CLM-L031 — Linguistic-frequency tagging (reverse-engineered rubric)
Status: 🔒 Locked (legacy) · 🔍 Practitioner-grounded · Falsifiable ✓ — operational in diagnostics/engine/careers/methodology/; not yet integrated into THEORY-OF-TRAITS.md
Topic: 09-tagging-and-clusters
CLAIM TEXT
The framework's operational tagging rubric for the 19 MN/MI dimensions was not authored top-down; it was reverse-engineered bottom-up from a gold-set of 649 manually scored careers via linguistic frequency analysis. For each of the 19 dimensions, every term in the corpus is assigned a frequency ratio:
ratio(term, dim) = frequency in HIGH-scoring careers (8–10) /
frequency in LOW-scoring careers (1–3)
The top ~15 HIGH-predictive and top ~10 LOW-predictive unigrams and bigrams per dimension constitute the derived rule set for that dimension. Signal strength is interpreted in three bands:
- ratio > 100 — very strong predictor (e.g., "engineers" → Logical, ratio ≈ 960).
- ratio 10–100 — moderate predictor.
- ratio 1–10 — weak predictor (often discarded).
The framework's structural claims about this methodology:
- The rubric is empirical, not stipulated. No team meeting decided which words "mean" Healing or Adventurous; the 649-career gold set decided. The HIGH/LOW term lists per dimension are the rubric.
- The rubric is auditable. Every derived rule traces to specific careers, specific frequencies, and specific ratios — not to private practitioner intuition.
- The rubric is dimension-asymmetric. Some dimensions are corpus-rare (Musical: 17 careers ≥ 8) and others corpus-common (Logical: 336 careers ≥ 8). Rare dimensions yield fewer high-confidence terms and require more careful out-of-distribution handling.
- The rubric is iteratively refinable. Mis-tagged careers feed back into the gold set; the next iteration's frequency ratios shift accordingly. The tagging rubric is a living artifact.
The diagnostic operationalization: for any new career description, the tagger (a) extracts unigrams and bigrams, (b) looks up each term's HIGH/LOW ratios across all 19 dimensions, (c) accumulates evidence per dimension up to a governance-capped delta, and (d) returns a per-dimension score with cited rules-fired as evidence.
LOCATION (pre-adoption)
diagnostics/engine/careers/methodology/INDEX.md and DERIVED-RULES-REPORT.md (1,213 lines, 19-dimension rubric)
diagnostics/engine/careers/methodology/tagging-analysis-data.json (machine-readable rubric, 100 KB)
diagnostics/engine/careers/methodology/TAGGING-RULES-QUICK-REFERENCE.txt (110-line operational summary)
diagnostics/engine/careers/tagger/engine.py (operationalization)
LOCATION (post-adoption, when integrated)
Not yet integrated into THEORY-OF-TRAITS.md. Recommended cherry-pick: a Tagging & Methodology sub-section paired with CLM-L032 (tagger architecture) and CLM-L033 (chromatrait cluster lineage), naming the reverse-engineered rubric as the framework's empirical anchor for the 19-dimensional space (CLM-L025).
EVIDENCE TYPES
[P] Phenomenological
Moderate. The frequency-ratio rubric reproduces practitioner intuitions about which work activities load on which dimensions, with high agreement on strong predictors (ratio > 100) and lower agreement on weak predictors (ratio < 10). Practitioners reading the derived term lists report recognition ("yes, that is what Healing looks like in job text") rather than surprise.
[E] Empirical
- 649 careers with manually authored 19-dimensional scores constitute the gold set.
- Frequency ratios computed across ~285+ unique terms per dimension; bigrams included.
- Stopword removal (102 common words) and 0.1 minimum-frequency floor to avoid division-by-zero.
- Cluster-mean validation: each dimension's HIGH-scoring careers cluster into recognizable career families, providing convergent validity.
- MISSING — cross-validation accuracy on held-out careers (planned in
tagger/evaluate.py).
- MISSING — inter-rater reliability on the 649 gold scores themselves (the rubric is only as strong as its training labels).
[T] Theoretical
- Compatible with CLM-L023 (10 intelligences modified Gardner): the dimension list determines what is being predicted.
- Compatible with CLM-L024 (9 natures as engagements): the frequency analysis treats engagements as describable in occupational text.
- Compatible with CLM-L025 (combinatorial profile space): the rubric assumes 19 independent dimensions, not types — frequency analysis runs per-dimension, not per-cluster.
- Convergent with corpus-linguistics keyword analysis, distinctive-collexeme analysis, and TF-IDF-style discriminative methods in NLP.
[C] Convergent
- Corpus linguistics — log-likelihood ratio tests (Dunning 1993) for distinctive vocabulary; structural parallel.
- Distinctive-collexeme analysis (Stefanowitsch & Gries 2003) — identifying terms that distinctively co-occur with constructions; structural parallel.
- TF-IDF / discriminative classifiers (Salton, Joachims) — frequency-weighted term importance; weaker form of the same idea.
- MISSING — convergent rs- entries on corpus-linguistics frequency-ratio methods and distinctive-collexeme analysis.
UPSTREAM SOURCES
diagnostics/engine/careers/methodology/DERIVED-RULES-REPORT.md (generated 2026-02-25).
diagnostics/engine/careers/methodology/README-ANALYSIS.md.
diagnostics/engine/careers/data/raw/careers.csv (649-career gold set).
POSITIONING IN LITERATURE
- Confirms: corpus-linguistics frequency-ratio methods, distinctive-collexeme analysis, discriminative term-weighting in NLP.
- Extends: applies frequency-ratio analysis to a 19-dimensional human-engagement-and-capacity space rather than a binary or category-discrete target. The framework's contribution: an auditable, reverse-engineered rubric as alternative to authored taxonomies (RIASEC, OPM, O*NET descriptors) which encode designer assumptions.
- Departs: from authored-taxonomy traditions (e.g., O*NET's authored work-activity descriptors) by treating the rubric as a derived artifact answerable to the gold set, not a fixed instrument.
FALSIFIABILITY
The reverse-engineered-rubric claim would be falsified if:
- Held-out cross-validation shows the derived rules predict gold scores no better than chance or no better than authored rubrics.
- Practitioner inter-rater reliability on the 649 gold scores is low enough (κ < .5) that the rubric inherits noise rather than signal.
- The HIGH/LOW term lists fail to generalize across occupational corpora (e.g., they work for U.S. O*NET text but not for non-U.S. ESCO text).
- A purely theoretical, top-down rubric outperforms the frequency-derived one on out-of-sample careers.
EDGE CASES / KNOWN LIMITS
- Rare dimensions are noisy. Musical (17 high-scoring careers) yields fewer reliable predictors than Logical (336). The framework treats rare-dimension predictions as lower-confidence and requires OOD handling (CLM-L032).
- Bigram coverage is incomplete. The rubric extracts unigrams and bigrams but not longer phrases; some dimension signals (e.g., "fine motor control") are present but short multi-word collocations may be missed.
- English-only corpus. The current rubric was derived on English career text; transferring to French ESCO data requires re-derivation, not translation.
- Gold-set bias. If the 649 careers over-represent certain occupational families (e.g., U.S. white-collar), the derived rules over-represent the vocabulary of those families.
DISCONFIRMING CASES TRACKED
- Careers where strong HIGH predictors fire but the resulting score conflicts with practitioner judgment are flagged in the tagger's review queue (
scripts/seed-output/dedup-review-queue.json). These cases drive iterative refinement of the rubric and the gold set.
REFLEXIVITY NOTE
The framework's preference for reverse-engineered over authored rubrics reflects the originator's stance against typology (CLM-L025) — the same anti-stipulation discipline applied at the methodology layer. A practitioner from an authored-taxonomy tradition (RIASEC, MBTI-codified scoring) may experience the empirical rubric as unstable; the framework's view is that stability of an authored rubric is a defect, because it cannot self-correct against new evidence.
RELATIONSHIP TO CURRENT CANON
- Already integrated? No. THEORY-OF-TRAITS.md does not yet describe the tagging methodology.
- Contradicts current canon? No.
- Net-new? The reverse-engineered linguistic-frequency rubric, the per-dimension HIGH/LOW term lists as the operational rubric, and the rubric-as-living-artifact stance are net-new to master canon.
- Recommended action: Cherry-pick a Tagging & Methodology sub-section into THEORY-OF-TRAITS.md naming the frequency-ratio approach, the 649-career gold set, and the auditability claim. Pair with CLM-L032 and CLM-L033.
RESEARCH-BANK GAPS FLAGGED
For BACKLOG.md:
- Dunning (1993) — Accurate Methods for the Statistics of Surprise and Coincidence (log-likelihood for keyness).
- Stefanowitsch & Gries (2003) — distinctive-collexeme analysis.
- **O*NET methodological documentation** — work-activity authored descriptors, for contrast.
- Cross-validation literature — held-out evaluation in NLP classification.
NOTES
- This claim is the framework's empirical anchor for the 19-dimensional space. The rubric's auditability is what permits the framework to claim its dimensions are operational rather than aesthetic.
- Pairs with CLM-L032 (tagger architecture — how the rubric is fused with cluster-prior and embedding signals) and CLM-L033 (chromatrait cluster lineage — the upstream gold authored layer the frequency analysis was run against).