CLM-L031 — Linguistic-frequency tagging (reverse-engineered rubric)

Status: 🔒 Locked (legacy) · 🔍 Practitioner-grounded · Falsifiable ✓ — operational in diagnostics/engine/careers/methodology/; not yet integrated into THEORY-OF-TRAITS.md

Topic: 09-tagging-and-clusters

CLAIM TEXT

The framework's operational tagging rubric for the 19 MN/MI dimensions was not authored top-down; it was reverse-engineered bottom-up from a gold-set of 649 manually scored careers via linguistic frequency analysis. For each of the 19 dimensions, every term in the corpus is assigned a frequency ratio:

ratio(term, dim) = frequency in HIGH-scoring careers (8–10) /
                   frequency in LOW-scoring careers (1–3)

The top ~15 HIGH-predictive and top ~10 LOW-predictive unigrams and bigrams per dimension constitute the derived rule set for that dimension. Signal strength is interpreted in three bands:

ratio > 100 — very strong predictor (e.g., "engineers" → Logical, ratio ≈ 960).
ratio 10–100 — moderate predictor.
ratio 1–10 — weak predictor (often discarded).

The framework's structural claims about this methodology:

The rubric is empirical, not stipulated. No team meeting decided which words "mean" Healing or Adventurous; the 649-career gold set decided. The HIGH/LOW term lists per dimension are the rubric.
The rubric is auditable. Every derived rule traces to specific careers, specific frequencies, and specific ratios — not to private practitioner intuition.
The rubric is dimension-asymmetric. Some dimensions are corpus-rare (Musical: 17 careers ≥ 8) and others corpus-common (Logical: 336 careers ≥ 8). Rare dimensions yield fewer high-confidence terms and require more careful out-of-distribution handling.
The rubric is iteratively refinable. Mis-tagged careers feed back into the gold set; the next iteration's frequency ratios shift accordingly. The tagging rubric is a living artifact.

The diagnostic operationalization: for any new career description, the tagger (a) extracts unigrams and bigrams, (b) looks up each term's HIGH/LOW ratios across all 19 dimensions, (c) accumulates evidence per dimension up to a governance-capped delta, and (d) returns a per-dimension score with cited rules-fired as evidence.

LOCATION (pre-adoption)

diagnostics/engine/careers/methodology/INDEX.md and DERIVED-RULES-REPORT.md (1,213 lines, 19-dimension rubric)
diagnostics/engine/careers/methodology/tagging-analysis-data.json (machine-readable rubric, 100 KB)
diagnostics/engine/careers/methodology/TAGGING-RULES-QUICK-REFERENCE.txt (110-line operational summary)
diagnostics/engine/careers/tagger/engine.py (operationalization)

LOCATION (post-adoption, when integrated)

Not yet integrated into THEORY-OF-TRAITS.md. Recommended cherry-pick: a Tagging & Methodology sub-section paired with CLM-L032 (tagger architecture) and CLM-L033 (chromatrait cluster lineage), naming the reverse-engineered rubric as the framework's empirical anchor for the 19-dimensional space (CLM-L025).

EVIDENCE TYPES

[P] Phenomenological

Moderate. The frequency-ratio rubric reproduces practitioner intuitions about which work activities load on which dimensions, with high agreement on strong predictors (ratio > 100) and lower agreement on weak predictors (ratio < 10). Practitioners reading the derived term lists report recognition ("yes, that is what Healing looks like in job text") rather than surprise.

[E] Empirical

649 careers with manually authored 19-dimensional scores constitute the gold set.
Frequency ratios computed across ~285+ unique terms per dimension; bigrams included.
Stopword removal (102 common words) and 0.1 minimum-frequency floor to avoid division-by-zero.
Cluster-mean validation: each dimension's HIGH-scoring careers cluster into recognizable career families, providing convergent validity.
MISSING — cross-validation accuracy on held-out careers (planned in tagger/evaluate.py).
MISSING — inter-rater reliability on the 649 gold scores themselves (the rubric is only as strong as its training labels).

[T] Theoretical

Compatible with CLM-L023 (10 intelligences modified Gardner): the dimension list determines what is being predicted.
Compatible with CLM-L024 (9 natures as engagements): the frequency analysis treats engagements as describable in occupational text.
Compatible with CLM-L025 (combinatorial profile space): the rubric assumes 19 independent dimensions, not types — frequency analysis runs per-dimension, not per-cluster.
Convergent with corpus-linguistics keyword analysis, distinctive-collexeme analysis, and TF-IDF-style discriminative methods in NLP.

[C] Convergent

Corpus linguistics — log-likelihood ratio tests (Dunning 1993) for distinctive vocabulary; structural parallel.
Distinctive-collexeme analysis (Stefanowitsch & Gries 2003) — identifying terms that distinctively co-occur with constructions; structural parallel.
TF-IDF / discriminative classifiers (Salton, Joachims) — frequency-weighted term importance; weaker form of the same idea.
MISSING — convergent rs- entries on corpus-linguistics frequency-ratio methods and distinctive-collexeme analysis.

UPSTREAM SOURCES

diagnostics/engine/careers/methodology/DERIVED-RULES-REPORT.md (generated 2026-02-25).
diagnostics/engine/careers/methodology/README-ANALYSIS.md.
diagnostics/engine/careers/data/raw/careers.csv (649-career gold set).

POSITIONING IN LITERATURE

Confirms: corpus-linguistics frequency-ratio methods, distinctive-collexeme analysis, discriminative term-weighting in NLP.
Extends: applies frequency-ratio analysis to a 19-dimensional human-engagement-and-capacity space rather than a binary or category-discrete target. The framework's contribution: an auditable, reverse-engineered rubric as alternative to authored taxonomies (RIASEC, OPM, O*NET descriptors) which encode designer assumptions.
Departs: from authored-taxonomy traditions (e.g., O*NET's authored work-activity descriptors) by treating the rubric as a derived artifact answerable to the gold set, not a fixed instrument.

FALSIFIABILITY

The reverse-engineered-rubric claim would be falsified if:

Held-out cross-validation shows the derived rules predict gold scores no better than chance or no better than authored rubrics.
Practitioner inter-rater reliability on the 649 gold scores is low enough (κ < .5) that the rubric inherits noise rather than signal.
The HIGH/LOW term lists fail to generalize across occupational corpora (e.g., they work for U.S. O*NET text but not for non-U.S. ESCO text).
A purely theoretical, top-down rubric outperforms the frequency-derived one on out-of-sample careers.

EDGE CASES / KNOWN LIMITS

Rare dimensions are noisy. Musical (17 high-scoring careers) yields fewer reliable predictors than Logical (336). The framework treats rare-dimension predictions as lower-confidence and requires OOD handling (CLM-L032).
Bigram coverage is incomplete. The rubric extracts unigrams and bigrams but not longer phrases; some dimension signals (e.g., "fine motor control") are present but short multi-word collocations may be missed.
English-only corpus. The current rubric was derived on English career text; transferring to French ESCO data requires re-derivation, not translation.
Gold-set bias. If the 649 careers over-represent certain occupational families (e.g., U.S. white-collar), the derived rules over-represent the vocabulary of those families.

DISCONFIRMING CASES TRACKED

Careers where strong HIGH predictors fire but the resulting score conflicts with practitioner judgment are flagged in the tagger's review queue (scripts/seed-output/dedup-review-queue.json). These cases drive iterative refinement of the rubric and the gold set.

REFLEXIVITY NOTE

The framework's preference for reverse-engineered over authored rubrics reflects the originator's stance against typology (CLM-L025) — the same anti-stipulation discipline applied at the methodology layer. A practitioner from an authored-taxonomy tradition (RIASEC, MBTI-codified scoring) may experience the empirical rubric as unstable; the framework's view is that stability of an authored rubric is a defect, because it cannot self-correct against new evidence.

RELATIONSHIP TO CURRENT CANON

Already integrated? No. THEORY-OF-TRAITS.md does not yet describe the tagging methodology.
Contradicts current canon? No.
Net-new? The reverse-engineered linguistic-frequency rubric, the per-dimension HIGH/LOW term lists as the operational rubric, and the rubric-as-living-artifact stance are net-new to master canon.
Recommended action: Cherry-pick a Tagging & Methodology sub-section into THEORY-OF-TRAITS.md naming the frequency-ratio approach, the 649-career gold set, and the auditability claim. Pair with CLM-L032 and CLM-L033.

RESEARCH-BANK GAPS FLAGGED

For BACKLOG.md:

Dunning (1993) — Accurate Methods for the Statistics of Surprise and Coincidence (log-likelihood for keyness).
Stefanowitsch & Gries (2003) — distinctive-collexeme analysis.
**O*NET methodological documentation** — work-activity authored descriptors, for contrast.
Cross-validation literature — held-out evaluation in NLP classification.

NOTES

This claim is the framework's empirical anchor for the 19-dimensional space. The rubric's auditability is what permits the framework to claim its dimensions are operational rather than aesthetic.
Pairs with CLM-L032 (tagger architecture — how the rubric is fused with cluster-prior and embedding signals) and CLM-L033 (chromatrait cluster lineage — the upstream gold authored layer the frequency analysis was run against).

Citations · 0 research entries

No research entries linked yet. Gaps tracked in research/method/BACKLOG.md.

Related claims

← All claims

Linguistic-frequency tagging — the rubric is reverse-engineered, not authored