All claims
09-tagging-and-clusters

Linguistic-frequency tagging — the rubric is reverse-engineered, not authored

  • CLM-L031
  • 🔒 Locked (legacy)
  • 🔍 Practitioner-grounded
  • Falsifiable ✓
  • 🔒 Practitioner

CLM-L031 — Linguistic-frequency tagging (reverse-engineered rubric)

Status: 🔒 Locked (legacy) · 🔍 Practitioner-grounded · Falsifiable ✓ — operational in diagnostics/engine/careers/methodology/; not yet integrated into THEORY-OF-TRAITS.md

Topic: 09-tagging-and-clusters


CLAIM TEXT

The framework's operational tagging rubric for the 19 MN/MI dimensions was not authored top-down; it was reverse-engineered bottom-up from a gold-set of 649 manually scored careers via linguistic frequency analysis. For each of the 19 dimensions, every term in the corpus is assigned a frequency ratio:

ratio(term, dim) = frequency in HIGH-scoring careers (8–10) /
                   frequency in LOW-scoring careers (1–3)

The top ~15 HIGH-predictive and top ~10 LOW-predictive unigrams and bigrams per dimension constitute the derived rule set for that dimension. Signal strength is interpreted in three bands:

  • ratio > 100 — very strong predictor (e.g., "engineers" → Logical, ratio ≈ 960).
  • ratio 10–100 — moderate predictor.
  • ratio 1–10 — weak predictor (often discarded).

The framework's structural claims about this methodology:

  • The rubric is empirical, not stipulated. No team meeting decided which words "mean" Healing or Adventurous; the 649-career gold set decided. The HIGH/LOW term lists per dimension are the rubric.
  • The rubric is auditable. Every derived rule traces to specific careers, specific frequencies, and specific ratios — not to private practitioner intuition.
  • The rubric is dimension-asymmetric. Some dimensions are corpus-rare (Musical: 17 careers ≥ 8) and others corpus-common (Logical: 336 careers ≥ 8). Rare dimensions yield fewer high-confidence terms and require more careful out-of-distribution handling.
  • The rubric is iteratively refinable. Mis-tagged careers feed back into the gold set; the next iteration's frequency ratios shift accordingly. The tagging rubric is a living artifact.

The diagnostic operationalization: for any new career description, the tagger (a) extracts unigrams and bigrams, (b) looks up each term's HIGH/LOW ratios across all 19 dimensions, (c) accumulates evidence per dimension up to a governance-capped delta, and (d) returns a per-dimension score with cited rules-fired as evidence.

LOCATION (pre-adoption)

  • diagnostics/engine/careers/methodology/INDEX.md and DERIVED-RULES-REPORT.md (1,213 lines, 19-dimension rubric)
  • diagnostics/engine/careers/methodology/tagging-analysis-data.json (machine-readable rubric, 100 KB)
  • diagnostics/engine/careers/methodology/TAGGING-RULES-QUICK-REFERENCE.txt (110-line operational summary)
  • diagnostics/engine/careers/tagger/engine.py (operationalization)

LOCATION (post-adoption, when integrated)

Not yet integrated into THEORY-OF-TRAITS.md. Recommended cherry-pick: a Tagging & Methodology sub-section paired with CLM-L032 (tagger architecture) and CLM-L033 (chromatrait cluster lineage), naming the reverse-engineered rubric as the framework's empirical anchor for the 19-dimensional space (CLM-L025).


EVIDENCE TYPES

[P] Phenomenological

Moderate. The frequency-ratio rubric reproduces practitioner intuitions about which work activities load on which dimensions, with high agreement on strong predictors (ratio > 100) and lower agreement on weak predictors (ratio < 10). Practitioners reading the derived term lists report recognition ("yes, that is what Healing looks like in job text") rather than surprise.

[E] Empirical

  • 649 careers with manually authored 19-dimensional scores constitute the gold set.
  • Frequency ratios computed across ~285+ unique terms per dimension; bigrams included.
  • Stopword removal (102 common words) and 0.1 minimum-frequency floor to avoid division-by-zero.
  • Cluster-mean validation: each dimension's HIGH-scoring careers cluster into recognizable career families, providing convergent validity.
  • MISSING — cross-validation accuracy on held-out careers (planned in tagger/evaluate.py).
  • MISSING — inter-rater reliability on the 649 gold scores themselves (the rubric is only as strong as its training labels).

[T] Theoretical

  • Compatible with CLM-L023 (10 intelligences modified Gardner): the dimension list determines what is being predicted.
  • Compatible with CLM-L024 (9 natures as engagements): the frequency analysis treats engagements as describable in occupational text.
  • Compatible with CLM-L025 (combinatorial profile space): the rubric assumes 19 independent dimensions, not types — frequency analysis runs per-dimension, not per-cluster.
  • Convergent with corpus-linguistics keyword analysis, distinctive-collexeme analysis, and TF-IDF-style discriminative methods in NLP.

[C] Convergent

  • Corpus linguistics — log-likelihood ratio tests (Dunning 1993) for distinctive vocabulary; structural parallel.
  • Distinctive-collexeme analysis (Stefanowitsch & Gries 2003) — identifying terms that distinctively co-occur with constructions; structural parallel.
  • TF-IDF / discriminative classifiers (Salton, Joachims) — frequency-weighted term importance; weaker form of the same idea.
  • MISSING — convergent rs- entries on corpus-linguistics frequency-ratio methods and distinctive-collexeme analysis.

UPSTREAM SOURCES

  • diagnostics/engine/careers/methodology/DERIVED-RULES-REPORT.md (generated 2026-02-25).
  • diagnostics/engine/careers/methodology/README-ANALYSIS.md.
  • diagnostics/engine/careers/data/raw/careers.csv (649-career gold set).

POSITIONING IN LITERATURE

  • Confirms: corpus-linguistics frequency-ratio methods, distinctive-collexeme analysis, discriminative term-weighting in NLP.
  • Extends: applies frequency-ratio analysis to a 19-dimensional human-engagement-and-capacity space rather than a binary or category-discrete target. The framework's contribution: an auditable, reverse-engineered rubric as alternative to authored taxonomies (RIASEC, OPM, O*NET descriptors) which encode designer assumptions.
  • Departs: from authored-taxonomy traditions (e.g., O*NET's authored work-activity descriptors) by treating the rubric as a derived artifact answerable to the gold set, not a fixed instrument.

FALSIFIABILITY

The reverse-engineered-rubric claim would be falsified if:

  • Held-out cross-validation shows the derived rules predict gold scores no better than chance or no better than authored rubrics.
  • Practitioner inter-rater reliability on the 649 gold scores is low enough (κ < .5) that the rubric inherits noise rather than signal.
  • The HIGH/LOW term lists fail to generalize across occupational corpora (e.g., they work for U.S. O*NET text but not for non-U.S. ESCO text).
  • A purely theoretical, top-down rubric outperforms the frequency-derived one on out-of-sample careers.

EDGE CASES / KNOWN LIMITS

  • Rare dimensions are noisy. Musical (17 high-scoring careers) yields fewer reliable predictors than Logical (336). The framework treats rare-dimension predictions as lower-confidence and requires OOD handling (CLM-L032).
  • Bigram coverage is incomplete. The rubric extracts unigrams and bigrams but not longer phrases; some dimension signals (e.g., "fine motor control") are present but short multi-word collocations may be missed.
  • English-only corpus. The current rubric was derived on English career text; transferring to French ESCO data requires re-derivation, not translation.
  • Gold-set bias. If the 649 careers over-represent certain occupational families (e.g., U.S. white-collar), the derived rules over-represent the vocabulary of those families.

DISCONFIRMING CASES TRACKED

  • Careers where strong HIGH predictors fire but the resulting score conflicts with practitioner judgment are flagged in the tagger's review queue (scripts/seed-output/dedup-review-queue.json). These cases drive iterative refinement of the rubric and the gold set.

REFLEXIVITY NOTE

The framework's preference for reverse-engineered over authored rubrics reflects the originator's stance against typology (CLM-L025) — the same anti-stipulation discipline applied at the methodology layer. A practitioner from an authored-taxonomy tradition (RIASEC, MBTI-codified scoring) may experience the empirical rubric as unstable; the framework's view is that stability of an authored rubric is a defect, because it cannot self-correct against new evidence.


RELATIONSHIP TO CURRENT CANON

  • Already integrated? No. THEORY-OF-TRAITS.md does not yet describe the tagging methodology.
  • Contradicts current canon? No.
  • Net-new? The reverse-engineered linguistic-frequency rubric, the per-dimension HIGH/LOW term lists as the operational rubric, and the rubric-as-living-artifact stance are net-new to master canon.
  • Recommended action: Cherry-pick a Tagging & Methodology sub-section into THEORY-OF-TRAITS.md naming the frequency-ratio approach, the 649-career gold set, and the auditability claim. Pair with CLM-L032 and CLM-L033.

RESEARCH-BANK GAPS FLAGGED

For BACKLOG.md:

  1. Dunning (1993)Accurate Methods for the Statistics of Surprise and Coincidence (log-likelihood for keyness).
  2. Stefanowitsch & Gries (2003) — distinctive-collexeme analysis.
  3. **O*NET methodological documentation** — work-activity authored descriptors, for contrast.
  4. Cross-validation literature — held-out evaluation in NLP classification.

NOTES

  • This claim is the framework's empirical anchor for the 19-dimensional space. The rubric's auditability is what permits the framework to claim its dimensions are operational rather than aesthetic.
  • Pairs with CLM-L032 (tagger architecture — how the rubric is fused with cluster-prior and embedding signals) and CLM-L033 (chromatrait cluster lineage — the upstream gold authored layer the frequency analysis was run against).
Citations · 0 research entries

No research entries linked yet. Gaps tracked in research/method/BACKLOG.md.