wlu03 · wlu03 · Jun 8, 2026 · Jun 8, 2026 · Jun 8, 2026 · Jun 8, 2026
diff --git a/README.md b/README.md
@@ -52,28 +52,36 @@ The patterns are organized by *why* the compiler fails to fix them:
 | deepseek-r1-distill-llama-70b | 70 | reasoning | 46 / 42 / 46 | 13.4 / 12.3 / 12.2 |
 | qwen2.5-72b | 72 | general | 57 / 54 / 62 | 6.8 / 14.1 / 10.3 |
 
-**Faithful × fast 2×2** (overall, n=55,980; "fast" = `speedup_vs_slow` > 1.5; full per-model/per-pattern breakdown in `results/aggregate_2x2/report.txt`):
+**Faithfulness cells × fast/slow** (overall, n=55,980; "fast" = `speedup_vs_slow` > 1.5; full per-model/per-pattern breakdown in `results/aggregate_2x2/report.txt`). The two-axis cascade (equivalence × expected-shape) routes each attempt into one of four cells rather than a binary faithful/unfaithful split:
 
-| | Faithful | Unfaithful | Row |
-|---|---|---|---|
-| **Fast** | 18.7% | 11.5% | 30.2% |
-| **Slow** | 13.0% | 56.7% | 69.8% |
-| **Col** | 31.8% | 68.2% | 100% |
+| | FAITHFUL | FAITHFUL_ALTERNATIVE | STRUCTURAL_ONLY | FAILED | Row |
+|---|---|---|---|---|---|
+| **Fast** | 19.2% | 10.9% | 0.2% | 0.0% | 30.2% |
+| **Slow** | 9.9% | 21.2% | 18.9% | 19.7% | 69.8% |
+| **Col** | 29.1% | 32.1% | 19.1% | 19.7% | 100% |
 
-Per-strategy faithful rates are close: generic 30.0%, pattern-aware 33.8%, taxonomy-guided 31.5%.
+- **FAITHFUL** — performed the labeled transformation *and* stays equivalent.
+- **FAITHFUL_ALTERNATIVE** — equivalent via a *different* valid transformation; deliberately not conflated with failure.
+- **STRUCTURAL_ONLY** — has the expected shape but breaks correctness (overfit / DCE / hardcoded output).
+- **FAILED** — neither.
 
-**Faithfulness-scoring caveats.** Two structural factors shape this aggregate, and the headline rate is sensitive to both:
-- **COMP composition (≈54% of rows).** COMP variants are scored against their constituent-pattern list (`composition` from `metadata.json`); the COMP checker *requires* it, and without it falls back to a generic regex battery that massively over-reports `FAITHFUL`. Earlier runs omitted it and reported an inflated ~45.6% overall (COMP alone read 58% faithful); both `faithfulness/report_2x2.py` and `scripts/rescore_faithfulness.py` now thread `composition`, which drops COMP to 33% faithful and the overall rate to 31.8%.
-- **Held-out patterns (`HO-*`, ≈14% of rows).** These post-cutoff patterns have no dedicated AST checker and fall through to a coarse structural fallback, so they essentially cannot earn a `FAITHFUL` verdict and weigh toward the unfaithful column. Authoring per-pattern held-out checkers is the remaining faithfulness-coverage gap.
+Faithful-family rate (FAITHFUL + FAITHFUL_ALTERNATIVE) by segment: base patterns 60.6%, COMP 65.0%, held-out 48.2%.
+
+**Faithfulness-scoring notes.**
+- **COMP composition (≈54% of rows).** COMP variants are scored against their constituent-pattern list (`composition` from `metadata.json`); the COMP checker *requires* it — without it a generic regex battery over-reports `FAITHFUL` (an earlier omission inflated the headline to ~45.6%). Both `scripts/rescore_faithfulness.py` and `faithfulness/report_2x2.py` thread `composition`, and `report_2x2.py` now consumes the canonical `faithfulness_cell` column written by the rescore (real `slow.c` + composition + the full checker registry) rather than recomputing per-row with an empty slow source.
+- **Held-out coverage (`HO-*`, ≈14% of rows) — gap closed.** All 36 held-out patterns now have dedicated per-pattern checkers (`faithfulness/checkers/held_out.py`); HO rows earn a real verdict (16.5% FAITHFUL, 31.7% FAITHFUL_ALTERNATIVE) instead of auto-failing the old coarse fallback. Several held-out patterns are un-fast *by design* on this single-socket test machine — inverted constant-time defenses that trade speed for leak-resistance, sub-1.5× tricks (shift-mask UB-guard elision), and NUMA/prefetch effects absent without remote DRAM — and correctly land in **(slow, FAITHFUL)**. The purely algorithmic held-out patterns (HLL/Count-Min sampling) read near-zero faithful because no model reproduces them, which is the contamination-defense working as intended.
 
 ### Findings from the sweep
 
 1. **A label helps speed but often hurts correctness.** A labeled strategy (pattern-aware *or* taxonomy-guided) beats generic on geomean speedup for **12 of 15** models — the 3 exceptions are all reasoning models, where the extra context slightly lowers speedup. Yet pattern-aware *reduces* pass@1 vs generic on **9 of 15** models. The category label pushes models toward a faster transformation at some cost to correctness.
 2. **The pattern-aware backfire effect reproduces — on correctness.** The clearest case is Qwen3-32B: pass@1 falls 64% → 52% (−12pp) from generic to pattern-aware, then recovers to 64% under taxonomy-guided. Taxonomy-guided generally recovers correctness that pattern-aware sheds (coder-7b 70→75, coder-14b 68→73, coder-32b 66→69).
 3. **Reasoning ≠ uniformly better.** Three ~32B reasoning recipes — DeepSeek-R1-distill (distilled CoT), QwQ (RL), Qwen3 (thinking) — span 44–64% pass@1 at the same size, with Qwen3-32B strongest on correctness and the R1-distill strongest on peak speedup.
 4. **Peak speedup is a fragile ranker.** DeepSeek-R1-Distill-Qwen-7B posts the single highest geomean (15.5×) but on only **21%** pass@1 — that mean is taken over a thin correct set. Rank by geomean *among models with healthy pass@1*, not by raw peak.
+5. **Category difficulty refutes the priors** (`scripts/category_difficulty.py`, full table in `results/category_difficulty.txt`). The hardest category by pass@1 is **DS** (47.9%, in the bottom-2 for **14 of 15** models), *not* IS; the easiest is **MI** (81.3%, top-2 for 14/15) — AL and SR sit mid-pack (~60%), so neither the "IS-hardest" nor "AL/SR-easiest" prior holds on correctness. IS *is* distinctly the hardest to **speed up** (1.24× geomean — barely above baseline even when correct). Within-category spread is large, though (DS 2–79%, IS 3–85%, HR 18–94%): category is a coarse proxy, and the aggregates are driven by individual killer patterns (DS-4 AoS→SoA at 2%, IS-5 alias-check fast-path at 3%). SR pays off most when solved (160× geomean, led by SR-3 redundant-aggregation hoisting).
+
+6. **Optimization skill is clustered, not monolithic** (`scripts/cross_pattern_transfer.py`). Across the 15 models, per-category pass@1 correlates only moderately (mean Spearman **+0.50**) — capability partly transfers but isn't a single axis. Two clusters stand out: a logic-restructuring group (**AL–CF +0.77**, AL–DS +0.67) and a data-reasoning link (**DS–IS +0.70**), while memory/IO is nearly independent of the rest (MI–SR +0.24, DS–MI +0.34). **AL is the best single predictor of overall model quality (+0.80); MI the worst (+0.57)** — most models clear the easy MI loop-swaps, so MI barely discriminates. (Part of the +0.50 baseline is just raw capability; the off-baseline pairs are the signal.)
 
-Still to analyze from the committed scored CSVs: per-category difficulty (the IS-hardest / AL-SR-easiest hypotheses), cross-pattern transfer Spearman correlations, and the fine-tune-vs-baseline paired-Wilcoxon test on the held-out set.
+7. **Fine-tuning the weak models did not transfer to held-out — and overfit the non-reasoning one** (`modal_app/finetune_weak3.py` → `scripts/finetune_transfer_summary.py`; held-out paired Wilcoxon, full table in `results/transfer_eval/summary.txt`). QLoRA-fine-tuning the 3 weakest models on the base+COMP training set (held-out excluded — guaranteed by authoring date: training data predates the held-out set) and evaluating on the 178 unseen held-out variants: the non-reasoning control **qwen2.5-coder-1.5b regressed significantly** — held-out pass@1 −39pp (generic) and −50pp (pattern-aware), both **p=0.001**. Its outputs hallucinate extern names and stop compiling — **catastrophic forgetting** the contamination-defense set surfaced (aggregate metrics would hide it). The two reasoning models (r1-distill-1.5b/7b) nudged upward off near-zero baselines (e.g. 7b pattern-aware 2.8→11.1%) as the SFT fixed their empty-output failure, but **no gain reached significance** (held-out paired n = 4–6). Net: narrow SFT on the training distribution overfits rather than generalizing — most damagingly on the small non-reasoning model.
 
 ---
 

diff --git a/faithfulness/checkers/__init__.py b/faithfulness/checkers/__init__.py
@@ -74,6 +74,45 @@
     SR4Checker,
     SR5Checker,
 )
+# Held-out (HO-*) checkers — added phase by phase per family.
+from .held_out import (
+    HOAL1Checker,
+    HOAL2Checker,
+    HOAL3Checker,
+    HOAL4Checker,
+    HOSR1Checker,
+    HOSR2Checker,
+    HOSR3Checker,
+    HOSR4Checker,
+    HOSR5Checker,
+    HOSR6Checker,
+    HOSR7Checker,
+    HOCF1Checker,
+    HOCF2Checker,
+    HOCF3Checker,
+    HOCF4Checker,
+    HOCF5Checker,
+    HODS1Checker,
+    HODS2Checker,
+    HODS3Checker,
+    HODS4Checker,
+    HODS5Checker,
+    HODS6Checker,
+    HOHR1Checker,
+    HOHR2Checker,
+    HOHR3Checker,
+    HOHR4Checker,
+    HOHR5Checker,
+    HOIS1Checker,
+    HOIS2Checker,
+    HOIS3Checker,
+    HOIS4Checker,
+    HOIS5Checker,
+    HOMI1Checker,
+    HOMI2Checker,
+    HOMI3Checker,
+    HOMI4Checker,
+)
 
 
 # ─────────────────────────────────────────────────────────────────────────────
@@ -113,6 +152,49 @@
     "MI-3":  MI3Checker(),
     "MI-4":  MI4Checker(),
     "COMP":  COMPChecker(),
+    # Held-out (HO-*) — Algorithmic Inefficiency family.
+    "HO-AL-1": HOAL1Checker(),
+    "HO-AL-2": HOAL2Checker(),
+    "HO-AL-3": HOAL3Checker(),
+    "HO-AL-4": HOAL4Checker(),
+    # Held-out (HO-*) — Semantic Redundancy family.
+    "HO-SR-1": HOSR1Checker(),
+    "HO-SR-2": HOSR2Checker(),
+    "HO-SR-3": HOSR3Checker(),
+    "HO-SR-4": HOSR4Checker(),
+    "HO-SR-5": HOSR5Checker(),
+    "HO-SR-6": HOSR6Checker(),
+    "HO-SR-7": HOSR7Checker(),
+    # Held-out (HO-*) — Control Flow family.
+    "HO-CF-1": HOCF1Checker(),
+    "HO-CF-2": HOCF2Checker(),
+    "HO-CF-3": HOCF3Checker(),
+    "HO-CF-4": HOCF4Checker(),
+    "HO-CF-5": HOCF5Checker(),
+    # Held-out (HO-*) — Data Structure Inefficiency family.
+    "HO-DS-1": HODS1Checker(),
+    "HO-DS-2": HODS2Checker(),
+    "HO-DS-3": HODS3Checker(),
+    "HO-DS-4": HODS4Checker(),
+    "HO-DS-5": HODS5Checker(),
+    "HO-DS-6": HODS6Checker(),
+    # Held-out (HO-*) — Human-Style Antipatterns family.
+    "HO-HR-1": HOHR1Checker(),
+    "HO-HR-2": HOHR2Checker(),
+    "HO-HR-3": HOHR3Checker(),
+    "HO-HR-4": HOHR4Checker(),
+    "HO-HR-5": HOHR5Checker(),
+    # Held-out (HO-*) — Input-Sensitive Inefficiency family.
+    "HO-IS-1": HOIS1Checker(),
+    "HO-IS-2": HOIS2Checker(),
+    "HO-IS-3": HOIS3Checker(),
+    "HO-IS-4": HOIS4Checker(),
+    "HO-IS-5": HOIS5Checker(),
+    # Held-out (HO-*) — Memory & IO family.
+    "HO-MI-1": HOMI1Checker(),
+    "HO-MI-2": HOMI2Checker(),
+    "HO-MI-3": HOMI3Checker(),
+    "HO-MI-4": HOMI4Checker(),
 }