Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 19 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,28 +52,36 @@ The patterns are organized by *why* the compiler fails to fix them:
| deepseek-r1-distill-llama-70b | 70 | reasoning | 46 / 42 / 46 | 13.4 / 12.3 / 12.2 |
| qwen2.5-72b | 72 | general | 57 / 54 / 62 | 6.8 / 14.1 / 10.3 |

**Faithful × fast 2×2** (overall, n=55,980; "fast" = `speedup_vs_slow` > 1.5; full per-model/per-pattern breakdown in `results/aggregate_2x2/report.txt`):
**Faithfulness cells × fast/slow** (overall, n=55,980; "fast" = `speedup_vs_slow` > 1.5; full per-model/per-pattern breakdown in `results/aggregate_2x2/report.txt`). The two-axis cascade (equivalence × expected-shape) routes each attempt into one of four cells rather than a binary faithful/unfaithful split:

| | Faithful | Unfaithful | Row |
|---|---|---|---|
| **Fast** | 18.7% | 11.5% | 30.2% |
| **Slow** | 13.0% | 56.7% | 69.8% |
| **Col** | 31.8% | 68.2% | 100% |
| | FAITHFUL | FAITHFUL_ALTERNATIVE | STRUCTURAL_ONLY | FAILED | Row |
|---|---|---|---|---|---|
| **Fast** | 19.2% | 10.9% | 0.2% | 0.0% | 30.2% |
| **Slow** | 9.9% | 21.2% | 18.9% | 19.7% | 69.8% |
| **Col** | 29.1% | 32.1% | 19.1% | 19.7% | 100% |

Per-strategy faithful rates are close: generic 30.0%, pattern-aware 33.8%, taxonomy-guided 31.5%.
- **FAITHFUL** — performed the labeled transformation *and* stays equivalent.
- **FAITHFUL_ALTERNATIVE** — equivalent via a *different* valid transformation; deliberately not conflated with failure.
- **STRUCTURAL_ONLY** — has the expected shape but breaks correctness (overfit / DCE / hardcoded output).
- **FAILED** — neither.

**Faithfulness-scoring caveats.** Two structural factors shape this aggregate, and the headline rate is sensitive to both:
- **COMP composition (≈54% of rows).** COMP variants are scored against their constituent-pattern list (`composition` from `metadata.json`); the COMP checker *requires* it, and without it falls back to a generic regex battery that massively over-reports `FAITHFUL`. Earlier runs omitted it and reported an inflated ~45.6% overall (COMP alone read 58% faithful); both `faithfulness/report_2x2.py` and `scripts/rescore_faithfulness.py` now thread `composition`, which drops COMP to 33% faithful and the overall rate to 31.8%.
- **Held-out patterns (`HO-*`, ≈14% of rows).** These post-cutoff patterns have no dedicated AST checker and fall through to a coarse structural fallback, so they essentially cannot earn a `FAITHFUL` verdict and weigh toward the unfaithful column. Authoring per-pattern held-out checkers is the remaining faithfulness-coverage gap.
Faithful-family rate (FAITHFUL + FAITHFUL_ALTERNATIVE) by segment: base patterns 60.6%, COMP 65.0%, held-out 48.2%.

**Faithfulness-scoring notes.**
- **COMP composition (≈54% of rows).** COMP variants are scored against their constituent-pattern list (`composition` from `metadata.json`); the COMP checker *requires* it — without it a generic regex battery over-reports `FAITHFUL` (an earlier omission inflated the headline to ~45.6%). Both `scripts/rescore_faithfulness.py` and `faithfulness/report_2x2.py` thread `composition`, and `report_2x2.py` now consumes the canonical `faithfulness_cell` column written by the rescore (real `slow.c` + composition + the full checker registry) rather than recomputing per-row with an empty slow source.
- **Held-out coverage (`HO-*`, ≈14% of rows) — gap closed.** All 36 held-out patterns now have dedicated per-pattern checkers (`faithfulness/checkers/held_out.py`); HO rows earn a real verdict (16.5% FAITHFUL, 31.7% FAITHFUL_ALTERNATIVE) instead of auto-failing the old coarse fallback. Several held-out patterns are un-fast *by design* on this single-socket test machine — inverted constant-time defenses that trade speed for leak-resistance, sub-1.5× tricks (shift-mask UB-guard elision), and NUMA/prefetch effects absent without remote DRAM — and correctly land in **(slow, FAITHFUL)**. The purely algorithmic held-out patterns (HLL/Count-Min sampling) read near-zero faithful because no model reproduces them, which is the contamination-defense working as intended.

### Findings from the sweep

1. **A label helps speed but often hurts correctness.** A labeled strategy (pattern-aware *or* taxonomy-guided) beats generic on geomean speedup for **12 of 15** models — the 3 exceptions are all reasoning models, where the extra context slightly lowers speedup. Yet pattern-aware *reduces* pass@1 vs generic on **9 of 15** models. The category label pushes models toward a faster transformation at some cost to correctness.
2. **The pattern-aware backfire effect reproduces — on correctness.** The clearest case is Qwen3-32B: pass@1 falls 64% → 52% (−12pp) from generic to pattern-aware, then recovers to 64% under taxonomy-guided. Taxonomy-guided generally recovers correctness that pattern-aware sheds (coder-7b 70→75, coder-14b 68→73, coder-32b 66→69).
3. **Reasoning ≠ uniformly better.** Three ~32B reasoning recipes — DeepSeek-R1-distill (distilled CoT), QwQ (RL), Qwen3 (thinking) — span 44–64% pass@1 at the same size, with Qwen3-32B strongest on correctness and the R1-distill strongest on peak speedup.
4. **Peak speedup is a fragile ranker.** DeepSeek-R1-Distill-Qwen-7B posts the single highest geomean (15.5×) but on only **21%** pass@1 — that mean is taken over a thin correct set. Rank by geomean *among models with healthy pass@1*, not by raw peak.
5. **Category difficulty refutes the priors** (`scripts/category_difficulty.py`, full table in `results/category_difficulty.txt`). The hardest category by pass@1 is **DS** (47.9%, in the bottom-2 for **14 of 15** models), *not* IS; the easiest is **MI** (81.3%, top-2 for 14/15) — AL and SR sit mid-pack (~60%), so neither the "IS-hardest" nor "AL/SR-easiest" prior holds on correctness. IS *is* distinctly the hardest to **speed up** (1.24× geomean — barely above baseline even when correct). Within-category spread is large, though (DS 2–79%, IS 3–85%, HR 18–94%): category is a coarse proxy, and the aggregates are driven by individual killer patterns (DS-4 AoS→SoA at 2%, IS-5 alias-check fast-path at 3%). SR pays off most when solved (160× geomean, led by SR-3 redundant-aggregation hoisting).

6. **Optimization skill is clustered, not monolithic** (`scripts/cross_pattern_transfer.py`). Across the 15 models, per-category pass@1 correlates only moderately (mean Spearman **+0.50**) — capability partly transfers but isn't a single axis. Two clusters stand out: a logic-restructuring group (**AL–CF +0.77**, AL–DS +0.67) and a data-reasoning link (**DS–IS +0.70**), while memory/IO is nearly independent of the rest (MI–SR +0.24, DS–MI +0.34). **AL is the best single predictor of overall model quality (+0.80); MI the worst (+0.57)** — most models clear the easy MI loop-swaps, so MI barely discriminates. (Part of the +0.50 baseline is just raw capability; the off-baseline pairs are the signal.)

Still to analyze from the committed scored CSVs: per-category difficulty (the IS-hardest / AL-SR-easiest hypotheses), cross-pattern transfer Spearman correlations, and the fine-tune-vs-baseline paired-Wilcoxon test on the held-out set.
7. **Fine-tuning the weak models did not transfer to held-out — and overfit the non-reasoning one** (`modal_app/finetune_weak3.py` → `scripts/finetune_transfer_summary.py`; held-out paired Wilcoxon, full table in `results/transfer_eval/summary.txt`). QLoRA-fine-tuning the 3 weakest models on the base+COMP training set (held-out excluded — guaranteed by authoring date: training data predates the held-out set) and evaluating on the 178 unseen held-out variants: the non-reasoning control **qwen2.5-coder-1.5b regressed significantly** — held-out pass@1 −39pp (generic) and −50pp (pattern-aware), both **p=0.001**. Its outputs hallucinate extern names and stop compiling — **catastrophic forgetting** the contamination-defense set surfaced (aggregate metrics would hide it). The two reasoning models (r1-distill-1.5b/7b) nudged upward off near-zero baselines (e.g. 7b pattern-aware 2.8→11.1%) as the SFT fixed their empty-output failure, but **no gain reached significance** (held-out paired n = 4–6). Net: narrow SFT on the training distribution overfits rather than generalizing — most damagingly on the small non-reasoning model.

---

Expand Down
82 changes: 82 additions & 0 deletions faithfulness/checkers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,45 @@
SR4Checker,
SR5Checker,
)
# Held-out (HO-*) checkers — added phase by phase per family.
from .held_out import (
HOAL1Checker,
HOAL2Checker,
HOAL3Checker,
HOAL4Checker,
HOSR1Checker,
HOSR2Checker,
HOSR3Checker,
HOSR4Checker,
HOSR5Checker,
HOSR6Checker,
HOSR7Checker,
HOCF1Checker,
HOCF2Checker,
HOCF3Checker,
HOCF4Checker,
HOCF5Checker,
HODS1Checker,
HODS2Checker,
HODS3Checker,
HODS4Checker,
HODS5Checker,
HODS6Checker,
HOHR1Checker,
HOHR2Checker,
HOHR3Checker,
HOHR4Checker,
HOHR5Checker,
HOIS1Checker,
HOIS2Checker,
HOIS3Checker,
HOIS4Checker,
HOIS5Checker,
HOMI1Checker,
HOMI2Checker,
HOMI3Checker,
HOMI4Checker,
)


# ─────────────────────────────────────────────────────────────────────────────
Expand Down Expand Up @@ -113,6 +152,49 @@
"MI-3": MI3Checker(),
"MI-4": MI4Checker(),
"COMP": COMPChecker(),
# Held-out (HO-*) — Algorithmic Inefficiency family.
"HO-AL-1": HOAL1Checker(),
"HO-AL-2": HOAL2Checker(),
"HO-AL-3": HOAL3Checker(),
"HO-AL-4": HOAL4Checker(),
# Held-out (HO-*) — Semantic Redundancy family.
"HO-SR-1": HOSR1Checker(),
"HO-SR-2": HOSR2Checker(),
"HO-SR-3": HOSR3Checker(),
"HO-SR-4": HOSR4Checker(),
"HO-SR-5": HOSR5Checker(),
"HO-SR-6": HOSR6Checker(),
"HO-SR-7": HOSR7Checker(),
# Held-out (HO-*) — Control Flow family.
"HO-CF-1": HOCF1Checker(),
"HO-CF-2": HOCF2Checker(),
"HO-CF-3": HOCF3Checker(),
"HO-CF-4": HOCF4Checker(),
"HO-CF-5": HOCF5Checker(),
# Held-out (HO-*) — Data Structure Inefficiency family.
"HO-DS-1": HODS1Checker(),
"HO-DS-2": HODS2Checker(),
"HO-DS-3": HODS3Checker(),
"HO-DS-4": HODS4Checker(),
"HO-DS-5": HODS5Checker(),
"HO-DS-6": HODS6Checker(),
# Held-out (HO-*) — Human-Style Antipatterns family.
"HO-HR-1": HOHR1Checker(),
"HO-HR-2": HOHR2Checker(),
"HO-HR-3": HOHR3Checker(),
"HO-HR-4": HOHR4Checker(),
"HO-HR-5": HOHR5Checker(),
# Held-out (HO-*) — Input-Sensitive Inefficiency family.
"HO-IS-1": HOIS1Checker(),
"HO-IS-2": HOIS2Checker(),
"HO-IS-3": HOIS3Checker(),
"HO-IS-4": HOIS4Checker(),
"HO-IS-5": HOIS5Checker(),
# Held-out (HO-*) — Memory & IO family.
"HO-MI-1": HOMI1Checker(),
"HO-MI-2": HOMI2Checker(),
"HO-MI-3": HOMI3Checker(),
"HO-MI-4": HOMI4Checker(),
}


Expand Down
Loading