Fix/coder32b fence extraction by wlu03 · Pull Request #14 · wlu03/pattern-driven-optimization-benchmark

wlu03 · 2026-06-10T23:40:49Z

No description provided.

36 per-pattern HO-* checkers (held_out.py) close the fallback gap so held-out rows earn real verdicts (16.5% FAITHFUL). report_2x2 consumes faithfulness_cell and reports four cells x fast/slow. Overall FAITHFUL 26.8% -> 29.1%.

scripts/category_difficulty.py refutes the IS-hardest/AL-SR-easiest priors: DS hardest by pass@1 (47.9%, bottom-2 for 14/15 models), MI easiest (81.3%); IS is hardest only to speed up (1.24x geomean). README finding added.

scripts/cross_pattern_transfer.py: per-category pass@1 correlates only moderately across 15 models (mean Spearman +0.50). Clusters AL-CF +0.77, DS-IS +0.70; MI most independent; AL best predictor of overall skill (+0.80).

modal_app/finetune_weak3.py trains QLoRA on r1-distill-qwen-7b, yi-coder-9b, opencoder-8b (held-out excluded), merges to 16-bit, stages on the pdob-finetuned volume. inference.py registers *-ft model keys from that volume so eval is the unchanged pipeline.

Swap targets to the weakest fine-tune-friendly models (rescue experiment): r1-distill-qwen-1.5b (2.8%), r1-distill-qwen-7b (26.7%), qwen2.5-coder-1.5b (59.4%, non-reasoning control). inference.py *-ft keys synced.

Eval the 3 fine-tuned weak models on the 178 unseen held-out variants, paired Wilcoxon vs base. Result: no positive transfer — non-reasoning qwen2.5-coder-1.5b regresses significantly (held-out pass@1 -39/-50pp, p=0.001; hallucinated externs, catastrophic forgetting); reasoning models nudge up off ~0 baselines but not significantly. README finding #7.

modal_app/finetune_sweep.py: grid over epochs/lr/LoRA-rank/dropout + completion-only loss (Unsloth train_on_responses_only) + replay data (CodeAlpaca-20k mix) to fight the phase-1 overfitting. 2 subjects (qwen2.5-coder-1.5b regressor, r1-distill-7b) x 7 configs; inference.py registers the *-ft variants for held-out eval.

prepare_indist_split.py holds out whole base-pattern variants (79) for a clean in-distribution test (the old random split leaked 255/273 variants). finetune_indist.py sweeps epochs {1,3,6,10} on the clean split to map the in-dist-transfer vs OOD-forgetting crossover (researched recipe: lr 2e-4, alpha=2r, dropout 0.1, completion-only).

evaluate_all_modal spawns generation+CSV-write on Modal (survives --detach disconnect, unlike evaluate_all's .map). score_modal.py scores cells on Modal CPU; compiler.py honors PDOB_*_TIMEOUT env so broken candidates die fast.

Interrupted merges left config.json + tokenizer but no safetensors, which the idempotency check treated as 'already merged' (so they were skipped) and vLLM then couldn't load. Now check for safetensors and wipe+retrain partials. Add crossover_tick.sh to idempotently drive the epoch-sweep eval->score->crossover.

The orchestrator checkpoints incrementally, so a still-generating eval CSV looked ready and got scored on a partial (26/257 rows). Only score when all 257 in-dist+OOD rows are present; only mark DONE when the scored CSV is complete.

A prematurely-scored cell (e.g. 36 rows from a partial eval) polluted the table with tiny-denominator garbage. Require ~257 rows or mark the cell incomplete.

wlu03 added 12 commits June 7, 2026 18:44

feat(analysis): per-category difficulty

92f29fc

scripts/category_difficulty.py refutes the IS-hardest/AL-SR-easiest priors: DS hardest by pass@1 (47.9%, bottom-2 for 14/15 models), MI easiest (81.3%); IS is hardest only to speed up (1.24x geomean). README finding added.

feat(analysis): cross-pattern transfer correlations

eba90e3

scripts/cross_pattern_transfer.py: per-category pass@1 correlates only moderately across 15 models (mean Spearman +0.50). Clusters AL-CF +0.77, DS-IS +0.70; MI most independent; AL best predictor of overall skill (+0.80).

chore(finetune): retarget to the 3 weakest models incl. a 1.5B

4e791fe

Swap targets to the weakest fine-tune-friendly models (rescue experiment): r1-distill-qwen-1.5b (2.8%), r1-distill-qwen-7b (26.7%), qwen2.5-coder-1.5b (59.4%, non-reasoning control). inference.py *-ft keys synced.

feat(modal): survivable server-side eval + scoring

7cebc94

evaluate_all_modal spawns generation+CSV-write on Modal (survives --detach disconnect, unlike evaluate_all's .map). score_modal.py scores cells on Modal CPU; compiler.py honors PDOB_*_TIMEOUT env so broken candidates die fast.

fix(crossover): ignore incomplete cells (<250 rows)

3f40cf2

A prematurely-scored cell (e.g. 36 rows from a partial eval) polluted the table with tiny-denominator garbage. Require ~257 rows or mark the cell incomplete.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/coder32b fence extraction#14

Fix/coder32b fence extraction#14
wlu03 wants to merge 12 commits into
mainfrom
fix/coder32b-fence-extraction

wlu03 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wlu03 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant