Skip to content

Fix/coder32b fence extraction#14

Open
wlu03 wants to merge 12 commits into
mainfrom
fix/coder32b-fence-extraction
Open

Fix/coder32b fence extraction#14
wlu03 wants to merge 12 commits into
mainfrom
fix/coder32b-fence-extraction

Conversation

@wlu03

@wlu03 wlu03 commented Jun 10, 2026

Copy link
Copy Markdown
Owner

No description provided.

wlu03 added 12 commits June 7, 2026 18:44
36 per-pattern HO-* checkers (held_out.py) close the fallback gap so held-out
rows earn real verdicts (16.5% FAITHFUL). report_2x2 consumes faithfulness_cell
and reports four cells x fast/slow. Overall FAITHFUL 26.8% -> 29.1%.
scripts/category_difficulty.py refutes the IS-hardest/AL-SR-easiest priors:
DS hardest by pass@1 (47.9%, bottom-2 for 14/15 models), MI easiest (81.3%);
IS is hardest only to speed up (1.24x geomean). README finding added.
scripts/cross_pattern_transfer.py: per-category pass@1 correlates only
moderately across 15 models (mean Spearman +0.50). Clusters AL-CF +0.77,
DS-IS +0.70; MI most independent; AL best predictor of overall skill (+0.80).
modal_app/finetune_weak3.py trains QLoRA on r1-distill-qwen-7b, yi-coder-9b,
opencoder-8b (held-out excluded), merges to 16-bit, stages on the pdob-finetuned
volume. inference.py registers *-ft model keys from that volume so eval is the
unchanged pipeline.
Swap targets to the weakest fine-tune-friendly models (rescue experiment):
r1-distill-qwen-1.5b (2.8%), r1-distill-qwen-7b (26.7%), qwen2.5-coder-1.5b
(59.4%, non-reasoning control). inference.py *-ft keys synced.
Eval the 3 fine-tuned weak models on the 178 unseen held-out variants, paired
Wilcoxon vs base. Result: no positive transfer — non-reasoning qwen2.5-coder-1.5b
regresses significantly (held-out pass@1 -39/-50pp, p=0.001; hallucinated
externs, catastrophic forgetting); reasoning models nudge up off ~0 baselines
but not significantly. README finding #7.
modal_app/finetune_sweep.py: grid over epochs/lr/LoRA-rank/dropout + completion-only
loss (Unsloth train_on_responses_only) + replay data (CodeAlpaca-20k mix) to fight
the phase-1 overfitting. 2 subjects (qwen2.5-coder-1.5b regressor, r1-distill-7b)
x 7 configs; inference.py registers the *-ft variants for held-out eval.
prepare_indist_split.py holds out whole base-pattern variants (79) for a clean
in-distribution test (the old random split leaked 255/273 variants). finetune_indist.py
sweeps epochs {1,3,6,10} on the clean split to map the in-dist-transfer vs OOD-forgetting
crossover (researched recipe: lr 2e-4, alpha=2r, dropout 0.1, completion-only).
evaluate_all_modal spawns generation+CSV-write on Modal (survives --detach
disconnect, unlike evaluate_all's .map). score_modal.py scores cells on Modal
CPU; compiler.py honors PDOB_*_TIMEOUT env so broken candidates die fast.
Interrupted merges left config.json + tokenizer but no safetensors, which the
idempotency check treated as 'already merged' (so they were skipped) and vLLM
then couldn't load. Now check for safetensors and wipe+retrain partials. Add
crossover_tick.sh to idempotently drive the epoch-sweep eval->score->crossover.
The orchestrator checkpoints incrementally, so a still-generating eval CSV
looked ready and got scored on a partial (26/257 rows). Only score when all
257 in-dist+OOD rows are present; only mark DONE when the scored CSV is complete.
A prematurely-scored cell (e.g. 36 rows from a partial eval) polluted the table
with tiny-denominator garbage. Require ~257 rows or mark the cell incomplete.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant