FE-813: Cook evaluator runs the verification tests instead of an LLM verdict#170
FE-813: Cook evaluator runs the verification tests instead of an LLM verdict#170kostandinang wants to merge 9 commits into
Conversation
PR SummaryMedium Risk Overview The Petri
Reviewed by Cursor Bugbot for commit a537038. Bugbot is set up for automated code reviews on this repo. Configure here. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 85753f7. Configure here.
d064cdd to
1cbed39
Compare
`evaluate-done` ran with `read,write,edit,bash`, so the evaluator could fix code during evaluation and report `done` on the first call — collapsing the write-tests → write-code → evaluate TDD loop (2026-05-26 brownfield smoke). Add per-action tool scoping (`toolsForAction`): the evaluator is read-only; code-producing actions keep the full toolset. Threads `tools` through `runPi`. Opens the `cook-harness-fidelity` frontier (Slice 1) and records it in PLAN. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
`evaluate-done` returned an LLM verdict over criterion prose — a soft oracle a standalone component or Ladle story could satisfy without the feature working. It now executes the slice's verification targets (`bun test`) and gates `done` on real pass/fail: `evaluateVerificationTargets` requires ≥1 target and every target passing. The evaluator no longer dispatches pi, so it cannot mutate the sandbox at all. Removes the now-dead `extractJson` helper. Completes the cook-harness-fidelity frontier (Slice 2). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…stics `bun test` writes its results (failure detail, pass/fail counts) to stderr and only the version banner to stdout. The execSync failure path returned `err.stdout`, so every failing tests-run / epic-verified report carried just "bun test v1.3.14" with zero diagnostics (observed on cook run 289c9843). Switch to spawnSync and concatenate both streams; the arg-array form also drops shell interpolation of the target path. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Slice 2 made the evaluator a deterministic oracle (it executes the verification targets; `done` = all pass), which makes the test-writer the load-bearing oracle: a test that passes without exercising the slice's behavior now marks broken code DONE. Port ln-build discipline into the two live writer prompts to close that gap: - test-writer: orient first (read + match the repo's conventions); one observable behavior per test mapped to an acceptance criterion; test through the public interface; make the red meaningful; ban trivially-passing tests (a false oracle under the deterministic gate). - code-writer: orient + inside-out build; no speculative abstraction; never weaken a test to go green; pre-release deletion posture. - code-writer: drop the hardcoded "TypeScript" — derive the language from the repo (the harness is meant to be host-agnostic; the remaining `bun test` coupling in test-writer is the executor's, to be removed with the ProjectProfile/toolchain adapter). Delete evaluator.md: dead since Slice 2 stopped dispatching pi for evaluation (verified zero references repo-wide). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…ter-prompt slice The "pi-actions evaluate-done collapses the TDD workflow" follow-on (under cook-codebase-mode) was closed by FE-813 Slices 1+2 but still listed as open. Mark it resolved with commit pointers, and extend the FE-813 status with Slice 3 (writer-prompt hardening, 9fb5af1) and the still-unowned bun→host test-runner decoupling. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
ln-sync of the cook-harness-fidelity work: records the read-only, execution-derived completion oracle as Active Decision D161-K and Critical Invariant I126-K (protected by pi-actions / engine-contract / brownfield-smoke tests), and points the PLAN frontier status at them. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Adds SPEC A98 (brownfield integration-demanding verification without emitter host introspection) and the resolved 2026-06-04 spike note under the spec-to-cook-plan frontier: integration intent is latent in criterion prose, carried via kind:'integration-test' + prose into the runnable target; agent authors the app-reachability test at run time. Co-authored-by: Claude <noreply@anthropic.com>
Two real vite builds (~50s) run inside the default-parallel vitest pool; 120s gives headroom so parallel contention doesn't surface as a timeout flake. Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
1cbed39 to
a537038
Compare


What
Makes
brunch cook's per-slice done trustworthy. The evaluator was an LLM judging criterion prose withread,write,edit,bash— it could fix code mid-evaluation and rubber-stampdone(collapsing the write-tests → write-code → evaluate loop) and pass orphan output.evaluate-donenow runs the verification targets (BunTestRunner) and gatesdoneon real pass/fail — requires ≥1 target and every target passing. It no longer callspi.toolsForAction): the evaluator can't mutate the sandbox; code-producing actions keep write tools.evaluator.mdprompt; hardened the writer prompts.Limit
donenow requires real test execution, not integration. Targets are agent-authored against the emitter's synthesized paths, so a weak/isolated target can still pass — guaranteeing a wired feature needs the emitter to emit integration-demanding targets (the FE-800 integration-blind-verification follow-on).Verify
pi-actions.test.ts+test-runner.test.ts+npm run checkpass. Two full-suite failures (build-boundary.test.ts,app.test.ts) are a fe-800-base failure and a parallel-run flake — not this diff (see #167).🤖 Generated with Claude Code