Skip to content

FE-813: Cook evaluator runs the verification tests instead of an LLM verdict#170

Open
kostandinang wants to merge 9 commits into
ka/fe-800-spec-to-cook-planfrom
ka/fe-813-cook-harness-fidelity
Open

FE-813: Cook evaluator runs the verification tests instead of an LLM verdict#170
kostandinang wants to merge 9 commits into
ka/fe-800-spec-to-cook-planfrom
ka/fe-813-cook-harness-fidelity

Conversation

@kostandinang
Copy link
Copy Markdown
Contributor

@kostandinang kostandinang commented Jun 4, 2026

What

Makes brunch cook's per-slice done trustworthy. The evaluator was an LLM judging criterion prose with read,write,edit,bash — it could fix code mid-evaluation and rubber-stamp done (collapsing the write-tests → write-code → evaluate loop) and pass orphan output.

  • evaluate-done now runs the verification targets (BunTestRunner) and gates done on real pass/fail — requires ≥1 target and every target passing. It no longer calls pi.
  • Per-action tool scoping (toolsForAction): the evaluator can't mutate the sandbox; code-producing actions keep write tools.
  • Dropped the now-dead evaluator.md prompt; hardened the writer prompts.

Limit

done now requires real test execution, not integration. Targets are agent-authored against the emitter's synthesized paths, so a weak/isolated target can still pass — guaranteeing a wired feature needs the emitter to emit integration-demanding targets (the FE-800 integration-blind-verification follow-on).

Verify

pi-actions.test.ts + test-runner.test.ts + npm run check pass. Two full-suite failures (build-boundary.test.ts, app.test.ts) are a fe-800-base failure and a parallel-run flake — not this diff (see #167).

🤖 Generated with Claude Code

Copy link
Copy Markdown
Contributor Author

kostandinang commented Jun 4, 2026

@kostandinang kostandinang changed the title FE-813: Scope the cook evaluator to read-only tools FE-813: Cook harness fidelity — trustworthy per-slice completion signal Jun 4, 2026
@kostandinang kostandinang marked this pull request as ready for review June 4, 2026 16:27
@cursor
Copy link
Copy Markdown

cursor Bot commented Jun 4, 2026

PR Summary

Medium Risk
Changes how cook runs decide slice success (orchestrator TDD path), but behavior is covered by new unit tests and existing contract tests; integration reachability still depends on future emitter/harness work, not this PR alone.

Overview
Cook harness fidelity (FE-813): Per-slice completion is no longer an LLM verdict. evaluate-done runs bun test on every slice verification target via evaluateVerificationTargets (≥1 target, all must pass) and does not invoke pi. toolsForAction limits evaluate-done to read-only tools so evaluation cannot fix code mid-loop; write actions keep full tools.

The Petri run-tests handler now executes all verification targets (not only the first). BunTestRunner uses spawnSync with argv (no shell) and merges stdout + stderr so failure reports stay useful. verify-epic reuses the same runBunTest path.

evaluator.md is removed; test-writer / code-writer prompts are tightened so tests act as the oracle. SPEC adds D161-K, I126-K, and assumption A98; PLAN documents cook-harness-fidelity and marks the brownfield evaluate-done regression resolved. build-boundary.test.ts timeout is raised to 120s to reduce parallel CI flakes.

Reviewed by Cursor Bugbot for commit a537038. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 85753f7. Configure here.

Comment thread src/orchestrator/src/pi-actions.ts
Comment thread src/orchestrator/src/pi-actions.ts Outdated
@kostandinang kostandinang changed the title FE-813: Cook harness fidelity — trustworthy per-slice completion signal FE-813: Cook evaluator runs the verification tests instead of an LLM verdict Jun 4, 2026
@kostandinang kostandinang self-assigned this Jun 4, 2026
@kostandinang kostandinang force-pushed the ka/fe-813-cook-harness-fidelity branch from d064cdd to 1cbed39 Compare June 5, 2026 07:50
kostandinang and others added 9 commits June 5, 2026 09:50
`evaluate-done` ran with `read,write,edit,bash`, so the evaluator could fix
code during evaluation and report `done` on the first call — collapsing the
write-tests → write-code → evaluate TDD loop (2026-05-26 brownfield smoke).

Add per-action tool scoping (`toolsForAction`): the evaluator is read-only;
code-producing actions keep the full toolset. Threads `tools` through `runPi`.
Opens the `cook-harness-fidelity` frontier (Slice 1) and records it in PLAN.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
`evaluate-done` returned an LLM verdict over criterion prose — a soft oracle a
standalone component or Ladle story could satisfy without the feature working.
It now executes the slice's verification targets (`bun test`) and gates `done`
on real pass/fail: `evaluateVerificationTargets` requires ≥1 target and every
target passing. The evaluator no longer dispatches pi, so it cannot mutate the
sandbox at all. Removes the now-dead `extractJson` helper.

Completes the cook-harness-fidelity frontier (Slice 2).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…stics

`bun test` writes its results (failure detail, pass/fail counts) to
stderr and only the version banner to stdout. The execSync failure path
returned `err.stdout`, so every failing tests-run / epic-verified
report carried just "bun test v1.3.14" with zero diagnostics (observed
on cook run 289c9843). Switch to spawnSync and concatenate both
streams; the arg-array form also drops shell interpolation of the
target path.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Slice 2 made the evaluator a deterministic oracle (it executes the
verification targets; `done` = all pass), which makes the test-writer the
load-bearing oracle: a test that passes without exercising the slice's
behavior now marks broken code DONE. Port ln-build discipline into the two
live writer prompts to close that gap:

- test-writer: orient first (read + match the repo's conventions); one
  observable behavior per test mapped to an acceptance criterion; test
  through the public interface; make the red meaningful; ban
  trivially-passing tests (a false oracle under the deterministic gate).
- code-writer: orient + inside-out build; no speculative abstraction;
  never weaken a test to go green; pre-release deletion posture.
- code-writer: drop the hardcoded "TypeScript" — derive the language from
  the repo (the harness is meant to be host-agnostic; the remaining
  `bun test` coupling in test-writer is the executor's, to be removed with
  the ProjectProfile/toolchain adapter).

Delete evaluator.md: dead since Slice 2 stopped dispatching pi for
evaluation (verified zero references repo-wide).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…ter-prompt slice

The "pi-actions evaluate-done collapses the TDD workflow" follow-on (under
cook-codebase-mode) was closed by FE-813 Slices 1+2 but still listed as open.
Mark it resolved with commit pointers, and extend the FE-813 status with
Slice 3 (writer-prompt hardening, 9fb5af1) and the still-unowned bun→host
test-runner decoupling.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
ln-sync of the cook-harness-fidelity work: records the read-only,
execution-derived completion oracle as Active Decision D161-K and Critical
Invariant I126-K (protected by pi-actions / engine-contract /
brownfield-smoke tests), and points the PLAN frontier status at them.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Adds SPEC A98 (brownfield integration-demanding verification without emitter host introspection) and the resolved 2026-06-04 spike note under the spec-to-cook-plan frontier: integration intent is latent in criterion prose, carried via kind:'integration-test' + prose into the runnable target; agent authors the app-reachability test at run time.

Co-authored-by: Claude <noreply@anthropic.com>
Two real vite builds (~50s) run inside the default-parallel vitest pool; 120s gives headroom so parallel contention doesn't surface as a timeout flake.

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant