FE-813: Cook evaluator runs the verification tests instead of an LLM verdict by kostandinang · Pull Request #170 · hashintel/brunch

kostandinang · 2026-06-04T16:01:21Z

What

Makes brunch cook's per-slice done trustworthy. The evaluator was an LLM judging criterion prose with read,write,edit,bash — it could fix code mid-evaluation and rubber-stamp done (collapsing the write-tests → write-code → evaluate loop) and pass orphan output.

evaluate-done now runs the verification targets (BunTestRunner) and gates done on real pass/fail — requires ≥1 target and every target passing. It no longer calls pi.
Per-action tool scoping (toolsForAction): the evaluator can't mutate the sandbox; code-producing actions keep write tools.
Dropped the now-dead evaluator.md prompt; hardened the writer prompts.

Limit

done now requires real test execution, not integration. Targets are agent-authored against the emitter's synthesized paths, so a weak/isolated target can still pass — guaranteeing a wired feature needs the emitter to emit integration-demanding targets (the FE-800 integration-blind-verification follow-on).

Verify

pi-actions.test.ts + test-runner.test.ts + npm run check pass. Two full-suite failures (build-boundary.test.ts, app.test.ts) are a fe-800-base failure and a parallel-run flake — not this diff (see #167).

🤖 Generated with Claude Code

kostandinang · 2026-06-04T16:01:56Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

cursor · 2026-06-04T16:27:23Z

PR Summary

Medium Risk
Changes how cook runs decide slice success (orchestrator TDD path), but behavior is covered by new unit tests and existing contract tests; integration reachability still depends on future emitter/harness work, not this PR alone.

Overview
Cook harness fidelity (FE-813): Per-slice completion is no longer an LLM verdict. evaluate-done runs bun test on every slice verification target via evaluateVerificationTargets (≥1 target, all must pass) and does not invoke pi. toolsForAction limits evaluate-done to read-only tools so evaluation cannot fix code mid-loop; write actions keep full tools.

The Petri run-tests handler now executes all verification targets (not only the first). BunTestRunner uses spawnSync with argv (no shell) and merges stdout + stderr so failure reports stay useful. verify-epic reuses the same runBunTest path.

evaluator.md is removed; test-writer / code-writer prompts are tightened so tests act as the oracle. SPEC adds D161-K, I126-K, and assumption A98; PLAN documents cook-harness-fidelity and marks the brownfield evaluate-done regression resolved. build-boundary.test.ts timeout is raised to 120s to reduce parallel CI flakes.

^{Reviewed by Cursor Bugbot for commit a537038. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 85753f7. Configure here.}

`evaluate-done` ran with `read,write,edit,bash`, so the evaluator could fix code during evaluation and report `done` on the first call — collapsing the write-tests → write-code → evaluate TDD loop (2026-05-26 brownfield smoke). Add per-action tool scoping (`toolsForAction`): the evaluator is read-only; code-producing actions keep the full toolset. Threads `tools` through `runPi`. Opens the `cook-harness-fidelity` frontier (Slice 1) and records it in PLAN. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

`evaluate-done` returned an LLM verdict over criterion prose — a soft oracle a standalone component or Ladle story could satisfy without the feature working. It now executes the slice's verification targets (`bun test`) and gates `done` on real pass/fail: `evaluateVerificationTargets` requires ≥1 target and every target passing. The evaluator no longer dispatches pi, so it cannot mutate the sandbox at all. Removes the now-dead `extractJson` helper. Completes the cook-harness-fidelity frontier (Slice 2). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…stics `bun test` writes its results (failure detail, pass/fail counts) to stderr and only the version banner to stdout. The execSync failure path returned `err.stdout`, so every failing tests-run / epic-verified report carried just "bun test v1.3.14" with zero diagnostics (observed on cook run 289c9843). Switch to spawnSync and concatenate both streams; the arg-array form also drops shell interpolation of the target path. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Slice 2 made the evaluator a deterministic oracle (it executes the verification targets; `done` = all pass), which makes the test-writer the load-bearing oracle: a test that passes without exercising the slice's behavior now marks broken code DONE. Port ln-build discipline into the two live writer prompts to close that gap: - test-writer: orient first (read + match the repo's conventions); one observable behavior per test mapped to an acceptance criterion; test through the public interface; make the red meaningful; ban trivially-passing tests (a false oracle under the deterministic gate). - code-writer: orient + inside-out build; no speculative abstraction; never weaken a test to go green; pre-release deletion posture. - code-writer: drop the hardcoded "TypeScript" — derive the language from the repo (the harness is meant to be host-agnostic; the remaining `bun test` coupling in test-writer is the executor's, to be removed with the ProjectProfile/toolchain adapter). Delete evaluator.md: dead since Slice 2 stopped dispatching pi for evaluation (verified zero references repo-wide). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…ter-prompt slice The "pi-actions evaluate-done collapses the TDD workflow" follow-on (under cook-codebase-mode) was closed by FE-813 Slices 1+2 but still listed as open. Mark it resolved with commit pointers, and extend the FE-813 status with Slice 3 (writer-prompt hardening, 9fb5af1) and the still-unowned bun→host test-runner decoupling. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

ln-sync of the cook-harness-fidelity work: records the read-only, execution-derived completion oracle as Active Decision D161-K and Critical Invariant I126-K (protected by pi-actions / engine-contract / brownfield-smoke tests), and points the PLAN frontier status at them. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Adds SPEC A98 (brownfield integration-demanding verification without emitter host introspection) and the resolved 2026-06-04 spike note under the spec-to-cook-plan frontier: integration intent is latent in criterion prose, carried via kind:'integration-test' + prose into the runnable target; agent authors the app-reachability test at run time. Co-authored-by: Claude <noreply@anthropic.com>

Two real vite builds (~50s) run inside the default-parallel vitest pool; 120s gives headroom so parallel contention doesn't surface as a timeout flake. Co-authored-by: Claude <noreply@anthropic.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

kostandinang mentioned this pull request Jun 4, 2026

FE-812: Completed-spec fixture for the plan emitter #168

Open

kostandinang changed the title ~~FE-813: Scope the cook evaluator to read-only tools~~ FE-813: Cook harness fidelity — trustworthy per-slice completion signal Jun 4, 2026

kostandinang marked this pull request as ready for review June 4, 2026 16:27

cursor Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread src/orchestrator/src/pi-actions.ts

Comment thread src/orchestrator/src/pi-actions.ts Outdated

kostandinang changed the title ~~FE-813: Cook harness fidelity — trustworthy per-slice completion signal~~ FE-813: Cook evaluator runs the verification tests instead of an LLM verdict Jun 4, 2026

kostandinang self-assigned this Jun 4, 2026

kostandinang force-pushed the ka/fe-813-cook-harness-fidelity branch from d064cdd to 1cbed39 Compare June 5, 2026 07:50

kostandinang and others added 9 commits June 5, 2026 09:50

FE-813: Raise build-boundary minify test timeout to 120s

33f0864

Two real vite builds (~50s) run inside the default-parallel vitest pool; 120s gives headroom so parallel contention doesn't surface as a timeout flake. Co-authored-by: Claude <noreply@anthropic.com>

FE-813: Run all verification targets safely

a537038

Co-authored-by: Cursor <cursoragent@cursor.com>

kostandinang force-pushed the ka/fe-813-cook-harness-fidelity branch from 1cbed39 to a537038 Compare June 5, 2026 13:14

This was referenced Jun 5, 2026

FE-815: Orchestrator demo fixtures: parallel, diamond, halt #174

Draft

FE-819: Orchestrator <> Petrinaut integration improvements #176

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FE-813: Cook evaluator runs the verification tests instead of an LLM verdict#170

FE-813: Cook evaluator runs the verification tests instead of an LLM verdict#170
kostandinang wants to merge 9 commits into
ka/fe-800-spec-to-cook-planfrom
ka/fe-813-cook-harness-fidelity

kostandinang commented Jun 4, 2026 •

edited

Loading

Uh oh!

kostandinang commented Jun 4, 2026 •

edited

Loading

Uh oh!

cursor Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kostandinang commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Limit

Verify

Uh oh!

kostandinang commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kostandinang commented Jun 4, 2026 •

edited

Loading

kostandinang commented Jun 4, 2026 •

edited

Loading

cursor Bot commented Jun 4, 2026 •

edited

Loading