Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/backend/admin-recording-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ token, and addresses the data with `recording_id` (below).
A **task-scoped agent** (authenticated by its own task id) may additionally read **by `recording_id`,
but only the recording its own task analyzes** (`taskScope.recordingId`). This lets a polish-script
run fetch the originating task's rrweb + injected events without the admin token — the polish setup
harness (`scripts/polish/run-recorded.ts`) does exactly this.
harness (`scripts/run-recorded-mcp.ts`) does exactly this.

- Not authenticated → `401`.
- Not an admin, and not requesting your own task's `recording_id` → `403 { "error": "Admin access required" }`.
Expand Down
10 changes: 8 additions & 2 deletions docs/backend/admin-run-journey-eval.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,8 +80,9 @@ The work runs asynchronously in a container. Watch progress at `task_url`, or wa

1. **Run the steps deterministically.** `journey-run-script` executes `journey_steps` as the task's
fast-mode setup — recording itself and, when a failure handler judges a failure to be a real defect,
filing at most one bug (through the same judged `file-bug` webhook). Guidance is resolved
`--resolve latest` (staged).
reporting `bug_seen` + a `bug_data` blob into the run's `setup_output` (fast-mode journeys no longer
file bugs inline; a second-stage `file-journey-bug` task files the bug from the recording — see
[`warm-recordings.md`](./warm-recordings.md)). Guidance is resolved `--resolve latest` (staged).
2. **Review + judge.** The review agent (guidance entry **`handle-journey-eval`**, created on demand if
missing) reads the run's filed bug and decides — fuzzy, natural-language — whether it matches
`expected_bug`. This is the one behavioral difference from `admin-run-polish-script`, which compares
Expand All @@ -104,6 +105,11 @@ When the run reaches a terminal state, `callback_url` is POSTed once (`Content-T
application/json`). **Unauthenticated** — secured only by the unguessable `runId` in the path.
Best-effort, exactly-once (edge-triggered via `eval_callback_fired`); not retried.

> **Second-task ordering.** When the run saw a bug, a `file-journey-bug` task is scheduled to file it
> from the recording. `container-events` DEFERS `finalizeJourneyEvalRun` + the callback until BOTH the
> review task and the file-journey-bug task are terminal — so the bug is filed (and the review agent's
> verdict recorded) before `bug_filed` is reported. See [`warm-recordings.md`](./warm-recordings.md) §6.

```jsonc
{
"event": "journey.eval_run.finished",
Expand Down
4 changes: 2 additions & 2 deletions docs/backend/admin-run-polish-script.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ for a single **signal** — the programmatic counterpart to the polish-script ev
signal's expected `results` are captured for a recording).

It kicks off one container task that runs **only** the polish setup script
(`scripts/polish/run-recorded.ts`) against the recording and then stops — **no agent bug-filing
(`scripts/run-recorded-mcp.ts`) against the recording and then stops — **no agent bug-filing
afterwards**. The task is pooled in the **guidance-update project / containers** (`proj-guidance`),
since the script analyzes an existing Replay recording by id and never drives the app. When the run
finishes, a webhook you supply is POSTed with the results.
Expand Down Expand Up @@ -174,7 +174,7 @@ bugs**):
(The harness only sanity-checks that an `eval_comparison` was produced; if a script forgot to call
`attachEvalComparison`, the harness records that as an `eval-comparison` setup error so it surfaces
instead of silently passing. The eval comparison is **no longer** computed harness-side — older
builds did it in `run-recorded.ts` and never wrote it back to the output file, so the agent saw no
builds did it in `run-recorded-mcp.ts` and never wrote it back to the output file, so the agent saw no
`eval_comparison` and took no action.)
3. The container then runs the agent with the **`handle-polish-script-eval`** guidance entry. The agent
reads `eval_comparison` (and looks for the logged `EVAL FAILED` / `eval-comparison` setup error); if
Expand Down
2 changes: 1 addition & 1 deletion docs/backend/journeys.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ If `step_count` is 0, the journey is unstepped.

1. `container-task-webhook.ts` checks if the journey is unstepped (latest version has zero actions)
2. If unstepped: builds prompt via `buildUnsteppedQAPrompt()` — agent uses Playwright MCP to execute the journey description. On success it saves the recording and exits; the journey stays unstepped. On failure it files bugs.
3. If stepped: builds prompt via `buildQAPrompt()` — the journey setup harness (`scripts/journey/run-recorded.ts`) runs the journey runner, which executes the actions programmatically, and records the result into the task info
3. If stepped: the browser-driver setup harness (`scripts/run-recorded-browse.ts --mode journey`) runs the journey runner, which executes the actions programmatically and records the result into the task info. In normal mode the container agent (`buildQAPrompt()`) then triages the result and files bugs. In **FAST MODE** the runner is setup-only (no agent) and does NOT file bugs — it reports `bug_seen` + a `bug_data` blob into `setup_output`, and a separate `file-journey-bug` task files the bug from the recording. See [`warm-recordings.md`](./warm-recordings.md).

## Journey origin

Expand Down
4 changes: 2 additions & 2 deletions docs/backend/mcp-error-reporting.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ client owns reporting for setup-script calls; the proxy owns reporting for agent
## Setup-script path (the common mechanism)

Every polish setup script lives as a guidance entry (`polish-script-<type>`) and drives Replay MCP
**only** through `scripts/polish/replay-client.ts`. `run-recorded.ts` (the in-container runner)
**only** through `scripts/polish/replay-client.ts`. `run-recorded-mcp.ts` (the in-container runner)
bundles the client with the script and execs it with the task context in env:

- `POLISH_ADMIN_TOKEN` — the task-scoped credential, which **is** the task id (`LOOPQA_TASK_ID`).
Expand Down Expand Up @@ -90,7 +90,7 @@ polish_pass-sourced guidance update; it no longer auto-creates the update) and
## Files

- `scripts/polish/replay-client.ts` — shared setup-script client; self-reports `callTool` failures.
- `scripts/polish/run-recorded.ts` — in-container runner; injects `POLISH_ADMIN_TOKEN` / `POLISH_SITE_URL`.
- `scripts/run-recorded-mcp.ts` — in-container runner; injects `POLISH_ADMIN_TOKEN` / `POLISH_SITE_URL`.
- `netlify/functions/mcp-error-webhook.ts` — receives `mcp.error`, writes `task_mcp_errors`.
- `netlify/functions/lib/task-mcp-errors.ts` — `task_mcp_errors` table interface + `kind` vocabulary.
- `netlify/functions/container-task-webhook.ts` — hands each task its `mcpErrorWebhook` URL (agent path).
Expand Down
4 changes: 4 additions & 0 deletions docs/backend/run-recorded-browse.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# `run-recorded-browse.ts` — Fast-Mode Exploration harness architecture

> Sibling harness: [`run-recorded-mcp.ts`](./warm-recordings.md) drives an existing *recording* via the
> Replay MCP (no browser) for polish + `file-journey-bug` tasks. Both share `scripts/lib/setup-harness.ts`.


This documents how the **exploration ("browse") setup harness** works end to end, what each moving
part does, and — importantly — the **current open problem** with the driver-script recording, with the
evidence I actually have (so the parts I'm still unsure about are called out explicitly rather than
Expand Down
39 changes: 35 additions & 4 deletions docs/backend/tasks-and-containers.md
Original file line number Diff line number Diff line change
Expand Up @@ -251,7 +251,7 @@ replenishes the pool.

## Polish setup script constraints

Each polish pass runs a **setup script** before the agent: `scripts/polish/run-recorded.ts` fetches
Each polish pass runs a **setup script** before the agent: `scripts/run-recorded-mcp.ts` fetches
the per-pass-type TypeScript, bundles it (esbuild), and runs it under `replay-node` to collect
diagnostic JSON from the recording via Replay MCP. Authoring one of these scripts has hard
constraints that are easy to violate:
Expand Down Expand Up @@ -299,15 +299,15 @@ constraints that are easy to violate:

- **Requirement: each setup script gets a 20-minute run window, and budgets must be ordered
`TIME_BUDGET_MS < hard-exit backstop < runner deadline`.** The runner hard-kills the bundled script
via `execSync` after `SCRIPT_EXEC_TIMEOUT_MS` = **20 min** (`run-recorded.ts`; other commands it
via `execSync` after `SCRIPT_EXEC_TIMEOUT_MS` = **20 min** (`run-recorded-mcp.ts`; other commands it
runs keep the 5-min default). A script's internal `TIME_BUDGET_MS`/`HARD_EXIT_MS` (and, in
`ui-details`, `BACKSTOP_MS`/`WATCHDOG_MS`) must sit *just under* 20 min — the convention is
`TIME_BUDGET_MS = 19 min < HARD_EXIT_MS = 19.5 min < 20-min kill` — so the script force-emits
parseable partial output before the SIGKILL. A process still alive at the runner deadline dies as
`spawnSync /bin/sh ETIMEDOUT` with **no output at all**. The 15-min `RecordingOverview` ceiling
lives inside this window. If you change the window, change both ends together and keep the ordering.

- **stdout ≤ 1 MiB.** `run-recorded.ts` captures the script's stdout with Node's default `execSync`
- **stdout ≤ 1 MiB.** `run-recorded-mcp.ts` captures the script's stdout with Node's default `execSync`
`maxBuffer` (1 MiB); overflowing it kills the process mid-print with `ENOBUFS` and no parseable
output. Scripts must bound their JSON (cap candidates, collapse repeated findings).

Expand Down Expand Up @@ -335,16 +335,47 @@ can also be set explicitly on the create request (UI, `POST /api/projects`, or `
report its own result), so no agent runs afterward.

- **Polish → the setup script files its own bugs, no agent.** The regular polish-pass branch gets a
no-op prompt, and the setup command sets `FAST_MODE=1`. `scripts/polish/run-recorded.ts` forwards
no-op prompt, and the setup command sets `FAST_MODE=1`. `scripts/run-recorded-mcp.ts` forwards
that (plus `POLISH_LLM_URL`, the run-scoped `call-llm` endpoint) into the setup script's exec env.
The `polish-script-<passType>` entry, on seeing `FAST_MODE`, calls the method exported by
**`polish-file-bug-script`**, which reviews its just-computed ("pass 2") results with an LLM via
**`polish-review-results-script`** and files each resulting bug through the `fileBug` helper from
**`file-bug-script`** — the same submit + poll bug path the polish agent prompt uses. `script_only`
(admin eval) and `ui-details` passes are unaffected.

- **Journey → the runner reports a bug SIGNAL, a second task files it.** In fast mode a stepped
journey's runner (`journey-run-script`) no longer files bugs. When it sees one it reports
`bug_seen: true` plus a `bug_data` blob (a description of what it encountered + browser history) in
its result, saved to the run's `setup_output` via `save-setup-result`. `saveTestRunSetupResult` then
schedules ONE **file-journey-bug** task (`lib/file-journey-bug.ts`, discriminated by the
`File journey bug:` goal prefix, carrying the run's `test_run_id` + the browser `recording_id`),
linked from the run's task page (`test_run.file_journey_bug_task_id`). That task runs
`scripts/run-recorded-mcp.ts --mode file-journey-bug` — the recording-driver harness with NO browser
— which materializes the **`file-journey-bug-script`** entry, hands it the bug blob (fetched via the
`get-journey-bug-data` task-webhook action) + the recording, and files the bug against the run
(`fileBug` → `/api/test-runs/<run>`). The journey run's finalization (and, for a **journey eval**,
its result callback) is DEFERRED in `container-events` until the file-journey-bug task completes, so
the bug exists before the run is finalized / the eval verdict is reported.

## Warm recordings

> Full write-up: [`warm-recordings.md`](./warm-recordings.md) — warming, `run-recorded-mcp`, the
> `file-journey-bug` task, and journey-eval second-task ordering. Summary below.

A task that OPERATES ON an existing recording (a polish pass task, or a file-journey-bug task) carries
that recording in `tasks.recording_id`. When such a task is scheduled we fire
`warm-recording-background` (`lib/warm-recording.ts`), which drives `RecordingOverview` (a cold
recording's first overview can take minutes while Replay indexes it) and waits up to **20 min**. On
success it stamps `tasks.recording_warmed_at` on every queued task for the recording; on timeout/error
it fails those tasks (marked `no_retry`, so the retry helpers leave them alone) and records a Replay
MCP error against them so the failure shows in admin activity. `claimNextTask` orders by warm
readiness FIRST — warmed-recording tasks, then no-recording tasks, then not-yet-warmed-recording tasks
— so a container never waits on a cold recording while a warmed one is queued.

Like every other setup script, the four new entries (`exploration-run-script`,
`polish-file-bug-script`, `polish-review-results-script`, `file-bug-script`) live ONLY as guidance
entries — the repo holds stubs (`scripts/seed-guidance.ts`) with the real TypeScript authored in the
prod DB via the guidance API. Making fast-mode polish actually file bugs also requires editing the
`polish-script-<passType>` entries to check `FAST_MODE` and delegate to `polish-file-bug-script`.
Fast-mode journeys additionally need `journey-run-script` edited to report `bug_seen`/`bug_data`
instead of filing, and the new **`file-journey-bug-script`** entry authored to file from the recording.
3 changes: 2 additions & 1 deletion docs/backend/test-runs.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
### Reads

- `getTestRun(id)` — fetch a single test run with project name.
- `getTestRunWithBugs(id)` — fetch a test run with project name and associated bugs.
- `getTestRunWithBugs(id)` — fetch a test run with project name and associated bugs. Also returns `file_journey_bug_task_id` — the second-stage task that files a FAST MODE journey's bug (if any), so the run's task page can link to it (see [`warm-recordings.md`](./warm-recordings.md)).
- `listTestRuns(filters?)` — paginated list of test runs, optionally filtered by project.
- `listRecentRunsForProject(opts)` — lightweight timing rows for a project's recent runs (newest first, default 20, max 50), optionally excluding one run and flagging each row with `overlaps_this_run` relative to a given `[overlapsAfter, overlapsBefore]` window (no `overlapsBefore` = window still open). Lets a running task — and the bug-submission judge — see what other journeys executed concurrently, since parallel journeys can mutate shared app state.

Expand All @@ -37,6 +37,7 @@
- `infraFailTestRun(id, data?)` — mark an in-progress run as `infra-failed` (transient infra error, retryable).
- `incompleteTestRun(id, data?)` — mark an in-progress run as `incomplete` (journey couldn't be completed, no error).
- `updateTestRunProgress(id, data)` — update bugs count or recording ID while a run is in progress.
- `saveTestRunSetupResult(id, data)` — store a setup harness's `setup_output` + recordings on the run (the `save-setup-result` action). When the output carries a FAST MODE journey's `bug_seen: true` signal, it also schedules a `file-journey-bug` task to file the bug from the recording (see [`warm-recordings.md`](./warm-recordings.md) §5).

## Callers

Expand Down
Loading
Loading