feat(web): semantic extraction on Firecrawl + dashboard semantic-vs-positional hit-rate by QodeXcli · Pull Request #92 · QodeXcli/QodeX

QodeXcli · 2026-07-01T03:02:17Z

Two follow-ups to the semantic web-extraction PR (#90): make it work across all backends, and measure its value.

1. Semantic selection on the Firecrawl backend

Firecrawl content-mode returned scraped markdown capped by a blind md.slice(0, 1500) head cut — the same positional weakness. Now:

parse.ts gains capMarkdown(md, query, cap) (PURE) that trims with the same selectRelevantPassages — semantic when the search query is given, head+tail otherwise
mapFirecrawlResults threads the search query + an onExtract callback, and strips base64 image bombs
Backward compatible: no query → head+tail window, still ≤ cap (verified the existing ≤1500 assertion holds)

So a Firecrawl content-mode search now returns each result trimmed to the passages most relevant to what you searched for.

2. Dashboard hit-rate counter (show the real value)

extract-metrics.ts — a PURE aggregator over a best-effort append-only JSONL under ~/.qodex (always on, deliberately not the opt-in telemetry DB). parseExtractMetrics → {semantic, headTail, truncated, semanticRate}
web_fetch + firecrawl record each truncation's mode (whole pages ignored)
Dashboard panel "Web extract — semantic vs positional": semantic hits, head+tail fallback, and the semantic hit-rate — so you can see that agents pass a query and get the relevant middle instead of a blind window

Tests

extract-metrics.test.ts (+6): aggregator counts/rate/empty; capMarkdown semantic vs head-tail; mapFirecrawlResults semantic + onExtract mode; whole-page ignored. Dashboard asserts the panel + rate render. Full suite 1488 green, tsc clean.

…emantic-vs-positional hit-rate Extends the query-aware semantic extraction (that beat Hermes's positional truncate-and-store) to ALL backends, and surfaces its real-world value in the dashboard. - parse.ts: capMarkdown(md, query, cap) (PURE) trims scraped markdown with the SAME selectRelevantPassages — semantic when a query is given, head+tail otherwise. mapFirecrawlResults now threads the search query + an onExtract callback, and strips base64 image bombs. Backward compatible (no query → head+tail, still ≤ cap). - firecrawl.ts: passes the search query so scraped result bodies return the most RELEVANT passages, and records each result's extract mode. - extract-metrics.ts (PURE aggregator + best-effort append-only JSONL under ~/.qodex — always on, not the opt-in telemetry DB): parseExtractMetrics → {semantic, headTail, truncated, semanticRate}. web_fetch + firecrawl record their mode. - dashboard: "Web extract — semantic vs positional" panel — semantic hits, head+tail fallback, and the semantic hit-rate, so you can see agents pass a query and get the relevant part.

QodeXcli merged commit 1dc686f into main Jul 1, 2026
2 checks passed

QodeXcli deleted the feat/web-extract-firecrawl-metrics branch July 1, 2026 03:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(web): semantic extraction on Firecrawl + dashboard semantic-vs-positional hit-rate#92

feat(web): semantic extraction on Firecrawl + dashboard semantic-vs-positional hit-rate#92
QodeXcli merged 1 commit into
mainfrom
feat/web-extract-firecrawl-metrics

QodeXcli commented Jul 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

QodeXcli commented Jul 1, 2026

1. Semantic selection on the Firecrawl backend

2. Dashboard hit-rate counter (show the real value)

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant