Skip to content

feat(web): semantic extraction on Firecrawl + dashboard semantic-vs-positional hit-rate#92

Merged
QodeXcli merged 1 commit into
mainfrom
feat/web-extract-firecrawl-metrics
Jul 1, 2026
Merged

feat(web): semantic extraction on Firecrawl + dashboard semantic-vs-positional hit-rate#92
QodeXcli merged 1 commit into
mainfrom
feat/web-extract-firecrawl-metrics

Conversation

@QodeXcli

@QodeXcli QodeXcli commented Jul 1, 2026

Copy link
Copy Markdown
Owner

Two follow-ups to the semantic web-extraction PR (#90): make it work across all backends, and measure its value.

1. Semantic selection on the Firecrawl backend

Firecrawl content-mode returned scraped markdown capped by a blind md.slice(0, 1500) head cut — the same positional weakness. Now:

  • parse.ts gains capMarkdown(md, query, cap) (PURE) that trims with the same selectRelevantPassages — semantic when the search query is given, head+tail otherwise
  • mapFirecrawlResults threads the search query + an onExtract callback, and strips base64 image bombs
  • Backward compatible: no query → head+tail window, still ≤ cap (verified the existing ≤1500 assertion holds)

So a Firecrawl content-mode search now returns each result trimmed to the passages most relevant to what you searched for.

2. Dashboard hit-rate counter (show the real value)

  • extract-metrics.ts — a PURE aggregator over a best-effort append-only JSONL under ~/.qodex (always on, deliberately not the opt-in telemetry DB). parseExtractMetrics → {semantic, headTail, truncated, semanticRate}
  • web_fetch + firecrawl record each truncation's mode (whole pages ignored)
  • Dashboard panel "Web extract — semantic vs positional": semantic hits, head+tail fallback, and the semantic hit-rate — so you can see that agents pass a query and get the relevant middle instead of a blind window

Tests

extract-metrics.test.ts (+6): aggregator counts/rate/empty; capMarkdown semantic vs head-tail; mapFirecrawlResults semantic + onExtract mode; whole-page ignored. Dashboard asserts the panel + rate render. Full suite 1488 green, tsc clean.

…emantic-vs-positional hit-rate

Extends the query-aware semantic extraction (that beat Hermes's positional truncate-and-store)
to ALL backends, and surfaces its real-world value in the dashboard.

- parse.ts: capMarkdown(md, query, cap) (PURE) trims scraped markdown with the SAME
  selectRelevantPassages — semantic when a query is given, head+tail otherwise. mapFirecrawlResults
  now threads the search query + an onExtract callback, and strips base64 image bombs. Backward
  compatible (no query → head+tail, still ≤ cap).
- firecrawl.ts: passes the search query so scraped result bodies return the most RELEVANT passages,
  and records each result's extract mode.
- extract-metrics.ts (PURE aggregator + best-effort append-only JSONL under ~/.qodex — always on,
  not the opt-in telemetry DB): parseExtractMetrics → {semantic, headTail, truncated, semanticRate}.
  web_fetch + firecrawl record their mode.
- dashboard: "Web extract — semantic vs positional" panel — semantic hits, head+tail fallback, and
  the semantic hit-rate, so you can see agents pass a query and get the relevant part.
@QodeXcli QodeXcli merged commit 1dc686f into main Jul 1, 2026
2 checks passed
@QodeXcli QodeXcli deleted the feat/web-extract-firecrawl-metrics branch July 1, 2026 03:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant