feat(web): semantic extraction on Firecrawl + dashboard semantic-vs-positional hit-rate#92
Merged
Merged
Conversation
…emantic-vs-positional hit-rate
Extends the query-aware semantic extraction (that beat Hermes's positional truncate-and-store)
to ALL backends, and surfaces its real-world value in the dashboard.
- parse.ts: capMarkdown(md, query, cap) (PURE) trims scraped markdown with the SAME
selectRelevantPassages — semantic when a query is given, head+tail otherwise. mapFirecrawlResults
now threads the search query + an onExtract callback, and strips base64 image bombs. Backward
compatible (no query → head+tail, still ≤ cap).
- firecrawl.ts: passes the search query so scraped result bodies return the most RELEVANT passages,
and records each result's extract mode.
- extract-metrics.ts (PURE aggregator + best-effort append-only JSONL under ~/.qodex — always on,
not the opt-in telemetry DB): parseExtractMetrics → {semantic, headTail, truncated, semanticRate}.
web_fetch + firecrawl record their mode.
- dashboard: "Web extract — semantic vs positional" panel — semantic hits, head+tail fallback, and
the semantic hit-rate, so you can see agents pass a query and get the relevant part.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two follow-ups to the semantic web-extraction PR (#90): make it work across all backends, and measure its value.
1. Semantic selection on the Firecrawl backend
Firecrawl content-mode returned scraped markdown capped by a blind
md.slice(0, 1500)head cut — the same positional weakness. Now:parse.tsgainscapMarkdown(md, query, cap)(PURE) that trims with the sameselectRelevantPassages— semantic when the search query is given, head+tail otherwisemapFirecrawlResultsthreads the search query + anonExtractcallback, and strips base64 image bombsSo a Firecrawl content-mode search now returns each result trimmed to the passages most relevant to what you searched for.
2. Dashboard hit-rate counter (show the real value)
extract-metrics.ts— a PURE aggregator over a best-effort append-only JSONL under~/.qodex(always on, deliberately not the opt-in telemetry DB).parseExtractMetrics → {semantic, headTail, truncated, semanticRate}web_fetch+firecrawlrecord each truncation's mode (whole pages ignored)Tests
extract-metrics.test.ts(+6): aggregator counts/rate/empty;capMarkdownsemantic vs head-tail;mapFirecrawlResultssemantic +onExtractmode; whole-page ignored. Dashboard asserts the panel + rate render. Full suite 1488 green, tsc clean.