Exact-ts log lookup, error-log badge propagation, and Prometheus scraping#451
Exact-ts log lookup, error-log badge propagation, and Prometheus scraping#451tonyalaribe wants to merge 24 commits into
Conversation
…ndow
Single-row item detail lookups (web log-item detail + the CLI/API
/events/{id}/time/{ts} path, both via lookupOtelRecord) matched
`timestamp BETWEEN ts-1s AND ts+1s AND id = ?`. The caller always holds
the exact stored timestamp (it comes from a prior query result), so match
`timestamp = ts` instead: both PG and TF store microsecond precision,
which round-trips losslessly through the UTCTime↔Hasql encoder, and exact
equality lets the planner hit the (timestamp, id) ordering rather than
scanning a window. Regression test asserts an off-by-1s timestamp no
longer resolves the row on either the TF or PG read path.
…missing-span parents The trace waterfall lights a parent row's [+N] badge red via childErrors, but error LOGS never contributed: the worker hardcoded hasErrors=false for kind=log, and the trace-view `errors` column only reflected structured error payloads (which severity-only error logs lack). So an error log — or an orphan error log nested under a synthetic "missing span" parent — left its parent's badge un-reddened. - log-worker-functions: derive a log's error-ness from the errors flag and severity (text or OTel number >=17); spans unchanged. Now error logs and their (possibly synthetic) parents propagate childErrors. - defaultSelectSqlQuery: fold log severity into the `errors` projection (mirrors LogQueries.is_error) so the signal actually reaches the client. Vitest covers direct + synthetic-parent propagation and the info-log non-regression; an integration test asserts the errors column is set for ERROR-severity logs.
Adds the ability to register Prometheus scrape targets per project and poll them on a per-target interval, parsing the text exposition format and ingesting the samples as metrics (queryable via the existing metrics store). - Pkg.Prometheus: exposition-format (v0.0.4) parser — names, labels (with escapes / commas in values), values (incl. +Inf/-Inf/NaN), optional ms timestamps, and # TYPE/# HELP metadata. Doctested. - Models.Apis.PrometheusScrapeConfigs: model + CRUD, due-target selection (enabled AND interval elapsed), and ingestScrapedBody — parse → convert to MetricRecord (counter→monotonic sum, else gauge; config name as service.name; static + sample labels as attributes; non-finite dropped) → bulkInsertMetrics. - BackgroundJobs.PrometheusScrapeTick: per-minute ticker (seeded by the daily scheduler) that GETs each due target (optional auth header), ingests it, and records last_scraped_at + last_status; failures are isolated. - Settings: per-project Prometheus settings page (add / enable-disable / delete targets) with typed handlers + routes + nav entry. - migration 0101: apis.prometheus_scrape_configs. Integration test (PrometheusSpec) covers ingest (incl. non-finite drop) and the due/scraped scheduling.
…de scaling The per-minute tick scraped every due target sequentially in one job on one pod — at hundreds/thousands of targets the tick wall-time exceeds the interval, one hung endpoint stalls the batch, and two pods running it would double-scrape. Rework into a distributed-queue model (leans on odd-jobs, which already gives a cross-pod worker pool + retries): - claimDueConfigs: atomic `UPDATE ... WHERE id IN (SELECT ... FOR UPDATE SKIP LOCKED LIMIT n) RETURNING ...`. Safe to run on every node — each due row is leased to exactly one caller, and the same statement bumps last_scraped_at so the next tick (any node) won't re-pick it before its interval. Oldest-first so a backlog drains fairly. - PrometheusScrapeTick is now a cheap dispatcher: claim + fan out one PrometheusScrapeOne job per target. Workers across all pods drain those, so throughput scales horizontally with the fleet instead of a single loop. - PrometheusScrapeOne worker: hard per-scrape HTTP response timeout (10s) via a manager-level managerResponseTimeout, so a dead endpoint can't wedge a worker; failures are caught (the lease handles back-off, no retry storm). Test now asserts claim-once-and-lease (multi-node safety) and disabled skipping.
Code ReviewThis PR ships three independently useful pieces: an exact-timestamp lookup optimisation, error-log badge propagation via severity, and a full Prometheus scraping pipeline (exposition parser, CRUD model, distributed-queue background jobs, settings UI). The design of the distributed scraper (SKIP LOCKED lease, fan-out jobs, per-scrape timeout) is solid. A few issues worth addressing before merge: 1. SSRF — no URL scheme or host validation (security, confirmed)
The only validation on the submitted URL is At a minimum, validate that the scheme is 2. Counter misclassification —
|
Code ReviewThis PR ships three independent pieces: exact-timestamp log lookup (removes the ±1s window), error-log badge propagation (folds severity into the `errors` projection and client-side `rowHasError`), and a full Prometheus scrape pipeline (parser, model, CRUD, distributed-queue dispatch, settings UI). The design is solid — the `FOR UPDATE SKIP LOCKED` claim pattern is the right approach for multi-node safety. Bugs1. whenJustM (PromCfg.getConfig cid) \cfg -> void $ PromCfg.setEnabled pid cid (not cfg.enabled)Two concurrent clicks both read UPDATE apis.prometheus_scrape_configs SET enabled = NOT enabled WHERE id = ? AND project_id = ?2. attrs = AE.Object $ case cfg.extraLabels of
AE.Object o -> o <> labelMap -- ← left-biased; config wins over sample labels
_ -> labelMap
3. Error-badge detection diverges between the DB projection and the JS waterfall — The DB projection ( Fix: align both predicates. The DB projection already has Performance4. A new TLS HTTP manager is created on every scrape — let mgr = Left tlsManagerSettings{managerResponseTimeout = responseTimeoutMicro prometheusScrapeTimeoutMicros}
opts = auth $ defaults & Wreq.manager .~ mgr
Simplification / less code5. Four
6. splitBrace = go False "" . T.unpack -- Text → [Char] → (reverse acc) → TextThis allocates O(n) linked-list chars plus a second O(n) reverse for every label block. The project already depends on Minor
|
… validation Turn the Prometheus settings page from a blind CRUD form into a health-aware management surface (addresses the /critique findings): - Scrape health (ok / failing / pending, derived from last_status) is now a first-class colored+iconed badge, decoupled from the on/off state — an enabled-but-failing target no longer shows a reassuring green. - Test button: one-shot scrape (prometheusTestH, 10s timeout, no DB write) that reports "Scraped N samples · Xms" or a clean error, so targets are validated before saving instead of blind. - Edit: per-row edit modal → prometheusUpdateH, so URL/interval changes don't require delete+recreate (preserves history). - Require name (validatePrometheusForm) — it becomes service.name, so blank names would silently collide targets into one series namespace. - Row surfaces last-scraped time, sample count, an auth-configured lock icon, and a "View metrics" link; auth input is masked; interval is a preset select; static-labels prefill on edit; client-side filter; real onboarding empty state. Lib compiles clean (121 modules).
…verity branch /simplify pass over this session's diff (no behavior change): - Extract BackgroundJobs.prometheusScrapeOpts (timeout + auth-header wreq Options), shared by the worker and the settings Test button so they can't drift. - Collapse prometheusPostH/prometheusUpdateH into one promSave helper (they differed only by persist call + verb). - Inline the one-use `finite` predicate in ingestScrapedBody. - rowHasError: drop the severity_text/severity_number branch — it was inert (the trace select never ships those columns) now that the server `errors` flag folds in log severity; the worker just consumes that flag. Update the unit tests to drive error logs via the errors flag (removes a test that guarded the dead branch). - Fix the defaultSelectSqlQuery comment to stop claiming it "mirrors" LogQueries.is_error (it's a log-scoped subset). Lib compiles (121 modules); web-component tests 12/12.
…trict errors flag /code-review findings on this session's diff: - Drop the 15s/30s interval options (and floor validation at 60s): the scrape dispatcher ticks once per minute, so sub-minute intervals could never be honoured — the UI was promising a cadence the scheduler can't deliver. - ingestScrapedBody now logs the count of non-finite (NaN/±Inf) samples it drops instead of dropping them silently (CLAUDE.md: never silently drop data); an all-NaN target no longer reports a misleading "ok: 0 samples". - COALESCE the trace-view errors projection to false so the column is a strict boolean rather than tri-valued NULL. Lib compiles (121 modules).
… scrape interval - regular.svg lacked the `objects-column` symbol (it only existed in solid.svg), so the Prometheus nav icon and the settings/modal/empty-state icon badges rendered blank. Add the symbol (CLAUDE.md: add the <symbol> to the sprite the helper references before using the id). - Interval <select>: always include the config's current value as a selected option, even when it isn't one of the canonical presets — otherwise editing an unrelated field (name/url) on a config with an off-list interval silently reset it to 60s because no option matched.
…se rewind, URL scheme From the GitHub Claude PR review: - resolveMeta: also strip `_total` so OpenMetrics counters (`# TYPE http_requests counter` exposed as `http_requests_total`) are classified as Sum, not Gauge — otherwise rate queries are wrong for most real-world counters. Doctest added. - prometheusToggleH: flip via a single `UPDATE ... SET enabled = NOT enabled` (new toggleEnabled) instead of getConfig+setEnabled — fixes the read-modify-write race where two concurrent toggles cancel out. - markScraped: stamp `last_scraped_at = now()` (DB clock) instead of a host UTCTime, so host/DB clock skew can't rewind the claim lease and cause double-scraping. - validatePrometheusForm: require an http(s):// scheme (SSRF floor, matching the S3-endpoint field's bar) instead of accepting any string. Lib compiles (121 modules).
|
Thanks for the review — addressed in 2bc5fd5 (and earlier 1db17581/8c32b26a/211345ff): Fixed
Considered, kept as-is (rationale)
|
Review — PR #451: Exact-ts lookup, error-log badge propagation, Prometheus scrapingThree logically independent changesets in one PR. The exact-timestamp lookup and error-badge propagation changes are clean and well-reasoned. The Prometheus feature is the bulk of the code; notes below focus there. 🔴 SSRF —
|
Follow-ups from the latest PR review: - prometheusTestH fetched the URL without the scheme check (it doesn't go through validatePrometheusForm), leaving the SSRF floor open via the Test button. Extract isHttpUrl and apply it in both the Test handler and save-validation. - ingestScrapedBody: single-pass partition instead of filter + length-diff (which traversed the sample list twice). Lib compiles (121 modules).
Code ReviewThis PR cleanly delivers three independent pieces: (1) exact-timestamp lookup replacing the ±1s window with a micro-level equality match, (2) propagating ERROR-severity log badges to trace parents, and (3) a distributed Prometheus scraper with UI, CRUD, distributed-queue dispatch, and a doctest-covered exposition-format parser. The architecture choices (SKIP LOCKED lease, per-scrape HTTP timeout, fan-out job model) are solid. Below are the findings worth addressing before merge. 1. SSRF — test endpoint and scrape worker reach internal network addresses (Security, Medium)
Minimum mitigation: resolve the URL's hostname and reject requests to RFC 1918 ( 2.
|
Code Review — PR #451This PR ships three well-motivated pieces of work: exact-timestamp lookups (removing a ±1s window the caller never needed), error-log badge propagation via severity folding in Finding 1 — SSRF:
Finding 2 — Both functions convert Finding 3 —
Finding 4 —
Finding 5 — The migration correctly uses Minor positive notes
|
…st count 4th PR review pass: - safeScrapeUrl (replaces isHttpUrl): parse with Network.URI, require an http(s) scheme case-insensitively, and reject loopback/link-local/RFC-1918/ULA literal hosts. Applied to both save-validation and the Test handler. (Literal-host check; DNS-rebinding-safe resolve-and-pin noted as a follow-up.) - errors projection: add `status_code = 'ERROR'` inside the log clause for full parity with LogQueries.is_error — logs that carry only status_code now badge. - prometheusTestH: count only finite samples, matching what ingestScrapedBody stores, so the Test result doesn't overstate (e.g. +Inf bucket values). Build green (121 modules).
|
Round 4 addressed in c060908:
Remaining items I'm intentionally deferring (not blocking): the megaparsec rewrite of |
- web-components/package-lock.json was out of sync (npm ci failed: "Missing @emnapi/* from lock file"), breaking Build Frontend and UI Tests. Regenerated; `npm ci`, full vitest (135), and the production vite build all pass locally. - hlint (fail-on: warning) flagged 14 warnings in this branch's new code. Cleared them in Pkg.Prometheus (lines/words/unwords/toString from Relude, isAscii*/isDigit, stripSuffix section, parenthesized negation) and Pages.Settings (merged Data.Time import, sortWith, bound the filtered list to drop the length·filter suggestion).
…series The per-target 'View metrics' button linked to the unfiltered metrics explorer. Point it at ?metric_source=<name>, which the explorer maps to service_name = <name> — and sampleToMetricRecord stamps each scraped sample's service.name with the config name, so this scopes the view to exactly that Prometheus target's metrics.
Code ReviewPR does three things cleanly: (1) switches the single-row lookup to exact timestamp equality (safe — the full µs round-trip is verified), (2) propagates error-log severity into the Security
Correctness
parseTs t = readMaybe (toString t) <|> (truncate <$> (readMaybe (toString t) :: Maybe Double))When
lower(level) = 'error' OR severity___severity_number >= 17 OR status_code = 'ERROR'A log with Cleanup / succinctnessNon-finite predicate duplicated —
Nested
Blocking: the two SSRF bypasses and the |
Code ReviewThis PR ships three independent pieces: exact-timestamp log lookups (removing the ±1 s window), error-log badge propagation in the trace waterfall, and a full Prometheus scrape pipeline (parser, model, settings UI, distributed job queue). The scope is substantial; below are the issues worth acting on. 1. SSRF –
|
- Drop the decorative icon pill above the heading (it violated our own
'no decorative icons above headings' brand rule and was the main AI-slop tell);
replace with a left-aligned heading + one-line subhead explaining scraping.
- Make 'Test' a real ghost button ('Test connection') in a shared action row
with Cancel + the commit button, instead of bare clickable text.
- Add persistent helper text under URL / auth header / static labels so format
guidance survives typing (recognition over recall).
- Collapse the optional auth + static-label fields behind an 'Advanced'
disclosure, auto-expanded when editing a target that already uses them.
- prometheusFields_ now owns its action row (modalId + submitLabel params),
removing the duplicated submit button from both the add and edit modals.
Code ReviewThree cleanly separated pieces of work. The 🔴 Security — SSRF: IPv4-mapped IPv6 not blocked
`safeScrapeUrl` blocks `[::1]` and the `[fd` / `[fe80` prefixes, but IPv4-mapped IPv6 addresses slip through: ``` The OS kernel routes all of these to the corresponding IPv4 private address. `fc00::/7` ULA has the same gap: the prefix guard is `"[fd"` but the range covers both `fc` and `fd` (bit 7 of the first octet), so `[fc00::1]` passes too. Suggested addition to `internalHost`: 🟡 Correctness — float timestamps stored without ×1000 conversion
The float fallback in `parseTs` truncates to milliseconds as stored, but `sampleToMetricRecord` converts back to wall time with `posixSecondsToUTCTime . (/ 1000) . fromIntegral`. If an exporter emits a float timestamp in seconds (as OpenMetrics specifies, and many Go/Python exporters do): ``` The fix in the float branch is `truncate (d * 1000)` so seconds-as-float become milliseconds. Classic Prometheus v0.0.4 says timestamps are integer ms, so integer-first parse stays unchanged; this only affects the non-standard but common float path. 🟡 Correctness — `parseLabels` drops all subsequent labels on first malformed pair
When a label pair has no quoted value (e.g. an unquoted integer `code=200`) or a missing `=`, the `_ -> []` arm returns immediately, discarding every following label on that line. For a line like: ``` `code=200` hits `_ -> []`, and `path="/api"` is silently lost. Spec-compliant exporters never emit this, but real-world exporters sometimes do. A more resilient approach: skip just the bad token up to the next comma and `go rest'`. 🟠 Simplification — `prometheusTestH` re-implements `validatePrometheusForm`
The test handler manually re-checks `T.null url`, `safeScrapeUrl url`, and the auth-header strip (`mfilter (not . T.null) (T.strip <$> form.authHeader)`) — all of which already live in `validatePrometheusForm`. If validation tightens (e.g. to reject non-`/metrics`-path URLs), the Test button diverges silently. A minimal fix is to run `validatePrometheusForm form` in `prometheusTestH` and extract the normalised values from the `Right` branch instead of re-deriving them. 🟠 Code quality — `splitBrace` / `takeQuoted` round-trip through `String`
`splitBrace` calls `toString` to convert the whole label block to `[Char]`, accumulates in a reversed `[Char]` list, then calls `toText . reverse`. `takeQuoted` uses `T.uncons` on `Text` input but also accumulates into `[Char]`. `attoparsec` is already a project dependency and handles exactly this pattern — an escape-aware quoted-string parser is ~10 lines with `Data.Attoparsec.Text` and avoids both the `toString` allocation and the `reverse` on exit. At minimum `takeQuoted` can keep its `T.uncons` discipline but use `Text.Builder` instead of `[Char]`. 🟢 Cleanup — screenshot PNGs committed to repo root
These are referenced in the PR description for illustration but will stay in git history permanently. They are not referenced from any code path. Either drop them from the commit or move them to `docs/` if they're meant to be kept. |
The redesigned modal had flat spacing — equal gaps between every field made helper text crowd the next label. Establish tight-within / generous-between rhythm: - gap-4 between fields, gap-1 within (label/input/helper) for clear grouping - wrap heading + form in a gap-5 container so the modal's space-y-2 no longer governs heading->form spacing (deterministic 1.25rem) - footer action row gets a top border + pt-4 to demarcate the commit zone - drop the Advanced toggle's border-t so it doesn't compete with the footer
Code Review — PR #451This PR lands three well-scoped changes: an exact-timestamp lookup optimisation, error-log badge propagation through the trace waterfall, and full Prometheus endpoint scraping with a distributed-queue dispatcher. The distributed-claim design ( Below are the findings, most-severe first. 1.
|
items-start + mt-0.5 (intended for wrapped multi-line errors) left the icon sitting too high on the common single-line case. Use items-center to match the success badge.
Code ReviewThis PR lands three distinct pieces of work: an exact-timestamp lookup optimisation, error-log badge propagation to trace-view ancestors, and a full Prometheus scrape-target feature (parser, model, CRUD, distributed queue dispatch, settings UI, integration tests). The first two are tight and correct. The Prometheus piece is solid overall — good use of 🔴 SSRF via IPv4-mapped IPv6 —
|
CI (node 22) runs npm 10, which read the npm-11-generated lockfile as out of sync — the top-level @emnapi/core and @emnapi/runtime entries the wasm Tailwind fallback pulls in were missing, failing 'npm ci'. Regenerated with npm 10 so the lock matches what CI resolves.
Code ReviewThree clean, well-scoped pieces of work: exact-timestamp lookup eliminates the ±1s window scan (good — PG and TF store microseconds losslessly), error-log badge propagation is correctly split between server-side SQL and the TS worker, and the Prometheus scraper has solid distributed-queue design (SKIP LOCKED, hard HTTP timeout, per-scrape fan-out). The test coverage is thorough. A few issues below, ranked by severity. Bug — label merge order in
|
Code Review — PR #451This PR lands three independent pieces: an exact-timestamp lookup optimization, error-log badge propagation, and a new Prometheus scraping subsystem. The distributed-queue design for Prometheus ( 🔴 Security — SSRF blocklist bypasses (
|
| Input | What uriRegName returns |
Resolves to |
|---|---|---|
http://localhost./metrics |
localhost. |
127.0.0.1 (trailing dot) |
http://[::ffff:127.0.0.1]/metrics |
[::ffff:127.0.0.1] |
127.0.0.1 (IPv4-mapped) |
http://0177.0.0.1/metrics |
0177.0.0.1 |
127.0.0.1 (octal) |
http://2130706433/metrics |
2130706433 |
127.0.0.1 (decimal) |
The prefix check "[fd" and "[fe80" cover ULA/link-local but not [::ffff:. The elem list uses literal "localhost" which won't match "localhost.". Suggest normalising by resolving the hostname (and pinning the result to reject later DNS changes) — the comment already flags DNS rebinding as a known gap. At minimum add "[::ffff:" to the prefix list and strip trailing dots before the elem check.
🟠 Security — Authorization header forwarded on redirect (src/BackgroundJobs.hs ~line 3052)
prometheusScrapeOpts sets Authorization globally on Wreq options. Haskell's http-client follows 301/302 redirects and, unlike browsers, does not strip Authorization when the redirect crosses hosts. A misconfigured target returning a 302 → http://attacker.net/ leaks the bearer token. Same issue in prometheusTestH. Fix: set managerModifyRequest or checkResponse to strip Authorization on cross-origin redirects, or simply disable redirects for scrape requests.
🟡 Bug — parseTs float fallback misinterprets float-seconds as float-milliseconds (src/Pkg/Prometheus.hs line 186)
parseTs t = readMaybe (toString t) <|> (truncate <$> (readMaybe (toString t) :: Maybe Double))The float branch fires when readMaybe :: Maybe Int64 fails (e.g. "1395066363.5"). truncate 1395066363.5 = 1395066363 is then treated as milliseconds in sampleToMetricRecord (/ 1000), yielding ~January 1970 for any exporter that emits Unix seconds as a float. The Prometheus spec mandates integer milliseconds, but some exporters deviate. Either drop the float branch entirely (only spec-compliant integer ms are accepted) or document that float inputs are treated as float-ms (which would be correct only if the exporter means ms).
🧹 Cleanup — splitBrace roundtrips through String (src/Pkg/Prometheus.hs ~line 134)
splitBrace = go False "" . toString -- Text → String
where
go _ acc [] = Just (toText (reverse acc), "")
go inQ acc (c : cs) ...The entire Text is converted to [Char], accumulated in reverse, then reversed and packed. Data.Text already exposes T.span, T.break, and T.uncons that would avoid the roundtrip. Similarly, takeQuoted inside parseLabels uses the same prepend-then-reverse pattern with three toText (reverse acc) call sites. A T.span-based scan would halve allocations and remove the nesting.
🧹 Cleanup — prometheusTestH re-implements URL validation (src/Pages/Settings.hs ~line 536)
The test handler manually repeats the first two guards from validatePrometheusForm (null URL, safeScrapeUrl) with the same error strings. Extract:
validateScrapeUrl :: Text -> Either Text Text -- Left = error, Right = stripped URLBoth validatePrometheusForm and prometheusTestH call it, eliminating the duplication and the nested if/else if/else in the test handler.
🧹 Cleanup — parseLabelsText magic T.drop 1 (src/Pages/Settings.hs ~line 584)
let (k, v) = T.breakOn "=" pair
...
AE.String (T.strip (T.drop 1 v)) -- drop the leading "="T.breakOn "=" includes the delimiter in the suffix, so T.drop 1 silently skips =. Replace with T.stripPrefix "=" which makes the intent explicit and naturally collapses the separate not (T.null v) guard into a Just match:
[ (AEK.fromText k', AE.String v')
| pair <- T.splitOn "," t
, let (k, rest) = T.breakOn "=" pair
, let k' = T.strip k
, not (T.null k')
, Just v' <- [T.strip <$> T.stripPrefix "=" rest]
]🧹 Cleanup — ingestScrapedBody three intermediate lists (src/Models/Apis/PrometheusScrapeConfigs.hs ~line 357)
let (finiteS, dropped) = partition (...) (Prom.parsePrometheus (decodeUtf8 body))
records = V.fromList (map (sampleToMetricRecord cfg now) finiteS)partition builds two lists; map a third; V.fromList converts. dropped is only needed for null dropped and length dropped. One pass with V.mapMaybe eliminates two intermediate lists:
let samples = Prom.parsePrometheus (decodeUtf8 body)
records = V.mapMaybe (\s -> if isNaN s.value || isInfinite s.value then Nothing else Just (sampleToMetricRecord cfg now s)) (V.fromList samples)
droppedCount = length samples - V.length recordsMinor — safeScrapeUrl 172.x check (src/Pages/Settings.hs ~line 508)
any (\n -> ("172." <> show n <> ".") `T.isPrefixOf` h) ([16..31] :: [Int])This has a false positive: "172.16." is a prefix of "172.160.", "172.161." … "172.169." (public space). No security bypass in the other direction, but a user with a legitimate 172.160.x.x endpoint is incorrectly rejected. A structured check (parse the second octet and test 16 ≤ n ≤ 31) is unambiguous.
…, redirect token leak Correctness/security: - sampleToMetricRecord: scraped per-sample labels now win over the config's static extraLabels on key collision (KeyMap <> is left-biased; was clobbering e.g. env) - safeScrapeUrl: structured IPv4 octet parsing replaces prefix matching — fixes the 172.16-31 false-positive (172.160.x was wrongly rejected) and now blocks IPv4-mapped IPv6 ([::ffff:127.0.0.1]), trailing-dot FQDNs (localhost.), and obfuscated numeric hosts (decimal 2130706433, octal 0177.0.0.1, short 127.1) - prometheusScrapeOpts: disable redirect following — http-client re-sends Authorization across a cross-host 3xx, leaking the bearer token to a malicious redirect target - parseTs: drop the float fallback that misread float-seconds as float-ms (dated points to ~1970); spec mandates integer ms, non-integers fall back to scrape time Cleanup: - isFiniteSample lives in Pkg.Prometheus; ingestScrapedBody + Test handler share it (was copy-pasted), and ingest is now a single V.mapMaybe pass (no partition/map/fromList trio) - validateScrapeUrl extracted; validatePrometheusForm and the Test handler share it - parseLabelsText uses explicit T.stripPrefix "="; drop redundant Data.List import; prometheusScrapeOpts reads as a left-to-right & chain - 0102 migration: composite (last_scraped_at) WHERE enabled index for the claim query, dropping the now-redundant partial enabled index Doctests pin the new invariants (safeScrapeUrl bypasses, parseLabelsText, isFiniteSample).
Code Review — PR #451Solid three-in-one PR. The exact-timestamp lookup is a clean, well-justified simplification. The error-badge propagation is well-tested (the Vitest suite covers the direct + synthetic-parent cases). The Prometheus scraper has good bones: distributed dispatch, lease semantics, SSRF floor, hard HTTP timeout. A few findings below, roughly severity-ordered. 1. SSRF bypass via hex numeric host —
|
The errors column projection changed from 'errors is not null' to the COALESCE expression that also flags error-level logs (kind='log' with error level/severity/ status). Update the 11 ParserSpec golden strings to match.
This comment has been minimized.
This comment has been minimized.
Code ReviewThis PR bundles three independent pieces: exact-timestamp log lookup (removing the ±1 s window), error-log badge propagation to the trace waterfall, and a full Prometheus scraping pipeline (parser, model, settings UI, distributed-queue background jobs). The architecture is solid — the SKIP LOCKED dispatcher, SSRF check, and test-before-save flow are all well-considered. A handful of issues worth addressing before merge: 1. Auth header pre-populated in password field HTML — bearer token in page source
field_ "authHeader" "Authorization header" "Bearer <token>" "password" False ... (maybe "" (fromMaybe "" . (.authHeader)) mcfg)
2. Silent scrape loss when
|
…y, parser cleanup Security/correctness: - Edit form no longer echoes the saved auth token into the password field's HTML (it sat in page source / proxy caches); blank on edit now means 'keep saved token', and prometheusUpdateH preserves the existing token when the field is left empty - safeScrapeUrl: block the full ULA range fc00::/7 (add 'fc'; only 'fd' was covered) - dispatchPrometheusScrapes: guard each createJob with tryAny — targets are leased at claim time, so a mid-batch enqueue failure must not silently strand the rest; log each - scrapePrometheusTarget: capture the fallback metricTime after the HTTP response, not before, so a slow endpoint doesn't backdate every timestamp-less sample by the latency Cleanup: - ingestScrapedBody: single list comprehension (drop the list->Vector->mapMaybe trio) - splitBrace: forward Text scan (T.break) instead of round-tripping through String; doctest pins the }-inside-quoted-value case - parseLabelsText: remove the duplicate floating Haddock line - remove accidentally-committed prometheus-*.png screenshots; gitignore root *.png Doctest count 699 -> 700.
Code ReviewThis PR delivers three well-scoped pieces of work: exact-timestamp lookups (removing the wasteful ±1s window), error-log badge propagation to parent spans (server-side SQL + frontend flag), and a full Prometheus scraping subsystem (parser, model, background jobs, UI, migration). The distributed-claim design with FOR UPDATE SKIP LOCKED is solid. A few things worth fixing before merge: 1. Duplicate Haddock block on safeScrapeUrl — stray | in rendered docs
There are two consecutive 2. IPv6 link-local check is too narrow — fe81:: through febf:: slip through
The link-local range is fe80::/10, covering fe80:: through febf::. The "fe80" prefix only blocks the fe80::/16 slice; http://[fe81::1]/metrics through http://[febf::1]/metrics all pass. Real-world NICs virtually always use fe80:: so this is theoretical, but the comment claims link-local is blocked so the check should match the claim. Fix: extend the prefix list to include "fe8", "fe9", "fea", "feb" instead of just "fe80". 3. takeQuoted accumulates in String — contradicts the module design note
splitBrace in the same file (same job: scan a quoted string with escapes) has the explicit comment "Stays in Text (T.break chunks) rather than round-tripping through String". takeQuoted does the opposite: it builds a reversed [Char] list and calls 4. No UI path to clear authHeader once set
The edit handler intentionally treats a blank auth field as "keep the saved token" — correct, avoids echoing the secret into HTML. But there is no mechanism to remove a previously-configured header; the only recourse is delete-and-recreate or a direct DB write. A "Clear saved token" checkbox mapping to an explicit Nothing would complete the CRUD. 5. scrapePrometheusTarget — single ~200-char expression
Three monadic steps (HTTP fetch → capture time → ingest) plus the case branch are chained into one long line. A short do-block would surface the "capture metricTime after the response arrives" invariant in the code structure rather than in a comment, and make the error branches readable. Everything else looks good: the parser doctests are thorough, the SKIP LOCKED claim design is correct, the markScraped/claimDueConfigs lease semantics are sound, and the SSRF IPv4 coverage (obfuscated numeric hosts, octal octets, short-form addresses) is solid. Generated with Claude Code |
…readability - safeScrapeUrl: remove the leftover duplicate Haddock block; widen IPv6 link-local to the full fe80::/10 (fe8/fe9/fea/feb, not just fe80::/16); doctest the link-local/ULA cases - Settings: add a 'Clear saved token' checkbox so a configured auth header can be removed via the UI (blank still = keep; ticked = explicit Nothing), completing the CRUD - Prometheus.takeQuoted: accumulate in Text (T.snoc) instead of a reversed String, matching splitBrace's 'stay in Text' design note - scrapePrometheusTarget: expand the chained one-liner into a do-block so the 'capture metricTime after the response' ordering is visible in the structure Doctest count 700 -> 701.
ReviewThis PR does three independent, well-scoped things: narrows the log-item lookup from a ±1 s window to exact equality (safe because the caller always holds the stored timestamp), plumbs error-log severity into the trace waterfall's error badge (with matching SQL + TS refactor + good test coverage), and adds a full Prometheus scrape pipeline — exposition parser, CRUD model, distributed-queue dispatcher, settings UI, and migration. The architecture is solid: atomic SKIP LOCKED claim, per-scrape HTTP timeout, SSRF guard on the URL. A few issues worth addressing before merge: 1. SSRF bypass via hex-encoded IP literal —
|
…okup, name uniqueness Security: - safeScrapeUrl: block hex-encoded IP literals (0x7f000001, 0x7f.0x0.0x0.0x1) — the decimal-only numeric guard let them through; doctest added - getConfigByProject: project-scoped lookup for the update handler so a foreign config id can't load another project's auth token into the request context; getConfig kept only for the trusted background worker (which holds a claimed id) Data integrity: - migration 0103: UNIQUE(project_id, name) — two targets sharing a name collide on the metrics_meta (project_id, metric_name, service_name) key and clobber each other - promSave rejects a duplicate name with a friendly toast (DB constraint is the race backstop) Cleanup: - configsByProjectId returns Vector (conversion was duplicated at both call sites) - splitBrace returns Nothing on an unclosed brace instead of Just partial-content Doctest count 701 -> 702.
Three independent pieces of work, one commit each (+ a scaling follow-up).
1. Exact-timestamp random-access lookup (
5878ab203)Single-row item detail lookups (web log-item detail + the CLI/API
/events/{id}/time/{ts}path, both vialookupOtelRecord) matchedtimestamp BETWEEN ts-1s AND ts+1s AND id = ?. The caller always holds the exact stored timestamp, so we now matchtimestamp = ts: PG and TF both store microsecond precision (lossless through the UTCTime↔Hasql encoder), and exact equality lets the planner hit the(timestamp, id)ordering instead of scanning a window. Regression test asserts an off-by-1s timestamp no longer resolves on the TF or PG read path.2. Error-log
[+1]badge propagation (be1145bba)The trace waterfall lights a parent's red
[+N]badge viachildErrors, but error logs never contributed: the worker hardcodedhasErrors=falseforkind=log, and the trace-viewerrorscolumn only reflected structured error payloads (which severity-only error logs lack). So an error log — or an orphan error log under a synthetic "missing span" parent — left its parent un-reddened.log-worker-functions: derive a log's error-ness from theerrorsflag + severity (text or OTel number ≥17); spans unchanged.defaultSelectSqlQuery: fold log severity into theerrorsprojection (mirrorsLogQueries.is_error) so the signal reaches the client.Vitest covers direct + synthetic-parent propagation and the info-log non-regression (13/13). An integration test asserts the
errorscolumn is set for ERROR-severity logs.3. Prometheus endpoint scraping (
125f96469,141013057)Configure and scrape Prometheus
/metricsendpoints per project, ingesting samples as metrics (queryable via the existing metrics store).Pkg.Prometheus: text exposition-format (v0.0.4) parser — names, labels (escapes, commas in values), values (+Inf/-Inf/NaN), optional ms timestamps,# TYPE/# HELP. Doctested.Models.Apis.PrometheusScrapeConfigs: model + CRUD +ingestScrapedBody(counter→monotonic sum, else gauge; config name asservice.name; static+sample labels as attributes; non-finite dropped).0101.Built for multiple nodes/pods (
141013057)The naive per-minute tick scraped every target sequentially in one job — at scale the tick exceeds its interval, one hung endpoint stalls the batch, and two pods would double-scrape. Reworked into a distributed-queue model on top of odd-jobs:
claimDueConfigs: atomicUPDATE ... WHERE id IN (SELECT ... FOR UPDATE SKIP LOCKED LIMIT n) RETURNING ...— safe on every node (each due row leased to exactly one caller; the same statement bumpslast_scraped_atso it won't re-fire before its interval).PrometheusScrapeTickis now a cheap dispatcher (claim + fan out onePrometheusScrapeOneper target); workers across all pods drain those → horizontal scaling.Verification
SUM, gauge→GAUGE,+Infdropped) and the claim-once/lease invariant; it surfaced two real bugs that are fixed in these commits (scrape_interval_secondsneededBIGINTfor the int8 decoder; test assertion case).🤖 Generated with Claude Code