Skip to content

Exact-ts log lookup, error-log badge propagation, and Prometheus scraping#451

Open
tonyalaribe wants to merge 24 commits into
masterfrom
exact-lookup-error-badge-prometheus
Open

Exact-ts log lookup, error-log badge propagation, and Prometheus scraping#451
tonyalaribe wants to merge 24 commits into
masterfrom
exact-lookup-error-badge-prometheus

Conversation

@tonyalaribe

Copy link
Copy Markdown
Contributor

Three independent pieces of work, one commit each (+ a scaling follow-up).

1. Exact-timestamp random-access lookup (5878ab203)

Single-row item detail lookups (web log-item detail + the CLI/API /events/{id}/time/{ts} path, both via lookupOtelRecord) matched timestamp BETWEEN ts-1s AND ts+1s AND id = ?. The caller always holds the exact stored timestamp, so we now match timestamp = ts: PG and TF both store microsecond precision (lossless through the UTCTime↔Hasql encoder), and exact equality lets the planner hit the (timestamp, id) ordering instead of scanning a window. Regression test asserts an off-by-1s timestamp no longer resolves on the TF or PG read path.

2. Error-log [+1] badge propagation (be1145bba)

The trace waterfall lights a parent's red [+N] badge via childErrors, but error logs never contributed: the worker hardcoded hasErrors=false for kind=log, and the trace-view errors column only reflected structured error payloads (which severity-only error logs lack). So an error log — or an orphan error log under a synthetic "missing span" parent — left its parent un-reddened.

  • log-worker-functions: derive a log's error-ness from the errors flag + severity (text or OTel number ≥17); spans unchanged.
  • defaultSelectSqlQuery: fold log severity into the errors projection (mirrors LogQueries.is_error) so the signal reaches the client.

Vitest covers direct + synthetic-parent propagation and the info-log non-regression (13/13). An integration test asserts the errors column is set for ERROR-severity logs.

3. Prometheus endpoint scraping (125f96469, 141013057)

Configure and scrape Prometheus /metrics endpoints per project, ingesting samples as metrics (queryable via the existing metrics store).

  • Pkg.Prometheus: text exposition-format (v0.0.4) parser — names, labels (escapes, commas in values), values (+Inf/-Inf/NaN), optional ms timestamps, # TYPE/# HELP. Doctested.
  • Models.Apis.PrometheusScrapeConfigs: model + CRUD + ingestScrapedBody (counter→monotonic sum, else gauge; config name as service.name; static+sample labels as attributes; non-finite dropped).
  • Settings: per-project Prometheus page (add / enable-disable / delete) with typed handlers + routes + nav entry.
  • migration 0101.

Built for multiple nodes/pods (141013057)

The naive per-minute tick scraped every target sequentially in one job — at scale the tick exceeds its interval, one hung endpoint stalls the batch, and two pods would double-scrape. Reworked into a distributed-queue model on top of odd-jobs:

  • claimDueConfigs: atomic UPDATE ... WHERE id IN (SELECT ... FOR UPDATE SKIP LOCKED LIMIT n) RETURNING ... — safe on every node (each due row leased to exactly one caller; the same statement bumps last_scraped_at so it won't re-fire before its interval).
  • PrometheusScrapeTick is now a cheap dispatcher (claim + fan out one PrometheusScrapeOne per target); workers across all pods drain those → horizontal scaling.
  • Per-scrape hard HTTP timeout (10s) so a dead endpoint can't wedge a worker; failures caught (the lease handles back-off).

Verification

  • Compiles clean: lib 121 + integration 174 modules, 0 errors.
  • Prometheus parser confirmed by direct evaluation of every doctest case.
  • Frontend vitest: 13/13.
  • Prometheus integration test ran against TimescaleDB and proved correct ingestion (counter→SUM, gauge→GAUGE, +Inf dropped) and the claim-once/lease invariant; it surfaced two real bugs that are fixed in these commits (scrape_interval_seconds needed BIGINT for the int8 decoder; test assertion case).

🤖 Generated with Claude Code

…ndow

Single-row item detail lookups (web log-item detail + the CLI/API
/events/{id}/time/{ts} path, both via lookupOtelRecord) matched
`timestamp BETWEEN ts-1s AND ts+1s AND id = ?`. The caller always holds
the exact stored timestamp (it comes from a prior query result), so match
`timestamp = ts` instead: both PG and TF store microsecond precision,
which round-trips losslessly through the UTCTime↔Hasql encoder, and exact
equality lets the planner hit the (timestamp, id) ordering rather than
scanning a window. Regression test asserts an off-by-1s timestamp no
longer resolves the row on either the TF or PG read path.
…missing-span parents

The trace waterfall lights a parent row's [+N] badge red via childErrors,
but error LOGS never contributed: the worker hardcoded hasErrors=false for
kind=log, and the trace-view `errors` column only reflected structured
error payloads (which severity-only error logs lack). So an error log — or
an orphan error log nested under a synthetic "missing span" parent — left
its parent's badge un-reddened.

- log-worker-functions: derive a log's error-ness from the errors flag and
  severity (text or OTel number >=17); spans unchanged. Now error logs and
  their (possibly synthetic) parents propagate childErrors.
- defaultSelectSqlQuery: fold log severity into the `errors` projection
  (mirrors LogQueries.is_error) so the signal actually reaches the client.

Vitest covers direct + synthetic-parent propagation and the info-log
non-regression; an integration test asserts the errors column is set for
ERROR-severity logs.
Adds the ability to register Prometheus scrape targets per project and
poll them on a per-target interval, parsing the text exposition format and
ingesting the samples as metrics (queryable via the existing metrics store).

- Pkg.Prometheus: exposition-format (v0.0.4) parser — names, labels (with
  escapes / commas in values), values (incl. +Inf/-Inf/NaN), optional ms
  timestamps, and # TYPE/# HELP metadata. Doctested.
- Models.Apis.PrometheusScrapeConfigs: model + CRUD, due-target selection
  (enabled AND interval elapsed), and ingestScrapedBody — parse → convert
  to MetricRecord (counter→monotonic sum, else gauge; config name as
  service.name; static + sample labels as attributes; non-finite dropped)
  → bulkInsertMetrics.
- BackgroundJobs.PrometheusScrapeTick: per-minute ticker (seeded by the
  daily scheduler) that GETs each due target (optional auth header), ingests
  it, and records last_scraped_at + last_status; failures are isolated.
- Settings: per-project Prometheus settings page (add / enable-disable /
  delete targets) with typed handlers + routes + nav entry.
- migration 0101: apis.prometheus_scrape_configs.

Integration test (PrometheusSpec) covers ingest (incl. non-finite drop) and
the due/scraped scheduling.
…de scaling

The per-minute tick scraped every due target sequentially in one job on one
pod — at hundreds/thousands of targets the tick wall-time exceeds the interval,
one hung endpoint stalls the batch, and two pods running it would double-scrape.

Rework into a distributed-queue model (leans on odd-jobs, which already gives a
cross-pod worker pool + retries):

- claimDueConfigs: atomic `UPDATE ... WHERE id IN (SELECT ... FOR UPDATE SKIP
  LOCKED LIMIT n) RETURNING ...`. Safe to run on every node — each due row is
  leased to exactly one caller, and the same statement bumps last_scraped_at so
  the next tick (any node) won't re-pick it before its interval. Oldest-first so
  a backlog drains fairly.
- PrometheusScrapeTick is now a cheap dispatcher: claim + fan out one
  PrometheusScrapeOne job per target. Workers across all pods drain those, so
  throughput scales horizontally with the fleet instead of a single loop.
- PrometheusScrapeOne worker: hard per-scrape HTTP response timeout (10s) via a
  manager-level managerResponseTimeout, so a dead endpoint can't wedge a worker;
  failures are caught (the lease handles back-off, no retry storm).

Test now asserts claim-once-and-lease (multi-node safety) and disabled skipping.
@claude

claude Bot commented Jun 28, 2026

Copy link
Copy Markdown

Code Review

This PR ships three independently useful pieces: an exact-timestamp lookup optimisation, error-log badge propagation via severity, and a full Prometheus scraping pipeline (exposition parser, CRUD model, distributed-queue background jobs, settings UI). The design of the distributed scraper (SKIP LOCKED lease, fan-out jobs, per-scrape timeout) is solid. A few issues worth addressing before merge:


1. SSRF — no URL scheme or host validation (security, confirmed)

src/Pages/Settings.hsprometheusPostH

The only validation on the submitted URL is T.null (T.strip form.url). http://169.254.169.254/latest/meta-data/ (or any other internal/cloud-metadata URL) passes the check, is stored verbatim, and is later fetched unconditionally by W.getWith opts (toString cfg.url) in scrapePrometheusTarget. The response body ends up in last_status and/or ingested as metrics.

At a minimum, validate that the scheme is http:// or https:// and consider a blocklist for link-local ranges (169.254.0.0/16, fd00::/8, 127.0.0.0/8). The S3-endpoint field elsewhere in Settings uses pattern_ "https?://.*" — at least match that floor.


2. Counter misclassification — _total suffix not stripped in resolveMeta (bug, confirmed)

src/Pkg/Prometheus.hs:120

candidates = name : mapMaybe (\sfx -> T.stripSuffix sfx name) ["_bucket", "_sum", "_count"]

The Prometheus text format spec (and OpenMetrics) distinguish the family name in # TYPE from the sample name: a counter declared as # TYPE http_requests counter is exposed as http_requests_total{…} 1027. resolveMeta strips _bucket, _sum, _count but not _total, so http_requests_total never matches the map key http_requests. The sample is classified as Untyped → stored as a Gauge with aggregationTemporality = Nothing, isMonotonic = Nothing instead of a monotonic Sum — downstream rate queries will be wrong for the majority of real-world counter metrics.

The existing doctest hides this: it uses # TYPE http_requests_total counter (base name = sample name), which is valid spec but atypical in practice.

Fix: add "_total" to the suffix list.


3. markScraped can rewind the claim lease → double-scraping (plausible)

src/BackgroundJobs.hs:3067 and src/Models/Apis/PrometheusScrapeConfigs.hs:122

claimDueConfigs writes last_scraped_at = now() (DB clock). now is then captured on the Haskell host before the scrape begins, and markScraped unconditionally overwrites last_scraped_at = #{t} with that host-clock value. If the Haskell host lags the DB by δ seconds, markScraped rewinds the timestamp by at least δ + scrape_duration. On the next tick, last_scraped_at < now() - make_interval(secs => scrape_interval_seconds) can fire again before the interval has elapsed, double-scraping the target.

Simplest fix: use now() in the markScraped SQL instead of passing a Haskell UTCTime:

UPDATESET last_scraped_at = now(), last_status = #{status} WHERE id = #{cid}

4. Exact timestamp regression — callers that round-trip via JSON/URL may 404 (plausible)

src/Models/Telemetry/Telemetry.hslookupOtelRecord

The change from BETWEEN ts-1s AND ts+1s to timestamp = #{createdAt} is correct when the caller holds the losslessly stored value. The risk is in callers that reconstruct UTCTime from a serialised form. If any JS-side formatting (e.g. Date.toISOString()) or a JSON round-trip truncates microseconds before the value is sent back in a URL path parameter, the exact match returns 0 rows and the detail panel silently 404s. The PR description says "the caller always holds the exact stored timestamp" — worth confirming that every call site (the web log-item detail, the CLI /events/{id}/time/{ts} path) flows the value through without a format round-trip that could drop sub-millisecond precision.


5. splitBrace / takeQuotedT.unpack + reversed [Char] accumulator (succinctness)

src/Pkg/Prometheus.hs:126 and :153

Both helpers convert Text → [Char], build a reversed character list, then call toText (reverse acc). This is ~30 lines of boilerplate for what is essentially a quoted-string scanner. Given the project's succinctness priority, a Data.Text.Builder-backed helper (or a single attoparsec/megaparsec combinator) would halve the code and avoid the unpack/pack round-trip. The two functions also duplicate essentially the same escape-handling loop — extracting one shared takeQuotedText would remove the duplication entirely.


Minor

  • parseValue (Pkg/Prometheus.hs): the five-case +Inf/-Inf/NaN match re-implements what Data.Attoparsec.Text.double (or Scientific) handles natively. Worth delegating to avoid a bespoke spec table.
  • ingestScrapedBody: no response-size cap — a misbehaving exporter returning hundreds of MB will materialise the full Vector MetricRecord before any insert. Consider capping at, say, 10 MB or chunking the bulk insert.
  • forM_ job enqueue in dispatchPrometheusScrapes: 1000 sequential createJob round-trips is fine at current scale but worth noting if prometheusScrapeBatchLimit ever grows; a batched insert would make this O(1) RTTs.

🤖 Generated with Claude Code

@claude

claude Bot commented Jun 28, 2026

Copy link
Copy Markdown

Code Review

This PR ships three independent pieces: exact-timestamp log lookup (removes the ±1s window), error-log badge propagation (folds severity into the `errors` projection and client-side `rowHasError`), and a full Prometheus scrape pipeline (parser, model, CRUD, distributed-queue dispatch, settings UI). The design is solid — the `FOR UPDATE SKIP LOCKED` claim pattern is the right approach for multi-node safety.


Bugs

1. prometheusToggleH is a racy read-modify-writesrc/Pages/Settings.hs

whenJustM (PromCfg.getConfig cid) \cfg -> void $ PromCfg.setEnabled pid cid (not cfg.enabled)

Two concurrent clicks both read enabled=true, both write false, leaving the config disabled with no net toggle. The fix is a single atomic SQL statement:

UPDATE apis.prometheus_scrape_configs SET enabled = NOT enabled WHERE id = ? AND project_id = ?

2. extraLabels merge order silently overwrites per-sample labelssrc/Models/Apis/PrometheusScrapeConfigs.hs

attrs = AE.Object $ case cfg.extraLabels of
  AE.Object o -> o <> labelMap   -- ← left-biased; config wins over sample labels
  _ -> labelMap

Data.Aeson.KeyMap.(<>) is left-biased, so if the config has {"env": "prod"} and the scraped metric exports env="staging", the stored attribute is env="prod". The per-sample label is silently dropped. Flip to labelMap <> o so static config labels serve as defaults, not overrides.


3. Error-badge detection diverges between the DB projection and the JS waterfallsrc/Pkg/Parser.hs vs web-components/src/log-worker-functions.ts

The DB projection (Parser.hs:533) uses lower(level) = 'error' for text-based severity. The JS rowHasError additionally matches fatal, critical, alert, emerg via regex. Because level and severity_text are the same column, a log with level='fatal' gets errors=false from the DB (badge count = 0) but rowHasError returns true (row appears red in the waterfall). The waterfall and the DB-backed badge count diverge.

Fix: align both predicates. The DB projection already has severity___severity_number >= 17 which covers FATAL (severity 21) numerically — adding OR lower(level) IN ('fatal','critical') (or whatever the agreed set is) would close the gap, or narrow the JS regex to match only what the DB checks.


Performance

4. A new TLS HTTP manager is created on every scrapesrc/BackgroundJobs.hs

let mgr = Left tlsManagerSettings{managerResponseTimeout = responseTimeoutMicro prometheusScrapeTimeoutMicros}
    opts = auth $ defaults & Wreq.manager .~ mgr

Left ManagerSettings tells wreq to call newManager on each request, allocating a fresh connection pool with no reuse. With many targets and short intervals, every scrape opens a new TCP+TLS handshake. Create one shared Manager (with the desired timeout) at startup and store it in the context (or as a module-level unsafePerformIO, consistent with how the rest of the codebase handles shared managers).


Simplification / less code

5. Four V.fromList <$> conversions are unnecessarysrc/Pages/Settings.hs

configsByProjectId returns [PrometheusScrapeConfig]. All four handlers convert to V.Vector only to call V.null and V.forM_ — both available on plain lists. Changing PrometheusGet / PrometheusMut and prometheusTargetsList to use [PrometheusScrapeConfig] drops four V.fromList calls and the Data.Vector import from the settings module with no behaviour change.

6. splitBrace in the Prometheus parser uses T.unpack char-list traversalsrc/Pkg/Prometheus.hs

splitBrace = go False "" . T.unpack  -- Text → [Char] → (reverse acc) → Text

This allocates O(n) linked-list chars plus a second O(n) reverse for every label block. The project already depends on attoparsec (used in Pkg.Parser); a small Attoparsec.Text parser for the brace block would be text-native, use no intermediate allocation, and be shorter. Even without attoparsec, Data.Text.span/break with a quote-tracking state passed as an argument avoids the list.


Minor

  • recordHelp joins the HELP text via T.words then T.unwords (drop 2 ws), collapsing any intentional multi-space runs in the HELP description. Consider using T.breakOn on the original line to preserve spacing.
  • claimDueConfigs sets last_scraped_at = now() at claim time; markScraped overwrites it at completion. This is fine (completion time is the more meaningful anchor), but the two writes are easy to confuse when reading the lease logic — a one-line comment on claimDueConfigs noting that the lease is intentionally overwritten by markScraped would help.

… validation

Turn the Prometheus settings page from a blind CRUD form into a health-aware
management surface (addresses the /critique findings):

- Scrape health (ok / failing / pending, derived from last_status) is now a
  first-class colored+iconed badge, decoupled from the on/off state — an
  enabled-but-failing target no longer shows a reassuring green.
- Test button: one-shot scrape (prometheusTestH, 10s timeout, no DB write) that
  reports "Scraped N samples · Xms" or a clean error, so targets are validated
  before saving instead of blind.
- Edit: per-row edit modal → prometheusUpdateH, so URL/interval changes don't
  require delete+recreate (preserves history).
- Require name (validatePrometheusForm) — it becomes service.name, so blank
  names would silently collide targets into one series namespace.
- Row surfaces last-scraped time, sample count, an auth-configured lock icon,
  and a "View metrics" link; auth input is masked; interval is a preset select;
  static-labels prefill on edit; client-side filter; real onboarding empty state.

Lib compiles clean (121 modules).
…verity branch

/simplify pass over this session's diff (no behavior change):

- Extract BackgroundJobs.prometheusScrapeOpts (timeout + auth-header wreq Options),
  shared by the worker and the settings Test button so they can't drift.
- Collapse prometheusPostH/prometheusUpdateH into one promSave helper (they
  differed only by persist call + verb).
- Inline the one-use `finite` predicate in ingestScrapedBody.
- rowHasError: drop the severity_text/severity_number branch — it was inert
  (the trace select never ships those columns) now that the server `errors`
  flag folds in log severity; the worker just consumes that flag. Update the
  unit tests to drive error logs via the errors flag (removes a test that
  guarded the dead branch).
- Fix the defaultSelectSqlQuery comment to stop claiming it "mirrors"
  LogQueries.is_error (it's a log-scoped subset).

Lib compiles (121 modules); web-component tests 12/12.
…trict errors flag

/code-review findings on this session's diff:

- Drop the 15s/30s interval options (and floor validation at 60s): the scrape
  dispatcher ticks once per minute, so sub-minute intervals could never be
  honoured — the UI was promising a cadence the scheduler can't deliver.
- ingestScrapedBody now logs the count of non-finite (NaN/±Inf) samples it
  drops instead of dropping them silently (CLAUDE.md: never silently drop data);
  an all-NaN target no longer reports a misleading "ok: 0 samples".
- COALESCE the trace-view errors projection to false so the column is a strict
  boolean rather than tri-valued NULL.

Lib compiles (121 modules).
… scrape interval

- regular.svg lacked the `objects-column` symbol (it only existed in solid.svg),
  so the Prometheus nav icon and the settings/modal/empty-state icon badges
  rendered blank. Add the symbol (CLAUDE.md: add the <symbol> to the sprite the
  helper references before using the id).
- Interval <select>: always include the config's current value as a selected
  option, even when it isn't one of the canonical presets — otherwise editing an
  unrelated field (name/url) on a config with an off-list interval silently reset
  it to 60s because no option matched.
…se rewind, URL scheme

From the GitHub Claude PR review:

- resolveMeta: also strip `_total` so OpenMetrics counters (`# TYPE http_requests
  counter` exposed as `http_requests_total`) are classified as Sum, not Gauge —
  otherwise rate queries are wrong for most real-world counters. Doctest added.
- prometheusToggleH: flip via a single `UPDATE ... SET enabled = NOT enabled`
  (new toggleEnabled) instead of getConfig+setEnabled — fixes the read-modify-write
  race where two concurrent toggles cancel out.
- markScraped: stamp `last_scraped_at = now()` (DB clock) instead of a host UTCTime,
  so host/DB clock skew can't rewind the claim lease and cause double-scraping.
- validatePrometheusForm: require an http(s):// scheme (SSRF floor, matching the
  S3-endpoint field's bar) instead of accepting any string.

Lib compiles (121 modules).
@tonyalaribe

Copy link
Copy Markdown
Contributor Author

Thanks for the review — addressed in 2bc5fd5 (and earlier 1db17581/8c32b26a/211345ff):

Fixed

  • Counter _total misclassificationresolveMeta now strips _total too, so OpenMetrics counters are typed as Sum, not Gauge. Doctest added.
  • prometheusToggleH race — replaced get+set with an atomic UPDATE … SET enabled = NOT enabled (toggleEnabled).
  • markScraped lease rewind — now stamps last_scraped_at = now() (DB clock) instead of a host UTCTime, so clock skew can't re-fire a target early.
  • SSRF floorvalidatePrometheusForm now requires an http(s):// scheme (matching the S3-endpoint field's bar). A full link-local/metadata blocklist is a worthwhile follow-up.
  • rowHasError severity divergence — already resolved earlier this branch: the client now consumes the server errors flag (which folds in severity) and no longer re-derives fatal/critical/..., so the waterfall and the DB badge can't diverge.
  • Non-finite samples — now logged (count) instead of dropped silently; sub-60s interval options removed (the dispatcher ticks per minute); errors projection wrapped in COALESCE(..., false).

Considered, kept as-is (rationale)

  • extraLabels merge order (config wins) — intentional: matches Prometheus' default honor_labels: false, where configured target labels take precedence over scraped ones.
  • Shared TLS manager — marginal here: intervals are ≥60s and each target is a distinct host, so cross-scrape connection reuse wouldn't kick in; a fresh manager also gives clean per-scrape timeout isolation.
  • Exact-timestamp lookup — traced all call sites: the value flows as the raw stored string into the URL and back via servant's microsecond-preserving FromHttpApiData UTCTime; no lossy round-trip. Covered by the off-by-1s regression test on both PG and TF.
  • splitBrace/parseValue rewrites, response-size cap, batched enqueue — fair points; deferred as non-blocking (parser is doctested and runs once/scrape; batch enqueue is fine at the current 1000 cap).

@claude

claude Bot commented Jun 28, 2026

Copy link
Copy Markdown

Review — PR #451: Exact-ts lookup, error-log badge propagation, Prometheus scraping

Three logically independent changesets in one PR. The exact-timestamp lookup and error-badge propagation changes are clean and well-reasoned. The Prometheus feature is the bulk of the code; notes below focus there.


🔴 SSRF — prometheusTestH (and stored-target scraping)

src/Pages/Settings.hsprometheusTestH

res <- tryAny (W.getWith opts (toString url))

url is taken directly from form input with no protocol or address validation. An authenticated project member can direct the server to fetch:

  • http://169.254.169.254/latest/meta-data/ (AWS/GCP/Azure instance metadata)
  • Any RFC 1918 address (internal network scanning)

The same vector exists for saved targets: validatePrometheusForm only checks non-emptiness, not protocol or IP range. Fix: reject non-http(s):// schemes and optionally block link-local / private ranges before the HTTP call.


🟠 TOCTOU race + unnecessary DB round-trip — prometheusToggleH

src/Pages/Settings.hsprometheusToggleH

whenJustM (PromCfg.getConfig cid) \cfg -> void $ PromCfg.setEnabled pid cid (not cfg.enabled)

Two concurrent toggles both read enabled = true, both write false — the expected flip never happens. The read is also unnecessary. Replace both operations with a single atomic UPDATE:

prometheusToggleH pid cid = do
  _ <- Projects.sessionAndProject pid
  Hasql.interpExecute [HI.sql|UPDATE apis.prometheus_scrape_configs SET enabled = NOT enabled WHERE id = #{cid} AND project_id = #{pid}|]
  prometheusMut pid

This eliminates the race, the extra DB round-trip, and the getConfig call entirely.


🟡 Pkg.Prometheus — hand-rolled char-list parser when attoparsec/megaparsec are in scope

src/Pkg/Prometheus.hssplitBrace, takeQuoted, parseLabels

splitBrace and takeQuoted both do T.unpack → character-list accumulation → reverse accT.pack:

splitBrace = go False "" . T.unpack
  where
    go inQ acc (c : cs) | c == '"' = go (not inQ) (c : acc) cs
    ...
    go _ acc [] = Just (toText (reverse acc), "")

Both attoparsec and megaparsec are already project dependencies and used elsewhere (Pkg.Parser.Expr, Pkg.Parser.Stats). The quoted-string scanner alone collapses from ~12 lines to ~4 with Text.Megaparsec.Char:

quotedVal :: Parser Text
quotedVal = char '"' *> (T.pack <$> manyTill (unescape <|> anySingle) (char '"'))
  where unescape = char '\\' *> (('\n' <$ char 'n') <|> anySingle)

Using a proper combinator eliminates reverse acc, the T.unpack allocation, and the Nothing-on-unclosed-brace ambiguity (currently go _ acc [] = Just ... silently accepts malformed input).


🟡 ingestScrapedBody — two-pass filter when partition gives one

src/Models/Apis/PrometheusScrapeConfigs.hsingestScrapedBody

let finiteS   = filter (\s -> not (isNaN s.value || isInfinite s.value)) samples
    records   = V.fromList (map (sampleToMetricRecord cfg now) finiteS)
    droppedN  = length samples - length finiteS   -- traverses `samples` twice

length samples re-traverses the full list. One-pass with partition:

let (finiteS, dropped) = partition (\s -> not (isNaN s.value || isInfinite s.value)) samples
    records  = V.fromList (map (sampleToMetricRecord cfg now) finiteS)
    droppedN = length dropped

🟢 Minor — prometheusPostH/prometheusUpdateH lambda is an eta-expansion

prometheusPostH pid form = promSave pid form "Added"
  \(name, url, interval, authH, labels) -> PromCfg.insertConfig pid name url interval authH labels

The lambda just applies the tuple's components to insertConfig pid. Since promSave's persist argument already receives the 5-tuple, this could use uncurry or a local alias — or better, accept the tuple directly in insertConfig / updateConfig — but the current form at least reads clearly. Worth tidying if the tuple grows.


✅ What's solid

  • Exact-timestamp lookup (lookupOtelRecord): the reasoning (microsecond round-trip fidelity, planner index hit) is correct and the regression test in LogSpec is exactly the right kind.
  • claimDueConfigs SQL: UPDATE … WHERE id IN (SELECT … FOR UPDATE SKIP LOCKED LIMIT n) RETURNING … is the correct PostgreSQL pattern for a distributed lease; bumping last_scraped_at inside the claim statement rather than after is the right call.
  • Pkg.Prometheus doctests cover the representative cases (counter w/ labels, no-label, ±Inf/NaN, type-scope inheritance across a family, comma-in-label-value).
  • rowHasError extraction in the TS worker is clean; the comment explicitly documents why logs must not scan summary (false-positive risk), which is the non-obvious part.

Follow-ups from the latest PR review:

- prometheusTestH fetched the URL without the scheme check (it doesn't go through
  validatePrometheusForm), leaving the SSRF floor open via the Test button. Extract
  isHttpUrl and apply it in both the Test handler and save-validation.
- ingestScrapedBody: single-pass partition instead of filter + length-diff (which
  traversed the sample list twice).

Lib compiles (121 modules).
@claude

claude Bot commented Jun 28, 2026

Copy link
Copy Markdown

Code Review

This PR cleanly delivers three independent pieces: (1) exact-timestamp lookup replacing the ±1s window with a micro-level equality match, (2) propagating ERROR-severity log badges to trace parents, and (3) a distributed Prometheus scraper with UI, CRUD, distributed-queue dispatch, and a doctest-covered exposition-format parser. The architecture choices (SKIP LOCKED lease, per-scrape HTTP timeout, fan-out job model) are solid. Below are the findings worth addressing before merge.


1. SSRF — test endpoint and scrape worker reach internal network addresses (Security, Medium)

src/Pages/Settings.hs ~L527, src/BackgroundJobs.hs ~L3079

isHttpUrl only blocks non-HTTP schemes (e.g. file://). It does not prevent targeting RFC 1918 addresses, link-local (169.254.169.254), or localhost. Any project member can use the "Test" button to probe internal services with immediate feedback, or persist a scrape config that hits an internal endpoint every minute. The test handler even reports response latency, making enumeration easy.

Minimum mitigation: resolve the URL's hostname and reject requests to RFC 1918 (10/8, 172.16/12, 192.168/16), loopback (127/8, ::1), and link-local (169.254/16) address ranges before fetching. Note that isHttpUrl is also case-sensitive ("HTTP://" slips through despite RFC 3986 §3.1 requiring case-insensitive scheme comparison).


2. errors column omits status_code = 'ERROR' vs LogQueries.is_error (Bug, Low)

src/Pkg/Parser.hs L534

The PR description says the new expression "mirrors LogQueries.is_error", and indeed LogQueries.is_error is:

lower(level) = 'error' OR severity___severity_number >= 17 OR status_code = 'ERROR'

But the new column omits OR status_code = 'ERROR':

kind = 'log' AND (lower(level) = 'error' OR severity___severity_number >= 17)

Log records from some SDKs (e.g. those that emit status_code = 'ERROR' without a level field) won't propagate the error badge. Add the missing term for parity:

kind = 'log' AND (lower(level) = 'error' OR severity___severity_number >= 17 OR status_code = 'ERROR')

3. Test-button sample count includes non-finite samples that ingest drops (UX/Correctness, Low)

src/Pages/Settings.hs ~L537

prometheusTestH reports length (Prom.parsePrometheus ...), which includes +Inf/-Inf/NaN samples. ingestScrapedBody partitions these out before counting. A metrics endpoint that exports +Inf values (common for histograms' +Inf bucket) shows a higher count in the test result than what will actually be ingested. Apply the same finite filter before reporting:

Right resp ->
  let samples = filter (\s -> not (isNaN s.value || isInfinite s.value))
                       (Prom.parsePrometheus (decodeUtf8 (resp ^. Wreq.responseBody)))
  Right (length samples, ms)

4. Pkg.Prometheus hand-rolls a char-list parser when Megaparsec is already a dependency (Cleanup, Low)

src/Pkg/Prometheus.hs L133–170

splitBrace and takeQuoted both convert Text to [Char] via T.unpack and accumulate into a reversed [Char] list before toText . reverse. Every other parser in the project uses Megaparsec (Pkg.Parser.Expr, Pkg.Parser.Stats, Pkg.Components.TimePicker). A small Megaparsec parser here would:

  • avoid the String allocation on every label block
  • express the grammar more clearly (quoted-string, escape handling, brace matching)
  • follow the established project pattern

The current implementation works, but it diverges from the project standard without a clear reason.


5. PrometheusScrapeTick UTCTime — carries an always-discarded field (Simplification, Low)

src/BackgroundJobs.hs L160, L563

PrometheusScrapeTick _ -> dispatchPrometheusScrapes authCtx always ignores the UTCTime. Compare QueryMonitorsCheck which is a nullary constructor seeded via const BackgroundJobs.QueryMonitorsCheck. PrometheusScrapeTick could do the same:

-- constructor becomes nullary
| PrometheusScrapeTick

-- seed line
seed 1440 60 (const BackgroundJobs.PrometheusScrapeTick)

This removes a constructor field that carries no information at dispatch time.


Overall the distributed-queue design for multi-node scraping, the exact-timestamp lookup regression test, and the error-badge integration test are well-reasoned and cleanly implemented. The SSRF gap (finding 1) is the one item that warrants a fix before shipping.

🤖 Generated with Claude Code

@claude

claude Bot commented Jun 28, 2026

Copy link
Copy Markdown

Code Review — PR #451

This PR ships three well-motivated pieces of work: exact-timestamp lookups (removing a ±1s window the caller never needed), error-log badge propagation via severity folding in defaultSelectSqlQuery, and a distributed Prometheus scraper built on the existing odd-jobs queue. The architecture is sound and the integration tests cover the real failure modes (multi-node lease invariant, exact-ts miss, +Inf drop). A few findings below, ranked by severity.


Finding 1 — SSRF: prometheusTestH skips URL validation (src/Pages/Settings.hs:514)

promSave calls validatePrometheusForm which rejects anything not starting with http:// or https://. The Test handler bypasses that entirely — it only checks T.null url. Any authenticated user can supply file:///etc/passwd, http://169.254.169.254/latest/meta-data/, http://localhost:2375/ (Docker daemon), etc., and the error text leaks the response. The fix is one line — call validatePrometheusForm (or extract the URL predicate) before fetching.


Finding 2 — splitBrace / takeQuoted use String-based processing when the project has attoparsec (src/Pkg/Prometheus.hs:133-170)

Both functions convert Text -> String (via T.unpack), accumulate a reversed [Char] list, then convert back. The project already has both attoparsec and megaparsec in the cabal. Even staying in Data.Text, T.uncons in a loop (no pack/unpack) would remove the allocation overhead. splitBrace could use T.span/T.breakOn for unescaped runs and only handle escapes char-by-char. Same for the c : acc accumulator in takeQuotedT.span (\c -> c /= '"' && c /= '\\') handles the bulk in one pass.


Finding 3 — parseTs evaluates toString t twice (src/Pkg/Prometheus.hs:185)

toString t allocates a String each call. Bind it once: let s = toString t in readMaybe s <|> (truncate <$> (readMaybe s :: Maybe Double))


Finding 4 — ingestScrapedBody traverses finiteS twice (src/Models/Apis/PrometheusScrapeConfigs.hs:153-155)

records is a 1-to-1 map of finiteS, so V.length records is already the count: use droppedN = length samples - V.length records instead of re-traversing finiteS.


Finding 5 — scrapeIntervalSeconds typed Int but column is BIGINT (src/Models/Apis/PrometheusScrapeConfigs.hs:63)

The migration correctly uses BIGINT (with inline comment: "Haskell Int decodes as int8"). The Haskell field should be Int64 to make the intent explicit and avoid surprises on 32-bit builds.


Minor positive notes

  • The FOR UPDATE SKIP LOCKED claim pattern in claimDueConfigs is the right primitive for multi-node scraping; advancing last_scraped_at in the same statement is correct atomic lease design.
  • Exact-timestamp regression test covering both TF and PG paths is good.
  • prometheusScrapeOpts shared between the worker and the Test button is good hygiene — they cannot drift on timeout or auth.
  • bool Nothing (Just X) isCounter works, but isCounter $> Just X is marginally more idiomatic.

…st count

4th PR review pass:

- safeScrapeUrl (replaces isHttpUrl): parse with Network.URI, require an http(s)
  scheme case-insensitively, and reject loopback/link-local/RFC-1918/ULA literal
  hosts. Applied to both save-validation and the Test handler. (Literal-host
  check; DNS-rebinding-safe resolve-and-pin noted as a follow-up.)
- errors projection: add `status_code = 'ERROR'` inside the log clause for full
  parity with LogQueries.is_error — logs that carry only status_code now badge.
- prometheusTestH: count only finite samples, matching what ingestScrapedBody
  stores, so the Test result doesn't overstate (e.g. +Inf bucket values).

Build green (121 modules).
@tonyalaribe

Copy link
Copy Markdown
Contributor Author

Round 4 addressed in c060908:

  • SSRFsafeScrapeUrl (Network.URI): case-insensitive http(s) scheme + reject loopback/link-local/RFC-1918/ULA literal hosts, applied to both save-validation and the Test button. Full DNS-resolve-and-pin (rebinding) is noted as a follow-up.
  • status_code = 'ERROR' added to the log-scoped errors projection for parity with LogQueries.is_error.
  • Test count now filters non-finite samples to match what ingestScrapedBody stores.

Remaining items I'm intentionally deferring (not blocking): the megaparsec rewrite of splitBrace/takeQuoted (works, doctested), DNS-rebinding-safe SSRF resolution, and making PrometheusScrapeTick nullary (would break already-enqueued job payloads). Happy to take any of these as follow-ups.

- web-components/package-lock.json was out of sync (npm ci failed: "Missing
  @emnapi/* from lock file"), breaking Build Frontend and UI Tests. Regenerated;
  `npm ci`, full vitest (135), and the production vite build all pass locally.
- hlint (fail-on: warning) flagged 14 warnings in this branch's new code. Cleared
  them in Pkg.Prometheus (lines/words/unwords/toString from Relude, isAscii*/isDigit,
  stripSuffix section, parenthesized negation) and Pages.Settings (merged Data.Time
  import, sortWith, bound the filtered list to drop the length·filter suggestion).
…series

The per-target 'View metrics' button linked to the unfiltered metrics
explorer. Point it at ?metric_source=<name>, which the explorer maps to
service_name = <name> — and sampleToMetricRecord stamps each scraped
sample's service.name with the config name, so this scopes the view to
exactly that Prometheus target's metrics.
@claude

claude Bot commented Jun 28, 2026

Copy link
Copy Markdown

Code Review

PR does three things cleanly: (1) switches the single-row lookup to exact timestamp equality (safe — the full µs round-trip is verified), (2) propagates error-log severity into the errors SQL column and honours it in the TS worker, and (3) adds a full Prometheus scraping stack (parser, model, dispatcher/worker, UI). The SKIP LOCKED dispatcher pattern is correct and the exposition-format parser is solid. A few real issues below, ranked by severity.


Security

safeScrapeUrl: IPv4-mapped IPv6 bypasses the RFC-1918 blocksrc/Pages/Settings.hs ~line 555
internalHost checks for "[fd" and "[fe80" prefixes but not "[::ffff:". parseURI "http://[::ffff:192.168.1.1]/metrics" returns uriRegName = "[::ffff:192.168.1.1]" — nothing in the prefix list matches, so the URL is allowed. The Linux dual-stack socket connects to 192.168.1.1 (RFC-1918). Fix: add "[::ffff:" to any (\p -> p \T.isPrefixOf` h)`.

safeScrapeUrl: bare "0" host bypasses the "0.0.0.0" checksrc/Pages/Settings.hs ~line 553
parseURI "http://0/metrics" (RFC 3986: "0" is a reg-name, not an IPv4 address) gives uriRegName = "0". The check h \elem` ["localhost", "[::1]", "0.0.0.0"]tests the exact string"0.0.0.0"and misses"0". On Linux, connecting to "0"reaches127.0.0.1. Fix: add "0"` to the blocklist.


Correctness

parseTs float-second fallback produces timestamps ~1000× too earlysrc/Pkg/Prometheus.hs ~line 184

parseTs t = readMaybe (toString t) <|> (truncate <$> (readMaybe (toString t) :: Maybe Double))

When readMaybe :: Maybe Int64 fails on a decimal string like "1700000000.123" (OpenMetrics float-second format), the Double fallback returns Just 1700000000. sampleToMetricRecord then does posixSecondsToUTCTime . (/ 1000) . fromIntegral, so 1700000000 / 1000 = 1700000 seconds from epoch ≈ January 20, 1970 instead of November 2023. Any OpenMetrics exporter that emits float-second timestamps will have every metric stored ~53 years in the past, making them invisible in any normal time-range query. The Prometheus text v0.0.4 format specifies integer milliseconds, so the safest fix is to drop the Double fallback entirely (or detect and reject non-integer timestamps).

level = 'fatal' logs with no severity number miss the error badgesrc/Pkg/Parser.hs ~line 534

lower(level) = 'error' OR severity___severity_number >= 17 OR status_code = 'ERROR'

A log with level = 'FATAL' and severity___severity_number IS NULL evaluates false OR NULL OR false = NULL → COALESCE(..., false) = false. The badge is not set. The comment on this line says "ERROR/FATAL logs also propagate the badge" but fatal is only caught by severity_number >= 17, not the level string check. Any SDK that normalises to "FATAL" as the level string but omits the numeric severity (logged via isAlertableLogLevel which does handle "FATAL") will produce un-badged fatal logs. Fix: lower(level) IN ('error', 'fatal').


Cleanup / succinctness

Non-finite predicate duplicatedsrc/Pages/Settings.hs ~line 554 and src/Models/Apis/PrometheusScrapeConfigs.hs ~line 153
(\s -> not (isNaN s.value || isInfinite s.value)) is written identically in both prometheusTestH and ingestScrapedBody. They must stay in sync or the Test button and the actual scraper will diverge on what counts as storable. Either export a predicate from Pkg.Prometheus (e.g. isFiniteSample :: Sample -> Bool) or use base's GHC.Float.isFinite directly: filter (isFinite . (.value)).

prometheusScrapeOpts placed in BackgroundJobs, consumed by Pages.Settingssrc/BackgroundJobs.hs ~line 3050
This function has no dependency on background-job infrastructure (only wreq + http-client-tls). Pages/Settings.hs already imports it qualified as BJ.prometheusScrapeOpts, which is a sign of the wrong home. The natural home is Pkg.Prometheus alongside the parser, or in Models.Apis.PrometheusScrapeConfigs. Moving it breaks the coupling from the settings page into the 3 000-line orchestration module.

sampleToMetricRecord exported but never used outside the modulesrc/Models/Apis/PrometheusScrapeConfigs.hs line 14
grep across the whole codebase shows it is only referenced inside PrometheusScrapeConfigs.hs. Remove it from the export list.

Nested if/else/if/else in prometheusTestHMultiWayIf is already a default extensionsrc/Pages/Settings.hs ~line 540
The same two guard conditions (T.null url, safeScrapeUrl url) already appear in validatePrometheusForm. The handler could reuse that validation or, at minimum, use the enabled MultiWayIf to flatten the nesting. As written, a third validation condition added later must be added in two places with slightly different error strings.

markScraped redundantly writes last_scraped_atsrc/Models/Apis/PrometheusScrapeConfigs.hs ~line 144
claimDueConfigs already advances last_scraped_at = now() as the lease. markScraped then writes last_scraped_at = now() again alongside last_status. Two writes per successful scrape. The markScraped SQL only needs SET last_status = #{status}.


Blocking: the two SSRF bypasses and the parseTs timestamp bug. The fatal badge miss is worth fixing while the surrounding code is touched. The rest are cleanup.

@claude

claude Bot commented Jun 28, 2026

Copy link
Copy Markdown

Code Review

This PR ships three independent pieces: exact-timestamp log lookups (removing the ±1 s window), error-log badge propagation in the trace waterfall, and a full Prometheus scrape pipeline (parser, model, settings UI, distributed job queue). The scope is substantial; below are the issues worth acting on.


1. SSRF – safeScrapeUrl does not block IPv4-mapped IPv6 loopback — src/Pages/Settings.hs:506

`elem` ["localhost", "[::1]", "0.0.0.0"]
|| any (`T.isPrefixOf` h) ["127.", "10.", "192.168.", "169.254.", "[fd", "[fe80"]

[::ffff:127.0.0.1] and [::ffff:7f00:1] (IPv4-mapped loopback, valid RFC-2732 URI hosts) pass this check. A saved target — or the Test button — will reach any service bound to 127.0.0.1 on the monoscope pod.

Fix: add "[::ffff:7f" (covers the whole 127.x.x.x mapped range) and "[::ffff:0:7f" (SIIT form) to the prefix list, or resolve the hostname and check the resulting SockAddr.


2. extraLabels (static config) silently overrides scraped metric labels — src/Models/Apis/PrometheusScrapeConfigs.hs:194

attrs = AE.Object $ case cfg.extraLabels of
  AE.Object o -> o <> labelMap   -- left-biased: o wins on collision

KeyMap (<>) / HashMap union is left-biased, so a config-level job=my-service replaces a scraped job=prometheus. Prometheus convention is the opposite: external/static labels have lower priority than scraped ones (they fill in when the target doesn't export the key). Any query filtering on a label that appears in both places silently gets the wrong value.

Fix: flip to labelMap <> o so scraped labels win, or document the current behaviour explicitly if the override semantics are intentional.


3. splitBrace returns Just on an unclosed {src/Pkg/Prometheus.hs:137

go _ acc [] = Just (toText (reverse acc), "")

A truncated scrape body whose last line contains an unclosed { (e.g. the HTTP connection was cut mid-line) is not rejected — splitBrace succeeds with afterLabels = "", parseSample then fails to find a value token and returns Nothing, and every remaining sample in the body is silently dropped. The caller gets a partial result with no indication of truncation.

Fix: change the base case to Nothing (an unclosed brace is malformed). Parsers that consume unterminated constructs silently are hard to debug.


4. \t in Prometheus label values is mis-unescaped — src/Pkg/Prometheus.hs:161-170

The takeQuoted comment says it unescapes \\, \", \n, but the Prometheus text-format spec also permits \t (horizontal tab). The catch-all branch:

Just (c, r) -> go' (c : acc) r   -- \t → 't', not '\t'

produces a literal t instead of a tab character. Any label value containing \t (rare but valid) will not match the same value read back via a query.

Fix: add Just ('t', r) -> go' ('\t' : acc) r before the catch-all, mirroring the \n branch.


5. decodeUtf8 on the response body is outside tryAnysrc/Pages/Settings.hs:554

res <- tryAny (W.getWith opts (toString url))   -- only covers the HTTP GET
...
Right resp ->
  let finite = filter ... (Prom.parsePrometheus (decodeUtf8 (resp ^. Wreq.responseBody)))

decodeUtf8 (strict) throws a pure exception on invalid UTF-8. An endpoint returning Latin-1 or a binary body bypasses the Left e -> Left (...) display path and returns a 500 with no user-visible error from the Test button. The background worker wraps the full pipeline in tryAny, so only the interactive Test button is broken.

Fix: move the decodeUtf8 call inside the tryAny, or use decodeUtf8Lenient.


6. Claim-and-enqueue span two transactions — src/BackgroundJobs.hs:3065–3069

due <- PromCfg.claimDueConfigs prometheusScrapeBatchLimit  -- commits last_scraped_at (Hasql pool)
...
liftIO $ withResource authCtx.jobsPool \conn ->
  forM_ due \cfg -> void $ createJob conn "background_jobs" (PrometheusScrapeOne cfg.id)

claimDueConfigs bumps last_scraped_at and commits immediately (Hasql pool connection). createJob runs on a separate connection from the jobs pool. A pod restart, OOM kill, or pool exhaustion between the two leaves those configs with their lease advanced but no PrometheusScrapeOne job ever enqueued — a silent gap for the full scrape interval (up to scrape_interval_seconds, possibly hours).

The same transaction-spanning pattern already exists in detectLogPatternSpikes (line 3478) for reference — but that one is unless (null issueIds) guarded, so it only creates jobs when there's actually work. Here the claim itself is the problem.


7. dispatchPrometheusScrapes takes an explicit AuthContextsrc/BackgroundJobs.hs:3063

dispatchPrometheusScrapes :: Config.AuthContext -> ATBackgroundCtx ()
dispatchPrometheusScrapes authCtx = ...

Every other ATBackgroundCtx function that needs the context uses ctx <- ask @Config.AuthContext (lines 2568, 2601, 2651, 2708, 2826, 2870, 2964, 3002, 3106, 3152…). The explicit parameter is unnecessary boilerplate and forces the call-site to thread authCtx too. Drop the parameter and ask.


8. splitBrace converts Text → String → Textsrc/Pkg/Prometheus.hs:135

splitBrace = go False "" . toString
  where
    go _ acc [] = ...
    go inQ acc (c : cs) = ...   -- reversed [Char] accumulator

This allocates a String from the input Text, accumulates a reversed [Char], and then calls toText (reverse acc) twice. For a /metrics body with thousands of labeled series this creates needless GC pressure. A T.uncons-based loop with a Text builder (or T.span/T.break where the state allows) avoids both copies entirely.


Minor / worth noting

  • isFiniteSample predicate is copy-pasted. The filter (\s -> not (isNaN s.value || isInfinite s.value)) in prometheusTestH (Settings.hs:554) should call a shared predicate with ingestScrapedBody — if the definition ever drifts the Test button will count differently from what actually ingests.
  • ingestScrapedBody materialises the full lazy ByteString as a strict Text before parsing. For large /metrics responses this means two full copies in memory simultaneously. Passing LText to parsePrometheus and using Data.Text.Lazy.lines would halve peak RSS for this path.

- Drop the decorative icon pill above the heading (it violated our own
  'no decorative icons above headings' brand rule and was the main AI-slop tell);
  replace with a left-aligned heading + one-line subhead explaining scraping.
- Make 'Test' a real ghost button ('Test connection') in a shared action row
  with Cancel + the commit button, instead of bare clickable text.
- Add persistent helper text under URL / auth header / static labels so format
  guidance survives typing (recognition over recall).
- Collapse the optional auth + static-label fields behind an 'Advanced'
  disclosure, auto-expanded when editing a target that already uses them.
- prometheusFields_ now owns its action row (modalId + submitLabel params),
  removing the duplicated submit button from both the add and edit modals.
@claude

claude Bot commented Jun 28, 2026

Copy link
Copy Markdown

Code Review

Three cleanly separated pieces of work. The claimDueConfigs distributed-queue model is well designed, the Prometheus parser has good doctest coverage, and the error-badge propagation fix is correctly threaded from SQL through the TypeScript worker. Below are findings in severity order.


🔴 Security — SSRF: IPv4-mapped IPv6 not blocked

src/Pages/Settings.hs:508

`safeScrapeUrl` blocks `[::1]` and the `[fd` / `[fe80` prefixes, but IPv4-mapped IPv6 addresses slip through:

```
http://[::ffff:127.0.0.1]/metrics → h = "[::ffff:127.0.0.1]", no match
http://[::ffff:10.0.0.1]/metrics → h = "[::ffff:10.0.0.1]", no match
http://[::ffff:169.254.0.1]/metrics → h = "[::ffff:169.254.0.1]", no match
```

The OS kernel routes all of these to the corresponding IPv4 private address. `fc00::/7` ULA has the same gap: the prefix guard is `"[fd"` but the range covers both `fc` and `fd` (bit 7 of the first octet), so `[fc00::1]` passes too.

Suggested addition to `internalHost`:
```haskell
|| "::ffff:" T.isInfixOf h -- IPv4-mapped IPv6
|| "[fc" T.isPrefixOf h -- fc00::/7 ULA (fc + fd)
```


🟡 Correctness — float timestamps stored without ×1000 conversion

src/Pkg/Prometheus.hs:186 / src/Models/Apis/PrometheusScrapeConfigs.hs:172

The float fallback in `parseTs` truncates to milliseconds as stored, but `sampleToMetricRecord` converts back to wall time with `posixSecondsToUTCTime . (/ 1000) . fromIntegral`. If an exporter emits a float timestamp in seconds (as OpenMetrics specifies, and many Go/Python exporters do):

```
parseTs "1700000000.123"
→ truncate 1700000000.123 = 1700000000 -- treated as ms
→ posixSecondsToUTCTime (1700000000 / 1000)
= posixSecondsToUTCTime 1700000.0
≈ Jan 20 1970 -- off by ×1000
```

The fix in the float branch is `truncate (d * 1000)` so seconds-as-float become milliseconds. Classic Prometheus v0.0.4 says timestamps are integer ms, so integer-first parse stays unchanged; this only affects the non-standard but common float path.


🟡 Correctness — `parseLabels` drops all subsequent labels on first malformed pair

src/Pkg/Prometheus.hs:158 (_ -> [])

When a label pair has no quoted value (e.g. an unquoted integer `code=200`) or a missing `=`, the `_ -> []` arm returns immediately, discarding every following label on that line. For a line like:

```
requests_total{method="get",code=200,path="/api"} 5
```

`code=200` hits `_ -> []`, and `path="/api"` is silently lost. Spec-compliant exporters never emit this, but real-world exporters sometimes do. A more resilient approach: skip just the bad token up to the next comma and `go rest'`.


🟠 Simplification — `prometheusTestH` re-implements `validatePrometheusForm`

src/Pages/Settings.hs:534–556

The test handler manually re-checks `T.null url`, `safeScrapeUrl url`, and the auth-header strip (`mfilter (not . T.null) (T.strip <$> form.authHeader)`) — all of which already live in `validatePrometheusForm`. If validation tightens (e.g. to reject non-`/metrics`-path URLs), the Test button diverges silently. A minimal fix is to run `validatePrometheusForm form` in `prometheusTestH` and extract the normalised values from the `Right` branch instead of re-deriving them.


🟠 Code quality — `splitBrace` / `takeQuoted` round-trip through `String`

src/Pkg/Prometheus.hs:133–171

`splitBrace` calls `toString` to convert the whole label block to `[Char]`, accumulates in a reversed `[Char]` list, then calls `toText . reverse`. `takeQuoted` uses `T.uncons` on `Text` input but also accumulates into `[Char]`. `attoparsec` is already a project dependency and handles exactly this pattern — an escape-aware quoted-string parser is ~10 lines with `Data.Attoparsec.Text` and avoids both the `toString` allocation and the `reverse` on exit. At minimum `takeQuoted` can keep its `T.uncons` discipline but use `Text.Builder` instead of `[Char]`.


🟢 Cleanup — screenshot PNGs committed to repo root

prometheus-add-modal.png, prometheus-after-empty.png, prometheus-before.png, prometheus-row.png, prometheus-sprite-check.png, prometheus-test-error.png

These are referenced in the PR description for illustration but will stay in git history permanently. They are not referenced from any code path. Either drop them from the commit or move them to `docs/` if they're meant to be kept.

The redesigned modal had flat spacing — equal gaps between every field made
helper text crowd the next label. Establish tight-within / generous-between
rhythm:
- gap-4 between fields, gap-1 within (label/input/helper) for clear grouping
- wrap heading + form in a gap-5 container so the modal's space-y-2 no longer
  governs heading->form spacing (deterministic 1.25rem)
- footer action row gets a top border + pt-4 to demarcate the commit zone
- drop the Advanced toggle's border-t so it doesn't compete with the footer
@claude

claude Bot commented Jun 28, 2026

Copy link
Copy Markdown

Code Review — PR #451

This PR lands three well-scoped changes: an exact-timestamp lookup optimisation, error-log badge propagation through the trace waterfall, and full Prometheus endpoint scraping with a distributed-queue dispatcher. The distributed-claim design (FOR UPDATE SKIP LOCKED) is solid, the Vitest coverage is good, and the defaultSelectSqlQuery change for error-log severity is clean.

Below are the findings, most-severe first.


1. Pkg/Prometheus.hs — hand-rolled parser when attoparsec / megaparsec are already in build-depends

splitBrace and takeQuoted both follow the same pattern: go False "" . toString — convert Text → String, accumulate a reversed [Char], then reverse acc on exit. This is ~40 lines of boilerplate for what attoparsec/megaparsec give you in 2–3 combinators (char '{', manyTill anyChar (char '}'), etc.). The whole Pkg.Prometheus module is 186 lines of which ~50 are this String-accumulation scaffolding. Using attoparsec-text (already a transitive dep via attoparsec-aeson) would halve the module and keep parsing in Text slices without the toString/reverse round-trips.

Concrete change: replace splitBrace+takeQuoted with an Attoparsec parser; parseSample can then be a straightforward combinator sequence.


2. Pkg/Prometheus.hs:186parseTs converts to String twice

parseTs t = readMaybe (toString t) <|> (truncate <$> (readMaybe (toString t) :: Maybe Double))

toString t is evaluated twice. A simple let s = toString t in readMaybe s <|> ... fixes it.


3. Models/Apis/PrometheusScrapeConfigs.hs:153–154 — three-pass pipeline

let (finiteS, dropped) = partition (\s -> not (isNaN s.value || isInfinite s.value)) (Prom.parsePrometheus )
    records = V.fromList (map (sampleToMetricRecord cfg now) finiteS)

partitionmapV.fromList is three full traversals. V.mapMaybe from Data.Vector collapses them into one:

let allS   = V.fromList (Prom.parsePrometheus (decodeUtf8 body))
    records = V.mapMaybe (\s -> if isFinite s.value then Just (sampleToMetricRecord cfg now s) else Nothing) allS
    dropped = V.length allS - V.length records

(Bonus: isFinite = not . isNaN &&& not . isInfinite is already in Relude as GHC.Float.isIEEE — or just not (isNaN v || isInfinite v) is fine.)


4. 0101_prometheus_scrape_configs.sql — partial index missing last_scraped_at

The migration creates:

CREATE INDEX … idx_prometheus_scrape_configs_enabled ON … (enabled) WHERE enabled;

But claimDueConfigs also filters by last_scraped_at and sorts ORDER BY last_scraped_at ASC NULLS FIRST. Postgres cannot satisfy this sort from the above index, so it must fetch all enabled rows and sort externally. A composite partial index fixes both the filter and the sort in one pass:

CREATE INDEX … ON apis.prometheus_scrape_configs (last_scraped_at ASC NULLS FIRST) WHERE enabled;

5. Pages/Settings.hs — DNS rebinding bypasses safeScrapeUrl (documented gap, ships as-is)

The code comment already flags this:

"a public hostname that resolves to an internal IP (DNS rebinding) is not caught here"

The safeScrapeUrl check is purely lexical. An attacker registers evil.attacker.com, passes validation, then rotates the DNS record to 169.254.169.254 (cloud metadata). The scraper then fetches metadata credentials on every 60-second tick. This is a known limitation but it's the primary SSRF vector for multi-tenant scrapers. Resolve-and-pin at save time (not scrape time) plus a short TTL-ignore policy is the standard mitigation.


6. Pages/Settings.hs / BackgroundJobs.hsauth_header plaintext storage + last_status exception leakage

auth_header TEXT is stored unencrypted in apis.prometheus_scrape_configs. Any DB read-replica breach or SELECT-level role exposes every project's Bearer tokens at once.

Separately, in scrapePrometheusTarget:

void $ PromCfg.markScraped cfg.id ("error: " <> toText (displayException err))

displayException on http-client / wreq exceptions can include the full request URL, port, and TLS negotiation details. This lands verbatim in last_status and is rendered in the settings UI, confirming internal network topology to any project member who opens the page. Consider truncating at 200 chars and redacting the host portion for auth failures.


7. Pages/Settings.hs:619–638prometheusTestH duplicates URL validation from validatePrometheusForm

Both functions independently check T.null url and safeScrapeUrl url. Extracting a shared validateUrl :: Text -> Either Text Text (used by both) would mean a future SSRF-rule change only needs to happen once.


Minor

  • ingestScrapedBody drops +Inf histogram sentinel buckets — every Prometheus histogram has a _bucket{le="+Inf"} series. These are dropped (non-finite), so last_status will always under-report sample count for histogram-heavy exporters (e.g., "ok: 9 samples" when 10 were scraped). Consider logging the dropped +Inf count separately from NaN so users understand why the count doesn't match curl /metrics | wc -l.
  • 6 PNG screenshots committed to repo root — these look like review aids and should probably live in docs/ or not be committed.

items-start + mt-0.5 (intended for wrapped multi-line errors) left the icon
sitting too high on the common single-line case. Use items-center to match
the success badge.
@claude

claude Bot commented Jun 28, 2026

Copy link
Copy Markdown

Code Review

This PR lands three distinct pieces of work: an exact-timestamp lookup optimisation, error-log badge propagation to trace-view ancestors, and a full Prometheus scrape-target feature (parser, model, CRUD, distributed queue dispatch, settings UI, integration tests). The first two are tight and correct. The Prometheus piece is solid overall — good use of FOR UPDATE SKIP LOCKED, the 10s scrape timeout, and the sampleToMetricRecord mapping — but has a few issues worth addressing.


🔴 SSRF via IPv4-mapped IPv6 — src/Pages/Settings.hs ~line 590

safeScrapeUrl blocks [::1] and the [fd/[fe80 ULA prefixes, but not [::ffff:]. IPv4-mapped IPv6 addresses resolve directly to IPv4 without DNS:

http://[::ffff:127.0.0.1]/metrics  →  127.0.0.1 (loopback)
http://[::ffff:10.0.0.1]/metrics   →  10.0.0.1  (RFC-1918)
http://[::ffff:192.168.1.1]/metrics →  192.168.1.1
http://[::ffff:172.16.0.1]/metrics  →  172.16.0.1

The PR's comment already calls out DNS-rebinding as a known gap, but this is a literal-host bypass that the current code claims to prevent. Fix: add "[::ffff:" to the isPrefixOf blocklist:

|| any (`T.isPrefixOf` h) ["127.", "10.", "192.168.", "169.254.", "[fd", "[fe80", "[::ffff:"]

🟡 Missing sort index for claimDueConfigsstatic/migrations/0101_prometheus_scrape_configs.sql

The dispatcher query filters by enabled and orders by last_scraped_at ASC NULLS FIRST LIMIT n. The migration adds a partial index on (enabled) WHERE enabled, which covers the filter but not the sort — Postgres has to scan all enabled rows and sort before applying LIMIT. A composite index eliminates the sort step entirely:

CREATE INDEX IF NOT EXISTS idx_prometheus_scrape_configs_claim
  ON apis.prometheus_scrape_configs (last_scraped_at ASC NULLS FIRST)
  WHERE enabled;

Fine at current scale; worth landing before this table grows.


🟡 Duplicated finite-sample predicate — src/Pkg/Prometheus.hs / src/Models/Apis/PrometheusScrapeConfigs.hs / src/Pages/Settings.hs

\s -> not (isNaN s.value || isInfinite s.value) appears verbatim in both ingestScrapedBody and prometheusTestH. Since Sample lives in Pkg.Prometheus, the natural home is there:

-- in Pkg.Prometheus
isFiniteSample :: Sample -> Bool
isFiniteSample s = not (isNaN s.value || isInfinite s.value)

Then ingestScrapedBody uses partition isFiniteSample and prometheusTestH uses filter isFiniteSample.


🟡 parseLabelsText — empty-value labels silently stored — src/Pages/Settings.hs ~line 663

The guard not (T.null v) where v = T.breakOn "=" pair checks the raw "=..." string, not the stripped value. Input "env= " gives v = "= " (non-null → passes guard), but stores label env = "". Fix: guard on the trimmed value instead:

[ (AEK.fromText k', AE.String v')
| pair <- T.splitOn "," t
, let (k, raw) = T.breakOn "=" pair
, let k' = T.strip k
, let v' = T.strip (T.drop 1 raw)
, not (T.null k')
, not (T.null v')           -- ← guards the actual value, not the raw "=..." fragment
]

🟢 Style: prometheusTestH nested if-elsesrc/Pages/Settings.hs ~line 619

The handler has two levels of nested if-else for URL validation. MultiWayIf is already enabled project-wide; this reads more clearly as:

prometheusTestH pid form = do
  _ <- Projects.sessionAndProject pid
  let url = T.strip form.url
  addRespHeaders =<< if
    | T.null url            -> pure $ prometheusTestResult (Left "Enter a URL first")
    | not (safeScrapeUrl url) -> pure $ prometheusTestResult (Left "URL must be a public http:// or https:// endpoint")
    | otherwise -> do
        ...
        pure $ prometheusTestResult ...

🟢 Minor: V.fromList round-trip — src/Pages/Settings.hs lines 548, 658

configsByProjectId returns [PrometheusScrapeConfig]; both call sites immediately wrap with V.fromList. prometheusTargetsList only uses V.null and V.forM_, both of which have direct list equivalents. Keeping the wrapper types and the rendering function on [] avoids the conversion.

CI (node 22) runs npm 10, which read the npm-11-generated lockfile as out of
sync — the top-level @emnapi/core and @emnapi/runtime entries the wasm Tailwind
fallback pulls in were missing, failing 'npm ci'. Regenerated with npm 10 so the
lock matches what CI resolves.
@claude

claude Bot commented Jun 28, 2026

Copy link
Copy Markdown

Code Review

Three clean, well-scoped pieces of work: exact-timestamp lookup eliminates the ±1s window scan (good — PG and TF store microseconds losslessly), error-log badge propagation is correctly split between server-side SQL and the TS worker, and the Prometheus scraper has solid distributed-queue design (SKIP LOCKED, hard HTTP timeout, per-scrape fan-out). The test coverage is thorough. A few issues below, ranked by severity.


Bug — label merge order in sampleToMetricRecord

src/Models/Apis/PrometheusScrapeConfigs.hs ~line 194

attrs = AE.Object $ case cfg.extraLabels of
  AE.Object o -> o <> labelMap   -- ← wrong precedence
  _ -> labelMap

KeyMap's (<>) is left-biased (Data.Map.Strict.union), so o <> labelMap means extraLabels from the config silently overwrites same-key labels from the scraped sample. A target with extraLabels = {"env": "prod"} would clobber any env label on the actual metric. The per-sample labels (specific) should win over the static config labels (general context):

AE.Object o -> labelMap <> o

Cleanup — duplicate finite-sample predicate

src/Models/Apis/PrometheusScrapeConfigs.hs:153 and src/Pages/Settings.hs:555

\s -> not (isNaN s.value || isInfinite s.value) is copy-pasted verbatim. The Settings.hs comment even says "matches what ingestScrapedBody actually stores", which is an explicit coupling that will silently break if the filter definition ever changes. Since Sample lives in Pkg.Prometheus, exporting a helper there removes the sync hazard:

-- Pkg.Prometheus
isFiniteSample :: Sample -> Bool
isFiniteSample s = not (isNaN s.value || isInfinite s.value)

Both sites can then use filter isFiniteSample / partition isFiniteSample.


Cleanup — redundant import Data.List (partition)

src/Models/Apis/PrometheusScrapeConfigs.hs:22

Relude 1.2.2.2 already re-exports partition. The explicit import is noise (and triggers -Wredundant-imports). Drop it.


Style — prometheusScrapeOpts inside-out composition

src/BackgroundJobs.hs ~line 3051

prometheusScrapeOpts authHeader =
  maybe id (\h -> header "Authorization" .~ [encodeUtf8 h]) authHeader
    $ defaults & Wreq.manager .~ Left tlsManagerSettings{...}

The maybe id f $ base & ... pattern reads inside-out. A &-chain is the conventional Wreq style and reads left-to-right:

prometheusScrapeOpts authHeader =
  defaults
    & Wreq.manager .~ Left tlsManagerSettings{managerResponseTimeout = responseTimeoutMicro prometheusScrapeTimeoutMicros}
    & maybe id (\h -> header "Authorization" .~ [encodeUtf8 h]) authHeader

Minor — parseTs calls toString twice

src/Pkg/Prometheus.hs ~line 186

parseTs t = readMaybe (toString t) <|> (truncate <$> (readMaybe (toString t) :: Maybe Double))

toString t allocates a [Char] each call; the second allocation is wasted when the first parse succeeds. One let s = toString t covers both.


Minor — 172.x range check allocates 16 Text values per call

src/Pages/Settings.hs ~line 509

any (\n -> ("172." <> show n <> ".") `T.isPrefixOf` h) ([16 .. 31] :: [Int])

This builds 16 Text values on every safeScrapeUrl invocation (every Test press and every form save). Hoisting to a top-level CAF:

rfc1918_172 :: [Text]
rfc1918_172 = ["172." <> show n <> "." | n <- [16..31 :: Int]]

...keeps the list allocation to once per process lifetime.


Note — histogram/summary _count/_sum series ingested as gauge

src/Models/Apis/PrometheusScrapeConfigs.hs ~line 187

isCounter = s.sampleType == Prom.Counter

_count and _sum sub-series of histograms and summaries are cumulative monotonic counters by definition, but they'll be stored with metricType = MTGauge and aggregationTemporality = Nothing. The PR description acknowledges this ("ingests each as its own series, which is enough to make them queryable"), but it means any downstream rate-of-change query that checks aggregationTemporality will treat these as instantaneous gauges. Worth a comment in sampleToMetricRecord so a future maintainer doesn't try to "fix" this without understanding the tradeoff.

@claude

claude Bot commented Jun 28, 2026

Copy link
Copy Markdown

Code Review — PR #451

This PR lands three independent pieces: an exact-timestamp lookup optimization, error-log badge propagation, and a new Prometheus scraping subsystem. The distributed-queue design for Prometheus (SKIP LOCKED claim + per-target jobs) is solid, and the parser's doctest coverage is good. A few issues to address:


🔴 Security — SSRF blocklist bypasses (src/Pages/Settings.hs ~line 496)

internalHost misses several encoding variants that OS network stacks resolve to loopback/internal addresses:

Input What uriRegName returns Resolves to
http://localhost./metrics localhost. 127.0.0.1 (trailing dot)
http://[::ffff:127.0.0.1]/metrics [::ffff:127.0.0.1] 127.0.0.1 (IPv4-mapped)
http://0177.0.0.1/metrics 0177.0.0.1 127.0.0.1 (octal)
http://2130706433/metrics 2130706433 127.0.0.1 (decimal)

The prefix check "[fd" and "[fe80" cover ULA/link-local but not [::ffff:. The elem list uses literal "localhost" which won't match "localhost.". Suggest normalising by resolving the hostname (and pinning the result to reject later DNS changes) — the comment already flags DNS rebinding as a known gap. At minimum add "[::ffff:" to the prefix list and strip trailing dots before the elem check.


🟠 Security — Authorization header forwarded on redirect (src/BackgroundJobs.hs ~line 3052)

prometheusScrapeOpts sets Authorization globally on Wreq options. Haskell's http-client follows 301/302 redirects and, unlike browsers, does not strip Authorization when the redirect crosses hosts. A misconfigured target returning a 302 → http://attacker.net/ leaks the bearer token. Same issue in prometheusTestH. Fix: set managerModifyRequest or checkResponse to strip Authorization on cross-origin redirects, or simply disable redirects for scrape requests.


🟡 Bug — parseTs float fallback misinterprets float-seconds as float-milliseconds (src/Pkg/Prometheus.hs line 186)

parseTs t = readMaybe (toString t) <|> (truncate <$> (readMaybe (toString t) :: Maybe Double))

The float branch fires when readMaybe :: Maybe Int64 fails (e.g. "1395066363.5"). truncate 1395066363.5 = 1395066363 is then treated as milliseconds in sampleToMetricRecord (/ 1000), yielding ~January 1970 for any exporter that emits Unix seconds as a float. The Prometheus spec mandates integer milliseconds, but some exporters deviate. Either drop the float branch entirely (only spec-compliant integer ms are accepted) or document that float inputs are treated as float-ms (which would be correct only if the exporter means ms).


🧹 Cleanup — splitBrace roundtrips through String (src/Pkg/Prometheus.hs ~line 134)

splitBrace = go False "" . toString   -- Text → String
  where
    go _ acc [] = Just (toText (reverse acc), "")
    go inQ acc (c : cs) ...

The entire Text is converted to [Char], accumulated in reverse, then reversed and packed. Data.Text already exposes T.span, T.break, and T.uncons that would avoid the roundtrip. Similarly, takeQuoted inside parseLabels uses the same prepend-then-reverse pattern with three toText (reverse acc) call sites. A T.span-based scan would halve allocations and remove the nesting.


🧹 Cleanup — prometheusTestH re-implements URL validation (src/Pages/Settings.hs ~line 536)

The test handler manually repeats the first two guards from validatePrometheusForm (null URL, safeScrapeUrl) with the same error strings. Extract:

validateScrapeUrl :: Text -> Either Text Text  -- Left = error, Right = stripped URL

Both validatePrometheusForm and prometheusTestH call it, eliminating the duplication and the nested if/else if/else in the test handler.


🧹 Cleanup — parseLabelsText magic T.drop 1 (src/Pages/Settings.hs ~line 584)

let (k, v) = T.breakOn "=" pair
...
AE.String (T.strip (T.drop 1 v))   -- drop the leading "="

T.breakOn "=" includes the delimiter in the suffix, so T.drop 1 silently skips =. Replace with T.stripPrefix "=" which makes the intent explicit and naturally collapses the separate not (T.null v) guard into a Just match:

[ (AEK.fromText k', AE.String v')
| pair <- T.splitOn "," t
, let (k, rest) = T.breakOn "=" pair
, let k' = T.strip k
, not (T.null k')
, Just v' <- [T.strip <$> T.stripPrefix "=" rest]
]

🧹 Cleanup — ingestScrapedBody three intermediate lists (src/Models/Apis/PrometheusScrapeConfigs.hs ~line 357)

let (finiteS, dropped) = partition (...) (Prom.parsePrometheus (decodeUtf8 body))
    records = V.fromList (map (sampleToMetricRecord cfg now) finiteS)

partition builds two lists; map a third; V.fromList converts. dropped is only needed for null dropped and length dropped. One pass with V.mapMaybe eliminates two intermediate lists:

let samples = Prom.parsePrometheus (decodeUtf8 body)
    records = V.mapMaybe (\s -> if isNaN s.value || isInfinite s.value then Nothing else Just (sampleToMetricRecord cfg now s)) (V.fromList samples)
    droppedCount = length samples - V.length records

Minor — safeScrapeUrl 172.x check (src/Pages/Settings.hs ~line 508)

any (\n -> ("172." <> show n <> ".") `T.isPrefixOf` h) ([16..31] :: [Int])

This has a false positive: "172.16." is a prefix of "172.160.", "172.161.""172.169." (public space). No security bypass in the other direction, but a user with a legitimate 172.160.x.x endpoint is incorrectly rejected. A structured check (parse the second octet and test 16 ≤ n ≤ 31) is unambiguous.

…, redirect token leak

Correctness/security:
- sampleToMetricRecord: scraped per-sample labels now win over the config's static
  extraLabels on key collision (KeyMap <> is left-biased; was clobbering e.g. env)
- safeScrapeUrl: structured IPv4 octet parsing replaces prefix matching — fixes the
  172.16-31 false-positive (172.160.x was wrongly rejected) and now blocks IPv4-mapped
  IPv6 ([::ffff:127.0.0.1]), trailing-dot FQDNs (localhost.), and obfuscated numeric
  hosts (decimal 2130706433, octal 0177.0.0.1, short 127.1)
- prometheusScrapeOpts: disable redirect following — http-client re-sends Authorization
  across a cross-host 3xx, leaking the bearer token to a malicious redirect target
- parseTs: drop the float fallback that misread float-seconds as float-ms (dated points
  to ~1970); spec mandates integer ms, non-integers fall back to scrape time

Cleanup:
- isFiniteSample lives in Pkg.Prometheus; ingestScrapedBody + Test handler share it
  (was copy-pasted), and ingest is now a single V.mapMaybe pass (no partition/map/fromList trio)
- validateScrapeUrl extracted; validatePrometheusForm and the Test handler share it
- parseLabelsText uses explicit T.stripPrefix "="; drop redundant Data.List import;
  prometheusScrapeOpts reads as a left-to-right & chain
- 0102 migration: composite (last_scraped_at) WHERE enabled index for the claim query,
  dropping the now-redundant partial enabled index

Doctests pin the new invariants (safeScrapeUrl bypasses, parseLabelsText, isFiniteSample).
@claude

claude Bot commented Jun 28, 2026

Copy link
Copy Markdown

Code Review — PR #451

Solid three-in-one PR. The exact-timestamp lookup is a clean, well-justified simplification. The error-badge propagation is well-tested (the Vitest suite covers the direct + synthetic-parent cases). The Prometheus scraper has good bones: distributed dispatch, lease semantics, SSRF floor, hard HTTP timeout. A few findings below, roughly severity-ordered.


1. SSRF bypass via hex numeric host — Pages/Settings.hs (safeScrapeUrl)

obfuscatedNumericHost rejects bare-decimal and octal-octet forms but not hex. http://0x7f000001/metrics passes safeScrapeUrlisDigit excludes 'x', so T.all (\c -> isDigit c || c == '.') returns False, the host is not classified as numeric, and internalIPv4 also misses it. Most OS resolvers (curl, glibc) expand 0x7f000001127.0.0.1.

-- current: only catches decimal/octal
obfuscatedNumericHost h =
  not (T.null h)
    && T.all (\c -> isDigit c || c == '.') h
    ...

-- fix: extend the all-numeric check to allow [0-9a-fA-Fx.]
-- or reject any host that (when dots are stripped) is hex-parseable to a private range

The docstring lists "octal/decimal/short forms" explicitly; hex is a known gap worth either plugging or documenting alongside the DNS-rebinding caveat.


2. splitBrace returns Just for unclosed {Pkg/Prometheus.hs:1050

go _ acc [] = Just (toText (reverse acc), "")   -- unclosed brace → success?

When the input ends without a closing }, the function returns Just (inside, ""). In practice the sample is still dropped (the empty suffix gives words "" = [], which hits the [] -> Nothing arm of parseSample), but:

  • the function's Maybe return type conveys a success/failure signal — returning Just on malformed input breaks that contract;
  • if the exposition format ever has content after the brace (it doesn't in the spec, but...) this could silently parse garbage.
go _ _ [] = Nothing   -- unclosed brace → failure

3. ingestScrapedBody — unnecessary [Sample] → Vector → [Sample] round-trip — Models/Apis/PrometheusScrapeConfigs.hs:362

-- current: constructs an intermediate Vector just to use V.mapMaybe
records = V.mapMaybe (\s -> if Prom.isFiniteSample s then Just (sampleToMetricRecord cfg now s) else Nothing) (V.fromList samples)

-- simpler + one fewer allocation:
records = V.fromList [sampleToMetricRecord cfg now s | s <- samples, Prom.isFiniteSample s]
-- or: mapMaybe (\s -> sampleToMetricRecord cfg now s <$ guard (isFiniteSample s)) samples |> V.fromList

parsePrometheus already returns a list; convert to Vector once at the end.


4. parseLabels silently drops all labels after the first malformed one — Pkg/Prometheus.hs:1072

_ -> []   -- stops the whole parse on any label whose value isn't a quoted string

The Prometheus spec mandates quoted values, so this only fires on non-conforming exposition, but when it does it drops every subsequent label in that metric's {...} block silently. A go rest' in the _ branch (skipping the bad pair and continuing) would be more forgiving and match how the rest of the parser handles bad input (best-effort, skip line).


5. recordHelp collapses internal whitespace — Pkg/Prometheus.hs:998

recordHelp meta body = case words body of
  ws@(_ : fam : _) -> Map.insertWith mergeHelp fam (Meta Untyped (unwords (drop 2 ws))) meta

words splits on any whitespace; unwords rejoins with single spaces. # HELP latency_ms Latency in ms becomes "Latency in ms". Help text is display-only metadata so this is low impact, but the fix is trivial: use a single T.drop past the two-word prefix instead of words/unwords.


6. Minor: scrape-interval floor not enforced at the DB layer — 0101_prometheus_scrape_configs.sql

validatePrometheusForm applies max 60, but there is no CHECK (scrape_interval_seconds >= 60) on the column. A direct SQL insert or a future code path that bypasses validation can store a sub-minute interval that the once-per-minute dispatcher can never honour. A DB constraint is cheap and removes the silent invariant.


7. Minor: duplicate -- | doc block on safeScrapeUrlPages/Settings.hs:598–607

Two consecutive Haddock comment blocks describe the same function (-- | SSRF floor… then -- | Reject non-http(s)…). Only the second one is picked up by Haddock; the first is effectively a dead comment.


Non-issues / confirmed correct

  • Lease semantics in claimDueConfigs + markScraped: intentional double-write; the lease prevents re-fire within the interval regardless of worker completion time. ✓
  • prometheusScrapeOpts operator chain: compiles correctly despite the unusual line splits; the lens (&)/(.~) precedences hold. Readability could be improved by grouping each & setter .~ val on one line.
  • errors IS NOT NULL OR (kind = 'log' AND ...) COALESCE in defaultSelectSqlQuery: SQL NULL semantics are correct — all-NULL log fields fold to COALESCE(NULL, false) = false. ✓
  • setEnabled + toggleEnabled both present: setEnabled is used by the integration test's setup; toggleEnabled is the UI path. Both have value. ✓

The errors column projection changed from 'errors is not null' to the COALESCE
expression that also flags error-level logs (kind='log' with error level/severity/
status). Update the 11 ParserSpec golden strings to match.
@blacksmith-sh

This comment has been minimized.

@claude

claude Bot commented Jun 28, 2026

Copy link
Copy Markdown

Code Review

This PR bundles three independent pieces: exact-timestamp log lookup (removing the ±1 s window), error-log badge propagation to the trace waterfall, and a full Prometheus scraping pipeline (parser, model, settings UI, distributed-queue background jobs). The architecture is solid — the SKIP LOCKED dispatcher, SSRF check, and test-before-save flow are all well-considered. A handful of issues worth addressing before merge:


1. Auth header pre-populated in password field HTML — bearer token in page source

src/Pages/Settings.hs:700

field_ "authHeader" "Authorization header" "Bearer <token>" "password" False ... (maybe "" (fromMaybe "" . (.authHeader)) mcfg)

field_ emits value_ val on a type="password" input. The browser masks it visually but the raw bearer token travels in the HTML response body on every edit-form load. Any XSS on the settings page, a caching proxy that stores HTML, or a user with DevTools can harvest it. Conventional practice is to omit value= on password fields and send a boolean "token is set" indicator instead, requiring re-entry on edit.


2. Silent scrape loss when createJob fails mid-batch

src/BackgroundJobs.hs:3067

dispatchPrometheusScrapes authCtx = do
  due <- PromCfg.claimDueConfigs prometheusScrapeBatchLimit   -- bumps last_scraped_at for all N
  unless (null due) do
    liftIO $ withResource authCtx.jobsPool \conn ->
      forM_ due \cfg -> void $ createJob conn "background_jobs" (PrometheusScrapeOne cfg.id)

claimDueConfigs commits last_scraped_at = now() for all claimed targets in one atomic UPDATE. The subsequent forM_ enqueues jobs on a separate pool/connection with no tryAny guard. If createJob throws for target k (pool exhaustion, transient DB error), the forM_ aborts and targets k+1 … N are never enqueued — but their leases are already set, so they silently wait a full interval with last_status still showing the previous "ok". Wrapping each enqueue in tryAny and logging on failure is the minimal fix.


3. Metric timestamps backdated by scrape latency

src/BackgroundJobs.hs:3081

scrapePrometheusTarget cid = whenJustM (PromCfg.getConfig cid) \cfg -> when cfg.enabled do
  now <- Time.currentTime           -- captured before the HTTP request
  tryAny (W.getWith opts url >>= \resp -> PromCfg.ingestScrapedBody cfg now (resp ^. responseBody))

now is used as the fallback metricTime for samples that lack an explicit exposition timestamp (the common case). For a slow endpoint (say 8 s of the 10 s budget) every sample in that scrape is backdated 8 s, creating an apparent gap at the end of each polling window. Capture now immediately after getWith returns instead.


4. fc00::/8 half of IPv6 ULA range not blocked

src/Pages/Settings.hs:627

any (`T.isPrefixOf` ip6) ["::1", "fd", "fe80", "::ffff:"]

RFC 4193 defines the full ULA block as fc00::/7 — both fd00::/8 (blocked) and fc00::/8 (not blocked). http://[fc00::1]/metrics passes today. Add "fc" alongside "fd".


5. ingestScrapedBody — unnecessary list→Vector round-trip

src/Models/Apis/PrometheusScrapeConfigs.hs:362

records = V.mapMaybe (\s -> if Prom.isFiniteSample s then Just (sampleToMetricRecord cfg now s) else Nothing) (V.fromList samples)

V.fromList forces the full [Sample] list into an intermediate Vector before V.mapMaybe produces a second. A list comprehension is shorter and avoids both intermediate structures:

records = V.fromList [sampleToMetricRecord cfg now s | s <- samples, Prom.isFiniteSample s]

6. splitBrace round-trips through String

src/Pkg/Prometheus.hs:1047

splitBrace = go False "" . toString
  where
    go _ acc [] = Just (toText (reverse acc), "")
    go inQ acc (c : cs) ...

toString converts Text → [Char], builds a reversed String accumulator, then toText converts back — O(n) Char allocation per label block. Writing this with T.span/T.break/T.stripPrefix in forward Text order avoids the round-trip and the reverse.


7. Duplicate Haddock comment on parseLabelsText

src/Pages/Settings.hs:712–713

-- | "k=v, k2=v2" → JSON object of static labels (blank/invalid pairs skipped).
-- | Parse @k=v, k2=v2@ static-label text into a JSON object, ...

Two consecutive -- | lines before the same function: Haddock attaches only the second (the nearer one); the first is a floating dead comment. Remove the first line.


Overall this is a well-structured feature addition with good test coverage. The scrape-loss reliability risk (#2) and auth-header exposure (#1) are the ones most worth addressing before merge; the rest are smaller cleanups.

tonyalaribe and others added 2 commits June 28, 2026 23:42
…y, parser cleanup

Security/correctness:
- Edit form no longer echoes the saved auth token into the password field's HTML
  (it sat in page source / proxy caches); blank on edit now means 'keep saved token',
  and prometheusUpdateH preserves the existing token when the field is left empty
- safeScrapeUrl: block the full ULA range fc00::/7 (add 'fc'; only 'fd' was covered)
- dispatchPrometheusScrapes: guard each createJob with tryAny — targets are leased at
  claim time, so a mid-batch enqueue failure must not silently strand the rest; log each
- scrapePrometheusTarget: capture the fallback metricTime after the HTTP response, not
  before, so a slow endpoint doesn't backdate every timestamp-less sample by the latency

Cleanup:
- ingestScrapedBody: single list comprehension (drop the list->Vector->mapMaybe trio)
- splitBrace: forward Text scan (T.break) instead of round-tripping through String;
  doctest pins the }-inside-quoted-value case
- parseLabelsText: remove the duplicate floating Haddock line
- remove accidentally-committed prometheus-*.png screenshots; gitignore root *.png

Doctest count 699 -> 700.
@claude

claude Bot commented Jun 28, 2026

Copy link
Copy Markdown

Code Review

This PR delivers three well-scoped pieces of work: exact-timestamp lookups (removing the wasteful ±1s window), error-log badge propagation to parent spans (server-side SQL + frontend flag), and a full Prometheus scraping subsystem (parser, model, background jobs, UI, migration). The distributed-claim design with FOR UPDATE SKIP LOCKED is solid. A few things worth fixing before merge:


1. Duplicate Haddock block on safeScrapeUrl — stray | in rendered docs

src/Pages/Settings.hs:502–507

There are two consecutive -- | starters in the comment block. The first (lines 502–506) is a leftover draft; Haddock treats the second -- | as a new paragraph so the rendered output shows a literal | Reject non-http(s)… at the top. Delete the first four comment lines.


2. IPv6 link-local check is too narrow — fe81:: through febf:: slip through

src/Pages/Settings.hs:530

The link-local range is fe80::/10, covering fe80:: through febf::. The "fe80" prefix only blocks the fe80::/16 slice; http://[fe81::1]/metrics through http://[febf::1]/metrics all pass. Real-world NICs virtually always use fe80:: so this is theoretical, but the comment claims link-local is blocked so the check should match the claim. Fix: extend the prefix list to include "fe8", "fe9", "fea", "feb" instead of just "fe80".


3. takeQuoted accumulates in String — contradicts the module design note

src/Pkg/Prometheus.hs:187–196

splitBrace in the same file (same job: scan a quoted string with escapes) has the explicit comment "Stays in Text (T.break chunks) rather than round-tripping through String". takeQuoted does the opposite: it builds a reversed [Char] list and calls toText (reverse acc) at exit. For short label values the cost is negligible, but the inconsistency is surprising. Replacing (c : acc) / toText (reverse acc) with T.snoc acc c / acc would stay in Text throughout.


4. No UI path to clear authHeader once set

src/Pages/Settings.hs:554–561

The edit handler intentionally treats a blank auth field as "keep the saved token" — correct, avoids echoing the secret into HTML. But there is no mechanism to remove a previously-configured header; the only recourse is delete-and-recreate or a direct DB write. A "Clear saved token" checkbox mapping to an explicit Nothing would complete the CRUD.


5. scrapePrometheusTarget — single ~200-char expression

src/BackgroundJobs.hs — scrapePrometheusTarget body

Three monadic steps (HTTP fetch → capture time → ingest) plus the case branch are chained into one long line. A short do-block would surface the "capture metricTime after the response arrives" invariant in the code structure rather than in a comment, and make the error branches readable.


Everything else looks good: the parser doctests are thorough, the SKIP LOCKED claim design is correct, the markScraped/claimDueConfigs lease semantics are sound, and the SSRF IPv4 coverage (obfuscated numeric hosts, octal octets, short-form addresses) is solid.

Generated with Claude Code

…readability

- safeScrapeUrl: remove the leftover duplicate Haddock block; widen IPv6 link-local
  to the full fe80::/10 (fe8/fe9/fea/feb, not just fe80::/16); doctest the link-local/ULA cases
- Settings: add a 'Clear saved token' checkbox so a configured auth header can be removed
  via the UI (blank still = keep; ticked = explicit Nothing), completing the CRUD
- Prometheus.takeQuoted: accumulate in Text (T.snoc) instead of a reversed String, matching
  splitBrace's 'stay in Text' design note
- scrapePrometheusTarget: expand the chained one-liner into a do-block so the
  'capture metricTime after the response' ordering is visible in the structure

Doctest count 700 -> 701.
@claude

claude Bot commented Jun 29, 2026

Copy link
Copy Markdown

Review

This PR does three independent, well-scoped things: narrows the log-item lookup from a ±1 s window to exact equality (safe because the caller always holds the stored timestamp), plumbs error-log severity into the trace waterfall's error badge (with matching SQL + TS refactor + good test coverage), and adds a full Prometheus scrape pipeline — exposition parser, CRUD model, distributed-queue dispatcher, settings UI, and migration. The architecture is solid: atomic SKIP LOCKED claim, per-scrape HTTP timeout, SSRF guard on the URL. A few issues worth addressing before merge:


1. SSRF bypass via hex-encoded IP literal — src/Pages/Settings.hs:608 (security)

safeScrapeUrl relies on obfuscatedNumericHost to block bare integer IPs, but that check uses T.all (\c -> isDigit c || c == '.') which only covers decimal digits 09. The hex character x fails isDigit, so "0x7f000001" falls through both internalIPv4 (not a 4-octet quad) and obfuscatedNumericHost (contains non-digit x, f), and safeScrapeUrl "http://0x7f000001/metrics" returns True. network-uri's parseURI accepts hex reg-names, so uriRegName hands back "0x7f000001" which an OS resolver expands to 127.0.0.1.

Fix: add "0x" to the hex-prefix guard, or add a isHexNumericHost check analogous to obfuscatedNumericHost that rejects hosts matching [0-9a-fA-Fx]+ when they aren't a clean dotted quad.


2. getConfig reads across project boundaries — src/Models/Apis/PrometheusScrapeConfigs.hs:304 (security)

getConfig cid = Hasql.interpOne
  ([HI.sql|SELECT |] <> selectCols <> [HI.sql| FROM apis.prometheus_scrape_configs WHERE id = #{cid}|])

No project_id filter. In prometheusUpdateH, the "keep saved token" branch (Nothing auth + no clearAuth) calls getConfig cid with a user-supplied UUID. If cid belongs to a different project, the foreign row's auth_header (a Bearer token) is loaded into keptAuth. The subsequent updateConfig pid cid is project-scoped so it's a no-op, but the foreign token is briefly live in the request context where structured logging or APM instrumentation can capture it.

UUID v4 makes guessing infeasible in practice, but this still violates the principle of least privilege. The simplest fix is a scoped variant:

getConfigByProject :: DB es => Projects.ProjectId -> PrometheusScrapeConfigId -> Eff es (Maybe PrometheusScrapeConfig)
getConfigByProject pid cid = Hasql.interpOne
  ([HI.sql|SELECT |] <> selectCols <> [HI.sql| FROM apis.prometheus_scrape_configs WHERE id = #{cid} AND project_id = #{pid}|])

and use it in prometheusUpdateH.


3. No UNIQUE(project_id, name) constraint — static/migrations/0101_prometheus_scrape_configs.sql (data integrity)

Two targets with the same name in the same project share the same service.name bucket. The metrics_meta table has a unique constraint on (project_id, metric_name, service_name), so one target silently overwrites the other's metadata on every upsert. The raw metric rows accumulate from both sources and are indistinguishable in queries, dashboards, and alerts.

Add to the migration (or a follow-up):

ALTER TABLE apis.prometheus_scrape_configs
  ADD CONSTRAINT uq_prometheus_scrape_configs_project_name UNIQUE (project_id, name);

and surface a friendly "a target named X already exists" error in prometheusPostH / prometheusUpdateH when the insert/update violates it.


4. configsByProjectId returns [] but callers always want Vectorsrc/Models/Apis/PrometheusScrapeConfigs.hs:300 (simplification)

Both call sites do V.fromList <$> PromCfg.configsByProjectId pid. The return type should be V.Vector PrometheusScrapeConfig so the conversion happens once at the source — consistent with how similar list-type DB functions are consumed elsewhere in the codebase.


5. splitBrace returns Just on unclosed brace — src/Pkg/Prometheus.hs:157 (code clarity)

_ -> Just (acc <> chunk, "") -- no closing brace; take what we have

The Maybe return type conventionally signals success/failure. Returning Just ("partial-content", "") when the closing } is absent means callers cannot distinguish "parsed ok, nothing after the brace" from "no closing brace found". Today parseSample catches the silent Nothing downstream (empty words ""Nothing), but any future caller that relies on splitBrace returning Nothing for failure will be surprised. Return Nothing here — parseSample's monadic bind will propagate it cleanly.


Posted by Claude (claude-sonnet-4-6) via /review

…okup, name uniqueness

Security:
- safeScrapeUrl: block hex-encoded IP literals (0x7f000001, 0x7f.0x0.0x0.0x1) — the
  decimal-only numeric guard let them through; doctest added
- getConfigByProject: project-scoped lookup for the update handler so a foreign config id
  can't load another project's auth token into the request context; getConfig kept only for
  the trusted background worker (which holds a claimed id)

Data integrity:
- migration 0103: UNIQUE(project_id, name) — two targets sharing a name collide on the
  metrics_meta (project_id, metric_name, service_name) key and clobber each other
- promSave rejects a duplicate name with a friendly toast (DB constraint is the race backstop)

Cleanup:
- configsByProjectId returns Vector (conversion was duplicated at both call sites)
- splitBrace returns Nothing on an unclosed brace instead of Just partial-content

Doctest count 701 -> 702.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant