Skip to content

[fm] Add saga diagnosis engine#10592

Draft
smklein wants to merge 41 commits into
mainfrom
fm-saga-diagnoser
Draft

[fm] Add saga diagnosis engine#10592
smklein wants to merge 41 commits into
mainfrom
fm-saga-diagnoser

Conversation

@smklein

@smklein smklein commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Adds the a fault management diagnosis engine: the problematic saga diagnoser. Stacked on #10541, which establishes the typed per-engine fact tables this engine uses.

What it does

Opens a case (keyed by saga ID) for any unfinished saga that needs attention.

This PR uses three fact kinds:

Inputs

The preparation phase builds an ObservedSaga projection, the saga analog of #10541's InServiceDisk: all unfinished sagas (running, unwinding, or abandoned), joined with each saga's latest node-event time and its owner's classification against db_metadata_nexus. This is the executed view read directly from the database, and it is included in the analysis input report so "why did FM (not) flag this saga" is answerable from a sitrep.

Case lifecycle

The two live-saga facts may coexist on one case; as conditions change, facts are removed and re-added under the same case (with stable fact UUIDs when the observation is unchanged). The case closes when the saga completes, or when every condition clears (progress resumes, owner re-adopted), and a recurrence later opens a fresh case. An abandoned saga's case never closes on its own; it persists, carrying the Abandoned fact, until remediation removes the saga row. Actually remediating an Abandoned saga seems like a case-by-case resolution, so I'm not doing it in this PR!

Fact payload design

Payloads carry minimal condition-defining fields: the saga ID plus the parameters whose change means the condition itself changed (which rotates the fact UUID). Presentation data (e.g. the saga's name) is deliberately excluded; it can be looked up from the saga row when a case is acted on, and a case is only open while that row exists.

Context

Review notes

smklein added 20 commits June 3, 2026 11:34
The second fault management diagnosis engine: opens a case (keyed by
saga_id) for any non-terminal saga that is either not making progress
(no node event recorded within STALE_SAGA_THRESHOLD) or orphaned (owned
by a Nexus that is no longer of the current generation). These are two
independent fact kinds; a saga's case may carry either or both. A case
closes when the saga reaches a terminal state. Tracks #10530.

Supporting infrastructure:

- DiagnosisEngineKind::Saga variant (Rust + DB enum)
- fm_fact_saga typed child table with two fact kinds (not_progressing,
  owner_not_current_generation), per-kind nullable columns gated by a
  CHECK, participating in copy-forward + GC like other sitrep child
  tables
- SagaFact / FactPayload::Saga and the ObservedSaga nexus-types
  projection
- saga_list_running_or_unwinding_batched and a grouped
  saga_latest_node_event_times datastore query (the wall-clock progress
  signal); owner currency read DB-direct via the existing
  get_db_metadata_nexus_in_state

Schema migration: fm-saga-de (version 263) adds the 'saga' enum value,
the fm_fact_saga_kind and fm_fact_saga_orphan_reason enums, and the
fm_fact_saga table.
…lerate duplicate zpools

- The per-kind CHECK constraint on fm_fact_physical_disk is now an
  implication (kind != 'zpool_unhealthy' OR columns present), so future
  fact kinds add their own constraint instead of rewriting this one
- The fm_analysis in-service disk projection skips soft-deleted
  physical_disk rows, matching what its comment already claimed
- A duplicate physical disk in the zpool listing now logs a warning and
  keeps the first zpool instead of panicking the background task
SitrepBuilder::cases is seeded from the open cases of the Input the
builder was constructed from, but diagnosis::analyze() previously took
the Input as a separate argument; nothing tied the two together, and an
engine handed a mismatched pair would panic looking up a case that was
never seeded. Store the Input on the builder and have engines read it
from there, making the mismatch unrepresentable.
Facts are only read during sitrep load, which paginates over the
(sitrep_id, id) primary key; nothing queries by case directly. Easy to
re-add with a migration if omdb grows a case-scoped fact query.
Replace the per-disk linear scan of parent cases with a one-time
inverse index. First case ID wins on (pathological) collisions, same
as the scan it replaces.
Match the fm_fact_physical_disk convention: each kind's constraint
validates that the columns it expects are present, and future kinds add
their own constraint instead of rewriting an exhaustive CASE. This also
stops enforcing that other kinds' columns are NULL, leaving room for
kinds to share columns.
A case is an episode of a problem, not a dossier on the saga. When a
flagged saga is still running but no condition holds anymore (progress
resumed, owner re-adopted by a current Nexus), close the case rather
than stripping its facts and leaving it open. Previously the case would
be left open with zero facts, making it uninterpretable to the next
analysis pass, which would then carry it forward unexamined forever,
even after the saga terminated.

The closing case keeps its facts attached as the record of why it
existed; they age out with the case once it stops being copied forward
and its sitreps are GC'd.
Facts are only read during sitrep load, which paginates over the
(sitrep_id, id) primary key; nothing queries by case directly. Matches
the same change to fm_fact_physical_disk.
The input report listed in-service disks but said nothing about the
sagas visible to the saga diagnosis engine, leaving no way to answer
"why did FM (not) flag this saga" from the report. List each
non-terminal saga with its name, state, latest node-event time, and
owner classification.
A mass-stuck-saga incident is exactly when this query runs with a large
ID list; chunk the eq_any so it never becomes one giant statement.
Chunks are disjoint, so GROUP BY keys never cross them and the
concatenated results are identical to the single-statement form.
A saga case carries at most one fact per kind. The parent-case summary
previously kept whichever duplicate happened to have the highest fact
UUID and lost track of the rest, so a (pathological) duplicate would be
carried forward and re-persisted in every sitrep, and removing the
tracked copy while an untracked match survived would re-add a fresh
fact, regenerating the duplicate pair.

ParentSagaCase now separates the fact to consider when advancing the
case (the lowest fact UUID of each kind) from duplicates, which are
removed unconditionally with a warning. A corrupt case converges to one
fact per kind in a single pass.
A saga with no current_sec yields owner_state = None, not Absent; the
variant doc claimed otherwise. Also replace em-dashes in this file's
comments.
A fact payload contains exactly what defines the condition for the
analysis loop: the subject's ID plus the parameters whose change should
rotate the fact. Anything a human wants for presentation is looked up
from the database when a case is acted on; a case is only open while
its saga row still exists, so the lookup always works.

Accordingly:
- saga_name leaves both payloads and the fm_fact_saga table; it never
  defines a condition. Names still appear in debug comments.
- time_created leaves NotProgressing; the staleness condition folds it
  into last_event_time already.
- adopt_generation leaves OwnerNotCurrentGeneration; it is not
  condition-defining, and sagas_reassign_sec bumping it would have
  rotated the fact UUID over a meaningless ownership shuffle.
Per omicron#10581, Nexus will explicitly abandon sagas it fails to
recover for non-transient reasons. Abandonment is the beginning of an
escalation, not a resolution: the saga may be holding
partially-allocated resources and needs saga-specific manual
remediation (RFD 555). Without this change the engine would close the
case the moment a stuck saga was abandoned, exactly when the escalation
should begin.

- The projection now lists unfinished sagas (running, unwinding, or
  abandoned); only 'done' drops a saga from observation. ObservedSaga
  carries the three-variant ObservedSagaState; SagaProgressState
  remains the live-saga subset recorded in NotProgressing facts.
- New SagaFact::Abandoned with a pure-identity payload (the condition
  is boolean, so nothing can rotate it). Abandonment supersedes the
  live-saga conditions: NotProgressing and OwnerNotCurrentGeneration
  facts are removed when a saga is abandoned, and the case carries the
  Abandoned fact alone, staying open until the saga row is deleted.
- The module doc also records why the owner fact detects stranded
  sagas rather than wrongly-resumed ones, and what mitigates the
  latter.
@smklein smklein changed the title Fm saga diagnoser [fm] Add saga diagnosis engine Jun 11, 2026
smklein added 9 commits June 11, 2026 12:50
Hardening from a pre-review pass:

- Include zpool_id in the "same observation" comparison for facts, so a
  zpool that is destroyed and recreated (still unhealthy) rotates the
  fact instead of carrying a stale zpool reference forward.

- Close cases the diagnosis engine cannot interpret (foreign fact
  payload, facts disagreeing on the disk, no facts) and duplicate cases
  for the same disk, rather than carrying them forward unprocessable
  forever. Closing is safe for fault coverage: detection is independent
  of case bookkeeping, so a still-faulty disk gets a fresh well-formed
  case in the same analysis pass. Reasons are a typed enum
  (UninterpretableCase) surfaced in the close comment and warn logs.

- In the omdb output test, drive the FM tasks to their steady state
  with explicit background task activations (analysis -> loader ->
  analysis -> rendezvous) instead of racing the watch-channel triggers.
  This makes the expected output deterministic for both fm_analysis and
  fm_rendezvous, and removes the sitrep_load_rx test plumbing
  (NexusServer trait method, accessors, wait helper) that existed only
  to support the old wait.
Apply the lessons from the disk diagnoser review (the same pattern
landed there in 6f2e4a5):

- Close cases the engine cannot interpret (foreign fact payload, facts
  disagreeing on the saga, no facts) instead of skipping them, which
  left them open and unprocessable in every future sitrep with no path
  to closure. Reasons are a typed UninterpretableCase enum surfaced in
  the close comment and warn logs. Closing is safe for fault coverage:
  detection iterates observed sagas independently of case bookkeeping,
  so a saga that still needs attention gets a fresh well-formed case in
  the same pass.

- Close duplicate cases for the same saga as superseded, keeping the
  lowest case ID. Previously the last-indexed case silently won the
  saga_id index while both cases had stale facts removed, so the loser
  decayed into an empty open case that could never close.

- New tests: empty case closed, duplicate closed, corrupt case replaced
  by a fresh one in the same pass, and foreign-payload cases closed in
  both engines (newly testable, since two FactPayload variants now
  exist).
Previously from_sitrep() set a default kind in the base row that every
match arm had to remember to override; a future DiskFact variant that
forgot would silently inherit the wrong discriminant. Route kind
through an exhaustive match on the payload so that can't compile.
Remove the unused CaseBuilder::facts() accessor, collapse the three
identical multi-line .expect() strings in analyze() to one short
message, and drop the redundant case-level comment on a freshly opened
disk case (its specifics already live in the fact added alongside it).
Lets the per-DE row constructor take (metadata, payload) instead of a
whole Fact plus a redundant DiskFact, removing the consistency footgun
where the two could disagree. Mirrors Case's metadata/payload split.
observed_health -> observed_zpool_health (keeps the load-bearing
'observed' semantic), summarize_case -> parse_case (it validates and
rejects, not just condenses), and ParentCaseSummary -> ParsedDiskCase to
match.
smklein added 7 commits June 15, 2026 11:51
The duplicate-case detection re-scanned parent_cases and re-looked-up
case_by_disk to find the cases that weren't kept. Since parent_cases
iterates ascending by CaseUuid, the keep-lowest / close-the-rest decision
can be made in one pass with the Entry API.
Fact reconciliation was split across two loops over two collections: the
first loop (over parent cases) removed stale facts, the second (over
faulty in-service disks) added fresh ones, coupled by a 'removed above'
comment. Make the first loop closing-only and let the second own all fact
state for a disk (remove stale + add fresh) in one place.
…ookup

Replace the (case, fact, DiskFact) tuple returned by disk_facts with a
named DiskFactRef so call sites read .case/.fact/.disk_fact instead of
.0/.1/.2, and extract the repeated parent-fact-id chain into a
sole_disk_fact_id helper. Also fix a stale doc comment (identity fields,
not facts).
Exhaustively destructure FactMetadata in from_sitrep so a new field fails to compile until it is mapped to a column.

remove_fact now takes a comment (logged on removal) and logs the removed fact's payload; it warns if asked to remove a fact that does not exist.

Drop a redundant 'Empty today' line from known_ereport_classes' doc.

Close uninterpretable Disk cases inline in the parse loop instead of collecting them into a vec and looping again (the iterator borrows the input, not the builder), which also drops the now-redundant separate warn.

Reframe the duplicate-case tie-break comment: keeping the lowest-ID case is arbitrary-but-deterministic, not 'correct'.
Brings the disk diagnoser's typed-fact-table work (and main) into the
saga branch. The merge propagates the Fact -> FactMetadata + payload
split and the new remove_fact(comment) signature into the saga engine,
its db-model row (fm_fact_saga), the datastore, and their tests so the
tree builds; the saga-specific cleanups mirroring #10541 follow in a
separate commit.
Mirror the disk diagnoser cleanups in the saga engine, now that both
share the FactMetadata/payload split:

- summarize_case -> parse_case (it validates and rejects, not just
  condenses) and ParentSagaCase -> ParsedSagaCase to match.
- observed -> observed_sagas.
- Close uninterpretable cases inline in the parse loop instead of
  collecting them into a vec and looping again.
- Build case_for_saga in a single pass with the Entry API, closing
  duplicate cases inline; reframe the tie-break comment as
  arbitrary-but-deterministic rather than 'correct'.
- Separate case-closing from fact reconciliation: the first pass over
  surviving cases is now closing-only, and the second pass owns all of
  a saga's fact state (drop duplicate + stale facts, add fresh) in one
  place. A new covered() helper folds the three per-kind match arms.
- FmFactSaga::from_sitrep takes (metadata, payload) and destructures
  FactMetadata exhaustively, so a new field fails to compile until it
  is mapped to a column.
- Tests: replace the (Fact, SagaFact) tuple from saga_facts with a
  named SagaFactRef, and extract the repeated sole-fact lookup into a
  sole_saga_fact_id helper.
@smklein smklein force-pushed the fm-saga-diagnoser branch from 90185bb to 5b48c08 Compare June 16, 2026 19:10
@smklein smklein force-pushed the fm-saga-diagnoser branch from b5f8ee1 to dffc15c Compare June 16, 2026 21:29
smklein added 2 commits June 16, 2026 16:36
The fm_sitrep_gc task's "orphaned sitreps deleted" and "orphaned
fm_sitrep_analysis_report rows deleted" counts are timing-dependent.
When two fm_analysis activations overlap (e.g. the boot-time activation
and the explicit drive in the test), both can insert a first sitrep
before either is made current; the loser's sitrep and its stashed
analysis report are inserted but orphaned (fm_sitrep_insert's
ParentNotCurrent path), then deleted by a later GC pass. Whether that
race occurred before the omdb snapshot is non-deterministic, which made
test_omdb_success_cases flaky. Redact both counts.
Base automatically changed from fm-disk-diagnoser-typed to main June 18, 2026 01:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant