Skip to content

Epoch boundary resume is not idempotent — root cause of persisted pool-snapshot lag #1018

Description

@scarmuega

Summary

A node restart (Mithril-bootstrapped preview, "WAL in sync with state") panicked at RUPD for epoch 1223 because at least one persisted PoolState had snapshot.epoch() != EpochState.number — its snapshot lagged the current epoch. #1016 turned that obscure unwrap panic into a clear ChainError::PoolSnapshotLagging; #1017 added two fail-loud guards. Neither fixes the root cause. This issue captures what we know and what the real fix is.

What we established (by code reading)

  1. The healthy boundary path cannot persist a lag. EpochState.number advances only in ESTART commit_finalize (crates/cardano/src/estart/commit.rs), via EpochTransition, in the same atomic writer.commit() as every PoolTransition and the cursor. Both backends (redb3, fjall) commit a Writer atomically. So pools and the epoch advance together-or-not-at-all. The import path (crates/core/src/import.rs) runs the same lifecycle.

  2. Therefore a persisted lag originates in an abnormal crash/resume. The sharding work (refactor(cardano): shard, reorder, and merge the EWRAP boundary pipeline #978) already flags this directly in crates/cardano/src/lib.rs:

    correctness depends on shard idempotency … AccountTransition is not natively idempotent … Operators should monitor the subsequent boundary for inconsistency. TODO: implement true shard resume.

    The boundary transitions (AccountTransition, PoolTransition, EpochTransition, NonceTransition, DRep/Proposal transitions) call default_transition(next_epoch) unconditionally — re-applying one double-rotates the EpochValue snapshot window and increments the epoch again. The only thing preventing re-application today is the shard-level skip (start_shard reading estart_progress.committed) plus fix(cardano): fail loud on lagging pool snapshots and unfinished epoch boundaries #1017's finalize guard. The authors flagged the shard-level protection as insufficient, and they were closer to the code than our from-reading inference.

  3. Atomicity ≠ idempotency. Atomic per-shard commit (verified) prevents torn writes within a shard. It does not guarantee that re-running a boundary from a partially-committed state reconstructs correct state — that is the missing property.

Open question: lead vs lag

The simplest non-idempotency story (re-running a transition) produces an epoch lead (entity ahead). The reported symptom is a lag (pool behind EpochState), i.e. the pool was transitioned fewer times than the epoch. Candidate lag mechanisms to confirm:

  • Import re-apply: import skips the WAL ("crash recovery is handled by re-import", import.rs). If the cursor can sit behind already-committed state, re-import re-applies block deltas / transitions with no undo. First thing to verify: does the import cursor advance in lockstep with state commits, or only at coarse points?
  • Rollback/undo asymmetry: an undo that reverts a pool's PoolTransition but not the matching EpochTransition (e.g. a transition delta missing from the WAL undo set, or a pool that was absent at finalize and reappears at a stale epoch).

The exact mechanism is not confirmed — it needs reproduction, not more reading.

Recommendation

Step 1 — reproduce. Crash an import mid-boundary (and separately, force a rollback across a boundary) and record which entities land at which epochs. #1016/#1017 added the tripwires that make this legible:

  • ChainError::PoolSnapshotLagging (RUPD + MintedBlocksInc::apply assert)
  • BrokenInvariant::EpochBoundaryIncomplete (ESTART finalize guard)

Step 2 — make transitions entity-level idempotent (the realistic fix; "true shard resume"). In each *Transition::apply, no-op if the entity already reflects the target epoch:

if entity.snapshot.epoch() >= Some(self.next_epoch) { return; }

EpochValue already tracks its epoch, so the check is cheap. The work is the audit:

  • Apply to every transition delta (AccountTransition, PoolTransition, NonceTransition, EpochTransition(V2), DRep/Proposal).
  • Make undo honor the skip — a delta that no-op'd on apply must no-op on undo (its prev_* must record "did nothing").
  • For import: ensure the cursor/applied-marker advances in lockstep with committed state so re-import never re-applies committed deltas.

With entity-level idempotency in place, start_shard/progress fields and the finalize guard become belt-and-suspenders rather than the sole defense.

Rejected — atomic whole-boundary commit (Approach 2): conceptually clean (boundary is all-or-nothing) but reintroduces the memory blowup that #978 sharded away. Not viable at mainnet scale.

Out of scope here — repair of already-corrupted nodes: a node that already persisted a lag keeps failing loud with PoolSnapshotLagging and needs a re-bootstrap. Rebuilding a pool's snapshot window from the archive StakeLog is a separate, optional CLI-repair effort.

Acceptance criteria

  • Reproduction: a deterministic test (e.g. via dolos_testing::harness::cardano::LedgerHarness) that crashes mid-boundary and asserts the resulting per-entity epochs — confirming the exact lag mechanism.
  • Re-running any boundary transition on an already-transitioned entity is a no-op (apply and undo).
  • Import resume cannot re-apply already-committed deltas.
  • epoch_pots / tests/cardano stay green; a new crash-resume test produces byte-identical state to an uninterrupted run.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:cardanoCardano ledger / epoch / pots logicbugSomething isn't working

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    Status
    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions