You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A node restart (Mithril-bootstrapped preview, "WAL in sync with state") panicked at RUPD for epoch 1223 because at least one persisted PoolState had snapshot.epoch() != EpochState.number — its snapshot lagged the current epoch. #1016 turned that obscure unwrap panic into a clear ChainError::PoolSnapshotLagging; #1017 added two fail-loud guards. Neither fixes the root cause. This issue captures what we know and what the real fix is.
What we established (by code reading)
The healthy boundary path cannot persist a lag.EpochState.number advances only in ESTART commit_finalize (crates/cardano/src/estart/commit.rs), via EpochTransition, in the same atomic writer.commit() as every PoolTransition and the cursor. Both backends (redb3, fjall) commit a Writer atomically. So pools and the epoch advance together-or-not-at-all. The import path (crates/core/src/import.rs) runs the same lifecycle.
correctness depends on shard idempotency … AccountTransition is not natively idempotent … Operators should monitor the subsequent boundary for inconsistency. TODO: implement true shard resume.
The boundary transitions (AccountTransition, PoolTransition, EpochTransition, NonceTransition, DRep/Proposal transitions) call default_transition(next_epoch)unconditionally — re-applying one double-rotates the EpochValue snapshot window and increments the epoch again. The only thing preventing re-application today is the shard-level skip (start_shard reading estart_progress.committed) plus fix(cardano): fail loud on lagging pool snapshots and unfinished epoch boundaries #1017's finalize guard. The authors flagged the shard-level protection as insufficient, and they were closer to the code than our from-reading inference.
Atomicity ≠ idempotency. Atomic per-shard commit (verified) prevents torn writes within a shard. It does not guarantee that re-running a boundary from a partially-committed state reconstructs correct state — that is the missing property.
Open question: lead vs lag
The simplest non-idempotency story (re-running a transition) produces an epoch lead (entity ahead). The reported symptom is a lag (pool behind EpochState), i.e. the pool was transitioned fewer times than the epoch. Candidate lag mechanisms to confirm:
Import re-apply: import skips the WAL ("crash recovery is handled by re-import", import.rs). If the cursor can sit behind already-committed state, re-import re-applies block deltas / transitions with no undo. First thing to verify: does the import cursor advance in lockstep with state commits, or only at coarse points?
Rollback/undo asymmetry: an undo that reverts a pool's PoolTransition but not the matching EpochTransition (e.g. a transition delta missing from the WAL undo set, or a pool that was absent at finalize and reappears at a stale epoch).
The exact mechanism is not confirmed — it needs reproduction, not more reading.
Recommendation
Step 1 — reproduce. Crash an import mid-boundary (and separately, force a rollback across a boundary) and record which entities land at which epochs. #1016/#1017 added the tripwires that make this legible:
Step 2 — make transitions entity-level idempotent (the realistic fix; "true shard resume"). In each *Transition::apply, no-op if the entity already reflects the target epoch:
if entity.snapshot.epoch() >= Some(self.next_epoch){return;}
EpochValue already tracks its epoch, so the check is cheap. The work is the audit:
Apply to every transition delta (AccountTransition, PoolTransition, NonceTransition, EpochTransition(V2), DRep/Proposal).
Make undo honor the skip — a delta that no-op'd on apply must no-op on undo (its prev_* must record "did nothing").
For import: ensure the cursor/applied-marker advances in lockstep with committed state so re-import never re-applies committed deltas.
With entity-level idempotency in place, start_shard/progress fields and the finalize guard become belt-and-suspenders rather than the sole defense.
Rejected — atomic whole-boundary commit (Approach 2): conceptually clean (boundary is all-or-nothing) but reintroduces the memory blowup that #978 sharded away. Not viable at mainnet scale.
Out of scope here — repair of already-corrupted nodes: a node that already persisted a lag keeps failing loud with PoolSnapshotLagging and needs a re-bootstrap. Rebuilding a pool's snapshot window from the archive StakeLog is a separate, optional CLI-repair effort.
Acceptance criteria
Reproduction: a deterministic test (e.g. via dolos_testing::harness::cardano::LedgerHarness) that crashes mid-boundary and asserts the resulting per-entity epochs — confirming the exact lag mechanism.
Re-running any boundary transition on an already-transitioned entity is a no-op (apply and undo).
Summary
A node restart (Mithril-bootstrapped preview, "WAL in sync with state") panicked at RUPD for epoch 1223 because at least one persisted
PoolStatehadsnapshot.epoch() != EpochState.number— its snapshot lagged the current epoch. #1016 turned that obscureunwrappanic into a clearChainError::PoolSnapshotLagging; #1017 added two fail-loud guards. Neither fixes the root cause. This issue captures what we know and what the real fix is.What we established (by code reading)
The healthy boundary path cannot persist a lag.
EpochState.numberadvances only in ESTARTcommit_finalize(crates/cardano/src/estart/commit.rs), viaEpochTransition, in the same atomicwriter.commit()as everyPoolTransitionand the cursor. Both backends (redb3,fjall) commit aWriteratomically. So pools and the epoch advance together-or-not-at-all. The import path (crates/core/src/import.rs) runs the same lifecycle.Therefore a persisted lag originates in an abnormal crash/resume. The sharding work (refactor(cardano): shard, reorder, and merge the EWRAP boundary pipeline #978) already flags this directly in
crates/cardano/src/lib.rs:The boundary transitions (
AccountTransition,PoolTransition,EpochTransition,NonceTransition, DRep/Proposal transitions) calldefault_transition(next_epoch)unconditionally — re-applying one double-rotates theEpochValuesnapshot window and increments the epoch again. The only thing preventing re-application today is the shard-level skip (start_shardreadingestart_progress.committed) plus fix(cardano): fail loud on lagging pool snapshots and unfinished epoch boundaries #1017's finalize guard. The authors flagged the shard-level protection as insufficient, and they were closer to the code than our from-reading inference.Atomicity ≠ idempotency. Atomic per-shard commit (verified) prevents torn writes within a shard. It does not guarantee that re-running a boundary from a partially-committed state reconstructs correct state — that is the missing property.
Open question: lead vs lag
The simplest non-idempotency story (re-running a transition) produces an epoch lead (entity ahead). The reported symptom is a lag (pool behind
EpochState), i.e. the pool was transitioned fewer times than the epoch. Candidate lag mechanisms to confirm:import.rs). If the cursor can sit behind already-committed state, re-import re-applies block deltas / transitions with no undo. First thing to verify: does the import cursor advance in lockstep with state commits, or only at coarse points?PoolTransitionbut not the matchingEpochTransition(e.g. a transition delta missing from the WAL undo set, or a pool that was absent at finalize and reappears at a stale epoch).The exact mechanism is not confirmed — it needs reproduction, not more reading.
Recommendation
Step 1 — reproduce. Crash an import mid-boundary (and separately, force a rollback across a boundary) and record which entities land at which epochs. #1016/#1017 added the tripwires that make this legible:
ChainError::PoolSnapshotLagging(RUPD +MintedBlocksInc::applyassert)BrokenInvariant::EpochBoundaryIncomplete(ESTART finalize guard)Step 2 — make transitions entity-level idempotent (the realistic fix; "true shard resume"). In each
*Transition::apply, no-op if the entity already reflects the target epoch:EpochValuealready tracks its epoch, so the check is cheap. The work is the audit:AccountTransition,PoolTransition,NonceTransition,EpochTransition(V2), DRep/Proposal).undohonor the skip — a delta that no-op'd on apply must no-op on undo (itsprev_*must record "did nothing").With entity-level idempotency in place,
start_shard/progress fields and the finalize guard become belt-and-suspenders rather than the sole defense.Rejected — atomic whole-boundary commit (Approach 2): conceptually clean (boundary is all-or-nothing) but reintroduces the memory blowup that #978 sharded away. Not viable at mainnet scale.
Out of scope here — repair of already-corrupted nodes: a node that already persisted a lag keeps failing loud with
PoolSnapshotLaggingand needs a re-bootstrap. Rebuilding a pool's snapshot window from the archiveStakeLogis a separate, optional CLI-repair effort.Acceptance criteria
dolos_testing::harness::cardano::LedgerHarness) that crashes mid-boundary and asserts the resulting per-entity epochs — confirming the exact lag mechanism.epoch_pots/tests/cardanostay green; a new crash-resume test produces byte-identical state to an uninterrupted run.References
ensure_pool_aligned/PoolSnapshotLagging)crates/cardano/src/estart/{commit,work_unit,loading}.rs,crates/cardano/src/lib.rs(resume diagnostics),crates/cardano/src/model/{pools,accounts,epochs}.rs(transition deltas),crates/core/src/import.rs