Epoch boundary resume is not idempotent — root cause of persisted pool-snapshot lag

## Summary

A node restart (Mithril-bootstrapped preview, "WAL in sync with state") panicked at RUPD for epoch 1223 because at least one persisted `PoolState` had `snapshot.epoch() != EpochState.number` — its snapshot lagged the current epoch. #1016 turned that obscure `unwrap` panic into a clear `ChainError::PoolSnapshotLagging`; #1017 added two fail-loud guards. **Neither fixes the root cause.** This issue captures what we know and what the real fix is.

## What we established (by code reading)

1. **The healthy boundary path cannot persist a lag.** `EpochState.number` advances only in ESTART `commit_finalize` (`crates/cardano/src/estart/commit.rs`), via `EpochTransition`, in the **same atomic `writer.commit()`** as every `PoolTransition` and the cursor. Both backends (`redb3`, `fjall`) commit a `Writer` atomically. So pools and the epoch advance together-or-not-at-all. The import path (`crates/core/src/import.rs`) runs the same lifecycle.

2. **Therefore a persisted lag originates in an abnormal crash/resume.** The sharding work (#978) already flags this directly in `crates/cardano/src/lib.rs`:
   > correctness depends on shard idempotency … **`AccountTransition` is not natively idempotent** … Operators should **monitor the subsequent boundary for inconsistency**. **TODO: implement true shard resume.**

   The boundary transitions (`AccountTransition`, `PoolTransition`, `EpochTransition`, `NonceTransition`, DRep/Proposal transitions) call `default_transition(next_epoch)` **unconditionally** — re-applying one double-rotates the `EpochValue` snapshot window and increments the epoch again. The only thing preventing re-application today is the **shard-level** skip (`start_shard` reading `estart_progress.committed`) plus #1017's finalize guard. The authors flagged the shard-level protection as insufficient, and they were closer to the code than our from-reading inference.

3. **Atomicity ≠ idempotency.** Atomic per-shard commit (verified) prevents torn writes *within* a shard. It does **not** guarantee that re-running a boundary from a partially-committed state reconstructs correct state — that is the missing property.

## Open question: lead vs lag

The simplest non-idempotency story (re-running a transition) produces an epoch **lead** (entity ahead). The reported symptom is a **lag** (pool behind `EpochState`), i.e. the pool was transitioned *fewer* times than the epoch. Candidate lag mechanisms to confirm:
- **Import re-apply:** import skips the WAL ("crash recovery is handled by re-import", `import.rs`). If the cursor can sit behind already-committed state, re-import re-applies block deltas / transitions with no undo. **First thing to verify: does the import cursor advance in lockstep with state commits, or only at coarse points?**
- **Rollback/undo asymmetry:** an undo that reverts a pool's `PoolTransition` but not the matching `EpochTransition` (e.g. a transition delta missing from the WAL undo set, or a pool that was absent at finalize and reappears at a stale epoch).

The exact mechanism is **not confirmed** — it needs reproduction, not more reading.

## Recommendation

**Step 1 — reproduce.** Crash an import mid-boundary (and separately, force a rollback across a boundary) and record which entities land at which epochs. #1016/#1017 added the tripwires that make this legible:
- `ChainError::PoolSnapshotLagging` (RUPD + `MintedBlocksInc::apply` assert)
- `BrokenInvariant::EpochBoundaryIncomplete` (ESTART finalize guard)

**Step 2 — make transitions entity-level idempotent** (the realistic fix; "true shard resume"). In each `*Transition::apply`, no-op if the entity already reflects the target epoch:
```rust
if entity.snapshot.epoch() >= Some(self.next_epoch) { return; }
```
`EpochValue` already tracks its epoch, so the check is cheap. The work is the audit:
- Apply to every transition delta (`AccountTransition`, `PoolTransition`, `NonceTransition`, `EpochTransition(V2)`, DRep/Proposal).
- Make `undo` honor the skip — a delta that no-op'd on apply must no-op on undo (its `prev_*` must record "did nothing").
- For import: ensure the cursor/applied-marker advances in lockstep with committed state so re-import never re-applies committed deltas.

With entity-level idempotency in place, `start_shard`/progress fields and the finalize guard become belt-and-suspenders rather than the sole defense.

**Rejected — atomic whole-boundary commit (Approach 2):** conceptually clean (boundary is all-or-nothing) but reintroduces the memory blowup that #978 sharded *away*. Not viable at mainnet scale.

**Out of scope here — repair of already-corrupted nodes:** a node that already persisted a lag keeps failing loud with `PoolSnapshotLagging` and needs a re-bootstrap. Rebuilding a pool's snapshot window from the archive `StakeLog` is a separate, optional CLI-repair effort.

## Acceptance criteria
- [ ] Reproduction: a deterministic test (e.g. via `dolos_testing::harness::cardano::LedgerHarness`) that crashes mid-boundary and asserts the resulting per-entity epochs — confirming the exact lag mechanism.
- [ ] Re-running any boundary transition on an already-transitioned entity is a no-op (apply and undo).
- [ ] Import resume cannot re-apply already-committed deltas.
- [ ] `epoch_pots` / `tests/cardano` stay green; a new crash-resume test produces byte-identical state to an uninterrupted run.

## References
- #1016 — surface lagging pool at RUPD (`ensure_pool_aligned` / `PoolSnapshotLagging`)
- #1017 — fail-loud guards (this issue is the follow-up root-cause work)
- #978 — sharded mem-hungry work units (introduced the resume scaffolding + the TODOs)
- Key files: `crates/cardano/src/estart/{commit,work_unit,loading}.rs`, `crates/cardano/src/lib.rs` (resume diagnostics), `crates/cardano/src/model/{pools,accounts,epochs}.rs` (transition deltas), `crates/core/src/import.rs`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Epoch boundary resume is not idempotent — root cause of persisted pool-snapshot lag #1018

Summary

What we established (by code reading)

Open question: lead vs lag

Recommendation

Acceptance criteria

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Epoch boundary resume is not idempotent — root cause of persisted pool-snapshot lag #1018

Description

Summary

What we established (by code reading)

Open question: lead vs lag

Recommendation

Acceptance criteria

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions