Skip to content

feat(encryption): Stage 6E-2e-1 — Applier FSM-apply wrap installer hook#944

Open
bootjp wants to merge 2 commits into
mainfrom
feat/encryption-6e-2-e-1-fsm-apply-auto-install-wrap
Open

feat(encryption): Stage 6E-2e-1 — Applier FSM-apply wrap installer hook#944
bootjp wants to merge 2 commits into
mainfrom
feat/encryption-6e-2-e-1-fsm-apply-auto-install-wrap

Conversation

@bootjp

@bootjp bootjp commented Jun 10, 2026

Copy link
Copy Markdown
Owner

Summary

Stage 6E-2e-1 closes BLOCKER (b) from PR #933's codex P1 round-3:
the leader-failover hazard between cutover commit and the
EnableRaftEnvelope handler's InstallWrap call.

Every replica's FSM-apply of the EnableRaftEnvelope cutover marker now
invokes a registered RaftCutoverWrapInstaller callback that publishes
the §4.2 raft envelope wrap closure on this node. A follower that
becomes leader post-cutover already has wrap active without needing the
admin RPC to re-run.

This is the apply-layer half of BLOCKER (b). Production wiring of
the installer closure (the main.go side that constructs a wrap closure
from cipher + active raft DEK and calls SetRaftPayloadWrap on every
participating kv.ShardGroup) lands in 6E-2e-3 alongside the
CutoverBarrierController fanout.

The PR is production-inert until 6E-2e-3 wires the installer:
default-nil installer = no behavior change for any caller that does
not opt in via WithRaftCutoverWrapInstaller.

What ships

  • RaftCutoverWrapInstaller func type (cutoverIdx, activeRaftDEKID) error.
  • WithRaftCutoverWrapInstaller Applier option.
  • applyEnableRaftEnvelope invokes the installer on:
    • Fresh-success branch (after WriteSidecar, after
      RefreshFromSidecar) with (raftIdx, p.DEKID).
    • Already-active branch (FSM replay safety: a snapshot that
      excludes the cutover entry replays it on restart) with
      (sc.RaftEnvelopeCutoverIndex, p.DEKID).
  • Stale-DEKID no-op branch deliberately skips the installer — no
    cutover took effect, installing would publish a closure keyed to the
    wrong DEK.
  • Installer errors halt apply (fail-closed). Sidecar write precedes
    installer call so crash recovery converges via the already-active
    branch on restart.
  • Cyclop-driven refactor: shared invokeRaftCutoverWrapInstaller
    helper keeps applyEnableRaftEnvelope under the per-function
    complexity cap.

Behavior change, risk, test evidence

  • Behavior change: production-observable change is none.
    Default nil installer = the existing apply path runs unchanged.
  • Risk: low. The hook is opt-in; production wiring lands in
    6E-2e-3 with its own review cycle.
  • Test evidence:
    • go test -race ./internal/encryption/... — all pass (existing
      TestApplyRotation_EnableRaftEnvelope_* tests confirmed unchanged
      • four new installer-hook tests).
    • golangci-lint --config=.golangci.yaml run — 0 issues.
    • Four new tests pin all four branches of the dispatch:
      FreshSuccess_InvokesInstaller,
      AlreadyActive_InvokesInstaller,
      StaleDEKID_SkipsInstaller,
      InstallerError_HaltsApply.

Caller audit

  • applyEnableRaftEnvelope (semantic change: installer-call insertion):
    only callee path is ApplyRotation dispatch in the same file. No
    external callers affected.
  • Default-nil installer preserves the entire pre-6E-2e-1 caller surface
    unchanged; no existing test or production caller observes the new
    branches.

5-lens self-review

  1. Data loss — none. Sidecar write is durable before installer
    call. The reverse ordering (install first, write second) would lose
    the install state on crash; we chose write→install so a crash
    between fsync and installer error converges on restart via the
    already-active branch.
  2. Concurrency / distributed failures — FSM apply is
    single-threaded per replica. Each replica installs independently
    against the same sidecar values, so the wrap closure key is
    consistent cluster-wide. No cross-replica coordination needed.
  3. Performance — one closure call per cutover apply. Cutover is a
    one-shot operation. Negligible cost.
  4. Data consistency — installer key matches sidecar state on
    success. On installer error, sidecar is durable + apply halts;
    operator restart converges via the already-active branch
    re-install.
  5. Test coverage — four new tests covering all four branches of
    the dispatch.

Test plan

  • go test -race ./internal/encryption/...
  • golangci-lint run clean
  • Full make test in CI
  • No Jepsen suite — the hook is unreachable from production paths
    until 6E-2e-3 wires the installer. Jepsen coverage will attach
    to 6E-2f / Stage 9.

Next slices

  • 6E-2e-2 — Admin RPCs (RotateDEK, RegisterEncryptionWriter)
    routed through the wrap-aware proposer; closes BLOCKER (a).
  • 6E-2e-3 — main.go OpenConfig.RaftCipher + RaftCutoverIndex
    wiring + concrete RaftCutoverWrapInstaller closure +
    CutoverBarrierController fanout + startup-time install on
    sidecar.RaftEnvelopeCutoverIndex != 0.
  • 6E-2fraftEnvelopeWrapEnabled.Store(true) (single-line gate
    flip).

Closes BLOCKER (b) at the apply layer (codex P1 round-3 on PR933).
Every replica's FSM-apply of the EnableRaftEnvelope cutover marker now
invokes a registered RaftCutoverWrapInstaller callback that publishes
the §4.2 raft envelope wrap closure on this node — so a follower that
becomes leader post-cutover already has wrap active without needing the
EnableRaftEnvelope handler to re-run.

Without this hook, the per-leader InstallWrap call in
adapter/encryption_admin.go's runRaftEnvelopeCutoverBarrier is the only
path that installs the wrap. A leader failover between cutover commit
and InstallWrap would let the new leader admit cleartext user writes at
indexes gt cutoverIdx; followers applying those entries would halt on
the §6.3 strict-gt unwrap hook cluster-wide.

What ships:
- RaftCutoverWrapInstaller func type (cutoverIdx, activeRaftDEKID) -> error.
- WithRaftCutoverWrapInstaller Applier option. nil = no-op (preserves
  the pre-6E-2e-1 test surface and the production posture until
  6E-2e-3 wires the installer in main.go).
- applyEnableRaftEnvelope invokes the installer on:
  * Fresh-success branch (after WriteSidecar, after RefreshFromSidecar)
    with (raftIdx, p.DEKID).
  * Already-active branch (FSM replay safety: snapshot-excluded cutover
    re-applies on restart) with (sc.RaftEnvelopeCutoverIndex, p.DEKID).
- Stale-DEKID no-op branch deliberately SKIPS the installer — no
  cutover took effect, installing would publish a closure keyed to the
  wrong DEK.
- Installer errors halt apply (fail-closed). Sidecar write precedes
  installer call so crash recovery converges via the already-active
  branch on restart.

5-lens self-review:
1. Data loss — none; sidecar write is durable before installer call.
   Reverse ordering would lose install state on crash.
2. Concurrency / distributed failures — FSM apply is single-threaded
   per replica; each replica installs independently against the same
   sidecar values, so wrap closure key is consistent cluster-wide. No
   cross-replica coordination needed.
3. Performance — one closure call per cutover apply. Cutover is rare.
4. Data consistency — installer key matches sidecar state on success.
   On installer error, sidecar is durable + apply halts; operator
   restart converges via already-active branch re-install.
5. Test coverage — four new tests: FreshSuccess invokes installer with
   (raftIdx, p.DEKID); AlreadyActive invokes installer with
   (firstIdx, p.DEKID) for replay safety; StaleDEKID skips installer;
   InstallerError halts apply with sidecar still written.

Caller audit for the new semantic:
- applyEnableRaftEnvelope is only called from ApplyRotation's dispatch
  in the same file. No external callers affected.
- Default nil installer = no behavior change for callers that don't
  opt in to WithRaftCutoverWrapInstaller (test surface and current
  main.go wiring).

Cyclop-driven refactor: extracted invokeRaftCutoverWrapInstaller helper
shared between fresh-success and already-active branches so
applyEnableRaftEnvelope stays under the per-function complexity cap.

Next slices (6E-2e-2: admin RPCs through wrap-aware proposer to close
BLOCKER (a); 6E-2e-3: main.go closure source + startup-time install;
6E-2f: gate flip) tracked in the partial design doc.
@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@bootjp, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 17 minutes and 25 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3ff82c45-80dd-4504-a9d6-f09f32098f8d

📥 Commits

Reviewing files that changed from the base of the PR and between ed82233 and 031031e.

📒 Files selected for processing (3)
  • docs/design/2026_05_31_partial_6e_enable_raft_envelope.md
  • internal/encryption/applier.go
  • internal/encryption/applier_test.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/encryption-6e-2-e-1-fsm-apply-auto-install-wrap

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@bootjp

bootjp commented Jun 10, 2026

Copy link
Copy Markdown
Owner Author

@claude review

@claude

claude Bot commented Jun 10, 2026

Copy link
Copy Markdown

Claude finished @bootjp's task in 3m 1s —— View job


PR Review: feat(encryption): Stage 6E-2e-1 — Applier FSM-apply wrap installer hook

  • Gather context and understand request
  • Read changed files
  • Analyze design doc update
  • Review applier.go changes
  • Review applier_test.go changes
  • Post final review feedback

Overall Assessment

Solid, focused implementation. The crash-recovery ordering (WriteSidecarinvokeInstaller) is correctly reasoned and verified by test. The design — installing the wrap on every FSM apply path rather than only on the proposing leader — is the right fix for the BLOCKER (b) failover hazard. The four-branch test coverage is good. A few findings below, two of which are worth addressing.


Findings

Finding 1 — Minor doc inaccuracy: activeRaftDEKID contract comment is misleading (non-blocking)

applier.go:155

The RaftCutoverWrapInstaller type comment says:

activeRaftDEKID is the sidecar.Active.Raft value at apply time.

On the already-active replay branch, the code passes p.DEKID (from the Raft entry payload), not sc.Active.Raft. These are equal on fresh-success (the stale-DEK check guarantees it), but they diverge if a RotateDEK runs after the cutover — at that point sc.Active.Raft has advanced past the cutover DEK, while p.DEKID correctly reflects the DEK that was active at cutover time.

The code is correct — p.DEKID is what the §6.3 hook needs. The comment is what's wrong. Suggested fix:

// activeRaftDEKID is the Raft DEK ID that was active at cutover time
// (i.e. the DEKID carried in the EnableRaftEnvelope entry payload).
// On fresh-success apply this equals sidecar.Active.Raft; on
// already-active replay it is the payload DEKID from the original
// entry, which may differ from sidecar.Active.Raft if a RotateDEK
// ran after cutover.

Fix this →


Finding 2 — Test gap: AlreadyActive_InvokesInstaller doesn't assert activeRaftDEKID (worth fixing)

applier_test.go:2635

The test checks rec.calls[1].cutoverIdx (passes) but never checks rec.calls[1].activeRaftDEKID. Since Finding 1 hinges on which DEK is passed on the already-active replay path, this is worth pinning explicitly:

if got, want := rec.calls[1].activeRaftDEKID, cutoverBootstrapRaftDEKID; got != want {
    t.Errorf("already-active installer activeRaftDEKID = %d, want %d (must be the cutover-entry DEKID, not sc.Active.Raft)",
        got, want)
}

Fix this →


Finding 3 — Test gap: no coverage for installer error on the already-active branch (minor, non-blocking)

TestApplyRotation_EnableRaftEnvelope_InstallerError_HaltsApply exercises only the fresh-success branch. A corresponding test for the already-active replay branch would verify the same fail-closed contract and the crash-recovery sidecar-before-installer invariant for FSM replays. This is low urgency since the shared invokeRaftCutoverWrapInstaller helper is a single call site, but it would improve branch coverage for the idempotent re-install path.


Positive notes

  • Crash-recovery ordering (WriteSidecar before invokeInstaller) is correctly implemented and the InstallerError_HaltsApply test explicitly asserts that the sidecar is durable after an installer failure. This is exactly the right way to document an ordering invariant.
  • Stale-DEK branch deliberately skips the installer — and the test pins this with a clear rationale (wrong-DEK closure). Correct.
  • invokeRaftCutoverWrapInstaller helper keeps the two call sites DRY and makes the nil-guard a single place to audit.
  • Design doc update correctly splits 6E-2e into three numbered sub-slices (6E-2e-1/2/3) and marks 6E-2d with the right PR reference. Good housekeeping.

Findings 1 and 2 are linked — fixing the comment and adding the missing assertion together would fully lock down the activeRaftDEKID contract on the already-active path. Everything else LGTM.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements Stage 6E-2e-1 of the Raft envelope cutover design by introducing the WithRaftCutoverWrapInstaller hook in internal/encryption/applier.go. This hook is invoked during both fresh-success and already-active branches of applyEnableRaftEnvelope to ensure that every replica's FSM-apply of the cutover marker publishes the wrap closure, mitigating leader-failover hazards. Comprehensive unit tests have been added in internal/encryption/applier_test.go to validate the installer's invocation, replay safety, stale-DEK handling, and error propagation. The review feedback suggests a minor improvement to pass sc.Active.Raft directly instead of p.DEKID when invoking the installer to ensure robustness and consistency with the state cache.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread internal/encryption/applier.go Outdated
return errors.Wrap(err, "applier: write sidecar for already-active raft-cutover no-op")
}
return nil
return a.invokeRaftCutoverWrapInstaller(sc.RaftEnvelopeCutoverIndex, p.DEKID, "already-active replay")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In the already-active replay branch, the active Raft DEK ID is passed to the installer. While p.DEKID is guaranteed to be equal to sc.Active.Raft due to the preceding check on line 1214, passing sc.Active.Raft directly is more robust, self-documenting, and strictly adheres to the RaftCutoverWrapInstaller contract which specifies that the second argument should be the sidecar.Active.Raft value at apply time.

Suggested change
return a.invokeRaftCutoverWrapInstaller(sc.RaftEnvelopeCutoverIndex, p.DEKID, "already-active replay")
return a.invokeRaftCutoverWrapInstaller(sc.RaftEnvelopeCutoverIndex, sc.Active.Raft, "already-active replay")
References
  1. Prioritize using the primary state cache directly rather than passing state variables through call chains to ensure consistency and correctness of critical state updates.

Comment thread internal/encryption/applier.go Outdated
// first) would leave the wrap published but the sidecar
// pre-cutover on crash, breaking the equality the §6.3 hook
// relies on cluster-wide.
return a.invokeRaftCutoverWrapInstaller(raftIdx, p.DEKID, "fresh-success apply")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In the fresh-success apply branch, the active Raft DEK ID is passed to the installer. While p.DEKID is guaranteed to be equal to sc.Active.Raft due to the preceding check on line 1214, passing sc.Active.Raft directly is more robust, self-documenting, and strictly adheres to the RaftCutoverWrapInstaller contract which specifies that the second argument should be the sidecar.Active.Raft value at apply time.

Suggested change
return a.invokeRaftCutoverWrapInstaller(raftIdx, p.DEKID, "fresh-success apply")
return a.invokeRaftCutoverWrapInstaller(raftIdx, sc.Active.Raft, "fresh-success apply")
References
  1. Prioritize using the primary state cache directly rather than passing state variables through call chains to ensure consistency and correctness of critical state updates.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6a510abcab

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/encryption/applier.go Outdated
return errors.Wrap(err, "applier: write sidecar for already-active raft-cutover no-op")
}
return nil
return a.invokeRaftCutoverWrapInstaller(sc.RaftEnvelopeCutoverIndex, p.DEKID, "already-active replay")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Install wrap even when replayed cutover DEK is stale

This installer call is only reachable after the stale-DEK guard above, so a crash/replay after the cutover has already applied and a later raft-DEK rotation has advanced sidecar.Active.Raft will skip installation entirely because the original cutover payload's p.DEKID no longer matches the sidecar. That is exactly the replay path this hook is meant to repair: the sidecar already has RaftEnvelopeCutoverIndex != 0, but the in-process ShardGroup wrap pointer is nil, so a restarted node that becomes leader can admit cleartext proposals above the cutover. The already-active case should be handled before treating the payload DEK as stale, and should install using the current sc.Active.Raft DEK.

Useful? React with 👍 / 👎.

Round-1 review fixes for PR #944:

Codex P1 round-1 + Gemini medium x2 — Install wrap even when replayed
cutover DEK is stale.
applyEnableRaftEnvelope tested the stale-DEK condition BEFORE the
already-active condition. After a successful cutover at index N with
DEK D1, a subsequent RotateDEK advances sidecar.Active.Raft to D2.
On process restart, FSM replay re-applies the original cutover marker
(p.DEKID=D1). The reverse order short-circuits at the stale-DEK
check: p.DEKID != sc.Active.Raft → benign no-op → installer NEVER
fires → a freshly-elected leader on this node admits cleartext
proposals above the cutover index and bricks the §6.3 strict-gt
unwrap hook on every replica.

Fix:
- Swap the check order: test RaftEnvelopeCutoverIndex != 0 FIRST so
  a replay of the original marker after RotateDEK lands on the
  already-active branch (which DOES invoke the installer).
- Pass sc.Active.Raft to the installer (gemini medium x2): the wrap
  closure must key to the CURRENT active DEK on every replica so the
  §6.3 hook unwraps with the same key. p.DEKID was technically
  equivalent to sc.Active.Raft in the fresh-success branch (the
  stale-DEK guard above ensures it), but the already-active replay
  branch is precisely the case where they diverge.

The storage variant applyEnableStorageEnvelope keeps its original
check order because it has no installer hook — the stale-DEK and
already-active branches both result in advanceRaftAppliedIndex-only
behavior, so the order is observationally irrelevant. The raft path's
order swap is documented in the function-level comment.

New regression test:
TestApplyRotation_EnableRaftEnvelope_AlreadyActive_StaleDEKReplay_InvokesInstaller
pins the exact codex P1 scenario:
  1. Original cutover at idx 500 with D1.
  2. RotateDEK advances Active.Raft to D2.
  3. FSM replay of the original marker (p.DEKID=D1, sc.Active.Raft=D2).
  4. Asserts: installer fires; activeRaftDEKID == D2 (sc.Active.Raft,
     NOT replayed p.DEKID); cutoverIdx == 500 (original sidecar value).

Updated existing assertions in
TestApplyRotation_EnableRaftEnvelope_FreshSuccess_InvokesInstaller
and TestApplyRotation_EnableRaftEnvelope_AlreadyActive_InvokesInstaller
remain valid because Active.Raft == p.DEKID in those scenarios.

Caller audit for the semantic change:
- applyEnableRaftEnvelope is internal; only called from
  ApplyRotation dispatch. No external callers affected.
- The behavior change affects ONLY the replay-after-rotate scenario;
  every other branch (fresh-success, already-active with matching
  DEK, stale-DEK race without cutover) is unchanged.
@bootjp

bootjp commented Jun 10, 2026

Copy link
Copy Markdown
Owner Author

@claude review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant