feat(redis): flip one-phase txn dedup default to on (closes #937)#943
feat(redis): flip one-phase txn dedup default to on (closes #937)#943bootjp wants to merge 3 commits into
Conversation
Closes the rollout of the option-2 idempotency design landed by docs/design/2026_05_21_proposed_txn_secondary_idempotency.md by flipping adapter.RedisServer.onePhaseTxnDedup from default-off to default-on. No new mechanism — the FSM probe, the PrevCommitTS wire field, and every adapter-side write-set reuse path already ship. Default-on is authorized by the parent design's M4 7-day green criterion in the dedup-mode Jepsen workflow; 12 consecutive green runs on the stress profile (2026-05-31 → 2026-06-10) exceed that threshold. Triggered by Jepsen scheduled-stress run 27033231956 (2026-06-05) on the control baseline workflow (gate off) which surfaced :duplicate-elements and the related real-time cycle anomalies, exactly the shape the parent design was written to prevent. Issue #937 tracks that failure; this proposal closes it by retiring the unprotected default. Doc commit only. Implementation commit follows in the same PR.
Implements the proposal in docs/design/2026_06_10_proposed_redis_onephase_dedup_default_on.md. The Redis adapter's one-phase idempotency dedup (adapter.RedisServer.onePhaseTxnDedup) now defaults on. The ELASTICKV_REDIS_ONEPHASE_DEDUP env var inverts from opt-in (=1 enabled, anything else disabled) to opt-out (=0 disabled, anything else enabled). The WithOnePhaseTxnDedup constructor option still trumps the env var. Authorization for the flip is the parent design's M4 7-day green criterion in the dedup-mode Jepsen workflow (.github/workflows/jepsen-test-scheduled-dedup.yml), met by 12 consecutive green runs over 10 days (2026-05-31 → 2026-06-10) on the stress profile that produced the parent design's trigger anomaly. The parent's R5 (FSM determinism across a rolling upgrade) is discharged — the probe code has shipped on every production node for months, so the writer (emission) flipping on is uniformly understood. To preserve the legacy-path coverage during the post-flip observation window, the control workflow (.github/workflows/jepsen-test-scheduled.yml) now sets ELASTICKV_REDIS_ONEPHASE_DEDUP=0 explicitly at the job env level so its semantics survive the default change. Retirement of that workflow is a 30-day follow-up. Closes #937 — the failure that triggered this work is the control baseline's expected unprotected behaviour; with default-on it falls back into the path the dedup-mode workflow has been exercising clean for 12 consecutive days. Verified: - go vet ./... clean - go build ./... clean - go test -run 'Dedup|OnePhase|PrevCommit|Idempot' ./adapter/ — pass - go test -run 'TestRedis|TestList|TestSet|TestZSet|TestStream|TestExec|TestMulti' ./adapter/ — pass (169s) - golangci-lint run ./adapter/... — 0 issues
|
Note Currently processing new changes in this PR. This may take a few minutes, please wait... ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughThis PR flips the default state of one-phase transaction dedup in the Redis adapter from off to on. The initialization logic in ChangesRedis One-Phase Transaction Dedup Default-On Implementation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request flips the default state of the Redis one-phase transaction deduplication feature (onePhaseTxnDedup) to enabled, updating the configuration logic and adding a detailed design document. The review feedback highlights a broken reference link in the new design document and suggests improving the environment variable parsing in adapter/redis.go to also recognize 'false' as a disabling value, preventing potential operator misconfiguration.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
|
||
| The control workflow (`jepsen-test-scheduled.yml`, gate off) continues | ||
| to surface `:duplicate-elements` and the related real-time cycle | ||
| anomalies as expected; the 2026-06-05 18:37 failure ([#937]) is the |
There was a problem hiding this comment.
The reference link [#937] on this line is broken because there is no corresponding link definition at the bottom of the document. Please use an inline link here to ensure it resolves correctly.
| anomalies as expected; the 2026-06-05 18:37 failure ([#937]) is the | |
| anomalies as expected; the 2026-06-05 18:37 failure ([#937](https://github.com/bootjp/elastickv/issues/937)) is the |
| // See docs/design/2026_06_10_proposed_redis_onephase_dedup_default_on.md. | ||
| // ELASTICKV_REDIS_ONEPHASE_DEDUP=0 opts out; the WithOnePhaseTxnDedup | ||
| // constructor option still trumps the env var. | ||
| onePhaseTxnDedup: os.Getenv("ELASTICKV_REDIS_ONEPHASE_DEDUP") != "0", |
There was a problem hiding this comment.
Checking only != "0" to enable the feature by default can be a usability footgun for operators. If an operator attempts to explicitly disable the feature using common falsey values like "false", the check will evaluate to true and keep the feature enabled. Consider also checking against "false" to prevent accidental misconfiguration.
| onePhaseTxnDedup: os.Getenv("ELASTICKV_REDIS_ONEPHASE_DEDUP") != "0", | |
| onePhaseTxnDedup: os.Getenv("ELASTICKV_REDIS_ONEPHASE_DEDUP") != "0" && strings.ToLower(os.Getenv("ELASTICKV_REDIS_ONEPHASE_DEDUP")) != "false", |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a6af333f48
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // See docs/design/2026_06_10_proposed_redis_onephase_dedup_default_on.md. | ||
| // ELASTICKV_REDIS_ONEPHASE_DEDUP=0 opts out; the WithOnePhaseTxnDedup | ||
| // constructor option still trumps the env var. | ||
| onePhaseTxnDedup: os.Getenv("ELASTICKV_REDIS_ONEPHASE_DEDUP") != "0", |
There was a problem hiding this comment.
Preserve SET overwrites before defaulting dedup on
With this default flip, every standalone SET now takes the set() dedup branch unless operators opt out; that branch calls runTransactionWithDedup, whose applySet path does not match the legacy standalone executeSet semantics for existing collection keys. For example, SET k v after RPUSH k x now returns WRONGTYPE (and for other collection encodings can write the string without deleting old collection rows), whereas the previous setLegacy path used replaceWithStringTxn/deleteLogicalKeyElems to overwrite the logical key. This breaks normal Redis overwrite behavior for any default-configured server that sets a key previously used as a list/hash/set/zset/stream.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
.github/workflows/jepsen-test-scheduled.yml (1)
42-54: ⚡ Quick winAdd a fail-fast assertion that this workflow is truly dedup-OFF.
The explicit env var is good, but adding a launch-step assertion (
== "0") will prevent silent control/baseline drift in future edits.Suggested hardening
- name: Launch demo cluster run: | set -euo pipefail + if [ "${ELASTICKV_REDIS_ONEPHASE_DEDUP:-}" != "0" ]; then + echo "FATAL: ELASTICKV_REDIS_ONEPHASE_DEDUP must be '0' in the control workflow" + exit 2 + fi mkdir -p "$GOCACHE" /tmp/go-tmp export GOTMPDIR=/tmp/go-tmp nohup go run cmd/server/demo.go > /tmp/elastickv-demo.log 2>&1 &🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/jepsen-test-scheduled.yml around lines 42 - 54, Add a hard fail-fast assertion that ELASTICKV_REDIS_ONEPHASE_DEDUP is exactly "0" at workflow startup to prevent silent drift; implement this as an early job/step (e.g., a "validate-dedup-flag" step) that reads the ELASTICKV_REDIS_ONEPHASE_DEDUP env var and exits non‑zero if its value is not the literal string "0", causing the workflow to fail immediately when the flag is incorrect. Reference the ELASTICKV_REDIS_ONEPHASE_DEDUP env var in this validation step and place it before any test/launch steps that depend on the dedup-OFF baseline.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/design/2026_06_10_proposed_redis_onephase_dedup_default_on.md`:
- Around line 83-84: The markdown uses an unresolved reference token "[`#937`]" in
the sentence about the 2026-06-05 failure; fix by either adding a matching
reference definition for "[`#937`]" (e.g., add a reference entry at the end of the
document with the correct URL and optional title) or replace the token with an
inline link such as [`#937`](https://example.com/issue/937) so the link renders
correctly; update the reference to point to the actual issue/PR URL and ensure
the label matches the token used in the text.
---
Nitpick comments:
In @.github/workflows/jepsen-test-scheduled.yml:
- Around line 42-54: Add a hard fail-fast assertion that
ELASTICKV_REDIS_ONEPHASE_DEDUP is exactly "0" at workflow startup to prevent
silent drift; implement this as an early job/step (e.g., a "validate-dedup-flag"
step) that reads the ELASTICKV_REDIS_ONEPHASE_DEDUP env var and exits non‑zero
if its value is not the literal string "0", causing the workflow to fail
immediately when the flag is incorrect. Reference the
ELASTICKV_REDIS_ONEPHASE_DEDUP env var in this validation step and place it
before any test/launch steps that depend on the dedup-OFF baseline.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8efe9515-79ab-4fbd-9052-b117197c57a1
📒 Files selected for processing (3)
.github/workflows/jepsen-test-scheduled.ymladapter/redis.godocs/design/2026_06_10_proposed_redis_onephase_dedup_default_on.md
|
@claude review |
PR #943 round-1 codex P1: flipping onePhaseTxnDedup to default-on exposed a pre-existing semantic divergence on the standalone SET path. txnContext.applySet (adapter/redis.go:2356) returns WRONGTYPE when the target key already holds a list/hash/set/zset/stream, while the legacy setLegacy → executeSet → replaceWithStringTxn path correctly deletes the collection's logical elements and writes the string. Real Redis allows SET to overwrite any prior type unconditionally; the default flip would have silently broken normal overwrite semantics for any default-configured server. This commit narrows the default flip — the standalone SET path now requires *two* gates to route through runTransactionWithDedup: onePhaseTxnDedup (the umbrella, default on) AND a new standaloneSetDedup sub-gate (default off, opt-in via WithStandaloneSetDedup or ELASTICKV_REDIS_ONEPHASE_DEDUP_SET=1). With the default config, the standalone SET branch falls through to setLegacy, preserving legacy overwrite semantics. MULTI/EXEC and list-push paths — the parent design's actual M4-validated workloads — keep the dedup default-on win unchanged. Bringing applySet to parity (collection-deletion + string write inside the dedup txn buffer) is a larger surgery tracked as a follow-up; until then the sub-gate stays default-off. Per the CLAUDE.md "add failing test then fix" convention: - TestRedis_SET_OverwritesList_UnderDefaultGate (new file adapter/redis_set_overwrite_default_test.go) seeds RPUSH k x, runs SET k v under default config, asserts OK + GET returns v. Fails on the pre-fix build (WRONGTYPE), passes on the fixed build. - Existing TestStandaloneSetDedup_* tests now set standaloneSetDedup: true alongside onePhaseTxnDedup: true to keep pinning the dedup-on routing they were designed to exercise. Also addresses gemini medium: "[#937]" broken reference-style link in the design doc replaced with an inline link. Verified: go vet ./... clean, go build ./... clean, golangci-lint run ./adapter/... 0 issues, TestRedis_SET_OverwritesList_UnderDefaultGate + TestStandaloneSetDedup_* + Dedup/OnePhase test family all pass.
|
@claude review Round-1 responses (commit a5909eb): codex P1 — Preserve SET overwrites before defaulting dedup on ✅ FIXEDYou're right — txnContext.applySet (adapter/redis.go:2356) returns WRONGTYPE for SET-over-collection, while the legacy setLegacy → executeSet → replaceWithStringTxn path correctly deletes the collection's logical elements and writes the string. Flipping onePhaseTxnDedup default-on without a separate gate on standalone SET would have broken normal Redis overwrite semantics. Fix: introduced a per-path sub-gate Regression test Bringing applySet to parity with executeSet is a larger surgery; tracking as a follow-up before re-flipping the sub-gate. gemini medium — broken [#937] reference link ✅ FIXEDReplaced with an inline link on line 83. gemini medium — accept "false" as opt-out ❌ DECLINEDelastickv's env-var convention across the codebase is binary "0"/"1" (e.g. ELASTICKV_REDIS_TRACE == "1" two lines above the SET dedup site). Accepting "false" without also accepting "no" / "off" / "FALSE" would be inconsistent; accepting all of them would diverge from every other elastickv env-var read. The opt-out value "0" is documented in the godoc and in-file comment. Project-wide convention shift toward truthy-string parsing is a separate refactor — out of scope here. CI Analyze (go) failure — workflow permissions, not code 🚧 NOT FIXED HERECodeQL ran fine (287/605 files, 7 diagnostics, SARIF exported) but the upload step failed with coderabbit minor — same [#937] link as gemini ✅ FIXED (subsumed)Verification
|
|
Claude finished @bootjp's task in 3m 51s —— View job PR #943 Review — Round 2
SummaryThe round-1 fixes are solid. The core default-on flip is well-supported and the codex-P1 regression is correctly contained behind the
|
Summary
Flips
adapter.RedisServer.onePhaseTxnDedupfrom default-off to default-on, closing the rollout of the option-2 idempotency design landed bydocs/design/2026_05_21_proposed_txn_secondary_idempotency.md. No new mechanism — the FSM probe (dedupProbeOnePhase), the wire change (TxnMeta.PrevCommitTS, V2-only when non-zero), and every adapter-side reuse path already ship and are exercised every day by the dedup-mode Jepsen workflow.Closes #937.
Authorization
The parent design's
M47-day criterion (source) reads:jepsen-test-scheduled-dedup.yml(Redis stress profile,ELASTICKV_REDIS_ONEPHASE_DEDUP=1) has produced 12 consecutive green runs over 10 calendar days, 2026-05-31 → 2026-06-10. Run IDs are listed in the new design doc; the most recent are:Exceeds the 7-day threshold. The parent's R5 (FSM determinism across a rolling upgrade) is discharged — the probe code has shipped on every production node for months.
Change
docs/design/2026_06_10_proposed_redis_onephase_dedup_default_on.mdadapter/redis.go== "1"to!= "0"; updates the comment, godoc, and field comment..github/workflows/jepsen-test-scheduled.ymlELASTICKV_REDIS_ONEPHASE_DEDUP: "0"to the job env so the control baseline keeps its meaning across the default change. Retirement of the control workflow is a 30-day follow-up.Behaviour change
ELASTICKV_REDIS_ONEPHASE_DEDUP=0(single-env-var rollback, no binary revert). The constructor optionWithOnePhaseTxnDedup(false)still trumps the env.=1(the dedup-mode workflow's existing setting) continues to resolve totrue. No breakage.Risk
prev_commit_ts(the parent design's R1 + the existingkv/fsm.goprobe). No write is dropped.PrevCommitTSprobe on each retryable attempt. Cost is oneCommittedVersionAtMVCC lookup at the primary key + stalecommit_ts; this is the same cost the dedup-mode workflow has been paying for 12 days without regression. No hot-path allocation change.:duplicate-elements,:future-read,:G-single-item-realtime) become unreachable on the protected path. Issue #937 is the most recent control-baseline reproduction.adapter/redis_*_dedup_test.gosuites cover both gate states. Tests that construct&RedisServer{...}directly (zero-valueonePhaseTxnDedup) still observe the legacy path, so the off-path remains covered at unit level.Verification
go vet ./...— cleango build ./...— cleango test -run 'Dedup|OnePhase|PrevCommit|Idempot' ./adapter/— pass (1.2s)go test -run 'TestRedis|TestList|TestSet|TestZSet|TestStream|TestExec|TestMulti' ./adapter/— pass (169s)golangci-lint run ./adapter/...— 0 issuesFollow-ups
jepsen-test-scheduled.yml(now explicit dedup-off) still adds signal. If not, retire it.2026_06_03_partial_dynamodb_onephase_dedup.md'sM2. Not part of this PR.Summary by CodeRabbit
Documentation
Chores
Tests