docs(design): propose hotspot split M3 — automation#951
Conversation
|
Warning Review limit reached
More reviews will be available in 52 minutes and 6 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@claude review |
|
Claude finished @bootjp's task in 3m 20s —— View job Review: M3 Automation Design Doc
This is a high-quality, well-structured proposal. All cited Issues1. Score normalization formula is underspecified (§5.1) The doc says the score is computed over "a 1-minute equivalent window (trailing flush columns summed and normalized to the flush Step)." But the exact formula is not given. When This is exactly what OQ-8 is pointing at, but the issue goes beyond units — the doc should commit to a specific formula. If the threshold stays in "ops/min", the normalization formula must be stated explicitly so unit tests in M3-PR2 can pin it. Without that, two contributors can independently implement different normalizations that both pass tests but produce different sensitivities. Recommendation: close OQ-8 in this doc by committing to 2.
The detector needs a stable reference to the last flush time. It's worth calling out explicitly whether the detector gets this from a This is a concrete implementation question that will arise in M3-PR2 — resolving it here avoids a revision mid-PR. 3. The design proposes seeding cooldown after election from "any route whose This means the durable-cooldown reconstruction requires adding a timestamp field to Recommendation: add 4. Single-key isolation (§5.4.2) requires two The proposed single-key isolation approach is: "split key = hot key, and a follow-on split at hot key's successor so the child contains only that key." This is two Options worth deciding now:
This affects both the scheduler implementation and the acceptance criteria in §8.4. The design should commit to one approach. 5. Detector state map doesn't describe how manual splits are reconciled (§6.3) The Recommendation: at the start of each evaluation cycle the detector should reconcile its state map against the live route set (e.g. discard state for any 6. §3.3 correctly notes that This is minor given detection is off-path, but worth noting: the five-lens performance review in M3-PR2 should benchmark or at least measure the lock contention under load (e.g. with Open Questions — RecommendationsOQ-1 ( OQ-2 (down-side hysteresis shape): Hold (current proposal) is the right choice. Decaying by one per cycle introduces an implicit max-hold time that's a function of OQ-3 (no-evidence fallback): Decline (current proposal). Allowing a min/max-observed-key midpoint as fallback re-introduces the midpoint-only problem the parent doc §3.1.3 explicitly rejects for unobserved subspaces. Better to push the operator toward enabling OQ-5 (leader-local vs cluster-wide): Leader-local is sufficient for M3. The rationale in §4.1 is well-argued (the leader serves the load it would split on). Flag this as a known limitation in the doc, not as a deferred OQ — it's already a deliberate non-goal. OQ-6 (cross-group gate): Recommend gating behind Minor / Nits
SummaryFundamentally sound design. The key things to resolve before M3-PR1 begins:
|
There was a problem hiding this comment.
Code Review
This pull request proposes a design document for Milestone 3 of the Hotspot Shard Split, focusing on automation. The design leverages the existing keyviz sampler to detect hot routes, implements a pure detector with hysteresis and cooldown, and schedules splits via the SplitRange RPC. The review feedback highlights three key design improvements: implementing a garbage-collection mechanism for the leader-local state map to prevent memory leaks, addressing the lack of creation timestamps in RouteDescriptor to enable accurate cooldown seeding on new leaders, and restricting auto-split eligibility to routes in RouteStateActive to avoid interfering with transitional migration states.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| The detector's hysteresis counters and cooldown deadlines live in a **leader-local in-memory state map** — they are **not** replicated through Raft. On a default-group leadership change: | ||
|
|
||
| - The deposed leader's detector goroutine stops (it observes the leader-loss callback, the same mechanism the lease invalidation uses). | ||
| - The **new** leader starts its detector with an **empty** state map. Consecutive-over counters reset to zero, so a route must re-accumulate `candidateWindows` cycles before it can be promoted again. This is intentionally conservative: after an election, the new leader re-earns confidence before splitting, avoiding a split based on counters it never observed. | ||
| - **Cooldowns are partially lost** across an election (the new leader does not know a route was recently split). To avoid a too-soon re-split, the new leader seeds cooldown from the catalog: any route whose `parent_route_id` lineage (parent doc §5, `RouteDescriptor.parent_route_id`) indicates a recent split, or whose creation is recent per the catalog, starts in cooldown. (OQ-7: is `parent_route_id` + a catalog-recorded split timestamp enough to reconstruct cooldown, or do we need a durable `last_split_at` per route?) In-flight cross-group `SplitJob`s are durable (M2) so the new leader sees them and excludes those routes (§6.3) regardless of detector state. | ||
|
|
||
| Keeping detector state leader-local (not Raft-replicated) is a deliberate cost/safety trade: replicating per-cycle counters would add Raft traffic for what is a best-effort heuristic, and the worst case of a lost counter is a *delayed* split (safe), never an *erroneous* one. We state this explicitly so a reviewer does not mistake the non-replication for an oversight. |
There was a problem hiding this comment.
To prevent unbounded memory growth in the leader-local in-memory state map, the design should specify a pruning/garbage-collection mechanism. Since route IDs are deleted from the catalog after a split (and replaced by child routes), keeping their hysteresis counters and cooldown deadlines in the map indefinitely will cause a memory leak.\n\nThe detector should periodically (e.g., at the end of each evaluation cycle) compare the keys of the state map against the active routes in the current catalog snapshot and evict any route IDs that are no longer present.
|
|
||
| - The deposed leader's detector goroutine stops (it observes the leader-loss callback, the same mechanism the lease invalidation uses). | ||
| - The **new** leader starts its detector with an **empty** state map. Consecutive-over counters reset to zero, so a route must re-accumulate `candidateWindows` cycles before it can be promoted again. This is intentionally conservative: after an election, the new leader re-earns confidence before splitting, avoiding a split based on counters it never observed. | ||
| - **Cooldowns are partially lost** across an election (the new leader does not know a route was recently split). To avoid a too-soon re-split, the new leader seeds cooldown from the catalog: any route whose `parent_route_id` lineage (parent doc §5, `RouteDescriptor.parent_route_id`) indicates a recent split, or whose creation is recent per the catalog, starts in cooldown. (OQ-7: is `parent_route_id` + a catalog-recorded split timestamp enough to reconstruct cooldown, or do we need a durable `last_split_at` per route?) In-flight cross-group `SplitJob`s are durable (M2) so the new leader sees them and excludes those routes (§6.3) regardless of detector state. |
There was a problem hiding this comment.
The proposal to seed cooldowns on a new leader by checking if a route's creation is "recent per the catalog" has a design gap: the RouteDescriptor struct in catalog.go does not contain any creation timestamp, split timestamp, or catalog version of origin. Without this metadata, a newly elected leader cannot determine whether a route with a parent_route_id was split 1 minute ago (and should be in cooldown) or 1 hour ago (cooldown expired).\n\nTo resolve OQ-7 and make this safety mechanism viable, the design should either:\n1. Add a durable split_timestamp or created_at field to the RouteDescriptor (or catalog metadata).\n2. Store a mapping of catalog_version -> timestamp in the catalog metadata so the leader can resolve the age of the route's origin version.
|
|
||
| The target-group selection slots **behind the same `SplitRange` interface** — the scheduler code path is identical; only the `target_group_id` field flips from zero (same-group) to a chosen group once M2 is present. M3 ships the policy hook disabled (always same-group) and a follow-on M3.x / M2-completion PR enables target selection. OQ-6 asks whether to gate target selection on an explicit `--autoSplitCrossGroup` sub-flag. | ||
|
|
||
| ### 6.3 Concurrency with in-flight splits |
There was a problem hiding this comment.
To ensure safety during migrations, the design should explicitly state that only routes in RouteStateActive are eligible for auto-split detection. Routes in transitional states such as RouteStateWriteFenced, RouteStateMigratingSource, or RouteStateMigratingTarget must be excluded from candidacy. This prevents concurrent automated splits from interfering with active M2 migration jobs.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3dc133ecb7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| - **M3 reads detection input from the keyviz sampler**, not from `Route.Load`. | ||
| - The engine's `RecordAccess` / `splitRange` / `Route.Load` path is **retired** (§3.4). No new call site is added for it. | ||
|
|
||
| This means **auto-split requires keyviz sampling to be active**. `--autoSplit` implies the sampler is constructed and flushing (§7.1). We do *not* require the operator to also pass `--keyvizEnabled`; enabling auto-split turns on the minimum sampling auto-split needs (route-granular counts; sub-range buckets and Top-K stay opt-in via the existing keyviz flags and only sharpen split-key selection — see §5.4). |
There was a problem hiding this comment.
Enable split-key evidence with autoSplit defaults
With the current keyviz defaults (DefaultKeyBucketsPerRoute = 1 and --keyvizHotKeysEnabled=false), turning on only --autoSplit constructs a route-granular sampler, but §5.4.3 later says the detector declines whenever neither sub-range buckets nor Top-K are enabled. That means the default advertised --autoSplit mode can never produce a split key or satisfy the M3 acceptance criteria unless operators also discover and enable another keyviz flag; either make --autoSplit enable a split-key evidence source or document it as required configuration.
Useful? React with 👍 / 👎.
| Per parent doc §3.1.3 and §6.3, the split key comes from the **observed key distribution**, not a byte midpoint: | ||
|
|
||
| 1. **Default — sub-range p50.** When `--keyvizKeyBucketsPerRoute > 1`, use the route's sub-range bucket rows to find the bucket boundary nearest the cumulative-load median, and use that boundary key (reconstructed via `MemSampler.SubBucketBoundsFor`, `keyviz/sampler.go:559`) as the split key. This places the boundary where load is actually halved. | ||
| 2. **Single-key skew — isolate the hot key.** When Top-K is enabled (`--keyvizHotKeysEnabled`) and one key's share of the route's writes is `>= topKeyShare` (parent doc §6.3 default 0.8), split so the hot key is isolated into its own narrow child (split key = hot key, and a follow-on split at hot key's successor so the child contains only that key). This is the single-hot-key hotspot case the whole feature targets. |
There was a problem hiding this comment.
Exempt compound hot-key splits from cooldown
This single-key isolation plan requires two SplitRange calls: first at the hot key, then a follow-on split at its successor. However §5.3 immediately puts both children into the 10-minute cooldown after a split is scheduled, and §6.3 defaults to one split per cycle; unless the scheduler treats this as an atomic/explicitly exempt compound operation, the child containing the hot key cannot be split again until cooldown expires, so the hot key is not isolated as claimed.
Useful? React with 👍 / 👎.
Design-doc-only PR. No implementation code. Per the repo's design-doc-first workflow, this lands the M3 proposal for review before any implementation.
New file:
docs/design/2026_06_11_proposed_hotspot_split_milestone3_automation.md(Status: Proposed,Author: bootjp,Date: 2026-06-11).Summary
Milestone 3 of the hotspot shard split design: automation — auto-detection + an auto-split scheduler. M1 (durable versioned route catalog, manual same-group
SplitRange, route watcher) has shipped; M2 (#945) proposes the cross-group migration plane. M3 composes with M2 but delivers first value standalone: auto-detection + same-group auto-split (the M1 capability) work without M2; cross-group target selection slots in behind the sameSplitRangeinterface once M2 lands.It closes the gap the parent doc §1 calls out:
distribution/engine.go'sRecordAccess/threshold-splitRangeexists but is wired into nothing (only tests call it; verified by grep) and is midpoint-only + mutates the engine directly (bypassing the catalog/watcher).Key design points
MemSampleris already wired into the exact request-path functions the parent doc names (groupMutations→observeMutation,observeReadinkv/sharded_coordinator.go), is allocation-free on the hot path, splits reads vs writes, is windowed, has sub-range + Top-K split-key evidence, and bounds cardinality. Driving detection off it avoids a second parallel pipeline. The engine's deadRecordAccess/splitRange/Route.Loadpath is removed in M3-PR1. M3 adds zero hot-path cost — it consumes already-flushed windows off-path.MemSampler.Snapshotfor routes it leads (the leader serves the load it would split on). NoReportAccessRPC and no RPC piggybacking; cluster fan-out stays an admin-UI concern and can slot behind the detector interface later if needed.score = write_ops*Ww + read_ops*Wr(4/1), 50k ops/min threshold, 3-consecutive-window up-side hysteresis + 0.8× down-side band, per-route 10-min cooldown (monotonic deadline, not wall clock), split-key from observed distribution (sub-range p50 / single-key isolation / decline when no evidence — never blind midpoint), guardrails: min route span, max routes, per-cycle split budget.proto.Distribution.SplitRangeRPC withexpected_catalog_versionCAS; all catalog mutations go throughSplitRange(repo convention). Same-group today; once M2 lands the least-loaded-group target policy flips thetarget_group_idfield on the same RPC.--autoSplit; runtime kill switch (atomic.Bool); fixed-enum metrics (no per-route/key labels); stable slog keys; detector state is leader-local and resets on election (re-earns confidence; cooldown re-seeded fromparent_route_idlineage) — stated explicitly so non-replication isn't mistaken for an oversight.Every current-state claim cites concrete
file:lineevidence, verified before writing.Open questions
Route.Load/Engine.Stats, or keepStatsfor diagnostics?[0.8×, 1.0×)band, or decay it?minRouteSpanwithout per-route key counts: approximate from sampler evidence, or add a cheap key-count estimate?--autoSplitCrossGroup?parent_route_id+ a catalog split timestamp enough, or do we need durablelast_split_at?Stepto avoid coupling sensitivity to--keyvizStep?Test plan
Doc-only — no Go tests run. Markdown verified clean; all cited
file:linereferences confirmed againstmain. The doc's own §8.2 lays out the implementation-phase test strategy (table-driven detector unit tests +pgregory.net/rapidinvariants, detector→SplitRangeintegration in the 3-node demo, Jepsen hotspot workload deferred to M4 per the parent doc).