Skip to content

docs(design): propose hotspot split M3 — automation#951

Open
bootjp wants to merge 1 commit into
mainfrom
design/hotspot-split-m3-automation
Open

docs(design): propose hotspot split M3 — automation#951
bootjp wants to merge 1 commit into
mainfrom
design/hotspot-split-m3-automation

Conversation

@bootjp

@bootjp bootjp commented Jun 10, 2026

Copy link
Copy Markdown
Owner

Design-doc-only PR. No implementation code. Per the repo's design-doc-first workflow, this lands the M3 proposal for review before any implementation.

New file: docs/design/2026_06_11_proposed_hotspot_split_milestone3_automation.md (Status: Proposed, Author: bootjp, Date: 2026-06-11).

Summary

Milestone 3 of the hotspot shard split design: automation — auto-detection + an auto-split scheduler. M1 (durable versioned route catalog, manual same-group SplitRange, route watcher) has shipped; M2 (#945) proposes the cross-group migration plane. M3 composes with M2 but delivers first value standalone: auto-detection + same-group auto-split (the M1 capability) work without M2; cross-group target selection slots in behind the same SplitRange interface once M2 lands.

It closes the gap the parent doc §1 calls out: distribution/engine.go's RecordAccess/threshold-splitRange exists but is wired into nothing (only tests call it; verified by grep) and is midpoint-only + mutates the engine directly (bypassing the catalog/watcher).

Key design points

  • Detection signal — reuse the keyviz sampler, retire the engine counters. The keyviz MemSampler is already wired into the exact request-path functions the parent doc names (groupMutationsobserveMutation, observeRead in kv/sharded_coordinator.go), is allocation-free on the hot path, splits reads vs writes, is windowed, has sub-range + Top-K split-key evidence, and bounds cardinality. Driving detection off it avoids a second parallel pipeline. The engine's dead RecordAccess/splitRange/Route.Load path is removed in M3-PR1. M3 adds zero hot-path cost — it consumes already-flushed windows off-path.
  • Aggregation — leader-local, no new RPC. The detector runs on the default-group leader and reads its own MemSampler.Snapshot for routes it leads (the leader serves the load it would split on). No ReportAccess RPC and no RPC piggybacking; cluster fan-out stays an admin-UI concern and can slot behind the detector interface later if needed.
  • Detectorscore = write_ops*Ww + read_ops*Wr (4/1), 50k ops/min threshold, 3-consecutive-window up-side hysteresis + 0.8× down-side band, per-route 10-min cooldown (monotonic deadline, not wall clock), split-key from observed distribution (sub-range p50 / single-key isolation / decline when no evidence — never blind midpoint), guardrails: min route span, max routes, per-cycle split budget.
  • Scheduler — invokes the existing proto.Distribution.SplitRange RPC with expected_catalog_version CAS; all catalog mutations go through SplitRange (repo convention). Same-group today; once M2 lands the least-loaded-group target policy flips the target_group_id field on the same RPC.
  • Safety/ops — default OFF behind --autoSplit; runtime kill switch (atomic.Bool); fixed-enum metrics (no per-route/key labels); stable slog keys; detector state is leader-local and resets on election (re-earns confidence; cooldown re-seeded from parent_route_id lineage) — stated explicitly so non-replication isn't mistaken for an oversight.
  • PR breakdown — M3-PR1 (retire engine path + metric scaffolding), M3-PR2 (aggregation + pure detector, observe-only), M3-PR3 (scheduler + flag + kill switch), each independently shippable; M3-PR4 enables cross-group target selection post-M2.

Every current-state claim cites concrete file:line evidence, verified before writing.

Open questions

  1. OQ-1 — Fully delete Route.Load/Engine.Stats, or keep Stats for diagnostics?
  2. OQ-2 — Down-side hysteresis: hold the counter in the [0.8×, 1.0×) band, or decay it?
  3. OQ-3 — No sub-range/Top-K evidence: decline to split (proposed), or allow a min/max-observed-key midpoint?
  4. OQ-4minRouteSpan without per-route key counts: approximate from sampler evidence, or add a cheap key-count estimate?
  5. OQ-5 — Leader-local scoring sufficient, or is there a heavy-follower-read workload needing cluster fan-out?
  6. OQ-6 — Post-M2: enable cross-group target automatically, or gate behind --autoSplitCrossGroup?
  7. OQ-7 — Durable cooldown across elections: is parent_route_id + a catalog split timestamp enough, or do we need durable last_split_at?
  8. OQ-8 — Express threshold in ops/min (proposed) or ops-per-Step to avoid coupling sensitivity to --keyvizStep?

Test plan

Doc-only — no Go tests run. Markdown verified clean; all cited file:line references confirmed against main. The doc's own §8.2 lays out the implementation-phase test strategy (table-driven detector unit tests + pgregory.net/rapid invariants, detector→SplitRange integration in the 3-node demo, Jepsen hotspot workload deferred to M4 per the parent doc).

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@bootjp, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 52 minutes and 6 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e16e4c99-a354-41cb-85b3-187c76ff76fb

📥 Commits

Reviewing files that changed from the base of the PR and between ed82233 and 3dc133e.

📒 Files selected for processing (1)
  • docs/design/2026_06_11_proposed_hotspot_split_milestone3_automation.md
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch design/hotspot-split-m3-automation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@bootjp

bootjp commented Jun 10, 2026

Copy link
Copy Markdown
Owner Author

@claude review

@claude

claude Bot commented Jun 10, 2026

Copy link
Copy Markdown

Claude finished @bootjp's task in 3m 20s —— View job


Review: M3 Automation Design Doc

  • Load tools
  • Read the M3 design doc
  • Read parent design doc and related source files for context
  • Write and post review feedback

This is a high-quality, well-structured proposal. All cited file:line references were verified against the current codebase and are accurate. The rationale for choosing the keyviz sampler over the engine's dead RecordAccess path is sound. The PR breakdown (5 independently-shippable PRs) is excellent. The following are issues worth resolving before implementation begins.


Issues

1. Score normalization formula is underspecified (§5.1)

The doc says the score is computed over "a 1-minute equivalent window (trailing flush columns summed and normalized to the flush Step)." But the exact formula is not given. When --keyvizStep ≠ 60s, "summed and normalized to the flush Step" is ambiguous. Proposed clarification:

score = (sum of write_ops over N columns / (N × Step_seconds)) × 60
      × Ww + ...

This is exactly what OQ-8 is pointing at, but the issue goes beyond units — the doc should commit to a specific formula. If the threshold stays in "ops/min", the normalization formula must be stated explicitly so unit tests in M3-PR2 can pin it. Without that, two contributors can independently implement different normalizations that both pass tests but produce different sensitivities.

Recommendation: close OQ-8 in this doc by committing to score = (sum_over_N_columns / (N × Step_s) × 60) × weights, and document that as the invariant the table-driven unit tests verify.


2. MemSampler.Snapshot takes time.Time params — the "last N columns" query needs a concrete recipe (§4.2)

MemSampler.Snapshot(from, to time.Time) []MatrixColumn (keyviz/sampler.go:1463) takes a time range, not a column count. The doc repeatedly says "last N flush columns" but never specifies how the detector computes from/to. Suggested recipe:

to   = time of last completed flush (flusher's most recent flush boundary)
from = to − N × Step

The detector needs a stable reference to the last flush time. It's worth calling out explicitly whether the detector gets this from a Flusher handle, a channel notification, or simply reads sampler.Snapshot(now − N×Step, now) and accepts that the trailing edge might capture a partially-flushed column. The latter would silently skew scores; the design should rule it out.

This is a concrete implementation question that will arise in M3-PR2 — resolving it here avoids a revision mid-PR.


3. OQ-7 needs a data-model change that isn't scoped to any PR (§7.6)

The design proposes seeding cooldown after election from "any route whose parent_route_id lineage indicates a recent split, or whose creation is recent per the catalog." But RouteDescriptor (parent doc §5) has parent_route_id uint64 and state, but no split_at or created_at timestamp field. Without a timestamp, the new leader cannot tell whether a recent-lineage split happened within the last splitCooldown window. "Recent per the catalog" has no basis in the current schema.

This means the durable-cooldown reconstruction requires adding a timestamp field to RouteDescriptor. That's a catalog schema change — it belongs in M3-PR1 (which already touches engine cleanup), not as an implicit assumption in §7.6.

Recommendation: add split_at_hlc uint64 (HLC timestamp of when the split was applied) to RouteDescriptor, set by the SplitRange handler when it writes the two child RouteDescriptors. State explicitly in §8.1 M3-PR1 or M3-PR2 that this field is added. OQ-7 then becomes "is split_at_hlc sufficient or do we also need last_split_at_hlc updated on every re-split?"


4. Single-key isolation (§5.4.2) requires two SplitRange calls per detection cycle

The proposed single-key isolation approach is: "split key = hot key, and a follow-on split at hot key's successor so the child contains only that key." This is two SplitRange calls: one to split [A, B) into [A, hotkey) and [hotkey, B), and another to split [hotkey, B) into [hotkey, hotkey+1) and [hotkey+1, B). With maxSplitsPerCycle = 1, the second call cannot be issued in the same cycle.

Options worth deciding now:

  • Count the pair as 1 budget unit (special-case the two calls atomically, or issue back-to-back in the same cycle despite the per-cycle limit).
  • Defer the second split to the next cycle: the child [hotkey, B) is now much smaller; it re-evaluates next cycle and the isolation completes. Simpler but takes 2 cycles.
  • Atomic two-call variant: issue both in one cycle, counting as 2 budget units; maxSplitsPerCycle = 2 minimum for single-key isolation to work in one pass.

This affects both the scheduler implementation and the acceptance criteria in §8.4. The design should commit to one approach.


5. Detector state map doesn't describe how manual splits are reconciled (§6.3)

The expected_catalog_version CAS correctly makes the scheduler fail-closed on a manual split. But the detector's in-memory state map retains the parent route entry after its catalog entry disappears. On the next evaluation cycle, the detector would try to evaluate a route that no longer exists in the engine, wasting work at minimum and potentially logging confusing output.

Recommendation: at the start of each evaluation cycle the detector should reconcile its state map against the live route set (e.g. discard state for any routeID that is no longer present in the engine's routes). This is a small but load-bearing correctness detail that should be in §5 or §6.


6. historyMu.Lock() in Snapshot contends with RunFlusher (§3.3)

§3.3 correctly notes that Snapshot "takes historyMu — never touched by Observe." However, RunFlusher also takes historyMu on each flush to append the new column. The lock is a write mutex (s.historyMu.Lock()). If the default eval interval equals --keyvizStep, the detector's Snapshot call and the flusher's append will regularly race for the same lock, adding tail latency to one or the other.

This is minor given detection is off-path, but worth noting: the five-lens performance review in M3-PR2 should benchmark or at least measure the lock contention under load (e.g. with go test -race plus a timing check). The doc could add a one-line note to the performance lens in §8.3 to check this.


Open Questions — Recommendations

OQ-1 (Route.Load / Stats fate): Recommend keeping Engine.Stats() but deleting RecordAccess/splitRange/the threshold constructor. Stats() is useful as a debug surface for the admin tool and costs nothing when unused. The comment at distribution/engine.go:377-393 shows it already returns a safe copy.

OQ-2 (down-side hysteresis shape): Hold (current proposal) is the right choice. Decaying by one per cycle introduces an implicit max-hold time that's a function of candidateWindows, making the threshold–cooldown interaction harder to reason about in unit tests. Hold is simple, deterministic, and easy to assert: "any score in [0.8×, threshold) must never change consecutiveOver."

OQ-3 (no-evidence fallback): Decline (current proposal). Allowing a min/max-observed-key midpoint as fallback re-introduces the midpoint-only problem the parent doc §3.1.3 explicitly rejects for unobserved subspaces. Better to push the operator toward enabling --keyvizKeyBucketsPerRoute via a clear metric (autosplit_skipped_no_split_key).

OQ-5 (leader-local vs cluster-wide): Leader-local is sufficient for M3. The rationale in §4.1 is well-argued (the leader serves the load it would split on). Flag this as a known limitation in the doc, not as a deferred OQ — it's already a deliberate non-goal.

OQ-6 (cross-group gate): Recommend gating behind --autoSplitCrossGroup. Cross-group splits trigger data movement (M2's SplitJob); operators should opt into that separately from range splitting. The flag cost is trivial; the blast radius of unexpected cross-group data migration is not.


Minor / Nits

  • §5.4.1 "p50 from sub-range buckets" — clarify which bucket boundary is chosen: the lower (lo) or upper (hi) boundary of the bucket whose cumulative load crosses 50%. SubBucketBoundsFor returns (lo, hi []byte, ok bool); the implementation needs to pick one.

  • §7.6 "leader-loss callback" — add a file:line pointer to the actual mechanism (presumably the same kv/lease_state.go / kv/raft_engine.go callbacks used by lease invalidation). This doc targets a reviewer unfamiliar with the lease layer; the cross-reference closes the loop.

  • §7.1 --autoSplit implies sampler construction — add a note about the interaction with cmd/server/demo.go and main.go: the wiring condition becomes if keyvizEnabled || autoSplit { constructSampler() }. Flag this for M3-PR3 so the flag-wiring reviewer doesn't miss it.


Summary

Fundamentally sound design. The key things to resolve before M3-PR1 begins:

  1. Commit to a precise scoring/normalization formula (closes OQ-8 and pins unit tests).
  2. Specify how the detector queries Snapshot for "last N columns" (concrete time-range recipe).
  3. Add split_at_hlc to RouteDescriptor schema in M3-PR1's scope (unblocks OQ-7 reconstruction).
  4. Decide single-key isolation budget counting (2 calls = 1 or 2 budget units).
  5. Add detector state-map reconciliation to §5 or §6.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request proposes a design document for Milestone 3 of the Hotspot Shard Split, focusing on automation. The design leverages the existing keyviz sampler to detect hot routes, implements a pure detector with hysteresis and cooldown, and schedules splits via the SplitRange RPC. The review feedback highlights three key design improvements: implementing a garbage-collection mechanism for the leader-local state map to prevent memory leaks, addressing the lack of creation timestamps in RouteDescriptor to enable accurate cooldown seeding on new leaders, and restricting auto-split eligibility to routes in RouteStateActive to avoid interfering with transitional migration states.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +229 to +235
The detector's hysteresis counters and cooldown deadlines live in a **leader-local in-memory state map** — they are **not** replicated through Raft. On a default-group leadership change:

- The deposed leader's detector goroutine stops (it observes the leader-loss callback, the same mechanism the lease invalidation uses).
- The **new** leader starts its detector with an **empty** state map. Consecutive-over counters reset to zero, so a route must re-accumulate `candidateWindows` cycles before it can be promoted again. This is intentionally conservative: after an election, the new leader re-earns confidence before splitting, avoiding a split based on counters it never observed.
- **Cooldowns are partially lost** across an election (the new leader does not know a route was recently split). To avoid a too-soon re-split, the new leader seeds cooldown from the catalog: any route whose `parent_route_id` lineage (parent doc §5, `RouteDescriptor.parent_route_id`) indicates a recent split, or whose creation is recent per the catalog, starts in cooldown. (OQ-7: is `parent_route_id` + a catalog-recorded split timestamp enough to reconstruct cooldown, or do we need a durable `last_split_at` per route?) In-flight cross-group `SplitJob`s are durable (M2) so the new leader sees them and excludes those routes (§6.3) regardless of detector state.

Keeping detector state leader-local (not Raft-replicated) is a deliberate cost/safety trade: replicating per-cycle counters would add Raft traffic for what is a best-effort heuristic, and the worst case of a lost counter is a *delayed* split (safe), never an *erroneous* one. We state this explicitly so a reviewer does not mistake the non-replication for an oversight.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To prevent unbounded memory growth in the leader-local in-memory state map, the design should specify a pruning/garbage-collection mechanism. Since route IDs are deleted from the catalog after a split (and replaced by child routes), keeping their hysteresis counters and cooldown deadlines in the map indefinitely will cause a memory leak.\n\nThe detector should periodically (e.g., at the end of each evaluation cycle) compare the keys of the state map against the active routes in the current catalog snapshot and evict any route IDs that are no longer present.


- The deposed leader's detector goroutine stops (it observes the leader-loss callback, the same mechanism the lease invalidation uses).
- The **new** leader starts its detector with an **empty** state map. Consecutive-over counters reset to zero, so a route must re-accumulate `candidateWindows` cycles before it can be promoted again. This is intentionally conservative: after an election, the new leader re-earns confidence before splitting, avoiding a split based on counters it never observed.
- **Cooldowns are partially lost** across an election (the new leader does not know a route was recently split). To avoid a too-soon re-split, the new leader seeds cooldown from the catalog: any route whose `parent_route_id` lineage (parent doc §5, `RouteDescriptor.parent_route_id`) indicates a recent split, or whose creation is recent per the catalog, starts in cooldown. (OQ-7: is `parent_route_id` + a catalog-recorded split timestamp enough to reconstruct cooldown, or do we need a durable `last_split_at` per route?) In-flight cross-group `SplitJob`s are durable (M2) so the new leader sees them and excludes those routes (§6.3) regardless of detector state.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The proposal to seed cooldowns on a new leader by checking if a route's creation is "recent per the catalog" has a design gap: the RouteDescriptor struct in catalog.go does not contain any creation timestamp, split timestamp, or catalog version of origin. Without this metadata, a newly elected leader cannot determine whether a route with a parent_route_id was split 1 minute ago (and should be in cooldown) or 1 hour ago (cooldown expired).\n\nTo resolve OQ-7 and make this safety mechanism viable, the design should either:\n1. Add a durable split_timestamp or created_at field to the RouteDescriptor (or catalog metadata).\n2. Store a mapping of catalog_version -> timestamp in the catalog metadata so the leader can resolve the age of the route's origin version.


The target-group selection slots **behind the same `SplitRange` interface** — the scheduler code path is identical; only the `target_group_id` field flips from zero (same-group) to a chosen group once M2 is present. M3 ships the policy hook disabled (always same-group) and a follow-on M3.x / M2-completion PR enables target selection. OQ-6 asks whether to gate target selection on an explicit `--autoSplitCrossGroup` sub-flag.

### 6.3 Concurrency with in-flight splits

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To ensure safety during migrations, the design should explicitly state that only routes in RouteStateActive are eligible for auto-split detection. Routes in transitional states such as RouteStateWriteFenced, RouteStateMigratingSource, or RouteStateMigratingTarget must be excluded from candidacy. This prevents concurrent automated splits from interfering with active M2 migration jobs.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3dc133ecb7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- **M3 reads detection input from the keyviz sampler**, not from `Route.Load`.
- The engine's `RecordAccess` / `splitRange` / `Route.Load` path is **retired** (§3.4). No new call site is added for it.

This means **auto-split requires keyviz sampling to be active**. `--autoSplit` implies the sampler is constructed and flushing (§7.1). We do *not* require the operator to also pass `--keyvizEnabled`; enabling auto-split turns on the minimum sampling auto-split needs (route-granular counts; sub-range buckets and Top-K stay opt-in via the existing keyviz flags and only sharpen split-key selection — see §5.4).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Enable split-key evidence with autoSplit defaults

With the current keyviz defaults (DefaultKeyBucketsPerRoute = 1 and --keyvizHotKeysEnabled=false), turning on only --autoSplit constructs a route-granular sampler, but §5.4.3 later says the detector declines whenever neither sub-range buckets nor Top-K are enabled. That means the default advertised --autoSplit mode can never produce a split key or satisfy the M3 acceptance criteria unless operators also discover and enable another keyviz flag; either make --autoSplit enable a split-key evidence source or document it as required configuration.

Useful? React with 👍 / 👎.

Per parent doc §3.1.3 and §6.3, the split key comes from the **observed key distribution**, not a byte midpoint:

1. **Default — sub-range p50.** When `--keyvizKeyBucketsPerRoute > 1`, use the route's sub-range bucket rows to find the bucket boundary nearest the cumulative-load median, and use that boundary key (reconstructed via `MemSampler.SubBucketBoundsFor`, `keyviz/sampler.go:559`) as the split key. This places the boundary where load is actually halved.
2. **Single-key skew — isolate the hot key.** When Top-K is enabled (`--keyvizHotKeysEnabled`) and one key's share of the route's writes is `>= topKeyShare` (parent doc §6.3 default 0.8), split so the hot key is isolated into its own narrow child (split key = hot key, and a follow-on split at hot key's successor so the child contains only that key). This is the single-hot-key hotspot case the whole feature targets.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Exempt compound hot-key splits from cooldown

This single-key isolation plan requires two SplitRange calls: first at the hot key, then a follow-on split at its successor. However §5.3 immediately puts both children into the 10-minute cooldown after a split is scheduled, and §6.3 defaults to one split per cycle; unless the scheduler treats this as an atomic/explicitly exempt compound operation, the child containing the hot key cannot be split again until cooldown expires, so the hot key is not isolated as claimed.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant