Skip to content

feat: [DSM-142] Add ColdStats aggregate to CanisterStates#10286

Merged
alin-at-dfinity merged 14 commits into
masterfrom
alin/DSM-142-canister-states-cold-stats
May 29, 2026
Merged

feat: [DSM-142] Add ColdStats aggregate to CanisterStates#10286
alin-at-dfinity merged 14 commits into
masterfrom
alin/DSM-142-canister-states-cold-stats

Conversation

@alin-at-dfinity
Copy link
Copy Markdown
Contributor

Maintains a small ColdStats aggregate over the canisters in the cold pool, updated incrementally on every transition into / out of cold. This lets the "touch every canister" aggregate queries — total_compute_allocation, total_canister_memory_usage, memory_taken, callback_count, guaranteed_response_message_memory_taken, best_effort_message_memory_taken — run in O(|hot|) instead of O(|all canisters|), which is the primary motivation for the hot/cold split on subnets with a long tail of idle canisters.

The aggregates are derived (not persisted) and are reconstructed by CanisterStates::new on checkpoint load. debug_assert_invariants ensures every mutating method keeps them in sync, and the ColdStats struct stays module-private — callers always reach the totals through the public aggregator methods on CanisterStates.

MemoryTaken's fields are bumped from private to pub(crate) so that CanisterStates::memory_taken can construct the struct directly, keeping MemoryTaken in its current home in replicated_state.rs. CanisterStates::memory_taken itself is pub(crate) and will be wired up to ReplicatedState::memory_taken in the next PR.

alin-at-dfinity and others added 2 commits May 22, 2026 07:23
Lays the foundation for splitting `ReplicatedState::canister_states` into
"hot" (potentially active) and "cold" (definitely idle) pools, so that
per-round operations can skip the long tail of idle canisters.

This PR is intentionally a no-op for the running replica: it only adds
the new types and predicates. The integration into `ReplicatedState` and
the migration of all consumers follow in subsequent PRs.

Specifically:

  * `CanisterState::is_cold()` — pure predicate that classifies a canister
    as "definitely idle": no input/output, no task queue entries, no
    heartbeat method, inactive global timer, not `Stopping`, no
    unexpired best-effort callbacks, and no scheduler debits.
  * `CallContextManager::has_unexpired_callbacks()` and the matching
    `SystemState::has_unexpired_callbacks()` accessor, used by `is_cold`.
  * `CanisterStates`, a hot/cold-partitioned container with eager
    promotion (mutations land in `hot`) and lazy demotion (via
    `try_cool`/`try_cool_all`), plus the common map operations
    (`get`/`get_mut`/`insert`/`remove`/`contains_key`/`len`/`is_empty`/
    `retain`), per-pool iterators (`hot_iter`/`hot_values`/
    `hot_values_mut`), merged iterators in `CanisterId` order
    (`all_iter`/`all_keys`/`all_values`), and bulk mutation
    (`for_each_mut`/`try_for_each_mut`).
  * `CanisterStates::validate_strict_split()` for the canonical-partition
    invariant used in checkpoint validation.
  * `debug_assert_invariants()` runs on every mutating operation in
    debug builds.

`ColdStats` and the aggregate accessors (`total_compute_allocation`,
`total_canister_memory_usage`, `memory_taken`, `callback_count`, ...)
are intentionally **not** part of this PR — they will be added once the
struct is in place.

Co-authored-by: Cursor <cursoragent@cursor.com>
Maintains a small `ColdStats` aggregate over the canisters in the
`cold` pool, updated incrementally on every transition into / out of
`cold`. This lets the "touch every canister" aggregate queries —
`total_compute_allocation`, `total_canister_memory_usage`,
`memory_taken`, `callback_count`,
`guaranteed_response_message_memory_taken`,
`best_effort_message_memory_taken` — run in `O(|hot|)` instead of
`O(|all canisters|)`, which is the primary motivation for the
hot/cold split on subnets with a long tail of idle canisters.

The aggregates are derived (not persisted) and are reconstructed by
`CanisterStates::new` on checkpoint load. `debug_assert_invariants`
(now also runs an `O(|cold|)` recompute and compares against the live
aggregate) ensures every mutating method keeps them in sync, and the
`ColdStats` struct stays module-private — callers always reach the
totals through the public aggregator methods on `CanisterStates`.

`MemoryTaken`'s fields are bumped from private to `pub(crate)` so that
`CanisterStates::memory_taken` can construct the struct directly,
keeping `MemoryTaken` in its current home in `replicated_state.rs`.
`CanisterStates::memory_taken` itself is `pub(crate)` and will be
wired up to `ReplicatedState::memory_taken` in the next PR; an
`#[allow(dead_code)]` keeps the build warning-free until then.

Aggregator behaviour is exercised by two new tests
(`memory_aggregators_combine_hot_and_cold`,
`callback_count_combines_hot_and_cold`) and the bookkeeping
discipline is exercised by an extended set of `*_updates_cold_stats*`
tests covering `insert`, `remove`, `try_cool*`, `for_each_mut`,
`try_for_each_mut`, and `retain`.

Co-authored-by: Cursor <cursoragent@cursor.com>
@alin-at-dfinity alin-at-dfinity requested a review from a team as a code owner May 22, 2026 08:55
@github-actions github-actions Bot added the feat label May 22, 2026
alin-at-dfinity and others added 8 commits May 22, 2026 09:14
…ry. Rename raw_memory to execution_memory, so it better matches the equivalent MemoryTaken field. Update documentation and tests.
A canister can satisfy `CanisterState::is_cold()` while still holding a
guaranteed-response slot reservation: `is_cold()` only requires empty
input/output *messages* (the pool count) and no unexpired best-effort
callback, both of which are independent of whether the canister has
in-flight guaranteed-response requests. A canister that has pushed a
guaranteed-response request that's already been moved to an outgoing
stream still keeps the input-slot reservation for the eventual response,
which contributes `MAX_RESPONSE_COUNT_BYTES` to its
`guaranteed_response_message_memory_usage()`.

The previous commit dropped this field from `ColdStats` on the
assumption it was always zero. It isn't, and the consequence is that
`guaranteed_response_message_memory_taken()` quietly under-reports
subnet-wide memory: promoting a cold canister with a reservation to
`hot` (e.g. on the next `get_mut`) makes the subnet total jump up out
of nowhere, breaking conservation invariants in downstream code
(stream handler `debug_assert!`s, in particular).

Restore the field and the corresponding `add`/`sub` bookkeeping, fold
it into `guaranteed_response_message_memory_taken`,
`total_canister_memory_usage`, and `memory_taken`, and add a focused
test (`cold_canister_with_guaranteed_response_reservation_is_aggregated`)
exercising the case via `push_output_request` followed by draining the
output queue.

Best-effort message memory remains hot-only: an unexpired best-effort
callback forces the canister into `hot`, and any expired best-effort
callback shows up as a pending input which also forces `hot`.

Co-authored-by: Cursor <cursoragent@cursor.com>
… canisters fron one pool to the other; add more tests for is_cold(); misc test additions.
Base automatically changed from alin/canister-states-foundations to master May 27, 2026 19:06
Comment thread rs/replicated_state/src/canister_states.rs Outdated
Comment thread rs/replicated_state/src/canister_states.rs Outdated
Comment thread rs/replicated_state/src/canister_states.rs Outdated
Comment thread rs/replicated_state/src/canister_states.rs
Comment thread rs/replicated_state/src/canister_states.rs Outdated
Comment thread rs/replicated_state/src/canister_states/tests.rs
…lementation; also apply pub(crate) to best_effort_message_memory_taken() and guaranteed_response_message_memory_taken(), as they are also potentially dangerous to use directly.
…d() test, so that all stats are covered; and for both hot and cold canisters.
@alin-at-dfinity alin-at-dfinity enabled auto-merge May 29, 2026 10:11
@alin-at-dfinity alin-at-dfinity added this pull request to the merge queue May 29, 2026
Merged via the queue into master with commit 7813995 May 29, 2026
37 checks passed
@alin-at-dfinity alin-at-dfinity deleted the alin/DSM-142-canister-states-cold-stats branch May 29, 2026 11:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants