Skip to content

fix: disk-space pre-check before payment verification on full nodes#133

Merged
jacderida merged 3 commits into
WithAutonomi:rc-2026.6.2from
jacderida:v2-411
Jun 9, 2026
Merged

fix: disk-space pre-check before payment verification on full nodes#133
jacderida merged 3 commits into
WithAutonomi:rc-2026.6.2from
jacderida:v2-411

Conversation

@jacderida

Copy link
Copy Markdown
Collaborator

Resolves V2-411.

Why

Investigation of the PROD-UL-01 merkle upload failures found that a large fraction of the node
fleet sits below the 0.49 GiB disk reserve. A full node that receives a PUT it can never satisfy
still ran the expensive payment verification first — ML-DSA pool checks, a synchronous
Kademlia closeness lookup, and an on-chain Arbitrum RPC (the call that intermittently failed
against arb1.arbitrum.io during the incident) — and only discovered the disk problem afterwards
at storage.put. At fleet scale that is wasted RPC/network load and it delays the client's
fallback to another peer, contributing to the slow uploads observed.

The node is not taking money here — merkle/EVM payment is already settled on-chain before the
PUT; verification only validates a pre-paid proof. The issue is wasted work + latency, not lost
funds.

What

Add a cheap disk-space pre-check ahead of payment verification on the two paths where a full node
would otherwise verify-then-fail:

  • PUT handler (handle_put_inner): call a new LmdbStorage::check_capacity() immediately
    after the already-exists check and before verify_payment; on insufficient space, return the
    existing ProtocolError::StorageFailed early.
  • Replication fresh-offer path (handle_fresh_offer): the same guard before verify_payment,
    returning a Rejected response. (Found via the testnet verification below — the PUT-handler fix
    alone left this path still doing wasted verification.)

Both emit an info-level reject log under target ant_node::storage::disk_precheck for log-based
verification. check_capacity() wraps the existing cached check (passing results only, so freed
space is still detected promptly; effectively free per-PUT). The store-path check inside
LmdbStorage::put() is kept as defence-in-depth.

Scope notes:

  • Backwards compatible — no wire/response-variant change. The client already falls back to the
    next peer on any Err. A dedicated OutOfCapacity/TryElsewhere response is left to V2-469.
  • handle_paid_notify (the third verify_payment caller) is intentionally not guarded: it
    stores no chunk data, so there is no doomed store to skip.
  • The pull/fetch store path does no payment verification, so it is unchanged.

Commits

  • fix(storage): disk-space pre-check before payment verification in PUT
  • fix(replication): disk-space pre-check before payment verification on fresh offers

Test plan

Local gates (all green):

  • cargo fmt --all
  • cargo clippy --all-features -- -D clippy::panic -D clippy::unwrap_used -D clippy::expect_used
  • cargo test — storage handler 15/15 (incl. a new regression test asserting a disk-full PUT of
    an uncached chunk returns the disk error before verification), replication 233/233.

Testnet verification (DEV-02, 40 nodes, 8 full @ 20%, this build), via Elasticsearch log queries
scoped to the known full hosts:

Run 1 (PUT-handler fix):

  • disk_precheck reject log fired only on the full hosts; zero on healthy hosts.
  • On full hosts put_rpc count == disk_precheck count (115 == 115) — 100% of client PUT RPCs
    rejected at the pre-check; put_rpc with duration_ms >= 30 = 0; Stored chunk = 0.
  • PUT latency collapsed: full hosts p50 1 ms / p99 8 ms / max 10 ms vs healthy p50 152 ms /
    p99 937 ms / max 2148 ms (~100x), exactly the skipped verification work.
  • Revealed the replication push path still verified-then-failed (~491 EVM payment verified
    followed by Failed to store fetched record ...: Insufficient disk), motivating the second commit.

Run 2 (both fixes):

  • disk_precheck rejects only on full hosts; breakdown PUT-path 70, replication-path 313
    (Rejecting fresh replication offer before payment verification: ... Insufficient disk space).
  • PUT latency still collapsed: full hosts p50 0 ms / p99 2 ms vs healthy p50 146 ms / p99 534 ms;
    Stored chunk on full hosts = 0.
  • Residual EVM payment verified on full hosts traced structurally to handle_paid_notify
    (no store) — i.e. no wasted store-path verification remains.

🤖 Generated with Claude Code

A disk-full node previously ran full payment verification — ML-DSA pool
checks, a Kademlia closeness lookup, and an on-chain Arbitrum RPC — on
every PUT before discovering at storage time that it had no space, only
to reject the chunk. At fleet scale this wasted RPC/network load and
delayed the client's fallback to another peer (V2-411).

Add a cheap disk-space pre-check that runs immediately after the
already-exists check and before payment verification:

- Expose `LmdbStorage::check_capacity()` wrapping the existing cached
  `check_disk_space_cached()` (passing results only, so freed space is
  still detected promptly; effectively free per-PUT).
- In `handle_put_inner`, call it before `verify_payment`; on insufficient
  space, emit an info-level reject log (target
  `ant_node::storage::disk_precheck`) and return the existing
  `ProtocolError::StorageFailed` early, preserving client behaviour.
- Keep the store-path check as defence-in-depth (pre-check↔store race,
  replication writes).

The reject log is greppable for testnet verification. Backwards
compatible: no wire/response-variant change. A dedicated
OutOfCapacity/TryElsewhere response is left to V2-469.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 8, 2026 22:18

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a lightweight disk-capacity guard so full nodes can reject PUT/replication work before running expensive payment verification, reducing wasted RPC/network load and improving client failover latency (V2-411).

Changes:

  • Add LmdbStorage::check_capacity() as a cheap, cached disk-space pre-check.
  • Add early disk-space rejection in the PUT handler (handle_put_inner) before payment verification.
  • Add the same early rejection in replication fresh-offer handling (handle_fresh_offer) plus a regression test covering the PUT short-circuit.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
src/storage/lmdb.rs Introduces a public capacity pre-check wrapper around the existing cached disk-space guard.
src/storage/handler.rs Adds a disk pre-check before payment verification in PUT flow and adds a regression test for early rejection.
src/replication/mod.rs Adds a disk pre-check before payment verification for fresh replication offers, returning a rejection early.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/storage/lmdb.rs Outdated
Comment thread src/replication/mod.rs
Comment thread src/storage/handler.rs Outdated
… fresh offers

DEV-02 testnet verification of the PUT-handler pre-check (the previous
commit) showed full nodes still doing wasted payment verification on the
replication push path: ~491 "EVM payment verified" log lines on the two
full VMs, each followed by "Failed to store fetched record ...:
Insufficient disk space". That work runs in handle_fresh_offer, which is
not covered by the handle_put_inner reordering.

Mirror V2-411 into the fresh-offer path: call LmdbStorage::check_capacity()
immediately after the responsibility check and before verify_payment. On
insufficient space, emit the same info-level reject log (target
ant_node::storage::disk_precheck, with a replication-specific message) and
return a Rejected response, instead of verifying a payment for a record the
node can never store. The store-path check inside put() remains as
defence-in-depth.

The pull/fetch path (FetchResponse store) does no payment verification, so
it is left unchanged. No behavioural change for nodes with free space.

Verification: re-run the same small testnet and confirm the full hosts'
"EVM payment verified" volume drops while disk_precheck reject logs cover
the replication path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Narrow `LmdbStorage::check_capacity` from `pub` to `pub(crate)`: it is only
  used in-crate (PUT handler + replication), so it should not widen the lib API.
- Drop the redundant "Storage error: " prefix from the replication rejection
  reason: `Error::Storage` already renders as "storage error: ...", so
  `format!("Storage error: {e}")` produced "Storage error: storage error: ...".
  Use the error's own message instead, here and on the adjacent pre-existing
  store-failure branch for consistency.
- Reword the PUT pre-check comment: the cached check is free per-PUT on a
  healthy node, but a disk-full node re-runs a cheap `available_space` syscall
  each PUT (still negligible next to the verification it avoids).

No behavioural change. cargo fmt/clippy/doc clean; replication 233/233,
storage 30/30 tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 8, 2026 22:32

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

@dirvine dirvine left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: PR #133 — disk-space pre-check before payment verification

Scope: Focused on PR #133's own commits (4c4f60d and f4e4ec5) on v2-411main. CI: fmt ✅, clippy ✅, build (ubuntu/macos/windows) ✅, security audit ✅, Test (no logging) ✅; the three platform Test jobs were still in progress at review time. Re-verify before merge.

TL;DR

LGTM with one should-fix and a few nits. The core idea is correct and well-evidenced by the testnet data in the PR body, the diff is small and localised, and the chosen error surface preserves the existing fallback contract. Approve once the wire-string change is acknowledged (or revised).

What the PR does

Adds a cheap, cached LmdbStorage::check_capacity() (just a wrapper over the existing check_disk_space_cached) and calls it on the two paths that previously verified-then-failed at the store layer:

  1. PUT handler (handle_put_inner) — rejects with ProtocolError::StorageFailed before verify_payment (V2-411 root cause).
  2. Replication fresh-offer path (handle_fresh_offer) — sends FreshReplicationResponse::Rejected before the on-chain PaymentVerifier::verify_payment call.

handle_paid_notify is intentionally left alone (no store) and the pull/fetch path is unchanged (no payment). Defence-in-depth is preserved because LmdbStorage::put() keeps its own reserve check.

✅ Strengths

  • Right scope. The fix is exactly the two wasted-work paths. The handle_paid_notify exclusion is correct and explicitly justified in the body.
  • Backwards compatible. No new response variant, no wire change other than the reason text in one reject path (see ⚠️ below). Client fallback behaviour is preserved.
  • Reuses an existing, already-bounded primitive. check_capacity() is a 4-line wrapper over check_disk_space_cached (TTL-cached passing results, failing results always re-statted, single fs2::available_space syscall on miss). No new background work, no new locks, no new async state.
  • Test is well-designed. create_test_protocol_with_reserve(u64::MAX) makes the pre-check deterministically fail without fabricating LMDB state. Asserting StorageFailed containing "Insufficient disk space" plus !exists(&address) proves the short-circuit: the chunk is uncached, so reaching verify_payment would have produced PaymentRequired/PaymentFailed, not StorageFailed. The test name and docstring explain the negative-space reasoning clearly.
  • Testnet evidence is concrete. Run 1/Run 2 breakdown (PUT-path 70 vs replication-path 313 rejects; 100% of full-host put_rpc short-circuited; ~100× latency drop on full hosts; Stored chunk = 0) is the right level of proof for a change like this, and the discovery of the replication path during Run 1 is a nice iterative signal.
  • Observability. The ant_node::storage::disk_precheck log target gives a cheap way to detect over-eager rejection in production without re-instrumenting the call sites.

⚠️ Should-fix

1. FreshReplicationResponse::Rejected { reason } text changed on the post-store Err arm.
src/replication/mod.rs:1265 changes the existing format!("Storage error: {e}") to e.to_string(). The new value is "Insufficient disk space: …" (a more accurate, more specific message — good!), but this is a wire-text change for an existing reject path, and the PR body doesn't flag it. Two options:

  • Acknowledge it explicitly in the PR body ("reason string on the post-store error path changes from Storage error: … to the more specific underlying error; same variant, free-form String field, no schema change"). The reason is a String, so it's wire-compatible, but a downstream client that pattern-matches on the prefix would break.
  • Or: leave the old format!("Storage error: {e}") and only use the raw e.to_string() on the new pre-check arm. That keeps the existing reject-path text stable and isolates the new wire text to the new arm.

Either is fine — pick one and document it. I lean toward acknowledging the change, since the new text is more informative for the operator on the other end.

🟡 Nits

  1. check_capacity visibility is pub(crate). Fine for now, but two of the three call sites that benefit (PUT handler, replication) live in different sub-trees and both go through an Arc<LmdbStorage> they already hold. If a fourth caller appears (e.g. a future self-check task or a metrics emitter), consider pub. Not blocking.

  2. Doc comment on check_capacity claims it is "free per-PUT on a healthy node". Strictly true only for the cache TTL window (DISK_CHECK_INTERVAL_SECS). After the TTL elapses, every PUT does a statvfs syscall. Worth tightening to "free per-PUT within the cache TTL" — the existing check_disk_space_cached docstring already says it correctly, so this is a consistency fix.

  3. The new PUT pre-check logs at info! with the peer address. That's good for incident response but consider whether the existing error!/warn! policy on the surrounding StorageFailed return path already covers it — if so, debug! is probably enough for the pre-check and reduces log volume on a hot path. (I am not strongly opposed to info!; flagging because the testnet cluster has 8 full hosts out of 40 and the per-PUT log will show up.)

  4. Test is unit-only. The e2e harness at tests/e2e/replication.rs has good coverage for the empty-PoP reject path; a small e2e test that fills the storage reserve and asserts a Rejected with the new pre-check reason would harden it. Not blocking — the unit test plus the testnet data are convincing.

  5. put_rpc log path doesn't get a disk_precheck paired entry on the successful path. That's correct (nothing to log on a hit), just noting that a counter at the same target (e.g. disk_precheck ok = N) would make "are we rejecting anything we shouldn't?" trivially answerable in Kibana. Optional.

Verdict

Approve once the reason-text wire change is acknowledged or revised. This is a textbook small, well-evidenced PR: cheap guard, correct placement, real testnet numbers, a regression test that proves the short-circuit, and explicit scope notes for the path it deliberately doesn't touch.

Happy to re-review after the should-fix lands or after CI goes green.

@jacderida jacderida changed the base branch from main to rc-2026.6.2 June 9, 2026 12:48
@jacderida jacderida merged commit 0d10a85 into WithAutonomi:rc-2026.6.2 Jun 9, 2026
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants