fix: disk-space pre-check before payment verification on full nodes by jacderida · Pull Request #133 · WithAutonomi/ant-node

jacderida · 2026-06-08T22:18:25Z

Resolves V2-411.

Why

Investigation of the PROD-UL-01 merkle upload failures found that a large fraction of the node
fleet sits below the 0.49 GiB disk reserve. A full node that receives a PUT it can never satisfy
still ran the expensive payment verification first — ML-DSA pool checks, a synchronous
Kademlia closeness lookup, and an on-chain Arbitrum RPC (the call that intermittently failed
against arb1.arbitrum.io during the incident) — and only discovered the disk problem afterwards
at storage.put. At fleet scale that is wasted RPC/network load and it delays the client's
fallback to another peer, contributing to the slow uploads observed.

The node is not taking money here — merkle/EVM payment is already settled on-chain before the
PUT; verification only validates a pre-paid proof. The issue is wasted work + latency, not lost
funds.

What

Add a cheap disk-space pre-check ahead of payment verification on the two paths where a full node
would otherwise verify-then-fail:

PUT handler (handle_put_inner): call a new LmdbStorage::check_capacity() immediately
after the already-exists check and before verify_payment; on insufficient space, return the
existing ProtocolError::StorageFailed early.
Replication fresh-offer path (handle_fresh_offer): the same guard before verify_payment,
returning a Rejected response. (Found via the testnet verification below — the PUT-handler fix
alone left this path still doing wasted verification.)

Both emit an info-level reject log under target ant_node::storage::disk_precheck for log-based
verification. check_capacity() wraps the existing cached check (passing results only, so freed
space is still detected promptly; effectively free per-PUT). The store-path check inside
LmdbStorage::put() is kept as defence-in-depth.

Scope notes:

Backwards compatible — no wire/response-variant change. The client already falls back to the
next peer on any Err. A dedicated OutOfCapacity/TryElsewhere response is left to V2-469.
handle_paid_notify (the third verify_payment caller) is intentionally not guarded: it
stores no chunk data, so there is no doomed store to skip.
The pull/fetch store path does no payment verification, so it is unchanged.

Commits

fix(storage): disk-space pre-check before payment verification in PUT
fix(replication): disk-space pre-check before payment verification on fresh offers

Test plan

Local gates (all green):

cargo fmt --all
cargo clippy --all-features -- -D clippy::panic -D clippy::unwrap_used -D clippy::expect_used
cargo test — storage handler 15/15 (incl. a new regression test asserting a disk-full PUT of
an uncached chunk returns the disk error before verification), replication 233/233.

Testnet verification (DEV-02, 40 nodes, 8 full @ 20%, this build), via Elasticsearch log queries
scoped to the known full hosts:

Run 1 (PUT-handler fix):

disk_precheck reject log fired only on the full hosts; zero on healthy hosts.
On full hosts put_rpc count == disk_precheck count (115 == 115) — 100% of client PUT RPCs
rejected at the pre-check; put_rpc with duration_ms >= 30 = 0; Stored chunk = 0.
PUT latency collapsed: full hosts p50 1 ms / p99 8 ms / max 10 ms vs healthy p50 152 ms /
p99 937 ms / max 2148 ms (~100x), exactly the skipped verification work.
Revealed the replication push path still verified-then-failed (~491 EVM payment verified
followed by Failed to store fetched record ...: Insufficient disk), motivating the second commit.

Run 2 (both fixes):

disk_precheck rejects only on full hosts; breakdown PUT-path 70, replication-path 313
(Rejecting fresh replication offer before payment verification: ... Insufficient disk space).
PUT latency still collapsed: full hosts p50 0 ms / p99 2 ms vs healthy p50 146 ms / p99 534 ms;
Stored chunk on full hosts = 0.
Residual EVM payment verified on full hosts traced structurally to handle_paid_notify
(no store) — i.e. no wasted store-path verification remains.

🤖 Generated with Claude Code

A disk-full node previously ran full payment verification — ML-DSA pool checks, a Kademlia closeness lookup, and an on-chain Arbitrum RPC — on every PUT before discovering at storage time that it had no space, only to reject the chunk. At fleet scale this wasted RPC/network load and delayed the client's fallback to another peer (V2-411). Add a cheap disk-space pre-check that runs immediately after the already-exists check and before payment verification: - Expose `LmdbStorage::check_capacity()` wrapping the existing cached `check_disk_space_cached()` (passing results only, so freed space is still detected promptly; effectively free per-PUT). - In `handle_put_inner`, call it before `verify_payment`; on insufficient space, emit an info-level reject log (target `ant_node::storage::disk_precheck`) and return the existing `ProtocolError::StorageFailed` early, preserving client behaviour. - Keep the store-path check as defence-in-depth (pre-check↔store race, replication writes). The reject log is greppable for testnet verification. Backwards compatible: no wire/response-variant change. A dedicated OutOfCapacity/TryElsewhere response is left to V2-469. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR adds a lightweight disk-capacity guard so full nodes can reject PUT/replication work before running expensive payment verification, reducing wasted RPC/network load and improving client failover latency (V2-411).

Changes:

Add LmdbStorage::check_capacity() as a cheap, cached disk-space pre-check.
Add early disk-space rejection in the PUT handler (handle_put_inner) before payment verification.
Add the same early rejection in replication fresh-offer handling (handle_fresh_offer) plus a regression test covering the PUT short-circuit.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`src/storage/lmdb.rs`	Introduces a public capacity pre-check wrapper around the existing cached disk-space guard.
`src/storage/handler.rs`	Adds a disk pre-check before payment verification in PUT flow and adds a regression test for early rejection.
`src/replication/mod.rs`	Adds a disk pre-check before payment verification for fresh replication offers, returning a rejection early.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

… fresh offers DEV-02 testnet verification of the PUT-handler pre-check (the previous commit) showed full nodes still doing wasted payment verification on the replication push path: ~491 "EVM payment verified" log lines on the two full VMs, each followed by "Failed to store fetched record ...: Insufficient disk space". That work runs in handle_fresh_offer, which is not covered by the handle_put_inner reordering. Mirror V2-411 into the fresh-offer path: call LmdbStorage::check_capacity() immediately after the responsibility check and before verify_payment. On insufficient space, emit the same info-level reject log (target ant_node::storage::disk_precheck, with a replication-specific message) and return a Rejected response, instead of verifying a payment for a record the node can never store. The store-path check inside put() remains as defence-in-depth. The pull/fetch path (FetchResponse store) does no payment verification, so it is left unchanged. No behavioural change for nodes with free space. Verification: re-run the same small testnet and confirm the full hosts' "EVM payment verified" volume drops while disk_precheck reject logs cover the replication path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- Narrow `LmdbStorage::check_capacity` from `pub` to `pub(crate)`: it is only used in-crate (PUT handler + replication), so it should not widen the lib API. - Drop the redundant "Storage error: " prefix from the replication rejection reason: `Error::Storage` already renders as "storage error: ...", so `format!("Storage error: {e}")` produced "Storage error: storage error: ...". Use the error's own message instead, here and on the adjacent pre-existing store-failure branch for consistency. - Reword the PUT pre-check comment: the cached check is free per-PUT on a healthy node, but a disk-full node re-runs a cheap `available_space` syscall each PUT (still negligible next to the verification it avoids). No behavioural change. cargo fmt/clippy/doc clean; replication 233/233, storage 30/30 tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

dirvine

Review: PR #133 — disk-space pre-check before payment verification

Scope: Focused on PR #133's own commits (4c4f60d and f4e4ec5) on v2-411 → main. CI: fmt ✅, clippy ✅, build (ubuntu/macos/windows) ✅, security audit ✅, Test (no logging) ✅; the three platform Test jobs were still in progress at review time. Re-verify before merge.

TL;DR

LGTM with one should-fix and a few nits. The core idea is correct and well-evidenced by the testnet data in the PR body, the diff is small and localised, and the chosen error surface preserves the existing fallback contract. Approve once the wire-string change is acknowledged (or revised).

What the PR does

Adds a cheap, cached LmdbStorage::check_capacity() (just a wrapper over the existing check_disk_space_cached) and calls it on the two paths that previously verified-then-failed at the store layer:

PUT handler (handle_put_inner) — rejects with ProtocolError::StorageFailed before verify_payment (V2-411 root cause).
Replication fresh-offer path (handle_fresh_offer) — sends FreshReplicationResponse::Rejected before the on-chain PaymentVerifier::verify_payment call.

handle_paid_notify is intentionally left alone (no store) and the pull/fetch path is unchanged (no payment). Defence-in-depth is preserved because LmdbStorage::put() keeps its own reserve check.

✅ Strengths

Right scope. The fix is exactly the two wasted-work paths. The handle_paid_notify exclusion is correct and explicitly justified in the body.
Backwards compatible. No new response variant, no wire change other than the reason text in one reject path (see ⚠️ below). Client fallback behaviour is preserved.
Reuses an existing, already-bounded primitive. check_capacity() is a 4-line wrapper over check_disk_space_cached (TTL-cached passing results, failing results always re-statted, single fs2::available_space syscall on miss). No new background work, no new locks, no new async state.
Test is well-designed. create_test_protocol_with_reserve(u64::MAX) makes the pre-check deterministically fail without fabricating LMDB state. Asserting StorageFailed containing "Insufficient disk space" plus !exists(&address) proves the short-circuit: the chunk is uncached, so reaching verify_payment would have produced PaymentRequired/PaymentFailed, not StorageFailed. The test name and docstring explain the negative-space reasoning clearly.
Testnet evidence is concrete. Run 1/Run 2 breakdown (PUT-path 70 vs replication-path 313 rejects; 100% of full-host put_rpc short-circuited; ~100× latency drop on full hosts; Stored chunk = 0) is the right level of proof for a change like this, and the discovery of the replication path during Run 1 is a nice iterative signal.
Observability. The ant_node::storage::disk_precheck log target gives a cheap way to detect over-eager rejection in production without re-instrumenting the call sites.

⚠️ Should-fix

1. FreshReplicationResponse::Rejected { reason } text changed on the post-store Err arm.
src/replication/mod.rs:1265 changes the existing format!("Storage error: {e}") to e.to_string(). The new value is "Insufficient disk space: …" (a more accurate, more specific message — good!), but this is a wire-text change for an existing reject path, and the PR body doesn't flag it. Two options:

Acknowledge it explicitly in the PR body ("reason string on the post-store error path changes from Storage error: … to the more specific underlying error; same variant, free-form String field, no schema change"). The reason is a String, so it's wire-compatible, but a downstream client that pattern-matches on the prefix would break.
Or: leave the old format!("Storage error: {e}") and only use the raw e.to_string() on the new pre-check arm. That keeps the existing reject-path text stable and isolates the new wire text to the new arm.

Either is fine — pick one and document it. I lean toward acknowledging the change, since the new text is more informative for the operator on the other end.

🟡 Nits

check_capacity visibility is pub(crate). Fine for now, but two of the three call sites that benefit (PUT handler, replication) live in different sub-trees and both go through an Arc<LmdbStorage> they already hold. If a fourth caller appears (e.g. a future self-check task or a metrics emitter), consider pub. Not blocking.
Doc comment on check_capacity claims it is "free per-PUT on a healthy node". Strictly true only for the cache TTL window (DISK_CHECK_INTERVAL_SECS). After the TTL elapses, every PUT does a statvfs syscall. Worth tightening to "free per-PUT within the cache TTL" — the existing check_disk_space_cached docstring already says it correctly, so this is a consistency fix.
The new PUT pre-check logs at info! with the peer address. That's good for incident response but consider whether the existing error!/warn! policy on the surrounding StorageFailed return path already covers it — if so, debug! is probably enough for the pre-check and reduces log volume on a hot path. (I am not strongly opposed to info!; flagging because the testnet cluster has 8 full hosts out of 40 and the per-PUT log will show up.)
Test is unit-only. The e2e harness at tests/e2e/replication.rs has good coverage for the empty-PoP reject path; a small e2e test that fills the storage reserve and asserts a Rejected with the new pre-check reason would harden it. Not blocking — the unit test plus the testnet data are convincing.
put_rpc log path doesn't get a disk_precheck paired entry on the successful path. That's correct (nothing to log on a hit), just noting that a counter at the same target (e.g. disk_precheck ok = N) would make "are we rejecting anything we shouldn't?" trivially answerable in Kibana. Optional.

Verdict

Approve once the reason-text wire change is acknowledged or revised. This is a textbook small, well-evidenced PR: cheap guard, correct placement, real testnet numbers, a regression test that proves the short-circuit, and explicit scope notes for the path it deliberately doesn't touch.

Happy to re-review after the should-fix lands or after CI goes green.

Copilot AI review requested due to automatic review settings June 8, 2026 22:18

Copilot started reviewing on behalf of jacderida June 8, 2026 22:18 View session

Copilot AI reviewed Jun 8, 2026

View reviewed changes

Comment thread src/storage/lmdb.rs Outdated

Comment thread src/replication/mod.rs

Comment thread src/storage/handler.rs Outdated

jacderida force-pushed the v2-411 branch from 4c4f60d to 0dc5926 Compare June 8, 2026 22:23

Copilot AI review requested due to automatic review settings June 8, 2026 22:32

Copilot started reviewing on behalf of jacderida June 8, 2026 22:33 View session

Copilot AI reviewed Jun 8, 2026

View reviewed changes

dirvine requested changes Jun 8, 2026

View reviewed changes

jacderida changed the base branch from main to rc-2026.6.2 June 9, 2026 12:48

mickvandijke approved these changes Jun 9, 2026

View reviewed changes

jacderida merged commit 0d10a85 into WithAutonomi:rc-2026.6.2 Jun 9, 2026
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: disk-space pre-check before payment verification on full nodes#133

fix: disk-space pre-check before payment verification on full nodes#133
jacderida merged 3 commits into
WithAutonomi:rc-2026.6.2from
jacderida:v2-411

jacderida commented Jun 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

dirvine left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jacderida commented Jun 8, 2026

Why

What

Commits

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

dirvine left a comment

Choose a reason for hiding this comment

Review: PR #133 — disk-space pre-check before payment verification

TL;DR

What the PR does

✅ Strengths

⚠️ Should-fix

🟡 Nits

Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants