fix: disk-space pre-check before payment verification on full nodes#133
Conversation
A disk-full node previously ran full payment verification — ML-DSA pool checks, a Kademlia closeness lookup, and an on-chain Arbitrum RPC — on every PUT before discovering at storage time that it had no space, only to reject the chunk. At fleet scale this wasted RPC/network load and delayed the client's fallback to another peer (V2-411). Add a cheap disk-space pre-check that runs immediately after the already-exists check and before payment verification: - Expose `LmdbStorage::check_capacity()` wrapping the existing cached `check_disk_space_cached()` (passing results only, so freed space is still detected promptly; effectively free per-PUT). - In `handle_put_inner`, call it before `verify_payment`; on insufficient space, emit an info-level reject log (target `ant_node::storage::disk_precheck`) and return the existing `ProtocolError::StorageFailed` early, preserving client behaviour. - Keep the store-path check as defence-in-depth (pre-check↔store race, replication writes). The reject log is greppable for testnet verification. Backwards compatible: no wire/response-variant change. A dedicated OutOfCapacity/TryElsewhere response is left to V2-469. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR adds a lightweight disk-capacity guard so full nodes can reject PUT/replication work before running expensive payment verification, reducing wasted RPC/network load and improving client failover latency (V2-411).
Changes:
- Add
LmdbStorage::check_capacity()as a cheap, cached disk-space pre-check. - Add early disk-space rejection in the PUT handler (
handle_put_inner) before payment verification. - Add the same early rejection in replication fresh-offer handling (
handle_fresh_offer) plus a regression test covering the PUT short-circuit.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/storage/lmdb.rs |
Introduces a public capacity pre-check wrapper around the existing cached disk-space guard. |
src/storage/handler.rs |
Adds a disk pre-check before payment verification in PUT flow and adds a regression test for early rejection. |
src/replication/mod.rs |
Adds a disk pre-check before payment verification for fresh replication offers, returning a rejection early. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
… fresh offers DEV-02 testnet verification of the PUT-handler pre-check (the previous commit) showed full nodes still doing wasted payment verification on the replication push path: ~491 "EVM payment verified" log lines on the two full VMs, each followed by "Failed to store fetched record ...: Insufficient disk space". That work runs in handle_fresh_offer, which is not covered by the handle_put_inner reordering. Mirror V2-411 into the fresh-offer path: call LmdbStorage::check_capacity() immediately after the responsibility check and before verify_payment. On insufficient space, emit the same info-level reject log (target ant_node::storage::disk_precheck, with a replication-specific message) and return a Rejected response, instead of verifying a payment for a record the node can never store. The store-path check inside put() remains as defence-in-depth. The pull/fetch path (FetchResponse store) does no payment verification, so it is left unchanged. No behavioural change for nodes with free space. Verification: re-run the same small testnet and confirm the full hosts' "EVM payment verified" volume drops while disk_precheck reject logs cover the replication path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Narrow `LmdbStorage::check_capacity` from `pub` to `pub(crate)`: it is only
used in-crate (PUT handler + replication), so it should not widen the lib API.
- Drop the redundant "Storage error: " prefix from the replication rejection
reason: `Error::Storage` already renders as "storage error: ...", so
`format!("Storage error: {e}")` produced "Storage error: storage error: ...".
Use the error's own message instead, here and on the adjacent pre-existing
store-failure branch for consistency.
- Reword the PUT pre-check comment: the cached check is free per-PUT on a
healthy node, but a disk-full node re-runs a cheap `available_space` syscall
each PUT (still negligible next to the verification it avoids).
No behavioural change. cargo fmt/clippy/doc clean; replication 233/233,
storage 30/30 tests pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
dirvine
left a comment
There was a problem hiding this comment.
Review: PR #133 — disk-space pre-check before payment verification
Scope: Focused on PR #133's own commits (
4c4f60dandf4e4ec5) onv2-411→main. CI: fmt ✅, clippy ✅, build (ubuntu/macos/windows) ✅, security audit ✅,Test (no logging)✅; the three platformTestjobs were still in progress at review time. Re-verify before merge.
TL;DR
LGTM with one should-fix and a few nits. The core idea is correct and well-evidenced by the testnet data in the PR body, the diff is small and localised, and the chosen error surface preserves the existing fallback contract. Approve once the wire-string change is acknowledged (or revised).
What the PR does
Adds a cheap, cached LmdbStorage::check_capacity() (just a wrapper over the existing check_disk_space_cached) and calls it on the two paths that previously verified-then-failed at the store layer:
- PUT handler (
handle_put_inner) — rejects withProtocolError::StorageFailedbeforeverify_payment(V2-411 root cause). - Replication fresh-offer path (
handle_fresh_offer) — sendsFreshReplicationResponse::Rejectedbefore the on-chainPaymentVerifier::verify_paymentcall.
handle_paid_notify is intentionally left alone (no store) and the pull/fetch path is unchanged (no payment). Defence-in-depth is preserved because LmdbStorage::put() keeps its own reserve check.
✅ Strengths
- Right scope. The fix is exactly the two wasted-work paths. The
handle_paid_notifyexclusion is correct and explicitly justified in the body. - Backwards compatible. No new response variant, no wire change other than the reason text in one reject path (see
⚠️ below). Client fallback behaviour is preserved. - Reuses an existing, already-bounded primitive.
check_capacity()is a 4-line wrapper overcheck_disk_space_cached(TTL-cached passing results, failing results always re-statted, singlefs2::available_spacesyscall on miss). No new background work, no new locks, no new async state. - Test is well-designed.
create_test_protocol_with_reserve(u64::MAX)makes the pre-check deterministically fail without fabricating LMDB state. AssertingStorageFailedcontaining"Insufficient disk space"plus!exists(&address)proves the short-circuit: the chunk is uncached, so reachingverify_paymentwould have producedPaymentRequired/PaymentFailed, notStorageFailed. The test name and docstring explain the negative-space reasoning clearly. - Testnet evidence is concrete. Run 1/Run 2 breakdown (PUT-path 70 vs replication-path 313 rejects; 100% of full-host
put_rpcshort-circuited; ~100× latency drop on full hosts;Stored chunk= 0) is the right level of proof for a change like this, and the discovery of the replication path during Run 1 is a nice iterative signal. - Observability. The
ant_node::storage::disk_prechecklog target gives a cheap way to detect over-eager rejection in production without re-instrumenting the call sites.
⚠️ Should-fix
1. FreshReplicationResponse::Rejected { reason } text changed on the post-store Err arm.
src/replication/mod.rs:1265 changes the existing format!("Storage error: {e}") to e.to_string(). The new value is "Insufficient disk space: …" (a more accurate, more specific message — good!), but this is a wire-text change for an existing reject path, and the PR body doesn't flag it. Two options:
- Acknowledge it explicitly in the PR body ("reason string on the post-store error path changes from
Storage error: …to the more specific underlying error; same variant, free-formStringfield, no schema change"). The reason is aString, so it's wire-compatible, but a downstream client that pattern-matches on the prefix would break. - Or: leave the old
format!("Storage error: {e}")and only use the rawe.to_string()on the new pre-check arm. That keeps the existing reject-path text stable and isolates the new wire text to the new arm.
Either is fine — pick one and document it. I lean toward acknowledging the change, since the new text is more informative for the operator on the other end.
🟡 Nits
-
check_capacityvisibility ispub(crate). Fine for now, but two of the three call sites that benefit (PUT handler, replication) live in different sub-trees and both go through anArc<LmdbStorage>they already hold. If a fourth caller appears (e.g. a future self-check task or a metrics emitter), considerpub. Not blocking. -
Doc comment on
check_capacityclaims it is "free per-PUT on a healthy node". Strictly true only for the cache TTL window (DISK_CHECK_INTERVAL_SECS). After the TTL elapses, every PUT does a statvfs syscall. Worth tightening to "free per-PUT within the cache TTL" — the existingcheck_disk_space_cacheddocstring already says it correctly, so this is a consistency fix. -
The new PUT pre-check logs at
info!with the peer address. That's good for incident response but consider whether the existingerror!/warn!policy on the surroundingStorageFailedreturn path already covers it — if so,debug!is probably enough for the pre-check and reduces log volume on a hot path. (I am not strongly opposed toinfo!; flagging because the testnet cluster has 8 full hosts out of 40 and the per-PUT log will show up.) -
Test is unit-only. The e2e harness at
tests/e2e/replication.rshas good coverage for the empty-PoP reject path; a small e2e test that fills the storage reserve and asserts aRejectedwith the new pre-check reason would harden it. Not blocking — the unit test plus the testnet data are convincing. -
put_rpclog path doesn't get adisk_precheckpaired entry on the successful path. That's correct (nothing to log on a hit), just noting that a counter at the same target (e.g.disk_precheck ok = N) would make "are we rejecting anything we shouldn't?" trivially answerable in Kibana. Optional.
Verdict
Approve once the reason-text wire change is acknowledged or revised. This is a textbook small, well-evidenced PR: cheap guard, correct placement, real testnet numbers, a regression test that proves the short-circuit, and explicit scope notes for the path it deliberately doesn't touch.
Happy to re-review after the should-fix lands or after CI goes green.
Resolves V2-411.
Why
Investigation of the PROD-UL-01 merkle upload failures found that a large fraction of the node
fleet sits below the 0.49 GiB disk reserve. A full node that receives a PUT it can never satisfy
still ran the expensive payment verification first — ML-DSA pool checks, a synchronous
Kademlia closeness lookup, and an on-chain Arbitrum RPC (the call that intermittently failed
against
arb1.arbitrum.ioduring the incident) — and only discovered the disk problem afterwardsat
storage.put. At fleet scale that is wasted RPC/network load and it delays the client'sfallback to another peer, contributing to the slow uploads observed.
The node is not taking money here — merkle/EVM payment is already settled on-chain before the
PUT; verification only validates a pre-paid proof. The issue is wasted work + latency, not lost
funds.
What
Add a cheap disk-space pre-check ahead of payment verification on the two paths where a full node
would otherwise verify-then-fail:
handle_put_inner): call a newLmdbStorage::check_capacity()immediatelyafter the already-exists check and before
verify_payment; on insufficient space, return theexisting
ProtocolError::StorageFailedearly.handle_fresh_offer): the same guard beforeverify_payment,returning a
Rejectedresponse. (Found via the testnet verification below — the PUT-handler fixalone left this path still doing wasted verification.)
Both emit an info-level reject log under target
ant_node::storage::disk_precheckfor log-basedverification.
check_capacity()wraps the existing cached check (passing results only, so freedspace is still detected promptly; effectively free per-PUT). The store-path check inside
LmdbStorage::put()is kept as defence-in-depth.Scope notes:
next peer on any
Err. A dedicatedOutOfCapacity/TryElsewhereresponse is left to V2-469.handle_paid_notify(the thirdverify_paymentcaller) is intentionally not guarded: itstores no chunk data, so there is no doomed store to skip.
Commits
fix(storage): disk-space pre-check before payment verification in PUTfix(replication): disk-space pre-check before payment verification on fresh offersTest plan
Local gates (all green):
cargo fmt --allcargo clippy --all-features -- -D clippy::panic -D clippy::unwrap_used -D clippy::expect_usedcargo test— storage handler 15/15 (incl. a new regression test asserting a disk-full PUT ofan uncached chunk returns the disk error before verification), replication 233/233.
Testnet verification (DEV-02, 40 nodes, 8 full @ 20%, this build), via Elasticsearch log queries
scoped to the known full hosts:
Run 1 (PUT-handler fix):
disk_precheckreject log fired only on the full hosts; zero on healthy hosts.put_rpccount ==disk_precheckcount (115 == 115) — 100% of client PUT RPCsrejected at the pre-check;
put_rpcwithduration_ms >= 30= 0;Stored chunk= 0.p99 937 ms / max 2148 ms (~100x), exactly the skipped verification work.
EVM payment verifiedfollowed by
Failed to store fetched record ...: Insufficient disk), motivating the second commit.Run 2 (both fixes):
disk_precheckrejects only on full hosts; breakdown PUT-path 70, replication-path 313(
Rejecting fresh replication offer before payment verification: ... Insufficient disk space).Stored chunkon full hosts = 0.EVM payment verifiedon full hosts traced structurally tohandle_paid_notify(no store) — i.e. no wasted store-path verification remains.
🤖 Generated with Claude Code