apollo_l1_events,apollo_l1_events_config: chunk L1 getLogs range into bounded windows (M-25)#14602
Open
asaf-sw wants to merge 1 commit into
Open
apollo_l1_events,apollo_l1_events_config: chunk L1 getLogs range into bounded windows (M-25)#14602asaf-sw wants to merge 1 commit into
asaf-sw wants to merge 1 commit into
Conversation
Contributor
Author
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
… bounded windows (M-25) The L1 events scraper fetched the entire range from its cursor to the latest (finality-adjusted) L1 block in a single events()/eth_getLogs request. After downtime, an L1 outage, or with a large startup rewind, that range can span thousands of blocks, materializing an unbounded Vec<L1Event>/Vec<Event> and a single oversized RPC request that most providers reject or that hits the 1s base-layer timeout — wedging the scraper and (via the cyclic wrapper) misfiring the primary-down-since alert. Cap each fetch to a configurable max_blocks_per_fetch (default 1000, validated >= 1). fetch_events now requests scraping_start..=min(latest, start + cap - 1) and returns the L1BlockReference for the window end, so the cursor advances by exactly one window per poll and the steady-state loop drains a backlog over successive polls. The finality ceiling and the once-per-poll reorg check are preserved. initialize sends the first bounded window via the provider's initialize() and lets the steady loop drain the rest via add_events, so the provider's initialize-once contract is unchanged. Regenerated config_schema.json for the new field. Assumptions (documented in code): default 1000 is conservative versus common public-RPC eth_getLogs caps (~1k-10k) and the 1s timeout; operators on private RPCs may raise it. Saturating arithmetic guards against a 0 cap even though validation rejects it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
5910da2 to
1ddf871
Compare
66ffac0 to
9821da4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

M-25 — Chunk the L1 getLogs range into bounded windows
Security review finding: M-25 (Low–Medium). The L1 events scraper built the inclusive range
scraping_start..=latest_l1_blockand asked the base layer for all tracked events across that entire range in a singleevents()/eth_getLogscall, with no per-iteration cap and no chunking (a// If this gets too high, send in batches.comment acknowledged the gap). After downtime, an L1 outage, or with a large startup rewind, that range can span thousands of L1 blocks. The dominant failure mode is a liveness stall: the single call is wrapped in a 1 s timeout plus an N+1 per-event header fetch, so an oversized range times out and the scraper spins in its retry loop, never advancing — and (via the cyclic wrapper) misfires the primary-down-since signal. UnboundedVec<L1Event>/Vec<Event>materialization is a secondary memory risk.What changed
l1_events_scraper_config.max_blocks_per_fetch: u64(apollo_l1_events_config), default 1000, validated>= 1.apollo_l1_events/src/l1_scraper.rs):fetch_eventsnow requestsscraping_start..=min(latest, start + max_blocks_per_fetch - 1)and returns theL1BlockReferencefor the window end, so the cursor advances by exactly one window per poll and the steady-state loop drains a backlog over successive polls. The finality ceiling and the once-per-poll reorg check are preserved.initializesends the first bounded window via the provider'sinitialize()and lets the steady loop drain the rest viaadd_events, so the provider's initialize-once contract is unchanged. Saturating arithmetic guards a 0 cap even though validation rejects it.l1_events_scraper_config.max_blocks_per_fetch: 1000added to the hand-maintainedl1_events_scraper_config.jsonandreplacer_l1_events_scraper_config.jsonpresets, andconfig_schema.jsonregenerated. A new required schema param missing from the presets wouldMissingParam-panic the deployed node at startup (CrashLoopBackOff insystem_test_hybrid) — exactly the bug L-16 (apollo_l1_events: bound catch-up commit-block backlog with a cap and metric #14590) hit.Assumptions (documented in code)
Default 1000 is conservative versus common public-RPC
eth_getLogscaps (~1k–10k) and the 1 s timeout; operators on private RPCs may raise it.Tests
SEED=0 cargo nextest run -p papyrus_base_layer -p apollo_l1_events -p apollo_l1_events_config— 125 passed (new: range-cap, partial-window, finality-ceiling, multi-poll backlog drain).cargo nextest run -p apollo_deployments(incl.deployment_files_are_up_to_date) and-p apollo_node_config(default_config_file_is_up_to_date) both pass.🤖 Generated with Claude Code