Skip to content

apollo_l1_events,apollo_l1_events_config: chunk L1 getLogs range into bounded windows (M-25)#14602

Open
asaf-sw wants to merge 1 commit into
asaf/l13-l1-error-classificationfrom
asaf/m25-l1-getlogs-chunking
Open

apollo_l1_events,apollo_l1_events_config: chunk L1 getLogs range into bounded windows (M-25)#14602
asaf-sw wants to merge 1 commit into
asaf/l13-l1-error-classificationfrom
asaf/m25-l1-getlogs-chunking

Conversation

@asaf-sw

@asaf-sw asaf-sw commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

M-25 — Chunk the L1 getLogs range into bounded windows

Security review finding: M-25 (Low–Medium). The L1 events scraper built the inclusive range scraping_start..=latest_l1_block and asked the base layer for all tracked events across that entire range in a single events() / eth_getLogs call, with no per-iteration cap and no chunking (a // If this gets too high, send in batches. comment acknowledged the gap). After downtime, an L1 outage, or with a large startup rewind, that range can span thousands of L1 blocks. The dominant failure mode is a liveness stall: the single call is wrapped in a 1 s timeout plus an N+1 per-event header fetch, so an oversized range times out and the scraper spins in its retry loop, never advancing — and (via the cyclic wrapper) misfires the primary-down-since signal. Unbounded Vec<L1Event> / Vec<Event> materialization is a secondary memory risk.

What changed

  • New config param l1_events_scraper_config.max_blocks_per_fetch: u64 (apollo_l1_events_config), default 1000, validated >= 1.
  • Bounded fetch window (apollo_l1_events/src/l1_scraper.rs): fetch_events now requests scraping_start..=min(latest, start + max_blocks_per_fetch - 1) and returns the L1BlockReference for the window end, so the cursor advances by exactly one window per poll and the steady-state loop drains a backlog over successive polls. The finality ceiling and the once-per-poll reorg check are preserved. initialize sends the first bounded window via the provider's initialize() and lets the steady loop drain the rest via add_events, so the provider's initialize-once contract is unchanged. Saturating arithmetic guards a 0 cap even though validation rejects it.
  • Deployment presets updated. l1_events_scraper_config.max_blocks_per_fetch: 1000 added to the hand-maintained l1_events_scraper_config.json and replacer_l1_events_scraper_config.json presets, and config_schema.json regenerated. A new required schema param missing from the presets would MissingParam-panic the deployed node at startup (CrashLoopBackOff in system_test_hybrid) — exactly the bug L-16 (apollo_l1_events: bound catch-up commit-block backlog with a cap and metric #14590) hit.

Assumptions (documented in code)

Default 1000 is conservative versus common public-RPC eth_getLogs caps (~1k–10k) and the 1 s timeout; operators on private RPCs may raise it.

Tests

SEED=0 cargo nextest run -p papyrus_base_layer -p apollo_l1_events -p apollo_l1_events_config — 125 passed (new: range-cap, partial-window, finality-ceiling, multi-poll backlog drain). cargo nextest run -p apollo_deployments (incl. deployment_files_are_up_to_date) and -p apollo_node_config (default_config_file_is_up_to_date) both pass.

🤖 Generated with Claude Code

@reviewable-StarkWare

Copy link
Copy Markdown

This change is Reviewable

asaf-sw commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

… bounded windows (M-25)

The L1 events scraper fetched the entire range from its cursor to the latest
(finality-adjusted) L1 block in a single events()/eth_getLogs request. After
downtime, an L1 outage, or with a large startup rewind, that range can span
thousands of blocks, materializing an unbounded Vec<L1Event>/Vec<Event> and a
single oversized RPC request that most providers reject or that hits the 1s
base-layer timeout — wedging the scraper and (via the cyclic wrapper) misfiring
the primary-down-since alert.

Cap each fetch to a configurable max_blocks_per_fetch (default 1000, validated
>= 1). fetch_events now requests scraping_start..=min(latest, start + cap - 1)
and returns the L1BlockReference for the window end, so the cursor advances by
exactly one window per poll and the steady-state loop drains a backlog over
successive polls. The finality ceiling and the once-per-poll reorg check are
preserved. initialize sends the first bounded window via the provider's
initialize() and lets the steady loop drain the rest via add_events, so the
provider's initialize-once contract is unchanged. Regenerated config_schema.json
for the new field.

Assumptions (documented in code): default 1000 is conservative versus common
public-RPC eth_getLogs caps (~1k-10k) and the 1s timeout; operators on private
RPCs may raise it. Saturating arithmetic guards against a 0 cap even though
validation rejects it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@asaf-sw asaf-sw force-pushed the asaf/m25-l1-getlogs-chunking branch from 5910da2 to 1ddf871 Compare June 23, 2026 12:15
@asaf-sw asaf-sw force-pushed the asaf/l13-l1-error-classification branch from 66ffac0 to 9821da4 Compare June 23, 2026 12:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants