Skip to content

[diagnostic, do-not-merge] repro WAL advisory-lock race on master#5543

Closed
grishasobol wants to merge 3 commits into
masterfrom
gsobol/ethexe/malachite-wal-repro
Closed

[diagnostic, do-not-merge] repro WAL advisory-lock race on master#5543
grishasobol wants to merge 3 commits into
masterfrom
gsobol/ethexe/malachite-wal-repro

Conversation

@grishasobol
Copy link
Copy Markdown
Member

This is a diagnostic PR, not for merging.

It modifies `seven_validators_full_network_restart` so the WAL advisory-lock race
(`MalachiteService::shutdown` returns before the upstream Malachite WAL writer
`std::thread` releases `flock` on `consensus.wal`) is reliably surfaced on CI.

Expected outcome: `build / workspace (release)` fails with

```
service starts: building Malachite engine: Actor panicked during startup
'Failed to acquire exclusive advisory lock: the file is already locked'
```

at `ethexe/malachite/core/tests/multi_validators.rs:start_service`. The companion
branch `gsobol/ethexe/malachite-wal-fix` applies the production fix in
`MalachiteService::shutdown` and is expected to pass under the same conditions.

See https://github.com/gear-tech/gear/actions/runs/26678194627/job/78634373130 for the
original master-side flake this repros.

…ors_full_network_restart

Tightens the restart cycle (no inter-cohort sleep, two extra back-to-back
shutdown+restart rounds on the same home dirs) so the race in MalachiteService::shutdown
is reliably surfaced on CI: arc-malachitebft-engine's WAL writer std::thread holds
flock on consensus.wal beyond the engine actor's JoinHandle, so the next
MalachiteService::new on the same base dir panics with 'Failed to acquire
exclusive advisory lock'.

This commit is the repro half of the diagnosis; the production fix is on a
sibling branch.
@grishasobol grishasobol added the ci: release Run release build (cargo --release) label May 31, 2026
This branch only ever runs the workspace-release nextest of
ethexe-malachite-core::multi_validators::seven_validators_full_network_restart,
5 iterations. Everything else (clippy, fmt, debug build, examples,
docs, ethexe-cli, benchmarks, …) is gone so the WAL-lock race repro
finishes in minutes.
Adds rapid_restart_exposes_wal_lock_race: 50 cycles of (start 3-validator
cohort, shutdown, immediately probe the consensus.wal flock from a fresh
fd). The post-condition of MalachiteService::shutdown().await is that
every file lock the service held is released by the time it returns;
without the fix the upstream WAL writer std::thread is still alive after
the engine actor's JoinHandle resolves and continues to hold flock, so
the probe fails immediately with AlreadyLocked.

Locally (release) this fails in cycle 0 within ~120ms — deterministic
enough for CI to surface the race in seconds. Companion fix sits on
gsobol/ethexe/malachite-wal-fix.

Also reverts the seven_validators_full_network_restart churn tweak from
the previous commit — the probe test makes it redundant.
@grishasobol
Copy link
Copy Markdown
Member Author

Superseded by #5546 (the actual fix). This diagnostic branch only exists as a trail of how the WAL advisory-lock race was isolated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci: release Run release build (cargo --release)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant