[diagnostic, do-not-merge] repro WAL advisory-lock race on master#5543
Closed
grishasobol wants to merge 3 commits into
Closed
[diagnostic, do-not-merge] repro WAL advisory-lock race on master#5543grishasobol wants to merge 3 commits into
grishasobol wants to merge 3 commits into
Conversation
…ors_full_network_restart Tightens the restart cycle (no inter-cohort sleep, two extra back-to-back shutdown+restart rounds on the same home dirs) so the race in MalachiteService::shutdown is reliably surfaced on CI: arc-malachitebft-engine's WAL writer std::thread holds flock on consensus.wal beyond the engine actor's JoinHandle, so the next MalachiteService::new on the same base dir panics with 'Failed to acquire exclusive advisory lock'. This commit is the repro half of the diagnosis; the production fix is on a sibling branch.
This branch only ever runs the workspace-release nextest of ethexe-malachite-core::multi_validators::seven_validators_full_network_restart, 5 iterations. Everything else (clippy, fmt, debug build, examples, docs, ethexe-cli, benchmarks, …) is gone so the WAL-lock race repro finishes in minutes.
Adds rapid_restart_exposes_wal_lock_race: 50 cycles of (start 3-validator cohort, shutdown, immediately probe the consensus.wal flock from a fresh fd). The post-condition of MalachiteService::shutdown().await is that every file lock the service held is released by the time it returns; without the fix the upstream WAL writer std::thread is still alive after the engine actor's JoinHandle resolves and continues to hold flock, so the probe fails immediately with AlreadyLocked. Locally (release) this fails in cycle 0 within ~120ms — deterministic enough for CI to surface the race in seconds. Companion fix sits on gsobol/ethexe/malachite-wal-fix. Also reverts the seven_validators_full_network_restart churn tweak from the previous commit — the probe test makes it redundant.
Merged
3 tasks
Member
Author
|
Superseded by #5546 (the actual fix). This diagnostic branch only exists as a trail of how the WAL advisory-lock race was isolated. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a diagnostic PR, not for merging.
It modifies `seven_validators_full_network_restart` so the WAL advisory-lock race
(`MalachiteService::shutdown` returns before the upstream Malachite WAL writer
`std::thread` releases `flock` on `consensus.wal`) is reliably surfaced on CI.
Expected outcome: `build / workspace (release)` fails with
```
service starts: building Malachite engine: Actor panicked during startup
'Failed to acquire exclusive advisory lock: the file is already locked'
```
at `ethexe/malachite/core/tests/multi_validators.rs:start_service`. The companion
branch `gsobol/ethexe/malachite-wal-fix` applies the production fix in
`MalachiteService::shutdown` and is expected to pass under the same conditions.
See https://github.com/gear-tech/gear/actions/runs/26678194627/job/78634373130 for the
original master-side flake this repros.