[diagnostic, do-not-merge] repro WAL advisory-lock race on master by grishasobol · Pull Request #5543 · gear-tech/gear

grishasobol · 2026-05-31T16:22:43Z

This is a diagnostic PR, not for merging.

It modifies `seven_validators_full_network_restart` so the WAL advisory-lock race
(`MalachiteService::shutdown` returns before the upstream Malachite WAL writer
`std::thread` releases `flock` on `consensus.wal`) is reliably surfaced on CI.

Expected outcome: `build / workspace (release)` fails with

```
service starts: building Malachite engine: Actor panicked during startup
'Failed to acquire exclusive advisory lock: the file is already locked'
```

at `ethexe/malachite/core/tests/multi_validators.rs:start_service`. The companion
branch `gsobol/ethexe/malachite-wal-fix` applies the production fix in
`MalachiteService::shutdown` and is expected to pass under the same conditions.

See https://github.com/gear-tech/gear/actions/runs/26678194627/job/78634373130 for the
original master-side flake this repros.

…ors_full_network_restart Tightens the restart cycle (no inter-cohort sleep, two extra back-to-back shutdown+restart rounds on the same home dirs) so the race in MalachiteService::shutdown is reliably surfaced on CI: arc-malachitebft-engine's WAL writer std::thread holds flock on consensus.wal beyond the engine actor's JoinHandle, so the next MalachiteService::new on the same base dir panics with 'Failed to acquire exclusive advisory lock'. This commit is the repro half of the diagnosis; the production fix is on a sibling branch.

This branch only ever runs the workspace-release nextest of ethexe-malachite-core::multi_validators::seven_validators_full_network_restart, 5 iterations. Everything else (clippy, fmt, debug build, examples, docs, ethexe-cli, benchmarks, …) is gone so the WAL-lock race repro finishes in minutes.

Adds rapid_restart_exposes_wal_lock_race: 50 cycles of (start 3-validator cohort, shutdown, immediately probe the consensus.wal flock from a fresh fd). The post-condition of MalachiteService::shutdown().await is that every file lock the service held is released by the time it returns; without the fix the upstream WAL writer std::thread is still alive after the engine actor's JoinHandle resolves and continues to hold flock, so the probe fails immediately with AlreadyLocked. Locally (release) this fails in cycle 0 within ~120ms — deterministic enough for CI to surface the race in seconds. Companion fix sits on gsobol/ethexe/malachite-wal-fix. Also reverts the seven_validators_full_network_restart churn tweak from the previous commit — the probe test makes it redundant.

grishasobol · 2026-05-31T17:28:56Z

Superseded by #5546 (the actual fix). This diagnostic branch only exists as a trail of how the WAL advisory-lock race was isolated.

grishasobol added the ci: release Run release build (cargo --release) label May 31, 2026

grishasobol mentioned this pull request May 31, 2026

[diagnostic, do-not-merge] fix for WAL advisory-lock race + same repro #5544

Closed

grishasobol mentioned this pull request May 31, 2026

fix(ethexe/malachite-core): wait for WAL advisory lock release in MalachiteService::shutdown #5546

Merged

3 tasks

grishasobol closed this May 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[diagnostic, do-not-merge] repro WAL advisory-lock race on master#5543

[diagnostic, do-not-merge] repro WAL advisory-lock race on master#5543
grishasobol wants to merge 3 commits into
masterfrom
gsobol/ethexe/malachite-wal-repro

grishasobol commented May 31, 2026

Uh oh!

grishasobol commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

grishasobol commented May 31, 2026

Uh oh!

grishasobol commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant