fix(pollers): prevent duplicate processing across overlapping cron runs by rbren · Pull Request #290 · OpenHands/extensions

rbren · 2026-06-02T15:34:21Z

What

Fixes a race condition in both the github-repo-monitor and slack-channel-monitor cron-based polling skills where a single trigger comment / message could spin up two OpenHands conversations (and post two acknowledgements) when the previous poll run hadn't finished saving state before the next cron tick fired.

Also fixes a packaging bug that caused intermittent gzip decompression failed run failures.

This PR was authored by an AI agent (OpenHands) on behalf of @rbren.

How I found it

Set up the github-repo-monitor locally on OpenHands/extensions with a 1-minute cron. A single trigger comment on PR #288 produced two 🤖 OpenHands is on it! acks, two distinct conversation IDs, and two completion summaries. Run timeline:

Time (UTC)	Run status	Side effects
15:13:22	(user comment posted)	—
15:14:07	FAILED	tar error on stray `__pycache__/main.cpython-312.pyc`
15:15:07	FAILED	same tar error
15:16:07	COMPLETED	created conv `7823f1…`, posted ack #1 at 15:17:16 (run took ~1m9s)
15:17:07	COMPLETED	created conv `4376b5…`, posted ack #2 at 15:17:17

The 15:16 run was still in flight when 15:17 fired. Both reads of processed_comment_ids showed the comment as unprocessed, so both created conversations. The slack-channel-monitor has the identical pattern.

The race

Original flow in both pollers:

state = load_state(path)                  # ① read processed IDs
# ... poll API, for each new trigger:
#     create_conversation(...)            # side effect, slow network call
#     post_ack_comment(...)               # side effect
save_state(path, state)                   # ② persist at the end

When step ① to ② took longer than the cron interval (60s), two runs would race on identical state.

The fix (same pattern in both skills)

Atomic save_state — write to {path}.tmp.{pid} and os.replace() so readers never see a half-written file.
New claim_comment / claim_message helper — re-reads disk, atomically appends the ID to processed_comment_ids / processed_message_keys, returns True only if this run won the claim. Called before any side effect.
End-of-run save_state re-reads & merges the processed-ID set and conversations dict so claims written by an overlapping run aren't clobbered by a stale in-memory write.

Slack message keys are namespaced {channel_id}:{msg_ts} because Slack ts is unique per-channel, not globally.

Residual race window

The claim helper is read-then-write without an OS lock, so two runs hitting the same millisecond can both think they won. For 1-minute (or larger) cron schedules this window is effectively zero. The docstrings flag fcntl.flock as the upgrade path if a stronger guarantee is ever needed.

Packaging fix (root cause of those two FAILED runs above)

SKILL.md syntax validation switched from python3 -m py_compile (which writes __pycache__/main.cpython-*.pyc) to python3 -c "import ast; ast.parse(…)". No more stray pyc.
Tarball commands now use tar --exclude='__pycache__' --exclude='*.pyc' as defense in depth.

Misc

Slack scripts/main.py wrapped its bottom main() call in if __name__ == "__main__": so the module can be imported by tests. Runtime behavior when invoked as python3 main.py is unchanged.
New slack-channel-monitor/tests/test_main.py (8 tests) — slack had no test suite before.
Extended github-repo-monitor/tests/test_main.py with 5 new tests.

Test results

github-repo-monitor: 44 tests pass (5 new)
slack-channel-monitor: 8 tests pass (new file)

Files

skills/github-repo-monitor/scripts/main.py — atomic save, claim_comment, end-of-run merge
skills/github-repo-monitor/SKILL.md — ast.parse validation + --exclude tarball
skills/github-repo-monitor/references/state-schema.md — documents the claim-before-process pattern
skills/github-repo-monitor/tests/test_main.py — 5 new tests
skills/slack-channel-monitor/scripts/main.py — same fixes + __main__ guard
skills/slack-channel-monitor/SKILL.md — same packaging fixes
skills/slack-channel-monitor/references/state-schema.md — same docs
skills/slack-channel-monitor/tests/test_main.py — new file (8 tests)

…ocessing across overlapping cron runs Both pollers had the same race: when a single poll cycle's network calls took longer than the cron interval (e.g. ~60s for GitHub conversation creation), the next cron tick fired before the first run had saved state. Two runs then read identical state, both saw the same trigger comment/message as unprocessed, and both spun up an OpenHands conversation + posted an acknowledgement. Fix (same pattern in both skills): - Make save_state atomic via tmp file + os.replace(). - Add claim_comment/claim_message helper that re-reads disk and atomically appends the comment/message ID to processed_comment_ids / processed_message_keys *before* any side effect (conversation create, GitHub/Slack ack, reaction, reply forward). Returns False if another concurrent run already claimed the same item. - End-of-run save_state now re-reads disk and merges the processed-ID set and conversations map so claims written by an overlapping run are never clobbered. - Slack message keys are namespaced `{channel_id}:{msg_ts}` since Slack ts values are unique per-channel, not globally. Also fixes a packaging bug that caused intermittent run failures: - SKILL.md syntax validation switched from `python3 -m py_compile` (writes __pycache__/main.cpython-*.pyc) to `python3 -c "import ast; ast.parse(...)"` (no .pyc). - Tarball commands now `--exclude='__pycache__' --exclude='*.pyc'` as defense in depth. The stray .pyc was occasionally failing to decompress in the sandbox with "gzip decompression failed". Other small changes: - Slack scripts/main.py wrapped its bottom `main()` call in `if __name__ == '__main__':` so the module is importable for tests. Runtime behavior is unchanged when invoked as `python3 main.py`. - Added slack-channel-monitor/tests/test_main.py covering atomic save and claim_message race semantics (8 tests). - Extended github-repo-monitor/tests/test_main.py with 5 new tests for the same behavior. Full suite: 44 tests pass. - references/state-schema.md updated in both skills to document the claim-before-process pattern. Co-authored-by: openhands <openhands@all-hands.dev>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pollers): prevent duplicate processing across overlapping cron runs#290

fix(pollers): prevent duplicate processing across overlapping cron runs#290
rbren wants to merge 1 commit into
mainfrom
fix/poller-concurrent-runs-race

rbren commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rbren commented Jun 2, 2026

What

How I found it

The race

The fix (same pattern in both skills)

Residual race window

Packaging fix (root cause of those two FAILED runs above)

Misc

Test results

Files

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant