fix(pollers): prevent duplicate processing across overlapping cron runs#290
Draft
rbren wants to merge 1 commit into
Draft
fix(pollers): prevent duplicate processing across overlapping cron runs#290rbren wants to merge 1 commit into
rbren wants to merge 1 commit into
Conversation
…ocessing across overlapping cron runs
Both pollers had the same race: when a single poll cycle's network calls took
longer than the cron interval (e.g. ~60s for GitHub conversation creation),
the next cron tick fired before the first run had saved state. Two runs then
read identical state, both saw the same trigger comment/message as
unprocessed, and both spun up an OpenHands conversation + posted an
acknowledgement.
Fix (same pattern in both skills):
- Make save_state atomic via tmp file + os.replace().
- Add claim_comment/claim_message helper that re-reads disk and atomically
appends the comment/message ID to processed_comment_ids /
processed_message_keys *before* any side effect (conversation create,
GitHub/Slack ack, reaction, reply forward). Returns False if another
concurrent run already claimed the same item.
- End-of-run save_state now re-reads disk and merges the processed-ID set
and conversations map so claims written by an overlapping run are never
clobbered.
- Slack message keys are namespaced `{channel_id}:{msg_ts}` since Slack
ts values are unique per-channel, not globally.
Also fixes a packaging bug that caused intermittent run failures:
- SKILL.md syntax validation switched from `python3 -m py_compile` (writes
__pycache__/main.cpython-*.pyc) to `python3 -c "import ast; ast.parse(...)"`
(no .pyc).
- Tarball commands now `--exclude='__pycache__' --exclude='*.pyc'` as
defense in depth. The stray .pyc was occasionally failing to decompress
in the sandbox with "gzip decompression failed".
Other small changes:
- Slack scripts/main.py wrapped its bottom `main()` call in `if __name__ ==
'__main__':` so the module is importable for tests. Runtime behavior is
unchanged when invoked as `python3 main.py`.
- Added slack-channel-monitor/tests/test_main.py covering atomic save and
claim_message race semantics (8 tests).
- Extended github-repo-monitor/tests/test_main.py with 5 new tests for the
same behavior. Full suite: 44 tests pass.
- references/state-schema.md updated in both skills to document the
claim-before-process pattern.
Co-authored-by: openhands <openhands@all-hands.dev>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Fixes a race condition in both the
github-repo-monitorandslack-channel-monitorcron-based polling skills where a single trigger comment / message could spin up two OpenHands conversations (and post two acknowledgements) when the previous poll run hadn't finished saving state before the next cron tick fired.Also fixes a packaging bug that caused intermittent
gzip decompression failedrun failures.How I found it
Set up the github-repo-monitor locally on
OpenHands/extensionswith a 1-minute cron. A single trigger comment on PR #288 produced two🤖 OpenHands is on it!acks, two distinct conversation IDs, and two completion summaries. Run timeline:__pycache__/main.cpython-312.pyc7823f1…, posted ack #1 at 15:17:16 (run took ~1m9s)4376b5…, posted ack #2 at 15:17:17The 15:16 run was still in flight when 15:17 fired. Both reads of
processed_comment_idsshowed the comment as unprocessed, so both created conversations. The slack-channel-monitor has the identical pattern.The race
Original flow in both pollers:
When step ① to ② took longer than the cron interval (60s), two runs would race on identical state.
The fix (same pattern in both skills)
save_state— write to{path}.tmp.{pid}andos.replace()so readers never see a half-written file.claim_comment/claim_messagehelper — re-reads disk, atomically appends the ID toprocessed_comment_ids/processed_message_keys, returnsTrueonly if this run won the claim. Called before any side effect.save_statere-reads & merges the processed-ID set and conversations dict so claims written by an overlapping run aren't clobbered by a stale in-memory write.Slack message keys are namespaced
{channel_id}:{msg_ts}because Slacktsis unique per-channel, not globally.Residual race window
The claim helper is read-then-write without an OS lock, so two runs hitting the same millisecond can both think they won. For 1-minute (or larger) cron schedules this window is effectively zero. The docstrings flag
fcntl.flockas the upgrade path if a stronger guarantee is ever needed.Packaging fix (root cause of those two FAILED runs above)
SKILL.mdsyntax validation switched frompython3 -m py_compile(which writes__pycache__/main.cpython-*.pyc) topython3 -c "import ast; ast.parse(…)". No more stray pyc.tar --exclude='__pycache__' --exclude='*.pyc'as defense in depth.Misc
scripts/main.pywrapped its bottommain()call inif __name__ == "__main__":so the module can be imported by tests. Runtime behavior when invoked aspython3 main.pyis unchanged.slack-channel-monitor/tests/test_main.py(8 tests) — slack had no test suite before.github-repo-monitor/tests/test_main.pywith 5 new tests.Test results
Files
skills/github-repo-monitor/scripts/main.py— atomic save, claim_comment, end-of-run mergeskills/github-repo-monitor/SKILL.md— ast.parse validation +--excludetarballskills/github-repo-monitor/references/state-schema.md— documents the claim-before-process patternskills/github-repo-monitor/tests/test_main.py— 5 new testsskills/slack-channel-monitor/scripts/main.py— same fixes +__main__guardskills/slack-channel-monitor/SKILL.md— same packaging fixesskills/slack-channel-monitor/references/state-schema.md— same docsskills/slack-channel-monitor/tests/test_main.py— new file (8 tests)