Skip to content

fix(pollers): prevent duplicate processing across overlapping cron runs#290

Draft
rbren wants to merge 1 commit into
mainfrom
fix/poller-concurrent-runs-race
Draft

fix(pollers): prevent duplicate processing across overlapping cron runs#290
rbren wants to merge 1 commit into
mainfrom
fix/poller-concurrent-runs-race

Conversation

@rbren

@rbren rbren commented Jun 2, 2026

Copy link
Copy Markdown
Member

What

Fixes a race condition in both the github-repo-monitor and slack-channel-monitor cron-based polling skills where a single trigger comment / message could spin up two OpenHands conversations (and post two acknowledgements) when the previous poll run hadn't finished saving state before the next cron tick fired.

Also fixes a packaging bug that caused intermittent gzip decompression failed run failures.

This PR was authored by an AI agent (OpenHands) on behalf of @rbren.

How I found it

Set up the github-repo-monitor locally on OpenHands/extensions with a 1-minute cron. A single trigger comment on PR #288 produced two 🤖 OpenHands is on it! acks, two distinct conversation IDs, and two completion summaries. Run timeline:

Time (UTC) Run status Side effects
15:13:22 (user comment posted)
15:14:07 FAILED tar error on stray __pycache__/main.cpython-312.pyc
15:15:07 FAILED same tar error
15:16:07 COMPLETED created conv 7823f1…, posted ack #1 at 15:17:16 (run took ~1m9s)
15:17:07 COMPLETED created conv 4376b5…, posted ack #2 at 15:17:17

The 15:16 run was still in flight when 15:17 fired. Both reads of processed_comment_ids showed the comment as unprocessed, so both created conversations. The slack-channel-monitor has the identical pattern.

The race

Original flow in both pollers:

state = load_state(path)                  # ① read processed IDs
# ... poll API, for each new trigger:
#     create_conversation(...)            # side effect, slow network call
#     post_ack_comment(...)               # side effect
save_state(path, state)                   # ② persist at the end

When step ① to ② took longer than the cron interval (60s), two runs would race on identical state.

The fix (same pattern in both skills)

  1. Atomic save_state — write to {path}.tmp.{pid} and os.replace() so readers never see a half-written file.
  2. New claim_comment / claim_message helper — re-reads disk, atomically appends the ID to processed_comment_ids / processed_message_keys, returns True only if this run won the claim. Called before any side effect.
  3. End-of-run save_state re-reads & merges the processed-ID set and conversations dict so claims written by an overlapping run aren't clobbered by a stale in-memory write.

Slack message keys are namespaced {channel_id}:{msg_ts} because Slack ts is unique per-channel, not globally.

Residual race window

The claim helper is read-then-write without an OS lock, so two runs hitting the same millisecond can both think they won. For 1-minute (or larger) cron schedules this window is effectively zero. The docstrings flag fcntl.flock as the upgrade path if a stronger guarantee is ever needed.

Packaging fix (root cause of those two FAILED runs above)

  • SKILL.md syntax validation switched from python3 -m py_compile (which writes __pycache__/main.cpython-*.pyc) to python3 -c "import ast; ast.parse(…)". No more stray pyc.
  • Tarball commands now use tar --exclude='__pycache__' --exclude='*.pyc' as defense in depth.

Misc

  • Slack scripts/main.py wrapped its bottom main() call in if __name__ == "__main__": so the module can be imported by tests. Runtime behavior when invoked as python3 main.py is unchanged.
  • New slack-channel-monitor/tests/test_main.py (8 tests) — slack had no test suite before.
  • Extended github-repo-monitor/tests/test_main.py with 5 new tests.

Test results

github-repo-monitor: 44 tests pass (5 new)
slack-channel-monitor: 8 tests pass (new file)

Files

  • skills/github-repo-monitor/scripts/main.py — atomic save, claim_comment, end-of-run merge
  • skills/github-repo-monitor/SKILL.md — ast.parse validation + --exclude tarball
  • skills/github-repo-monitor/references/state-schema.md — documents the claim-before-process pattern
  • skills/github-repo-monitor/tests/test_main.py — 5 new tests
  • skills/slack-channel-monitor/scripts/main.py — same fixes + __main__ guard
  • skills/slack-channel-monitor/SKILL.md — same packaging fixes
  • skills/slack-channel-monitor/references/state-schema.md — same docs
  • skills/slack-channel-monitor/tests/test_main.py — new file (8 tests)

…ocessing across overlapping cron runs

Both pollers had the same race: when a single poll cycle's network calls took
longer than the cron interval (e.g. ~60s for GitHub conversation creation),
the next cron tick fired before the first run had saved state. Two runs then
read identical state, both saw the same trigger comment/message as
unprocessed, and both spun up an OpenHands conversation + posted an
acknowledgement.

Fix (same pattern in both skills):

- Make save_state atomic via tmp file + os.replace().
- Add claim_comment/claim_message helper that re-reads disk and atomically
  appends the comment/message ID to processed_comment_ids /
  processed_message_keys *before* any side effect (conversation create,
  GitHub/Slack ack, reaction, reply forward). Returns False if another
  concurrent run already claimed the same item.
- End-of-run save_state now re-reads disk and merges the processed-ID set
  and conversations map so claims written by an overlapping run are never
  clobbered.
- Slack message keys are namespaced `{channel_id}:{msg_ts}` since Slack
  ts values are unique per-channel, not globally.

Also fixes a packaging bug that caused intermittent run failures:

- SKILL.md syntax validation switched from `python3 -m py_compile` (writes
  __pycache__/main.cpython-*.pyc) to `python3 -c "import ast; ast.parse(...)"`
  (no .pyc).
- Tarball commands now `--exclude='__pycache__' --exclude='*.pyc'` as
  defense in depth. The stray .pyc was occasionally failing to decompress
  in the sandbox with "gzip decompression failed".

Other small changes:

- Slack scripts/main.py wrapped its bottom `main()` call in `if __name__ ==
  '__main__':` so the module is importable for tests. Runtime behavior is
  unchanged when invoked as `python3 main.py`.
- Added slack-channel-monitor/tests/test_main.py covering atomic save and
  claim_message race semantics (8 tests).
- Extended github-repo-monitor/tests/test_main.py with 5 new tests for the
  same behavior. Full suite: 44 tests pass.
- references/state-schema.md updated in both skills to document the
  claim-before-process pattern.

Co-authored-by: openhands <openhands@all-hands.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant