Skip to content

fix: make MergeRun wg.Wait() and CnServerMessageHandler connection wait cancelable#25035

Draft
ck89119 wants to merge 2 commits into
matrixorigin:3.0-devfrom
ck89119:issue-25025-3.0-dev
Draft

fix: make MergeRun wg.Wait() and CnServerMessageHandler connection wait cancelable#25035
ck89119 wants to merge 2 commits into
matrixorigin:3.0-devfrom
ck89119:issue-25025-3.0-dev

Conversation

@ck89119

@ck89119 ck89119 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary

Fix distributed INSERT...SELECT deadlock where cross-CN data stream stall causes Scope.MergeRun to block forever in wg.Wait(), making the query unkillable and table locks unreleasable without CN restart.

Root Cause

Three blocking points form a cascade that ignores KILL/context cancellation:

  1. sendNotifyMessage's closeWithError blocks on reg.Ch2 <- signal when pipeline consumer stopped
  2. MergeRun defer wg.Wait() never returns because sub-routines can't call wg.Done()
  3. CnServerMessageHandler waits for <-receiver.connectionCtx.Done() (TCP close only, not cancellable)

Changes

File Change
scope.go closeWithError: reg.Ch2 <- changed to select with ctx.Done()
scope.go MergeRun defer: wg.Wait() wrapped in goroutine + select with ctx.Done()
remoterunServer.go connection wait: added select with messageCtx.Done()

Tests

3 new unit tests + 3 existing tests all pass:

  • TestCloseWithErrorContextCancel — verify select doesn't block on Ch2
  • TestCnConnectionWaitContextCancel — verify connection wait observes context cancel
  • TestMergeRunWgWaitCancelable — verify MergeRun returns after ctx cancel

Issue

Fixes #25025

🤖 Generated with Claude Code

…it cancelable (matrixorigin#25025)

Three fixes to prevent distributed query deadlock when cross-CN data
stream stalls:

1. MergeRun defer wg.Wait(): use goroutine+select with ctx.Done() to
   avoid blocking forever when sub-routines fail to call wg.Done().

2. sendNotifyMessage closeWithError: use select for Ch2 send to avoid
   blocking when pipeline consumer has stopped.

3. CnServerMessageHandler connection wait: observe messageCtx.Done()
   in addition to connectionCtx.Done() so killed queries don't leave
   handlers blocked waiting for TCP close.

Previously KILL/cancel had no effect on these blocking points, making
the query unkillable and table locks unreleasable without CN restart.

Co-Authored-By: Claude <noreply@anthropic.com>
@qodo-code-review

Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Denotes a PR that changes [100,499] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants