Skip to content

Fix race simplify logic#2

Merged
gabriel-samfira merged 2 commits into
cloudbase:mainfrom
gabriel-samfira:fix-race-simplify-logic
Jun 16, 2026
Merged

Fix race simplify logic#2
gabriel-samfira merged 2 commits into
cloudbase:mainfrom
gabriel-samfira:fix-race-simplify-logic

Conversation

@gabriel-samfira

Copy link
Copy Markdown
Member

This change fixes a number of potential race conditions and simplifies the loops. The agent (and the machine its running on) are expected to be disposed of once shut down. So I was lazy with cleanup. This fixes that.

On shutdown the daemon returned as soon as the done channel was closed, but the
goroutine that terminates the runner (keepRunnerAlive) was still running its
cleanup. The process could exit before the runner's process group was killed,
leaving the runner behind.

The service now tracks its goroutines with a WaitGroup and exposes a Wait
method. keepRunnerAlive signals the group only after its cleanup defer has run,
so both the Unix daemon and the Windows service handler now wait for the runner
to be torn down before they return.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
The service read runnerAlive in determineRunnerState, the cli field in loop and
Stop, and the connecting/connected channel fields from multiple goroutines
without consistent locking. This could cause stale reads, torn pointer values,
and a nil dereference if the websocket connected before the runner command was
created.

runnerAlive is now snapshotted under the lock and passed into determineRunnerState
and the status senders. The cli field is owned exclusively by cliMux: keepAliveLoop
uses its local after the write, loop reads through getClient, and Stop reads under
cliMux. The connecting/connected handshake channels are guarded by a dedicated
connMux with snapshot-before-select and two small helpers (connectionUp and
connectionDown). loop was restructured from a goto into a nested for loop so the
client snapshot does not tangle with label scoping.

A reconnect test (TestServiceReconnect) drives the full connect, drop, reconnect
cycle against an httptest websocket server under -race.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
@gabriel-samfira gabriel-samfira merged commit d0b779e into cloudbase:main Jun 16, 2026
2 checks passed
@gabriel-samfira gabriel-samfira deleted the fix-race-simplify-logic branch June 16, 2026 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant