Skip to content

Exempt SSE endpoints from the 60s request timeout middleware#287

Closed
hiroTamada wants to merge 1 commit into
mainfrom
hypeship/exempt-sse-from-request-timeout
Closed

Exempt SSE endpoints from the 60s request timeout middleware#287
hiroTamada wants to merge 1 commit into
mainfrom
hypeship/exempt-sse-from-request-timeout

Conversation

@hiroTamada

@hiroTamada hiroTamada commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

  • middleware.Timeout(60 * time.Second) on the authenticated route group cancels the request context after 60s, which hard-kills SSE streams (/builds/{id}/events, /instances/{id}/logs) even while events are actively flowing.
  • For build events this is a correctness bug: a consumer following a build that takes >60s (including post-build image conversion) sees the stream close while the build status is still building, and misclassifies a successful build as timed out. In production this caused deployments to be marked failed at exactly stream-start+60s while the underlying build succeeded seconds later.
  • Fix: skip the timeout middleware for paths ending in /logs or /events, mirroring the existing SSE skip in the HTTP metrics middleware directly above. All other routes keep the 60s timeout.

Test plan

  • go build ./cmd/api and go vet ./cmd/api pass
  • Follow a >60s build via GET /builds/{id}/events?follow=true and confirm the stream stays open until the terminal status event
  • Confirm non-streaming endpoints still time out after 60s

🤖 Generated with Claude Code


Note

Low Risk
Narrow routing change for two SSE path suffixes; other endpoints keep the 60s timeout.

Overview
Replaces the blanket 60s middleware.Timeout on the authenticated API group with a wrapper that bypasses the timeout for paths ending in /logs or /events, while other routes still use the 60s limit.

middleware.Timeout cancels the request context at 60s, which was cutting off long-lived SSE streams (instance logs, build events) even when data was still flowing—e.g. builds over 60s could look failed when the stream died before the terminal status. The skip mirrors the existing HTTP metrics middleware pattern for /logs.

Reviewed by Cursor Bugbot for commit d15020d. Bugbot is set up for automated code reviews on this repo. Configure here.

middleware.Timeout cancels the request context after 60s, which kills
/builds/{id}/events and /instances/{id}/logs streams mid-flight even
while events are flowing. Build-event consumers then observe a closed
stream with the build still in progress and treat it as a failure.

Skip the timeout for streaming paths, mirroring the existing SSE skip
in the HTTP metrics middleware.
@hiroTamada hiroTamada marked this pull request as ready for review June 10, 2026 17:15
@firetiger-agent

Copy link
Copy Markdown

Created a monitoring plan for this PR.

What this PR does: Fixes a bug where long-running builds (>60s) appeared to fail in the eyes of stream consumers — the SSE connection was cut mid-stream, leaving callers with a build stuck in building status instead of receiving the terminal success/failure event.

Intended effect:

  • Build event stream lifetime: baseline cut at 60s; confirmed if streams following >60s builds now deliver the terminal status event without disconnecting.
  • Kernel API 5xx rate: baseline 0–30 errors/hour; confirmed if it remains stable (no sustained increase) post-deploy.

Risks:

  • Non-SSE timeout loss — if a future non-streaming route happens to end in /logs or /events, it silently loses its 60s timeout; alert if any handler shows unexpectedly long response times on new routes.
  • Hidden stream errors now surfacing — errors that previously appeared as client-side 60s timeouts may now be emitted as proper 5xx responses; alert if Railway HTTP 5xx on the kernel API exceeds 150/hour sustained for 2+ hours (query: opentelemetry/logs/railway-http-logs, attributes.railway.service_name = 'api').
  • Unclosed goroutines on the hypeman fleet — without a timeout, misbehaving consumers could hold goroutines open indefinitely; alert on any OOM or panic in Railway stdout for prod-jfk-hypeman-{0,1,2} within 2 hours of deploy.

Status updates will be posted automatically on this PR as monitoring progresses.

View monitor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant