OtlpProxy: buffer request body to survive stale-connection retry#3510
OtlpProxy: buffer request body to survive stale-connection retry#3510reakaleek wants to merge 4 commits into
Conversation
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthrough
Sequence DiagramsequenceDiagram
participant Client
participant OtlpProxy as OTLP Proxy
participant SocketsHttpHandler
participant PooledConnection as Pooled Conn
participant Counter
participant Collector
Client->>OtlpProxy: POST /traces (StreamContent)
OtlpProxy->>SocketsHttpHandler: SendAsync
SocketsHttpHandler->>PooledConnection: use cached connection
PooledConnection-->>SocketsHttpHandler: stale/closed
SocketsHttpHandler-->>OtlpProxy: HttpRequestException(IOException)
OtlpProxy->>OtlpProxy: MapExceptionToStatusCode → 204
OtlpProxy->>Counter: increment otlp.proxy.stale_connection.dropped
OtlpProxy-->>Client: 204 NoContent
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/Elastic.Documentation.Api.Tests/OtlpProxyTests.cs`:
- Around line 319-350: The test currently accepts either NoContent or BadGateway
status codes and only requires callCount to be greater than zero, which allows
the test to pass without verifying actual retry behavior occurred. To fix this,
strengthen the assertions in the OtlpProxyTests test method to definitively
prove the retry/replay behavior: either assert that callCount equals 2
(demonstrating the second send attempt was made after the first failed), or
assert the response status code is specifically NoContent (indicating the retry
succeeded), or both. This ensures the test actually validates the retry
mechanism rather than just checking that the error was handled somehow.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b232e094-5008-4bfe-9443-a4bb7b03f73d
📒 Files selected for processing (2)
src/api/Elastic.Documentation.Api/Telemetry/AdotOtlpService.cstests/Elastic.Documentation.Api.Tests/OtlpProxyTests.cs
The mock handler sits above SocketsHttpHandler so its transparent retry can't be exercised here. Narrow the test to what it actually proves: one SendAsync attempt is made and a failing throw maps cleanly to 502. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Mpdreamz
left a comment
There was a problem hiding this comment.
Alternative: zero-copy, best-effort
This buffer fixes the InvalidOperationException correctly, but it allocates a MemoryStream (and copies the entire body into it) on every OTLP forward. For a high-frequency telemetry path that's real memory pressure with no real gain — OTLP is fire-and-forget, so dropping one batch on a rare stale connection is totally acceptable.
The alternative is to stay zero-copy and let stale connections fail gracefully:
1. Keep StreamContent(requestBody) — no buffer needed.
StreamContent exposes an internal AllowRetry property that returns stream.CanSeek. For ASP.NET's non-seekable request body, AllowRetry = false, so SocketsHttpHandler will not attempt the transparent retry — it throws HttpRequestException instead. Your existing catch already handles that.
2. Cycle connections before they go stale (in ServicesExtension.AddOtlpProxyService):
.ConfigurePrimaryHttpMessageHandler(() => new SocketsHttpHandler
{
PooledConnectionLifetime = TimeSpan.FromSeconds(30), // localhost sidecar — reconnect is ~free
ConnectTimeout = TimeSpan.FromSeconds(1),
})With a 30s lifetime, the window for a connection going stale is tiny. This makes the drop rate negligible in practice.
3. Don't surface stale-connection failures as 502 to the caller. The browser OTLP exporter will interpret 502 as a signal to retry, which defeats the fire-and-forget intent. Map the stale-connection case to 204 instead (silent drop).
4. Emit a metric so we can watch the drop rate:
// in TelemetryConstants or alongside ActivitySource:
private static readonly Meter Meter = new(TelemetryConstants.OtlpProxySourceName);
internal static readonly Counter<int> StaleConnectionDrops =
Meter.CreateCounter<int>("otlp.proxy.stale_connection.dropped",
description: "Traces silently dropped due to stale pooled connection to ADOT collector");Then in MapExceptionToStatusCode, add a case before the generic HttpRequestException arm:
// Stale pooled connection — SocketsHttpHandler detected it before sending body.
// OTLP is best-effort; drop silently and let the next request reconnect.
HttpRequestException { InnerException: IOException }
=> (204, string.Empty), // caller should not retryAnd increment StaleConnectionDrops in the catch block when the status is 204.
This is effectively what YARP does in its no-buffer mode — accept the occasional drop on a stale connection in exchange for zero copy and zero allocation on the hot path. Since the collector is a localhost sidecar, reconnect is microseconds and drops should be extremely rare (<< 0.1% of requests) once PooledConnectionLifetime keeps connections fresh.
| // The raw request body is forward-only; a stale pooled connection to the localhost | ||
| // collector triggers a silent resend that otherwise throws "stream was already consumed". | ||
| using var buffer = new MemoryStream(); | ||
| await requestBody.CopyToAsync(buffer, ctx); |
There was a problem hiding this comment.
This MemoryStream allocation (+ full body copy) happens on every OTLP forward. Since the request body is non-seekable, StreamContent.AllowRetry already returns false, which tells SocketsHttpHandler to not attempt the transparent retry — it will throw HttpRequestException instead.
So the buffer isn't needed to prevent a crash; the existing catch already handles HttpRequestException. The tradeoff: we copy the entire body into heap memory on every request, vs. occasionally dropping a trace on a rare stale connection. For fire-and-forget telemetry, the drop is preferable — see the PR-level comment for the full alternative.
There was a problem hiding this comment.
makes a lot of sense. thank you. Addressing it now
StreamContent.AllowRetry=false for non-seekable streams, so SocketsHttpHandler
throws HttpRequestException (not InvalidOperationException) on a stale connection
without retrying. Remove the MemoryStream buffer — it was unnecessary allocation.
Instead: add PooledConnectionLifetime=30s to proactively recycle connections
before the sidecar closes them; map HttpRequestException{InnerException:IOException}
to 204 (silent drop) so the browser OTLP exporter doesn't treat it as a retryable
502; and emit an otlp.proxy.stale_connection.dropped counter for observability.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Why
AdotOtlpServicewas wrappingcontext.Request.Bodydirectly inStreamContent.SocketsHttpHandlerhas a built-in transparent connection retry (SendWithVersionDetectionAndRetryAsync) that fires when it picks a stale pooled keep-alive connection. For a non-seekable stream,StreamContent.AllowRetryreturnsfalse, soSocketsHttpHandlerskips the transparent retry and throwsHttpRequestExceptionwrapping anIOException. The existingcatchhandles this, but it was returning 502 — which the browser OTLP exporter interprets as "please retry", defeating fire-and-forget semantics. The noisy error logs were also misleading.What
Three small changes:
PooledConnectionLifetime = 30son theSocketsHttpHandler— proactively recycles connections before the sidecar can close them, making stale connections extremely rare.HttpRequestException { InnerException: IOException }→ 204 — a silent drop. The browser exporter sees 204 (success) and moves on rather than retrying.otlp.proxy.stale_connection.droppedcounter — emitted on every silent drop so the rate is observable.