Skip to content

OtlpProxy: buffer request body to survive stale-connection retry#3510

Open
reakaleek wants to merge 4 commits into
mainfrom
wood-sapphire
Open

OtlpProxy: buffer request body to survive stale-connection retry#3510
reakaleek wants to merge 4 commits into
mainfrom
wood-sapphire

Conversation

@reakaleek

@reakaleek reakaleek commented Jun 15, 2026

Copy link
Copy Markdown
Member

Why

AdotOtlpService was wrapping context.Request.Body directly in StreamContent. SocketsHttpHandler has a built-in transparent connection retry (SendWithVersionDetectionAndRetryAsync) that fires when it picks a stale pooled keep-alive connection. For a non-seekable stream, StreamContent.AllowRetry returns false, so SocketsHttpHandler skips the transparent retry and throws HttpRequestException wrapping an IOException. The existing catch handles this, but it was returning 502 — which the browser OTLP exporter interprets as "please retry", defeating fire-and-forget semantics. The noisy error logs were also misleading.

What

Three small changes:

  • PooledConnectionLifetime = 30s on the SocketsHttpHandler — proactively recycles connections before the sidecar can close them, making stale connections extremely rare.
  • Map HttpRequestException { InnerException: IOException } → 204 — a silent drop. The browser exporter sees 204 (success) and moves on rather than retrying.
  • otlp.proxy.stale_connection.dropped counter — emitted on every silent drop so the rate is observable.

reakaleek and others added 2 commits June 15, 2026 13:32
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e9e245fa-203d-402c-b9ea-65d1d5027c72

📥 Commits

Reviewing files that changed from the base of the PR and between 121a781 and 41f6ee3.

📒 Files selected for processing (3)
  • src/api/Elastic.Documentation.Api/ServicesExtension.cs
  • src/api/Elastic.Documentation.Api/Telemetry/AdotOtlpService.cs
  • tests/Elastic.Documentation.Api.Tests/OtlpProxyTests.cs

📝 Walkthrough

Walkthrough

AdotOtlpService now tracks OTLP batches dropped due to stale pooled connections by introducing a Meter and internal counter. When an HttpRequestException with an inner IOException is caught in ForwardOtlp, it is mapped to a 204 NoContent status code, the counter is incremented, and the failure is logged at debug level. ServicesExtension configures the OTLP proxy's HttpClient with a SocketsHttpHandler that enforces a 30-second PooledConnectionLifetime. Three new tests validate byte-for-byte body forwarding, error mapping for non-IO exceptions to 502 BadGateway, and graceful handling of stale connections with 204 NoContent.

Sequence Diagram

sequenceDiagram
  participant Client
  participant OtlpProxy as OTLP Proxy
  participant SocketsHttpHandler
  participant PooledConnection as Pooled Conn
  participant Counter
  participant Collector

  Client->>OtlpProxy: POST /traces (StreamContent)
  OtlpProxy->>SocketsHttpHandler: SendAsync
  SocketsHttpHandler->>PooledConnection: use cached connection
  PooledConnection-->>SocketsHttpHandler: stale/closed
  SocketsHttpHandler-->>OtlpProxy: HttpRequestException(IOException)
  OtlpProxy->>OtlpProxy: MapExceptionToStatusCode → 204
  OtlpProxy->>Counter: increment otlp.proxy.stale_connection.dropped
  OtlpProxy-->>Client: 204 NoContent
Loading
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: handling stale-connection retries by mapping specific exceptions to 204 instead of 502, with proactive connection recycling via PooledConnectionLifetime.
Description check ✅ Passed The description clearly explains the root cause (non-seekable stream with transparent retry), the solution (connection recycling, exception mapping, and observability), and why 204 is appropriate instead of 502.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch wood-sapphire

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/Elastic.Documentation.Api.Tests/OtlpProxyTests.cs`:
- Around line 319-350: The test currently accepts either NoContent or BadGateway
status codes and only requires callCount to be greater than zero, which allows
the test to pass without verifying actual retry behavior occurred. To fix this,
strengthen the assertions in the OtlpProxyTests test method to definitively
prove the retry/replay behavior: either assert that callCount equals 2
(demonstrating the second send attempt was made after the first failed), or
assert the response status code is specifically NoContent (indicating the retry
succeeded), or both. This ensures the test actually validates the retry
mechanism rather than just checking that the error was handled somehow.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b232e094-5008-4bfe-9443-a4bb7b03f73d

📥 Commits

Reviewing files that changed from the base of the PR and between ed624de and 121a781.

📒 Files selected for processing (2)
  • src/api/Elastic.Documentation.Api/Telemetry/AdotOtlpService.cs
  • tests/Elastic.Documentation.Api.Tests/OtlpProxyTests.cs

Comment thread tests/Elastic.Documentation.Api.Tests/OtlpProxyTests.cs
The mock handler sits above SocketsHttpHandler so its transparent retry
can't be exercised here. Narrow the test to what it actually proves:
one SendAsync attempt is made and a failing throw maps cleanly to 502.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@reakaleek reakaleek temporarily deployed to integration-tests June 15, 2026 11:48 — with GitHub Actions Inactive
@reakaleek reakaleek marked this pull request as draft June 15, 2026 11:50
@reakaleek reakaleek marked this pull request as ready for review June 15, 2026 11:55

@Mpdreamz Mpdreamz left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative: zero-copy, best-effort

This buffer fixes the InvalidOperationException correctly, but it allocates a MemoryStream (and copies the entire body into it) on every OTLP forward. For a high-frequency telemetry path that's real memory pressure with no real gain — OTLP is fire-and-forget, so dropping one batch on a rare stale connection is totally acceptable.

The alternative is to stay zero-copy and let stale connections fail gracefully:

1. Keep StreamContent(requestBody) — no buffer needed.
StreamContent exposes an internal AllowRetry property that returns stream.CanSeek. For ASP.NET's non-seekable request body, AllowRetry = false, so SocketsHttpHandler will not attempt the transparent retry — it throws HttpRequestException instead. Your existing catch already handles that.

2. Cycle connections before they go stale (in ServicesExtension.AddOtlpProxyService):

.ConfigurePrimaryHttpMessageHandler(() => new SocketsHttpHandler
{
    PooledConnectionLifetime = TimeSpan.FromSeconds(30), // localhost sidecar — reconnect is ~free
    ConnectTimeout = TimeSpan.FromSeconds(1),
})

With a 30s lifetime, the window for a connection going stale is tiny. This makes the drop rate negligible in practice.

3. Don't surface stale-connection failures as 502 to the caller. The browser OTLP exporter will interpret 502 as a signal to retry, which defeats the fire-and-forget intent. Map the stale-connection case to 204 instead (silent drop).

4. Emit a metric so we can watch the drop rate:

// in TelemetryConstants or alongside ActivitySource:
private static readonly Meter Meter = new(TelemetryConstants.OtlpProxySourceName);
internal static readonly Counter<int> StaleConnectionDrops =
    Meter.CreateCounter<int>("otlp.proxy.stale_connection.dropped",
        description: "Traces silently dropped due to stale pooled connection to ADOT collector");

Then in MapExceptionToStatusCode, add a case before the generic HttpRequestException arm:

// Stale pooled connection — SocketsHttpHandler detected it before sending body.
// OTLP is best-effort; drop silently and let the next request reconnect.
HttpRequestException { InnerException: IOException }
    => (204, string.Empty), // caller should not retry

And increment StaleConnectionDrops in the catch block when the status is 204.

This is effectively what YARP does in its no-buffer mode — accept the occasional drop on a stale connection in exchange for zero copy and zero allocation on the hot path. Since the collector is a localhost sidecar, reconnect is microseconds and drops should be extremely rare (<< 0.1% of requests) once PooledConnectionLifetime keeps connections fresh.

// The raw request body is forward-only; a stale pooled connection to the localhost
// collector triggers a silent resend that otherwise throws "stream was already consumed".
using var buffer = new MemoryStream();
await requestBody.CopyToAsync(buffer, ctx);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This MemoryStream allocation (+ full body copy) happens on every OTLP forward. Since the request body is non-seekable, StreamContent.AllowRetry already returns false, which tells SocketsHttpHandler to not attempt the transparent retry — it will throw HttpRequestException instead.

So the buffer isn't needed to prevent a crash; the existing catch already handles HttpRequestException. The tradeoff: we copy the entire body into heap memory on every request, vs. occasionally dropping a trace on a rare stale connection. For fire-and-forget telemetry, the drop is preferable — see the PR-level comment for the full alternative.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes a lot of sense. thank you. Addressing it now

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StreamContent.AllowRetry=false for non-seekable streams, so SocketsHttpHandler
throws HttpRequestException (not InvalidOperationException) on a stale connection
without retrying. Remove the MemoryStream buffer — it was unnecessary allocation.

Instead: add PooledConnectionLifetime=30s to proactively recycle connections
before the sidecar closes them; map HttpRequestException{InnerException:IOException}
to 204 (silent drop) so the browser OTLP exporter doesn't treat it as a retryable
502; and emit an otlp.proxy.stale_connection.dropped counter for observability.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@reakaleek reakaleek requested a review from Mpdreamz June 15, 2026 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants