Skip to content

feat(sdk): retry transient network errors and rate limits#1378

Open
jakubno wants to merge 11 commits into
mainfrom
feat/retry-transient-errors
Open

feat(sdk): retry transient network errors and rate limits#1378
jakubno wants to merge 11 commits into
mainfrom
feat/retry-transient-errors

Conversation

@jakubno
Copy link
Copy Markdown
Member

@jakubno jakubno commented Jun 2, 2026

Automatically retry requests on transient failures across the JS and Python SDKs. Retries connection errors and 429/502/503/504 responses using exponential backoff with jitter, and honor a server-provided Retry-After header so rate limiting (e.g. listing sandboxes) is handled transparently.

Retries are idempotency-aware: idempotent methods retry on any transient failure, while non-idempotent ones (e.g. Sandbox.create) only retry on "rejected" failures where the server provably did not process the request (throttling, connection-refused, DNS), avoiding duplicate side effects.

Configure via the new retries option or E2B_MAX_RETRIES env var (default 3, set 0 to disable). The Python envd RPC retry now also uses backoff between attempts.

@cla-bot cla-bot Bot added the cla-signed label Jun 2, 2026
@cursor
Copy link
Copy Markdown

cursor Bot commented Jun 2, 2026

PR Summary

Medium Risk
Cross-cutting change to all API/envd/volume request paths; incorrect retry classification on non-idempotent calls could duplicate side effects, though the PR explicitly limits those to “rejected” failures only.

Overview
Both JS and Python SDKs now automatically retry control-plane, volume, and sandbox (envd) HTTP traffic on transient failures—connection errors and 429 / 408 / 502 / 503 / 504 (not 500)—using exponential backoff with jitter and honoring Retry-After.

Retries are idempotency-aware: safe methods retry on any transient failure; non-idempotent calls (e.g. Sandbox.create, envd POST RPCs) only retry when the failure is provably unprocessed (“rejected”, e.g. throttling, refused connection, DNS). Ambiguous mid-flight failures are not replayed for those calls. Large or streaming bodies are sent once without retry (JS buffers up to 1 MiB for replay).

Configuration is unified via a new retries option and E2B_MAX_RETRIES (default 3, 0 disables). Python httpx transports and envd Connect unary RPCs use the shared policy; envd RPC backoff is added. wait() in JS respects AbortSignal during backoff.

Reviewed by Cursor Bugbot for commit 9bb851f. Bugbot is set up for automated code reviews on this repo. Configure here.

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Jun 2, 2026

🦋 Changeset detected

Latest commit: 9bb851f

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 2 packages
Name Type
e2b Minor
@e2b/python-sdk Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Comment thread packages/python-sdk/e2b/connection_config.py
jakubno added 3 commits June 2, 2026 08:13
Automatically retry requests on transient failures across the JS and
Python SDKs. Retries connection errors and 429/502/503/504 responses
using exponential backoff with jitter, and honor a server-provided
Retry-After header so rate limiting (e.g. listing sandboxes) is handled
transparently.

Retries are idempotency-aware: idempotent methods retry on any transient
failure, while non-idempotent ones (e.g. Sandbox.create) only retry on
"rejected" failures where the server provably did not process the request
(throttling, connection-refused, DNS), avoiding duplicate side effects.

Configure via the new `retries` option or E2B_MAX_RETRIES env var
(default 3, set 0 to disable). The Python envd RPC retry now also uses
backoff between attempts.
First retry now waits up to ~100ms (was ~500ms) before backing off
exponentially, keeping the cap at 8s.
@jakubno jakubno force-pushed the feat/retry-transient-errors branch from 42725e5 to 330a96d Compare June 2, 2026 08:35
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

Package Artifacts

Built from 16a13a2. Download artifacts from this workflow run.

JS SDK (e2b@2.27.2-feat-retry-transient-errors.0):

npm install ./e2b-2.27.2-feat-retry-transient-errors.0.tgz

CLI (@e2b/cli@2.10.4-feat-retry-transient-errors.0):

npm install ./e2b-cli-2.10.4-feat-retry-transient-errors.0.tgz

Python SDK (e2b==2.25.1+feat-retry-transient-errors):

pip install ./e2b-2.25.1+feat.retry.transient.errors-py3-none-any.whl

@jakubno jakubno marked this pull request as ready for review June 2, 2026 13:45
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 330a96d56d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/js-sdk/src/retry.ts Outdated
Comment thread packages/python-sdk/e2b/_retry.py Outdated
Comment thread packages/js-sdk/src/connectionConfig.ts
Comment thread packages/python-sdk/e2b_connect/client.py
Comment thread packages/python-sdk/e2b/api/client_async/__init__.py
jakubno added 3 commits June 2, 2026 14:40
- Treat 503 as ambiguous (not rejected) so non-idempotent POSTs are not
  replayed when the server may have processed the request (JS + Python)
- Correct misleading retry docs that referenced an idempotency-key
  mechanism that is not implemented
- Plumb config.retries through the Python envd Connect RPC client instead
  of a hardcoded count of 3
- Pass retries=config.retries to the Python envd HTTP transport and include
  it in the transport cache key
- Fix a deadlock in the JS retry body buffering: cancelling one branch of a
  teed request body (request.clone()) never resolves while the other branch
  is unread, hanging any >1MiB non-stream upload (volume PUT, filesystem
  POST). Send the pristine original once instead of cancelling.
- Collapse the duplicated response/error retry guards into a single
  shouldRetry predicate (JS) / _should_retry (Python) so the POST safety rule
  has one source of truth.
- Export and lock the classification tables with tests and cross-SDK sync
  notes to catch JS<->Python drift.
- Add edge tests: large non-replayable body sent once, abort-race.
Comment thread packages/python-sdk/e2b/_retry.py
Comment thread packages/python-sdk/e2b/_retry.py
…lay streamed DELETE

- Bound the whole retried operation by the request's timeout (a monotonic
  deadline + per-attempt clamp), instead of letting each attempt use the full
  timeout so N retries could run ~N*timeout. Mirrors the JS single-signal bound.
- _is_replayable now requires buffered content for all methods; a DELETE or
  OPTIONS carrying a one-shot streaming body is no longer treated as replayable.
- Add sync+async tests for both.
Comment thread packages/python-sdk/e2b/_retry.py
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale comment

Security review (run 2/2) complete on head f96449a49b11d4cf933a95bae4fba43bc18205b1.

No new security vulnerabilities were identified in the current diff.

I specifically re-checked the retry safety surface (replayability checks and timeout/deadline bounding in Python, and abort-aware backoff behavior in JS) and did not find a remaining security regression to report.

Open in Web View Automation 

Sent by Cursor Security Agent: Security Reviewer

Comment thread packages/python-sdk/e2b_connect/client.py Outdated
Comment thread packages/js-sdk/src/retry.ts
Match the JS SDK, which retries envd RPC through withRetry. Python unary RPC
now retries on rejected failures — HTTP 429 (honoring Retry-After) and
connection errors (ConnectError/ConnectTimeout) — in addition to the existing
RemoteProtocolError handling. Ambiguous statuses (502/503/504) are not retried
since RPC is a non-idempotent POST. Streaming RPC is left unwrapped, matching
the JS isStreamLike pass-through.
Comment thread packages/python-sdk/e2b_connect/client.py
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit df035e9. Configure here.

Comment thread packages/python-sdk/e2b_connect/client.py
const kind = RETRYABLE_ERROR_CODES.get(code)
if (kind) return kind
// undici low-level socket/transport errors are ambiguous mid-flight drops.
if (code.startsWith('und_err_') || code === 'fetch failed') {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably check .message here since fetch failed isn't set in .code.

https://github.com/nodejs/undici/blob/c995513094903c67151907213296e91179279b50/lib/web/fetch/index.js#L261

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants