feat(sdk): retry transient network errors and rate limits#1378
feat(sdk): retry transient network errors and rate limits#1378jakubno wants to merge 11 commits into
Conversation
PR SummaryMedium Risk Overview Retries are idempotency-aware: safe methods retry on any transient failure; non-idempotent calls (e.g. Configuration is unified via a new Reviewed by Cursor Bugbot for commit 9bb851f. Bugbot is set up for automated code reviews on this repo. Configure here. |
🦋 Changeset detectedLatest commit: 9bb851f The changes in this PR will be included in the next version bump. This PR includes changesets to release 2 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
Automatically retry requests on transient failures across the JS and Python SDKs. Retries connection errors and 429/502/503/504 responses using exponential backoff with jitter, and honor a server-provided Retry-After header so rate limiting (e.g. listing sandboxes) is handled transparently. Retries are idempotency-aware: idempotent methods retry on any transient failure, while non-idempotent ones (e.g. Sandbox.create) only retry on "rejected" failures where the server provably did not process the request (throttling, connection-refused, DNS), avoiding duplicate side effects. Configure via the new `retries` option or E2B_MAX_RETRIES env var (default 3, set 0 to disable). The Python envd RPC retry now also uses backoff between attempts.
First retry now waits up to ~100ms (was ~500ms) before backing off exponentially, keeping the cap at 8s.
42725e5 to
330a96d
Compare
Package ArtifactsBuilt from 16a13a2. Download artifacts from this workflow run. JS SDK ( npm install ./e2b-2.27.2-feat-retry-transient-errors.0.tgzCLI ( npm install ./e2b-cli-2.10.4-feat-retry-transient-errors.0.tgzPython SDK ( pip install ./e2b-2.25.1+feat.retry.transient.errors-py3-none-any.whl |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 330a96d56d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Treat 503 as ambiguous (not rejected) so non-idempotent POSTs are not replayed when the server may have processed the request (JS + Python) - Correct misleading retry docs that referenced an idempotency-key mechanism that is not implemented - Plumb config.retries through the Python envd Connect RPC client instead of a hardcoded count of 3 - Pass retries=config.retries to the Python envd HTTP transport and include it in the transport cache key
- Fix a deadlock in the JS retry body buffering: cancelling one branch of a teed request body (request.clone()) never resolves while the other branch is unread, hanging any >1MiB non-stream upload (volume PUT, filesystem POST). Send the pristine original once instead of cancelling. - Collapse the duplicated response/error retry guards into a single shouldRetry predicate (JS) / _should_retry (Python) so the POST safety rule has one source of truth. - Export and lock the classification tables with tests and cross-SDK sync notes to catch JS<->Python drift. - Add edge tests: large non-replayable body sent once, abort-race.
…lay streamed DELETE - Bound the whole retried operation by the request's timeout (a monotonic deadline + per-attempt clamp), instead of letting each attempt use the full timeout so N retries could run ~N*timeout. Mirrors the JS single-signal bound. - _is_replayable now requires buffered content for all methods; a DELETE or OPTIONS carrying a one-shot streaming body is no longer treated as replayable. - Add sync+async tests for both.
There was a problem hiding this comment.
Stale comment
Security review (run 2/2) complete on head
f96449a49b11d4cf933a95bae4fba43bc18205b1.No new security vulnerabilities were identified in the current diff.
I specifically re-checked the retry safety surface (replayability checks and timeout/deadline bounding in Python, and abort-aware backoff behavior in JS) and did not find a remaining security regression to report.
Sent by Cursor Security Agent: Security Reviewer
Match the JS SDK, which retries envd RPC through withRetry. Python unary RPC now retries on rejected failures — HTTP 429 (honoring Retry-After) and connection errors (ConnectError/ConnectTimeout) — in addition to the existing RemoteProtocolError handling. Ambiguous statuses (502/503/504) are not retried since RPC is a non-idempotent POST. Streaming RPC is left unwrapped, matching the JS isStreamLike pass-through.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit df035e9. Configure here.
| const kind = RETRYABLE_ERROR_CODES.get(code) | ||
| if (kind) return kind | ||
| // undici low-level socket/transport errors are ambiguous mid-flight drops. | ||
| if (code.startsWith('und_err_') || code === 'fetch failed') { |
There was a problem hiding this comment.
Should probably check .message here since fetch failed isn't set in .code.



Automatically retry requests on transient failures across the JS and Python SDKs. Retries connection errors and 429/502/503/504 responses using exponential backoff with jitter, and honor a server-provided Retry-After header so rate limiting (e.g. listing sandboxes) is handled transparently.
Retries are idempotency-aware: idempotent methods retry on any transient failure, while non-idempotent ones (e.g. Sandbox.create) only retry on "rejected" failures where the server provably did not process the request (throttling, connection-refused, DNS), avoiding duplicate side effects.
Configure via the new
retriesoption or E2B_MAX_RETRIES env var (default 3, set 0 to disable). The Python envd RPC retry now also uses backoff between attempts.