Skip to content

feat(ssh): classify SSH connection-failure stderr into actionable banner labels#12917

Open
david-engelmann wants to merge 2 commits into
warpdotdev:masterfrom
david-engelmann:david/ssh-failure-classification
Open

feat(ssh): classify SSH connection-failure stderr into actionable banner labels#12917
david-engelmann wants to merge 2 commits into
warpdotdev:masterfrom
david-engelmann:david/ssh-failure-classification

Conversation

@david-engelmann

@david-engelmann david-engelmann commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Closes #3317
Refs #1922, #2931

Summary

When the SSH extension fails to start, the terminal shows a banner that's currently always titled "Failed to start SSH extension" with the raw stderr in the detail field. For the user this is "something broke — read the wall of text and figure it out." This PR splits that single label into structured failure modes so the banner can lead with the actual cause and a suggested fix.

After this PR a permission-denied looks like:

Public-key auth rejected
Make sure your key is loaded (ssh-add -l), the host has it in authorized_keys, and your SSH agent is reachable.
Error: <original error>
SSH output: <raw stderr, truncated>

A host-key-changed looks like:

Host key CHANGED — investigate
The server's host key doesn't match what's in known_hosts. Either the host was reinstalled or your connection is being intercepted. Don't bypass this until you've confirmed which.

…and 16 more (one per ConnectionFailureMode variant). When stderr doesn't match any known mode, the banner falls back to the existing "Failed to start SSH extension" copy — so unrecognized failures keep the current behavior verbatim.

Issues addressed (partial fix angle)

The classification doesn't fix any of these underlying bugs, but it dramatically narrows down what each one is — which is what reporters keep asking for in the threads. All three issues carry `ready-to-implement`.

Structure

Two commits, separable:

1. `feat(ssh-diagnostics): new crate for SSH failure-mode taxonomy + agent/identity probing` (8a6f0e4f)

Adds a new `crates/warp_ssh_diagnostics` crate:

  • `failure_modes` — `ConnectionFailureMode` enum (18 variants), `classify_ssh_stderr(stderr) -> Option`, plus `label()` / `suggested_fix()` / `is_transient()` accessors per mode
  • `agent` — `ssh_agent_status()` with an `SshAddRunner` trait so unit tests don't shell out
  • `identities` — enumerate `~/.ssh` for known identity files

Pure library code. 43 unit tests. No consumer dependency outside this PR's banner integration. Wasm-gated where it shells out.

2. `feat(ssh): classify SSH connection-failure errors into actionable banner labels` (d35c51c5)

Wires `classify_ssh_stderr` into the path in `app/src/terminal/view.rs` that builds the `RemoteServerManagerEvent::SessionConnectionFailed` banner. The classification helper lives in `app/src/terminal/ssh/failure_classification.rs` as a pure function so it's unit-testable without a transport setup. 10 tests on the consumer side.

Adjacency notes

Test plan

  • `cargo test -p warp_ssh_diagnostics --lib` (43 pass)
  • `cargo test -p warp --lib terminal::ssh::failure_classification` (10 pass)
  • `cargo check -p warp --bin warp-oss` (clean)
  • `cargo fmt -p warp_ssh_diagnostics -p warp`
  • Manual repro on each `ConnectionFailureMode` variant — happy to record a Loom for each major class (auth-rejected, host-key-changed, port-closed) if useful; let me know.

Things explicitly NOT in this PR

  • No new feature flag — classification is behavior-preserving for unrecognized stderr, additive for recognized stderr. Happy to gate behind a flag if reviewers prefer.
  • No change to the other two `show_ssh_remote_server_failed_banner` call sites (`BinaryInstallComplete` / `BinaryCheckComplete`); those already go through `error.user_facing_error(SetupStage::...)` which is a separate well-typed surface.
  • No SSH-hosts UI integration of the diagnostics crate — that's a follow-up PR (R3.5).
  • No reconnect / session-journal work — that's a separate scope, would link to Can not restore normal session when accident disconnect remote server #3230.

…t/identity probing (R3.3a foundation)

Self-contained data layer that turns opaque ssh client output into
structured, user-actionable diagnostics. Built so it can plug into
either the diagnostics chip UI (display side) or the auto-reconnect
policy (decision side) without re-implementing the classification
in either consumer.

Three pieces:

1. `ConnectionFailureMode` + `classify_ssh_stderr(stderr) -> Option<Mode>`
   — pattern matcher for ~18 common ssh failure modes. Each variant
   carries a `label()`, `suggested_fix()`, and `is_transient()` flag.
   Pure logic, no IO. Returns `None` for unknown patterns so the UI
   can fall back to raw text instead of misclassifying.

   Modes covered: HostnameUnresolved, NetworkUnreachable, PortClosed,
   ConnectTimeout, SessionTimeout, BrokenPipe, AuthPublicKeyRejected,
   AuthPasswordRejected, AuthRejected, HostKeyMismatch, HostKeyChanged
   (separate so the UI can surface the stronger MitM warning),
   KexClosedByRemote, ProxyJumpFailed, IdentityFileBadPermissions,
   IdentityFileMissing, SshConfigSyntaxError, TooManyAuthFailures,
   ConnectionResetByPeer.

   The classifier deliberately picks specific-over-general: a stderr
   that contains both `UNPROTECTED PRIVATE KEY FILE` *and* a
   downstream `Permission denied` classifies as
   `IdentityFileBadPermissions` (the actionable cause) rather than
   `AuthPublicKeyRejected` (the symptom). Same for `HostKeyChanged`
   beating `HostKeyMismatch`.

2. `SshIdentity` + `list_ssh_identities()` — enumerate visible
   public keys in `~/.ssh` with a private-key-presence flag for
   each. Deliberately does NOT read the private key (would require
   passphrase handling); only stat-checks the matching path. UI uses
   the flag to highlight `id_rsa.pub` orphans where the private
   key isn't on this machine.

3. `SshAgentStatus` + `ssh_agent_status()` — probe ssh-agent health
   via `ssh-add -l` and classify as Available / NotConfigured /
   Stale. Driven through the `SshAddRunner` trait so tests
   substitute canned outputs; production uses `ProcessSshAddRunner`
   with a 750ms timeout and a non-blocking poll loop so a wedged
   socket doesn't stall the diagnostics chip refresh.

Cross-platform:
- All three pieces compile to wasm. The subprocess-spawning bits
  (`ProcessSshAddRunner`, `ssh_agent_status`, `run_with_timeout`)
  are non-wasm only — wasm callers stay with `ssh_agent_status_with`
  + a stubbed runner if they need the type-level surface.
- Uses `command::blocking::Command` (Warp's wrapper that avoids the
  Windows terminal-flash on subprocess spawn) instead of
  `std::process::Command`.
- Uses `instant::Instant` (wasm-compatible) instead of
  `std::time::Instant`.

Tests: 43 unit tests covering every classifier variant, the
specific-over-general precedence rules, identity enumeration
edge cases (missing dir, empty dir, orphan pub key, non-pub files
ignored, sorted output), and the agent state machine (NotConfigured
on missing/empty socket, Available on exit 0 and exit 1
"no identities" path, Stale on other exits and spawn failures).

(Fork-only commit; no PR until R3 scope has a maintainer green-light.)
…ner labels

Wires `warp_ssh_diagnostics::classify_ssh_stderr` into the path that
builds the `RemoteServerManagerEvent::SessionConnectionFailed` banner
in the terminal view. The banner previously always said "Failed to
start SSH extension" with the raw stderr in the detail field;
classified failure modes now surface as the banner body itself with
an actionable suggested-fix line in the detail.

Banner labels include:
- "Public-key auth rejected" — for `Permission denied (publickey)`
- "Host key CHANGED — investigate" — for the openssh MitM warning
- "Identity file permissions too open" — for UNPROTECTED PRIVATE KEY
- "Hostname could not be resolved" — for DNS failures
- "SSH port closed" — for `Connection refused`
- ... and 13 more (see `ConnectionFailureMode` taxonomy)

The helper is a pure function in `app/src/terminal/ssh/failure_classification.rs`
that takes the raw error + proxy_stderr from the manager event and
returns a structured `UserFacingError`. Pure code; unit-testable
without a transport setup.

10 tests cover:
- classified-path bodies for auth / network / host-key / identity-permissions modes
- precedence (UNPROTECTED PRIVATE KEY beats downstream auth failure;
  HOST IDENTIFICATION CHANGED beats generic mismatch)
- fallback path (unrecognized stderr keeps the pre-existing banner shape)
- empty-error / empty-stderr / `Some("")` edge cases
- UTF-8-safe truncation at MAX_RAW_TAIL_CHARS

When classification doesn't match, the banner falls back to the
original "Failed to start SSH extension" body with the raw error text
as detail -- behavior-preserving for unrecognized failures, additive
for recognized ones.

Addresses (partial fix angle):
- warpdotdev#3317 — SSH does not work with remote server settings
- warpdotdev#1922 — Shell on remote server will not load
- warpdotdev#2931 — Error when running commands on remote servers
@cla-bot cla-bot Bot added the cla-signed label Jun 22, 2026
@oz-for-oss

oz-for-oss Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

@david-engelmann

Every PR must be linked to a same-repo issue before Oz can review it.

Next step: open or find a same-repo issue describing this change, then link it to this PR by adding Closes #123 to the PR description (or using the "Development" sidebar on GitHub). A maintainer will mark the issue ready-to-implement when it is ready. Once it is marked, comment /oz-review to re-trigger review.

See the contribution guidelines for the full readiness model.

Powered by Oz

@github-actions github-actions Bot added the external-contributor Indicates that a PR has been opened by someone outside the Warp team. label Jun 22, 2026

@oz-for-oss oz-for-oss Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@david-engelmann

Every PR must be linked to a same-repo issue before Oz can review it.

Next step: open or find a same-repo issue describing this change, then link it to this PR by adding Closes #123 to the PR description (or using the "Development" sidebar on GitHub). A maintainer will mark the issue ready-to-implement when it is ready. Once it is marked, comment /oz-review to re-trigger review.

See the contribution guidelines for the full readiness model.

Powered by Oz

@david-engelmann

Copy link
Copy Markdown
Contributor Author

Updated PR body to add Closes #3317 at the top (and Refs #1922, #2931 for the other related issues — each carries ready-to-implement, and this PR is partial-fix shape for the broader "SSH does not work" report in #3317). Happy to switch the linkage to a different issue or split into per-issue PRs if the team prefers; #3317 felt closest to the user-facing problem the classification surface solves.

/oz-review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed external-contributor Indicates that a PR has been opened by someone outside the Warp team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SSH does not work with remote server settings

1 participant