Skip to content

feat: agent-initiated gRPC stream, decoupled image build#7

Merged
aarani merged 4 commits into
mainfrom
feat/agent-initiated-control-stream
Jun 20, 2026
Merged

feat: agent-initiated gRPC stream, decoupled image build#7
aarani merged 4 commits into
mainfrom
feat/agent-initiated-control-stream

Conversation

@aarani

@aarani aarani commented Jun 20, 2026

Copy link
Copy Markdown
Member

No description provided.

Nami-Ashkan and others added 4 commits June 20, 2026 16:12
The control plane no longer dials each agent's HTTP API. Instead every agent
holds one persistent gRPC bidirectional stream open to the control plane, which
pushes VM lifecycle commands down it; the agent runs them against its local
Runtime and answers on the same stream. Agents now need no inbound
reachability, and the open stream both delivers commands and serves as the
host's liveness signal.

- proto/agentlink + generated internal/agentlink/pb: AgentLink.Connect bidi
  stream. Command/result payloads are JSON of the existing agent.VMSpec/agent.VM
  types; heartbeats ride the same stream.
- internal/agentlink/hub.go: CP-side gRPC server + per-host connection registry;
  registers hosts (reconstructing committed capacity), correlates results,
  stamps liveness on heartbeats, and marks a host down the instant its stream
  drops.
- internal/agent/link.go: agent-side dial/register/serve loop with heartbeat
  ticker and reconnect backoff.
- provisioner/remote.go calls the hub by host id (Commander interface);
  repository.HostRepository gains MarkDown for immediate down-on-disconnect.
- Remove the old HTTP path: agent /vms server + client, HTTP register/heartbeat
  handlers, AgentAuth middleware, and ADVERTISE_ADDR. Agent dials
  CONTROL_PLANE_GRPC_ADDR; control plane listens on GRPC_PORT (8090).

Co-Authored-By: Afshin Arani <afshin@arani.dev>
A cold image build pulls layers lazily, so the layer download streams
during the squashfs write and is bound to the context passed in. A short
request/command-scoped deadline (a dropped agent stream, the reconciler's
30s batch timeout) cancelled that context mid-stream, aborting a build
that legitimately takes minutes — surfacing as "stream layer tar: write
file ...: context canceled". Since the .tmp is removed on error, every
retry restarted from zero.

Build the rootfs under the runtime's lifetime context plus a generous
ImagePullTimeout instead of the command ctx. The build is
content-addressed, idempotent, and shared across every VM booting the
ref, so a cancelled provision now leaves a populated cache for the next
attempt rather than killing the work. ImageStore becomes an ImageEnsurer
interface so the non-KVM tests can assert the build context.

Also drop the reconciler's 30s per-batch timeout: nothing in the
reconciler should impose a command/request-scoped deadline on
provisioning. Operations run under the reconciler lifetime, bounded by
the agent-side pull timeout and process shutdown.

Co-Authored-By: Afshin Arani <afshin@arani.dev>
The e2e suite was wired for the old "control plane dials agent over HTTP"
model (agent.NewRouter/NewClient, provisioner.NewRemote(hostRepo, ...),
and HTTP /agent/hosts register+heartbeat endpoints), all removed when the
control channel inverted to an agent-dialed gRPC stream. The suite no
longer compiled.

- main_test.go: stand up the real agentlink.Hub on a localhost gRPC
  listener and an in-process placement agent via agent.RunLink over a
  FakeRuntime, exactly as production wires it; provisioner.NewRemote(hub).
  The host registers itself over the stream, so wait on the fleet
  inventory for it to come ready before running tests. Adds startAgent()
  for ad-hoc per-test agents.
- agent_test.go: the agent has no inbound API anymore; query the
  in-process FakeRuntime directly to confirm the VM landed on the agent.
- hosts_test.go: rewrite host lifecycle/identity as real gRPC flows
  (connect->ready, disconnect->down, reconnect->ready; agent-supplied id
  authoritative and updated in place on reconnect). Drop the HTTP
  request-validation tests (bad id, missing fields, unknown heartbeat) —
  no analog on the typed gRPC Register path; that seam is unit-covered in
  internal/agentlink/hub_test.go.

Full suite passes against a live Postgres testcontainer.

Co-Authored-By: Afshin Arani <afshin@arani.dev>
@aarani aarani merged commit 11bd603 into main Jun 20, 2026
3 checks passed
@aarani aarani deleted the feat/agent-initiated-control-stream branch June 20, 2026 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants