Skip to content

resolve volumes by stable name (logical→physical owned by GlideFS) + node-scoped recovery + ublk idle-spin fix#81

Merged
jaredLunde merged 4 commits into
mainfrom
jared/resolve-by-name
Jun 23, 2026
Merged

resolve volumes by stable name (logical→physical owned by GlideFS) + node-scoped recovery + ublk idle-spin fix#81
jaredLunde merged 4 commits into
mainfrom
jared/resolve-by-name

Conversation

@jaredLunde

Copy link
Copy Markdown
Contributor

GlideFS now owns the logical→physical mapping: callers address volumes by stable name and never supply s3_prefix/manifest_name/snapshot_sequence. Three related changes:

1. Resolve-by-name logical API (12a6fa7)

  • block/registry.rs: FromRef (image:/volume:/snapshot:), ResolvedSource, durable image + snapshot indexes (sibling of export.json).
  • resolve_export + GET /api/resolve/{name} (reads export.json from S3 — works on any node). resolve_source turns a logical from ref into physical coords feeding the existing fork machinery; the new volume lands in the source's pool for CoW. Re-attach-by-name on PUT when not held locally.
  • PUT body is fully logical (CreateVolumeRequest{size_gb, from, …}); physical knobs removed.
  • Snapshots return a stable snapshot_id + write the snapshot index; image index written by HTTP bless, glidefs bless CLI, promote-base, and tags; GET /api/images/{name}; lineage in ExportConfig::source.
  • Physical S3 layout unchanged (no migration). Docs: ARCHITECTURE.md + README.md.

2. Node-scoped boot recovery (10484d3)

discover_exports() listed every export.json under the single global {db_path}/exports/ prefix, so a node resurrected every export ever created on the shared bucket as a live ublk device (observed: 350+ devices when ~3 VMs existed). New discover_local_exports() recovers only the node's working set from the local device maps (ublk_devices.json/nbd_devices.json); everything else stays dormant in S3 and attaches on demand by name. Enabled by (1).

3. ublk idle-worker spin fix (e94b6a7)

The worker loop used to_wait = if all_done() { 0 } else { 1 } — a worker with no hosted queues busy-spun a full core. With N idle workers that burned ~N cores (observed ~1576% CPU with few/no devices). The eventfd watcher daemon already wakes blocked workers on queue assignment, so idle workers now always block (to_wait=1). Verified live: 1590%→0% CPU with 0 devices; 8MB direct-IO round-trip intact.

Testing

  • 467 lib tests pass, clippy clean, all feature-gated targets compile.
  • Verified e2e on the homelab: VM boots via from:"image:…" fork; node-scoped recovery + detach behavior; worker idle CPU 0%.

Note: requires the coordinated instd cutover (separate beyond PR) — instd must speak the logical API.

🤖 Generated with Claude Code

jaredLunde and others added 3 commits June 22, 2026 20:39
GlideFS now owns the name→location mapping; callers address everything by
logical name and never supply an s3_prefix/manifest_name/snapshot_sequence.

- block/registry.rs: FromRef (image:/volume:/snapshot:), ResolvedSource,
  durable image + snapshot indexes (sibling of export.json).
- router: resolve_export + GET /api/resolve/{name} (reads export.json from S3,
  works on any node); resolve_source turns a logical `from` ref into physical
  coords feeding the existing fork machinery; new volume lands in the source's
  pool for CoW. Re-attach-by-name on PUT when not held locally.
- api: PUT body is fully logical (CreateVolumeRequest{size_gb, from, ...});
  physical knobs removed. create_or_attach_volume is the shared core.
- Phase 3: snapshot_export returns a stable snapshot_id + writes the snapshot
  index; image index written by HTTP bless, `glidefs bless` CLI, promote-base,
  and tags; GET /api/images/{name}; lineage in ExportConfig::source.
- Physical S3 layout unchanged (no migration); explicit-s3_prefix admin
  endpoints (manifests/profile HEAD/GET) retained for build-time use.
- Docs: ARCHITECTURE.md + README.md updated to the logical model.

465 lib tests pass, clippy clean, all feature-gated targets compile. Verified
e2e on live ublk: blank→snapshot→fork-by-snapshot:→resolve.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…orking set

discover_exports() lists every export.json under the single global
{db_path}/exports/ prefix, and boot re-attaches + binds a kernel ublk device
for all of them. On a shared bucket a node thus resurrects every export ever
created (deep-sleeping/detached/dead/other-node) as a live /dev/ublkbN — a
restart re-attached 350+ devices when ~3 VMs existed.

Add discover_local_exports(): read the node-local device maps
(cache_dir/ublk_devices.json + nbd_devices.json — rewritten on every device
add/remove, so they name exactly the exports this node owns a device for) and
load_export() only those. Swap the cold-start call (cli/server.rs). Everything
else a node doesn't hold locally stays dormant in S3 and attaches on demand by
name — enabled by resolve-by-name. A fresh node (no maps) recovers nothing and
attaches by name when asked (dead-node recovery).

Verified live: boot recovers from the device map (not S3); a detached export
(export.json present, dropped from the map) is not resurrected on restart but
remains resolvable by name.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The worker loop used `to_wait = if all_done() { 0 } else { 1 }`, so a worker
with no hosted queues passed to_wait=0 to io_uring_enter, which returned
instantly and busy-looped a full core. With N idle workers (N = pool size −
workers hosting queues) that burned ~N cores doing nothing — on a host running
few VMs that's most of the pool (observed: ~1576% CPU with no/few devices),
and inversely proportional to device count.

The eventfd watcher is a daemon task that keeps a PollAdd permanently armed on
the worker's eventfd and re-arms forever, independent of any hosted queue.
Every WorkerHandle::send (AddQueue/RemoveQueue/Shutdown) writes the eventfd,
generating a PollAdd CQE that unblocks io_uring_enter immediately. So a
queue-less worker has nothing to gain from busy-polling: no I/O can target it
until a queue is assigned, and that assignment wakes it. glidefs already blocks
with a WORKER_IDLE_NSEC (250ms) timeout — it is not a busy-poll-for-latency
design — so always blocking is strictly better. Channel close is still noticed
within one idle tick.

Verified live: CPU 1590%→0% with 0 devices; 8MB direct-IO write/read round-trip
intact (workers still service I/O); idles to 0% after I/O.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jaredLunde jaredLunde changed the title Resolve volumes by stable name (logical→physical owned by GlideFS) + node-scoped recovery + ublk idle-spin fix resolve volumes by stable name (logical→physical owned by GlideFS) + node-scoped recovery + ublk idle-spin fix Jun 23, 2026
…ver_local_exports)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jaredLunde jaredLunde merged commit adad677 into main Jun 23, 2026
25 checks passed
@jaredLunde jaredLunde deleted the jared/resolve-by-name branch June 23, 2026 05:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant