resolve volumes by stable name (logical→physical owned by GlideFS) + node-scoped recovery + ublk idle-spin fix#81
Merged
Conversation
GlideFS now owns the name→location mapping; callers address everything by
logical name and never supply an s3_prefix/manifest_name/snapshot_sequence.
- block/registry.rs: FromRef (image:/volume:/snapshot:), ResolvedSource,
durable image + snapshot indexes (sibling of export.json).
- router: resolve_export + GET /api/resolve/{name} (reads export.json from S3,
works on any node); resolve_source turns a logical `from` ref into physical
coords feeding the existing fork machinery; new volume lands in the source's
pool for CoW. Re-attach-by-name on PUT when not held locally.
- api: PUT body is fully logical (CreateVolumeRequest{size_gb, from, ...});
physical knobs removed. create_or_attach_volume is the shared core.
- Phase 3: snapshot_export returns a stable snapshot_id + writes the snapshot
index; image index written by HTTP bless, `glidefs bless` CLI, promote-base,
and tags; GET /api/images/{name}; lineage in ExportConfig::source.
- Physical S3 layout unchanged (no migration); explicit-s3_prefix admin
endpoints (manifests/profile HEAD/GET) retained for build-time use.
- Docs: ARCHITECTURE.md + README.md updated to the logical model.
465 lib tests pass, clippy clean, all feature-gated targets compile. Verified
e2e on live ublk: blank→snapshot→fork-by-snapshot:→resolve.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…orking set
discover_exports() lists every export.json under the single global
{db_path}/exports/ prefix, and boot re-attaches + binds a kernel ublk device
for all of them. On a shared bucket a node thus resurrects every export ever
created (deep-sleeping/detached/dead/other-node) as a live /dev/ublkbN — a
restart re-attached 350+ devices when ~3 VMs existed.
Add discover_local_exports(): read the node-local device maps
(cache_dir/ublk_devices.json + nbd_devices.json — rewritten on every device
add/remove, so they name exactly the exports this node owns a device for) and
load_export() only those. Swap the cold-start call (cli/server.rs). Everything
else a node doesn't hold locally stays dormant in S3 and attaches on demand by
name — enabled by resolve-by-name. A fresh node (no maps) recovers nothing and
attaches by name when asked (dead-node recovery).
Verified live: boot recovers from the device map (not S3); a detached export
(export.json present, dropped from the map) is not resurrected on restart but
remains resolvable by name.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The worker loop used `to_wait = if all_done() { 0 } else { 1 }`, so a worker
with no hosted queues passed to_wait=0 to io_uring_enter, which returned
instantly and busy-looped a full core. With N idle workers (N = pool size −
workers hosting queues) that burned ~N cores doing nothing — on a host running
few VMs that's most of the pool (observed: ~1576% CPU with no/few devices),
and inversely proportional to device count.
The eventfd watcher is a daemon task that keeps a PollAdd permanently armed on
the worker's eventfd and re-arms forever, independent of any hosted queue.
Every WorkerHandle::send (AddQueue/RemoveQueue/Shutdown) writes the eventfd,
generating a PollAdd CQE that unblocks io_uring_enter immediately. So a
queue-less worker has nothing to gain from busy-polling: no I/O can target it
until a queue is assigned, and that assignment wakes it. glidefs already blocks
with a WORKER_IDLE_NSEC (250ms) timeout — it is not a busy-poll-for-latency
design — so always blocking is strictly better. Channel close is still noticed
within one idle tick.
Verified live: CPU 1590%→0% with 0 devices; 8MB direct-IO write/read round-trip
intact (workers still service I/O); idles to 0% after I/O.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ver_local_exports) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
GlideFS now owns the logical→physical mapping: callers address volumes by stable name and never supply
s3_prefix/manifest_name/snapshot_sequence. Three related changes:1. Resolve-by-name logical API (
12a6fa7)block/registry.rs:FromRef(image:/volume:/snapshot:),ResolvedSource, durable image + snapshot indexes (sibling ofexport.json).resolve_export+GET /api/resolve/{name}(readsexport.jsonfrom S3 — works on any node).resolve_sourceturns a logicalfromref into physical coords feeding the existing fork machinery; the new volume lands in the source's pool for CoW. Re-attach-by-name onPUTwhen not held locally.PUTbody is fully logical (CreateVolumeRequest{size_gb, from, …}); physical knobs removed.snapshot_id+ write the snapshot index; image index written by HTTP bless,glidefs blessCLI,promote-base, and tags;GET /api/images/{name}; lineage inExportConfig::source.ARCHITECTURE.md+README.md.2. Node-scoped boot recovery (
10484d3)discover_exports()listed everyexport.jsonunder the single global{db_path}/exports/prefix, so a node resurrected every export ever created on the shared bucket as a live ublk device (observed: 350+ devices when ~3 VMs existed). Newdiscover_local_exports()recovers only the node's working set from the local device maps (ublk_devices.json/nbd_devices.json); everything else stays dormant in S3 and attaches on demand by name. Enabled by (1).3. ublk idle-worker spin fix (
e94b6a7)The worker loop used
to_wait = if all_done() { 0 } else { 1 }— a worker with no hosted queues busy-spun a full core. With N idle workers that burned ~N cores (observed ~1576% CPU with few/no devices). The eventfd watcher daemon already wakes blocked workers on queue assignment, so idle workers now always block (to_wait=1). Verified live: 1590%→0% CPU with 0 devices; 8MB direct-IO round-trip intact.Testing
from:"image:…"fork; node-scoped recovery + detach behavior; worker idle CPU 0%.🤖 Generated with Claude Code