Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 55 additions & 13 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -177,15 +177,20 @@ Each pack is self-describing — the block index is a footer (trailer → index

```
{db_path}/
└── exports/{s3_prefix}/ ← ContentStore root (shared by exports + bless)
├── manifests/{export_name} ← VolumeManifest GLVM (chunk_idx → [pack_ids])
├── manifests/{tag_name} ← Named manifest tag (same format, arbitrary name)
├── manifests/bases/{image_name} ← Blessed base image VolumeManifest (glidefs bless)
├── snapshots/{export_name}/{sequence:020} ← Versioned VolumeManifest (zero-padded sequence)
└── chunks/{chunk_idx:04}/
└── {pack_id:016x}.pack ← GLPK pack (self-describing: header+index+data)
├── exports/{s3_prefix}/ ← ContentStore root (shared by exports + bless)
│ ├── manifests/{export_name} ← VolumeManifest GLVM (chunk_idx → [pack_ids])
│ ├── manifests/{tag_name} ← Named manifest tag (same format, arbitrary name)
│ ├── manifests/bases/{image_name} ← Blessed base image VolumeManifest (glidefs bless)
│ ├── snapshots/{export_name}/{sequence:020} ← Versioned VolumeManifest (zero-padded sequence)
│ └── chunks/{chunk_idx:04}/
│ └── {pack_id:016x}.pack ← GLPK pack (self-describing: header+index+data)
└── index/ ← Logical→physical resolution (name-keyed, prefix-independent)
├── images/{image_name}.json ← image:<name> → {pool, manifest}
└── snapshots/{volume}@{seq}.json ← snapshot:<id> → {pool, volume, sequence, parent}
```

(The volume index is `exports/{name}/export.json` itself — name-keyed and pool-independent — so it doubles as both the export definition and the `volume:<name>` resolver.)

Chunk directories use 4-digit zero-padded indices (`chunks/0000/`, `chunks/0001/`, ...). A 1 TB device has 8,192 chunks (128 MiB each). A compacted chunk has exactly 1 pack file; an uncompacted chunk may have up to `DEFAULT_COMPACTION_THRESHOLD` (16) packs.

**Manifest size by scenario:**
Expand Down Expand Up @@ -462,14 +467,18 @@ GC reconcile_prefix():

```
PUT /api/exports/fork-vm
{ "manifest_name": "prod-vm", "snapshot_sequence": 42, "size_gb": 10 }
{ "from": "snapshot:prod-vm@42", "size_gb": 10 } ← logical: no pool, no manifest, no sequence
resolve_source(Snapshot("prod-vm@42")) ← GET index/snapshots/prod-vm@42.json
→ ResolvedSource { pool: "prod-vm", manifest_name: "prod-vm", snapshot_sequence: 42 }
router.create_export(config, readonly=false, manifest_name=Some("prod-vm"), snapshot_sequence=Some(42))
router.create_export(config{s3_prefix="prod-vm"}, readonly=false, manifest_name=Some("prod-vm"), snapshot_sequence=Some(42))
├── content_store.get_snapshot("prod-vm", 42) ← GET snapshots/prod-vm/00000000000000000042
├── VolumeManifest::deserialize()
├── ContentStore::put_manifest("fork-vm", ...) ← PUT manifests/fork-vm
├── ContentStore::put_manifest("fork-vm", ...) ← PUT manifests/fork-vm (in prod-vm's pool, CoW)
└── WriteCache::open_fresh_active(config) ← empty local block map
```

Expand Down Expand Up @@ -701,18 +710,20 @@ HTTP REST API for orchestrators (scale-to-zero, live migration). (`api.rs`)

| Endpoint | Method | Purpose |
|----------|--------|---------|
| `/api/exports/{name}` | `PUT` | Create or resize export (idempotent). With `manifest_name` + optional `snapshot_sequence`: fork from parent or specific snapshot. |
| `/api/exports/{name}` | `PUT` | Create, fork, re-attach, or resize a volume by **name alone** (idempotent). Body is fully logical — no `s3_prefix`/`manifest_name`/`snapshot_sequence`. To fork, set `from` to `"image:<name>"`, `"volume:<name>"`, or `"snapshot:<id>"`; GlideFS resolves it to a pool + manifest and places the new volume in the source's pool for CoW. Omit `from` for a blank volume. |
| `/api/resolve/{name}` | `GET` | Resolve a volume's physical location (`{s3_prefix, manifest_name, size_gb,…}`) from the durable name-keyed index, reading `export.json` straight from S3. Works on **any** node — even one that has never attached or discovered the volume. The primitive behind dead-node recovery. |
| `/api/images/{name}` | `GET` | Resolve a blessed image's location (`{name, pool, manifest}`) from the logical image index. |
| `/api/exports/{name}` | `GET` | Get export info (size, readonly, transport, device path) |
| `/api/exports/{name}` | `DELETE` | Remove export. `?purge=true` also deletes local cache and all S3 snapshots. |
| `/api/exports` | `GET` | List all active exports |
| `/api/exports/{name}/snapshot` | `POST` | Flush dirty blocks → S3, upload versioned manifest. Optional body `{"tag":"name"}` also publishes named alias. Returns `{sequence, manifest_etag}`. |
| `/api/exports/{name}/snapshot` | `POST` | Flush dirty blocks → S3, upload versioned manifest, and register the snapshot in the logical index. Optional body `{"tag":"name"}` also publishes a named alias. Returns `{snapshot_id, sequence, manifest_etag}` — fork from it via `from: "snapshot:<snapshot_id>"`. |
| `/api/exports/{name}/snapshots` | `GET` | List snapshot sequences in ascending order |
| `/api/exports/{name}/snapshots/{seq}` | `DELETE` | Delete a specific snapshot (idempotent) |
| `/api/exports/{name}/tag` | `POST` | Publish current manifest under a named alias without re-flushing. Body: `{"tag":"name"}`. |
| `/api/manifests/{s3_prefix}/{name}` | `HEAD` | Check manifest existence (200/404). No data transfer, no running export required. |
| `/api/exports/{name}/drain` | `POST` | Flush all dirty blocks to S3 (no versioned snapshot) |
| `/api/exports/{name}/promote` | `POST` | Toggle readonly → read-write |
| `/api/exports/{name}/promote-base` | `POST` | Publish a snapshot's manifest under `bases/{base_name}` (no data re-upload). Body: `{"base_name":"...","sequence":N}`. Idempotent; the promoted base is forkable and profileable like a blessed one. |
| `/api/exports/{name}/promote-base` | `POST` | Publish a snapshot's manifest under `bases/{base_name}` (no data re-upload) and register it in the image index. Body: `{"base_name":"...","sequence":N}`. Idempotent; the promoted base is forkable (`from: "image:<base_name>"`) and profileable like a blessed one. |
| `/api/profile/{s3_prefix}/{name}` | `POST` | Start a background boot-set profile of `bases/{name}` (202). Body (all optional): `{"cmd","seed_paths","fs_type","runs","timeout_secs","force","untrusted","max_blocks"}`. `seed_paths` are faulted under the tracer before the entrypoint. 503 when the server has no `[profile]` config. |
| `/api/profile/{s3_prefix}/{name}` | `GET` | Profile status: `{"state":"running"}` in-flight; `{"state":"complete"}` when `.boot-set.meta` exists; 404 when neither (never profiled, or last attempt failed). |
| `/api/exports/{name}/metrics` | `GET` | Per-export metrics snapshot (JSON) |
Expand All @@ -724,6 +735,37 @@ HTTP REST API for orchestrators (scale-to-zero, live migration). (`api.rs`)

Export definitions are saved to S3 as `{db_path}/exports/{name}/export.json`. On startup, `discover_exports()` lists all `export.json` files under the `exports/` prefix and loads them 32-wide parallel, then `create_export()` recovers each from local WAL 16-wide parallel. No S3 writes on the recovery path. (`router.rs:save_export`, `router.rs:discover_exports`, `cli/server.rs`)

### Logical Naming & Resolution (GlideFS owns the logical→physical mapping)

Callers address everything by **stable logical name** and never supply a physical
`s3_prefix` or `manifest_name`. GlideFS owns three durable, name-keyed, prefix-independent
indexes — read on every resolve so **any node** can locate data from a name alone (the
basis for dead-node recovery: kill the node holding a mapping, the bytes stay addressable).

| Index | Key | S3 location | Resolves a… | Written by |
|-------|-----|-------------|-------------|------------|
| **Volume** | volume name | `{db_path}/exports/{name}/export.json` | `volume:<name>` → `(pool, manifests/{name})` | every create/fork/re-attach (`save_export`) |
| **Image** | image name | `{db_path}/index/images/{name}.json` | `image:<name>` → `(pool, bases/{name})` | bless (HTTP + `glidefs bless` CLI) and `promote-base` (`index_image` / `registry::put_image_entry`) |
| **Snapshot** | `{volume}@{seq}` | `{db_path}/index/snapshots/{id}.json` | `snapshot:<id>` → `(pool, volume, sequence)` | `snapshot_export` (`save_snapshot_entry`) |

A create/fork request carries a logical `from` ref (`FromRef`, `block/registry.rs`); the
router's `resolve_source()` turns it into the physical coordinates the existing fork
machinery needs, and **places the new volume in the source's pool so CoW pack sharing
works**. The physical S3 layout is unchanged — only the *addressing* moved from the caller
into GlideFS. Lineage (`ExportConfig::source`, and a snapshot entry's `parent`) records the
`from` ref so GlideFS, not the caller, owns the parent/child graph.

Re-attach: a `PUT` for a volume not held locally consults the volume index first; if it
exists, GlideFS adopts the persisted pool + geometry and attaches the real data instead of
creating a fresh empty volume at the wrong pool. (`router.rs:resolve_export`,
`router.rs:resolve_source`, `api.rs:create_or_attach_volume`.)

**Remaining physical surface (build-time admin only).** A few endpoints still take a
`{s3_prefix}` path segment — `HEAD /api/manifests/{s3_prefix}/{name}` and
`POST|GET /api/profile/{s3_prefix}/{name}`. These are image-authoring/admin operations, not
the volume create/fork data path; the orchestrator's runtime volume lifecycle uses logical
names exclusively.

## Observability

Per-export Prometheus metrics exposed at `/metrics`. Latency histograms are sampled 1:64 to reduce mutex contention at high IOPS. (`metrics.rs`)
Expand Down
34 changes: 18 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,13 +77,13 @@ curl -X PUT localhost:8080/api/exports/my-vm \
-d '{"size_gb": 500}'
# → {"name":"my-vm","size_bytes":500000000000,"readonly":false,"transport":"nbd","device":"/dev/nbd0"}

# Fork from the current state of an existing export
# Fork from the current state of an existing export — by logical name alone
curl -X PUT localhost:8080/api/exports/my-vm-fork \
-d '{"size_gb": 500, "manifest_name": "my-vm"}'
-d '{"size_gb": 500, "from": "volume:my-vm"}'

# Fork from a specific snapshot (returns sequence from POST /snapshot)
# Fork from a specific snapshot (snapshot_id comes from POST /snapshot)
curl -X PUT localhost:8080/api/exports/my-vm-fork \
-d '{"size_gb": 500, "manifest_name": "my-vm", "snapshot_sequence": 42}'
-d '{"size_gb": 500, "from": "snapshot:my-vm@42"}'

# Use ublk transport (Linux 6.0+, requires --features ublk)
curl -X PUT localhost:8080/api/exports/my-vm \
Expand Down Expand Up @@ -112,7 +112,9 @@ PUT is idempotent. Same size → returns current state. Larger size → grows th
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/exports` | GET | List exports (includes transport + device path) |
| `/api/exports/{name}` | PUT | Create or resize export. `manifest_name` + optional `snapshot_sequence` to fork. |
| `/api/exports/{name}` | PUT | Create or resize export by name. To fork, set `from` to `"image:<name>"`, `"volume:<name>"`, or `"snapshot:<id>"`. |
| `/api/resolve/{name}` | GET | Resolve a volume's location (`s3_prefix`, `manifest_name`, …) by name — reads S3 directly, works on any node. |
| `/api/images/{name}` | GET | Resolve a blessed image's location (`pool`, `manifest`) by name. |
| `/api/exports/{name}` | GET | Get export info |
| `/api/exports/{name}` | DELETE | Remove export. `?purge=true` deletes local cache and all S3 snapshots. |
| `/api/exports/{name}/drain` | POST | Flush all dirty blocks to S3 (no snapshot created) |
Expand All @@ -137,11 +139,11 @@ glidefs bless --image ubuntu-22.04.raw --name ubuntu-22.04-v1 --s3-prefix bases

Exports forked from base images share blocks via content addressing. Identical data is stored once.

Fork from a blessed image using `manifest_name: "bases/{name}"`:
Fork from a blessed image using `from: "image:{name}"`:

```sh
curl -X PUT localhost:8080/api/exports/vm-1 \
-d '{"size_gb": 50, "manifest_name": "bases/ubuntu-22.04-v1"}'
-d '{"size_gb": 50, "from": "image:ubuntu-22.04-v1"}'
```

### Boot-set profiling (faster cold start)
Expand Down Expand Up @@ -180,7 +182,7 @@ curl -sX POST localhost:8080/api/exports/prod/snapshot \

# 2. Fork — instant CoW, no data copied
curl -X PUT localhost:8080/api/exports/vm-deploy-7 \
-d '{"size_gb": 50, "manifest_name": "prod"}'
-d '{"size_gb": 50, "from": "volume:prod"}'
# → {"device": "/dev/nbd1", ...}

# 3. Mount + sync code + start
Expand Down Expand Up @@ -229,7 +231,7 @@ if [ "$STATUS" -eq 200 ]; then
else
# Miss: fork from base, run setup, tag result
curl -X PUT localhost:8080/api/exports/setup-work \
-d '{"size_gb": 50, "manifest_name": "bases/ubuntu-24.04-v1"}'
-d '{"size_gb": 50, "from": "image:ubuntu-24.04-v1"}'

mount /dev/nbd1 /mnt
mise install node@22 && npm ci --prefix /mnt/app
Expand All @@ -242,9 +244,9 @@ else
SOURCE="setup-${SETUP_HASH}"
fi

# Fork from setup state, sync code, deploy
# Fork from setup state (a tag is forkable as an image), sync code, deploy
curl -X PUT localhost:8080/api/exports/vm-deploy-8 \
-d "{\"size_gb\": 50, \"manifest_name\": \"${SOURCE}\"}"
-d "{\"size_gb\": 50, \"from\": \"image:${SOURCE}\"}"
```

Same `IMAGE_ID + LOCKFILE_HASH` next deploy → HEAD returns 200 → setup is skipped entirely.
Expand Down Expand Up @@ -405,15 +407,15 @@ SEQ=$(curl -sX POST localhost:8080/api/exports/my-vm/snapshot | jq .sequence)
curl localhost:8080/api/exports/my-vm/snapshots
# → [1, 5, 42]

# Fork a new export from snapshot 42 (read-only parent blocks, CoW overlay for writes)
# Fork a new export from snapshot $SEQ (read-only parent blocks, CoW overlay for writes)
curl -X PUT localhost:8080/api/exports/my-vm-test \
-d "{\"size_gb\": 500, \"manifest_name\": \"my-vm\", \"snapshot_sequence\": $SEQ}"
-d "{\"size_gb\": 500, \"from\": \"snapshot:my-vm@$SEQ\"}"

# Delete a snapshot when done
curl -X DELETE localhost:8080/api/exports/my-vm/snapshots/5
```

`snapshot_sequence` is optional. Omit it to fork from the current state.
To fork from the live current state instead of a snapshot, use `from: "volume:my-vm"`.

**GC and snapshots**: GC scans all snapshot manifests before deleting any pack. Packs referenced by a snapshot are kept alive even if they're no longer in the current manifest. Deleting a snapshot unpins its exclusive packs — they become eligible for GC after the grace period (default 24h).

Expand All @@ -425,7 +427,7 @@ curl -X DELETE localhost:8080/api/exports/my-vm

# 2. Fork from the target snapshot into the same name
curl -X PUT localhost:8080/api/exports/my-vm \
-d '{"size_gb": 500, "manifest_name": "my-vm", "snapshot_sequence": 5}'
-d '{"size_gb": 500, "from": "snapshot:my-vm@5"}'
```

No data is copied — the new export reads parent blocks from the existing S3 packs via the CoW overlay.
Expand All @@ -434,7 +436,7 @@ No data is copied — the new export reads parent blocks from the existing S3 pa

```sh
curl -X PUT localhost:8080/api/exports/my-vm-rollback \
-d '{"size_gb": 500, "manifest_name": "my-vm", "snapshot_sequence": 5}'
-d '{"size_gb": 500, "from": "snapshot:my-vm@5"}'
# verify my-vm-rollback, then swap at the load balancer
```

Expand Down
Loading
Loading