feat(local): MoE/VRAM offloading helper + local turbo-cache docs by QodeXcli · Pull Request #49 · QodeXcli/QodeX

QodeXcli · 2026-06-30T02:39:30Z

Why

The inference-engine gap the user named: running large MoE coders (Qwen3-Coder-MoE, DeepSeek-MoE) on limited VRAM, plus a local "turbo cache." QodeX already forwards providers.ollama.options verbatim — so num_gpu (layers on GPU, the rest on CPU) Just Works — but it was undocumented guesswork. This makes it usable.

What

src/llm/offload.ts (PURE) — suggestGpuLayers({ modelSizeGB, vramBudgetGB, totalLayers, reserveGB? }) → a sensible num_gpu, clamped [0, total]. E.g. a 48 GB MoE on a 12 GB GPU → keep ~14/64 layers on GPU. describeOffload() renders a one-line summary for qodex setup / docs.
Wider options typing — number → number | string | boolean (config + Ollama provider), so any llama.cpp/Ollama runtime flag passes through, not just numeric ones.
README recipe — "Large (MoE) models on limited VRAM + local turbo-cache." Documents that local speed comes from two things working together:
1. keep_alive keeps the model + its KV cache warm (no cold reload), and
2. QodeX's byte-stable prompt prefix (the perf(cache): hierarchical prompt cache — stop re-billing the conversation prefix #44–perf(cache): fold stable guidance into the cached core #48 hierarchical-cache work) means the engine's KV prefix cache hits instead of re-prefilling the whole context each turn — the local counterpart to Anthropic prompt caching.

Honest scope

The actual offloading is done by the local engine (Ollama/llama.cpp); QodeX's job is to expose, compute, and document the knobs — which is what this does. Auto-detecting VRAM/model-layers to set num_gpu automatically is a future qodex setup step; suggestGpuLayers is the pure core it would call.

Tests

+6 (full fit, partial offload, cpu-only fallback, clamping, custom reserve, summary strings). Full suite 1345 green, tsc clean.

The "real gap" the user named — running large MoE coders on limited VRAM. QodeX already forwards providers.ollama.options verbatim, so `num_gpu` (layers on GPU; rest on CPU) Just Works; this makes it usable and documented instead of guesswork. - src/llm/offload.ts (PURE): suggestGpuLayers({modelSizeGB, vramBudgetGB, totalLayers}) → a sensible num_gpu (clamped [0,total]) so a 48 GB MoE on a 12 GB GPU keeps ~14/64 layers on GPU; describeOffload() renders a one-line summary for the wizard/docs. - options typing widened number → number|string|bool (config + Ollama provider) so ANY llama.cpp/Ollama runtime flag passes through, not just numeric ones. - README: "Large (MoE) models on limited VRAM + local turbo-cache" recipe. Documents that local speed comes from keep_alive (model+KV warm) + QodeX's byte-stable prompt prefix (the #44–#48 cache work) → the engine's KV PREFIX cache hits instead of re-prefilling every turn — the local counterpart to Anthropic prompt caching. +6 tests (full fit, partial offload, cpu-only fallback, clamping, custom reserve, summaries). Full suite 1345 green; tsc clean.

…gpu (#50) The follow-up to #49: stop making users guess num_gpu. `qodex offload` detects the VRAM budget and the model's facts, runs them through suggestGpuLayers, and prints (or --apply writes) a num_gpu. - src/setup/offload-detect.ts: • PURE parsers (unit-tested): parseNvidiaSmiVram (MiB→GB, biggest GPU), parseMacMemGB (hw.memsize→GiB), extractBlockCount (arch-prefixed `<arch>.block_count` from Ollama /api/show model_info). • Best-effort detectors: detectVramGB (nvidia-smi → Apple unified memory ×0.7 → null), fetchOllamaModelFacts (/api/show block_count + /api/tags size), planOffload orchestrator. Everything degrades to null and never throws — no GPU / daemon down / odd platform just means "couldn't auto-detect, set it manually". - `qodex offload [--model <id>] [--vram <gb>] [--apply]` — prints the plan; --apply writes providers.ollama.options.num_gpu to ~/.qodex/config.yaml. Ollama-only (LM Studio reports a clear unsupported message). Smoke-tested: with no daemon it exits with a clear note, no crash. - README: the MoE recipe now leads with `qodex offload`. +6 tests (parsers); full suite 1351 green; tsc clean. Co-authored-by: Louise Lau <QodeXcli@users.noreply.github.com>

QodeXcli merged commit dcda8a7 into main Jun 30, 2026
2 checks passed

QodeXcli deleted the feat/moe-offload branch June 30, 2026 02:39

QodeXcli mentioned this pull request Jun 30, 2026

feat(local): qodex offload — auto-detect VRAM + model, suggest num_gpu #50

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(local): MoE/VRAM offloading helper + local turbo-cache docs#49

feat(local): MoE/VRAM offloading helper + local turbo-cache docs#49
QodeXcli merged 1 commit into
mainfrom
feat/moe-offload

QodeXcli commented Jun 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

QodeXcli commented Jun 30, 2026

Why

What

Honest scope

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant