feat(local): MoE/VRAM offloading helper + local turbo-cache docs#49
Merged
Conversation
The "real gap" the user named — running large MoE coders on limited VRAM. QodeX already
forwards providers.ollama.options verbatim, so `num_gpu` (layers on GPU; rest on CPU) Just
Works; this makes it usable and documented instead of guesswork.
- src/llm/offload.ts (PURE): suggestGpuLayers({modelSizeGB, vramBudgetGB, totalLayers}) →
a sensible num_gpu (clamped [0,total]) so a 48 GB MoE on a 12 GB GPU keeps ~14/64 layers
on GPU; describeOffload() renders a one-line summary for the wizard/docs.
- options typing widened number → number|string|bool (config + Ollama provider) so ANY
llama.cpp/Ollama runtime flag passes through, not just numeric ones.
- README: "Large (MoE) models on limited VRAM + local turbo-cache" recipe. Documents that
local speed comes from keep_alive (model+KV warm) + QodeX's byte-stable prompt prefix
(the #44–#48 cache work) → the engine's KV PREFIX cache hits instead of re-prefilling
every turn — the local counterpart to Anthropic prompt caching.
+6 tests (full fit, partial offload, cpu-only fallback, clamping, custom reserve, summaries).
Full suite 1345 green; tsc clean.
QodeXcli
added a commit
that referenced
this pull request
Jun 30, 2026
…gpu (#50) The follow-up to #49: stop making users guess num_gpu. `qodex offload` detects the VRAM budget and the model's facts, runs them through suggestGpuLayers, and prints (or --apply writes) a num_gpu. - src/setup/offload-detect.ts: • PURE parsers (unit-tested): parseNvidiaSmiVram (MiB→GB, biggest GPU), parseMacMemGB (hw.memsize→GiB), extractBlockCount (arch-prefixed `<arch>.block_count` from Ollama /api/show model_info). • Best-effort detectors: detectVramGB (nvidia-smi → Apple unified memory ×0.7 → null), fetchOllamaModelFacts (/api/show block_count + /api/tags size), planOffload orchestrator. Everything degrades to null and never throws — no GPU / daemon down / odd platform just means "couldn't auto-detect, set it manually". - `qodex offload [--model <id>] [--vram <gb>] [--apply]` — prints the plan; --apply writes providers.ollama.options.num_gpu to ~/.qodex/config.yaml. Ollama-only (LM Studio reports a clear unsupported message). Smoke-tested: with no daemon it exits with a clear note, no crash. - README: the MoE recipe now leads with `qodex offload`. +6 tests (parsers); full suite 1351 green; tsc clean. Co-authored-by: Louise Lau <QodeXcli@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The inference-engine gap the user named: running large MoE coders (Qwen3-Coder-MoE, DeepSeek-MoE) on limited VRAM, plus a local "turbo cache." QodeX already forwards
providers.ollama.optionsverbatim — sonum_gpu(layers on GPU, the rest on CPU) Just Works — but it was undocumented guesswork. This makes it usable.What
src/llm/offload.ts(PURE) —suggestGpuLayers({ modelSizeGB, vramBudgetGB, totalLayers, reserveGB? })→ a sensiblenum_gpu, clamped[0, total]. E.g. a 48 GB MoE on a 12 GB GPU → keep ~14/64 layers on GPU.describeOffload()renders a one-line summary forqodex setup/ docs.optionstyping —number→number | string | boolean(config + Ollama provider), so any llama.cpp/Ollama runtime flag passes through, not just numeric ones.keep_alivekeeps the model + its KV cache warm (no cold reload), andHonest scope
The actual offloading is done by the local engine (Ollama/llama.cpp); QodeX's job is to expose, compute, and document the knobs — which is what this does. Auto-detecting VRAM/model-layers to set
num_gpuautomatically is a futureqodex setupstep;suggestGpuLayersis the pure core it would call.Tests
+6 (full fit, partial offload, cpu-only fallback, clamping, custom reserve, summary strings). Full suite 1345 green,
tscclean.