Skip to content

feat(local): MoE/VRAM offloading helper + local turbo-cache docs#49

Merged
QodeXcli merged 1 commit into
mainfrom
feat/moe-offload
Jun 30, 2026
Merged

feat(local): MoE/VRAM offloading helper + local turbo-cache docs#49
QodeXcli merged 1 commit into
mainfrom
feat/moe-offload

Conversation

@QodeXcli

Copy link
Copy Markdown
Owner

Why

The inference-engine gap the user named: running large MoE coders (Qwen3-Coder-MoE, DeepSeek-MoE) on limited VRAM, plus a local "turbo cache." QodeX already forwards providers.ollama.options verbatim — so num_gpu (layers on GPU, the rest on CPU) Just Works — but it was undocumented guesswork. This makes it usable.

What

  • src/llm/offload.ts (PURE)suggestGpuLayers({ modelSizeGB, vramBudgetGB, totalLayers, reserveGB? }) → a sensible num_gpu, clamped [0, total]. E.g. a 48 GB MoE on a 12 GB GPU → keep ~14/64 layers on GPU. describeOffload() renders a one-line summary for qodex setup / docs.
  • Wider options typingnumbernumber | string | boolean (config + Ollama provider), so any llama.cpp/Ollama runtime flag passes through, not just numeric ones.
  • README recipe — "Large (MoE) models on limited VRAM + local turbo-cache." Documents that local speed comes from two things working together:
    1. keep_alive keeps the model + its KV cache warm (no cold reload), and
    2. QodeX's byte-stable prompt prefix (the perf(cache): hierarchical prompt cache — stop re-billing the conversation prefix #44perf(cache): fold stable guidance into the cached core #48 hierarchical-cache work) means the engine's KV prefix cache hits instead of re-prefilling the whole context each turn — the local counterpart to Anthropic prompt caching.

Honest scope

The actual offloading is done by the local engine (Ollama/llama.cpp); QodeX's job is to expose, compute, and document the knobs — which is what this does. Auto-detecting VRAM/model-layers to set num_gpu automatically is a future qodex setup step; suggestGpuLayers is the pure core it would call.

Tests

+6 (full fit, partial offload, cpu-only fallback, clamping, custom reserve, summary strings). Full suite 1345 green, tsc clean.

The "real gap" the user named — running large MoE coders on limited VRAM. QodeX already
forwards providers.ollama.options verbatim, so `num_gpu` (layers on GPU; rest on CPU) Just
Works; this makes it usable and documented instead of guesswork.

  - src/llm/offload.ts (PURE): suggestGpuLayers({modelSizeGB, vramBudgetGB, totalLayers}) →
    a sensible num_gpu (clamped [0,total]) so a 48 GB MoE on a 12 GB GPU keeps ~14/64 layers
    on GPU; describeOffload() renders a one-line summary for the wizard/docs.
  - options typing widened number → number|string|bool (config + Ollama provider) so ANY
    llama.cpp/Ollama runtime flag passes through, not just numeric ones.
  - README: "Large (MoE) models on limited VRAM + local turbo-cache" recipe. Documents that
    local speed comes from keep_alive (model+KV warm) + QodeX's byte-stable prompt prefix
    (the #44#48 cache work) → the engine's KV PREFIX cache hits instead of re-prefilling
    every turn — the local counterpart to Anthropic prompt caching.

+6 tests (full fit, partial offload, cpu-only fallback, clamping, custom reserve, summaries).
Full suite 1345 green; tsc clean.
@QodeXcli QodeXcli merged commit dcda8a7 into main Jun 30, 2026
2 checks passed
@QodeXcli QodeXcli deleted the feat/moe-offload branch June 30, 2026 02:39
QodeXcli added a commit that referenced this pull request Jun 30, 2026
…gpu (#50)

The follow-up to #49: stop making users guess num_gpu. `qodex offload` detects the VRAM
budget and the model's facts, runs them through suggestGpuLayers, and prints (or --apply
writes) a num_gpu.

  - src/setup/offload-detect.ts:
      • PURE parsers (unit-tested): parseNvidiaSmiVram (MiB→GB, biggest GPU), parseMacMemGB
        (hw.memsize→GiB), extractBlockCount (arch-prefixed `<arch>.block_count` from Ollama
        /api/show model_info).
      • Best-effort detectors: detectVramGB (nvidia-smi → Apple unified memory ×0.7 → null),
        fetchOllamaModelFacts (/api/show block_count + /api/tags size), planOffload orchestrator.
        Everything degrades to null and never throws — no GPU / daemon down / odd platform just
        means "couldn't auto-detect, set it manually".
  - `qodex offload [--model <id>] [--vram <gb>] [--apply]` — prints the plan; --apply writes
    providers.ollama.options.num_gpu to ~/.qodex/config.yaml. Ollama-only (LM Studio reports a
    clear unsupported message). Smoke-tested: with no daemon it exits with a clear note, no crash.
  - README: the MoE recipe now leads with `qodex offload`.

+6 tests (parsers); full suite 1351 green; tsc clean.

Co-authored-by: Louise Lau <QodeXcli@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant