perf(cache): fold stable guidance into the cached core by QodeXcli · Pull Request #48 · QodeXcli/QodeX

QodeXcli · 2026-06-30T02:31:49Z

Why

Follow-up to the static/volatile split (#47), and the cleaner half of the "volatile-after-history" idea. #47 placed the cache boundary right after the base prompt — so session-stable injections (the code-style profile, failure lessons) landed in the uncached volatile tail and got re-billed every turn, even though they're byte-identical across the whole session.

What

Injections now route into two buffers and the cache boundary lands between them:

stableTail — code-style profile + failure lessons (byte-identical across turns) → folded into the cached core → a cache hit for the whole session.
volatileTail — auto-retrieval, dep-graph, episodic recall (genuinely query-dependent) → stays after the boundary, uncached.

Why this is the right shape (not "volatile-after-history")

Moving volatile context into the message history would have bloated it — the system prompt is regenerated each turn and never persisted, but messages are. So volatile stays in the (regenerated) system prompt; we just enlarge the cached portion to include the stable guidance. No bloat, no behavior change beyond guidance now preceding per-turn context.

Bonus: a larger byte-stable prefix also helps local backends — Ollama/llama.cpp KV prefix-cache hits more across turns (local "turbo cache"), which sets up the MoE/offload work next.

Safety

Pure content regrouping; the failure-lessons taskKey side effect is preserved. Full suite 1339 green, tsc clean. The cache-block split itself is covered by the #47 tests.

…turn hit) Follow-up to the static/volatile split (#47). #47 put the boundary right after the base prompt, so ALL injections — including session-STABLE ones (code-style profile, failure lessons) — landed in the uncached volatile tail and re-billed every turn. Now injections route into two buffers: - stableTail (code style, failure lessons) — byte-identical across turns → folded INTO the cached core, so they're a cache HIT for the whole session. - volatileTail (auto-retrieval, dep-graph, episodic recall) — genuinely query-dependent → stays after the boundary, uncached. The cache boundary now lands between them. Pure content regrouping (no message bloat — volatile stays in the regenerated system prompt, never persisted to history) — guidance simply precedes per-turn context now. Also helps LOCAL backends: a larger byte-stable prefix means Ollama/ llama.cpp KV prefix-cache hits more across turns (local "turbo cache"). Full suite 1339 green; tsc clean. The failure-lessons taskKey side effect is preserved.

The "real gap" the user named — running large MoE coders on limited VRAM. QodeX already forwards providers.ollama.options verbatim, so `num_gpu` (layers on GPU; rest on CPU) Just Works; this makes it usable and documented instead of guesswork. - src/llm/offload.ts (PURE): suggestGpuLayers({modelSizeGB, vramBudgetGB, totalLayers}) → a sensible num_gpu (clamped [0,total]) so a 48 GB MoE on a 12 GB GPU keeps ~14/64 layers on GPU; describeOffload() renders a one-line summary for the wizard/docs. - options typing widened number → number|string|bool (config + Ollama provider) so ANY llama.cpp/Ollama runtime flag passes through, not just numeric ones. - README: "Large (MoE) models on limited VRAM + local turbo-cache" recipe. Documents that local speed comes from keep_alive (model+KV warm) + QodeX's byte-stable prompt prefix (the #44–#48 cache work) → the engine's KV PREFIX cache hits instead of re-prefilling every turn — the local counterpart to Anthropic prompt caching. +6 tests (full fit, partial offload, cpu-only fallback, clamping, custom reserve, summaries). Full suite 1345 green; tsc clean. Co-authored-by: Louise Lau <QodeXcli@users.noreply.github.com>

QodeXcli merged commit b9db52e into main Jun 30, 2026
2 checks passed

QodeXcli deleted the perf/enlarge-cached-core branch June 30, 2026 02:31

QodeXcli mentioned this pull request Jun 30, 2026

feat(local): MoE/VRAM offloading helper + local turbo-cache docs #49

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(cache): fold stable guidance into the cached core#48

perf(cache): fold stable guidance into the cached core#48
QodeXcli merged 1 commit into
mainfrom
perf/enlarge-cached-core

QodeXcli commented Jun 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

QodeXcli commented Jun 30, 2026

Why

What

Why this is the right shape (not "volatile-after-history")

Safety

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant