perf(cache): fold stable guidance into the cached core#48
Merged
Conversation
…turn hit) Follow-up to the static/volatile split (#47). #47 put the boundary right after the base prompt, so ALL injections — including session-STABLE ones (code-style profile, failure lessons) — landed in the uncached volatile tail and re-billed every turn. Now injections route into two buffers: - stableTail (code style, failure lessons) — byte-identical across turns → folded INTO the cached core, so they're a cache HIT for the whole session. - volatileTail (auto-retrieval, dep-graph, episodic recall) — genuinely query-dependent → stays after the boundary, uncached. The cache boundary now lands between them. Pure content regrouping (no message bloat — volatile stays in the regenerated system prompt, never persisted to history) — guidance simply precedes per-turn context now. Also helps LOCAL backends: a larger byte-stable prefix means Ollama/ llama.cpp KV prefix-cache hits more across turns (local "turbo cache"). Full suite 1339 green; tsc clean. The failure-lessons taskKey side effect is preserved.
QodeXcli
added a commit
that referenced
this pull request
Jun 30, 2026
The "real gap" the user named — running large MoE coders on limited VRAM. QodeX already
forwards providers.ollama.options verbatim, so `num_gpu` (layers on GPU; rest on CPU) Just
Works; this makes it usable and documented instead of guesswork.
- src/llm/offload.ts (PURE): suggestGpuLayers({modelSizeGB, vramBudgetGB, totalLayers}) →
a sensible num_gpu (clamped [0,total]) so a 48 GB MoE on a 12 GB GPU keeps ~14/64 layers
on GPU; describeOffload() renders a one-line summary for the wizard/docs.
- options typing widened number → number|string|bool (config + Ollama provider) so ANY
llama.cpp/Ollama runtime flag passes through, not just numeric ones.
- README: "Large (MoE) models on limited VRAM + local turbo-cache" recipe. Documents that
local speed comes from keep_alive (model+KV warm) + QodeX's byte-stable prompt prefix
(the #44–#48 cache work) → the engine's KV PREFIX cache hits instead of re-prefilling
every turn — the local counterpart to Anthropic prompt caching.
+6 tests (full fit, partial offload, cpu-only fallback, clamping, custom reserve, summaries).
Full suite 1345 green; tsc clean.
Co-authored-by: Louise Lau <QodeXcli@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Follow-up to the static/volatile split (#47), and the cleaner half of the "volatile-after-history" idea. #47 placed the cache boundary right after the base prompt — so session-stable injections (the code-style profile, failure lessons) landed in the uncached volatile tail and got re-billed every turn, even though they're byte-identical across the whole session.
What
Injections now route into two buffers and the cache boundary lands between them:
stableTail— code-style profile + failure lessons (byte-identical across turns) → folded into the cached core → a cache hit for the whole session.volatileTail— auto-retrieval, dep-graph, episodic recall (genuinely query-dependent) → stays after the boundary, uncached.Why this is the right shape (not "volatile-after-history")
Moving volatile context into the message history would have bloated it — the system prompt is regenerated each turn and never persisted, but messages are. So volatile stays in the (regenerated) system prompt; we just enlarge the cached portion to include the stable guidance. No bloat, no behavior change beyond guidance now preceding per-turn context.
Bonus: a larger byte-stable prefix also helps local backends — Ollama/llama.cpp KV prefix-cache hits more across turns (local "turbo cache"), which sets up the MoE/offload work next.
Safety
Pure content regrouping; the failure-lessons
taskKeyside effect is preserved. Full suite 1339 green,tscclean. The cache-block split itself is covered by the #47 tests.