Skip to content

perf(cache): fold stable guidance into the cached core#48

Merged
QodeXcli merged 1 commit into
mainfrom
perf/enlarge-cached-core
Jun 30, 2026
Merged

perf(cache): fold stable guidance into the cached core#48
QodeXcli merged 1 commit into
mainfrom
perf/enlarge-cached-core

Conversation

@QodeXcli

Copy link
Copy Markdown
Owner

Why

Follow-up to the static/volatile split (#47), and the cleaner half of the "volatile-after-history" idea. #47 placed the cache boundary right after the base prompt — so session-stable injections (the code-style profile, failure lessons) landed in the uncached volatile tail and got re-billed every turn, even though they're byte-identical across the whole session.

What

Injections now route into two buffers and the cache boundary lands between them:

  • stableTail — code-style profile + failure lessons (byte-identical across turns) → folded into the cached core → a cache hit for the whole session.
  • volatileTail — auto-retrieval, dep-graph, episodic recall (genuinely query-dependent) → stays after the boundary, uncached.

Why this is the right shape (not "volatile-after-history")

Moving volatile context into the message history would have bloated it — the system prompt is regenerated each turn and never persisted, but messages are. So volatile stays in the (regenerated) system prompt; we just enlarge the cached portion to include the stable guidance. No bloat, no behavior change beyond guidance now preceding per-turn context.

Bonus: a larger byte-stable prefix also helps local backends — Ollama/llama.cpp KV prefix-cache hits more across turns (local "turbo cache"), which sets up the MoE/offload work next.

Safety

Pure content regrouping; the failure-lessons taskKey side effect is preserved. Full suite 1339 green, tsc clean. The cache-block split itself is covered by the #47 tests.

…turn hit)

Follow-up to the static/volatile split (#47). #47 put the boundary right after the base
prompt, so ALL injections — including session-STABLE ones (code-style profile, failure
lessons) — landed in the uncached volatile tail and re-billed every turn.

Now injections route into two buffers:
  - stableTail (code style, failure lessons) — byte-identical across turns → folded INTO the
    cached core, so they're a cache HIT for the whole session.
  - volatileTail (auto-retrieval, dep-graph, episodic recall) — genuinely query-dependent →
    stays after the boundary, uncached.

The cache boundary now lands between them. Pure content regrouping (no message bloat — volatile
stays in the regenerated system prompt, never persisted to history) — guidance simply precedes
per-turn context now. Also helps LOCAL backends: a larger byte-stable prefix means Ollama/
llama.cpp KV prefix-cache hits more across turns (local "turbo cache"). Full suite 1339 green;
tsc clean. The failure-lessons taskKey side effect is preserved.
@QodeXcli QodeXcli merged commit b9db52e into main Jun 30, 2026
2 checks passed
@QodeXcli QodeXcli deleted the perf/enlarge-cached-core branch June 30, 2026 02:31
QodeXcli added a commit that referenced this pull request Jun 30, 2026
The "real gap" the user named — running large MoE coders on limited VRAM. QodeX already
forwards providers.ollama.options verbatim, so `num_gpu` (layers on GPU; rest on CPU) Just
Works; this makes it usable and documented instead of guesswork.

  - src/llm/offload.ts (PURE): suggestGpuLayers({modelSizeGB, vramBudgetGB, totalLayers}) →
    a sensible num_gpu (clamped [0,total]) so a 48 GB MoE on a 12 GB GPU keeps ~14/64 layers
    on GPU; describeOffload() renders a one-line summary for the wizard/docs.
  - options typing widened number → number|string|bool (config + Ollama provider) so ANY
    llama.cpp/Ollama runtime flag passes through, not just numeric ones.
  - README: "Large (MoE) models on limited VRAM + local turbo-cache" recipe. Documents that
    local speed comes from keep_alive (model+KV warm) + QodeX's byte-stable prompt prefix
    (the #44#48 cache work) → the engine's KV PREFIX cache hits instead of re-prefilling
    every turn — the local counterpart to Anthropic prompt caching.

+6 tests (full fit, partial offload, cpu-only fallback, clamping, custom reserve, summaries).
Full suite 1345 green; tsc clean.

Co-authored-by: Louise Lau <QodeXcli@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant