PR3: HybridReader query-result cache + clear_cache trait + @uncache + dual VL mime by jimhester · Pull Request #2 · jimhester/ggsql

jimhester · 2026-05-29T15:39:34Z

Draft — stacked on posit-dev/ggsql#423 (HybridReader). This PR lives in my fork (jimhester/ggsql) with base = pr2-hybrid-reader so the diff shows only the cache work on top of PR2. Opened in draft per posit-dev/ggsql#423 (comment) so the proposed staging-layer design can be informed by the concrete caching mechanism here. When PR2 merges to posit-dev/ggsql:main, I'll rebase and re-target this PR (or re-open it) against upstream.

Summary

Bundles four follow-ups to PR2 (HybridReader):

Query-result cache for HybridReader — memoizes (reader_uri, sql) results in the staging DuckDB so visualization iteration
(tweak DRAW, change a SCALE, re-run) is sub-millisecond on
cache hits instead of round-tripping to the primary reader.
Reader::clear_cache() trait method — default Ok(()) so
readers without a cache (DuckDB, SQLite, ODBC, ADBC) inherit a
no-op; HybridReader overrides to drop its cache tables.
Jupyter -- @uncache meta-command — clears the active reader's
cache from inside a notebook cell without restarting the kernel.
Dual Vega-Lite v5 + v6 mime emission in the Jupyter kernel —
JupyterLab 4.x (built-in v5 renderer) and nteract / newer
Lab extensions (v6) both display natively without the embedded
HTML fallback.

Each piece on its own is below the threshold of "worth a PR" — the
trait method has no motivation without the cache, and the
meta-command has no motivation without the trait method. They're a
unit. The VL mime change is small and rides along since it ships in
the same ggsql-jupyter/src/display.rs file the -- @uncache
dispatch lives next to. Reviewers who prefer to split the VL mime
change into PR3.5 — happy to oblige.

Companion design comment with the broader sequencing context:
#341 (comment).

Motivation

Visualization workflows iterate fast: change DRAW, change a
SCALE, re-run. Each iteration re-runs the SQL — usually identical
text — against the primary reader. For a remote backend (Trino,
Snowflake, anything over Flight SQL) that's wasteful network and
time. The cache memoizes results in the staging DuckDB, keyed on
(reader_uri, sql) with TTL and a byte-budget LRU. Hits are
sub-millisecond; misses fall through transparently. Scope is
deliberately narrow to HybridReader — other reader types don't
have a place to store results.

The trait method exists so non-Hybrid readers can be passed to
generic code that calls clear_cache() without per-type
match-and-cast. The Jupyter meta-command exists because the
notebook is the primary place users encounter the cache: when a
remote table changes underneath you, -- @uncache is faster than
# Restart Kernel + re-running every preceding cell.

The VL mime change addresses a JupyterLab 4.x reality: its
built-in @jupyterlab/vega5-extension only handles v5, while the
ecosystem (nteract, newer Lab extensions) increasingly emits v6.
Emitting both means the client picks whichever it has, with the
HTML/vega-embed fallback only kicking in when neither native
renderer is present.

Design

Cache (`src/reader/hybrid_cache.rs` + `hybrid.rs` integration)

The cache module is reader-agnostic: SHA-256 over (reader_uri, sql) joined by \n (the separator prevents (ab, c) and (a, bc)
from colliding), truncated to 16 hex chars (64 bits — collision
odds negligible at any realistic cache size), used as the suffix in
the per-query staging table name __ggsql_cache_<hex>. A
single __ggsql_cache_meta__ table tracks
(cache_key, reader_uri, sql, fetched_at, last_accessed, row_count, byte_estimate).

HybridReader::execute_sql consults the cache:

If the query references a registered name → route to staging
(PR2 behavior, unchanged).
Else if cache disabled → direct passthrough to primary.
Else look up the cache key; if present and within TTL, return
the staged result and touch the last-accessed timestamp.
Else fetch from primary, register the result under
__ggsql_cache_<hex> in staging, insert the meta row,
then run LRU eviction over max_bytes.

Empty-width DataFrames bypass caching (DuckDB's arrow(...) table
function rejects zero-column schemas).

Defaults: enabled, 300s TTL, 512 MB max. Tunable via
HybridReader::with_cache_config(CacheConfig). Globally disabled
with GGSQL_HYBRID_CACHE_DISABLED=1 (read once at
CacheConfig::default()).

`Reader::clear_cache()`

fn clear_cache(&self) -> Result<()> {
    Ok(())
}

Trait default. HybridReader overrides to call
hybrid_cache::clear_all, which iterates the meta table and drops
each per-key cache table, then defensively deletes any leftover
meta rows.

Jupyter `-- @uncache` meta-command

Mirrors the existing -- @connect: <uri> pattern:

META_UNCACHE_PREFIX = "-- @uncache"
parse_uncache_meta_command(code) returns Some(()) iff the
trimmed code is the prefix followed only by whitespace.
QueryExecutor::execute checks this first; on match, calls
self.reader.clear_cache()? and returns an empty DataFrame so
the cell renders cleanly.

Vega-Lite v5 + v6 mime

format_vegalite in display.rs builds two serde_json::Values:
the original spec (emitted as application/vnd.vegalite.v6+json)
and a clone with the $schema field rewritten to the v5 schema URL
(emitted as application/vnd.vegalite.v5+json). The $schema
rewrite is necessary because JupyterLab's vega5 extension validates
schema-URL-vs-mime-version. The HTML/vega-embed fallback and
text/plain remain as before.

A drive-by fix in the same function: the output_location: "plot"
hint (Positron Plot-pane routing) was previously placed at the top
level of the output object, which fails Jupyter's notebook-format
schema validation and causes JupyterLab to silently drop the
output. Moving it inside metadata per the schema fixes the
dropped-output case.

Testing

All offline, no external setup:

Cache (`hybrid_cache.rs`)

Key stability across calls; URI and SQL each affect the key;
separator-injection collision resistance.
Meta-table DDL idempotency (ensure_meta callable twice).
INSERT-OR-REPLACE behavior on the cache_key PK.
Lookup → insert → touch (advances last_accessed) → drop cycle.

Cache (`hybrid.rs`)

All cache-hit assertions go through a CountingReader that wraps
DuckDBReader and shares an Arc<AtomicUsize> call counter so the
test can read it after the reader is moved into Box<dyn Reader>:

Default config has enabled=true, ttl_secs=300.
with_cache_config applies custom TTL / byte budget / enabled.
A repeat execute_sql of the same query reaches the primary
exactly once (counter stops at 1).
ttl=0 always misses — counter advances on every call (the strict
< comparison guarantees this even within the same millisecond).
LRU eviction with a 1-byte budget evicts entry A when B is
inserted, so re-querying A increments the counter again.
clear_cache() resets the cache; the next call increments the
counter as if from scratch.
The viz pipeline emits SQL strings through execute_sql that the
cache memoizes — re-issuing one of the pipeline's sub-queries
(the schema-fetch SQL targeting the global temp table) is served
from cache and does NOT advance the data counter. This guarantees
that any pipeline that re-emits the same SQL string within the
TTL is memoized.

Jupyter

parse_uncache_meta_command accepts -- @uncache,
-- @uncache \n; rejects SELECT 1.
QueryExecutor::execute("-- @uncache") returns
ExecutionResult::DataFrame(...) against the default
duckdb://memory reader (a no-op clear_cache on the trait
default — proves the dispatch arm is wired).
format_vegalite emits both v5 and v6 mime keys; the v5 payload
has its $schema rewritten to the v5 URL.

Limitations

Cache key uses a fixed "hybrid-primary" placeholder for the
reader URI portion (the Reader trait doesn't expose a URI).
Each HybridReader instance has its own staging DuckDB namespace,
so keys don't need to cross-collide between instances. If a future
trait extension exposes a real URI, the placeholder becomes the
actual URI and the test on key uniqueness across URIs starts
passing for free.
No spill-to-disk; the cache lives in the staging DuckDB instance
owned by a single HybridReader and dies with it.
Eviction failures are logged and swallowed (eprintln!) since
the user's data is already returned at that point. If review
prefers structured logging, swapping to tracing::warn! is a
trivial follow-up.
The viz pipeline's whole-iteration cache reuse (issuing
r.execute(viz_query) twice and serving the second call entirely
from cache) is gated on upstream pipeline determinism — the
current pipeline issues schema/range/data sub-queries whose SQL
depends on HashMap iteration order, and the temp-table DDL is
uncacheable by design (zero-column DDL result). Each individual
sub-query is cached and replayed correctly when re-issued
verbatim. Improving end-to-end viz iteration deduplication is a
separate follow-up — the cache infrastructure here is the
prerequisite.

What's next

PR4 (ggsql-python PyO3 bindings) depends on PR1+PR2+PR3.

Adds two small crates needed by the upcoming hybrid_cache module: sha2 for SHA-256 hashing of (reader_uri, sql) cache keys, hex for fixed-width hex encoding of the truncated digest. Both pull in only `generic-array`/`typenum`/`block-buffer` transitively — no large graphs.

Generic cache primitives for HybridReader: SHA-256 cache-key derivation from (reader_uri, sql), DuckDB meta-table DDL, lookup/insert/touch/drop operations, and LRU eviction over a configurable byte budget. Reader- agnostic — the only 'remote' reference is via the caller-supplied reader_uri string. Tests cover key stability, separator-injection collision resistance, meta-table idempotency, INSERT-OR-REPLACE behavior on the cache_key PK, and the lookup/insert/touch/drop cycle. Module not yet wired into mod.rs — that lands together with the HybridReader cache integration in the next commit.

The cache module is reader-agnostic, but the test fixtures inherited from the source-of-truth file used 'quiver+trino://' URIs that would read as Netflix-specific to upstream reviewers. Replace with neutral 'backend://' literals — the substance of the tests (key stability, URI sensitivity, lookup/insert cycle) is unchanged.

… mod New `Reader::clear_cache()` method with default `Ok(())` so existing readers (DuckDB, SQLite, ODBC, ADBC) inherit a no-op implementation. `HybridReader` will override it in the next commit to drop its staging-DuckDB cache tables. Also declares `mod hybrid_cache;` (non-pub) so the new helper module is reachable from `hybrid.rs`.

execute_sql now consults the staging DuckDB cache (TTL + LRU byte-budget eviction) before falling through to the primary reader, and stages miss results back under hashed table names. clear_cache() override drops all cache tables. with_cache_config() lets callers tune TTL or byte budget; default is 300s TTL / 512 MB / enabled (gated by GGSQL_HYBRID_CACHE_DISABLED=1). Tests cover default config, custom config application, repeat-query cache hit, ttl=0 always-miss, LRU eviction under tight budget, clear_cache wiping the meta and tables, and the empty-width fast-path.

@connect

Lets notebook users invalidate any caches their reader holds without restarting the kernel. Mirrors the existing -- @connect: <uri> pattern: prefix constant, parser returning Some(()) on a clean prefix-only line, and a dispatch arm in QueryExecutor::execute that calls Reader::clear_cache(). Returns an empty DataFrame so the cell renders without further machinery. For non-HybridReader readers the trait default makes this a clean no-op; HybridReader overrides to drop its DuckDB cache tables.

Frontends vary in which Vega-Lite version they render natively: JupyterLab 4.x's built-in @jupyterlab/vega5-extension only handles v5; nteract and newer Lab extensions render v6. Emitting both lets the client pick whichever it supports without the user installing an extension or trusting a CDN-loaded vega-embed. The v5 payload has its \$schema URL rewritten to the v5 schema since the JupyterLab vega5 extension validates schema-URL-vs-mime-version agreement. The two specs are otherwise identical; ggsql's generated output uses core Vega-Lite features stable across v5 and v6.

@uncache

@uncache, and dual VL mime

Mirrors the equivalence-tests pattern from src/reader/adbc.rs (PR1): gated #[cfg(all(test, feature = "sqlite", feature = "adbc"))], each test #[ignore]'d so default CI doesn't hit the missing dynamic driver, runnable via 'cargo test --features "adbc duckdb sqlite" -- --ignored cache_equivalence' after 'dbc install sqlite'. Four tests, each with HybridReader<AdbcSqlite> as primary + in-memory DuckDB as staging: - equiv_cache_returns_same_data_as_bare_adbc: cache miss + cache hit both return data byte-identical to a bare AdbcReader on the same query. Validates the miss-then-stage path doesn't corrupt data and the hit-then-SELECT-from-staging path returns a faithful copy. - equiv_cache_hit_avoids_adbc_call: counter-based; second call to the same SQL must not increment the ADBC call counter. - equiv_clear_cache_forces_adbc_refetch: clear_cache() must drop cache state so the next call round-trips back to ADBC. - equiv_ttl_zero_always_hits_adbc: ttl=0 must always evict + refetch. Verified locally with the SQLite ADBC driver installed via 'dbc install sqlite' — all 4 pass.

Six fixes prompted by an llm-panel pre-review of PR3: 1. Cache key namespacing: each HybridReader mints a per-instance UUID used as the `reader_uri` half of the cache key, so two HybridReader instances cannot alias each other's cached results even when they share a staging DuckDB. Replaces the fixed "hybrid-primary" placeholder. 2. Identifier quoting in hybrid_cache SQL: every reference to the meta table and the per-key cache tables now goes through `naming::quote_ident`, and every string literal goes through `naming::quote_literal`. Removes the local `esc()` helper. Eliminates reserved-word and injection brittleness from manual interpolation. 3. Vega-Lite parse-failure path: format_vegalite no longer emits the v5 / v6 mime bundle or vega-embed HTML when the spec fails to parse — it returns a plain text/plain error output instead, so the failure surfaces cleanly to the user rather than rendering as a silently broken chart. 4. Clock-skew robustness: `age_ms` is clamped to `(now - fetched).max(0)` so a backwards-moving clock can no longer keep stale entries alive. `now_ms()` returns `i64::MAX` on the (practically impossible) case where SystemTime::now() is before the UNIX epoch — the safer failure mode for a correctness-flavoured cache (force re-fetch instead of "infinitely fresh"). 5. clear_all no longer orphans tables: the blanket `DELETE FROM meta` at the end is gone. Failed DROP TABLEs now propagate as an error with the meta rows still in place so a retry can pick them up. 6. Library hygiene: `eprintln!` swapped for `tracing::warn!` on the non-fatal eviction-failure path. Adds four tests covering (1), (4), and (5): instance_id uniqueness, shared-staging non-aliasing, future fetched_at clamped to age=0, and clear_cache on a clean state.

jimhester added 11 commits May 28, 2026 14:58

style: apply cargo fmt + clippy fixes for PR3 changes

c83335d

docs(changelog): announce HybridReader cache, clear_cache trait method,

e471fa0

@uncache, and dual VL mime

jimhester mentioned this pull request May 29, 2026

Add a HybridReader for use in write constrained databases posit-dev/ggsql#423

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR3: HybridReader query-result cache + clear_cache trait + @uncache + dual VL mime#2

PR3: HybridReader query-result cache + clear_cache trait + @uncache + dual VL mime#2
jimhester wants to merge 11 commits into
pr2-hybrid-readerfrom
pr3-hybrid-cache

jimhester commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jimhester commented May 29, 2026

Summary

Motivation

Design

Cache (src/reader/hybrid_cache.rs + hybrid.rs integration)

Reader::clear_cache()

Jupyter -- @uncache meta-command

Vega-Lite v5 + v6 mime

Testing

Cache (hybrid_cache.rs)

Cache (hybrid.rs)

Jupyter

Limitations

What's next

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Cache (`src/reader/hybrid_cache.rs` + `hybrid.rs` integration)

`Reader::clear_cache()`

Jupyter `-- @uncache` meta-command

Cache (`hybrid_cache.rs`)

Cache (`hybrid.rs`)