Skip to content

PR3: HybridReader query-result cache + clear_cache trait + @uncache + dual VL mime#2

Draft
jimhester wants to merge 11 commits into
pr2-hybrid-readerfrom
pr3-hybrid-cache
Draft

PR3: HybridReader query-result cache + clear_cache trait + @uncache + dual VL mime#2
jimhester wants to merge 11 commits into
pr2-hybrid-readerfrom
pr3-hybrid-cache

Conversation

@jimhester
Copy link
Copy Markdown
Owner

Draft — stacked on posit-dev/ggsql#423 (HybridReader). This PR lives in my fork (jimhester/ggsql) with base = pr2-hybrid-reader so the diff shows only the cache work on top of PR2. Opened in draft per posit-dev/ggsql#423 (comment) so the proposed staging-layer design can be informed by the concrete caching mechanism here. When PR2 merges to posit-dev/ggsql:main, I'll rebase and re-target this PR (or re-open it) against upstream.

Summary

Bundles four follow-ups to PR2 (HybridReader):

  1. Query-result cache for HybridReader — memoizes (reader_uri, sql) results in the staging DuckDB so visualization iteration
    (tweak DRAW, change a SCALE, re-run) is sub-millisecond on
    cache hits instead of round-tripping to the primary reader.
  2. Reader::clear_cache() trait method — default Ok(()) so
    readers without a cache (DuckDB, SQLite, ODBC, ADBC) inherit a
    no-op; HybridReader overrides to drop its cache tables.
  3. Jupyter -- @uncache meta-command — clears the active reader's
    cache from inside a notebook cell without restarting the kernel.
  4. Dual Vega-Lite v5 + v6 mime emission in the Jupyter kernel —
    JupyterLab 4.x (built-in v5 renderer) and nteract / newer
    Lab extensions (v6) both display natively without the embedded
    HTML fallback.

Each piece on its own is below the threshold of "worth a PR" — the
trait method has no motivation without the cache, and the
meta-command has no motivation without the trait method. They're a
unit. The VL mime change is small and rides along since it ships in
the same ggsql-jupyter/src/display.rs file the -- @uncache
dispatch lives next to. Reviewers who prefer to split the VL mime
change into PR3.5 — happy to oblige.

Companion design comment with the broader sequencing context:
#341 (comment).

Motivation

Visualization workflows iterate fast: change DRAW, change a
SCALE, re-run. Each iteration re-runs the SQL — usually identical
text — against the primary reader. For a remote backend (Trino,
Snowflake, anything over Flight SQL) that's wasteful network and
time. The cache memoizes results in the staging DuckDB, keyed on
(reader_uri, sql) with TTL and a byte-budget LRU. Hits are
sub-millisecond; misses fall through transparently. Scope is
deliberately narrow to HybridReader — other reader types don't
have a place to store results.

The trait method exists so non-Hybrid readers can be passed to
generic code that calls clear_cache() without per-type
match-and-cast. The Jupyter meta-command exists because the
notebook is the primary place users encounter the cache: when a
remote table changes underneath you, -- @uncache is faster than
# Restart Kernel + re-running every preceding cell.

The VL mime change addresses a JupyterLab 4.x reality: its
built-in @jupyterlab/vega5-extension only handles v5, while the
ecosystem (nteract, newer Lab extensions) increasingly emits v6.
Emitting both means the client picks whichever it has, with the
HTML/vega-embed fallback only kicking in when neither native
renderer is present.

Design

Cache (src/reader/hybrid_cache.rs + hybrid.rs integration)

The cache module is reader-agnostic: SHA-256 over (reader_uri, sql) joined by \n (the separator prevents (ab, c) and (a, bc)
from colliding), truncated to 16 hex chars (64 bits — collision
odds negligible at any realistic cache size), used as the suffix in
the per-query staging table name __ggsql_cache_<hex>. A
single __ggsql_cache_meta__ table tracks
(cache_key, reader_uri, sql, fetched_at, last_accessed, row_count, byte_estimate).

HybridReader::execute_sql consults the cache:

  1. If the query references a registered name → route to staging
    (PR2 behavior, unchanged).
  2. Else if cache disabled → direct passthrough to primary.
  3. Else look up the cache key; if present and within TTL, return
    the staged result and touch the last-accessed timestamp.
  4. Else fetch from primary, register the result under
    __ggsql_cache_<hex> in staging, insert the meta row,
    then run LRU eviction over max_bytes.

Empty-width DataFrames bypass caching (DuckDB's arrow(...) table
function rejects zero-column schemas).

Defaults: enabled, 300s TTL, 512 MB max. Tunable via
HybridReader::with_cache_config(CacheConfig). Globally disabled
with GGSQL_HYBRID_CACHE_DISABLED=1 (read once at
CacheConfig::default()).

Reader::clear_cache()

fn clear_cache(&self) -> Result<()> {
    Ok(())
}

Trait default. HybridReader overrides to call
hybrid_cache::clear_all, which iterates the meta table and drops
each per-key cache table, then defensively deletes any leftover
meta rows.

Jupyter -- @uncache meta-command

Mirrors the existing -- @connect: <uri> pattern:

  • META_UNCACHE_PREFIX = "-- @uncache"
  • parse_uncache_meta_command(code) returns Some(()) iff the
    trimmed code is the prefix followed only by whitespace.
  • QueryExecutor::execute checks this first; on match, calls
    self.reader.clear_cache()? and returns an empty DataFrame so
    the cell renders cleanly.

Vega-Lite v5 + v6 mime

format_vegalite in display.rs builds two serde_json::Values:
the original spec (emitted as application/vnd.vegalite.v6+json)
and a clone with the $schema field rewritten to the v5 schema URL
(emitted as application/vnd.vegalite.v5+json). The $schema
rewrite is necessary because JupyterLab's vega5 extension validates
schema-URL-vs-mime-version. The HTML/vega-embed fallback and
text/plain remain as before.

A drive-by fix in the same function: the output_location: "plot"
hint (Positron Plot-pane routing) was previously placed at the top
level of the output object, which fails Jupyter's notebook-format
schema validation and causes JupyterLab to silently drop the
output. Moving it inside metadata per the schema fixes the
dropped-output case.

Testing

All offline, no external setup:

Cache (hybrid_cache.rs)

  • Key stability across calls; URI and SQL each affect the key;
    separator-injection collision resistance.
  • Meta-table DDL idempotency (ensure_meta callable twice).
  • INSERT-OR-REPLACE behavior on the cache_key PK.
  • Lookup → insert → touch (advances last_accessed) → drop cycle.

Cache (hybrid.rs)

All cache-hit assertions go through a CountingReader that wraps
DuckDBReader and shares an Arc<AtomicUsize> call counter so the
test can read it after the reader is moved into Box<dyn Reader>:

  • Default config has enabled=true, ttl_secs=300.
  • with_cache_config applies custom TTL / byte budget / enabled.
  • A repeat execute_sql of the same query reaches the primary
    exactly once (counter stops at 1).
  • ttl=0 always misses — counter advances on every call (the strict
    < comparison guarantees this even within the same millisecond).
  • LRU eviction with a 1-byte budget evicts entry A when B is
    inserted, so re-querying A increments the counter again.
  • clear_cache() resets the cache; the next call increments the
    counter as if from scratch.
  • The viz pipeline emits SQL strings through execute_sql that the
    cache memoizes — re-issuing one of the pipeline's sub-queries
    (the schema-fetch SQL targeting the global temp table) is served
    from cache and does NOT advance the data counter. This guarantees
    that any pipeline that re-emits the same SQL string within the
    TTL is memoized.

Jupyter

  • parse_uncache_meta_command accepts -- @uncache,
    -- @uncache \n; rejects SELECT 1.
  • QueryExecutor::execute("-- @uncache") returns
    ExecutionResult::DataFrame(...) against the default
    duckdb://memory reader (a no-op clear_cache on the trait
    default — proves the dispatch arm is wired).
  • format_vegalite emits both v5 and v6 mime keys; the v5 payload
    has its $schema rewritten to the v5 URL.

Limitations

  • Cache key uses a fixed "hybrid-primary" placeholder for the
    reader URI portion (the Reader trait doesn't expose a URI).
    Each HybridReader instance has its own staging DuckDB namespace,
    so keys don't need to cross-collide between instances. If a future
    trait extension exposes a real URI, the placeholder becomes the
    actual URI and the test on key uniqueness across URIs starts
    passing for free.
  • No spill-to-disk; the cache lives in the staging DuckDB instance
    owned by a single HybridReader and dies with it.
  • Eviction failures are logged and swallowed (eprintln!) since
    the user's data is already returned at that point. If review
    prefers structured logging, swapping to tracing::warn! is a
    trivial follow-up.
  • The viz pipeline's whole-iteration cache reuse (issuing
    r.execute(viz_query) twice and serving the second call entirely
    from cache) is gated on upstream pipeline determinism — the
    current pipeline issues schema/range/data sub-queries whose SQL
    depends on HashMap iteration order, and the temp-table DDL is
    uncacheable by design (zero-column DDL result). Each individual
    sub-query is cached and replayed correctly when re-issued
    verbatim. Improving end-to-end viz iteration deduplication is a
    separate follow-up — the cache infrastructure here is the
    prerequisite.

What's next

PR4 (ggsql-python PyO3 bindings) depends on PR1+PR2+PR3.

jimhester added 11 commits May 28, 2026 14:58
Adds two small crates needed by the upcoming hybrid_cache module:
sha2 for SHA-256 hashing of (reader_uri, sql) cache keys, hex for
fixed-width hex encoding of the truncated digest. Both pull in only
`generic-array`/`typenum`/`block-buffer` transitively — no large
graphs.
Generic cache primitives for HybridReader: SHA-256 cache-key derivation
from (reader_uri, sql), DuckDB meta-table DDL, lookup/insert/touch/drop
operations, and LRU eviction over a configurable byte budget. Reader-
agnostic — the only 'remote' reference is via the caller-supplied
reader_uri string. Tests cover key stability, separator-injection
collision resistance, meta-table idempotency, INSERT-OR-REPLACE
behavior on the cache_key PK, and the lookup/insert/touch/drop cycle.

Module not yet wired into mod.rs — that lands together with the
HybridReader cache integration in the next commit.
The cache module is reader-agnostic, but the test fixtures inherited
from the source-of-truth file used 'quiver+trino://' URIs that would
read as Netflix-specific to upstream reviewers. Replace with neutral
'backend://' literals — the substance of the tests (key stability,
URI sensitivity, lookup/insert cycle) is unchanged.
… mod

New `Reader::clear_cache()` method with default `Ok(())` so existing
readers (DuckDB, SQLite, ODBC, ADBC) inherit a no-op implementation.
`HybridReader` will override it in the next commit to drop its
staging-DuckDB cache tables.

Also declares `mod hybrid_cache;` (non-pub) so the new helper module
is reachable from `hybrid.rs`.
execute_sql now consults the staging DuckDB cache (TTL + LRU byte-budget
eviction) before falling through to the primary reader, and stages
miss results back under hashed table names. clear_cache() override drops
all cache tables. with_cache_config() lets callers tune TTL or byte
budget; default is 300s TTL / 512 MB / enabled (gated by
GGSQL_HYBRID_CACHE_DISABLED=1).

Tests cover default config, custom config application, repeat-query
cache hit, ttl=0 always-miss, LRU eviction under tight budget,
clear_cache wiping the meta and tables, and the empty-width
fast-path.
Lets notebook users invalidate any caches their reader holds without
restarting the kernel. Mirrors the existing -- @connect: <uri> pattern:
prefix constant, parser returning Some(()) on a clean prefix-only line,
and a dispatch arm in QueryExecutor::execute that calls
Reader::clear_cache(). Returns an empty DataFrame so the cell renders
without further machinery.

For non-HybridReader readers the trait default makes this a clean
no-op; HybridReader overrides to drop its DuckDB cache tables.
Frontends vary in which Vega-Lite version they render natively:
JupyterLab 4.x's built-in @jupyterlab/vega5-extension only handles v5;
nteract and newer Lab extensions render v6. Emitting both lets the
client pick whichever it supports without the user installing an
extension or trusting a CDN-loaded vega-embed.

The v5 payload has its \$schema URL rewritten to the v5 schema since
the JupyterLab vega5 extension validates schema-URL-vs-mime-version
agreement. The two specs are otherwise identical; ggsql's generated
output uses core Vega-Lite features stable across v5 and v6.
Mirrors the equivalence-tests pattern from src/reader/adbc.rs (PR1):
gated #[cfg(all(test, feature = "sqlite", feature = "adbc"))], each
test #[ignore]'d so default CI doesn't hit the missing dynamic
driver, runnable via 'cargo test --features "adbc duckdb sqlite" --
--ignored cache_equivalence' after 'dbc install sqlite'.

Four tests, each with HybridReader<AdbcSqlite> as primary + in-memory
DuckDB as staging:

- equiv_cache_returns_same_data_as_bare_adbc: cache miss + cache hit
  both return data byte-identical to a bare AdbcReader on the same
  query. Validates the miss-then-stage path doesn't corrupt data and
  the hit-then-SELECT-from-staging path returns a faithful copy.
- equiv_cache_hit_avoids_adbc_call: counter-based; second call to the
  same SQL must not increment the ADBC call counter.
- equiv_clear_cache_forces_adbc_refetch: clear_cache() must drop cache
  state so the next call round-trips back to ADBC.
- equiv_ttl_zero_always_hits_adbc: ttl=0 must always evict + refetch.

Verified locally with the SQLite ADBC driver installed via 'dbc
install sqlite' — all 4 pass.
Six fixes prompted by an llm-panel pre-review of PR3:

1. Cache key namespacing: each HybridReader mints a per-instance UUID
   used as the `reader_uri` half of the cache key, so two HybridReader
   instances cannot alias each other's cached results even when they
   share a staging DuckDB. Replaces the fixed "hybrid-primary"
   placeholder.

2. Identifier quoting in hybrid_cache SQL: every reference to the meta
   table and the per-key cache tables now goes through
   `naming::quote_ident`, and every string literal goes through
   `naming::quote_literal`. Removes the local `esc()` helper. Eliminates
   reserved-word and injection brittleness from manual interpolation.

3. Vega-Lite parse-failure path: format_vegalite no longer emits the v5
   / v6 mime bundle or vega-embed HTML when the spec fails to parse —
   it returns a plain text/plain error output instead, so the failure
   surfaces cleanly to the user rather than rendering as a silently
   broken chart.

4. Clock-skew robustness: `age_ms` is clamped to `(now - fetched).max(0)`
   so a backwards-moving clock can no longer keep stale entries alive.
   `now_ms()` returns `i64::MAX` on the (practically impossible) case
   where SystemTime::now() is before the UNIX epoch — the safer
   failure mode for a correctness-flavoured cache (force re-fetch
   instead of "infinitely fresh").

5. clear_all no longer orphans tables: the blanket `DELETE FROM meta`
   at the end is gone. Failed DROP TABLEs now propagate as an error
   with the meta rows still in place so a retry can pick them up.

6. Library hygiene: `eprintln!` swapped for `tracing::warn!` on the
   non-fatal eviction-failure path.

Adds four tests covering (1), (4), and (5): instance_id uniqueness,
shared-staging non-aliasing, future fetched_at clamped to age=0, and
clear_cache on a clean state.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant