PR3: HybridReader query-result cache + clear_cache trait + @uncache + dual VL mime#2
Draft
jimhester wants to merge 11 commits into
Draft
PR3: HybridReader query-result cache + clear_cache trait + @uncache + dual VL mime#2jimhester wants to merge 11 commits into
jimhester wants to merge 11 commits into
Conversation
Adds two small crates needed by the upcoming hybrid_cache module: sha2 for SHA-256 hashing of (reader_uri, sql) cache keys, hex for fixed-width hex encoding of the truncated digest. Both pull in only `generic-array`/`typenum`/`block-buffer` transitively — no large graphs.
Generic cache primitives for HybridReader: SHA-256 cache-key derivation from (reader_uri, sql), DuckDB meta-table DDL, lookup/insert/touch/drop operations, and LRU eviction over a configurable byte budget. Reader- agnostic — the only 'remote' reference is via the caller-supplied reader_uri string. Tests cover key stability, separator-injection collision resistance, meta-table idempotency, INSERT-OR-REPLACE behavior on the cache_key PK, and the lookup/insert/touch/drop cycle. Module not yet wired into mod.rs — that lands together with the HybridReader cache integration in the next commit.
The cache module is reader-agnostic, but the test fixtures inherited from the source-of-truth file used 'quiver+trino://' URIs that would read as Netflix-specific to upstream reviewers. Replace with neutral 'backend://' literals — the substance of the tests (key stability, URI sensitivity, lookup/insert cycle) is unchanged.
… mod New `Reader::clear_cache()` method with default `Ok(())` so existing readers (DuckDB, SQLite, ODBC, ADBC) inherit a no-op implementation. `HybridReader` will override it in the next commit to drop its staging-DuckDB cache tables. Also declares `mod hybrid_cache;` (non-pub) so the new helper module is reachable from `hybrid.rs`.
execute_sql now consults the staging DuckDB cache (TTL + LRU byte-budget eviction) before falling through to the primary reader, and stages miss results back under hashed table names. clear_cache() override drops all cache tables. with_cache_config() lets callers tune TTL or byte budget; default is 300s TTL / 512 MB / enabled (gated by GGSQL_HYBRID_CACHE_DISABLED=1). Tests cover default config, custom config application, repeat-query cache hit, ttl=0 always-miss, LRU eviction under tight budget, clear_cache wiping the meta and tables, and the empty-width fast-path.
Lets notebook users invalidate any caches their reader holds without restarting the kernel. Mirrors the existing -- @connect: <uri> pattern: prefix constant, parser returning Some(()) on a clean prefix-only line, and a dispatch arm in QueryExecutor::execute that calls Reader::clear_cache(). Returns an empty DataFrame so the cell renders without further machinery. For non-HybridReader readers the trait default makes this a clean no-op; HybridReader overrides to drop its DuckDB cache tables.
Frontends vary in which Vega-Lite version they render natively: JupyterLab 4.x's built-in @jupyterlab/vega5-extension only handles v5; nteract and newer Lab extensions render v6. Emitting both lets the client pick whichever it supports without the user installing an extension or trusting a CDN-loaded vega-embed. The v5 payload has its \$schema URL rewritten to the v5 schema since the JupyterLab vega5 extension validates schema-URL-vs-mime-version agreement. The two specs are otherwise identical; ggsql's generated output uses core Vega-Lite features stable across v5 and v6.
Mirrors the equivalence-tests pattern from src/reader/adbc.rs (PR1): gated #[cfg(all(test, feature = "sqlite", feature = "adbc"))], each test #[ignore]'d so default CI doesn't hit the missing dynamic driver, runnable via 'cargo test --features "adbc duckdb sqlite" -- --ignored cache_equivalence' after 'dbc install sqlite'. Four tests, each with HybridReader<AdbcSqlite> as primary + in-memory DuckDB as staging: - equiv_cache_returns_same_data_as_bare_adbc: cache miss + cache hit both return data byte-identical to a bare AdbcReader on the same query. Validates the miss-then-stage path doesn't corrupt data and the hit-then-SELECT-from-staging path returns a faithful copy. - equiv_cache_hit_avoids_adbc_call: counter-based; second call to the same SQL must not increment the ADBC call counter. - equiv_clear_cache_forces_adbc_refetch: clear_cache() must drop cache state so the next call round-trips back to ADBC. - equiv_ttl_zero_always_hits_adbc: ttl=0 must always evict + refetch. Verified locally with the SQLite ADBC driver installed via 'dbc install sqlite' — all 4 pass.
Six fixes prompted by an llm-panel pre-review of PR3: 1. Cache key namespacing: each HybridReader mints a per-instance UUID used as the `reader_uri` half of the cache key, so two HybridReader instances cannot alias each other's cached results even when they share a staging DuckDB. Replaces the fixed "hybrid-primary" placeholder. 2. Identifier quoting in hybrid_cache SQL: every reference to the meta table and the per-key cache tables now goes through `naming::quote_ident`, and every string literal goes through `naming::quote_literal`. Removes the local `esc()` helper. Eliminates reserved-word and injection brittleness from manual interpolation. 3. Vega-Lite parse-failure path: format_vegalite no longer emits the v5 / v6 mime bundle or vega-embed HTML when the spec fails to parse — it returns a plain text/plain error output instead, so the failure surfaces cleanly to the user rather than rendering as a silently broken chart. 4. Clock-skew robustness: `age_ms` is clamped to `(now - fetched).max(0)` so a backwards-moving clock can no longer keep stale entries alive. `now_ms()` returns `i64::MAX` on the (practically impossible) case where SystemTime::now() is before the UNIX epoch — the safer failure mode for a correctness-flavoured cache (force re-fetch instead of "infinitely fresh"). 5. clear_all no longer orphans tables: the blanket `DELETE FROM meta` at the end is gone. Failed DROP TABLEs now propagate as an error with the meta rows still in place so a retry can pick them up. 6. Library hygiene: `eprintln!` swapped for `tracing::warn!` on the non-fatal eviction-failure path. Adds four tests covering (1), (4), and (5): instance_id uniqueness, shared-staging non-aliasing, future fetched_at clamped to age=0, and clear_cache on a clean state.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Draft — stacked on posit-dev/ggsql#423 (HybridReader). This PR lives in my fork (
jimhester/ggsql) with base =pr2-hybrid-readerso the diff shows only the cache work on top of PR2. Opened in draft per posit-dev/ggsql#423 (comment) so the proposed staging-layer design can be informed by the concrete caching mechanism here. When PR2 merges toposit-dev/ggsql:main, I'll rebase and re-target this PR (or re-open it) against upstream.Summary
Bundles four follow-ups to PR2 (HybridReader):
HybridReader— memoizes(reader_uri, sql)results in the staging DuckDB so visualization iteration(tweak
DRAW, change aSCALE, re-run) is sub-millisecond oncache hits instead of round-tripping to the primary reader.
Reader::clear_cache()trait method — defaultOk(())soreaders without a cache (DuckDB, SQLite, ODBC, ADBC) inherit a
no-op;
HybridReaderoverrides to drop its cache tables.-- @uncachemeta-command — clears the active reader'scache from inside a notebook cell without restarting the kernel.
JupyterLab 4.x (built-in v5 renderer) and nteract / newer
Lab extensions (v6) both display natively without the embedded
HTML fallback.
Each piece on its own is below the threshold of "worth a PR" — the
trait method has no motivation without the cache, and the
meta-command has no motivation without the trait method. They're a
unit. The VL mime change is small and rides along since it ships in
the same
ggsql-jupyter/src/display.rsfile the-- @uncachedispatch lives next to. Reviewers who prefer to split the VL mime
change into PR3.5 — happy to oblige.
Companion design comment with the broader sequencing context:
#341 (comment).
Motivation
Visualization workflows iterate fast: change
DRAW, change aSCALE, re-run. Each iteration re-runs the SQL — usually identicaltext — against the primary reader. For a remote backend (Trino,
Snowflake, anything over Flight SQL) that's wasteful network and
time. The cache memoizes results in the staging DuckDB, keyed on
(reader_uri, sql)with TTL and a byte-budget LRU. Hits aresub-millisecond; misses fall through transparently. Scope is
deliberately narrow to
HybridReader— other reader types don'thave a place to store results.
The trait method exists so non-Hybrid readers can be passed to
generic code that calls
clear_cache()without per-typematch-and-cast. The Jupyter meta-command exists because the
notebook is the primary place users encounter the cache: when a
remote table changes underneath you,
-- @uncacheis faster than# Restart Kernel+ re-running every preceding cell.The VL mime change addresses a JupyterLab 4.x reality: its
built-in
@jupyterlab/vega5-extensiononly handles v5, while theecosystem (nteract, newer Lab extensions) increasingly emits v6.
Emitting both means the client picks whichever it has, with the
HTML/vega-embed fallback only kicking in when neither native
renderer is present.
Design
Cache (
src/reader/hybrid_cache.rs+hybrid.rsintegration)The cache module is reader-agnostic: SHA-256 over
(reader_uri, sql)joined by\n(the separator prevents(ab, c)and(a, bc)from colliding), truncated to 16 hex chars (64 bits — collision
odds negligible at any realistic cache size), used as the suffix in
the per-query staging table name
__ggsql_cache_<hex>. Asingle
__ggsql_cache_meta__table tracks(cache_key, reader_uri, sql, fetched_at, last_accessed, row_count, byte_estimate).HybridReader::execute_sqlconsults the cache:(PR2 behavior, unchanged).
the staged result and
touchthe last-accessed timestamp.__ggsql_cache_<hex>in staging, insert the meta row,then run LRU eviction over
max_bytes.Empty-width DataFrames bypass caching (DuckDB's
arrow(...)tablefunction rejects zero-column schemas).
Defaults: enabled, 300s TTL, 512 MB max. Tunable via
HybridReader::with_cache_config(CacheConfig). Globally disabledwith
GGSQL_HYBRID_CACHE_DISABLED=1(read once atCacheConfig::default()).Reader::clear_cache()Trait default.
HybridReaderoverrides to callhybrid_cache::clear_all, which iterates the meta table and dropseach per-key cache table, then defensively deletes any leftover
meta rows.
Jupyter
-- @uncachemeta-commandMirrors the existing
-- @connect: <uri>pattern:META_UNCACHE_PREFIX = "-- @uncache"parse_uncache_meta_command(code)returnsSome(())iff thetrimmed code is the prefix followed only by whitespace.
QueryExecutor::executechecks this first; on match, callsself.reader.clear_cache()?and returns an empty DataFrame sothe cell renders cleanly.
Vega-Lite v5 + v6 mime
format_vegaliteindisplay.rsbuilds twoserde_json::Values:the original spec (emitted as
application/vnd.vegalite.v6+json)and a clone with the
$schemafield rewritten to the v5 schema URL(emitted as
application/vnd.vegalite.v5+json). The$schemarewrite is necessary because JupyterLab's vega5 extension validates
schema-URL-vs-mime-version. The HTML/vega-embed fallback and
text/plain remain as before.
A drive-by fix in the same function: the
output_location: "plot"hint (Positron Plot-pane routing) was previously placed at the top
level of the output object, which fails Jupyter's notebook-format
schema validation and causes JupyterLab to silently drop the
output. Moving it inside
metadataper the schema fixes thedropped-output case.
Testing
All offline, no external setup:
Cache (
hybrid_cache.rs)separator-injection collision resistance.
ensure_metacallable twice).cache_keyPK.last_accessed) → drop cycle.Cache (
hybrid.rs)All cache-hit assertions go through a
CountingReaderthat wrapsDuckDBReaderand shares anArc<AtomicUsize>call counter so thetest can read it after the reader is moved into
Box<dyn Reader>:enabled=true,ttl_secs=300.with_cache_configapplies custom TTL / byte budget / enabled.execute_sqlof the same query reaches the primaryexactly once (counter stops at 1).
ttl=0always misses — counter advances on every call (the strict<comparison guarantees this even within the same millisecond).inserted, so re-querying A increments the counter again.
clear_cache()resets the cache; the next call increments thecounter as if from scratch.
execute_sqlthat thecache memoizes — re-issuing one of the pipeline's sub-queries
(the schema-fetch SQL targeting the global temp table) is served
from cache and does NOT advance the data counter. This guarantees
that any pipeline that re-emits the same SQL string within the
TTL is memoized.
Jupyter
parse_uncache_meta_commandaccepts-- @uncache,-- @uncache \n; rejectsSELECT 1.QueryExecutor::execute("-- @uncache")returnsExecutionResult::DataFrame(...)against the defaultduckdb://memoryreader (a no-opclear_cacheon the traitdefault — proves the dispatch arm is wired).
format_vegaliteemits both v5 and v6 mime keys; the v5 payloadhas its
$schemarewritten to the v5 URL.Limitations
"hybrid-primary"placeholder for thereader URI portion (the
Readertrait doesn't expose a URI).Each
HybridReaderinstance has its own staging DuckDB namespace,so keys don't need to cross-collide between instances. If a future
trait extension exposes a real URI, the placeholder becomes the
actual URI and the test on key uniqueness across URIs starts
passing for free.
owned by a single
HybridReaderand dies with it.eprintln!) sincethe user's data is already returned at that point. If review
prefers structured logging, swapping to
tracing::warn!is atrivial follow-up.
r.execute(viz_query)twice and serving the second call entirelyfrom cache) is gated on upstream pipeline determinism — the
current pipeline issues schema/range/data sub-queries whose SQL
depends on HashMap iteration order, and the temp-table DDL is
uncacheable by design (zero-column DDL result). Each individual
sub-query is cached and replayed correctly when re-issued
verbatim. Improving end-to-end viz iteration deduplication is a
separate follow-up — the cache infrastructure here is the
prerequisite.
What's next
PR4 (
ggsql-pythonPyO3 bindings) depends on PR1+PR2+PR3.