Skip to content

Implement repeat directive + %rowIndex via a hybrid (staged) SQL emitter#34

Open
piotrszul wants to merge 13 commits into
masterfrom
issue/1_repeat_directive
Open

Implement repeat directive + %rowIndex via a hybrid (staged) SQL emitter#34
piotrszul wants to merge 13 commits into
masterfrom
issue/1_repeat_directive

Conversation

@piotrszul

@piotrszul piotrszul commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Replaces the legacy struct SQL emitter with the hybrid (staged-CTE) emitter, and adds the repeat directive and %rowIndex along the way. The hybrid emitter is now the sole emitter (code, templates, docs, tests). Closes #1.

Core migration

Landed as four reviewed sub-issue PRs into this branch, in sequence:

Stage PR Sub-issue What
1 #25 #24 Replace struct emitter with the staged/hybrid emitter (CHAIN/FORK, lazy fork keys, root-key modes); remove struct code/templates/specs/tests
2 #28 #27 %rowIndex for forEach/forEachOrNull (0-based, WITH ORDINALITY ordinal)
3 #30 #29 repeat operation — recursive-CTE descent + typed from_json bridge (CHAIN + FORK)
4 #33 #32 %rowIndex over repeat — pre-order descent index, nested-repeat chaining

Post-stage hardening

  • Internal identifier collisions — prefix every emitter-internal column (_rid, _ord{N}, _nord{N}, _node, _path, bridge var) with _, the one prefix a valid ViewDefinition column name can never use. A view may now legally name a column rid, node, path, or ord.
  • Test layout — split official SQL-on-FHIR conformance fixtures (tests/spec-tests/) from flatquack-specific ones (tests/custom-tests/); de-duplicate the staged harness behind a shared buildStaged + runFixtureSuite (test-util.js); route scratch output to a git-ignored tests/.tmp/.
  • Re-vendor spec-tests @ FHIR/sql-on-fhir.js 70c0566 (2026-06-16) — pulls newer official cases (repeat.json +4, row_index.json +1). Engine fix: join() over an empty collection yields NULL, not "".
  • On-demand typing (Typed column navigating a repeat-forced-JSON field emits JSON, breaking unionAll (repeat + non-repeat branches) #35Fix #35: hybrid on-demand typing across the repeat-forced JSON boundary #36) — re-type a repeat-forced JSON field back to a typed STRUCT inline at every navigation site (columns, where paths/lambdas, components, comparison operands, unionAll branches). This resolves the previously known-failing official case "unionAll with repeat and non-repeat branches", which is now unskipped and passing. Design recorded in docs/SPEC_ondemand_typing.md.
  • Docs — rewrote the hybrid spec as a standalone view-lowering document and moved specs/docs/ (SPEC_view_lowering.md, SPEC_ondemand_typing.md).

Verification

  • 445 tests pass / 0 fail (9 files); was 191/0 on the struct baseline. Includes the full official repeat.json and row_index.json suites (both root-key modes), with no skipped repeat/rowIndex cases.
  • End-to-end CLI verified on real DuckDB for forks, %rowIndex, chain repeat, fork repeat, and repeat-in-repeat indexing.
  • No struct-emitter remnants (tablesToSql/setRepeatContext/--backend/_forEach directives) remain in src/tests/templates.
  • --root-key <natural|uuid> (default natural) replaces --backend.

Follow-up (not in scope)

🤖 Generated with Claude Code

piotrszul and others added 4 commits June 15, 2026 20:25
* Replace struct emitter with hybrid (staged) emitter (#24)

Make the staged/hybrid SQL emitter (SPEC_hybrid) the sole emitter for all
non-repeat, non-%rowIndex features, removing the legacy struct emitter. Stage 1
of the migration tracked in #1.

What changed:
- src/staged-sql-builder.js (new): the SPEC_hybrid walker — collectScope,
  subtreeHasFork, emitScope (CHAIN vs FORK), emitUnion, unnestFrom, lazy fork
  keys (rid + WITH ORDINALITY ordinals), root-key modes (natural|uuid).
  repeat is rejected (later stage); no %rowIndex path.
- src/query-builder.js: staged-only; the `backend` flag is gone. The read
  schema is derived from a navigation-only path expression (view-parser
  `viewPaths`) instead of the struct flattening transform.
- src/view-parser.js: dropped struct flattening (`parseVd` path/tables);
  kept `validateVd` + `extractPathsFromAst`; added `viewPaths` for read-schema
  path coverage.
- src/ddb-sql-builder.js: kept all navigation/function/literal/comparison leaf
  cases, `_col`/`_col_collection`/`_invoke`, and `pathsToSchema` (shared by the
  staged engine); threaded the `rootVar` arg; removed struct-only directive
  cases (`_forEach`/`_forEachOrNull`/`_unionAll`) and `tablesToSql`.
- src/cli.js: removed `--backend`; added `--root-key` (default `natural`);
  still executes via the duckdb 1.4.1 binding.
- templates/{csv,parquet,ndjson,explore,dbt_model}.sql: promoted to the staged
  source/tail form; duck-macros.js gains `fhir_list`.
- tests: spec.staged.test.js is the spec runner (runs every reference suite in
  both root-key modes); staged-rootkey.test.js added; e2e/fp-custom migrated to
  the staged engine; removed struct-only spec.test.js + view.test.js and the
  struct-directive fp-custom cases. Added spec-tests/staged.json fork coverage.
- specs/SPEC_hybrid.md added; README documents `--root-key`.

Verification: `bun test` 279 pass / 0 fail. End-to-end `bun run src/cli.js`
on forEach and nested-fork ViewDefinitions produces correct output on duckdb
1.4.1; generated SQL uses only 1.4-valid syntax (UNNEST column, WITH
ORDINALITY). Stays on `duckdb` 1.4.1 (no @duckdb/node-api).

repeat and %rowIndex remain later migration stages.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Simplify viewPaths to schema-coverage essentials (#24)

`viewPaths` was trimmed down from the removed struct `parseVd` and carried
vestigial structure that does not affect the extracted read-schema paths:
the `isRoot`/`inUnion` parameters and their conditional `_nav(...)` wrapping,
the `'e'` literal, and the `_col('name', ...)` wrapper. `extractPathsFromAst`
only collects type-resolved `nav` segments and recurses through any function's
args regardless of name, so these are inert. Collapse to a uniform inert
`_nav(...)` grouping wrapper that solely carries the forEach type context.
Produces byte-identical extracted paths; `bun test` 279 pass / 0 fail.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Document natural-key getResourceKey() source + cross-type collision caveat (#26)

The natural root-fork key is sourced from getResourceKey() (as-is) so a
future fix to that function propagates to the key. getResourceKey() today
resolves to the bare id, so the natural key only disambiguates within a
single resource type; the mixed-type collision risk is tracked in #26.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Address code-review findings on the staged emitter (#24)

Correctness:
- Reject duplicate column names in validateVd: a name reused across sibling
  scopes previously emitted two same-named columns into a CTE, where DuckDB
  silently keeps the first and drops the other column's values. Now fails up
  front with a clear message. unionAll branches (which must share names) are
  still allowed.
- Reject `%rowIndex` with a clear "not supported by the staged backend yet"
  error instead of the misleading generic "define it with --var" hint (a
  constant --var cannot supply a per-row ordinal).

Cleanup (all behaviour-preserving — generated SQL byte-identical across every
reference view in both root-key modes):
- Memoise subtreeHasFork (WeakMap) so the recursive walk is not re-run on
  overlapping subtrees at every level.
- Compile each fan-out path once via prepareFanout, replacing childElemOf +
  arrayize which each re-parsed the same path.
- Unify UNNEST FROM-clause construction in a single unnestClause helper used by
  both unnestFrom and the unionAll emitter.
- Drop the unused fhir_list macro (forward-looking; belongs with the later
  repeat stage that will use it).

bun test: 283 pass / 0 fail.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Add %rowIndex for forEach / forEachOrNull (staged emitter)

Implements the `%rowIndex` FHIRPath environment variable - the 0-based
position of an element within the collection its operator iterates,
scoped to the parent instance (per SPEC_hybrid §11), not a global
ROW_NUMBER over output.

- fhirpath-parser: `%rowIndex` parses to a `{segmentType:"rowIndex"}`
  segment (replacing the prior "not supported" rejection).
- ddb-sql-builder: `astToSql` threads a `rowIndexSql` binding; a
  `rowIndex` leaf emits it. Defaults to "0" (resource root / no
  enclosing iteration).
- staged-sql-builder: every forEach / forEachOrNull / unionAll fan-out
  binds a `WITH ORDINALITY` ordinal and sets the child scope's
  `rowIndexSql` to `CAST(ord - 1 AS INTEGER)` (the ordinal is BIGINT;
  %rowIndex is a FHIR integer). forEachOrNull over an empty collection
  yields index 0 (COALESCE the missing ordinal to 1) per the official
  fixture. The ordinal resets per driving row automatically.
- tests/spec-tests/row_index.json: official fixture (root, forEach,
  forEachOrNull incl. empty, nested forEach, unionAll variants,
  surrogate key). The `%rowIndex with repeat` sub-test is omitted -
  repeat (and %rowIndex over repeat) are later migration stages.
- spec runner: drop `row_index` from the exclude set; drop the now
  obsolete "%rowIndex is rejected" assertion (repeat rejection kept).

Scope: forEach / forEachOrNull / unionAll only. `repeat` stays
rejected (Stage 3) and %rowIndex-over-repeat is Stage 4.

Tests: 298 pass / 0 fail (baseline 283).

Refs #27, #1

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Simplify %rowIndex ordinal plumbing (review cleanup)

- Fold bindRowIndex into unnestFrom (single source for chain/fork) and
  route the iterating unionAll branch through unnestFrom too, removing
  three duplicated bind sites and the cosmetic _b/_u alias split.
- Use the existing name("rn") helper instead of inline `rn${++counter}`,
  matching the file's other generated identifiers.
- Drop the redundant `|| "0"` in compilePath/compileColumn; astToSql
  already defaults rowIndexSql to "0".

No behaviour change; 298 pass / 0 fail.

Refs #27, #1

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Address review: thread rowIndexSql into _invoke; reconcile spec §11

- ddb-sql-builder.js: thread rootVar/rowIndexSql into the _invoke macro-param
  astToSql call so %rowIndex resolves correctly if _invoke ever accepts path
  arguments (currently guarded by the literal-only validator, so latent).
- ddb-sql-builder.js: drop stray double blank line after the rowIndex case.
- SPEC_hybrid.md §11: rename %rowNumber -> %rowIndex, 1-based -> 0-based, and
  document the forEachOrNull empty-row result as 0 (COALESCE(ord,1)-1), matching
  the SoF spec and the implementation; repeat windows now subtract 1.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Stage 3: implement repeat operation in the staged emitter (#29, #1)

Implement the `repeat` directive on the hybrid (staged) backend per
SPEC_hybrid §5/§10, in both CHAIN and FORK modes.

- repeat-lowering.js: staged-relevant helpers (assertSimplePath, jsonFold
  descent, typedSeed, repeatStructure via a self-contained body-paths walk,
  childElemOf). Dropped the struct-only reduceSource accumulator.
- staged-sql-builder.js: remove the repeat rejection; add emitRepeat
  (WITH RECURSIVE descent over the listed paths only, excluding the seed +
  a typed from_json bridge so leaf columns / forEach inside the body compile
  via the existing typed engine), fq_cast_<N> cast-macro pooling, and FORK
  keying (carry the fork key + recombine via JOIN USING(rid)). A lone repeat
  stays a keyless CHAIN. Bridge element bound as `el_r` (not `el`) so an
  array-navigating body column's `list_transform(el -> ...)` lambda does not
  shadow the bridge struct.
- query-builder.js / view-parser.js: forced-JSON repeat seeds in the read
  schema (applyForcedJson + repeat-aware viewPaths), and the fq_staged_with /
  fq_staged_macros template vars.
- ddb-sql-builder.js: pathsToJsonStruct + forceJson handling (shared leafSqlType).
- duck-macros.js: the fhir_list macro (single-value-vs-array bridge).
- templates + test-util: WITH RECURSIVE keyword + cast-macro preamble placeholders.
- Fixtures: official repeat.json + staged-repeat.test.js (from_json bridge /
  choice-type / lambda-collision edge cases), wired into the staged spec runner.

%rowIndex over a repeat's own descent scope is Stage 4 (omitted here);
%rowIndex for forEach/forEachOrNull keeps working, including inside a repeat body.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Simplify: reuse viewPaths, single-compile seed, hoist bridge-var const

Apply /simplify findings:
- repeat-lowering.js: drop repeatBodyPaths (a verbatim duplicate of
  viewPaths' parseNode walker) and call viewPaths on the stripped body;
  typedSeed now arrayizes from its single compilePath result instead of
  recompiling each path via B.arrayize.
- ddb-sql-builder.js: factor the duplicated JSON object-field rendering in
  pathsToJsonStruct into a local objOf helper.
- staged-sql-builder.js: hoist the `el_r` bridge variable to a BRIDGE_VAR
  const used by bridgeElem and both bridge CTE emitters, so the alias and the
  leaf root-var can't silently drift.

No behaviour change: bun test 331 pass / 0 fail; chain + fork repeat e2e
unchanged (5 / 10 rows).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Review fixes: %rowIndex in shared-field forEach + applyForcedJson guard

Apply /review findings:
- HIGH: a `forEach` that names the same field as a sibling `repeat` is routed
  through emitJsonEach (from_json bridge). It bound %rowIndex on the discarded
  f.childElem instead of the bridge element used for the body, and passed
  ordName=null so no ORDINALITY was even emitted — %rowIndex was always 0.
  Now emitJsonEach always mints an ordinal, projects the 0-based %rowIndex as a
  named bridge column, and binds the body element's rowIndexSql to it.
  Verified: forEach beside repeat on the same field now reports 0,1,2.
- MEDIUM: applyForcedJson force-JSON'd a shallower ancestor on a partial path
  match (truncating a subtree a typed sibling may need). Guard with a `matched`
  flag so it only applies when the full path resolves.

bun test 331 pass / 0 fail; chain (5 rows) + fork (10 rows) repeat e2e unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Fix fork inside a repeat body; reject %rowIndex over a repeat scope

A repeat body that forks (>=2 fan-out children) recombined its branches via
JOIN USING(keyCols), where keyCols was constant per resource (rid for a CHAIN
repeat, the parent fork key for a FORK branch). With no per-node identity the
branches cross-produced across ALL descended nodes of the resource, emitting
wrong rows (e.g. item "a" paired with item "b"'s answers). The official
repeat.json suite missed this: its repeat+internal-fork cases all expect a
single row, which cannot expose the cross-product.

Mint a per-node descent key when the body forks: carry the integer descent
`path` (per-level WITH ORDINALITY ordinals from seed to node) through both
recursive legs, then assign a deterministic pre-order index
`(row_number() OVER (PARTITION BY keyCols ORDER BY path) - 1)` at the bridge
and thread it as an extra recombination key. ORDER BY path is a total order
(paths are unique per node), so the index survives every reference to the
bridge without MATERIALIZED. This is the same descent key Stage 4 binds for
%rowIndex over a repeat's own scope; here it serves only as a fork key.

Also reject %rowIndex resolving to a repeat's own descent scope (Stage 4,
unsupported) instead of silently binding it to the constant 0: the repeat
bridge element binds rowIndexSql=null and the leaf engine throws on a rowIndex
segment with no valid binding. %rowIndex for a forEach/forEachOrNull inside a
repeat body is unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Stage 4: %rowIndex over repeat (pre-order descent index)

Implements `%rowIndex` indexing a `repeat`'s OWN descent scope: a depth-first
pre-order index over the recursive descent, completing the hybrid-only migration.

The index is assigned at the repeat's own scope (one row per visited node) by
`row_number() OVER (PARTITION BY {enclosing-scope key} ORDER BY path)` over the
int-list descent path the rCTE already carries, BEFORE any sibling recombination,
then materialised as a scalar that survives fork joins:
  - root repeat partitions by `rid` (forced even with no fork, via
    `subtreeHasRepeatUsingRowIndex`);
  - a repeat under a forEach carries the enclosing iteration ordinal as a key so
    its window restarts per enclosing element;
  - a repeat-in-repeat partitions the inner window by the outer repeat's own index
    (chaining via `(rid, outer_rn)`), never PARTITION BY a list.

The pre-order window unifies with the existing fork-recombination node key (same
value, same `nord{N}` column): computed when any of fork / `%rowIndex` /
nested-`%rowIndex`-repeat needs it.

Adds tests/spec-tests/row_index_repeat.json (staged-only extended suite, skip:true)
+ tests/row-index-repeat.test.js (dedicated runner, natural + uuid key modes).
Updates the now-obsolete Stage 3 "rejected" guard into a positive assertion.

Tests: 334 -> 344 pass, 0 fail (+10 from the new repeat-rowIndex suite).

Refs #32, #1

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Simplify: share %rowIndex scope traversal + cache subtree detection

- Collapse the parallel raw-string / AST `%rowIndex` scope detectors onto one
  shared `scopeMentionsRowIndex` traversal parameterised by a per-column
  predicate, so the "a nested fan-out opens its own scope" rule lives in one
  place instead of being mirrored in four functions.
- Memoise `subtreeHasRepeatUsingRowIndex` with a WeakMap, matching the existing
  `forkCache` pattern (emitScope queries it per fan-out over overlapping subtrees).

No behaviour change; 344 pass / 0 fail.

Refs #32, #1

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-project-automation github-project-automation Bot moved this to Backlog in Pathling Jun 15, 2026
@piotrszul piotrszul moved this from Backlog to In progress in Pathling Jun 15, 2026
piotrszul and others added 3 commits June 16, 2026 10:35
…lisions

A ViewDefinition column name is a validated SQL identifier starting with a
letter (`^[A-Za-z][A-Za-z0-9_]*$`). The staged emitter carried internal
bookkeeping columns (resource key `rid`, fork ordinals, a repeat's descent
`node`/`ord`/`path` and pre-order index `nord`) in the same CTEs as the user's
columns, so a view that legally named a column `rid`, `node`, `path`, etc.
collided: a hard `Binder Error` (duplicate column) inside a repeat's recursive
CTE, or silent value loss in `uuid` root-key mode.

Prefix every internal identifier with `_` — the smallest prefix a valid column
name can never start with — making the collision impossible rather than merely
unlikely. Covers the `name()`-generated relation/column names plus the fixed
names `_rid`, `_ord{N}`, `_nord{N}`, `_node`, `_ord`, `_path`, and BRIDGE_VAR.

Also move the unknown-type diagnostic in pathsToSchema from console.log
(stdout, which corrupts the compiled SQL the CLI prints there) to console.warn
(stderr), and document the enforcement mechanism in SPEC_hybrid §projection.

New tests/staged-reserved-names.test.js pins that a view may name a column
`rid`, `node`, `path`, or `ord` and still compile and return correct rows.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d test runners

tests/spec-tests/ now holds only the official SQL-on-FHIR conformance
fixtures (mirror of FHIR/sql-on-fhir.js /tests). Flatquack-specific
fixtures move to a new tests/custom-tests/ dir in the same JSON format:
staged.json (emitter fork handling) and row_index_repeat.json (extended
%rowIndex-over-repeat). repeat.json and row_index.json stay in spec-tests/
— they are official spec features.

De-duplicate the staged test harness:
- Extract a shared buildStaged helper and a runFixtureSuite runner into
  test-util.js (replacing the copy-pasted helper in 5 runner files). The
  runner gains a custom-only per-test `resources` override so one fixture
  can hold cases with differing input data.
- spec.staged.test.js + new custom.staged.test.js are now thin wrappers.
- Convert inline-view JS tests to JSON fixtures: reserved_names.json,
  repeat_extra.json, and a no_id_fork case in staged.json. The only
  non-fixture bits survive as staged-sql-shape.test.js (asserts emitted
  SQL) and view-parser.test.js (validateVd units, moved out of the spec
  runner). Delete the superseded row-index-repeat / staged-repeat /
  staged-reserved-names / staged-rootkey runners.

Route generated *.temp.json resource files to a git-ignored tests/.tmp/
scratch dir instead of polluting the fixture directories. Add tests/README.md
documenting the split.

Fix scripts/build-test-output.js (it imported a non-existent
testQueryTemplate and was broken): use the shared buildStaged, a single
shared db, and the scratch dir; regenerates an identical conformance report.

Coverage note: drops the 4 redundant natural-EXCEPT-ALL-uuid cross-mode
checks (both modes are still independently checked against ground truth).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Re-vendor tests/spec-tests/ from FHIR/sql-on-fhir.js tests/ at commit
70c05667b5fc8f82bfce16a3a2a0e5b142e5c6d2 (branch main, fetched 2026-06-16).
This pulls newer official cases, notably repeat.json (+4) and row_index.json
(+1, "%rowIndex with repeat"). The vendored revision is recorded in
tests/README.md. skip/ fixtures (constant, constant_types, fn_boundary) are
refreshed but stay skipped (unimplemented features).

Test harness: feed each case only the resources matching its view's declared
resource type. Upstream fhirpath.json now mixes an Organization (scalar name)
with Patients (array name) in one resource list, which broke flatquack's
type-coerced read_json_auto; filtering to the view's resourceType mirrors the
per-resource-type engine model and leaves results unchanged (recovers 16 tests).

Engine fix: join() over an empty collection now yields NULL, not "" — drop the
forced .ifnull2('') in ddb-sql-builder. Matches the new official fn_join cases.

Known failure (to be addressed separately): repeat.json "unionAll with repeat
and non-repeat branches". A unionAll branch that navigates the repeat-forced
JSON field (item.linkId.first()) evaluates to JSON, while the repeat branch
yields VARCHAR via its from_json bridge; UNION ALL then fails casting a plain
string to JSON. Root cause is forceJson typing not coercing to the column's
declared type; fix deferred pending design discussion.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
piotrszul and others added 2 commits June 17, 2026 21:19
…ry (#36)

Re-type a repeat-forced JSON field back to a typed STRUCT inline at every site that
navigates across the boundary: columns (root / chain-forEach / unionAll branches),
where paths and where() lambdas, and components/comparison operands. Multi-segment
(dotted) seeds re-type at the divergence depth. forcedJsonPaths is the single source
of truth; from_json cast macros are pooled. No-op for non-repeat views.

Adds a systematic on-demand-typing test matrix + SQL-shape assertions and a design
spec (specs/SPEC_ondemand_typing.md). Full suite: 445 tests, 0 fail.
…ocs/

Make SPEC_hybrid.md a self-contained specification of the staged-CTE
lowering and drop all references to non-existent resources (Spec A/B/C,
benchmark/work SQL, measured timings). Replace the inline JSON leaf
machinery with the assumption of an available FHIRPath->SQL translation
layer, and focus on the view-to-relational mapping (chain vs fork, lazy
fork keys, repeat descent, %rowIndex, projection contract).

Rename specs/ -> docs/ and SPEC_hybrid.md -> SPEC_view_lowering.md, and
update all references (README, companion spec, source comments, tests).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@piotrszul

Copy link
Copy Markdown
Collaborator Author

master-fix vs cte-fix — full comparison

Benchmark of two flatquack lines compiling the same SQL-on-FHIR ViewDefinitions to DuckDB SQL:

  • master-fixmaster variant, non-staged emitter (benchmark.sql); branch staging/master-fix
  • cte-fixcte-staged variant, staged-CTE emitter (benchmark-staged.sql); branch staging/cte-fix

Same DuckDB engine in each cell, so every difference is the generated SQL. Speedup = master ÷ cte (>1 ⇒ cte faster).

Datasets

Synthea-generated FHIR R4 bulk-data (per-resource NDJSON, US Core profiles). Three patient cohorts — s1000 (~1k patients), s10000 (~10k), s100000 (~119k, patients-only export: Patient + Encounter only). questionnaire_response_flat runs on a separate QuestionnaireResponse dataset (Synthea doesn't emit QR): 1,000 QRs at s1000 and 10,000 QRs at s10000 (flattened to 10,268 / 102,470 output rows). Row counts are verified identical across variants.

Measurement

Per cell: flatquack compiles the ViewDefinition → DuckDB SQL, run through the DuckDB CLI in materialize mode (CREATE OR REPLACE TEMP TABLE _sink AS <query>), timed with .timer. 1 warmup + 5 measured iterations, warm (process + OS page cache); we report the mean of the 5 (± stderr omitted here, all < a few %). preserve_insertion_order = false is set for every run (recovers DuckDB 1.5 parallel result-sink; output carries no row order).

DuckDB 1.4.1 (mean ms)

case s1000 master / cte × s10000 master / cte ×
condition_flat 484 / 145 3.3× 4563 / 1158 3.9×
encounter_flat 760 / 230 3.3× 7064 / 1961 3.6×
us_core_blood_pressures 691 / 187 3.7× 6664 / 1656 4.0×
patient_addresses 6 / 6 1.0× 38 / 35 1.1×
patient_and_contact_addresses 6 / 7 1.0× 38 / 41 0.9×
patient_demographics 6 / 6 1.1× 37 / 35 1.0×
questionnaire_response_flat GEN_ERR / 15 GEN_ERR / 110

DuckDB 1.5.3 (mean ms)

case s1000 master / cte × s10000 master / cte ×
condition_flat 4781 / 154 31× 48393 / 1261 38×
encounter_flat 6628 / 243 27× 63651 / 2186 29×
us_core_blood_pressures 1643 / 188 8.7× 16887 / 1746 9.7×
patient_addresses 7 / 6 1.1× 49 / 45 1.1×
patient_and_contact_addresses 7 / 7 1.0× 47 / 41 1.2×
patient_demographics 6 / 6 1.0× 49 / 43 1.1×
questionnaire_response_flat GEN_ERR / 13 GEN_ERR / 93

s100000 — both engines (~119k patients)

case master 1.4.1 cte 1.4.1 master 1.5.3 cte 1.5.3 × @1.5.3
patient_addresses 336 ms 319 ms 434 ms 415 ms 1.0×
patient_and_contact_addresses 354 ms 225 ms 451 ms 224 ms 2.0×
patient_demographics 338 ms 338 ms 417 ms 393 ms 1.1×
encounter_flat (6.35M rows) 41.1 s 14.1 s 354.1 s 15.5 s 23×

Takeaways

  • Heavy flattens (condition / encounter / blood-pressures): cte is 3–4× faster on 1.4.1, and 9–38× faster on 1.5.3.
  • The 1.5 gap is a master regression, not a cte win. cte is engine-stable (encounter 14.1 → 15.5 s, +10% across 1.4→1.5); master collapses (41 → 354 s, 8.6×, a steady ~9× penalty at every size). This is a DuckDB-1.5 planning interaction with the non-staged SQL, not the preserve_insertion_order issue (already disabled in both).
  • Trivial Patient projections (patient_addresses, patient_demographics): essentially tied at every size/engine; both pick up the same ~20–30% 1.5 cost on small JSON scans.
  • patient_and_contact_addresses is the one Patient view where staging wins (2× at scale on both engines).
  • questionnaire_response_flat: master can't compile it (repeat directive → GEN_ERR); cte runs it. A capability gap, not just a speedup.

Net: cte-fix is equal-or-faster on every case, at every scale, on both engines — and engine-stable across the 1.4→1.5 jump that severely regresses master.

@piotrszul piotrszul requested a review from johngrimes June 18, 2026 06:02

@johngrimes johngrimes left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @piotrszul, it's looking good!

I managed to run it against my AU PIP QI project, but ran into the issue with the breaking change to the template variables. Is this breaking change necessary?

I've also added a few other comments.

Comment thread src/staged-sql-builder.js
Comment thread src/query-builder.js Outdated
Comment thread docs/SPEC_view_lowering.md Outdated
piotrszul and others added 4 commits June 23, 2026 18:41
Two claims in the spec did not match the staged emitter; soften the
wording to describe actual behavior:

- preserve_insertion_order = false: the lowering never emits it. Document
  it as a DuckDB 1.5+ session setting that recovers parallel UNNEST
  materialisation (a significant regression without it on 1.5+), and add
  a matching performance note to the README.
- shallow shared-array siblings: the emitter always FORKs at >=2 fan-outs;
  there is no CHAIN exception. Document always-FORK and record the
  shallow-CHAIN optimization as benchmark-gated in #39.

The third claim (WITH ORDINALITY 'zero extra cost') is addressed by a
follow-up code change in a separate commit.

Addresses review comment on PR #34.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Previously every forEach unnest emitted WITH ORDINALITY because an
ordinal was always minted to make %rowIndex available, even when no
column read it and no fork needed it as a key (the 'zero extra cost'
the spec claimed was actually an always-on cost).

Gate the ordinal on (needsOrd || usesRi): emit it only when a fork or a
nested %rowIndex repeat below needs it as a key, or when the scope itself
binds %rowIndex. When neither holds the unnest is a plain UNNEST(...) and
%rowIndex is left unbound (a stray reference is rejected, matching a
repeat's no-ordinal default). Reuses the existing scopeUsesRowIndex
predicate and the already-conditional unnestClause; applies uniformly to
the chain, fork, json-each, and unionAll-branch paths.

The unionAll-branch path computes needsOrd/usesRi in prepUnion (the same
predicates emitScope uses for chain/fork fan-outs); without them the
emitJsonEach/unnestFrom gate read undefined and minted no ordinal,
leaving a %rowIndex-reading forEach branch over a repeat-forced field
(jsonMode) unbound. The branch ordinal stays private (branches recombine
on the parent keyCols, never their own ordinal).

Verified: fork keys (_ord{N}) and %rowIndex views still emit ORDINALITY;
nested forks still join on (_rid, _ord1, ...). Full suite: 447 pass
(adds row_index_union_jsonmode.json covering the unionAll branch).

Reconciles the third SPEC_view_lowering claim (lines 139, 278) by making
the emitter match the doc rather than softening it.

Addresses review comment on PR #34.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ntract

Replace the fq_staged_*/fq_where_filter template variables with one fq_sql_*
contract built around a named input CTE and a column-clean output CTE. The
template binds the data source as `_fq_input` (the resource's elements as its
columns) and projects the emitter's `_fq_output`; the emitter's flattening
pipeline is sealed between them. This drops the source-splicing hole and the
with/src/materialized/tail fragments in favour of:

  fq_sql_with, fq_sql_input, fq_sql_input_schema, fq_sql_where,
  fq_sql_pipeline, fq_sql_output, fq_sql_output_columns,
  fq_sql_macros, fq_sql_view_macros

Emitter (staged-sql-builder, query-builder): emit `_fq_src AS (… FROM _fq_input)`
reading the template-provided input, seal the MATERIALIZED hint onto `_fq_src`, and
add a column-clean `_fq_output` CTE; expose pipeline / input / output names and
the output column list (no edge commas, no terminal SELECT). The root pipeline CTE
is `_fq_`-prefixed (via the SRC_CTE constant) so it can never collide with a user
column, matching the seam CTEs and the SPEC_sql_template_contract §4.2 invariant.

Templates: rewrite csv/ndjson/parquet/explore/dbt_model/dbt_prehook. explore
materialises its limited input so a forked view stays deterministic. dbt routes
both macro blocks to the prehook so the model is a single, materializable
SELECT (fixing the prior inline-CREATE-MACRO model).

Docs: add docs/SPEC_sql_template_contract.md (the contract), rewrite the README
variable table + migration note, and cross-link from SPEC_view_lowering.

Tests: migrate the shared harness template to the new contract; add a
determinism guard that renders the shipped templates.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Marks the fq_sql_ template-variable contract as a new minor release: the
fq_staged_*/fq_where_filter/fq_sql_flattening_* variables from v0.2.x are
replaced by the input/output CTE contract, a breaking change for custom
templates.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@piotrszul piotrszul requested a review from johngrimes June 24, 2026 02:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

Implement repeat directive

2 participants