Implement repeat directive + %rowIndex via a hybrid (staged) SQL emitter by piotrszul · Pull Request #34 · aehrc/flatquack

piotrszul · 2026-06-15T15:09:29Z

Replaces the legacy struct SQL emitter with the hybrid (staged-CTE) emitter, and adds the repeat directive and %rowIndex along the way. The hybrid emitter is now the sole emitter (code, templates, docs, tests). Closes #1.

Core migration

Landed as four reviewed sub-issue PRs into this branch, in sequence:

Stage	PR	Sub-issue	What
1	#25	#24	Replace struct emitter with the staged/hybrid emitter (CHAIN/FORK, lazy fork keys, root-key modes); remove struct code/templates/specs/tests
2	#28	#27	`%rowIndex` for `forEach`/`forEachOrNull` (0-based, `WITH ORDINALITY` ordinal)
3	#30	#29	`repeat` operation — recursive-CTE descent + typed `from_json` bridge (CHAIN + FORK)
4	#33	#32	`%rowIndex` over `repeat` — pre-order descent index, nested-repeat chaining

Post-stage hardening

Internal identifier collisions — prefix every emitter-internal column (_rid, _ord{N}, _nord{N}, _node, _path, bridge var) with _, the one prefix a valid ViewDefinition column name can never use. A view may now legally name a column rid, node, path, or ord.
Test layout — split official SQL-on-FHIR conformance fixtures (tests/spec-tests/) from flatquack-specific ones (tests/custom-tests/); de-duplicate the staged harness behind a shared buildStaged + runFixtureSuite (test-util.js); route scratch output to a git-ignored tests/.tmp/.
Re-vendor spec-tests @ FHIR/sql-on-fhir.js 70c0566 (2026-06-16) — pulls newer official cases (repeat.json +4, row_index.json +1). Engine fix: join() over an empty collection yields NULL, not "".
On-demand typing (Typed column navigating a repeat-forced-JSON field emits JSON, breaking unionAll (repeat + non-repeat branches) #35 → Fix #35: hybrid on-demand typing across the repeat-forced JSON boundary #36) — re-type a repeat-forced JSON field back to a typed STRUCT inline at every navigation site (columns, where paths/lambdas, components, comparison operands, unionAll branches). This resolves the previously known-failing official case "unionAll with repeat and non-repeat branches", which is now unskipped and passing. Design recorded in docs/SPEC_ondemand_typing.md.
Docs — rewrote the hybrid spec as a standalone view-lowering document and moved specs/ → docs/ (SPEC_view_lowering.md, SPEC_ondemand_typing.md).

Verification

445 tests pass / 0 fail (9 files); was 191/0 on the struct baseline. Includes the full official repeat.json and row_index.json suites (both root-key modes), with no skipped repeat/rowIndex cases.
End-to-end CLI verified on real DuckDB for forks, %rowIndex, chain repeat, fork repeat, and repeat-in-repeat indexing.
No struct-emitter remnants (tablesToSql/setRepeatContext/--backend/_forEach directives) remain in src/tests/templates.
--root-key <natural|uuid> (default natural) replaces --backend.

Follow-up (not in scope)

getResourceKey()/getReferenceKey() resolve to bare id — cross-resource-type key collision risk #26 — getResourceKey()/getReferenceKey() resolve to bare id (cross-resource-type key collision risk).
Investigate materializing the root _fq_src CTE in natural-key mode as a performance optimization #40 — investigate materializing the root _fq_src CTE in natural-key mode as a performance optimization.
Add test coverage for the dbt templates (dbt environment + harness integration) #41 — test the dbt templates in a dbt environment and integrate dbt coverage into the test harness.

🤖 Generated with Claude Code

* Replace struct emitter with hybrid (staged) emitter (#24) Make the staged/hybrid SQL emitter (SPEC_hybrid) the sole emitter for all non-repeat, non-%rowIndex features, removing the legacy struct emitter. Stage 1 of the migration tracked in #1. What changed: - src/staged-sql-builder.js (new): the SPEC_hybrid walker — collectScope, subtreeHasFork, emitScope (CHAIN vs FORK), emitUnion, unnestFrom, lazy fork keys (rid + WITH ORDINALITY ordinals), root-key modes (natural|uuid). repeat is rejected (later stage); no %rowIndex path. - src/query-builder.js: staged-only; the `backend` flag is gone. The read schema is derived from a navigation-only path expression (view-parser `viewPaths`) instead of the struct flattening transform. - src/view-parser.js: dropped struct flattening (`parseVd` path/tables); kept `validateVd` + `extractPathsFromAst`; added `viewPaths` for read-schema path coverage. - src/ddb-sql-builder.js: kept all navigation/function/literal/comparison leaf cases, `_col`/`_col_collection`/`_invoke`, and `pathsToSchema` (shared by the staged engine); threaded the `rootVar` arg; removed struct-only directive cases (`_forEach`/`_forEachOrNull`/`_unionAll`) and `tablesToSql`. - src/cli.js: removed `--backend`; added `--root-key` (default `natural`); still executes via the duckdb 1.4.1 binding. - templates/{csv,parquet,ndjson,explore,dbt_model}.sql: promoted to the staged source/tail form; duck-macros.js gains `fhir_list`. - tests: spec.staged.test.js is the spec runner (runs every reference suite in both root-key modes); staged-rootkey.test.js added; e2e/fp-custom migrated to the staged engine; removed struct-only spec.test.js + view.test.js and the struct-directive fp-custom cases. Added spec-tests/staged.json fork coverage. - specs/SPEC_hybrid.md added; README documents `--root-key`. Verification: `bun test` 279 pass / 0 fail. End-to-end `bun run src/cli.js` on forEach and nested-fork ViewDefinitions produces correct output on duckdb 1.4.1; generated SQL uses only 1.4-valid syntax (UNNEST column, WITH ORDINALITY). Stays on `duckdb` 1.4.1 (no @duckdb/node-api). repeat and %rowIndex remain later migration stages. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Simplify viewPaths to schema-coverage essentials (#24) `viewPaths` was trimmed down from the removed struct `parseVd` and carried vestigial structure that does not affect the extracted read-schema paths: the `isRoot`/`inUnion` parameters and their conditional `_nav(...)` wrapping, the `'e'` literal, and the `_col('name', ...)` wrapper. `extractPathsFromAst` only collects type-resolved `nav` segments and recurses through any function's args regardless of name, so these are inert. Collapse to a uniform inert `_nav(...)` grouping wrapper that solely carries the forEach type context. Produces byte-identical extracted paths; `bun test` 279 pass / 0 fail. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Document natural-key getResourceKey() source + cross-type collision caveat (#26) The natural root-fork key is sourced from getResourceKey() (as-is) so a future fix to that function propagates to the key. getResourceKey() today resolves to the bare id, so the natural key only disambiguates within a single resource type; the mixed-type collision risk is tracked in #26. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Address code-review findings on the staged emitter (#24) Correctness: - Reject duplicate column names in validateVd: a name reused across sibling scopes previously emitted two same-named columns into a CTE, where DuckDB silently keeps the first and drops the other column's values. Now fails up front with a clear message. unionAll branches (which must share names) are still allowed. - Reject `%rowIndex` with a clear "not supported by the staged backend yet" error instead of the misleading generic "define it with --var" hint (a constant --var cannot supply a per-row ordinal). Cleanup (all behaviour-preserving — generated SQL byte-identical across every reference view in both root-key modes): - Memoise subtreeHasFork (WeakMap) so the recursive walk is not re-run on overlapping subtrees at every level. - Compile each fan-out path once via prepareFanout, replacing childElemOf + arrayize which each re-parsed the same path. - Unify UNNEST FROM-clause construction in a single unnestClause helper used by both unnestFrom and the unionAll emitter. - Drop the unused fhir_list macro (forward-looking; belongs with the later repeat stage that will use it). bun test: 283 pass / 0 fail. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Add %rowIndex for forEach / forEachOrNull (staged emitter) Implements the `%rowIndex` FHIRPath environment variable - the 0-based position of an element within the collection its operator iterates, scoped to the parent instance (per SPEC_hybrid §11), not a global ROW_NUMBER over output. - fhirpath-parser: `%rowIndex` parses to a `{segmentType:"rowIndex"}` segment (replacing the prior "not supported" rejection). - ddb-sql-builder: `astToSql` threads a `rowIndexSql` binding; a `rowIndex` leaf emits it. Defaults to "0" (resource root / no enclosing iteration). - staged-sql-builder: every forEach / forEachOrNull / unionAll fan-out binds a `WITH ORDINALITY` ordinal and sets the child scope's `rowIndexSql` to `CAST(ord - 1 AS INTEGER)` (the ordinal is BIGINT; %rowIndex is a FHIR integer). forEachOrNull over an empty collection yields index 0 (COALESCE the missing ordinal to 1) per the official fixture. The ordinal resets per driving row automatically. - tests/spec-tests/row_index.json: official fixture (root, forEach, forEachOrNull incl. empty, nested forEach, unionAll variants, surrogate key). The `%rowIndex with repeat` sub-test is omitted - repeat (and %rowIndex over repeat) are later migration stages. - spec runner: drop `row_index` from the exclude set; drop the now obsolete "%rowIndex is rejected" assertion (repeat rejection kept). Scope: forEach / forEachOrNull / unionAll only. `repeat` stays rejected (Stage 3) and %rowIndex-over-repeat is Stage 4. Tests: 298 pass / 0 fail (baseline 283). Refs #27, #1 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Simplify %rowIndex ordinal plumbing (review cleanup) - Fold bindRowIndex into unnestFrom (single source for chain/fork) and route the iterating unionAll branch through unnestFrom too, removing three duplicated bind sites and the cosmetic _b/_u alias split. - Use the existing name("rn") helper instead of inline `rn${++counter}`, matching the file's other generated identifiers. - Drop the redundant `|| "0"` in compilePath/compileColumn; astToSql already defaults rowIndexSql to "0". No behaviour change; 298 pass / 0 fail. Refs #27, #1 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Address review: thread rowIndexSql into _invoke; reconcile spec §11 - ddb-sql-builder.js: thread rootVar/rowIndexSql into the _invoke macro-param astToSql call so %rowIndex resolves correctly if _invoke ever accepts path arguments (currently guarded by the literal-only validator, so latent). - ddb-sql-builder.js: drop stray double blank line after the rowIndex case. - SPEC_hybrid.md §11: rename %rowNumber -> %rowIndex, 1-based -> 0-based, and document the forEachOrNull empty-row result as 0 (COALESCE(ord,1)-1), matching the SoF spec and the implementation; repeat windows now subtract 1. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Stage 3: implement repeat operation in the staged emitter (#29, #1) Implement the `repeat` directive on the hybrid (staged) backend per SPEC_hybrid §5/§10, in both CHAIN and FORK modes. - repeat-lowering.js: staged-relevant helpers (assertSimplePath, jsonFold descent, typedSeed, repeatStructure via a self-contained body-paths walk, childElemOf). Dropped the struct-only reduceSource accumulator. - staged-sql-builder.js: remove the repeat rejection; add emitRepeat (WITH RECURSIVE descent over the listed paths only, excluding the seed + a typed from_json bridge so leaf columns / forEach inside the body compile via the existing typed engine), fq_cast_<N> cast-macro pooling, and FORK keying (carry the fork key + recombine via JOIN USING(rid)). A lone repeat stays a keyless CHAIN. Bridge element bound as `el_r` (not `el`) so an array-navigating body column's `list_transform(el -> ...)` lambda does not shadow the bridge struct. - query-builder.js / view-parser.js: forced-JSON repeat seeds in the read schema (applyForcedJson + repeat-aware viewPaths), and the fq_staged_with / fq_staged_macros template vars. - ddb-sql-builder.js: pathsToJsonStruct + forceJson handling (shared leafSqlType). - duck-macros.js: the fhir_list macro (single-value-vs-array bridge). - templates + test-util: WITH RECURSIVE keyword + cast-macro preamble placeholders. - Fixtures: official repeat.json + staged-repeat.test.js (from_json bridge / choice-type / lambda-collision edge cases), wired into the staged spec runner. %rowIndex over a repeat's own descent scope is Stage 4 (omitted here); %rowIndex for forEach/forEachOrNull keeps working, including inside a repeat body. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Simplify: reuse viewPaths, single-compile seed, hoist bridge-var const Apply /simplify findings: - repeat-lowering.js: drop repeatBodyPaths (a verbatim duplicate of viewPaths' parseNode walker) and call viewPaths on the stripped body; typedSeed now arrayizes from its single compilePath result instead of recompiling each path via B.arrayize. - ddb-sql-builder.js: factor the duplicated JSON object-field rendering in pathsToJsonStruct into a local objOf helper. - staged-sql-builder.js: hoist the `el_r` bridge variable to a BRIDGE_VAR const used by bridgeElem and both bridge CTE emitters, so the alias and the leaf root-var can't silently drift. No behaviour change: bun test 331 pass / 0 fail; chain + fork repeat e2e unchanged (5 / 10 rows). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Review fixes: %rowIndex in shared-field forEach + applyForcedJson guard Apply /review findings: - HIGH: a `forEach` that names the same field as a sibling `repeat` is routed through emitJsonEach (from_json bridge). It bound %rowIndex on the discarded f.childElem instead of the bridge element used for the body, and passed ordName=null so no ORDINALITY was even emitted — %rowIndex was always 0. Now emitJsonEach always mints an ordinal, projects the 0-based %rowIndex as a named bridge column, and binds the body element's rowIndexSql to it. Verified: forEach beside repeat on the same field now reports 0,1,2. - MEDIUM: applyForcedJson force-JSON'd a shallower ancestor on a partial path match (truncating a subtree a typed sibling may need). Guard with a `matched` flag so it only applies when the full path resolves. bun test 331 pass / 0 fail; chain (5 rows) + fork (10 rows) repeat e2e unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Fix fork inside a repeat body; reject %rowIndex over a repeat scope A repeat body that forks (>=2 fan-out children) recombined its branches via JOIN USING(keyCols), where keyCols was constant per resource (rid for a CHAIN repeat, the parent fork key for a FORK branch). With no per-node identity the branches cross-produced across ALL descended nodes of the resource, emitting wrong rows (e.g. item "a" paired with item "b"'s answers). The official repeat.json suite missed this: its repeat+internal-fork cases all expect a single row, which cannot expose the cross-product. Mint a per-node descent key when the body forks: carry the integer descent `path` (per-level WITH ORDINALITY ordinals from seed to node) through both recursive legs, then assign a deterministic pre-order index `(row_number() OVER (PARTITION BY keyCols ORDER BY path) - 1)` at the bridge and thread it as an extra recombination key. ORDER BY path is a total order (paths are unique per node), so the index survives every reference to the bridge without MATERIALIZED. This is the same descent key Stage 4 binds for %rowIndex over a repeat's own scope; here it serves only as a fork key. Also reject %rowIndex resolving to a repeat's own descent scope (Stage 4, unsupported) instead of silently binding it to the constant 0: the repeat bridge element binds rowIndexSql=null and the leaf engine throws on a rowIndex segment with no valid binding. %rowIndex for a forEach/forEachOrNull inside a repeat body is unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Stage 4: %rowIndex over repeat (pre-order descent index) Implements `%rowIndex` indexing a `repeat`'s OWN descent scope: a depth-first pre-order index over the recursive descent, completing the hybrid-only migration. The index is assigned at the repeat's own scope (one row per visited node) by `row_number() OVER (PARTITION BY {enclosing-scope key} ORDER BY path)` over the int-list descent path the rCTE already carries, BEFORE any sibling recombination, then materialised as a scalar that survives fork joins: - root repeat partitions by `rid` (forced even with no fork, via `subtreeHasRepeatUsingRowIndex`); - a repeat under a forEach carries the enclosing iteration ordinal as a key so its window restarts per enclosing element; - a repeat-in-repeat partitions the inner window by the outer repeat's own index (chaining via `(rid, outer_rn)`), never PARTITION BY a list. The pre-order window unifies with the existing fork-recombination node key (same value, same `nord{N}` column): computed when any of fork / `%rowIndex` / nested-`%rowIndex`-repeat needs it. Adds tests/spec-tests/row_index_repeat.json (staged-only extended suite, skip:true) + tests/row-index-repeat.test.js (dedicated runner, natural + uuid key modes). Updates the now-obsolete Stage 3 "rejected" guard into a positive assertion. Tests: 334 -> 344 pass, 0 fail (+10 from the new repeat-rowIndex suite). Refs #32, #1 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Simplify: share %rowIndex scope traversal + cache subtree detection - Collapse the parallel raw-string / AST `%rowIndex` scope detectors onto one shared `scopeMentionsRowIndex` traversal parameterised by a per-column predicate, so the "a nested fan-out opens its own scope" rule lives in one place instead of being mirrored in four functions. - Memoise `subtreeHasRepeatUsingRowIndex` with a WeakMap, matching the existing `forkCache` pattern (emitScope queries it per fan-out over overlapping subtrees). No behaviour change; 344 pass / 0 fail. Refs #32, #1 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…lisions A ViewDefinition column name is a validated SQL identifier starting with a letter (`^[A-Za-z][A-Za-z0-9_]*$`). The staged emitter carried internal bookkeeping columns (resource key `rid`, fork ordinals, a repeat's descent `node`/`ord`/`path` and pre-order index `nord`) in the same CTEs as the user's columns, so a view that legally named a column `rid`, `node`, `path`, etc. collided: a hard `Binder Error` (duplicate column) inside a repeat's recursive CTE, or silent value loss in `uuid` root-key mode. Prefix every internal identifier with `_` — the smallest prefix a valid column name can never start with — making the collision impossible rather than merely unlikely. Covers the `name()`-generated relation/column names plus the fixed names `_rid`, `_ord{N}`, `_nord{N}`, `_node`, `_ord`, `_path`, and BRIDGE_VAR. Also move the unknown-type diagnostic in pathsToSchema from console.log (stdout, which corrupts the compiled SQL the CLI prints there) to console.warn (stderr), and document the enforcement mechanism in SPEC_hybrid §projection. New tests/staged-reserved-names.test.js pins that a view may name a column `rid`, `node`, `path`, or `ord` and still compile and return correct rows. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…d test runners tests/spec-tests/ now holds only the official SQL-on-FHIR conformance fixtures (mirror of FHIR/sql-on-fhir.js /tests). Flatquack-specific fixtures move to a new tests/custom-tests/ dir in the same JSON format: staged.json (emitter fork handling) and row_index_repeat.json (extended %rowIndex-over-repeat). repeat.json and row_index.json stay in spec-tests/ — they are official spec features. De-duplicate the staged test harness: - Extract a shared buildStaged helper and a runFixtureSuite runner into test-util.js (replacing the copy-pasted helper in 5 runner files). The runner gains a custom-only per-test `resources` override so one fixture can hold cases with differing input data. - spec.staged.test.js + new custom.staged.test.js are now thin wrappers. - Convert inline-view JS tests to JSON fixtures: reserved_names.json, repeat_extra.json, and a no_id_fork case in staged.json. The only non-fixture bits survive as staged-sql-shape.test.js (asserts emitted SQL) and view-parser.test.js (validateVd units, moved out of the spec runner). Delete the superseded row-index-repeat / staged-repeat / staged-reserved-names / staged-rootkey runners. Route generated *.temp.json resource files to a git-ignored tests/.tmp/ scratch dir instead of polluting the fixture directories. Add tests/README.md documenting the split. Fix scripts/build-test-output.js (it imported a non-existent testQueryTemplate and was broken): use the shared buildStaged, a single shared db, and the scratch dir; regenerates an identical conformance report. Coverage note: drops the 4 redundant natural-EXCEPT-ALL-uuid cross-mode checks (both modes are still independently checked against ground truth). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Re-vendor tests/spec-tests/ from FHIR/sql-on-fhir.js tests/ at commit 70c05667b5fc8f82bfce16a3a2a0e5b142e5c6d2 (branch main, fetched 2026-06-16). This pulls newer official cases, notably repeat.json (+4) and row_index.json (+1, "%rowIndex with repeat"). The vendored revision is recorded in tests/README.md. skip/ fixtures (constant, constant_types, fn_boundary) are refreshed but stay skipped (unimplemented features). Test harness: feed each case only the resources matching its view's declared resource type. Upstream fhirpath.json now mixes an Organization (scalar name) with Patients (array name) in one resource list, which broke flatquack's type-coerced read_json_auto; filtering to the view's resourceType mirrors the per-resource-type engine model and leaves results unchanged (recovers 16 tests). Engine fix: join() over an empty collection now yields NULL, not "" — drop the forced .ifnull2('') in ddb-sql-builder. Matches the new official fn_join cases. Known failure (to be addressed separately): repeat.json "unionAll with repeat and non-repeat branches". A unionAll branch that navigates the repeat-forced JSON field (item.linkId.first()) evaluates to JSON, while the repeat branch yields VARCHAR via its from_json bridge; UNION ALL then fails casting a plain string to JSON. Root cause is forceJson typing not coercing to the column's declared type; fix deferred pending design discussion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ry (#36) Re-type a repeat-forced JSON field back to a typed STRUCT inline at every site that navigates across the boundary: columns (root / chain-forEach / unionAll branches), where paths and where() lambdas, and components/comparison operands. Multi-segment (dotted) seeds re-type at the divergence depth. forcedJsonPaths is the single source of truth; from_json cast macros are pooled. No-op for non-repeat views. Adds a systematic on-demand-typing test matrix + SQL-shape assertions and a design spec (specs/SPEC_ondemand_typing.md). Full suite: 445 tests, 0 fail.

…ocs/ Make SPEC_hybrid.md a self-contained specification of the staged-CTE lowering and drop all references to non-existent resources (Spec A/B/C, benchmark/work SQL, measured timings). Replace the inline JSON leaf machinery with the assumption of an available FHIRPath->SQL translation layer, and focus on the view-to-relational mapping (chain vs fork, lazy fork keys, repeat descent, %rowIndex, projection contract). Rename specs/ -> docs/ and SPEC_hybrid.md -> SPEC_view_lowering.md, and update all references (README, companion spec, source comments, tests). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

piotrszul · 2026-06-18T05:57:38Z

master-fix vs cte-fix — full comparison

Benchmark of two flatquack lines compiling the same SQL-on-FHIR ViewDefinitions to DuckDB SQL:

master-fix — master variant, non-staged emitter (benchmark.sql); branch staging/master-fix
cte-fix — cte-staged variant, staged-CTE emitter (benchmark-staged.sql); branch staging/cte-fix

Same DuckDB engine in each cell, so every difference is the generated SQL. Speedup = master ÷ cte (>1 ⇒ cte faster).

Datasets

Synthea-generated FHIR R4 bulk-data (per-resource NDJSON, US Core profiles). Three patient cohorts — s1000 (~1k patients), s10000 (~10k), s100000 (~119k, patients-only export: Patient + Encounter only). questionnaire_response_flat runs on a separate QuestionnaireResponse dataset (Synthea doesn't emit QR): 1,000 QRs at s1000 and 10,000 QRs at s10000 (flattened to 10,268 / 102,470 output rows). Row counts are verified identical across variants.

Measurement

Per cell: flatquack compiles the ViewDefinition → DuckDB SQL, run through the DuckDB CLI in materialize mode (CREATE OR REPLACE TEMP TABLE _sink AS <query>), timed with .timer. 1 warmup + 5 measured iterations, warm (process + OS page cache); we report the mean of the 5 (± stderr omitted here, all < a few %). preserve_insertion_order = false is set for every run (recovers DuckDB 1.5 parallel result-sink; output carries no row order).

DuckDB 1.4.1 (mean ms)

case	s1000 master / cte	×	s10000 master / cte	×
condition_flat	484 / 145	3.3×	4563 / 1158	3.9×
encounter_flat	760 / 230	3.3×	7064 / 1961	3.6×
us_core_blood_pressures	691 / 187	3.7×	6664 / 1656	4.0×
patient_addresses	6 / 6	1.0×	38 / 35	1.1×
patient_and_contact_addresses	6 / 7	1.0×	38 / 41	0.9×
patient_demographics	6 / 6	1.1×	37 / 35	1.0×
questionnaire_response_flat	GEN_ERR / 15	—	GEN_ERR / 110	—

DuckDB 1.5.3 (mean ms)

case	s1000 master / cte	×	s10000 master / cte	×
condition_flat	4781 / 154	31×	48393 / 1261	38×
encounter_flat	6628 / 243	27×	63651 / 2186	29×
us_core_blood_pressures	1643 / 188	8.7×	16887 / 1746	9.7×
patient_addresses	7 / 6	1.1×	49 / 45	1.1×
patient_and_contact_addresses	7 / 7	1.0×	47 / 41	1.2×
patient_demographics	6 / 6	1.0×	49 / 43	1.1×
questionnaire_response_flat	GEN_ERR / 13	—	GEN_ERR / 93	—

s100000 — both engines (~119k patients)

case	master 1.4.1	cte 1.4.1	master 1.5.3	cte 1.5.3	× @1.5.3
patient_addresses	336 ms	319 ms	434 ms	415 ms	1.0×
patient_and_contact_addresses	354 ms	225 ms	451 ms	224 ms	2.0×
patient_demographics	338 ms	338 ms	417 ms	393 ms	1.1×
encounter_flat (6.35M rows)	41.1 s	14.1 s	354.1 s	15.5 s	23×

Takeaways

Heavy flattens (condition / encounter / blood-pressures): cte is 3–4× faster on 1.4.1, and 9–38× faster on 1.5.3.
The 1.5 gap is a master regression, not a cte win. cte is engine-stable (encounter 14.1 → 15.5 s, +10% across 1.4→1.5); master collapses (41 → 354 s, 8.6×, a steady ~9× penalty at every size). This is a DuckDB-1.5 planning interaction with the non-staged SQL, not the preserve_insertion_order issue (already disabled in both).
Trivial Patient projections (patient_addresses, patient_demographics): essentially tied at every size/engine; both pick up the same ~20–30% 1.5 cost on small JSON scans.
patient_and_contact_addresses is the one Patient view where staging wins (2× at scale on both engines).
questionnaire_response_flat: master can't compile it (repeat directive → GEN_ERR); cte runs it. A capability gap, not just a speedup.

Net: cte-fix is equal-or-faster on every case, at every scale, on both engines — and engine-stable across the 1.4→1.5 jump that severely regresses master.

johngrimes

Thanks @piotrszul, it's looking good!

I managed to run it against my AU PIP QI project, but ran into the issue with the breaking change to the template variables. Is this breaking change necessary?

I've also added a few other comments.

Two claims in the spec did not match the staged emitter; soften the wording to describe actual behavior: - preserve_insertion_order = false: the lowering never emits it. Document it as a DuckDB 1.5+ session setting that recovers parallel UNNEST materialisation (a significant regression without it on 1.5+), and add a matching performance note to the README. - shallow shared-array siblings: the emitter always FORKs at >=2 fan-outs; there is no CHAIN exception. Document always-FORK and record the shallow-CHAIN optimization as benchmark-gated in #39. The third claim (WITH ORDINALITY 'zero extra cost') is addressed by a follow-up code change in a separate commit. Addresses review comment on PR #34. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Previously every forEach unnest emitted WITH ORDINALITY because an ordinal was always minted to make %rowIndex available, even when no column read it and no fork needed it as a key (the 'zero extra cost' the spec claimed was actually an always-on cost). Gate the ordinal on (needsOrd || usesRi): emit it only when a fork or a nested %rowIndex repeat below needs it as a key, or when the scope itself binds %rowIndex. When neither holds the unnest is a plain UNNEST(...) and %rowIndex is left unbound (a stray reference is rejected, matching a repeat's no-ordinal default). Reuses the existing scopeUsesRowIndex predicate and the already-conditional unnestClause; applies uniformly to the chain, fork, json-each, and unionAll-branch paths. The unionAll-branch path computes needsOrd/usesRi in prepUnion (the same predicates emitScope uses for chain/fork fan-outs); without them the emitJsonEach/unnestFrom gate read undefined and minted no ordinal, leaving a %rowIndex-reading forEach branch over a repeat-forced field (jsonMode) unbound. The branch ordinal stays private (branches recombine on the parent keyCols, never their own ordinal). Verified: fork keys (_ord{N}) and %rowIndex views still emit ORDINALITY; nested forks still join on (_rid, _ord1, ...). Full suite: 447 pass (adds row_index_union_jsonmode.json covering the unionAll branch). Reconciles the third SPEC_view_lowering claim (lines 139, 278) by making the emitter match the doc rather than softening it. Addresses review comment on PR #34. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ntract Replace the fq_staged_*/fq_where_filter template variables with one fq_sql_* contract built around a named input CTE and a column-clean output CTE. The template binds the data source as `_fq_input` (the resource's elements as its columns) and projects the emitter's `_fq_output`; the emitter's flattening pipeline is sealed between them. This drops the source-splicing hole and the with/src/materialized/tail fragments in favour of: fq_sql_with, fq_sql_input, fq_sql_input_schema, fq_sql_where, fq_sql_pipeline, fq_sql_output, fq_sql_output_columns, fq_sql_macros, fq_sql_view_macros Emitter (staged-sql-builder, query-builder): emit `_fq_src AS (… FROM _fq_input)` reading the template-provided input, seal the MATERIALIZED hint onto `_fq_src`, and add a column-clean `_fq_output` CTE; expose pipeline / input / output names and the output column list (no edge commas, no terminal SELECT). The root pipeline CTE is `_fq_`-prefixed (via the SRC_CTE constant) so it can never collide with a user column, matching the seam CTEs and the SPEC_sql_template_contract §4.2 invariant. Templates: rewrite csv/ndjson/parquet/explore/dbt_model/dbt_prehook. explore materialises its limited input so a forked view stays deterministic. dbt routes both macro blocks to the prehook so the model is a single, materializable SELECT (fixing the prior inline-CREATE-MACRO model). Docs: add docs/SPEC_sql_template_contract.md (the contract), rewrite the README variable table + migration note, and cross-link from SPEC_view_lowering. Tests: migrate the shared harness template to the new contract; add a determinism guard that renders the shipped templates. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Marks the fq_sql_ template-variable contract as a new minor release: the fq_staged_*/fq_where_filter/fq_sql_flattening_* variables from v0.2.x are replaced by the input/output CTE contract, a breaking change for custom templates. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

piotrszul and others added 4 commits June 15, 2026 20:25

piotrszul added this to Pathling Jun 15, 2026

github-project-automation Bot moved this to Backlog in Pathling Jun 15, 2026

piotrszul moved this from Backlog to In progress in Pathling Jun 15, 2026

piotrszul and others added 3 commits June 16, 2026 10:35

piotrszul mentioned this pull request Jun 16, 2026

Typed column navigating a repeat-forced-JSON field emits JSON, breaking unionAll (repeat + non-repeat branches) #35

Open

piotrszul and others added 2 commits June 17, 2026 21:19

piotrszul requested a review from johngrimes June 18, 2026 06:02

johngrimes requested changes Jun 19, 2026

View reviewed changes

Comment thread src/staged-sql-builder.js

Comment thread src/query-builder.js Outdated

Comment thread docs/SPEC_view_lowering.md Outdated

johngrimes assigned piotrszul Jun 21, 2026

piotrszul mentioned this pull request Jun 23, 2026

Optimization: shallow shared-array siblings could stay a CHAIN instead of always FORK #39

Open

3 tasks

piotrszul and others added 4 commits June 23, 2026 18:41

piotrszul requested a review from johngrimes June 24, 2026 02:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement repeat directive + %rowIndex via a hybrid (staged) SQL emitter#34

Implement repeat directive + %rowIndex via a hybrid (staged) SQL emitter#34
piotrszul wants to merge 13 commits into
masterfrom
issue/1_repeat_directive

piotrszul commented Jun 15, 2026 •

edited

Loading

Uh oh!

piotrszul commented Jun 18, 2026

Uh oh!

johngrimes left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

piotrszul commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Core migration

Post-stage hardening

Verification

Follow-up (not in scope)

Uh oh!

piotrszul commented Jun 18, 2026

master-fix vs cte-fix — full comparison

Datasets

Measurement

DuckDB 1.4.1 (mean ms)

DuckDB 1.5.3 (mean ms)

s100000 — both engines (~119k patients)

Takeaways

Uh oh!

johngrimes left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

piotrszul commented Jun 15, 2026 •

edited

Loading