Skip to content

[Experimental] repeat and %rowIndex implementation (staged SPEC_hybrid emitter)#7

Open
piotrszul wants to merge 19 commits into
masterfrom
spike/1_repeat_mixed_v5
Open

[Experimental] repeat and %rowIndex implementation (staged SPEC_hybrid emitter)#7
piotrszul wants to merge 19 commits into
masterfrom
spike/1_repeat_mixed_v5

Conversation

@piotrszul

@piotrszul piotrszul commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator

⚠️ Experimental

This is an experimental implementation of repeat and %rowIndex on the staged (SPEC_hybrid) SQL emitter, developed on a spike/ branch. It is exploratory work intended for review and discussion — not a finalized, production-bound change. The emitter specification it implements is vendored into this branch at docs/SPEC_hybrid.md.

Summary

Brings the staged (SPEC_hybrid) SQL emitter to full %rowIndex parity, including over repeat — the deferred follow-up (add-rowindex-repeat-preorder) from the earlier add-rowindex-staged change.

%rowIndex over a repeat now evaluates to the SQL-on-FHIR depth-first pre-order position (linkId 1, 1.1, 1.2, 2 → 0, 1, 2, 3), restarting per the repeat's enclosing-scope instance. Previously the staged repeat fan-out never threaded a row-index expression, so %rowIndex over a repeat compiled to the default 0 and the row_index.json repeat sub-test was left failing on the staged suite.

Note on scope/size: this is a long-lived spike/ branch — the PR spans the whole staged-emitter + repeat + %rowIndex lineage (18 commits) against master. The %rowIndex-over-repeat work is commit df4f6a9; commit 122d4fa vendors the emitter spec into docs/.

What changed

Mechanism follows docs/SPEC_hybrid.md §11 and the alternative flagged in the add-rowindex-staged design D2:

  • Pre-order numbering. When a repeat body uses %rowIndex, carry a materialized integer descent path ([ord]::BIGINT[] seed leg, list_append(path, ord) recursive leg, each ord from an added WITH ORDINALITY) through the WITH RECURSIVE descent, then assign (row_number() OVER (PARTITION BY <key> ORDER BY path) - 1)::INTEGER at the typed bridge stage and bind it as the body's row-index. Ordering the int[] path element-wise yields pre-order; the index is a scalar assigned before any fork recombination, so it survives joins.
  • Forced partition key. Generalize the lazy fork look-ahead (subtreeHasRepeatUsingRowIndex alongside subtreeHasFork) so the resource key (rid) and enclosing forEach ordinals are minted on the spine down to a %rowIndex repeat even in fork-free views. A repeat nested in a repeat chains its partition by appending the outer repeat's own index ((rid, outer_rn)) — never PARTITION BY a list.
  • Shared-field fixes. emitJsonEach now binds %rowIndex to its ordinal, and the fork jsonMode forEach branch passes its ordinal through.
  • Fully gated on usage. A repeat whose body does not reference %rowIndex emits no ordinality, no path, no window, and no forced rid — byte-for-byte unchanged.

Tests

  • New tests/spec-tests/row_index_repeat.json (skip:true so generic auto-discovery ignores it) + tests/row-index-repeat.test.js: a staged-only runner over five extended cases the official row_index.json doesn't reach — fork-free multi-resource, repeat-under-forEach, repeat-in-repeat, repeat-in-unionAll, repeat-at-a-fork — in both natural and uuid root-key modes. 10/10 pass.
  • The official row_index.json repeat sub-test now passes on the staged suite (was the documented red).
  • Full suite: 499 pass, 1 fail — the single fail is the struct backend's row_index.json repeat sub-test, which is pre-existing and out of scope (struct unchanged; verified it fails identically with this change stashed). Struct pre-order remains a separate documented follow-up.

Spec

  • docs/SPEC_hybrid.md: the staged ("Spec D") emitter specification, vendored into the repo (replacing the earlier docs/repeat-implementation-investigation.md exploratory note).
  • openspec/specs/staged-sql-emitter/spec.md: the %rowIndex requirement now covers repeat pre-order (materialized integer path + path-ordered window, forced enclosing-scope partition, repeat-in-repeat chaining); the prior "out of scope" carve-out is removed. Change archived under openspec/changes/archive/2026-06-06-add-rowindex-repeat-preorder.

🤖 Generated with Claude Code

piotrszul and others added 17 commits April 14, 2026 16:10
Adds a specialized skill for SQL on FHIR v2 specification guidance,
with modular reference files covering ViewDefinition structure,
FHIRPath subset, operations, and comprehensive examples.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents the analysis of implementing the SQL on FHIR v2 repeat
element for recursive traversal of nested FHIR structures.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds two failing test cases (repeat inside forEach, repeat inside repeat)
that surface the scope-leak limitations of the hoisted JSON CTE strategy,
and documents Solution G — inlining the recursive CTE inside the value
expression — which composes naturally inside lambdas and resolves those
limitations without template changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add Solution F: chained CTE with baked-in context for repeat

Document a new repeat strategy that bakes non-repeat columns and
forEach context into the recursive CTE anchor, eliminating the
correlated subquery and fixing forEach/repeat nesting issues from
Solution B. Verified working in DuckDB with unnest and list_transform
in the recursive member.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The old `duckdb` package has no pre-built binaries for Node 25+ and
was not released for DuckDB v1.5+. Migrates to `@duckdb/node-api`
(Node Neo) throughout src and tests, converting callbacks to async/await
and using getRowObjectsJS() for plain JS type coercion.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The runtime was migrated to @duckdb/node-api in 0175209 but the benchmark
script still imported the legacy duckdb package (pinned at 1.4.1, where
nested correlated WITH RECURSIVE subqueries are buggy). Switch to the
same client the rest of the project uses so benchmarks reflect the
production engine.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lower a ViewDefinition into staged CTEs per SPEC_hybrid — carry-forward
chain stages with keyed recombination only at independent fan-out forks —
while reusing the existing typed FHIRPath engine for every leaf.

- ddb-sql-builder: astToSql gains an optional outer root-var (default `el`)
  so a staged scope element can be aliased `node` without colliding with
  internal list_transform `el` lambdas; struct emitter unchanged.
- staged-sql-builder (new): tree-walk + CHAIN/FORK classification, lazy
  integer keys (rid + WITH ORDINALITY ordinals at nested forks), forEach/
  forEachOrNull (LEFT JOIN UNNEST, no padding), unionAll, repeat rejection.
- query-builder: backend: 'staged' | 'struct' selector; `AS result` folded
  into the expanded transform variable (literal dropped from all struct
  templates: ndjson/csv/parquet/dbt_model/explore + test-util).
- tests: spec.staged.test.js runs the official reference suites unmodified
  through the staged backend (repeat excluded) + a repeat-rejection test;
  spec-tests/staged.json adds fork coverage (nested composite key, root
  cross product, forEachOrNull-in-fork) passing on both backends.

Staged backend: all non-repeat official suites pass (96 tests green).
Full suite: 290 pass / 15 fail (the 15 are repeat on the struct backend,
pre-existing baseline — no regression).

OpenSpec change add-staged-sql-emitter archived; capability spec synced.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the always-on blocking row_number() OVER () window at root forks
in the staged (SPEC_hybrid) emitter with the natural resource key
(getResourceKey()/id) as the default recombination key, plus an opt-in
uuid() mode for inputs where resource-id uniqueness can't be assumed.

- staged-sql-builder: rootKey "natural"|"uuid" option; natural projects
  the resource key AS rid, uuid projects uuid() AS rid and signals
  srcMaterialized; lazily exposes resourceKeyAst so the typed read schema
  includes id even when the view doesn't project it.
- query-builder: thread rootKey through buildQuery/templateToQuery; fold
  resourceKeyAst into the schema paths; add fq_staged_src_materialized.
- cli: surface --backend and --root-key.
- templates (test-util, gen-staged): emit WITH src AS MATERIALIZED in
  uuid mode.
- tests: run the staged spec harness in both modes; new staged-rootkey
  test proves multiset identity vs the struct oracle (EXCEPT ALL = 0 both
  directions) and the no-id root-fork case.

Archives the root-fork-resource-key OpenSpec change. Tasks 4.3/5.1
(benchmark regen, SPEC_hybrid docs) deferred — those artifacts are not
present in this worktree.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The staged backend previously rejected `repeat` and fell back to the
struct emitter. `repeat` is the one unbounded-depth directive, so it
cannot live on the typed substrate: lower it by oscillating the binding
between typed and JSON — the recursive descent runs in JSON via a
WITH RECURSIVE CTE (the fhir_list fold), while every column/forEach/where
is still evaluated typed against a per-scope from_json bridge element.

- JSON descent: WITH RECURSIVE CTE; seed = typed navigation of the listed
  paths off the enclosing element, recursive leg = fhir_list JSON fold
  (single step, multi-step flatten, list_concat across paths). New
  fhir_list macro.
- Typed bridge: from_json(node, <structure>) AS el (lenient — NULL-fills
  absent FHIR fields, keeps nested-repeat seeds raw as JSON[]); factored
  into shared fq_cast_* macros.
- One schema rule: parseVd surfaces a repeat seed childless -> JSON[];
  pathsToJsonStruct renders the from_json structure off the same tree;
  forceJson makes "repeat wins" over a sibling forEach (shared field).
- Chain/fork/union: lone repeat = chain; repeat among siblings = fork
  (JOIN USING the resource/composite key); repeat inside unionAll unions
  keyed branches; shared-field forEach reads via the bridge (json mode).
- WITH RECURSIVE lead threaded through both templates and gen-staged.

All 15 official repeat.json cases pass on the staged backend in both
rootKey modes, plus a new typed-bridge choice-type test. Unbounded depth
verified. Spec synced; change archived.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A repeat scope binds its element via the from_json typed bridge aliased
`AS el` (ref:"el"). The leaf compiler hardcoded the list_transform /
list_filter lambda parameter as `el` too, so a column *directly in a
repeat scope* that navigates an array compiled to

    (el.answer).list_transform(el -> el.valueString)

where the lambda `el` collides with the bridge column `el` — DuckDB
resolves `el.valueString` against the bridge struct (keys linkId/answer),
raising "Binder Error: Could not find key valuestring in struct". The
official repeat.json suite never hit it (its repeat-scope columns are
scalar, or reach arrays only through a forEach, which uses ref:"node").

Give every list_transform/list_filter lambda the reserved variable `__el`
(constant L in ddb-sql-builder.js), which no scope element ref can ever
be, so a lambda parameter can never collide with the element it is
applied to. The lambda bodies are compiled with rootVar=L to match, and
the inLambda element prefixes in _splitPath/_forEach now use rootVar
(correct for both backends; previously hardcoded "el" worked only because
the struct backend always uses "el" inside lambdas).

Regression test added in tests/staged-repeat.test.js (a collection column
navigating an array directly inside a repeat scope). Full suite: the only
remaining failures are the pre-existing struct-backend repeat.json cases.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Lower `repeat` on the struct backend so the default emitter — the oracle the
staged emitter is validated against — passes all 15 official repeat.json cases.

A repeat becomes a forEach over a bounded list_reduce accumulator: the JSON
frontier is descended via the fhir_list fold, each frontier is bridged to typed
elements with an inline from_json, and the body is the existing two-phase
machinery (re-rooted at the repeat element type). The accumulator computes the
descent once per step (frontier in `cur`, `acc` collects the previous frontier),
accumulates truncated typed elements, and projects the body once after the
reduce. Recursion depth is bounded and configurable (--repeat-depth, default 10).

- src/repeat-lowering.js: shared helpers (assertSimplePath, jsonFold, typedSeed,
  repeatStructure, childElemOf) extracted from the staged emitter, plus the
  struct reduceSource builder.
- src/view-parser.js: structRepeat mode emits a _repeat fan-out (and _jsoneach
  for the repeat-beside-forEach shared-field case) instead of the _rseed shim;
  validateVd rejects repeat+forEach.
- src/ddb-sql-builder.js: _repeat/_jsoneach astToSql cases; repeat table type
  cross-joins; nav outputType carries schemaPath.
- src/query-builder.js: orchestrates the schema-aware repeat context, forces
  repeat seeds to JSON[], threads repeatDepth.
- src/cli.js: --repeat-depth option.
- tests/struct-repeat.test.js: struct/staged equivalence on repeat.json,
  choice-type ofType() and lambda-collision guards, bounded-depth coverage.
- openspec: new struct-sql-emitter capability spec; change archived.

Known limitation (tracked in #5): nested repeat builds a large per-resource
intermediate that spills to disk at scale; staged WITH RECURSIVE is the
streaming option. scripts/gen-struct.js generates struct SQL for benchmarking.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Implements the SQL-on-FHIR %rowIndex environment variable on the struct
backend: the 0-based position within the nearest enclosing iteration
(forEach/forEachOrNull/repeat), or 0 at the resource root and in
non-iterating unionAll branches.

- fhirpath-parser: recognize %rowIndex as a contextual segment rather than
  routing it through the --var lookup (no longer throws "not defined").
- ddb-sql-builder: thread a rowIndexSql expression; real forEach/repeat emit
  an indexed list_transform lambda binding ((__idx-1)::INTEGER); source-side
  ifnull2 padding for forEachOrNull so the null row reports index 0; new
  _project directive for column projections that pass the enclosing index
  through instead of opening a new scope.
- view-parser: emit _project (not _forEach) for projection wrappers (root,
  select, unionAll branch).
- tests: exclude row_index from the staged suite (staged parity is a
  follow-up); update view parser-output assertions for _project.

8/9 row_index sub-tests pass. The repeat sub-case is left failing: the struct
repeat lowering accumulates breadth-first while the spec pins depth-first
pre-order numbering (documented as a Non-Goal). Change archived under
openspec/changes/archive/2026-06-05-add-rowindex-struct.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Brings the staged backend to parity with the struct backend on the
SQL-on-FHIR %rowIndex environment variable (the 0-based position within
the nearest enclosing iteration). The shared FHIRPath front end already
recognizes %rowIndex as a contextual segment, and astToSql already takes
a rowIndexSql parameter; the staged compiler just never supplied it, so
%rowIndex silently compiled to "0" everywhere (hence the suite exclusion).

The staged emitter realizes forEach as UNNEST(...) WITH ORDINALITY, which
already yields a 1-based ordinal, so %rowIndex is (ord - 1)::INTEGER.

- staged-sql-builder: thread rowIndexSql on the element descriptor (rootElem
  -> "0"); compilePath/compileColumn/arrayize pass it to astToSql. Emit the
  iteration ordinal when the scope forks OR uses %rowIndex, adding it to the
  recombination childKey only for forks (the index-only case reads it in the
  immediate stage). forEachOrNull binds (coalesce(ord,1)-1) so the null row
  reports 0. unionAll: forEach branches get their own ordinal (independent
  numbering); columns-only branches inherit the enclosing scope's index.
  scopeUsesRowIndex detects usage by parsing for the rowIndex segment.
- tests: un-exclude row_index.json from the staged suite.
- specs: add the %rowIndex requirement to the staged-sql-emitter capability.

8/9 row_index sub-tests pass on staged. The repeat sub-case is left failing
(WITH RECURSIVE descends breadth-first; the spec pins depth-first pre-order),
documented as a Non-Goal and a follow-up (add-rowindex-repeat-preorder), exactly
as the struct suite already carries it. Change archived under
openspec/changes/archive/2026-06-06-add-rowindex-staged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Closes the deferred follow-up (`add-rowindex-repeat-preorder`) from the
`add-rowindex-staged` change: `%rowIndex` over a `repeat` now evaluates to the
SQL-on-FHIR depth-first pre-order position (`linkId 1, 1.1, 1.2, 2 -> 0, 1, 2,
3`), restarting per the repeat's enclosing-scope instance. Previously the staged
`repeat` fan-out never threaded a row-index expression, so `%rowIndex` over a
repeat compiled to the default `0`, and the `row_index.json` repeat sub-test was
left failing on the staged suite (mirroring struct).

Mechanism (SPEC_hybrid section 11, and the alternative flagged in the
add-rowindex-staged design D2):

- staged-sql-builder: when a repeat body uses `%rowIndex`, carry a materialized
  integer descent `path` (`[ord]::BIGINT[]` seed leg, `list_append(path, ord)`
  recursive leg, each ord from an added WITH ORDINALITY) through the WITH
  RECURSIVE descent, then assign `(row_number() OVER (PARTITION BY <key> ORDER BY
  path) - 1)::INTEGER` at the typed bridge stage and bind it as the body's
  rowIndexSql. Ordering the int[] path element-wise yields pre-order; the index
  is a scalar assigned before any fork recombination, so it survives joins.
- Force the partition key: generalize the lazy fork look-ahead so the resource
  key (rid) and enclosing forEach ordinals are minted on the spine down to a
  `%rowIndex` repeat even in fork-free views (subtreeHasRepeatUsingRowIndex
  alongside subtreeHasFork). A repeat nested in a repeat chains its partition by
  appending the outer repeat's own index ((rid, outer_rn)) -- never PARTITION BY
  a list.
- emitJsonEach now binds `%rowIndex` to its ordinal, and the fork jsonMode forEach
  branch passes its ordinal through (the shared-field cases).
- Fully gated on usage: a repeat whose body does not reference `%rowIndex` emits
  no ordinality, no path, no window, and no forced rid -- byte-for-byte unchanged.

Tests:
- tests/spec-tests/row_index_repeat.json (new, skip:true so the generic
  auto-discovery suites ignore it) + tests/row-index-repeat.test.js: a staged-only
  runner over five extended cases the official row_index.json does not reach --
  fork-free multi-resource, repeat-under-forEach, repeat-in-repeat,
  repeat-in-unionAll, repeat-at-a-fork -- in both natural and uuid root-key modes.
- spec.staged.test.js: the official row_index.json repeat sub-test now passes
  (comment updated; EXCLUDE was already null).
- struct backend unchanged; its row_index.json repeat red remains a separate
  documented follow-up.
- specs/staged-sql-emitter: the `%rowIndex` requirement now covers repeat
  pre-order; the "out of scope" carve-out removed. Change archived under
  openspec/changes/archive/2026-06-06-add-rowindex-repeat-preorder.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@piotrszul piotrszul changed the title %rowIndex for repeat on the staged (SPEC_hybrid) SQL emitter [Experimental] repeat and %rowIndex implementation (staged SPEC_hybrid emitter) Jun 6, 2026
piotrszul and others added 2 commits June 6, 2026 18:27
Vendor the staged (SPEC_hybrid / "Spec D") emitter specification into the repo
as `docs/SPEC_hybrid.md` so it is browsable on the PR and travels with the code
it specifies. Removes the superseded `docs/repeat-implementation-investigation.md`
exploratory note.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bring the built-in template set to parity for the staged (SPEC_hybrid)
backend. Previously every shipped @template only referenced struct vars,
so --backend staged --template @csv could not emit a real output query.

- cli.js: resolve @name templates per --backend, preferring
  templates/staged/<name>.sql and falling back to templates/<name>.sql
- templates/staged/{csv,parquet,ndjson,explore,dbt_model}.sql: new
  parallel set wrapping the staged CTE skeleton in each output stage,
  emitting both base and view-specific macros; explore limits source
  rows inside the src CTE to match struct
- scripts/gen-staged.js: removed (superseded by the shipped templates)
- README.md: document --backend/--root-key/--repeat-depth, the Backends
  section, backend-aware resolution, and the fq_staged_* variables
- openspec: add the templates capability spec; archive the change

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant