Skip to content

Add configurable native JSON column type to the loader#15

Merged
johngrimes merged 5 commits into
mainfrom
feature/native-json-column-type
Jun 22, 2026
Merged

Add configurable native JSON column type to the loader#15
johngrimes merged 5 commits into
mainfrom
feature/native-json-column-type

Conversation

@johngrimes

@johngrimes johngrimes commented Jun 21, 2026

Copy link
Copy Markdown
Member

Adds an optional native JSON column type to the NDJSON loader, letting the resources table store its json column using SQL Server 2025's native JSON type instead of the default NVARCHAR(MAX).

The loader and the load CLI gain a resourceJsonDataType option accepting NVARCHAR(MAX) (default) or JSON. Omitting it leaves existing behaviour unchanged. The value is validated against a strict allowlist before any database connection is opened, and a mismatch against an existing table's column type emits a warning rather than altering the table.

The exists() and empty() FHIRPath translations previously detected an empty array by comparing the extracted value against the literal '[]'. That works over NVARCHAR(MAX), but over the native JSON column JSON_QUERY returns a json-typed value and the comparison raises "the data types json and varchar are incompatible". The compared operand is now wrapped in CAST(... AS NVARCHAR(MAX)) at each array emptiness site, which is a no-op over NVARCHAR(MAX) and makes the generated SQL valid over both column types.

Conformance is proven against the native type: the Vitest harness reads MSSQL_RESOURCE_JSON_DATA_TYPE, and the CI matrix gains a 2025-latest dimension that runs the full suite over both column types while 2017, 2019 and 2022 stay on NVARCHAR(MAX). Documentation covers the new option, the SQL Server 2025 requirement, and a repeatable manual storage and timing benchmark procedure.

Unfortunately it doesn't seem to improve performance over the string column - if anything the performance is worse. But it may be useful for some users, as you can create indexes on JSON columns, which could be useful for some types of queries.

The exists() and empty() FHIRPath translations detect an empty array by
comparing the extracted value to the literal '[]'. Over an NVARCHAR(MAX)
column JSON_QUERY returns nvarchar and the comparison works, but over SQL
Server 2025's native JSON column it returns a json-typed value, so the
comparison raises "the data types json and varchar are incompatible".

Wrap the compared operand in CAST(... AS NVARCHAR(MAX)) at the five array
emptiness sites, matching the cast already used in one empty() branch. The
cast is a no-op over NVARCHAR(MAX), so generated results are unchanged, and
it makes the generated SQL valid over both column types. Add a database-free
regression test asserting no bare '[]' comparison survives.
Add an optional resourceJsonDataType to the NDJSON loader and the load CLI,
accepting NVARCHAR(MAX) (default) or JSON. When set to JSON the resources
table's json column is created with SQL Server 2025's native JSON type;
omitting it leaves behaviour unchanged. The value is validated against a
strict allowlist before any database connection is opened, and a mismatch
against an existing table's column type emits a warning without altering the
table.

Conformance is proven against the native type: the Vitest harness reads
MSSQL_RESOURCE_JSON_DATA_TYPE and the CI matrix gains a 2025-latest dimension
that runs the suite over both column types, while 2017, 2019 and 2022 stay on
NVARCHAR(MAX). Documentation covers the new option, the SQL Server 2025
requirement, and a repeatable manual storage and timing benchmark procedure.
The loader integration tests added with the native JSON column type
connect through the loader's production connection path, which only
trusts the server certificate when MSSQL_TRUST_SERVER_CERTIFICATE is
set to "true". The CI SQL Server presents a self-signed certificate,
so without that variable the integration suites failed to connect.
Set it in the "Run tests" step of both workflows, matching the local
.env convention. The loader's secure default of not trusting the
certificate is left unchanged.
@johngrimes johngrimes marked this pull request as ready for review June 21, 2026 04:39
@johngrimes johngrimes added the enhancement New feature or request label Jun 21, 2026
@github-project-automation github-project-automation Bot moved this to Backlog in Pathling Jun 21, 2026
@johngrimes johngrimes moved this from Backlog to In progress in Pathling Jun 21, 2026
@johngrimes johngrimes requested a review from piotrszul June 21, 2026 04:39

@piotrszul piotrszul left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, here is minor suggestion:

resolveColumnJsonDataType silently accepts non-compliant existing column types (src/loader/tables.ts:73)

This classifier maps data_type === 'json'JSON and everything elseNVARCHAR(MAX):

return dataType.trim().toLowerCase() === "json" ? "JSON" : "NVARCHAR(MAX)";

The consequence is that an existing table whose json column is actually something incompatible — e.g. VARCHAR(100), NVARCHAR(64), TEXT — gets classified as NVARCHAR(MAX). So the mismatch check sees no mismatch (under the default request) and the loader proceeds to write FHIR resources into a column that can't hold them. At load time that surfaces as truncation / String or binary data would be truncated errors, or — worse, under non-Unicode VARCHAR — silent character corruption. The one safeguard we have for existing tables (FR-008) is blind to exactly the cases that actually lose data.

Related: the _characterMaximumLength parameter is fetched by getExistingJsonColumnType and passed in here, but never read — it's the field that would let us tell NVARCHAR(MAX) (length -1) from a bounded NVARCHAR(64), so the information needed to detect this is already on hand.

Suggestion: rather than coercing unknown types into NVARCHAR(MAX), fail fast — if an existing column is neither native JSON nor NVARCHAR(MAX), raise a clear error naming the offending type (e.g. "existing column [json] is VARCHAR(100); expected NVARCHAR(MAX) or JSON") before any rows are loaded. That turns a late, data-dependent truncation failure into an early, actionable configuration error, in keeping with Principle IV (reject invalid input at the boundary with a meaningful message).

resolveColumnJsonDataType previously classified every non-json data type
as NVARCHAR(MAX). An existing table whose json column was actually a
bounded or non-Unicode type (VARCHAR(100), NVARCHAR(64), TEXT) was
therefore seen as a match under the default request, so the mismatch
safeguard passed and the loader wrote FHIR resources into a column that
could not hold them - surfacing as truncation errors or silent character
corruption.

The classifier now uses the character_maximum_length it was already
given to distinguish NVARCHAR(MAX) from bounded forms, and fails fast
when the existing json column is neither native JSON nor NVARCHAR(MAX),
naming the offending type before any rows are loaded (Principle IV).
@johngrimes johngrimes merged commit 1d86b9a into main Jun 22, 2026
5 checks passed
@johngrimes johngrimes deleted the feature/native-json-column-type branch June 22, 2026 21:25
@github-project-automation github-project-automation Bot moved this from In progress to Done in Pathling Jun 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants