Add configurable native JSON column type to the loader by johngrimes · Pull Request #15 · aehrc/sof-mssql

johngrimes · 2026-06-21T04:27:02Z

Adds an optional native JSON column type to the NDJSON loader, letting the resources table store its json column using SQL Server 2025's native JSON type instead of the default NVARCHAR(MAX).

The loader and the load CLI gain a resourceJsonDataType option accepting NVARCHAR(MAX) (default) or JSON. Omitting it leaves existing behaviour unchanged. The value is validated against a strict allowlist before any database connection is opened, and a mismatch against an existing table's column type emits a warning rather than altering the table.

The exists() and empty() FHIRPath translations previously detected an empty array by comparing the extracted value against the literal '[]'. That works over NVARCHAR(MAX), but over the native JSON column JSON_QUERY returns a json-typed value and the comparison raises "the data types json and varchar are incompatible". The compared operand is now wrapped in CAST(... AS NVARCHAR(MAX)) at each array emptiness site, which is a no-op over NVARCHAR(MAX) and makes the generated SQL valid over both column types.

Conformance is proven against the native type: the Vitest harness reads MSSQL_RESOURCE_JSON_DATA_TYPE, and the CI matrix gains a 2025-latest dimension that runs the full suite over both column types while 2017, 2019 and 2022 stay on NVARCHAR(MAX). Documentation covers the new option, the SQL Server 2025 requirement, and a repeatable manual storage and timing benchmark procedure.

Unfortunately it doesn't seem to improve performance over the string column - if anything the performance is worse. But it may be useful for some users, as you can create indexes on JSON columns, which could be useful for some types of queries.

The exists() and empty() FHIRPath translations detect an empty array by comparing the extracted value to the literal '[]'. Over an NVARCHAR(MAX) column JSON_QUERY returns nvarchar and the comparison works, but over SQL Server 2025's native JSON column it returns a json-typed value, so the comparison raises "the data types json and varchar are incompatible". Wrap the compared operand in CAST(... AS NVARCHAR(MAX)) at the five array emptiness sites, matching the cast already used in one empty() branch. The cast is a no-op over NVARCHAR(MAX), so generated results are unchanged, and it makes the generated SQL valid over both column types. Add a database-free regression test asserting no bare '[]' comparison survives.

Add an optional resourceJsonDataType to the NDJSON loader and the load CLI, accepting NVARCHAR(MAX) (default) or JSON. When set to JSON the resources table's json column is created with SQL Server 2025's native JSON type; omitting it leaves behaviour unchanged. The value is validated against a strict allowlist before any database connection is opened, and a mismatch against an existing table's column type emits a warning without altering the table. Conformance is proven against the native type: the Vitest harness reads MSSQL_RESOURCE_JSON_DATA_TYPE and the CI matrix gains a 2025-latest dimension that runs the suite over both column types, while 2017, 2019 and 2022 stay on NVARCHAR(MAX). Documentation covers the new option, the SQL Server 2025 requirement, and a repeatable manual storage and timing benchmark procedure.

The loader integration tests added with the native JSON column type connect through the loader's production connection path, which only trusts the server certificate when MSSQL_TRUST_SERVER_CERTIFICATE is set to "true". The CI SQL Server presents a self-signed certificate, so without that variable the integration suites failed to connect. Set it in the "Run tests" step of both workflows, matching the local .env convention. The loader's secure default of not trusting the certificate is left unchanged.

piotrszul

Looks good, here is minor suggestion:

resolveColumnJsonDataType silently accepts non-compliant existing column types (src/loader/tables.ts:73)

This classifier maps data_type === 'json' → JSON and everything else → NVARCHAR(MAX):

return dataType.trim().toLowerCase() === "json" ? "JSON" : "NVARCHAR(MAX)";

The consequence is that an existing table whose json column is actually something incompatible — e.g. VARCHAR(100), NVARCHAR(64), TEXT — gets classified as NVARCHAR(MAX). So the mismatch check sees no mismatch (under the default request) and the loader proceeds to write FHIR resources into a column that can't hold them. At load time that surfaces as truncation / String or binary data would be truncated errors, or — worse, under non-Unicode VARCHAR — silent character corruption. The one safeguard we have for existing tables (FR-008) is blind to exactly the cases that actually lose data.

Related: the _characterMaximumLength parameter is fetched by getExistingJsonColumnType and passed in here, but never read — it's the field that would let us tell NVARCHAR(MAX) (length -1) from a bounded NVARCHAR(64), so the information needed to detect this is already on hand.

Suggestion: rather than coercing unknown types into NVARCHAR(MAX), fail fast — if an existing column is neither native JSON nor NVARCHAR(MAX), raise a clear error naming the offending type (e.g. "existing column [json] is VARCHAR(100); expected NVARCHAR(MAX) or JSON") before any rows are loaded. That turns a late, data-dependent truncation failure into an early, actionable configuration error, in keeping with Principle IV (reject invalid input at the boundary with a meaningful message).

resolveColumnJsonDataType previously classified every non-json data type as NVARCHAR(MAX). An existing table whose json column was actually a bounded or non-Unicode type (VARCHAR(100), NVARCHAR(64), TEXT) was therefore seen as a match under the default request, so the mismatch safeguard passed and the loader wrote FHIR resources into a column that could not hold them - surfacing as truncation errors or silent character corruption. The classifier now uses the character_maximum_length it was already given to distinguish NVARCHAR(MAX) from bounded forms, and fails fast when the existing json column is neither native JSON nor NVARCHAR(MAX), naming the offending type before any rows are loaded (Principle IV).

johngrimes added 4 commits June 21, 2026 10:17

Update skills

7d2bb8a

johngrimes marked this pull request as ready for review June 21, 2026 04:39

johngrimes added the enhancement New feature or request label Jun 21, 2026

johngrimes added this to Pathling Jun 21, 2026

github-project-automation Bot moved this to Backlog in Pathling Jun 21, 2026

johngrimes moved this from Backlog to In progress in Pathling Jun 21, 2026

johngrimes requested a review from piotrszul June 21, 2026 04:39

johngrimes assigned piotrszul Jun 21, 2026

piotrszul reviewed Jun 22, 2026

View reviewed changes

johngrimes merged commit 1d86b9a into main Jun 22, 2026
5 checks passed

johngrimes deleted the feature/native-json-column-type branch June 22, 2026 21:25

github-project-automation Bot moved this from In progress to Done in Pathling Jun 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add configurable native JSON column type to the loader#15

Add configurable native JSON column type to the loader#15
johngrimes merged 5 commits into
mainfrom
feature/native-json-column-type

johngrimes commented Jun 21, 2026 •

edited

Loading

Uh oh!

piotrszul left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

johngrimes commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

piotrszul left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

johngrimes commented Jun 21, 2026 •

edited

Loading