Add configurable native JSON column type to the loader#15
Conversation
The exists() and empty() FHIRPath translations detect an empty array by comparing the extracted value to the literal '[]'. Over an NVARCHAR(MAX) column JSON_QUERY returns nvarchar and the comparison works, but over SQL Server 2025's native JSON column it returns a json-typed value, so the comparison raises "the data types json and varchar are incompatible". Wrap the compared operand in CAST(... AS NVARCHAR(MAX)) at the five array emptiness sites, matching the cast already used in one empty() branch. The cast is a no-op over NVARCHAR(MAX), so generated results are unchanged, and it makes the generated SQL valid over both column types. Add a database-free regression test asserting no bare '[]' comparison survives.
Add an optional resourceJsonDataType to the NDJSON loader and the load CLI, accepting NVARCHAR(MAX) (default) or JSON. When set to JSON the resources table's json column is created with SQL Server 2025's native JSON type; omitting it leaves behaviour unchanged. The value is validated against a strict allowlist before any database connection is opened, and a mismatch against an existing table's column type emits a warning without altering the table. Conformance is proven against the native type: the Vitest harness reads MSSQL_RESOURCE_JSON_DATA_TYPE and the CI matrix gains a 2025-latest dimension that runs the suite over both column types, while 2017, 2019 and 2022 stay on NVARCHAR(MAX). Documentation covers the new option, the SQL Server 2025 requirement, and a repeatable manual storage and timing benchmark procedure.
The loader integration tests added with the native JSON column type connect through the loader's production connection path, which only trusts the server certificate when MSSQL_TRUST_SERVER_CERTIFICATE is set to "true". The CI SQL Server presents a self-signed certificate, so without that variable the integration suites failed to connect. Set it in the "Run tests" step of both workflows, matching the local .env convention. The loader's secure default of not trusting the certificate is left unchanged.
piotrszul
left a comment
There was a problem hiding this comment.
Looks good, here is minor suggestion:
resolveColumnJsonDataType silently accepts non-compliant existing column types (src/loader/tables.ts:73)
This classifier maps data_type === 'json' → JSON and everything else → NVARCHAR(MAX):
return dataType.trim().toLowerCase() === "json" ? "JSON" : "NVARCHAR(MAX)";The consequence is that an existing table whose json column is actually something incompatible — e.g. VARCHAR(100), NVARCHAR(64), TEXT — gets classified as NVARCHAR(MAX). So the mismatch check sees no mismatch (under the default request) and the loader proceeds to write FHIR resources into a column that can't hold them. At load time that surfaces as truncation / String or binary data would be truncated errors, or — worse, under non-Unicode VARCHAR — silent character corruption. The one safeguard we have for existing tables (FR-008) is blind to exactly the cases that actually lose data.
Related: the _characterMaximumLength parameter is fetched by getExistingJsonColumnType and passed in here, but never read — it's the field that would let us tell NVARCHAR(MAX) (length -1) from a bounded NVARCHAR(64), so the information needed to detect this is already on hand.
Suggestion: rather than coercing unknown types into NVARCHAR(MAX), fail fast — if an existing column is neither native JSON nor NVARCHAR(MAX), raise a clear error naming the offending type (e.g. "existing column [json] is VARCHAR(100); expected NVARCHAR(MAX) or JSON") before any rows are loaded. That turns a late, data-dependent truncation failure into an early, actionable configuration error, in keeping with Principle IV (reject invalid input at the boundary with a meaningful message).
resolveColumnJsonDataType previously classified every non-json data type as NVARCHAR(MAX). An existing table whose json column was actually a bounded or non-Unicode type (VARCHAR(100), NVARCHAR(64), TEXT) was therefore seen as a match under the default request, so the mismatch safeguard passed and the loader wrote FHIR resources into a column that could not hold them - surfacing as truncation errors or silent character corruption. The classifier now uses the character_maximum_length it was already given to distinguish NVARCHAR(MAX) from bounded forms, and fails fast when the existing json column is neither native JSON nor NVARCHAR(MAX), naming the offending type before any rows are loaded (Principle IV).
Adds an optional native JSON column type to the NDJSON loader, letting the resources table store its
jsoncolumn using SQL Server 2025's nativeJSONtype instead of the defaultNVARCHAR(MAX).The loader and the
loadCLI gain aresourceJsonDataTypeoption acceptingNVARCHAR(MAX)(default) orJSON. Omitting it leaves existing behaviour unchanged. The value is validated against a strict allowlist before any database connection is opened, and a mismatch against an existing table's column type emits a warning rather than altering the table.The
exists()andempty()FHIRPath translations previously detected an empty array by comparing the extracted value against the literal'[]'. That works overNVARCHAR(MAX), but over the nativeJSONcolumnJSON_QUERYreturns a json-typed value and the comparison raises "the data types json and varchar are incompatible". The compared operand is now wrapped inCAST(... AS NVARCHAR(MAX))at each array emptiness site, which is a no-op overNVARCHAR(MAX)and makes the generated SQL valid over both column types.Conformance is proven against the native type: the Vitest harness reads
MSSQL_RESOURCE_JSON_DATA_TYPE, and the CI matrix gains a2025-latestdimension that runs the full suite over both column types while 2017, 2019 and 2022 stay onNVARCHAR(MAX). Documentation covers the new option, the SQL Server 2025 requirement, and a repeatable manual storage and timing benchmark procedure.Unfortunately it doesn't seem to improve performance over the string column - if anything the performance is worse. But it may be useful for some users, as you can create indexes on
JSONcolumns, which could be useful for some types of queries.