Skip to content

Add parquet-variant transformation to/from JSON, including shredding #8391

Open
AdamGS wants to merge 5 commits into
developfrom
adamg/variant-json-handling
Open

Add parquet-variant transformation to/from JSON, including shredding #8391
AdamGS wants to merge 5 commits into
developfrom
adamg/variant-json-handling

Conversation

@AdamGS

@AdamGS AdamGS commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR adds two new expression to vortex-parquet-variant:

  1. VariantToJson converting Variant arrays to a JSON array, including "unshredding" them
  2. JsonToVariant - takes a JSON/Utf8 array and a shredding spec, returning a new ParquetVariant array, shredded according to the spec.

Notes

An option I have considered here is to introduce a third crate to include this logic - vortex-parquet-variant-json, or using a "json" feature for vortex-parquet-variant.

@AdamGS AdamGS mentioned this pull request Jun 12, 2026
@AdamGS AdamGS force-pushed the adamg/variant-json-handling branch from 860196c to b9ae199 Compare June 12, 2026 16:27
@codspeed-hq

codspeed-hq Bot commented Jun 12, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 10 improved benchmarks
❌ 9 regressed benchmarks
✅ 1525 untouched benchmarks
🆕 21 new benchmarks
⏩ 11 skipped benchmarks1

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation chunked_bool_canonical_into[(1000, 10)] 20.6 µs 35.1 µs -41.29%
Simulation chunked_dict_primitive_into_canonical[u32, (1000, 10, 10)] 120.7 µs 181.8 µs -33.61%
Simulation encode_varbin[(1000, 2)] 176.1 µs 234 µs -24.78%
Simulation decompress_rd[f64, (100000, 0.0)] 845.5 µs 1,025.5 µs -17.56%
Simulation bench_many_codes_few_values[1024] 393.2 µs 466.8 µs -15.75%
Simulation decompress_rd[f32, (100000, 0.0)] 499.1 µs 587.8 µs -15.08%
Simulation varbinview_large 112.2 µs 131 µs -14.41%
Simulation bitwise_not_vortex_buffer_mut[128] 186.1 ns 215.3 ns -13.55%
Simulation bitwise_not_vortex_buffer_mut[1024] 246.4 ns 275.6 ns -10.58%
Simulation sum_i32_nullable_all_valid 69.2 µs 35.5 µs +95.26%
Simulation null_count_run_end[(10000, 4, 0.01)] 125.4 µs 91.4 µs +37.24%
Simulation chunked_varbinview_opt_canonical_into[(1000, 10)] 214.3 µs 177.6 µs +20.62%
Simulation encode_varbinview[(1000, 2)] 189 µs 157.3 µs +20.12%
Simulation chunked_varbinview_opt_into_canonical[(1000, 10)] 229.3 µs 193.4 µs +18.55%
Simulation take_10k_contiguous 252.8 µs 218.2 µs +15.87%
Simulation decompress_rd[f64, (100000, 0.01)] 981.2 µs 846.8 µs +15.87%
Simulation decompress_rd[f64, (100000, 0.1)] 981.2 µs 846.9 µs +15.86%
Simulation and_bool_nullable 93.7 µs 82.6 µs +13.34%
Simulation baseline_lt[4, 1024] 78.5 µs 69.5 µs +12.98%
🆕 Simulation cast_i32_to_u32[65536] N/A 836.8 µs N/A
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing adamg/variant-json-handling (d78dcaa) with develop (679e2c5)2

Open in CodSpeed

Footnotes

  1. 11 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

  2. No successful run was found on develop (60d165d) during the generation of this report, so 679e2c5 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

@AdamGS AdamGS force-pushed the adamg/variant-json-handling branch 3 times, most recently from 3e823bc to 0bc0405 Compare June 17, 2026 11:30
@AdamGS AdamGS added the changelog/feature A new feature label Jun 17, 2026
@AdamGS AdamGS force-pushed the adamg/variant-json-handling branch 3 times, most recently from 67592eb to 6b3a236 Compare June 17, 2026 15:43
@AdamGS AdamGS marked this pull request as ready for review June 17, 2026 15:59
@AdamGS AdamGS requested a review from a team June 17, 2026 15:59
AdamGS added 5 commits June 17, 2026 16:59
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
@AdamGS AdamGS force-pushed the adamg/variant-json-handling branch from 6b3a236 to d78dcaa Compare June 17, 2026 15:59
@AdamGS AdamGS changed the title Json-Variant exprs Add parquet-variant transformation to/from JSON, including shredding Jun 17, 2026
@AdamGS AdamGS requested review from mprammer and robert3005 June 17, 2026 15:59

@gatesn gatesn left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One sec!

/// `variant_to_json` emits fields in a canonical order, not source order.
/// - Number formatting and precision change: e.g. `1.0` may render as `1`, exponent forms and
/// very large numbers are re-rendered, and floating-point values are re-encoded.
/// - Duplicate object keys are collapsed to a single entry.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the key value pair that gets removed well defined? Like is it whichever one comes first wins?

@gatesn

gatesn commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Do these functions mix DType (JSON) with encoding (ParquetVariant)? At least the docs seem to suggest that they do.

We probably want vortex-json to expose to_json and from_json functions (or maybe this just a cast kernel) and then implement them using execute_parent for ParquetVariant.

But it seems like you are actually defining a more general DType::JSON -> DType::Variant (any variant encoding) function? In which case, it cannot be a cast because you want the shredding options, but the definition should probably still live in the vortex-json crate, and implement general-purpose shredding (per the canonical Variant array), without touching the non-shredded data?

fn return_dtype(&self, _options: &Self::Options, arg_dtypes: &[DType]) -> VortexResult<DType> {
let input_dtype = &arg_dtypes[0];
vortex_ensure!(
matches!(input_dtype, DType::Variant(_)),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have an is_variant method now

Comment on lines +157 to +159
fn is_null_sensitive(&self, _options: &Self::Options) -> bool {
false
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the null semantics of internal nulls and nested nulls (afaik these are not the same things) documented anywhere for the json and variant conversions?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference between null and "VariantNull" are documented in a few places around variant and ParquetVariant.

@connortsui20 connortsui20 Jun 18, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I meant that it might be worth justifying why this is not null sensitive (Im guessing because variant null semantics are going to be different from the top-level null semantics, but that might not be super obvious to everyone), because I am guessing this function is definitely variant null sensitive

fn return_dtype(&self, _options: &Self::Options, arg_dtypes: &[DType]) -> VortexResult<DType> {
let input_dtype = &arg_dtypes[0];
vortex_ensure!(
input_dtype.is_utf8()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want Vortex to essentially "autocast" from UTF8 to JSON and then to Variant? IMO it makes more sense to wrap a UTF8 array in JSON (so the validation logic can live there) and then we are only allowed to shred the JSON after it has been validated (which fits nicely because the JSON storage already might already be shredded variant).

I guess it is kind of okay for now because json storage array is just a utf8 array, but in that case we should be more intentional with our conversions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Json is an extension type, I don't see a reason to force users to go through it if they already have strings.

@AdamGS

AdamGS commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

@gatesn I've really tried to keep expressions that can't be evaluated directly from the core. I thought execute_parent rules were less flexible but I guess I can try and do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants