Add parquet-variant transformation to/from JSON, including shredding #8391
Add parquet-variant transformation to/from JSON, including shredding #8391AdamGS wants to merge 5 commits into
Conversation
860196c to
b9ae199
Compare
Merging this PR will not alter performance
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | chunked_bool_canonical_into[(1000, 10)] |
20.6 µs | 35.1 µs | -41.29% |
| ❌ | Simulation | chunked_dict_primitive_into_canonical[u32, (1000, 10, 10)] |
120.7 µs | 181.8 µs | -33.61% |
| ❌ | Simulation | encode_varbin[(1000, 2)] |
176.1 µs | 234 µs | -24.78% |
| ❌ | Simulation | decompress_rd[f64, (100000, 0.0)] |
845.5 µs | 1,025.5 µs | -17.56% |
| ❌ | Simulation | bench_many_codes_few_values[1024] |
393.2 µs | 466.8 µs | -15.75% |
| ❌ | Simulation | decompress_rd[f32, (100000, 0.0)] |
499.1 µs | 587.8 µs | -15.08% |
| ❌ | Simulation | varbinview_large |
112.2 µs | 131 µs | -14.41% |
| ❌ | Simulation | bitwise_not_vortex_buffer_mut[128] |
186.1 ns | 215.3 ns | -13.55% |
| ❌ | Simulation | bitwise_not_vortex_buffer_mut[1024] |
246.4 ns | 275.6 ns | -10.58% |
| ⚡ | Simulation | sum_i32_nullable_all_valid |
69.2 µs | 35.5 µs | +95.26% |
| ⚡ | Simulation | null_count_run_end[(10000, 4, 0.01)] |
125.4 µs | 91.4 µs | +37.24% |
| ⚡ | Simulation | chunked_varbinview_opt_canonical_into[(1000, 10)] |
214.3 µs | 177.6 µs | +20.62% |
| ⚡ | Simulation | encode_varbinview[(1000, 2)] |
189 µs | 157.3 µs | +20.12% |
| ⚡ | Simulation | chunked_varbinview_opt_into_canonical[(1000, 10)] |
229.3 µs | 193.4 µs | +18.55% |
| ⚡ | Simulation | take_10k_contiguous |
252.8 µs | 218.2 µs | +15.87% |
| ⚡ | Simulation | decompress_rd[f64, (100000, 0.01)] |
981.2 µs | 846.8 µs | +15.87% |
| ⚡ | Simulation | decompress_rd[f64, (100000, 0.1)] |
981.2 µs | 846.9 µs | +15.86% |
| ⚡ | Simulation | and_bool_nullable |
93.7 µs | 82.6 µs | +13.34% |
| ⚡ | Simulation | baseline_lt[4, 1024] |
78.5 µs | 69.5 µs | +12.98% |
| 🆕 | Simulation | cast_i32_to_u32[65536] |
N/A | 836.8 µs | N/A |
| ... | ... | ... | ... | ... | ... |
ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing adamg/variant-json-handling (d78dcaa) with develop (679e2c5)2
Footnotes
-
11 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
-
No successful run was found on
develop(60d165d) during the generation of this report, so 679e2c5 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report. ↩
3e823bc to
0bc0405
Compare
67592eb to
6b3a236
Compare
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
6b3a236 to
d78dcaa
Compare
| /// `variant_to_json` emits fields in a canonical order, not source order. | ||
| /// - Number formatting and precision change: e.g. `1.0` may render as `1`, exponent forms and | ||
| /// very large numbers are re-rendered, and floating-point values are re-encoded. | ||
| /// - Duplicate object keys are collapsed to a single entry. |
There was a problem hiding this comment.
Is the key value pair that gets removed well defined? Like is it whichever one comes first wins?
|
Do these functions mix DType (JSON) with encoding (ParquetVariant)? At least the docs seem to suggest that they do. We probably want vortex-json to expose to_json and from_json functions (or maybe this just a cast kernel) and then implement them using execute_parent for ParquetVariant. But it seems like you are actually defining a more general DType::JSON -> DType::Variant (any variant encoding) function? In which case, it cannot be a cast because you want the shredding options, but the definition should probably still live in the vortex-json crate, and implement general-purpose shredding (per the canonical Variant array), without touching the non-shredded data? |
| fn return_dtype(&self, _options: &Self::Options, arg_dtypes: &[DType]) -> VortexResult<DType> { | ||
| let input_dtype = &arg_dtypes[0]; | ||
| vortex_ensure!( | ||
| matches!(input_dtype, DType::Variant(_)), |
There was a problem hiding this comment.
we have an is_variant method now
| fn is_null_sensitive(&self, _options: &Self::Options) -> bool { | ||
| false | ||
| } |
There was a problem hiding this comment.
are the null semantics of internal nulls and nested nulls (afaik these are not the same things) documented anywhere for the json and variant conversions?
There was a problem hiding this comment.
The difference between null and "VariantNull" are documented in a few places around variant and ParquetVariant.
There was a problem hiding this comment.
sorry I meant that it might be worth justifying why this is not null sensitive (Im guessing because variant null semantics are going to be different from the top-level null semantics, but that might not be super obvious to everyone), because I am guessing this function is definitely variant null sensitive
| fn return_dtype(&self, _options: &Self::Options, arg_dtypes: &[DType]) -> VortexResult<DType> { | ||
| let input_dtype = &arg_dtypes[0]; | ||
| vortex_ensure!( | ||
| input_dtype.is_utf8() |
There was a problem hiding this comment.
Do we really want Vortex to essentially "autocast" from UTF8 to JSON and then to Variant? IMO it makes more sense to wrap a UTF8 array in JSON (so the validation logic can live there) and then we are only allowed to shred the JSON after it has been validated (which fits nicely because the JSON storage already might already be shredded variant).
I guess it is kind of okay for now because json storage array is just a utf8 array, but in that case we should be more intentional with our conversions.
There was a problem hiding this comment.
Json is an extension type, I don't see a reason to force users to go through it if they already have strings.
|
@gatesn I've really tried to keep expressions that can't be evaluated directly from the core. I thought execute_parent rules were less flexible but I guess I can try and do that. |
Summary
This PR adds two new expression to
vortex-parquet-variant:VariantToJsonconverting Variant arrays to a JSON array, including "unshredding" themJsonToVariant- takes a JSON/Utf8 array and a shredding spec, returning a newParquetVariantarray, shredded according to the spec.Notes
An option I have considered here is to introduce a third crate to include this logic -
vortex-parquet-variant-json, or using a"json"feature forvortex-parquet-variant.