Add RunEndBool encoding for efficient boolean array compression#8467
Add RunEndBool encoding for efficient boolean array compression#8467joseph-isaacs wants to merge 4 commits into
Conversation
…election Re-introduces the `vortex-runend-bool` encoding crate (encoding id `vortex.runend_bool`) specialized for boolean arrays. Boolean runs strictly alternate, so the array stores only the run `ends` plus a `start` flag and an optional validity child -- no values array. - New crate `encodings/runend-bool` mirroring `vortex-runend`: VTable, prost metadata serde, canonicalize, scalar_at, take, filter, slice, invert, and is_constant / is_sorted / min_max aggregate kernels. - Registered in the default session (file round-trip) and the workspace. - New BtrBlocks `BoolRunEndScheme` so the compressor auto-selects run-end for run-heavy bool columns; the run `ends` child is cascaded for further compression. - Unify shared run-end index logic (find_physical_index, find_slice_end_index, validate_ends, logical_len_from_ends) into `vortex-runend::shared` behind a `RunEndIndex` trait plus orphan-rule-safe free-function forms, reused by both encodings. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> https://claude.ai/code/session_01SKPEzTk1Wme4y96bC7wps1
- `encode_runend_bool` now takes `ArrayView<'_, Bool>`, mirroring how `runend_encode` takes `ArrayView<'_, Primitive>`, and drops the `&BoolArray` / `TypedArrayRef::to_owned` detour at every call site (including the BtrBlocks `BoolRunEndScheme`). - The metadata test now uses the `check_metadata` goldenfile harness with `#[cfg_attr(miri, ignore)]`, matching `runend`; adds `goldenfiles/runend_bool.metadata`. - Add a usage doctest to `RunEndBool::new`, mirroring `RunEnd`. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> https://claude.ai/code/session_01SKPEzTk1Wme4y96bC7wps1
`RunEndBool::filter` previously decoded every kept element with a per-index binary search (`find_physical_index`), which is O(kept * log runs). Mirror `runend`'s threshold dispatch: keep that path for sparse masks, but for dense masks scan the run ends once (O(runs + len)) and emit a `RunEndBool`, avoiding the binary searches and preserving the run-end encoding in the output. Because boolean runs strictly alternate, dropping an entire run can leave two kept runs with the same value adjacent; these are merged so the result still alternates and round-trips through a single `start` flag. Adds tests for the dense path and the run-merging case. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> https://claude.ai/code/session_01SKPEzTk1Wme4y96bC7wps1
Merging this PR will not alter performance
Warning Please fix the performance issues or acknowledge them on CodSpeed. Performance Changes
Tip Investigate this regression by commenting Comparing Footnotes
|
The shared `RunEndIndex` trait now provides `ends()`/`offset()`/ `find_physical_index()` for `RunEnd` arrays, but Rust requires the defining trait in scope to call them. Consumer crates that import only `RunEndArrayExt` (vortex-cuda, vortex-duckdb, and a vortex bench) failed to compile with E0599. Bring `RunEndIndex` into scope alongside `RunEndArrayExt` in those files. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> https://claude.ai/code/session_01SKPEzTk1Wme4y96bC7wps1
Summary
This PR introduces a new
RunEndBoolencoding specialized for compressing boolean arrays with long runs. Since boolean values strictly alternate in run-end encoding, we can store only the run end positions plus a singlestartflag, rather than maintaining a separate values array like the genericRunEndencoding.Changes
New
vortex-runend-boolcrate with:RunEndBoolArraytype and metadata serializationShared run-end indexing logic extracted to
encodings/runend/src/shared.rs:RunEndIndextrait for common "ends" indexing operationsfind_physical_index()andfind_slice_end_index()free functionsvalidate_ends()validation helperRunEndandRunEndBoolnow implement this traitIntegration with compression pipeline:
BoolRunEndSchemeinvortex-btrblocksfor automatic bool array compressionFile format support:
RunEndBoolarrays serialize/deserialize correctly in Vortex file formatvortex-fileDesign Notes
ihas valuevalue_at_index(i, start)where even indices equalstartand odd indices equal!start