Skip to content

feat[gpu]: arrow device array stream support#8483

Draft
0ax1 wants to merge 13 commits into
developfrom
ad/arrow-device-array-stream
Draft

feat[gpu]: arrow device array stream support#8483
0ax1 wants to merge 13 commits into
developfrom
ad/arrow-device-array-stream

Conversation

@0ax1

@0ax1 0ax1 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

No description provided.

0ax1 added 9 commits June 17, 2026 18:46
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
This reverts commit 52952d2.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
@0ax1 0ax1 added the changelog/feature A new feature label Jun 17, 2026
@0ax1

0ax1 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

@claude review in depth

@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Claude finished @0ax1's task in 3m 22s —— View job


Claude Code Review — Arrow Device Array Stream support

  • Gather PR diff and context
  • Review vortex-cuda/src/arrow/mod.rs (device array stream impl)
  • Review FFI changes (vortex-cuda/ffi, vortex-ffi)
  • Review e2e test harness changes
  • Post review feedback

Overall this is a careful, well-documented implementation of the Arrow C Device stream interface on top of the existing single-array device export. The unsafe FFI surface is handled with discipline: panics are caught at every callback boundary, double-release is guarded, Drop releases the buffered batch, and the underlying Arc<CudaStream> is retained per-batch so outstanding arrays survive stream release. The e2e coverage (sliced, multi-buffer, dict, decimal, list/FSL, temporal) is excellent. A few things worth considering below.


Findings

1. futures::executor::block_on won't drive a Tokio stream (medium)

vortex-cuda/src/arrow/mod.rs:299-305

impl Iterator for FuturesArrayStreamIterator {
    fn next(&mut self) -> Option<Self::Item> {
        futures::executor::block_on(self.stream.next())
    }
}

The public DeviceArrayStreamExt for SendableArrayStream polls the stream with futures::executor::block_on. That executor does not install a Tokio reactor, so a SendableArrayStream backed by async file/object-store IO (the normal vortex-file scan case) will fail or stall when it awaits IO. The FFI partition path sidesteps this correctly by going through RUNTIME.block_on_stream(...) in vx_partition_into_array_iter (vortex-ffi/src/scan.rs:97-103) before constructing the stream.

So today the trait impl is only safe for runtime-independent (in-memory) streams, which is all the tests exercise. Since this is a public API, I'd either route it through the Vortex runtime or document the constraint loudly on the trait. As written it's an easy footgun for the first caller who hands it a file-backed stream. Fix this →

Note export_batch also calls futures::executor::block_on(array.export_device_array_with_schema(...)) — that's fine and consistent with vx_cuda_array_export_arrow_device, assuming the export future never awaits Tokio primitives (which appears to be the case).

2. Schema is committed before device validation in export_batch (low)

vortex-cuda/src/arrow/mod.rs:385-414

Inside the validation closure, self.schema = Some(batch_schema) is set before the device_type / device_id checks run. If the very first batch exports onto the wrong device, the array is released and an error is returned — but self.schema is now permanently set from a rejected batch, so a later get_schema would report a schema for a stream that already errored. Consider validating device type/id first and only committing the schema once the batch is fully accepted.

3. Empty-stream schema can diverge from a non-empty stream's schema (low / worth a comment)

vortex-cuda/src/arrow/mod.rs:328-344 and from_dtype (265-283)

ensure_schema/export_batch derive the reported schema from the first concrete batch via from_ffi, which preserves nested encodings (e.g. dictionary → DataType::Dictionary). The empty-stream fallback from_dtype derives only from the logical DType, where a dict-encoded column's logical type is not Dictionary. So an empty stream of dictionary data reports a plain (non-dictionary) field, while a non-empty one reports a dictionary field. There's no data in the empty case so it's likely harmless, but a one-line comment noting the intentional difference would help future readers.

4. set_error silently drops the message on the (now-impossible) NUL path (nit)

vortex-cuda/src/arrow/mod.rs:322-326

let message = error.to_string().replace('\0', "\\0");
self.last_error = CString::new(message).ok();

After stripping all \0, CString::new can no longer fail, so .ok() will always be Some. It's harmless, but if you want defense-in-depth, fall back to a static message (unwrap_or_else(|| CString::new("error").unwrap())) so get_last_error never returns NULL while a non-zero status is reported.

5. Thread affinity of the exported stream (question / doc)

The ArrowDeviceArrayStream carries private_data holding a CudaExecutionCtx and Box<dyn Iterator> (no Send bound). The Arrow C stream contract is single-threaded, but consumers may legitimately move the stream to one other thread before driving it. Worth confirming the CudaExecutionCtx / cudarc stream is safe to use from a thread other than the one that created it, and documenting the "drive from a single thread" expectation on export_device_array_stream_from_iter.


Things I verified and liked

  • Double-release / release protocol: device_stream_release nulls all callbacks and guards re-entry; EOS is correctly signaled via an ArrowArray::empty() with release == None. The test exercises both release-twice and EOS.
  • Leak safety: release_schema/release_device_array are called on every error path in export_batch, and Drop for DeviceArrayStreamPrivateData releases a buffered pending_array.
  • Dictionary code widening (arrow_device_export_dictionary_codes_dtype, U8→I16 etc.) matches the cuDF signed-index requirement and is confirmed end-to-end by the DictionaryArray<Int16Type> expectation in validate_array.
  • FFI ownership docs in vortex_cuda.h and the Rust headers are precise about who releases what.

Checks I did not run

This review is static only — I did not build or run tests. The CUDA paths require GPU hardware (#[cuda_test] / compute-sanitizer) that isn't available here, and cargo/clippy were not executed. I'd recommend the author run cargo clippy --all-targets --all-features on the touched crates and the CUDA e2e harness per CLAUDE.md before merge.
· branch ad/arrow-device-array-stream

@codspeed-hq

codspeed-hq Bot commented Jun 17, 2026

Copy link
Copy Markdown

Merging this PR will improve performance by 15.43%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 6 improved benchmarks
❌ 1 regressed benchmark
✅ 1574 untouched benchmarks
⏩ 3 skipped benchmarks1

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation chunked_varbinview_into_canonical[(1000, 10)] 177.7 µs 213.9 µs -16.94%
Simulation take_10k_random 255.8 µs 197.8 µs +29.27%
Simulation take_10k_contiguous 276.3 µs 218.5 µs +26.46%
Simulation patched_take_10k_contiguous_patches 291 µs 232.3 µs +25.26%
Simulation patched_take_10k_random 303 µs 244.2 µs +24.07%
Simulation bitwise_not_vortex_buffer_mut[128] 215.3 ns 186.1 ns +15.67%
Simulation bitwise_not_vortex_buffer_mut[1024] 275.6 ns 246.4 ns +11.84%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing ad/arrow-device-array-stream (3ae0d00) with develop (a4476f1)

Open in CodSpeed

Footnotes

  1. 3 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

0ax1 added 4 commits June 17, 2026 21:28
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant