Skip to content

GH-1179: Correct the size of var-width vector with >0 start offset during vector append#1180

Open
jordepic wants to merge 2 commits into
apache:mainfrom
jordepic:fix-vector-appender-nonzero-start-offset
Open

GH-1179: Correct the size of var-width vector with >0 start offset during vector append#1180
jordepic wants to merge 2 commits into
apache:mainfrom
jordepic:fix-vector-appender-nonzero-start-offset

Conversation

@jordepic

@jordepic jordepic commented Jun 9, 2026

Copy link
Copy Markdown

What's Changed

Fix VectorAppender data size computation for variable-width vectors with non-zero start offsets

When appending a variable width offset vector in DataFusion comet I was receiving exceptions
due to allocating too much memory. This is because Comet passes variable width arrays back
to Java where the initial offset vector entry is greater than 0. Prior to this change, arrow-java
determines how many bytes to copy by just looking at the last offset entry in the buffer,
completely disregarding the value of the first. If first = 100 and last = 200, Java will still
copy 200 bytes instead of 100. In this change we fix that.

Closes #1179

@github-actions

This comment has been minimized.

@jordepic

jordepic commented Jun 9, 2026

Copy link
Copy Markdown
Author

Could a maintainer please add the bug-fix label here?

@jordepic

Copy link
Copy Markdown
Author

Hey @lidavidm , @laurentgo , @wgtmac , @jbonofre - would one of you guys mind taking a look here? It's a pretty nefarious bug :)

@jbonofre jbonofre left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I believe the same issue is present in ListVector and LargeListVector. Maybe worth to address in this PR or a following up PR.

@jordepic

Copy link
Copy Markdown
Author

@jbonofre would you be able to label the PR so that those checks aren't failing? I don't have permissions. I can also follow up here with the list vector changes. Thank you for your review :)

@jbonofre

Copy link
Copy Markdown
Member

@jordepic absolutely

@jordepic

Copy link
Copy Markdown
Author

I'm not sure that the label worked if you tried to add it, I appreciate all of your help here

@jordepic jordepic force-pushed the fix-vector-appender-nonzero-start-offset branch from 94537fa to 03491a1 Compare June 11, 2026 13:57
@jordepic jordepic requested a review from jbonofre June 11, 2026 14:04
@jordepic

Copy link
Copy Markdown
Author

@jbonofre I went ahead and amended this guy with the ListVector and LargeListVector fix. Changes are tested. Would greatly appreciate you labeling the PR as a bug fix as well as a review, thank you :)

@lidavidm lidavidm added the bug-fix PRs that fix a big. label Jun 11, 2026
@github-actions github-actions Bot added this to the 20.0.0 milestone Jun 11, 2026
@jordepic

Copy link
Copy Markdown
Author

Thanks @lidavidm !! I think the last failure here is transient if you want to give it a reroll

@jordepic

Copy link
Copy Markdown
Author

Go big red, by the way ;)

@jordepic

Copy link
Copy Markdown
Author

@jbonofre if you want to do a final sweep all checks are passing!

Jordan Epstein added 2 commits June 12, 2026 10:29
…width vectors with non-zero start offsets

VectorAppender computed the delta vector's data size as its last offset
value, which is only correct when the offset buffer starts at zero.
Vectors imported through the C data interface from sliced arrays can
have a non-zero first offset; appending them copied the unreferenced
data buffer prefix into the target, inflating it on every append until
allocation eventually failed with OversizedAllocationException.

Compute the data size as the distance between the first and last
offsets, copy from the first offset, and rebase appended offsets
accordingly.

Fixes apache#1179.
…ero start offsets

Apply the same fix to list vectors: compute the referenced child range
from the distance between the first and last offsets, rebase appended
offsets by the delta's start offset, and append only the referenced
range of the delta's data vector (via splitAndTransfer when the range
does not cover the whole vector).

Also fix LargeListVector appending, which was previously untested and
broken: the target vector was cast to ListVector (ClassCastException
for an actual LargeListVector target), and the offset buffer copy used
4-byte ListVector offset arithmetic on 8-byte offsets.
@jordepic jordepic force-pushed the fix-vector-appender-nonzero-start-offset branch from 03491a1 to 9bfd257 Compare June 12, 2026 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PRs that fix a big.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Java] Incorrect size computed for var-width vectors with non-zero start offsets when appending them

3 participants