HDDS-15388. K way merge iterator over native SST file reader to emit latest tombstone and latest value of a key.#10484
Conversation
…latest tombstone and latest value of a key.
smengcl
left a comment
There was a problem hiding this comment.
Thanks @SaketaChalamchala for the patch. I have a few comments inline
There was a problem hiding this comment.
Can this PR be simplified to a single-purpose raw-SST merge iterator? Right now it adds a bunch of generic interfaces that are only used by this like VersionedMergeComparators, VersionedMergeEmitter, VersionedMergeEntry, etc.
Something like one PriorityQueue, one byte[] comparator, and one private method to choose the latest value/tombstone should be much smaller and easier to review.
| public RawSstHeapHead next() { | ||
| ManagedRawSSTFileIterator.KeyValue keyValue = iterator.next(); | ||
| int type = keyValue.getType(); | ||
| CodecBuffer valueBuffer = type == ROCKS_TYPE_VALUE ? keyValue.getValue() : null; |
There was a problem hiding this comment.
This stores the CodecBuffer from ManagedRawSSTFileIterator in a RawSstHeapHead, but drainHeapForUserKey() can call entry.advance() before toMergedKeyValue() copies the value. The key is copied immediately, while the value is copied later, after the underlying iterator may have reused or released the buffer. Should RawSstIterator.next() copy the value bytes before creating RawSstHeapHead?
| current = iterator.next(); | ||
| } else { | ||
| current = null; | ||
| iterator.close(); |
There was a problem hiding this comment.
This closes the iterator when it's exhausted, but VersionedKWayMergeIterator.close() closes the same iterators again. RawSstIterator.close() calls native close paths and does not guard against repeated calls. Can we either make RawSstIterator.close() idempotent, or stop closing exhausted iterators a second time?
What changes were proposed in this pull request?
Developed with the help of Cursor AI.
This PR adds a k-way merge iterator over native RocksDB SST reader/iterator in
hadoop-hdds/rocks-nativemodule that emits the latest value and/or the latest tombstone per key.This PR intended as the foundation for a more efficient snapshot diff (HDDS-9154) where the diff can be calculated directly from the delta set of SST files instead of doing a random lookup from the target snapshot.
VersionedKWayMergeIteratorperforms a dual-heap k-way merge over multiple SST file iterators:VersionedMergeEmitter. Heap ordering and user-key grouping are controlled byVersionedMergeComparators.LatestVersionMergeComparatorsimplementsVersionedMergeComparators— user key ascending, sequence descending (newest rocksDB entry first)LatestVersionMergeEmitterimplementsVersionedMergeEmitter— emission rules:valueSeq > tombstoneSeq→ emit both (delete→recreate)tombstoneSeq >= valueSeq→ emit tombstone onlyWhat is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-15388
How was this patch tested?
Unit Tests.