Skip to content

HDDS-15388. K way merge iterator over native SST file reader to emit latest tombstone and latest value of a key.#10484

Draft
SaketaChalamchala wants to merge 1 commit into
apache:masterfrom
SaketaChalamchala:HDDS-15388
Draft

HDDS-15388. K way merge iterator over native SST file reader to emit latest tombstone and latest value of a key.#10484
SaketaChalamchala wants to merge 1 commit into
apache:masterfrom
SaketaChalamchala:HDDS-15388

Conversation

@SaketaChalamchala

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Developed with the help of Cursor AI.

This PR adds a k-way merge iterator over native RocksDB SST reader/iterator in hadoop-hdds/rocks-native module that emits the latest value and/or the latest tombstone per key.
This PR intended as the foundation for a more efficient snapshot diff (HDDS-9154) where the diff can be calculated directly from the delta set of SST files instead of doing a random lookup from the target snapshot.

  • VersionedKWayMergeIterator performs a dual-heap k-way merge over multiple SST file iterators:
    • A value heap for non-tombstone entries
    • A tombstone heap for delete/tombstone entries
    • For each user key, the iterator drains all versions from both heaps, tracks the highest-sequence value and tombstone, then delegates emission to a pluggable VersionedMergeEmitter. Heap ordering and user-key grouping are controlled by VersionedMergeComparators.
    • Deferred value copying: raw SST heap heads hold CodecBuffer references and copy on emit.
  • LatestVersionMergeComparators implements VersionedMergeComparators — user key ascending, sequence descending (newest rocksDB entry first)
  • LatestVersionMergeEmitter implements VersionedMergeEmitter — emission rules:
    • valueSeq > tombstoneSeq → emit both (delete→recreate)
    • tombstoneSeq >= valueSeq → emit tombstone only
    • value-only or tombstone-only otherwise

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-15388

How was this patch tested?

Unit Tests.

@SaketaChalamchala SaketaChalamchala added the snapshot https://issues.apache.org/jira/browse/HDDS-6517 label Jun 10, 2026
@peterxcli peterxcli self-requested a review June 10, 2026 20:34

@smengcl smengcl left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @SaketaChalamchala for the patch. I have a few comments inline

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this PR be simplified to a single-purpose raw-SST merge iterator? Right now it adds a bunch of generic interfaces that are only used by this like VersionedMergeComparators, VersionedMergeEmitter, VersionedMergeEntry, etc.

Something like one PriorityQueue, one byte[] comparator, and one private method to choose the latest value/tombstone should be much smaller and easier to review.

public RawSstHeapHead next() {
ManagedRawSSTFileIterator.KeyValue keyValue = iterator.next();
int type = keyValue.getType();
CodecBuffer valueBuffer = type == ROCKS_TYPE_VALUE ? keyValue.getValue() : null;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stores the CodecBuffer from ManagedRawSSTFileIterator in a RawSstHeapHead, but drainHeapForUserKey() can call entry.advance() before toMergedKeyValue() copies the value. The key is copied immediately, while the value is copied later, after the underlying iterator may have reused or released the buffer. Should RawSstIterator.next() copy the value bytes before creating RawSstHeapHead?

current = iterator.next();
} else {
current = null;
iterator.close();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This closes the iterator when it's exhausted, but VersionedKWayMergeIterator.close() closes the same iterators again. RawSstIterator.close() calls native close paths and does not guard against repeated calls. Can we either make RawSstIterator.close() idempotent, or stop closing exhausted iterators a second time?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

snapshot https://issues.apache.org/jira/browse/HDDS-6517

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants