SPANN: opt-in mmap of the input vector file#451
Open
fastio wants to merge 2 commits into
Open
Conversation
Add Helper::MemoryMappedFile (header-only RAII, read-only, Windows/POSIX) and a [Base] MmapVectors option (default false). When enabled, the DEFAULT-format vector file is memory-mapped instead of loaded into heap: the OS demand-pages it and reclaims clean pages under pressure, so peak RSS during SelectHead stays bounded on billion-scale inputs. The mapping's lifetime follows the returned VectorSet via an aliasing shared_ptr, and MADV_RANDOM suppresses readahead for the random access pattern of BKT clustering. Falls back to the regular in-memory read (with a warning) if mapping fails. Automatically disabled for Cosine with un-normalized input, since SelectHead normalizes vectors in place and the mapping is read-only; pre-normalize the input or use L2 to combine the two.
Make the non-TBB ConcurrentQueue/Set/Map fallbacks expose the TBB API surface used by ExtraFileController/ExtraDynamicSearcher (unsafe_size, empty, unsafe_begin/end, value_type, range iteration). Force the LoggerHolder pre-C++20 path so std::atomic<std::shared_ptr<Logger>> never instantiates under libc++. Drop an unused omp.h include and pass .c_str() to a SPTAGLIB_LOG variadic call.
Contributor
|
@fastio please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
SPANN BuildIndexcurrently reads the whole input vector file into heap beforeSelectHead. For billion-scale inputs (e.g. SPACEV1B) that means hundreds of GB
of resident memory just to hold the raw vectors, on top of the build's own
working set. BKT clustering touches vectors randomly by id and rarely needs
them all hot at once.
Design
A new
[Base] MmapVectorsoption (defaultfalse). When enabled and the inputis the DEFAULT binary format,
DefaultVectorReader::GetVectorSetmemory-mapsthe file read-only instead of reading it:
under memory pressure, so peak RSS stays bounded.
(
Helper::MemoryMappedFile, Windows + POSIX). Its lifetime follows thereturned
VectorSetvia an aliasingshared_ptr, so the mapping is unmappedexactly once, after the last
ByteArraycopy is gone.MADV_RANDOMis applied on POSIX to suppress readahead, matching BKT'srandom access pattern.
Fallback and safety
in-memory read with a warning.
the metric is Cosine and the input is not pre-normalized.
BuildIndexdetects that combination, disables mmap, and logs a warni
to pre-normalize or use L2. L2 and pre-normalized Cosine inputs use mmap
unmodified.