Use mmap() for default IO #8451
Replies: 5 comments 1 reply
-
|
On the performance, #8450 shows 3x improvement for random access (obviously, the memory pressure would be higher). |
Beta Was this translation helpful? Give feedback.
-
|
From the performance side of things, I think this is less about mmap specifically and more about avoiding heap allocations / contentious spawn_blocking for every request. I believe we still coalesce reads even against NVMe right now to amortize some of this overhead. For the mmap question, I would tend to agree with Adam that engine integrations should typically use their own I/O subsystem, this means all the configs users tweak are respected when reading Vortex. That said, as we do for DuckDB, we should provide configs to opt-in to Vortex's own I/O. And for any users hitting Vortex directly, i.e. via PyVortex or other language bindings, they should have the option to plug in their own I/O or use ours. Given the random access benchmarks run against Vortex Rust (and not via DataFusion or any other engine), I'd say it's reasonable for our Rust API to pick the best I/O backend, including mmap, if it's available. |
Beta Was this translation helpful? Give feedback.
-
|
I'm just going to leave that on linux you almost always end up being better with io_uring vs mmap |
Beta Was this translation helpful? Give feedback.
-
|
Hi folks, I'm about to plug in our io_uring backend into VortexReadAt and benchmark results. If anybody is interested I can share them later. Note that originally we were on mmap and explicitly moved away from mmap. I would agree that io_uring requires very careful engineering and a huge amount of time fine tuning it. Our design is to read into a ring buffer - how to manage the memory is a huge part of what makes it work or not work. Then we would wrap the memory into BufferHandle and decode into or in worse case memcpy it to somewhere else, then release the ringbuffer memory. This avoids additional copy and is as zero-copy as I can think of at the moment. My overall feeling on mmap as a default: it works well enough in many environments and is a good default -- until you push concurrency and have heavy loads in production, then it starts falling apart because of contention. You have no control over it either. |
Beta Was this translation helpful? Give feedback.
-
|
I think there are multiple cases here:
My main argument is that we should focus on providing APIs that make performance possible, and a core that doesn't hinder it. Maintaining high performance io_uring or otherwise opinionated IO drivers is a non-trivial effort (see Lance's blogpost on the matter), and users will almost always want to provide their own implementation, either to integrate with other system-specific components (caching, tracing etc.) or because they are highly opinionated about performance or behavior. The DuckDB integration stands out here because its provided as a binary, we can expose some configuration letting them tune or choose different implementations, but even here I think that letting DuckDB handle the IO can help us reduce the maintenance load. As for the benchmarks - having a good benchmark is just hard. If we have problems with the setup itself (which @myrrc is telling seems to be an issue here), or if we have obvious performance issues and/or bad APIs we should improve them, but hacking on complex secondary components just to win a benchmark seems like the wrong priority IMO. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
In the ideal world we'd have VortexReadAt exposed, and the engines would just read the bytes for us that we hint them at (#8433). However, currently we use own IO, and sometimes this is an issue.
I propose using mmap for some of short reads (random-access benchmarks) because there the cost of pread and a blocking tokio spawn dominates. Worse, with higher number of CPU cores we have more contention on these threads i.e. locally with 16 cores difference with lance is 3x, but in CI it's 10x.
Argument against this (@AdamGS) is we shouldn't move towards using own IO preferences as library developers, otherwise comparision with i.e. lance would not be correct, and as we're not shipping a complete product like a database, this may harm some of our customers.
My argument is that it's incorrect already (we're on par with lancev5, lancev6 uses io-uring which is a different IO implementation), and as we provide a default IO, we should use faster options if they are available.
Beta Was this translation helpful? Give feedback.
All reactions