This benchmark suite uses publicly available NYSE TAQ data, with queries that are representative of common financial industry workloads.
The suite provides benchmarks to:
- Compare in-memory query engines (KDB-X, KDB-X Python, Polars, Pandas, and DuckDB).
- Evaluate the impact of KDB-X attributes and memory layout.
Running any benchmark involves four steps:
- Step 1: Select a data size to control how much data is downloaded and used during the benchmark.
- Step 2: Download the compressed PSV files from the NYSE FTP server.
- Step 3: Convert the files into kdb+ or Parquet format.
- Step 4: Select and run a benchmark.
A single day of NYSE TAQ data is substantial. To reduce execution time,
you can limit ingestion to a subset of the BBO split CSV files (the source
of the quote table).
Use the SIZE environment variable to balance execution time against data coverage:
export SIZE=small- In all modes except
full, only a subset of the BBO split CSV files is downloaded. - Only the corresponding trades are converted into the HDB (for example, only
symbols whose names start with
Z).
The following statistics are based on data from 2025-01-02:
SIZE |
Recommended for | Symbol first letters | HDB size (GB) | Nr of quote symbols | Nr of quotes |
|---|---|---|---|---|---|
small |
A quick test to get familiar with the benchmark suite | Z | 1 | 94 | 4 607 158 |
medium |
KDB-X Community Edition users | I | 13 | 555 | 180 827 332 |
large |
Users with an unlimited KDB-X license but limited memory | A–H | 52 | 4 849 | 707 738 295 |
full |
The most thorough testing | A–Z | 233 | 11 155 | 2 313 872 956 |
Use medium when running the benchmark with KDB-X Community Edition, which
enforces a memory limit.
Although you can download, decompress, and prepare the PSV files manually, we recommend using the getPSVs.sh script from the KDB-X taq module. The taq repository is included as a git submodule; initialize it with:
git submodule update --init --recursiveSet a directory for storing the PSV files. We use a DATA directory inside the
repository (it is listed in .gitignore, so the large downloads and generated
databases are never committed). Point NYSEBENCHMARKDIR elsewhere if you prefer
to keep the data on a different (e.g. faster or larger) filesystem:
export NYSEBENCHMARKDIR=$PWD/DATAFetch the latest available date from the NYSE FTP server and run getPSVs.sh:
export DATE=$(curl -s https://ftp.nyse.com/Historical%20Data%20Samples/DAILY%20TAQ/| grep -oE 'EQY_US_ALL_TRADE_2[0-9]{7}' | grep -oE '2[0-9]{7}'|head -1)
./external/kx/taq/scripts/getPSVs.sh --csvdir ${NYSEBENCHMARKDIR}/${SIZE}/psv --dates ${DATE} --size ${SIZE}The getPSVs.sh script:
- Downloads the compressed PSV files using
curl -C(which supports resuming interrupted downloads). - Decompresses the files.
- Removes trailing lines.
- Adds the correct extension (
.psv).
The PSV files must be converted to a binary format that the query engines can read directly. Both kdb+ and Parquet formats are supported. Each benchmark has its own data format requirement, so example commands are only provided in Step 4.
The ./generateDB.sh script wraps the underlying TAQ parsers. Each parser has its own dependencies.
The kdb+ parser requires:
- KDB-X to be installed. The benchmark relies on modules, so KDB-X is required — it does not run on kdb+ versions prior to 5.0.
- The KDB-X taq module to be available. This module is included as a git submodule (
git submodule update --init --recursive), but its dependencies must be installed manually to the standard KX module path.
The Parquet parser uses Python and the PyArrow library. Install uv to manage your Python environment. The full list of required libraries is defined in the inline script metadata in pysrc/taqToParquet/main.py.
Exercise caution when running cleanup: downloading PSV files can be time-consuming. Delete the PSV files only when the binary data has been generated and you are sure that no other binary format will be required.
rm -rf ${NYSEBENCHMARKDIR}/${SIZE}/psvTwo benchmarks are available:
- In-memory query engine benchmark — compares query execution time across the KDB-X, KDB-X SQL, Polars, DuckDB, Pandas, and KDB-X Python (
pykx) engines. - In-memory KDB-X attribute and table format comparison — evaluates the impact of attributes and table dictionary formats.
Query engines read data into memory from Hive-partitioned Parquet or kdb+ format. The required format depends on the engine: the KDB-X engines read kdb+ data, while the Python dataframe/SQL engines read Parquet. If you run all engines (the default), both formats must be generated.
Engine (--engines value) |
Description | Required data format |
|---|---|---|
kdb |
KDB-X (q-sql) | kdb+ |
sql |
KDB-X SQL | kdb+ |
pykx |
KDB-X Python (pykx) |
kdb+ |
duckdb |
DuckDB | Parquet |
polars |
Polars | Parquet |
pandas |
Pandas | Parquet |
So you only need the kdb+ database if you restrict the run to kdb/sql/pykx (e.g. --engines kdb,sql), and only the Parquet database if you restrict it to duckdb/polars/pandas. Convert the TAQ PSV files to the format(s) you need using ./generateDB.sh:
# kdb+ format — needed for the kdb, sql, and pykx engines
DATAFORMAT=kdb ./generateDB.sh ${NYSEBENCHMARKDIR}/${SIZE}/psv ${NYSEBENCHMARKDIR}/${SIZE}/kdb ${DATE}
# Hive-partitioned Parquet — needed for the duckdb, polars, and pandas engines
SYMBOLSTOREDAS=ROWGROUP DATAFORMAT=parquet ./generateDB.sh ${NYSEBENCHMARKDIR}/${SIZE}/psv ${NYSEBENCHMARKDIR}/${SIZE}/parquet/rowgroup ${DATE}Once the on-disk data has been generated, you can start the benchmark. Python libraries are run via uv, so ensure uv is installed. To test the engines with 0, 4, 16, and 64 secondary threads, run:
export NUMANODE=0
./benchmarks/inmemory/queryEngines.sh --db-dir ${NYSEBENCHMARKDIR}/${SIZE} --param-dir ./artifacts/parameters/${SIZE} --date ${DATE} --threads "0 4 16 64" --results ./results/inmemory/${SIZE}/queryengines.psvThe script accepts the following mandatory parameters:
| Parameter | Description |
|---|---|
--db-dir |
Directory containing the generated databases. The script expects the kdb and parquet/rowgroup subdirectories created by ./generateDB.sh. |
-p, --param-dir |
Directory of the query parameters (e.g. ./artifacts/parameters/${SIZE}). |
-d, --date |
Target date to query, in the same format as ${DATE}. |
And the following optional parameters:
| Parameter | Description |
|---|---|
-t, --threads |
Space-separated list of secondary-thread counts to test, e.g. "0 4 16 64". Each engine runs once per value. Default: "1 4". |
-e, --engines |
Comma-separated subset of engines to run. Valid values: kdb, sql, duckdb, polars, pykx, pandas. Default: all of them. |
-s, --stats-dir |
Directory to save per-table statistics (one YAML file per table), environment information, plus OS time -v output. |
-i, --idx |
Filter queries by index: single (42), comma-separated list (32,42,50), or range (40-44). Default: run all queries. |
--results |
Single PSV file that all per-engine results are merged into. The individual per-engine files are written to a temporary directory and removed afterwards. Default: ./results/inmemory/queryengines.psv. |
-h, --help |
Show usage and exit. |
The NUMANODE environment variable is also honoured: when set, every engine is launched
under numactl -N ${NUMANODE} -m ${NUMANODE} to pin CPU and memory allocation to that NUMA node.
To pin a specific library version, edit the inline script metadata in pysrc/queryrunner/main.py. For example:
# "pykx==4.0.0",The script merges every engine's results into a single pipe-separated values (PSV) file
(set by --results), one row per query (plus a few rows for the data-loading steps).
The file starts with a header row. The columns are:
| Column | Description |
|---|---|
storagebackend |
Where the data is read from: inmemory or ondisk. |
compparam |
Compression parameter used for the data. |
threadcount |
Number of (secondary/worker) threads the engine was configured to use. 0 means no secondary threads. |
runner |
The harness driving the engine, e.g. KDB-X or Python. |
engine |
The query engine, e.g. pykx, duckdb_con, polars, pandas. |
format |
Data format. |
sortcols |
Columns the trade/quote tables were sorted by before querying, e.g. time or sym,time. Empty if unsorted. |
indexon |
Columns an index/attribute was applied to, e.g. sym. Empty if none. |
engineversion |
Version string of the engine library, e.g. 1.5.4. |
idx |
Query index. Positive integers are benchmark queries; non-positive values are setup steps: 0 = load a partition into memory, -1 = transform, -2 = sort, -3 = index. |
tags |
Comma-separated category tags for the query (e.g. timefilter,groupby,advanced). Setup rows are tagged load. |
query |
The query text that was executed (or a short description for setup rows). |
status |
Outcome: success, error (query raised an exception), idxfiltered (skipped by the --idx filter), or tagfiltered (skipped by the --tags filter). |
run1timeNS |
Execution time of run 1 (cold) in nanoseconds. Setup rows record their elapsed time here. |
run2timeNS |
Execution time of run 2 (warm) in nanoseconds. |
run3timeNS |
Execution time of run 3 (warm) in nanoseconds. |
run3memKB |
Peak memory of the query of run 3 in KB. |
run1ioKB |
Disk I/O during run 1 in KB. Should be zero for in-memory benchmarks. |
run2ioKB |
Disk I/O during run 2 in KB. Should be zero for in-memory benchmarks. |
run3ioKB |
Disk I/O during run 3 in KB. Should be zero for in-memory benchmarks. |
ressizeKB |
Size of the query result in KB. |
Each benchmark query is run three times (one cold run followed by two warm runs); columns are
empty when a value does not apply (e.g. timing/IO columns for an error row, or warm-run
columns for setup rows).
Data is read into memory from kdb+ format. Convert the TAQ PSV files to this format using ./generateDB.sh:
DATAFORMAT=kdb ./generateDB.sh ${NYSEBENCHMARKDIR}/${SIZE}/psv ${NYSEBENCHMARKDIR}/${SIZE}/kdb ${DATE}Once the on-disk data has been generated, you can start the benchmark. To test with 0, 4, 16, and 64 secondary threads, run:
export NUMANODE=0
./benchmarks/inmemory/kdbAttributes.sh --db-dir ${NYSEBENCHMARKDIR}/${SIZE} --param-dir ./artifacts/parameters/${SIZE} --date ${DATE} --threads "0 4 16 64" --results ./results/inmemory/${SIZE}/kdbattr.psvThe scripts write the results as pipe-separated values (PSV) files of the same format as queryEngines.sh
The suite is designed to be extended in two common ways: adding another query engine, and growing the query set. Both are described below.
Python engines live in pysrc/queryrunner/executors/inmemory/. Each engine is a single class that is driven by the shared runner pysrc/queryrunner/main.py. The runner handles flushing, timing (one cold run followed by two warm runs), result writing, and PSV output; your class only has to load the data and execute queries.
Use an existing executor as a template. polars.py and pandas.py read the Hive-partitioned Parquet database; pykx.py reads the kdb+ database instead.
-
Create the executor class. Implement the informal interface the runner expects (see main.py and the existing executors):
Method Responsibility __init__(self, param, sort_cols, ...)Stash parameters/options and build any engine-specific lookup tables (e.g. timeBuckets).load_resources(self, db_path, datadate, writer, row_start, ios)Load exnames/master/trade/quoteinto memory, then transform, sort bysort_cols, and (optionally) index. Emit one setup row per phase viawriter.writerow(row_start + [...]):idx0= load,-1= transform,-2= sort,-3= index.prepare_run(self)Reset any per-run state before each of the 3 timed runs. get_parameters(self, parameter)Pre-process the raw parameterstring into whateverexecute_queryexpects (excluded from the measured time).execute_query(self, idx, tags, query_str, params, runidx)Execute the query and return the result object. get_table_size(df)(static)Result/table size in KB, or Noneif unavailable.get_table_stats(self)Per-table stats dict written to the --stats-dirYAML files.write_csv(self, res, out_file)Serialize a result to CSV for cross-engine output comparison. The CSV must be in kdb+-loadable format, so values need special formatting: booleans as 1/0(nottrue/false), and temporal values as kdb+ literals (e.g. timespans like0D09:30:00.000000000). See thewrite_csvimplementations in polars.py and pandas.py for the duration/boolean conversions. -
Wire it into the runner. In main.py, add an
elif engine == "<name>":branch inside theinmemoryblock that imports and instantiates your class asrunnerand setsthreadnrandengineversion. Also add"<name>"to the-engineargument'schoiceslist inbuild_parser. -
Declare dependencies. Add any new library to the inline script metadata (the PEP 723
# /// scriptblock at the top ofmain.py) souv runinstalls it. -
Add a query file. Create
artifacts/queries/inmemory/<name>.psvwith the queries written in your engine's syntax. It must stay index-aligned withquerymeta.psv— see Extending the Query Set. -
Add it to the driver. In benchmarks/inmemory/queryEngines.sh, add an
engine_enabled <name>block that callsuv run pysrc/queryrunner/main.py ... -engine <name> -queryfile ./artifacts/queries/inmemory/<name>.psv ...followed byadd_nickname, and add<name>to the defaultENGINESlist. Optionally add a matching run inget_table_stats. Each engine is launched once per requested thread count; if the library is configured through an environment variable, set it inline as the existing engines do (e.g.POLARS_MAX_THREADS,DUCKDB_THREADS,OMP_NUM_THREADS).
Queries are defined per engine in PSV files under
artifacts/queries/inmemory/ (kdb.psv,
sql.psv, duckdb.psv, polars.psv, pandas.psv, pykx.psv, and the
attribute-benchmark variants kdb_noattr.psv, kdb_tabledict.psv). Each file
has the columns:
| Column | Meaning |
|---|---|
idx |
Query index. Must be identical, row for row, across every query file and querymeta.psv. |
tags |
Optional engine-specific extra tags (usually empty). |
query |
The query text in that engine's syntax. |
parameter |
Comma-separated names of parameters injected into the query (e.g. datadate, aFreqInstr, twentyInstrs, timeBuckets). Empty if the query takes none. |
Engine-independent metadata lives in artifacts/queries/inmemory/querymeta.psv
(idx|tags|description|comment). At runtime the runners join each query to its
meta row by idx and abort on any index mismatch between a query file and
querymeta.psv (see the checks in main.py and
src/runQueries.q). Consequently, every query you add must
appear — at the same row position and with the same index — in all engine files
you want to benchmark and in querymeta.psv.
Parameter names in the parameter column are resolved from the per-size files
in artifacts/parameters/${SIZE}/*.txt. To introduce a brand-new parameter, add
its .txt file to every size directory and load it in both
load_parameters (main.py) and
src/getQueryParameters.q.
Appending a query (no existing indices change):
- Add a row with the next free
idxto each engine query file, expressing the same logical query in that engine's syntax. - Add a matching row (same
idx) toquerymeta.psvwith adescriptionand tags.
Inserting a query in the middle (existing indices must shift): because indices are sequential, inserting renumbers every query after the insertion point. Rather than renumbering by hand, use artifacts/queries/reindex.sh:
- Insert the new row at the same position in each query file and in
querymeta.psv(theidxvalue can be left inconsistent for now). - Renumber the
idxcolumn of every affected file to1, 2, 3, …based on row order:The script rewrites each PSV in place (preserving the header) and numbers purely by row position, so indices stay aligned across files as long as the inserted row sits at the same position in each. Commit or back up first, and pass only the query/meta PSVs — not result files../artifacts/queries/reindex.sh artifacts/queries/inmemory/*.psv