KX NYSE TAQ Benchmarks

Overview

This benchmark suite uses publicly available NYSE TAQ data, with queries that are representative of common financial industry workloads.

The suite provides benchmarks to:

Compare in-memory query engines (KDB-X, KDB-X Python, Polars, Pandas, and DuckDB).
Evaluate the impact of KDB-X attributes and memory layout.

Running any benchmark involves four steps:

Step 1: Select a data size to control how much data is downloaded and used during the benchmark.
Step 2: Download the compressed PSV files from the NYSE FTP server.
Step 3: Convert the files into kdb+ or Parquet format.
Step 4: Select and run a benchmark.

Step 1: Selecting a Data Size

A single day of NYSE TAQ data is substantial. To reduce execution time, you can limit ingestion to a subset of the BBO split CSV files (the source of the quote table).

Use the SIZE environment variable to balance execution time against data coverage:

export SIZE=small

In all modes except full, only a subset of the BBO split CSV files is downloaded.
Only the corresponding trades are converted into the HDB (for example, only symbols whose names start with Z).

The following statistics are based on data from 2025-01-02:

`SIZE`	Recommended for	Symbol first letters	HDB size (GB)	Nr of quote symbols	Nr of quotes
`small`	A quick test to get familiar with the benchmark suite	Z	1	94	4 607 158
`medium`	KDB-X Community Edition users	I	13	555	180 827 332
`large`	Users with an unlimited KDB-X license but limited memory	A–H	52	4 849	707 738 295
`full`	The most thorough testing	A–Z	233	11 155	2 313 872 956

Use medium when running the benchmark with KDB-X Community Edition, which enforces a memory limit.

Step 2: Obtaining the PSV Files

Although you can download, decompress, and prepare the PSV files manually, we recommend using the getPSVs.sh script from the KDB-X taq module. The taq repository is included as a git submodule; initialize it with:

git submodule update --init --recursive

Set a directory for storing the PSV files. We use a DATA directory inside the repository (it is listed in .gitignore, so the large downloads and generated databases are never committed). Point NYSEBENCHMARKDIR elsewhere if you prefer to keep the data on a different (e.g. faster or larger) filesystem:

export NYSEBENCHMARKDIR=$PWD/DATA

Fetch the latest available date from the NYSE FTP server and run getPSVs.sh:

export DATE=$(curl -s https://ftp.nyse.com/Historical%20Data%20Samples/DAILY%20TAQ/| grep -oE 'EQY_US_ALL_TRADE_2[0-9]{7}' | grep -oE '2[0-9]{7}'|head -1)

./external/kx/taq/scripts/getPSVs.sh --csvdir ${NYSEBENCHMARKDIR}/${SIZE}/psv --dates ${DATE} --size ${SIZE}

The getPSVs.sh script:

Downloads the compressed PSV files using curl -C (which supports resuming interrupted downloads).
Decompresses the files.
Removes trailing lines.
Adds the correct extension (.psv).

Step 3: Converting PSV Files to Binary Data Formats

The PSV files must be converted to a binary format that the query engines can read directly. Both kdb+ and Parquet formats are supported. Each benchmark has its own data format requirement, so example commands are only provided in Step 4.

The ./generateDB.sh script wraps the underlying TAQ parsers. Each parser has its own dependencies.

kdb+ Parser

The kdb+ parser requires:

KDB-X to be installed. The benchmark relies on modules, so KDB-X is required — it does not run on kdb+ versions prior to 5.0.
The KDB-X taq module to be available. This module is included as a git submodule (git submodule update --init --recursive), but its dependencies must be installed manually to the standard KX module path.

Parquet Parser

The Parquet parser uses Python and the PyArrow library. Install uv to manage your Python environment. The full list of required libraries is defined in the inline script metadata in pysrc/taqToParquet/main.py.

PSV Cleanup

Exercise caution when running cleanup: downloading PSV files can be time-consuming. Delete the PSV files only when the binary data has been generated and you are sure that no other binary format will be required.

rm -rf ${NYSEBENCHMARKDIR}/${SIZE}/psv

Step 4: Selecting and Running a Benchmark

Two benchmarks are available:

In-memory query engine benchmark — compares query execution time across the KDB-X, KDB-X SQL, Polars, DuckDB, Pandas, and KDB-X Python (pykx) engines.
In-memory KDB-X attribute and table format comparison — evaluates the impact of attributes and table dictionary formats.

1. In-Memory Query Engine Benchmark — `benchmarks/inmemory/queryEngines.sh`

Query engines read data into memory from Hive-partitioned Parquet or kdb+ format. The required format depends on the engine: the KDB-X engines read kdb+ data, while the Python dataframe/SQL engines read Parquet. If you run all engines (the default), both formats must be generated.

Engine (`--engines` value)	Description	Required data format
`kdb`	KDB-X (q-sql)	kdb+
`sql`	KDB-X SQL	kdb+
`pykx`	KDB-X Python (`pykx`)	kdb+
`duckdb`	DuckDB	Parquet
`polars`	Polars	Parquet
`pandas`	Pandas	Parquet

So you only need the kdb+ database if you restrict the run to kdb/sql/pykx (e.g. --engines kdb,sql), and only the Parquet database if you restrict it to duckdb/polars/pandas. Convert the TAQ PSV files to the format(s) you need using ./generateDB.sh:

# kdb+ format — needed for the kdb, sql, and pykx engines
DATAFORMAT=kdb ./generateDB.sh ${NYSEBENCHMARKDIR}/${SIZE}/psv ${NYSEBENCHMARKDIR}/${SIZE}/kdb ${DATE}
# Hive-partitioned Parquet — needed for the duckdb, polars, and pandas engines
SYMBOLSTOREDAS=ROWGROUP DATAFORMAT=parquet ./generateDB.sh ${NYSEBENCHMARKDIR}/${SIZE}/psv ${NYSEBENCHMARKDIR}/${SIZE}/parquet/rowgroup ${DATE}

Once the on-disk data has been generated, you can start the benchmark. Python libraries are run via uv, so ensure uv is installed. To test the engines with 0, 4, 16, and 64 secondary threads, run:

export NUMANODE=0
./benchmarks/inmemory/queryEngines.sh --db-dir ${NYSEBENCHMARKDIR}/${SIZE} --param-dir ./artifacts/parameters/${SIZE} --date ${DATE}  --threads "0 4 16 64" --results ./results/inmemory/${SIZE}/queryengines.psv

The script accepts the following mandatory parameters:

Parameter	Description
`--db-dir`	Directory containing the generated databases. The script expects the `kdb` and `parquet/rowgroup` subdirectories created by `./generateDB.sh`.
`-p`, `--param-dir`	Directory of the query parameters (e.g. `./artifacts/parameters/${SIZE}`).
`-d`, `--date`	Target date to query, in the same format as `${DATE}`.

And the following optional parameters:

Parameter	Description
`-t`, `--threads`	Space-separated list of secondary-thread counts to test, e.g. `"0 4 16 64"`. Each engine runs once per value. Default: `"1 4"`.
`-e`, `--engines`	Comma-separated subset of engines to run. Valid values: `kdb`, `sql`, `duckdb`, `polars`, `pykx`, `pandas`. Default: all of them.
`-s`, `--stats-dir`	Directory to save per-table statistics (one YAML file per table), environment information, plus OS `time -v` output.
`-i`, `--idx`	Filter queries by index: single (`42`), comma-separated list (`32,42,50`), or range (`40-44`). Default: run all queries.
`--results`	Single PSV file that all per-engine results are merged into. The individual per-engine files are written to a temporary directory and removed afterwards. Default: `./results/inmemory/queryengines.psv`.
`-h`, `--help`	Show usage and exit.

The NUMANODE environment variable is also honoured: when set, every engine is launched under numactl -N ${NUMANODE} -m ${NUMANODE} to pin CPU and memory allocation to that NUMA node.

To pin a specific library version, edit the inline script metadata in pysrc/queryrunner/main.py. For example:

#   "pykx==4.0.0",

Results

The script merges every engine's results into a single pipe-separated values (PSV) file (set by --results), one row per query (plus a few rows for the data-loading steps).

The file starts with a header row. The columns are:

Column	Description
`storagebackend`	Where the data is read from: `inmemory` or `ondisk`.
`compparam`	Compression parameter used for the data.
`threadcount`	Number of (secondary/worker) threads the engine was configured to use. `0` means no secondary threads.
`runner`	The harness driving the engine, e.g. `KDB-X` or `Python`.
`engine`	The query engine, e.g. `pykx`, `duckdb_con`, `polars`, `pandas`.
`format`	Data format.
`sortcols`	Columns the `trade`/`quote` tables were sorted by before querying, e.g. `time` or `sym,time`. Empty if unsorted.
`indexon`	Columns an index/attribute was applied to, e.g. `sym`. Empty if none.
`engineversion`	Version string of the engine library, e.g. `1.5.4`.
`idx`	Query index. Positive integers are benchmark queries; non-positive values are setup steps: `0` = load a partition into memory, `-1` = transform, `-2` = sort, `-3` = index.
`tags`	Comma-separated category tags for the query (e.g. `timefilter,groupby,advanced`). Setup rows are tagged `load`.
`query`	The query text that was executed (or a short description for setup rows).
`status`	Outcome: `success`, `error` (query raised an exception), `idxfiltered` (skipped by the `--idx` filter), or `tagfiltered` (skipped by the `--tags` filter).
`run1timeNS`	Execution time of run 1 (cold) in nanoseconds. Setup rows record their elapsed time here.
`run2timeNS`	Execution time of run 2 (warm) in nanoseconds.
`run3timeNS`	Execution time of run 3 (warm) in nanoseconds.
`run3memKB`	Peak memory of the query of run 3 in KB.
`run1ioKB`	Disk I/O during run 1 in KB. Should be zero for in-memory benchmarks.
`run2ioKB`	Disk I/O during run 2 in KB. Should be zero for in-memory benchmarks.
`run3ioKB`	Disk I/O during run 3 in KB. Should be zero for in-memory benchmarks.
`ressizeKB`	Size of the query result in KB.

Each benchmark query is run three times (one cold run followed by two warm runs); columns are empty when a value does not apply (e.g. timing/IO columns for an error row, or warm-run columns for setup rows).

2. In-Memory KDB-X Attribute Benchmark — `benchmarks/inmemory/kdbAttributes.sh`

Data is read into memory from kdb+ format. Convert the TAQ PSV files to this format using ./generateDB.sh:

DATAFORMAT=kdb ./generateDB.sh ${NYSEBENCHMARKDIR}/${SIZE}/psv ${NYSEBENCHMARKDIR}/${SIZE}/kdb ${DATE}

Once the on-disk data has been generated, you can start the benchmark. To test with 0, 4, 16, and 64 secondary threads, run:

export NUMANODE=0
./benchmarks/inmemory/kdbAttributes.sh --db-dir ${NYSEBENCHMARKDIR}/${SIZE} --param-dir ./artifacts/parameters/${SIZE} --date ${DATE} --threads "0 4 16 64" --results ./results/inmemory/${SIZE}/kdbattr.psv

Results

The scripts write the results as pipe-separated values (PSV) files of the same format as queryEngines.sh

Extending the Benchmarks

The suite is designed to be extended in two common ways: adding another query engine, and growing the query set. Both are described below.

Adding a New Python-Based In-Memory Query Engine

Python engines live in pysrc/queryrunner/executors/inmemory/. Each engine is a single class that is driven by the shared runner pysrc/queryrunner/main.py. The runner handles flushing, timing (one cold run followed by two warm runs), result writing, and PSV output; your class only has to load the data and execute queries.

Use an existing executor as a template. polars.py and pandas.py read the Hive-partitioned Parquet database; pykx.py reads the kdb+ database instead.

Create the executor class. Implement the informal interface the runner expects (see main.py and the existing executors):

Method	Responsibility
`__init__(self, param, sort_cols, ...)`	Stash parameters/options and build any engine-specific lookup tables (e.g. `timeBuckets`).
`load_resources(self, db_path, datadate, writer, row_start, ios)`	Load `exnames`/`master`/`trade`/`quote` into memory, then transform, sort by `sort_cols`, and (optionally) index. Emit one setup row per phase via `writer.writerow(row_start + [...])`: `idx` `0` = load, `-1` = transform, `-2` = sort, `-3` = index.
`prepare_run(self)`	Reset any per-run state before each of the 3 timed runs.
`get_parameters(self, parameter)`	Pre-process the raw `parameter` string into whatever `execute_query` expects (excluded from the measured time).
`execute_query(self, idx, tags, query_str, params, runidx)`	Execute the query and return the result object.
`get_table_size(df)` (static)	Result/table size in KB, or `None` if unavailable.
`get_table_stats(self)`	Per-table stats dict written to the `--stats-dir` YAML files.
`write_csv(self, res, out_file)`	Serialize a result to CSV for cross-engine output comparison. The CSV must be in kdb+-loadable format, so values need special formatting: booleans as `1`/`0` (not `true`/`false`), and temporal values as kdb+ literals (e.g. timespans like `0D09:30:00.000000000`). See the `write_csv` implementations in polars.py and pandas.py for the duration/boolean conversions.

Wire it into the runner. In main.py, add an elif engine == "<name>": branch inside the inmemory block that imports and instantiates your class as runner and sets threadnr and engineversion. Also add "<name>" to the -engine argument's choices list in build_parser.
Declare dependencies. Add any new library to the inline script metadata (the PEP 723 # /// script block at the top of main.py) so uv run installs it.
Add a query file. Create artifacts/queries/inmemory/<name>.psv with the queries written in your engine's syntax. It must stay index-aligned with querymeta.psv — see Extending the Query Set.
Add it to the driver. In benchmarks/inmemory/queryEngines.sh, add an engine_enabled <name> block that calls uv run pysrc/queryrunner/main.py ... -engine <name> -queryfile ./artifacts/queries/inmemory/<name>.psv ... followed by add_nickname, and add <name> to the default ENGINES list. Optionally add a matching run in get_table_stats. Each engine is launched once per requested thread count; if the library is configured through an environment variable, set it inline as the existing engines do (e.g. POLARS_MAX_THREADS, DUCKDB_THREADS, OMP_NUM_THREADS).

Extending the Query Set with New Queries

Queries are defined per engine in PSV files under artifacts/queries/inmemory/ (kdb.psv, sql.psv, duckdb.psv, polars.psv, pandas.psv, pykx.psv, and the attribute-benchmark variants kdb_noattr.psv, kdb_tabledict.psv). Each file has the columns:

Column	Meaning
`idx`	Query index. Must be identical, row for row, across every query file and `querymeta.psv`.
`tags`	Optional engine-specific extra tags (usually empty).
`query`	The query text in that engine's syntax.
`parameter`	Comma-separated names of parameters injected into the query (e.g. `datadate`, `aFreqInstr`, `twentyInstrs`, `timeBuckets`). Empty if the query takes none.

Engine-independent metadata lives in artifacts/queries/inmemory/querymeta.psv (idx|tags|description|comment). At runtime the runners join each query to its meta row by idx and abort on any index mismatch between a query file and querymeta.psv (see the checks in main.py and src/runQueries.q). Consequently, every query you add must appear — at the same row position and with the same index — in all engine files you want to benchmark and in querymeta.psv.

Parameter names in the parameter column are resolved from the per-size files in artifacts/parameters/${SIZE}/*.txt. To introduce a brand-new parameter, add its .txt file to every size directory and load it in both load_parameters (main.py) and src/getQueryParameters.q.

Appending a query (no existing indices change):

Add a row with the next free idx to each engine query file, expressing the same logical query in that engine's syntax.
Add a matching row (same idx) to querymeta.psv with a description and tags.

Inserting a query in the middle (existing indices must shift): because indices are sequential, inserting renumbers every query after the insertion point. Rather than renumbering by hand, use artifacts/queries/reindex.sh:

Insert the new row at the same position in each query file and in querymeta.psv (the idx value can be left inconsistent for now).
Renumber the idx column of every affected file to 1, 2, 3, … based on row order:
```
./artifacts/queries/reindex.sh artifacts/queries/inmemory/*.psv
```
The script rewrites each PSV in place (preserving the header) and numbers purely by row position, so indices stay aligned across files as long as the inserted row sits at the same position in each. Commit or back up first, and pass only the query/meta PSVs — not result files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

KX NYSE TAQ Benchmarks

Overview

Step 1: Selecting a Data Size

Step 2: Obtaining the PSV Files

Step 3: Converting PSV Files to Binary Data Formats

kdb+ Parser

Parquet Parser

PSV Cleanup

Step 4: Selecting and Running a Benchmark

1. In-Memory Query Engine Benchmark — `benchmarks/inmemory/queryEngines.sh`

Results

2. In-Memory KDB-X Attribute Benchmark — `benchmarks/inmemory/kdbAttributes.sh`

Results

Extending the Benchmarks

Adding a New Python-Based In-Memory Query Engine

Extending the Query Set with New Queries

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.claude/skills/run-nyse-taq-benchmarks		.claude/skills/run-nyse-taq-benchmarks
artifacts		artifacts
benchmarks/inmemory		benchmarks/inmemory
external/kx		external/kx
flush		flush
pysrc		pysrc
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
generateDB.sh		generateDB.sh

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

KX NYSE TAQ Benchmarks

Overview

Step 1: Selecting a Data Size

Step 2: Obtaining the PSV Files

Step 3: Converting PSV Files to Binary Data Formats

kdb+ Parser

Parquet Parser

PSV Cleanup

Step 4: Selecting and Running a Benchmark

1. In-Memory Query Engine Benchmark — benchmarks/inmemory/queryEngines.sh

Results

2. In-Memory KDB-X Attribute Benchmark — benchmarks/inmemory/kdbAttributes.sh

Results

Extending the Benchmarks

Adding a New Python-Based In-Memory Query Engine

Extending the Query Set with New Queries

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. In-Memory Query Engine Benchmark — `benchmarks/inmemory/queryEngines.sh`

2. In-Memory KDB-X Attribute Benchmark — `benchmarks/inmemory/kdbAttributes.sh`

Packages