perf: prefetch counts[] ahead in the AVX2 percentile scan (follow-up to #138) by fcostaoliveira · Pull Request #139 · HdrHistogram/HdrHistogram_c

fcostaoliveira · 2026-07-01T15:33:56Z

Summary

Software-prefetch counts[] ahead of the AVX2 percentile scan
(get_value_from_idx_up_to_count_avx2) to hide L2/L3 load latency:

_mm_prefetch((const char*)&h->counts[idx + 64], _MM_HINT_T0);   /* 512 B / 4 iters ahead */

After the vector-accumulator widening, the scan is load-latency bound streaming the
~10s-of-KB counts[] array, so an explicit prefetch a few iterations ahead helps.

Stacked on #138. This branch contains #138's widening commit plus this one, so the diff
currently shows both. Please review/merge after #138 — once #138 lands, this PR reduces to
the single one-line prefetch commit. (I kept them as separate commits so each is reviewable on
its own.)

Benchmark

test/hdr_percentile_bench — hdr_value_at_percentile throughput, core-pinned, measured
base (this repo's AVX2 scan) vs +prefetch back-to-back in the same session, on two Intel
microarchitectures and both compilers:

µarch	Compiler	Base	+prefetch	Δ
Cascade Lake (Xeon Gold 6248)	gcc 11.4	0.37 M q/s	0.40 M q/s	+8%
Cascade Lake	clang 14.0	0.43 M q/s	0.43 M q/s	neutral
Granite Rapids	gcc 11.4	0.52 M q/s	0.56 M q/s	+7.7%
Granite Rapids	clang 14.0	0.53 M q/s	0.56 M q/s	+5.7%

gcc is consistently ~+8%; clang ranges neutral → +5.7% and never regresses. The write path
(hdr_histogram_perf) is a flat control on both µarchs (this change only touches the read scan).
The prefetch distance (64) is a reasonable default and could be tuned further per target.

Steps to reproduce

cmake -S . -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo \
  -DHDR_HISTOGRAM_BUILD_PROGRAMS=ON -DHDR_HISTOGRAM_BUILD_BENCHMARK=ON
cmake --build build -j
taskset -c 8 ./build/test/hdr_percentile_bench     # base vs this branch: M queries/sec + sink

Correctness

_mm_prefetch is a non-faulting hint — no bounds/UB implications, no change to the values read.
ctest green (gcc and clang); the benchmark sink is byte-identical to base
(17401860284404480), i.e. every percentile query returns exactly the same value.

…ator get_value_from_idx_up_to_count_avx2 summed 4 int64/iter and did a horizontal reduction + 2x _mm_extract_epi64 + target-cross branch every 4 elements. Accumulate 16 int64/iter (4x256) in a vector register and reduce to a scalar block sum once per 16, so the costly GPR extracts and the early-exit branch run 4x less often. Scalar fallback and uint64 overflow hardening unchanged; percentile results bit-identical. clx1 (Cascade Lake), core-pinned, same-session A/B: hdr_value_at_percentile +137% (gcc 0.16->0.38 Mq/s) / +144% (clang 0.18->0.44 Mq/s). Read sink byte-identical.

The widened AVX2 percentile scan is memory-load-latency bound over the ~10s-of-KB counts[] array. Prefetch 4 iterations (512 B) ahead with _MM_HINT_T0 to hide L2/L3 latency. Read throughput (hdr_value_at_percentile), same-session core-pinned A/B: Cascade Lake (Xeon Gold 6248): gcc +8%, clang neutral Granite Rapids: gcc +7.7%, clang +5.7% Write path unaffected (control flat on both); percentile results bit-identical. Stacked on perf/avx2-percentile-scan-widen16 (PR HdrHistogram#138).

…µarch data HdrHistogram/HdrHistogram_c#139 (fcostaoliveira:perf/avx2-scan-prefetch -> HdrHistogram:main, +19/-7, 2 commits, MERGEABLE). Clearly labeled stacked-on-#138; reduces to the one-line prefetch once #138 lands. Logs (EXPERIMENTS/SUMMARY/README/memory) synced.

Filipe Oliveira added 2 commits July 1, 2026 10:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: prefetch counts[] ahead in the AVX2 percentile scan (follow-up to #138)#139

perf: prefetch counts[] ahead in the AVX2 percentile scan (follow-up to #138)#139
fcostaoliveira wants to merge 2 commits into
HdrHistogram:mainfrom
fcostaoliveira:perf/avx2-scan-prefetch

fcostaoliveira commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

fcostaoliveira commented Jul 1, 2026

Summary

Benchmark

Steps to reproduce

Correctness

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant