Skip to content

perf: prefetch counts[] ahead in the AVX2 percentile scan (follow-up to #138)#139

Open
fcostaoliveira wants to merge 2 commits into
HdrHistogram:mainfrom
fcostaoliveira:perf/avx2-scan-prefetch
Open

perf: prefetch counts[] ahead in the AVX2 percentile scan (follow-up to #138)#139
fcostaoliveira wants to merge 2 commits into
HdrHistogram:mainfrom
fcostaoliveira:perf/avx2-scan-prefetch

Conversation

@fcostaoliveira

Copy link
Copy Markdown
Contributor

Summary

Software-prefetch counts[] ahead of the AVX2 percentile scan
(get_value_from_idx_up_to_count_avx2) to hide L2/L3 load latency:

_mm_prefetch((const char*)&h->counts[idx + 64], _MM_HINT_T0);   /* 512 B / 4 iters ahead */

After the vector-accumulator widening, the scan is load-latency bound streaming the
~10s-of-KB counts[] array, so an explicit prefetch a few iterations ahead helps.

Stacked on #138. This branch contains #138's widening commit plus this one, so the diff
currently shows both. Please review/merge after #138 — once #138 lands, this PR reduces to
the single one-line prefetch commit. (I kept them as separate commits so each is reviewable on
its own.)

Benchmark

test/hdr_percentile_benchhdr_value_at_percentile throughput, core-pinned, measured
base (this repo's AVX2 scan) vs +prefetch back-to-back in the same session, on two Intel
microarchitectures and both compilers:

µarch Compiler Base +prefetch Δ
Cascade Lake (Xeon Gold 6248) gcc 11.4 0.37 M q/s 0.40 M q/s +8%
Cascade Lake clang 14.0 0.43 M q/s 0.43 M q/s neutral
Granite Rapids gcc 11.4 0.52 M q/s 0.56 M q/s +7.7%
Granite Rapids clang 14.0 0.53 M q/s 0.56 M q/s +5.7%

gcc is consistently ~+8%; clang ranges neutral → +5.7% and never regresses. The write path
(hdr_histogram_perf) is a flat control on both µarchs (this change only touches the read scan).
The prefetch distance (64) is a reasonable default and could be tuned further per target.

Steps to reproduce

cmake -S . -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo \
  -DHDR_HISTOGRAM_BUILD_PROGRAMS=ON -DHDR_HISTOGRAM_BUILD_BENCHMARK=ON
cmake --build build -j
taskset -c 8 ./build/test/hdr_percentile_bench     # base vs this branch: M queries/sec + sink

Correctness

  • _mm_prefetch is a non-faulting hint — no bounds/UB implications, no change to the values read.
  • ctest green (gcc and clang); the benchmark sink is byte-identical to base
    (17401860284404480), i.e. every percentile query returns exactly the same value.

Filipe Oliveira added 2 commits July 1, 2026 10:48
…ator

get_value_from_idx_up_to_count_avx2 summed 4 int64/iter and did a horizontal
reduction + 2x _mm_extract_epi64 + target-cross branch every 4 elements. Accumulate
16 int64/iter (4x256) in a vector register and reduce to a scalar block sum once per
16, so the costly GPR extracts and the early-exit branch run 4x less often. Scalar
fallback and uint64 overflow hardening unchanged; percentile results bit-identical.

clx1 (Cascade Lake), core-pinned, same-session A/B: hdr_value_at_percentile
+137% (gcc 0.16->0.38 Mq/s) / +144% (clang 0.18->0.44 Mq/s). Read sink byte-identical.
The widened AVX2 percentile scan is memory-load-latency bound over the ~10s-of-KB
counts[] array. Prefetch 4 iterations (512 B) ahead with _MM_HINT_T0 to hide L2/L3
latency. Read throughput (hdr_value_at_percentile), same-session core-pinned A/B:
  Cascade Lake (Xeon Gold 6248): gcc +8%, clang neutral
  Granite Rapids:                gcc +7.7%, clang +5.7%
Write path unaffected (control flat on both); percentile results bit-identical.

Stacked on perf/avx2-percentile-scan-widen16 (PR HdrHistogram#138).
fcostaoliveira pushed a commit to redis-performance/hdr-agent-workspace that referenced this pull request Jul 1, 2026
…µarch data

HdrHistogram/HdrHistogram_c#139 (fcostaoliveira:perf/avx2-scan-prefetch
-> HdrHistogram:main, +19/-7, 2 commits, MERGEABLE). Clearly labeled stacked-on-#138; reduces to
the one-line prefetch once #138 lands. Logs (EXPERIMENTS/SUMMARY/README/memory) synced.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant