perf(cd2pd): thread the structured cell->point interpolation loop (byte-exact) by akaszynski · Pull Request #149 · pyvista/fvtk

akaszynski · 2026-06-23T01:27:54Z

Summary

Threads the per-output-point interpolation loop in
vtkCellDataToPointData::InterpolatePointData(input, output). A plain
vtkImageData (and other structured datasets with no blanking) routes here from
RequestData, and the ptId loop was still serial even though the rest of
fvtk threads via the default STDThread SMP backend.

The loop is now a vtkSMPTools::For wrapped in fvtk::RunSafeFilterParallel
(the established bit-exact-safe opt-in, with the usual GetSingleThread()-guarded
UpdateProgress/CheckAbort and re-entrancy guard). cellIds and the weights[]
buffer are thread-local (vtkSMPThreadLocalObject<vtkIdList> + a per-thread stack
buffer).

The crux of thread-safety: every output point-data array is pre-sized to
numberOfPoints tuples up front. InterpolateAllocate() only reserves capacity
(MaxId == -1); after the presize, each InterpolateTuple(ptId,…) /
InsertTuple(ptId,…) / NullData(ptId) is a pure store into an already-existing
tuple — no realloc, no MaxId bump on any thread. NullData() inserts into every
array in the output (not just the interpolated ones), so the pass-through arrays
copied from the input point data are resized too; they already hold exactly
numberOfPoints tuples, so that is a no-op.

Parity bucket: byte-exact, default-on

This is bucket 1 — byte-for-byte identical to stock VTK 9.6.2 (maxULP = 0,
same values AND same order), so it ships on by default.

Byte-exactness argument:

The output is index-addressed by ptId. Threads get disjoint ptId
sub-ranges, so they write to disjoint, pre-sized output tuples — zero write
conflict, and emission order is preserved exactly.
The per-point average sums the same (≤ 8) terms in the same index order
regardless of how the range is partitioned across threads, so there is no
floating-point reassociation across iterations. InterpolatePoint →
InterpolateTuple iterates the same cellIds list (produced identically by the
existing pure StructuredGetPointCells) in the same order.
Reads are from the input cell data (processedCellData), a distinct object from
the output; the per-thread scratch (cellIds, weights) is the only mutable
state and it is thread-local.

The structured inputs that reach this path take the pure StructuredGetPointCells
traversal (no shared state). For the rare non-structured fallback, any lazy
incident-cell structure is primed once on the main thread before the parallel
region so the first GetPointCells() cannot race.

Expected win

2–6× on large vtkImageData cell-data → point-data conversions (capped at the
fvtk default of 4 threads), scaling with point count.

Validation gate

tests/bitexact/ops.py::op_cell2point drives vtkCellDataToPointData on a
vtkImageData with cell-data scalars (the exact modified image path) and is in
the modified gate group (float32/float64, sizes 20/32) — covered at maxULP = 0
against stock VTK 9.6.2.
tests/bitexact/test_smp_determinism.py: added cell2point to THREADED_OPS,
asserting byte-identical output at 1 / 4 / 8 threads (which holds by
construction — disjoint index writes).

No local build was run (disk/time constrained); relying on CI, which installs the
built wheel and runs tests/bitexact at maxULP = 0.

…te-exact)

perf(cd2pd): thread the structured cell->point interpolation loop (by…

a9159ae

…te-exact)

akaszynski mentioned this pull request Jun 23, 2026

ci: build the aarch64 PR wheel LTO-off (PR-only de-opt; shipped wheel unchanged) #148

Closed