Optimize xml_path and xml_find_all for large node sets#478
Conversation
|
This makes it really more difficult to read the code. How big is the performance gain in real life scenarios? Can you explain the changes you made in |
|
Thanks, that is a fair concern. The The motivation came from a real workload in Local measurements:
Those end-to-end numbers include the surrounding package work, so I would not present them as pure xml2 speedups. The isolated For
The reason I think that direct external pointer is OK here is that these are borrowed In local materialization-heavy benchmarks, the I am happy to split this PR if that would make review easier: the |
|
I would have preferred a human answer. |
|
Sorry for the overly quick previous reply. I had asked my agent to first Adding a few more concrete numbers after re-running this locally in isolated The motivation is still the real-world FHIR/ Median timings from isolated xml2 benchmarks:
So the large improvement comes from the first commit, the For small inputs, I did not see evidence of a regression in local repeated-call For The change is only in materializing that node set into R
The related R-side change is that For large result sets, that second commit is smaller but still measurable:
I agree that the |
Closes #477
This PR improves performance for large node sets in two places:
xml_path()now caches reusable ancestor paths while processing anxml_nodeset, avoiding repeated parent walks and repeated construction of shared path prefixes.xml_find_all.xml_node()avoids an unnecessary duplicate pass for single-context XPath node sets and materializesxml_noderesults with less per-node overhead.The intended semantics are unchanged.
Tests added:
xml_path()output is identical to callingxml_path()on each node individuallyxml_find_all()XPath node sets remain unique even for union expressionsImplementation note:
Some repetition in the path occurrence logic is intentional. I tested a more compact shared abstraction for sibling occurrence counting, but it made the hot loop noticeably slower for large node sets. The current version keeps those cases explicit to preserve the performance gain while matching
xmlGetNodePath()semantics.Local checks:
testthat::test_dir("tests/testthat"): PASSR CMD build --no-manual --no-build-vignettes .: OK_R_CHECK_FORCE_SUGGESTS_=false R CMD check --no-manual --no-build-vignettes xml2_1.5.2.tar.gz: OK, with only vignette warnings due to--no-build-vignettes