Skip to content

Optimize xml_path and xml_find_all for large node sets#478

Open
astruebi wants to merge 2 commits into
r-lib:mainfrom
astruebi:codex/optimize-xml-path-nodesets
Open

Optimize xml_path and xml_find_all for large node sets#478
astruebi wants to merge 2 commits into
r-lib:mainfrom
astruebi:codex/optimize-xml-path-nodesets

Conversation

@astruebi

@astruebi astruebi commented Jun 5, 2026

Copy link
Copy Markdown

Closes #477

This PR improves performance for large node sets in two places:

  • xml_path() now caches reusable ancestor paths while processing an xml_nodeset, avoiding repeated parent walks and repeated construction of shared path prefixes.
  • xml_find_all.xml_node() avoids an unnecessary duplicate pass for single-context XPath node sets and materializes xml_node results with less per-node overhead.

The intended semantics are unchanged.

Tests added:

  • batch xml_path() output is identical to calling xml_path() on each node individually
  • single-context xml_find_all() XPath node sets remain unique even for union expressions

Implementation note:

Some repetition in the path occurrence logic is intentional. I tested a more compact shared abstraction for sibling occurrence counting, but it made the hot loop noticeably slower for large node sets. The current version keeps those cases explicit to preserve the performance gain while matching xmlGetNodePath() semantics.

Local checks:

  • testthat::test_dir("tests/testthat"): PASS
  • R CMD build --no-manual --no-build-vignettes .: OK
  • _R_CHECK_FORCE_SUGGESTS_=false R CMD check --no-manual --no-build-vignettes xml2_1.5.2.tar.gz: OK, with only vignette warnings due to --no-build-vignettes

@jeroen

jeroen commented Jun 22, 2026

Copy link
Copy Markdown
Member

This makes it really more difficult to read the code. How big is the performance gain in real life scenarios?

Can you explain the changes you made in xml2_xpath.cpp‎ please?

@astruebi

Copy link
Copy Markdown
Author

Thanks, that is a fair concern. The xml_path() part in particular is more code than I would like, because it has to preserve the existing xmlGetNodePath() behaviour for elements, attributes, text/CDATA, comments, processing instructions, namespaces, and positional predicates.

The motivation came from a real workload in fhircrackr, where after optimizing the package-level code, xml_path() became the dominant remaining hotspot. On FHIR bundles with many selected nodes from the same document, the same ancestor paths were being rebuilt many times.

Local measurements:

  • On large synthetic node sets with shared ancestors, xml_path(nodes) was about 8-11x faster than before.
  • On a real FHIR Consent bundle selection, the isolated xml_path() call went from roughly 2.4s to roughly 0.3s for about 415k selected nodes.
  • In the end-to-end fhircrackr hot path, using the optimized fhircrackr code and only swapping xml2, the full runtime improved substantially. Recent 5-run medians for count = 1000, ncores = 1 were approximately:
    • Consent: ~7.7s, down from ~20.3s with old xml2
    • Encounter: ~3.4s, down from ~11.7s
    • Condition: ~1.7s, down from ~5.5s
    • Observation: ~1.5s, down from ~4.8s
    • Patient: ~0.9s, down from ~3.2s

Those end-to-end numbers include the surrounding package work, so I would not present them as pure xml2 speedups. The isolated xml_path() benchmark is the cleaner measurement for this PR, but the FHIR case is the real-world workload that exposed the issue.

For xml2_xpath.cpp, the changes are smaller and are about result materialization, not changing XPath evaluation itself:

  1. xml_find_all.xml_node() now calls xml_nodeset(nodes, deduplicate = FALSE). This is only for the single-context-node method. libxml2 XPath node sets are already document-ordered unique node sets, including for union expressions, so the extra R-level duplicate pass is unnecessary there. The multi-node/flattening path still deduplicates as before, because combining results from multiple context nodes can introduce duplicates.

  2. In XmlSeeker::search(), the loop that converts the libxml2 node set to R objects now avoids some per-node overhead:

    • the "xml_node" class string is allocated once before the loop instead of once per returned node;
    • the external pointer is created directly with R_MakeExternalPtr() instead of constructing an XPtrNode wrapper for each XPath result node.

The reason I think that direct external pointer is OK here is that these are borrowed xmlNode* pointers owned by the document, and the returned R object keeps the document external pointer in its doc field, just like before. XPtrNode is still used when reading/checking incoming node pointers. The hot path here is only wrapping already-returned libxml2 nodes for R.

In local materialization-heavy benchmarks, the xml_find_all() part was roughly 1.5-2x faster for large result sets. The XPath evaluation itself was not the main cost in those cases; creating the R-side node objects was.

I am happy to split this PR if that would make review easier: the xml_find_all() materialization change is small and independent, while the xml_path() cache is the larger/readability-sensitive part.

@jeroen

jeroen commented Jun 23, 2026

Copy link
Copy Markdown
Member

I would have preferred a human answer.

@astruebi

Copy link
Copy Markdown
Author

Sorry for the overly quick previous reply. I had asked my agent to first
summarize local what we had done a few weeks ago, and it went ahead and posted
that before I had reviewed it properly.

Adding a few more concrete numbers after re-running this locally in isolated
temporary libraries for main, the first commit only, and the full PR.

The motivation is still the real-world FHIR/fhircrackr workload, but the
optimized pattern is general rather than FHIR-specific: large xml_nodeset
inputs where many nodes come from the same document and share ancestor paths.

Median timings from isolated xml2 benchmarks:

Case main first commit only full PR
Synthetic related nodes, xml_path() on 75k nodes 0.400s 0.032s 0.038s
FHIR Consent, xml_path() on 155k attributes 1.052s 0.148s 0.161s
FHIR Consent, xml_path() on 260k elements 1.717s 0.177s 0.182s
FHIR Encounter, xml_path() on 24k attributes 0.097s 0.025s 0.025s
FHIR Patient, xml_path() on 21k attributes 0.096s 0.019s 0.019s

So the large improvement comes from the first commit, the xml_path() cache. In
these local runs it is roughly 4x-10x faster for the large shared-ancestor node
sets that motivated the PR.

For small inputs, I did not see evidence of a regression in local repeated-call
benchmarks; timings were essentially unchanged and in the microsecond range.

For xml2_xpath.cpp, the XPath evaluation itself is unchanged. The code still
calls libxml2 through xmlXPathEval(), and libxml2 still produces the
xmlNodeSet.

The change is only in materializing that node set into R xml_node objects:

  • the "xml_node" class string is allocated once before the loop instead of
    once per returned node;
  • each returned xmlNode* is wrapped directly with R_MakeExternalPtr() instead
    of constructing a temporary XPtrNode wrapper for every result node.
    XPtrNode would create the same external pointer, but also preserve and
    release it for each temporary wrapper. In this path the nodes are borrowed
    from the owning document, which is still kept in the returned xml_node, so
    that extra per-result preserve/release cycle is unnecessary;
  • the returned xml_node object still stores the owning document pointer in its
    doc field, as before, so the lifetime model is unchanged.

The related R-side change is that xml_find_all.xml_node() now calls
xml_nodeset(..., deduplicate = FALSE). This is only for the single
context-node method. A libxml2 XPath node set is already unique and
document-ordered, including for union expressions. The flattening method for
multiple context nodes still deduplicates, because combining results from
multiple context nodes can introduce duplicates.

For large result sets, that second commit is smaller but still measurable:

Case main first commit only full PR
Synthetic xml_find_all("//@*"), 51k attributes 0.016s 0.015s 0.007s
FHIR Consent xml_find_all("//@*"), 155k attributes 0.043s 0.042s 0.018s

I agree that the xml_path() implementation is the main readability tradeoff.
The reason it is more than a small wrapper change is that the cached path
builder has to preserve the existing xmlGetNodePath() behaviour, including
namespaces, attributes, text/CDATA nodes, comments, processing instructions, and
the positional predicates for siblings. I tried a more compact shared helper for
some of this logic, but it made the hot loop noticeably slower, so the current
version keeps the node-type cases explicit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize xml_path() and xml_find_all() for large node sets

2 participants