Skip to content

OAK-12089: Add Lucene 9 index provider (oak-search-luceneNg)#2817

Open
bhabegger wants to merge 15 commits into
apache:OAK-12089from
bhabegger:oak-12089-lucene9-core
Open

OAK-12089: Add Lucene 9 index provider (oak-search-luceneNg)#2817
bhabegger wants to merge 15 commits into
apache:OAK-12089from
bhabegger:oak-12089-lucene9-core

Conversation

@bhabegger

@bhabegger bhabegger commented Mar 26, 2026

Copy link
Copy Markdown
Contributor

Summary

Introduces oak-search-luceneNg, a new Oak module that provides a Lucene 9 based index engine. Indexes opt in explicitly via type=lucene9; all existing indexes are unaffected.

  • New oak-search-luceneNg module: index editor, query index, tracker, index node, storage (Oak JCR node-based directory), and OSGi wiring
  • Full query parity with the legacy lucene engine: property restrictions, path/type filters, fulltext, sorting, and excerpts
  • Facet parity for all three ACL modes: insecure, statistical (TapeSampling), and secure (per-document access check) — Lucene 9 API adaptations and null-safe MatchingDocs.bits handling
  • LuceneNgFacetCommonTest extends the shared FacetCommonTest suite for end-to-end JCR-level facet coverage
  • AbstractIndexComparisonTest inlined into oak-search test-jar; oak-search-test module removed
  • README documents feature parity vs legacy Lucene and Elastic

@bhabegger bhabegger force-pushed the oak-12089-lucene9-core branch from e4bc5ad to dbccf97 Compare March 26, 2026 15:11
@bhabegger bhabegger marked this pull request as draft March 26, 2026 15:13
@bhabegger bhabegger force-pushed the oak-12089-lucene9-core branch from 28c304c to 66fb544 Compare March 27, 2026 16:59
@bhabegger bhabegger changed the title feat: add lucene9 index provider as safe opt-in target feat: add Lucene 9 index provider (oak-search-luceneNg) Mar 27, 2026
@bhabegger bhabegger marked this pull request as ready for review March 30, 2026 05:10
Comment thread oak-search-luceneNg/README.md Outdated
Comment thread oak-search-luceneNg/README.md Outdated

@thomasmueller thomasmueller left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some bikeshedding: I wonder if we should use the term "Lucene 9" at all.

Comment thread pom.xml Outdated
@@ -0,0 +1,26 @@
# oak-search-luceneNg

Lucene 9 index provider for Oak (`type="lucene9"`).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about:

Suggested change
Lucene 9 index provider for Oak (`type="lucene9"`).
Lucene NG index provider for Oak (`type="lucene-ng"`).

Hoping we can upgrade to Lucene 10 without having to change the type. I do assume the index storage version won't change, or that there is an option to add compatibility.

If not, and we do need to reindex for lucene 10, we can still add "lucene-ng-10" if that should be needed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the idea is that the code should be less sensitive to Lucene upgrades. However, I wasn't really sure about having a fixed name and the in some future having to have a lucene-ng-ng ;) Here we could interpret it as lucene since 9 and still use that for lucene 10, 11 up to and imaginary lucene 12 that breaks compatibility again (which might never happen as likely the APIs are more stable now).

Personally, I fine with lucene-ng as I myself had doubts.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the NG in the code for now is fine since we do have 2 versions, but at some point my expectation is that the legacy code be removed and the NG no longer be new generation and be refactored with proper non NG names. So in the code I'm fine, in the type however, this sticks more. That's mostly what bother me with ng in the type.

@bhabegger bhabegger Apr 1, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasmueller What about lucene2026 ? Less tied to 9 doesn't have explicit ng ?

@bhabegger bhabegger changed the title feat: add Lucene 9 index provider (oak-search-luceneNg) OAK-12089: Add Lucene 9 index provider (oak-search-luceneNg) Apr 10, 2026
bhabegger added 5 commits June 5, 2026 09:59
Introduces oak-search-luceneNg, a new Oak module providing a Lucene 9
based index engine under type=lucene9, with full parity to the legacy
lucene implementation for property queries, fulltext, sorting, excerpts,
and facets (insecure, statistical, and secure ACL modes).

Key changes:
- New oak-search-luceneNg module: index editor, query index, tracker,
  index node, storage, and OSGi wiring
- Facet parity: LuceneNgSecure/StatisticalSortedSetDocValuesFacetCounts
  ported to Lucene 9 APIs with null-safe MatchingDocs.bits handling
- LuceneNgFacetCommonTest extends FacetCommonTest for JCR-level coverage
- AbstractIndexComparisonTest inlined into oak-search test-jar;
  oak-search-test module removed
- getRootBuilder removed from ContextAwareCallback and IndexUpdate
- leaf OSGi property removed from LuceneIndexProviderService
- README documents feature parity vs legacy Lucene and Elastic

Made-with: Cursor
When indexNodeName=true, the index editor writes the namespace-stripped
local name of each node into FieldNames.NODE_NAME. The query engine maps
LOCALNAME() equality and LIKE restrictions to TermQuery/WildcardQuery on
that field.

Function restrictions prefixed with "function*@" (e.g. "function*@:localname")
are generated alongside the dedicated ":localname" restriction by Oak's SQL2
parser; they are now silently dropped from plan evaluation, cost calculation,
and the Lucene query to prevent false negatives.

Adds NodeNameCommonTest (shared) and LuceneNgNodeNameCommonTest.

Made-with: Cursor
…ADME

Address PR review comments from thomasmueller:
- Rename "Multi-index queries" to "Composite node store queries" and add a
  footnote explaining the composite node store scenario.
- Add a footnote for "Index augmentors" describing the IndexFieldProvider /
  FulltextQueryTermsProvider extension points.

Made-with: Cursor
Follows Maven/Oak convention of lowercase hyphenated artifact names.
The Java package (org.apache.jackrabbit.oak.plugins.index.luceneNg) is
unchanged as it is an internal implementation detail.

Made-with: Cursor
…etCommonTest

The three tests (basic faceting, multiple dimensions, facet with filter)
are all already exercised by FacetCommonTest via the JCR API.
LuceneNgFacetCommonTest runs that suite against Lucene 9 and is the
canonical coverage. The ignored class added no value.

Made-with: Cursor
@bhabegger bhabegger force-pushed the oak-12089-lucene9-core branch from f0a7199 to de83bda Compare June 5, 2026 07:59
@bhabegger bhabegger changed the base branch from OAK-12089 to trunk June 5, 2026 08:12
@bhabegger bhabegger changed the base branch from trunk to OAK-12089 June 5, 2026 08:13
bhabegger and others added 10 commits June 8, 2026 16:51
LuceneNgIndexNode now mirrors the legacy LuceneIndexNodeManager pattern:
- acquire() takes a read lock, incRefs the DirectoryReader, and returns an
  AcquiredNode that the caller must release() when done
- close() (called by the tracker on eviction) takes the write lock, which
  blocks until all in-flight AcquiredNodes have been released, then closes
  the searcher holder
- LuceneNgCursor accepts an AcquiredNode and releases it on close()
- getCost() and getPlans() wrap acquire in try-finally
- IndexSearcherHolder now closes the OakDirectory it owns

Without this, a concurrent tracker refresh could close the DirectoryReader
while a cursor was still iterating hits, causing AlreadyClosedException.
Also fixes the OakDirectory leak in IndexSearcherHolder.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…xNode assertion in test

IndexSearcherHolder: private OakDirectory directory was inserted between the
constructor Javadoc and its declaration — moved it to the field block.

LuceneNgIndexTrackerTest.testGetIndexNode: test was changed to only verify
path tracking via getIndexPaths(), dropping the acquireIndexNode() call
entirely. Restored it: the test now asserts the path is tracked AND that
acquireIndexNode() returns null for an index with no data written yet,
which is the correct post-refactor behaviour.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The legacy OakBufferedIndexFile appends a random 16-byte key to every
blob it writes (OAK-7066). This key is stored on the file node as
PROP_UNIQUE_KEY and makes blob content unique across writes, preventing
the blob store GC from reclaiming blobs that are still referenced by a
live index file.

OakDirectory now generates and writes PROP_UNIQUE_KEY when creating a
new output file. OakBufferedIndexFile reads it back, subtracts the key
length from the reported file length, and appends it to each blob via
SequenceInputStream in flushBlob(), matching the legacy behaviour.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
LuceneNgIndexNodeTest verifies:
- acquire() returns a non-null AcquiredNode when index data exists
- acquire() returns null after close()
- release() is idempotent
- close() blocks until all AcquiredNodes are released (write-lock guard)

OakDirectoryTest adds:
- PROP_UNIQUE_KEY is written to every new file node
- the key is exactly UNIQUE_KEY_SIZE bytes (32 hex chars)
- fileLength() excludes the key suffix; the raw blob includes it
- every file gets a distinct key

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
StandardAnalyzer implements Closeable and holds a CloseableThreadLocal.
Both buildQuery() and generateExcerpts() were creating a new instance
per call and leaving it unclosed, leaking thread-local storage under
query load. Wrapping each in try-with-resources ensures timely cleanup.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…IndexEditor

OakDirectory is a local variable created just before IndexWriter. If
IndexWriter(directory, config) throws, the directory goes out of scope
unclosed — PROP_DIR_LISTING is never written and any partially written
segments are left open. Close the directory in the catch block so the
storage is consistent regardless of whether the writer succeeds.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ment

FacetsConfig was instantiated inside indexNode() on every document,
with setMultiValued determined by counting field occurrences in each
individual document. This is incorrect: the config must be consistent
across all documents in the same writer session, and multi-valued must
be set if the property can ever carry more than one value.

FacetsConfig is now built once in the constructor by scanning the index
definition's property rules, marking every facet dimension as
multi-valued (safe for single-valued properties too). Child editors
receive the same shared instance via their constructor. The per-document
dimension-counting Map and its HashMap/Map imports are removed.

Test: LuceneNgFacetsConfigTest verifies that multi-valued facet
properties across multiple documents produce correct facet counts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without this, deleting an index file silently dropped its blobs from the
repository without telling the blob store GC. The GC could only reclaim
them during the next full mark-and-sweep scan.

Adds BlobDeletionCallback interface (mirrors legacy
ActiveDeletedBlobCollectorFactory.BlobDeletionCallback without the
oak-lucene dependency). OakDirectory.deleteFile() now iterates the
JCR_DATA blobs before removing the node and invokes the callback for
each blob with a non-null content identity (inlined blobs are ignored
since they are not tracked by the blob store GC).

The default 3-arg constructor uses BlobDeletionCallback.NOOP for
backward compatibility. A new 4-arg public constructor and a
package-private 5-arg constructor (BlobFactory + callback) allow callers
and tests to supply a real callback.

TODO: wire the callback from LuceneNgIndexEditorProvider through
LuceneNgIndexEditor to OakDirectory once the full
ActiveDeletedBlobCollectorFactory integration is added.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…istical facet classes

Both LuceneNgSecure... and LuceneNgStatistical... contained identical
per-document accessibility checks (load stored fields, read PATH,
call filter.isAccessible). A security fix would have to be applied to
both classes independently.

Extracted the check as a package-private static isDocAccessible()
on LuceneNgSecureSortedSetDocValuesFacetCounts and updated the
statistical class to delegate to it. Also adds a null-guard on the
PATH field (getField returns null for documents without that field),
which was a latent NPE in both classes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…uery builders

#9 — DefaultSortedSetDocValuesReaderState is expensive: it reads and
indexes all ordinals from a field's doc values. It was being reconstructed
on every query per facet field. Added a ConcurrentHashMap cache in
IndexSearcherHolder keyed by field name; the cache is scoped to the reader
lifetime so it is discarded on index refresh. Exposed via
AcquiredNode.getFacetReaderState() and used in LuceneNgIndex.query(IndexPlan).

#10 — createLongQuery and createDoubleQuery were structurally identical
(same equality/range/IN/NOT control flow) with only the type, Lucene
Point class, and exclusive-bound adjustment differing. Introduced a
package-private NumericPoint<T> interface with two static instances
(LONG_POINT, DOUBLE_POINT) and a shared createNumericQuery() helper.
createLongQuery/createDoubleQuery now delegate to it. Adding a future
numeric type (e.g. DECIMAL) requires only a new NumericPoint instance,
not a copy of the entire restriction-pattern decision tree.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants