OAK-12089: Add Lucene 9 index provider (oak-search-luceneNg) by bhabegger · Pull Request #2817 · apache/jackrabbit-oak

bhabegger · 2026-03-26T15:07:12Z

Summary

Introduces oak-search-luceneNg, a new Oak module that provides a Lucene 9 based index engine. Indexes opt in explicitly via type=lucene9; all existing indexes are unaffected.

New oak-search-luceneNg module: index editor, query index, tracker, index node, storage (Oak JCR node-based directory), and OSGi wiring
Full query parity with the legacy lucene engine: property restrictions, path/type filters, fulltext, sorting, and excerpts
Facet parity for all three ACL modes: insecure, statistical (TapeSampling), and secure (per-document access check) — Lucene 9 API adaptations and null-safe MatchingDocs.bits handling
LuceneNgFacetCommonTest extends the shared FacetCommonTest suite for end-to-end JCR-level facet coverage
AbstractIndexComparisonTest inlined into oak-search test-jar; oak-search-test module removed
README documents feature parity vs legacy Lucene and Elastic

thomasmueller

Just some bikeshedding: I wonder if we should use the term "Lucene 9" at all.

thomasmueller · 2026-03-30T14:06:43Z

@@ -0,0 +1,26 @@
+# oak-search-luceneNg
+
+Lucene 9 index provider for Oak (`type="lucene9"`).


What about:

Suggested change

Lucene 9 index provider for Oak (`type="lucene9"`).

Lucene NG index provider for Oak (`type="lucene-ng"`).

Hoping we can upgrade to Lucene 10 without having to change the type. I do assume the index storage version won't change, or that there is an option to add compatibility.

If not, and we do need to reindex for lucene 10, we can still add "lucene-ng-10" if that should be needed.

So the idea is that the code should be less sensitive to Lucene upgrades. However, I wasn't really sure about having a fixed name and the in some future having to have a lucene-ng-ng ;) Here we could interpret it as lucene since 9 and still use that for lucene 10, 11 up to and imaginary lucene 12 that breaks compatibility again (which might never happen as likely the APIs are more stable now).

Personally, I fine with lucene-ng as I myself had doubts.

Also the NG in the code for now is fine since we do have 2 versions, but at some point my expectation is that the legacy code be removed and the NG no longer be new generation and be refactored with proper non NG names. So in the code I'm fine, in the type however, this sticks more. That's mostly what bother me with ng in the type.

@thomasmueller What about lucene2026 ? Less tied to 9 doesn't have explicit ng ?

Introduces oak-search-luceneNg, a new Oak module providing a Lucene 9 based index engine under type=lucene9, with full parity to the legacy lucene implementation for property queries, fulltext, sorting, excerpts, and facets (insecure, statistical, and secure ACL modes). Key changes: - New oak-search-luceneNg module: index editor, query index, tracker, index node, storage, and OSGi wiring - Facet parity: LuceneNgSecure/StatisticalSortedSetDocValuesFacetCounts ported to Lucene 9 APIs with null-safe MatchingDocs.bits handling - LuceneNgFacetCommonTest extends FacetCommonTest for JCR-level coverage - AbstractIndexComparisonTest inlined into oak-search test-jar; oak-search-test module removed - getRootBuilder removed from ContextAwareCallback and IndexUpdate - leaf OSGi property removed from LuceneIndexProviderService - README documents feature parity vs legacy Lucene and Elastic Made-with: Cursor

When indexNodeName=true, the index editor writes the namespace-stripped local name of each node into FieldNames.NODE_NAME. The query engine maps LOCALNAME() equality and LIKE restrictions to TermQuery/WildcardQuery on that field. Function restrictions prefixed with "function*@" (e.g. "function*@:localname") are generated alongside the dedicated ":localname" restriction by Oak's SQL2 parser; they are now silently dropped from plan evaluation, cost calculation, and the Lucene query to prevent false negatives. Adds NodeNameCommonTest (shared) and LuceneNgNodeNameCommonTest. Made-with: Cursor

…ADME Address PR review comments from thomasmueller: - Rename "Multi-index queries" to "Composite node store queries" and add a footnote explaining the composite node store scenario. - Add a footnote for "Index augmentors" describing the IndexFieldProvider / FulltextQueryTermsProvider extension points. Made-with: Cursor

Follows Maven/Oak convention of lowercase hyphenated artifact names. The Java package (org.apache.jackrabbit.oak.plugins.index.luceneNg) is unchanged as it is an internal implementation detail. Made-with: Cursor

…etCommonTest The three tests (basic faceting, multiple dimensions, facet with filter) are all already exercised by FacetCommonTest via the JCR API. LuceneNgFacetCommonTest runs that suite against Lucene 9 and is the canonical coverage. The ignored class added no value. Made-with: Cursor

LuceneNgIndexNode now mirrors the legacy LuceneIndexNodeManager pattern: - acquire() takes a read lock, incRefs the DirectoryReader, and returns an AcquiredNode that the caller must release() when done - close() (called by the tracker on eviction) takes the write lock, which blocks until all in-flight AcquiredNodes have been released, then closes the searcher holder - LuceneNgCursor accepts an AcquiredNode and releases it on close() - getCost() and getPlans() wrap acquire in try-finally - IndexSearcherHolder now closes the OakDirectory it owns Without this, a concurrent tracker refresh could close the DirectoryReader while a cursor was still iterating hits, causing AlreadyClosedException. Also fixes the OakDirectory leak in IndexSearcherHolder. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…xNode assertion in test IndexSearcherHolder: private OakDirectory directory was inserted between the constructor Javadoc and its declaration — moved it to the field block. LuceneNgIndexTrackerTest.testGetIndexNode: test was changed to only verify path tracking via getIndexPaths(), dropping the acquireIndexNode() call entirely. Restored it: the test now asserts the path is tracked AND that acquireIndexNode() returns null for an index with no data written yet, which is the correct post-refactor behaviour. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The legacy OakBufferedIndexFile appends a random 16-byte key to every blob it writes (OAK-7066). This key is stored on the file node as PROP_UNIQUE_KEY and makes blob content unique across writes, preventing the blob store GC from reclaiming blobs that are still referenced by a live index file. OakDirectory now generates and writes PROP_UNIQUE_KEY when creating a new output file. OakBufferedIndexFile reads it back, subtracts the key length from the reported file length, and appends it to each blob via SequenceInputStream in flushBlob(), matching the legacy behaviour. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

LuceneNgIndexNodeTest verifies: - acquire() returns a non-null AcquiredNode when index data exists - acquire() returns null after close() - release() is idempotent - close() blocks until all AcquiredNodes are released (write-lock guard) OakDirectoryTest adds: - PROP_UNIQUE_KEY is written to every new file node - the key is exactly UNIQUE_KEY_SIZE bytes (32 hex chars) - fileLength() excludes the key suffix; the raw blob includes it - every file gets a distinct key Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

StandardAnalyzer implements Closeable and holds a CloseableThreadLocal. Both buildQuery() and generateExcerpts() were creating a new instance per call and leaving it unclosed, leaking thread-local storage under query load. Wrapping each in try-with-resources ensures timely cleanup. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…IndexEditor OakDirectory is a local variable created just before IndexWriter. If IndexWriter(directory, config) throws, the directory goes out of scope unclosed — PROP_DIR_LISTING is never written and any partially written segments are left open. Close the directory in the catch block so the storage is consistent regardless of whether the writer succeeds. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ment FacetsConfig was instantiated inside indexNode() on every document, with setMultiValued determined by counting field occurrences in each individual document. This is incorrect: the config must be consistent across all documents in the same writer session, and multi-valued must be set if the property can ever carry more than one value. FacetsConfig is now built once in the constructor by scanning the index definition's property rules, marking every facet dimension as multi-valued (safe for single-valued properties too). Child editors receive the same shared instance via their constructor. The per-document dimension-counting Map and its HashMap/Map imports are removed. Test: LuceneNgFacetsConfigTest verifies that multi-valued facet properties across multiple documents produce correct facet counts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Without this, deleting an index file silently dropped its blobs from the repository without telling the blob store GC. The GC could only reclaim them during the next full mark-and-sweep scan. Adds BlobDeletionCallback interface (mirrors legacy ActiveDeletedBlobCollectorFactory.BlobDeletionCallback without the oak-lucene dependency). OakDirectory.deleteFile() now iterates the JCR_DATA blobs before removing the node and invokes the callback for each blob with a non-null content identity (inlined blobs are ignored since they are not tracked by the blob store GC). The default 3-arg constructor uses BlobDeletionCallback.NOOP for backward compatibility. A new 4-arg public constructor and a package-private 5-arg constructor (BlobFactory + callback) allow callers and tests to supply a real callback. TODO: wire the callback from LuceneNgIndexEditorProvider through LuceneNgIndexEditor to OakDirectory once the full ActiveDeletedBlobCollectorFactory integration is added. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…istical facet classes Both LuceneNgSecure... and LuceneNgStatistical... contained identical per-document accessibility checks (load stored fields, read PATH, call filter.isAccessible). A security fix would have to be applied to both classes independently. Extracted the check as a package-private static isDocAccessible() on LuceneNgSecureSortedSetDocValuesFacetCounts and updated the statistical class to delegate to it. Also adds a null-guard on the PATH field (getField returns null for documents without that field), which was a latent NPE in both classes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…uery builders #9 — DefaultSortedSetDocValuesReaderState is expensive: it reads and indexes all ordinals from a field's doc values. It was being reconstructed on every query per facet field. Added a ConcurrentHashMap cache in IndexSearcherHolder keyed by field name; the cache is scoped to the reader lifetime so it is discarded on index refresh. Exposed via AcquiredNode.getFacetReaderState() and used in LuceneNgIndex.query(IndexPlan). #10 — createLongQuery and createDoubleQuery were structurally identical (same equality/range/IN/NOT control flow) with only the type, Lucene Point class, and exclusive-bound adjustment differing. Introduced a package-private NumericPoint<T> interface with two static instances (LONG_POINT, DOUBLE_POINT) and a shared createNumericQuery() helper. createLongQuery/createDoubleQuery now delegate to it. Adding a future numeric type (e.g. DECIMAL) requires only a new NumericPoint instance, not a copy of the entire restriction-pattern decision tree. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

This was referenced Mar 26, 2026

OAK-12089: add multi-target routing and per-target catch-up lifecycle #2818

Draft

feat: Add Lucene 9 integration - Phase 1 (Write Path) + Phase 2 Step 1 (Query Support) bhabegger/jackrabbit-oak#51

Closed

bhabegger force-pushed the oak-12089-lucene9-core branch from e4bc5ad to dbccf97 Compare March 26, 2026 15:11

bhabegger marked this pull request as draft March 26, 2026 15:13

bhabegger force-pushed the oak-12089-lucene9-core branch from 28c304c to 66fb544 Compare March 27, 2026 16:59

bhabegger changed the title ~~feat: add lucene9 index provider as safe opt-in target~~ feat: add Lucene 9 index provider (oak-search-luceneNg) Mar 27, 2026

bhabegger marked this pull request as ready for review March 30, 2026 05:10

thomasmueller reviewed Mar 30, 2026

View reviewed changes

Comment thread oak-search-luceneNg/README.md Outdated

Comment thread oak-search-luceneNg/README.md Outdated

thomasmueller reviewed Mar 30, 2026

View reviewed changes

bhabegger changed the title ~~feat: add Lucene 9 index provider (oak-search-luceneNg)~~ OAK-12089: Add Lucene 9 index provider (oak-search-luceneNg) Apr 10, 2026

bhabegger added 5 commits June 5, 2026 09:59

refactor: rename module directory and artifactId to oak-search-lucene-ng

95eb6e9

Follows Maven/Oak convention of lowercase hyphenated artifact names. The Java package (org.apache.jackrabbit.oak.plugins.index.luceneNg) is unchanged as it is an internal implementation detail. Made-with: Cursor

bhabegger force-pushed the oak-12089-lucene9-core branch from f0a7199 to de83bda Compare June 5, 2026 07:59

bhabegger changed the base branch from OAK-12089 to trunk June 5, 2026 08:12

bhabegger changed the base branch from trunk to OAK-12089 June 5, 2026 08:13

bhabegger and others added 10 commits June 8, 2026 16:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OAK-12089: Add Lucene 9 index provider (oak-search-luceneNg)#2817

OAK-12089: Add Lucene 9 index provider (oak-search-luceneNg)#2817
bhabegger wants to merge 15 commits into
apache:OAK-12089from
bhabegger:oak-12089-lucene9-core

bhabegger commented Mar 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

thomasmueller left a comment

Uh oh!

Uh oh!

thomasmueller Mar 30, 2026

Uh oh!

bhabegger Mar 30, 2026

Uh oh!

bhabegger Mar 30, 2026

Uh oh!

bhabegger Apr 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -0,0 +1,26 @@
		# oak-search-luceneNg

		Lucene 9 index provider for Oak (`type="lucene9"`).

	Lucene 9 index provider for Oak (`type="lucene9"`).
	Lucene NG index provider for Oak (`type="lucene-ng"`).

Uh oh!

Conversation

bhabegger commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Uh oh!

Uh oh!

thomasmueller left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

thomasmueller Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

bhabegger Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

bhabegger Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

bhabegger Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bhabegger commented Mar 26, 2026 •

edited

Loading

bhabegger Apr 1, 2026 •

edited

Loading