OAK-12089: Add Lucene 9 index provider (oak-search-luceneNg)#2817
OAK-12089: Add Lucene 9 index provider (oak-search-luceneNg)#2817bhabegger wants to merge 15 commits into
Conversation
e4bc5ad to
dbccf97
Compare
28c304c to
66fb544
Compare
thomasmueller
left a comment
There was a problem hiding this comment.
Just some bikeshedding: I wonder if we should use the term "Lucene 9" at all.
| @@ -0,0 +1,26 @@ | |||
| # oak-search-luceneNg | |||
|
|
|||
| Lucene 9 index provider for Oak (`type="lucene9"`). | |||
There was a problem hiding this comment.
What about:
| Lucene 9 index provider for Oak (`type="lucene9"`). | |
| Lucene NG index provider for Oak (`type="lucene-ng"`). |
Hoping we can upgrade to Lucene 10 without having to change the type. I do assume the index storage version won't change, or that there is an option to add compatibility.
If not, and we do need to reindex for lucene 10, we can still add "lucene-ng-10" if that should be needed.
There was a problem hiding this comment.
So the idea is that the code should be less sensitive to Lucene upgrades. However, I wasn't really sure about having a fixed name and the in some future having to have a lucene-ng-ng ;) Here we could interpret it as lucene since 9 and still use that for lucene 10, 11 up to and imaginary lucene 12 that breaks compatibility again (which might never happen as likely the APIs are more stable now).
Personally, I fine with lucene-ng as I myself had doubts.
There was a problem hiding this comment.
Also the NG in the code for now is fine since we do have 2 versions, but at some point my expectation is that the legacy code be removed and the NG no longer be new generation and be refactored with proper non NG names. So in the code I'm fine, in the type however, this sticks more. That's mostly what bother me with ng in the type.
There was a problem hiding this comment.
@thomasmueller What about lucene2026 ? Less tied to 9 doesn't have explicit ng ?
Introduces oak-search-luceneNg, a new Oak module providing a Lucene 9 based index engine under type=lucene9, with full parity to the legacy lucene implementation for property queries, fulltext, sorting, excerpts, and facets (insecure, statistical, and secure ACL modes). Key changes: - New oak-search-luceneNg module: index editor, query index, tracker, index node, storage, and OSGi wiring - Facet parity: LuceneNgSecure/StatisticalSortedSetDocValuesFacetCounts ported to Lucene 9 APIs with null-safe MatchingDocs.bits handling - LuceneNgFacetCommonTest extends FacetCommonTest for JCR-level coverage - AbstractIndexComparisonTest inlined into oak-search test-jar; oak-search-test module removed - getRootBuilder removed from ContextAwareCallback and IndexUpdate - leaf OSGi property removed from LuceneIndexProviderService - README documents feature parity vs legacy Lucene and Elastic Made-with: Cursor
When indexNodeName=true, the index editor writes the namespace-stripped local name of each node into FieldNames.NODE_NAME. The query engine maps LOCALNAME() equality and LIKE restrictions to TermQuery/WildcardQuery on that field. Function restrictions prefixed with "function*@" (e.g. "function*@:localname") are generated alongside the dedicated ":localname" restriction by Oak's SQL2 parser; they are now silently dropped from plan evaluation, cost calculation, and the Lucene query to prevent false negatives. Adds NodeNameCommonTest (shared) and LuceneNgNodeNameCommonTest. Made-with: Cursor
…ADME Address PR review comments from thomasmueller: - Rename "Multi-index queries" to "Composite node store queries" and add a footnote explaining the composite node store scenario. - Add a footnote for "Index augmentors" describing the IndexFieldProvider / FulltextQueryTermsProvider extension points. Made-with: Cursor
Follows Maven/Oak convention of lowercase hyphenated artifact names. The Java package (org.apache.jackrabbit.oak.plugins.index.luceneNg) is unchanged as it is an internal implementation detail. Made-with: Cursor
…etCommonTest The three tests (basic faceting, multiple dimensions, facet with filter) are all already exercised by FacetCommonTest via the JCR API. LuceneNgFacetCommonTest runs that suite against Lucene 9 and is the canonical coverage. The ignored class added no value. Made-with: Cursor
f0a7199 to
de83bda
Compare
LuceneNgIndexNode now mirrors the legacy LuceneIndexNodeManager pattern: - acquire() takes a read lock, incRefs the DirectoryReader, and returns an AcquiredNode that the caller must release() when done - close() (called by the tracker on eviction) takes the write lock, which blocks until all in-flight AcquiredNodes have been released, then closes the searcher holder - LuceneNgCursor accepts an AcquiredNode and releases it on close() - getCost() and getPlans() wrap acquire in try-finally - IndexSearcherHolder now closes the OakDirectory it owns Without this, a concurrent tracker refresh could close the DirectoryReader while a cursor was still iterating hits, causing AlreadyClosedException. Also fixes the OakDirectory leak in IndexSearcherHolder. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…xNode assertion in test IndexSearcherHolder: private OakDirectory directory was inserted between the constructor Javadoc and its declaration — moved it to the field block. LuceneNgIndexTrackerTest.testGetIndexNode: test was changed to only verify path tracking via getIndexPaths(), dropping the acquireIndexNode() call entirely. Restored it: the test now asserts the path is tracked AND that acquireIndexNode() returns null for an index with no data written yet, which is the correct post-refactor behaviour. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The legacy OakBufferedIndexFile appends a random 16-byte key to every blob it writes (OAK-7066). This key is stored on the file node as PROP_UNIQUE_KEY and makes blob content unique across writes, preventing the blob store GC from reclaiming blobs that are still referenced by a live index file. OakDirectory now generates and writes PROP_UNIQUE_KEY when creating a new output file. OakBufferedIndexFile reads it back, subtracts the key length from the reported file length, and appends it to each blob via SequenceInputStream in flushBlob(), matching the legacy behaviour. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
LuceneNgIndexNodeTest verifies: - acquire() returns a non-null AcquiredNode when index data exists - acquire() returns null after close() - release() is idempotent - close() blocks until all AcquiredNodes are released (write-lock guard) OakDirectoryTest adds: - PROP_UNIQUE_KEY is written to every new file node - the key is exactly UNIQUE_KEY_SIZE bytes (32 hex chars) - fileLength() excludes the key suffix; the raw blob includes it - every file gets a distinct key Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
StandardAnalyzer implements Closeable and holds a CloseableThreadLocal. Both buildQuery() and generateExcerpts() were creating a new instance per call and leaving it unclosed, leaking thread-local storage under query load. Wrapping each in try-with-resources ensures timely cleanup. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…IndexEditor OakDirectory is a local variable created just before IndexWriter. If IndexWriter(directory, config) throws, the directory goes out of scope unclosed — PROP_DIR_LISTING is never written and any partially written segments are left open. Close the directory in the catch block so the storage is consistent regardless of whether the writer succeeds. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ment FacetsConfig was instantiated inside indexNode() on every document, with setMultiValued determined by counting field occurrences in each individual document. This is incorrect: the config must be consistent across all documents in the same writer session, and multi-valued must be set if the property can ever carry more than one value. FacetsConfig is now built once in the constructor by scanning the index definition's property rules, marking every facet dimension as multi-valued (safe for single-valued properties too). Child editors receive the same shared instance via their constructor. The per-document dimension-counting Map and its HashMap/Map imports are removed. Test: LuceneNgFacetsConfigTest verifies that multi-valued facet properties across multiple documents produce correct facet counts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without this, deleting an index file silently dropped its blobs from the repository without telling the blob store GC. The GC could only reclaim them during the next full mark-and-sweep scan. Adds BlobDeletionCallback interface (mirrors legacy ActiveDeletedBlobCollectorFactory.BlobDeletionCallback without the oak-lucene dependency). OakDirectory.deleteFile() now iterates the JCR_DATA blobs before removing the node and invokes the callback for each blob with a non-null content identity (inlined blobs are ignored since they are not tracked by the blob store GC). The default 3-arg constructor uses BlobDeletionCallback.NOOP for backward compatibility. A new 4-arg public constructor and a package-private 5-arg constructor (BlobFactory + callback) allow callers and tests to supply a real callback. TODO: wire the callback from LuceneNgIndexEditorProvider through LuceneNgIndexEditor to OakDirectory once the full ActiveDeletedBlobCollectorFactory integration is added. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…istical facet classes Both LuceneNgSecure... and LuceneNgStatistical... contained identical per-document accessibility checks (load stored fields, read PATH, call filter.isAccessible). A security fix would have to be applied to both classes independently. Extracted the check as a package-private static isDocAccessible() on LuceneNgSecureSortedSetDocValuesFacetCounts and updated the statistical class to delegate to it. Also adds a null-guard on the PATH field (getField returns null for documents without that field), which was a latent NPE in both classes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…uery builders #9 — DefaultSortedSetDocValuesReaderState is expensive: it reads and indexes all ordinals from a field's doc values. It was being reconstructed on every query per facet field. Added a ConcurrentHashMap cache in IndexSearcherHolder keyed by field name; the cache is scoped to the reader lifetime so it is discarded on index refresh. Exposed via AcquiredNode.getFacetReaderState() and used in LuceneNgIndex.query(IndexPlan). #10 — createLongQuery and createDoubleQuery were structurally identical (same equality/range/IN/NOT control flow) with only the type, Lucene Point class, and exclusive-bound adjustment differing. Introduced a package-private NumericPoint<T> interface with two static instances (LONG_POINT, DOUBLE_POINT) and a shared createNumericQuery() helper. createLongQuery/createDoubleQuery now delegate to it. Adding a future numeric type (e.g. DECIMAL) requires only a new NumericPoint instance, not a copy of the entire restriction-pattern decision tree. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
Introduces
oak-search-luceneNg, a new Oak module that provides a Lucene 9 based index engine. Indexes opt in explicitly viatype=lucene9; all existing indexes are unaffected.oak-search-luceneNgmodule: index editor, query index, tracker, index node, storage (Oak JCR node-based directory), and OSGi wiringluceneengine: property restrictions, path/type filters, fulltext, sorting, and excerptsMatchingDocs.bitshandlingLuceneNgFacetCommonTestextends the sharedFacetCommonTestsuite for end-to-end JCR-level facet coverageAbstractIndexComparisonTestinlined intooak-searchtest-jar;oak-search-testmodule removed