feat: HierarchicalTextSplitter — header-aware chunking with verbless sentence filtering#548
Open
jexp wants to merge 4 commits into
Open
feat: HierarchicalTextSplitter — header-aware chunking with verbless sentence filtering#548jexp wants to merge 4 commits into
jexp wants to merge 4 commits into
Conversation
jexp
added a commit
to jexp/neo4j-graphrag-python
that referenced
this pull request
Jun 24, 2026
Remove hierarchical_splitter.py and its tests from this branch (now in PR neo4j#548). Update example script to reference the companion PR.
This was referenced Jun 24, 2026
…_line/spacy_verbless header strategies Implements HierarchicalTextSplitter(TextSplitter) based on the chunking strategy from Min et al. (arXiv:2507.03226). Splits documents at section headers first, then recursively at a configurable character limit with overlap, and optionally drops verb-free sentences as a cheap pre-filter. Four header detection strategies: - markdown: lines starting with # (for MarkdownLoader / LiteParseLoader markdown output) - capitalization: short ALL_CAPS or Title Case lines without terminal punctuation - blank_line: short lines surrounded by blank lines - spacy_verbless: SpaCy-detected verbless sentences Includes 20 unit tests (spacy mocked) and 2 integration tests (skipped when en_core_web_sm is not installed).
…ration tests" This reverts commit dfa93b4.
jexp
added a commit
to jexp/neo4j-graphrag-python
that referenced
this pull request
Jun 24, 2026
Remove hierarchical_splitter.py and its tests from this branch (now in PR neo4j#548). Update example script to reference the companion PR.
7f5f96e to
450cab9
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
HierarchicalTextSplitter, a newTextSplitterbased on the chunking strategy from Min et al., arXiv:2507.03226 — Towards Practical GraphRAG: Efficient KG Construction and Hybrid Retrieval at Scale.Splits documents at section headers first, then recursively at a configurable character limit with overlap, and optionally drops verb-free sentences as a cheap pre-filter before chunking.
Related: the companion PR #547 (
SpacyEntityRelationExtractor) uses this splitter in its example pipeline.Header detection strategies
"markdown"(default)#LiteParseLoader(output_format="markdown")"capitalization""blank_line""spacy_verbless"Constructor
Usage
Pairs naturally with
LiteParseLoader(output_format="markdown")from PR #534.Test plan
uv run pytest tests/unit/experimental/components/text_splitters/test_hierarchical_splitter.py -v(20 unit tests, spacy mocked — no model download)python -m spacy download en_core_web_sm && uv run pytest tests/unit/experimental/components/text_splitters/test_hierarchical_splitter_integration.py -v(2 integration tests)uv run python -c "from neo4j_graphrag.experimental.components.text_splitters import HierarchicalTextSplitter; print('ok')"