Skip to content

feat: HierarchicalTextSplitter — header-aware chunking with verbless sentence filtering#548

Open
jexp wants to merge 4 commits into
neo4j:mainfrom
jexp:feat/hierarchical-splitter
Open

feat: HierarchicalTextSplitter — header-aware chunking with verbless sentence filtering#548
jexp wants to merge 4 commits into
neo4j:mainfrom
jexp:feat/hierarchical-splitter

Conversation

@jexp

@jexp jexp commented Jun 24, 2026

Copy link
Copy Markdown
Member

Summary

Adds HierarchicalTextSplitter, a new TextSplitter based on the chunking strategy from Min et al., arXiv:2507.03226Towards Practical GraphRAG: Efficient KG Construction and Hybrid Retrieval at Scale.

Splits documents at section headers first, then recursively at a configurable character limit with overlap, and optionally drops verb-free sentences as a cheap pre-filter before chunking.

Related: the companion PR #547 (SpacyEntityRelationExtractor) uses this splitter in its example pipeline.

Header detection strategies

Strategy Description Best for
"markdown" (default) Lines starting with # MarkdownLoader, LiteParseLoader(output_format="markdown")
"capitalization" Short ALL_CAPS or Title Case lines without terminal punctuation Plain-text loaders
"blank_line" Short lines surrounded by blank lines Plain-text loaders
"spacy_verbless" SpaCy-detected verbless sentences Mixed-format documents

Constructor

HierarchicalTextSplitter(
    max_chunk_size: int = 2048,
    chunk_overlap: int = 200,
    header_strategy: str = "markdown",  # "markdown" | "capitalization" | "blank_line" | "spacy_verbless"
    model: str = "en_core_web_sm",      # only loaded for spacy_verbless or drop_verbless_sentences=True
    drop_verbless_sentences: bool = True,
)

Usage

from neo4j_graphrag.experimental.components.text_splitters import HierarchicalTextSplitter

splitter = HierarchicalTextSplitter(header_strategy="markdown", max_chunk_size=2048, chunk_overlap=200)
chunks = await splitter.run(text)

Pairs naturally with LiteParseLoader(output_format="markdown") from PR #534.

Test plan

  • uv run pytest tests/unit/experimental/components/text_splitters/test_hierarchical_splitter.py -v (20 unit tests, spacy mocked — no model download)
  • python -m spacy download en_core_web_sm && uv run pytest tests/unit/experimental/components/text_splitters/test_hierarchical_splitter_integration.py -v (2 integration tests)
  • uv run python -c "from neo4j_graphrag.experimental.components.text_splitters import HierarchicalTextSplitter; print('ok')"

@jexp jexp requested a review from a team as a code owner June 24, 2026 08:19
jexp added a commit to jexp/neo4j-graphrag-python that referenced this pull request Jun 24, 2026
Remove hierarchical_splitter.py and its tests from this branch (now in PR neo4j#548).
Update example script to reference the companion PR.
jexp added 3 commits June 24, 2026 12:19
…_line/spacy_verbless header strategies

Implements HierarchicalTextSplitter(TextSplitter) based on the chunking strategy
from Min et al. (arXiv:2507.03226). Splits documents at section headers first, then
recursively at a configurable character limit with overlap, and optionally drops
verb-free sentences as a cheap pre-filter.

Four header detection strategies:
- markdown: lines starting with # (for MarkdownLoader / LiteParseLoader markdown output)
- capitalization: short ALL_CAPS or Title Case lines without terminal punctuation
- blank_line: short lines surrounded by blank lines
- spacy_verbless: SpaCy-detected verbless sentences

Includes 20 unit tests (spacy mocked) and 2 integration tests (skipped when
en_core_web_sm is not installed).
jexp added a commit to jexp/neo4j-graphrag-python that referenced this pull request Jun 24, 2026
Remove hierarchical_splitter.py and its tests from this branch (now in PR neo4j#548).
Update example script to reference the companion PR.
@jexp jexp force-pushed the feat/hierarchical-splitter branch from 7f5f96e to 450cab9 Compare June 24, 2026 10:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant