Skip to content

feat: extend meta classes#617

Open
ceberam wants to merge 5 commits into
docling-project:mainfrom
ana-daniele:feat/extend-meta-classes
Open

feat: extend meta classes#617
ceberam wants to merge 5 commits into
docling-project:mainfrom
ana-daniele:feat/extend-meta-classes

Conversation

@ceberam
Copy link
Copy Markdown
Member

@ceberam ceberam commented May 19, 2026

No description provided.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 19, 2026

DCO Check Passed

Thanks @ceberam, all your commits are properly signed off. 🎉

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 19, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

Waiting for

  • #approved-reviews-by >= 2
This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@ana-daniele ana-daniele force-pushed the feat/extend-meta-classes branch from 58dfa93 to eeb96d5 Compare May 22, 2026 08:54
@ana-daniele
Copy link
Copy Markdown

ana-daniele commented May 22, 2026

Changes

  1. Backed up the rich version → feat/extend-meta-classes-backup (branch only, no PR)
  2. Hard-reset feat/extend-meta-classes to main
  3. Re-implemented the minimal scope:
    • KeywordsMetaField and TopicsMetaFieldvalues: list[str], deduplicated (order-preserving)
    • Wired into BaseMeta (node-level) and new DocumentMeta (document-level, keywords + topics only)
    • DoclingDocument.meta field added
    • HTML / Markdown / Doclang serializers updated to render new fields as comma-separated values
    • Dropped: Statement*, PropertyValue*, all confidence levels on new fields, title / authors / publication_date
    • No schema version bump (additive, backward-compatible)
  4. Added 8 tests covering round-trip, dedup, has_content, HTML escaping, and Markdown rendering
  5. Squashed to 2 conventional commits (feat: + test:)
  6. Force-pushed

Diff: +292 / -0 across 7 files (was +726 / -1019 in the rich version).

Tests: 22/22 in test_metadata.py, 389 in wider suite green.

@ceberam ceberam marked this pull request as ready for review June 1, 2026 07:31
@PeterStaar-IBM PeterStaar-IBM self-requested a review June 1, 2026 08:23
PeterStaar-IBM
PeterStaar-IBM previously approved these changes Jun 1, 2026
ana-daniele and others added 4 commits June 1, 2026 10:54
Introduce two simple, additive meta fields:
- KeywordsMetaField — unique list of keyword strings
- TopicsMetaField — unique list of topic / subject strings

Both attach to BaseMeta (node-level) and to the new DocumentMeta
(document-level), which is wired into DoclingDocument.meta. HTML,
Markdown, and Doclang serializers render the new fields as comma-
separated values. Schema bump is intentionally avoided since the
fields are optional and backward-compatible.

Signed-off-by: Ana Daniele <ana.daniele@ibm.com>
Add round-trip, dedup, has_content, HTML escape, and Markdown
rendering tests for KeywordsMetaField, TopicsMetaField, and
DocumentMeta. Document-level meta absent-by-default is also covered.

Signed-off-by: Ana Daniele <ana.daniele@ibm.com>
Convert tuple-form isinstance checks to PEP 604 unions and
reorder a misplaced import flagged by ruff.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Ana Daniele <ana.daniele@ibm.com>
Post-rebase onto main, the HTML metadata serializer now emits
`data-meta-name="<field>"` instead of valueless `data-meta-<field>`
and applies a single trailing html.escape pass. Drop the per-type
escape from the Keywords/Topics branch (was producing double
escaping after rebase) and update the two test assertions to the
new attribute shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Ana Daniele <ana.daniele@ibm.com>
Per discussion, DocLang is a standard whose extension is out of
scope for this PR. Drop the KeywordsMetaField/TopicsMetaField
branches from DoclangMetaSerializer and their imports; HTML and
Markdown serialization remain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Ana Daniele <ana.daniele@ibm.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Comment on lines +1451 to +1452
KEYWORDS = "keywords" # unique keywords / keyphrases for the node content
TOPICS = "topics" # unique topics / subjects for the node content
Copy link
Copy Markdown
Member Author

@ceberam ceberam Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please leverage docstrings in MetaFieldName class instead of inline comments? It would be create if you could fix the other attributes. Check TableFormerMode as an example of how we document enumerations.

class KeywordsMetaField(_ExtraAllowingModel):
"""Container for a list of unique keywords / keyphrases."""

values: Annotated[list[str], Field(min_length=1)]
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could reuse a type that we created for unique lists: UniqueLists instead of list[str].

class TopicsMetaField(_ExtraAllowingModel):
"""Container for a list of unique topics / subjects."""

values: Annotated[list[str], Field(min_length=1)]
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Comment on lines +1524 to +1525
keywords: Optional[KeywordsMetaField] = None
topics: Optional[TopicsMetaField] = None
Copy link
Copy Markdown
Member Author

@ceberam ceberam Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we take advantage of this PR and add a good description of each metadata type? For instance, it would be helpful to make a clear distinction between entities, keywords, and topics. You can use the Annotated pattern and pydantic's Field function with the description and examples arguments.

Comment on lines +2787 to +2793
class DocumentMeta(_ExtraAllowingModel):
"""Document-level metadata for DoclingDocument."""

keywords: Optional[KeywordsMetaField] = None
topics: Optional[TopicsMetaField] = None


Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, remove this class. In the past we discussed about metadata at document level and we decided to use the body field of DoclingDocument to attach the document-level metadata.

# This is optional, e.g. a DoclingDocument could also be entirely
# generated from synthetic data.
)
meta: Optional[DocumentMeta] = None
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, remove since not necessary. Users should use the body field for document-level metadata.

Comment thread test/test_metadata.py
assert field.values == ["nlp", "vision"]


def test_keywords_and_topics_in_base_meta_roundtrip() -> None:
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just extend the existing test_semantic_base_meta_fields_roundtrip_and_html_rendering, since it was built for testing any type of meta fields.

Comment thread test/test_metadata.py
Comment on lines +367 to +377
doc = DoclingDocument(
name="doc-meta",
meta=DocumentMeta(
keywords=KeywordsMetaField(values=["llm", "rag"]),
topics=TopicsMetaField(values=["ai"]),
),
)
roundtrip = DoclingDocument.model_validate(doc.model_dump(mode="json"))
assert roundtrip.meta is not None
assert roundtrip.meta.keywords is not None and roundtrip.meta.keywords.values == ["llm", "rag"]
assert roundtrip.meta.topics is not None and roundtrip.meta.topics.values == ["ai"]
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if these lines test anything different with respect to the previous test.

Comment thread test/test_metadata.py
Comment on lines +416 to +417
assert "[Keywords] ibm, zurich" in md
assert "[Topics] business" in md
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI for a next PR: brackets should be escaped in markdown, since they are used to create links. This is the way we have serialized other metadata so far and therefore I would leave the implementation as it is in this PR. We can later think a nice way to add metadata in markdown and ensure it follows the right syntax.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants