feat: extend meta classes#617
Conversation
|
✅ DCO Check Passed Thanks @ceberam, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesWaiting for
This rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
58dfa93 to
eeb96d5
Compare
Changes
Diff: +292 / -0 across 7 files (was +726 / -1019 in the rich version). Tests: 22/22 in |
Introduce two simple, additive meta fields: - KeywordsMetaField — unique list of keyword strings - TopicsMetaField — unique list of topic / subject strings Both attach to BaseMeta (node-level) and to the new DocumentMeta (document-level), which is wired into DoclingDocument.meta. HTML, Markdown, and Doclang serializers render the new fields as comma- separated values. Schema bump is intentionally avoided since the fields are optional and backward-compatible. Signed-off-by: Ana Daniele <ana.daniele@ibm.com>
Add round-trip, dedup, has_content, HTML escape, and Markdown rendering tests for KeywordsMetaField, TopicsMetaField, and DocumentMeta. Document-level meta absent-by-default is also covered. Signed-off-by: Ana Daniele <ana.daniele@ibm.com>
Convert tuple-form isinstance checks to PEP 604 unions and reorder a misplaced import flagged by ruff. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Ana Daniele <ana.daniele@ibm.com>
Post-rebase onto main, the HTML metadata serializer now emits `data-meta-name="<field>"` instead of valueless `data-meta-<field>` and applies a single trailing html.escape pass. Drop the per-type escape from the Keywords/Topics branch (was producing double escaping after rebase) and update the two test assertions to the new attribute shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Ana Daniele <ana.daniele@ibm.com>
2b193e6 to
22839e2
Compare
Per discussion, DocLang is a standard whose extension is out of scope for this PR. Drop the KeywordsMetaField/TopicsMetaField branches from DoclangMetaSerializer and their imports; HTML and Markdown serialization remain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Ana Daniele <ana.daniele@ibm.com>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
| KEYWORDS = "keywords" # unique keywords / keyphrases for the node content | ||
| TOPICS = "topics" # unique topics / subjects for the node content |
There was a problem hiding this comment.
Could you please leverage docstrings in MetaFieldName class instead of inline comments? It would be create if you could fix the other attributes. Check TableFormerMode as an example of how we document enumerations.
| class KeywordsMetaField(_ExtraAllowingModel): | ||
| """Container for a list of unique keywords / keyphrases.""" | ||
|
|
||
| values: Annotated[list[str], Field(min_length=1)] |
There was a problem hiding this comment.
You could reuse a type that we created for unique lists: UniqueLists instead of list[str].
| class TopicsMetaField(_ExtraAllowingModel): | ||
| """Container for a list of unique topics / subjects.""" | ||
|
|
||
| values: Annotated[list[str], Field(min_length=1)] |
| keywords: Optional[KeywordsMetaField] = None | ||
| topics: Optional[TopicsMetaField] = None |
There was a problem hiding this comment.
Can we take advantage of this PR and add a good description of each metadata type? For instance, it would be helpful to make a clear distinction between entities, keywords, and topics. You can use the Annotated pattern and pydantic's Field function with the description and examples arguments.
| class DocumentMeta(_ExtraAllowingModel): | ||
| """Document-level metadata for DoclingDocument.""" | ||
|
|
||
| keywords: Optional[KeywordsMetaField] = None | ||
| topics: Optional[TopicsMetaField] = None | ||
|
|
||
|
|
There was a problem hiding this comment.
Please, remove this class. In the past we discussed about metadata at document level and we decided to use the body field of DoclingDocument to attach the document-level metadata.
| # This is optional, e.g. a DoclingDocument could also be entirely | ||
| # generated from synthetic data. | ||
| ) | ||
| meta: Optional[DocumentMeta] = None |
There was a problem hiding this comment.
Please, remove since not necessary. Users should use the body field for document-level metadata.
| assert field.values == ["nlp", "vision"] | ||
|
|
||
|
|
||
| def test_keywords_and_topics_in_base_meta_roundtrip() -> None: |
There was a problem hiding this comment.
I would just extend the existing test_semantic_base_meta_fields_roundtrip_and_html_rendering, since it was built for testing any type of meta fields.
| doc = DoclingDocument( | ||
| name="doc-meta", | ||
| meta=DocumentMeta( | ||
| keywords=KeywordsMetaField(values=["llm", "rag"]), | ||
| topics=TopicsMetaField(values=["ai"]), | ||
| ), | ||
| ) | ||
| roundtrip = DoclingDocument.model_validate(doc.model_dump(mode="json")) | ||
| assert roundtrip.meta is not None | ||
| assert roundtrip.meta.keywords is not None and roundtrip.meta.keywords.values == ["llm", "rag"] | ||
| assert roundtrip.meta.topics is not None and roundtrip.meta.topics.values == ["ai"] |
There was a problem hiding this comment.
Not sure if these lines test anything different with respect to the previous test.
| assert "[Keywords] ibm, zurich" in md | ||
| assert "[Topics] business" in md |
There was a problem hiding this comment.
Just FYI for a next PR: brackets should be escaped in markdown, since they are used to create links. This is the way we have serialized other metadata so far and therefore I would leave the implementation as it is in this PR. We can later think a nice way to add metadata in markdown and ensure it follows the right syntax.
No description provided.