feat(DocLang): add hyperlink support#578
Conversation
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
|
✅ DCO Check Passed Thanks @vagenas, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesWaiting for
This rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
| hyperlink = AnyUrl(uri_text) | ||
| except Exception: | ||
| # Invalid URL, skip | ||
| return |
There was a problem hiding this comment.
This will reject any relative hyperlink. Is this behavior intended?
In the examples like test/data/doc/content_all.gt.dclg.xml I can see that <uri> also supports relative hyperlinks, so we should be able to deserialize it.
For instance, deserializing the following text will discard the hyperlink:
<doclang version="1.0.0">
<text>
<hyperlink>
<uri>/blog/</uri>
hyperlink
</hyperlink>
</text>
</doclang>Also, the roundtrip test in test_roundtrip_hyperlink will fail if we add a relative hyperlink like this:
doc.add_text(
label=DocItemLabel.TEXT,
text="simple link",
hyperlink=Path("/blog/")
)| content_elements = [ | ||
| node | ||
| for node in el.childNodes | ||
| if isinstance(node, Element) | ||
| and node.tagName not in {DoclangToken.URI.value, DoclangToken.LOCATION.value, DoclangToken.LAYER.value} | ||
| ] |
There was a problem hiding this comment.
This collects only element children and the surrounding text nodes will be excluded.
For instance, if you deserialize:
<hyperlink>
<uri>https://example.com</uri>
pre
<bold>mid</bold>
post
</hyperlink>you will get:
TextItem(text="mid", hyperlink="https://example.com/")
This is the example I used:
from docling_core.experimental.doclang import DoclangDeserializer
xml = """<doclang version="1.0.0">
<text>
<hyperlink>
<uri>https://example.com</uri>
pre
<bold>mid</bold>
post
</hyperlink>
</text>
</doclang>"""
doc = DoclangDeserializer().deserialize(text=xml)
for doc in doc.texts:
print(doc.text)(you will get just mid, without pre or post)
No description provided.