Skip to content

feat(DocLang): add hyperlink support#578

Open
vagenas wants to merge 2 commits into
mainfrom
add-doclang-hyperlinks
Open

feat(DocLang): add hyperlink support#578
vagenas wants to merge 2 commits into
mainfrom
add-doclang-hyperlinks

Conversation

@vagenas
Copy link
Copy Markdown
Member

@vagenas vagenas commented Apr 7, 2026

No description provided.

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 7, 2026

DCO Check Passed

Thanks @vagenas, all your commits are properly signed off. 🎉

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 7, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

Waiting for

  • #approved-reviews-by >= 2
This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@vagenas vagenas mentioned this pull request Apr 7, 2026
52 tasks
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 7, 2026

Codecov Report

❌ Patch coverage is 84.52381% with 13 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling_core/experimental/doclang.py 84.52% 13 Missing ⚠️

📢 Thoughts on this report? Let us know!

hyperlink = AnyUrl(uri_text)
except Exception:
# Invalid URL, skip
return
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will reject any relative hyperlink. Is this behavior intended?
In the examples like test/data/doc/content_all.gt.dclg.xml I can see that <uri> also supports relative hyperlinks, so we should be able to deserialize it.

For instance, deserializing the following text will discard the hyperlink:

<doclang version="1.0.0">
  <text>
    <hyperlink>
      <uri>/blog/</uri>
      hyperlink
    </hyperlink>
  </text>
</doclang>

Also, the roundtrip test in test_roundtrip_hyperlink will fail if we add a relative hyperlink like this:

doc.add_text(
    label=DocItemLabel.TEXT,
    text="simple link",
    hyperlink=Path("/blog/")
)

Comment on lines +2681 to +2686
content_elements = [
node
for node in el.childNodes
if isinstance(node, Element)
and node.tagName not in {DoclangToken.URI.value, DoclangToken.LOCATION.value, DoclangToken.LAYER.value}
]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This collects only element children and the surrounding text nodes will be excluded.

For instance, if you deserialize:

<hyperlink>
  <uri>https://example.com</uri>
  pre
  <bold>mid</bold>
  post
</hyperlink>

you will get:

TextItem(text="mid", hyperlink="https://example.com/")

This is the example I used:

from docling_core.experimental.doclang import DoclangDeserializer

xml = """<doclang version="1.0.0">
  <text>
    <hyperlink>
      <uri>https://example.com</uri>
      pre
      <bold>mid</bold>
      post
    </hyperlink>
  </text>
</doclang>"""

doc = DoclangDeserializer().deserialize(text=xml)

for doc in doc.texts:
  print(doc.text)

(you will get just mid, without pre or post)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants