Skip to content

Ignore invalid HTML (or self closed tags) #251

@titospeakap

Description

@titospeakap

Up to version 2.2.0, the following HTML code will be fully parsed

<H1>Heading 1</h1>
<p>Paragraph
<b>Second</b> line.</p>
<ul><li>List item 1</li><li>List item 2<ul><li>List item 2.1</li><li>List item 2.2</li></ul></li><li>List item 3</ul>
<p>Paragraph 2</p>
<h2>Heading 2</h2>
<p>Paragraph 3</p>
<p><img alt="image" width="100" height="20"></p>
<audio />
<video />
<p><a data-rel="attachment">attachment</a></p>
<p>Another paragraph. <a href="http://url.to.link">Hyperlink</a>.</p>
<ol><li>List item 1</li><li>List item 2<ol><li>List item 2.1</li><li>List item 2.2</li></ol></li><li>List item 3</ol>

In more recent versions, it stops parsing at the tag <audio /> (if I change to be <audio></audio>, it works), but no errors are generated (->hasErrors() returns false).

Is this behaviour intentional? and is there a way in more recent version to replicate what happens in version 2.2.0 or below?

For the HTML shared above, here is the code I'm running

$html5 = new HTML5();
$html5->loadHTMLFragment($html);
foreach ($fragment->childNodes as $child) {
        echo $child->nodeName . "\n";
 }

And the respective output in version 2.9.0:

h1
#text
p
#text
ul
#text
p
#text
h2
#text
p
#text
p
#text
audio

but for version 2.2.0, I get

h1
#text
p
#text
ul
#text
p
#text
h2
#text
p
#text
p
#text
audio
#text
video
#text
p
#text
p
#text
ol

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions