Skip to content

Add article to node search tags#865

Open
8W9aG wants to merge 1 commit into
codelucas:masterfrom
8W9aG:include-article-tag
Open

Add article to node search tags#865
8W9aG wants to merge 1 commit into
codelucas:masterfrom
8W9aG:include-article-tag

Conversation

@8W9aG

@8W9aG 8W9aG commented Dec 17, 2020

Copy link
Copy Markdown
  • The tag often denotes exactly where the article begins and ends in HTML5.

I noticed the wrong text was being pulled from articles on the bbc.co.uk. This includes whole articles rather than focusing on good paragraphs or table rows.

* The <article> tag often denotes exactly where
the article begins and ends in HTML5.
dc-harsh added a commit to dc-harsh/newspaper that referenced this pull request May 29, 2026
- requirements: add lxml-html-clean (fixes `import newspaper` breaking on lxml>=5)
- nlp: NLTK punkt_tab fallback in split_sentences (codelucas#1023) + collapse
  intra-sentence newlines (codelucas#873); download_corpora adds punkt_tab (codelucas#1023)
- cleaners: itemprop contains("articleBody") keeps multi-token itemprop nodes (codelucas#953)
- extractors: include <article> in scoring candidate nodes (codelucas#865)
- tests: expect precise meta article:published_time (intended fork behavior)
- bump version to 0.4.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant