Skip to content

fix: update world bank document handling to use description as full c…#153

Open
lpi-tn wants to merge 4 commits into
mainfrom
Fix/world-bank-switch-on-abstract
Open

fix: update world bank document handling to use description as full c…#153
lpi-tn wants to merge 4 commits into
mainfrom
Fix/world-bank-switch-on-abstract

Conversation

@lpi-tn

@lpi-tn lpi-tn commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

This pull request updates how full content is handled for World Bank Open Knowledge Repository documents. Instead of attempting to extract and use the full text or PDF content (which is restricted), the system now uses the document's description (abstract) as the full content. Additionally, the metadata flags in the details are updated to accurately reflect the new content source.

Content extraction logic changes:

  • The _update_welearn_document method in world_bank_okr.py now sets doc.full_content to the document's description instead of extracted text or PDF content, since scraping full content is not permitted.
  • The details dictionary is updated to set content_from_pdf and content_from_txt to False, and introduces content_from_description as True to reflect the actual source of content.

Test updates:

  • The test test__update_welearn_document is updated to check that ret.full_content matches the description, and to verify the new details flags (content_from_description is True, others are False).

Minor code cleanup:

  • An extra line break is added after a WrapperRetrieveDocument instantiation for readability.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the World Bank Open Knowledge Repository collector to avoid restricted full-text/PDF scraping by using the record’s description (abstract) as the document’s full_content, and adjusts metadata flags + tests to reflect that new content source.

Changes:

  • Set doc.full_content from doc.description (abstract) instead of extracted TXT/PDF content.
  • Update details flags to indicate content comes from description (content_from_description=True, others False).
  • Update unit test expectations for full_content and new details flags.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
welearn_datastack/plugins/rest_requesters/world_bank_okr.py Switches full content source to abstract/description and updates content-source flags accordingly.
tests/document_collector_hub/plugins_test/test_world_bank_okr.py Updates assertions to validate the new full-content behavior and details flags.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread welearn_datastack/plugins/rest_requesters/world_bank_okr.py Outdated
Comment thread tests/document_collector_hub/plugins_test/test_world_bank_okr.py
lpi-tn and others added 3 commits June 25, 2026 16:01
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants