fix: update world bank document handling to use description as full c…#153
Open
lpi-tn wants to merge 4 commits into
Open
fix: update world bank document handling to use description as full c…#153lpi-tn wants to merge 4 commits into
lpi-tn wants to merge 4 commits into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the World Bank Open Knowledge Repository collector to avoid restricted full-text/PDF scraping by using the record’s description (abstract) as the document’s full_content, and adjusts metadata flags + tests to reflect that new content source.
Changes:
- Set
doc.full_contentfromdoc.description(abstract) instead of extracted TXT/PDF content. - Update
detailsflags to indicate content comes from description (content_from_description=True, othersFalse). - Update unit test expectations for
full_contentand newdetailsflags.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| welearn_datastack/plugins/rest_requesters/world_bank_okr.py | Switches full content source to abstract/description and updates content-source flags accordingly. |
| tests/document_collector_hub/plugins_test/test_world_bank_okr.py | Updates assertions to validate the new full-content behavior and details flags. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
sandragjacinto
approved these changes
Jun 25, 2026
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request updates how full content is handled for World Bank Open Knowledge Repository documents. Instead of attempting to extract and use the full text or PDF content (which is restricted), the system now uses the document's description (abstract) as the full content. Additionally, the metadata flags in the details are updated to accurately reflect the new content source.
Content extraction logic changes:
_update_welearn_documentmethod inworld_bank_okr.pynow setsdoc.full_contentto the document's description instead of extracted text or PDF content, since scraping full content is not permitted.detailsdictionary is updated to setcontent_from_pdfandcontent_from_txttoFalse, and introducescontent_from_descriptionasTrueto reflect the actual source of content.Test updates:
test__update_welearn_documentis updated to check thatret.full_contentmatches the description, and to verify the new details flags (content_from_descriptionisTrue, others areFalse).Minor code cleanup:
WrapperRetrieveDocumentinstantiation for readability.