PaperFetch.jl helps validate BibTeX bibliographies by checking entries against source metadata and writing human-readable review reports. It is designed for small and medium bibliography checks, usually 10-100 references, where traceable evidence matters more than bulk harvesting.
PaperFetch.jl does not edit your .bib file. It reports what looks correct,
what looks suspicious, and what source metadata it found so that a person, script,
or separate AI-assisted editing task can improve the bibliography deliberately.
- Parses BibTeX with BibParser.jl, plus simple plain-text DOI/URL lists.
- Extracts identifiers from normal and misplaced fields, including DOI, arXiv,
PMID, ISBN, and URL values found in fields such as
noteandhowpublished. - Looks up metadata from deterministic fixtures or optional online providers:
Crossref, OpenAlex, Unpaywall, DataCite, arXiv, Semantic Scholar, PubMed,
CORE, Figshare, Open Library, Google Books, and URL landing-page metadata.
Book lookups can search by title and creator to discover an ISBN, then retry
ISBN-specific metadata paths when the input BibTeX has no ISBN.
Software repository URLs can use
CITATION.cffmetadata when available. - Compares bibliography fields with cautious normalization for title case, page ranges, Unicode/LaTeX accents, DOI URL variants, author initials, and similar harmless differences. URL paths and queries keep their case, and reordered author lists are flagged for review rather than silently accepted.
- Uses entry-type-aware comparisons: proceedings and chapter entries compare
booktitle, and edited books can useeditorinstead ofauthor. - Matches URLs found in
url,note, orhowpublished, including LaTeX\url{...}forms, when checking URL-backed references. - Resolves competing source candidates conservatively. A provider result must have enough identity evidence, such as matching DOI, title and creator, title and year, or URL, before PaperFetch uses it as the source of truth. When a journal article and an arXiv preprint both plausibly match the same entry and the BibTeX does not distinguish them, the journal article is preferred.
- Writes Markdown and INC reports; INC is a spreadsheet-friendly CSV-like format handled by IncCSV.jl.
- Optionally downloads PDFs from explicit PDF candidate URLs and writes fetch manifests.
- It does not rewrite or auto-correct the input BibTeX file.
- It does not ask for, store, or manage library passwords.
- It does not scrape publisher pages when a suitable API or landing-page metadata route is available.
- It does not treat every provider disagreement as truth. Reports are evidence for review, not automatic authority.
PaperFetch.jl currently targets Julia 1.11 or newer. Install the registered package with Julia's package manager:
using Pkg
Pkg.add("PaperFetch")To use the latest development version directly from GitHub:
using Pkg
Pkg.add(url="https://github.com/mroughan/PaperFetch.jl")For package development from a local checkout:
git clone https://github.com/mroughan/PaperFetch.jl.git
cd PaperFetch.jl
julia --project=. -e 'using Pkg; Pkg.instantiate()'Run a deterministic offline check with the included example fixture:
julia --project=. -e 'using PaperFetch; PaperFetch.main()' -- \
check examples/01_exact_article.bib \
--fixture examples/metadata_fixture.json \
--outdir paperfetch_outA fixture is a small JSON file containing known source metadata for examples or tests. It lets PaperFetch.jl exercise the same comparison and reporting logic without making live API requests, so results are deterministic and repeatable.
The command-line tool prints progress to stderr as it reads, checks, and
fetches entries. Pass --quiet to suppress progress messages in scripted runs.
This writes:
paperfetch_out/01_exact_article.mdpaperfetch_out/01_exact_article.inc
The Markdown report is meant for direct reading. The INC report is meant for
spreadsheets and downstream tooling. CLI report names default to the input file
stem; pass --report-basename NAME to choose a different basename.
Each Markdown entry keeps the original BibTeX key, then shows general flags for
source discovery, provider errors, required fields, PDF candidates, and
confidence. Field-by-field comparisons include a Flag column so green, amber,
red, and ignored review signals are visible next to the relevant value.
Entry notes include the selected source, the source-resolution confidence, and
the identity evidence used to choose it.
Live provider lookup is opt-in:
julia --project=. -e 'using PaperFetch; PaperFetch.main()' -- \
check references.bib \
--email your.email@example.edu \
--use-apis \
--cache-dir .paperfetch_cache \
--rate-limit-seconds 0.05 \
--outdir paperfetch_outUse a real contact email for scholarly APIs. --cache-dir keeps repeat runs
faster and gentler on providers. --rate-limit-seconds is a light per-run
throttle between uncached live requests; increase it if a provider asks you to
slow down.
Fetch mode first checks the bibliography, then downloads only explicit PDF candidate URLs discovered in source metadata:
julia --project=. -e 'using PaperFetch; PaperFetch.main()' -- \
fetch references.bib \
--email your.email@example.edu \
--use-apis \
--cache-dir .paperfetch_cache \
--rate-limit-seconds 0.05 \
--outdir paperfetch_outOutputs include:
paperfetch_out/references.mdpaperfetch_out/references.incpaperfetch_out/manifest.mdpaperfetch_out/manifest.inc- downloaded
*.pdffiles when candidate URLs are available and reachable
manifest.md is the human-readable fetch table. manifest.inc is the
spreadsheet/tooling manifest. Entries without PDF candidates are recorded as
skipped, not as validation failures.
The manifest records the reference key, compact title, fetch status, local file when downloaded, source URL, and a short diagnostic such as "no PDF candidate", "downloaded from ...", or a failed HTTP/content-type reason.
Credential-assisted fetching is local and opt-in. PaperFetch.jl never asks for your username or password.
Supported runtime inputs:
- an EZproxy URL template, for example
https://proxy.example.edu/login?url={url}; - a local browser-exported Netscape-format
cookies.txtfile.
Example:
julia --project=. -e 'using PaperFetch; PaperFetch.main()' -- \
fetch references.bib \
--email your.email@example.edu \
--use-apis \
--cache-dir .paperfetch_cache \
--rate-limit-seconds 0.05 \
--ezproxy 'https://proxy.example.edu/login?url={url}' \
--cookie-file /path/to/cookies.txt \
--outdir paperfetch_outTreat cookie files as login tokens. Do not commit, upload, email, or share them. Check your library and publisher terms before downloading.
using PaperFetch
reports = check_bibliography("references.bib";
email = "you@example.edu",
use_apis = true,
cache_dir = ".paperfetch_cache",
rate_limit_seconds = 0.05,
)
paths = write_reports(reports, "paperfetch_out")
paths[:markdown]
paths[:inc]
results, manifest = fetch_pdfs(reports, "paperfetch_out")This writes both paperfetch_out/manifest.md and paperfetch_out/manifest.inc.
For deterministic offline runs, pass a fixture instead of live APIs:
reports = check_bibliography("examples/01_exact_article.bib";
fixture = "examples/metadata_fixture.json",
check = :none,
)check_bibliography skips the key anon by default because that key is often
used for anonymized review placeholders. Pass ignore_keys=nothing to keep every
entry, or provide a custom set/list of keys to skip.
The examples/ directory contains small cases used by the test suite, covering
exact metadata, normalized text differences, missing/conflicting DOI fields,
web references, datasets, arXiv preprints, book chapters, online reports, and
plain DOI lists.
Run the default offline tests:
julia --project=. -e 'using Pkg; Pkg.test()'Manual online field tests live in examples/online/ and are not run by default:
PAPERFETCH_ONLINE=true \
PAPERFETCH_EMAIL=your.email@example.edu \
julia --project=. test/online/runtests.jlThe documentation includes a quickstart, examples, API reference, and notes on live providers, report formats, fetch manifests, caching, rate limiting, and building a stand-alone executable:
https://mroughan.github.io/PaperFetch.jl/dev
Build docs locally with:
julia --project=docs -e '
using Pkg
Pkg.develop(PackageSpec(path=pwd()))
Pkg.instantiate()
'
julia --project=docs docs/make.jlSee CONTRIBUTING.md for development setup, test expectations, provider guidelines, and pull request notes.
See SECURITY.md. In short:
- do not put usernames or passwords in command-line arguments;
- keep cookie files local and private;
- do not commit API caches, downloaded PDFs, or private bibliographies;
- retrieve only material you are entitled to access.
If PaperFetch.jl helps your work, please cite it using the metadata in CITATION.cff.
This project has been built with help from AI coding agents. The package structure and implementation were developed under user supervision with user-provided architecture and guardrail instructions.