Skip to content

mroughan/PaperFetch.jl

Repository files navigation

PaperFetch.jl

CI codecov JET Aqua Documentation Documentation Stable Documentation Dev License: MIT

PaperFetch.jl helps validate BibTeX bibliographies by checking entries against source metadata and writing human-readable review reports. It is designed for small and medium bibliography checks, usually 10-100 references, where traceable evidence matters more than bulk harvesting.

PaperFetch.jl does not edit your .bib file. It reports what looks correct, what looks suspicious, and what source metadata it found so that a person, script, or separate AI-assisted editing task can improve the bibliography deliberately.

What PaperFetch Does

  • Parses BibTeX with BibParser.jl, plus simple plain-text DOI/URL lists.
  • Extracts identifiers from normal and misplaced fields, including DOI, arXiv, PMID, ISBN, and URL values found in fields such as note and howpublished.
  • Looks up metadata from deterministic fixtures or optional online providers: Crossref, OpenAlex, Unpaywall, DataCite, arXiv, Semantic Scholar, PubMed, CORE, Figshare, Open Library, Google Books, and URL landing-page metadata. Book lookups can search by title and creator to discover an ISBN, then retry ISBN-specific metadata paths when the input BibTeX has no ISBN. Software repository URLs can use CITATION.cff metadata when available.
  • Compares bibliography fields with cautious normalization for title case, page ranges, Unicode/LaTeX accents, DOI URL variants, author initials, and similar harmless differences. URL paths and queries keep their case, and reordered author lists are flagged for review rather than silently accepted.
  • Uses entry-type-aware comparisons: proceedings and chapter entries compare booktitle, and edited books can use editor instead of author.
  • Matches URLs found in url, note, or howpublished, including LaTeX \url{...} forms, when checking URL-backed references.
  • Resolves competing source candidates conservatively. A provider result must have enough identity evidence, such as matching DOI, title and creator, title and year, or URL, before PaperFetch uses it as the source of truth. When a journal article and an arXiv preprint both plausibly match the same entry and the BibTeX does not distinguish them, the journal article is preferred.
  • Writes Markdown and INC reports; INC is a spreadsheet-friendly CSV-like format handled by IncCSV.jl.
  • Optionally downloads PDFs from explicit PDF candidate URLs and writes fetch manifests.

What PaperFetch Does Not Do

  • It does not rewrite or auto-correct the input BibTeX file.
  • It does not ask for, store, or manage library passwords.
  • It does not scrape publisher pages when a suitable API or landing-page metadata route is available.
  • It does not treat every provider disagreement as truth. Reports are evidence for review, not automatic authority.

Installation

PaperFetch.jl currently targets Julia 1.11 or newer. Install the registered package with Julia's package manager:

using Pkg
Pkg.add("PaperFetch")

To use the latest development version directly from GitHub:

using Pkg
Pkg.add(url="https://github.com/mroughan/PaperFetch.jl")

For package development from a local checkout:

git clone https://github.com/mroughan/PaperFetch.jl.git
cd PaperFetch.jl
julia --project=. -e 'using Pkg; Pkg.instantiate()'

Quickstart

Run a deterministic offline check with the included example fixture:

julia --project=. -e 'using PaperFetch; PaperFetch.main()' -- \
  check examples/01_exact_article.bib \
  --fixture examples/metadata_fixture.json \
  --outdir paperfetch_out

A fixture is a small JSON file containing known source metadata for examples or tests. It lets PaperFetch.jl exercise the same comparison and reporting logic without making live API requests, so results are deterministic and repeatable.

The command-line tool prints progress to stderr as it reads, checks, and fetches entries. Pass --quiet to suppress progress messages in scripted runs.

This writes:

  • paperfetch_out/01_exact_article.md
  • paperfetch_out/01_exact_article.inc

The Markdown report is meant for direct reading. The INC report is meant for spreadsheets and downstream tooling. CLI report names default to the input file stem; pass --report-basename NAME to choose a different basename.

Each Markdown entry keeps the original BibTeX key, then shows general flags for source discovery, provider errors, required fields, PDF candidates, and confidence. Field-by-field comparisons include a Flag column so green, amber, red, and ignored review signals are visible next to the relevant value. Entry notes include the selected source, the source-resolution confidence, and the identity evidence used to choose it.

Live API Checks

Live provider lookup is opt-in:

julia --project=. -e 'using PaperFetch; PaperFetch.main()' -- \
  check references.bib \
  --email your.email@example.edu \
  --use-apis \
  --cache-dir .paperfetch_cache \
  --rate-limit-seconds 0.05 \
  --outdir paperfetch_out

Use a real contact email for scholarly APIs. --cache-dir keeps repeat runs faster and gentler on providers. --rate-limit-seconds is a light per-run throttle between uncached live requests; increase it if a provider asks you to slow down.

Fetch PDFs

Fetch mode first checks the bibliography, then downloads only explicit PDF candidate URLs discovered in source metadata:

julia --project=. -e 'using PaperFetch; PaperFetch.main()' -- \
  fetch references.bib \
  --email your.email@example.edu \
  --use-apis \
  --cache-dir .paperfetch_cache \
  --rate-limit-seconds 0.05 \
  --outdir paperfetch_out

Outputs include:

  • paperfetch_out/references.md
  • paperfetch_out/references.inc
  • paperfetch_out/manifest.md
  • paperfetch_out/manifest.inc
  • downloaded *.pdf files when candidate URLs are available and reachable

manifest.md is the human-readable fetch table. manifest.inc is the spreadsheet/tooling manifest. Entries without PDF candidates are recorded as skipped, not as validation failures.

The manifest records the reference key, compact title, fetch status, local file when downloaded, source URL, and a short diagnostic such as "no PDF candidate", "downloaded from ...", or a failed HTTP/content-type reason.

Credential-Assisted Fetching

Credential-assisted fetching is local and opt-in. PaperFetch.jl never asks for your username or password.

Supported runtime inputs:

  • an EZproxy URL template, for example https://proxy.example.edu/login?url={url};
  • a local browser-exported Netscape-format cookies.txt file.

Example:

julia --project=. -e 'using PaperFetch; PaperFetch.main()' -- \
  fetch references.bib \
  --email your.email@example.edu \
  --use-apis \
  --cache-dir .paperfetch_cache \
  --rate-limit-seconds 0.05 \
  --ezproxy 'https://proxy.example.edu/login?url={url}' \
  --cookie-file /path/to/cookies.txt \
  --outdir paperfetch_out

Treat cookie files as login tokens. Do not commit, upload, email, or share them. Check your library and publisher terms before downloading.

Julia API

using PaperFetch

reports = check_bibliography("references.bib";
    email              = "you@example.edu",
    use_apis           = true,
    cache_dir          = ".paperfetch_cache",
    rate_limit_seconds = 0.05,
)

paths = write_reports(reports, "paperfetch_out")
paths[:markdown]
paths[:inc]

results, manifest = fetch_pdfs(reports, "paperfetch_out")

This writes both paperfetch_out/manifest.md and paperfetch_out/manifest.inc.

For deterministic offline runs, pass a fixture instead of live APIs:

reports = check_bibliography("examples/01_exact_article.bib";
    fixture = "examples/metadata_fixture.json",
    check = :none,
)

check_bibliography skips the key anon by default because that key is often used for anonymized review placeholders. Pass ignore_keys=nothing to keep every entry, or provide a custom set/list of keys to skip.

Examples And Tests

The examples/ directory contains small cases used by the test suite, covering exact metadata, normalized text differences, missing/conflicting DOI fields, web references, datasets, arXiv preprints, book chapters, online reports, and plain DOI lists.

Run the default offline tests:

julia --project=. -e 'using Pkg; Pkg.test()'

Manual online field tests live in examples/online/ and are not run by default:

PAPERFETCH_ONLINE=true \
PAPERFETCH_EMAIL=your.email@example.edu \
julia --project=. test/online/runtests.jl

Documentation

The documentation includes a quickstart, examples, API reference, and notes on live providers, report formats, fetch manifests, caching, rate limiting, and building a stand-alone executable:

https://mroughan.github.io/PaperFetch.jl/dev

Build docs locally with:

julia --project=docs -e '
  using Pkg
  Pkg.develop(PackageSpec(path=pwd()))
  Pkg.instantiate()
'
julia --project=docs docs/make.jl

Contributing

See CONTRIBUTING.md for development setup, test expectations, provider guidelines, and pull request notes.

Security

See SECURITY.md. In short:

  • do not put usernames or passwords in command-line arguments;
  • keep cookie files local and private;
  • do not commit API caches, downloaded PDFs, or private bibliographies;
  • retrieve only material you are entitled to access.

Citation

If PaperFetch.jl helps your work, please cite it using the metadata in CITATION.cff.

AI Disclosure

This project has been built with help from AI coding agents. The package structure and implementation were developed under user supervision with user-provided architecture and guardrail instructions.

About

Tools to validate a BibTex file.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages