v3.0.0: Document, image, audio & YouTube conversion + frontmatter-first output#30
Merged
Merged
Conversation
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ody size limit Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ractWeb Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…length trim Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds extractFile(buffer, options) to lib/web.js — a sibling to extractHtml() that converts uploaded document bytes (PDF/Office/EPUB/…) via the markitdown sidecar with no URL in the output header (filename shown instead). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ce/title fallback Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Accepts raw binary document bytes (PDF, Office, EPUB, etc.) and converts via extractFile() (markitdown sidecar). Same privacy model as /api/html: no cache.put, telemetry logs constant 'local-file' placeholder. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…fy 502 + 413 comments Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds markitdown boolean (true when MARKITDOWN_URL env var is set) to GET /api/config so the PWA can conditionally advertise document upload. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ccept Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… status; tidy .env sidecar URLs Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… HTML fallback) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rkitdown flag Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ription Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s, clarify model fallback docs Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Document types (PDF, DOCX, PPTX, etc.) always route to markitdown. Image and audio content-types only route when MARKITDOWN_MEDIA env var is set, keeping media-to-markdown opt-in for self-hosters. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ump to 2.8.0 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…p md shadow Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t opts When MARKITDOWN_YOUTUBE is set, extractWeb() now detects YouTube/youtu.be URLs and dispatches them to the youtube sidecar client (convertYoutubeViaSidecar). The branch runs after decodeBody and the Cloudflare short-circuit, before convertWithReadability; a null return from the sidecar falls through to the normal HTML pipeline with no hard failure. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ions Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… field Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… url test Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…k to markitdown Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…OCR fallback Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nfig flag; pdf_pages frontmatter Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…odel frontmatter Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…n note Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The OCR provider default endpoint (Mistral) differs from the chat LLM (default OpenAI), so sharing the chat key is a footgun. Vision/STT keep the shared fallback. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a "What's new in v3" section (intro note + highlights with deep links) near the top of the README, and a bilingual "Neu in v3 / New in v3" card as the first /help section. Covers clean body, document conversion, PDF-OCR, image/audio, YouTube, and richer frontmatter. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…fully + YAML CR escape Three correctness/security fixes from code review: - web: wrap caption/transcribe calls in try/catch so a failing (not just unconfigured) vision/STT provider falls back to normal extraction instead of surfacing as HTTP 502. Covers extractWeb (URL) and extractFile (upload). - cache+frontmatter: persist extraction metadata on the cache row (new metadata JSON column + migration) and re-inject media/LLM fields on every serve path via a single mergeMediaFrontmatter() helper. Fixes cached youtube/image/pdf-ocr entries silently dropping duration/views/image_size/ pdf_pages/llm_* when served with frontmatter=true. Replaces the 3x duplicated merge block in server.js + the partial copy in mcp.js (MCP now emits the full media field set, not just duration/views). - frontmatter: quoteYamlString now neutralizes carriage returns (\r) in addition to \n, closing a YAML line-injection gap via attacker-controlled titles/descriptions. +12 tests (frontmatter CR + mergeMediaFrontmatter, cache metadata round-trip, provider-throw fallback, cache-hit media frontmatter end-to-end). 702/702 pass. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…dpoint derivation - convertYoutubeViaSidecar now checks opts.signal.aborted before attaching the listener (matching convertViaMarkitdown), so an already-cancelled request no longer fires a full 30s fetch to the sidecar. - Derive the /youtube endpoint robustly: swap a trailing /convert, else append /youtube to the base. Previously a MARKITDOWN_URL without a /convert suffix silently POSTed to the wrong path and dropped transcripts. +3 tests. 705/705 pass. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…port Two remaining code-review findings: - DoS hardening: each markitdown conversion now runs in a disposable child process (lib limits.run_guarded) with a wall-clock timeout (+ optional RLIMIT_AS cap). A decompression bomb or pathological document can no longer pin CPU or OOM the long-lived uvicorn process — the child is killed and the server stays up. Uses the 'spawn' start method to avoid fork-in-thread deadlocks with markitdown's lazy imports. New env knobs (MARKITDOWN_CONVERT_TIMEOUT, MARKITDOWN_MEM_LIMIT_MB) + a container mem_limit on the sidecar service as the recommended hard memory bound. - MCP read_url gains a pdf_ocr boolean that forwards pdfOcr to extractWeb and bypasses the cache, mirroring ?pdf=ocr on the HTTP API. Tests: +1 Node (MCP pdf_ocr forwarding + cache bypass), 706/706 pass; new standalone Python harness test (markitdown-sidecar/test_limits.py) for the timeout/memory/exception guard, 4/4 pass. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…e support "URL in, Markdown out" no longer fits now that uploads and non-web sources are supported. New hero (DE default + de/en i18n dicts): Anything in, Markdown out. / Alles rein, Markdown raus. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
"Webseiten als Markdown extrahieren" → "Alles als Markdown extrahieren — Webseiten, Dokumente, Bilder, Audio, YouTube", consistent with the new hero. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…tion Relative img src/srcset, links, and media URLs survived extraction verbatim (linkedom gives Readability no baseURI; Trafilatura leaves image srcs relative even with url=), so rendered shares resolved them against the PullMD origin and showed broken images. Absolutify all URL-bearing attributes once, right after parsing, and re-serialize so both extractors plus the fallback/comments paths see absolute URLs. data:, mailto:, javascript:, tel:, and #fragment values stay untouched, as does HTML uploaded without a source URL. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
"… oder Datei ziehen oder öffnen" stacked two "oder"s and used "öffnen" for what is a file-picker click. Now "Alternativ: Datei hierher ziehen oder auswählen" (EN: "Alternatively: drag a file here or browse"), with matching no-drag variants. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…parity The pullmd service never forwarded PULLMD_PDF_OCR_*, PULLMD_SOURCE_HEADER (the v3 breaking-change opt-out), PULLMD_FRONTMATTER_FIELDS, OAUTH_JWT_SECRET, PULLMD_USER_AGENT/UA_FEED_URL, or PULLMD_SITE_RECIPES, so .env values silently had no effect in Docker deployments. docker-compose.traefik.yml was still on the v2.6 layout — add the markitdown sidecar service and the same env pass-throughs. Document PULLMD_SITE_RECIPES in .env.example (the one var missing there). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
List the OpenAI/Mistral/Groq/Gemini/OpenRouter/Ollama base URLs with their vision/STT capability next to the media-tier vars, plus the implicit defaults (OpenAI for vision/STT, Mistral for PDF OCR) so nobody has to dig through provider docs for the _BASE_URL value. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…examples https://api.anthropic.com/v1 serves OpenAI-style chat completions incl. image_url content, so the vision tier works with a Claude model; no /audio/transcriptions endpoint, hence vision only. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… default - Sidecar URLs: per-sidecar unset behavior (Readability-only, no JS rendering, documents 502 + YouTube tier disabled since /youtube is derived from MARKITDOWN_URL) instead of one misplaced comment - PULLMD_SOURCE_HEADER: show =true, frame as the v2 compat switch, link MIGRATION.md - PULLMD_FRONTMATTER_FIELDS: list the known field names, document the ignore-with-warning / safe-fallback behavior and the per-request ?frontmatter=true opt-in - MARKITDOWN_YT_LANGS: was uncommented de,en (silent German preference for anyone copying the file); now empty default with format example Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- Embedded compose block was missing CACHE_DB=/data/cache.db — anyone copying it instead of curling the file got the SQLite DB outside the mounted volume (lost on rebuild). Also add the markitdown mem_limit and an explicit "abridged" note pointing at the repo file. - comment_limit default corrected: no cap (code passes null), not 15 - PULLMD_AUTH_TOKEN: "removed in v3.0" was false — still supported; now "slated for removal in a future major release" - OAuth sections updated to post-v2.3 reality: client-compat table shows OAuth ✅ for Claude Desktop / claude.ai, Caddy workaround reframed as OAuth-disabled fallback, "closes on v2.1.0" → shipped - Version-pinning note updated: :latest tracks v3, pin :2 to stay on the v2 output format (also fixes the aeternalabs/ typo) - Session cookie TTL corrected to 90 days (v2.5 change) - /api param table: extractor, pdf=ocr, yt_* rows; X-Source lists now include the v3 sources (markitdown/youtube/image-caption/ audio-transcript/pdf-ocr); /api/html + /api/file in auth boundary - HOST_DOMAIN marked Traefik-only (quickstart promises no .env) - Architecture: lib/llm, lib/youtube.js, lib/frontmatter.js entries Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…pullmd The MCP read_url description and the Claude Code skill still described the v2 web-only pipeline — an agent holding a PDF or YouTube URL had no signal that the tool handles it. read_url now lists documents, YouTube transcripts, and media captioning/transcription (config-dependent), and the MCP server reports the real package version instead of 1.0.0. The skill bundle is renamed web-reader → pullmd: zip served at /pullmd.zip (old /web-reader.zip 301-redirects), entries under pullmd/, skill+plugin named pullmd. SKILL.md rewritten for v3: per-type routing (documents/YouTube/media), corrected comment_limit default (no cap), extractor/pdf=ocr/yt_* params, full X-Source list, /api/file example. README and /help updated to the new name. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
….0.0 Installing pullmd.zip does not replace an existing web-reader install - Claude Code would load both side by side. Add the remove-first step (rm -rf ~/.claude/skills/web-reader) to README, MIGRATION.md, and /help (DE+EN). CHANGELOG: fold the [Unreleased] section (PDF-OCR tier, media moved into pullmd) into [3.0.0] - none of it was ever released separately - and add the missing Changed/Fixed entries (skill rename, v3-aware MCP descriptions, per-modality source labels, relative-URL resolution, sidecar sandboxing, media frontmatter cache persistence). Release date 2026-06-10. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Result area order was result-header → permalink-bar → markdown, which put the share-copy button right next to the markdown and made it easy to copy the share URL when you meant the markdown. Now: permalink bar first (right under the input area), then the result header with the markdown actions, then the output. Bump SW cache to v28 so installed PWAs pick it up. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The inline meta line (**r/sub** · u/user · N ↑ · age · date + url)
contradicted the v3 clean-body promise. It is now emitted only with
PULLMD_SOURCE_HEADER=true (same opt-out as the web source header);
subreddit, author, publish date, and upvotes land in the frontmatter
instead (new known fields: subreddit, upvotes).
extractPost gains an opt-in withMeta option returning { markdown,
meta } — the default string return is unchanged, so existing callers
and test doubles keep working. All serve paths (api, stream, MCP,
share-refresh) persist the meta in the cache metadata column, so
cached responses carry the same fields.
Docs: CHANGELOG/MIGRATION/README breaking-change sections extended to
Reddit; .env.example known-fields list updated; remaining stale
"removed in v3" AUTH_TOKEN claims softened.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The field lists still showed only the base web fields. /help now lists all fields grouped by source (Reddit, YouTube, media, PDF OCR, MCP extras) plus the allowlist pointer; the README LLM-usage section is retitled "Source-specific frontmatter fields" and gains the Reddit rows; the skill tip mentions the Reddit meta fields. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The OCR tier was reachable only via ?pdf=ocr on the API — the PWA had no way to use it, so a configured PULLMD_PDF_OCR_API_KEY never fired for dragged-in PDFs. New toggle in the controls row, shown only when /api/config reports pdfOcr, persisted like the other switches. On URL pulls it appends pdf=ocr; on file uploads only for *.pdf. SW cache bumped to v29. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This release turns PullMD from a URL-to-Markdown service into a general content-to-Markdown service: documents, images, audio, and YouTube transcripts join web pages as first-class inputs — and the output format moves to a clean, frontmatter-first style (the one breaking change that makes this v3.0.0).
Document tier (markitdown sidecar)
markitdownsidecar converts PDF, DOCX, PPTX, XLSX, EPUB, CSV, JSON, XML, and ZIPextractWebroutes non-HTML content types through the sidecar automaticallyPOST /api/fileendpoint for direct document uploads (raw bytes, 25 MB cap)Media tier (opt-in, multi-provider)
lib/llm/{providers,vision,stt}) — the sidecar stays docs-onlyPULLMD_VISION_*/PULLMD_STT_*withPULLMD_LLM_*as shared fallback; off by defaultsourcelabels (image-caption,audio-transcript) and LLM usage (model, tokens, audio seconds, image size) reported in frontmatterPDF OCR (opt-in)
?pdf=ocrquery param (or recipefetch.pdf: ocr) routes PDFs through an OCR adapter (Mistral OCR API shape, pluggable viaPULLMD_PDF_OCR_BASE_URL)PULLMD_PDF_OCR_API_KEY(deliberately no shared-key fallback), falls back to markitdown when unavailablesource: pdf-ocr+pdf_pagesin frontmatter; supported on/api,/api/stream,/api/file, and the MCPread_urltool (pdf_ocrparam)YouTube transcripts (opt-in, keyless)
yt_timecodes/yt_chunkoptions on/apiand the MCPread_urltoolBreaking: frontmatter-first output (v3.0.0)
Source:/date header line moved into YAML frontmatterPULLMD_SOURCE_HEADER=truefor consumers that relied on the body headerPULLMD_FRONTMATTER_FIELDSallowlist to trim frontmatter to selected fields (safe fallback to all, startup warning on unknown names)Robustness & fixes along the way
metadataJSON column); provider errors degrade to plain extraction instead of failing the requestTest plan
node --test), including new coverage for every tiermarkitdown-sidecar/test_limits.py)🤖 Generated with Claude Code