Skip to content

v3.0.0: Document, image, audio & YouTube conversion + frontmatter-first output#30

Merged
syswave-dev merged 95 commits into
mainfrom
feat/markitdown-document-conversion
Jun 10, 2026
Merged

v3.0.0: Document, image, audio & YouTube conversion + frontmatter-first output#30
syswave-dev merged 95 commits into
mainfrom
feat/markitdown-document-conversion

Conversation

@syswave-dev

Copy link
Copy Markdown
Collaborator

Summary

This release turns PullMD from a URL-to-Markdown service into a general content-to-Markdown service: documents, images, audio, and YouTube transcripts join web pages as first-class inputs — and the output format moves to a clean, frontmatter-first style (the one breaking change that makes this v3.0.0).

Document tier (markitdown sidecar)

  • New markitdown sidecar converts PDF, DOCX, PPTX, XLSX, EPUB, CSV, JSON, XML, and ZIP
  • extractWeb routes non-HTML content types through the sidecar automatically
  • New POST /api/file endpoint for direct document uploads (raw bytes, 25 MB cap)
  • PWA drag-and-drop & file picker extended from HTML-only to all supported document types
  • Conversions are sandboxed: subprocess isolation with timeout and memory caps (DoS hardening)

Media tier (opt-in, multi-provider)

  • Image captioning and audio transcription via OpenAI-compatible providers, running in Node (lib/llm/{providers,vision,stt}) — the sidecar stays docs-only
  • Configured via PULLMD_VISION_* / PULLMD_STT_* with PULLMD_LLM_* as shared fallback; off by default
  • Per-modality source labels (image-caption, audio-transcript) and LLM usage (model, tokens, audio seconds, image size) reported in frontmatter

PDF OCR (opt-in)

  • ?pdf=ocr query param (or recipe fetch.pdf: ocr) routes PDFs through an OCR adapter (Mistral OCR API shape, pluggable via PULLMD_PDF_OCR_BASE_URL)
  • Requires its own PULLMD_PDF_OCR_API_KEY (deliberately no shared-key fallback), falls back to markitdown when unavailable
  • source: pdf-ocr + pdf_pages in frontmatter; supported on /api, /api/stream, /api/file, and the MCP read_url tool (pdf_ocr param)

YouTube transcripts (opt-in, keyless)

  • Dedicated sidecar endpoint fetches transcripts without an API key
  • Per-request yt_timecodes / yt_chunk options on /api and the MCP read_url tool
  • Channel/duration/views land in frontmatter, not the body

Breaking: frontmatter-first output (v3.0.0)

  • The markdown body is clean by default: the Source:/date header line moved into YAML frontmatter
  • Opt-out via PULLMD_SOURCE_HEADER=true for consumers that relied on the body header
  • New PULLMD_FRONTMATTER_FIELDS allowlist to trim frontmatter to selected fields (safe fallback to all, startup warning on unknown names)
  • See MIGRATION.md for the upgrade path

Robustness & fixes along the way

  • Media frontmatter survives the cache (new metadata JSON column); provider errors degrade to plain extraction instead of failing the request
  • Relative image/link URLs are now resolved against the source page before extraction (fixes broken images on share pages)
  • YAML escaping for carriage returns, YouTube sidecar pre-abort guard, PWA upload-affordance copy polish

Test plan

  • 711 Node tests pass (node --test), including new coverage for every tier
  • Python sidecar limit tests pass (markitdown-sidecar/test_limits.py)
  • All tiers verified live on a staging deployment: PDF/DOCX conversion, image caption, audio transcription, YouTube transcripts, PDF OCR, LLM-usage frontmatter, clean body + frontmatter allowlist

🤖 Generated with Claude Code

syswave-dev and others added 30 commits June 8, 2026 13:43
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ody size limit

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ractWeb

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…length trim

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds extractFile(buffer, options) to lib/web.js — a sibling to extractHtml()
that converts uploaded document bytes (PDF/Office/EPUB/…) via the markitdown
sidecar with no URL in the output header (filename shown instead).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ce/title fallback

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Accepts raw binary document bytes (PDF, Office, EPUB, etc.) and converts
via extractFile() (markitdown sidecar). Same privacy model as /api/html:
no cache.put, telemetry logs constant 'local-file' placeholder.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…fy 502 + 413 comments

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds markitdown boolean (true when MARKITDOWN_URL env var is set) to
GET /api/config so the PWA can conditionally advertise document upload.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ccept

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… status; tidy .env sidecar URLs

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… HTML fallback)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rkitdown flag

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ription

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s, clarify model fallback docs

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Document types (PDF, DOCX, PPTX, etc.) always route to markitdown.
Image and audio content-types only route when MARKITDOWN_MEDIA env var
is set, keeping media-to-markdown opt-in for self-hosters.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ump to 2.8.0

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…p md shadow

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t opts

When MARKITDOWN_YOUTUBE is set, extractWeb() now detects YouTube/youtu.be
URLs and dispatches them to the youtube sidecar client (convertYoutubeViaSidecar).
The branch runs after decodeBody and the Cloudflare short-circuit, before
convertWithReadability; a null return from the sidecar falls through to the
normal HTML pipeline with no hard failure.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ions

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
syswave-dev and others added 28 commits June 9, 2026 14:17
… field

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… url test

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…k to markitdown

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…OCR fallback

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nfig flag; pdf_pages frontmatter

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…odel frontmatter

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…n note

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The OCR provider default endpoint (Mistral) differs from the chat LLM
(default OpenAI), so sharing the chat key is a footgun. Vision/STT keep
the shared fallback.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a "What's new in v3" section (intro note + highlights with deep links)
near the top of the README, and a bilingual "Neu in v3 / New in v3" card as
the first /help section. Covers clean body, document conversion, PDF-OCR,
image/audio, YouTube, and richer frontmatter.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…fully + YAML CR escape

Three correctness/security fixes from code review:

- web: wrap caption/transcribe calls in try/catch so a failing (not just
  unconfigured) vision/STT provider falls back to normal extraction instead
  of surfacing as HTTP 502. Covers extractWeb (URL) and extractFile (upload).

- cache+frontmatter: persist extraction metadata on the cache row (new
  metadata JSON column + migration) and re-inject media/LLM fields on every
  serve path via a single mergeMediaFrontmatter() helper. Fixes cached
  youtube/image/pdf-ocr entries silently dropping duration/views/image_size/
  pdf_pages/llm_* when served with frontmatter=true. Replaces the 3x
  duplicated merge block in server.js + the partial copy in mcp.js (MCP now
  emits the full media field set, not just duration/views).

- frontmatter: quoteYamlString now neutralizes carriage returns (\r) in
  addition to \n, closing a YAML line-injection gap via attacker-controlled
  titles/descriptions.

+12 tests (frontmatter CR + mergeMediaFrontmatter, cache metadata round-trip,
provider-throw fallback, cache-hit media frontmatter end-to-end). 702/702 pass.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…dpoint derivation

- convertYoutubeViaSidecar now checks opts.signal.aborted before attaching the
  listener (matching convertViaMarkitdown), so an already-cancelled request no
  longer fires a full 30s fetch to the sidecar.
- Derive the /youtube endpoint robustly: swap a trailing /convert, else append
  /youtube to the base. Previously a MARKITDOWN_URL without a /convert suffix
  silently POSTed to the wrong path and dropped transcripts.

+3 tests. 705/705 pass.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…port

Two remaining code-review findings:

- DoS hardening: each markitdown conversion now runs in a disposable child
  process (lib limits.run_guarded) with a wall-clock timeout (+ optional
  RLIMIT_AS cap). A decompression bomb or pathological document can no longer
  pin CPU or OOM the long-lived uvicorn process — the child is killed and the
  server stays up. Uses the 'spawn' start method to avoid fork-in-thread
  deadlocks with markitdown's lazy imports. New env knobs
  (MARKITDOWN_CONVERT_TIMEOUT, MARKITDOWN_MEM_LIMIT_MB) + a container mem_limit
  on the sidecar service as the recommended hard memory bound.

- MCP read_url gains a pdf_ocr boolean that forwards pdfOcr to extractWeb and
  bypasses the cache, mirroring ?pdf=ocr on the HTTP API.

Tests: +1 Node (MCP pdf_ocr forwarding + cache bypass), 706/706 pass; new
standalone Python harness test (markitdown-sidecar/test_limits.py) for the
timeout/memory/exception guard, 4/4 pass.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…e support

"URL in, Markdown out" no longer fits now that uploads and non-web sources are
supported. New hero (DE default + de/en i18n dicts):
  Anything in, Markdown out. / Alles rein, Markdown raus.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
"Webseiten als Markdown extrahieren" → "Alles als Markdown extrahieren —
Webseiten, Dokumente, Bilder, Audio, YouTube", consistent with the new hero.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…tion

Relative img src/srcset, links, and media URLs survived extraction
verbatim (linkedom gives Readability no baseURI; Trafilatura leaves
image srcs relative even with url=), so rendered shares resolved them
against the PullMD origin and showed broken images.

Absolutify all URL-bearing attributes once, right after parsing, and
re-serialize so both extractors plus the fallback/comments paths see
absolute URLs. data:, mailto:, javascript:, tel:, and #fragment values
stay untouched, as does HTML uploaded without a source URL.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
"… oder Datei ziehen oder öffnen" stacked two "oder"s and used "öffnen"
for what is a file-picker click. Now "Alternativ: Datei hierher ziehen
oder auswählen" (EN: "Alternatively: drag a file here or browse"), with
matching no-drag variants.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…parity

The pullmd service never forwarded PULLMD_PDF_OCR_*, PULLMD_SOURCE_HEADER
(the v3 breaking-change opt-out), PULLMD_FRONTMATTER_FIELDS,
OAUTH_JWT_SECRET, PULLMD_USER_AGENT/UA_FEED_URL, or PULLMD_SITE_RECIPES,
so .env values silently had no effect in Docker deployments.

docker-compose.traefik.yml was still on the v2.6 layout — add the
markitdown sidecar service and the same env pass-throughs. Document
PULLMD_SITE_RECIPES in .env.example (the one var missing there).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
List the OpenAI/Mistral/Groq/Gemini/OpenRouter/Ollama base URLs with
their vision/STT capability next to the media-tier vars, plus the
implicit defaults (OpenAI for vision/STT, Mistral for PDF OCR) so
nobody has to dig through provider docs for the _BASE_URL value.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…examples

https://api.anthropic.com/v1 serves OpenAI-style chat completions incl.
image_url content, so the vision tier works with a Claude model; no
/audio/transcriptions endpoint, hence vision only.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… default

- Sidecar URLs: per-sidecar unset behavior (Readability-only, no JS
  rendering, documents 502 + YouTube tier disabled since /youtube is
  derived from MARKITDOWN_URL) instead of one misplaced comment
- PULLMD_SOURCE_HEADER: show =true, frame as the v2 compat switch, link
  MIGRATION.md
- PULLMD_FRONTMATTER_FIELDS: list the known field names, document the
  ignore-with-warning / safe-fallback behavior and the per-request
  ?frontmatter=true opt-in
- MARKITDOWN_YT_LANGS: was uncommented de,en (silent German preference
  for anyone copying the file); now empty default with format example

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- Embedded compose block was missing CACHE_DB=/data/cache.db — anyone
  copying it instead of curling the file got the SQLite DB outside the
  mounted volume (lost on rebuild). Also add the markitdown mem_limit
  and an explicit "abridged" note pointing at the repo file.
- comment_limit default corrected: no cap (code passes null), not 15
- PULLMD_AUTH_TOKEN: "removed in v3.0" was false — still supported;
  now "slated for removal in a future major release"
- OAuth sections updated to post-v2.3 reality: client-compat table
  shows OAuth ✅ for Claude Desktop / claude.ai, Caddy workaround
  reframed as OAuth-disabled fallback, "closes on v2.1.0" → shipped
- Version-pinning note updated: :latest tracks v3, pin :2 to stay on
  the v2 output format (also fixes the aeternalabs/ typo)
- Session cookie TTL corrected to 90 days (v2.5 change)
- /api param table: extractor, pdf=ocr, yt_* rows; X-Source lists now
  include the v3 sources (markitdown/youtube/image-caption/
  audio-transcript/pdf-ocr); /api/html + /api/file in auth boundary
- HOST_DOMAIN marked Traefik-only (quickstart promises no .env)
- Architecture: lib/llm, lib/youtube.js, lib/frontmatter.js entries

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…pullmd

The MCP read_url description and the Claude Code skill still described
the v2 web-only pipeline — an agent holding a PDF or YouTube URL had no
signal that the tool handles it. read_url now lists documents, YouTube
transcripts, and media captioning/transcription (config-dependent), and
the MCP server reports the real package version instead of 1.0.0.

The skill bundle is renamed web-reader → pullmd: zip served at
/pullmd.zip (old /web-reader.zip 301-redirects), entries under pullmd/,
skill+plugin named pullmd. SKILL.md rewritten for v3: per-type routing
(documents/YouTube/media), corrected comment_limit default (no cap),
extractor/pdf=ocr/yt_* params, full X-Source list, /api/file example.
README and /help updated to the new name.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
….0.0

Installing pullmd.zip does not replace an existing web-reader install -
Claude Code would load both side by side. Add the remove-first step
(rm -rf ~/.claude/skills/web-reader) to README, MIGRATION.md, and /help
(DE+EN).

CHANGELOG: fold the [Unreleased] section (PDF-OCR tier, media moved
into pullmd) into [3.0.0] - none of it was ever released separately -
and add the missing Changed/Fixed entries (skill rename, v3-aware MCP
descriptions, per-modality source labels, relative-URL resolution,
sidecar sandboxing, media frontmatter cache persistence). Release date
2026-06-10.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Result area order was result-header → permalink-bar → markdown, which
put the share-copy button right next to the markdown and made it easy
to copy the share URL when you meant the markdown. Now: permalink bar
first (right under the input area), then the result header with the
markdown actions, then the output. Bump SW cache to v28 so installed
PWAs pick it up.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The inline meta line (**r/sub** · u/user · N ↑ · age · date + url)
contradicted the v3 clean-body promise. It is now emitted only with
PULLMD_SOURCE_HEADER=true (same opt-out as the web source header);
subreddit, author, publish date, and upvotes land in the frontmatter
instead (new known fields: subreddit, upvotes).

extractPost gains an opt-in withMeta option returning { markdown,
meta } — the default string return is unchanged, so existing callers
and test doubles keep working. All serve paths (api, stream, MCP,
share-refresh) persist the meta in the cache metadata column, so
cached responses carry the same fields.

Docs: CHANGELOG/MIGRATION/README breaking-change sections extended to
Reddit; .env.example known-fields list updated; remaining stale
"removed in v3" AUTH_TOKEN claims softened.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The field lists still showed only the base web fields. /help now lists
all fields grouped by source (Reddit, YouTube, media, PDF OCR, MCP
extras) plus the allowlist pointer; the README LLM-usage section is
retitled "Source-specific frontmatter fields" and gains the Reddit
rows; the skill tip mentions the Reddit meta fields.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The OCR tier was reachable only via ?pdf=ocr on the API — the PWA had
no way to use it, so a configured PULLMD_PDF_OCR_API_KEY never fired
for dragged-in PDFs. New toggle in the controls row, shown only when
/api/config reports pdfOcr, persisted like the other switches. On URL
pulls it appends pdf=ocr; on file uploads only for *.pdf. SW cache
bumped to v29.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@syswave-dev syswave-dev merged commit 00d7d91 into main Jun 10, 2026
4 checks passed
@syswave-dev syswave-dev deleted the feat/markitdown-document-conversion branch June 10, 2026 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant