Skip to content

hekuli/mycast

Repository files navigation

mycast

Daily news transcript → podcast pipeline. Generates an MP3 from a text transcript using a Samantha voice clone (Qwen3-TTS via mlx-audio), updates an RSS 2.0 / Podcasting 2.0 feed, syncs to Cloudflare R2, and sends a Telegram status message.

Quick start

uv sync                    # one-time: install deps
mycast new-podcast "My Daily News"   # one-time: create output/feed.xml
# drop a transcript at ./incoming/2026-04-29.txt
uv run mycast run          # tts -> feed -> sync -> notify

For one-time setup (Python env, mlx-audio, ffmpeg, rclone, direnv, optional auto-watcher), see SETUP.md.

Pipeline

incoming/<date>.txt  ──tts──▶  output/<date>.mp3
                              output/<date>.txt   (copied)
                              output/<date>.vtt   ──┐
                                                    ├──feed──▶  output/feed.xml
                                                    │
                              output/*            ──sync──▶  R2
                                                  ──notify──▶  Telegram

Each step is independently runnable and idempotent.

Commands

All commands run via uv run mycast <command>. A summary:

Command Purpose
run [--all] [--force] Full pipeline: tts → feed → sync → notify (default: only the latest incoming file)
tts [files...] [--all] [--force] Generate MP3 audio from incoming/*.txt (default: only the latest, skips already-processed)
feed Refresh output/feed.xml with every output/*.mp3
sync [--dry-run] Push ./output/ to R2 via rclone copy
notify <message> Send a one-off Telegram message
new-podcast <title> [-d ...] [-o ...] [--force] Create a new feed.xml template

Add -v / --verbose for debug logging. Logs always go to logs/mycast.log.

mycast run — full orchestration

uv run mycast run                # only the latest incoming/*.txt
uv run mycast run --all          # every incoming/*.txt (skips already-processed)
uv run mycast run --force        # reprocess the latest even if its mp3 exists
uv run mycast run --all --force  # reprocess everything
  1. tts: By default, looks at the lexicographically-last incoming/*.txt (which, given the YYYY-MM-DD naming, is the newest date), generates output/<stem>.mp3 if it doesn't already exist, copies the transcript to output/<stem>.txt. --all processes every file in incoming/; --force re-runs even if the output mp3 already exists.
  2. feed: Rebuilds RSS items in output/feed.xml for every output/*.mp3. Same-date entries are replaced, so re-running is safe.
  3. sync: rclone copy ./output r2:mycast (configurable via MYCAST_R2_REMOTE).
  4. notify: Sends one Telegram message summarizing each step's outcome (success or failure).

A flock (.mycast.lock) prevents concurrent runs from oversubscribing the GPU. If a run is already in progress, the new invocation exits immediately.

Exit code is 0 on full success, 1 if any step failed (the Telegram message indicates which).

mycast tts — generate audio only

uv run mycast tts                                # only the latest incoming/*.txt
uv run mycast tts --all                          # every unprocessed file in incoming/
uv run mycast tts --force                        # re-run the latest even if mp3 exists
uv run mycast tts incoming/2026-04-29.txt        # process specific file(s)

Calls the mlx_audio Python API directly (mlx_audio.tts.utils.load_model) — no shell-out. The model is loaded once per process and reused across all input files in a single invocation.

The transcript is chunked into ~400-character pieces (sentence-aligned, with --- separator lines stripped) before being fed to the TTS model. Each chunk gets a fresh in-context-learning (ICL) voice-clone prefill from custom-voices/<voice>.{wav,txt}. Audio segments are concatenated and written as a single output/<stem>.mp3 via mlx_audio.audio_io.write (which uses ffmpeg internally).

The transcript is also copied to output/<stem>.txt so the R2 sync includes it alongside the audio.

Tunables: MYCAST_MAX_TOKENS (per-chunk codec budget, default 4096), MYCAST_CHUNK_CHARS (max characters per chunk, default 400).

mycast feed — update the RSS feed

uv run mycast feed

Walks output/*.mp3, parses the date from the filename (YYYY-MM-DD), reads the matching output/<stem>.txt, and writes/replaces the RSS <item> for that date. Generates output/<stem>.vtt (WebVTT timestamps distributed proportionally by sentence length across the audio duration) and links it via <podcast:transcript>.

The episode description is whatever appears before the first --- line in the transcript.

mycast sync — push to R2

uv run mycast sync
uv run mycast sync --dry-run    # preview without uploading

Shells out to rclone copy ./output $MYCAST_R2_REMOTE. rclone's natural diffing means only changed/new files transfer.

mycast notify — Telegram status message

uv run mycast notify "feed updated manually"

Sends a plain-text message to the chat configured by TELEGRAM_CHAT_ID. Useful in scripts.

mycast new-podcast — create feed.xml

uv run mycast new-podcast "My Daily News" -d "Personal news roundup, read aloud."
uv run mycast new-podcast "Other" -o other-feed.xml --force

One-time setup. Default output is output/feed.xml. After creating, edit the file to fill in podcast details (link, image, category, etc.).

Configuration

All configuration is done via environment variables, loaded from .envrc by direnv. Copy .envrc.example to .envrc, fill in values, and run direnv allow.

Variable Required Default Purpose
TELEGRAM_BOT_TOKEN yes (for notify) Bot token from BotFather
TELEGRAM_CHAT_ID yes (for notify) Target chat ID
MYCAST_BASE_URL no https://mycast.hekuli.com/ Public URL prefix used in feed.xml enclosure URLs
MYCAST_R2_REMOTE no r2:mycast rclone remote:bucket for sync
MYCAST_VOICE no samantha Voice clone reference (custom-voices/<name>.{wav,txt})
MYCAST_TTS_BACKEND no qwen3 TTS engine: qwen3 (Qwen3-TTS, ICL voice cloning) or chatterbox (Resemble Chatterbox, caches speaker conditionals once for cross-chunk consistency)
MYCAST_MODEL no depends on backend mlx-audio model id (default: Qwen3-TTS-12Hz-1.7B-Base-bf16 for qwen3, chatterbox-fp16 for chatterbox)
MYCAST_MAX_TOKENS no 4096 Per-chunk codec-token budget for TTS (12.5 Hz, so 4096 ≈ 5.5 min per chunk)
MYCAST_CHUNK_CHARS no 400 Max characters per text chunk fed to the TTS model
MYCAST_SPEED no 0.9 (qwen3) / 1.0 (chatterbox) Playback speed multiplier (ffmpeg atempo, preserves pitch). 1.0 = no change
MYCAST_EXAGGERATION no 0.5 Chatterbox-only: emotion/prosody intensity (0=flat, 0.5=natural, 1=very expressive). Ignored by qwen3
MYCAST_CFG_WEIGHT no 0.5 Chatterbox-only: classifier-free guidance weight for voice cloning fidelity. Ignored by qwen3
MYCAST_TEMPERATURE no 0.8 Sampling temperature. Lower = consistent/flat, higher = expressive/drift
MYCAST_TOP_P no 0.9 Nucleus sampling cutoff
MYCAST_NORMALIZE_RMS no 0.1 Per-chunk loudness target (RMS) for equalizing volume across chunks
MYCAST_LANG_CODE no auto TTS language hint. auto runs per-chunk detection (English vs German). Force a specific code with english, german, french, italian, portuguese, spanish, russian, chinese, japanese, korean
MYCAST_SEED no 42 Random seed reset before each chunk's TTS call. Pins voice consistency across chunks; sweep different seeds to find one whose voice you like best

Voice cloning

Voice clones are stored as custom-voices/<name>.wav + custom-voices/<name>.txt pairs. The .txt file must be an exact transcript of what's spoken in the .wav (used by Qwen3-TTS; Chatterbox ignores the transcript). The default voice samantha is shipped in this repo.

To add a new voice: drop custom-voices/myvoice.wav (~10s of clean speech) and custom-voices/myvoice.txt (its exact transcript), then export MYCAST_VOICE=myvoice. Both backends use the same WAV/transcript pair, so switching MYCAST_TTS_BACKEND between qwen3 and chatterbox doesn't require any other changes.

Comparing TTS backends

Switch backends by setting MYCAST_TTS_BACKEND:

MYCAST_TTS_BACKEND=qwen3      uv run mycast tts incoming/2026-04-29.txt    # default
MYCAST_TTS_BACKEND=chatterbox uv run mycast tts incoming/2026-04-29.txt

Quick comparison:

Qwen3-TTS Base (qwen3) Chatterbox (chatterbox)
Speaker conditioning rebuilt every chunk (drift-prone) cached once at load (consistent)
Languages english, german, french, italian, portuguese, spanish, russian, chinese, japanese, korean en, de, fr, es, it, pt, ru, zh, ja, ko + ar/da/el/fi/he/hi/ms/nl/no/pl/sv/sw/tr
Reference transcript needed yes (.txt exact transcript) no (only .wav)
Per-chunk language tag yes — lang_code arg yes — lang_code arg (2-letter codes)
Built-in loudness normalization no no (Turbo variant has it; classic doesn't)
Natural-language instruct no no

Both honor the same [GERMAN]...[/GERMAN] markup in input transcripts and the same env-var tunables (seed, temperature, top_p, max_tokens, chunk_chars, speed, normalize_rms).

German content

For longer stretches of German content (a quoted statement, a German headline, a paragraph), wrap them in [GERMAN] ... [/GERMAN] markers in the transcript. Both backends parse these and route the content to the model's German language code, fixing the American-accented German pronunciation that otherwise occurs with English-cloned voices. Single German names embedded inline don't need markers.

The minister released a statement yesterday.

[GERMAN]
Die Lage ist ernst, aber wir haben einen klaren Plan für die kommenden Wochen.
[/GERMAN]

Translated, it says the situation is serious but they have a clear plan.

Auto-watcher (optional)

A macOS LaunchAgent can watch ./incoming/ and run mycast run automatically whenever a new transcript appears. See SETUP.md → Step 5 for installation.

Notes

  • Re-running anything is always safe: TTS skips processed files; feed replaces entries; rclone diffs.
  • Output filenames must contain a YYYY-MM-DD date — the feed step parses it from the basename.
  • Audio output is MP3 directly from mlx-audio (which uses ffmpeg under the hood).

About

A utility to generate an audio podcast from text files.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages