A modular, memory-safe Ruby web crawler that discovers images and videos from web pages, downloads media, extracts video frames via FFmpeg, and performs OCR on images/frames using Tesseract (RTesseract). The crawler is selector-driven via a config.yaml file so you can target specific HTML tags or attributes.
docker compose up --build
then
docker compose run --rm \ ocr_crawler \ rake "run[https://another-site.com,3]"
or
docker compose run --rm \ ocr_crawler \ rake "run[,2,/app/config.yaml]"
- Overview
- Requirements
- Install (macOS / Linux / Windows)
- Setup (project)
- Configuration (
config.yaml) - Running the crawler
- Example end-to-end run (download + frames + OCR)
- Output layout
- Testing & linting
- Troubleshooting
- Extending / Notes
- License
This project performs a crawl starting from one or more start URLs, finds images and videos using configurable CSS selectors, records discovered items, downloads media, extracts frames using FFmpeg, and runs Tesseract OCR on images/frames. It is designed for long-running crawls: it uses threads, mutexes, and a MemoryManager to periodically trigger GC.
- Ruby 3.1 or newer
- Bundler
- Tesseract (system binary available in PATH) - required for OCR
- FFmpeg (system binary available in PATH) - required for video frame extraction
- Optional: Homebrew (macOS), apt (Debian/Ubuntu), dnf/yum (Fedora/CentOS), winget/choco (Windows) to install system packages
General (applies to all platforms):
-
Clone the repository:
git clone <repo-url> && cd ruby_web_scraper
-
Install gems:
bundle install
Platform-specific system package install:
-
macOS (Homebrew)
# install Homebrew if needed: https://brew.sh/ brew install tesseract ffmpeg -
Debian / Ubuntu
sudo apt update sudo apt install -y ruby-full build-essential tesseract-ocr ffmpeg
-
Fedora / RHEL / CentOS
sudo dnf install -y ruby rubygems tesseract ffmpeg
-
Windows (PowerShell) - using winget (Windows 10/11)
winget install --id=Gyan.FFmpeg # For Tesseract use an available package or installer; ensure tesseract.exe is on PATH.
Verify installation:
tesseract --version
ffmpeg -version- Ensure dependencies installed (see above).
- Create or edit
config.yamlat project root. A sampleexample.config.yamlis included in the repository and can be adapted.
Example fields:
start_urls: array of seed URLs to crawl.threads: number of worker threads.output_dir: where downloaded media, frames and results are stored.frame_rate: fps used by FFmpeg when extracting frames from video.gc_interval: how many pages processed per GC trigger (MemoryManager).max_depth: maximum crawl depth from start URL.keep_files: true | false, choose if the files analysed are saved on disk or deleted.selectors.images: CSS selectors used to find images (nodes should havesrc,data-src,content, orhref).selectors.videos: CSS selectors used to find video resources (nodes should havesrc,data-src,poster, etc).user_agent: optional HTTP User-Agent header used by downloads.
There are two primary ways:
-
Rake task:
# Run crawler; provide optional start URL and max depth: rake "run[https://example.com,2]"
The
runtask accepts optional arguments: first is a start URL (overrides config), second ismax_depth. -
Direct script:
ruby bin/run.rb path/to/config.yaml
Or provide a URL and optional max depth directly:
ruby bin/run.rb https://example.com 2
The script loads config.yaml (or merges CLI-provided URL), initializes the environment, runs the crawler to discover media, then downloads media, extracts frames for videos, runs OCR on images/frames, and writes processed results.
- Ensure
config.yamlis properly configured. - Run:
Or:
ruby bin/run.rb config.yaml
ruby bin/run.rb https://example.com 2
After the run completes:
- Initial discovered resources are saved to
output/results.json. - Downloaded media and OCR outputs are saved and a final processed results file is written to
output/processed_results.json.
By default output/ (or your configured output_dir) contains:
output/images/- downloaded imagesoutput/videos/- downloaded video filesoutput/video_frames/- extracted frames for videos (organized per-video)output/results.json- JSON file of discovered resources (pre-processing)output/processed_results.json- JSON file including download paths and OCR text
results.json entries follow the format produced by ResultRecorder.build:
{
"source_page": "https://example.com",
"type": "image",
"url": "https://example.com/assets/img.jpg",
"path": null,
"text": null
}After processing, processed_results.json will have path filled with local file paths and text with OCR output.
-
Run tests (RSpec):
bundle exec rspec # or via rake rake spec
-
Lint with RuboCop:
bundle exec rubocop # or via rake rake lint
- "tesseract: command not found" - install Tesseract and ensure PATH updated.
- "ffmpeg: command not found" - install FFmpeg and ensure PATH updated.
- Downloads failing - adjust
user_agentinconfig.yaml. - If
output/is missing - verify write permissions andconfig.yamloutput_dir.
- Add selectors in
config.yamlto target other tags or attributes. - Extend managers to perform additional processing, filtering, or remote storage.
- Swap/override
MemoryManagerfor a different GC strategy by dependency-injecting intoCrawler.
- Core files:
lib/ocr_crawler/config.rb- YAML config loaderlib/ocr_crawler/crawler.rb- main orchestrationlib/ocr_crawler/link_manager.rb- link extractionlib/ocr_crawler/image_manager.rb&video_manager.rb- selector-driven extractionlib/ocr_crawler/downloader.rb- HTTP resource downloaderlib/ocr_crawler/ffmpeg_helper.rb- frame extraction helper (calls FFmpeg)lib/ocr_crawler/ocr_executor.rb- RTesseract wrapperlib/ocr_crawler/result_recorder.rb- save JSON results
All modules/classes include short docstring comments.
MIT - see LICENSE file.