Ruby OCR Crawler

A modular, memory-safe Ruby web crawler that discovers images and videos from web pages, downloads media, extracts video frames via FFmpeg, and performs OCR on images/frames using Tesseract (RTesseract). The crawler is selector-driven via a config.yaml file so you can target specific HTML tags or attributes.

with Docker

docker compose up --build

then

docker compose run --rm \ ocr_crawler \ rake "run[https://another-site.com,3]"

or

docker compose run --rm \ ocr_crawler \ rake "run[,2,/app/config.yaml]"

Overview

This project performs a crawl starting from one or more start URLs, finds images and videos using configurable CSS selectors, records discovered items, downloads media, extracts frames using FFmpeg, and runs Tesseract OCR on images/frames. It is designed for long-running crawls: it uses threads, mutexes, and a MemoryManager to periodically trigger GC.

Requirements

Ruby 3.1 or newer
Bundler
Tesseract (system binary available in PATH) - required for OCR
FFmpeg (system binary available in PATH) - required for video frame extraction
Optional: Homebrew (macOS), apt (Debian/Ubuntu), dnf/yum (Fedora/CentOS), winget/choco (Windows) to install system packages

Install

General (applies to all platforms):

Clone the repository:

git clone <repo-url> && cd ruby_web_scraper

Install gems:
```
bundle install
```

Platform-specific system package install:

macOS (Homebrew)

# install Homebrew if needed: https://brew.sh/
brew install tesseract ffmpeg

Debian / Ubuntu

sudo apt update
sudo apt install -y ruby-full build-essential tesseract-ocr ffmpeg

Fedora / RHEL / CentOS

sudo dnf install -y ruby rubygems tesseract ffmpeg

Windows (PowerShell) - using winget (Windows 10/11)

winget install --id=Gyan.FFmpeg
# For Tesseract use an available package or installer; ensure tesseract.exe is on PATH.

Verify installation:

tesseract --version
ffmpeg -version

Setup (project)

Ensure dependencies installed (see above).
Create or edit config.yaml at project root. A sample example.config.yaml is included in the repository and can be adapted.

Example fields:

start_urls: array of seed URLs to crawl.
threads: number of worker threads.
output_dir: where downloaded media, frames and results are stored.
frame_rate: fps used by FFmpeg when extracting frames from video.
gc_interval: how many pages processed per GC trigger (MemoryManager).
max_depth: maximum crawl depth from start URL.
keep_files: true | false, choose if the files analysed are saved on disk or deleted.
selectors.images: CSS selectors used to find images (nodes should have src, data-src, content, or href).
selectors.videos: CSS selectors used to find video resources (nodes should have src, data-src, poster, etc).
user_agent: optional HTTP User-Agent header used by downloads.

Running the crawler

There are two primary ways:

Rake task:
```
# Run crawler; provide optional start URL and max depth:
rake "run[https://example.com,2]"
```
The run task accepts optional arguments: first is a start URL (overrides config), second is max_depth.

Direct script:

ruby bin/run.rb path/to/config.yaml

Or provide a URL and optional max depth directly:

ruby bin/run.rb https://example.com 2

The script loads config.yaml (or merges CLI-provided URL), initializes the environment, runs the crawler to discover media, then downloads media, extracts frames for videos, runs OCR on images/frames, and writes processed results.

Example end-to-end run (download + frames + OCR)

Ensure config.yaml is properly configured.

Run:

ruby bin/run.rb config.yaml

Or:

ruby bin/run.rb https://example.com 2

After the run completes:

Initial discovered resources are saved to output/results.json.
Downloaded media and OCR outputs are saved and a final processed results file is written to output/processed_results.json.

Output layout

By default output/ (or your configured output_dir) contains:

output/images/ - downloaded images
output/videos/ - downloaded video files
output/video_frames/ - extracted frames for videos (organized per-video)
output/results.json - JSON file of discovered resources (pre-processing)
output/processed_results.json - JSON file including download paths and OCR text

results.json entries follow the format produced by ResultRecorder.build:

{
  "source_page": "https://example.com",
  "type": "image",
  "url": "https://example.com/assets/img.jpg",
  "path": null,
  "text": null
}

After processing, processed_results.json will have path filled with local file paths and text with OCR output.

Testing & Linting

Run tests (RSpec):

bundle exec rspec
# or via rake
rake spec

Lint with RuboCop:

bundle exec rubocop
# or via rake
rake lint

Troubleshooting

"tesseract: command not found" - install Tesseract and ensure PATH updated.
"ffmpeg: command not found" - install FFmpeg and ensure PATH updated.
Downloads failing - adjust user_agent in config.yaml.
If output/ is missing - verify write permissions and config.yaml output_dir.

Extending

Add selectors in config.yaml to target other tags or attributes.
Extend managers to perform additional processing, filtering, or remote storage.
Swap/override MemoryManager for a different GC strategy by dependency-injecting into Crawler.

Development notes

Core files:
- lib/ocr_crawler/config.rb - YAML config loader
- lib/ocr_crawler/crawler.rb - main orchestration
- lib/ocr_crawler/link_manager.rb - link extraction
- lib/ocr_crawler/image_manager.rb & video_manager.rb - selector-driven extraction
- lib/ocr_crawler/downloader.rb - HTTP resource downloader
- lib/ocr_crawler/ffmpeg_helper.rb - frame extraction helper (calls FFmpeg)
- lib/ocr_crawler/ocr_executor.rb - RTesseract wrapper
- lib/ocr_crawler/result_recorder.rb - save JSON results

All modules/classes include short docstring comments.

License

MIT - see LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
bin		bin
lib		lib
spec		spec
.dockerignore		.dockerignore
.gitignore		.gitignore
.rspec		.rspec
.rubocop.yml		.rubocop.yml
Dockerfile		Dockerfile
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE		LICENSE
README.md		README.md
Rakefile		Rakefile
docker-compose.yml		docker-compose.yml
example.config.yaml		example.config.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ruby OCR Crawler

with Docker

Table of contents

Overview

Requirements

Install

Setup (project)

Running the crawler

Example end-to-end run (download + frames + OCR)

Output layout

Testing & Linting

Troubleshooting

Extending

Development notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Ruby OCR Crawler

with Docker

Table of contents

Overview

Requirements

Install

Setup (project)

Running the crawler

Example end-to-end run (download + frames + OCR)

Output layout

Testing & Linting

Troubleshooting

Extending

Development notes

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages