Auto-labelling pipeline for vision datasets. Takes raw frames from an ingress directory, deduplicates them, auto-labels with SAM3, curates the results, and exports COCO-format annotation JSON ready for training.
dedupe → label → curate
- dedupe — walks an ingress directory, runs dhash + ORB deduplication, copies accepted frames into a content-addressable image store
- label — runs SAM3 over accepted images, writes bounding boxes + RLE masks to
labels.jsonl - curate — filters low-signal frames, splits into train/val, exports
coco/train.json+coco/val.json
Everything is resumable — pipeline.db tracks every image through all three stages so interrupted runs pick up where they left off.
# 1. Copy .env.example and fill in your paths
cp .env.example .env
# 2. Build the Docker image
./docker/build.sh
# 3. Deduplicate frames from ingress
./docker/run.sh python dedupe.py
# 4. Auto-label (needs GPU)
./docker/run.sh python label.py
# 5. Curate and export COCO JSON
./docker/run.sh python curate.pyAll paths are set via environment variables. Copy .env.example to .env (gitignored) and edit:
DATASET_DIR=/path/to/my-dataset # where pipeline.db, images/, labels.jsonl etc. live
GRABBY_WORKDIR=/path/to/ingress # source frames (read-only in Docker)
MODEL_CACHE=/path/to/model-cache # HuggingFace model cache (read-only in Docker)
HF_TOKEN=hf_... # optional, for gated modelsDATASET_DIR is picked up automatically by all scripts. You can also pass --dataset-dir to any script to override it for a single run.
{DATASET_DIR}/
pipeline.db ← SQLite pipeline state (WAL mode)
images/ ← accepted frames, named {dhash}.jpg (content-addressable)
labels.jsonl ← SAM3 output: boxes (XYXY), category_ids, RLE masks, scores
dataset.jsonl ← curated subset of labels.jsonl
coco/
train.json ← COCO format (XYWH boxes, RLE segmentation)
val.json ← COCO format
coco-spec.json ← category definitions (required by label.py and curate.py)
sam-queries.yaml ← per-category text query overrides for SAM3
coco-spec.json and sam-queries.yaml are dataset-specific config that live alongside the data.
coco-spec.json defines the category taxonomy. Supercategories group related classes — SAM3 queries are built per-supercategory. Category IDs can be any integers; leaving gaps between supercategories makes it easy to add new classes later.
{
"info": {
"year": 2025,
"version": "1.0",
"description": "My Dataset",
"contributor": "you"
},
"licenses": [{ "id": 1, "name": "", "url": "" }],
"categories": [
{
"id": 1,
"name": "person",
"supercategory": "person",
"keypoints": ["nose", "left_eye", "right_eye", "left_shoulder", "right_shoulder",
"left_elbow", "right_elbow", "left_wrist", "right_wrist",
"left_hip", "right_hip", "left_knee", "right_knee",
"left_ankle", "right_ankle"],
"skeleton": [[1,2],[1,3],[2,4],[3,5],[4,6],[5,7],[6,8],[7,9],[8,10],
[9,11],[10,12],[11,13],[12,14],[13,15],[14,16]]
},
{ "id": 100, "name": "car", "supercategory": "vehicle" },
{ "id": 101, "name": "truck", "supercategory": "vehicle" },
{ "id": 102, "name": "airplane", "supercategory": "vehicle" },
{ "id": 200, "name": "dog", "supercategory": "animal" },
{ "id": 201, "name": "cat", "supercategory": "animal" },
{ "id": 202, "name": "bird", "supercategory": "animal" }
],
"images": [],
"annotations": []
}sam-queries.yaml lets you override the text prompt SAM3 uses for any category. By default the category name is used as the query. Only add entries where the default produces poor results:
vehicle:
airplane: commercial jet aircraft
animal:
bird: small bird perched or in flightWalks the ingress directory ({GRABBY_WORKDIR}/{source}/{category}/*.jpg), discovers all frames, and runs two-pass deduplication:
- dhash — perceptual hash; images within a Hamming distance threshold are near-duplicates globally across the whole dataset
- ORB — per-video feature matching; catches similar frames from the same video that dhash misses (slight motion, lighting changes)
Accepted images are copied to {DATASET_DIR}/images/{dhash}.jpg.
# Run everything (paths from .env)
./docker/run.sh python dedupe.py
# Filter to one category
./docker/run.sh python dedupe.py --category gym
# dhash only, no ORB (faster)
./docker/run.sh python dedupe.py --no-orb
# Dry run — count without writing
./docker/run.sh python dedupe.py --dry-runReads accepted images from pipeline.db, runs SAM3 inference in batches, appends results to labels.jsonl. Each image is marked done or failed in the DB so runs are resumable.
Requires coco-spec.json and sam-queries.yaml in DATASET_DIR.
# Label all pending images
./docker/run.sh python label.py
# Filter to one category, retry previously failed
./docker/run.sh python label.py --category animal --retry-failed
# Custom batch size and confidence threshold
./docker/run.sh python label.py --batch-size 32 --score-thresh 0.6Reads labels.jsonl, filters out low-quality records (empty detections, person-only frames), splits into train/val, and writes coco/train.json + coco/val.json. Updates curate_status in pipeline.db for each image.
# Standard run (85/15 split)
./docker/run.sh python curate.py
# Custom split ratio and seed
./docker/run.sh python curate.py --train 80 --seed 1
# Keep only specific categories
./docker/run.sh python curate.py --categories animal vehicle
# Include person-only frames (excluded by default)
./docker/run.sh python curate.py --keep-only-personPrints a full picture of pipeline state and annotation counts. Useful for checking imports and dataset balance before training.
# Full report (pipeline DB + annotation stats from labels.jsonl)
./docker/run.sh python stats.py
# Stats from a specific file
./docker/run.sh python stats.py /path/to/labels.jsonl
# Include filtered-out images in counts
./docker/run.sh python stats.py --keep-empty --keep-only-personSplits any JSONL file into train/val (or train/val/test) by percentage. Useful for re-splitting without re-running the full curation step.
python split.py dataset.jsonl train.jsonl val.jsonl --train 80
python split.py dataset.jsonl train.jsonl val.jsonl test.jsonl --train 75 --val 15Converts any labels JSONL file to COCO annotation format. Useful for one-off conversions outside the main pipeline.
python to_coco.py labels.jsonl output.json
python to_coco.py labels.jsonl output.json --spec /path/to/coco-spec.jsonIf you have data from an older separate-directory pipeline layout (01-selection/, 02-labels/, 03-curation/), import it into pipeline.db without re-running anything:
python import_existing.py --data-root /path/to/old-dataset
# Dry run first to verify counts
python import_existing.py --data-root /path/to/old-dataset --dry-runThen run curate pointing at the existing labels file:
./docker/run.sh python curate.py --labels-jsonl /path/to/old-dataset/02-labels/labels.jsonlpipeline.db is a SQLite database tracking every image through all three stages:
| Stage | Column | Values |
|---|---|---|
| select | select_status |
pending / accepted / rejected |
| label | label_status |
pending / done / failed |
| curate | curate_status |
pending / included / excluded |
Useful queries:
# What failed labelling and why?
sqlite3 $DATASET_DIR/pipeline.db \
"SELECT source_path, label_error FROM images WHERE label_status='failed' LIMIT 20"
# Which videos contributed the most accepted frames?
sqlite3 $DATASET_DIR/pipeline.db \
"SELECT video_id, COUNT(*) n FROM images WHERE select_status='accepted'
GROUP BY video_id ORDER BY n DESC LIMIT 10"
# Accepted but not yet labelled (work remaining before curate)
sqlite3 $DATASET_DIR/pipeline.db \
"SELECT COUNT(*) FROM images WHERE select_status='accepted' AND label_status != 'done'"Frames are sourced from grabby, which organises output by source and category:
{GRABBY_WORKDIR}/
{source}/ e.g. youtube/
{category}/ e.g. gym/
*.jpg frames named {video_id}_{timestamp_ms}.jpg
Set GRABBY_WORKDIR in .env to point at grabby's output directory.
docker/run.sh loads .env, expands paths, and mounts:
| Host path | Container path | Mode |
|---|---|---|
DATASET_DIR |
/dataset |
read/write |
GRABBY_WORKDIR |
/grabby-workdir |
read-only |
MODEL_CACHE |
/model-cache |
read-only |
| project root | /workspace |
read/write |
./docker/build.sh # build image tagged 'samantics'
./docker/run.sh python label.py # run a script
./docker/run.sh python stats.py # check pipeline state
./docker/run.sh bash # interactive shellGPU is used automatically if the nvidia Docker runtime is present.