samantics

Auto-labelling pipeline for vision datasets. Takes raw frames from an ingress directory, deduplicates them, auto-labels with SAM3, curates the results, and exports COCO-format annotation JSON ready for training.

dedupe → label → curate

dedupe — walks an ingress directory, runs dhash + ORB deduplication, copies accepted frames into a content-addressable image store
label — runs SAM3 over accepted images, writes bounding boxes + RLE masks to labels.jsonl
curate — filters low-signal frames, splits into train/val, exports coco/train.json + coco/val.json

Everything is resumable — pipeline.db tracks every image through all three stages so interrupted runs pick up where they left off.

Quick start

# 1. Copy .env.example and fill in your paths
cp .env.example .env

# 2. Build the Docker image
./docker/build.sh

# 3. Deduplicate frames from ingress
./docker/run.sh python dedupe.py

# 4. Auto-label (needs GPU)
./docker/run.sh python label.py

# 5. Curate and export COCO JSON
./docker/run.sh python curate.py

Configuration

All paths are set via environment variables. Copy .env.example to .env (gitignored) and edit:

DATASET_DIR=/path/to/my-dataset     # where pipeline.db, images/, labels.jsonl etc. live
GRABBY_WORKDIR=/path/to/ingress     # source frames (read-only in Docker)
MODEL_CACHE=/path/to/model-cache    # HuggingFace model cache (read-only in Docker)
HF_TOKEN=hf_...                     # optional, for gated models

DATASET_DIR is picked up automatically by all scripts. You can also pass --dataset-dir to any script to override it for a single run.

Dataset directory layout

{DATASET_DIR}/
  pipeline.db          ← SQLite pipeline state (WAL mode)
  images/              ← accepted frames, named {dhash}.jpg (content-addressable)
  labels.jsonl         ← SAM3 output: boxes (XYXY), category_ids, RLE masks, scores
  dataset.jsonl        ← curated subset of labels.jsonl
  coco/
    train.json         ← COCO format (XYWH boxes, RLE segmentation)
    val.json           ← COCO format
  coco-spec.json       ← category definitions (required by label.py and curate.py)
  sam-queries.yaml     ← per-category text query overrides for SAM3

coco-spec.json and sam-queries.yaml are dataset-specific config that live alongside the data.

coco-spec.json defines the category taxonomy. Supercategories group related classes — SAM3 queries are built per-supercategory. Category IDs can be any integers; leaving gaps between supercategories makes it easy to add new classes later.

{
  "info": {
    "year": 2025,
    "version": "1.0",
    "description": "My Dataset",
    "contributor": "you"
  },
  "licenses": [{ "id": 1, "name": "", "url": "" }],
  "categories": [
    {
      "id": 1,
      "name": "person",
      "supercategory": "person",
      "keypoints": ["nose", "left_eye", "right_eye", "left_shoulder", "right_shoulder",
                    "left_elbow", "right_elbow", "left_wrist", "right_wrist",
                    "left_hip", "right_hip", "left_knee", "right_knee",
                    "left_ankle", "right_ankle"],
      "skeleton": [[1,2],[1,3],[2,4],[3,5],[4,6],[5,7],[6,8],[7,9],[8,10],
                   [9,11],[10,12],[11,13],[12,14],[13,15],[14,16]]
    },
    { "id": 100, "name": "car",       "supercategory": "vehicle" },
    { "id": 101, "name": "truck",     "supercategory": "vehicle" },
    { "id": 102, "name": "airplane",  "supercategory": "vehicle" },
    { "id": 200, "name": "dog",       "supercategory": "animal" },
    { "id": 201, "name": "cat",       "supercategory": "animal" },
    { "id": 202, "name": "bird",      "supercategory": "animal" }
  ],
  "images": [],
  "annotations": []
}

sam-queries.yaml lets you override the text prompt SAM3 uses for any category. By default the category name is used as the query. Only add entries where the default produces poor results:

vehicle:
  airplane: commercial jet aircraft
animal:
  bird: small bird perched or in flight

Scripts

`dedupe.py` — deduplicate and ingest frames

Walks the ingress directory ({GRABBY_WORKDIR}/{source}/{category}/*.jpg), discovers all frames, and runs two-pass deduplication:

dhash — perceptual hash; images within a Hamming distance threshold are near-duplicates globally across the whole dataset
ORB — per-video feature matching; catches similar frames from the same video that dhash misses (slight motion, lighting changes)

Accepted images are copied to {DATASET_DIR}/images/{dhash}.jpg.

# Run everything (paths from .env)
./docker/run.sh python dedupe.py

# Filter to one category
./docker/run.sh python dedupe.py --category gym

# dhash only, no ORB (faster)
./docker/run.sh python dedupe.py --no-orb

# Dry run — count without writing
./docker/run.sh python dedupe.py --dry-run

`label.py` — SAM3 auto-labelling

Reads accepted images from pipeline.db, runs SAM3 inference in batches, appends results to labels.jsonl. Each image is marked done or failed in the DB so runs are resumable.

Requires coco-spec.json and sam-queries.yaml in DATASET_DIR.

# Label all pending images
./docker/run.sh python label.py

# Filter to one category, retry previously failed
./docker/run.sh python label.py --category animal --retry-failed

# Custom batch size and confidence threshold
./docker/run.sh python label.py --batch-size 32 --score-thresh 0.6

`curate.py` — filter, split, export COCO

Reads labels.jsonl, filters out low-quality records (empty detections, person-only frames), splits into train/val, and writes coco/train.json + coco/val.json. Updates curate_status in pipeline.db for each image.

# Standard run (85/15 split)
./docker/run.sh python curate.py

# Custom split ratio and seed
./docker/run.sh python curate.py --train 80 --seed 1

# Keep only specific categories
./docker/run.sh python curate.py --categories animal vehicle

# Include person-only frames (excluded by default)
./docker/run.sh python curate.py --keep-only-person

`stats.py` — pipeline and annotation statistics

Prints a full picture of pipeline state and annotation counts. Useful for checking imports and dataset balance before training.

# Full report (pipeline DB + annotation stats from labels.jsonl)
./docker/run.sh python stats.py

# Stats from a specific file
./docker/run.sh python stats.py /path/to/labels.jsonl

# Include filtered-out images in counts
./docker/run.sh python stats.py --keep-empty --keep-only-person

`split.py` — re-split an existing JSONL

Splits any JSONL file into train/val (or train/val/test) by percentage. Useful for re-splitting without re-running the full curation step.

python split.py dataset.jsonl train.jsonl val.jsonl --train 80
python split.py dataset.jsonl train.jsonl val.jsonl test.jsonl --train 75 --val 15

`to_coco.py` — convert JSONL to COCO JSON

Converts any labels JSONL file to COCO annotation format. Useful for one-off conversions outside the main pipeline.

python to_coco.py labels.jsonl output.json
python to_coco.py labels.jsonl output.json --spec /path/to/coco-spec.json

`import_existing.py` — migrate from a previous pipeline

If you have data from an older separate-directory pipeline layout (01-selection/, 02-labels/, 03-curation/), import it into pipeline.db without re-running anything:

python import_existing.py --data-root /path/to/old-dataset

# Dry run first to verify counts
python import_existing.py --data-root /path/to/old-dataset --dry-run

Then run curate pointing at the existing labels file:

./docker/run.sh python curate.py --labels-jsonl /path/to/old-dataset/02-labels/labels.jsonl

Pipeline DB

pipeline.db is a SQLite database tracking every image through all three stages:

Stage	Column	Values
select	`select_status`	`pending` / `accepted` / `rejected`
label	`label_status`	`pending` / `done` / `failed`
curate	`curate_status`	`pending` / `included` / `excluded`

Useful queries:

# What failed labelling and why?
sqlite3 $DATASET_DIR/pipeline.db \
  "SELECT source_path, label_error FROM images WHERE label_status='failed' LIMIT 20"

# Which videos contributed the most accepted frames?
sqlite3 $DATASET_DIR/pipeline.db \
  "SELECT video_id, COUNT(*) n FROM images WHERE select_status='accepted'
   GROUP BY video_id ORDER BY n DESC LIMIT 10"

# Accepted but not yet labelled (work remaining before curate)
sqlite3 $DATASET_DIR/pipeline.db \
  "SELECT COUNT(*) FROM images WHERE select_status='accepted' AND label_status != 'done'"

Ingress layout

Frames are sourced from grabby, which organises output by source and category:

{GRABBY_WORKDIR}/
  {source}/              e.g. youtube/
    {category}/          e.g. gym/
      *.jpg              frames named {video_id}_{timestamp_ms}.jpg

Set GRABBY_WORKDIR in .env to point at grabby's output directory.

Docker

docker/run.sh loads .env, expands paths, and mounts:

Host path	Container path	Mode
`DATASET_DIR`	`/dataset`	read/write
`GRABBY_WORKDIR`	`/grabby-workdir`	read-only
`MODEL_CACHE`	`/model-cache`	read-only
project root	`/workspace`	read/write

./docker/build.sh                        # build image tagged 'samantics'
./docker/run.sh python label.py          # run a script
./docker/run.sh python stats.py          # check pipeline state
./docker/run.sh bash                     # interactive shell

GPU is used automatically if the nvidia Docker runtime is present.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

samantics

Quick start

Configuration

Dataset directory layout

Scripts

`dedupe.py` — deduplicate and ingest frames

`label.py` — SAM3 auto-labelling

`curate.py` — filter, split, export COCO

`stats.py` — pipeline and annotation statistics

`split.py` — re-split an existing JSONL

`to_coco.py` — convert JSONL to COCO JSON

`import_existing.py` — migrate from a previous pipeline

Pipeline DB

Ingress layout

Docker

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
common		common
docker		docker
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
curate.py		curate.py
dedupe.py		dedupe.py
import_existing.py		import_existing.py
label.py		label.py
requirements.txt		requirements.txt
split.py		split.py
stats.py		stats.py
to_coco.py		to_coco.py

Folders and files

Latest commit

History

Repository files navigation

samantics

Quick start

Configuration

Dataset directory layout

Scripts

dedupe.py — deduplicate and ingest frames

label.py — SAM3 auto-labelling

curate.py — filter, split, export COCO

stats.py — pipeline and annotation statistics

split.py — re-split an existing JSONL

to_coco.py — convert JSONL to COCO JSON

import_existing.py — migrate from a previous pipeline

Pipeline DB

Ingress layout

Docker

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`dedupe.py` — deduplicate and ingest frames

`label.py` — SAM3 auto-labelling

`curate.py` — filter, split, export COCO

`stats.py` — pipeline and annotation statistics

`split.py` — re-split an existing JSONL

`to_coco.py` — convert JSONL to COCO JSON

`import_existing.py` — migrate from a previous pipeline

Packages