Skip to content

broomhead/samantics

Repository files navigation

samantics

Auto-labelling pipeline for vision datasets. Takes raw frames from an ingress directory, deduplicates them, auto-labels with SAM3, curates the results, and exports COCO-format annotation JSON ready for training.

dedupe → label → curate
  • dedupe — walks an ingress directory, runs dhash + ORB deduplication, copies accepted frames into a content-addressable image store
  • label — runs SAM3 over accepted images, writes bounding boxes + RLE masks to labels.jsonl
  • curate — filters low-signal frames, splits into train/val, exports coco/train.json + coco/val.json

Everything is resumable — pipeline.db tracks every image through all three stages so interrupted runs pick up where they left off.


Quick start

# 1. Copy .env.example and fill in your paths
cp .env.example .env

# 2. Build the Docker image
./docker/build.sh

# 3. Deduplicate frames from ingress
./docker/run.sh python dedupe.py

# 4. Auto-label (needs GPU)
./docker/run.sh python label.py

# 5. Curate and export COCO JSON
./docker/run.sh python curate.py

Configuration

All paths are set via environment variables. Copy .env.example to .env (gitignored) and edit:

DATASET_DIR=/path/to/my-dataset     # where pipeline.db, images/, labels.jsonl etc. live
GRABBY_WORKDIR=/path/to/ingress     # source frames (read-only in Docker)
MODEL_CACHE=/path/to/model-cache    # HuggingFace model cache (read-only in Docker)
HF_TOKEN=hf_...                     # optional, for gated models

DATASET_DIR is picked up automatically by all scripts. You can also pass --dataset-dir to any script to override it for a single run.


Dataset directory layout

{DATASET_DIR}/
  pipeline.db          ← SQLite pipeline state (WAL mode)
  images/              ← accepted frames, named {dhash}.jpg (content-addressable)
  labels.jsonl         ← SAM3 output: boxes (XYXY), category_ids, RLE masks, scores
  dataset.jsonl        ← curated subset of labels.jsonl
  coco/
    train.json         ← COCO format (XYWH boxes, RLE segmentation)
    val.json           ← COCO format
  coco-spec.json       ← category definitions (required by label.py and curate.py)
  sam-queries.yaml     ← per-category text query overrides for SAM3

coco-spec.json and sam-queries.yaml are dataset-specific config that live alongside the data.

coco-spec.json defines the category taxonomy. Supercategories group related classes — SAM3 queries are built per-supercategory. Category IDs can be any integers; leaving gaps between supercategories makes it easy to add new classes later.

{
  "info": {
    "year": 2025,
    "version": "1.0",
    "description": "My Dataset",
    "contributor": "you"
  },
  "licenses": [{ "id": 1, "name": "", "url": "" }],
  "categories": [
    {
      "id": 1,
      "name": "person",
      "supercategory": "person",
      "keypoints": ["nose", "left_eye", "right_eye", "left_shoulder", "right_shoulder",
                    "left_elbow", "right_elbow", "left_wrist", "right_wrist",
                    "left_hip", "right_hip", "left_knee", "right_knee",
                    "left_ankle", "right_ankle"],
      "skeleton": [[1,2],[1,3],[2,4],[3,5],[4,6],[5,7],[6,8],[7,9],[8,10],
                   [9,11],[10,12],[11,13],[12,14],[13,15],[14,16]]
    },
    { "id": 100, "name": "car",       "supercategory": "vehicle" },
    { "id": 101, "name": "truck",     "supercategory": "vehicle" },
    { "id": 102, "name": "airplane",  "supercategory": "vehicle" },
    { "id": 200, "name": "dog",       "supercategory": "animal" },
    { "id": 201, "name": "cat",       "supercategory": "animal" },
    { "id": 202, "name": "bird",      "supercategory": "animal" }
  ],
  "images": [],
  "annotations": []
}

sam-queries.yaml lets you override the text prompt SAM3 uses for any category. By default the category name is used as the query. Only add entries where the default produces poor results:

vehicle:
  airplane: commercial jet aircraft
animal:
  bird: small bird perched or in flight

Scripts

dedupe.py — deduplicate and ingest frames

Walks the ingress directory ({GRABBY_WORKDIR}/{source}/{category}/*.jpg), discovers all frames, and runs two-pass deduplication:

  1. dhash — perceptual hash; images within a Hamming distance threshold are near-duplicates globally across the whole dataset
  2. ORB — per-video feature matching; catches similar frames from the same video that dhash misses (slight motion, lighting changes)

Accepted images are copied to {DATASET_DIR}/images/{dhash}.jpg.

# Run everything (paths from .env)
./docker/run.sh python dedupe.py

# Filter to one category
./docker/run.sh python dedupe.py --category gym

# dhash only, no ORB (faster)
./docker/run.sh python dedupe.py --no-orb

# Dry run — count without writing
./docker/run.sh python dedupe.py --dry-run

label.py — SAM3 auto-labelling

Reads accepted images from pipeline.db, runs SAM3 inference in batches, appends results to labels.jsonl. Each image is marked done or failed in the DB so runs are resumable.

Requires coco-spec.json and sam-queries.yaml in DATASET_DIR.

# Label all pending images
./docker/run.sh python label.py

# Filter to one category, retry previously failed
./docker/run.sh python label.py --category animal --retry-failed

# Custom batch size and confidence threshold
./docker/run.sh python label.py --batch-size 32 --score-thresh 0.6

curate.py — filter, split, export COCO

Reads labels.jsonl, filters out low-quality records (empty detections, person-only frames), splits into train/val, and writes coco/train.json + coco/val.json. Updates curate_status in pipeline.db for each image.

# Standard run (85/15 split)
./docker/run.sh python curate.py

# Custom split ratio and seed
./docker/run.sh python curate.py --train 80 --seed 1

# Keep only specific categories
./docker/run.sh python curate.py --categories animal vehicle

# Include person-only frames (excluded by default)
./docker/run.sh python curate.py --keep-only-person

stats.py — pipeline and annotation statistics

Prints a full picture of pipeline state and annotation counts. Useful for checking imports and dataset balance before training.

# Full report (pipeline DB + annotation stats from labels.jsonl)
./docker/run.sh python stats.py

# Stats from a specific file
./docker/run.sh python stats.py /path/to/labels.jsonl

# Include filtered-out images in counts
./docker/run.sh python stats.py --keep-empty --keep-only-person

split.py — re-split an existing JSONL

Splits any JSONL file into train/val (or train/val/test) by percentage. Useful for re-splitting without re-running the full curation step.

python split.py dataset.jsonl train.jsonl val.jsonl --train 80
python split.py dataset.jsonl train.jsonl val.jsonl test.jsonl --train 75 --val 15

to_coco.py — convert JSONL to COCO JSON

Converts any labels JSONL file to COCO annotation format. Useful for one-off conversions outside the main pipeline.

python to_coco.py labels.jsonl output.json
python to_coco.py labels.jsonl output.json --spec /path/to/coco-spec.json

import_existing.py — migrate from a previous pipeline

If you have data from an older separate-directory pipeline layout (01-selection/, 02-labels/, 03-curation/), import it into pipeline.db without re-running anything:

python import_existing.py --data-root /path/to/old-dataset

# Dry run first to verify counts
python import_existing.py --data-root /path/to/old-dataset --dry-run

Then run curate pointing at the existing labels file:

./docker/run.sh python curate.py --labels-jsonl /path/to/old-dataset/02-labels/labels.jsonl

Pipeline DB

pipeline.db is a SQLite database tracking every image through all three stages:

Stage Column Values
select select_status pending / accepted / rejected
label label_status pending / done / failed
curate curate_status pending / included / excluded

Useful queries:

# What failed labelling and why?
sqlite3 $DATASET_DIR/pipeline.db \
  "SELECT source_path, label_error FROM images WHERE label_status='failed' LIMIT 20"

# Which videos contributed the most accepted frames?
sqlite3 $DATASET_DIR/pipeline.db \
  "SELECT video_id, COUNT(*) n FROM images WHERE select_status='accepted'
   GROUP BY video_id ORDER BY n DESC LIMIT 10"

# Accepted but not yet labelled (work remaining before curate)
sqlite3 $DATASET_DIR/pipeline.db \
  "SELECT COUNT(*) FROM images WHERE select_status='accepted' AND label_status != 'done'"

Ingress layout

Frames are sourced from grabby, which organises output by source and category:

{GRABBY_WORKDIR}/
  {source}/              e.g. youtube/
    {category}/          e.g. gym/
      *.jpg              frames named {video_id}_{timestamp_ms}.jpg

Set GRABBY_WORKDIR in .env to point at grabby's output directory.


Docker

docker/run.sh loads .env, expands paths, and mounts:

Host path Container path Mode
DATASET_DIR /dataset read/write
GRABBY_WORKDIR /grabby-workdir read-only
MODEL_CACHE /model-cache read-only
project root /workspace read/write
./docker/build.sh                        # build image tagged 'samantics'
./docker/run.sh python label.py          # run a script
./docker/run.sh python stats.py          # check pipeline state
./docker/run.sh bash                     # interactive shell

GPU is used automatically if the nvidia Docker runtime is present.

About

Auto-labelling pipeline for vision datasets.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors