Add push_workers_to_ecr.sh script for ECR image deployment#141
Add push_workers_to_ecr.sh script for ECR image deployment#141arjunrajlab wants to merge 11 commits into
Conversation
Local-only workflow for building one or more worker images and pushing them to AWS ECR after sourcing AWS credentials. Discovers workers from the standard workers/annotations/<name> and workers/properties/<sub>/ <name> layouts (piscis is intentionally skipped), tags each image with both the short git SHA and 'latest', uses a per-worker local buildx cache, and creates ECR repos on first push (with a confirmation prompt). Includes --list, --all, --dry-run, --region, --prefix, and -y flags. Also adds a top-level .gitignore covering the buildx cache and the existing build.log, and documents the workflow in CLAUDE.md alongside the other build commands.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1ff5dd26a6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| -t "$img_sha" \ | ||
| -t "$img_latest" \ | ||
| --push \ | ||
| . 2> >(tee "$stderr_file" >&2) |
There was a problem hiding this comment.
Build worker-context Dockerfiles from their directories
When pushing any discovered worker whose Dockerfile was written for its own directory as the build context (for example workers/annotations/cellpose/Dockerfile:35 and workers/annotations/random_point/Dockerfile:7 use COPY ./environment.yml, and there is no repo-root environment.yml), this repo-root build context makes Docker look in the wrong place and the build fails before anything can be pushed. --all includes many of these Dockerfiles, and named pushes for those workers are broken unless the context is switched to the worker directory or those files are excluded/rewritten.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Resolved in d938a07 (before this review's commit). detect_context() now picks the build context per worker by checking where each COPY/ADD source resolves — repo-root for compose-style workers, worker-dir for ones that COPY ./environment.yml (cellpose, random_point, …). Verified: cellpose → worker dir, blob_metrics_worker → repo root; both build and push.
push_workers_to_ecr.sh always used the repo root as the build context and had no concept of base images, so: - Workers whose Dockerfile expects its own directory as context (cellpose, stardist, random_point, ...) failed to build under --all (Codex P1: no repo-root environment.yml). Now detect_context() picks repo-root vs worker-dir per worker by checking where each COPY/ADD source resolves. - Worker `FROM nimbusimage/<base>:latest` images aren't on any registry and the local copies are the host arch (arm64), so amd64 cross-builds couldn't resolve them. ensure_base_image() now builds the needed base for the target platform, pushes it to nimbus/base/<base>, and redirects FROM to the ECR copy via `docker buildx --build-context` (no worker Dockerfile edits). Add --skip-base to reuse a base already in ECR. Validated on this Mac: built+pushed amd64 nimbus/base/worker-base and nimbus/properties/blob_metrics_worker (both confirmed linux/amd64). Docs (CLAUDE.md): correct creds path, note worker names are directory names, document base-image build/redirect and --skip-base. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Document the full ECR push flow in scripts/PUSH_WORKERS_TO_ECR.md: auth via ../AWSDeploy/aws_credentials_prod.sh, prerequisites (bash 4+ on macOS), all options, image naming/tags, automatic base-image build + FROM redirect, build-context auto-detection, the per-base-image worker groups, and gotchas (QEMU emulation on macOS is slow; GPU/ML workers should be built natively on an x86_64 machine via build_machine_learning_workers.sh; test/sample and _M1 workers are not deployed). Link it from CLAUDE.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ECR repository names and Docker image tags must be lowercase, but some worker directory names aren't (blob_point_count_3D_projection_worker, the *_M1 variants). Lowercase the repo component when forming the repo and tags so these build instead of failing with "repository name must be lowercase". Also fix ensure_repo() to check the create-repository exit code and return non-zero on failure, instead of always logging "Created repo" even when the AWS call errored. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a "Building on a native amd64 machine" section covering (A) running the existing script on an amd64 Linux box you already have (no emulation, fast, GPU workers build normally) and (B) spinning up an ephemeral EC2 instance (amd64 or GPU) with an ECR IAM instance role to build + push everything, then terminate. Notes that GPU/ML workers go through build_machine_learning_workers.sh and that AWSDeploy Terraform could template the build instance. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a821bed23a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| while IFS= read -r f; do | ||
| name="$(basename "$(dirname "$f")")" | ||
| printf '%s\tannotations\t%s\n' "$name" "$f" | ||
| done < <(find workers/annotations -mindepth 2 -maxdepth 2 -name Dockerfile -type f 2>/dev/null | sort) |
There was a problem hiding this comment.
Exclude M1-only worker Dockerfiles from amd64 pushes
When --all is used, this find includes directories such as workers/annotations/annulus_generator_M1, random_point_annotation_M1, and test_multiple_annotation_M1; those Dockerfiles download Miniconda3-latest-Linux-aarch64.sh, so the script's default --platform linux/amd64 build will fail once it reaches them. Filter out the _M1 worker directories (or only include the production workers) so --all can complete for the documented amd64 ECR workflow.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in b2696bf. *_M1 directories are now excluded from discovery (they hardcode the aarch64 Miniconda installer and can't build linux/amd64), so --all no longer breaks on them. Also added a denylist so --all skips the dev/test workers (random_squares, sample_interface, test_multiple_annotation, test_file_creation_worker); those stay pushable by explicit name. Confirmed --all now resolves to 49 buildable workers with no _M1.
| if ! git -C "$REPO_ROOT" diff --quiet 2>/dev/null || \ | ||
| ! git -C "$REPO_ROOT" diff --staged --quiet 2>/dev/null; then |
There was a problem hiding this comment.
Include untracked files in dirty SHA tagging
This dirty check only looks at tracked unstaged/staged diffs, so an untracked file that is copied into the Docker build context (for example a newly added worker helper or entrypoint dependency) can be baked into the image while the tag remains the clean git SHA. That breaks the provenance guarantee documented for -dirty; include untracked files in the check before deciding the image corresponds to HEAD.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in b2696bf. The dirty check now uses git status --porcelain, which catches tracked changes, staged changes, and untracked (non-gitignored) files. Since the build context is usually the repo root, an untracked file can be COPYed into an image, so it must flip the tag to -dirty. .gitignored artifacts (.cache/, etc.) are still excluded.
…k untracked Address Codex review on PR #141: - Exclude `*_M1` worker dirs from discovery. Their Dockerfiles hardcode the aarch64 Miniconda installer (Apple-Silicon dev only) and cannot build for linux/amd64, so `--all` previously broke when it reached them. - `--all` now skips dev/test/sample workers (random_squares, sample_interface, test_multiple_annotation, test_file_creation_worker) per the deployment policy; they remain pushable by explicit name. - Dirty SHA check now uses `git status --porcelain`, which also catches untracked (non-gitignored) files. With the repo-root build context an untracked file can be baked into an image, so the `-dirty` provenance tag must account for it. Docs updated to match. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…elism push_workers_to_ecr.sh now covers GPU/ML workers (they build FROM public nvidia/cuda images, so no base redirect is needed) and piscis, which is emitted as two pseudo-workers piscis_predict/piscis_train with the build context pinned to the piscis dir (matching its docker-compose). Adds --skip-ml/--only-ml, a GPU/ML column in --list, and a QEMU emulation warning before cross-building any CUDA worker on a non-amd64 host. Make COMPOSE_PARALLEL_LIMIT in build_workers.sh env-overridable (default 1). Even an EC2 GPU build instance can't handle many concurrent heavy CUDA/torch builds, so 2-3 is the practical ceiling, not higher. Update PUSH_WORKERS_TO_ECR.md and CLAUDE.md: a single --all run on a native amd64/GPU host now builds+pushes everything (standard + GPU/ML + piscis); build_machine_learning_workers.sh remains the local-dev (no-ECR) build path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Regenerated by the update-worker-docs Claude Code hook after PR creation.
Resolve the target registry from $ECR_REGISTRY when set, defaulting to the URL derived from the authenticated account + region (which still respects --region). This is the shared env-var contract with the AWSDeploy consume side (Celery workers pull prebuilt images from the same registry); the full contract is documented in AWSDeploy's doc/Build_and_Push_Worker_Images_to_ECR.md. No new flag, no behavior change when $ECR_REGISTRY is unset. The override flows to worker SHA/latest tags, the base-image push, and the FROM --build-context redirect. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cripts Capture the deferred consume-side link (make build_workers.sh and build_machine_learning_workers.sh pull prebuilt images from ECR when $ECR_REGISTRY is set, falling back to a source build) along with the two blockers that make it non-trivial: the unknown local image tag Girder discovery expects, and the lack of a single name mapping across worker dir / ECR repo / ML-script local tag / compose service name. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…hanges generate_worker_docs.py: prefer the concise Docker `description` label for the REGISTRY one-liner (fall back to the in-app notes only when there's no label), and make _strip_html replace tags with a space + drop a trailing "Learn more" link label. This removes the verbose/truncated rows, the leaked "Learn more" anchor text, and fused sentences (e.g. "z-slices.It") that the regenerated registry had picked up from the notes fields. update-worker-docs.sh: only regenerate + commit docs when the branch diff vs. the base actually touches workers/ or the generator, so an unrelated PR (e.g. a build-script change) no longer picks up a registry-churn commit. REGISTRY.md regenerated with the fixed generator (idempotent). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
scripts/push_workers_to_ecr.shbuilds NimbusImage worker Docker images forlinux/amd64and pushes them to AWS ECR. Each worker gets two tags (short gitSHA +
latest); needed base images are built for amd64 and each worker'sFROMis redirected to the ECR copy viadocker buildx --build-context(noDockerfile edits). Full reference:
scripts/PUSH_WORKERS_TO_ECR.md.Where to run it
Two supported paths:
linux/amd64under QEMU. Fine for standardCPU workers; needs bash 4+ (
brew install bash, run with/opt/homebrew/bin/bash).GPU/ML workers are slow/fragile under emulation — use
--skip-mlfor a quickstandard-only run.
amd64/GPU) host there is no emulation — builds are fast and GPU/ML workers
build normally. Spin up an ephemeral instance with an IAM instance profile
that has ECR push permissions (so
aws_credentials_prod.shisn't needed),run the script, then terminate it (a spot instance keeps the one-off cost
low). See "Building on a native amd64 machine" in the doc for the existing-box
and ephemeral-EC2 recipes.
GPU/ML + piscis coverage
sam/sam2 family, stardist, condensatenet, deconwolf, deepcell, cellori). They
build
FROMpublicnvidia/cudaimages, so they need no base redirect andpush like any other worker.
predict/+train/layout) is handled as twopseudo-workers,
piscis_predictandpiscis_train, both built with thepiscis dir as context (matching its
docker-compose.yaml).--skip-ml(excludenvidia/cudaworkers incl. piscis) and--only-ml(build only those, e.g. on a GPU box).--listnow shows aGPU/ML column, and the script warns before cross-building any CUDA worker
under QEMU on a non-amd64 host.
./scripts/push_workers_to_ecr.sh --allon a nativeamd64/GPU EC2 host builds + pushes everything (standard + GPU/ML +
piscis).
build_machine_learning_workers.shremains the local-dev build path(local image tags, no ECR push).
Build parallelism
COMPOSE_PARALLEL_LIMITinbuild_workers.shis now env-overridable (default1, safe on a memory-constrained laptop). Even a beefy EC2 GPU build instancecan't handle many concurrent heavy CUDA/torch builds, so 2–3 is the practical
ceiling, not higher:
Key implementation details
workers/annotations/andworkers/properties/;*_M1arm64 dev variants are excluded entirely, and test/sample workers areskipped by
--all(still pushable by explicit name).context is pinned via an override map since it can't be inferred.
sts get-caller-identity); appends-dirtyto the SHA tag on a dirty working tree; auto-creates ECR repos on first push
(with a prompt;
-yto skip); re-logs-in + retries once on a push auth error;keeps a per-worker buildx cache under
.cache/buildx/(--no-cachetodisable).
Image naming / tags
Prod registry: account
677276105390, regionus-east-1, prefixnimbus.🤖 Generated with Claude Code