GRPO on Sagemaker using TRL Tutorial#2596
Conversation
Signed-off-by: DWarez <dario.salvati@huggingface.co>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Want higher recall? High effort reviews run extra passes and find more bugs. A team admin can switch effort levels in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit cc6dcec. Configure here.
| "elif \"trainer\" in globals() and hasattr(trainer, \"latest_training_job\"):\n", | ||
| " job = trainer.latest_training_job.name\n", | ||
| "elif \"base_job_name\" in globals():\n", | ||
| " job = base_job_name\n", |
There was a problem hiding this comment.
Wrong SageMaker job name
Medium Severity
When resolving the training job, the notebook sets job to base_job_name if trainer.latest_training_job is missing. SageMaker assigns a unique suffix to the actual TrainingJobName, so describe_training_job with only base_job_name fails even right after launch.
Reviewed by Cursor Bugbot for commit cc6dcec. Configure here.
| " if e.response.get(\"Error\", {}).get(\"Code\") != \"AccessDeniedException\":\n", | ||
| " raise\n", | ||
| " print(\"CloudWatch Logs access denied; parsing saved notebook output instead.\")\n", | ||
| " records = notebook_metric_records()\n", |
There was a problem hiding this comment.
Wrong notebook fallback path
Low Severity
The CloudWatch fallback calls notebook_metric_records() with default path notebook.ipynb, but this tutorial file is sagemaker-notebook.ipynb. When S3 and CloudWatch both fail, the cell raises FileNotFoundError instead of parsing saved outputs.
Reviewed by Cursor Bugbot for commit cc6dcec. Configure here.
| "MODEL_ID = \"HuggingFaceTB/SmolLM3-3B\" # SmolLM3, HF's 3B instruct model\n", | ||
| "INSTANCE_TYPE = \"ml.g6.12xlarge\" # 4xL4 24GB\n", | ||
| "\n", | ||
| "TRAINING_IMAGE = \"754289655784.dkr.ecr.us-east-1.amazonaws.com/hf-trl-grpo:sagemaker-trl-dev-e63f67e\"" |
There was a problem hiding this comment.
this must be changed before merging
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
change: using zero3 instead of zero2 Signed-off-by: DWarez <dario.salvati@huggingface.co>
The workaround I did before is totally unnecessary for ml.p4d.24xlarge therefore I think it was due to some hardware related issue (like high memory pressure or comms bandwidth). Since we're keeping ml.p4d.24xlarge as instance type in order to also run the reference model, we can safely revert back to not use the workaround Signed-off-by: DWarez <dario.salvati@huggingface.co>
alvarobartt
left a comment
There was a problem hiding this comment.
It's crazy that in big 2026 we cannot review / comment on Jupyter Notebooks in GitHub without third-parties...
So here's the review with minor nits:
- In the snippet below, I'd rather use
from huggingface_hub import notebook_login, see https://huggingface.co/docs/huggingface_hub/v1.21.0.rc0/en/package_reference/authentication#huggingface_hub.notebook_login, and checkget_tokenbefore i.e.,if not hf_token := get_token(): ...then login.
import os
from huggingface_hub import login, get_token
if not os.environ.get("HF_TOKEN"):
login()
HF_TOKEN = get_token()
assert HF_TOKEN, "No HF token found — set HF_TOKEN or run login()"
print("HF token loaded")-
# The training image lives in us-east-1, so keep the job, the S3 bucket and the image in one region.AFAIK AWS SageMaker AI is the one creating the bucket so no need for this comment IMO -
TRAINING_IMAGE = "754289655784.dkr.ecr.us-east-1.amazonaws.com/hf-trl-grpo:sagemaker-trl-dev-e63f67e"is this container public at the end? Or temporarily until the official one is released? -
For the following cell, either remove the
printor keep the output in the Jupyter Notebook cell IMO
import json
from datasets import load_dataset
raw = load_dataset("Salesforce/xlam-function-calling-60k", split="train")
print(raw)
print(json.dumps({k: raw[0][k] for k in ("query", "tools", "answers")}, indent=2)[:1200])-
With the following
Generation uses the Transformers backend.you mean instead of TRL's default vLLM backend? If so, I'd add the clarification. -
ZeRO-3 shards the model, ...withDeepSpeed ZeRO-3 shards the model, ...to make it explicit. -
There are some occurrences of
eight A100which I'd replace with8 x A100 40GBfor consistency i.e., as A100 comes in both 40GB and 80GB flavors I'd write it that way + using the "N x A100" notation as I feel like it's more readable than "eight A100". -
Not sure if not displayed because Jupyter Notebook rendering is broken on GitHub, but I don't see the plots, so it'd be nice to include those too for the reader to understand how you interpret those metrics.
-
Maybe add a
## Referencessection at the end?
Great work @dwarez, I'd maybe review the nits / comments above and add a simple explanation (or even diagram) on what GRPO is and how it works, with maybe mentions to DeepSeek R1? Nonetheless the example is great, and I really think that extending our collection on training will be great 🔥
add: references Signed-off-by: DWarez <dario.salvati@huggingface.co>
|
hi @alvarobartt, thanks a lot for the feedback 🤗 I applied your suggestions, here some other points:
|


The purpose of this notebook is to guide users in using Sagemaker to run a GRPO training with verifiable rewards using TRL.
Note:
Salesforce/xlam-function-calling-60k) therefore the advantages computed during GRPO would be mostly nulllauncher.shworkaround you can find in the notebook. I guess it's somewhat connected to the fact that I didn't manage to make it work using vLLM as generation backend (one of the issues GRPO Trainer crashes with Could not infer dtype of NoneType when vLLM returns a NaN token logprob trl#6166)cc @alvarobartt
Note
Low Risk
Documentation-only addition (notebook); no changes to library or runtime code paths.
Overview
Adds a new SageMaker SDK v3 tutorial notebook that walks through verifiable-reward GRPO on
HuggingFaceTB/SmolLM3-3Bfor single-turn tool calling onSalesforce/xlam-function-calling-60k.The notebook materializes the full job bundle via
%%writefile:rewards.py(exact-match + format rewards on<tool_call>JSON),train.py(GRPOTrainer, DAPO loss, collapse early-stop, JSONL metrics),launch.sh(torchrunper GPU as the SageMaker entry), and DeepSpeed ZeRO-2 config. It covers dataset prep (chat prompts + hiddenanswers), S3 upload,ModelTrainerlaunch onml.g6.12xlargewith a temporary TRL GRPO ECR image, and post-job reward-curve plotting from S3/CloudWatch/notebook output.The documented run is intentionally short (smoke / wiring validation);
DEBUG_NO_UPDATEsupports a no-gradient smoke test.Reviewed by Cursor Bugbot for commit cc6dcec. Bugbot is set up for automated code reviews on this repo. Configure here.