Skip to content

fep(sig-framework): Add megatron-lm-fl/te-fl v0.2.0 new features#26

Open
zhaoyinglia wants to merge 10 commits into
flagos-ai:mainfrom
zhaoyinglia:train_plugin
Open

fep(sig-framework): Add megatron-lm-fl/te-fl v0.2.0 new features#26
zhaoyinglia wants to merge 10 commits into
flagos-ai:mainfrom
zhaoyinglia:train_plugin

Conversation

@zhaoyinglia

Copy link
Copy Markdown

No description provided.

@JosephNew JosephNew left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

镜像问题

Megatron-LM-FL 的 Test Plan 中 Image Acquisition 部分写的是:

Base image: FlagOS 2.1 training image (CUDA variant or MetaX variant as applicable).
Source: Internal container registry or docker pull from FlagOS CI.

这不是一个可拉取的具体镜像名,SVT 测试团队无法据此复现环境。相比之下,同一 FEP 中的 TE-FL 部分给出了 nvcr.io/nvidia/pytorch:24.07-py3,可以直接拉取。

请补充具体的镜像地址(如 nvcr.io/nvidia/pytorch:24.07-py3 或 Harbor 中的具体路径 + tag),否则本 FEP 的测试环境不可复现。

另外,P0 Issue #52#53 已创建 6 天无人回复,请尽快关注。

|--------|-------------|-----------------|
| All features | `python transformer_engine/plugin/tests/run_all_tests.py` | All tests pass |

### Performance Verification

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide relevant commands and operation procedures for performance testing.

| Platform | Base Image | Source |
|----------|-----------|--------|
| CUDA (NVIDIA) | `nvcr.io/nvidia/pytorch:24.07-py3` | NVIDIA NGC |
| MetaX MACA | FlagCICD MetaX runner (pre-configured) | FlagCICD platform |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide the specific link for the Metax image.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

删除性能测试,给了端到端功能测试

zhaoyinglia and others added 8 commits June 11, 2026 18:14
Added installation command for flash-attn package.
Added installation of wandb and tensorboard to the setup instructions.
Added instructions for METAX setup and clarified the process for CUDA and METAX.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants