Skip to content

Ascend910B XTuner微调InternVL3.5-1B报错 #1407

@JeffDing

Description

@JeffDing

Ascend910B XTuner微调InternVL3.5-1B报错,出现以下报错,这个问题是CANN版本太低还是哪个环境问题?

[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[rank0]: Traceback (most recent call last):
[rank0]:   File "/root/xtuner_config/vl.py", line 72, in <module>
[rank0]:     trainer = Trainer.from_config(trainer)
[rank0]:   File "/root/xtuner/xtuner/v1/train/trainer.py", line 407, in from_config
[rank0]:     self = cls(
[rank0]:   File "/root/xtuner/xtuner/v1/train/trainer.py", line 313, in __init__
[rank0]:     self._init_dist(backend)
[rank0]:   File "/root/xtuner/xtuner/v1/train/trainer.py", line 902, in _init_dist
[rank0]:     torch.accelerator.set_device_index(int(os.environ["LOCAL_RANK"]))
[rank0]:   File "/root/.conda/envs/xtuner_npu/lib/python3.10/site-packages/torch/accelerator/__init__.py", line 133, in set_device_index
[rank0]:     torch._C._accelerator_setDeviceIndex(device_index)
[rank0]:   File "/root/.conda/envs/xtuner_npu/lib/python3.10/site-packages/torch_npu/npu/__init__.py", line 251, in _lazy_init
[rank0]:     torch_npu._C._npu_init()
[rank0]: RuntimeError: SetPrecisionMode:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:175 NPU function error: at_npu::native::AclSetCompileopt(aclCompileOpt::ACL_PRECISION_MODE, precision_mode), error code is 500001
[rank0]: [ERROR] 2026-01-03-18:55:33 (PID:8511, Device:0, RankID:0) ERR00100 PTA call acl api failed
[rank0]: [Error]: The internal ACL of the system is incorrect.
[rank0]:         Rectify the fault based on the error information in the ascend log.
[rank0]: EC0010: [PID: 8511] 2026-01-03-18:55:33.631.880 Failed to import Python module [AttributeError: `np.float_` was removed in the NumPy 2.0 release. Use `np.float64` instead..].
[rank0]:         Solution: Check that all required components are properly installed and the specified Python path matches the Python installation directory. (If the path does not match the directory, run set_env.sh in the installation package.)
[rank0]:         TraceBack (most recent call last):
[rank0]:         AOE Failed to call InitCannKB[FUNC:Initialize][FILE:python_adapter_manager.cc][LINE:47]
[rank0]:         Failed to initialize TeConfigInfo.
[rank0]:         [GraphOpt][InitializeInner][InitTbeFunc] Failed to init tbe.[FUNC:InitializeTeFusion][FILE:tbe_op_store_adapter.cc][LINE:1921]
[rank0]:         [GraphOpt][InitializeInner][InitTeFusion]: Failed to initialize TeFusion.[FUNC:InitializeInner][FILE:tbe_op_store_adapter.cc][LINE:1888]
[rank0]:         [SubGraphOpt][PreCompileOp][InitAdapter] InitializeAdapter adapter [tbe_op_adapter] failed! Ret [4294967295][FUNC:InitializeAdapter][FILE:op_store_adapter_manager.cc][LINE:79]
[rank0]:         [SubGraphOpt][PreCompileOp][Init] Initialize op store adapter failed, OpsStoreName[tbe-custom].[FUNC:Initialize][FILE:op_store_adapter_manager.cc][LINE:120]
[rank0]:         [FusionMngr][Init] Op store adapter manager init failed.[FUNC:Initialize][FILE:fusion_manager.cc][LINE:115]
[rank0]:         PluginManager InvokeAll failed.[FUNC:Initialize][FILE:ops_kernel_manager.cc][LINE:83]
[rank0]:         OpsManager initialize failed.[FUNC:InnerInitialize][FILE:gelib.cc][LINE:239]
[rank0]:         GELib::InnerInitialize failed.[FUNC:Initialize][FILE:gelib.cc][LINE:164]
[rank0]:         GEInitialize failed.[FUNC:GEInitialize][FILE:ge_api.cc][LINE:382]
[rank0]:         [Initialize][Ge]GEInitialize failed. ge result = 4294967295[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
[rank0]:         [Init][Compiler]Init compiler failed[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
[rank0]:         [Set][Options]OpCompileProcessor init failed![FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]

E0103 18:55:39.548000 8415 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 8511) of binary: /root/.conda/envs/xtuner_npu/bin/python3.10
Traceback (most recent call last):
  File "/root/.conda/envs/xtuner_npu/bin/torchrun", line 7, in <module>
    sys.exit(main())
  File "/root/.conda/envs/xtuner_npu/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
    return f(*args, **kwargs)
  File "/root/.conda/envs/xtuner_npu/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/root/.conda/envs/xtuner_npu/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/root/.conda/envs/xtuner_npu/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 143, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/.conda/envs/xtuner_npu/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
vl.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2026-01-03_18:55:39
  host      : aide-20251118-e18b94f-0005395-84b6c4d979-tvvbf
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 8511)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

配置代码内容

from xtuner.v1.model import InternVL3P5Dense1BConfig
from xtuner.v1.train import Trainer, TrainerConfig
from xtuner.v1.config import AdamWConfig, LRConfig
from xtuner.v1.datasets import InternS1VLTokenizeFnConfig, DataloaderConfig, DatasetConfig
from xtuner.v1.loss import CELossConfig
import sys
# model config - 启用梯度检查点
model_cfg = InternVL3P5Dense1BConfig(
    use_gradient_checkpointing=True, freeze_vision=True, freeze_projector=False, freeze_language=False
)
# dataset and dataloader config
sample_max_length = 8000
pack_max_length = 8000

dataset_config = [
    {
        "dataset": DatasetConfig(
            name="formula_recognition",
            anno_path="/home/ma-user/work/dataset/VLM-formula-recognition-dataset_intern_camp/train/train_mini_xt.jsonl",
            media_root="/home/ma-user/work/dataset/VLM-formula-recognition-dataset_intern_camp/train/",
            sample_ratio=1.0,
            class_name="VLMJsonlDataset",
        ),
        # 使用 InternVL3.5 模板,确保 prompt 与视觉 token 对齐
        "tokenize_fn": InternS1VLTokenizeFnConfig(
            model_cfg=model_cfg,
            max_length=sample_max_length,
            template_name="internvl-3.5",
        ),
    }
]
dataloader_config = DataloaderConfig(
    dataset_config_list=dataset_config,
    pack_max_length=pack_max_length,
    num_workers=16,
    pack_level="soft",
    collator="intern_s1_vl_sft_collator",
)

# 优化学习率配置 - 提高学习率以加快收敛
optim_cfg = AdamWConfig(
    lr=3e-5,           
    weight_decay=0.01, # 添加权重衰减防止过拟合
    betas=(0.9, 0.95), # 优化Adam参数
    foreach=False
)
lr_cfg = LRConfig(
    lr_type="cosine",
    warmup_ratio=0.1,  # 增加warmup比例,让模型更稳定地开始训练
    min_lr_ratio=0.1   # 添加最小学习率比例
)

load_from = "/home/ma-user/work/model/InternVL3_5-1B-HF"
tokenizer = "/home/ma-user/work/model/InternVL3_5-1B-HF"

# trainer config
trainer = TrainerConfig(
    load_from=load_from,
    model_cfg=model_cfg,
    optim_cfg=optim_cfg,
    dataloader_cfg=dataloader_config,
    lr_cfg=lr_cfg,
    tokenizer_path=tokenizer,
    global_batch_size=8,
    gradient_accumulation_steps=4,
    total_epoch=5,
    work_dir="/root/data/xtuner_workdir/vl_1031/",
    loss_cfg=CELossConfig(mode="chunk", chunk_size=1024),
    hf_interval=50,
    hf_max_keep=2,
)
trainer = Trainer.from_config(trainer)
# 检查模型是否正确加载了预训练权重
print(f"Model device: {next(trainer._engine.model.        parameters()).device}")
print(f"Model dtype: {next(trainer._engine.model.parameters()).dtype}")
# sys.exit(0)
trainer.fit()

环境版本

CANN=8.2.RC2
torch==2.8.0
torch-npu==2.8.0
transformers==4.57.0

XTuner安装命令

git clone https://gh.llkk.cc/https://github.com/InternLM/xtuner.git
cd xtuner
git checkout 4990d05c5a5416fbfd51fee9e6cf502c66947099
pip install -e .

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions