name | about | labels |
---|---|---|
Bug Report | Use this template for reporting a bug | kind/bug |
pangu/dbnet_mobilenetv3/crnn_vgg7网络在910B3环境训练报HCCL错误
模型仓地址:https://gitee.com/mindspore/models/tree/master/official/nlp/Pangu_alpha
Ascend
/GPU
/CPU
) / 硬件环境:Please delete the backend not involved / 请删除不涉及的后端:
/device ascend910B3
失败版本:
MindSpore(newest)版本:r2.3.0 master_20240506061517_d8802c69d
CANN(newest)版本:20240425_000122620_newest
ok版本:
MindSpore(newest)版本:r2.3.0 master_20240506061517_d8802c69d
CANN(newest)版本:20240424_000121416_newest
PyNative
/Graph
):Please delete the mode not involved / 请删除不涉及的模式:
/mode graph
用例仓地址:solution_test/cases/02network/02nlp/pangu_alpha/train/
用例:
test_ms_pangu_alpha_train_910_8p_0002
test_ms_dbnet_mobilenetv3_icdar2015_train_check_loss_910_gpu_8p_0001
test_ms_crnn_vgg7_data_lmdb_release_check_loss_910_gpu_8p_0001
网络训练成功
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-70.h40.eulerosv2r11.aarch64 libgomp-10.3.1-10.h13.eulerosv2r11.aarch64 libxcrypt-4.4.26-2.h2.eulerosv2r11.aarch64 ncurses-libs-6.3-2.h 2.eulerosv2r11.aarch64 sssd-2.6.1-1.h9.eulerosv2r11.aarch64
--Type <RET> for more, q to quit, c to continue without paging--c
Core was generated by `python tools/train.py --config /home/jenkins/workspace/TDT_deployment/solution_'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000ffff79b5b5d8 in std::string::_Rep::_M_is_leaked (this=this@entry=0xffffffffffffffe8) at /home/conda/feedstock_root/build_artifacts/gcc_compilers_1666516829809/work/libstdc++-v3/include/bits/c ow_string.h:212
212 /home/conda/feedstock_root/build_artifacts/gcc_compilers_1666516829809/work/libstdc++-v3/include/bits/cow_string.h: No such file or directory.
[Current thread is 1 (Thread 0xfff2e1fbef20 (LWP 79253))]
(gdb) bt
#0 0x0000ffff79b5b5d8 in std::string::_Rep::_M_is_leaked (this=this@entry=0xffffffffffffffe8)
at /home/conda/feedstock_root/build_artifacts/gcc_compilers_1666516829809/work/libstdc++-v3/include/bits/cow_string.h:212
#1 0x0000ffff79b5ca04 in std::string::_Rep::_M_grab (this=0xffffffffffffffe8, __alloc1=..., __alloc2=...)
at /home/conda/feedstock_root/build_artifacts/gcc_compilers_1666516829809/work/libstdc++-v3/include/bits/cow_string.h:263
#2 0x0000ffff79b5ca58 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string (this=0xfff2e1fbb6b0, __str=...)
at /home/conda/feedstock_root/build_artifacts/gcc_compilers_1666516829809/work/libstdc++-v3/include/bits/cow_string.h:544
#3 0x0000ffff6a7bba2c in hccl::HcclCommBase::GetIdentifier() () from /usr/local/Ascend/latest/lib64/libhccl.so
#4 0x0000ffff6a7cb614 in HcclCommGraphGetIdentifier_cpu(long long, std::string&) () from /usr/local/Ascend/latest/lib64/libhccl.so
#5 0x0000ffff6a7d3fe4 in GenerateCclOpTag () from /usr/local/Ascend/latest/lib64/libhccl.so
#6 0x0000ffff373d9c48 in hccl::HcomOpsKernelInfoStore::GenerateOpTagFromTaskInfo(ge::GETaskInfo const&, std::string const&, std::string&, unsigned int&) ()
from /usr/local/Ascend/latest/lib64/libhcom_graph_adaptor.so
#7 0x0000ffff37457a4c in hccl::HcomOpsKernelInfoStore::GetTagVectorInfo(ge::GETaskInfo const&, std::string const&, std::vector<std::string, std::allocator<std::string> >&) ()
from /usr/local/Ascend/latest/lib64/libhcom_graph_adaptor.so
#8 0x0000ffff3745a338 in hccl::HcomOpsKernelInfoStore::LoadTask(ge::GETaskInfo&) () from /usr/local/Ascend/latest/lib64/libhcom_graph_adaptor.so
#9 0x0000ffff672b0330 in ge::HcclTaskInfo::Distribute() () from /usr/local/Ascend/latest/lib64/libdavinci_executor.so
#10 0x0000ffff67177254 in ge::DavinciModel::DistributeTask(domi::ModelTaskDef const&) () from /usr/local/Ascend/latest/lib64/libdavinci_executor.so
#11 0x0000ffff671a60ec in ge::DavinciModel::DoTaskSink() () from /usr/local/Ascend/latest/lib64/libdavinci_executor.so
#12 0x0000ffff671a9918 in ge::DavinciModel::Init(ge::ModelParam const&, void*) () from /usr/local/Ascend/latest/lib64/libdavinci_executor.so
#13 0x0000ffff67952234 in ge::ModelManager::LoadModelOnline(unsigned int&, std::shared_ptr<ge::GeRootModel> const&, std::shared_ptr<ge::GraphNode> const&, unsigned int, long, void*) ()
from /usr/local/Ascend/latest/lib64/libge_executor.so
#14 0x0000ffff67936418 in ge::GraphLoader::LoadModelOnline(unsigned int&, std::shared_ptr<ge::GeRootModel> const&, std::shared_ptr<ge::GraphNode> const&, unsigned int, error_message::Context const&, lo ng, void*) () from /usr/local/Ascend/latest/lib64/libge_executor.so
#15 0x0000ffff6791c388 in ge::ModelExecutor::ModelLoad(std::shared_ptr<ge::FlowModel> const&, std::shared_ptr<ge::GraphNode> const&, void*) () from /usr/local/Ascend/latest/lib64/libge_executor.so
#16 0x0000ffff6791cf40 in ge::ModelExecutor::LoadGraph(std::shared_ptr<ge::FlowModel> const&, std::shared_ptr<ge::GraphNode> const&, void*) () from /usr/local/Ascend/latest/lib64/libge_executor.so
#17 0x0000ffff6897c2c0 in ge::GraphManager::LoadGraph(std::shared_ptr<ge::FlowModel> const&, std::shared_ptr<ge::GraphNode> const&, void*) () from /usr/local/Ascend/latest/lib64/libge_compiler.so
#18 0x0000ffff689b6db4 in ge::GraphManager::StartForRunGraph(std::shared_ptr<ge::GraphNode> const&, std::vector<ge::GeTensor, std::allocator<ge::GeTensor> > const&, std::shared_ptr<ge::FlowModel>&, uns igned long, void*) () from /usr/local/Ascend/latest/lib64/libge_compiler.so
#19 0x0000ffff689cfcb8 in ge::GraphManager::RunGraphWithStreamAsync(unsigned int const&, void*, unsigned long, std::vector<ge::GeTensor, std::allocator<ge::GeTensor> > const&, std::vector<ge::GeTensor, std::allocator<ge::GeTensor> >&) () from /usr/local/Ascend/latest/lib64/libge_compiler.so
#20 0x0000ffff72f9c6d4 in ge::InnerSession::RunGraphWithStreamAsync(unsigned int, void*, std::vector<ge::Tensor, std::allocator<ge::Tensor> > const&, std::vector<ge::Tensor, std::allocator<ge::Tensor> >&) () from /usr/local/Ascend/latest/lib64/libge_runner.so
#21 0x0000ffff72f7b254 in ge::Session::RunGraphWithStreamAsync(unsigned int, void*, std::vector<ge::Tensor, std::allocator<ge::Tensor> > const&, std::vector<ge::Tensor, std::allocator<ge::Tensor> >&)
() from /usr/local/Ascend/latest/lib64/libge_runner.so
#22 0x0000ffff76ee95fc in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/plugin/libmindspore_ascend.so.2
#23 0x0000ffff77c7335c in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/plugin/libmindspore_ascend.so.2
#24 0x0000ffff767e3018 in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/plugin/libmindspore_ascend.so.2
#25 0x0000ffff767e8b18 in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/plugin/libmindspore_ascend.so.2
#26 0x0000ffff84a5bba8 in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so
#27 0x0000ffff84a5b538 in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so
#28 0x0000ffff84a5e098 in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so
#29 0x0000ffff84933fdc in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so
#30 0x0000ffff84930d8c in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so
#31 0x0000ffff84932a7c in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so
#32 0x0000ffff84933230 in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so
#33 0x0000ffff849d37e4 in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so
#34 0x0000ffff849d2c48 in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so
#35 0x0000ffff849d2328 in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so
#36 0x0000ffff84931cb0 in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so
#37 0x0000ffff84a0f6ec in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so
#38 0x0000ffff849385c0 in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so
#39 0x0000ffff8493520c in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so
#40 0x0000ffff7e816dfc in mindspore::ActorBase::Run() () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_core.so
#41 0x0000ffff7e835e38 in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_core.so
#42 0x0000ffff7e837074 in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_core.so
#43 0x0000ffff79b4f000 in std::execute_native_thread_routine (__p=<optimized out>) at ../../../../../libstdc++-v3/src/c++11/thread.cc:82
#44 0x0000ffff90113200 in ?? () from /usr/lib64/libc.so.6
#45 0x0000ffff9017971c in ?? () from /usr/lib64/libc.so.6
(gdb) q
走给樊瑞
Please assign maintainer to check this issue.
请为此issue分配处理人。
@huangjing
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
感谢您的提问,您可以评论//mindspore-assistant更快获取帮助:
该coredump问题发生在hccl模块中GenerateCclOpTag () from /usr/local/Ascend/latest/lib64/libhccl.so
, 为cann引入
已有DTS单跟踪:https://dts-szv.clouddragon.huawei.com/DTSPortal/ticket/DTS2024050818073
登录 后才可以发表评论