2.4K Star 8.2K Fork 4.4K

GVPMindSpore / mindspore

 / 详情

[ST][MS][NET][pangu-alpha][910A 32p]网络训练日志有error日志

DONE
Bug-Report
创建于  
2024-03-21 12:06
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

[pangu-alpha][910 32p]网络训练日志有error日志

网络脚本路径:https://e.gitee.com/mind_spore/repos/mindspore/models/tree/master/official/nlp/Pangu_alpha

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device /ascend910A/

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):

失败版本:
run包:Milan_C17/20240315
MindSpore 版本:2.3.0/B080 r2.3.q1_20240320000457_1b2cb8cd14

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode graph

Related testcase / 关联用例 (Mandatory / 必填)

用例仓地址:MindFormers_Test/cases/pangu/train
用例:test_mf_pangu_2_6b_train_check_loss_910_pangudata_32p_0001

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

1.get code from models
2.cd models/official/nlp/Pangu_alpha
3. node1:
bash scripts/run_distributed_train.sh ./pangu-data/pangu_30_step_bs64 ./hccl_32p.json 32 fp32 2.6B 1 1 16 0 8
node2:
bash scripts/run_distributed_train.sh ./pangu-data/pangu_30_step_bs64 ./hccl_32p.json 32 fp32 2.6B 1 1 16 8 8
node3:
bash scripts/run_distributed_train.sh ./pangu-data/pangu_30_step_bs64 ./hccl_32p.json 32 fp32 2.6B 1 1 16 16 8
node4:
bash scripts/run_distributed_train.sh ./pangu-data/pangu_30_step_bs64 ./hccl_32p.json 32 fp32 2.6B 1 1 16 24 8
4. 验证网络训练是否成功
5. 验证训练日志是否正常

Describe the expected behavior / 预期结果 (Mandatory / 必填)

网络pangu-alpha训练正常,训练日志正常

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

[ERROR] DRV(170385,python):2024-03-20-17:19:06.641.646 [ascend][curpid: 170385, 176108][drv][event-sche][halEschedSubmitEventSync 220]Failed. (ret=16; event_id=29; gid=16; tid=1; timeout=5000ms; subevent_id=3269)
[ERROR] DRV(170385,python):2024-03-20-17:19:06.641.646 [ascend][curpid: 170385, 176105][drv][event-sche][halEschedSubmitEventSync 220]Failed. (ret=16; event_id=30; gid=16; tid=2; timeout=5000ms; subevent_id=2054)
[ERROR] DRV(170385,python):2024-03-20-17:19:06.641.723 [ascend][curpid: 170385, 176108][drv][queuemng][QueueSendEventSync 487]halEschedSubmitEventSync failed, ret(16).
[ERROR] DRV(170385,python):2024-03-20-17:19:06.641.732 [ascend][curpid: 170385, 176105][drv][queuemng][QueueSendEventSync 487]halEschedSubmitEventSync failed, ret(16).
[ERROR] DRV(170385,python):2024-03-20-17:19:06.641.737 [ascend][curpid: 170385, 176108][drv][queuemng][QueueUnsubscribe 852]QueueSendEventSync failed. (ret=7; devid=0; qid=5)
[ERROR] DRV(170385,python):2024-03-20-17:19:06.641.744 [ascend][curpid: 170385, 176105][drv][queuemng][QueueUnsubscribe 852]QueueSendEventSync failed. (ret=7; devid=0; qid=2)
[ERROR] DRV(170385,python):2024-03-20-17:19:06.645.651 [ascend][curpid: 170385, 176106][drv][event-sche][halEschedSubmitEventSync 220]Failed. (ret=16; event_id=31; gid=16; tid=3; timeout=5000ms; subevent_id=1095)
[ERROR] DRV(170385,python):2024-03-20-17:19:06.645.634 [ascend][curpid: 170385, 176104][drv][event-sche][halEschedSubmitEventSync 220]Failed. (ret=16; event_id=32; gid=16; tid=4; timeout=5000ms; subevent_id=397)
[ERROR] DRV(170385,python):2024-03-20-17:19:06.645.639 [ascend][curpid: 170385, 176107][drv][event-sche][halEschedSubmitEventSync 220]Failed. (ret=16; event_id=28; gid=16; tid=0; timeout=5000ms; subevent_id=1394)
[ERROR] DRV(170385,python):2024-03-20-17:19:06.645.689 [ascend][curpid: 170385, 176106][drv][queuemng][QueueSendEventSync 487]halEschedSubmitEventSync failed, ret(16).
[ERROR] DRV(170385,python):2024-03-20-17:19:06.645.721 [ascend][curpid: 170385, 176104][drv][queuemng][QueueSendEventSync 487]halEschedSubmitEventSync failed, ret(16).
[ERROR] DRV(170385,python):2024-03-20-17:19:06.645.734 [ascend][curpid: 170385, 176107][drv][queuemng][QueueSendEventSync 487]halEschedSubmitEventSync failed, ret(16).
[ERROR] DRV(170385,python):2024-03-20-17:19:06.645.743 [ascend][curpid: 170385, 176106][drv][queuemng][QueueUnsubscribe 852]QueueSendEventSync failed. (ret=7; devid=0; qid=3)
[ERROR] DRV(170385,python):2024-03-20-17:19:06.645.752 [ascend][curpid: 170385, 176104][drv][queuemng][QueueUnsubscribe 852]QueueSendEventSync failed. (ret=7; devid=0; qid=1)
[ERROR] DRV(170385,python):2024-03-20-17:19:06.645.759 [ascend][curpid: 170385, 176107][drv][queuemng][QueueUnsubscribe 852]QueueSendEventSync failed. (ret=7; devid=0; qid=4)
[ERROR] DRV(170385,python):2024-03-20-17:19:11.761.604 [ascend][curpid: 170385, 176104][drv][event-sche][halEschedSubmitEventSync 220]Failed. (ret=16; event_id=31; gid=16; tid=3; timeout=5000ms; subevent_id=1096)
[ERROR] DRV(170385,python):2024-03-20-17:19:11.761.618 [ascend][curpid: 170385, 176107][drv][event-sche][halEschedSubmitEventSync 220]Failed. (ret=16; event_id=32; gid=16; tid=4; timeout=5000ms; subevent_id=398)
[ERROR] DRV(170385,python):2024-03-20-17:19:11.761.626 [ascend][curpid: 170385, 176108][drv][event-sche][halEschedSubmitEventSync 220]Failed. (ret=16; event_id=29; gid=16; tid=1; timeout=5000ms; subevent_id=3270)
[ERROR] DRV(170385,python):2024-03-20-17:19:11.761.624 [ascend][curpid: 170385, 176105][drv][event-sche][halEschedSubmitEventSync 220]Failed. (ret=16; event_id=30; gid=16; tid=2; timeout=5000ms; subevent_id=2055)
[ERROR] DRV(170385,python):2024-03-20-17:19:11.761.631 [ascend][curpid: 170385, 176106][drv][event-sche][halEschedSubmitEventSync 220]Failed. (ret=16; event_id=28; gid=16; tid=0; timeout=5000ms; subevent_id=1395)
[ERROR] DRV(170385,python):2024-03-20-17:19:11.761.660 [ascend][curpid: 170385, 176104][drv][queuemng][QueueSendEventSyncTimeout 979]halEschedSubmitEventSync failed, ret(16).
[ERROR] DRV(170385,python):2024-03-20-17:19:11.761.702 [ascend][curpid: 170385, 176107][drv][queuemng][QueueSendEventSyncTimeout 979]halEschedSubmitEventSync failed, ret(16).
[ERROR] DRV(170385,python):2024-03-20-17:19:11.761.717 [ascend][curpid: 170385, 176108][drv][queuemng][QueueSendEventSyncTimeout 979]halEschedSubmitEventSync failed, ret(16).
[ERROR] DRV(170385,python):2024-03-20-17:19:11.761.725 [ascend][curpid: 170385, 176105][drv][queuemng][QueueSendEventSyncTimeout 979]halEschedSubmitEventSync failed, ret(16).
[ERROR] DRV(170385,python):2024-03-20-17:19:11.761.733 [ascend][curpid: 170385, 176106][drv][queuemng][QueueSendEventSyncTimeout 979]halEschedSubmitEventSync failed, ret(16).
[ERROR] RUNTIME(170385,python):2024-03-20-17:19:11.761.826 [npu_driver.cc:3658]176107 MemQueuePeek:report error module_type=1, module_name=EL9999
[ERROR] RUNTIME(170385,python):2024-03-20-17:19:11.761.836 [npu_driver.cc:3658]176104 MemQueuePeek:report error module_type=1, module_name=EL9999
[ERROR] RUNTIME(170385,python):2024-03-20-17:19:11.761.854 [npu_driver.cc:3658]176107 MemQueuePeek:[drv api] halQueuePeek failed: device_id=0, qid=4, timeout=800, drvRetCode=7.
[ERROR] RUNTIME(170385,python):2024-03-20-17:19:11.761.858 [npu_driver.cc:3658]176104 MemQueuePeek:[drv api] halQueuePeek failed: device_id=0, qid=1, timeout=800, drvRetCode=7.
[ERROR] RUNTIME(170385,python):2024-03-20-17:19:11.761.859 [npu_driver.cc:3658]176106 MemQueuePeek:report error module_type=1, module_name=EL9999
[ERROR] RUNTIME(170385,python):2024-03-20-17:19:11.761.863 [npu_driver.cc:3658]176105 MemQueuePeek:report error module_type=1, module_name=EL9999
[ERROR] RUNTIME(170385,python):2024-03-20-17:19:11.761.867 [npu_driver.cc:3658]176108 MemQueuePeek:report error module_type=1, module_name=EL9999
[ERROR] RUNTIME(170385,python):2024-03-20-17:19:11.761.881 [npu_driver.cc:3658]176106 MemQueuePeek:[drv api] halQueuePeek failed: device_id=0, qid=3, timeout=800, drvRetCode=7.
[ERROR] RUNTIME(170385,python):2024-03-20-17:19:11.761.886 [npu_driver.cc:3658]176105 MemQueuePeek:[drv api] halQueuePeek failed: device_id=0, qid=2, timeout=800, drvRetCode=7.
[ERROR] RUNTIME(170385,python):2024-03-20-17:19:11.761.890 [npu_driver.cc:3658]176108 MemQueuePeek:[drv api] halQueuePeek failed: device_id=0, qid=5, timeout=800, drvRetCode=7.
[ERROR] RUNTIME(170385,python):2024-03-20-17:19:11.761.977 [api_c.cc:4182]176107 rtMemQueuePeek:ErrCode=507899, desc=[driver error:internal error], InnerCode=0x7020010
[ERROR] RUNTIME(170385,python):2024-03-20-17:19:11.761.977 [api_c.cc:4182]176104 rtMemQueuePeek:ErrCode=507899, desc=[driver error:internal error], InnerCode=0x7020010
[ERROR] RUNTIME(170385,python):2024-03-20-17:19:11.761.993 [error_message_manage.cc:53]176107 FuncErrorReason:report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(170385,python):2024-03-20-17:19:11.761.997 [error_message_manage.cc:53]176104 FuncErrorReason:report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(170385,python):2024-03-20-17:19:11.762.008 [error_message_manage.cc:53]176107 FuncErrorReason:rtMemQueuePeek execute failed, reason=[driver error:internal error]
[ERROR] RUNTIME(170385,python):2024-03-20-17:19:11.762.010 [error_message_manage.cc:53]176104 FuncErrorReason:rtMemQueuePeek execute failed, reason=[driver error:internal error]
[ERROR] RUNTIME(170385,python):2024-03-20-17:19:11.762.041 [api_c.cc:4182]176108 rtMemQueuePeek:ErrCode=507899, desc=[driver error:internal error], InnerCode=0x7020010
[ERROR] RUNTIME(170385,python):2024-03-20-17:19:11.762.041 [api_c.cc:4182]176105 rtMemQueuePeek:ErrCode=507899, desc=[driver error:internal error], InnerCode=0x7020010
[ERROR] ASCENDCL(170385,python):2024-03-20-17:19:11.762.048 [tensor_data_transfer.cpp:588]176107 acltdtReceiveTensorV2: peek queue [4] failed
[ERROR] RUNTIME(170385,python):2024-03-20-17:19:11.762.056 [error_message_manage.cc:53]176108 FuncErrorReason:report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(170385,python):2024-03-20-17:19:11.762.058 [api_c.cc:4182]176106 rtMemQueuePeek:ErrCode=507899, desc=[driver error:internal error], InnerCode=0x7020010
[ERROR] RUNTIME(170385,python):2024-03-20-17:19:11.762.061 [error_message_manage.cc:53]176105 FuncErrorReason:report error module_type=3, module_name=EE8888
[ERROR] ASCENDCL(170385,python):2024-03-20-17:19:11.762.063 [tensor_data_transfer.cpp:588]176104 acltdtReceiveTensorV2: peek queue [1] failed
[ERROR] RUNTIME(170385,python):2024-03-20-17:19:11.762.081 [error_message_manage.cc:53]176108 FuncErrorReason:rtMemQueuePeek execute failed, reason=[driver error:internal error]
[ERROR] RUNTIME(170385,python):2024-03-20-17:19:11.762.099 [error_message_manage.cc:53]176106 FuncErrorReason:report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(170385,python):2024-03-20-17:19:11.762.104 [error_message_manage.cc:53]176105 FuncErrorReason:rtMemQueuePeek execute failed, reason=[driver error:internal error]
[ERROR] RUNTIME(170385,python):2024-03-20-17:19:11.762.124 [error_message_manage.cc:53]176106 FuncErrorReason:rtMemQueuePeek execute failed, reason=[driver error:internal error]
[ERROR] ASCENDCL(170385,python):2024-03-20-17:19:11.762.140 [tensor_data_transfer.cpp:588]176108 acltdtReceiveTensorV2: peek queue [5] failed
[ERROR] ASCENDCL(170385,python):2024-03-20-17:19:11.762.147 [tensor_data_transfer.cpp:588]176106 acltdtReceiveTensorV2: peek queue [3] failed
[ERROR] ASCENDCL(170385,python):2024-03-20-17:19:11.762.148 [tensor_data_transfer.cpp:588]176105 acltdtReceiveTensorV2: peek queue [2] failed
[ERROR] DEVICE(170385,fffcd0ee21e0,python):2024-03-20-17:19:11.762.113 [mindspore/ccsrc/plugin/device/ascend/hal/device/mbuf_receive_manager.cc:145] ReceiveAndProcessData] Channel ms_scalar_summary failed to receive tensor. Error code is 507899
[ERROR] DEVICE(170385,fffcd9dbc1e0,python):2024-03-20-17:19:11.762.126 [mindspore/ccsrc/plugin/device/ascend/hal/device/mbuf_receive_manager.cc:145] ReceiveAndProcessData] Channel ms_tensor_dump failed to receive tensor. Error code is 507899
[ERROR] DEVICE(170385,fffcd8dba1e0,python):2024-03-20-17:19:11.762.189 [mindspore/ccsrc/plugin/device/ascend/hal/device/mbuf_receive_manager.cc:145] ReceiveAndProcessData] Channel ms_image_summary failed to receive tensor. Error code is 507899
[ERROR] DEVICE(170385,fffcd95bb1e0,python):2024-03-20-17:19:11.762.189 [mindspore/ccsrc/plugin/device/ascend/hal/device/mbuf_receive_manager.cc:145] ReceiveAndProcessData] Channel ms_tensor_summary failed to receive tensor. Error code is 507899
[ERROR] DEVICE(170385,fffcbbfff1e0,python):2024-03-20-17:19:11.762.189 [mindspore/ccsrc/plugin/device/ascend/hal/device/mbuf_receive_manager.cc:145] ReceiveAndProcessData] Channel ms_histogram_summary failed to receive tensor. Error code is 507899
[ERROR] RUNTIME(170385,python):2024-03-20-17:33:38.837.731 [engine.cc:1648]176800 ReportExceptProc:[COMP][DEFAULT]Real task exception! device_id=0, stream_id=102, task_id=2, task_type=1 (KERNEL_AICPU)
[ERROR] RUNTIME(170385,python):2024-03-20-17:33:38.837.823 [engine.cc:1653]176800 ReportExceptProc:[COMP][DEFAULT]Task exception! device_id=0, stream_id=1, task_id=12, type=13(MODEL_EXECUTE), failuremode =1, retCode=0x91, [the model stream execute failed]
[ERROR] RUNTIME(170385,python):2024-03-20-17:33:38.839.356 [stream.cc:3385]176800 EnterFailureAbort:[COMP][DEFAULT]stream_id=1 enter failure abort.
[ERROR] RUNTIME(170385,python):2024-03-20-17:33:38.839.382 [stream.cc:1509]176800 GetError:[COMP][DEFAULT]Stream Synchronize failed, stream_id=1, retCode=0x91, [the model stream execute failed].
[ERROR] RUNTIME(170385,python):2024-03-20-17:33:38.839.404 [stream.cc:1509]176800 GetError:[COMP][DEFAULT]Stream Synchronize failed, stream_id=1, retCode=0x91, [the model stream execute failed].
[ERROR] RUNTIME(170385,python):2024-03-20-17:33:38.839.414 [task_info.cc:4924]176800 ReportErrorInfoForModelExecuteTask:[COMP][DEFAULT]model execute error, retCode=0x91, [the model stream execute failed].
[ERROR] RUNTIME(170385,python):2024-03-20-17:33:38.839.448 [task_info.cc:4897]176800 PrintErrorInfoForModelExecuteTask:[COMP][DEFAULT]model execute task failed, device_id=0, model stream_id=1, model task_id=12, flip_num=0, model_id=1, first_task_id=65535
[ERROR] RUNTIME(170385,python):2024-03-20-17:33:38.839.468 [task_info.cc:1557]176800 PrintAicpuErrorInfo:[COMP][DEFAULT]report error module_type=0, module_name=E39999
[ERROR] RUNTIME(170385,python):2024-03-20-17:33:38.839.483 [task_info.cc:1557]176800 PrintAicpuErrorInfo:[COMP][DEFAULT]Aicpu kernel execute failed, device_id=0, stream_id=102, task_id=2, errorCode=91.
[ERROR] RUNTIME(170385,python):2024-03-20-17:33:38.839.724 [task_info.cc:1586]176800 PrintAicpuErrorInfo:[COMP][DEFAULT]Aicpu kernel execute failed, device_id=0, stream_id=102, task_id=2, flip_num=0, fault so_name=, fault kernel_name=, fault op_name=, extend_info=(info_type:4, info_len:252, msg_info:Default/network-PanguAlphaTrainOneStepWithLossScaleCell/network-_VirtualDatasetCell/_backbone-MicroBatchInterleaved/network-PanGUAlphaWithLoss/network-PanguAlphaModel/backbone-PanguAlpha_Model/embedding-EmbeddingLayer/dropout-Dropout/DropoutGenMask-op0).
[ERROR] DEVICE(170385,fffcb8ff91e0,python):2024-03-20-17:33:38.840.002 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:231] TaskExceptionCallback] Run Task failed, task_id: 2, stream_id: 102, tid: 170385, device_id: 0, retcode: 507011 ( model execute failed)
[ERROR] IDEDD(170385,python):2024-03-20-17:33:38.840.131 [dump_args.cpp:42][tid:176800]>>> [Dump][Exception] in infoAddr[(nil)]|atomicIndex[0]|argAddr[(nil)]|argsize[0] has invalid attribute.
[ERROR] GE(170385,python):2024-03-20-17:33:38.840.553 [error_tracking.cc:101]176800 ErrorTrackingCallback: ErrorNo: 4294967295(failed) [COMP][DEFAULT]Error happened, origin_op_name [Default/network-PanguAlphaTrainOneStepWithLossScaleCell/network-_VirtualDatasetCell/_backbone-MicroBatchInterleaved/network-PanGUAlphaWithLoss/network-PanguAlphaModel/backbone-PanguAlpha_Model/embedding-EmbeddingLayer/dropout-Dropout/DropoutGenMask-op0;recompute_Default/network-PanguAlphaTrainOneStepWithLossScaleCell/network-_VirtualDatasetCell/_backbone-MicroBatchInterleaved/network-PanGUAlphaWithLoss/network-PanguAlphaModel/backbone-PanguAlpha_Model/blocks-CellList/0-TransformerEncoderLayer/attention-MultiHeadAttention/prob_dropout-Dropout/DropoutGenMask-op0;Default/network-PanguAlphaTrainOneStepWithLossScaleCell/network-_VirtualDatasetCell/_backbone-MicroBatchInterleaved/network-PanGUAlphaWithLoss/network-PanguAlphaModel/backbone-PanguAlpha_Model/embedding-EmbeddingLayer/dropout-Dropout/DropoutGenMask-op1;recompute_Default/ne
[ERROR] GE(170385,python):2024-03-20-17:33:38.840.640 [error_manager.cc:122]176800 ReportInnerError: [COMP][DEFAULT][Check][Param] FormatErrorMessage failed, ret:-1, file:error_tracking.cc, line:105
[ERROR] RUNTIME(170385,python):2024-03-20-17:33:43.953.615 [engine.cc:2039]176800 SyncTask:[COMP][DEFAULT]Task Wait:stream_id=1 is ABORT.
[ERROR] RUNTIME(170385,python):2024-03-20-17:33:43.953.647 [stream.cc:1509]176800 GetError:[COMP][DEFAULT]Stream Synchronize failed, stream_id=1, retCode=0x91, [the model stream execute failed].
[ERROR] RUNTIME(170385,python):2024-03-20-17:33:43.953.662 [stream.cc:1512]176800 GetError:[COMP][DEFAULT]report error module_type=0, module_name=E39999
[ERROR] RUNTIME(170385,python):2024-03-20-17:33:43.953.673 [stream.cc:1512]176800 GetError:[COMP][DEFAULT]Aicpu kernel execute failed, device_id=0, stream_id=102, task_id=2, flip_num=0, fault so_name=, fault kernel_name=, fault op_name=, extend_info=(info_type:4, info_len:252, msg_info:Default/network-PanguAlphaTrainOneStepWithLossScaleCell/network-_VirtualDatasetCell/_backbone-MicroBatchInterleaved/network-PanGUAlphaWithLoss/network-PanguAlphaModel/backbone-PanguAlpha_Model/embedding-EmbeddingLayer/dropout-Dropout/DropoutGenMask-op0).
[ERROR] RUNTIME(170385,python):2024-03-20-17:33:43.953.716 [logger.cc:480]176800 StreamSynchronize:[COMP][DEFAULT]Stream synchronize failed, stream_id=1
[ERROR] RUNTIME(170385,python):2024-03-20-17:33:43.953.769 [api_c.cc:803]176800 rtStreamSynchronizeWithTimeout:[COMP][DEFAULT]ErrCode=507011, desc=[the model stream execute failed], InnerCode=0x7150050
[ERROR] RUNTIME(170385,python):2024-03-20-17:33:43.953.782 [error_message_manage.cc:53]176800 FuncErrorReason:[COMP][DEFAULT]report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(170385,python):2024-03-20-17:33:43.953.798 [error_message_manage.cc:53]176800 FuncErrorReason:[COMP][DEFAULT]rtStreamSynchronizeWithTimeout execute failed, reason=[the model stream execute failed]
[ERROR] ASCENDCL(170385,python):2024-03-20-17:33:43.953.837 [stream.cpp:143]176800 aclrtSynchronizeStreamWithTimeout: [COMP][DEFAULT]synchronize stream failed, runtime result = 507011
[ERROR] DEVICE(170385,fffcb8ff91e0,python):2024-03-20-17:33:43.953.883 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_stream_manager.cc:257] SyncStream] Call runtime aclrtSynchronizeStreamWithTimeout error.
[ERROR] DEVICE(170385,fffcb8ff91e0,python):2024-03-20-17:33:43.953.934 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_stream_manager.cc:270] SyncAllStreams] SyncStream for stream id 0 failed.
[ERROR] RUNTIME(170385,python):2024-03-20-17:39:01.270.682 [device_msg_handler.cc:159]170385 HandleMsgInHostBuf:[COMP][DEFAULT]
[ERROR] RUNTIME(170385,python):2024-03-20-17:39:02.348.672 [stream.cc:2532]170385 WaitForTask:Task Wait: device_id=0, stream_id=1 is Abort, RunningState=0
[ERROR] RUNTIME(170385,python):2024-03-20-17:39:02.348.719 [stream.cc:1509]170385 GetError:Stream Synchronize failed, stream_id=1, retCode=0x91, [the model stream execute failed].
[ERROR] RUNTIME(170385,python):2024-03-20-17:39:02.348.732 [logger.cc:480]170385 StreamSynchronize:Stream synchronize failed, stream_id=1
[ERROR] RUNTIME(170385,python):2024-03-20-17:39:02.348.751 [api_c.cc:803]170385 rtStreamSynchronizeWithTimeout:ErrCode=507011, desc=[the model stream execute failed], InnerCode=0x7150050
[ERROR] RUNTIME(170385,python):2024-03-20-17:39:02.348.760 [error_message_manage.cc:53]170385 FuncErrorReason:report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(170385,python):2024-03-20-17:39:02.348.773 [error_message_manage.cc:53]170385 FuncErrorReason:rtStreamSynchronizeWithTimeout execute failed, reason=[the model stream execute failed]
[ERROR] ASCENDCL(170385,python):2024-03-20-17:39:02.348.863 [stream.cpp:143]170385 aclrtSynchronizeStreamWithTimeout: synchronize stream failed, runtime result = 507011
[ERROR] DEVICE(170385,ffffb96290b0,python):2024-03-20-17:39:02.348.900 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_stream_manager.cc:257] SyncStream] Call runtime aclrtSynchronizeStreamWithTimeout error.
[ERROR] DEVICE(170385,ffffb96290b0,python):2024-03-20-17:39:02.348.938 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:575] SyncStream] Sync default stream failed.
[ERROR] DEVICE(170385,ffffb96290b0,python):2024-03-20-17:39:02.348.988 [mindspore/ccsrc/runtime/device/kernel_runtime_manager.cc:134] WaitTaskFinishOnDevice] SyncStream failed
[ERROR] RUNTIME(170385,python):2024-03-20-17:39:02.349.020 [stream.cc:2532]170385 WaitForTask:Task Wait: device_id=0, stream_id=1 is Abort, RunningState=0
[ERROR] RUNTIME(170385,python):2024-03-20-17:39:02.349.034 [stream.cc:1509]170385 GetError:Stream Synchronize failed, stream_id=1, retCode=0x91, [the model stream execute failed].
[ERROR] RUNTIME(170385,python):2024-03-20-17:39:02.349.044 [logger.cc:480]170385 StreamSynchronize:Stream synchronize failed, stream_id=1
[ERROR] RUNTIME(170385,python):2024-03-20-17:39:02.349.055 [api_c.cc:803]170385 rtStreamSynchronizeWithTimeout:ErrCode=507011, desc=[the model stream execute failed], InnerCode=0x7150050

Special notes for this issue/备注 (Optional / 选填)

走给张银霞

评论 (7)

zhongjicheng 创建了Bug-Report
zhongjicheng 添加了
 
sig/ascend
标签
zhongjicheng 添加了
 
attr/function
标签
zhongjicheng 添加了
 
stage/func-debug
标签
zhongjicheng 添加了
 
kind/bug
标签
zhongjicheng 添加了
 
device/ascend
标签
zhongjicheng 添加了
 
v2.3.0
标签
zhongjicheng 添加了
 
v2.3.0.alpha
标签
展开全部操作日志

Please assign maintainer to check this issue.
请为此issue分配处理人。
@zhongjicheng

感谢您的提问,您可以评论//mindspore-assistant更快获取帮助:

  1. 如果您刚刚接触MindSpore,或许您可以在教程找到答案
  2. 如果您是资深Pytorch用户,您或许需要:
  1. 如果您遇到动态图问题,可以设置set_context(pynative_synchronize=True)查看报错栈协助定位
  2. 模型精度调优问题可参考官网调优指南
  3. 如果您反馈的是框架BUG,请确认您在ISSUE中提供了MindSpore版本、使用的后端类型(CPU、GPU、Ascend)、环境、训练的代码官方链接以及可以复现报错的代码的启动方式等必要的定位信息
  4. 如果您已经定位出问题根因,欢迎提交PR参与MindSpore开源社区,我们会尽快review
zhongjicheng 修改了描述
zhongjicheng 负责人设置为zhangyinxia
zhangyinxia 添加了
 
rct/cann
标签
zhangyinxia 添加了
 
rct/cann
标签
zhangyinxia 添加协作者zhangyinxia
zhangyinxia 负责人zhangyinxia 修改为zhongjicheng
zhongjicheng 负责人zhongjicheng 修改为zhangyinxia
zhangyinxia 添加协作者zhangyinxia
zhangyinxia 负责人zhangyinxia 修改为zhongjicheng

通过系统日志,报错第一现场在HCCP。
DTS单:https://dts-szv.clouddragon.huawei.com/DTSPortal/ticket/DTS2024032508696

@zhangyinxia 海思error日志(不影响网络训练)。和海思驱动同事确认,cann master上一月份已经改了,但是我们330版本还是用的C15的1,2包,所以还有这个error日志,海思那边ccb了这个不回合C15(见DTS)。这个问题等Q2升级1,2包可解决

i-robot 添加了
 
dts-szv
标签
zhongjicheng 添加了
 
master
标签

此版本也有issue问题
runpkg_version:Milan_C17/20240414
MindSpore:2.3.0.B521:master_20240517221756_2f410aa8a72d8e9b2cf3cd2fd05903cf307e3768

zhangyinxia 添加协作者zhangyinxia
zhangyinxia 负责人zhangyinxia 修改为zhongjicheng

br_base_20240601192830_01c2bce3e8dc6b9997f149798fb96b02a608a3e1
版本问题解决,关闭问题单

zhangyinxia 任务状态TODO 修改为VALIDATION

回归版本:
mindspore:r2.3.br_base_20240601192830_01c2bce3e8dc6b9997f149798fb96b02a608a3e1
runpkg_version:Milan_C18/20240522

回归步骤:参考issue复现步骤
基本功能:问题已解决

[jenkins0@xxx test_mf_pangu_2_6b_train_check_loss_910_pangudata_32p_0001]$ grep -rn ERROR /home/jenkins/workspace/TDT_deployment/MindFormers_Test/cases/pangu/train/test_mf_pangu_2_6b_train_check_loss_910_pangudata_32p_0001/device0/log0.log
[jenkins0@xxx test_mf_pangu_2_6b_train_check_loss_910_pangudata_32p_0001]$

测试结论:回归通过
回归人员:zhongjicheng
回归时间: 2024-05-3

zhongjicheng 任务状态VALIDATION 修改为DONE
zhongjicheng 里程碑B-SIG-ASCEND 修改为B-SolutionTest

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(4)
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助