401 Star 1.4K Fork 1.3K

GVPopenEuler / kernel

 / 详情

GPT2设置batch_size=4跑满step系统概率性panic或打印error

Accepted
Bug
Opened this issue  
2023-09-15 18:16

【环境信息】

6.4.0-8.0.0.16.oe2309.aarch64

【问题复现步骤】,请描述具体的操作步骤
GPT2设置batch_size=4跑满step
【实际结果】,请描述出问题的结果和影响
训练结束时系统panic
【其他相关附件信息】

Comments (5)

Hi zhangxiaofeng-melody, welcome to the openEuler Community.
I'm the Bot here serving you. You can find the instructions on how to interact with me at Here.
If you have any questions, please contact the SIG: Kernel, and any of the maintainers.

openeuler-ci-bot added
 
sig/Kernel
label
zhangxiaofeng_melody set assignee to Fcc
zhangxiaofeng_melody set priority to Main

这个问题多个机器没有复现出来,经过分析影响可控。

zhangxiaofeng_melody changed issue state from 待办的 to 已验收
zhangxiaofeng_melody changed description
zhangxiaofeng_melody changed issue state from 已验收 to 已挂起

对vma同步的处理上存在问题,没有使能大页,导致存在大小页共存,使用上存在问题

zhangxiaofeng_melody changed issue state from 已挂起 to 已完成
zhangxiaofeng_melody changed issue state from 已完成 to 已验收

2023-09-25 18:57:19,128 - mindformers - INFO - .........Build Callbacks For Train..........
2023-09-25 18:57:19,129 - mindformers - INFO - .........Build Callbacks for Train From Config..........
2023-09-25 18:57:19,130 - mindformers - INFO - .........Build Running Wrapper From Config For Train..........
2023-09-25 18:57:19,130 - mindformers - INFO - .........Build Model Wrapper for Train From Config..........
2023-09-25 18:57:19,141 - mindformers - INFO - .........Starting Init Train Model..........
2023-09-25 18:57:19,142 - mindformers - INFO - .........Starting Training Model..........
2023-09-25 18:57:19,142 - mindformers - INFO - .........Model Compiling, Please Wait a Moment...........
2023-09-25 20:02:17,571 - mindformers - INFO - Epoch:[ 1/ 2], step:[ 591/ 591], loss:[6.594/6.594], time:3890915.021 ms, lr:5.5152283e-05, overflow cond: False, loss_scale: 16384.0
2023-09-25 20:02:51,406 - mindformers - INFO - Epoch time: 3932263.041 ms, per step time: 6653.575 ms, avg loss: 6.594
2023-09-25 21:07:08,583 - mindformers - INFO - Epoch:[ 2/ 2], step:[ 591/ 591], loss:[6.702/6.702], time:3857161.517 ms, lr:1.0152285e-05, overflow cond: False, loss_scale: 16384.0
2023-09-25 21:07:41,340 - mindformers - INFO - Epoch time: 3889930.756 ms, per step time: 6581.947 ms, avg loss: 6.702
2023-09-25 21:07:41,346 - mindformers - INFO - .........Training Over!.............

Sign in to comment

Status
Assignees
Projects
Milestones
Pull Requests
Successfully merging a pull request will close this issue.
Branches
Planed to start   -   Planed to end
-
Top level
Priority
Duration (hours)
参与者(4)
5329419 openeuler ci bot 1632792936 5479204 luochenglcs 1691216215
C
1
https://gitee.com/openeuler/kernel.git
git@gitee.com:openeuler/kernel.git
openeuler
kernel
kernel

Search