/ 详情

Taishan四路服务器上kdump触发测试失败

Done
Task
Opened this issue  
2020-05-28 22:32

4路泰山服务器上做kdump测试时,出现了kdump进入第二内核后出现OOM;
目前输入链接说明根据报错信息,做了如下尝试:
1、 最初的是systemd-udev上报,参考了之前经验,添加udev.children-max=1(udev.children-max=4) 参数尝试依然还是报OOM
2、 尝试将crashkernel值到1024M,还是报OOM

Attachments

Comments (8)

Alex_Chao created任务
Alex_Chao set related repository to openEuler/kernel
展开全部操作日志

Hey @Alex_Chao , Welcome to openEuler Community.
All of the projects in openEuler Community are maintained by @openeuler-ci-bot .
That means the developers can comment below every pull request or issue to trigger Bot Commands.
Please follow instructions at https://gitee.com/openeuler/community/blob/master/en/sig-infrastructure/command.md to find the details.

Alex_Chao assigned collaborator XieXiuQi
Alex_Chao set assignee to Alex_Chao
Alex_Chao added
 
kind/failing-test
label

可以尝试先把网络相关的驱动先移除掉,不加载网络驱动。
在/etc/kdump.conf中,修改如下配置项为:
dracut_args --omit-drivers "mdio-gpi usb_8dev et1011c rt2x00usb bcm-phy-lib mac80211_hwsim rtl8723be rndis_host hns3_cae amd vrf rtl8192cu mt76x02-lib int51x1 ppp_deflate team_mode_loadbalance smsc911x aweth bonding mwifiex_usb hnae dnet rt2x00pci vaser_pci hdlc_ppp marvell rtl8xxxu mlxsw_i2c ath9k_htc rtl8150 smc91x cortina at803x rockchip cxgb4 spi_ks8995 mt76x2u smsc9420 mdio-cavium bnxt_en ch9200 dummy macsec ice mt7601u rtl8188ee ixgbevf net1080 liquidio_vf be2net mlxsw_switchx2 gl620a xilinx_gmii2rgmii ppp_generic rtl8192de sja1000_platform ath10k_core cc770_platform realte igb c_can_platform c_can ethoc dm9601 smsc95xx lg-vl600 ifb enic ath9 mdio-octeon ppp_mppe ath10k_pci cc770 team_mode_activebackup marvell10g hinic rt2x00lib mlx4_en iavf broadcom igc c_can_pci alx rtl8192se rtl8723ae microchip lan78xx atl1c rtl8192c-common almia ax88179_178a qed netxen_nic brcmsmac rt2800usb e1000 qla3xxx mdio-bitbang qsemi mdio-mscc-miim plx_pci ipvlan r8152 cx82310_eth slhc mt76x02-usb ems_pci xen-netfront usbnet pppoe mlxsw_minimal mlxsw_spectrum cdc_ncm rt2800lib rtl_usb hnae3 ath9k_common ath9k_hw catc mt76 hns_enet_drv ppp_async huawei_cdc_ncm i40e rtl8192ce dl2 qmi_wwan mii peak_usb plusb can-dev slcan amd-xgbe team_mode_roundrobin ste10Xp thunder_xcv pptp thunder_bgx ixgbe davicom icplus tap tun smsc75xx smsc dlci hns_dsaf mlxsw_core rt2800mmi softing uPD60620 vaser_usb dp83867 brcmfmac mwifiex_pcie mlx4_core micrel team macvlan bnx2 virtio_net rtl_pci zaurus hns_mdi libcxgb hv_netvsc nicvf mt76x0u teranetics mlxfw cdc_eem qcom-emac pppox mt76-usb sierra_net i40evf bcm87xx mwifiex pegasus rt2x00mmi sja1000 ena hclgevf cnic cxgb4vf ppp_synctty iwlmvm team_mode_broadcast vxlan vsockmon hdlc_cisc rtl8723-common bsd_comp fakelb dp83822 dp83tc811 cicada fm10 8139t sfc hs geneve hclge xgene-enet-v2 cdc_mbim hdlc asix netdevsim rt2800pci team_mode_random lxt ems_usb mlxsw_pci sr9700 mdio-thunder mlxsw_switchib macvtap atlantic cdc_ether mcs7830 nicpf mdi peak_pci atl1e cdc_subset ipvtap btcoexist mt76x0-common veth slip iwldvm bcm7xxx vitesse netconsole epic100 myri10ge r8169 qede microchip_t1 liquidi bnx2x brcmutil mwifiex_sdi mlx5_core rtlwifi vmxnet3 nlmon hns3 hdlc_raw esd_usb2 atl2 mt76x2-common iwlwifi mdio-bcm-unimac national ath rtwpci rtw88 nfp rtl8821ae fjes thunderbolt-net 8139cp atl1 mscc vcan dp83848 dp83640 hdlc_fr e1000e ipheth net_failover aquantia rtl8192ee igbvf rocker intel-xway tg3" --omit "ramdisk network ifcfg qemu-net" --install "chmod findmnt du gzip gunzip export_kbox_img_to_txt awk set_reboot_timer.sh" --nofscks

然后删除kdump initrd,重启kdump服务。
以前遇到过网络相关的驱动占用内存很大,在商用版本中,我们去掉了网络驱动/网络转储功能。

使用第一条回复的方法,仍无法解决问题。调查中。

可以把 kdump 相关设置,以及小内核的启动参数都发出来看看。

XieXiuQi changed issue state from 待办的 to 进行中
XieXiuQi changed issue state from 进行中 to 待办的
XieXiuQi added
 
discussion
label

经过调查后,已定位生成失败的原因有二:
一、测试人员手动升级了网卡outbox驱动,导致原来第二内核的--omit-drivers参数失效,解决方法是:(1)配置crashkernel=2G(1822网卡驱动耗内存极大);或者(2)把第二内核的initrd文件用cpio解压,删掉其中的outbox驱动,用cpio/gzip重新压缩回去;

二、板载sas驱动无法在单CPU上运行,规避方法是修改/etc/sysconfig/kdump文件,在KDUMP_COMMANDLINE_APPEND后面加入hisi_sas_v3_hw.user_ctl_irq=1参数后重启kdump服务。
此问题的patch海思在欧拉2.8 4.19.90-2001.1.0~3975内核上已合入,commit id: e3b9140,但openEuler 20.03 LTS上为何不生效具体情况待进一步调查确认。

经过调查后,已定位生成失败的原因有二:
一、测试人员手动升级了网卡outbox驱动,导致原来第二内核的--omit-drivers参数失效,解决方法是:(1)配置crashkernel=2G(1822网卡驱动耗内存极大);或者(2)把第二内核的initrd文件用cpio解压,删掉其中的outbox驱动,用cpio/gzip重新压缩回去;
二、板载sas驱动无法在单CPU上运行,规避方法是修改/etc/sysconfig/kdump文件,在KDUMP_COMMANDLINE_APPEND后面加入hisi_sas_v3_hw.user_ctl_irq=1参数后重启kdump服务。
此问题的patch海思在欧拉2.8 4.19.90-2001.1.0~3975内核上已合入,commit id: e3b9140,但openEuler 20.03 LTS上为何不生效具体情况待进一步调查确认。

@Alex_Chao 总结一下:在泰山四路服务器上,使用配置openEuler 20.03 LTS版本,配置crashkernel=2G,并且修改/etc/sysconfig/kdump文件,在KDUMP_COMMANDLINE_APPEND后面加入hisi_sas_v3_hw.user_ctl_irq=1参数后重启kdump服务。能解决此问题是吗?

XieXiuQi changed issue state from 待办的 to 已完成

Sign in to comment

状态
Assignees
Projects
Milestones
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
Branches
Planed to start   -   Planed to end
-
Top level
Priority
Duration (hours)
确定
参与者(4)
5329419 openeuler ci bot 1578984659