395 Star 1.4K Fork 1.3K

GVPopenEuler / kernel

 / 详情

【openEuler 21.09】 raid1长稳测试出现BUG_ON 空指针引用

已完成
任务
创建于  
2021-11-26 15:07

[ 5020.221265] md/raid1:md1: Disk failure on sda, disabling device.
[ 5020.221265] md/raid1:md1: Operation continuing on 1 devices.
[ 5030.437252] VFS: Open an exclusive opened block device for write sda [88421 mdadm].
[ 5030.471848] VFS: Open an exclusive opened block device for write sda [88421 mdadm].
[ 5030.764511] md: recovery of RAID array md1
[ 5040.036344] sd 5:0:16:0: [sdf] tag#2915 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[ 5040.045871] sd 5:0:16:0: Power-on or device reset occurred
[ 5037.167756] postfix/postdrop[88307]: warning: unable to look up public/pickup: No such file or directory
[ 5070.034669] sd 5:0:17:0: [sdg] tag#717 BRCM Debug mfi stat 0x2d, data len requested/completed 0x20000/0x0
[ 5070.044200] sd 5:0:17:0: Power-on or device reset occurred
[ 5075.053983] md/raid1:md0: Disk failure on sdf, disabling device.
[ 5075.053983] md/raid1:md0: Operation continuing on 1 devices.
[ 5100.047383] sd 5:0:16:0: [sdf] tag#740 BRCM Debug mfi stat 0x2d, data len requested/completed 0x100000/0x0
[ 5100.057028] sd 5:0:16:0: Power-on or device reset occurred
[ 5100.062304] md: md0: recovery interrupted.
[ 5130.031484] sd 5:0:17:0: [sdg] tag#625 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[ 5130.040925] sd 5:0:17:0: Power-on or device reset occurred
[ 5126.953001] postfix/postdrop[89034]: warning: unable to look up public/pickup: No such file or directory
[ 5130.146647] md: recovery of RAID array md0
[ 5160.029920] sd 5:0:16:0: [sdf] tag#904 BRCM Debug mfi stat 0x2d, data len requested/completed 0x6d000/0x0
[ 5160.039461] sd 5:0:16:0: Power-on or device reset occurred
[ 5190.041709] sd 5:0:17:0: Power-on or device reset occurred
[ 5195.400563] md: md1: recovery interrupted.
[ 5195.708504] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000058
[ 5195.717254] Mem abort info:
[ 5195.720040] ESR = 0x96000006
[ 5195.723086] Exception class = DABT (current EL), IL = 32 bits
[ 5195.728982] SET = 0, FnV = 0
[ 5195.732028] EA = 0, S1PTW = 0
[ 5195.735159] Data abort info:
[ 5195.738028] ISV = 0, ISS = 0x00000006
[ 5195.741849] CM = 0, WnR = 0
[ 5195.744808] user pgtable: 4k pages, 48-bit VAs, pgdp = 000000005cc8cdce
[ 5195.760064] Internal error: Oops: 96000006 [#1] SMP
[ 5195.835238] ip_tables realtek hns3 hclge hibmc_drm megaraid_sas hnae3 hisi_sas_v3_hw hisi_sas_main ipmi_si ipmi_devintf ipmi_msghandler
[ 5195.847446] Process fio (pid: 78987, stack limit = 0x0000000086ac644b)
[ 5195.853942] CPU: 39 PID: 78987 Comm: fio Kdump: loaded Not tainted 4.19.59+ #1
[ 5195.861130] Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 0.10 04/27/2019
[ 5195.868491] pstate: 60400009 (nZCv daif +PAN -UAO)
[ 5195.873262] pc : raid1_write_request+0x2a0/0xa88
[ 5195.877858] lr : raid1_write_request+0x258/0xa88
[ 5195.882452] sp : ffff00001a98b6d0
[ 5195.885753] x29: ffff00001a98b6d0 x28: ffff803da4988680
[ 5195.891040] x27: 0000000000000000 x26: ffff803db2ef5200
[ 5195.896326] x25: ffff000008b9f000 x24: ffff000008b9c000
[ 5195.901613] x23: ffff8027c1203c00 x22: 0000000000000001
[ 5195.906900] x21: 0000000000000004 x20: ffff803dab543600
[ 5195.912188] x19: ffff802732268000 x18: 0000000000000000
[ 5195.917474] x17: 0000000000000000 x16: 0000000000000000
[ 5195.922763] x15: 0000000000000000 x14: 0000000000000000
[ 5195.928050] x13: 0000000000000000 x12: ffffa03db4079f10
[ 5195.933336] x11: 0000000000000800 x10: ffff803da4988698
[ 5195.938625] x9 : ffff803da49886d0 x8 : 0000000000000000
[ 5195.943914] x7 : 0000000000000000 x6 : ffff803db2ef4a00
[ 5195.949200] x5 : 0000000000025d11 x4 : ffff803dbfb28e40
[ 5195.954487] x3 : 0000000000010000 x2 : 0000000023f77800
[ 5195.959774] x1 : ffff80273133d000 x0 : 0000000000000000
[ 5195.965061] Call trace:
[ 5195.967496] raid1_write_request+0x2a0/0xa88
[ 5195.971745] raid1_make_request+0xc8/0x120
[ 5195.975823] md_handle_request+0x11c/0x1b8
[ 5195.979901] md_make_request+0x90/0x1e0
[ 5195.983720] generic_make_request+0x174/0x350
[ 5195.988057] submit_bio+0x5c/0x198
[ 5195.991445] __blockdev_direct_IO+0x195c/0x1ad0
[ 5195.995957] ext4_direct_IO+0x28c/0x7e0
[ 5195.999777] generic_file_direct_write+0x94/0x1a0
[ 5196.004460] __generic_file_write_iter+0xb0/0x1c8
[ 5196.009143] ext4_file_write_iter+0x120/0x3e8
[ 5196.013480] __vfs_write+0x11c/0x190
[ 5196.017040] vfs_write+0xac/0x1c0
[ 5196.020340] ksys_pwrite64+0x8c/0xd0
[ 5196.023900] __arm64_sys_pwrite64+0x28/0x38
[ 5196.028064] el0_svc_common+0xac/0x1e8
[ 5196.031797] el0_svc_handler+0x38/0x78
[ 5196.035529] el0_svc+0x8/0xc
[ 5196.038398] Code: f94006e0 f9400782 f9400741 f8686800 (f9402c00)
[ 5196.044464] ---[ end trace 3097304ac513d089 ]---
[ 5196.049059] Kernel panic - not syncing: Fatal exception

评论 (4)

Qiuuuuu 创建了缺陷

Hi qiuuuuu, welcome to the openEuler Community.
I'm the Bot here serving you. You can find the instructions on how to interact with me at
https://gitee.com/openeuler/community/blob/master/en/sig-infrastructure/command.md.
If you have any questions, please contact the SIG: Kernel, and any of the maintainers: @Xie XiuQi, @YangYingliang, @成坚 (CHENG Jian).

openeuler-ci-bot 添加了
 
sig/Kernel
标签

复现方法:
(1)为内核加入延时

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 54010675df9a..dcb6d3bd2468 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -55,6 +55,7 @@
*/
#define NR_RAID1_BIOS 256

+int debug_read = 0;
/* when we get a read error on a read-only array, we redirect to another

  • device without failing the first device, or trying to over-write to
  • correct the read error. To keep track of bad blocks on a per-bio
    @@ -1377,8 +1378,15 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
    set_bit(R1BIO_Degraded, &r1_bio->state);
    continue;
    }
  •           if (i == 0) {
    
  •                   printk("%s %d i=%d bio wait 5s to nr_pending++\n", __FUNCTION__, __LINE__, i);
    
  •                   debug_read = 1;
    
  •                   msleep(5000);
    
  •           }
    
              atomic_inc(&rdev->nr_pending);
    
  •           if (i == 0)
    
  •                   printk("%s %d wait end\n", __FUNCTION__, __LINE__);
              if (test_bit(WriteErrorSeen, &rdev->flags)) {
                      sector_t first_bad;
                      int bad_sectors;
    

@@ -1490,8 +1498,13 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,

            r1_bio->bios[i] = mbio;
  •           if (i == 0) {
    
  •                   printk("%s %d read conf->mirrors[0].rdev\n", __FUNCTION__, __LINE__);
    
  •           }
              mbio->bi_iter.bi_sector = (r1_bio->sector +
                                 conf->mirrors[i].rdev->data_offset);
    
  •           }
    
  •           bio_set_dev(mbio, conf->mirrors[i].rdev->bdev);
              mbio->bi_end_io = raid1_end_write_request;
              mbio->bi_opf = bio_op(bio) | (bio->bi_opf & (REQ_SYNC | REQ_FUA));
    

@@ -1804,6 +1817,12 @@ static int raid1_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
goto abort;
}
p->rdev = NULL;

  •           while (debug_read == 1) {
    
  •                   printk("%s %d wait 2s\n", __FUNCTION__, __LINE__);
    
  •                   msleep(2000);
    
  •           }
    
  •           printk("%s %d wait all 2s end\n", __FUNCTION__, __LINE__);
    
  •           debug_read = 0;
              if (!test_bit(RemoveSynchronized, &rdev->flags)) {
                      synchronize_rcu();
                      if (atomic_read(&rdev->nr_pending)) {
    

(2)创建并操作raid,如下:
mdadm -CR /dev/md1 -l 1 -n 2 /dev/sd[ab] --assume-clean
mdadm /dev/md1 -f /dev/sda
mdadm /dev/md1 -r /dev/sda
mdadm /dev/md1 -a /dev/sda # start recovery

dd if=/dev/zero of=/dev/md1 bs=4k count=1 oflag=direct

mdadm /dev/md1 -f /dev/sdb

raid1竞争场景

raid1_write_request md_check_recovery mdadm set(/dev/sdb) faulty

rcu_read_lock()
rdev!=NULL
!test_bit(Faulty, &rdev->flags)
conf->recovery_disabled = mddev->recovery_disabled;
return busy;
remove_and_add_spares
raid1_remove_disk
p->rdev=NULL

atomic_inc(&rdev->nr_pending);

rcu_read_unlock()

mbio->bi_iter.bi_sector = (r1_bio->sector +
conf->mirrors[i].rdev->data_offset);
NULL pointer deference

                     if (!test_bit(RemoveSynchronized, &rdev->flags))
                         synchronize_rcu();
                          p->rdev=rdev
wupeng 任务类型缺陷 修改为任务
zhengzengkai 通过src-openeuler/kernel Pull Request !461任务状态待办的 修改为已完成

诚邀Issue的创建人,负责人,协作人以及评论人对此次Issue解决过程给予评价:

   0   1   2   3   4   5   6   7   8   9   10  

 不满意                        非常满意

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(3)
5329419 openeuler ci bot 1632792936 9968373 openeuler survey bot 1637036855
C
1
https://gitee.com/openeuler/kernel.git
git@gitee.com:openeuler/kernel.git
openeuler
kernel
kernel

搜索帮助