【标题描述】[sp1] 在测试LTP用例的时候,用例ioctl_sg01会导致内核crash
【环境信息】
硬件信息:
1) CPU : Intel(R) Xeon(R) Gold 5218 CPU @ 2.3
软件信息:
1) kernel: 4.19.90-2102.3.0.0058.up2.uel20.x86_64
【问题复现步骤】
LTP版本:
cd ltp-20210121/testcases/kernel/syscalls/ioctl/
for i in $(seq 1 10); do ./ioctl_sg01; done
使用最新的4.19.90-2103.2.0.0060内核版本也存在问题。
【预期结果】
系统正常运行
【实际结果】
系统自动重启,内核crash
【附件信息】
Hey weidongkl, Welcome to openEuler Community.
All of the projects in openEuler Community are maintained by @openeuler-ci-bot.
That means the developers can comment below every pull request or issue to trigger Bot Commands.
Please follow instructions at https://gitee.com/openeuler/community/blob/master/en/sig-infrastructure/command.md to find the details.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
此问题会导致内核crash,系统自动重启。影响了服务器稳定性。属于致命问题。请优先处理
此问题会导致内核crash,系统自动重启。影响了服务器稳定性。属于致命问题。请优先处理
@weidongkl 你好,已经找兄弟在看了。
@weidongkl 你好,已经找兄弟在看了。
@成坚 (CHENG Jian) hello,这个问题看到怎么样了。咱这边兄弟可以把每天分析的结果和方法在issues上分享出来么。这样也能更好的赋能社区。我们之后也可以自己pr修复问题
已经有初步结论了,分析的信息也在整理。我今天会贴出来的
HI,补丁我正在处理,待会推出来。
然后分析的话,我后面整理下贴出来,如果着急的话,可以先加我 WX gatieme。
在执行 ltp 测试用例 ioctl_sg01 时,内核发生崩溃。
由于我们事先开启了 kdump 功能,所以在内核崩溃时,我们能够获取到 vmcore,vmcore 是内核崩溃时的内核状态。vmcore 中包含了非常多的信息,对于我们定位问题非常有帮助。
通过 crash 工具分析 vmcore,首先启动调试:
[root@localhost crash]# crash vmcore vmlinux
crash 7.2.6-3.oe1
Copyright (C) 2002-2019 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...
WARNING: kernel relocated [22MB]: patching 80941 gdb minimal_symbol values
WARNING: kernel version inconsistency between vmlinux and dumpfile
KERNEL: vmlinux
DUMPFILE: vmcore [PARTIAL DUMP]
CPUS: 64
DATE: Thu Mar 11 21:04:06 2021
UPTIME: 07:57:53
LOAD AVERAGE: 1.35, 2.08, 15.43
TASKS: 670
NODENAME: ltptest
RELEASE: 4.19.90-2102.3.0.0058.up2.uel20.x86_64
VERSION: #1 SMP Tue Mar 9 15:57:56 UTC 2021
MACHINE: x86_64 (2300 Mhz)
MEMORY: 510.7 GB
PANIC: "BUG: unable to handle kernel NULL pointer dereference at 0000000000000008"
PID: 387
COMMAND: "kswapd1"
TASK: ffff97763d725ac0 [THREAD_INFO: ffff97763d725ac0]
CPU: 20
STATE: TASK_RUNNING (PANIC)
我们可以看到内核PANIC的原因是 访问空指针,并且是kswapd1进程发生空指针访问。我们接下来把系统崩溃前的堆栈等信息打印出来看一下:
crash> bt
PID: 387 TASK: ffff97763d725ac0 CPU: 20 COMMAND: "kswapd1"
#0 [ffff97763d71fa08] machine_kexec at ffffffff826442c2
#1 [ffff97763d71fa50] __crash_kexec at ffffffff82721c09
#2 [ffff97763d71fb08] crash_kexec at ffffffff827229e8
#3 [ffff97763d71fb20] oops_end at ffffffff826199ba
#4 [ffff97763d71fb40] no_context at ffffffff826513ed
#5 [ffff97763d71fb90] __do_page_fault at ffffffff82651b3d
#6 [ffff97763d71fbf8] do_page_fault at ffffffff82651f7a
#7 [ffff97763d71fc20] page_fault at ffffffff8300116e
[exception RIP: deferred_split_scan+286]
RIP: ffffffff82812a7e RSP: ffff97763d71fcd8 RFLAGS: 00010046
RAX: ffffe2843fa00118 RBX: ffff97763d71fd90 RCX: 0000000000000000
RDX: ffffe2843fa00080 RSI: 0000000000000286 RDI: 0000000000000001
RBP: ffff97763d71fce0 R8: 0000000000000000 R9: ffff977741413478
R10: 0000000000000040 R11: 0000000000000000 R12: ffff97b63c701008
R13: ffff97b63c701000 R14: ffff97b63c701018 R15: ffffffff83900ae0
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
#8 [ffff97763d71fd20] do_shrink_slab at ffffffff827a7893
#9 [ffff97763d71fd70] shrink_slab at ffffffff827a7cfb
#10 [ffff97763d71fde0] shrink_node at ffffffff827abc73
#11 [ffff97763d71fe58] kswapd at ffffffff827aca36
#12 [ffff97763d71ff10] kthread at ffffffff826b7289
#13 [ffff97763d71ff50] ret_from_fork at ffffffff830001ef
从这里我们可以看到,内核panic点deferred_split_scan+286,对应的地址是RIP=ffffffff82812a7e。再看一下对应的是哪条指令:
crash> dis -l ffffffff82812a7e
/usr/src/debug/kernel-4.19.90-2102.3.0.0058.up2.uel20.x86_64/linux-4.19.90-2102.3.0.0058.up2.uel20.x86_64/./include/linux/list.h: 105
0xffffffff82812a7e <deferred_split_scan+286>: mov %rdi,0x8(%r8)
显然出错的就是这个r8寄存器的值,从上面的堆栈信息可以进一步确认R8=0000000000000000。我们把deferred_split_scan反汇编出来,看一下这个r8是哪个变量,中间发生了:
crash> dis deferred_split_scan
... // 省略一部分代码
0xffffffff82812a67 <deferred_split_scan+263>: jmp 0xffffffff82812a59 <deferred_split_scan+249>
0xffffffff82812a69 <deferred_split_scan+265>: mov 0x98(%rdx),%r8
0xffffffff82812a70 <deferred_split_scan+272>: mov 0xa0(%rdx),%rdi
0xffffffff82812a77 <deferred_split_scan+279>: lea 0x98(%rdx),%rax
0xffffffff82812a7e <deferred_split_scan+286>: mov %rdi,0x8(%r8)
0xffffffff82812a82 <deferred_split_scan+290>: mov %r8,(%rdi)
0xffffffff82812a85 <deferred_split_scan+293>: mov %rax,0x98(%rdx)
0xffffffff82812a8c <deferred_split_scan+300>: mov %rax,0xa0(%rdx)
0xffffffff82812a93 <deferred_split_scan+307>: subq $0x1,(%r14)
0xffffffff82812a97 <deferred_split_scan+311>: subq $0x1,0x8(%rbx)
... // 省略一部分代码
可以看到这个r8是从RDX偏移来的。利用objdump对vmlinux进行反汇编,可以看到,挂在mov %rdi,0x8(%r8)这里,原因是r8寄存器的值为0x0000000000000000,而r8是rdx偏移0x98后取值而来,结合内核源码中deferred_split_scan函数的实现:
static unsigned long deferred_split_scan(struct shrinker *shrink,
struct shrink_control *sc)
{
...
list_for_each_safe(pos, next, ds_queue.split_queue) {
page = list_entry((void *)pos, struct page, mapping);
page = compound_head(page);
if (get_page_unless_zero(page)) {
list_move(page_deferred_list(page), &list);
} else {
/* We lost race with put_compound_page() */
list_del_init(page_deferred_list(page));
(*ds_queue.split_queue_len)--;
}
if (!--sc->nr_to_scan)
break;
}
...
}
static inline struct list_head *page_deferred_list(struct page *page)
{
/*
* Global or memcg deferred list in the second tail pages is
* occupied by compound_head.
*/
return &page[2].deferred_list;
}
可以看到rdx寄存器对应的是page指针,对应的值为RDX=ffffe2843fa00080,而r8寄存器对应的是page[2].deferred_list.next。我们把struct page结构体和这段内存都打印出来看一下:
crash> struct page -o
struct page {
[0] unsigned long flags;
union {
struct { // 复合页的page[0]使用这个结构体
[8] struct list_head lru;
[24] struct address_space *mapping;
[32] unsigned long index;
[40] unsigned long private;
};
...
struct { // 复合页的page[1]使用这个结构体
[8] unsigned long compound_head;
[16] unsigned char compound_dtor;
[17] unsigned char compound_order;
[20] atomic_t compound_mapcount;
};
struct { // 复合页的page[2]使用这个结构体
[8] unsigned long _compound_pad_1;
[16] unsigned long _compound_pad_2;
[24] struct list_head deferred_list;
};
...
};
union {
[48] atomic_t _mapcount;
[48] unsigned int page_type;
[48] unsigned int active;
[48] int units;
};
[52] atomic_t _refcount;
[56] struct mem_cgroup *mem_cgroup;
}
SIZE: 64
crash> rd -s ffffe2843fa00000 -e ffffe2843fa00140
ffffe2843fa00000: 0017ffffc0000000 ffffe283abe00008
ffffe2843fa00010: ffffe2840ae30008 0000000000000000
ffffe2843fa00020: 00000007fc730600 000000000000000a
ffffe2843fa00030: 00000000ffffff7f 0000000000000000
ffffe2843fa00040: 0017ffffc0000000 0000000000000000
ffffe2843fa00050: ffffffff00000903 0000000000000000
ffffe2843fa00060: 0000000000000001 0000000000000000
ffffe2843fa00070: 00000000ffffffff 0000000000000000
ffffe2843fa00080: 0017ffffc0000000 0000000000000000 // rdx寄存器所指示的地址
ffffe2843fa00090: dead000000000200 0000000000000000
ffffe2843fa000a0: dead000000000200 0000000000000000
ffffe2843fa000b0: 00000000ffffffff 0000000000000000
ffffe2843fa000c0: 0017ffffc0000000 0000000000000000
ffffe2843fa000d0: dead000000000200 0000000000000000
ffffe2843fa000e0: 0000000000000001 0000000000000000
ffffe2843fa000f0: 00000000ffffffff 0000000000000000
ffffe2843fa00100: 0017ffffc0000000 0000000000000000
ffffe2843fa00110: dead000000000200 0000000000000000
ffffe2843fa00120: 0000000000000001 0000000000000000
ffffe2843fa00130: 00000000ffffffff 0000000000000000
观察上面这段内存,每个page占64字节,它们应该组成一个复合页,但对照上面struct page结构提的定义,可以看出它们的值明显是不合理的,特别是这些页的引用计数已经为-1或-128了。
由于ffffe2843fa00080所指示的这个page在执行compound_head后仍然会得到它自己,所以我们有理由猜测,ffffe2843fa00000才是真正的page[0],而ffffe2843fa00080是page[2]。如果这样成立的话,那么ffffe2843fa00098对应deferred_list.next=0000000000000000, ffffe2843fa000a0对应deferred_list.prev=dead000000000200。如果ffffe2843fa00080这个page已经从复合页上拆分出来,那么ffffe2843fa00098对应的是mapping,所以其值是0000000000000000就能说的通。下面我们看一下deferred_list都在哪里做修改。
page.deferred_list都是通过page_deferred_list获取的,通过跟踪page_deferred_list这个函数的调用点,我们发现相关的操作都是有锁进行保护的。获取的锁是通过page->mem_cgroup或pglist_data得到的,所以猜想会不会在执行过程中,page的mem_cgroup发生变化,但是 deferred_list 仍然在原 mem_cgroup 上。进而转为跟踪mem_cgroup的变化, 最终我们发现在 move_active_pages_to_lru 函数流程中会调用mem_cgroup_uncharge(page),但此时page的deferred_list没有改变,在这里调用mem_cgroup_uncharge之后,实际上会对deferred_list进行操作,然后释放page,这里会导致链表发生异常。
进一步分析发现这里不应该执行mem_cgroup_uncharge(),因为在free_compound_page时会执行mem_cgroup_uncharge。修复之后重新测试,确认内核不再崩溃。
branch | commit | tag |
---|---|---|
openEuler-1.0-LTS | edc25b118985 | 4.19.90-2103.4.0 |
branch | commit | tag |
---|---|---|
openEuler-1.0-LTS | edc25b118985 | 4.19.90-2103.4.0 |
kernel-4.19 | 2377894b0491 | NA |
已确认,合入补丁后,问题修复。
@weidongkl 问题已经修复,先关闭此 issue。
登录 后才可以发表评论