/ 详情

[openEuler 20.03 LTS] 测试open高斯数据库报错pread 返回-EIO

Done
Task member
Opened this issue  
2021-06-21 09:45

环境
架构:arm64
内存页:64k
文件系统:xfs
内核版本:4.19.90
问题

[图片上传中…(image-ScLDjkcLs7xCKnbP4YX6)]

通过gdb加断点调试可以确认是pread系统调用出错,errno是-5,代表-EIO

Attachments

Comments (4)

成坚 (CHENG Jian) created任务
成坚 (CHENG Jian) set related repository to openEuler/kernel
展开全部操作日志

Hey gatieme, Welcome to openEuler Community.
All of the projects in openEuler Community are maintained by @openeuler-ci-bot .
That means the developers can comment below every pull request or issue to trigger Bot Commands.
Please follow instructions at https://gitee.com/openeuler/community/blob/master/en/sig-infrastructure/command.md to find the details.

openeuler-ci-bot added
 
sig/Kernel
label

1 内核打点


对内核函数从pread开始到块层结束,添加kprobe打印:

echo 'r:pread64_ret ksys_pread64 ret=$retval:s64' > kprobe_events
echo 'ret < 0' > events/kprobes/pread64_ret/filter

echo 1 > events/kprobes/pread64_ret/enable

echo 'r:xfs_file_read_iter xfs_file_read_iter ret=$retval:s64' >> kprobe_events
echo 'ret < 0' > events/kprobes/xfs_file_read_iter/filter
echo 1 > events/kprobes/xfs_file_read_iter/enable

echo 'r:generic_file_buffered_read generic_file_buffered_read ret=$retval:s64' >> kprobe_events
echo 'ret < 0' > events/kprobes/generic_file_buffered_read/filter
echo 1 > events/kprobes/generic_file_buffered_read/enable

echo 'r:rw_verify_area rw_verify_area ret=$retval:s32' >> kprobe_events
echo 'ret < 0' > events/kprobes/rw_verify_area/filter
echo 1 > events/kprobes/rw_verify_area/enable

echo 'error != 0' > events/block/block_rq_complete/filter
echo 1 > events/block/block_rq_complete/enable
echo 'r:ondemand_readahead ondemand_readahead ret=$retval:s64' >> kprobe_events

echo 'ret < 0' > events/kprobes/ondemand_readahead/filter
echo 1 > events/kprobes/ondemand_readahead/enable
echo 'r:read_pages read_pages ret=$retval:s32' >> kprobe_events
echo 'ret < 0' > events/kprobes/read_pages/filter再次复现时结果如下:

表示 generic_file_buffered_read 函数返回了-5,但是 readpages, iomap_readpage_actor 等均没有打
印,从代码看,在读page不报错的前提下,generic_file_buffered_read返回-EIO只有一个地方:

echo 1 > events/kprobes/read_pages/enable
echo 'r:iomap_apply iomap_apply ret=$retval:s64' >> kprobe_events
echo 'ret < 0' > events/kprobes/iomap_apply/filter
echo 1 > events/kprobes/iomap_apply/enable
echo 'r:iomap_readpage_actor iomap_readpage_actor ret=$retval:s64' >>
kprobe_events
echo 'ret < 0' > events/kprobes/iomap_readpage_actor/filter
echo 1 > events/kprobes/iomap_readpage_actor/enable

[图片上传中…(image-XChpbWwBLe7KEmB3XpYO)]

表示generic_file_buffered_read函数返回了-5,但是readpages, iomap_readpage_actor等均没有打
印,从代码看,在读page不报错的前提下,generic_file_buffered_read返回-EIO只有一个地方:

static ssize_t generic_file_buffered_read(struct kiocb *iocb,
struct iov_iter *iter, ssize_t written)
{
......
readpage:
/*
┊* A previous I/O error may have been due to temporary
┊* failures, eg. multipath errors.
┊* PG_error will be set again if readpage fails.
┊*/
ClearPageError(page);
/* Start the actual read. The read will unlock the page. */
error = mapping->a_ops->readpage(filp, page); --> 读page ok
if (unlikely(error)) {
if (error == AOP_TRUNCATED_PAGE) {
put_page(page);
error = 0;
goto find_page;
}
goto readpage_error;
}
if (!PageUptodate(page)) { --> page没有置上uptodate标记
error = lock_page_killable(page); --> 锁page
if (unlikely(error))
goto readpage_error;
if (!PageUptodate(page)) {
if (page->mapping == NULL) { --> page的
mapping还在
/*
┊* invalidate_mapping_pages got it
┊*/
unlock_page(page);
put_page(page);因此推测问题原因是xfs_vm_readpage没有报错,但是没有给page置上uptodate标记,继续看设置
uptodate标记的地方:
goto find_page;
}
unlock_page(page);
shrink_readahead_size_eio(filp, ra);
error = -EIO; --> 返回-EIO
goto readpage_error;
}
unlock_page(page);
}
goto page_ok;
......
}

iomap_set_range_uptodata会遍历page中的所有subpage,依次将范围内的(off 到off+len-1)subpage
置上uptodate,最后判断如果所有subpage置上了uptodate,会将该page置上uptodate.
问题
当page_size > block_size的时候,bio完成时调用iomap_set_range_uptodate可能存在竞争,例如一
共有16个subpage,bio1完成时要设置前8个uptodate,bio2完成时要设置后8个uptodate,理论上两
个bio完成后,该page应该设置uptodate,但是当两个iomap_set_range_uptodate一起调用时,可能
出现如下的时序:

  1. bio2 遍历到第9个subpage,发现前8个subpage没有uptodate标记
  2. bio1 开始将前8个subpage设置uptodate,继续遍历发现后8个subpage没有uptodate标记
  3. bio2继续遍历后8个subpage并设置uptodate
    此时,bio1, bio2遍历完16个subpage后均认为还存在subpage不是uptodate的,因此page没有置上
    uptodate

解决方案
社区主线已经有补丁,通过添加一个spin锁解决:

iomap: fix sub-page uptodate handling

分支 commit tag
openEuler-1.0-LTS 312e5cdf3dcd77a9ad0485f12819f28d7fa32704 tags/4.19.90-2008.2.0~31
kernel-4.19 cd67f09799e9f4e894000ab3896bc450b64ed532 tags/4.19.138-2008.1.0~113
openEuler-20.09 cd67f09799e9f4e894000ab3896bc450b64ed532 tags/4.19.138-2008.1.0~113
成坚 (CHENG Jian) changed description
成坚 (CHENG Jian) changed description
成坚 (CHENG Jian) changed description
成坚 (CHENG Jian) changed description
成坚 (CHENG Jian) changed title
成坚 (CHENG Jian) changed issue state from 待办的 to 已完成

Sign in to comment

Status
Assignees
Projects
Milestones
Pull Requests
Successfully merging a pull request will close this issue.
Branches
Planed to start   -   Planed to end
-
Top level
Priority
Duration (hours)
Confirm
参与者(2)
5329419 openeuler ci bot 1578984659