free看到的总内存比/sys/fs/cgroup/memory/memory.stat中total_rss看到的内存小

内核版本 4.19.90
安装了一套Kubernetes系统（简称k8s集群），该集群包括7个节点，其中3台服务器做为master节点，也做为node节点，其余4台服务器为node节点。
其中集群中第1台mster节点在运行2个月后，发现master节点上安装的pod被驱逐。经查看系统cgroup memory子系统的统计结果，发现/sys/fs/cgroup/memory/memory.usage_in_bytes的值比free统计的memory总计值还要大。随后查看/sys/fs/cgroup/memory/memory.stat的统计值，发现其中total_rss值也是大于总的物理内存
因为k8s集群根据total_rss来判断内存已经超限，从而引起master节点上的pod被驱逐。导致业务不能够正常运行

/sys/fs/cgroup/memory/memory.stat total_rss 319230640128

free -m 看到 total内存为 261016

需要解释一下total_rss为什么大于实际物理内存
通过分析total_rss，解决环境中节点的pod不再被驱逐，不影响业务使用

Hey jpzhang187, Welcome to openEuler Community.
All of the projects in openEuler Community are maintained by @openeuler-ci-bot.
That means the developers can comment below every pull request or issue to trigger Bot Commands.
Please follow instructions at https://gitee.com/openeuler/community/blob/master/en/sig-infrastructure/command.md to find the details.

total_rss和memory.usage_in_bytes的值都是在遍历整个tree后得到的。如果在遍历期间有cgroup被删，或者cgroup的使用量增大，那么这两个值是不准确的。最新的版本该问题已经修复。

total_rss和memory.usage_in_bytes的值都是在遍历整个tree后得到的。如果在遍历期间有cgroup被删，或者cgroup的使用量增大，那么这两个值是不准确的。最新的版本该问题已经修复。

@jing-xiangfeng 补丁链接能发一下吗

https://lore.kernel.org/patchwork/project/lkml/list/?series=390281

麒麟的兄弟已经定位了该问题。补丁和分析如下所示：

From ed778542a7f921cfd0e26ee697e62c290afa6f28 Mon Sep 17 00:00:00 2001
From: jiangfeng <jiangfeng@kylinos.cn>
Date: Mon, 21 Jun 2021 15:13:14 +0800
Subject: [PATCH] KYLIN: mm/memcontrol: fix wrong vmstats for dying memcg

Mainline: NA
From: NA
Category: Bugfix
CVE: NA
KABI: Checked

At present, only when the absolute value of vmstats_percpu exceeds
MEMCG_CHARGE_BATCH will it be updated to vmstats, so there will always
be a certain lag difference between vmstats and the correct value.

In addition, since the partially deleted memcg is still referenced, it
will not be freed immediately after it is offline. Although the
remaining memcg has released the page, it and the parent's vmstats will
still be not 0 or too large due to the update lag, which leads to the
abnormality of the total_<count> parameter in the memory.stat file.

This patch mainly solves the problem of synchronization between
memcg's vmstats and the correct value during the destruction process
from two aspects:
1) Perform a flush synchronization operation when memcg is offline
2) For memcg in the process of being destroyed, bypass the threshold
   judgment when updating vmstats

Reproduction method:
1) Use ssh to log in and log out of the system repeatedly with root
   account for more than 3000 times
2) Enter the /sys/fs/cgroup/memory/user.slice/user-0.slice directory,
   you can find that the total_rss of this layer is much larger than
   the sum of the rss of all sublayers

目前，只有当vmstats_percpu的绝对值超过MEMCG_CHARGE_BATCH后才会被更新到
vmstats中，所以vmstats与真实值相比，总会存在一定的滞后差异。

另外，由于部分被删除的memcg仍有被引用，所以其在offline后并不会立即被free。
虽然这些残留的memcg已经释放掉了page，但它和父辈的vmstats仍会因为更新滞后而
不为0或偏大，从而导致了memory.stat文件中total_<count>参数的异常。

本补丁主要从两个方面解决销毁过程中的memcg的vmstats与真实值的同步问题：
1) 在memcg被offline时执行一次flush同步操作
2) 对处于销毁过程中的memcg，更新vmstats时绕过阀值判断

复现方法：
1) 使用ssh用root账户反复登录退出系统3000次以上
2) 进入/sys/fs/cgroup/memory/user.slice/user-0.slice目录，可以发现该层的
   total_rss远远大于所有子层的rss的总和

task: #25019

Cc: nh <nh@tj.kylinos.cn> #kylinos-next
Signed-off-by: xieming <xieming@kylinos.cn>
Signed-off-by: jiangfeng <jiangfeng@kylinos.cn>
Change-Id: Ic8325b9d0defaaea52d1b7f03aa172439cd48692
---

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9e93c05..a4d791c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -632,7 +632,7 @@
 		return;
 
 	x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]);
-	if (unlikely(abs(x) > MEMCG_CHARGE_BATCH)) {
+	if (unlikely(abs(x) > MEMCG_CHARGE_BATCH || memcg->css.flags & CSS_DYING)) {
 		struct mem_cgroup *mi;
 
 		/*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ff9169d..3c3ddb8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3115,8 +3115,10 @@
 		stat[i] = 0;
 
 	for_each_online_cpu(cpu)
-		for (i = 0; i < MEMCG_NR_STAT; i++)
+		for (i = 0; i < MEMCG_NR_STAT; i++) {
 			stat[i] += per_cpu(memcg->vmstats_percpu->stat[i], cpu);
+			per_cpu(memcg->vmstats_percpu->stat[i], cpu) = 0;
+		}
 
 	for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
 		for (i = 0; i < MEMCG_NR_STAT; i++)
@@ -3130,9 +3132,11 @@
 			stat[i] = 0;
 
 		for_each_online_cpu(cpu)
-			for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
+			for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
 				stat[i] += per_cpu(
 					pn->lruvec_stat_cpu->count[i], cpu);
+				per_cpu(pn->lruvec_stat_cpu->count[i], cpu) = 0;
+			}
 
 		for (pi = pn; pi; pi = parent_nodeinfo(pi, node))
 			for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
@@ -3150,9 +3154,11 @@
 		events[i] = 0;
 
 	for_each_online_cpu(cpu)
-		for (i = 0; i < NR_VM_EVENT_ITEMS; i++)
+		for (i = 0; i < NR_VM_EVENT_ITEMS; i++) {
 			events[i] += per_cpu(memcg->vmstats_percpu->events[i],
 					     cpu);
+			per_cpu(memcg->vmstats_percpu->events[i], cpu) = 0;
+		}
 
 	for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
 		for (i = 0; i < NR_VM_EVENT_ITEMS; i++)
@@ -4760,6 +4766,9 @@
 	memcg_offline_kmem(memcg);
 	wb_memcg_offline(memcg);
 
+	memcg_flush_percpu_vmstats(memcg);
+	memcg_flush_percpu_vmevents(memcg);
+
 	mem_cgroup_id_put(memcg);
 }

kernel-4.19

a28bf2c54996 mm: memcg/slab: fix percpu slab vmstats flushing
4838eceb0195 mm: memcontrol: fix NULL-ptr deref in percpu stats flush
b64e646fe6c7 mm: memcg: get number of pages on the LRU list in memcgroup base on lru_zone_size
b00f9cc2e230 mm: memcontrol: fix percpu vmstats and vmevents flush
388cef144d6b mm, memcg: partially revert "mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones"
e65fecbb2825 mm: memcontrol: flush percpu slab vmstats on kmem offlining
6319bfe538fe mm: memcontrol: flush percpu vmevents before releasing memcg
ae954054c114 mm: memcontrol: flush percpu vmstats before releasing memcg
984a62cb05ab mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones
a096202fab07 mm/memcontrol: fix wrong statistics in memory.stat
223957a0bd0b mm: memcontrol: don't batch updates of local VM stats and events
80de86dcabb6 mm: memcontrol: fix NUMA round-robin reclaim at intermediate level
2bce8ff6af01 mm: memcontrol: fix recursive statistics correctness & scalabilty
daf410eb7d0f mm: memcontrol: move stat/event counting functions out-of-line
3ff5e0442c96 mm: memcontrol: make cgroup stats and events query API explicitly local
2d26d2d62888 mm, memcg: rename ambiguously named memory.stat counters and functions
dfe8d0adf961 mm/memcontrol.c: fix memory.stat item ordering
a79b33fbd920 mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()
7e42f408beb8 mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()
7b115f3400da mm/memory.c: fix a huge pud insertion race during faulting
701ad0619ed3 percpu: update free path with correct new free region
5c44571bf03f percpu: use nr_groups as check condition
9e8456391337 mm: update get_user_pages_longterm to migrate pages allocated from CMA region
deaa4f3b0378 mm/cma: add PF flag to force non cma alloc
518b56956ff0 mm: move the backup x_devmap() functions to asm-generic/pgtable.h
526a26145a51 mm, oom: remove 'prefer children over parent' heuristic
18aa65252bc8 mm/vmstat.c: assert that vmstat_text is in sync with stat_items_size
175b32d828e8 mm: memcg/slab: fix panic in __free_slab() caused by premature memcg pointer release
f88a8298a7fb mm: slab: make page_cgroup_ino() to recognize non-compound slab pages properly
5f20b17d2159 mm: memcontrol: quarantine the mem_cgroup_[node_]nr_lru_pages() API
d7ae0faa24e8 mm: memcontrol: push down mem_cgroup_nr_lru_pages()
0c8a6fa1e4be mm: memcontrol: push down mem_cgroup_node_nr_lru_pages()
50c31015a35d mm: memcontrol: replace node summing with memcg_page_state()
9af85687bc73 mm: memcontrol: replace zone summing with lruvec_page_state()
a76b094c78d2 mm: memcontrol: track LRU counts in the vmstats array
86f6eb7c0be8 mm: memcontrol: expose THP events on a per-memcg basis
408011e887fe mm, memcg: extract memcg maxable seq_file logic to seq_show_memcg_tunable
03c169feac16 mm, memcg: create mem_cgroup_from_seq
4ab385e78820 mm/oom_kill.c: fix uninitialized oc->constraint
514506c1074a mm, oom: add oom victim's memcg to the oom context information
948a1eddee15 mm, oom: reorganize the oom report in dump_header

openEuler-1.0-LTS

f40af09ad893 memcg: fix kabi broken when memory cgroup enhance
213e410e44ff mm: memcontrol: fix NULL-ptr deref in percpu stats flush
82ee7790611e mm: memcg: get number of pages on the LRU list in memcgroup base on lru_zone_size
648dfe059c35 mm: memcontrol: fix percpu vmstats and vmevents flush
f721b4181442 mm, memcg: partially revert "mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones"
ba4831a8bbd6 mm/memcontrol.c: keep local VM counters in sync with the hierarchical ones
c058ea679749 mm: memcontrol: flush percpu vmevents before releasing memcg
224f3a2f6594 mm: memcontrol: flush percpu vmstats before releasing memcg
bebc7577a763 mm/memcontrol: fix wrong statistics in memory.stat
bba51e8ec00d mm: memcontrol: don't batch updates of local VM stats and events
d7e2fc69cba5 mm: memcontrol: fix NUMA round-robin reclaim at intermediate level
622ad3079dcd mm: memcontrol: fix recursive statistics correctness & scalabilty
498a689afdfb mm: memcontrol: move stat/event counting functions out-of-line
c06ec704e74a mm: memcontrol: make cgroup stats and events query API explicitly local
69a33553df02 mm: memcontrol: quarantine the mem_cgroup_[node_]nr_lru_pages() API
c6e111395005 mm, memcg: rename ambiguously named memory.stat counters and functions
406a91dc8ed5 mm/memcontrol.c: fix memory.stat item ordering
9d5425af85ab mm: memcontrol: expose THP events on a per-memcg basis
1a7d152bdf5b mm: memcontrol: track LRU counts in the vmstats array
c2fbd9f353eb mm: memcontrol: push down mem_cgroup_nr_lru_pages()
e5693929d555 mm: memcontrol: push down mem_cgroup_node_nr_lru_pages()
9c18f1d2cb2d mm: workingset: don't drop refault information prematurely
34149d721493 mm: memcontrol: replace zone summing with lruvec_page_state()
fd604ae0d0a9 mm: memcontrol: replace node summing with memcg_page_state()
48ad87db08ab mm, oom: add oom victim's memcg to the oom context information
58b55f1fedc2 mm/oom_kill.c: fix uninitialized oc->constraint
0201217cd6d4 mm, oom: reorganize the oom report in dump_header

GVP openEuler / kernel

内容风险标识

评论 (7)

GVPopenEuler / kernel

内容风险标识