Commit 2fba5961 authored May 06, 2025 by Shakeel Butt Committed by Andrew Morton May 13, 2025
memcg: simplify consume_stock

Patch series "memcg: decouple memcg and objcg stocks", v3.

The per-cpu memcg charge cache and objcg charge cache are coupled in a
single struct memcg_stock_pcp and a single local lock is used to protect
both of the caches.  This makes memcg charging and objcg charging nmi safe
challenging.  Decoupling memcg and objcg stocks would allow us to make
them nmi safe and even work without disabling irqs independently.  This
series completely decouples memcg and objcg stocks.

To evaluate the impact of this series with and without PREEMPT_RT config,
we ran varying number of netperf clients in different cgroups on a 72 CPU
machine.

 $ netserver -6
 $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

PREEMPT_RT config:
------------------
number of clients | Without series | With series
  6               | 38559.1 Mbps   | 38652.6 Mbps
  12              | 37388.8 Mbps   | 37560.1 Mbps
  18              | 30707.5 Mbps   | 31378.3 Mbps
  24              | 25908.4 Mbps   | 26423.9 Mbps
  30              | 22347.7 Mbps   | 22326.5 Mbps
  36              | 20235.1 Mbps   | 20165.0 Mbps

!PREEMPT_RT config:
-------------------
number of clients | Without series | With series
  6               | 50235.7 Mbps   | 51415.4 Mbps
  12              | 49336.5 Mbps   | 49901.4 Mbps
  18              | 46306.8 Mbps   | 46482.7 Mbps
  24              | 38145.7 Mbps   | 38729.4 Mbps
  30              | 30347.6 Mbps   | 31698.2 Mbps
  36              | 26976.6 Mbps   | 27364.4 Mbps

No performance regression was observed.


This patch (of 4):

consume_stock() does not need to check gfp_mask for spinning and can
simply trylock the local lock to decide to proceed or fail.  No need to
spin at all for local lock.

One of the concern raised was that on PREEMPT_RT kernels, this trylock can
fail more often due to tasks having lock_lock can be preempted.  This can
potentially cause the task which have preempted the task having the
local_lock to take the slow path of memcg charging.

However this behavior will only impact the performance if memcg charging
slowpath is worse than two context switches and possibly scheduling delay
behavior of current code.  From the network intensive workload experiment
it does not seem like the case.

We ran varying number of netperf clients in different cgroups on a 72 CPU
machine for PREEMPT_RT config.

 $ netserver -6
 $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

number of clients | Without series | With series
  6               | 38559.1 Mbps   | 38652.6 Mbps
  12              | 37388.8 Mbps   | 37560.1 Mbps
  18              | 30707.5 Mbps   | 31378.3 Mbps
  24              | 25908.4 Mbps   | 26423.9 Mbps
  30              | 22347.7 Mbps   | 22326.5 Mbps
  36              | 20235.1 Mbps   | 20165.0 Mbps

We don't see any significant performance difference for the network
intensive workload with this series.

Link: https://lkml.kernel.org/r/20250506225533.2580386-1-shakeel.butt@linux.dev
Link: https://lkml.kernel.org/r/20250506225533.2580386-2-shakeel.butt@linux.dev


Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Eric Dumaze <edumazet@google.com>
Cc: Jakub Kacinski <kuba@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
parent 0cad6736
Show whitespace changes
Inline Side-by-side
Please to comment