sched/psi: Optimize psi_group_change() cpu_clock() usage
[ Upstream commit 570c8efd ] Dietmar reported that commit 3840cbe2 ("sched: psi: fix bogus pressure spikes from aggregation race") caused a regression for him on a high context switch rate benchmark (schbench) due to the now repeating cpu_clock() calls. In particular the problem is that get_recent_times() will extrapolate the current state to 'now'. But if an update uses a timestamp from before the start of the update, it is possible to get two reads with inconsistent results. It is effectively back-dating an update. (note that this all hard-relies on the clock being synchronized across CPUs -- if this is not the case, all bets are off). Combine this problem with the fact that there are per-group-per-cpu seqcounts, the commit in question pushed the clock read into the group iteration, causing tree-depth cpu_clock() calls. On architectures where cpu_clock() has appreciable overhead, this hurts. Instead move to a per-cpu seqcount, which allows us to have a single clock read for all group updates, increasing internal consistency and lowering update overhead. This comes at the cost of a longer update side (proportional to the tree depth) which can cause the read side to retry more often. Fixes: 3840cbe2 ("sched: psi: fix bogus pressure spikes from aggregation race") Reported-by:Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by:
Johannes Weiner <hannes@cmpxchg.org> Tested-by:
Dietmar Eggemann <dietmar.eggemann@arm.com>,> Link: https://lkml.kernel.org/20250522084844.GC31726@noisy.programming.kicks-ass.net Signed-off-by:
Sasha Levin <sashal@kernel.org>
Loading