Age | Commit message (Collapse) | Author |
|
Do a spell-checking pass.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Pull bitmap updates from Yury Norov:
- topology_span_sane() optimization from Kyle Meyer
- fns() rework from Kuan-Wei Chiu (used in cpumask_local_spread() and
other places)
- headers cleanup from Andy
- add a MAINTAINERS record for bitops API
* tag 'bitmap-for-6.10v2' of https://github.com/norov/linux:
usercopy: Don't use "proxy" headers
bitops: Move aligned_byte_mask() to wordpart.h
MAINTAINERS: add BITOPS API record
bitmap: relax find_nth_bit() limitation on return value
lib: make test_bitops compilable into the kernel image
bitops: Optimize fns() for improved performance
lib/test_bitops: Add benchmark test for fns()
Compiler Attributes: Add __always_used macro
sched/topology: Optimize topology_span_sane()
cpumask: Add for_each_cpu_from()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
- Fix a sched_balance_newidle setting bug
- Fix bug in the setting of /sys/fs/cgroup/test/cpu.max.burst
- Fix variable-shadowing build warning
- Extend sched-domains debug output
- Fix documentation
- Fix comments
* tag 'sched-urgent-2024-05-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Fix incorrect initialization of the 'burst' parameter in cpu_max_write()
sched/fair: Remove stale FREQUENCY_UTIL comment
sched/fair: Fix initial util_avg calculation
docs: cgroup-v1: Clarify that domain levels are system-specific
sched/debug: Dump domains' level
sched/fair: Allow disabling sched_balance_newidle with sched_relax_domain_level
arch/topology: Fix variable naming to avoid shadowing
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl
Pull sysctl updates from Joel Granados:
- Remove sentinel elements from ctl_table structs in kernel/*
Removing sentinels in ctl_table arrays reduces the build time size
and runtime memory consumed by ~64 bytes per array. Removals for
net/, io_uring/, mm/, ipc/ and security/ are set to go into mainline
through their respective subsystems making the next release the most
likely place where the final series that removes the check for
proc_name == NULL will land.
This adds to removals already in arch/, drivers/ and fs/.
- Adjust ctl_table definitions and references to allow constification
- Remove unused ctl_table function arguments
- Move non-const elements from ctl_table to ctl_table_header
- Make ctl_table pointers const in ctl_table_root structure
Making the static ctl_table structs const will increase safety by
keeping the pointers to proc_handler functions in .rodata. Though no
ctl_tables where made const in this PR, the ground work for making
that possible has started with these changes sent by Thomas
Weißschuh.
* tag 'sysctl-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
sysctl: drop now unnecessary out-of-bounds check
sysctl: move sysctl type to ctl_table_header
sysctl: drop sysctl_is_perm_empty_ctl_table
sysctl: treewide: constify argument ctl_table_root::permissions(table)
sysctl: treewide: drop unused argument ctl_table_root::set_ownership(table)
bpf: Remove the now superfluous sentinel elements from ctl_table array
delayacct: Remove the now superfluous sentinel elements from ctl_table array
kprobes: Remove the now superfluous sentinel elements from ctl_table array
printk: Remove the now superfluous sentinel elements from ctl_table array
scheduler: Remove the now superfluous sentinel elements from ctl_table array
seccomp: Remove the now superfluous sentinel elements from ctl_table array
timekeeping: Remove the now superfluous sentinel elements from ctl_table array
ftrace: Remove the now superfluous sentinel elements from ctl_table array
umh: Remove the now superfluous sentinel elements from ctl_table array
kernel misc: Remove the now superfluous sentinel elements from ctl_table array
|
|
Change relax_domain_level checks so that it would be possible
to include or exclude all domains from newidle balancing.
This matches the behavior described in the documentation:
-1 no request. use system default or follow request of others.
0 no search.
1 search siblings (hyperthreads in a core).
"2" enables levels 0 and 1, level_max excludes the last (level_max)
level, and level_max+1 includes all levels.
Fixes: 1d3504fcf560 ("sched, cpuset: customize sched domains, core")
Signed-off-by: Vitalii Bursov <vitaly@bursov.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/bd6de28e80073c79466ec6401cdeae78f0d4423d.1714488502.git.vitaly@bursov.com
|
|
Optimize topology_span_sane() by removing duplicate comparisons.
Since topology_span_sane() is called inside of for_each_cpu(), each
previous CPU has already been compared against every other CPU. The
current CPU only needs to be compared against higher-numbered CPUs.
The total number of comparisons is reduced from N * (N - 1) to
N * (N - 1) / 2 on each non-NUMA scheduling domain level.
Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
|
|
This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)
rm sentinel element from ctl_table arrays
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Joel Granados <j.granados@samsung.com>
|
|
The value is no longer used as we now keep track of max_allowed_capacity
for each task instead.
Signed-off-by: Qais Yousef <qyousef@layalina.io>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20240324004552.999936-4-qyousef@layalina.io
|
|
So that we can use it to iterate through available capacities in the
system. Sort asym_cap_list in descending order as expected users are
likely to be interested on the highest capacity first.
Make the list RCU protected to allow for cheap access in hot paths.
Signed-off-by: Qais Yousef <qyousef@layalina.io>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20240324004552.999936-2-qyousef@layalina.io
|
|
SD_SHARE_PKG_RESOURCES is a bit of a misnomer: its naming suggests that
it's sharing all 'package resources' - while in reality it's specifically
for sharing the LLC only.
Rename it to SD_SHARE_LLC to reduce confusion.
[ mingo: Rewrote the confusing changelog as well. ]
Suggested-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Alex Shi <alexs@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Link: https://lore.kernel.org/r/20240210113924.1130448-5-alexs@kernel.org
|
|
These flags are already documented in include/linux/sched/sd_flags.h.
Also, add missing SD_CLUSTER and keep the comment on SD_ASYM_PACKING
as it is a special case.
Suggested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Alex Shi <alexs@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20240210113924.1130448-1-alexs@kernel.org
|
|
For platforms having clusters like Kunpeng920, CPUs within the same cluster
have lower latency when synchronizing and accessing shared resources like
cache. Thus, this patch tries to find an idle cpu within the cluster of the
target CPU before scanning the whole LLC to gain lower latency. This
will be implemented in 2 steps in select_idle_sibling():
1. When the prev_cpu/recent_used_cpu are good wakeup candidates, use them
if they're sharing cluster with the target CPU. Otherwise trying to
scan for an idle CPU in the target's cluster.
2. Scanning the cluster prior to the LLC of the target CPU for an
idle CPU to wakeup.
Testing has been done on Kunpeng920 by pinning tasks to one numa and two
numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs.
With this patch, We noticed enhancement on tbench and netperf within one
numa or cross two numa on top of tip-sched-core commit
9b46f1abc6d4 ("sched/debug: Print 'tgid' in sched_show_task()")
tbench results (node 0):
baseline patched
1: 327.2833 372.4623 ( 13.80%)
4: 1320.5933 1479.8833 ( 12.06%)
8: 2638.4867 2921.5267 ( 10.73%)
16: 5282.7133 5891.5633 ( 11.53%)
32: 9810.6733 9877.3400 ( 0.68%)
64: 7408.9367 7447.9900 ( 0.53%)
128: 6203.2600 6191.6500 ( -0.19%)
tbench results (node 0-1):
baseline patched
1: 332.0433 372.7223 ( 12.25%)
4: 1325.4667 1477.6733 ( 11.48%)
8: 2622.9433 2897.9967 ( 10.49%)
16: 5218.6100 5878.2967 ( 12.64%)
32: 10211.7000 11494.4000 ( 12.56%)
64: 13313.7333 16740.0333 ( 25.74%)
128: 13959.1000 14533.9000 ( 4.12%)
netperf results TCP_RR (node 0):
baseline patched
1: 76546.5033 90649.9867 ( 18.42%)
4: 77292.4450 90932.7175 ( 17.65%)
8: 77367.7254 90882.3467 ( 17.47%)
16: 78519.9048 90938.8344 ( 15.82%)
32: 72169.5035 72851.6730 ( 0.95%)
64: 25911.2457 25882.2315 ( -0.11%)
128: 10752.6572 10768.6038 ( 0.15%)
netperf results TCP_RR (node 0-1):
baseline patched
1: 76857.6667 90892.2767 ( 18.26%)
4: 78236.6475 90767.3017 ( 16.02%)
8: 77929.6096 90684.1633 ( 16.37%)
16: 77438.5873 90502.5787 ( 16.87%)
32: 74205.6635 88301.5612 ( 19.00%)
64: 69827.8535 71787.6706 ( 2.81%)
128: 25281.4366 25771.3023 ( 1.94%)
netperf results UDP_RR (node 0):
baseline patched
1: 96869.8400 110800.8467 ( 14.38%)
4: 97744.9750 109680.5425 ( 12.21%)
8: 98783.9863 110409.9637 ( 11.77%)
16: 99575.0235 110636.2435 ( 11.11%)
32: 95044.7250 97622.8887 ( 2.71%)
64: 32925.2146 32644.4991 ( -0.85%)
128: 12859.2343 12824.0051 ( -0.27%)
netperf results UDP_RR (node 0-1):
baseline patched
1: 97202.4733 110190.1200 ( 13.36%)
4: 95954.0558 106245.7258 ( 10.73%)
8: 96277.1958 105206.5304 ( 9.27%)
16: 97692.7810 107927.2125 ( 10.48%)
32: 79999.6702 103550.2999 ( 29.44%)
64: 80592.7413 87284.0856 ( 8.30%)
128: 27701.5770 29914.5820 ( 7.99%)
Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so the SMT branch
in the code has not been tested but it supposed to work.
Chen Yu also noticed this will improve the performance of tbench and
netperf on a 24 CPUs Jacobsville machine, there are 4 CPUs in one
cluster sharing L2 Cache.
[https://lore.kernel.org/lkml/Ytfjs+m1kUs0ScSn@worktop.programming.kicks-ass.net]
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Link: https://lkml.kernel.org/r/20231019033323.54147-3-yangyicong@huawei.com
|
|
Add cpus_share_resources() API. This is the preparation for the
optimization of select_idle_cpu() on platforms with cluster scheduler
level.
On a machine with clusters cpus_share_resources() will test whether
two cpus are within the same cluster. On a non-cluster machine it
will behaves the same as cpus_share_cache(). So we use "resources"
here for cache resources.
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lkml.kernel.org/r/20231019033323.54147-2-yangyicong@huawei.com
|
|
While reworking the x86 topology code Thomas tripped over creating a 'DIE' domain
for the package mask. :-)
Since these names are CONFIG_SCHED_DEBUG=y only, rename them to make the
name less ambiguous.
[ Shrikanth Hegde: rename on s390 as well. ]
[ Valentin Schneider: also rename it in the comments. ]
[ mingo: port to recent kernels & find all remaining occurances. ]
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Valentin Schneider <vschneid@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20230712141056.GI3100107@hirez.programming.kicks-ass.net
|
|
Move it out of the .c file into the shared scheduler-internal header file,
to gain type-checking.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20231009060037.170765-3-sshegde@linux.vnet.ibm.com
|
|
on the platform
The 'sched_energy_aware' sysctl is available for the admin to disable/enable
energy aware scheduling(EAS). EAS is enabled only if few conditions are
met by the platform. They are, asymmetric CPU capacity, no SMT,
schedutil CPUfreq governor, frequency invariant load tracking etc.
A platform may boot without EAS capability, but could gain such
capability at runtime. For example, changing/registering the cpufreq
governor to schedutil.
At present, though platform doesn't support EAS, this sysctl returns 1
and it ends up calling build_perf_domains on write to 1 and
NOP when writing to 0. That is confusing and un-necessary.
Desired behavior would be to have this sysctl to enable/disable the EAS
on supported platform. On non-supported platform write to the sysctl
would return not supported error and read of the sysctl would return
empty. So sched_energy_aware returns empty - EAS is not possible at this moment
This will include EAS capable platforms which have at least one EAS
condition false during startup, e.g. not using the schedutil cpufreq governor
sched_energy_aware returns 0 - EAS is supported but disabled by admin.
sched_energy_aware returns 1 - EAS is supported and enabled.
User can find out the reason why EAS is not possible by checking
info messages. sched_is_eas_possible returns true if the platform
can do EAS at this moment.
Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Pierre Gondois <pierre.gondois@arm.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20231009060037.170765-3-sshegde@linux.vnet.ibm.com
|
|
The Energy Aware Scheduler (EAS) estimates the energy consumption
of placing a task on different CPUs. The goal is to minimize this
energy consumption. Estimating the energy of different task placements
is increasingly complex with the size of the platform.
To avoid having a slow wake-up path, EAS is only enabled if this
complexity is low enough.
The current complexity limit was set in:
b68a4c0dba3b1 ("sched/topology: Disable EAS on inappropriate platforms")
... based on the first implementation of EAS, which was re-computing
the power of the whole platform for each task placement scenario, see:
390031e4c309 ("sched/fair: Introduce an energy estimation helper function")
... but the complexity of EAS was reduced in:
eb92692b2544d ("sched/fair: Speed-up energy-aware wake-ups")
... and find_energy_efficient_cpu() (feec) algorithm was updated in:
3e8c6c9aac42 ("sched/fair: Remove task_util from effective utilization in feec()")
find_energy_efficient_cpu() (feec) is now doing:
feec()
\_ for_each_pd(pd) [0]
// get max_spare_cap_cpu and compute_prev_delta
\_ for_each_cpu(pd) [1]
\_ eenv_pd_busy_time(pd) [2]
\_ for_each_cpu(pd)
// compute_energy(pd) without the task
\_ eenv_pd_max_util(pd, -1) [3.0]
\_ for_each_cpu(pd)
\_ em_cpu_energy(pd, -1)
\_ for_each_ps(pd)
// compute_energy(pd) with the task on prev_cpu
\_ eenv_pd_max_util(pd, prev_cpu) [3.1]
\_ for_each_cpu(pd)
\_ em_cpu_energy(pd, prev_cpu)
\_ for_each_ps(pd)
// compute_energy(pd) with the task on max_spare_cap_cpu
\_ eenv_pd_max_util(pd, max_spare_cap_cpu) [3.2]
\_ for_each_cpu(pd)
\_ em_cpu_energy(pd, max_spare_cap_cpu)
\_ for_each_ps(pd)
[3.1] happens only once since prev_cpu is unique. With the same
definitions for nr_pd, nr_cpus and nr_ps, the complexity is of:
nr_pd * (2 * [nr_cpus in pd] + 2 * ([nr_cpus in pd] + [nr_ps in pd]))
+ ([nr_cpus in pd] + [nr_ps in pd])
[0] * ( [1] + [2] + [3.0] + [3.2] )
+ [3.1]
= nr_pd * (4 * [nr_cpus in pd] + 2 * [nr_ps in pd])
+ [nr_cpus in prev pd] + nr_ps
The complexity limit was set to 2048 in:
b68a4c0dba3b1 ("sched/topology: Disable EAS on inappropriate platforms")
... to make "EAS usable up to 16 CPUs with per-CPU DVFS and less than 8
performance states each". For the same platform, the complexity would
actually be of:
16 * (4 + 2 * 7) + 1 + 7 = 296
Since the EAS complexity was greatly reduced since the limit was
introduced, bigger platforms can handle EAS.
For instance, a platform with 112 CPUs with 7 performance states
each would not reach it:
112 * (4 + 2 * 7) + 1 + 7 = 2024
To reflect this improvement in the underlying EAS code, remove
the EAS complexity check.
Note that a limit on the number of CPUs still holds against
EM_MAX_NUM_CPUS to avoid overflows during the energy estimation.
[ mingo: Updates to the changelog. ]
Signed-off-by: Pierre Gondois <Pierre.Gondois@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20231009060037.170765-2-sshegde@linux.vnet.ibm.com
|
|
Remove the rq::cpu_capacity_orig field and use arch_scale_cpu_capacity()
instead.
The scheduler uses 3 methods to get access to a CPU's max compute capacity:
- arch_scale_cpu_capacity(cpu) which is the default way to get a CPU's capacity.
- cpu_capacity_orig field which is periodically updated with
arch_scale_cpu_capacity().
- capacity_orig_of(cpu) which encapsulates rq->cpu_capacity_orig.
There is no real need to save the value returned by arch_scale_cpu_capacity()
in struct rq. arch_scale_cpu_capacity() returns:
- either a per_cpu variable.
- or a const value for systems which have only one capacity.
Remove rq::cpu_capacity_orig and use arch_scale_cpu_capacity() everywhere.
No functional changes.
Some performance tests on Arm64:
- small SMP device (hikey): no noticeable changes
- HMP device (RB5): hackbench shows minor improvement (1-2%)
- large smp (thx2): hackbench and tbench shows minor improvement (1%)
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20231009103621.374412-2-vincent.guittot@linaro.org
|
|
Reword sched_numa_find_nth_cpu() comment and make it kernel-doc compatible.
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20230819141239.287290-7-yury.norov@gmail.com
|
|
sched_numa_find_nth_cpu() doesn't handle NUMA_NO_NODE properly, and
may crash kernel if passed with it. On the other hand, the only user
of sched_numa_find_nth_cpu() has to check NUMA_NO_NODE case explicitly.
It would be easier for users if this logic will get moved into
sched_numa_find_nth_cpu().
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20230819141239.287290-6-yury.norov@gmail.com
|
|
When the node provided by user is CPU-less, corresponding record in
sched_domains_numa_masks is not set. Trying to dereference it in the
following code leads to kernel crash.
To avoid it, start searching from the nearest node with CPUs.
Fixes: cd7f55359c90 ("sched: add sched_numa_find_nth_cpu()")
Reported-by: Yicong Yang <yangyicong@hisilicon.com>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Yicong Yang <yangyicong@hisilicon.com>
Cc: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20230819141239.287290-4-yury.norov@gmail.com
Closes: https://lore.kernel.org/lkml/CAAH8bW8C5humYnfpW3y5ypwx0E-09A3QxFE1JFzR66v+mO4XfA@mail.gmail.com/T/
Closes: https://lore.kernel.org/lkml/ZMHSNQfv39HN068m@yury-ThinkPad/T/#mf6431cb0b7f6f05193c41adeee444bc95bf2b1c4
|
|
The flags of the child of a given scheduling domain are used to initialize
the flags of its scheduling groups. When the child of a scheduling domain
is degenerated, the flags of its local scheduling group need to be updated
to align with the flags of its new child domain.
The flag SD_SHARE_CPUCAPACITY was aligned in
Commit bf2dc42d6beb ("sched/topology: Propagate SMT flags when removing degenerate domain").
Further generalize this alignment so other flags can be used later, such as
in cluster-based task wakeup. [1]
Reported-by: Yicong Yang <yangyicong@huawei.com>
Suggested-by: Ricardo Neri <ricardo.neri@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Yicong Yang <yangyicong@hisilicon.com>
Link: https://lore.kernel.org/r/20230713013133.2314153-1-yu.c.chen@intel.com
|
|
When balancing sibling domains that have different number of cores,
tasks in respective sibling domain should be proportional to the
number of cores in each domain. In preparation of implementing such a
policy, record the number of cores in a scheduling group.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/04641eeb0e95c21224352f5743ecb93dfac44654.1688770494.git.tim.c.chen@linux.intel.com
|
|
When using a cpufreq governor that uses
cpufreq_add_update_util_hook(), it is possible to trigger a missing
update_rq_clock() warning for the CPU hotplug path:
rq_attach_root()
set_rq_offline()
rq_offline_rt()
__disable_runtime()
sched_rt_rq_enqueue()
enqueue_top_rt_rq()
cpufreq_update_util()
data->func(data, rq_clock(rq), flags)
Move update_rq_clock() from sched_cpu_deactivate() (one of it's
callers) into set_rq_offline() such that it covers all
set_rq_offline() usage.
Additionally change rq_attach_root() to use rq_lock_irqsave() so that
it will properly manage the runqueue clock flags.
Suggested-by: Ben Segall <bsegall@google.com>
Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20230613082012.49615-2-jiahao.os@bytedance.com
|
|
All callers of set_sched_topology() are within __init section. Mark
it __init too.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20230603073645.1173332-1-linmiaohe@huawei.com
|
|
When a degenerate cluster domain for core with SMT CPUs is removed,
the SD_SHARE_CPUCAPACITY flag in the local child sched group was not
propagated to the new parent. We need this flag to properly determine
whether the local sched group is SMT. Set the flag in the local
child sched group of the new parent sched domain.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Link: https://lkml.kernel.org/r/73cf0959eafa53c02e7ef6bf805d751d9190e55d.1683156492.git.tim.c.chen@linux.intel.com
|
|
smatch reports
kernel/sched/topology.c:212:1: warning:
symbol 'sched_energy_mutex' was not declared. Should it be static?
kernel/sched/topology.c:213:6: warning:
symbol 'sched_energy_update' was not declared. Should it be static?
These variables are only used in topology.c, so should be static
Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20230314144818.1453523-1-trix@redhat.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core:
- Add dedicated kmem_cache for typical/small skb->head, avoid having
to access struct page at kfree time, and improve memory use.
- Introduce sysctl to set default RPS configuration for new netdevs.
- Define Netlink protocol specification format which can be used to
describe messages used by each family and auto-generate parsers.
Add tools for generating kernel data structures and uAPI headers.
- Expose all net/core sysctls inside netns.
- Remove 4s sleep in netpoll if carrier is instantly detected on
boot.
- Add configurable limit of MDB entries per port, and port-vlan.
- Continue populating drop reasons throughout the stack.
- Retire a handful of legacy Qdiscs and classifiers.
Protocols:
- Support IPv4 big TCP (TSO frames larger than 64kB).
- Add IP_LOCAL_PORT_RANGE socket option, to control local port range
on socket by socket basis.
- Track and report in procfs number of MPTCP sockets used.
- Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path
manager.
- IPv6: don't check net.ipv6.route.max_size and rely on garbage
collection to free memory (similarly to IPv4).
- Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986).
- ICMP: add per-rate limit counters.
- Add support for user scanning requests in ieee802154.
- Remove static WEP support.
- Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate
reporting.
- WiFi 7 EHT channel puncturing support (client & AP).
BPF:
- Add a rbtree data structure following the "next-gen data structure"
precedent set by recently added linked list, that is, by using
kfunc + kptr instead of adding a new BPF map type.
- Expose XDP hints via kfuncs with initial support for RX hash and
timestamp metadata.
- Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to
better support decap on GRE tunnel devices not operating in collect
metadata.
- Improve x86 JIT's codegen for PROBE_MEM runtime error checks.
- Remove the need for trace_printk_lock for bpf_trace_printk and
bpf_trace_vprintk helpers.
- Extend libbpf's bpf_tracing.h support for tracing arguments of
kprobes/uprobes and syscall as a special case.
- Significantly reduce the search time for module symbols by
livepatch and BPF.
- Enable cpumasks to be used as kptrs, which is useful for tracing
programs tracking which tasks end up running on which CPUs in
different time intervals.
- Add support for BPF trampoline on s390x and riscv64.
- Add capability to export the XDP features supported by the NIC.
- Add __bpf_kfunc tag for marking kernel functions as kfuncs.
- Add cgroup.memory=nobpf kernel parameter option to disable BPF
memory accounting for container environments.
Netfilter:
- Remove the CLUSTERIP target. It has been marked as obsolete for
years, and we still have WARN splats wrt races of the out-of-band
/proc interface installed by this target.
- Add 'destroy' commands to nf_tables. They are identical to the
existing 'delete' commands, but do not return an error if the
referenced object (set, chain, rule...) did not exist.
Driver API:
- Improve cpumask_local_spread() locality to help NICs set the right
IRQ affinity on AMD platforms.
- Separate C22 and C45 MDIO bus transactions more clearly.
- Introduce new DCB table to control DSCP rewrite on egress.
- Support configuration of Physical Layer Collision Avoidance (PLCA)
Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of
shared medium Ethernet.
- Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing
preemption of low priority frames by high priority frames.
- Add support for controlling MACSec offload using netlink SET.
- Rework devlink instance refcounts to allow registration and
de-registration under the instance lock. Split the code into
multiple files, drop some of the unnecessarily granular locks and
factor out common parts of netlink operation handling.
- Add TX frame aggregation parameters (for USB drivers).
- Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning
messages with notifications for debug.
- Allow offloading of UDP NEW connections via act_ct.
- Add support for per action HW stats in TC.
- Support hardware miss to TC action (continue processing in SW from
a specific point in the action chain).
- Warn if old Wireless Extension user space interface is used with
modern cfg80211/mac80211 drivers. Do not support Wireless
Extensions for Wi-Fi 7 devices at all. Everyone should switch to
using nl80211 interface instead.
- Improve the CAN bit timing configuration. Use extack to return
error messages directly to user space, update the SJW handling,
including the definition of a new default value that will benefit
CAN-FD controllers, by increasing their oscillator tolerance.
New hardware / drivers:
- Ethernet:
- nVidia BlueField-3 support (control traffic driver)
- Ethernet support for imx93 SoCs
- Motorcomm yt8531 gigabit Ethernet PHY
- onsemi NCN26000 10BASE-T1S PHY (with support for PLCA)
- Microchip LAN8841 PHY (incl. cable diagnostics and PTP)
- Amlogic gxl MDIO mux
- WiFi:
- RealTek RTL8188EU (rtl8xxxu)
- Qualcomm Wi-Fi 7 devices (ath12k)
- CAN:
- Renesas R-Car V4H
Drivers:
- Bluetooth:
- Set Per Platform Antenna Gain (PPAG) for Intel controllers.
- Ethernet NICs:
- Intel (1G, igc):
- support TSN / Qbv / packet scheduling features of i226 model
- Intel (100G, ice):
- use GNSS subsystem instead of TTY
- multi-buffer XDP support
- extend support for GPIO pins to E823 devices
- nVidia/Mellanox:
- update the shared buffer configuration on PFC commands
- implement PTP adjphase function for HW offset control
- TC support for Geneve and GRE with VF tunnel offload
- more efficient crypto key management method
- multi-port eswitch support
- Netronome/Corigine:
- add DCB IEEE support
- support IPsec offloading for NFP3800
- Freescale/NXP (enetc):
- support XDP_REDIRECT for XDP non-linear buffers
- improve reconfig, avoid link flap and waiting for idle
- support MAC Merge layer
- Other NICs:
- sfc/ef100: add basic devlink support for ef100
- ionic: rx_push mode operation (writing descriptors via MMIO)
- bnxt: use the auxiliary bus abstraction for RDMA
- r8169: disable ASPM and reset bus in case of tx timeout
- cpsw: support QSGMII mode for J721e CPSW9G
- cpts: support pulse-per-second output
- ngbe: add an mdio bus driver
- usbnet: optimize usbnet_bh() by avoiding unnecessary queuing
- r8152: handle devices with FW with NCM support
- amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation
- virtio-net: support multi buffer XDP
- virtio/vsock: replace virtio_vsock_pkt with sk_buff
- tsnep: XDP support
- Ethernet high-speed switches:
- nVidia/Mellanox (mlxsw):
- add support for latency TLV (in FW control messages)
- Microchip (sparx5):
- separate explicit and implicit traffic forwarding rules, make
the implicit rules always active
- add support for egress DSCP rewrite
- IS0 VCAP support (Ingress Classification)
- IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS
etc.)
- ES2 VCAP support (Egress Access Control)
- support for Per-Stream Filtering and Policing (802.1Q,
8.6.5.1)
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- add MAB (port auth) offload support
- enable PTP receive for mv88e6390
- NXP (ocelot):
- support MAC Merge layer
- support for the the vsc7512 internal copper phys
- Microchip:
- lan9303: convert to PHYLINK
- lan966x: support TC flower filter statistics
- lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x
- lan937x: support Credit Based Shaper configuration
- ksz9477: support Energy Efficient Ethernet
- other:
- qca8k: convert to regmap read/write API, use bulk operations
- rswitch: Improve TX timestamp accuracy
- Intel WiFi (iwlwifi):
- EHT (Wi-Fi 7) rate reporting
- STEP equalizer support: transfer some STEP (connection to radio
on platforms with integrated wifi) related parameters from the
BIOS to the firmware.
- Qualcomm 802.11ax WiFi (ath11k):
- IPQ5018 support
- Fine Timing Measurement (FTM) responder role support
- channel 177 support
- MediaTek WiFi (mt76):
- per-PHY LED support
- mt7996: EHT (Wi-Fi 7) support
- Wireless Ethernet Dispatch (WED) reset support
- switch to using page pool allocator
- RealTek WiFi (rtw89):
- support new version of Bluetooth co-existance
- Mobile:
- rmnet: support TX aggregation"
* tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits)
page_pool: add a comment explaining the fragment counter usage
net: ethtool: fix __ethtool_dev_mm_supported() implementation
ethtool: pse-pd: Fix double word in comments
xsk: add linux/vmalloc.h to xsk.c
sefltests: netdevsim: wait for devlink instance after netns removal
selftest: fib_tests: Always cleanup before exit
net/mlx5e: Align IPsec ASO result memory to be as required by hardware
net/mlx5e: TC, Set CT miss to the specific ct action instance
net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG
net/mlx5: Refactor tc miss handling to a single function
net/mlx5: Kconfig: Make tc offload depend on tc skb extension
net/sched: flower: Support hardware miss to tc action
net/sched: flower: Move filter handle initialization earlier
net/sched: cls_api: Support hardware miss to tc action
net/sched: Rename user cookie and act cookie
sfc: fix builds without CONFIG_RTC_LIB
sfc: clean up some inconsistent indentings
net/mlx4_en: Introduce flexible array to silence overflow warning
net: lan966x: Fix possible deadlock inside PTP
net/ulp: Remove redundant ->clone() test in inet_clone_ulp().
...
|
|
Despite that prev_hop is used conditionally on cur_hop
is not the first hop, it's initialized unconditionally.
Because initialization implies dereferencing, it might happen
that the code dereferences uninitialized memory, which has been
spotted by KASAN. Fix it by reorganizing hop_cmp() logic.
Reported-by: Bruno Goncalves <bgoncalv@redhat.com>
Fixes: cd7f55359c90 ("sched: add sched_numa_find_nth_cpu()")
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Link: https://lore.kernel.org/r/Y+7avK6V9SyAWsXi@yury-laptop/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Tariq has pointed out that drivers allocating IRQ vectors would benefit
from having smarter NUMA-awareness - cpumask_local_spread() only knows
about the local node and everything outside is in the same bucket.
sched_domains_numa_masks is pretty much what we want to hand out (a cpumask
of CPUs reachable within a given distance budget), introduce
sched_numa_hop_mask() to export those cpumasks.
Link: http://lore.kernel.org/r/20220728191203.4055-1-tariqt@nvidia.com
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The function finds Nth set CPU in a given cpumask starting from a given
node.
Leveraging the fact that each hop in sched_domains_numa_masks includes the
same or greater number of CPUs than the previous one, we can use binary
search on hops instead of linear walk, which makes the overall complexity
of O(log n) in terms of number of cpumask_weight() calls.
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Peter Lafreniere <peter@n8pjl.ca>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
sched_init_domains() is only used in initialization
Signed-off-by: Bing Huang <huangbing@kylinos.cn>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230105014943.9857-1-huangbing775@126.com
|
|
init_defrootdomain is only used in initialization
Signed-off-by: Bing Huang <huangbing@kylinos.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/20221118034208.267330-1-huangbing775@126.com
|
|
For a single LLC per node, a NUMA imbalance is allowed up until 25%
of CPUs sharing a node could be active. One intent of the cut-off is
to avoid an imbalance of memory channels but there is no topological
information based on active memory channels. Furthermore, there can
be differences between nodes depending on the number of populated
DIMMs.
A cut-off of 25% was arbitrary but generally worked. It does have a severe
corner cases though when an parallel workload is using 25% of all available
CPUs over-saturates memory channels. This can happen due to the initial
forking of tasks that get pulled more to one node after early wakeups
(e.g. a barrier synchronisation) that is not quickly corrected by the
load balancer. The LB may fail to act quickly as the parallel tasks are
considered to be poor migrate candidates due to locality or cache hotness.
On a range of modern Intel CPUs, 12.5% appears to be a better cut-off
assuming all memory channels are populated and is used as the new cut-off
point. A minimum of 1 is specified to allow a communicating pair to
remain local even for CPUs with low numbers of cores. For modern AMDs,
there are multiple LLCs and are not affected.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20220520103519.1863-5-mgorman@techsingularity.net
|
|
move energy_aware sysctls to topology.c and use the new
register_sysctl_init() to register the sysctl interface.
Signed-off-by: Zhen Ni <nizhen@uniontech.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
|
|
Merge the scheduler build speedup of the fast-headers tree.
Cumulative scheduler (kernel/sched/) build time speedup on a
Linux distribution's config, which enables all scheduler features,
compared to the vanilla kernel:
_____________________________________________________________________________
|
| Vanilla kernel (v5.13-rc7):
|_____________________________________________________________________________
|
| Performance counter stats for 'make -j96 kernel/sched/' (3 runs):
|
| 126,975,564,374 instructions # 1.45 insn per cycle ( +- 0.00% )
| 87,637,847,671 cycles # 3.959 GHz ( +- 0.30% )
| 22,136.96 msec cpu-clock # 7.499 CPUs utilized ( +- 0.29% )
|
| 2.9520 +- 0.0169 seconds time elapsed ( +- 0.57% )
|_____________________________________________________________________________
|
| Patched kernel:
|_____________________________________________________________________________
|
| Performance counter stats for 'make -j96 kernel/sched/' (3 runs):
|
| 50,420,496,914 instructions # 1.47 insn per cycle ( +- 0.00% )
| 34,234,322,038 cycles # 3.946 GHz ( +- 0.31% )
| 8,675.81 msec cpu-clock # 3.053 CPUs utilized ( +- 0.45% )
|
| 2.8420 +- 0.0181 seconds time elapsed ( +- 0.64% )
|_____________________________________________________________________________
Summary:
- CPU time used to build the scheduler dropped by -60.9%, a reduction
from 22.1 clock-seconds to 8.7 clock-seconds.
- Wall-clock time to build the scheduler dropped by -3.9%, a reduction
from 2.95 seconds to 2.84 seconds.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
build_sched_domains
While investigating the sparse warning reported by the LKP bot [1],
observed that we have a redundant variable "top" in the function
build_sched_domains that was introduced in the recent commit
e496132ebedd ("sched/fair: Adjust the allowed NUMA imbalance when
SD_NUMA spans multiple LLCs")
The existing variable "sd" suffices which allows us to remove the
redundant variable "top" while annotating the other variable "top_p"
with the "__rcu" annotation to silence the sparse warning.
[1] https://lore.kernel.org/lkml/202202170853.9vofgC3O-lkp@intel.com/
Fixes: e496132ebedd ("sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lore.kernel.org/r/20220218162743.1134-1-kprateek.nayak@amd.com
|
|
files there
Collect all utility functionality source code files into a single kernel/sched/build_utility.c file,
via #include-ing the .c files:
kernel/sched/clock.c
kernel/sched/completion.c
kernel/sched/loadavg.c
kernel/sched/swait.c
kernel/sched/wait_bit.c
kernel/sched/wait.c
CONFIG_CPU_FREQ:
kernel/sched/cpufreq.c
CONFIG_CPU_FREQ_GOV_SCHEDUTIL:
kernel/sched/cpufreq_schedutil.c
CONFIG_CGROUP_CPUACCT:
kernel/sched/cpuacct.c
CONFIG_SCHED_DEBUG:
kernel/sched/debug.c
CONFIG_SCHEDSTATS:
kernel/sched/stats.c
CONFIG_SMP:
kernel/sched/cpupri.c
kernel/sched/stop_task.c
kernel/sched/topology.c
CONFIG_SCHED_CORE:
kernel/sched/core_sched.c
CONFIG_PSI:
kernel/sched/psi.c
CONFIG_MEMBARRIER:
kernel/sched/membarrier.c
CONFIG_CPU_ISOLATION:
kernel/sched/isolation.c
CONFIG_SCHED_AUTOGROUP:
kernel/sched/autogroup.c
The goal is to amortize the 60+ KLOC header bloat from over a dozen build units into
a single build unit.
The build time of build_utility.c also roughly matches the build time of core.c and
fair.c - allowing better load-balancing of scheduler-only rebuilds.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
|
|
Refer to housekeeping APIs using single feature types instead of flags.
This prevents from passing multiple isolation features at once to
housekeeping interfaces, which soon won't be possible anymore as each
isolation features will have their own cpumask.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lore.kernel.org/r/20220207155910.527133-5-frederic@kernel.org
|
|
The NUMA topology parameters (sched_numa_topology_type,
sched_domains_numa_levels, and sched_max_numa_distance, etc.)
identified by scheduler may be wrong for systems with CPU-less nodes.
For example, the ACPI SLIT of a system with CPU-less persistent
memory (Intel Optane DCPMM) nodes is as follows,
[000h 0000 4] Signature : "SLIT" [System Locality Information Table]
[004h 0004 4] Table Length : 0000042C
[008h 0008 1] Revision : 01
[009h 0009 1] Checksum : 59
[00Ah 0010 6] Oem ID : "XXXX"
[010h 0016 8] Oem Table ID : "XXXXXXX"
[018h 0024 4] Oem Revision : 00000001
[01Ch 0028 4] Asl Compiler ID : "INTL"
[020h 0032 4] Asl Compiler Revision : 20091013
[024h 0036 8] Localities : 0000000000000004
[02Ch 0044 4] Locality 0 : 0A 15 11 1C
[030h 0048 4] Locality 1 : 15 0A 1C 11
[034h 0052 4] Locality 2 : 11 1C 0A 1C
[038h 0056 4] Locality 3 : 1C 11 1C 0A
While the `numactl -H` output is as follows,
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
node 0 size: 64136 MB
node 0 free: 5981 MB
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 1 size: 64466 MB
node 1 free: 10415 MB
node 2 cpus:
node 2 size: 253952 MB
node 2 free: 253920 MB
node 3 cpus:
node 3 size: 253952 MB
node 3 free: 253951 MB
node distances:
node 0 1 2 3
0: 10 21 17 28
1: 21 10 28 17
2: 17 28 10 28
3: 28 17 28 10
In this system, there are only 2 sockets. In each memory controller,
both DRAM and PMEM DIMMs are installed. Although the physical NUMA
topology is simple, the logical NUMA topology becomes a little
complex. Because both the distance(0, 1) and distance (1, 3) are less
than the distance (0, 3), it appears that node 1 sits between node 0
and node 3. And the whole system appears to be a glueless mesh NUMA
topology type. But it's definitely not, there is even no CPU in node 3.
This isn't a practical problem now yet. Because the PMEM nodes (node
2 and node 3 in example system) are offlined by default during system
boot. So init_numa_topology_type() called during system boot will
ignore them and set sched_numa_topology_type to NUMA_DIRECT. And
init_numa_topology_type() is only called at runtime when a CPU of a
never-onlined-before node gets plugged in. And there's no CPU in the
PMEM nodes. But it appears better to fix this to make the code more
robust.
To test the potential problem. We have used a debug patch to call
init_numa_topology_type() when the PMEM node is onlined (in
__set_migration_target_nodes()). With that, the NUMA parameters
identified by scheduler is as follows,
sched_numa_topology_type: NUMA_GLUELESS_MESH
sched_domains_numa_levels: 4
sched_max_numa_distance: 28
To fix the issue, the CPU-less nodes are ignored when the NUMA topology
parameters are identified. Because a node may become CPU-less or not
at run time because of CPU hotplug, the NUMA topology parameters need
to be re-initialized at runtime for CPU hotplug too.
With the patch, the NUMA parameters identified for the example system
above is as follows,
sched_numa_topology_type: NUMA_DIRECT
sched_domains_numa_levels: 2
sched_max_numa_distance: 21
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220214121553.582248-1-ying.huang@intel.com
|
|
In some places, kernel/sched code calls cpumask_weight() to check if
any bit of a given cpumask is set. We can do it more efficiently with
cpumask_empty() because cpumask_empty() stops traversing the cpumask as
soon as it finds first set bit, while cpumask_weight() counts all bits
unconditionally.
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220210224933.379149-23-yury.norov@gmail.com
|
|
Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
nodes") allowed an imbalance between NUMA nodes such that communicating
tasks would not be pulled apart by the load balancer. This works fine when
there is a 1:1 relationship between LLC and node but can be suboptimal
for multiple LLCs if independent tasks prematurely use CPUs sharing cache.
Zen* has multiple LLCs per node with local memory channels and due to
the allowed imbalance, it's far harder to tune some workloads to run
optimally than it is on hardware that has 1 LLC per node. This patch
allows an imbalance to exist up to the point where LLCs should be balanced
between nodes.
On a Zen3 machine running STREAM parallelised with OMP to have on instance
per LLC the results and without binding, the results are
5.17.0-rc0 5.17.0-rc0
vanilla sched-numaimb-v6
MB/sec copy-16 162596.94 ( 0.00%) 580559.74 ( 257.05%)
MB/sec scale-16 136901.28 ( 0.00%) 374450.52 ( 173.52%)
MB/sec add-16 157300.70 ( 0.00%) 564113.76 ( 258.62%)
MB/sec triad-16 151446.88 ( 0.00%) 564304.24 ( 272.61%)
STREAM can use directives to force the spread if the OpenMP is new
enough but that doesn't help if an application uses threads and
it's not known in advance how many threads will be created.
Coremark is a CPU and cache intensive benchmark parallelised with
threads. When running with 1 thread per core, the vanilla kernel
allows threads to contend on cache. With the patch;
5.17.0-rc0 5.17.0-rc0
vanilla sched-numaimb-v5
Min Score-16 368239.36 ( 0.00%) 389816.06 ( 5.86%)
Hmean Score-16 388607.33 ( 0.00%) 427877.08 * 10.11%*
Max Score-16 408945.69 ( 0.00%) 481022.17 ( 17.62%)
Stddev Score-16 15247.04 ( 0.00%) 24966.82 ( -63.75%)
CoeffVar Score-16 3.92 ( 0.00%) 5.82 ( -48.48%)
It can also make a big difference for semi-realistic workloads
like specjbb which can execute arbitrary numbers of threads without
advance knowledge of how they should be placed. Even in cases where
the average performance is neutral, the results are more stable.
5.17.0-rc0 5.17.0-rc0
vanilla sched-numaimb-v6
Hmean tput-1 71631.55 ( 0.00%) 73065.57 ( 2.00%)
Hmean tput-8 582758.78 ( 0.00%) 556777.23 ( -4.46%)
Hmean tput-16 1020372.75 ( 0.00%) 1009995.26 ( -1.02%)
Hmean tput-24 1416430.67 ( 0.00%) 1398700.11 ( -1.25%)
Hmean tput-32 1687702.72 ( 0.00%) 1671357.04 ( -0.97%)
Hmean tput-40 1798094.90 ( 0.00%) 2015616.46 * 12.10%*
Hmean tput-48 1972731.77 ( 0.00%) 2333233.72 ( 18.27%)
Hmean tput-56 2386872.38 ( 0.00%) 2759483.38 ( 15.61%)
Hmean tput-64 2909475.33 ( 0.00%) 2925074.69 ( 0.54%)
Hmean tput-72 2585071.36 ( 0.00%) 2962443.97 ( 14.60%)
Hmean tput-80 2994387.24 ( 0.00%) 3015980.59 ( 0.72%)
Hmean tput-88 3061408.57 ( 0.00%) 3010296.16 ( -1.67%)
Hmean tput-96 3052394.82 ( 0.00%) 2784743.41 ( -8.77%)
Hmean tput-104 2997814.76 ( 0.00%) 2758184.50 ( -7.99%)
Hmean tput-112 2955353.29 ( 0.00%) 2859705.09 ( -3.24%)
Hmean tput-120 2889770.71 ( 0.00%) 2764478.46 ( -4.34%)
Hmean tput-128 2871713.84 ( 0.00%) 2750136.73 ( -4.23%)
Stddev tput-1 5325.93 ( 0.00%) 2002.53 ( 62.40%)
Stddev tput-8 6630.54 ( 0.00%) 10905.00 ( -64.47%)
Stddev tput-16 25608.58 ( 0.00%) 6851.16 ( 73.25%)
Stddev tput-24 12117.69 ( 0.00%) 4227.79 ( 65.11%)
Stddev tput-32 27577.16 ( 0.00%) 8761.05 ( 68.23%)
Stddev tput-40 59505.86 ( 0.00%) 2048.49 ( 96.56%)
Stddev tput-48 168330.30 ( 0.00%) 93058.08 ( 44.72%)
Stddev tput-56 219540.39 ( 0.00%) 30687.02 ( 86.02%)
Stddev tput-64 121750.35 ( 0.00%) 9617.36 ( 92.10%)
Stddev tput-72 223387.05 ( 0.00%) 34081.13 ( 84.74%)
Stddev tput-80 128198.46 ( 0.00%) 22565.19 ( 82.40%)
Stddev tput-88 136665.36 ( 0.00%) 27905.97 ( 79.58%)
Stddev tput-96 111925.81 ( 0.00%) 99615.79 ( 11.00%)
Stddev tput-104 146455.96 ( 0.00%) 28861.98 ( 80.29%)
Stddev tput-112 88740.49 ( 0.00%) 58288.23 ( 34.32%)
Stddev tput-120 186384.86 ( 0.00%) 45812.03 ( 75.42%)
Stddev tput-128 78761.09 ( 0.00%) 57418.48 ( 27.10%)
Similarly, for embarassingly parallel problems like NPB-ep, there are
improvements due to better spreading across LLC when the machine is not
fully utilised.
vanilla sched-numaimb-v6
Min ep.D 31.79 ( 0.00%) 26.11 ( 17.87%)
Amean ep.D 31.86 ( 0.00%) 26.17 * 17.86%*
Stddev ep.D 0.07 ( 0.00%) 0.05 ( 24.41%)
CoeffVar ep.D 0.22 ( 0.00%) 0.20 ( 7.97%)
Max ep.D 31.93 ( 0.00%) 26.21 ( 17.91%)
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20220208094334.16379-3-mgorman@techsingularity.net
|
|
Merge misc updates from Andrew Morton:
"257 patches.
Subsystems affected by this patch series: scripts, ocfs2, vfs, and
mm (slab-generic, slab, slub, kconfig, dax, kasan, debug, pagecache,
gup, swap, memcg, pagemap, mprotect, mremap, iomap, tracing, vmalloc,
pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, tools,
memblock, oom-kill, hugetlbfs, migration, thp, readahead, nommu, ksm,
vmstat, madvise, memory-hotplug, rmap, zsmalloc, highmem, zram,
cleanups, kfence, and damon)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (257 commits)
mm/damon: remove return value from before_terminate callback
mm/damon: fix a few spelling mistakes in comments and a pr_debug message
mm/damon: simplify stop mechanism
Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions
Docs/admin-guide/mm/damon/start: simplify the content
Docs/admin-guide/mm/damon/start: fix a wrong link
Docs/admin-guide/mm/damon/start: fix wrong example commands
mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on
mm/damon: remove unnecessary variable initialization
Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM
mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM)
selftests/damon: support watermarks
mm/damon/dbgfs: support watermarks
mm/damon/schemes: activate schemes based on a watermarks mechanism
tools/selftests/damon: update for regions prioritization of schemes
mm/damon/dbgfs: support prioritization weights
mm/damon/vaddr,paddr: support pageout prioritization
mm/damon/schemes: prioritize regions within the quotas
mm/damon/selftests: support schemes quotas
mm/damon/dbgfs: support quotas of schemes
...
|
|
Patch series "Fix NUMA without SMP".
SuperH is the only architecture which still supports NUMA without SMP,
for good reasons (various memories scattered around the address space,
each with varying latencies).
This series fixes two build errors due to variables and functions used
by the NUMA code being provided by SMP-only source files or sections.
This patch (of 2):
If CONFIG_NUMA=y, but CONFIG_SMP=n (e.g. sh/migor_defconfig):
sh4-linux-gnu-ld: mm/page_alloc.o: in function `get_page_from_freelist':
page_alloc.c:(.text+0x2c24): undefined reference to `node_reclaim_distance'
Fix this by moving the declaration of node_reclaim_distance from an
SMP-only to a generic file.
Link: https://lkml.kernel.org/r/cover.1631781495.git.geert+renesas@glider.be
Link: https://lkml.kernel.org/r/6432666a648dde85635341e6c918cee97c97d264.1631781495.git.geert+renesas@glider.be
Fixes: a55c7454a8c887b2 ("sched/topology: Improve load balancing on AMD EPYC systems")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Suggested-by: Matt Fleming <matt@codeblueprint.co.uk>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yoshinori Sato <ysato@users.osdn.me>
Cc: Rich Felker <dalias@libc.org>
Cc: Gon Solo <gonsolo@gmail.com>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Decay max_newidle_lb_cost only when it has not been updated for a while
and ensure to not decay a recently changed value.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20211019123537.17146-4-vincent.guittot@linaro.org
|
|
The push-IPI logic for RT tasks expects to be invoked from hardirq
context. One reason is that a RT task on the remote CPU would block the
softirq processing on PREEMPT_RT and so avoid pulling / balancing the RT
tasks as intended.
Annotate root_domain::rto_push_work as IRQ_WORK_HARD_IRQ.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211006111852.1514359-2-bigeasy@linutronix.de
|
|
This patch adds scheduler level for clusters and automatically enables
the load balance among clusters. It will directly benefit a lot of
workload which loves more resources such as memory bandwidth, caches.
Testing has widely been done in two different hardware configurations of
Kunpeng920:
24 cores in one NUMA(6 clusters in each NUMA node);
32 cores in one NUMA(8 clusters in each NUMA node)
Workload is running on either one NUMA node or four NUMA nodes, thus,
this can estimate the effect of cluster spreading w/ and w/o NUMA load
balance.
* Stream benchmark:
4threads stream (on 1NUMA * 24cores = 24cores)
stream stream
w/o patch w/ patch
MB/sec copy 29929.64 ( 0.00%) 32932.68 ( 10.03%)
MB/sec scale 29861.10 ( 0.00%) 32710.58 ( 9.54%)
MB/sec add 27034.42 ( 0.00%) 32400.68 ( 19.85%)
MB/sec triad 27225.26 ( 0.00%) 31965.36 ( 17.41%)
6threads stream (on 1NUMA * 24cores = 24cores)
stream stream
w/o patch w/ patch
MB/sec copy 40330.24 ( 0.00%) 42377.68 ( 5.08%)
MB/sec scale 40196.42 ( 0.00%) 42197.90 ( 4.98%)
MB/sec add 37427.00 ( 0.00%) 41960.78 ( 12.11%)
MB/sec triad 37841.36 ( 0.00%) 42513.64 ( 12.35%)
12threads stream (on 1NUMA * 24cores = 24cores)
stream stream
w/o patch w/ patch
MB/sec copy 52639.82 ( 0.00%) 53818.04 ( 2.24%)
MB/sec scale 52350.30 ( 0.00%) 53253.38 ( 1.73%)
MB/sec add 53607.68 ( 0.00%) 55198.82 ( 2.97%)
MB/sec triad 54776.66 ( 0.00%) 56360.40 ( 2.89%)
Thus, it could help memory-bound workload especially under medium load.
Similar improvement is also seen in lkp-pbzip2:
* lkp-pbzip2 benchmark
2-96 threads (on 4NUMA * 24cores = 96cores)
lkp-pbzip2 lkp-pbzip2
w/o patch w/ patch
Hmean tput-2 11062841.57 ( 0.00%) 11341817.51 * 2.52%*
Hmean tput-5 26815503.70 ( 0.00%) 27412872.65 * 2.23%*
Hmean tput-8 41873782.21 ( 0.00%) 43326212.92 * 3.47%*
Hmean tput-12 61875980.48 ( 0.00%) 64578337.51 * 4.37%*
Hmean tput-21 105814963.07 ( 0.00%) 111381851.01 * 5.26%*
Hmean tput-30 150349470.98 ( 0.00%) 156507070.73 * 4.10%*
Hmean tput-48 237195937.69 ( 0.00%) 242353597.17 * 2.17%*
Hmean tput-79 360252509.37 ( 0.00%) 362635169.23 * 0.66%*
Hmean tput-96 394571737.90 ( 0.00%) 400952978.48 * 1.62%*
2-24 threads (on 1NUMA * 24cores = 24cores)
lkp-pbzip2 lkp-pbzip2
w/o patch w/ patch
Hmean tput-2 11071705.49 ( 0.00%) 11296869.10 * 2.03%*
Hmean tput-4 20782165.19 ( 0.00%) 21949232.15 * 5.62%*
Hmean tput-6 30489565.14 ( 0.00%) 33023026.96 * 8.31%*
Hmean tput-8 40376495.80 ( 0.00%) 42779286.27 * 5.95%*
Hmean tput-12 61264033.85 ( 0.00%) 62995632.78 * 2.83%*
Hmean tput-18 86697139.39 ( 0.00%) 86461545.74 ( -0.27%)
Hmean tput-24 104854637.04 ( 0.00%) 104522649.46 * -0.32%*
In the case of 6 threads and 8 threads, we see the greatest performance
improvement.
Similar improvement can be seen on lkp-pixz though the improvement is
smaller:
* lkp-pixz benchmark
2-24 threads lkp-pixz (on 1NUMA * 24cores = 24cores)
lkp-pixz lkp-pixz
w/o patch w/ patch
Hmean tput-2 6486981.16 ( 0.00%) 6561515.98 * 1.15%*
Hmean tput-4 11645766.38 ( 0.00%) 11614628.43 ( -0.27%)
Hmean tput-6 15429943.96 ( 0.00%) 15957350.76 * 3.42%*
Hmean tput-8 19974087.63 ( 0.00%) 20413746.98 * 2.20%*
Hmean tput-12 28172068.18 ( 0.00%) 28751997.06 * 2.06%*
Hmean tput-18 39413409.54 ( 0.00%) 39896830.55 * 1.23%*
Hmean tput-24 49101815.85 ( 0.00%) 49418141.47 * 0.64%*
* SPECrate benchmark
4,8,16 copies mcf_r(on 1NUMA * 32cores = 32cores)
Base Base
Run Time Rate
------- ---------
4 Copies w/o 580 (w/ 570) w/o 11.1 (w/ 11.3)
8 Copies w/o 647 (w/ 605) w/o 20.0 (w/ 21.4, +7%)
16 Copies w/o 844 (w/ 844) w/o 30.6 (w/ 30.6)
32 Copies(on 4NUMA * 32 cores = 128cores)
[w/o patch]
Base Base Base
Benchmarks Copies Run Time Rate
--------------- ------- --------- ---------
500.perlbench_r 32 584 87.2 *
502.gcc_r 32 503 90.2 *
505.mcf_r 32 745 69.4 *
520.omnetpp_r 32 1031 40.7 *
523.xalancbmk_r 32 597 56.6 *
525.x264_r 1 -- CE
531.deepsjeng_r 32 336 109 *
541.leela_r 32 556 95.4 *
548.exchange2_r 32 513 163 *
557.xz_r 32 530 65.2 *
Est. SPECrate2017_int_base 80.3
[w/ patch]
Base Base Base
Benchmarks Copies Run Time Rate
--------------- ------- --------- ---------
500.perlbench_r 32 580 87.8 (+0.688%) *
502.gcc_r 32 477 95.1 (+5.432%) *
505.mcf_r 32 644 80.3 (+13.574%) *
520.omnetpp_r 32 942 44.6 (+9.58%) *
523.xalancbmk_r 32 560 60.4 (+6.714%%) *
525.x264_r 1 -- CE
531.deepsjeng_r 32 337 109 (+0.000%) *
541.leela_r 32 554 95.6 (+0.210%) *
548.exchange2_r 32 515 163 (+0.000%) *
557.xz_r 32 524 66.0 (+1.227%) *
Est. SPECrate2017_int_base 83.7 (+4.062%)
On the other hand, it is slightly helpful to CPU-bound tasks like
kernbench:
* 24-96 threads kernbench (on 4NUMA * 24cores = 96cores)
kernbench kernbench
w/o cluster w/ cluster
Min user-24 12054.67 ( 0.00%) 12024.19 ( 0.25%)
Min syst-24 1751.51 ( 0.00%) 1731.68 ( 1.13%)
Min elsp-24 600.46 ( 0.00%) 598.64 ( 0.30%)
Min user-48 12361.93 ( 0.00%) 12315.32 ( 0.38%)
Min syst-48 1917.66 ( 0.00%) 1892.73 ( 1.30%)
Min elsp-48 333.96 ( 0.00%) 332.57 ( 0.42%)
Min user-96 12922.40 ( 0.00%) 12921.17 ( 0.01%)
Min syst-96 2143.94 ( 0.00%) 2110.39 ( 1.56%)
Min elsp-96 211.22 ( 0.00%) 210.47 ( 0.36%)
Amean user-24 12063.99 ( 0.00%) 12030.78 * 0.28%*
Amean syst-24 1755.20 ( 0.00%) 1735.53 * 1.12%*
Amean elsp-24 601.60 ( 0.00%) 600.19 ( 0.23%)
Amean user-48 12362.62 ( 0.00%) 12315.56 * 0.38%*
Amean syst-48 1921.59 ( 0.00%) 1894.95 * 1.39%*
Amean elsp-48 334.10 ( 0.00%) 332.82 * 0.38%*
Amean user-96 12925.27 ( 0.00%) 12922.63 ( 0.02%)
Amean syst-96 2146.66 ( 0.00%) 2122.20 * 1.14%*
Amean elsp-96 211.96 ( 0.00%) 211.79 ( 0.08%)
Note this patch isn't an universal win, it might hurt those workload
which can benefit from packing. Though tasks which want to take
advantages of lower communication latency of one cluster won't
necessarily been packed in one cluster while kernel is not aware of
clusters, they have some chance to be randomly packed. But this
patch will make them more likely spread.
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
numa_distance in cpu_attach_domain() is introduced in
commit b5b217346de8 ("sched/topology: Warn when NUMA diameter > 2")
to warn user when NUMA diameter > 2 as we'll misrepresent
the scheduler topology structures at that time. This is
fixed by Barry in commit 585b6d2723dc ("sched/topology: fix the issue
groups don't span domain->span for NUMA diameter > 2") and
numa_distance is unused now. So remove it.
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lore.kernel.org/r/20210915063158.80639-1-yangyicong@hisilicon.com
|
|
There exist situations in which the load balance needs to know the
properties of the CPUs in a scheduling group. When using asymmetric
packing, for instance, the load balancer needs to know not only the
state of dst_cpu but also of its SMT siblings, if any.
Use the flags of the child scheduling domains to initialize scheduling
group flags. This will reflect the properties of the CPUs in the
group.
A subsequent changeset will make use of these new flags. No functional
changes are introduced.
Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Len Brown <len.brown@intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210911011819.12184-3-ricardo.neri-calderon@linux.intel.com
|
|
The scheduler currently expects NUMA node distances to be stable from
init onwards, and as a consequence builds the related data structures
once-and-for-all at init (see sched_init_numa()).
Unfortunately, on some architectures node distance is unreliable for
offline nodes and may very well change upon onlining.
Skip over offline nodes during sched_init_numa(). Track nodes that have
been onlined at least once, and trigger a build of a node's NUMA masks
when it is first onlined post-init.
Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210818074333.48645-1-srikar@linux.vnet.ibm.com
|