diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2023-10-30 13:12:15 -1000 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2023-10-30 13:12:15 -1000 |
commit | 63ce50fff9240d66cf3b59663f458f55ba6dcfcc (patch) | |
tree | c2613289116d6f80a50f534e9c5e4afc26064574 /Documentation | |
parent | 3cf3fabccb9dc821ffaec3ad6bf0cd6b278bd012 (diff) | |
parent | 984ffb6a4366752c949f7b39640aecdce222607f (diff) |
Merge tag 'sched-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Fair scheduler (SCHED_OTHER) improvements:
- Remove the old and now unused SIS_PROP code & option
- Scan cluster before LLC in the wake-up path
- Use candidate prev/recent_used CPU if scanning failed for cluster
wakeup
NUMA scheduling improvements:
- Improve the VMA access-PID code to better skip/scan VMAs
- Extend tracing to cover VMA-skipping decisions
- Improve/fix the recently introduced sched_numa_find_nth_cpu() code
- Generalize numa_map_to_online_node()
Energy scheduling improvements:
- Remove the EM_MAX_COMPLEXITY limit
- Add tracepoints to track energy computation
- Make the behavior of the 'sched_energy_aware' sysctl more
consistent
- Consolidate and clean up access to a CPU's max compute capacity
- Fix uclamp code corner cases
RT scheduling improvements:
- Drive dl_rq->overloaded with dl_rq->pushable_dl_tasks updates
- Drive the ->rto_mask with rt_rq->pushable_tasks updates
Scheduler scalability improvements:
- Rate-limit updates to tg->load_avg
- On x86 disable IBRS when CPU is offline to improve single-threaded
performance
- Micro-optimize in_task() and in_interrupt()
- Micro-optimize the PSI code
- Avoid updating PSI triggers and ->rtpoll_total when there are no
state changes
Core scheduler infrastructure improvements:
- Use saved_state to reduce some spurious freezer wakeups
- Bring in a handful of fast-headers improvements to scheduler
headers
- Make the scheduler UAPI headers more widely usable by user-space
- Simplify the control flow of scheduler syscalls by using lock
guards
- Fix sched_setaffinity() vs. CPU hotplug race
Scheduler debuggability improvements:
- Disallow writing invalid values to sched_rt_period_us
- Fix a race in the rq-clock debugging code triggering warnings
- Fix a warning in the bandwidth distribution code
- Micro-optimize in_atomic_preempt_off() checks
- Enforce that the tasklist_lock is held in for_each_thread()
- Print the TGID in sched_show_task()
- Remove the /proc/sys/kernel/sched_child_runs_first sysctl
... and misc cleanups & fixes"
* tag 'sched-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (82 commits)
sched/fair: Remove SIS_PROP
sched/fair: Use candidate prev/recent_used CPU if scanning failed for cluster wakeup
sched/fair: Scan cluster before scanning LLC in wake-up path
sched: Add cpus_share_resources API
sched/core: Fix RQCF_ACT_SKIP leak
sched/fair: Remove unused 'curr' argument from pick_next_entity()
sched/nohz: Update comments about NEWILB_KICK
sched/fair: Remove duplicate #include
sched/psi: Update poll => rtpoll in relevant comments
sched: Make PELT acronym definition searchable
sched: Fix stop_one_cpu_nowait() vs hotplug
sched/psi: Bail out early from irq time accounting
sched/topology: Rename 'DIE' domain to 'PKG'
sched/psi: Delete the 'update_total' function parameter from update_triggers()
sched/psi: Avoid updating PSI triggers and ->rtpoll_total when there are no state changes
sched/headers: Remove comment referring to rq::cpu_load, since this has been removed
sched/numa: Complete scanning of inactive VMAs when there is no alternative
sched/numa: Complete scanning of partial VMAs regardless of PID activity
sched/numa: Move up the access pid reset logic
sched/numa: Trace decisions related to skipping VMAs
...
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/admin-guide/pm/intel_idle.rst | 17 | ||||
-rw-r--r-- | Documentation/admin-guide/sysctl/kernel.rst | 3 | ||||
-rw-r--r-- | Documentation/scheduler/sched-capacity.rst | 13 | ||||
-rw-r--r-- | Documentation/scheduler/sched-energy.rst | 29 | ||||
-rw-r--r-- | Documentation/scheduler/sched-rt-group.rst | 40 |
5 files changed, 49 insertions, 53 deletions
diff --git a/Documentation/admin-guide/pm/intel_idle.rst b/Documentation/admin-guide/pm/intel_idle.rst index b799a43da62e..39bd6ecce7de 100644 --- a/Documentation/admin-guide/pm/intel_idle.rst +++ b/Documentation/admin-guide/pm/intel_idle.rst @@ -170,7 +170,7 @@ and ``idle=nomwait``. If any of them is present in the kernel command line, the ``MWAIT`` instruction is not allowed to be used, so the initialization of ``intel_idle`` will fail. -Apart from that there are four module parameters recognized by ``intel_idle`` +Apart from that there are five module parameters recognized by ``intel_idle`` itself that can be set via the kernel command line (they cannot be updated via sysfs, so that is the only way to change their values). @@ -216,6 +216,21 @@ are ignored). The idle states disabled this way can be enabled (on a per-CPU basis) from user space via ``sysfs``. +The ``ibrs_off`` module parameter is a boolean flag (defaults to +false). If set, it is used to control if IBRS (Indirect Branch Restricted +Speculation) should be turned off when the CPU enters an idle state. +This flag does not affect CPUs that use Enhanced IBRS which can remain +on with little performance impact. + +For some CPUs, IBRS will be selected as mitigation for Spectre v2 and Retbleed +security vulnerabilities by default. Leaving the IBRS mode on while idling may +have a performance impact on its sibling CPU. The IBRS mode will be turned off +by default when the CPU enters into a deep idle state, but not in some +shallower ones. Setting the ``ibrs_off`` module parameter will force the IBRS +mode to off when the CPU is in any one of the available idle states. This may +help performance of a sibling CPU at the expense of a slightly higher wakeup +latency for the idle CPU. + .. _intel-idle-core-and-package-idle-states: diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index cf33de56da27..d89ac2bd8dc4 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -1182,7 +1182,8 @@ automatically on platforms where it can run (that is, platforms with asymmetric CPU topologies and having an Energy Model available). If your platform happens to meet the requirements for EAS but you do not want to use it, change -this value to 0. +this value to 0. On Non-EAS platforms, write operation fails and +read doesn't return anything. task_delayacct =============== diff --git a/Documentation/scheduler/sched-capacity.rst b/Documentation/scheduler/sched-capacity.rst index e2c1cf743158..de414b33dd2a 100644 --- a/Documentation/scheduler/sched-capacity.rst +++ b/Documentation/scheduler/sched-capacity.rst @@ -39,14 +39,15 @@ per Hz, leading to:: ------------------- Two different capacity values are used within the scheduler. A CPU's -``capacity_orig`` is its maximum attainable capacity, i.e. its maximum -attainable performance level. A CPU's ``capacity`` is its ``capacity_orig`` to -which some loss of available performance (e.g. time spent handling IRQs) is -subtracted. +``original capacity`` is its maximum attainable capacity, i.e. its maximum +attainable performance level. This original capacity is returned by +the function arch_scale_cpu_capacity(). A CPU's ``capacity`` is its ``original +capacity`` to which some loss of available performance (e.g. time spent +handling IRQs) is subtracted. Note that a CPU's ``capacity`` is solely intended to be used by the CFS class, -while ``capacity_orig`` is class-agnostic. The rest of this document will use -the term ``capacity`` interchangeably with ``capacity_orig`` for the sake of +while ``original capacity`` is class-agnostic. The rest of this document will use +the term ``capacity`` interchangeably with ``original capacity`` for the sake of brevity. 1.3 Platform examples diff --git a/Documentation/scheduler/sched-energy.rst b/Documentation/scheduler/sched-energy.rst index fc853c8cc346..70e2921ef725 100644 --- a/Documentation/scheduler/sched-energy.rst +++ b/Documentation/scheduler/sched-energy.rst @@ -359,32 +359,9 @@ in milli-Watts or in an 'abstract scale'. 6.3 - Energy Model complexity ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -The task wake-up path is very latency-sensitive. When the EM of a platform is -too complex (too many CPUs, too many performance domains, too many performance -states, ...), the cost of using it in the wake-up path can become prohibitive. -The energy-aware wake-up algorithm has a complexity of: - - C = Nd * (Nc + Ns) - -with: Nd the number of performance domains; Nc the number of CPUs; and Ns the -total number of OPPs (ex: for two perf. domains with 4 OPPs each, Ns = 8). - -A complexity check is performed at the root domain level, when scheduling -domains are built. EAS will not start on a root domain if its C happens to be -higher than the completely arbitrary EM_MAX_COMPLEXITY threshold (2048 at the -time of writing). - -If you really want to use EAS but the complexity of your platform's Energy -Model is too high to be used with a single root domain, you're left with only -two possible options: - - 1. split your system into separate, smaller, root domains using exclusive - cpusets and enable EAS locally on each of them. This option has the - benefit to work out of the box but the drawback of preventing load - balance between root domains, which can result in an unbalanced system - overall; - 2. submit patches to reduce the complexity of the EAS wake-up algorithm, - hence enabling it to cope with larger EMs in reasonable time. +EAS does not impose any complexity limit on the number of PDs/OPPs/CPUs but +restricts the number of CPUs to EM_MAX_NUM_CPUS to prevent overflows during +the energy estimation. 6.4 - Schedutil governor diff --git a/Documentation/scheduler/sched-rt-group.rst b/Documentation/scheduler/sched-rt-group.rst index 655a096ec8fb..d685609ed3d7 100644 --- a/Documentation/scheduler/sched-rt-group.rst +++ b/Documentation/scheduler/sched-rt-group.rst @@ -39,10 +39,10 @@ Most notable: 1.1 The problem --------------- -Realtime scheduling is all about determinism, a group has to be able to rely on +Real-time scheduling is all about determinism, a group has to be able to rely on the amount of bandwidth (eg. CPU time) being constant. In order to schedule -multiple groups of realtime tasks, each group must be assigned a fixed portion -of the CPU time available. Without a minimum guarantee a realtime group can +multiple groups of real-time tasks, each group must be assigned a fixed portion +of the CPU time available. Without a minimum guarantee a real-time group can obviously fall short. A fuzzy upper limit is of no use since it cannot be relied upon. Which leaves us with just the single fixed portion. @@ -50,14 +50,14 @@ relied upon. Which leaves us with just the single fixed portion. ---------------- CPU time is divided by means of specifying how much time can be spent running -in a given period. We allocate this "run time" for each realtime group which -the other realtime groups will not be permitted to use. +in a given period. We allocate this "run time" for each real-time group which +the other real-time groups will not be permitted to use. -Any time not allocated to a realtime group will be used to run normal priority +Any time not allocated to a real-time group will be used to run normal priority tasks (SCHED_OTHER). Any allocated run time not used will also be picked up by SCHED_OTHER. -Let's consider an example: a frame fixed realtime renderer must deliver 25 +Let's consider an example: a frame fixed real-time renderer must deliver 25 frames a second, which yields a period of 0.04s per frame. Now say it will also have to play some music and respond to input, leaving it with around 80% CPU time dedicated for the graphics. We can then give this group a run time of 0.8 @@ -70,7 +70,7 @@ needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s = of 0.00015s. The remaining CPU time will be used for user input and other tasks. Because -realtime tasks have explicitly allocated the CPU time they need to perform +real-time tasks have explicitly allocated the CPU time they need to perform their tasks, buffer underruns in the graphics or audio can be eliminated. NOTE: the above example is not fully implemented yet. We still @@ -87,18 +87,20 @@ lack an EDF scheduler to make non-uniform periods usable. The system wide settings are configured under the /proc virtual file system: /proc/sys/kernel/sched_rt_period_us: - The scheduling period that is equivalent to 100% CPU bandwidth + The scheduling period that is equivalent to 100% CPU bandwidth. /proc/sys/kernel/sched_rt_runtime_us: - A global limit on how much time realtime scheduling may use. Even without - CONFIG_RT_GROUP_SCHED enabled, this will limit time reserved to realtime - processes. With CONFIG_RT_GROUP_SCHED it signifies the total bandwidth - available to all realtime groups. + A global limit on how much time real-time scheduling may use. This is always + less or equal to the period_us, as it denotes the time allocated from the + period_us for the real-time tasks. Even without CONFIG_RT_GROUP_SCHED enabled, + this will limit time reserved to real-time processes. With + CONFIG_RT_GROUP_SCHED=y it signifies the total bandwidth available to all + real-time groups. * Time is specified in us because the interface is s32. This gives an operating range from 1us to about 35 minutes. * sched_rt_period_us takes values from 1 to INT_MAX. - * sched_rt_runtime_us takes values from -1 to (INT_MAX - 1). + * sched_rt_runtime_us takes values from -1 to sched_rt_period_us. * A run time of -1 specifies runtime == period, ie. no limit. @@ -108,7 +110,7 @@ The system wide settings are configured under the /proc virtual file system: The default values for sched_rt_period_us (1000000 or 1s) and sched_rt_runtime_us (950000 or 0.95s). This gives 0.05s to be used by SCHED_OTHER (non-RT tasks). These defaults were chosen so that a run-away -realtime tasks will not lock up the machine but leave a little time to recover +real-time tasks will not lock up the machine but leave a little time to recover it. By setting runtime to -1 you'd get the old behaviour back. By default all bandwidth is assigned to the root group and new groups get the @@ -116,10 +118,10 @@ period from /proc/sys/kernel/sched_rt_period_us and a run time of 0. If you want to assign bandwidth to another group, reduce the root group's bandwidth and assign some or all of the difference to another group. -Realtime group scheduling means you have to assign a portion of total CPU -bandwidth to the group before it will accept realtime tasks. Therefore you will -not be able to run realtime tasks as any user other than root until you have -done that, even if the user has the rights to run processes with realtime +Real-time group scheduling means you have to assign a portion of total CPU +bandwidth to the group before it will accept real-time tasks. Therefore you will +not be able to run real-time tasks as any user other than root until you have +done that, even if the user has the rights to run processes with real-time priority! |