summaryrefslogtreecommitdiff
path: root/kernel/rcu
AgeCommit message (Collapse)Author
2021-08-30Merge tag 'locking-core-2021-08-30' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking and atomics updates from Thomas Gleixner: "The regular pile: - A few improvements to the mutex code - Documentation updates for atomics to clarify the difference between cmpxchg() and try_cmpxchg() and to explain the forward progress expectations. - Simplification of the atomics fallback generator - The addition of arch_atomic_long*() variants and generic arch_*() bitops based on them. - Add the missing might_sleep() invocations to the down*() operations of semaphores. The PREEMPT_RT locking core: - Scheduler updates to support the state preserving mechanism for 'sleeping' spin- and rwlocks on RT. This mechanism is carefully preserving the state of the task when blocking on a 'sleeping' spin- or rwlock and takes regular wake-ups targeted at the same task into account. The preserved or updated (via a regular wakeup) state is restored when the lock has been acquired. - Restructuring of the rtmutex code so it can be utilized and extended for the RT specific lock variants. - Restructuring of the ww_mutex code to allow sharing of the ww_mutex specific functionality for rtmutex based ww_mutexes. - Header file disentangling to allow substitution of the regular lock implementations with the PREEMPT_RT variants without creating an unmaintainable #ifdef mess. - Shared base code for the PREEMPT_RT specific rw_semaphore and rwlock implementations. Contrary to the regular rw_semaphores and rwlocks the PREEMPT_RT implementation is writer unfair because it is infeasible to do priority inheritance on multiple readers. Experience over the years has shown that real-time workloads are not the typical workloads which are sensitive to writer starvation. The alternative solution would be to allow only a single reader which has been tried and discarded as it is a major bottleneck especially for mmap_sem. Aside of that many of the writer starvation critical usage sites have been converted to a writer side mutex/spinlock and RCU read side protections in the past decade so that the issue is less prominent than it used to be. - The actual rtmutex based lock substitutions for PREEMPT_RT enabled kernels which affect mutex, ww_mutex, rw_semaphore, spinlock_t and rwlock_t. The spin/rw_lock*() functions disable migration across the critical section to preserve the existing semantics vs per-CPU variables. - Rework of the futex REQUEUE_PI mechanism to handle the case of early wake-ups which interleave with a re-queue operation to prevent the situation that a task would be blocked on both the rtmutex associated to the outer futex and the rtmutex based hash bucket spinlock. While this situation cannot happen on !RT enabled kernels the changes make the underlying concurrency problems easier to understand in general. As a result the difference between !RT and RT kernels is reduced to the handling of waiting for the critical section. !RT kernels simply spin-wait as before and RT kernels utilize rcu_wait(). - The substitution of local_lock for PREEMPT_RT with a spinlock which protects the critical section while staying preemptible. The CPU locality is established by disabling migration. The underlying concepts of this code have been in use in PREEMPT_RT for way more than a decade. The code has been refactored several times over the years and this final incarnation has been optimized once again to be as non-intrusive as possible, i.e. the RT specific parts are mostly isolated. It has been extensively tested in the 5.14-rt patch series and it has been verified that !RT kernels are not affected by these changes" * tag 'locking-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (92 commits) locking/rtmutex: Return success on deadlock for ww_mutex waiters locking/rtmutex: Prevent spurious EDEADLK return caused by ww_mutexes locking/rtmutex: Dequeue waiter on ww_mutex deadlock locking/rtmutex: Dont dereference waiter lockless locking/semaphore: Add might_sleep() to down_*() family locking/ww_mutex: Initialize waiter.ww_ctx properly static_call: Update API documentation locking/local_lock: Add PREEMPT_RT support locking/spinlock/rt: Prepare for RT local_lock locking/rtmutex: Add adaptive spinwait mechanism locking/rtmutex: Implement equal priority lock stealing preempt: Adjust PREEMPT_LOCK_OFFSET for RT locking/rtmutex: Prevent lockdep false positive with PI futexes futex: Prevent requeue_pi() lock nesting issue on RT futex: Simplify handle_early_requeue_pi_wakeup() futex: Reorder sanity checks in futex_requeue() futex: Clarify comment in futex_requeue() futex: Restructure futex_requeue() futex: Correct the number of requeued waiters for PI futex: Remove bogus condition for requeue PI ...
2021-08-17locking/rtmutex: Split out the inner parts of 'struct rtmutex'Peter Zijlstra
RT builds substitutions for rwsem, mutex, spinlock and rwlock around rtmutexes. Split the inner working out so each lock substitution can use them with the appropriate lockdep annotations. This avoids having an extra unused lockdep map in the wrapped rtmutex. No functional change. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210815211302.784739994@linutronix.de
2021-08-10Merge branches 'doc.2021.07.20c', 'fixes.2021.08.06a', 'nocb.2021.07.20c', ↵Paul E. McKenney
'nolibc.2021.07.20c', 'tasks.2021.07.20c', 'torture.2021.07.27a' and 'torturescript.2021.07.27a' into HEAD doc.2021.07.20c: Documentation updates. fixes.2021.08.06a: Miscellaneous fixes. nocb.2021.07.20c: Callback-offloading (NOCB CPU) updates. nolibc.2021.07.20c: Tiny userspace library updates. tasks.2021.07.20c: Tasks RCU updates. torture.2021.07.27a: In-kernel torture-test updates. torturescript.2021.07.27a: Torture-test scripting updates.
2021-08-10rcu: Replace deprecated CPU-hotplug functionsSebastian Andrzej Siewior
The functions get_online_cpus() and put_online_cpus() have been deprecated during the CPU hotplug rework. They map directly to cpus_read_lock() and cpus_read_unlock(). Replace deprecated CPU-hotplug functions with the official version. The behavior remains unchanged. Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: rcu@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-08-06rcu: Print human-readable message for schedule() in RCU readerPaul E. McKenney
The WARN_ON_ONCE() invocation within the CONFIG_PREEMPT=y version of rcu_note_context_switch() triggers when there is a voluntary context switch in an RCU read-side critical section, but there is quite a gap between the output of that WARN_ON_ONCE() and this RCU-usage error. This commit therefore converts the WARN_ON_ONCE() to a WARN_ONCE() that explicitly describes the problem in its message. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-08-06rcu: Use per_cpu_ptr to get the pointer of per_cpu variableLiu Song
There are a few remaining locations in kernel/rcu that still use "&per_cpu()". This commit replaces them with "per_cpu_ptr(&)", and does not introduce any functional change. Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Neeraj Upadhyay <neeraju@codeaurora.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Liu Song <liu.song11@zte.com.cn> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-08-06rcu: Remove useless "ret" update in rcu_gp_fqs_loop()Liu Song
Within rcu_gp_fqs_loop(), the "ret" local variable is set to the return value from swait_event_idle_timeout_exclusive(), but "ret" is unconditionally overwritten later in the code. This commit therefore removes this useless assignment. Signed-off-by: Liu Song <liu.song11@zte.com.cn> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-08-06rcu: Mark accesses in tree_stall.hPaul E. McKenney
This commit marks the accesses in tree_stall.h so as to both avoid undesirable compiler optimizations and to keep KCSAN focused on the accesses of the core algorithm. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-08-06rcu: Make rcu_gp_init() and rcu_gp_fqs_loop noinline to conserve stackPaul E. McKenney
The kbuild test project found an oversized stack frame in rcu_gp_kthread() for some kernel configurations. This oversizing was due to a very large amount of inlining, which is unnecessary due to the fact that this code executes infrequently. This commit therefore marks rcu_gp_init() and rcu_gp_fqs_loop noinline_for_stack to conserve stack space. Reported-by: kernel test robot <lkp@intel.com> Tested-by: Rong Chen <rong.a.chen@intel.com> [ paulmck: noinline_for_stack per Nathan Chancellor. ] Reviewed-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-08-06rcu: Mark lockless ->qsmask read in rcu_check_boost_fail()Paul E. McKenney
Accesses to ->qsmask are normally protected by ->lock, but there is an exception in the diagnostic code in rcu_check_boost_fail(). This commit therefore applies data_race() to this access to avoid KCSAN complaining about the C-language writes protected by ->lock. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-08-06srcutiny: Mark read-side data racesPaul E. McKenney
This commit marks some interrupt-induced read-side data races in __srcu_read_lock(), __srcu_read_unlock(), and srcu_torture_stats_print(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-08-06rcu: Start timing stall repetitions after warning completePaul E. McKenney
Systems with low-bandwidth consoles can have very large printk() latencies, and on such systems it makes no sense to have the next RCU CPU stall warning message start output before the prior message completed. This commit therefore sets the time of the next stall only after the prints have completed. While printing, the time of the next stall message is set to ULONG_MAX/2 jiffies into the future. Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-08-06rcu: Do not disable GP stall detection in rcu_cpu_stall_reset()Sergey Senozhatsky
rcu_cpu_stall_reset() is one of the functions virtual CPUs execute during VM resume in order to handle jiffies skew that can trigger false positive stall warnings. Paul has pointed out that this approach is problematic because rcu_cpu_stall_reset() disables RCU grace period stall-detection virtually forever, while in fact it can just restart the stall-detection timeout. Suggested-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-08-06rcu/tree: Handle VM stoppage in stall detectionSergey Senozhatsky
The soft watchdog timer function checks if a virtual machine was suspended and hence what looks like a lockup in fact is a false positive. This is what kvm_check_and_clear_guest_paused() does: it tests guest PVCLOCK_GUEST_STOPPED (which is set by the host) and if it's set then we need to touch all watchdogs and bail out. Watchdog timer function runs from IRQ, so PVCLOCK_GUEST_STOPPED check works fine. There is, however, one more watchdog that runs from IRQ, so watchdog timer fn races with it, and that watchdog is not aware of PVCLOCK_GUEST_STOPPED - RCU stall detector. apic_timer_interrupt() smp_apic_timer_interrupt() hrtimer_interrupt() __hrtimer_run_queues() tick_sched_timer() tick_sched_handle() update_process_times() rcu_sched_clock_irq() This triggers RCU stalls on our devices during VM resume. If tick_sched_handle()->rcu_sched_clock_irq() runs on a VCPU before watchdog_timer_fn()->kvm_check_and_clear_guest_paused() then there is nothing on this VCPU that touches watchdogs and RCU reads stale gp stall timestamp and new jiffies value, which makes it think that RCU has stalled. Make RCU stall watchdog aware of PVCLOCK_GUEST_STOPPED and don't report RCU stalls when we resume the VM. Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-08-06rcu: Mark accesses to ->rcu_read_lock_nestingPaul E. McKenney
KCSAN flags accesses to ->rcu_read_lock_nesting as data races, but in the past, the overhead of marked accesses was excessive. However, that was long ago, and much has changed since then, both in terms of hardware and of compilers. Here is data taken on an eight-core laptop using Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz with a kernel built using gcc version 9.3.0, with all data in nanoseconds. Unmarked accesses (status quo), measured by three refscale runs: Minimum reader duration: 3.286 2.851 3.395 Median reader duration: 3.698 3.531 3.4695 Maximum reader duration: 4.481 5.215 5.157 Marked accesses, also measured by three refscale runs: Minimum reader duration: 3.501 3.677 3.580 Median reader duration: 4.053 3.723 3.895 Maximum reader duration: 7.307 4.999 5.511 This focused microbenhmark shows only sub-nanosecond differences which are unlikely to be visible at the system level. This commit therefore marks data-racing accesses to ->rcu_read_lock_nesting. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-08-06rcu: Weaken ->dynticks accesses and updatesPaul E. McKenney
Accesses to the rcu_data structure's ->dynticks field have always been fully ordered because it was not possible to prove that weaker ordering was safe. However, with the removal of the rcu_eqs_special_set() function and the advent of the Linux-kernel memory model, it is now easy to show that two of the four original full memory barriers can be weakened to acquire and release operations. The remaining pair must remain full memory barriers. This change makes the memory ordering requirements more evident, and it might well also speed up the to-idle and from-idle fastpaths on some architectures. The following litmus test, adapted from one supplied off-list by Frederic Weisbecker, models the RCU grace-period kthread detecting an idle CPU that is concurrently transitioning to non-idle: C dynticks-from-idle { DYNTICKS=0; (* Initially idle. *) } P0(int *X, int *DYNTICKS) { int dynticks; int x; // Idle. dynticks = READ_ONCE(*DYNTICKS); smp_store_release(DYNTICKS, dynticks + 1); smp_mb(); // Now non-idle x = READ_ONCE(*X); } P1(int *X, int *DYNTICKS) { int dynticks; WRITE_ONCE(*X, 1); smp_mb(); dynticks = smp_load_acquire(DYNTICKS); } exists (1:dynticks=0 /\ 0:x=1) Running "herd7 -conf linux-kernel.cfg dynticks-from-idle.litmus" verifies this transition, namely, showing that if the RCU grace-period kthread (P1) sees another CPU as idle (P0), then any memory access prior to the start of the grace period (P1's write to X) will be seen by any RCU read-side critical section following the to-non-idle transition (P0's read from X). This is a straightforward use of full memory barriers to force ordering in a store-buffering (SB) litmus test. The following litmus test, also adapted from the one supplied off-list by Frederic Weisbecker, models the RCU grace-period kthread detecting a non-idle CPU that is concurrently transitioning to idle: C dynticks-into-idle { DYNTICKS=1; (* Initially non-idle. *) } P0(int *X, int *DYNTICKS) { int dynticks; // Non-idle. WRITE_ONCE(*X, 1); dynticks = READ_ONCE(*DYNTICKS); smp_store_release(DYNTICKS, dynticks + 1); smp_mb(); // Now idle. } P1(int *X, int *DYNTICKS) { int x; int dynticks; smp_mb(); dynticks = smp_load_acquire(DYNTICKS); x = READ_ONCE(*X); } exists (1:dynticks=2 /\ 1:x=0) Running "herd7 -conf linux-kernel.cfg dynticks-into-idle.litmus" verifies this transition, namely, showing that if the RCU grace-period kthread (P1) sees another CPU as newly idle (P0), then any pre-idle memory access (P0's write to X) will be seen by any code following the grace period (P1's read from X). This is a simple release-acquire pair forcing ordering in a message-passing (MP) litmus test. Of course, if the grace-period kthread detects the CPU as non-idle, it will refrain from reporting a quiescent state on behalf of that CPU, so there are no ordering requirements from the grace-period kthread in that case. However, other subsystems call rcu_is_idle_cpu() to check for CPUs being non-idle from an RCU perspective. That case is also verified by the above litmus tests with the proviso that the sense of the low-order bit of the DYNTICKS counter be inverted. Unfortunately, on x86 smp_mb() is as expensive as a cache-local atomic increment. This commit therefore weakens only the read from ->dynticks. However, the updates are abstracted into a rcu_dynticks_inc() function to ease any future changes that might be needed. [ paulmck: Apply Linus Torvalds feedback. ] Link: https://lore.kernel.org/lkml/20210721202127.2129660-4-paulmck@kernel.org/ Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-08-06rcu: Remove special bit at the bottom of the ->dynticks counterJoel Fernandes (Google)
Commit b8c17e6664c4 ("rcu: Maintain special bits at bottom of ->dynticks counter") reserved a bit at the bottom of the ->dynticks counter to defer flushing of TLBs, but this facility never has been used. This commit therefore removes this capability along with the rcu_eqs_special_set() function used to trigger it. Link: https://lore.kernel.org/linux-doc/CALCETrWNPOOdTrFabTDd=H7+wc6xJ9rJceg6OL1S0rTV5pfSsA@mail.gmail.com/ Suggested-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: "Joel Fernandes (Google)" <joel@joelfernandes.org> [ paulmck: Forward-port to v5.13-rc1. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-08-06rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lockYanfei Xu
If rcu_print_task_stall() is invoked on an rcu_node structure that does not contain any tasks blocking the current grace period, it takes an early exit that fails to release that rcu_node structure's lock. This results in a self-deadlock, which is detected by lockdep. To reproduce this bug: tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 3 --trust-make --configs "TREE03" --kconfig "CONFIG_PROVE_LOCKING=y" --bootargs "rcutorture.stall_cpu=30 rcutorture.stall_cpu_block=1 rcutorture.fwd_progress=0 rcutorture.test_boost=0" This will also result in other complaints, including RCU's scheduler hook complaining about blocking rather than preemption and an rcutorture writer stall. Only a partial RCU CPU stall warning message will be printed because of the self-deadlock. This commit therefore releases the lock on the rcu_print_task_stall() function's early exit path. Fixes: c583bcb8f5ed ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled") Tested-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-08-06rcu: Fix to include first blocked task in stall warningYanfei Xu
The for loop in rcu_print_task_stall() always omits ts[0], which points to the first task blocking the stalled grace period. This in turn fails to count this first task, which means that ndetected will be equal to zero when all CPUs have passed through their quiescent states and only one task is blocking the stalled grace period. This zero value for ndetected will in turn result in an incorrect "All QSes seen" message: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: rcu: Tasks blocked on level-1 rcu_node (CPUs 12-23): (detected by 15, t=6504 jiffies, g=164777, q=9011209) rcu: All QSes seen, last rcu_preempt kthread activity 1 (4295252379-4295252378), jiffies_till_next_fqs=1, root ->qsmask 0x2 BUG: sleeping function called from invalid context at include/linux/uaccess.h:156 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 70613, name: msgstress04 INFO: lockdep is turned off. Preemption disabled at: [<ffff8000104031a4>] create_object.isra.0+0x204/0x4b0 CPU: 15 PID: 70613 Comm: msgstress04 Kdump: loaded Not tainted 5.12.2-yoctodev-standard #1 Hardware name: Marvell OcteonTX CN96XX board (DT) Call trace: dump_backtrace+0x0/0x2cc show_stack+0x24/0x30 dump_stack+0x110/0x188 ___might_sleep+0x214/0x2d0 __might_sleep+0x7c/0xe0 This commit therefore fixes the loop to include ts[0]. Fixes: c583bcb8f5ed ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled") Tested-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-07-27rcuscale: Console output claims too few grace periodsJiangong.Han
The rcuscale console output claims N grace periods, numbered from zero to N, which means that there were really N+1 grace periods. The root cause of this bug is that rcu_scale_writer() stores the number of the last grace period (numbered from zero) into writer_n_durations[me] instead of the number of grace periods. This commit therefore assigns the actual number of grace periods to writer_n_durations[me], and also makes the corresponding adjustment to the loop outputting per-grace-period measurements. Sample of old console output: rcu-scale: writer 0 gps: 133 ...... rcu-scale: 0 writer-duration: 0 44003961 rcu-scale: 0 writer-duration: 1 32003582 ...... rcu-scale: 0 writer-duration: 132 28004391 rcu-scale: 0 writer-duration: 133 27996410 Sample of new console output: rcu-scale: writer 0 gps: 134 ...... rcu-scale: 0 writer-duration: 0 44003961 rcu-scale: 0 writer-duration: 1 32003582 ...... rcu-scale: 0 writer-duration: 132 28004391 rcu-scale: 0 writer-duration: 133 27996410 Signed-off-by: Jiangong.Han <jiangong.han@windriver.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-07-27rcutorture: Preempt rather than block when testing task stallsPaul E. McKenney
Currently, rcu_torture_stall() does a one-jiffy timed wait when stall_cpu_block is set. This works, but emits a pointless splat in CONFIG_PREEMPT=y kernels. This commit avoids this splat by instead invoking preempt_schedule() in CONFIG_PREEMPT=y kernels. This uses an admittedly ugly #ifdef, but abstracted approaches just looked worse. A prettier approach would provide a preempt_schedule() definition with a WARN_ON() for CONFIG_PREEMPT=n kernels, but this seems quite silly. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-07-27refscale: Add measurement of clock readoutPaul E. McKenney
This commit adds a "clock" type to refscale, which checks the performance of ktime_get_real_fast_ns(). Use the "clocksource=" kernel boot parameter to select the underlying clock source. [ paulmck: Work around compiler false positive per kernel test robot. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-07-20rcu: Fix macro name CONFIG_TASKS_RCU_TRACEZhouyi Zhou
This commit fixes several typos where CONFIG_TASKS_RCU_TRACE should instead be CONFIG_TASKS_TRACE_RCU. Among other things, these typos could cause CONFIG_TASKS_TRACE_RCU_READ_MB=y kernels to suffer from memory-ordering bugs that could result in false-positive quiescent states and too-short grace periods. Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-07-20rcu-tasks: Fix synchronize_rcu_rude() typo in commentPaul E. McKenney
This commit replaces the fictitious synchronize_rcu_rude() function with its real-world synchronize_rcu_tasks_rude() counterpart. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-07-20rcu-tasks: Mark ->trc_reader_special.b.need_qs data racesPaul E. McKenney
There are several ->trc_reader_special.b.need_qs data races that are too low-probability for KCSAN to notice, but which will happen sooner or later. This commit therefore marks these accesses. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-07-20rcu-tasks: Mark ->trc_reader_nesting data racesPaul E. McKenney
There are several ->trc_reader_nesting data races that are too low-probability for KCSAN to notice, but which will happen sooner or later. This commit therefore marks these accesses, and comments one that cannot race. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-07-20rcu-tasks: Add comments explaining task_struct strategyPaul E. McKenney
Accesses to task_struct structures must be either protected by RCU or by get_task_struct(). Tasks trace RCU uses these in a non-obvious combination, in conjunction with an IPI handler. This commit therefore adds comments explaining this usage. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-07-20rcu/nocb: Remove NOCB deferred wakeup from rcutree_dead_cpu()Frederic Weisbecker
At CPU offline time, we must handle any pending wakeup for the nocb_gp kthread linked to the outgoing CPU. Now we are making sure of that twice: 1) From rcu_report_dead() when the outgoing CPU makes the very last local cleanups by itself before switching offline. 2) From rcutree_dead_cpu(). Here the offlining CPU has gone and is truly now offline. Another CPU takes care of post-portem cleaning up and check if the offline CPU had pending wakeup. Both ways are fine but we have to choose one or the other because we don't need to repeat that action. Simply benefit from cache locality and keep only the first solution. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-07-20rcu/nocb: Start moving nocb code to its own plugin fileFrederic Weisbecker
The kernel/rcu/tree_plugin.h file contains not only the plugins for preemptible RCU, but also many other features including rcu_nocbs callback offloading. This offloading has become large and complex, so it is time to put it in its own file. This commit starts that process. Suggested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> [ paulmck: Rename to tree_nocb.h, add Frederic as author. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-07-06rcu: Fix pr_info() formats and values in show_rcu_gp_kthreads()Paul E. McKenney
This commit changes from "%lx" to "%x" and from "0x1ffffL" to "0x1ffff" to match the change in type between the old field ->state (unsigned long) and the new field ->__state (unsigned int). Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-07-06rcu-tasks: Don't delete holdouts within trc_wait_for_one_reader()Paul E. McKenney
Invoking trc_del_holdout() from within trc_wait_for_one_reader() is only a performance optimization because the RCU Tasks Trace grace-period kthread will eventually do this within check_all_holdout_tasks_trace(). But it is not a particularly important performance optimization because it only applies to the grace-period kthread, of which there is but one. This commit therefore removes this invocation of trc_del_holdout() in favor of the one in check_all_holdout_tasks_trace() in the grace-period kthread. Reported-by: "Xu, Yanfei" <yanfei.xu@windriver.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-07-06rcu-tasks: Don't delete holdouts within trc_inspect_reader()Paul E. McKenney
As Yanfei pointed out, although invoking trc_del_holdout() is safe from the viewpoint of the integrity of the holdout list itself, the put_task_struct() invoked by trc_del_holdout() can result in use-after-free errors due to later accesses to this task_struct structure by the RCU Tasks Trace grace-period kthread. This commit therefore removes this call to trc_del_holdout() from trc_inspect_reader() in favor of the grace-period thread's existing call to trc_del_holdout(), thus eliminating that particular class of use-after-free errors. Reported-by: "Xu, Yanfei" <yanfei.xu@windriver.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-07-06refscale: Avoid false-positive warnings in ref_scale_reader()Paul E. McKenney
If the call to set_cpus_allowed_ptr() in ref_scale_reader() fails, a later WARN_ONCE() complains. But with the advent of 570a752b7a9b ("lib/smp_processor_id: Use is_percpu_thread() instead of nr_cpus_allowed"), this complaint can be drowned out by complaints from smp_processor_id(). The rationale for this change is that refscale's kthreads are not marked with PF_NO_SETAFFINITY, which means that a system administrator could change affinity at any time. However, refscale is a performance/stress test, and the system administrator might well have a valid test-the-test reason for changing affinity. This commit therefore changes to raw_smp_processor_id() in order to avoid the noise, and also adds a WARN_ON_ONCE() to the call to set_cpus_allowed_ptr() in order to directly detect immediate failure. There is no WARN_ON_ONCE() within the test loop, allowing human-reflex-based affinity resetting, if desired. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-07-04Merge branch 'core-rcu-2021.07.04' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU updates from Paul McKenney: - Bitmap parsing support for "all" as an alias for all bits - Documentation updates - Miscellaneous fixes, including some that overlap into mm and lockdep - kvfree_rcu() updates - mem_dump_obj() updates, with acks from one of the slab-allocator maintainers - RCU NOCB CPU updates, including limited deoffloading - SRCU updates - Tasks-RCU updates - Torture-test updates * 'core-rcu-2021.07.04' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (78 commits) tasks-rcu: Make show_rcu_tasks_gp_kthreads() be static inline rcu-tasks: Make ksoftirqd provide RCU Tasks quiescent states rcu: Add missing __releases() annotation rcu: Remove obsolete rcu_read_unlock() deadlock commentary rcu: Improve comments describing RCU read-side critical sections rcu: Create an unrcu_pointer() to remove __rcu from a pointer srcu: Early test SRCU polling start rcu: Fix various typos in comments rcu/nocb: Unify timers rcu/nocb: Prepare for fine-grained deferred wakeup rcu/nocb: Only cancel nocb timer if not polling rcu/nocb: Delete bypass_timer upon nocb_gp wakeup rcu/nocb: Cancel nocb_timer upon nocb_gp wakeup rcu/nocb: Allow de-offloading rdp leader rcu/nocb: Directly call __wake_nocb_gp() from bypass timer rcu: Don't penalize priority boosting when there is nothing to boost rcu: Point to documentation of ordering guarantees rcu: Make rcu_gp_cleanup() be noinline for tracing rcu: Restrict RCU_STRICT_GRACE_PERIOD to at most four CPUs rcu: Make show_rcu_gp_kthreads() dump rcu_node structures blocking GP ...
2021-07-02Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge more updates from Andrew Morton: "190 patches. Subsystems affected by this patch series: mm (hugetlb, userfaultfd, vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock, migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap, zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc, core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs, signals, exec, kcov, selftests, compress/decompress, and ipc" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits) ipc/util.c: use binary search for max_idx ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock ipc: use kmalloc for msg_queue and shmid_kernel ipc sem: use kvmalloc for sem_undo allocation lib/decompressors: remove set but not used variabled 'level' selftests/vm/pkeys: exercise x86 XSAVE init state selftests/vm/pkeys: refill shadow register after implicit kernel write selftests/vm/pkeys: handle negative sys_pkey_alloc() return code selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random kcov: add __no_sanitize_coverage to fix noinstr for all architectures exec: remove checks in __register_bimfmt() x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned hfsplus: report create_date to kstat.btime hfsplus: remove unnecessary oom message nilfs2: remove redundant continue statement in a while-loop kprobes: remove duplicated strong free_insn_page in x86 and s390 init: print out unknown kernel parameters checkpatch: do not complain about positive return values starting with EPOLL checkpatch: improve the indented label test checkpatch: scripts/spdxcheck.py now requires python3 ...
2021-07-01kernel.h: split out panic and oops helpersAndy Shevchenko
kernel.h is being used as a dump for all kinds of stuff for a long time. Here is the attempt to start cleaning it up by splitting out panic and oops helpers. There are several purposes of doing this: - dropping dependency in bug.h - dropping a loop by moving out panic_notifier.h - unload kernel.h from something which has its own domain At the same time convert users tree-wide to use new headers, although for the time being include new header back to kernel.h to avoid twisted indirected includes for existing users. [akpm@linux-foundation.org: thread_info.h needs limits.h] [andriy.shevchenko@linux.intel.com: ia64 fix] Link: https://lkml.kernel.org/r/20210520130557.55277-1-andriy.shevchenko@linux.intel.com Link: https://lkml.kernel.org/r/20210511074137.33666-1-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org> Co-developed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Corey Minyard <cminyard@mvista.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Wei Liu <wei.liu@kernel.org> Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Sebastian Reichel <sre@kernel.org> Acked-by: Luis Chamberlain <mcgrof@kernel.org> Acked-by: Stephen Boyd <sboyd@kernel.org> Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Acked-by: Helge Deller <deller@gmx.de> # parisc Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-18sched: Change task_struct::statePeter Zijlstra
Change the type and name of task_struct::state. Drop the volatile and shrink it to an 'unsigned int'. Rename it in order to find all uses such that we can use READ_ONCE/WRITE_ONCE as appropriate. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Acked-by: Will Deacon <will@kernel.org> Acked-by: Daniel Thompson <daniel.thompson@linaro.org> Link: https://lore.kernel.org/r/20210611082838.550736351@infradead.org
2021-06-18sched: Introduce task_is_running()Peter Zijlstra
Replace a bunch of 'p->state == TASK_RUNNING' with a new helper: task_is_running(p). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20210611082838.222401495@infradead.org
2021-05-18Merge branches 'bitmaprange.2021.05.10c', 'doc.2021.05.10c', ↵Paul E. McKenney
'fixes.2021.05.13a', 'kvfree_rcu.2021.05.10c', 'mmdumpobj.2021.05.10c', 'nocb.2021.05.12a', 'srcu.2021.05.12a', 'tasks.2021.05.18a' and 'torture.2021.05.10c' into HEAD bitmaprange.2021.05.10c: Allow "all" for bitmap ranges. doc.2021.05.10c: Documentation updates. fixes.2021.05.13a: Miscellaneous fixes. kvfree_rcu.2021.05.10c: kvfree_rcu() updates. mmdumpobj.2021.05.10c: mem_dump_obj() updates. nocb.2021.05.12a: RCU NOCB CPU updates, including limited deoffloading. srcu.2021.05.12a: SRCU updates. tasks.2021.05.18a: Tasks-RCU updates. torture.2021.05.10c: Torture-test updates.
2021-05-18tasks-rcu: Make show_rcu_tasks_gp_kthreads() be static inlinePaul E. McKenney
In some architectures, the no-op variant of show_rcu_tasks_gp_kthreads() get "no previous prototype" compiler warnings. These are false positives given that kernel/rcu/tasks.h is included only once. But why put up with the compiler noise? This commit therefore adds "static inline" to this definition to force the compiler to accept this situation, while also moving it to its proper place in kernel/rcu/rcu.h. Reported-by: kernel test robot <lkp@intel.com> [ paulmck: Update per Stephen Rothwell feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-18rcu-tasks: Make ksoftirqd provide RCU Tasks quiescent statesPaul E. McKenney
Heavy networking load can cause a CPU to execute continuously and indefinitely within ksoftirqd, in which case there will be no voluntary task switches and thus no RCU-tasks quiescent states. This commit therefore causes the exiting rcu_softirq_qs() to provide an RCU-tasks quiescent state. This of course means that __do_softirq() and its callers cannot be invoked from within a tracing trampoline. Reported-by: Toke Høiland-Jørgensen <toke@redhat.com> Tested-by: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org>
2021-05-13rcu: Add missing __releases() annotationJules Irenge
Sparse reports a warning at rcu_print_task_stall(): "warning: context imbalance in rcu_print_task_stall - unexpected unlock" The root cause is a missing annotation on rcu_print_task_stall(). This commit therefore adds the missing __releases(rnp->lock) annotation. Signed-off-by: Jules Irenge <jbi.octave@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-13rcu: Improve comments describing RCU read-side critical sectionsPaul E. McKenney
There are a number of places that call out the fact that preempt-disable regions of code now act as RCU read-side critical sections, where preempt-disable regions of code include irq-disable regions of code, bh-disable regions of code, hardirq handlers, and NMI handlers. However, someone relying solely on (for example) the call_rcu() header comment might well have no idea that preempt-disable regions of code have RCU semantics. This commit therefore updates the header comments for call_rcu(), synchronize_rcu(), rcu_dereference_bh_check(), and rcu_dereference_sched_check() to call out these new(ish) forms of RCU readers. Reported-by: Michel Lespinasse <michel@lespinasse.org> [ paulmck: Apply Matthew Wilcox and Michel Lespinasse feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-12srcu: Early test SRCU polling startFrederic Weisbecker
Place an early call to start_poll_synchronize_srcu() before the invocation of call_srcu() on the same srcu_struct structure. After the later call to srcu_barrier(), the completion of the first grace period should be visible to a subsequent invocation of poll_state_synchronize_srcu(), and if not, warn. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Uladzislau Rezki <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-12rcu: Fix various typos in commentsIngo Molnar
Fix ~12 single-word typos in RCU code comments. [ paulmck: Apply feedback from Randy Dunlap. ] Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-12rcu/nocb: Unify timersFrederic Weisbecker
Now that ->nocb_timer and ->nocb_bypass_timer have become quite similar, this commit merges them together. A new RCU_NOCB_WAKE_BYPASS wake level is introduced. As a result, timers perform all kinds of deferred wake ups but other deferred wakeup callsites only handle non-bypass wakeups in order not to wake up rcuo too early. The timer also unconditionally executes a full barrier so as to order timer_pending() and callback enqueue although the path performing RCU_NOCB_WAKE_FORCE that makes use of it is debatable. It should also test against the rdp leader instead of the current rdp. This unconditional full barrier shouldn't bring visible overhead since these timers almost never fire. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-12rcu/nocb: Prepare for fine-grained deferred wakeupFrederic Weisbecker
Tuning the deferred wakeup level must be done from a safe wakeup point. Currently those sites are: * ->nocb_timer * user/idle/guest entry * CPU down * softirq/rcuc All of these sites perform the wake up for both RCU_NOCB_WAKE and RCU_NOCB_WAKE_FORCE. In order to merge ->nocb_timer and ->nocb_bypass_timer together, we plan to add a new RCU_NOCB_WAKE_BYPASS that really should be deferred until a timer fires so that we don't wake up the NOCB-gp kthread too early. To prepare for that, this commit specifies the per-callsite wakeup level/limit. Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> [ paulmck: Fix non-NOCB rcu_nocb_need_deferred_wakeup() definition. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-12rcu/nocb: Only cancel nocb timer if not pollingFrederic Weisbecker
This commit refrains deleting the ->nocb_timer if rcu_nocb is polling because it should not ever have been queued in the polling case. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-12rcu/nocb: Delete bypass_timer upon nocb_gp wakeupFrederic Weisbecker
A NOCB-gp wake p can safely delete the ->nocb_bypass_timer because nocb_gp_wait() will recheck again the bypass state and rearm the bypass timer if necessary. This commit therefore deletes this timer. Reviewed-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-05-12rcu/nocb: Cancel nocb_timer upon nocb_gp wakeupFrederic Weisbecker
When waking up in nocb_gp_wait(), there is no need to keep the nocb_timer around because this function will traverse the whole rdp list. Any update performed before the timer was armed will now be visible after the ->nocb_gp_lock acquire. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>