From d2af1ad73e7a22ed3e04374896fee0eb300c05c3 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 20 Jan 2015 23:54:59 -0800 Subject: documentation: Update rcutree.kthread_prio for grace-period kthread use Now that the rcutree.kthread_prio kernel boot parameter also controls the priority of the grace-period kthreads, update the documentation to reflect this change. Signed-off-by: Paul E. McKenney --- Documentation/kernel-parameters.txt | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index bfcb1a62a7b4..d913e3b4bf0d 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2991,11 +2991,15 @@ bytes respectively. Such letter suffixes can also be entirely omitted. value is one, and maximum value is HZ. rcutree.kthread_prio= [KNL,BOOT] - Set the SCHED_FIFO priority of the RCU - per-CPU kthreads (rcuc/N). This value is also - used for the priority of the RCU boost threads - (rcub/N). Valid values are 1-99 and the default - is 1 (the least-favored priority). + Set the SCHED_FIFO priority of the RCU per-CPU + kthreads (rcuc/N). This value is also used for + the priority of the RCU boost threads (rcub/N) + and for the RCU grace-period kthreads (rcu_bh, + rcu_preempt, and rcu_sched). If RCU_BOOST is + set, valid values are 1-99 and the default is 1 + (the least-favored priority). Otherwise, when + RCU_BOOST is not set, valid values are 0-99 and + the default is zero (non-realtime operation). rcutree.rcu_nocb_leader_stride= [KNL] Set the number of NOCB kthread groups, which -- cgit v1.2.3-58-ga151 From 89bf5d82ed451f02329bbbb06ac365e96b18804d Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sat, 24 Jan 2015 22:24:14 -0800 Subject: documentation: Update based on on-demand vmstat workers Now that the on-demand vmstat workers commit is in mainline, it is possible to eliminate vmstat_update()-induced OS jitter. This commit updates the documentation accordingly. Signed-off-by: Paul E. McKenney --- Documentation/kernel-per-CPU-kthreads.txt | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) (limited to 'Documentation') diff --git a/Documentation/kernel-per-CPU-kthreads.txt b/Documentation/kernel-per-CPU-kthreads.txt index f3cd299fcc41..81fe051c4447 100644 --- a/Documentation/kernel-per-CPU-kthreads.txt +++ b/Documentation/kernel-per-CPU-kthreads.txt @@ -190,14 +190,16 @@ To reduce its OS jitter, do any of the following: on each CPU, including cs_dbs_timer() and od_dbs_timer(). WARNING: Please check your CPU specifications to make sure that this is safe on your particular system. - d. It is not possible to entirely get rid of OS jitter - from vmstat_update() on CONFIG_SMP=y systems, but you - can decrease its frequency by writing a large value - to /proc/sys/vm/stat_interval. The default value is - HZ, for an interval of one second. Of course, larger - values will make your virtual-memory statistics update - more slowly. Of course, you can also run your workload - at a real-time priority, thus preempting vmstat_update(), + d. As of v3.18, Christoph Lameter's on-demand vmstat workers + commit prevents OS jitter due to vmstat_update() on + CONFIG_SMP=y systems. Before v3.18, is not possible + to entirely get rid of the OS jitter, but you can + decrease its frequency by writing a large value to + /proc/sys/vm/stat_interval. The default value is HZ, + for an interval of one second. Of course, larger values + will make your virtual-memory statistics update more + slowly. Of course, you can also run your workload at + a real-time priority, thus preempting vmstat_update(), but if your workload is CPU-bound, this is a bad idea. However, there is an RFC patch from Christoph Lameter (based on an earlier one from Gilad Ben-Yossef) that -- cgit v1.2.3-58-ga151 From c25197841efe53258abb22cfd894a729a272edf9 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sun, 25 Jan 2015 11:28:28 -0800 Subject: documentation: Update NO_HZ_FULL interaction with POSIX timers POSIX timers are no longer starved on adaptive-ticks CPUs. Instead, they prevent affected CPUs from entering adaptive-ticks mode. This commit therefore updates the NO_HZ.txt documentation. Signed-off-by: Paul E. McKenney --- Documentation/timers/NO_HZ.txt | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-) (limited to 'Documentation') diff --git a/Documentation/timers/NO_HZ.txt b/Documentation/timers/NO_HZ.txt index cca122f25120..6eaf576294f3 100644 --- a/Documentation/timers/NO_HZ.txt +++ b/Documentation/timers/NO_HZ.txt @@ -158,13 +158,9 @@ not come for free: to the need to inform kernel subsystems (such as RCU) about the change in mode. -3. POSIX CPU timers on adaptive-tick CPUs may miss their deadlines - (perhaps indefinitely) because they currently rely on - scheduling-tick interrupts. This will likely be fixed in - one of two ways: (1) Prevent CPUs with POSIX CPU timers from - entering adaptive-tick mode, or (2) Use hrtimers or other - adaptive-ticks-immune mechanism to cause the POSIX CPU timer to - fire properly. +3. POSIX CPU timers prevent CPUs from entering adaptive-tick mode. + Real-time applications needing to take actions based on CPU time + consumption need to use other means of doing so. 4. If there are more perf events pending than the hardware can accommodate, they are normally round-robined so as to collect -- cgit v1.2.3-58-ga151 From f1360570f420b8b122e7f1cccf456ff7133a3007 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sun, 25 Jan 2015 11:48:18 -0800 Subject: documentation: Update per-CPU kthreads documentation Signed-off-by: Paul E. McKenney --- Documentation/kernel-per-CPU-kthreads.txt | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) (limited to 'Documentation') diff --git a/Documentation/kernel-per-CPU-kthreads.txt b/Documentation/kernel-per-CPU-kthreads.txt index 81fe051c4447..f4cbfe0ba108 100644 --- a/Documentation/kernel-per-CPU-kthreads.txt +++ b/Documentation/kernel-per-CPU-kthreads.txt @@ -205,7 +205,9 @@ To reduce its OS jitter, do any of the following: (based on an earlier one from Gilad Ben-Yossef) that reduces or even eliminates vmstat overhead for some workloads at https://lkml.org/lkml/2013/9/4/379. - e. If running on high-end powerpc servers, build with + e. Boot with "elevator=noop" to avoid workqueue use by + the block layer. + f. If running on high-end powerpc servers, build with CONFIG_PPC_RTAS_DAEMON=n. This prevents the RTAS daemon from running on each CPU every second or so. (This will require editing Kconfig files and will defeat @@ -213,12 +215,12 @@ To reduce its OS jitter, do any of the following: due to the rtas_event_scan() function. WARNING: Please check your CPU specifications to make sure that this is safe on your particular system. - f. If running on Cell Processor, build your kernel with + g. If running on Cell Processor, build your kernel with CBE_CPUFREQ_SPU_GOVERNOR=n to avoid OS jitter from spu_gov_work(). WARNING: Please check your CPU specifications to make sure that this is safe on your particular system. - g. If running on PowerMAC, build your kernel with + h. If running on PowerMAC, build your kernel with CONFIG_PMAC_RACKMETER=n to disable the CPU-meter, avoiding OS jitter from rackmeter_do_timer(). @@ -260,8 +262,12 @@ Purpose: Detect software lockups on each CPU. To reduce its OS jitter, do at least one of the following: 1. Build with CONFIG_LOCKUP_DETECTOR=n, which will prevent these kthreads from being created in the first place. -2. Echo a zero to /proc/sys/kernel/watchdog to disable the +2. Boot with "nosoftlockup=0", which will also prevent these kthreads + from being created. Other related watchdog and softlockup boot + parameters may be found in Documentation/kernel-parameters.txt + and Documentation/watchdog/watchdog-parameters.txt. +3. Echo a zero to /proc/sys/kernel/watchdog to disable the watchdog timer. -3. Echo a large number of /proc/sys/kernel/watchdog_thresh in +4. Echo a large number of /proc/sys/kernel/watchdog_thresh in order to reduce the frequency of OS jitter due to the watchdog timer down to a level that is acceptable for your workload. -- cgit v1.2.3-58-ga151 From daf1aab9acfaaded09f53fa91dfe6e4e6926ec39 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 2 Feb 2015 08:08:25 -0800 Subject: documentation: Clarify memory-barrier semantics of atomic operations All value-returning atomic read-modify-write operations must provide full memory-barrier semantics on both sides of the operation. This commit clarifies the documentation to make it clear that these memory-barrier semantics are provided by the operations themselves, not by their callers. Reported-by: Peter Hurley Signed-off-by: Paul E. McKenney --- Documentation/atomic_ops.txt | 45 ++++++++++++++++++++++---------------------- 1 file changed, 23 insertions(+), 22 deletions(-) (limited to 'Documentation') diff --git a/Documentation/atomic_ops.txt b/Documentation/atomic_ops.txt index 183e41bdcb69..dab6da3382d9 100644 --- a/Documentation/atomic_ops.txt +++ b/Documentation/atomic_ops.txt @@ -201,11 +201,11 @@ These routines add 1 and subtract 1, respectively, from the given atomic_t and return the new counter value after the operation is performed. -Unlike the above routines, it is required that explicit memory -barriers are performed before and after the operation. It must be -done such that all memory operations before and after the atomic -operation calls are strongly ordered with respect to the atomic -operation itself. +Unlike the above routines, it is required that these primitives +include explicit memory barriers that are performed before and after +the operation. It must be done such that all memory operations before +and after the atomic operation calls are strongly ordered with respect +to the atomic operation itself. For example, it should behave as if a smp_mb() call existed both before and after the atomic operation. @@ -233,21 +233,21 @@ These two routines increment and decrement by 1, respectively, the given atomic counter. They return a boolean indicating whether the resulting counter value was zero or not. -It requires explicit memory barrier semantics around the operation as -above. +Again, these primitives provide explicit memory barrier semantics around +the atomic operation. int atomic_sub_and_test(int i, atomic_t *v); This is identical to atomic_dec_and_test() except that an explicit -decrement is given instead of the implicit "1". It requires explicit -memory barrier semantics around the operation. +decrement is given instead of the implicit "1". This primitive must +provide explicit memory barrier semantics around the operation. int atomic_add_negative(int i, atomic_t *v); -The given increment is added to the given atomic counter value. A -boolean is return which indicates whether the resulting counter value -is negative. It requires explicit memory barrier semantics around the -operation. +The given increment is added to the given atomic counter value. A boolean +is return which indicates whether the resulting counter value is negative. +This primitive must provide explicit memory barrier semantics around +the operation. Then: @@ -257,7 +257,7 @@ This performs an atomic exchange operation on the atomic variable v, setting the given new value. It returns the old value that the atomic variable v had just before the operation. -atomic_xchg requires explicit memory barriers around the operation. +atomic_xchg must provide explicit memory barriers around the operation. int atomic_cmpxchg(atomic_t *v, int old, int new); @@ -266,7 +266,7 @@ with the given old and new values. Like all atomic_xxx operations, atomic_cmpxchg will only satisfy its atomicity semantics as long as all other accesses of *v are performed through atomic_xxx operations. -atomic_cmpxchg requires explicit memory barriers around the operation. +atomic_cmpxchg must provide explicit memory barriers around the operation. The semantics for atomic_cmpxchg are the same as those defined for 'cas' below. @@ -279,8 +279,8 @@ If the atomic value v is not equal to u, this function adds a to v, and returns non zero. If v is equal to u then it returns zero. This is done as an atomic operation. -atomic_add_unless requires explicit memory barriers around the operation -unless it fails (returns 0). +atomic_add_unless must provide explicit memory barriers around the +operation unless it fails (returns 0). atomic_inc_not_zero, equivalent to atomic_add_unless(v, 1, 0) @@ -460,9 +460,9 @@ the return value into an int. There are other places where things like this occur as well. These routines, like the atomic_t counter operations returning values, -require explicit memory barrier semantics around their execution. All -memory operations before the atomic bit operation call must be made -visible globally before the atomic bit operation is made visible. +must provide explicit memory barrier semantics around their execution. +All memory operations before the atomic bit operation call must be +made visible globally before the atomic bit operation is made visible. Likewise, the atomic bit operation must be visible globally before any subsequent memory operation is made visible. For example: @@ -536,8 +536,9 @@ except that two underscores are prefixed to the interface name. These non-atomic variants also do not require any special memory barrier semantics. -The routines xchg() and cmpxchg() need the same exact memory barriers -as the atomic and bit operations returning values. +The routines xchg() and cmpxchg() must provide the same exact +memory-barrier semantics as the atomic and bit operations returning +values. Spinlocks and rwlocks have memory barrier expectations as well. The rule to follow is simple: -- cgit v1.2.3-58-ga151 From ff382810590e7182a1482a225965d6943e61699c Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 17 Feb 2015 10:00:06 -0800 Subject: documentation: Clarify control-dependency pairing This commit explicitly states that control dependencies pair normally with other barriers, and gives an example of such pairing. Reported-by: Peter Zijlstra Signed-off-by: Paul E. McKenney Acked-by: Peter Zijlstra (Intel) --- Documentation/memory-barriers.txt | 42 +++++++++++++++++++++++++++------------ 1 file changed, 29 insertions(+), 13 deletions(-) (limited to 'Documentation') diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index ca2387ef27ab..6974f1c2b4e1 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -592,9 +592,9 @@ See also the subsection on "Cache Coherency" for a more thorough example. CONTROL DEPENDENCIES -------------------- -A control dependency requires a full read memory barrier, not simply a data -dependency barrier to make it work correctly. Consider the following bit of -code: +A load-load control dependency requires a full read memory barrier, not +simply a data dependency barrier to make it work correctly. Consider the +following bit of code: q = ACCESS_ONCE(a); if (q) { @@ -615,14 +615,15 @@ case what's actually required is: } However, stores are not speculated. This means that ordering -is- provided -in the following example: +for load-store control dependencies, as in the following example: q = ACCESS_ONCE(a); if (q) { ACCESS_ONCE(b) = p; } -Please note that ACCESS_ONCE() is not optional! Without the +Control dependencies pair normally with other types of barriers. +That said, please note that ACCESS_ONCE() is not optional! Without the ACCESS_ONCE(), might combine the load from 'a' with other loads from 'a', and the store to 'b' with other stores to 'b', with possible highly counterintuitive effects on ordering. @@ -813,6 +814,8 @@ In summary: barrier() can help to preserve your control dependency. Please see the Compiler Barrier section for more information. + (*) Control dependencies pair normally with other types of barriers. + (*) Control dependencies do -not- provide transitivity. If you need transitivity, use smp_mb(). @@ -823,14 +826,14 @@ SMP BARRIER PAIRING When dealing with CPU-CPU interactions, certain types of memory barrier should always be paired. A lack of appropriate pairing is almost certainly an error. -General barriers pair with each other, though they also pair with -most other types of barriers, albeit without transitivity. An acquire -barrier pairs with a release barrier, but both may also pair with other -barriers, including of course general barriers. A write barrier pairs -with a data dependency barrier, an acquire barrier, a release barrier, -a read barrier, or a general barrier. Similarly a read barrier or a -data dependency barrier pairs with a write barrier, an acquire barrier, -a release barrier, or a general barrier: +General barriers pair with each other, though they also pair with most +other types of barriers, albeit without transitivity. An acquire barrier +pairs with a release barrier, but both may also pair with other barriers, +including of course general barriers. A write barrier pairs with a data +dependency barrier, a control dependency, an acquire barrier, a release +barrier, a read barrier, or a general barrier. Similarly a read barrier, +control dependency, or a data dependency barrier pairs with a write +barrier, an acquire barrier, a release barrier, or a general barrier: CPU 1 CPU 2 =============== =============== @@ -850,6 +853,19 @@ Or: y = *x; +Or even: + + CPU 1 CPU 2 + =============== =============================== + r1 = ACCESS_ONCE(y); + + ACCESS_ONCE(y) = 1; if (r2 = ACCESS_ONCE(x)) { + + ACCESS_ONCE(y) = 1; + } + + assert(r1 == 0 || r2 == 0); + Basically, the read barrier always has to be there, even though it can be of the "weaker" type. -- cgit v1.2.3-58-ga151 From 37745d281069682d901f00c0121949a7d224195f Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 22 Jan 2015 18:24:08 -0800 Subject: rcu: Provide diagnostic option to slow down grace-period initialization Grace-period initialization normally proceeds quite quickly, so that it is very difficult to reproduce races against grace-period initialization. This commit therefore allows grace-period initialization to be artificially slowed down, increasing race-reproduction probability. A pair of new Kconfig parameters are provided, CONFIG_RCU_TORTURE_TEST_SLOW_INIT to enable the slowdowns, and CONFIG_RCU_TORTURE_TEST_SLOW_INIT_DELAY to specify the number of jiffies of slowdown to apply. A boot-time parameter named rcutree.gp_init_delay allows boot-time delay to be specified. By default, no delay will be applied even if CONFIG_RCU_TORTURE_TEST_SLOW_INIT is set. Signed-off-by: Paul E. McKenney --- Documentation/kernel-parameters.txt | 6 ++++++ kernel/rcu/tree.c | 10 ++++++++++ lib/Kconfig.debug | 24 ++++++++++++++++++++++++ 3 files changed, 40 insertions(+) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index bfcb1a62a7b4..94de410ec341 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2968,6 +2968,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted. Set maximum number of finished RCU callbacks to process in one batch. + rcutree.gp_init_delay= [KNL] + Set the number of jiffies to delay each step of + RCU grace-period initialization. This only has + effect when CONFIG_RCU_TORTURE_TEST_SLOW_INIT is + set. + rcutree.rcu_fanout_leaf= [KNL] Increase the number of CPUs assigned to each leaf rcu_node structure. Useful for very large diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 3b7e4133ca99..b42001fd55fb 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -160,6 +160,12 @@ static void invoke_rcu_callbacks(struct rcu_state *rsp, struct rcu_data *rdp); static int kthread_prio = CONFIG_RCU_KTHREAD_PRIO; module_param(kthread_prio, int, 0644); +/* Delay in jiffies for grace-period initialization delays. */ +static int gp_init_delay = IS_ENABLED(CONFIG_RCU_TORTURE_TEST_SLOW_INIT) + ? CONFIG_RCU_TORTURE_TEST_SLOW_INIT_DELAY + : 0; +module_param(gp_init_delay, int, 0644); + /* * Track the rcutorture test sequence number and the update version * number within a given test. The rcutorture_testseq is incremented @@ -1769,6 +1775,10 @@ static int rcu_gp_init(struct rcu_state *rsp) raw_spin_unlock_irq(&rnp->lock); cond_resched_rcu_qs(); ACCESS_ONCE(rsp->gp_activity) = jiffies; + if (IS_ENABLED(CONFIG_RCU_TORTURE_TEST_SLOW_INIT) && + gp_init_delay > 0 && + !(rsp->gpnum % (rcu_num_nodes * 10))) + schedule_timeout_uninterruptible(gp_init_delay); } mutex_unlock(&rsp->onoff_mutex); diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index c5cefb3c009c..feee8dab441e 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1257,6 +1257,30 @@ config RCU_TORTURE_TEST_RUNNABLE Say N here if you want the RCU torture tests to start only after being manually enabled via /proc. +config RCU_TORTURE_TEST_SLOW_INIT + bool "Slow down RCU grace-period initialization to expose races" + depends on RCU_TORTURE_TEST + help + This option makes grace-period initialization block for a + few jiffies between initializing each pair of consecutive + rcu_node structures. This helps to expose races involving + grace-period initialization, in other words, it makes your + kernel less stable. It can also greatly increase grace-period + latency, especially on systems with large numbers of CPUs. + This is useful when torture-testing RCU, but in almost no + other circumstance. + + Say Y here if you want your system to crash and hang more often. + Say N if you want a sane system. + +config RCU_TORTURE_TEST_SLOW_INIT_DELAY + int "How much to slow down RCU grace-period initialization" + range 0 5 + default 0 + help + This option specifies the number of jiffies to wait between + each rcu_node structure initialization. + config RCU_CPU_STALL_TIMEOUT int "RCU CPU stall timeout in seconds" depends on RCU_STALL_COMMON -- cgit v1.2.3-58-ga151