Age | Commit message (Collapse) | Author |
|
Daniel Borkmann says:
====================
pull-request: bpf-next 2020-01-27
The following pull-request contains BPF updates for your *net-next* tree.
We've added 20 non-merge commits during the last 5 day(s) which contain
a total of 24 files changed, 433 insertions(+), 104 deletions(-).
The main changes are:
1) Make BPF trampolines and dispatcher aware for the stack unwinder, from Jiri Olsa.
2) Improve handling of failed CO-RE relocations in libbpf, from Andrii Nakryiko.
3) Several fixes to BPF sockmap and reuseport selftests, from Lorenz Bauer.
4) Various cleanups in BPF devmap's XDP flush code, from John Fastabend.
5) Fix BPF flow dissector when used with port ranges, from Yoshiki Komachi.
6) Fix bpffs' map_seq_next callback to always inc position index, from Vasily Averin.
7) Allow overriding LLVM tooling for runqslower utility, from Andrey Ignatov.
8) Silence false-positive lockdep splats in devmap hash lookup, from Amol Grover.
9) Fix fentry/fexit selftests to initialize a variable before use, from John Sperbeck.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
* pm-cpufreq:
cpufreq: loongson2_cpufreq: adjust cpufreq uses of LOONGSON_CHIPCFG
cpufreq: brcmstb-avs: fix imbalance of cpufreq policy refcount
cpufreq: intel_pstate: fix spelling mistake: "Whethet" -> "Whether"
cpufreq: s3c: fix unbalances of cpufreq policy refcount
cpufreq: imx-cpufreq-dt: Add i.MX8MP support
cpufreq: Use imx-cpufreq-dt for i.MX8MP's speed grading
cpufreq: tegra186: convert to devm_platform_ioremap_resource
cpufreq: kirkwood: convert to devm_platform_ioremap_resource
cpufreq: CPPC: put ACPI table after using it
cpufreq : CPPC: Break out if HiSilicon CPPC workaround is matched
* pm-sleep:
PM: suspend: Add sysfs attribute to control the "sync on suspend" behavior
PM: hibernate: fix spelling mistake "shapshot" -> "snapshot"
PM: hibernate: Add more logging on hibernation failure
PM: hibernate: improve arithmetic division in preallocate_highmem_fraction()
PM: wakeup: Show statistics for deleted wakeup sources again
PM: sleep: Switch to rtc_time64_to_tm()/rtc_tm_to_time64()
|
|
Now that we depend on rcu_call() and synchronize_rcu() to also wait
for preempt_disabled region to complete the rcu read critical section
in __dev_map_flush() is no longer required. Except in a few special
cases in drivers that need it for other reasons.
These originally ensured the map reference was safe while a map was
also being free'd. And additionally that bpf program updates via
ndo_bpf did not happen while flush updates were in flight. But flush
by new rules can only be called from preempt-disabled NAPI context.
The synchronize_rcu from the map free path and the rcu_call from the
delete path will ensure the reference there is safe. So lets remove
the rcu_read_lock and rcu_read_unlock pair to avoid any confusion
around how this is being protected.
If the rcu_read_lock was required it would mean errors in the above
logic and the original patch would also be wrong.
Now that we have done above we put the rcu_read_lock in the driver
code where it is needed in a driver dependent way. I think this
helps readability of the code so we know where and why we are
taking read locks. Most drivers will not need rcu_read_locks here
and further XDP drivers already have rcu_read_locks in their code
paths for reading xdp programs on RX side so this makes it symmetric
where we don't have half of rcu critical sections define in driver
and the other half in devmap.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/1580084042-11598-4-git-send-email-john.fastabend@gmail.com
|
|
Now that we rely on synchronize_rcu and call_rcu waiting to
exit perempt-disable regions (NAPI) lets update the comments
to reflect this.
Fixes: 0536b85239b84 ("xdp: Simplify devmap cleanup")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/1580084042-11598-2-git-send-email-john.fastabend@gmail.com
|
|
If seq_file .next fuction does not change position index,
read after some lseek can generate an unexpected output.
See also: https://bugzilla.kernel.org/show_bug.cgi?id=206283
v1 -> v2: removed missed increment in end of function
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/eca84fdd-c374-a154-d874-6c7b55fc3bc4@virtuozzo.com
|
|
This patch fixes the following sparse errors by annotating the
sighand_struct with __rcu
kernel/fork.c:1511:9: error: incompatible types in comparison expression
kernel/exit.c:100:19: error: incompatible types in comparison expression
kernel/signal.c:1370:27: error: incompatible types in comparison expression
This fix introduces the following sparse error in signal.c due to
checking the sighand pointer without rcu primitives:
kernel/signal.c:1386:21: error: incompatible types in comparison expression
This new sparse error is also fixed in this patch.
Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20200124045908.26389-1-madhuparnabhowmik10@gmail.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
|
|
Minor conflict in mlx5 because changes happened to code that has
moved meanwhile.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Boot-time processing often loops in the kernel longer than one might
prefer, which can prevent expedited grace periods from completing in
a timely manner. This in turn triggers a splat In nohz_full CPUs One
could argue that long-looping code should be fixed, but on the other hand,
boot time is a bit special.
This commit therefore removes the splat. Later commits will add the
splat back in, but in a way that removes false positives.
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
As warnings can trigger panics, especially when "panic_on_warn" is set,
memory failure warnings can cause panics and fail fuzz testers that are
stressing memory.
Create a MEM_FAIL() macro to use instead of WARN() in the tracing code
(perhaps this should be a kernel wide macro?), and use that for memory
failure issues. This should stop failing fuzz tests due to warnings.
Link: https://lore.kernel.org/r/CACT4Y+ZP-7np20GVRu3p+eZys9GPtbu+JpfV+HtsufAzvTgJrg@mail.gmail.com
Suggested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
When unwinding the stack we need to identify each address
to successfully continue. Adding latch tree to keep trampolines
for quick lookup during the unwind.
The patch uses first 48 bytes for latch tree node, leaving 4048
bytes from the rest of the page for trampoline or dispatcher
generated code.
It's still enough not to affect trampoline and dispatcher progs
maximum counts.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200123161508.915203-3-jolsa@kernel.org
|
|
When accessing the context we allow access to arguments with
scalar type and pointer to struct. But we deny access for
pointer to scalar type, which is the case for many functions.
Alexei suggested to take conservative approach and allow
currently only string pointer access, which is the case
for most functions now:
Adding check if the pointer is to string type and allow access to it.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200123161508.915203-2-jolsa@kernel.org
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:
- Expedited grace-period updates
- kfree_rcu() updates
- RCU list updates
- Preemptible RCU updates
- Torture-test updates
- Miscellaneous fixes
- Documentation updates
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
The trace_array_get_by_name() creates a ftrace instance and
trace_array_put() is used to remove the reference. Even though the
trace_array_get_by_name() creates the instance, it also adds a reference
count to it, that prevents user space from removing it.
As the bootconfig just creates the instance on boot up, it should still be
used where it can be deleted by user space after boot. A trace_array_put()
is required to let that happen.
Also, change the documentation on trace_array_get_by_name() to make this not
be so confusing.
Link: https://lore.kernel.org/r/20200124205927.76128804@rorschach.local.home
Fixes: 4f712a4d04a4e ("tracing/boot: Add instance node support")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
We checked "iter->trace" earlier so there is no need to check here.
Link: http://lkml.kernel.org/r/20141122183012.GB6994@mwanda
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
[ Pulled from the archeological digging of my INBOX ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
I noticed when trying to use the trace-cmd python interface that reading the raw
buffer wasn't working for kernel_stack events. This is because it uses a
stubbed version of __dynamic_array that doesn't do the __data_loc trick and
encode the length of the array into the field. Instead it just shows up as a
size of 0. So change this to __array and set the len to FTRACE_STACK_ENTRIES
since this is what we actually do in practice and matches how user_stack_trace
works.
Link: http://lkml.kernel.org/r/1411589652-1318-1-git-send-email-jbacik@fb.com
Signed-off-by: Josef Bacik <jbacik@fb.com>
[ Pulled from the archeological digging of my INBOX ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
tracing_stat_init() was always returning '0', even on the error paths. It
now returns -ENODEV if tracing_init_dentry() fails or -ENOMEM if it fails
to created the 'trace_stat' debugfs directory.
Link: http://lkml.kernel.org/r/1410299381-20108-1-git-send-email-luis.henriques@canonical.com
Fixes: ed6f1c996bfe4 ("tracing: Check return value of tracing_init_dentry()")
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
[ Pulled from the archeological digging of my INBOX ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Looking through old emails in my INBOX, I came across a patch from Luis
Henriques that attempted to fix a race of two stat tracers registering the
same stat trace (extremely unlikely, as this is done in the kernel, and
probably doesn't even exist). The submitted patch wasn't quite right as it
needed to deal with clean up a bit better (if two stat tracers were the
same, it would have the same files).
But to make the code cleaner, all we needed to do is to keep the
all_stat_sessions_mutex held for most of the registering function.
Link: http://lkml.kernel.org/r/1410299375-20068-1-git-send-email-luis.henriques@canonical.com
Fixes: 002bb86d8d42f ("tracing/ftrace: separate events tracing and stats tracing engine")
Reported-by: Luis Henriques <luis.henriques@canonical.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
The stubbed version of alarmtimer_get_rtcdev() is not exported.
so this won't work if this function is used in a module when
CONFIG_RTC_CLASS=n.
Move the stub function to the header file and make it inline so that
callers don't have to worry about linking against this symbol.
rtcdev isn't used outside of this ifdef so it's not required to be
redefined to NULL. Drop that while touching this area.
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200124055849.154411-4-swboyd@chromium.org
|
|
Use the wakeup source that can be associated with the 'alarmtimer'
platform device instead of registering another one by hand.
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20200124055849.154411-3-swboyd@chromium.org
|
|
The alarmtimer_suspend() function will fail if an RTC device is on a bus
such as SPI or i2c and that RTC device registers and probes after
alarmtimer_init() registers and probes the 'alarmtimer' platform device.
This is because system wide suspend suspends devices in the reverse order
of their probe. When alarmtimer_suspend() attempts to program the RTC for a
wakeup it will try to program an RTC device on a bus that has already been
suspended.
Move the alarmtimer device registration to happen when the RTC which is
used for wakeup is registered. Register the 'alarmtimer' platform device as
a child of the RTC device too, so that it can be guaranteed that the RTC
device won't be suspended when alarmtimer_suspend() is called.
Reported-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20200124055849.154411-2-swboyd@chromium.org
|
|
This function doesn't do anything like this comment says when an RTC device
hasn't been chosen. It looks like we used to do something like that before
commit 8bc0dafb5cf3 ("alarmtimers: Rework RTC device selection using class
interface") but that's long gone now. Remove this sentence to avoid
confusing the reader.
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200124055849.154411-5-swboyd@chromium.org
|
|
The allocation mask is no longer used by on_each_cpu_cond() and
on_each_cpu_cond_mask() and can be removed.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20200117090137.1205765-4-bigeasy@linutronix.de
|
|
on_each_cpu_cond_mask() allocates a new CPU mask. The newly allocated
mask is a subset of the provided mask based on the conditional function.
This memory allocation can be avoided by extending smp_call_function_many()
with the conditional function and performing the remote function call based
on the mask and the conditional function.
Rename smp_call_function_many() to smp_call_function_many_cond() and add
the smp_cond_func_t argument. If smp_cond_func_t is provided then it is
used before invoking the function. Provide smp_call_function_many() with
cond_func set to NULL. Let on_each_cpu_cond_mask() use
smp_call_function_many_cond().
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20200117090137.1205765-3-bigeasy@linutronix.de
|
|
Use a typdef for the conditional function instead defining it each time in
the function prototype.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20200117090137.1205765-2-bigeasy@linutronix.de
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core
Pull irqchip updates from Marc Zyngier:
- Conversion of the SiFive PLIC to hierarchical domains
- New SiFive GPIO irqchip driver
- New Aspeed SCI irqchip driver
- New NXP INTMUX irqchip driver
- Additional support for the Meson A1 GPIO irqchip
- First part of the GICv4.1 support
- Assorted fixes
|
|
'kfree_rcu.2020.01.24a', 'list.2020.01.10a', 'preempt.2020.01.24a' and 'torture.2019.12.09a' into HEAD
doc.2019.12.10a: Documentations updates
exp.2019.12.09a: Expedited grace-period updates
fixes.2020.01.24a: Miscellaneous fixes
kfree_rcu.2020.01.24a: Batch kfree_rcu() work
list.2020.01.10a: RCU-protected-list updates
preempt.2020.01.24a: Preemptible RCU updates
torture.2019.12.09a: Torture-test updates
|
|
Long ago, RCU used the stop-machine mechanism to implement expedited
grace periods, but no longer does so. This commit therefore removes
the no-longer-needed #includes of linux/stop_machine.h.
Link: https://lwn.net/Articles/805317/
Reported-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
The ->srcu_last_gp_end field is accessed from any CPU at any time
by synchronize_srcu(), so non-initialization references need to use
READ_ONCE() and WRITE_ONCE(). This commit therefore makes that change.
Reported-by: syzbot+08f3e9d26e5541e1ecf2@syzkaller.appspotmail.com
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
Currently, force_qs_rnp() uses a for_each_leaf_node_possible_cpu()
loop containing a check of the current CPU's bit in ->qsmask.
This works, but this commit saves three lines by instead using
for_each_leaf_node_cpu_mask(), which combines the functionality of
for_each_leaf_node_possible_cpu() and leaf_node_cpu_bit(). This commit
also replaces the use of the local variable "bit" with rdp->grpmask.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
This commit moves the rcu_{expedited,normal} definitions from
kernel/rcu/update.c to include/linux/rcupdate.h to make sure they are
in sync, and also to avoid the following warning from sparse:
kernel/ksysfs.c:150:5: warning: symbol 'rcu_expedited' was not declared. Should it be static?
kernel/ksysfs.c:167:5: warning: symbol 'rcu_normal' was not declared. Should it be static?
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
Only tree_stall.h needs to get name from GP state, so this commit
moves the gp_state_names[] array and the gp_state_getname()
from kernel/rcu/tree.h and kernel/rcu/tree.c, respectively, to
kernel/rcu/tree_stall.h. While moving gp_state_names[], this commit
uses the GCC syntax to ensure that the right string is associated with
the right CPP macro.
Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
The call_rcu() function is an external RCU API that is declared in
include/linux/rcupdate.h. There is thus no point in redeclaring it
in kernel/rcu/tree.h, so this commit removes that redundant declaration.
Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
In the call to trace_rcu_utilization() at the start of the loop in
rcu_cpu_kthread(), "rcu_wait" is incorrect, plus this trace event needs
to be hoisted above the loop to balance with either the "rcu_wait" or
"rcu_yield", depending on how the loop exits. This commit therefore
makes these changes.
Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
The C preprocessor macros SRCU and TINY_RCU should instead be CONFIG_SRCU
and CONFIG_TINY_RCU, respectively in the #f in kernel/rcu/rcu.h. But
there is no harm when "TINY_RCU" is wrongly used, which are always
non-defined, which makes "!defined(TINY_RCU)" always true, which means
the code block is always included, and the included code block doesn't
cause any compilation error so far in CONFIG_TINY_RCU builds. It is
also the reason this change should not be taken in -stable.
This commit adds the needed "CONFIG_" prefix to both macros.
Not for -stable.
Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
In the current code, rcu_nmi_enter_common() might decide to turn on
the tick using tick_dep_set_cpu(), but be delayed just before doing so.
Then the grace-period kthread might notice that the CPU in question had
in fact gone through a quiescent state, thus turning off the tick using
tick_dep_clear_cpu(). The later invocation of tick_dep_set_cpu() would
then incorrectly leave the tick on.
This commit therefore enlists the aid of the leaf rcu_node structure's
->lock to ensure that decisions to enable or disable the tick are
carried out before they can be reversed.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
This commit provides wrapper functions for uses of ->rcu_read_lock_nesting
to improve readability and to ease future changes to support inlining
of __rcu_read_lock() and __rcu_read_unlock().
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
The rcu_node structure's ->expmask field is updated only when holding the
->lock, but is also accessed locklessly. This means that all ->expmask
updates must use WRITE_ONCE() and all reads carried out without holding
->lock must use READ_ONCE(). This commit therefore changes the lockless
->expmask read in rcu_read_unlock_special() to use READ_ONCE().
Reported-by: syzbot+99f4ddade3c22ab0cf23@syzkaller.appspotmail.com
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Marco Elver <elver@google.com>
|
|
In rcu_preempt_deferred_qs_irqrestore(), ->rcu_read_unlock_special is
cleared one piece at a time. Given that the "if" statements in this
function use the copy in "special", this commit removes the clearing
of the individual pieces in favor of clearing ->rcu_read_unlock_special
in one go just after it has been determined to be non-zero.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
Currently, the .exp_hint flag is cleared in rcu_read_unlock_special(),
which works, but which can also prevent subsequent rcu_read_unlock() calls
from helping expedite the quiescent state needed by an ongoing expedited
RCU grace period. This commit therefore defers clearing of .exp_hint
from rcu_read_unlock_special() to rcu_preempt_deferred_qs_irqrestore(),
thus ensuring that intervening calls to rcu_read_unlock() have a chance
to help end the expedited grace period.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
CONFIG_PREEMPTION and CONFIG_PREEMPT_RCU are always identical,
but some code depends on CONFIG_PREEMPTION to access to
rcu_preempt functionality. This patch changes CONFIG_PREEMPTION
to CONFIG_PREEMPT_RCU in these cases.
Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
Now that the kfree_rcu() special-casing has been removed from tree RCU,
this commit removes kfree_call_rcu_nobatch() since it is no longer needed.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
This commit removes kfree_rcu() special-casing and the lazy-callback
handling from Tree RCU. It moves some of this special casing to Tiny RCU,
the removal of which will be the subject of later commits.
This results in a nice negative delta.
Suggested-by: Paul E. McKenney <paulmck@linux.ibm.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
[ paulmck: Add slab.h #include, thanks to kbuild test robot <lkp@intel.com>. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
This commit applies RCU's debug_objects debugging to the new batched
kfree_rcu() implementations. The object is queued at the kfree_rcu()
call and dequeued during reclaim.
Tested that enabling CONFIG_DEBUG_OBJECTS_RCU_HEAD successfully detects
double kfree_rcu() calls.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
[ paulmck: Fix IRQ per kbuild test robot <lkp@intel.com> feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
During testing, it was observed that amount of memory consumed due
kfree_rcu() batching is 300-400MB. Previously we had only a single
head_free pointer pointing to the list of rcu_head(s) that are to be
freed after a grace period. Until this list is drained, we cannot queue
any more objects on it since such objects may not be ready to be
reclaimed when the worker thread eventually gets to drainin g the
head_free list.
We can do better by maintaining multiple lists as done by this patch.
Testing shows that memory consumption came down by around 100-150MB with
just adding another list. Adding more than 1 additional list did not
show any improvement.
Suggested-by: Paul E. McKenney <paulmck@linux.ibm.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
[ paulmck: Code style and initialization handling. ]
[ paulmck: Fix field name, reported by kbuild test robot <lkp@intel.com>. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
Because the ->monitor_todo field is always protected by krcp->lock,
this commit downgrades from xchg() to non-atomic unmarked assignment
statements.
Signed-off-by: Joel Fernandes <joel@joelfernandes.org>
[ paulmck: Update to include early-boot kick code. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
This test runs kfree_rcu() in a loop to measure performance of the new
kfree_rcu() batching functionality.
The following table shows results when booting with arguments:
rcuperf.kfree_loops=20000 rcuperf.kfree_alloc_num=8000
rcuperf.kfree_rcu_test=1 rcuperf.kfree_no_batch=X
rcuperf.kfree_no_batch=X # Grace Periods Test Duration (s)
X=1 (old behavior) 9133 11.5
X=0 (new behavior) 1732 12.5
On a 16 CPU system with the above boot parameters, we see that the total
number of grace periods that elapse during the test drops from 9133 when
not batching to 1732 when batching (a 5X improvement). The kfree_rcu()
flood itself slows down a bit when batching, though, as shown.
Note that the active memory consumption during the kfree_rcu() flood
does increase to around 200-250MB due to the batching (from around 50MB
without batching). However, this memory consumption is relatively
constant. In other words, the system is able to keep up with the
kfree_rcu() load. The memory consumption comes down considerably if
KFREE_DRAIN_JIFFIES is increased from HZ/50 to HZ/80. A later patch will
reduce memory consumption further by using multiple lists.
Also, when running the test, please disable CONFIG_DEBUG_PREEMPT and
CONFIG_PROVE_RCU for realistic comparisons with/without batching.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
Recently a discussion about stability and performance of a system
involving a high rate of kfree_rcu() calls surfaced on the list [1]
which led to another discussion how to prepare for this situation.
This patch adds basic batching support for kfree_rcu(). It is "basic"
because we do none of the slab management, dynamic allocation, code
moving or any of the other things, some of which previous attempts did
[2]. These fancier improvements can be follow-up patches and there are
different ideas being discussed in those regards. This is an effort to
start simple, and build up from there. In the future, an extension to
use kfree_bulk and possibly per-slab batching could be done to further
improve performance due to cache-locality and slab-specific bulk free
optimizations. By using an array of pointers, the worker thread
processing the work would need to read lesser data since it does not
need to deal with large rcu_head(s) any longer.
Torture tests follow in the next patch and show improvements of around
5x reduction in number of grace periods on a 16 CPU system. More
details and test data are in that patch.
There is an implication with rcu_barrier() with this patch. Since the
kfree_rcu() calls can be batched, and may not be handed yet to the RCU
machinery in fact, the monitor may not have even run yet to do the
queue_rcu_work(), there seems no easy way of implementing rcu_barrier()
to wait for those kfree_rcu()s that are already made. So this means a
kfree_rcu() followed by an rcu_barrier() does not imply that memory will
be freed once rcu_barrier() returns.
Another implication is higher active memory usage (although not
run-away..) until the kfree_rcu() flooding ends, in comparison to
without batching. More details about this are in the second patch which
adds an rcuperf test.
Finally, in the near future we will get rid of kfree_rcu() special casing
within RCU such as in rcu_do_batch and switch everything to just
batching. Currently we don't do that since timer subsystem is not yet up
and we cannot schedule the kfree_rcu() monitor as the timer subsystem's
lock are not initialized. That would also mean getting rid of
kfree_call_rcu_nobatch() entirely.
[1] http://lore.kernel.org/lkml/20190723035725-mutt-send-email-mst@kernel.org
[2] https://lkml.org/lkml/2017/12/19/824
Cc: kernel-team@android.com
Cc: kernel-team@lge.com
Co-developed-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
[ paulmck: Applied 0day and Paul Walmsley feedback on ->monitor_todo. ]
[ paulmck: Make it work during early boot. ]
[ paulmck: Add a crude early boot self-test. ]
[ paulmck: Style adjustments and experimental docbook structure header. ]
Link: https://lore.kernel.org/lkml/alpine.DEB.2.21.9999.1908161931110.32497@viisi.sifive.com/T/#me9956f66cb611b95d26ae92700e1d901f46e8c59
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
head is traversed using hlist_for_each_entry_rcu outside an RCU
read-side critical section but under the protection of dtab->index_lock.
Hence, add corresponding lockdep expression to silence false-positive
lockdep warnings, and harden RCU lists.
Fixes: 6f9d451ab1a3 ("xdp: Add devmap_hash map type for looking up devices by hashed index")
Signed-off-by: Amol Grover <frextrite@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20200123120437.26506-1-frextrite@gmail.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Various tracing fixes:
- Fix a function comparison warning for a xen trace event macro
- Fix a double perf_event linking to a trace_uprobe_filter for
multiple events
- Fix suspicious RCU warnings in trace event code for using
list_for_each_entry_rcu() when the "_rcu" portion wasn't needed.
- Fix a bug in the histogram code when using the same variable
- Fix a NULL pointer dereference when tracefs lockdown enabled and
calling trace_set_default_clock()
- A fix to a bug found with the double perf_event linking patch"
* tag 'trace-v5.5-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/uprobe: Fix to make trace_uprobe_filter alignment safe
tracing: Do not set trace clock if tracefs lockdown is in effect
tracing: Fix histogram code when expression has same var as value
tracing: trigger: Replace unneeded RCU-list traversals
tracing/uprobe: Fix double perf_event linking on multiprobe uprobe
tracing: xen: Ordered comparison of function pointers
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fix from Rafael Wysocki:
"Prevent the kernel from crashing during resume from hibernation if
free pages contain leftover data from the restore kernel and
init_on_free is set (Alexander Potapenko)"
* tag 'pm-5.5-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: hibernate: fix crashes with init_on_free=1
|