Age | Commit message (Collapse) | Author |
|
After commit f84bb1eac027 ("net: fix IFF_NO_QUEUE for drivers using
alloc_netdev"), default qdisc was changed to noqueue because
tuntap does not set tx_queue_len during .setup(). This patch restores
default qdisc by setting tx_queue_len in tun_setup().
Fixes: f84bb1eac027 ("net: fix IFF_NO_QUEUE for drivers using alloc_netdev")
Cc: Phil Sutter <phil@nwl.cc>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
* pm-core:
PM / wakeirq: fix wakeirq setting after wakup re-configuration from sysfs
PM / runtime: Document steps for device removal
* powercap:
powercap: intel_rapl: Add missing Haswell model
* pm-tools:
tools/power turbostat: work around RC6 counter wrap
tools/power turbostat: initial KBL support
tools/power turbostat: initial SKX support
tools/power turbostat: decode BXT TSC frequency via CPUID
tools/power turbostat: initial BXT support
tools/power turbostat: print IRTL MSRs
tools/power turbostat: SGX state should print only if --debug
|
|
* pm-cpufreq:
cpufreq: dt: Drop stale comment
cpufreq: intel_pstate: Documenation for structures
cpufreq: intel_pstate: fix inconsistency in setting policy limits
intel_pstate: Avoid extra invocation of intel_pstate_sample()
intel_pstate: Do not set utilization update hook too early
* pm-cpuidle:
intel_idle: Add KBL support
intel_idle: Add SKX support
intel_idle: Clean up all registered devices on exit.
intel_idle: Propagate hot plug errors.
intel_idle: Don't overreact to a cpuidle registration failure.
intel_idle: Setup the timer broadcast only on successful driver load.
intel_idle: Avoid a double free of the per-CPU data.
intel_idle: Fix dangling registration on error path.
intel_idle: Fix deallocation order on the driver exit path.
intel_idle: Remove redundant initialization calls.
intel_idle: Fix a helper function's return value.
intel_idle: remove useless return from void function.
* acpi-cppc:
mailbox: pcc: Don't access an unmapped memory address space
|
|
Ptr to devlink structure can be easily obtained from
devlink_port->devlink. So share user_ptr[0] pointer for both and leave
user_ptr[1] free for other users.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Jiri Pirko says:
====================
mlxsw: small driver update + one tiny devlink dependency
Cosmetics, in preparation to sharedbuffer patchset.
First patch is here to allow patch number two.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Fix copy&paste error and state the name of SBPM register correctly.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Same field, same values, so share the same enum.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Instead of that, pass mlxsw_core and use a helper to get driver priv
from driver code. Looks much cleaner that way.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Instead of passing around driver priv, pass struct mlxsw_core *
directly.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Remove devlink port reg/unreg from spectrum and switchx2 code and rather
do the common work in core. That also ensures code separation where
devlink is only used in core.c.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
As we rely on caller zeroing or correctly set the struct before the call,
this implicit type set is either no-op (DEVLINK_PORT_TYPE_NOTSET is 0)
or it rewrites wanted value. So remove this.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Jakub Kicinski says:
====================
MTU/buffer reconfig changes
I re-discussed MPLS/MTU internally, dropped it from the patch 1,
re-tested everything, found out I forgot about debugfs pointers,
fixed that as well.
v5:
- don't reserve space in RX buffers for MPLS label stack
(patch 1);
- fix debugfs pointers to ring structures (patch 5).
v4:
- cut down on unrelated patches;
- don't "close" the device on error path.
--- v4 cover letter
Previous series included some not entirely related patches,
this one is cut down. Main issue I'm trying to solve here
is that .ndo_change_mtu() in nfpvf driver is doing full
close/open to reallocate buffers - which if open fails
can result in device being basically closed even though
the interface is started. As suggested by you I try to move
towards a paradigm where the resources are allocated first
and the MTU change is only done once I'm certain (almost)
nothing can fail. Almost because I need to communicate
with FW and that can always time out.
Patch 1 fixes small issue. Next 10 patches reorganize things
so that I can easily allocate new rings and sets of buffers
while the device is running. Patches 13 and 15 reshape the
.ndo_change_mtu() and ethtool's ring-resize operation into
desired form.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Since much of the required changes have already been made for
changing MTU at runtime let's use it for ring size changes as
well.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Soon ring resize will call this functions with values
different than the current configuration we need to
explicitly pass the ring count as parameter.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When changing MTU on running device first allocate new rings
and buffers and once it succeeds proceed with changing MTU.
Allocation of new rings is not really necessary for this
operation - it's done to keep the code simple and because
size of the extra ring memory is quite small compared to
the size of buffers.
Operation can still fail midway through if FW communication
times out. In that case we retry with old MTU (rings).
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Free list buffer size needs to be propagated to few functions
as a parameter and added to struct nfp_net_rx_ring since soon
some of the functions will be reused to manage rings with
buffers of size different than nn->fl_bufsz.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
FW reconfiguration in .ndo_open()/.ndo_stop() should reset/
restore queue state. Since we need IRQs to be disabled when
filling rings on RX path we have to move disable_irq() from
.ndo_open() all the way up to IRQ allocation.
nfp_net_start_vec() becomes trivial now so it's inlined.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Divide .ndo_open() and .ndo_stop() into logical, callable
chunks. No functional changes.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
nfp_net_[rt]x_ring_{alloc,free} should only allocate or free
ring resources without touching the device. Move setting
parameters in the BAR to separate functions. This will make
it possible to reuse alloc/free functions to allocate new
rings while the device is running.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We want the .ndo_open() to have following structure:
- allocate resources;
- configure HW/FW;
- enable the device from stack perspective.
Therefore filling RX rings needs to be moved to the beginning
of .ndo_open().
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Separate allocation of buffers from giving them to FW,
thanks to this it will be possible to move allocation
earlier on .ndo_open() path and reuse buffers during
runtime reconfiguration.
Similar to TX side clean up the spill of functionality
from flush to freeing the ring. Unlike on TX side,
RX ring reset does not free buffers from the ring.
Ring reset means only that FW pointers are zeroed and
buffers on the ring must be placed in [0, cnt - 1)
positions.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Since we never used flush without freeing the ring later
the functionality of the two operations is mixed.
Rename flush to ring reset and move there all the things
which have to be done after FW ring state is cleared.
While at it do some clean-ups.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
To be able to switch rings more easily on config changes
allocate them dynamically, separately from nfp_net structure.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
nfp_net_[rt]x_ring_init functions used to be called from probe
path only and some of their functionality was spilled to the
call site. In order to reuse them for ring reconfiguration
we need them to do all the init.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
nfp_net_{alloc|free}_rings contained strange mix of allocations
and vector initialization. Remove it, declare vector init as
a separate function and handle allocations explicitly.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We need to be able to disable the link state interrupt when
the device is brought down. We used to just free the IRQ
at the beginning of .ndo_stop(). As we now move towards
more ordered .ndo_open()/.ndo_stop() paths LSC allocation
should be placed in the "allocate resource" section.
Since the IRQ can't be freed early in .ndo_stop(), it is
disabled instead.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When calculating the RX buffer length we need to account
for up to 2 VLAN tags. Rounding up to 1k is an relic of
a distant past and can be removed. While at it also remove
trivial print statement.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Emit the logging messages at the appropriate levels.
Miscellanea:
o Change format to fmt
o Use the more common ##__VA_ARGS__
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
It would have been possible for a rogue client-core to send in a symlink
target which is not NUL terminated. This returns EIO if the client-core
gives us corrupt data.
Leave debugfs and superblock code as is for now.
Other dcache.c and namei.c strncpy instances are safe because
ORANGEFS_NAME_MAX = NAME_MAX + 1; there is always enough space for a
name plus a NUL byte.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
The ctime and mtime are always updated on a successful ftruncate and
only updated on a successful truncate where the size changed.
We handle the ``if the size changed'' bit.
This matches FUSE's behavior.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
fs/orangefs/orangefs-debugfs.c:130:2-26: WARNING: NULL check before freeing functions like kfree, debugfs_remove, debugfs_remove_recursive or usb_free_urb is not needed. Maybe consider reorganizing relevant code to avoid passing NULL values.
NULL check before some freeing functions is not needed.
Based on checkpatch warning
"kfree(NULL) is safe this check is probably not required"
and kfreeaddr.cocci by Julia Lawall.
Generated by: scripts/coccinelle/free/ifnullfree.cocci
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
Suggested by David Binderman <dcb314@hotmail.com>
The former can potentially be a performance win over the latter.
memcpy(d, s, len);
memset(d+len, c, size-len);
memset(d, c, size);
memcpy(d, s, len);
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
1. It is nonsense to test for negative size_t, suggested by
David Binderman <dcb314@hotmail.com>
2. By the time Orangefs gets called, the vfs has ensured that
name != NULL, and that buffer and size are sane.
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
|
|
find_outdev calls inet{,6}_fib_lookup_dev() or dev_get_by_index() to
find the output device. In case of an error, inet{,6}_fib_lookup_dev()
returns error pointer and dev_get_by_index() returns NULL. But the function
only checks for NULL and thus can end up calling dev_put on an ERR_PTR.
This patch adds an additional check for err ptr after the NULL check.
Before: Trying to add an mpls route with no oif from user, no available
path to 10.1.1.8 and no default route:
$ip -f mpls route add 100 as 200 via inet 10.1.1.8
[ 822.337195] BUG: unable to handle kernel NULL pointer dereference at
00000000000003a3
[ 822.340033] IP: [<ffffffff8148781e>] mpls_nh_assign_dev+0x10b/0x182
[ 822.340033] PGD 1db38067 PUD 1de9e067 PMD 0
[ 822.340033] Oops: 0000 [#1] SMP
[ 822.340033] Modules linked in:
[ 822.340033] CPU: 0 PID: 11148 Comm: ip Not tainted 4.5.0-rc7+ #54
[ 822.340033] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org
04/01/2014
[ 822.340033] task: ffff88001db82580 ti: ffff88001dad4000 task.ti:
ffff88001dad4000
[ 822.340033] RIP: 0010:[<ffffffff8148781e>] [<ffffffff8148781e>]
mpls_nh_assign_dev+0x10b/0x182
[ 822.340033] RSP: 0018:ffff88001dad7a88 EFLAGS: 00010282
[ 822.340033] RAX: ffffffffffffff9b RBX: ffffffffffffff9b RCX:
0000000000000002
[ 822.340033] RDX: 00000000ffffff9b RSI: 0000000000000008 RDI:
0000000000000000
[ 822.340033] RBP: ffff88001ddc9ea0 R08: ffff88001e9f1768 R09:
0000000000000000
[ 822.340033] R10: ffff88001d9c1100 R11: ffff88001e3c89f0 R12:
ffffffff8187e0c0
[ 822.340033] R13: ffffffff8187e0c0 R14: ffff88001ddc9e80 R15:
0000000000000004
[ 822.340033] FS: 00007ff9ed798700(0000) GS:ffff88001fc00000(0000)
knlGS:0000000000000000
[ 822.340033] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 822.340033] CR2: 00000000000003a3 CR3: 000000001de89000 CR4:
00000000000006f0
[ 822.340033] Stack:
[ 822.340033] 0000000000000000 0000000100000000 0000000000000000
0000000000000000
[ 822.340033] 0000000000000000 0801010a00000000 0000000000000000
0000000000000000
[ 822.340033] 0000000000000004 ffffffff8148749b ffffffff8187e0c0
000000000000001c
[ 822.340033] Call Trace:
[ 822.340033] [<ffffffff8148749b>] ? mpls_rt_alloc+0x2b/0x3e
[ 822.340033] [<ffffffff81488e66>] ? mpls_rtm_newroute+0x358/0x3e2
[ 822.340033] [<ffffffff810e7bbc>] ? get_page+0x5/0xa
[ 822.340033] [<ffffffff813b7d94>] ? rtnetlink_rcv_msg+0x17e/0x191
[ 822.340033] [<ffffffff8111794e>] ? __kmalloc_track_caller+0x8c/0x9e
[ 822.340033] [<ffffffff813c9393>] ?
rht_key_hashfn.isra.20.constprop.57+0x14/0x1f
[ 822.340033] [<ffffffff813b7c16>] ? __rtnl_unlock+0xc/0xc
[ 822.340033] [<ffffffff813cb794>] ? netlink_rcv_skb+0x36/0x82
[ 822.340033] [<ffffffff813b4507>] ? rtnetlink_rcv+0x1f/0x28
[ 822.340033] [<ffffffff813cb2b1>] ? netlink_unicast+0x106/0x189
[ 822.340033] [<ffffffff813cb5b3>] ? netlink_sendmsg+0x27f/0x2c8
[ 822.340033] [<ffffffff81392ede>] ? sock_sendmsg_nosec+0x10/0x1b
[ 822.340033] [<ffffffff81393df1>] ? ___sys_sendmsg+0x182/0x1e3
[ 822.340033] [<ffffffff810e4f35>] ?
__alloc_pages_nodemask+0x11c/0x1e4
[ 822.340033] [<ffffffff8110619c>] ? PageAnon+0x5/0xd
[ 822.340033] [<ffffffff811062fe>] ? __page_set_anon_rmap+0x45/0x52
[ 822.340033] [<ffffffff810e7bbc>] ? get_page+0x5/0xa
[ 822.340033] [<ffffffff810e85ab>] ? __lru_cache_add+0x1a/0x3a
[ 822.340033] [<ffffffff81087ea9>] ? current_kernel_time64+0x9/0x30
[ 822.340033] [<ffffffff813940c4>] ? __sys_sendmsg+0x3c/0x5a
[ 822.340033] [<ffffffff8148f597>] ?
entry_SYSCALL_64_fastpath+0x12/0x6a
[ 822.340033] Code: 83 08 04 00 00 65 ff 00 48 8b 3c 24 e8 40 7c f2 ff
eb 13 48 c7 c3 9f ff ff ff eb 0f 89 ce e8 f1 ae f1 ff 48 89 c3 48 85 db
74 15 <48> 8b 83 08 04 00 00 65 ff 08 48 81 fb 00 f0 ff ff 76 0d eb 07
[ 822.340033] RIP [<ffffffff8148781e>] mpls_nh_assign_dev+0x10b/0x182
[ 822.340033] RSP <ffff88001dad7a88>
[ 822.340033] CR2: 00000000000003a3
[ 822.435363] ---[ end trace 98cc65e6f6b8bf11 ]---
After patch:
$ip -f mpls route add 100 as 200 via inet 10.1.1.8
RTNETLINK answers: Network is unreachable
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reported-by: David Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:
====================
10GbE Intel Wired LAN Driver Updates 2016-04-07
This series contains updates to ixgbe and ixgbevf.
This entire series (except for one patch from Alex) comes from Mark and
is mainly to add support for our new MAC (x550em_a).
So let's get Alex's patch out of the way first before we cover Mark's
many changes. Alex does his enable bulk free in transmit cleanup for
ixgbe and ixgbevf, like his has done for all of our other drivers.
First Mark cleans up registers that were not being used, so do some
house cleaning. Then to avoid casting lan_id and func fields, just
make them u8 since they only hold small values anyways. Found and
fixed an issue where on read operations it could be possible to
modify locations beyond the length passed in, so change the check
to round up in the same way. Cleaned up the interface for issuing
firmware commands to use a void * instead of a u32 * which eliminates
a number of casts. Added support for the new MAC and provided method
pointers and use them to access IOSF-attached devices, since the
new MAC will also need a new access method. Added support for SFPs
with an external retimer and for an SGMII backplane interface.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When sending a UDPv6 message longer than MTU, account for the length
of fragmentable IPv6 extension headers in skb->network_header offset.
Same as we do in alloc_new_skb path in __ip6_append_data().
This ensures that later on __ip6_make_skb() will make space in
headroom for fragmentable extension headers:
/* move skb->data to ip header from ext header */
if (skb->data < skb_network_header(skb))
__skb_pull(skb, skb_network_offset(skb));
Prevents a splat due to skb_under_panic:
skbuff: skb_under_panic: text:ffffffff8143397b len:2126 put:14 \
head:ffff880005bacf50 data:ffff880005bacf4a tail:0x48 end:0xc0 dev:lo
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:104!
invalid opcode: 0000 [#1] KASAN
CPU: 0 PID: 160 Comm: reproducer Not tainted 4.6.0-rc2 #65
[...]
Call Trace:
[<ffffffff813eb7b9>] skb_push+0x79/0x80
[<ffffffff8143397b>] eth_header+0x2b/0x100
[<ffffffff8141e0d0>] neigh_resolve_output+0x210/0x310
[<ffffffff814eab77>] ip6_finish_output2+0x4a7/0x7c0
[<ffffffff814efe3a>] ip6_output+0x16a/0x280
[<ffffffff815440c1>] ip6_local_out+0xb1/0xf0
[<ffffffff814f1115>] ip6_send_skb+0x45/0xd0
[<ffffffff81518836>] udp_v6_send_skb+0x246/0x5d0
[<ffffffff8151985e>] udpv6_sendmsg+0xa6e/0x1090
[...]
Reported-by: Ji Jianwen <jiji@redhat.com>
Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This reverts commit 0fd10721fe3664f7549e74af9d28a509c9a68719.
That patch causes the ib_srpt driver to crash as soon as the first SCSI
command is received:
kernel BUG at drivers/infiniband/ulp/srpt/ib_srpt.c:1439!
invalid opcode: 0000 [#1] SMP
Workqueue: target_completion target_complete_ok_work [target_core_mod]
RIP: srpt_queue_response+0x437/0x4a0 [ib_srpt]
Call Trace:
srpt_queue_data_in+0x9/0x10 [ib_srpt]
target_complete_ok_work+0x152/0x2b0 [target_core_mod]
process_one_work+0x197/0x480
worker_thread+0x49/0x490
kthread+0xea/0x100
ret_from_fork+0x22/0x40
Aside from the crash, the shortcomings of that patch are as follows:
- It makes the ib_srpt driver use I/O contexts allocated by
transport_alloc_session_tags() but it does not initialize these I/O
contexts properly. All the initializations performed by
srpt_alloc_ioctx() are skipped.
- It swaps the order of the send ioctx allocation and the transition to
RTR mode which is wrong.
- The amount of memory that is needed for I/O contexts is doubled.
- srpt_rdma_ch.free_list is no longer used but is not removed.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Alexei Starovoitov says:
====================
allow bpf attach to tracepoints
Hi Steven, Peter,
v1->v2: addressed Peter's comments:
- fixed wording in patch 1, added ack
- refactored 2nd patch into 3:
2/10 remove unused __perf_addr macro which frees up
an argument in perf_trace_buf_submit
3/10 split perf_trace_buf_prepare into alloc and update parts, so that bpf
programs don't have to pay performance penalty for update of struct trace_entry
which is not going to be accessed by bpf
4/10 actual addition of bpf filter to perf tracepoint handler is now trivial
and bpf prog can be used as proper filter of tracepoints
v1 cover:
last time we discussed bpf+tracepoints it was a year ago [1] and the reason
we didn't proceed with that approach was that bpf would make arguments
arg1, arg2 to trace_xx(arg1, arg2) call to be exposed to bpf program
and that was considered unnecessary extension of abi. Back then I wanted
to avoid the cost of buffer alloc and field assign part in all
of the tracepoints, but looks like when optimized the cost is acceptable.
So this new apporach doesn't expose any new abi to bpf program.
The program is looking at tracepoint fields after they were copied
by perf_trace_xx() and described in /sys/kernel/debug/tracing/events/xxx/format
We made a tool [2] that takes arguments from /sys/.../format and works as:
$ tplist.py -v random:urandom_read
int got_bits;
int pool_left;
int input_left;
Then these fields can be copy-pasted into bpf program like:
struct urandom_read {
__u64 hidden_pad;
int got_bits;
int pool_left;
int input_left;
};
and the program can use it:
SEC("tracepoint/random/urandom_read")
int bpf_prog(struct urandom_read *ctx)
{
return ctx->pool_left > 0 ? 1 : 0;
}
This way the program can access tracepoint fields faster than
equivalent bpf+kprobe program, which is the main goal of these patches.
Patch 1-4 are simple changes in perf core side, please review.
I'd like to take the whole set via net-next tree, since the rest of
the patches might conflict with other bpf work going on in net-next
and we want to avoid cross-tree merge conflicts.
Alternatively we can put patches 1-4 into both tip and net-next.
Patch 9 is an example of access to tracepoint fields from bpf prog.
Patch 10 is a micro benchmark for bpf+kprobe vs bpf+tracepoint.
Note that for actual tracing tools the user doesn't need to
run tplist.py and copy-paste fields manually. The tools do it
automatically. Like argdist tool [3] can be used as:
$ argdist -H 't:block:block_rq_complete():u32:nr_sector'
where 'nr_sector' is name of tracepoint field taken from
/sys/kernel/debug/tracing/events/block/block_rq_complete/format
and appropriate bpf program is generated on the fly.
[1] http://thread.gmane.org/gmane.linux.kernel.api/8127/focus=8165
[2] https://github.com/iovisor/bcc/blob/master/tools/tplist.py
[3] https://github.com/iovisor/bcc/blob/master/tools/argdist.py
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
the first microbenchmark does
fd=open("/proc/self/comm");
for() {
write(fd, "test");
}
and on 4 cpus in parallel:
writes per sec
base (no tracepoints, no kprobes) 930k
with kprobe at __set_task_comm() 420k
with tracepoint at task:task_rename 730k
For kprobe + full bpf program manully fetches oldcomm, newcomm via bpf_probe_read.
For tracepint bpf program does nothing, since arguments are copied by tracepoint.
2nd microbenchmark does:
fd=open("/dev/urandom");
for() {
read(fd, buf);
}
and on 4 cpus in parallel:
reads per sec
base (no tracepoints, no kprobes) 300k
with kprobe at urandom_read() 279k
with tracepoint at random:urandom_read 290k
bpf progs attached to kprobe and tracepoint are noop.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
modify offwaketime to work with sched/sched_switch tracepoint
instead of kprobe into finish_task_switch
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Recognize "tracepoint/" section name prefix and attach the program
to that tracepoint.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
during bpf program loading remember the last byte of ctx access
and at the time of attaching the program to tracepoint check that
the program doesn't access bytes beyond defined in tracepoint fields
This also disallows access to __dynamic_array fields, but can be
relaxed in the future.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
programs
needs two wrapper functions to fetch 'struct pt_regs *' to convert
tracepoint bpf context into kprobe bpf context to reuse existing
helper functions
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
register tracepoint bpf program type and let it call the same set
of helper functions as BPF_PROG_TYPE_KPROBE
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
introduce BPF_PROG_TYPE_TRACEPOINT program type and allow it to be attached
to the perf tracepoint handler, which will copy the arguments into
the per-cpu buffer and pass it to the bpf program as its first argument.
The layout of the fields can be discovered by doing
'cat /sys/kernel/debug/tracing/events/sched/sched_switch/format'
prior to the compilation of the program with exception that first 8 bytes
are reserved and not accessible to the program. This area is used to store
the pointer to 'struct pt_regs' which some of the bpf helpers will use:
+---------+
| 8 bytes | hidden 'struct pt_regs *' (inaccessible to bpf program)
+---------+
| N bytes | static tracepoint fields defined in tracepoint/format (bpf readonly)
+---------+
| dynamic | __dynamic_array bytes of tracepoint (inaccessible to bpf yet)
+---------+
Not that all of the fields are already dumped to user space via perf ring buffer
and broken application access it directly without consulting tracepoint/format.
Same rule applies here: static tracepoint fields should only be accessed
in a format defined in tracepoint/format. The order of fields and
field sizes are not an ABI.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
split allows to move expensive update of 'struct trace_entry' to later phase.
Repurpose unused 1st argument of perf_tp_event() to indicate event type.
While splitting use temp variable 'rctx' instead of '*rctx' to avoid
unnecessary loads done by the compiler due to -fno-strict-aliasing
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
now all calls to perf_trace_buf_submit() pass 0 as 4th
argument which will be repurposed in the next patch which will
change the meaning of 1st arg of perf_tp_event() to event_type
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
avoid memset in perf_fetch_caller_regs, since it's the critical path of all tracepoints.
It's called from perf_sw_event_sched, perf_event_task_sched_in and all of perf_trace_##call
with this_cpu_ptr(&__perf_regs[..]) which are zero initialized by perpcu init logic and
subsequent call to perf_arch_fetch_caller_regs initializes the same fields on all archs,
so we can safely drop memset from all of the above cases and move it into
perf_ftrace_function_call that calls it with stack allocated pt_regs.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Needs to be protected with CONFIG_LOCKDEP.
Based upon a patch by Hannes Frederic Sowa.
Signed-off-by: David S. Miller <davem@davemloft.net>
|