Age | Commit message (Collapse) | Author |
|
Bug fixes overlapping feature additions and refactoring, mostly.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add description for parameters of htb_lookup_leaf()
to fix gcc W=1 warnings:
net/sched/sch_htb.c:773: warning: Function parameter or member 'hprio' not described in 'htb_lookup_leaf'
net/sched/sch_htb.c:773: warning: Function parameter or member 'prio' not described in 'htb_lookup_leaf'
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add description for parameters of htb_do_events()
to fix gcc W=1 warnings:
net/sched/sch_htb.c:708: warning: Function parameter or member 'q' not described in 'htb_do_events'
net/sched/sch_htb.c:708: warning: Function parameter or member 'level' not described in 'htb_do_events'
net/sched/sch_htb.c:708: warning: Function parameter or member 'start' not described in 'htb_do_events'
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add description for parameters of htb_charge_class()
to fix gcc W=1 warnings:
net/sched/sch_htb.c:663: warning: Function parameter or member 'q' not described in 'htb_charge_class'
net/sched/sch_htb.c:663: warning: Function parameter or member 'cl' not described in 'htb_charge_class'
net/sched/sch_htb.c:663: warning: Function parameter or member 'level' not described in 'htb_charge_class'
net/sched/sch_htb.c:663: warning: Function parameter or member 'skb' not described in 'htb_charge_class'
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add description for parameters of htb_deactivate()
to fix gcc W=1 warnings:
net/sched/sch_htb.c:578: warning: Function parameter or member 'q' not described in 'htb_deactivate'
net/sched/sch_htb.c:578: warning: Function parameter or member 'cl' not described in 'htb_deactivate'
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add description for parameters of htb_activate()
to fix gcc W=1 warnings:
net/sched/sch_htb.c:562: warning: Function parameter or member 'q' not described in 'htb_activate'
net/sched/sch_htb.c:562: warning: Function parameter or member 'cl' not described in 'htb_activate'
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add description for parameters of htb_change_class_mode()
to fix gcc W=1 warnings:
net/sched/sch_htb.c:533: warning: Function parameter or member 'q' not described in 'htb_change_class_mode'
net/sched/sch_htb.c:533: warning: Function parameter or member 'cl' not described in 'htb_change_class_mode'
net/sched/sch_htb.c:533: warning: Function parameter or member 'diff' not described in 'htb_change_class_mode'
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add description for parameters of htb_class_mode()
to fix gcc W=1 warnings:
net/sched/sch_htb.c:507: warning: Function parameter or member 'cl' not described in 'htb_class_mode'
net/sched/sch_htb.c:507: warning: Function parameter or member 'diff' not described in 'htb_class_mode'
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add description for parameters of htb_deactivate_prios()
to fix gcc W=1 warnings:
net/sched/sch_htb.c:442: warning: Function parameter or member 'q' not described in 'htb_deactivate_prios'
net/sched/sch_htb.c:442: warning: Function parameter or member 'cl' not described in 'htb_deactivate_prios'
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add description for parameters of htb_activate_prios()
to fix gcc W=1 warnings:
net/sched/sch_htb.c:407: warning: Function parameter or member 'q' not described in 'htb_activate_prios'
net/sched/sch_htb.c:407: warning: Function parameter or member 'cl' not described in 'htb_activate_prios'
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add description for parameters of htb_remove_class_from_row()
to fix gcc W=1 warnings:
net/sched/sch_htb.c:380: warning: Function parameter or member 'q' not described in 'htb_remove_class_from_row'
net/sched/sch_htb.c:380: warning: Function parameter or member 'cl' not described in 'htb_remove_class_from_row'
net/sched/sch_htb.c:380: warning: Function parameter or member 'mask' not described in 'htb_remove_class_from_row'
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add description for parameters of htb_add_class_to_row() to fix
gcc W=1 warnings:
net/sched/sch_htb.c:351: warning: Function parameter or member 'q' not described in 'htb_add_class_to_row'
net/sched/sch_htb.c:351: warning: Function parameter or member 'cl' not described in 'htb_add_class_to_row'
net/sched/sch_htb.c:351: warning: Function parameter or member 'mask' not described in 'htb_add_class_to_row'
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add description for parameters of htb_next_rb_node() to fix
gcc W=1 warnings:
net/sched/sch_htb.c:339: warning: Function parameter or member 'n' not described in 'htb_next_rb_node'
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add description for parameters of htb_add_to_wait_tree() to fix
gcc W=1 warnings:
net/sched/sch_htb.c:308: warning: Function parameter or member 'q' not described in 'htb_add_to_wait_tree'
net/sched/sch_htb.c:308: warning: Function parameter or member 'cl' not described in 'htb_add_to_wait_tree'
net/sched/sch_htb.c:308: warning: Function parameter or member 'delay' not described in 'htb_add_to_wait_tree'
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The commit ae81feb7338c ("sch_htb: fix null pointer dereference
on a null new_q") fixes a NULL pointer dereference bug, but it
is not correct.
Because htb_graft_helper properly handles the case when new_q
is NULL, and after the previous patch by skipping this call
which creates an inconsistency : dev_queue->qdisc will still
point to the old qdisc, but cl->parent->leaf.q will point to
the new one (which will be noop_qdisc, because new_q was NULL).
The code is based on an assumption that these two pointers are
the same, so it can lead to refcount leaks.
The correct fix is to add a NULL pointer check to protect
qdisc_refcount_inc inside htb_parent_to_leaf_offload.
Fixes: ae81feb7338c ("sch_htb: fix null pointer dereference on a null new_q")
Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
Suggested-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add description for parameters of htb_add_to_id_tree() to fix
gcc W=1 warnings:
net/sched/sch_htb.c:282: warning: Function parameter or member 'root' not described in 'htb_add_to_id_tree'
net/sched/sch_htb.c:282: warning: Function parameter or member 'cl' not described in 'htb_add_to_id_tree'
net/sched/sch_htb.c:282: warning: Function parameter or member 'prio' not described in 'htb_add_to_id_tree'
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
mlx5 devices were observed generating MLX5_PORT_CHANGE_SUBTYPE_ACTIVE
events without an intervening MLX5_PORT_CHANGE_SUBTYPE_DOWN. This
breaks link flap detection based on Linux carrier state transition
count as netif_carrier_on() does nothing if carrier is already on.
Make sure we count such events.
netif_carrier_event() increments the counters and fires the linkwatch
events. The latter is not necessary for the use case but seems like
the right thing to do.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Dump vlan priority only if it has been previously set.
Fix the tests accordingly.
Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently vlan modification action checks existence of vlan priority by
comparing it to 0. Therefore it is impossible to modify existing vlan
tag to have priority 0.
For example, the following tc command will change the vlan id but will
not affect vlan priority:
tc filter add dev eth1 ingress matchall action vlan modify id 300 \
priority 0 pipe mirred egress redirect dev eth2
The incoming packet on eth1:
ethertype 802.1Q (0x8100), vlan 200, p 4, ethertype IPv4
will be changed to:
ethertype 802.1Q (0x8100), vlan 300, p 4, ethertype IPv4
although the user has intended to have p == 0.
The fix is to add tcfv_push_prio_exists flag to struct tcf_vlan_params
and rely on it when deciding to set the priority.
Fixes: 45a497f2d149a4a8061c (net/sched: act_vlan: Introduce TCA_VLAN_ACT_MODIFY vlan action)
Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Fix some spelling mistakes in comments:
sevaral ==> several
sugestion ==> suggestion
unregster ==> unregister
suplied ==> supplied
cirsumstances ==> circumstances
Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Link: https://lore.kernel.org/r/20210531020048.2920054-1-zhengyongjun3@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fix current behavior of skipping template allocation in case the
ct action is in zone 0.
Skipping the allocation may cause the datapath ct code to ignore the
entire ct action with all its attributes (commit, nat) in case the ct
action in zone 0 was preceded by a ct clear action.
The ct clear action sets the ct_state to untracked and resets the
skb->_nfct pointer. Under these conditions and without an allocated
ct template, the skb->_nfct pointer will remain NULL which will
cause the tc ct action handler to exit without handling commit and nat
actions, if such exist.
For example, the following rule in OVS dp:
recirc_id(0x2),ct_state(+new-est-rel-rpl+trk),ct_label(0/0x1), \
in_port(eth0),actions:ct_clear,ct(commit,nat(src=10.11.0.12)), \
recirc(0x37a)
Will result in act_ct skipping the commit and nat actions in zone 0.
The change removes the skipping of template allocation for zone 0 and
treats it the same as any other zone.
Fixes: b57dc7c13ea9 ("net/sched: Introduce action ct")
Signed-off-by: Ariel Levkovich <lariel@nvidia.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Link: https://lore.kernel.org/r/20210526170110.54864-1-lariel@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently established connections are not offloaded if the filter has a
"ct commit" action. This behavior will not offload connections of the
following scenario:
$ tc_filter add dev $DEV ingress protocol ip prio 1 flower \
ct_state -trk \
action ct commit action goto chain 1
$ tc_filter add dev $DEV ingress protocol ip chain 1 prio 1 flower \
action mirred egress redirect dev $DEV2
$ tc_filter add dev $DEV2 ingress protocol ip prio 1 flower \
action ct commit action goto chain 1
$ tc_filter add dev $DEV2 ingress protocol ip prio 1 chain 1 flower \
ct_state +trk+est \
action mirred egress redirect dev $DEV
Offload established connections, regardless of the commit flag.
Fixes: 46475bb20f4b ("net/sched: act_ct: Software offload of established flows")
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Link: https://lore.kernel.org/r/1622029449-27060-1-git-send-email-paulb@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
cdc-wdm: s/kill_urbs/poison_urbs/ to fix build
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Function skb_ext_add() doesn't initialize created skb extension with any
value and leaves it up to the user. However, since extension of type
TC_SKB_EXT originally contained only single value tc_skb_ext->chain its
users used to just assign the chain value without setting whole extension
memory to zero first. This assumption changed when TC_SKB_EXT extension was
extended with additional fields but not all users were updated to
initialize the new fields which leads to use of uninitialized memory
afterwards. UBSAN log:
[ 778.299821] UBSAN: invalid-load in net/openvswitch/flow.c:899:28
[ 778.301495] load of value 107 is not a valid value for type '_Bool'
[ 778.303215] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.12.0-rc7+ #2
[ 778.304933] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[ 778.307901] Call Trace:
[ 778.308680] <IRQ>
[ 778.309358] dump_stack+0xbb/0x107
[ 778.310307] ubsan_epilogue+0x5/0x40
[ 778.311167] __ubsan_handle_load_invalid_value.cold+0x43/0x48
[ 778.312454] ? memset+0x20/0x40
[ 778.313230] ovs_flow_key_extract.cold+0xf/0x14 [openvswitch]
[ 778.314532] ovs_vport_receive+0x19e/0x2e0 [openvswitch]
[ 778.315749] ? ovs_vport_find_upcall_portid+0x330/0x330 [openvswitch]
[ 778.317188] ? create_prof_cpu_mask+0x20/0x20
[ 778.318220] ? arch_stack_walk+0x82/0xf0
[ 778.319153] ? secondary_startup_64_no_verify+0xb0/0xbb
[ 778.320399] ? stack_trace_save+0x91/0xc0
[ 778.321362] ? stack_trace_consume_entry+0x160/0x160
[ 778.322517] ? lock_release+0x52e/0x760
[ 778.323444] netdev_frame_hook+0x323/0x610 [openvswitch]
[ 778.324668] ? ovs_netdev_get_vport+0xe0/0xe0 [openvswitch]
[ 778.325950] __netif_receive_skb_core+0x771/0x2db0
[ 778.327067] ? lock_downgrade+0x6e0/0x6f0
[ 778.328021] ? lock_acquire+0x565/0x720
[ 778.328940] ? generic_xdp_tx+0x4f0/0x4f0
[ 778.329902] ? inet_gro_receive+0x2a7/0x10a0
[ 778.330914] ? lock_downgrade+0x6f0/0x6f0
[ 778.331867] ? udp4_gro_receive+0x4c4/0x13e0
[ 778.332876] ? lock_release+0x52e/0x760
[ 778.333808] ? dev_gro_receive+0xcc8/0x2380
[ 778.334810] ? lock_downgrade+0x6f0/0x6f0
[ 778.335769] __netif_receive_skb_list_core+0x295/0x820
[ 778.336955] ? process_backlog+0x780/0x780
[ 778.337941] ? mlx5e_rep_tc_netdevice_event_unregister+0x20/0x20 [mlx5_core]
[ 778.339613] ? seqcount_lockdep_reader_access.constprop.0+0xa7/0xc0
[ 778.341033] ? kvm_clock_get_cycles+0x14/0x20
[ 778.342072] netif_receive_skb_list_internal+0x5f5/0xcb0
[ 778.343288] ? __kasan_kmalloc+0x7a/0x90
[ 778.344234] ? mlx5e_handle_rx_cqe_mpwrq+0x9e0/0x9e0 [mlx5_core]
[ 778.345676] ? mlx5e_xmit_xdp_frame_mpwqe+0x14d0/0x14d0 [mlx5_core]
[ 778.347140] ? __netif_receive_skb_list_core+0x820/0x820
[ 778.348351] ? mlx5e_post_rx_mpwqes+0xa6/0x25d0 [mlx5_core]
[ 778.349688] ? napi_gro_flush+0x26c/0x3c0
[ 778.350641] napi_complete_done+0x188/0x6b0
[ 778.351627] mlx5e_napi_poll+0x373/0x1b80 [mlx5_core]
[ 778.352853] __napi_poll+0x9f/0x510
[ 778.353704] ? mlx5_flow_namespace_set_mode+0x260/0x260 [mlx5_core]
[ 778.355158] net_rx_action+0x34c/0xa40
[ 778.356060] ? napi_threaded_poll+0x3d0/0x3d0
[ 778.357083] ? sched_clock_cpu+0x18/0x190
[ 778.358041] ? __common_interrupt+0x8e/0x1a0
[ 778.359045] __do_softirq+0x1ce/0x984
[ 778.359938] __irq_exit_rcu+0x137/0x1d0
[ 778.360865] irq_exit_rcu+0xa/0x20
[ 778.361708] common_interrupt+0x80/0xa0
[ 778.362640] </IRQ>
[ 778.363212] asm_common_interrupt+0x1e/0x40
[ 778.364204] RIP: 0010:native_safe_halt+0xe/0x10
[ 778.365273] Code: 4f ff ff ff 4c 89 e7 e8 50 3f 40 fe e9 dc fe ff ff 48 89 df e8 43 3f 40 fe eb 90 cc e9 07 00 00 00 0f 00 2d 74 05 62 00 fb f4 <c3> 90 e9 07 00 00 00 0f 00 2d 64 05 62 00 f4 c3 cc cc 0f 1f 44 00
[ 778.369355] RSP: 0018:ffffffff84407e48 EFLAGS: 00000246
[ 778.370570] RAX: ffff88842de46a80 RBX: ffffffff84425840 RCX: ffffffff83418468
[ 778.372143] RDX: 000000000026f1da RSI: 0000000000000004 RDI: ffffffff8343af5e
[ 778.373722] RBP: fffffbfff0884b08 R08: 0000000000000000 R09: ffff88842de46bcb
[ 778.375292] R10: ffffed1085bc8d79 R11: 0000000000000001 R12: 0000000000000000
[ 778.376860] R13: ffffffff851124a0 R14: 0000000000000000 R15: dffffc0000000000
[ 778.378491] ? rcu_eqs_enter.constprop.0+0xb8/0xe0
[ 778.379606] ? default_idle_call+0x5e/0xe0
[ 778.380578] default_idle+0xa/0x10
[ 778.381406] default_idle_call+0x96/0xe0
[ 778.382350] do_idle+0x3d4/0x550
[ 778.383153] ? arch_cpu_idle_exit+0x40/0x40
[ 778.384143] cpu_startup_entry+0x19/0x20
[ 778.385078] start_kernel+0x3c7/0x3e5
[ 778.385978] secondary_startup_64_no_verify+0xb0/0xbb
Fix the issue by providing new function tc_skb_ext_alloc() that allocates
tc skb extension and initializes its memory to 0 before returning it to the
caller. Change all existing users to use new API instead of calling
skb_ext_add() directly.
Fixes: 038ebb1a713d ("net/sched: act_ct: fix miss set mru for ovs after defrag in act_ct")
Fixes: d29334c15d33 ("net/sched: act_api: fix miss set post_ct for ovs after do conntrack in act_ct")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Acked-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If Qdisc_ops->init() is failed, Qdisc_ops->reset() would be called.
When dsmark_init(Qdisc_ops->init()) is failed, it possibly doesn't
initialize dsmark_qdisc_data->q. But dsmark_reset(Qdisc_ops->reset())
uses dsmark_qdisc_data->q pointer wihtout any null checking.
So, panic would occur.
Test commands:
sysctl net.core.default_qdisc=dsmark -w
ip link add dummy0 type dummy
ip link add vw0 link dummy0 type virt_wifi
ip link set vw0 up
Splat looks like:
KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
CPU: 3 PID: 684 Comm: ip Not tainted 5.12.0+ #910
RIP: 0010:qdisc_reset+0x2b/0x680
Code: 1f 44 00 00 48 b8 00 00 00 00 00 fc ff df 41 57 41 56 41 55 41 54
55 48 89 fd 48 83 c7 18 53 48 89 fa 48 c1 ea 03 48 83 ec 20 <80> 3c 02
00 0f 85 09 06 00 00 4c 8b 65 18 0f 1f 44 00 00 65 8b 1d
RSP: 0018:ffff88800fda6bf8 EFLAGS: 00010282
RAX: dffffc0000000000 RBX: ffff8880050ed800 RCX: 0000000000000000
RDX: 0000000000000003 RSI: ffffffff99e34100 RDI: 0000000000000018
RBP: 0000000000000000 R08: fffffbfff346b553 R09: fffffbfff346b553
R10: 0000000000000001 R11: fffffbfff346b552 R12: ffffffffc0824940
R13: ffff888109e83800 R14: 00000000ffffffff R15: ffffffffc08249e0
FS: 00007f5042287680(0000) GS:ffff888119800000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055ae1f4dbd90 CR3: 0000000006760002 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
? rcu_read_lock_bh_held+0xa0/0xa0
dsmark_reset+0x3d/0xf0 [sch_dsmark]
qdisc_reset+0xa9/0x680
qdisc_destroy+0x84/0x370
qdisc_create_dflt+0x1fe/0x380
attach_one_default_qdisc.constprop.41+0xa4/0x180
dev_activate+0x4d5/0x8c0
? __dev_open+0x268/0x390
__dev_open+0x270/0x390
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
the following script:
# tc qdisc add dev eth0 handle 0x1 root fq_pie flows 2
# tc qdisc add dev eth0 clsact
# tc filter add dev eth0 egress matchall action skbedit priority 0x10002
# ping 192.0.2.2 -I eth0 -c2 -w1 -q
produces the following splat:
BUG: KASAN: slab-out-of-bounds in fq_pie_qdisc_enqueue+0x1314/0x19d0 [sch_fq_pie]
Read of size 4 at addr ffff888171306924 by task ping/942
CPU: 3 PID: 942 Comm: ping Not tainted 5.12.0+ #441
Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
Call Trace:
dump_stack+0x92/0xc1
print_address_description.constprop.7+0x1a/0x150
kasan_report.cold.13+0x7f/0x111
fq_pie_qdisc_enqueue+0x1314/0x19d0 [sch_fq_pie]
__dev_queue_xmit+0x1034/0x2b10
ip_finish_output2+0xc62/0x2120
__ip_finish_output+0x553/0xea0
ip_output+0x1ca/0x4d0
ip_send_skb+0x37/0xa0
raw_sendmsg+0x1c4b/0x2d00
sock_sendmsg+0xdb/0x110
__sys_sendto+0x1d7/0x2b0
__x64_sys_sendto+0xdd/0x1b0
do_syscall_64+0x3c/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7fe69735c3eb
Code: 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f3 0f 1e fa 48 8d 05 75 42 2c 00 41 89 ca 8b 00 85 c0 75 14 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 75 c3 0f 1f 40 00 41 57 4d 89 c7 41 56 41 89
RSP: 002b:00007fff06d7fb38 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 000055e961413700 RCX: 00007fe69735c3eb
RDX: 0000000000000040 RSI: 000055e961413700 RDI: 0000000000000003
RBP: 0000000000000040 R08: 000055e961410500 R09: 0000000000000010
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff06d81260
R13: 00007fff06d7fb40 R14: 00007fff06d7fc30 R15: 000055e96140f0a0
Allocated by task 917:
kasan_save_stack+0x19/0x40
__kasan_kmalloc+0x7f/0xa0
__kmalloc_node+0x139/0x280
fq_pie_init+0x555/0x8e8 [sch_fq_pie]
qdisc_create+0x407/0x11b0
tc_modify_qdisc+0x3c2/0x17e0
rtnetlink_rcv_msg+0x346/0x8e0
netlink_rcv_skb+0x120/0x380
netlink_unicast+0x439/0x630
netlink_sendmsg+0x719/0xbf0
sock_sendmsg+0xe2/0x110
____sys_sendmsg+0x5ba/0x890
___sys_sendmsg+0xe9/0x160
__sys_sendmsg+0xd3/0x170
do_syscall_64+0x3c/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
The buggy address belongs to the object at ffff888171306800
which belongs to the cache kmalloc-256 of size 256
The buggy address is located 36 bytes to the right of
256-byte region [ffff888171306800, ffff888171306900)
The buggy address belongs to the page:
page:00000000bcfb624e refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x171306
head:00000000bcfb624e order:1 compound_mapcount:0
flags: 0x17ffffc0010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
raw: 0017ffffc0010200 dead000000000100 dead000000000122 ffff888100042b40
raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff888171306800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff888171306880: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc
>ffff888171306900: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
^
ffff888171306980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff888171306a00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
fix fq_pie traffic path to avoid selecting 'q->flows + q->flows_cnt' as a
valid flow: it's an address beyond the allocated memory.
Fixes: ec97ecf1ebe4 ("net: sched: add Flow Queue PIE packet scheduler")
CC: stable@vger.kernel.org
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
the patch that fixed an endless loop in_fq_pie_init() was not considering
that 65535 is a valid class id. The correct bugfix for this infinite loop
is to change 'idx' to become an u32, like Colin proposed in the past [1].
Fix this as follows:
- restore 65536 as maximum possible values of 'flows_cnt'
- use u32 'idx' when iterating on 'q->flows'
- fix the TDC selftest
This reverts commit bb2f930d6dd708469a587dc9ed1efe1ef969c0bf.
[1] https://lore.kernel.org/netdev/20210407163808.499027-1-colin.king@canonical.com/
CC: Colin Ian King <colin.king@canonical.com>
CC: stable@vger.kernel.org
Fixes: bb2f930d6dd7 ("net/sched: fix infinite loop in sch_fq_pie")
Fixes: ec97ecf1ebe4 ("net: sched: add Flow Queue PIE packet scheduler")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
modern userspace applications, like OVN, can configure the TC datapath to
"recirculate" packets several times. If more than 4 "recirculation" rules
are configured, packets can be dropped by __tcf_classify().
Changing the maximum number of reclassifications (from 4 to 16) should be
sufficient to prevent drops in most use cases, and guard against loops at
the same time.
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The netdev qeueue might be stopped when byte queue limit has
reached or tx hw ring is full, net_tx_action() may still be
rescheduled if STATE_MISSED is set, which consumes unnecessary
cpu without dequeuing and transmiting any skb because the
netdev queue is stopped, see qdisc_run_end().
This patch fixes it by checking the netdev queue state before
calling qdisc_run() and clearing STATE_MISSED if netdev queue is
stopped during qdisc_run(), the net_tx_action() is rescheduled
again when netdev qeueue is restarted, see netif_tx_wake_queue().
As there is time window between netif_xmit_frozen_or_stopped()
checking and STATE_MISSED clearing, between which STATE_MISSED
may set by net_tx_action() scheduled by netif_tx_wake_queue(),
so set the STATE_MISSED again if netdev queue is restarted.
Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
Reported-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently qdisc_run() checks the STATE_DEACTIVATED of lockless
qdisc before calling __qdisc_run(), which ultimately clear the
STATE_MISSED when all the skb is dequeued. If STATE_DEACTIVATED
is set before clearing STATE_MISSED, there may be rescheduling
of net_tx_action() at the end of qdisc_run_end(), see below:
CPU0(net_tx_atcion) CPU1(__dev_xmit_skb) CPU2(dev_deactivate)
. . .
. set STATE_MISSED .
. __netif_schedule() .
. . set STATE_DEACTIVATED
. . qdisc_reset()
. . .
.<--------------- . synchronize_net()
clear __QDISC_STATE_SCHED | . .
. | . .
. | . some_qdisc_is_busy()
. | . return *false*
. | . .
test STATE_DEACTIVATED | . .
__qdisc_run() *not* called | . .
. | . .
test STATE_MISS | . .
__netif_schedule()--------| . .
. . .
. . .
__qdisc_run() is not called by net_tx_atcion() in CPU0 because
CPU2 has set STATE_DEACTIVATED flag during dev_deactivate(), and
STATE_MISSED is only cleared in __qdisc_run(), __netif_schedule
is called at the end of qdisc_run_end(), causing tx action
rescheduling problem.
qdisc_run() called by net_tx_action() runs in the softirq context,
which should has the same semantic as the qdisc_run() called by
__dev_xmit_skb() protected by rcu_read_lock_bh(). And there is a
synchronize_net() between STATE_DEACTIVATED flag being set and
qdisc_reset()/some_qdisc_is_busy in dev_deactivate(), we can safely
bail out for the deactived lockless qdisc in net_tx_action(), and
qdisc_reset() will reset all skb not dequeued yet.
So add the rcu_read_lock() explicitly to protect the qdisc_run()
and do the STATE_DEACTIVATED checking in net_tx_action() before
calling qdisc_run_begin(). Another option is to do the checking in
the qdisc_run_end(), but it will add unnecessary overhead for
non-tx_action case, because __dev_queue_xmit() will not see qdisc
with STATE_DEACTIVATED after synchronize_net(), the qdisc with
STATE_DEACTIVATED can only be seen by net_tx_action() because of
__netif_schedule().
The STATE_DEACTIVATED checking in qdisc_run() is to avoid race
between net_tx_action() and qdisc_reset(), see:
commit d518d2ed8640 ("net/sched: fix race between deactivation
and dequeue for NOLOCK qdisc"). As the bailout added above for
deactived lockless qdisc in net_tx_action() provides better
protection for the race without calling qdisc_run() at all, so
remove the STATE_DEACTIVATED checking in qdisc_run().
After qdisc_reset(), there is no skb in qdisc to be dequeued, so
clear the STATE_MISSED in dev_reset_queue() too.
Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
V8: Clearing STATE_MISSED before calling __netif_schedule() has
avoid the endless rescheduling problem, but there may still
be a unnecessary rescheduling, so adjust the commit log.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Lockless qdisc has below concurrent problem:
cpu0 cpu1
. .
q->enqueue .
. .
qdisc_run_begin() .
. .
dequeue_skb() .
. .
sch_direct_xmit() .
. .
. q->enqueue
. qdisc_run_begin()
. return and do nothing
. .
qdisc_run_end() .
cpu1 enqueue a skb without calling __qdisc_run() because cpu0
has not released the lock yet and spin_trylock() return false
for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb
enqueued by cpu1 when calling dequeue_skb() because cpu1 may
enqueue the skb after cpu0 calling dequeue_skb() and before
cpu0 calling qdisc_run_end().
Lockless qdisc has below another concurrent problem when
tx_action is involved:
cpu0(serving tx_action) cpu1 cpu2
. . .
. q->enqueue .
. qdisc_run_begin() .
. dequeue_skb() .
. . q->enqueue
. . .
. sch_direct_xmit() .
. . qdisc_run_begin()
. . return and do nothing
. . .
clear __QDISC_STATE_SCHED . .
qdisc_run_begin() . .
return and do nothing . .
. . .
. qdisc_run_end() .
This patch fixes the above data race by:
1. If the first spin_trylock() return false and STATE_MISSED is
not set, set STATE_MISSED and retry another spin_trylock() in
case other CPU may not see STATE_MISSED after it releases the
lock.
2. reschedule if STATE_MISSED is set after the lock is released
at the end of qdisc_run_end().
For tx_action case, STATE_MISSED is also set when cpu1 is at the
end if qdisc_run_end(), so tx_action will be rescheduled again
to dequeue the skb enqueued by cpu2.
Clear STATE_MISSED before retrying a dequeuing when dequeuing
returns NULL in order to reduce the overhead of the second
spin_trylock() and __netif_schedule() calling.
Also clear the STATE_MISSED before calling __netif_schedule()
at the end of qdisc_run_end() to avoid doing another round of
dequeuing in the pfifo_fast_dequeue().
The performance impact of this patch, tested using pktgen and
dummy netdev with pfifo_fast qdisc attached:
threads without+this_patch with+this_patch delta
1 2.61Mpps 2.60Mpps -0.3%
2 3.97Mpps 3.82Mpps -3.7%
4 5.62Mpps 5.59Mpps -0.5%
8 2.78Mpps 2.77Mpps -0.3%
16 2.22Mpps 2.22Mpps -0.0%
Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
Acked-by: Jakub Kicinski <kuba@kernel.org>
Tested-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Even though the taprio qdisc is designed for multiqueue devices, all the
queues still point to the same top-level taprio qdisc. This works and is
probably required for software taprio, but at least with offload taprio,
it has an undesirable side effect: because the whole qdisc is run when a
packet has to be sent, it allows packets in a best-effort class to be
processed in the context of a task sending higher priority traffic. If
there are packets left in the qdisc after that first run, the NET_TX
softirq is raised and gets executed immediately in the same process
context. As with any other softirq, it runs up to 10 times and for up to
2ms, during which the calling process is waiting for the sendmsg call (or
similar) to return. In my use case, that calling process is a real-time
task scheduled to send a packet every 2ms, so the long sendmsg calls are
leading to missed timeslots.
By attaching each netdev queue to its own qdisc, as it is done with
the "classic" mq qdisc, each traffic class can be processed independently
without touching the other classes. A high-priority process can then send
packets without getting stuck in the sendmsg call anymore.
Signed-off-by: Yannick Vignon <yannick.vignon@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The rcu_head pointer passed to taprio_free_sched_cb is never NULL.
That means that the result of container_of() operations on it is also
never NULL, even though rcu_head is the first element of the structure
embedding it. On top of that, it is misleading to perform a NULL check
on the result of container_of() because the position of the contained
element could change, which would make the check invalid. Remove the
unnecessary NULL check.
This change was made automatically with the following Coccinelle script.
@@
type t;
identifier v;
statement s;
@@
<+...
(
t v = container_of(...);
|
v = container_of(...);
)
...
when != v
- if (\( !v \| v == NULL \) ) s
...+>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
when 'act_mirred' tries to fragment IPv4 packets that had been previously
re-assembled using 'act_ct', splats like the following can be observed on
kernels built with KASAN:
BUG: KASAN: stack-out-of-bounds in ip_do_fragment+0x1b03/0x1f60
Read of size 1 at addr ffff888147009574 by task ping/947
CPU: 0 PID: 947 Comm: ping Not tainted 5.12.0-rc6+ #418
Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
Call Trace:
<IRQ>
dump_stack+0x92/0xc1
print_address_description.constprop.7+0x1a/0x150
kasan_report.cold.13+0x7f/0x111
ip_do_fragment+0x1b03/0x1f60
sch_fragment+0x4bf/0xe40
tcf_mirred_act+0xc3d/0x11a0 [act_mirred]
tcf_action_exec+0x104/0x3e0
fl_classify+0x49a/0x5e0 [cls_flower]
tcf_classify_ingress+0x18a/0x820
__netif_receive_skb_core+0xae7/0x3340
__netif_receive_skb_one_core+0xb6/0x1b0
process_backlog+0x1ef/0x6c0
__napi_poll+0xaa/0x500
net_rx_action+0x702/0xac0
__do_softirq+0x1e4/0x97f
do_softirq+0x71/0x90
</IRQ>
__local_bh_enable_ip+0xdb/0xf0
ip_finish_output2+0x760/0x2120
ip_do_fragment+0x15a5/0x1f60
__ip_finish_output+0x4c2/0xea0
ip_output+0x1ca/0x4d0
ip_send_skb+0x37/0xa0
raw_sendmsg+0x1c4b/0x2d00
sock_sendmsg+0xdb/0x110
__sys_sendto+0x1d7/0x2b0
__x64_sys_sendto+0xdd/0x1b0
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f82e13853eb
Code: 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f3 0f 1e fa 48 8d 05 75 42 2c 00 41 89 ca 8b 00 85 c0 75 14 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 75 c3 0f 1f 40 00 41 57 4d 89 c7 41 56 41 89
RSP: 002b:00007ffe01fad888 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 00005571aac13700 RCX: 00007f82e13853eb
RDX: 0000000000002330 RSI: 00005571aac13700 RDI: 0000000000000003
RBP: 0000000000002330 R08: 00005571aac10500 R09: 0000000000000010
R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffe01faefb0
R13: 00007ffe01fad890 R14: 00007ffe01fad980 R15: 00005571aac0f0a0
The buggy address belongs to the page:
page:000000001dff2e03 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x147009
flags: 0x17ffffc0001000(reserved)
raw: 0017ffffc0001000 ffffea00051c0248 ffffea00051c0248 0000000000000000
raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff888147009400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff888147009480: f1 f1 f1 f1 04 f2 f2 f2 f2 f2 f2 f2 00 00 00 00
>ffff888147009500: 00 00 00 00 00 00 00 00 00 00 f2 f2 f2 f2 f2 f2
^
ffff888147009580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff888147009600: 00 00 00 00 00 00 00 00 00 00 00 00 00 f2 f2 f2
for IPv4 packets, sch_fragment() uses a temporary struct dst_entry. Then,
in the following call graph:
ip_do_fragment()
ip_skb_dst_mtu()
ip_dst_mtu_maybe_forward()
ip_mtu_locked()
the pointer to struct dst_entry is used as pointer to struct rtable: this
turns the access to struct members like rt_mtu_locked into an OOB read in
the stack. Fix this changing the temporary variable used for IPv4 packets
in sch_fragment(), similarly to what is done for IPv6 few lines below.
Fixes: c129412f74e9 ("net/sched: sch_frag: add generic packet fragment support.")
Cc: <stable@vger.kernel.org> # 5.11
Reported-by: Shuang Li <shuali@redhat.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The assignment is not being used and redundant.
The check for null is redundant as nf_conntrack_put() also
checks this.
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Link: https://lore.kernel.org/r/20210428060532.3330974-1-roid@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
while testing re-assembly/re-fragmentation using act_ct, it's possible to
observe a crash like the following one:
KASAN: maybe wild-memory-access in range [0x0001000000000448-0x000100000000044f]
CPU: 50 PID: 0 Comm: swapper/50 Tainted: G S 5.12.0-rc7+ #424
Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.4.3 01/17/2017
RIP: 0010:inet_frag_rbtree_purge+0x50/0xc0
Code: 00 fc ff df 48 89 c3 31 ed 48 89 df e8 a9 7a 38 ff 4c 89 fe 48 89 df 49 89 c6 e8 5b 3a 38 ff 48 8d 7b 40 48 89 f8 48 c1 e8 03 <42> 80 3c 20 00 75 59 48 8d bb d0 00 00 00 4c 8b 6b 40 48 89 f8 48
RSP: 0018:ffff888c31449db8 EFLAGS: 00010203
RAX: 0000200000000089 RBX: 000100000000040e RCX: ffffffff989eb960
RDX: 0000000000000140 RSI: ffffffff97cfb977 RDI: 000100000000044e
RBP: 0000000000000900 R08: 0000000000000000 R09: ffffed1186289350
R10: 0000000000000003 R11: ffffed1186289350 R12: dffffc0000000000
R13: 000100000000040e R14: 0000000000000000 R15: ffff888155e02160
FS: 0000000000000000(0000) GS:ffff888c31440000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005600cb70a5b8 CR3: 0000000a2c014005 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<IRQ>
inet_frag_destroy+0xa9/0x150
call_timer_fn+0x2d/0x180
run_timer_softirq+0x4fe/0xe70
__do_softirq+0x197/0x5a0
irq_exit_rcu+0x1de/0x200
sysvec_apic_timer_interrupt+0x6b/0x80
</IRQ>
when act_ct temporarily stores an IP fragment, restoring the skb qdisc cb
results in putting random data in FRAG_CB(), and this causes those "wild"
memory accesses later, when the rbtree is purged. Never overwrite the skb
cb in case tcf_ct_handle_fragments() returns -EINPROGRESS.
Fixes: ae372cb1750f ("net/sched: act_ct: fix restore the qdisc_skb_cb after defrag")
Fixes: 7baf2429a1a9 ("net/sched: cls_flower add CT_FLAGS_INVALID flag support")
Reported-by: Shuang Li <shuali@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
There is a reproducible sequence from the userland that will trigger a WARN_ON()
condition in taprio_get_start_time, which causes kernel to panic if configured
as "panic_on_warn". Catch this condition in parse_taprio_schedule to
prevent this condition.
Reported as bug on syzkaller:
https://syzkaller.appspot.com/bug?extid=d50710fd0873a9c6b40c
Reported-by: syzbot+d50710fd0873a9c6b40c@syzkaller.appspotmail.com
Signed-off-by: Du Cheng <ducheng2@gmail.com>
Acked-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Conflicts:
MAINTAINERS
- keep Chandrasekar
drivers/net/ethernet/mellanox/mlx5/core/en_main.c
- simple fix + trust the code re-added to param.c in -next is fine
include/linux/bpf.h
- trivial
include/linux/ethtool.h
- trivial, fix kdoc while at it
include/linux/skmsg.h
- move to relevant place in tcp.c, comment re-wrapped
net/core/skmsg.c
- add the sk = sk // sk = NULL around calls
net/tipc/crypto.c
- trivial
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Reproduce:
modprobe sch_teql
tc qdisc add dev teql0 root teql0
This leads to (for instance in Centos 7 VM) OOPS:
[ 532.366633] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8
[ 532.366733] IP: [<ffffffffc06124a8>] teql_destroy+0x18/0x100 [sch_teql]
[ 532.366825] PGD 80000001376d5067 PUD 137e37067 PMD 0
[ 532.366906] Oops: 0000 [#1] SMP
[ 532.366987] Modules linked in: sch_teql ...
[ 532.367945] CPU: 1 PID: 3026 Comm: tc Kdump: loaded Tainted: G ------------ T 3.10.0-1062.7.1.el7.x86_64 #1
[ 532.368041] Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.2 04/01/2014
[ 532.368125] task: ffff8b7d37d31070 ti: ffff8b7c9fdbc000 task.ti: ffff8b7c9fdbc000
[ 532.368224] RIP: 0010:[<ffffffffc06124a8>] [<ffffffffc06124a8>] teql_destroy+0x18/0x100 [sch_teql]
[ 532.368320] RSP: 0018:ffff8b7c9fdbf8e0 EFLAGS: 00010286
[ 532.368394] RAX: ffffffffc0612490 RBX: ffff8b7cb1565e00 RCX: ffff8b7d35ba2000
[ 532.368476] RDX: ffff8b7d35ba2000 RSI: 0000000000000000 RDI: ffff8b7cb1565e00
[ 532.368557] RBP: ffff8b7c9fdbf8f8 R08: ffff8b7d3fd1f140 R09: ffff8b7d3b001600
[ 532.368638] R10: ffff8b7d3b001600 R11: ffffffff84c7d65b R12: 00000000ffffffd8
[ 532.368719] R13: 0000000000008000 R14: ffff8b7d35ba2000 R15: ffff8b7c9fdbf9a8
[ 532.368800] FS: 00007f6a4e872740(0000) GS:ffff8b7d3fd00000(0000) knlGS:0000000000000000
[ 532.368885] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 532.368961] CR2: 00000000000000a8 CR3: 00000001396ee000 CR4: 00000000000206e0
[ 532.369046] Call Trace:
[ 532.369159] [<ffffffff84c8192e>] qdisc_create+0x36e/0x450
[ 532.369268] [<ffffffff846a9b49>] ? ns_capable+0x29/0x50
[ 532.369366] [<ffffffff849afde2>] ? nla_parse+0x32/0x120
[ 532.369442] [<ffffffff84c81b4c>] tc_modify_qdisc+0x13c/0x610
[ 532.371508] [<ffffffff84c693e7>] rtnetlink_rcv_msg+0xa7/0x260
[ 532.372668] [<ffffffff84907b65>] ? sock_has_perm+0x75/0x90
[ 532.373790] [<ffffffff84c69340>] ? rtnl_newlink+0x890/0x890
[ 532.374914] [<ffffffff84c8da7b>] netlink_rcv_skb+0xab/0xc0
[ 532.376055] [<ffffffff84c63708>] rtnetlink_rcv+0x28/0x30
[ 532.377204] [<ffffffff84c8d400>] netlink_unicast+0x170/0x210
[ 532.378333] [<ffffffff84c8d7a8>] netlink_sendmsg+0x308/0x420
[ 532.379465] [<ffffffff84c2f3a6>] sock_sendmsg+0xb6/0xf0
[ 532.380710] [<ffffffffc034a56e>] ? __xfs_filemap_fault+0x8e/0x1d0 [xfs]
[ 532.381868] [<ffffffffc034a75c>] ? xfs_filemap_fault+0x2c/0x30 [xfs]
[ 532.383037] [<ffffffff847ec23a>] ? __do_fault.isra.61+0x8a/0x100
[ 532.384144] [<ffffffff84c30269>] ___sys_sendmsg+0x3e9/0x400
[ 532.385268] [<ffffffff847f3fad>] ? handle_mm_fault+0x39d/0x9b0
[ 532.386387] [<ffffffff84d88678>] ? __do_page_fault+0x238/0x500
[ 532.387472] [<ffffffff84c31921>] __sys_sendmsg+0x51/0x90
[ 532.388560] [<ffffffff84c31972>] SyS_sendmsg+0x12/0x20
[ 532.389636] [<ffffffff84d8dede>] system_call_fastpath+0x25/0x2a
[ 532.390704] [<ffffffff84d8de21>] ? system_call_after_swapgs+0xae/0x146
[ 532.391753] Code: 00 00 00 00 00 00 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 41 55 41 54 53 48 8b b7 48 01 00 00 48 89 fb <48> 8b 8e a8 00 00 00 48 85 c9 74 43 48 89 ca eb 0f 0f 1f 80 00
[ 532.394036] RIP [<ffffffffc06124a8>] teql_destroy+0x18/0x100 [sch_teql]
[ 532.395127] RSP <ffff8b7c9fdbf8e0>
[ 532.396179] CR2: 00000000000000a8
Null pointer dereference happens on master->slaves dereference in
teql_destroy() as master is null-pointer.
When qdisc_create() calls teql_qdisc_init() it imediately fails after
check "if (m->dev == dev)" because both devices are teql0, and it does
not set qdisc_priv(sch)->m leaving it zero on error path, then
qdisc_create() imediately calls teql_destroy() which does not expect
zero master pointer and we get OOPS.
Fixes: 87b60cfacf9f ("net_sched: fix error recovery at qdisc creation")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
With recent changes that separated action module load from action
initialization tcf_action_init() function error handling code was modified
to manually release the loaded modules if loading/initialization of any
further action in same batch failed. For the case when all modules
successfully loaded and some of the actions were initialized before one of
them failed in init handler. In this case for all previous actions the
module will be released twice by the error handler: First time by the loop
that manually calls module_put() for all ops, and second time by the action
destroy code that puts the module after destroying the action.
Reproduction:
$ sudo tc actions add action simple sdata \"2\" index 2
$ sudo tc actions add action simple sdata \"1\" index 1 \
action simple sdata \"2\" index 2
RTNETLINK answers: File exists
We have an error talking to the kernel
$ sudo tc actions ls action simple
total acts 1
action order 0: Simple <"2">
index 2 ref 1 bind 0
$ sudo tc actions flush action simple
$ sudo tc actions ls action simple
$ sudo tc actions add action simple sdata \"2\" index 2
Error: Failed to load TC action module.
We have an error talking to the kernel
$ lsmod | grep simple
act_simple 20480 -1
Fix the issue by modifying module reference counting handling in action
initialization code:
- Get module reference in tcf_idr_create() and put it in tcf_idr_release()
instead of taking over the reference held by the caller.
- Modify users of tcf_action_init_1() to always release the module
reference which they obtain before calling init function instead of
assuming that created action takes over the reference.
- Finally, modify tcf_action_init_1() to not release the module reference
when overwriting existing action as this is no longer necessary since both
upper and lower layers obtain and manage their own module references
independently.
Fixes: d349f9976868 ("net_sched: fix RTNL deadlock again caused by request_module()")
Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Action init code increments reference counter when it changes an action.
This is the desired behavior for cls API which needs to obtain action
reference for every classifier that points to action. However, act API just
needs to change the action and releases the reference before returning.
This sequence breaks when the requested action doesn't exist, which causes
act API init code to create new action with specified index, but action is
still released before returning and is deleted (unless it was referenced
concurrently by cls API).
Reproduction:
$ sudo tc actions ls action gact
$ sudo tc actions change action gact drop index 1
$ sudo tc actions ls action gact
Extend tcf_action_init() to accept 'init_res' array and initialize it with
action->ops->init() result. In tcf_action_add() remove pointers to created
actions from actions array before passing it to tcf_action_put_many().
Fixes: cae422f379f3 ("net: sched: use reference counting action init")
Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This reverts commit 6855e8213e06efcaf7c02a15e12b1ae64b9a7149.
Following commit in series fixes the issue without introducing regression
in error rollback of tcf_action_destroy().
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The 'unlocked_driver_cb' struct field in 'bo' is not being initialized
in tcf_block_offload_init(). The uninitialized 'unlocked_driver_cb'
will be used when calling unlocked_driver_cb(). So initialize 'bo' to
zero to avoid the issue.
Addresses-Coverity: ("Uninitialized scalar variable")
Fixes: 0fdcf78d5973 ("net: use flow_indr_dev_setup_offload()")
Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
sch_htb: fix null pointer dereference on a null new_q
Currently if new_q is null, the null new_q pointer will be
dereference when 'q->offload' is true. Fix this by adding
a braces around htb_parent_to_leaf_offload() to avoid it.
Addresses-Coverity: ("Dereference after null check")
Fixes: d03b195b5aa0 ("sch_htb: Hierarchical QoS hardware offload")
Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently, action creation using ACT API in replace mode is buggy.
When invoking for non-existent action index 42,
tc action replace action bpf obj foo.o sec <xyz> index 42
kernel creates the action, fills up the netlink response, and then just
deletes the action after notifying userspace.
tc action show action bpf
doesn't list the action.
This happens due to the following sequence when ovr = 1 (replace mode)
is enabled:
tcf_idr_check_alloc is used to atomically check and either obtain
reference for existing action at index, or reserve the index slot using
a dummy entry (ERR_PTR(-EBUSY)).
This is necessary as pointers to these actions will be held after
dropping the idrinfo lock, so bumping the reference count is necessary
as we need to insert the actions, and notify userspace by dumping their
attributes. Finally, we drop the reference we took using the
tcf_action_put_many call in tcf_action_add. However, for the case where
a new action is created due to free index, its refcount remains one.
This when paired with the put_many call leads to the kernel setting up
the action, notifying userspace of its creation, and then tearing it
down. For existing actions, the refcount is still held so they remain
unaffected.
Fortunately due to rtnl_lock serialization requirement, such an action
with refcount == 1 will not be concurrently deleted by anything else, at
best CLS API can move its refcount up and down by binding to it after it
has been published from tcf_idr_insert_many. Since refcount is atleast
one until put_many call, CLS API cannot delete it. Also __tcf_action_put
release path already ensures deterministic outcome (either new action
will be created or existing action will be reused in case CLS API tries
to bind to action concurrently) due to idr lock serialization.
We fix this by making refcount of newly created actions as 2 in ACT API
replace mode. A relaxed store will suffice as visibility is ensured only
after the tcf_idr_insert_many call.
Note that in case of creation or overwriting using CLS API only (i.e.
bind = 1), overwriting existing action object is not allowed, and any
such request is silently ignored (without error).
The refcount bump that occurs in tcf_idr_check_alloc call there for
existing action will pair with tcf_exts_destroy call made from the
owner module for the same action. In case of action creation, there
is no existing action, so no tcf_exts_destroy callback happens.
This means no code changes for CLS API.
Fixes: cae422f379f3 ("net: sched: use reference counting action init")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
s/procdure/procedure/
s/maintanance/maintenance/
Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Invalid detection works with two distinct moments: act_ct tries to find
a conntrack entry and set post_ct true, indicating that that was
attempted. Then, when flow dissector tries to dissect CT info and no
entry is there, it knows that it was tried and no entry was found, and
synthesizes/sets
key->ct_state = TCA_FLOWER_KEY_CT_FLAGS_TRACKED |
TCA_FLOWER_KEY_CT_FLAGS_INVALID;
mimicing what OVS does.
OVS has this a bit more streamlined, as it recomputes the key after
trying to find a conntrack entry for it.
Issue here is, when we have 'tc action ct clear', it didn't clear
post_ct, causing a subsequent match on 'ct_state -trk' to fail, due to
the above. The fix, thus, is to clear it.
Reproducer rules:
tc filter add dev enp130s0f0np0_0 ingress prio 1 chain 0 \
protocol ip flower ip_proto tcp ct_state -trk \
action ct zone 1 pipe \
action goto chain 2
tc filter add dev enp130s0f0np0_0 ingress prio 1 chain 2 \
protocol ip flower \
action ct clear pipe \
action goto chain 4
tc filter add dev enp130s0f0np0_0 ingress prio 1 chain 4 \
protocol ip flower ct_state -trk \
action mirred egress redirect dev enp130s0f1np1_0
With the fix, the 3rd rule matches, like it does with OVS kernel
datapath.
Fixes: 7baf2429a1a9 ("net/sched: cls_flower add CT_FLAGS_INVALID flag support")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The existing code is functionally correct: iproute2 parses the ip_flags
argument for tc-flower and really packs it as big endian into the
TCA_FLOWER_KEY_FLAGS netlink attribute. But there is a problem in the
fact that W=1 builds complain:
net/sched/cls_flower.c:1047:15: warning: cast to restricted __be32
This is because we should use the dedicated helper for obtaining a
__be32 pointer to the netlink attribute, not a u32 one. This ensures
type correctness for be32_to_cpu.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|