Prior to this patch, the amount of counters guaranteed per VF in the
resource tracker was MLX4_VF_COUNTERS_PER_PORT * MLX4_MAX_PORTS. It was
set regardless if the VF was single or dual port.
This caused several VFs to have no guaranteed counters although the
system could satisfy their request.
The fix is to dynamically guarantee counters, based on each VF
specification.
Fixes: 9de92c60be ("net/mlx4_core: Adjust counter grant policy in the resource tracker")
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean says:
====================
VLAN fixes for Ocelot switch
This series addresses 2 issues with vlan_filtering=1:
- Untagged traffic gets dropped unless commands are run in a very
specific order.
- Untagged traffic starts being transmitted as tagged after adding
another untagged VID on the port.
Tested on NXP LS1028A-RDB board.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The switch driver keeps a "vid" variable per port, which signifies _the_
VLAN ID that is stripped on that port's egress (aka the native VLAN on a
trunk port).
That is the way the hardware is designed (mostly). The port->vid is
programmed into REW:PORT:PORT_VLAN_CFG:PORT_VID and the rewriter is told
to send all traffic as tagged except the one having port->vid.
There exists a possibility of finer-grained egress untagging decisions:
using the VCAP IS1 engine, one rule can be added to match every
VLAN-tagged frame whose VLAN should be untagged, and set POP_CNT=1 as
action. However, the IS1 can hold at most 512 entries, and the VLANs are
in the order of 6 * 4096.
So the code is fine for now. But this sequence of commands:
$ bridge vlan add dev swp0 vid 1 pvid untagged
$ bridge vlan add dev swp0 vid 2 untagged
makes untagged and pvid-tagged traffic be sent out of swp0 as tagged
with VID 1, despite user's request.
Prevent that from happening. The user should temporarily remove the
existing untagged VLAN (1 in this case), add it back as tagged, and then
add the new untagged VLAN (2 in this case).
Cc: Antoine Tenart <antoine.tenart@bootlin.com>
Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
Fixes: 7142529f16 ("net: mscc: ocelot: add VLAN filtering")
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Background information: the driver operates the hardware in a mode where
a single VLAN can be transmitted as untagged on a particular egress
port. That is the "native VLAN on trunk port" use case. Its value is
held in port->vid.
Consider the following command sequence (no network manager, all
interfaces are down, debugging prints added by me):
$ ip link add dev br0 type bridge vlan_filtering 1
$ ip link set dev swp0 master br0
Kernel code path during last command:
br_add_slave -> ocelot_netdevice_port_event (NETDEV_CHANGEUPPER):
[ 21.401901] ocelot_vlan_port_apply: port 0 vlan aware 0 pvid 0 vid 0
br_add_slave -> nbp_vlan_init -> switchdev_port_attr_set -> ocelot_port_attr_set (SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING):
[ 21.413335] ocelot_vlan_port_apply: port 0 vlan aware 1 pvid 0 vid 0
br_add_slave -> nbp_vlan_init -> nbp_vlan_add -> br_switchdev_port_vlan_add -> switchdev_port_obj_add -> ocelot_port_obj_add -> ocelot_vlan_vid_add
[ 21.667421] ocelot_vlan_port_apply: port 0 vlan aware 1 pvid 1 vid 1
So far so good. The bridge has replaced the driver's default pvid used
in standalone mode (0) with its own default_pvid (1). The port's vid
(native VLAN) has also changed from 0 to 1.
$ ip link set dev swp0 up
[ 31.722956] 8021q: adding VLAN 0 to HW filter on device swp0
do_setlink -> dev_change_flags -> vlan_vid_add -> ocelot_vlan_rx_add_vid -> ocelot_vlan_vid_add:
[ 31.728700] ocelot_vlan_port_apply: port 0 vlan aware 1 pvid 1 vid 0
The 8021q module uses the .ndo_vlan_rx_add_vid API on .ndo_open to make
ports be able to transmit and receive 802.1p-tagged traffic by default.
This API is supposed to offload a VLAN sub-interface, which for a switch
port means to add a VLAN that is not a pvid, and tagged on egress.
But the driver implementation of .ndo_vlan_rx_add_vid is wrong: it adds
back vid 0 as "egress untagged". Now back to the initial paragraph:
there is a single untagged VID that the driver keeps track of, and that
has just changed from 1 (the pvid) to 0. So this breaks the bridge
core's expectation, because it has changed vid 1 from untagged to
tagged, when what the user sees is.
$ bridge vlan
port vlan ids
swp0 1 PVID Egress Untagged
br0 1 PVID Egress Untagged
But curiously, instead of manifesting itself as "untagged and
pvid-tagged traffic gets sent as tagged on egress", the bug:
- is hidden when vlan_filtering=0
- manifests as dropped traffic when vlan_filtering=1, due to this setting:
if (port->vlan_aware && !port->vid)
/* If port is vlan-aware and tagged, drop untagged and priority
* tagged frames.
*/
val |= ANA_PORT_DROP_CFG_DROP_UNTAGGED_ENA |
ANA_PORT_DROP_CFG_DROP_PRIO_S_TAGGED_ENA |
ANA_PORT_DROP_CFG_DROP_PRIO_C_TAGGED_ENA;
which would have made sense if it weren't for this bug. The setting's
intention was "this is a trunk port with no native VLAN, so don't accept
untagged traffic". So the driver was never expecting to set VLAN 0 as
the value of the native VLAN, 0 was just encoding for "invalid".
So the fix is to not send 802.1p traffic as untagged, because that would
change the port's native vlan to 0, unbeknownst to the bridge, and
trigger unexpected code paths in the driver.
Cc: Antoine Tenart <antoine.tenart@bootlin.com>
Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
Fixes: 7142529f16 ("net: mscc: ocelot: add VLAN filtering")
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the implementation of i2400m_op_rfkill_sw_toggle() the allocated
buffer for cmd should be released before returning. The
documentation for i2400m_msg_to_dev() says when it returns the buffer
can be reused. Meaning cmd should be released in either case. Move
kfree(cmd) before return to be reached by all execution paths.
Fixes: 2507e6ab7a ("wimax: i2400: fix memory leak")
Signed-off-by: Navid Emamdoost <navid.emamdoost@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When rmmod hip04_eth.ko, we can get the following warning:
Task track: rmmod(1623)>bash(1591)>login(1581)>init(1)
------------[ cut here ]------------
WARNING: CPU: 0 PID: 1623 at kernel/irq/manage.c:1557 __free_irq+0xa4/0x2ac()
Trying to free already-free IRQ 200
Modules linked in: ping(O) pramdisk(O) cpuinfo(O) rtos_snapshot(O) interrupt_ctrl(O) mtdblock mtd_blkdevrtfs nfs_acl nfs lockd grace sunrpc xt_tcpudp ipt_REJECT iptable_filter ip_tables x_tables nf_reject_ipv
CPU: 0 PID: 1623 Comm: rmmod Tainted: G O 4.4.193 #1
Hardware name: Hisilicon A15
[<c020b408>] (rtos_unwind_backtrace) from [<c0206624>] (show_stack+0x10/0x14)
[<c0206624>] (show_stack) from [<c03f2be4>] (dump_stack+0xa0/0xd8)
[<c03f2be4>] (dump_stack) from [<c021a780>] (warn_slowpath_common+0x84/0xb0)
[<c021a780>] (warn_slowpath_common) from [<c021a7e8>] (warn_slowpath_fmt+0x3c/0x68)
[<c021a7e8>] (warn_slowpath_fmt) from [<c026876c>] (__free_irq+0xa4/0x2ac)
[<c026876c>] (__free_irq) from [<c0268a14>] (free_irq+0x60/0x7c)
[<c0268a14>] (free_irq) from [<c0469e80>] (release_nodes+0x1c4/0x1ec)
[<c0469e80>] (release_nodes) from [<c0466924>] (__device_release_driver+0xa8/0x104)
[<c0466924>] (__device_release_driver) from [<c0466a80>] (driver_detach+0xd0/0xf8)
[<c0466a80>] (driver_detach) from [<c0465e18>] (bus_remove_driver+0x64/0x8c)
[<c0465e18>] (bus_remove_driver) from [<c02935b0>] (SyS_delete_module+0x198/0x1e0)
[<c02935b0>] (SyS_delete_module) from [<c0202ed0>] (__sys_trace_return+0x0/0x10)
---[ end trace bb25d6123d849b44 ]---
Currently "rmmod hip04_eth.ko" call free_irq more than once
as devres_release_all and hip04_remove both call free_irq.
This results in a 'Trying to free already-free IRQ' warning.
To solve the problem free_irq has been moved out of hip04_remove.
Signed-off-by: Jiangfeng Xiao <xiaojiangfeng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the highly unlikely event that we fail to allocate either of the
"/txrx" or "/control" workqueues, we should bail cleanly rather than
blindly march on with NULL queue pointer(s) installed in the
'fjes_adapter' instance.
Cc: "David S. Miller" <davem@davemloft.net>
Reported-by: Nicolas Waisman <nico@semmle.com>
Link: https://lore.kernel.org/lkml/CADJ_3a8WFrs5NouXNqS5WYe7rebFP+_A5CheeqAyD_p7DFJJcg@mail.gmail.com/
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
* Fix free/alloc race for OGM and OGMv2, by Sven Eckelmann (2 patches)
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAl2yvoAWHHN3QHNpbW9u
d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoVosEAC7Ei09QJKa2Wh6by4c8OQKvloK
ecei39e0FjMUx4/XETSR/kdHrRFwzMPvIhqe+DUCski7p5x9sHWrkJwIWWirGrDf
mAsF8mhA1E0pfsQmQRqu4h/UbDyLLhsbfEf95B75Uf6Nq2cIG0Ko7rb2ao3DR7c2
YwvSjzT+zwlS8Q6Hj1YsJKSS3BEk5NQ9r9soD9WhwWmYynn9TxTWy8o9z3lBCXtV
j+mOm0qNebxsDKhr7jbfujjIaepmTOQJHwqPzm1Vo/v6CMEdJzd6j8vbsxCUyXvI
/Cg3V+8INEGmRmCDT96paO5kYkHptr8i467Raj3DAdwlh3BuoMo382i3D3GOXgEK
VtieWhuOplAJrbc+GFifvFSAA75+B8rZ/IV8CDFIbWHhEjZ/fdtyl1panqQkKSqw
n5m/nAyKmfTZ9CH0kZ/zKU8EKPaMNh0BUTbjhU/8Zn0lDI+loBs024K5a00TWtD8
GnjCGKi87HhtCn/3hxasbaVEyJIrMvPlzyYRylzFfSml55GCBuA8i8l/7ql0QKCe
xq78J08x/n3RjAj+aguLQ1v4Vz/EgOvzj4RdfHS2CGSI9m37z7wz5oFTKcqi4tX+
4Phpl7onRnMDx6AiZHMGxwrti/AdFO3YZmOiIh/IEBjUBhmABuj6Eq40thl/8Hkv
DfoOTKf98HhmaW6d/g==
=usWf
-----END PGP SIGNATURE-----
Merge tag 'batadv-net-for-davem-20191025' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
Here are two batman-adv bugfixes:
* Fix free/alloc race for OGM and OGMv2, by Sven Eckelmann (2 patches)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
An earlier bugfix introduced a dependency on CONFIG_NET_SCH_TAPRIO,
but this missed the case of NET_SCH_TAPRIO=m and NET_DSA_SJA1105=y,
which still causes a link error:
drivers/net/dsa/sja1105/sja1105_tas.o: In function `sja1105_setup_tc_taprio':
sja1105_tas.c:(.text+0x5c): undefined reference to `taprio_offload_free'
sja1105_tas.c:(.text+0x3b4): undefined reference to `taprio_offload_get'
drivers/net/dsa/sja1105/sja1105_tas.o: In function `sja1105_tas_teardown':
sja1105_tas.c:(.text+0x6ec): undefined reference to `taprio_offload_free'
Change the dependency to only allow selecting the TAS code when it
can link against the taprio code.
Fixes: a8d570de0c ("net: dsa: sja1105: Add dependency for NET_DSA_SJA1105_TAS")
Fixes: 317ab5b86c ("net: dsa: sja1105: Configure the Time-Aware Scheduler via tc-taprio offload")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
We are calling the checksum helper after the dma_map_single()
call to map the packet. This is incorrect as the checksumming
code will touch the packet from the CPU. This means the cache
won't be properly flushes (or the bounce buffering will leave
us with the unmodified packet to DMA).
This moves the calculation of the checksum & vlan tags to
before the DMA mapping.
This also has the side effect of fixing another bug: If the
checksum helper fails, we goto "drop" to drop the packet, which
will not unmap the DMA mapping.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Fixes: 05690d633f ("ftgmac100: Upgrade to NETIF_F_HW_CSUM")
Reviewed-by: Vijay Khemka <vijaykhemka@fb.com>
Tested-by: Vijay Khemka <vijaykhemka@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_page_frag() optimizes skb_frag allocations by using per-task
skb_frag cache when it knows it's the only user. The condition is
determined by seeing whether the socket allocation mask allows
blocking - if the allocation may block, it obviously owns the task's
context and ergo exclusively owns current->task_frag.
Unfortunately, this misses recursion through memory reclaim path.
Please take a look at the following backtrace.
[2] RIP: 0010:tcp_sendmsg_locked+0xccf/0xe10
...
tcp_sendmsg+0x27/0x40
sock_sendmsg+0x30/0x40
sock_xmit.isra.24+0xa1/0x170 [nbd]
nbd_send_cmd+0x1d2/0x690 [nbd]
nbd_queue_rq+0x1b5/0x3b0 [nbd]
__blk_mq_try_issue_directly+0x108/0x1b0
blk_mq_request_issue_directly+0xbd/0xe0
blk_mq_try_issue_list_directly+0x41/0xb0
blk_mq_sched_insert_requests+0xa2/0xe0
blk_mq_flush_plug_list+0x205/0x2a0
blk_flush_plug_list+0xc3/0xf0
[1] blk_finish_plug+0x21/0x2e
_xfs_buf_ioapply+0x313/0x460
__xfs_buf_submit+0x67/0x220
xfs_buf_read_map+0x113/0x1a0
xfs_trans_read_buf_map+0xbf/0x330
xfs_btree_read_buf_block.constprop.42+0x95/0xd0
xfs_btree_lookup_get_block+0x95/0x170
xfs_btree_lookup+0xcc/0x470
xfs_bmap_del_extent_real+0x254/0x9a0
__xfs_bunmapi+0x45c/0xab0
xfs_bunmapi+0x15/0x30
xfs_itruncate_extents_flags+0xca/0x250
xfs_free_eofblocks+0x181/0x1e0
xfs_fs_destroy_inode+0xa8/0x1b0
destroy_inode+0x38/0x70
dispose_list+0x35/0x50
prune_icache_sb+0x52/0x70
super_cache_scan+0x120/0x1a0
do_shrink_slab+0x120/0x290
shrink_slab+0x216/0x2b0
shrink_node+0x1b6/0x4a0
do_try_to_free_pages+0xc6/0x370
try_to_free_mem_cgroup_pages+0xe3/0x1e0
try_charge+0x29e/0x790
mem_cgroup_charge_skmem+0x6a/0x100
__sk_mem_raise_allocated+0x18e/0x390
__sk_mem_schedule+0x2a/0x40
[0] tcp_sendmsg_locked+0x8eb/0xe10
tcp_sendmsg+0x27/0x40
sock_sendmsg+0x30/0x40
___sys_sendmsg+0x26d/0x2b0
__sys_sendmsg+0x57/0xa0
do_syscall_64+0x42/0x100
entry_SYSCALL_64_after_hwframe+0x44/0xa9
In [0], tcp_send_msg_locked() was using current->page_frag when it
called sk_wmem_schedule(). It already calculated how many bytes can
be fit into current->page_frag. Due to memory pressure,
sk_wmem_schedule() called into memory reclaim path which called into
xfs and then IO issue path. Because the filesystem in question is
backed by nbd, the control goes back into the tcp layer - back into
tcp_sendmsg_locked().
nbd sets sk_allocation to (GFP_NOIO | __GFP_MEMALLOC) which makes
sense - it's in the process of freeing memory and wants to be able to,
e.g., drop clean pages to make forward progress. However, this
confused sk_page_frag() called from [2]. Because it only tests
whether the allocation allows blocking which it does, it now thinks
current->page_frag can be used again although it already was being
used in [0].
After [2] used current->page_frag, the offset would be increased by
the used amount. When the control returns to [0],
current->page_frag's offset is increased and the previously calculated
number of bytes now may overrun the end of allocated memory leading to
silent memory corruptions.
Fix it by adding gfpflags_normal_context() which tests sleepable &&
!reclaim and use it to determine whether to use current->task_frag.
v2: Eric didn't like gfp flags being tested twice. Introduce a new
helper gfpflags_normal_context() and combine the two tests.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
KCSAN reported a data-race in udp_set_dev_scratch() [1]
The issue here is that we must not write over skb fields
if skb is shared. A similar issue has been fixed in commit
89c22d8c3b ("net: Fix skb csum races when peeking")
While we are at it, use a helper only dealing with
udp_skb_scratch(skb)->csum_unnecessary, as this allows
udp_set_dev_scratch() to be called once and thus inlined.
[1]
BUG: KCSAN: data-race in udp_set_dev_scratch / udpv6_recvmsg
write to 0xffff888120278317 of 1 bytes by task 10411 on cpu 1:
udp_set_dev_scratch+0xea/0x200 net/ipv4/udp.c:1308
__first_packet_length+0x147/0x420 net/ipv4/udp.c:1556
first_packet_length+0x68/0x2a0 net/ipv4/udp.c:1579
udp_poll+0xea/0x110 net/ipv4/udp.c:2720
sock_poll+0xed/0x250 net/socket.c:1256
vfs_poll include/linux/poll.h:90 [inline]
do_select+0x7d0/0x1020 fs/select.c:534
core_sys_select+0x381/0x550 fs/select.c:677
do_pselect.constprop.0+0x11d/0x160 fs/select.c:759
__do_sys_pselect6 fs/select.c:784 [inline]
__se_sys_pselect6 fs/select.c:769 [inline]
__x64_sys_pselect6+0x12e/0x170 fs/select.c:769
do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x44/0xa9
read to 0xffff888120278317 of 1 bytes by task 10413 on cpu 0:
udp_skb_csum_unnecessary include/net/udp.h:358 [inline]
udpv6_recvmsg+0x43e/0xe90 net/ipv6/udp.c:310
inet6_recvmsg+0xbb/0x240 net/ipv6/af_inet6.c:592
sock_recvmsg_nosec+0x5c/0x70 net/socket.c:871
___sys_recvmsg+0x1a0/0x3e0 net/socket.c:2480
do_recvmmsg+0x19a/0x5c0 net/socket.c:2601
__sys_recvmmsg+0x1ef/0x200 net/socket.c:2680
__do_sys_recvmmsg net/socket.c:2703 [inline]
__se_sys_recvmmsg net/socket.c:2696 [inline]
__x64_sys_recvmmsg+0x89/0xb0 net/socket.c:2696
do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 10413 Comm: syz-executor.0 Not tainted 5.4.0-rc3+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Fixes: 2276f58ac5 ("udp: use a separate rx queue for packet reception")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch corrects the SPDX License Identifier style in
header files related to DPAA2 Ethernet driver supporting
Freescale SoCs with DPAA2. For C header files
Documentation/process/license-rules.rst mandates C-like comments
(opposed to C source files where C++ style should be used)
Changes made by using a script provided by Joe Perches here:
https://lkml.org/lkml/2019/2/7/46.
Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Nishad Kamdar <nishadkamdar@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet says:
====================
net: avoid KCSAN splats
Often times we use skb_queue_empty() without holding a lock,
meaning that other cpus (or interrupt) can change the queue
under us. This is fine, but we need to properly annotate
the lockless intent to make sure the compiler wont over
optimize things.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Busy polling usually runs without locks.
Let's use skb_queue_empty_lockless() instead of skb_queue_empty()
Also uses READ_ONCE() in __skb_try_recv_datagram() to address
a similar potential problem.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Many poll() handlers are lockless. Using skb_queue_empty_lockless()
instead of skb_queue_empty() is more appropriate.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some paths call skb_queue_empty() without holding
the queue lock. We must use a barrier in order
to not let the compiler do strange things, and avoid
KCSAN splats.
Adding a barrier in skb_queue_empty() might be overkill,
I prefer adding a new helper to clearly identify
points where the callers might be lockless. This might
help us finding real bugs.
The corresponding WRITE_ONCE() should add zero cost
for current compilers.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains Netfilter/IPVS fixes for net:
1) Fix crash on flowtable due to race between garbage collection
and insertion.
2) Restore callback unbinding in netfilter offloads.
3) Fix races on IPVS module removal, from Davide Caratti.
4) Make old_secure_tcp per-netns to fix sysbot report,
from Eric Dumazet.
5) Validate matching length in netfilter offloads, from wenxu.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf 2019-10-27
The following pull-request contains BPF updates for your *net* tree.
We've added 7 non-merge commits during the last 11 day(s) which contain
a total of 7 files changed, 66 insertions(+), 16 deletions(-).
The main changes are:
1) Fix two use-after-free bugs in relation to RCU in jited symbol exposure to
kallsyms, from Daniel Borkmann.
2) Fix NULL pointer dereference in AF_XDP rx-only sockets, from Magnus Karlsson.
3) Fix hang in netdev unregister for hash based devmap as well as another overflow
bug on 32 bit archs in memlock cost calculation, from Toke Høiland-Jørgensen.
4) Fix wrong memory access in LWT BPF programs on reroute due to invalid dst.
Also fix BPF selftests to use more compatible nc options, from Jiri Benc.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni says:
====================
ipv4: fix route update on metric change.
This fixes connected route update on some edge cases for ip addr metric
change.
It additionally includes self tests for the covered scenarios. The new tests
fail on unpatched kernels and pass on the patched one.
v1 -> v2:
- add selftests
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds two more tests to ipv4_addr_metric_test() to
explicitly cover the scenarios fixed by the previous patch.
Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit af4d768ad2 ("net/ipv4: Add support for specifying metric
of connected routes"), when updating an IP address with a different metric,
the associated connected route is updated, too.
Still, the mentioned commit doesn't handle properly some corner cases:
$ ip addr add dev eth0 192.168.1.0/24
$ ip addr add dev eth0 192.168.2.1/32 peer 192.168.2.2
$ ip addr add dev eth0 192.168.3.1/24
$ ip addr change dev eth0 192.168.1.0/24 metric 10
$ ip addr change dev eth0 192.168.2.1/32 peer 192.168.2.2 metric 10
$ ip addr change dev eth0 192.168.3.1/24 metric 10
$ ip -4 route
192.168.1.0/24 dev eth0 proto kernel scope link src 192.168.1.0
192.168.2.2 dev eth0 proto kernel scope link src 192.168.2.1
192.168.3.0/24 dev eth0 proto kernel scope link src 192.168.2.1 metric 10
Only the last route is correctly updated.
The problem is the current test in fib_modify_prefix_metric():
if (!(dev->flags & IFF_UP) ||
ifa->ifa_flags & (IFA_F_SECONDARY | IFA_F_NOPREFIXROUTE) ||
ipv4_is_zeronet(prefix) ||
prefix == ifa->ifa_local || ifa->ifa_prefixlen == 32)
Which should be the logical 'not' of the pre-existing test in
fib_add_ifaddr():
if (!ipv4_is_zeronet(prefix) && !(ifa->ifa_flags & IFA_F_SECONDARY) &&
(prefix != addr || ifa->ifa_prefixlen < 32))
To properly negate the original expression, we need to change the last
logical 'or' to a logical 'and'.
Fixes: af4d768ad2 ("net/ipv4: Add support for specifying metric of connected routes")
Reported-and-suggested-by: Beniamino Galvani <bgalvani@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
memset() the structure ethtool_wolinfo that has padded bytes
but the padded bytes have not been zeroed out.
Signed-off-by: zhanglin <zhang.lin16@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simon Horman says:
====================
IPVS fixes for v5.4
* Eric Dumazet resolves a race condition in switching the defense level
* Davide Caratti resolves a race condition in module removal
====================
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
For adapters which support the SGE Doorbell Queue Timer facility,
we configured the Ethernet TX Queues to send CIDX Updates to the
Associated Ethernet RX Response Queue with CPL_SGE_EGR_UPDATE
messages to allow us to respond more quickly to the CIDX Updates.
But, this was adding load to PCIe Link RX bandwidth and,
potentially, resulting in higher CPU Interrupt load.
This patch requests the HW to deliver the CIDX updates to the TX
queue status page rather than generating an ingress queue message
(as an interrupt). With this patch, the load on RX bandwidth is
reduced and a substantial improvement in BW is noticed at lower
IO sizes.
Fixes: d429005fdf ("cxgb4/cxgb4vf: Add support for SGE doorbell queue timer")
Signed-off-by: Raju Rangoju <rajur@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In rtnl_net_notifyid(), we certainly can't pass a null GFP flag to
rtnl_notify(). A GFP_KERNEL flag would be fine in most circumstances,
but there are a few paths calling rtnl_net_notifyid() from atomic
context or from RCU critical sections. The later also precludes the use
of gfp_any() as it wouldn't detect the RCU case. Also, the nlmsg_new()
call is wrong too, as it uses GFP_KERNEL unconditionally.
Therefore, we need to pass the GFP flags as parameter and propagate it
through function calls until the proper flags can be determined.
In most cases, GFP_KERNEL is fine. The exceptions are:
* openvswitch: ovs_vport_cmd_get() and ovs_vport_cmd_dump()
indirectly call rtnl_net_notifyid() from RCU critical section,
* rtnetlink: rtmsg_ifinfo_build_skb() already receives GFP flags as
parameter.
Also, in ovs_vport_cmd_build_info(), let's change the GFP flags used
by nlmsg_new(). The function is allowed to sleep, so better make the
flags consistent with the ones used in the following
ovs_vport_cmd_fill_info() call.
Found by code inspection.
Fixes: 9a9634545c ("netns: notify netns id events")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch corrects the SPDX License Identifier style in
header file related to ethernet driver for Cortina Gemini
devices. For C header files Documentation/process/license-rules.rst
mandates C-like comments (opposed to C source files where
C++ style should be used)
Changes made by using a script provided by Joe Perches here:
https://lkml.org/lkml/2019/2/7/46.
Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Nishad Kamdar <nishadkamdar@gmail.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Karsten Graul says:
====================
net/smc: fixes for -net
Fixes for the net tree, covering a memleak when closing
SMC fallback sockets and fix SMC-R connection establishment
when vlan-ids are used.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Creating of an SMC-R connection with vlan-id fails, because
smc_listen_work() determines the vlan_id of the connection,
saves it in struct smc_init_info ini, but clears the ini area
again if SMC-D is not applicable.
This patch just resets the ISM device before investigating
SMC-R availability.
Fixes: bc36d2fc93 ("net/smc: consolidate function parameters")
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For SMC sockets forced to fallback to TCP, the file is propagated
from the outer SMC to the internal TCP socket. When closing the SMC
socket, the internal TCP socket file pointer must be restored to the
original NULL value, otherwise memory leaks may show up (found with
CONFIG_DEBUG_KMEMLEAK).
The internal TCP socket is released in smc_clcsock_release(), which
calls __sock_release() function in net/socket.c. This calls the
needed iput(SOCK_INODE(sock)) only, if the file pointer has been reset
to the original NULL-value.
Fixes: 07603b2308 ("net/smc: propagate file from SMC to TCP socket")
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If CONFIG_NET_HWBM is not set, then these stub functions in
<net/hwbm.h> should be declared static to avoid trying to
export them from any driver that includes this.
Fixes the following sparse warnings:
./include/net/hwbm.h:24:6: warning: symbol 'hwbm_buf_free' was not declared. Should it be static?
./include/net/hwbm.h:25:5: warning: symbol 'hwbm_pool_refill' was not declared. Should it be static?
./include/net/hwbm.h:26:5: warning: symbol 'hwbm_pool_add' was not declared. Should it be static?
Signed-off-by: Ben Dooks (Codethink) <ben.dooks@codethink.co.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the CONFIG_MVNET_BA is not set, then make the stub functions
static inline to avoid trying to export them, and remove hte
following sparse warnings:
drivers/net/ethernet/marvell/mvneta_bm.h:163:6: warning: symbol 'mvneta_bm_pool_destroy' was not declared. Should it be static?
drivers/net/ethernet/marvell/mvneta_bm.h:165:6: warning: symbol 'mvneta_bm_bufs_free' was not declared. Should it be static?
drivers/net/ethernet/marvell/mvneta_bm.h:167:5: warning: symbol 'mvneta_bm_construct' was not declared. Should it be static?
drivers/net/ethernet/marvell/mvneta_bm.h:168:5: warning: symbol 'mvneta_bm_pool_refill' was not declared. Should it be static?
drivers/net/ethernet/marvell/mvneta_bm.h:170:23: warning: symbol 'mvneta_bm_pool_use' was not declared. Should it be static?
drivers/net/ethernet/marvell/mvneta_bm.h:181:18: warning: symbol 'mvneta_bm_get' was not declared. Should it be static?
drivers/net/ethernet/marvell/mvneta_bm.h:182:6: warning: symbol 'mvneta_bm_put' was not declared. Should it be static?
Signed-off-by: Ben Dooks (Codethink) <ben.dooks@codethink.co.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is networking hardware that isn't based on Ethernet for layers 1 and 2.
For example CAN.
CAN is a multi-master serial bus standard for connecting Electronic Control
Units [ECUs] also known as nodes. A frame on the CAN bus carries up to 8 bytes
of payload. Frame corruption is detected by a CRC. However frame loss due to
corruption is possible, but a quite unusual phenomenon.
While fq_codel works great for TCP/IP, it doesn't for CAN. There are a lot of
legacy protocols on top of CAN, which are not build with flow control or high
CAN frame drop rates in mind.
When using fq_codel, as soon as the queue reaches a certain delay based length,
skbs from the head of the queue are silently dropped. Silently meaning that the
user space using a send() or similar syscall doesn't get an error. However
TCP's flow control algorithm will detect dropped packages and adjust the
bandwidth accordingly.
When using fq_codel and sending raw frames over CAN, which is the common use
case, the user space thinks the package has been sent without problems, because
send() returned without an error. pfifo_fast will drop skbs, if the queue
length exceeds the maximum. But with this scheduler the skbs at the tail are
dropped, an error (-ENOBUFS) is propagated to user space. So that the user
space can slow down the package generation.
On distributions, where fq_codel is made default via CONFIG_DEFAULT_NET_SCH
during compile time, or set default during runtime with sysctl
net.core.default_qdisc (see [1]), we get a bad user experience. In my test case
with pfifo_fast, I can transfer thousands of million CAN frames without a frame
drop. On the other hand with fq_codel there is more then one lost CAN frame per
thousand frames.
As pointed out fq_codel is not suited for CAN hardware, so this patch changes
attach_one_default_qdisc() to use pfifo_fast for "ARPHRD_CAN" network devices.
During transition of a netdev from down to up state the default queuing
discipline is attached by attach_default_qdiscs() with the help of
attach_one_default_qdisc(). This patch modifies attach_one_default_qdisc() to
attach the pfifo_fast (pfifo_fast_ops) if the network device type is
"ARPHRD_CAN".
[1] https://github.com/systemd/systemd/issues/9194
Signed-off-by: Vincent Prince <vincent.prince.fr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Taehee Yoo says:
====================
net: fix nested device bugs
This patchset fixes several bugs that are related to nesting
device infrastructure.
Current nesting infrastructure code doesn't limit the depth level of
devices. nested devices could be handled recursively. at that moment,
it needs huge memory and stack overflow could occur.
Below devices type have same bug.
VLAN, BONDING, TEAM, MACSEC, MACVLAN, IPVLAN, and VXLAN.
But I couldn't test all interface types so there could be more device
types, which have similar problems.
Maybe qmi_wwan.c code could have same problem.
So, I would appreciate if someone test qmi_wwan.c and other modules.
Test commands:
ip link add dummy0 type dummy
ip link add vlan1 link dummy0 type vlan id 1
for i in {2..100}
do
let A=$i-1
ip link add name vlan$i link vlan$A type vlan id $i
done
ip link del dummy0
1st patch actually fixes the root cause.
It adds new common variables {upper/lower}_level that represent
depth level. upper_level variable is depth of upper devices.
lower_level variable is depth of lower devices.
[U][L] [U][L]
vlan1 1 5 vlan4 1 4
vlan2 2 4 vlan5 2 3
vlan3 3 3 |
| |
+------------+
|
vlan6 4 2
dummy0 5 1
After this patch, the nesting infrastructure code uses this variable to
check the depth level.
2nd patch fixes Qdisc lockdep related problem.
Before this patch, devices use static lockdep map.
So, if devices that are same types are nested, lockdep will warn about
recursive situation.
These patches make these devices use dynamic lockdep key instead of
static lock or subclass.
3rd patch fixes unexpected IFF_BONDING bit unset.
When nested bonding interface scenario, bonding interface could lost it's
IFF_BONDING flag. This should not happen.
This patch adds a condition before unsetting IFF_BONDING.
4th patch fixes nested locking problem in bonding interface
Bonding interface has own lock and this uses static lock.
Bonding interface could be nested and it uses same lockdep key.
So that unexisting lockdep warning occurs.
5th patch fixes nested locking problem in team interface
Team interface has own lock and this uses static lock.
Team interface could be nested and it uses same lockdep key.
So that unexisting lockdep warning occurs.
6th patch fixes a refcnt leak in the macsec module.
When the macsec module is unloaded, refcnt leaks occur.
But actually, that holding refcnt is unnecessary.
So this patch just removes these code.
7th patch adds ignore flag to an adjacent structure.
In order to exchange an adjacent node safely, ignore flag is needed.
8th patch makes vxlan add an adjacent link to limit depth level.
Vxlan interface could set it's lower interface and these lower interfaces
are handled recursively.
So, if the depth of lower interfaces is too deep, stack overflow could
happen.
9th patch removes unnecessary variables and callback.
After 1st patch, subclass callback and variables are unnecessary.
This patch just removes these variables and callback.
10th patch fix refcnt leaks in the virt_wifi module
Like every nested interface, the upper interface should be deleted
before the lower interface is deleted.
In order to fix this, the notifier routine is added in this patch.
v4 -> v5 :
- Update log messages
- Move variables position, 1st patch
- Fix iterator routine, 1st patch
- Add generic lockdep key code, which replaces 2, 4, 5, 6, 7 patches.
- Log message update, 10th patch
- Fix wrong error value in error path of __init routine, 10th patch
- hold module refcnt when interface is created, 10th patch
v3 -> v4 :
- Add new 12th patch to fix refcnt leaks in the virt_wifi module
- Fix wrong usage netdev_upper_dev_link() in the vxlan.c
- Preserve reverse christmas tree variable ordering in the vxlan.c
- Add missing static keyword in the dev.c
- Expose netdev_adjacent_change_{prepare/commit/abort} instead of
netdev_adjacent_dev_{enable/disable}
v2 -> v3 :
- Modify nesting infrastructure code to use iterator instead of recursive.
v1 -> v2 :
- Make the 3rd patch do not add a new priv_flag.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes variables and callback these are related to the nested
device structure.
devices that can be nested have their own nest_level variable that
represents the depth of nested devices.
In the previous patch, new {lower/upper}_level variables are added and
they replace old private nest_level variable.
So, this patch removes all 'nest_level' variables.
In order to avoid lockdep warning, ->ndo_get_lock_subclass() was added
to get lockdep subclass value, which is actually lower nested depth value.
But now, they use the dynamic lockdep key to avoid lockdep warning instead
of the subclass.
So, this patch removes ->ndo_get_lock_subclass() callback.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to link an adjacent node, netdev_upper_dev_link() is used
and in order to unlink an adjacent node, netdev_upper_dev_unlink() is used.
unlink operation does not fail, but link operation can fail.
In order to exchange adjacent nodes, we should unlink an old adjacent
node first. then, link a new adjacent node.
If link operation is failed, we should link an old adjacent node again.
But this link operation can fail too.
It eventually breaks the adjacent link relationship.
This patch adds an ignore flag into the netdev_adjacent structure.
If this flag is set, netdev_upper_dev_link() ignores an old adjacent
node for a moment.
This patch also adds new functions for other modules.
netdev_adjacent_change_prepare()
netdev_adjacent_change_commit()
netdev_adjacent_change_abort()
netdev_adjacent_change_prepare() inserts new device into adjacent list
but new device is not allowed to use immediately.
If netdev_adjacent_change_prepare() fails, it internally rollbacks
adjacent list so that we don't need any other action.
netdev_adjacent_change_commit() deletes old device in the adjacent list
and allows new device to use.
netdev_adjacent_change_abort() rollbacks adjacent list.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a macsec interface is created, it increases a refcnt to a lower
device(real device). when macsec interface is deleted, the refcnt is
decreased in macsec_free_netdev(), which is ->priv_destructor() of
macsec interface.
The problem scenario is this.
When nested macsec interfaces are exiting, the exit routine of the
macsec module makes refcnt leaks.
Test commands:
ip link add dummy0 type dummy
ip link add macsec0 link dummy0 type macsec
ip link add macsec1 link macsec0 type macsec
modprobe -rv macsec
[ 208.629433] unregister_netdevice: waiting for macsec0 to become free. Usage count = 1
Steps of exit routine of macsec module are below.
1. Calls ->dellink() in __rtnl_link_unregister().
2. Checks refcnt and wait refcnt to be 0 if refcnt is not 0 in
netdev_run_todo().
3. Calls ->priv_destruvtor() in netdev_run_todo().
Step2 checks refcnt, but step3 decreases refcnt.
So, step2 waits forever.
This patch makes the macsec module do not hold a refcnt of the lower
device because it already holds a refcnt of the lower device with
netdev_upper_dev_link().
Fixes: c09440f7dc ("macsec: introduce IEEE 802.1AE driver")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
team interface could be nested and it's lock variable could be nested too.
But this lock uses static lockdep key and there is no nested locking
handling code such as mutex_lock_nested() and so on.
so the Lockdep would warn about the circular locking scenario that
couldn't happen.
In order to fix, this patch makes the team module to use dynamic lock key
instead of static key.
Test commands:
ip link add team0 type team
ip link add team1 type team
ip link set team0 master team1
ip link set team0 nomaster
ip link set team1 master team0
ip link set team1 nomaster
Splat that looks like:
[ 40.364352] WARNING: possible recursive locking detected
[ 40.364964] 5.4.0-rc3+ #96 Not tainted
[ 40.365405] --------------------------------------------
[ 40.365973] ip/750 is trying to acquire lock:
[ 40.366542] ffff888060b34c40 (&team->lock){+.+.}, at: team_set_mac_address+0x151/0x290 [team]
[ 40.367689]
but task is already holding lock:
[ 40.368729] ffff888051201c40 (&team->lock){+.+.}, at: team_del_slave+0x29/0x60 [team]
[ 40.370280]
other info that might help us debug this:
[ 40.371159] Possible unsafe locking scenario:
[ 40.371942] CPU0
[ 40.372338] ----
[ 40.372673] lock(&team->lock);
[ 40.373115] lock(&team->lock);
[ 40.373549]
*** DEADLOCK ***
[ 40.374432] May be due to missing lock nesting notation
[ 40.375338] 2 locks held by ip/750:
[ 40.375851] #0: ffffffffabcc42b0 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x466/0x8a0
[ 40.376927] #1: ffff888051201c40 (&team->lock){+.+.}, at: team_del_slave+0x29/0x60 [team]
[ 40.377989]
stack backtrace:
[ 40.378650] CPU: 0 PID: 750 Comm: ip Not tainted 5.4.0-rc3+ #96
[ 40.379368] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 40.380574] Call Trace:
[ 40.381208] dump_stack+0x7c/0xbb
[ 40.381959] __lock_acquire+0x269d/0x3de0
[ 40.382817] ? register_lock_class+0x14d0/0x14d0
[ 40.383784] ? check_chain_key+0x236/0x5d0
[ 40.384518] lock_acquire+0x164/0x3b0
[ 40.385074] ? team_set_mac_address+0x151/0x290 [team]
[ 40.385805] __mutex_lock+0x14d/0x14c0
[ 40.386371] ? team_set_mac_address+0x151/0x290 [team]
[ 40.387038] ? team_set_mac_address+0x151/0x290 [team]
[ 40.387632] ? mutex_lock_io_nested+0x1380/0x1380
[ 40.388245] ? team_del_slave+0x60/0x60 [team]
[ 40.388752] ? rcu_read_lock_sched_held+0x90/0xc0
[ 40.389304] ? rcu_read_lock_bh_held+0xa0/0xa0
[ 40.389819] ? lock_acquire+0x164/0x3b0
[ 40.390285] ? lockdep_rtnl_is_held+0x16/0x20
[ 40.390797] ? team_port_get_rtnl+0x90/0xe0 [team]
[ 40.391353] ? __module_text_address+0x13/0x140
[ 40.391886] ? team_set_mac_address+0x151/0x290 [team]
[ 40.392547] team_set_mac_address+0x151/0x290 [team]
[ 40.393111] dev_set_mac_address+0x1f0/0x3f0
[ ... ]
Fixes: 3d249d4ca7 ("net: introduce ethernet teaming device")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All bonding device has same lockdep key and subclass is initialized with
nest_level.
But actual nest_level value can be changed when a lower device is attached.
And at this moment, the subclass should be updated but it seems to be
unsafe.
So this patch makes bonding use dynamic lockdep key instead of the
subclass.
Test commands:
ip link add bond0 type bond
for i in {1..5}
do
let A=$i-1
ip link add bond$i type bond
ip link set bond$i master bond$A
done
ip link set bond5 master bond0
Splat looks like:
[ 307.992912] WARNING: possible recursive locking detected
[ 307.993656] 5.4.0-rc3+ #96 Tainted: G W
[ 307.994367] --------------------------------------------
[ 307.995092] ip/761 is trying to acquire lock:
[ 307.995710] ffff8880513aac60 (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0xb8/0x500 [bonding]
[ 307.997045]
but task is already holding lock:
[ 307.997923] ffff88805fcbac60 (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0xb8/0x500 [bonding]
[ 307.999215]
other info that might help us debug this:
[ 308.000251] Possible unsafe locking scenario:
[ 308.001137] CPU0
[ 308.001533] ----
[ 308.001915] lock(&(&bond->stats_lock)->rlock#2/2);
[ 308.002609] lock(&(&bond->stats_lock)->rlock#2/2);
[ 308.003302]
*** DEADLOCK ***
[ 308.004310] May be due to missing lock nesting notation
[ 308.005319] 3 locks held by ip/761:
[ 308.005830] #0: ffffffff9fcc42b0 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x466/0x8a0
[ 308.006894] #1: ffff88805fcbac60 (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0xb8/0x500 [bonding]
[ 308.008243] #2: ffffffff9f9219c0 (rcu_read_lock){....}, at: bond_get_stats+0x9f/0x500 [bonding]
[ 308.009422]
stack backtrace:
[ 308.010124] CPU: 0 PID: 761 Comm: ip Tainted: G W 5.4.0-rc3+ #96
[ 308.011097] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 308.012179] Call Trace:
[ 308.012601] dump_stack+0x7c/0xbb
[ 308.013089] __lock_acquire+0x269d/0x3de0
[ 308.013669] ? register_lock_class+0x14d0/0x14d0
[ 308.014318] lock_acquire+0x164/0x3b0
[ 308.014858] ? bond_get_stats+0xb8/0x500 [bonding]
[ 308.015520] _raw_spin_lock_nested+0x2e/0x60
[ 308.016129] ? bond_get_stats+0xb8/0x500 [bonding]
[ 308.017215] bond_get_stats+0xb8/0x500 [bonding]
[ 308.018454] ? bond_arp_rcv+0xf10/0xf10 [bonding]
[ 308.019710] ? rcu_read_lock_held+0x90/0xa0
[ 308.020605] ? rcu_read_lock_sched_held+0xc0/0xc0
[ 308.021286] ? bond_get_stats+0x9f/0x500 [bonding]
[ 308.021953] dev_get_stats+0x1ec/0x270
[ 308.022508] bond_get_stats+0x1d1/0x500 [bonding]
Fixes: d3fff6c443 ("net: add netdev_lockdep_set_classes() helper")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some interface types could be nested.
(VLAN, BONDING, TEAM, MACSEC, MACVLAN, IPVLAN, VIRT_WIFI, VXLAN, etc..)
These interface types should set lockdep class because, without lockdep
class key, lockdep always warn about unexisting circular locking.
In the current code, these interfaces have their own lockdep class keys and
these manage itself. So that there are so many duplicate code around the
/driver/net and /net/.
This patch adds new generic lockdep keys and some helper functions for it.
This patch does below changes.
a) Add lockdep class keys in struct net_device
- qdisc_running, xmit, addr_list, qdisc_busylock
- these keys are used as dynamic lockdep key.
b) When net_device is being allocated, lockdep keys are registered.
- alloc_netdev_mqs()
c) When net_device is being free'd llockdep keys are unregistered.
- free_netdev()
d) Add generic lockdep key helper function
- netdev_register_lockdep_key()
- netdev_unregister_lockdep_key()
- netdev_update_lockdep_key()
e) Remove unnecessary generic lockdep macro and functions
f) Remove unnecessary lockdep code of each interfaces.
After this patch, each interface modules don't need to maintain
their lockdep keys.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current code doesn't limit the number of nested devices.
Nested devices would be handled recursively and this needs huge stack
memory. So, unlimited nested devices could make stack overflow.
This patch adds upper_level and lower_level, they are common variables
and represent maximum lower/upper depth.
When upper/lower device is attached or dettached,
{lower/upper}_level are updated. and if maximum depth is bigger than 8,
attach routine fails and returns -EMLINK.
In addition, this patch converts recursive routine of
netdev_walk_all_{lower/upper} to iterator routine.
Test commands:
ip link add dummy0 type dummy
ip link add link dummy0 name vlan1 type vlan id 1
ip link set vlan1 up
for i in {2..55}
do
let A=$i-1
ip link add vlan$i link vlan$A type vlan id $i
done
ip link del dummy0
Splat looks like:
[ 155.513226][ T908] BUG: KASAN: use-after-free in __unwind_start+0x71/0x850
[ 155.514162][ T908] Write of size 88 at addr ffff8880608a6cc0 by task ip/908
[ 155.515048][ T908]
[ 155.515333][ T908] CPU: 0 PID: 908 Comm: ip Not tainted 5.4.0-rc3+ #96
[ 155.516147][ T908] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 155.517233][ T908] Call Trace:
[ 155.517627][ T908]
[ 155.517918][ T908] Allocated by task 0:
[ 155.518412][ T908] (stack is not available)
[ 155.518955][ T908]
[ 155.519228][ T908] Freed by task 0:
[ 155.519885][ T908] (stack is not available)
[ 155.520452][ T908]
[ 155.520729][ T908] The buggy address belongs to the object at ffff8880608a6ac0
[ 155.520729][ T908] which belongs to the cache names_cache of size 4096
[ 155.522387][ T908] The buggy address is located 512 bytes inside of
[ 155.522387][ T908] 4096-byte region [ffff8880608a6ac0, ffff8880608a7ac0)
[ 155.523920][ T908] The buggy address belongs to the page:
[ 155.524552][ T908] page:ffffea0001822800 refcount:1 mapcount:0 mapping:ffff88806c657cc0 index:0x0 compound_mapcount:0
[ 155.525836][ T908] flags: 0x100000000010200(slab|head)
[ 155.526445][ T908] raw: 0100000000010200 ffffea0001813808 ffffea0001a26c08 ffff88806c657cc0
[ 155.527424][ T908] raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
[ 155.528429][ T908] page dumped because: kasan: bad access detected
[ 155.529158][ T908]
[ 155.529410][ T908] Memory state around the buggy address:
[ 155.530060][ T908] ffff8880608a6b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 155.530971][ T908] ffff8880608a6c00: fb fb fb fb fb f1 f1 f1 f1 00 f2 f2 f2 f3 f3 f3
[ 155.531889][ T908] >ffff8880608a6c80: f3 fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 155.532806][ T908] ^
[ 155.533509][ T908] ffff8880608a6d00: fb fb fb fb fb fb fb fb fb f1 f1 f1 f1 00 00 00
[ 155.534436][ T908] ffff8880608a6d80: f2 f3 f3 f3 f3 fb fb fb 00 00 00 00 00 00 00 00
[ ... ]
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Payload offload rule should also check the length of the match.
Moreover, check for unsupported link-layer fields:
nft --debug=netlink add rule firewall zones vlan id 100
...
[ payload load 2b @ link header + 0 => reg 1 ]
this loads 2byte base on ll header and offset 0.
This also fixes unsupported raw payload match.
Fixes: 92ad6325cb ("netfilter: nf_tables: add hardware offload support")
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
syzbot reported the following issue :
BUG: KCSAN: data-race in update_defense_level / update_defense_level
read to 0xffffffff861a6260 of 4 bytes by task 3006 on cpu 1:
update_defense_level+0x621/0xb30 net/netfilter/ipvs/ip_vs_ctl.c:177
defense_work_handler+0x3d/0xd0 net/netfilter/ipvs/ip_vs_ctl.c:225
process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
worker_thread+0xa0/0x800 kernel/workqueue.c:2415
kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
write to 0xffffffff861a6260 of 4 bytes by task 7333 on cpu 0:
update_defense_level+0xa62/0xb30 net/netfilter/ipvs/ip_vs_ctl.c:205
defense_work_handler+0x3d/0xd0 net/netfilter/ipvs/ip_vs_ctl.c:225
process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
worker_thread+0xa0/0x800 kernel/workqueue.c:2415
kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 7333 Comm: kworker/0:5 Not tainted 5.4.0-rc3+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: events defense_work_handler
Indeed, old_secure_tcp is currently a static variable, while it
needs to be a per netns variable.
Fixes: a0840e2e16 ("IPVS: netns, ip_vs_ctl local vars moved to ipvs struct.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Simon Horman <horms@verge.net.au>