We currently enable interrupts before we enable NAPI. If an RX interrupt
hits before we enabled NAPI then the NAPI callback is never called and
we leave the hardware with RX interrupts disabled, which of course leads
us to never handling received packets. Fix this by moving the interrupt
enable to after we've enable NAPI and the reclaim tasklet.
Fixes: cd5e412347 ("dwc_eth_qos: do phy_start before resetting hardware")
Signed-off-by: Rabin Vincent <rabinv@axis.com>
Signed-off-by: Lars Persson <larper@axis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
clk_prepare_enable() may fail, so we should better check its return
value and propagate it in the case of failure
While at it, replace __lpc_eth_clock_enable() with a plain
clk_prepare_enable/clk_disable_unprepare() call in order to
simplify the code.
Signed-off-by: Fabio Estevam <fabio.estevam@nxp.com>
Acked-by: Vladimir Zapolskiy <vz@mleia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The PORT_RATE_CONTROL register works differently on 88e6095/6095f/6131
in comparison to 6123/61/65, and 0x0 disables. The distinction was lost
Linux 4.1 --> 4.2
Signed-off-by: Jamie Lentin <jm@lentin.co.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Like the ksz8081, the ksz9031 has the behavior where it will clear the
interrupt enable bits when leaving power down. This takes advantage of the
solution provided by f5aba91.
Signed-off-by: Xander Huff <xander.huff@ni.com>
Signed-off-by: Nathan Sullivan <nathan.sullivan@ni.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current scatter-gather logic in gianfar is flawed, since
it does not consider the eTSEC's RxBD 'Data Length' field is
context depening: for the last fragment it contains the full
frame size, while fragments contain the fragment size, which
equals the value written to register MRBLR.
This causes data corruption as soon as the hardware starts
to fragment receiving frames. As a result, the size of
fragmented frames is increased by
(nr_frags - 1) * MRBLR
We first noticed this issue working with DSA, where an ICMP
request sized 1472 bytes causes the scatter-gather logic to
kick in. The full Ethernet frame (1518) gets increased by
DSA (4), GMAC_FCB_LEN (8), and FSL_GIANFAR_DEV_HAS_TIMER
(priv->padding=8) to a total of 1538 octets, which is
fragmented by the hardware and reconstructed by the driver
to a 3074 octet frame.
This patch fixes the problem by adjusting the size of
the last fragment.
It was tested by setting MRBLR to different multiples of
64, proving correct scatter-gather operation on frames
with up to 9000 octets in size.
Signed-off-by: Zefir Kurtisi <zefir.kurtisi@neratec.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The eTSEC register MRBLR defines the maximum space in
the RX buffers and is set to 1536 by gianfar. This
reasonably covers the common use case where the MTU
is kept at default 1500. In that case, the largest
Ethernet frame size of 1518 plus an optional
GMAC_FCB_LEN of 8, and an additional padding of 8
to handle FSL_GIANFAR_DEV_HAS_TIMER totals to 1534
and nicely fit within the chosen MRBLR.
Alas, if the eTSEC is attached to a DSA enabled switch,
the (E)DSA header extension (4 or 8 bytes) causes every
maximum sized frame to be fragmented by the hardware.
This patch increases the maximum RX buffer size by 8
and rounds up to the next multiple of 64, which the
hardware's defines as RX buffer granularity.
Signed-off-by: Zefir Kurtisi <zefir.kurtisi@neratec.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Laura tracked poll() [and friends] regression caused by commit
e6afc8ace6 ("udp: remove headers from UDP packets before queueing")
udp_poll() needs to know if there is a valid packet in receive queue,
even if its payload length is 0.
Change first_packet_length() to return an signed int, and use -1
as the indication of an empty queue.
Fixes: e6afc8ace6 ("udp: remove headers from UDP packets before queueing")
Reported-by: Laura Abbott <labbott@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Laura Abbott <labbott@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Encoding of the metadata was using the padded length as opposed to
the real length of the data which is a bug per specification.
This has not been an issue todate because all metadatum specified
so far has been 32 bit where aligned and data length are the same width.
This also includes a bug fix for validating the length of a u16 field.
But since there is no metadata of size u16 yes we are fine to include it
here.
While at it get rid of magic numbers.
Fixes: ef6980b6be ("net sched: introduce IFE action")
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Driver never bothered marking the VF's vport with the VF's sw_fid.
As a result, FLR flows are not going to clean those vports.
If the vport was active when FLRed, re-activating it would lead
to a FW assertion.
Fixes: dacd88d6f6 ("qed: IOV l2 functionality")
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In b8247f095e,
"net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs"
gso skbs arriving from an ingress interface that go through UDP
tunneling, are allowed to be fragmented if the resulting encapulated
segments exceed the dst mtu of the egress interface.
This aligned the behavior of gso skbs to non-gso skbs going through udp
encapsulation path.
However the non-gso vs gso anomaly is present also in the following
cases of a GRE tunnel:
- ip_gre in collect_md mode, where TUNNEL_DONT_FRAGMENT is not set
(e.g. OvS vport-gre with df_default=false)
- ip_gre in nopmtudisc mode, where IFLA_GRE_IGNORE_DF is set
In both of the above cases, the non-gso skbs get fragmented, whereas the
gso skbs (having skb_gso_network_seglen that exceeds dst mtu) get dropped,
as they don't go through the segment+fragment code path.
Fix: Setting IPSKB_FRAG_SEGS if the tunnel specified IP_DF bit is NOT set.
Tunnels that do set IP_DF, will not go to fragmentation of segments.
This preserves behavior of ip_gre in (the default) pmtudisc mode.
Fixes: b8247f095e ("net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs")
Reported-by: wenxu <wenxu@ucloud.cn>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Tested-by: wenxu <wenxu@ucloud.cn>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
If DAD fails with accept_dad set to 2, global addresses and host routes
are incorrectly left in place. Even though disable_ipv6 is set,
contrary to documentation, the addresses are not dynamically deleted
from the interface. It is only on a subsequent link down/up that these
are removed. The fix is not only to set the disable_ipv6 flag, but
also to call addrconf_ifdown(), which is the action to carry out when
disabling IPv6. This results in the addresses and routes being deleted
immediately. The DAD failure for the LL addr is determined as before
via netlink, or by the absence of the LL addr (which also previously
would have had to be checked for in case of an intervening link down
and up). As the call to addrconf_ifdown() requires an rtnl lock, the
logic to disable IPv6 when DAD fails is moved to addrconf_dad_work().
Previous behavior:
root@vm1:/# sysctl net.ipv6.conf.eth3.accept_dad=2
net.ipv6.conf.eth3.accept_dad = 2
root@vm1:/# ip -6 addr add 2000::10/64 dev eth3
root@vm1:/# ip link set up eth3
root@vm1:/# ip -6 addr show dev eth3
5: eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qlen 1000
inet6 2000::10/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::5054:ff:fe43:dd5a/64 scope link tentative dadfailed
valid_lft forever preferred_lft forever
root@vm1:/# ip -6 route show dev eth3
2000::/64 proto kernel metric 256
fe80::/64 proto kernel metric 256
root@vm1:/# ip link set down eth3
root@vm1:/# ip link set up eth3
root@vm1:/# ip -6 addr show dev eth3
root@vm1:/# ip -6 route show dev eth3
root@vm1:/#
New behavior:
root@vm1:/# sysctl net.ipv6.conf.eth3.accept_dad=2
net.ipv6.conf.eth3.accept_dad = 2
root@vm1:/# ip -6 addr add 2000::10/64 dev eth3
root@vm1:/# ip link set up eth3
root@vm1:/# ip -6 addr show dev eth3
root@vm1:/# ip -6 route show dev eth3
root@vm1:/#
Signed-off-by: Mike Manning <mmanning@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes these compiler warnings via libc-compat.h when glibc netipx/ipx.h is
included before linux/ipx.h:
./linux/ipx.h:9:8: error: redefinition of ‘struct sockaddr_ipx’
./linux/ipx.h:26:8: error: redefinition of ‘struct ipx_route_definition’
./linux/ipx.h:32:8: error: redefinition of ‘struct ipx_interface_definition’
./linux/ipx.h:49:8: error: redefinition of ‘struct ipx_config_data’
./linux/ipx.h:58:8: error: redefinition of ‘struct ipx_route_def’
Signed-off-by: Mikko Rapeli <mikko.rapeli@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kernel uapi header are supposed to use them. Fixes userspace compile error:
linux/openvswitch.h:583:2: error: unknown type name ‘uint32_t’
Signed-off-by: Mikko Rapeli <mikko.rapeli@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes userspace compile error:
error: field ‘real’ has incomplete type
struct timeval real; /* real (wall-clock) time */
Signed-off-by: Mikko Rapeli <mikko.rapeli@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes userspace compiler error:
error: unknown type name ‘uint32_t’
Signed-off-by: Mikko Rapeli <mikko.rapeli@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes userspace compilation errors:
error: field ‘addr’ has incomplete type
struct sockaddr_in addr; /* IP address and port to send to */
error: field ‘addr’ has incomplete type
struct sockaddr_in6 addr; /* IP address and port to send to */
Signed-off-by: Mikko Rapeli <mikko.rapeli@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes userspace compilation errors like:
error: field ‘addr’ has incomplete type
struct sockaddr_in addr; /* IP address and port to send to */
^
error: field ‘addr’ has incomplete type
struct sockaddr_in6 addr; /* IP address and port to send to */
Signed-off-by: Mikko Rapeli <mikko.rapeli@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes userspace compilation errors like:
error: field ‘iph’ has incomplete type
error: field ‘prefix’ has incomplete type
Signed-off-by: Mikko Rapeli <mikko.rapeli@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes userspace compilation error:
error: ‘IFNAMSIZ’ undeclared here (not in a function)
Signed-off-by: Mikko Rapeli <mikko.rapeli@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the address configured in the device tree is invalid, the
driver will fallback to using a random address from the locally
administered range.
Signed-off-by: Daniel Romell <daro@hms.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
'Commit 3c8b3efc06 ("vmxnet3: allow variable length transmit data ring
buffer")' changed the size of the buffers in the tx data ring from a
fixed size of 128 bytes to a variable size.
However, while copying data to the data ring, vmxnet3_copy_hdr continues
to carry the old code that assumes fixed buffer size of 128. This patch
fixes it by adding correct offset based on the actual data ring buffer
size.
Signed-off-by: Guolin Yang <gyang@vmware.com>
Signed-off-by: Shrikrishna Khare <skhare@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The RAR entry for the SAN MAC address was being cleared when we were
clearing the VMDq pool bits. In order to prevent this we need to add
an extra check to protect the SAN MAC from being cleared.
Fixes: 6e982aeae ("ixgbe: Clear stale pool mappings")
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pool index has to be converted by get_pool helper to work correctly for
egress pool. In mlxsw the egress pool index starts from 0.
Fixes: 0f433fa0ec ("mlxsw: spectrum_buffers: Implement shared buffer configuration")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The sk->sk_state is bits flag, so need use bit operation check
instead of value check.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Tested-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Saeed Mahameed says:
====================
Mellanox 100G mlx5 fixes 2016-08-16
This series includes some bug fixes for mlx5e driver.
From Saeed and Tariq, Optimize MTU change to not reset when it is not required.
From Paul, Command interface message length check to speedup firmware
command preparation.
From Mohamad, Save pci state when pci error is detected.
From Amir, Flow counters "lastuse" update fix.
From Hadar, Use correct flow dissector key on flower offloading.
Plus a small optimization for switchdev hardware id query.
From Or, three patches to address some E-Switch offloads issues.
For -stable of 4.6.y and 4.7.y:
net/mlx5e: Use correct flow dissector key on flower offloading
net/mlx5: Fix pci error recovery flow
net/mlx5: Added missing check of msg length in verifying its signature
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When we are in the switchdev/offloads mode, HW matching is done as
dictated by the offloaded rules and hence we don't need to enable
the ACLs mechanism used by the legacy mode.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While adding actual offloading support to the new switchdev mode, we didn't
change the setup of the send-to-vport rules to put them in the slow path
table, fix that.
Fixes: 1033665e63 ('net/mlx5: E-Switch, Use two priorities for SRIOV offloads mode')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since mlx5 has also the NONE e-switch mode, we must translate from mlx5
mode to devlink mode on the devlink eswitch mode get call, do that.
While here, remove the mlx5_ prefix from the static function helpers
that deal with the mode to comply with the rest of the code.
Fixes: c930a3ad74 ('net/mlx5e: Add devlink based SRIOV mode change')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Avoid firmware command execution each time the switchdev HW ID attr get
call is made. We do that by reading the ID (PF NIC MAC) only once at
load time and store it on the representor structure.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The wrong key is used when extracting the address type field set by
the flower offload code. We have to use the control key and not the
basic key, fix that.
Fixes: e3a2b7ed01 ('net/mlx5e: Support offload cls_flower with drop action')
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Set lastuse statistic, when number of packets is changed compared to
last query. This was wrongly dropped when bulk counter reading was added.
Fixes: a351a1b03b ('net/mlx5: Introduce bulk reading of flow counters')
Signed-off-by: Amir Vadai <amirva@mellanox.com>
Reported-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Set and verify signature calculates the signature for each of the
mailbox nodes, even for those that are unused (from cache). Added
a missing length check to set and verify only those which are used.
While here, also moved the setting of msg's nodes token to where we
already go over them. This saves a pass because checksum is disabled,
and the only useful thing remaining that set signature does is setting
the token.
Fixes: e126ba97db ('mlx5: Add driver for Mellanox Connect-IB
adapters')
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When PCI error is detected we should save the state of the pci prior to
disabling it.
Also when receiving pci slot reset call we need to verify that the
device is responsive.
Fixes: 89d44f0a6c ('net/mlx5_core: Add pci error handlers to mlx5_core
driver')
Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Avoid unnecessary interface down/up operations upon an MTU change
when it does not affect the rings configuration.
Fixes: 461017cb00 ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Port mtu shouldn't be written to hardware on every single interface
open.
Here we set it only when needed, on change_mtu and netdevice creation.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1) Fix one typo: s/tn/tp/
2) Fix the description about the "u" bits.
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Oliver Neukum says:
====================
fixes to kaweth in response to Umap2 testing
These patches fix an oops in firmware downloading and an oops due
to a memory allocation failure
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes the oops discovered by the Umap2 project and Alan Stern.
The intf member needs to be set before the firmware is downloaded.
Signed-off-by: Oliver Neukum <oneukum@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It was failing on successful registration returning meaningless errors.
Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
Fixes: 55954f3bfd ("net: ethernet: bgmac: move BCMA MDIO Phy code into a separate file")
Signed-off-by: David S. Miller <davem@davemloft.net>
When tcp_sendmsg() allocates a fresh and empty skb, it puts it at the
tail of the write queue using tcp_add_write_queue_tail()
Then it attempts to copy user data into this fresh skb.
If the copy fails, we undo the work and remove the fresh skb.
Unfortunately, this undo lacks the change done to tp->highest_sack and
we can leave a dangling pointer (to a freed skb)
Later, tcp_xmit_retransmit_queue() can dereference this pointer and
access freed memory. For regular kernels where memory is not unmapped,
this might cause SACK bugs because tcp_highest_sack_seq() is buggy,
returning garbage instead of tp->snd_nxt, but with various debug
features like CONFIG_DEBUG_PAGEALLOC, this can crash the kernel.
This bug was found by Marco Grassi thanks to syzkaller.
Fixes: 6859d49475 ("[TCP]: Abstract tp->highest_sack accessing & point to next skb")
Reported-by: Marco Grassi <marco.gra@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
At present the code to check in kdump kernel was not disabling
allocation of resources when CONFIG_CHELSIO_T4_DCB is defined, move the
code outside #defines so that it gets disabled irrespective of #define,
when in kdump kernel.
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ethtool_ops .get_regs function attempts to read the nonexistent
register NIC_QSET_SQ_0_7_CNM_CHG, which produces a "bus error" type
OOPs.
Fix by not attempting to read, and removing the definition of,
NIC_QSET_SQ_0_7_CNM_CHG. A zero is written into the register dump to
keep the layout unchanged.
Signed-off-by: David Daney <david.daney@cavium.com>
Cc: <stable@vger.kernel.org> # 4.4.x-
Signed-off-by: David S. Miller <davem@davemloft.net>
Driver uses netif_tx_queue_stopped() to make sure the xmit_more
indication will be honored, but that only checks for DRV_XOFF.
At the same time, it's possible that during transmission the DQL will
close the transmission queue with STACK_XOFF indication.
In re-configuration flows, when the threshold is relatively low, it's
possible that the device has no pending tranmissions, and during
tranmission the driver would miss doorbelling the HW.
Since there are no pending transmission, there will never be a Tx
completion [and thus the DQL would not remove the STACK_XOFF indication],
eventually causing the Tx queue to timeout.
While we're at it - also doorbell in case driver has to close the
transmission queue on its own [although this one is less important -
if the ring is full, we're bound to receive completion eventually,
which means the doorbell would only be postponed and not indefinetly
blocked].
Fixes: 312e06761c ("qede: Utilize xmit_more")
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter updates for your net tree,
they are:
1) Dump only conntrack that belong to this namespace via /proc file.
This is some fallout from the conversion to single conntrack table
for all netns, patch from Liping Zhang.
2) Missing MODULE_ALIAS_NF_LOGGER() for the ARP family that prevents
module autoloading, also from Liping Zhang.
3) Report overquota event to the right netnamespace, again from Liping.
4) Fix tproxy listener sk refcount that leads to crash, from
Eric Dumazet.
5) Fix racy refcounting on object deletion from nfnetlink and rule
removal both for nfacct and cttimeout, from Liping Zhang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
In general, when we want to delete a netns, cttimeout_net_exit will
be called before ipt_unregister_table, i.e. before ctnl_timeout_put.
But after call kfree_rcu in cttimeout_net_exit, we will still decrease
the timeout object's refcnt in ctnl_timeout_put, this is incorrect,
and will cause a use after free error.
It is easy to reproduce this problem:
# while : ; do
ip netns add xxx
ip netns exec xxx nfct add timeout testx inet icmp timeout 200
ip netns exec xxx iptables -t raw -p icmp -I OUTPUT -j CT --timeout testx
ip netns del xxx
done
=======================================================================
BUG kmalloc-96 (Tainted: G B E ): Poison overwritten
-----------------------------------------------------------------------
INFO: 0xffff88002b5161e8-0xffff88002b5161e8. First byte 0x6a instead of
0x6b
INFO: Allocated in cttimeout_new_timeout+0xd4/0x240 [nfnetlink_cttimeout]
age=104 cpu=0 pid=3330
___slab_alloc+0x4da/0x540
__slab_alloc+0x20/0x40
__kmalloc+0x1c8/0x240
cttimeout_new_timeout+0xd4/0x240 [nfnetlink_cttimeout]
nfnetlink_rcv_msg+0x21a/0x230 [nfnetlink]
[ ... ]
So only when the refcnt decreased to 0, we call kfree_rcu to free the
timeout object. And like nfnetlink_acct do, use atomic_cmpxchg to
avoid race between ctnl_timeout_try_del and ctnl_timeout_put.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Suppose that we input the following commands at first:
# nfacct add test
# iptables -A INPUT -m nfacct --nfacct-name test
And now "test" acct's refcnt is 2, but later when we try to delete the
"test" nfacct and the related iptables rule at the same time, race maybe
happen:
CPU0 CPU1
nfnl_acct_try_del nfnl_acct_put
atomic_dec_and_test //ref=1,testfail -
- atomic_dec_and_test //ref=0,testok
- kfree_rcu
atomic_inc //ref=1 -
So after the rcu grace period, nf_acct will be freed but it is still linked
in the nfnl_acct_list, and we can access it later, then oops will happen.
Convert atomic_dec_and_test and atomic_inc combinaiton to one atomic
operation atomic_cmpxchg here to fix this problem.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>