When freeing the TX skb for an off-channel TX, use the correct
API to also free the ACK skb that might have been allocated.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When a new station is added to AP/GO interfaces the default behaviour
is for it to be added authenticated and associated, due to backwards
compatibility. To prevent that, the driver must be able to do that
(setting the NL80211_FEATURE_FULL_AP_CLIENT_STATE feature flag) and
userspace must set the flag mask to auth|assoc and clear the set.
Handle this quirk in the API entirely in nl80211, and always push the
full flags to the drivers. NL80211_FEATURE_FULL_AP_CLIENT_STATE is
still required for userspace to be allowed to set the mask including
those bits, but after checking that add both flags to the mask and
set in case userspace didn't set them otherwise.
This obsoletes the mac80211 code handling this difference, no other
driver is currently using these flags.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Fix nl80211_set_station() to use the value of NL80211_ATTR_STA_AID
attribute instead of NL80211_ATTR_PEER_AID attribute.
Signed-off-by: Ayala Beker <ayala.beker@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This commit adds implementation for abort scan in mac80211.
Reviewed-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Vidyullatha Kanchanapally <vkanchan@qti.qualcomm.com>
Signed-off-by: Sunil Dutt <usdutt@qti.qualcomm.com>
[adjust to wdev change in previous patch and clean up code a bit]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Implement new functionality for aborting an ongoing scan.
Add NL80211_CMD_ABORT_SCAN to the nl80211 interface. After
aborting the scan, driver shall provide the scan status by
calling cfg80211_scan_done().
Reviewed-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Vidyullatha Kanchanapally <vkanchan@qti.qualcomm.com>
Signed-off-by: Sunil Dutt <usdutt@qti.qualcomm.com>
[change command to take wdev instead of netdev so that it
can be used on p2p-device scans]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add new VIF flag, that will allow get NOA update
notification when driver will request this, even
this is not pure P2P vif (eg. STA vif).
Signed-off-by: Janusz Dziedzic <janusz.dziedzic@tieto.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Last caller of this function was removed in 3.17 in commit
97dc94f1d9.
Signed-off-by: Michal Sojka <sojkam1@fel.cvut.cz>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
add ieee80211_iter_keys_rcu() to iterate over uploaded
keys in atomic context (when rcu is locked)
The station removal code removes the keys only after
calling synchronize_net(), so it's not safe to iterate
the keys at this point (and postponing the actual key
deletion with call_rcu() might result in some
badly-ordered ops calls).
Add a flag to indicate a station is being removed,
and skip the configured keys if it's set.
Signed-off-by: Eliad Peller <eliadx.peller@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This can happen when the driver needs to send less frames
than expected and then needs to close the SP.
Mac80211 still needs to set the more_data properly based
on its buffer state (ps_tx_buffer and buffered frames on
other TIDs).
To that end, refactor the code that delivers frames upon
uAPSD trigger frames to be able to get only the more_data
bit without actually delivering those frames in case the
driver is just asking to set a NDP with EOSP and MORE_DATA
bit properly set.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This really should never happen except very early in the process
of bringing up a new driver, at which point you'll have to add
more debugging in the driver and this string isn't useful. Remove
it and save some size (when it's even compiled in.)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This indicates a driver key selection issue, but even then there's
no point in printing it all the time, so ratelimit it. Also remove
the priv pointer from it -- people debugging will only have a single
device anyway and it's useless as anything but a cookie.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
There's no point in printing the mpath pointer since it can't
be used for anything - print the MAC address instead (like in
the forwarding case.)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The function is a very simple wrapper around another one,
just adds a few default parameters, so replace it with a
static inline instead of using EXPORT_SYMBOL, reducing
the module size slightly.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Complete the tracepoint with the missing data - it's not printed
by default (a lot of it is dynamic arrays) but will be recorded
and be available during post-processing.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Some devices or drivers cannot deal with having the same station
address for different virtual interfaces, say as a client to two
virtual AP interfaces. Rather than requiring each driver with a
limitation like that to enforce it, add a hardware flag for it.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
I want to get the full off-channel bugfix since later code depends on
it, as well as the AP client state change so I can revert it correctly.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Conflicts:
drivers/net/ethernet/renesas/ravb_main.c
kernel/bpf/syscall.c
net/ipv4/ipmr.c
All three conflicts were cases of overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
"A lot of Thanksgiving turkey leftovers accumulated, here goes:
1) Fix bluetooth l2cap_chan object leak, from Johan Hedberg.
2) IDs for some new iwlwifi chips, from Oren Givon.
3) Fix rtlwifi lockups on boot, from Larry Finger.
4) Fix memory leak in fm10k, from Stephen Hemminger.
5) We have a route leak in the ipv6 tunnel infrastructure, fix from
Paolo Abeni.
6) Fix buffer pointer handling in arm64 bpf JIT,f rom Zi Shen Lim.
7) Wrong lockdep annotations in tcp md5 support, fix from Eric
Dumazet.
8) Work around some middle boxes which prevent proper handling of TCP
Fast Open, from Yuchung Cheng.
9) TCP repair can do huge kmalloc() requests, build paged SKBs
instead. From Eric Dumazet.
10) Fix msg_controllen overflow in scm_detach_fds, from Daniel
Borkmann.
11) Fix device leaks on ipmr table destruction in ipv4 and ipv6, from
Nikolay Aleksandrov.
12) Fix use after free in epoll with AF_UNIX sockets, from Rainer
Weikusat.
13) Fix double free in VRF code, from Nikolay Aleksandrov.
14) Fix skb leaks on socket receive queue in tipc, from Ying Xue.
15) Fix ifup/ifdown crach in xgene driver, from Iyappan Subramanian.
16) Fix clearing of persistent array maps in bpf, from Daniel
Borkmann.
17) In TCP, for the cross-SYN case, we don't initialize tp->copied_seq
early enough. From Eric Dumazet.
18) Fix out of bounds accesses in bpf array implementation when
updating elements, from Daniel Borkmann.
19) Fill gaps in RCU protection of np->opt in ipv6 stack, from Eric
Dumazet.
20) When dumping proxy neigh entries, we have to accomodate NULL
device pointers properly, from Konstantin Khlebnikov.
21) SCTP doesn't release all ipv6 socket resources properly, fix from
Eric Dumazet.
22) Prevent underflows of sch->q.qlen for multiqueue packet
schedulers, also from Eric Dumazet.
23) Fix MAC and unicast list handling in bnxt_en driver, from Jeffrey
Huang and Michael Chan.
24) Don't actively scan radar channels, from Antonio Quartulli"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (110 commits)
net: phy: reset only targeted phy
bnxt_en: Setup uc_list mac filters after resetting the chip.
bnxt_en: enforce proper storing of MAC address
bnxt_en: Fixed incorrect implementation of ndo_set_mac_address
net: lpc_eth: remove irq > NR_IRQS check from probe()
net_sched: fix qdisc_tree_decrease_qlen() races
openvswitch: fix hangup on vxlan/gre/geneve device deletion
ipv4: igmp: Allow removing groups from a removed interface
ipv6: sctp: implement sctp_v6_destroy_sock()
arm64: bpf: add 'store immediate' instruction
ipv6: kill sk_dst_lock
ipv6: sctp: add rcu protection around np->opt
net/neighbour: fix crash at dumping device-agnostic proxy entries
sctp: use GFP_USER for user-controlled kmalloc
sctp: convert sack_needed and sack_generation to bits
ipv6: add complete rcu protection around np->opt
bpf: fix allocation warnings in bpf maps and integer overflow
mvebu: dts: enable IP checksum with jumbo frames for Armada 38x on Port0
net: mvneta: enable setting custom TX IP checksum limit
net: mvneta: fix error path for building skb
...
* fix scanning in mac80211 to not actively scan radar
channels (from Antonio)
* fix uninitialized variable in remain-on-channel that
could lead to treating frame TX as remain-on-channel
and not sending the frame at all
* remove NL80211_FEATURE_FULL_AP_CLIENT_STATE again, it
was broken and needs more work, we'll enable it later
* fix call_rcu() induced use-after-reset/free in mesh
(that was suddenly causing issues in certain tests)
* always request block-ack window size 64 as we found
some APs will otherwise crash (really ...)
* fix P2P-Device teardown sequence to avoid restarting
with uninitialized data
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJWX2JpAAoJEGt7eEactAAd+9cQAJmn3zt0orj/sASv7BeF0h5d
sRfAhkBVOTZur8MgVj1c7fNzT3h1HYNei6c4SA2+rphy6Vbifoli1nLNloC+1Ld4
2WXllEqVe473GqVofCxZHsYZPr2Inmhj7uMiDqvoKUiSRz7phmkY9m+Vju6WZG/W
F6FrTLqFS7UDHIYNYH1DNVSScd/89Gu6pHZvEpoHkrsvt5rEEZiPAQ7sDB4MAMSm
amETtuBqgX83gHR2G4UT2Z9r8TVdzhO+s7vvdVjj0qbP6C6BaS9IUXDjmm3gOvHy
G7j9MJuUC8w/2fZ5A5/l94OuN5rF/ZFMNkn2e6OIzg0HjEZh74CeLl21CnuxdNpB
ECmDVbKoI3OVoFbhEl7P5fBokzZsqhAXpZOmbYEeFRyO6lF2Mv9uzttsF6EOmCX0
BjIoEXOWA2o6IUD/8M6NjW+/B58SDDVi9Mg6D+7Dn7rUFlQ4pddjb0m94bI8GQQU
wl7gROMvYR3tIhiMs1bLF9jJgA831WGWu9eiq8mT2kHPaEV2bFO7OK+SUxyZu1M7
UhN4eoLpU84v9QNJ34N8RCiYxEZ1e6HQxBwQn/fDIWOjOHryZoArhicFY9aOEja4
9xBI9OJhBWOL4N4AFdmTExBdYudSgCTpX+/gQ4tSfedz3lqF79y8+PILwv6E1Q6D
8pScH/4pVo4v5omGaMpA
=vui7
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2015-12-02' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
A small set of fixes for 4.4:
* fix scanning in mac80211 to not actively scan radar
channels (from Antonio)
* fix uninitialized variable in remain-on-channel that
could lead to treating frame TX as remain-on-channel
and not sending the frame at all
* remove NL80211_FEATURE_FULL_AP_CLIENT_STATE again, it
was broken and needs more work, we'll enable it later
* fix call_rcu() induced use-after-reset/free in mesh
(that was suddenly causing issues in certain tests)
* always request block-ack window size 64 as we found
some APs will otherwise crash (really ...)
* fix P2P-Device teardown sequence to avoid restarting
with uninitialized data
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 5405ff6e15 ("tipc: convert node lock to rwlock")
introduced a bug to the node reference counter handling. When a
message is successfully sent in the function tipc_node_xmit(),
we return directly after releasing the node lock, instead of
continuing and decrementing the node reference counter as we
should do.
This commit fixes this bug.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* Register PF_PPPOX with pppox module rather than with pppoe,
so that pppoe doesn't get loaded for any PF_PPPOX socket.
* Register PX_PROTO_* with standard MODULE_ALIAS_NET_PF_PROTO()
instead of using pppox's own naming scheme.
* While there, add auto-loading feature for pptp.
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Enable virtio-vsock and vhost-vsock.
Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
VM sockets virtio transport implementation. This module runs in guest
kernel.
Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This module contains the common code and header files for the following
virtio-vsock and virtio-vhost kernel modules.
Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds support for RTNH_F_DEAD and RTNH_F_LINKDOWN flags on mpls
routes due to link events. Also adds code to ignore dead
routes during route selection.
Unlike ip routes, mpls routes are not deleted when the route goes
dead. This is current mpls behaviour and this patch does not change
that. With this patch however, routes will be marked dead.
dead routes are not notified to userspace (this is consistent with ipv4
routes).
dead routes:
-----------
$ip -f mpls route show
100
nexthop as to 200 via inet 10.1.1.2 dev swp1
nexthop as to 700 via inet 10.1.1.6 dev swp2
$ip link set dev swp1 down
$ip link show dev swp1
4: swp1: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast state DOWN mode
DEFAULT group default qlen 1000
link/ether 00:02:00:00:00:01 brd ff:ff:ff:ff:ff:ff
$ip -f mpls route show
100
nexthop as to 200 via inet 10.1.1.2 dev swp1 dead linkdown
nexthop as to 700 via inet 10.1.1.6 dev swp2
linkdown routes:
----------------
$ip -f mpls route show
100
nexthop as to 200 via inet 10.1.1.2 dev swp1
nexthop as to 700 via inet 10.1.1.6 dev swp2
$ip link show dev swp1
4: swp1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast
state UP mode DEFAULT group default qlen 1000
link/ether 00:02:00:00:00:01 brd ff:ff:ff:ff:ff:ff
/* carrier goes down */
$ip link show dev swp1
4: swp1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast
state DOWN mode DEFAULT group default qlen 1000
link/ether 00:02:00:00:00:01 brd ff:ff:ff:ff:ff:ff
$ip -f mpls route show
100
nexthop as to 200 via inet 10.1.1.2 dev swp1 linkdown
nexthop as to 700 via inet 10.1.1.6 dev swp2
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
qdisc_tree_decrease_qlen() suffers from two problems on multiqueue
devices.
One problem is that it updates sch->q.qlen and sch->qstats.drops
on the mq/mqprio root qdisc, while it should not : Daniele
reported underflows errors :
[ 681.774821] PAX: sch->q.qlen: 0 n: 1
[ 681.774825] PAX: size overflow detected in function qdisc_tree_decrease_qlen net/sched/sch_api.c:769 cicus.693_49 min, count: 72, decl: qlen; num: 0; context: sk_buff_head;
[ 681.774954] CPU: 2 PID: 19 Comm: ksoftirqd/2 Tainted: G O 4.2.6.201511282239-1-grsec #1
[ 681.774955] Hardware name: ASUSTeK COMPUTER INC. X302LJ/X302LJ, BIOS X302LJ.202 03/05/2015
[ 681.774956] ffffffffa9a04863 0000000000000000 0000000000000000 ffffffffa990ff7c
[ 681.774959] ffffc90000d3bc38 ffffffffa95d2810 0000000000000007 ffffffffa991002b
[ 681.774960] ffffc90000d3bc68 ffffffffa91a44f4 0000000000000001 0000000000000001
[ 681.774962] Call Trace:
[ 681.774967] [<ffffffffa95d2810>] dump_stack+0x4c/0x7f
[ 681.774970] [<ffffffffa91a44f4>] report_size_overflow+0x34/0x50
[ 681.774972] [<ffffffffa94d17e2>] qdisc_tree_decrease_qlen+0x152/0x160
[ 681.774976] [<ffffffffc02694b1>] fq_codel_dequeue+0x7b1/0x820 [sch_fq_codel]
[ 681.774978] [<ffffffffc02680a0>] ? qdisc_peek_dequeued+0xa0/0xa0 [sch_fq_codel]
[ 681.774980] [<ffffffffa94cd92d>] __qdisc_run+0x4d/0x1d0
[ 681.774983] [<ffffffffa949b2b2>] net_tx_action+0xc2/0x160
[ 681.774985] [<ffffffffa90664c1>] __do_softirq+0xf1/0x200
[ 681.774987] [<ffffffffa90665ee>] run_ksoftirqd+0x1e/0x30
[ 681.774989] [<ffffffffa90896b0>] smpboot_thread_fn+0x150/0x260
[ 681.774991] [<ffffffffa9089560>] ? sort_range+0x40/0x40
[ 681.774992] [<ffffffffa9085fe4>] kthread+0xe4/0x100
[ 681.774994] [<ffffffffa9085f00>] ? kthread_worker_fn+0x170/0x170
[ 681.774995] [<ffffffffa95d8d1e>] ret_from_fork+0x3e/0x70
mq/mqprio have their own ways to report qlen/drops by folding stats on
all their queues, with appropriate locking.
A second problem is that qdisc_tree_decrease_qlen() calls qdisc_lookup()
without proper locking : concurrent qdisc updates could corrupt the list
that qdisc_match_from_root() parses to find a qdisc given its handle.
Fix first problem adding a TCQ_F_NOPARENT qdisc flag that
qdisc_tree_decrease_qlen() can use to abort its tree traversal,
as soon as it meets a mq/mqprio qdisc children.
Second problem can be fixed by RCU protection.
Qdisc are already freed after RCU grace period, so qdisc_list_add() and
qdisc_list_del() simply have to use appropriate rcu list variants.
A future patch will add a per struct netdev_queue list anchor, so that
qdisc_tree_decrease_qlen() can have more efficient lookups.
Reported-by: Daniele Fucini <dfucini@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Cong Wang <cwang@twopensource.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Setting a value bigger than 255 resulted in using only the lower eight
bits of that value as it is assigned to the u8 header field. To avoid
this unexpected result, reject such values.
Setting a value of zero is technically possible, but hosts receiving
such a packet have to treat it like hop_limit was set to one, according
to RFC2460. Therefore I don't see a use-case for that.
Setting a route's hop_limit to zero in iproute2 means to use the sysctl
default, which is not the case here: Setting e.g.
net.conf.eth0.hop_limit=0 will not make the kernel use
net.conf.all.hop_limit for outgoing packets on eth0. To avoid these
kinds of confusion, reject zero.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
Each openvswitch tunnel vport (vxlan,gre,geneve) holds a reference
to the underlying tunnel device, but never released it when such
device is deleted.
Deleting the underlying device via the ip tool cause the kernel to
hangup in the netdev_wait_allrefs() loop.
This commit ensure that on device unregistration dp_detach_port_notify()
is called for all vports that hold the device reference, properly
releasing it.
Fixes: 614732eaa1 ("openvswitch: Use regular VXLAN net_device device")
Fixes: b2acd1dc39 ("openvswitch: Use regular GRE net_device instead of vport")
Fixes: 6b001e682e ("openvswitch: Use Geneve device.")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a multicast group is joined on a socket, a struct ip_mc_socklist
is appended to the sockets mc_list containing information about the
joined group.
If the interface is hot unplugged, this entry becomes stale. Prior to
commit 52ad353a53 ("igmp: fix the problem when mc leave group") it
was possible to remove the stale entry by performing a
IP_DROP_MEMBERSHIP, passing either the old ifindex or ip address on
the interface. However, this fix enforces that the interface must
still exist. Thus with time, the number of stale entries grows, until
sysctl_igmp_max_memberships is reached and then it is not possible to
join and more groups.
The previous patch fixes an issue where a IP_DROP_MEMBERSHIP is
performed without specifying the interface, either by ifindex or ip
address. However here we do supply one of these. So loosen the
restriction on device existence to only apply when the interface has
not been specified. This then restores the ability to clean up the
stale entries.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Fixes: 52ad353a53 "(igmp: fix the problem when mc leave group")
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry Vyukov reported a memory leak using IPV6 SCTP sockets.
We need to call inet6_destroy_sock() to properly release
inet6 specific fields.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth 2015-12-01
Here's a Bluetooth fix for the 4.4-rc series that fixes a memory leak of
the Security Manager L2CAP channel that'll happen for every LE
connection.
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When lower device like bonding slave, team/bridge port, etc changes its
state, it is useful for others to notice this change. Currently this is
implemented specificly for bonding as NETDEV_BONDING_INFO notifier. This
patch aims to replace this specific usage and make this more generic to
be used for all upper-lower devices.
Introduce NETDEV_CHANGELOWERSTATE netdev notifier type and
netdev_lower_state_changed() helper.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sometimes the drivers and other code would find it handy to know some
internal information about upper device being changed. So allow upper-code
to pass information down to notifier listeners during linking.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eliminate netdev_master_upper_dev_link_private and pass priv directly as
a parameter of netdev_master_upper_dev_link.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
switchdev drivers reflect the newly requested topology to hardware when
CHANGEUPPER is received, after software links were already formed.
However, the operation can fail and user will not be notified, as the
return value of the notifier is not checked.
Add this check and rollback software links if necessary.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While testing the np->opt RCU conversion, I found that UDP/IPv6 was
using a mixture of xchg() and sk_dst_lock to protect concurrent changes
to sk->sk_dst_cache, leading to possible corruptions and crashes.
ip6_sk_dst_lookup_flow() uses sk_dst_check() anyway, so the simplest
way to fix the mess is to remove sk_dst_lock completely, as we did for
IPv4.
__ip6_dst_store() and ip6_dst_store() share same implementation.
sk_setup_caps() being called with socket lock being held or not,
we have to use sk_dst_set() instead of __sk_dst_set()
Note that I had to move the "np->dst_cookie = rt6_get_cookie(rt);"
in ip6_dst_store() before the sk_setup_caps(sk, dst) call.
This is because ip6_dst_store() can be called from process context,
without any lock held.
As soon as the dst is installed in sk->sk_dst_cache, dst can be freed
from another cpu doing a concurrent ip6_dst_store()
Doing the dst dereference before doing the install is needed to make
sure no use after free would trigger.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch completes the work I did in commit 45f6fad84c
("ipv6: add complete rcu protection around np->opt"), as I missed
sctp part.
This simply makes sure np->opt is used with proper RCU locking
and accessors.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Consider the following v2 hierarchy.
P0 (+memory) --- P1 (-memory) --- A
\- B
P0 has memory enabled in its subtree_control while P1 doesn't. If
both A and B contain processes, they would belong to the memory css of
P1. Now if memory is enabled on P1's subtree_control, memory csses
should be created on both A and B and A's processes should be moved to
the former and B's processes the latter. IOW, enabling controllers
can cause atomic migrations into different csses.
The core cgroup migration logic has been updated accordingly but the
controller migration methods haven't and still assume that all tasks
migrate to a single target css; furthermore, the methods were fed the
css in which subtree_control was updated which is the parent of the
target csses. pids controller depends on the migration methods to
move charges and this made the controller attribute charges to the
wrong csses often triggering the following warning by driving a
counter negative.
WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
Modules linked in:
CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
...
ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
Call Trace:
[<ffffffff81551ffc>] dump_stack+0x4e/0x82
[<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
[<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
[<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
[<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
[<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
[<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
[<ffffffff81189016>] cgroup_attach_task+0x176/0x200
[<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
[<ffffffff81189684>] cgroup_procs_write+0x14/0x20
[<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
[<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
[<ffffffff81265f88>] __vfs_write+0x28/0xe0
[<ffffffff812666fc>] vfs_write+0xac/0x1a0
[<ffffffff81267019>] SyS_write+0x49/0xb0
[<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
This patch fixes the bug by removing @css parameter from the three
migration methods, ->can_attach, ->cancel_attach() and ->attach() and
updating cgroup_taskset iteration helpers also return the destination
css in addition to the task being migrated. All controllers are
updated accordingly.
* Controllers which don't care whether there are one or multiple
target csses can be converted trivially. cpu, io, freezer, perf,
netclassid and netprio fall in this category.
* cpuset's current implementation assumes that there's single source
and destination and thus doesn't support v2 hierarchy already. The
only change made by this patchset is how that single destination css
is obtained.
* memory migration path already doesn't do anything on v2. How the
single destination css is obtained is updated and the prep stage of
mem_cgroup_can_attach() is reordered to accomodate the change.
* pids is the only controller which was affected by this bug. It now
correctly handles multi-destination migrations and no longer causes
counter underflow from incorrect accounting.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Proxy entries could have null pointer to net-device.
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Fixes: 84920c1420 ("net: Allow ipv6 proxies and arp proxies be shown with iproute2")
Signed-off-by: David S. Miller <davem@davemloft.net>
If tcp_send_ack() can not allocate skb, we properly handle this
and setup a timer to try later.
Use __GFP_NOWARN to avoid polluting syslog in the case host is
under memory pressure, so that pertinent messages are not lost under
a flood of useless information.
sk_gfp_atomic() can use its gfp_mask argument (all callers currently
were using GFP_ATOMIC before this patch)
We rename sk_gfp_atomic() to sk_gfp_mask() to clearly express this
function now takes into account its second argument (gfp_mask)
Note that when tcp_transmit_skb() is called with clone_it set to false,
we do not attempt memory allocations, so can pass a 0 gfp_mask, which
most compilers can emit faster than a non zero or constant value.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry Vyukov reported that the user could trigger a kernel warning by
using a large len value for getsockopt SCTP_GET_LOCAL_ADDRS, as that
value directly affects the value used as a kmalloc() parameter.
This patch thus switches the allocation flags from all user-controllable
kmalloc size to GFP_USER to put some more restrictions on it and also
disables the warn, as they are not necessary.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch addresses multiple problems :
UDP/RAW sendmsg() need to get a stable struct ipv6_txoptions
while socket is not locked : Other threads can change np->opt
concurrently. Dmitry posted a syzkaller
(http://github.com/google/syzkaller) program desmonstrating
use-after-free.
Starting with TCP/DCCP lockless listeners, tcp_v6_syn_recv_sock()
and dccp_v6_request_recv_sock() also need to use RCU protection
to dereference np->opt once (before calling ipv6_dup_options())
This patch adds full RCU protection to np->opt
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the last change here, I neglected to update the cookie in one code
path: when a mgmt-tx has no real cookie sent to userspace as it doesn't
wait for a response, but is off-channel. The original code used the SKB
pointer as the cookie and always assigned the cookie to the TX SKB in
ieee80211_start_roc_work(), but my change turned this around and made
the code rely on a valid cookie being passed in.
Unfortunately, the off-channel no-wait TX path wasn't assigning one at
all, resulting in an uninitialized stack value being used. This wasn't
handed back to userspace as a cookie (since in the no-wait case there
isn't a cookie), but it was tested for non-zero to distinguish between
mgmt-tx and off-channel.
Fix this by assigning a dummy non-zero cookie unconditionally, and get
rid of a misleading comment and some dead code while at it. I'll clean
up the ACK SKB handling separately later.
Fixes: 3b79af973c ("mac80211: stop using pointers as userspace cookies")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
DFS channels should not be actively scanned as we can't be sure
if we are allowed or not.
If the current channel is in the DFS band, active scan might be
performed after CSA, but we have no guarantee about other channels,
therefore it is safer to prevent active scanning at all.
Cc: stable@vger.kernel.org
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Interfaces are being initialized (setup) on addition,
and torn down on removal.
However, p2p device is being torn down when stopped,
resulting in the next p2p start operation being done
on uninitialized interface.
Solve it by calling ieee80211_teardown_sdata() only
on interface removal (for the non-netdev case).
Signed-off-by: Eliad Peller <eliadx.peller@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
[squashed in fix to call teardown after unregister]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
After 614732eaa1, no refcount is maintained for the vport-vxlan module.
This allows the userspace to remove such module while vport-vxlan
devices still exist, which leads to later oops.
v1 -> v2:
- move vport 'owner' initialization in ovs_vport_ops_register()
and make such function a macro
Fixes: 614732eaa1 ("openvswitch: Use regular VXLAN net_device device")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry provided a syzkaller (http://github.com/google/syzkaller)
triggering a fault in sock_wake_async() when async IO is requested.
Said program stressed af_unix sockets, but the issue is generic
and should be addressed in core networking stack.
The problem is that by the time sock_wake_async() is called,
we should not access the @flags field of 'struct socket',
as the inode containing this socket might be freed without
further notice, and without RCU grace period.
We already maintain an RCU protected structure, "struct socket_wq"
so moving SOCKWQ_ASYNC_NOSPACE & SOCKWQ_ASYNC_WAITDATA into it
is the safe route.
It also reduces number of cache lines needing dirtying, so might
provide a performance improvement anyway.
In followup patches, we might move remaining flags (SOCK_NOSPACE,
SOCK_PASSCRED, SOCK_PASSSEC) to save 8 bytes and let 'struct socket'
being mostly read and let it being shared between cpus.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is a cleanup to make following patch easier to
review.
Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
from (struct socket)->flags to a (struct socket_wq)->flags
to benefit from RCU protection in sock_wake_async()
To ease backports, we rename both constants.
Two new helpers, sk_set_bit(int nr, struct sock *sk)
and sk_clear_bit(int net, struct sock *sk) are added so that
following patch can change their implementation.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit ab450605b3.
In IPv6, we cannot inherit the dst of the original dst. ndisc packets
are IPv6 packets and may take another route than the original packet.
This patch breaks the following scenario: a packet comes from eth0 and
is forwarded through vxlan1. The encapsulated packet triggers an NS
which cannot be sent because of the wrong route.
CC: Jiri Benc <jbenc@redhat.com>
CC: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current unix_dgram_recvmsg does a wake up for every received
datagram. This seems wasteful as only SOCK_DGRAM client sockets in an
n:1 association with a server socket will ever wait because of the
associated condition. The patch below changes the function such that the
wake up only happens if wq_has_sleeper indicates that someone actually
wants to be notified. Testing with SOCK_SEQPACKET and SOCK_DGRAM socket
seems to confirm that this is an improvment.
Signed-Off-By: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry provided a syzkaller (http://github.com/google/syzkaller)
generated program that triggers the WARNING at
net/ipv4/tcp.c:1729 in tcp_recvmsg() :
WARN_ON(tp->copied_seq != tp->rcv_nxt &&
!(flags & (MSG_PEEK | MSG_TRUNC)));
His program is specifically attempting a Cross SYN TCP exchange,
that we support (for the pleasure of hackers ?), but it looks we
lack proper tcp->copied_seq initialization.
Thanks again Dmitry for your report and testings.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support to add and remove MFC entries. It uses the
same attributes like the already present dump support in order to be
consistent. There's one new entry - RTA_PREFSRC, it's used to denote an
MFC_PROXY entry (see MRT_ADD_MFC vs MRT_ADD_MFC_PROXY).
The already existing infrastructure is used to create and delete the
entries, the netlink message gets converted internally to a struct mfcctl
which is used with ipmr_mfc_add/delete.
The other used attributes are:
RTA_IIF - used for mfcc_parent (when adding it's required to be valid)
RTA_SRC - used for mfcc_origin
RTA_DST - used for mfcc_mcastgrp
RTA_TABLE - the MRT table id
RTA_MULTIPATH - the "oifs" ttl array
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can have both errors and we'll return the second one, fix it to
return an error at a time as it's normal. I've overlooked this in my
previous set.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the inline pimsm_enabled() to pim.h and rename it to
ipmr_pimsm_enabled to show it's for the ipv4 ipmr code since pim.h is
used by IPv6 too.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the definitions of VIF_EXISTS() and struct mr_table to mroute.h
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
MFC_NOTIFY was introduced in kernel 2.1.68 but afaik it hasn't been used
and I couldn't find any users currently so just remove it. Only
MFC_STATIC is left, so move it into an enum, add a description and use
BIT().
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It looks like many files are including mroute.h unnecessarily, so remove
the include. Most importantly remove it from ipv6.
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Steffen Klassert <steffen.klassert@secunet.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sendpage did not care about credentials at all. This could lead to
situations in which because of fd passing between processes we could
append data to skbs with different scm data. It is illegal to splice those
skbs together. Instead we have to allocate a new skb and if requested
fill out the scm details.
Fixes: 869e7c6248 ("net: af_unix: implement stream sendpage support")
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The memory barrier in the helper wq_has_sleeper is needed by just
about every user of waitqueue_active. This patch generalises it
by making it take a wait_queue_head_t directly. The existing
helper is renamed to skwq_has_sleeper.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 9c7077622d ("packet: make packet_snd fail on len smaller
than l2 header") added validation for the packet size in packet_snd.
This change enforces that every packet needs a header (with at least
hard_header_len bytes) plus a payload with at least one byte. Before
this change the payload was optional.
This fixes PPPoE connections which do not have a "Service" or
"Host-Uniq" configured (which is violating the spec, but is still
widely used in real-world setups). Those are currently failing with the
following message: "pppd: packet size is too short (24 <= 24)"
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sasha's found a NULL pointer dereference in the RDS connection code when
sending a message to an apparently unbound socket. The problem is caused
by the code checking if the socket is bound in rds_sendmsg(), which checks
the rs_bound_addr field without taking a lock on the socket. This opens a
race where rs_bound_addr is temporarily set but where the transport is not
in rds_bind(), leading to a NULL pointer dereference when trying to
dereference 'trans' in __rds_conn_create().
Vegard wrote a reproducer for this issue, so kindly ask him to share if
you're interested.
I cannot reproduce the NULL pointer dereference using Vegard's reproducer
with this patch, whereas I could without.
Complete earlier incomplete fix to CVE-2015-6937:
74e98eb085 ("RDS: verify the underlying transport exists before creating a connection")
Cc: David S. Miller <davem@davemloft.net>
Cc: stable@vger.kernel.org
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Reviewed-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
During pre-upstream development, the openvswitch datapath used a custom
hashtable to store vports that could fail on delete due to lack of
memory. However, prior to upstream submission, this code was reworked to
use an hlist based hastable with flexible-array based buckets. As such
the failure condition was eliminated from the vport_del path, rendering
this comment invalid.
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since (at least) commit b17a7c179d ("[NET]: Do sysfs registration as
part of register_netdevice."), netdev_run_todo() deals only with
unregistration, so we don't need to do the rtnl_unlock/lock cycle to
finish registration when failing pimreg or dvmrp device creation. In
fact that opens a race condition where someone can delete the device
while rtnl is unlocked because it's fully registered. The problem gets
worse when netlink support is introduced as there are more points of entry
that can cause it and it also makes reusing that code correctly impossible.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Normally, the transmit phase of a client call is implicitly ack'd by the
reception of the first data packet of the response being received.
However, if a security negotiation happens, the transmit phase, if it is
entirely contained in a single packet, may get an ack packet in response
and then may get aborted due to security negotiation failure.
Because the client has shifted state to RXRPC_CALL_CLIENT_AWAIT_REPLY due
to having transmitted all the data, the code that handles processing of the
received ack packet doesn't note the hard ack the data packet.
The following abort packet in the case of security negotiation failure then
incurs an assertion failure when it tries to drain the Tx queue because the
hard ack state is out of sync (hard ack means the packets have been
processed and can be discarded by the sender; a soft ack means that the
packets are received but could still be discarded and rerequested by the
receiver).
To fix this, we should record the hard ack we received for the ack packet.
The assertion failure looks like:
RxRPC: Assertion failed
1 <= 0 is false
0x1 <= 0x0 is false
------------[ cut here ]------------
kernel BUG at ../net/rxrpc/ar-ack.c:431!
...
RIP: 0010:[<ffffffffa006857b>] [<ffffffffa006857b>] rxrpc_rotate_tx_window+0xbc/0x131 [af_rxrpc]
...
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a fragmented multicast packet is received on an ethernet device which
has an active macvlan on top of it, each fragment is duplicated and
received both on the underlying device and the macvlan. If some
fragments for macvlan are processed before the whole packet for the
underlying device is reassembled, the "overlapping fragments" test in
ip6_frag_queue() discards the whole fragment queue.
To resolve this, add device ifindex to the search key and require it to
match reassembling multicast packets and packets to link-local
addresses.
Note: similar patch has been already submitted by Yoshifuji Hideaki in
http://patchwork.ozlabs.org/patch/220979/
but got lost and forgotten for some reason.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2015-11-23
Here's the first bluetooth-next pull request for the 4.5 kernel.
- Add new Get Advertising Size Information management command
- Add support for new system note message type on monitor channel
- Refactor LE scan changes behind separate workqueue to avoid races
- Fix issue with privacy feature when powering on adapter
- Various minor fixes & cleanups here and there
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 09605cc12c ("net ipv4: use preferred log methods") replaced
a few calls of pr_cont() after a console print without a trailing
newline by pr_info(), causing lines to be split during IP
autoconfiguration, like:
.
,
OK
IP-Config: Got DHCP answer from 192.168.97.254,
my address is 192.168.97.44
Convert these back to using pr_cont(), so it prints again:
., OK
IP-Config: Got DHCP answer from 192.168.97.254, my address is 192.168.97.44
Absorb the printing of "my address ..." into the previous call to
pr_info(), as there's no reason to use a continuation there.
Convert one more pr_info() to print nameservers while we're at it.
Fixes: 09605cc12c ("net ipv4: use preferred log methods")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the introduction of the switch gpio reset API, I'm getting
build errors in configurations that disable CONFIG_GPIOLIB:
net/dsa/dsa.c:783:16: error: implicit declaration of function 'gpio_to_desc' [-Werror=implicit-function-declaration]
The reason is that linux/gpio/consumer.h is not automatically
included without gpiolib support. This adds an explicit #include
statement to make it compile in all configurations. The reset
functionality will not work without gpiolib, which is what you
get when disabling the feature.
As far as I can tell, gpiolib is supported on all architectures
on which you can have DSA at the moment.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: cc30c16344 ("net: dsa: Add support for a switch reset gpio")
Acked-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Coverity says:
*** CID 1338065: Error handling issues (CHECKED_RETURN)
/net/tipc/udp_media.c: 162 in tipc_udp_send_msg()
156 struct udp_media_addr *dst = (struct udp_media_addr *)&dest->value;
157 struct udp_media_addr *src = (struct udp_media_addr *)&b->addr.value;
158 struct sk_buff *clone;
159 struct rtable *rt;
160
161 if (skb_headroom(skb) < UDP_MIN_HEADROOM)
>>> CID 1338065: Error handling issues (CHECKED_RETURN)
>>> Calling "pskb_expand_head" without checking return value (as is done elsewhere 51 out of 56 times).
162 pskb_expand_head(skb, UDP_MIN_HEADROOM, 0, GFP_ATOMIC);
163
164 clone = skb_clone(skb, GFP_ATOMIC);
165 skb_set_inner_protocol(clone, htons(ETH_P_TIPC));
166 ub = rcu_dereference_rtnl(b->media_ptr);
167 if (!ub) {
When expanding buffer headroom over udp tunnel with pskb_expand_head(),
it's unfortunate that we don't check its return value. As a result, if
the function returns an error code due to the lack of memory, it may
cause unpredictable consequence as we unconditionally consider that
it's always successful.
Fixes: e53567948f ("tipc: conditionally expand buffer headroom over udp tunnel")
Reported-by: <scan-admin@coverity.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Even if we drain receive queue thoroughly in tipc_release() after tipc
socket is removed from rhashtable, it is possible that some packets
are in flight because some CPU runs receiver and did rhashtable lookup
before we removed socket. They will achieve receive queue, but nobody
delete them at all. To avoid this leak, we register a private socket
destructor to purge receive queue, meaning releasing packets pending
on receive queue will be delayed until the last reference of tipc
socket will be released.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A truncated cb_compound request will cause the client to decode null or
data from a previous callback for nfs4.1 backchannel case, or uninitialized
data for the nfs4.0 case. This is because the path through
svc_process_common() advances the request's iov_base and decrements iov_len
without adjusting the overall xdr_buf's len field. That causes
xdr_init_decode() to set up the xdr_stream with an incorrect length in
nfs4_callback_compound().
Fixing this for the nfs4.1 backchannel case first requires setting the
correct iov_len and page_len based on the length of received data in the
same manner as the nfs4.0 case.
Then the request's xdr_buf length can be adjusted for both cases based upon
the remaining iov_len and page_len.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The vmci_transport_notify_ops structures are never modified, so declare
them as const.
Done with the help of Coccinelle.
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
The in_cache_ops and eg_cache_ops structures are never modified, so declare
them as const.
Done with the help of Coccinelle.
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Factor out common vif init code used in both tunnel and pimreg
initialization and create ipmr_init_vif_indev() function.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Take rtnl in the beginning unconditionally as most options already need
it (one exception - MRT_DONE, see the comment inside), make the
lock/unlock places central and move out the switch() local variables.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's not necessary to check for null as SLAB_PANIC is used and we'll
panic if the alloc fails, so just drop it.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Trivial replace of ifdef with IS_BUILTIN().
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use a switch to determine if optname is correct and set val accordingly.
This produces a much more straight-forward and readable code.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Trivial code and comment style fixes, also removed some extra newlines,
spaces and tabs.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the helper pimsm_enabled() which replaces the old CONFIG_IP_PIMSM
define and is used to check if any version of PIM-SM has been enabled.
Use a single if defined(CONFIG_IP_PIMSM_V1) || defined(CONFIG_IP_PIMSM_V2)
for the pim-sm shared code. This is okay w.r.t IGMPMSG_WHOLEPKT because
only a VIFF_REGISTER device can send such packet, and it can't be
created if pimsm_enabled() is false.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before mroute_reg_vif_num was defined only if any of the CONFIG_PIMSM_
options were set, but that's not really necessary as the size of the
struct is the same in both cases (checked with pahole, both cases size
is 3256 bytes) and we can remove some unnecessary ifdefs to simplify the
code.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the table id check in ipmr_new_table and make it return error
pointer. We need this change for the upcoming netlink table manipulation
support in order to avoid code duplication and a race condition.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
WARN_ON_ONCE() takes a condition, it doesn't take an error message. I
have converted this to WARN() instead.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rainer Weikusat <rweikusat@mobileactivedefense.com> writes:
An AF_UNIX datagram socket being the client in an n:1 association with
some server socket is only allowed to send messages to the server if the
receive queue of this socket contains at most sk_max_ack_backlog
datagrams. This implies that prospective writers might be forced to go
to sleep despite none of the message presently enqueued on the server
receive queue were sent by them. In order to ensure that these will be
woken up once space becomes again available, the present unix_dgram_poll
routine does a second sock_poll_wait call with the peer_wait wait queue
of the server socket as queue argument (unix_dgram_recvmsg does a wake
up on this queue after a datagram was received). This is inherently
problematic because the server socket is only guaranteed to remain alive
for as long as the client still holds a reference to it. In case the
connection is dissolved via connect or by the dead peer detection logic
in unix_dgram_sendmsg, the server socket may be freed despite "the
polling mechanism" (in particular, epoll) still has a pointer to the
corresponding peer_wait queue. There's no way to forcibly deregister a
wait queue with epoll.
Based on an idea by Jason Baron, the patch below changes the code such
that a wait_queue_t belonging to the client socket is enqueued on the
peer_wait queue of the server whenever the peer receive queue full
condition is detected by either a sendmsg or a poll. A wake up on the
peer queue is then relayed to the ordinary wait queue of the client
socket via wake function. The connection to the peer wait queue is again
dissolved if either a wake up is about to be relayed or the client
socket reconnects or a dead peer is detected or the client socket is
itself closed. This enables removing the second sock_poll_wait from
unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
that no blocked writer sleeps forever.
Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Fixes: ec0d215f94 ("af_unix: fix 'poll for write'/connected DGRAM sockets")
Reviewed-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The classid of a process is changed either when a process is moved to
or from a cgroup or when the net_cls.classid file is updated.
Previously net_cls only supported propogating these changes to the
cgroup's related sockets when a process was added or removed from the
cgroup. This means it was neccessary to remove and re-add all processes
to a cgroup in order to update its classid. This change introduces
support for doing this dynamically - i.e. when the value is changed in
the net_cls_classid file, this will also trigger an update to the
classid associated with all sockets controlled by the cgroup.
This mimics the behaviour of other cgroup subsystems.
net_prio circumvents this issue by storing an index into a table with
each socket (and so any updates to the table, don't require updating
the value associated with the socket). net_cls, however, passes the
socket the classid directly, and so this additional step is needed.
Signed-off-by: Nina Schiff <ninasc@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some boards have a gpio line tied to the switch reset pin. Allow this
gpio to be retrieved from the device tree, and take the switch out of
reset before performing the probe.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch increments the management interface revision due to
introduction of a new Get Advertising Size Information command and
various other fixes & improvements.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
In order to enable advertising with privacy enabled, SMP has to be
registered in order to generate new RPA. During power on, it will be
registered at the very end which is the reason why advertising is not
enabled and it's not possible to enable it anymore due to mismatch
between hci_dev settings and actual controller state.
This fixes this problem by moving SMP registration earlier, just after
controller is powered (which is ok, because LE SMP will be already able
to decide on identity address to be used), but before advertising is
enabled.
Signed-off-by: Andrzej Kaczmarek <andrzej.kaczmarek@codecoup.pl>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
There were a couple of code paths missed by the previous patch that
added a HCI status return parameter to __hci_req_sync. This patch adds
the missing assignments for them.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Several ARM default configurations give us warnings on recent
compilers about potentially uninitialized variables in the
nfnetlink code in two functions:
net/netfilter/nfnetlink_queue.c: In function 'nfqnl_build_packet_message':
net/netfilter/nfnetlink_queue.c:519:19: warning: 'nfnl_ct' may be used uninitialized in this function [-Wmaybe-uninitialized]
if (ct && nfnl_ct->build(skb, ct, ctinfo, NFQA_CT, NFQA_CT_INFO) < 0)
Moving the rcu_dereference(nfnl_ct_hook) call outside of the
conditional code avoids the warning without forcing us to
preinitialize the variable.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: a4b4766c3c ("netfilter: nfnetlink_queue: rename related to nfqueue attaching conntrack info")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Similar to ipv4, when destroying an mrt table the static mfc entries and
the static devices are kept, which leads to devices that can never be
destroyed (because of refcnt taken) and leaked memory. Make sure that
everything is cleaned up on netns destruction.
Fixes: 8229efdaef ("netns: ip6mr: enable namespace support in ipv6 multicast forwarding code")
CC: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When destroying an mrt table the static mfc entries and the static
devices are kept, which leads to devices that can never be destroyed
(because of refcnt taken) and leaked memory, for example:
unreferenced object 0xffff880034c144c0 (size 192):
comm "mfc-broken", pid 4777, jiffies 4320349055 (age 46001.964s)
hex dump (first 32 bytes):
98 53 f0 34 00 88 ff ff 98 53 f0 34 00 88 ff ff .S.4.....S.4....
ef 0a 0a 14 01 02 03 04 00 00 00 00 01 00 00 00 ................
backtrace:
[<ffffffff815c1b9e>] kmemleak_alloc+0x4e/0xb0
[<ffffffff811ea6e0>] kmem_cache_alloc+0x190/0x300
[<ffffffff815931cb>] ip_mroute_setsockopt+0x5cb/0x910
[<ffffffff8153d575>] do_ip_setsockopt.isra.11+0x105/0xff0
[<ffffffff8153e490>] ip_setsockopt+0x30/0xa0
[<ffffffff81564e13>] raw_setsockopt+0x33/0x90
[<ffffffff814d1e14>] sock_common_setsockopt+0x14/0x20
[<ffffffff814d0b51>] SyS_setsockopt+0x71/0xc0
[<ffffffff815cdbf6>] entry_SYSCALL_64_fastpath+0x16/0x7a
[<ffffffffffffffff>] 0xffffffffffffffff
Make sure that everything is cleaned on netns destruction.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David and HacKurx reported a following/similar size overflow triggered
in a grsecurity kernel, thanks to PaX's gcc size overflow plugin:
(Already fixed in later grsecurity versions by Brad and PaX Team.)
[ 1002.296137] PAX: size overflow detected in function scm_detach_fds net/core/scm.c:314
cicus.202_127 min, count: 4, decl: msg_controllen; num: 0; context: msghdr;
[ 1002.296145] CPU: 0 PID: 3685 Comm: scm_rights_recv Not tainted 4.2.3-grsec+ #7
[ 1002.296149] Hardware name: Apple Inc. MacBookAir5,1/Mac-66F35F19FE2A0D05, [...]
[ 1002.296153] ffffffff81c27366 0000000000000000 ffffffff81c27375 ffffc90007843aa8
[ 1002.296162] ffffffff818129ba 0000000000000000 ffffffff81c27366 ffffc90007843ad8
[ 1002.296169] ffffffff8121f838 fffffffffffffffc fffffffffffffffc ffffc90007843e60
[ 1002.296176] Call Trace:
[ 1002.296190] [<ffffffff818129ba>] dump_stack+0x45/0x57
[ 1002.296200] [<ffffffff8121f838>] report_size_overflow+0x38/0x60
[ 1002.296209] [<ffffffff816a979e>] scm_detach_fds+0x2ce/0x300
[ 1002.296220] [<ffffffff81791899>] unix_stream_read_generic+0x609/0x930
[ 1002.296228] [<ffffffff81791c9f>] unix_stream_recvmsg+0x4f/0x60
[ 1002.296236] [<ffffffff8178dc00>] ? unix_set_peek_off+0x50/0x50
[ 1002.296243] [<ffffffff8168fac7>] sock_recvmsg+0x47/0x60
[ 1002.296248] [<ffffffff81691522>] ___sys_recvmsg+0xe2/0x1e0
[ 1002.296257] [<ffffffff81693496>] __sys_recvmsg+0x46/0x80
[ 1002.296263] [<ffffffff816934fc>] SyS_recvmsg+0x2c/0x40
[ 1002.296271] [<ffffffff8181a3ab>] entry_SYSCALL_64_fastpath+0x12/0x85
Further investigation showed that this can happen when an *odd* number of
fds are being passed over AF_UNIX sockets.
In these cases CMSG_LEN(i * sizeof(int)) and CMSG_SPACE(i * sizeof(int)),
where i is the number of successfully passed fds, differ by 4 bytes due
to the extra CMSG_ALIGN() padding in CMSG_SPACE() to an 8 byte boundary
on 64 bit. The padding is used to align subsequent cmsg headers in the
control buffer.
When the control buffer passed in from the receiver side *lacks* these 4
bytes (e.g. due to buggy/wrong API usage), then msg->msg_controllen will
overflow in scm_detach_fds():
int cmlen = CMSG_LEN(i * sizeof(int)); <--- cmlen w/o tail-padding
err = put_user(SOL_SOCKET, &cm->cmsg_level);
if (!err)
err = put_user(SCM_RIGHTS, &cm->cmsg_type);
if (!err)
err = put_user(cmlen, &cm->cmsg_len);
if (!err) {
cmlen = CMSG_SPACE(i * sizeof(int)); <--- cmlen w/ 4 byte extra tail-padding
msg->msg_control += cmlen;
msg->msg_controllen -= cmlen; <--- iff no tail-padding space here ...
} ... wrap-around
F.e. it will wrap to a length of 18446744073709551612 bytes in case the
receiver passed in msg->msg_controllen of 20 bytes, and the sender
properly transferred 1 fd to the receiver, so that its CMSG_LEN results
in 20 bytes and CMSG_SPACE in 24 bytes.
In case of MSG_CMSG_COMPAT (scm_detach_fds_compat()), I haven't seen an
issue in my tests as alignment seems always on 4 byte boundary. Same
should be in case of native 32 bit, where we end up with 4 byte boundaries
as well.
In practice, passing msg->msg_controllen of 20 to recvmsg() while receiving
a single fd would mean that on successful return, msg->msg_controllen is
being set by the kernel to 24 bytes instead, thus more than the input
buffer advertised. It could f.e. become an issue if such application later
on zeroes or copies the control buffer based on the returned msg->msg_controllen
elsewhere.
Maximum number of fds we can send is a hard upper limit SCM_MAX_FD (253).
Going over the code, it seems like msg->msg_controllen is not being read
after scm_detach_fds() in scm_recv() anymore by the kernel, good!
Relevant recvmsg() handler are unix_dgram_recvmsg() (unix_seqpacket_recvmsg())
and unix_stream_recvmsg(). Both return back to their recvmsg() caller,
and ___sys_recvmsg() places the updated length, that is, new msg_control -
old msg_control pointer into msg->msg_controllen (hence the 24 bytes seen
in the example).
Long time ago, Wei Yongjun fixed something related in commit 1ac70e7ad2
("[NET]: Fix function put_cmsg() which may cause usr application memory
overflow").
RFC3542, section 20.2. says:
The fields shown as "XX" are possible padding, between the cmsghdr
structure and the data, and between the data and the next cmsghdr
structure, if required by the implementation. While sending an
application may or may not include padding at the end of last
ancillary data in msg_controllen and implementations must accept both
as valid. On receiving a portable application must provide space for
padding at the end of the last ancillary data as implementations may
copy out the padding at the end of the control message buffer and
include it in the received msg_controllen. When recvmsg() is called
if msg_controllen is too small for all the ancillary data items
including any trailing padding after the last item an implementation
may set MSG_CTRUNC.
Since we didn't place MSG_CTRUNC for already quite a long time, just do
the same as in 1ac70e7ad2 to avoid an overflow.
Btw, even man-page author got this wrong :/ See db939c9b26e9 ("cmsg.3: Fix
error in SCM_RIGHTS code sample"). Some people must have copied this (?),
thus it got triggered in the wild (reported several times during boot by
David and HacKurx).
No Fixes tag this time as pre 2002 (that is, pre history tree).
Reported-by: David Sterba <dave@jikos.cz>
Reported-by: HacKurx <hackurx@gmail.com>
Cc: PaX Team <pageexec@freemail.hu>
Cc: Emese Revfy <re.emese@gmail.com>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: Eric Dumazet <edumazet@google.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add tracepoint to show fib6 table lookups and result.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Get Advertising Size Information command allows to retrieve size
information for advertising data and scan response data fields depending
on the selected flags. This is useful if applications want to know the
available size ahead of time.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
The if statements for checking the flags parameter could be written a
bit easier to read. This changes this. No functional behavior has been
changed.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
The instance range check for Add Advertising command is missing. If the
provided instance is out of range an Invalid Parameters error should be
returned. At the moment, the generic Failed error is returned. This
extra check ensures that clear error messages are returned.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
napi_alloc_skb() can return NULL.
We should not crash should this happen.
Fixes: 93f93a4404 ("net: move skb_mark_napi_id() into core networking stack")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 5266698661 ("tipc: let broadcast packet
reception use new link receive function") the broadcast send
link state was meant to always be set to LINK_ESTABLISHED, since
we don't need this link to follow the regular link FSM rules. It
was also the intention that this state anyway shouldn't impact
the run-time working state of the link, since the latter in
reality is controlled by the number of registered peers.
We have now discovered that this assumption is not quite correct.
If the broadcast link is reset because of too many retransmissions,
its state will inadvertently go to LINK_RESETTING, and never go
back to LINK_ESTABLISHED, because the LINK_FAILURE event was not
anticipated. This will work well once, but if it happens a second
time, the reset on a link in LINK_RESETTING has has no effect, and
neither the broadcast link nor the unicast links will go down as
they should.
Furthermore, it is confusing that the management tool shows that
this link is in UP state when that obviously isn't the case.
We now ensure that this state strictly follows the true working
state of the link. The state is set to LINK_ESTABLISHED when
the number of peers is non-zero, and to LINK_RESET otherwise.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The number of variables with Hungarian notation (l_ptr, n_ptr etc.)
has been significantly reduced over the last couple of years.
We now root out the last traces of this practice.
There are no functional changes in this commit.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We move the definition of struct tipc_link from link.h to link.c in
order to minimize its exposure to the rest of the code.
When needed, we define new functions to make it possible for external
entities to access and set data in the link.
Apart from the above, there are no functional changes.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In our effort to have less code and include dependencies between
entities such as node, link and bearer, we try to narrow down
the exposed interface towards the node as much as possible.
In this commit, we move the definition of struct tipc_node, along
with many of its associated function declarations, from node.h to
node.c. We also move some function definitions from link.c and
name_distr.c to node.c, since they access fields in struct tipc_node
that should not be externally visible. The moved functions are renamed
according to new location, and made static whenever possible.
There are no functional changes in this commit.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
According to the node FSM a node in state SELF_UP_PEER_UP cannot
change state inside a lock context, except when a TUNNEL_PROTOCOL
(SYNCH or FAILOVER) packet arrives. However, the node's individual
links may still change state.
Since each link now is protected by its own spinlock, we finally have
the conditions in place to convert the node spinlock to an rwlock_t.
If the node state and arriving packet type are rigth, we can let the
link directly receive the packet under protection of its own spinlock
and the node lock in read mode. In all other cases we use the node
lock in write mode. This enables full concurrent execution between
parallel links during steady-state traffic situations, i.e., 99+ %
of the time.
This commit implements this change.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As a preparation to allow parallel links to work more independently
from each other we introduce a per-link spinlock, to be stored in the
struct nodes's link entry area. Since the node lock still is a regular
spinlock there is no increase in parallellism at this stage.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The file name_distr.c currently contains three functions,
named_cluster_distribute(), tipc_publ_subcscribe() and
tipc_publ_unsubscribe() that all directly access fields in
struct tipc_node. We want to eliminate such dependencies, so
we move those functions to the file node.c and rename them to
tipc_node_broadcast(), tipc_node_subscribe() and tipc_node_unsubscribe()
respectively.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function tipc_node_check_state() contains the core logics
for handling link synchronization and failover. For this reason,
it is important to keep it as comprehensible as possible.
In this commit, we make three small cleanups.
1) If the node is in state SELF_DOWN_PEER_LEAVING and the received
packet confirms that the peer has lost contact, there will be no
further action in this function. To make this clearer, we return
from the function directly after the state change.
2) Since commit 0f8b8e28fb ("tipc: eliminate risk of stalled
link synchronization") only the logically first TUNNEL_PROTO/SYNCH
packet can alter the link state and set the synch point,
independently of arrival order. Hence, there is not any longer any
need to adjust the synch value in case such packets arrive in
disorder. We remove this adjustment.
3) It is the intention that any message arriving on any of the links
may trig a check for and possible termination of a node SYNCH state.
A redundant and unnoticed check for tipc_link_is_synching() obviously
beats this purpose, with the effect that only packets arriving on the
synching link may currently end the synch state. We remove this check.
This change will further shorten the synchronization period between
parallel links.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 5cbb28a4bf ("tipc: linearize arriving NAME_DISTR
and LINK_PROTO buffers") we added linearization of NAME_DISTRIBUTOR,
LINK_PROTOCOL/RESET and LINK_PROTOCOL/ACTIVATE to the function
tipc_udp_recv(). The location of the change was selected in order
to make the commit easily appliable to 'net' and 'stable'.
We now move this linearization to where it should be done, in the
functions tipc_named_rcv() and tipc_link_proto_rcv() respectively.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_send_rcvq() is used for re-injecting data into tcp receive queue.
Problems :
- No check against size is performed, allowed user to fool kernel in
attempting very large memory allocations, eventually triggering
OOM when memory is fragmented.
- In case of fault during the copy we do not return correct errno.
Lets use alloc_skb_with_frags() to cook optimal skbs.
Fixes: 292e8d8c85 ("tcp: Move rcvq sending to tcp_input.c")
Fixes: c0e88ff0f2 ("tcp: Repair socket queues")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix incrementing TCPFastOpenActiveFailed snmp stats multiple times
when the handshake experiences multiple SYN timeouts.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some middle-boxes black-hole the data after the Fast Open handshake
(https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf).
The exact reason is unknown. The work-around is to disable Fast Open
temporarily after multiple recurring timeouts with few or no data
delivered in the established state.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Advertising reordering window in ADDBA less than 64 can crash some APs,
an example is LinkSys WRT120N (with FW v1.0.07 build 002 Jun 18 2012).
On the other hand, a driver may need to limit Tx A-MPDU size for its own
reasons, like specific HW limitations.
Signed-off-by: Gregory Greenman <gregory.greenman@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Remove unneeded variable used to store return value.
Error reported by coccicheck.
Signed-off-by: Prasanna Karthik <mkarthi3@visteon.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Fix errors reported by checkpatch.
- ERROR: spaces required around that ':' (ctx:VxW)
- ERROR: open brace '{' following function declarations go on the next line
Signed-off-by: Prasanna Karthik <mkarthi3@visteon.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The kfree_skb() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The kfree_skb() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The hci_req_sync_cancel() is just as much related to the request
cleanup as hci_request_cancel_all() is. Just move the former into the
latter and do the cleanup from a single place in hci_dev_do_close().
The important thing is to avoid deadlocks by holding the req_sync
lock: previously hci_request_cancel_all was done right after releasing
the lock and with this patch it's right before taking it.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The conn_unfinished variable makes the entire logic of
hci_connect_le() rather confusing. By restructuring and clarifying the
logic we can actually remove the conn_unfinished variable and still
keep the same behavior.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This helps simplify the logic in further patches (less cleanups to do
in this failure branch).
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The hci_connect_le_scan() is (as the name implies) a master/central
role API, so it makes no sense in passing a role parameter to it. At
the same time this patch also fixes the direct advertising support for
LE L2CAP sockets where we now call the more appropriate hci_le_connect()
API if slave/peripheral role is desired.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The only user of this, le_scan_restart_work(), is so short and simple
that it makes sense to just merge the code there.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Merge le_scan_disable_work_complete into the main le_scan_disable_work
function and take advantage of the updated bredr_inquiry() to run the
Inquiry through hci_req_sync().
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Passing the needed inquiry length to bredr_inquiry() makes it possible
to also use this helper for interleaved discovery where the controller
doesn't support simultaneous Inquiry & LE scan.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The recent changes to remove dependency on HCI in Add Device missed
out relevant changes for BR/EDR. This patch removes the left-overs and
ensures the right HCI command gets queued for BR/EDR.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Since discovery also deals with LE scanning it makes sense to move it
behind the same req_workqueue as other LE scanning changes. This also
simplifies the logic since we do many of the actions in a synchronous
manner.
Part of this refactoring is moving hci_req_stop_discovery() to
hci_request.c. At the same time the function receives support for
properly handling the STOPPING state since that's the state we'll be
in when stopping through the req_workqueue.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Since discovery also deals with LE scanning it makes sense to move it
behind the same req_workqueue as other LE scanning changes. This also
simplifies the logic since we do many of the actions in a synchronous
manner.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
In some circumstances it may be useful to abort the request through
checks done in the request callback. To make the feature possible this
patch changes the return value of the request callback from void to
int and aborts the request if a non-zero value is returned.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
As preparation for moving the discovery HCI commands behind
req_workqueue, add a helper and do the validity checks of the given
discovery type before proceeding further. This way we don't need to do
them again in hci_request.c.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
To avoid any risks of races, place also these LE scan modification
work callbacks behind the same work queue as the other LE scan
changes.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
There are no more external users so this API can be made private.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
We can easily use the new req_workqueue based background scan update
for the power on case. This also removes the last external user of
__hci_update_background_scan().
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Since explicit connect requests are also a sub-category of passive
scan updates, run them through the same workqueue as the other passive
scan changes.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
In some cases it may be important to get the exact HCI status rather
than the converted HCI-to-errno value. Add an optional return
parameter to the hci_req_sync() API to allow for this. Since there are
no good HCI translation candidates for cancelation and timeout, use
the "unknown" status code for those cases.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
There's no point in waiting for HCI activity in Add/Remove Device
since the effects of these calls are long-lasting and we can anyway
not report up to the application all HCI failures.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Instead of firing off a simple async request queue all background scan
updates through req_workqueue and use hci_req_sync() there to ensure
that no two updates overlap with each other.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Prepare hci_request.c to have code for doing synchronous HCI requests,
such as LE scanning or advertising changes. The necessary work
callbacks will be set up in hci_request_setup() and cleaned up in
hci_request_cancel_all(). The former is used when an HCI device get
registered, and the latter each time it is powered off (or
unregistered).
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
To make it clear which HCI request APIs target specifically
synchronous requests, add 'sync' to the API names.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
hci_request.c is a more natural place for the synchronous request
handling. Furthermore, we will soon need access to some of the
previously private-to-hci_core.c functions from hci_request.c.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The hci_conn_params_clear_all() function is only called from
hci_unregister_dev() at which point it's completely futile to try to
do any LE scanning updates. Simply remove this unnecessary function
call. At the same time we can make the function static since it's only
accessed from within the same c-file.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
To enable controller specific logging, the userspace daemon has to have
the ability to log per controller. To facilitate this support, provide
a dedicated logging channel. Messages in this channel will be included
in the monitor queue and with that also forwarded to monitoring tools
along with the actual hardware traces.
All messages from the logging channel are timestamped and with that
allow an easy correlation between userspace messages and hardware
events. This will increase the ability to debug problems faster.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
The monitor channel can be used to send generic system notes as text
strings for debugging purposes. This adds the system note monitor code
and uses it for including kernel and subsystem version into traces.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
The HCI sockets code has still some old casting coding style. Fix this
to match with the rest of the code.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
It's not obvious why schedule_work is used instead of queue_work. Add
a comment explaining why.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
When HCI commands are injected via the raw socket, the core was not
including the decoded opcode value. So ensure that it is actually set.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
We can reduce the size of the hci_ctrl struct by converting
'bool req_start' to 'u8 req_flags' and making the two function
pointers a union (since only one is ever set at a time).
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The new hci_skb_pkt_* wrappers only help if they are used consistently
in the Bluetooth subsystem. So first convert the core packet handling.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
For the LE only controllers, there are events that should not be enabled
if the corresponding command is not supported.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
When setting the event mask, the HCI_QUIRK_FIXUP_INQUIRY_MODE quirk is
required to be checked so that the Inquiry Result with RSSI event gets
actually enabled.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
The LE event mask should be created based on the commands that are
actually supported by the controller.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
There are some BR/EDR default events for Bluetooth 1.2 or later
controllers that are not conditional on their features being present.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
When a passive TCP is created, we eventually call tcp_md5_do_add()
with sk pointing to the child. It is not owner by the user yet (we
will add this socket into listener accept queue a bit later anyway)
But we do own the spinlock, so amend the lockdep annotation to avoid
following splat :
[ 8451.090932] net/ipv4/tcp_ipv4.c:923 suspicious rcu_dereference_protected() usage!
[ 8451.090932]
[ 8451.090932] other info that might help us debug this:
[ 8451.090932]
[ 8451.090934]
[ 8451.090934] rcu_scheduler_active = 1, debug_locks = 1
[ 8451.090936] 3 locks held by socket_sockopt_/214795:
[ 8451.090936] #0: (rcu_read_lock){.+.+..}, at: [<ffffffff855c6ac1>] __netif_receive_skb_core+0x151/0xe90
[ 8451.090947] #1: (rcu_read_lock){.+.+..}, at: [<ffffffff85618143>] ip_local_deliver_finish+0x43/0x2b0
[ 8451.090952] #2: (slock-AF_INET){+.-...}, at: [<ffffffff855acda5>] sk_clone_lock+0x1c5/0x500
[ 8451.090958]
[ 8451.090958] stack backtrace:
[ 8451.090960] CPU: 7 PID: 214795 Comm: socket_sockopt_
[ 8451.091215] Call Trace:
[ 8451.091216] <IRQ> [<ffffffff856fb29c>] dump_stack+0x55/0x76
[ 8451.091229] [<ffffffff85123b5b>] lockdep_rcu_suspicious+0xeb/0x110
[ 8451.091235] [<ffffffff8564544f>] tcp_md5_do_add+0x1bf/0x1e0
[ 8451.091239] [<ffffffff85645751>] tcp_v4_syn_recv_sock+0x1f1/0x4c0
[ 8451.091242] [<ffffffff85642b27>] ? tcp_v4_md5_hash_skb+0x167/0x190
[ 8451.091246] [<ffffffff85647c78>] tcp_check_req+0x3c8/0x500
[ 8451.091249] [<ffffffff856451ae>] ? tcp_v4_inbound_md5_hash+0x11e/0x190
[ 8451.091253] [<ffffffff85647170>] tcp_v4_rcv+0x3c0/0x9f0
[ 8451.091256] [<ffffffff85618143>] ? ip_local_deliver_finish+0x43/0x2b0
[ 8451.091260] [<ffffffff856181b6>] ip_local_deliver_finish+0xb6/0x2b0
[ 8451.091263] [<ffffffff85618143>] ? ip_local_deliver_finish+0x43/0x2b0
[ 8451.091267] [<ffffffff85618d38>] ip_local_deliver+0x48/0x80
[ 8451.091270] [<ffffffff85618510>] ip_rcv_finish+0x160/0x700
[ 8451.091273] [<ffffffff8561900e>] ip_rcv+0x29e/0x3d0
[ 8451.091277] [<ffffffff855c74b7>] __netif_receive_skb_core+0xb47/0xe90
Fixes: a8afca0329 ("tcp: md5: protects md5sig_info with RCU")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changes the definition of the pointer _expiry from time_t to
time64_t. This is to handle the Y2038 problem where time_t
will overflow in the year 2038. The change is safe because
the kernel subsystems that call dns_query pass NULL.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Aya Mahfouz <mahfouz.saif.elyazal@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
the commit cdf3464e6c ("ipv6: Fix dst_entry refcnt bugs in ip6_tunnel")
introduced percpu storage for ip6_tunnel dst cache, but while clearing
such cache it used raw_cpu_ptr to walk the per cpu entries, so cached
dst on non current cpu are not actually reset.
This patch replaces raw_cpu_ptr with per_cpu_ptr, properly cleaning
such storage.
Fixes: cdf3464e6c ("ipv6: Fix dst_entry refcnt bugs in ip6_tunnel")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
NAPI drivers no longer need to observe a particular protocol
to benefit from busy polling (CONFIG_NET_RX_BUSY_POLL=y)
napi_hash_add() and napi_hash_del() are automatically called
from core networking stack, respectively from
netif_napi_add() and netif_napi_del()
This patch depends on free_netdev() and netif_napi_del() being
called from process context, which seems to be the norm.
Drivers might still prefer to call napi_hash_del() on their
own, since they might combine all the rcu grace periods into
a single one, knowing their NAPI structures lifetime, while
core networking stack has no idea of a possible combining.
Once this patch proves to not bring serious regressions,
we will cleanup drivers to either remove napi_hash_del()
or provide appropriate rcu grace periods combining.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
napi_hash_del() will soon be used from both drivers (if they want)
or core networking stack.
Callers are responsibles to ensure an RCU grace period is respected
before freeing napi structure : napi_hash_del() can signal if
this RCU grace period is needed or not.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We do not often add/delete a napi context.
Moving napi_hash[] into read mostly section avoids potential false sharing.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netif_tx_napi_add() is a variant of netif_napi_add()
It should be used by drivers that use a napi structure
to exclusively poll TX.
We do not want to add this kind of napi in napi_hash[] in following
patches, adding generic busy polling to all NAPI drivers.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We would like to automatically provide busy polling support
to all NAPI drivers, without them having to implement anything.
skb_mark_napi_id() can be called from napi_gro_receive() and
napi_get_frags().
Few drivers are still calling skb_mark_napi_id() because
they use netif_receive_skb(). They should eventually call
napi_gro_receive() instead. I will leave this to drivers
maintainers.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of having to implement complex ndo_busy_poll() method,
drivers can simply rely on NAPI poll logic.
Busy polling gains are mainly coming from polling itself,
not on exact details on how we poll the device.
ndo_busy_poll() if implemented can avoid touching
napi state, but it adds extra synchronization between
normal napi->poll() and busy poll handler, slowing down
the common path (non busy polling) with extra atomic operations.
In practice few drivers ever got busy poll because of the complexity.
We could go one step further, and make busy polling
available for all NAPI drivers, but this would require
that all netif_napi_del() calls are done in process context
so that we can call synchronize_rcu().
Full audit would be required.
Before this is done, a driver still needs to call :
- skb_mark_napi_id() for each skb provided to the stack.
- napi_hash_add() and napi_hash_del() to allocate a napi_id per napi struct.
- Make sure RCU grace period is respected after napi_hash_del() before
memory containing napi structure is freed.
Followup patch implements busy poll for mlx5 driver as an example.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of blocking BH in whole sk_busy_loop(), block them
only around ->ndo_busy_poll() calls.
This has many benefits.
1) allow tunneled traffic to use busy poll as well as native traffic.
Tunnels handlers usually call netif_rx() and depend on net_rx_action()
being run (from sofirq handler)
2) allow RFS/RPS being used (sending IPI to other cpus if needed)
3) use the 'lets burn cpu cycles' budget to do useful work
(like TX completions, timers, RCU callbacks...)
4) reduce BH latencies, making busy poll a better citizen.
Tested:
Tested with SIT tunnel
lpaa5:~# echo 0 >/proc/sys/net/core/busy_read
lpaa5:~# ./netperf -H 2002:af6:786::1 -t TCP_RR
MIGRATED TCP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to 2002:af6:786::1 () port 0 AF_INET6 : first burst 0
Local /Remote
Socket Size Request Resp. Elapsed Trans.
Send Recv Size Size Time Rate
bytes Bytes bytes bytes secs. per sec
16384 87380 1 1 10.00 37373.93
16384 87380
Now enable busy poll on both hosts
lpaa5:~# echo 70 >/proc/sys/net/core/busy_read
lpaa6:~# echo 70 >/proc/sys/net/core/busy_read
lpaa5:~# ./netperf -H 2002:af6:786::1 -t TCP_RR
MIGRATED TCP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to 2002:af6:786::1 () port 0 AF_INET6 : first burst 0
Local /Remote
Socket Size Request Resp. Elapsed Trans.
Send Recv Size Size Time Rate
bytes Bytes bytes bytes secs. per sec
16384 87380 1 1 10.00 58314.77
16384 87380
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is really little gain from inlining this big function.
We'll soon make it even bigger in following patches.
This means we no longer need to export napi_by_id()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb->sender_cpu and skb->napi_id share a common storage,
and we had various bugs about this.
We had to call skb_sender_cpu_clear() in some places to
not leave a prior skb->napi_id and fool netdev_pick_tx()
As suggested by Alexei, we could split the space so that
these errors can not happen.
0 value being reserved as the common (not initialized) value,
let's reserve [1 .. NR_CPUS] range for valid sender_cpu,
and [NR_CPUS+1 .. ~0U] for valid napi_id.
This will allow proper busy polling support over tunnels.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace printk calls with preferred unconditional log method calls to keep
kernel messages clean.
Added newline to "too small MTU" message.
Signed-off-by: Bastian Stender <bst@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix list tests in netfilter ingress support, from Florian Westphal.
2) Fix reversal of input and output interfaces in ingress hook
invocation, from Pablo Neira Ayuso.
3) We have a use after free in r8169, caught by Dave Jones, fixed by
Francois Romieu.
4) Splice use-after-free fix in AF_UNIX frmo Hannes Frederic Sowa.
5) Three ipv6 route handling bug fixes from Martin KaFai Lau:
a) Don't create clone routes not managed by the fib6 tree
b) Don't forget to check expiration of DST_NOCACHE routes.
c) Handle rt->dst.from == NULL properly.
6) Several AF_PACKET fixes wrt transport header setting and SKB
protocol setting, from Daniel Borkmann.
7) Fix thunder driver crash on shutdown, from Pavel Fedin.
8) Several Mellanox driver fixes (max MTU calculations, use of correct
DMA unmap in TX path, etc.) from Saeed Mahameed, Tariq Toukan, Doron
Tsur, Achiad Shochat, Eran Ben Elisha, and Noa Osherovich.
9) Several mv88e6060 DSA driver fixes (wrong bit definitions for
certain registers, etc.) from Neil Armstrong.
10) Make sure to disable preemption while updating per-cpu stats of ip
tunnels, from Jason A. Donenfeld.
11) Various ARM64 bpf JIT fixes, from Yang Shi.
12) Flush icache properly in ARM JITs, from Daniel Borkmann.
13) Fix masking of RX and TX interrupts in ravb driver, from Masaru
Nagai.
14) Fix netdev feature propagation for devices not implementing
->ndo_set_features(). From Nikolay Aleksandrov.
15) Big endian fix in vmxnet3 driver, from Shrikrishna Khare.
16) RAW socket code increments incorrect SNMP counters, fix from Ben
Cartwright-Cox.
17) IPv6 multicast SNMP counters are bumped twice, fix from Neil Horman.
18) Fix handling of VLAN headers on stacked devices when REORDER is
disabled. From Vlad Yasevich.
19) Fix SKB leaks and use-after-free in ipvlan and macvlan drivers, from
Sabrina Dubroca.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (83 commits)
MAINTAINERS: Update Mellanox's Eth NIC driver entries
net/core: revert "net: fix __netdev_update_features return.." and add comment
af_unix: take receive queue lock while appending new skb
rtnetlink: fix frame size warning in rtnl_fill_ifinfo
net: use skb_clone to avoid alloc_pages failure.
packet: Use PAGE_ALIGNED macro
packet: Don't check frames_per_block against negative values
net: phy: Use interrupts when available in NOLINK state
phy: marvell: Add support for 88E1540 PHY
arm64: bpf: make BPF prologue and epilogue align with ARM64 AAPCS
macvlan: fix leak in macvlan_handle_frame
ipvlan: fix use after free of skb
ipvlan: fix leak in ipvlan_rcv_frame
vlan: Do not put vlan headers back on bridge and macvlan ports
vlan: Fix untag operations of stacked vlans with REORDER_HEADER off
via-velocity: unconditionally drop frames with bad l2 length
ipg: Remove ipg driver
dl2k: Add support for IP1000A-based cards
snmp: Remove duplicate OUTMCAST stat increment
net: thunder: Check for driver data in nicvf_remove()
...
This reverts commit 00ee592717 ("net: fix __netdev_update_features return
on ndo_set_features failure")
and adds a comment explaining why it's okay to return a value other than
0 upon error. Some drivers might actually change flags and return an
error so it's better to fire a spurious notification rather than miss
these.
CC: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While possibly in future we don't necessarily need to use
sk_buff_head.lock this is a rather larger change, as it affects the
af_unix fd garbage collector, diag and socket cleanups. This is too much
for a stable patch.
For the time being grab sk_buff_head.lock without disabling bh and irqs,
so don't use locked skb_queue_tail.
Fixes: 869e7c6248 ("net: af_unix: implement stream sendpage support")
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Reported-by: Eric Dumazet <edumazet@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix the following warning:
CC net/core/rtnetlink.o
net/core/rtnetlink.c: In function ‘rtnl_fill_ifinfo’:
net/core/rtnetlink.c:1308:1: warning: the frame size of 2864 bytes is larger than 2048 bytes [-Wframe-larger-than=]
}
^
by splitting up the huge rtnl_fill_ifinfo into some smaller ones, so we
don't have the huge frame allocations at the same time.
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1. new skb only need dst and ip address(v4 or v6).
2. skb_copy may need high order pages, which is very rare on long running server.
Signed-off-by: Junwei Zhang <linggao.zjw@alibaba-inc.com>
Signed-off-by: Martin Zhang <martinbj2008@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use PAGE_ALIGNED(...) instead of open-coding it.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
rb->frames_per_block is an unsigned int, thus can never be negative.
Also fix spacing in the calculation of frames_per_block.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a vlan is configured with REORDER_HEADER set to 0, the vlan
header is put back into the packet and makes it appear that
the vlan header is still there even after it's been processed.
This posses a problem for bridge and macvlan ports. The packets
passed to those device may be forwarded and at the time of the
forward, vlan headers end up being unexpectedly present.
With the patch, we make sure that we do not put the vlan header
back (when REORDER_HEADER is 0) if a bridge or macvlan has
been configured on top of the vlan device.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we have multiple stacked vlan devices all of which have
turned off REORDER_HEADER flag, the untag operation does not
locate the ethernet addresses correctly for nested vlans.
The reason is that in case of REORDER_HEADER flag being off,
the outer vlan headers are put back and the mac_len is adjusted
to account for the presense of the header. Then, the subsequent
untag operation, for the next level vlan, always use VLAN_ETH_HLEN
to locate the begining of the ethernet header and that ends up
being a multiple of 4 bytes short of the actuall beginning
of the mac header (the multiple depending on the how many vlan
encapsulations ethere are).
As a reslult, if there are multiple levles of vlan devices
with REODER_HEADER being off, the recevied packets end up
being dropped.
To solve this, we use skb->mac_len as the offset. The value
is always set on receive path and starts out as a ETH_HLEN.
The value is also updated when the vlan header manupations occur
so we know it will be correct.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using call_rcu(), the called function may be delayed quite
significantly, and without a matching rcu_barrier() there's no
way to be sure it has finished.
Therefore, global state that could be gone/freed/reused should
never be touched in the callback.
Fix this in mesh by moving the atomic_dec() into the caller;
that's not really a problem since we already unlinked the path
and it will be destroyed anyway.
This fixes a crash Jouni observed when running certain tests in
a certain order, in which the mesh interface was torn down, the
memory reused for a function pointer (work struct) and running
that then crashed since the pointer had been decremented by 1,
resulting in an invalid instruction byte stream.
Cc: stable@vger.kernel.org
Fixes: eb2b9311fd ("mac80211: mesh path table implementation")
Reported-by: Jouni Malinen <j@w1.fi>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
For now, this feature doesn't actually work. To avoid shipping a
kernel that has it enabled but where it can't be used disable it
for now - we can re-enable it when it's fixed.
This partially reverts 44674d9c22 ("mac80211: advertise support
for full station state in AP mode").
Cc: Ayala Beker <ayala.beker@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
the OUTMCAST stat is double incremented, getting bumped once in the mcast code
itself, and again in the common ip output path. Remove the mcast bump, as its
not needed
Validated by the reporter, with good results
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: Claus Jensen <claus.jensen@microsemi.com>
CC: Claus Jensen <claus.jensen@microsemi.com>
CC: David Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
A recent flaw in the netdev feature setting resulted in warnings
like this one from VLAN interfaces:
WARNING: CPU: 1 PID: 4975 at net/core/dev.c:2419 skb_warn_bad_offload+0xbc/0xcb()
: caps=(0x00000000001b5820, 0x00000000001b5829) len=2782 data_len=0 gso_size=1348 gso_type=16 ip_summed=3
The ":" is supposed to be preceded by a driver name, but in this
case it is an empty string since the device has no parent.
There are many types of network devices without a parent. The
anonymous warnings for these devices can be hard to debug. Log
the network device name instead in these cases to assist further
debugging.
This is mostly similar to how __netdev_printk() handles orphan
devices.
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case multiple writes to a unix stream socket race we could end up in a
situation where we pre-allocate a new skb for use in unix_stream_sendpage
but have to free it again in the locked section because another skb
has been appended meanwhile, which we must use. Accidentally we didn't
clear the pointer after consuming it and so we touched freed memory
while appending it to the sk_receive_queue. So, clear the pointer after
consuming the skb.
This bug has been found with syzkaller
(http://github.com/google/syzkaller) by Dmitry Vyukov.
Fixes: 869e7c6248 ("net: af_unix: implement stream sendpage support")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sending ICMP packets with raw sockets ends up in the SNMP counters
logging the type as the first byte of the IPv4 header rather than
the ICMP header. This is fixed by adding the IP Header Length to
the casting into a icmphdr struct.
Signed-off-by: Ben Cartwright-Cox <ben@benjojo.co.uk>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If ndo_set_features fails __netdev_update_features() will return -1 but
this is wrong because it is expected to return 0 if no features were
changed (see netdev_update_features()), which will cause a netdev
notifier to be called without any actual changes. Fix this by returning
0 if ndo_set_features fails.
Fixes: 6cb6a27c45 ("net: Call netdev_features_change() from netdev_update_features()")
CC: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When __netdev_update_features() was updated to ensure some features are
disabled on new lower devices, an error was introduced for devices which
don't have the ndo_set_features() method set. Before we'll just set the
new features, but now we return an error and don't set them. Fix this by
returning the old behaviour and setting err to 0 when ndo_set_features
is not present.
Fixes: e7868a85e1 ("net/core: ensure features get disabled on new lower devs")
CC: Jarod Wilson <jarod@redhat.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Ido Schimmel <idosch@mellanox.com>
CC: Sander Eikelenboom <linux@eikelenboom.it>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
CC: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Reviewed-by: Jarod Wilson <jarod@redhat.com>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Dave Young <dyoung@redhat.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When NET_SWITCHDEV=n, switchdev_port_attr_set simply returns EOPNOTSUPP.
In this case we should not emit errors and warnings to the kernel log.
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Fixes: 0bc05d585d ("switchdev: allow caller to explicitly request
attr_set as deferred")
Fixes: 6ac311ae8b ("Adding switchdev ageing notification on port
bridged")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SYNACK packets might be attached to request sockets.
Use skb_to_full_sk() helper to avoid illegal accesses to
inet_sk(skb->sk)
Fixes: ca6fb06518 ("tcp: attach SYNACK messages to request sockets instead of listener")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Acked-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some functions access TCP sockets without holding a lock and
might output non consistent data, depending on compiler and or
architecture.
tcp_diag_get_info(), tcp_get_info(), tcp_poll(), get_tcp4_sock() ...
Introduce sk_state_load() and sk_state_store() to fix the issues,
and more clearly document where this lack of locking is happening.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
now sctp auth cannot work well when setting a hmacid manually, which
is caused by that we didn't use the network order for hmacid, so fix
it by adding the transformation in sctp_auth_ep_set_hmacs.
even we set hmacid with the network order in userspace, it still
can't work, because of this condition in sctp_auth_ep_set_hmacs():
if (id > SCTP_AUTH_HMAC_ID_MAX)
return -EOPNOTSUPP;
so this wasn't working before and thus it won't break compatibility.
Fixes: 65b07e5d0d ("[SCTP]: API updates to suport SCTP-AUTH extensions.")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since it's introduction in commit 69e3c75f4d ("net: TX_RING and
packet mmap"), TX_RING could be used from SOCK_DGRAM and SOCK_RAW
side. When used with SOCK_DGRAM only, the size_max > dev->mtu +
reserve check should have reserve as 0, but currently, this is
unconditionally set (in it's original form as dev->hard_header_len).
I think this is not correct since tpacket_fill_skb() would then
take dev->mtu and dev->hard_header_len into account for SOCK_DGRAM,
the extra VLAN_HLEN could be possible in both cases. Presumably, the
reserve code was copied from packet_snd(), but later on missed the
check. Make it similar as we have it in packet_snd().
Fixes: 69e3c75f4d ("net: TX_RING and packet mmap")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case no struct sockaddr_ll has been passed to packet
socket's sendmsg() when doing a TX_RING flush run, then
skb->protocol is set to po->num instead, which is the protocol
passed via socket(2)/bind(2).
Applications only xmitting can go the path of allocating the
socket as socket(PF_PACKET, <mode>, 0) and do a bind(2) on the
TX_RING with sll_protocol of 0. That way, register_prot_hook()
is neither called on creation nor on bind time, which saves
cycles when there's no interest in capturing anyway.
That leaves us however with po->num 0 instead and therefore
the TX_RING flush run sets skb->protocol to 0 as well. Eric
reported that this leads to problems when using tools like
trafgen over bonding device. I.e. the bonding's hash function
could invoke the kernel's flow dissector, which depends on
skb->protocol being properly set. In the current situation, all
the traffic is then directed to a single slave.
Fix it up by inferring skb->protocol from the Ethernet header
when not set and we have ARPHRD_ETHER device type. This is only
done in case of SOCK_RAW and where we have a dev->hard_header_len
length. In case of ARPHRD_ETHER devices, this is guaranteed to
cover ETH_HLEN, and therefore being accessed on the skb after
the skb_store_bits().
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Packet sockets can be used by various net devices and are not
really restricted to ARPHRD_ETHER device types. However, when
currently checking for the extra 4 bytes that can be transmitted
in VLAN case, our assumption is that we generally probe on
ARPHRD_ETHER devices. Therefore, before looking into Ethernet
header, check the device type first.
This also fixes the issue where non-ARPHRD_ETHER devices could
have no dev->hard_header_len in TX_RING SOCK_RAW case, and thus
the check would test unfilled linear part of the skb (instead
of non-linear).
Fixes: 57f89bfa21 ("network: Allow af_packet to transmit +4 bytes for VLAN packets.")
Fixes: 52f1454f62 ("packet: allow to transmit +4 byte in TX_RING slot for VLAN case")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We concluded that the skb_probe_transport_header() should better be
called unconditionally. Avoiding the call into the flow dissector has
also not really much to do with the direct xmit mode.
While it seems that only virtio_net code makes use of GSO from non
RX/TX ring packet socket paths, we should probe for a transport header
nevertheless before they hit devices.
Reference: http://thread.gmane.org/gmane.linux.network/386173/
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In tpacket_fill_skb() commit c1aad275b0 ("packet: set transport
header before doing xmit") and later on 40893fd0fd ("net: switch
to use skb_probe_transport_header()") was probing for a transport
header on the skb from a ring buffer slot, but at a time, where
the skb has _not even_ been filled with data yet. So that call into
the flow dissector is pretty useless. Lets do it after we've set
up the skb frags.
Fixes: c1aad275b0 ("packet: set transport header before doing xmit")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All DST_NOCACHE rt6_info used to have rt->dst.from set to
its parent.
After commit 8e3d5be736 ("ipv6: Avoid double dst_free"),
DST_NOCACHE is also set to rt6_info which does not have
a parent (i.e. rt->dst.from is NULL).
This patch catches the rt->dst.from == NULL case.
Fixes: 8e3d5be736 ("ipv6: Avoid double dst_free")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the expires of the DST_NOCACHE rt can be set during
the ip6_rt_update_pmtu(), we also need to consider the expires
value when doing ip6_dst_check().
This patches creates __rt6_check_expired() to only
check the expire value (if one exists) of the current rt.
In rt6_dst_from_check(), it adds __rt6_check_expired() as
one of the condition check.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The original bug report:
https://bugzilla.redhat.com/show_bug.cgi?id=1272571
The setup has a IPv4 GRE tunnel running in a IPSec. The bug
happens when ndisc starts sending router solicitation at the gre
interface. The simplified oops stack is like:
__lock_acquire+0x1b2/0x1c30
lock_acquire+0xb9/0x140
_raw_write_lock_bh+0x3f/0x50
__ip6_ins_rt+0x2e/0x60
ip6_ins_rt+0x49/0x50
~~~~~~~~
__ip6_rt_update_pmtu.part.54+0x145/0x250
ip6_rt_update_pmtu+0x2e/0x40
~~~~~~~~
ip_tunnel_xmit+0x1f1/0xf40
__gre_xmit+0x7a/0x90
ipgre_xmit+0x15a/0x220
dev_hard_start_xmit+0x2bd/0x480
__dev_queue_xmit+0x696/0x730
dev_queue_xmit+0x10/0x20
neigh_direct_output+0x11/0x20
ip6_finish_output2+0x21f/0x770
ip6_finish_output+0xa7/0x1d0
ip6_output+0x56/0x190
~~~~~~~~
ndisc_send_skb+0x1d9/0x400
ndisc_send_rs+0x88/0xc0
~~~~~~~~
The rt passed to ip6_rt_update_pmtu() is created by
icmp6_dst_alloc() and it is not managed by the fib6 tree,
so its rt6i_table == NULL. When __ip6_rt_update_pmtu() creates
a RTF_CACHE clone, the newly created clone also has rt6i_table == NULL
and it causes the ip6_ins_rt() oops.
During pmtu update, we only want to create a RTF_CACHE clone
from a rt which is currently managed (or owned) by the
fib6 tree. It means either rt->rt6i_node != NULL or
rt is a RTF_PCPU clone.
It is worth to note that rt6i_table may not be NULL even it is
not (yet) managed by the fib6 tree (e.g. addrconf_dst_alloc()).
Hence, rt6i_node is a better check instead of rt6i_table.
Fixes: 45e4fd2668 ("ipv6: Only create RTF_CACHE routes after encountering pmtu")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reported-by: Chris Siebenmann <cks-rhbugzilla@cs.toronto.edu>
Cc: Chris Siebenmann <cks-rhbugzilla@cs.toronto.edu>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>