Update fib_nl_dumprule for strict data checking. If the flag is set,
the dump request is expected to have fib_rule_hdr struct as the header.
All elements of the struct are expected to be 0 and no attributes can
be appended.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Christian Brauner <christian@brauner.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
Update rtnl_net_dumpid for strict data checking. If the flag is set,
the dump request is expected to have an rtgenmsg struct as the header
which has the family as the only element. No data may be appended.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Update neightbl_dump_info for strict data checking. If the flag is set,
the dump request is expected to have an ndtmsg struct as the header.
All elements of the struct are expected to be 0 and no attributes can
be appended.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Christian Brauner <christian@brauner.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
Update neigh_dump_info for strict data checking. If the flag is set,
the dump request is expected to have an ndmsg struct as the header
potentially followed by one or more attributes. Any data passed in the
header or as an attribute is taken as a request to influence the data
returned. Only values supported by the dump handler are allowed to be
non-0 or set in the request. At the moment only the NDA_IFINDEX and
NDA_MASTER attributes are supported.
Existing code does not fail the dump if nlmsg_parse fails. That behavior
is kept for non-strict checking.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Christian Brauner <christian@brauner.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
Update rtnl_stats_dump for strict data checking. If the flag is set,
the dump request is expected to have an if_stats_msg struct as the header.
All elements of the struct are expected to be 0 except filter_mask which
must be non-0 (legacy behavior). No attributes are supported.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Update rtnl_bridge_getlink for strict data checking. If the flag is set,
the dump request is expected to have an ifinfomsg struct as the header
potentially followed by one or more attributes. Any data passed in the
header or as an attribute is taken as a request to influence the data
returned. Only values supported by the dump handler are allowed to be
non-0 or set in the request. At the moment only the IFLA_EXT_MASK
attribute is supported.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Christian Brauner <christian@brauner.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
Update rtnl_dump_ifinfo for strict data checking. If the flag is set,
the dump request is expected to have an ifinfomsg struct as the header
potentially followed by one or more attributes. Any data passed in the
header or as an attribute is taken as a request to influence the data
returned. Only values supported by the dump handler are allowed to be
non-0 or set in the request. At the moment only the IFA_TARGET_NETNSID,
IFLA_EXT_MASK, IFLA_MASTER, and IFLA_LINKINFO attributes are supported.
Existing code does not fail the dump if nlmsg_parse fails. That behavior
is kept for non-strict checking.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Christian Brauner <christian@brauner.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make sure extack is passed to nlmsg_parse where easy to do so.
Most of these are dump handlers and leveraging the extack in
the netlink_callback.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Christian Brauner <christian@brauner.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, rtnl_fdb_dump() assumes the family header is 'struct ifinfomsg',
which is not always true -- 'struct ndmsg' is used by iproute2 ('ip neigh').
The problem is, the function bails out early if nlmsg_parse() fails, which
does occur for iproute2 usage of 'struct ndmsg' because the payload length
is shorter than the family header alone (as 'struct ifinfomsg' is assumed).
This breaks backward compatibility with userspace -- nothing is sent back.
Some examples with iproute2 and netlink library for go [1]:
1) $ bridge fdb show
33:33:00:00:00:01 dev ens3 self permanent
01:00:5e:00:00:01 dev ens3 self permanent
33:33:ff:15:98:30 dev ens3 self permanent
This one works, as it uses 'struct ifinfomsg'.
fdb_show() @ iproute2/bridge/fdb.c
"""
.n.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)),
...
if (rtnl_dump_request(&rth, RTM_GETNEIGH, [...]
"""
2) $ ip --family bridge neigh
RTNETLINK answers: Invalid argument
Dump terminated
This one fails, as it uses 'struct ndmsg'.
do_show_or_flush() @ iproute2/ip/ipneigh.c
"""
.n.nlmsg_type = RTM_GETNEIGH,
.n.nlmsg_len = NLMSG_LENGTH(sizeof(struct ndmsg)),
"""
3) $ ./neighlist
< no output >
This one fails, as it uses 'struct ndmsg'-based.
neighList() @ netlink/neigh_linux.go
"""
req := h.newNetlinkRequest(unix.RTM_GETNEIGH, [...]
msg := Ndmsg{
"""
The actual breakage was introduced by commit 0ff50e83b5 ("net: rtnetlink:
bail out from rtnl_fdb_dump() on parse error"), because nlmsg_parse() fails
if the payload length (with the _actual_ family header) is less than the
family header length alone (which is assumed, in parameter 'hdrlen').
This is true in the examples above with struct ndmsg, with size and payload
length shorter than struct ifinfomsg.
However, that commit just intends to fix something under the assumption the
family header is indeed an 'struct ifinfomsg' - by preventing access to the
payload as such (via 'ifm' pointer) if the payload length is not sufficient
to actually contain it.
The assumption was introduced by commit 5e6d243587 ("bridge: netlink dump
interface at par with brctl"), to support iproute2's 'bridge fdb' command
(not 'ip neigh') which indeed uses 'struct ifinfomsg', thus is not broken.
So, in order to unbreak the 'struct ndmsg' family headers and still allow
'struct ifinfomsg' to continue to work, check for the known message sizes
used with 'struct ndmsg' in iproute2 (with zero or one attribute which is
not used in this function anyway) then do not parse the data as ifinfomsg.
Same examples with this patch applied (or revert/before the original fix):
$ bridge fdb show
33:33:00:00:00:01 dev ens3 self permanent
01:00:5e:00:00:01 dev ens3 self permanent
33:33:ff:15:98:30 dev ens3 self permanent
$ ip --family bridge neigh
dev ens3 lladdr 33:33:00:00:00:01 PERMANENT
dev ens3 lladdr 01:00:5e:00:00:01 PERMANENT
dev ens3 lladdr 33:33:ff:15:98:30 PERMANENT
$ ./neighlist
netlink.Neigh{LinkIndex:2, Family:7, State:128, Type:0, Flags:2, IP:net.IP(nil), HardwareAddr:net.HardwareAddr{0x33, 0x33, 0x0, 0x0, 0x0, 0x1}, LLIPAddr:net.IP(nil), Vlan:0, VNI:0}
netlink.Neigh{LinkIndex:2, Family:7, State:128, Type:0, Flags:2, IP:net.IP(nil), HardwareAddr:net.HardwareAddr{0x1, 0x0, 0x5e, 0x0, 0x0, 0x1}, LLIPAddr:net.IP(nil), Vlan:0, VNI:0}
netlink.Neigh{LinkIndex:2, Family:7, State:128, Type:0, Flags:2, IP:net.IP(nil), HardwareAddr:net.HardwareAddr{0x33, 0x33, 0xff, 0x15, 0x98, 0x30}, LLIPAddr:net.IP(nil), Vlan:0, VNI:0}
Tested on mainline (v4.19-rc6) and net-next (3bd09b05b0).
References:
[1] netlink library for go (test-case)
https://github.com/vishvananda/netlink
$ cat ~/go/src/neighlist/main.go
package main
import ("fmt"; "syscall"; "github.com/vishvananda/netlink")
func main() {
neighs, _ := netlink.NeighList(0, syscall.AF_BRIDGE)
for _, neigh := range neighs { fmt.Printf("%#v\n", neigh) }
}
$ export GOPATH=~/go
$ go get github.com/vishvananda/netlink
$ go build neighlist
$ ~/go/src/neighlist/neighlist
Thanks to David Ahern for suggestions to improve this patch.
Fixes: 0ff50e83b5 ("net: rtnetlink: bail out from rtnl_fdb_dump() on parse error")
Fixes: 5e6d243587 ("bridge: netlink dump interface at par with brctl")
Reported-by: Aidan Obley <aobley@pivotal.io>
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We already check the RSS indirection table does not use queues which
would be disabled by channel reconfiguration. Make sure user does not
try to disable queues which have a UMEM and zero-copy AF_XDP socket
installed.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
ethtool_set_channels() validates the config against driver's max
settings. It retrieves the current config and stores it in a
variable called max. This was okay when only max settings were
accessed but we will soon want to access current settings as
well, so calling the entire structure max makes the code less
readable.
While at it drop unnecessary parenthesis.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Move the attribute parsing from neigh_dump_table to neigh_dump_info, and
pass the filter arguments down to neigh_dump_table in a new struct. Add
the filter option to proxy neigh dumps as well to make them consistent.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
msix_vec_per_pf_min - This param sets the number of minimal MSIX
vectors required for the device initialization. This value is set
in the device which limits MSIX vectors per PF.
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
msix_vec_per_pf_max - This param sets the number of MSIX vectors
that the device requests from the host on driver initialization.
This value is set in the device which is applicable per PF.
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ignore_ari - Device ignores ARI(Alternate Routing ID) capability,
even when platforms has the support and creates same number of
partitions when platform does not support ARI capability.
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
mlx5 core driver and ethernet netdev updates, please note there is a small
devlink releated update to allow extack argument to eswitch operations.
From Eli Britstein,
1) devlink: Add extack argument to the eswitch related operations
2) net/mlx5e: E-Switch, return extack messages for failures in the e-switch devlink callbacks
3) net/mlx5e: Add extack messages for TC offload failures
From Eran Ben Elisha,
4) mlx5e: Add counter for aRFS rule insertion failures
From Feras Daoud
5) Fast teardown support for mlx5 device
This change introduces the enhanced version of the "Force teardown" that
allows SW to perform teardown in a faster way without the need to reclaim
all the FW pages.
Fast teardown provides the following advantages:
1- Fix a FW race condition that could cause command timeout
2- Avoid moving to polling mode
3- Close the vport to prevent PCI ACK to be sent without been scatter
to memory
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJbtU45AAoJEEg/ir3gV/o+/C4H/RHA4KImrb476EdB3VNYMqAN
dgXb+bmh6sZP+jHWqQ4c3aVeh6/T8qm4gwiSn2nVTtHEnxtCdIYljzDC1Nswczeg
pSjD1eOP7M1LpAOmBb8xdnJcX7yM7r1bTklnp2sN853WShbsDRYgZBHsBwTzx25U
ZdzL4QTLuohlG/aLrbGXMntIy45ya2fVQrnK54s18nFlgsdFjEs0mi0xaUKNBC6+
P8CTohHAxuuxmL5b+6MIYLZCdgd8cLNQFdtqbckEVw7SvcRTxfraRlyqJ0YOgTGB
TdSWnqZz2JYH29wSFbpFG8qX6GCv8FoiZ+fKzldbolHk442rrktHv3+Y7qQuZVs=
=NVks
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2018-10-03' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2018-10-03
mlx5 core driver and ethernet netdev updates, please note there is a small
devlink releated update to allow extack argument to eswitch operations.
From Eli Britstein,
1) devlink: Add extack argument to the eswitch related operations
2) net/mlx5e: E-Switch, return extack messages for failures in the e-switch devlink callbacks
3) net/mlx5e: Add extack messages for TC offload failures
From Eran Ben Elisha,
4) mlx5e: Add counter for aRFS rule insertion failures
From Feras Daoud
5) Fast teardown support for mlx5 device
This change introduces the enhanced version of the "Force teardown" that
allows SW to perform teardown in a faster way without the need to reclaim
all the FW pages.
Fast teardown provides the following advantages:
1- Fix a FW race condition that could cause command timeout
2- Avoid moving to polling mode
3- Close the vport to prevent PCI ACK to be sent without been scatter
to memory
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Stephen Rothwell reports the following link failure with IPv6 as module:
x86_64-linux-gnu-ld: net/core/filter.o: in function `sk_lookup':
(.text+0x19219): undefined reference to `__udp6_lib_lookup'
Fix the build by only enabling the IPv6 socket lookup if IPv6 support is
compiled into the kernel.
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Minor conflict in net/core/rtnetlink.c, David Ahern's bug fix in 'net'
overlapped the renaming of a netlink attribute in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
Add extack argument to the eswitch related operations.
Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
syzkaller was able to hit the WARN_ON(sock_owned_by_user(sk));
in tcp_close()
While a socket is being closed, it is very possible other
threads find it in rtnetlink dump.
tcp_get_info() will acquire the socket lock for a short amount
of time (slow = lock_sock_fast(sk)/unlock_sock_fast(sk, slow);),
enough to trigger the warning.
Fixes: 67db3e4bfb ("tcp: no longer hold ehash lock while calling tcp_get_info()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds new BPF helper functions, bpf_sk_lookup_tcp() and
bpf_sk_lookup_udp() which allows BPF programs to find out if there is a
socket listening on this host, and returns a socket pointer which the
BPF program can then access to determine, for instance, whether to
forward or drop traffic. bpf_sk_lookup_xxx() may take a reference on the
socket, so when a BPF program makes use of this function, it must
subsequently pass the returned pointer into the newly added sk_release()
to return the reference.
By way of example, the following pseudocode would filter inbound
connections at XDP if there is no corresponding service listening for
the traffic:
struct bpf_sock_tuple tuple;
struct bpf_sock_ops *sk;
populate_tuple(ctx, &tuple); // Extract the 5tuple from the packet
sk = bpf_sk_lookup_tcp(ctx, &tuple, sizeof tuple, netns, 0);
if (!sk) {
// Couldn't find a socket listening for this traffic. Drop.
return TC_ACT_SHOT;
}
bpf_sk_release(sk, 0);
return TC_ACT_OK;
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Teach the verifier a little bit about a new type of pointer, a
PTR_TO_SOCKET. This pointer type is accessed from BPF through the
'struct bpf_sock' structure.
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
We have an impressive number of syzkaller bugs that are linked
to the fact that syzbot was able to create a networking device
with millions of TX (or RX) queues.
Let's limit the number of RX/TX queues to 4096, this really should
cover all known cases.
A separate patch will add various cond_resched() in the loops
handling sysfs entries at device creation and dismantle.
Tested:
lpaa6:~# ip link add gre-4097 numtxqueues 4097 numrxqueues 4097 type ip6gretap
RTNETLINK answers: Invalid argument
lpaa6:~# time ip link add gre-4096 numtxqueues 4096 numrxqueues 4096 type ip6gretap
real 0m0.180s
user 0m0.000s
sys 0m0.107s
Fixes: 76ff5cc919 ("rtnl: allow to specify number of rx and tx queues on device creation")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This helper is unused since commit 988cf74deb ("inet:
Stop generating UFO packets.")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Link dumps can return results from a target namespace. If the namespace id
is invalid, then the dump request should fail if get_target_net fails
rather than continuing with a dump of the current namespace.
Fixes: 79e1ad148c ("rtnetlink: use netnsid to query interface")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since we do no longer require NAPI drivers to provide
an ndo_poll_controller(), napi_schedule() has not been done
before poll_one_napi() invocation.
So testing NAPI_STATE_SCHED is likely to cause early returns.
While we are at it, remove outdated comment.
Note to future bisections : This change might surface prior
bugs in drivers. See commit 73f21c653f ("bnxt_en: Fix TX
timeout during netpoll.") for one occurrence.
Fixes: ac3d9dd034 ("netpoll: make ndo_poll_controller() optional")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Song Liu <songliubraving@fb.com>
Cc: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes the following sparse warning:
net/core/gen_stats.c:166:1: warning:
symbol '___gnet_stats_copy_basic' was not declared. Should it be static?
Fixes: 5e111210a4 ("net/core: Add new basic hardware counter")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add flag wol_enabled to struct net_device indicating whether
Wake-on-LAN is enabled. As first user phy_suspend() will use it to
decide whether PHY can be suspended or not.
Fixes: f1e911d5d0 ("r8169: add basic phylib support")
Fixes: e8cfd9d6c7 ("net: phy: call state machine synchronously in phy_stop")
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-09-25
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Allow for RX stack hardening by implementing the kernel's flow
dissector in BPF. Idea was originally presented at netconf 2017 [0].
Quote from merge commit:
[...] Because of the rigorous checks of the BPF verifier, this
provides significant security guarantees. In particular, the BPF
flow dissector cannot get inside of an infinite loop, as with
CVE-2013-4348, because BPF programs are guaranteed to terminate.
It cannot read outside of packet bounds, because all memory accesses
are checked. Also, with BPF the administrator can decide which
protocols to support, reducing potential attack surface. Rarely
encountered protocols can be excluded from dissection and the
program can be updated without kernel recompile or reboot if a
bug is discovered. [...]
Also, a sample flow dissector has been implemented in BPF as part
of this work, from Petar and Willem.
[0] http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf
2) Add support for bpftool to list currently active attachment
points of BPF networking programs providing a quick overview
similar to bpftool's perf subcommand, from Yonghong.
3) Fix a verifier pruning instability bug where a union member
from the register state was not cleared properly leading to
branches not being pruned despite them being valid candidates,
from Alexei.
4) Various smaller fast-path optimizations in XDP's map redirect
code, from Jesper.
5) Enable to recognize BPF_MAP_TYPE_REUSEPORT_SOCKARRAY maps
in bpftool, from Roman.
6) Remove a duplicate check in libbpf that probes for function
storage, from Taeung.
7) Fix an issue in test_progs by avoid checking for errno since
on success its value should not be checked, from Mauricio.
8) Fix unused variable warning in bpf_getsockopt() helper when
CONFIG_INET is not configured, from Anders.
9) Fix a compilation failure in the BPF sample code's use of
bpf_flow_keys, from Prashant.
10) Minor cleanups in BPF code, from Yue and Zhong.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Rtnl lock is encapsulated in netlink and cannot be accessed by other
modules directly. This means that reference counted objects that rely on
rtnl lock cannot use it with refcounter helper function that atomically
releases decrements reference and obtains mutex.
This patch implements simple wrapper function around refcount_dec_and_lock
that obtains rtnl lock if reference counter value reached 0.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Version bump conflict in batman-adv, take what's in net-next.
iavf conflict, adjustment of netdev_ops in net-next conflicting
with poll controller method removal in net.
Signed-off-by: David S. Miller <davem@davemloft.net>
BPF flow dissectors are configured per network namespace.
__skb_flow_dissect looks up the netns through dev_net(skb->dev).
In some dissector paths skb->dev is NULL, such as for Unix sockets.
In these cases fall back to looking up the netns by socket.
Analyzing the codepaths leading to __skb_flow_dissect I did not find
a case where both skb->dev and skb->sk are NULL. Warn and fall back to
standard flow dissector if one is found.
Fixes: d58e468b11 ("flow_dissector: implements flow dissector BPF hook")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
send netlink notification if neigh_update results in NTF_ROUTER
change and if NEIGH_UPDATE_F_ISROUTER is on. Also move the
NTF_ROUTER change function into a helper.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows admin setting of NTF_ROUTER flag
on a neighbour entry. This enables external control
plane (like bgp evpn) to manage neigh entries with
NTF_ROUTER flag.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new hardware specific basic counter, TCA_STATS_BASIC_HW. This can
be used to count packets/bytes processed by hardware offload.
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.
It seems that all networking drivers that do use NAPI
for their TX completions, should not provide a ndo_poll_controller().
NAPI drivers have netpoll support already handled
in core networking stack, since netpoll_poll_dev()
uses poll_napi(dev) to iterate through registered
NAPI contexts for a device.
This patch allows netpoll_poll_dev() to process NAPI
contexts even for drivers not providing ndo_poll_controller(),
allowing for following patches in NAPI drivers.
Also we export netpoll_poll_dev() so that it can be called
by bonding/team drivers in following patches.
Reported-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
So it should not fail with EPERM even though it is no longer implemented...
This is a fix for:
(userns)$ egrep ^Cap /proc/self/status
CapInh: 0000003fffffffff
CapPrm: 0000003fffffffff
CapEff: 0000003fffffffff
CapBnd: 0000003fffffffff
CapAmb: 0000003fffffffff
(userns)$ tcpdump -i usb_rndis0
tcpdump: WARNING: usb_rndis0: SIOCETHTOOL(ETHTOOL_GUFO) ioctl failed: Operation not permitted
Warning: Kernel filter failed: Bad file descriptor
tcpdump: can't remove kernel filter: Bad file descriptor
With this change it returns EOPNOTSUPP instead of EPERM.
See also https://github.com/the-tcpdump-group/libpcap/issues/689
Fixes: 08a00fea6d "net: Remove references to NETIF_F_UFO from ethtool."
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Smatch reports that devlink_dpipe_send_and_alloc_skb() frees the skb
on error so this is a double free. We fixed a bunch of these bugs in
commit 7fe4d6dcbc ("devlink: Remove redundant free on error path") but
we accidentally overlooked this one.
Fixes: d9f9b9a4d0 ("devlink: Add support for resource abstraction")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When bringing down the netdevice (incl. detaching it) and calling
netif_carrier_off directly or indirectly the latter triggers an
asynchronous linkwatch event.
This linkwatch event eventually may fail to access chip registers in
the ndo_get_stats/ndo_get_stats64 callback because the device isn't
accessible any longer, see call trace in [0].
To prevent this scenario don't check for IFF_UP only, but also make
sure that the netdevice is present.
[0] https://lists.openwall.net/netdev/2018/03/15/62
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
FIELD_SIZEOF is defined as a macro to calculate the specified value. Therefore,
We prefer to use the macro rather than calculating its value.
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Two new tls tests added in parallel in both net and net-next.
Used Stephen Rothwell's linux-next resolution.
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf 2018-09-16
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix end boundary calculation in BTF for the type section, from Martin.
2) Fix and revert subtraction of pointers that was accidentally allowed
for unprivileged programs, from Alexei.
3) Fix bpf_msg_pull_data() helper by using __GFP_COMP in order to avoid
a warning in linearizing sg pages into a single one for large allocs,
from Tushar.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
path. The BPF program is per-network namespace.
Signed-off-by: Petar Penkov <ppenkov@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Replace "fallthru" with a proper "fall through" annotation.
This fix is part of the ongoing efforts to enabling
-Wimplicit-fallthrough
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Update 'confirmed' timestamp when ARP packet is received. It shouldn't
affect locktime logic and anyway entry can be confirmed by any higher-layer
protocol. Thus it makes sense to confirm it when ARP packet is received.
Fixes: 77d7123342 ("neighbour: update neigh timestamps iff update is effective")
Signed-off-by: Vasily Khoruzhick <vasilykh@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fix addresses https://bugzilla.kernel.org/show_bug.cgi?id=201071
Commit 5025f7f7d5 wrongly relied on __dev_change_flags to notify users of
dev flag changes in the case when dev->rtnl_link_state = RTNL_LINK_INITIALIZED.
Fix it by indicating flag changes explicitly to __dev_notify_flags.
Fixes: 5025f7f7d5 ("rtnetlink: add rtnl_link_state check in rtnl_configure_link")
Reported-By: Liam mcbirnie <liam.mcbirnie@boeing.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Helper bpg_msg_pull_data() can allocate multiple pages while
linearizing multiple scatterlist elements into one shared page.
However, if the shared page has size > PAGE_SIZE, using
copy_page_to_iter() causes below warning.
e.g.
[ 6367.019832] WARNING: CPU: 2 PID: 7410 at lib/iov_iter.c:825
page_copy_sane.part.8+0x0/0x8
To avoid above warning, use __GFP_COMP while allocating multiple
contiguous pages.
Fixes: 015632bb30 ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Building with CONFIG_INET=n will show the warning below:
net/core/filter.c: In function ‘____bpf_getsockopt’:
net/core/filter.c:4048:19: warning: unused variable ‘tp’ [-Wunused-variable]
struct tcp_sock *tp;
^~
net/core/filter.c:4046:31: warning: unused variable ‘icsk’ [-Wunused-variable]
struct inet_connection_sock *icsk;
^~~~
Move the variable declarations inside the {} block where they are used.
Fixes: 1e215300f1 ("bpf: add TCP_SAVE_SYN/TCP_SAVED_SYN options for bpf_(set|get)sockopt")
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
It documents what is happening, and eliminates the spurious list
pointer poisoning.
In the long term, in order to get proper list head debugging, we
might want to use the list poison value as the indicator that
an SKB is a singleton and not on a list.
Signed-off-by: David S. Miller <davem@davemloft.net>
An SKB is not on a list if skb->next is NULL.
Codify this convention into a helper function and use it
where we are dequeueing an SKB and need to mark it as such.
Signed-off-by: David S. Miller <davem@davemloft.net>
According to the documentation in msg_zerocopy.rst, the SO_ZEROCOPY
flag was introduced because send(2) ignores unknown message flags and
any legacy application which was accidentally passing the equivalent of
MSG_ZEROCOPY earlier should not see any new behaviour.
Before commit f214f915e7 ("tcp: enable MSG_ZEROCOPY"), a send(2) call
which passed the equivalent of MSG_ZEROCOPY without setting SO_ZEROCOPY
would succeed. However, after that commit, it fails with -ENOBUFS. So
it appears that the SO_ZEROCOPY flag fails to fulfill its intended
purpose. Fix it.
Fixes: f214f915e7 ("tcp: enable MSG_ZEROCOPY")
Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The compiler does an efficient job of inlining static C functions.
Perf top clearly shows that almost everything gets inlined into the
function call xdp_do_redirect.
The function xdp_do_redirect end-up containing and interleaving the
map and non-map redirect code. This is sub-optimal, as it would be
strange for an XDP program to use both types of redirect in the same
program. The two use-cases are separate, and interleaving the code
just cause more instruction-cache pressure.
I would like to stress (again) that the non-map variant bpf_redirect
is very slow compared to the bpf_redirect_map variant, approx half the
speed. Measured with driver i40e the difference is:
- map redirect: 13,250,350 pps
- non-map redirect: 7,491,425 pps
For this reason, the function name of the non-map variant of redirect
have been called xdp_do_redirect_slow. This hopefully gives a hint
when using perf, that this is not the optimal XDP redirect operating mode.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The compiler chooses to not-inline the function __xdp_map_lookup_elem,
because it can see that it is used by both Generic-XDP and native-XDP
do redirect calls (xdp_do_generic_redirect_map and xdp_do_redirect_map).
The compiler cannot know that this is a bad choice, as it cannot know
that a net device cannot run both XDP modes (Generic or Native) at the
same time. Thus, mark this function inline, even-though we normally
leave this up-to the compiler.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Notice the compiler generated ASM code layout was suboptimal. It
assumed map enqueue errors as the likely case, which is shouldn't.
It assumed that xdp_do_flush_map() was a likely case, due to maps
changing between packets, which should be very unlikely.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
IFLA_TARGET_NETNSID is the new alias for IFLA_IF_NETNSID. This commit
replaces all occurrences of IFLA_IF_NETNSID with the new alias to
indicate that this identifier is the preferred one.
Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Cc: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I don't see how the type - which is one of
RTM_{GETADDR,GETROUTE,GETNETCONF} - can change. So do the message type
calculation once before entering the for loop.
Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
get_target_net() will be used in follow-up patches in ipv{4,6} codepaths to
retrieve network namespaces based on network namespace identifiers. So
remove the static declaration and export in the rtnetlink header. Also,
rename it to rtnl_get_net_ns_capable() to make it obvious what this
function is doing.
Export rtnl_get_net_ns_capable() so it can be used when ipv6 is built as
a module.
Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the only way to ignore outgoing packets on a packet socket is
via the BPF filter. With MSG_ZEROCOPY, packets that are looped into
AF_PACKET are copied in dev_queue_xmit_nit(), and this copy happens even
if the filter run from packet_rcv() would reject them. So the presence
of a packet socket on the interface takes away the benefits of
MSG_ZEROCOPY, even if the packet socket is not interested in outgoing
packets. (Even when MSG_ZEROCOPY is not used, the skb is unnecessarily
cloned, but the cost for that is much lower.)
Add a socket option to allow AF_PACKET sockets to ignore outgoing
packets to solve this. Note that the *BSDs already have something
similar: BIOCSSEESENT/BIOCSDIRECTION and BIOCSDIRFILT.
The first intended user is lldpd.
Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Must perform TXQ teardown before unregistering interfaces in
mac80211, from Toke Høiland-Jørgensen.
2) Don't allow creating mac80211_hwsim with less than one channel, from
Johannes Berg.
3) Division by zero in cfg80211, fix from Johannes Berg.
4) Fix endian issue in tipc, from Haiqing Bai.
5) BPF sockmap use-after-free fixes from Daniel Borkmann.
6) Spectre-v1 in mac80211_hwsim, from Jinbum Park.
7) Missing rhashtable_walk_exit() in tipc, from Cong Wang.
8) Revert kvzalloc() conversion of AF_PACKET, it breaks mmap() when
kvzalloc() tries to use kmalloc() pages. From Eric Dumazet.
9) Fix deadlock in hv_netvsc, from Dexuan Cui.
10) Do not restart timewait timer on RST, from Florian Westphal.
11) Fix double lwstate refcount grab in ipv6, from Alexey Kodanev.
12) Unsolicit report count handling is off-by-one, fix from Hangbin Liu.
13) Sleep-in-atomic in cadence driver, from Jia-Ju Bai.
14) Respect ttl-inherit in ip6 tunnel driver, from Hangbin Liu.
15) Use-after-free in act_ife, fix from Cong Wang.
16) Missing hold to meta module in act_ife, from Vlad Buslov.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (91 commits)
net: phy: sfp: Handle unimplemented hwmon limits and alarms
net: sched: action_ife: take reference to meta module
act_ife: fix a potential use-after-free
net/mlx5: Fix SQ offset in QPs with small RQ
tipc: correct spelling errors for tipc_topsrv_queue_evt() comments
tipc: correct spelling errors for struct tipc_bc_base's comment
bnxt_en: Do not adjust max_cp_rings by the ones used by RDMA.
bnxt_en: Clean up unused functions.
bnxt_en: Fix firmware signaled resource change logic in open.
sctp: not traverse asoc trans list if non-ipv6 trans exists for ipv6_flowlabel
sctp: fix invalid reference to the index variable of the iterator
net/ibm/emac: wrong emac_calc_base call was used by typo
net: sched: null actions array pointer before releasing action
vhost: fix VHOST_GET_BACKEND_FEATURES ioctl request definition
r8169: add support for NCube 8168 network card
ip6_tunnel: respect ttl inherit for ip6tnl
mac80211: shorten the IBSS debug messages
mac80211: don't Tx a deauth frame if the AP forbade Tx
mac80211: Fix station bandwidth setting after channel switch
mac80211: fix a race between restart and CSA flows
...
Helper bpf_msg_pull_data() mistakenly reuses variable 'offset' while
linearizing multiple scatterlist elements. Variable 'offset' is used
to find first starting scatterlist element
i.e. msg->data = sg_virt(&sg[first_sg]) + start - offset"
Use different variable name while linearizing multiple scatterlist
elements so that value contained in variable 'offset' won't get
overwritten.
Fixes: 015632bb30 ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-09-01
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Add AF_XDP zero-copy support for i40e driver (!), from Björn and Magnus.
2) BPF verifier improvements by giving each register its own liveness
chain which allows to simplify and getting rid of skip_callee() logic,
from Edward.
3) Add bpf fs pretty print support for percpu arraymap, percpu hashmap
and percpu lru hashmap. Also add generic percpu formatted print on
bpftool so the same can be dumped there, from Yonghong.
4) Add bpf_{set,get}sockopt() helper support for TCP_SAVE_SYN and
TCP_SAVED_SYN options to allow reflection of tos/tclass from received
SYN packet, from Nikita.
5) Misc improvements to the BPF sockmap test cases in terms of cgroup v2
interaction and removal of incorrect shutdown() calls, from John.
6) Few cleanups in xdp_umem_assign_dev() and xdpsock samples, from Prashant.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Adding support for two new bpf get/set sockopts: TCP_SAVE_SYN (set)
and TCP_SAVED_SYN (get). This would allow for bpf program to build
logic based on data from ingress SYN packet (e.g. doing tcp's tos/
tclass reflection (see sample prog)) and do it transparently from
userspace program point of view.
Signed-off-by: Nikita V. Shirokov <tehnerd@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Variable 'headroom' is being assigned but is never used hence it is
redundant and can be removed.
Cleans up clang warning:
variable ‘headroom’ set but not used [-Wunused-but-set-variable]
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Since [gs]et_settings ethtool_ops callbacks have been deprecated in
February 2016, all in tree NIC drivers have been converted to provide
[gs]et_link_ksettings() and out of tree drivers have had enough time to do
the same.
Drop get_settings() and set_settings() and implement both ETHTOOL_[GS]SET
and ETHTOOL_[GS]LINKSETTINGS only using [gs]et_link_ksettings().
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
rtnl_unregister_all(PF_INET6) gets called from inet6_init in cases when
no handler has been registered for PF_INET6 yet, for example if
ip6_mr_init() fails. Abort and avoid a NULL pointer deref in that case.
Example of panic (triggered by faking a failure of
register_pernet_subsys):
general protection fault: 0000 [#1] PREEMPT SMP KASAN PTI
[...]
RIP: 0010:rtnl_unregister_all+0x17e/0x2a0
[...]
Call Trace:
? rtnetlink_net_init+0x250/0x250
? sock_unregister+0x103/0x160
? kernel_getsockopt+0x200/0x200
inet6_init+0x197/0x20d
Fixes: e2fddf5e96 ("[IPV6]: Make af_inet6 to check ip6_route_init return value.")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Export __xdp_rxq_info_unreg_mem_model as xdp_rxq_info_unreg_mem_model,
so it can be used from netdev drivers. Also, add additional checks for
the memory type.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This commit adds proper MEM_TYPE_ZERO_COPY support for
convert_to_xdp_frame. Converting a MEM_TYPE_ZERO_COPY xdp_buff to an
xdp_frame is done by transforming the MEM_TYPE_ZERO_COPY buffer into a
MEM_TYPE_PAGE_ORDER0 frame. This is costly, and in the future it might
make sense to implement a more sophisticated thread-safe alloc/free
scheme for MEM_TYPE_ZERO_COPY, so that no allocation and copy is
required in the fast-path.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When we perform the sg shift repair for the scatterlist ring, we
currently start out at i = first_sg + 1. However, this is not
correct since the first_sg could point to the sge sitting at slot
MAX_SKB_FRAGS - 1, and a subsequent i = MAX_SKB_FRAGS will access
the scatterlist ring (sg) out of bounds. Add the sk_msg_iter_var()
helper for iterating through the ring, and apply the same rule
for advancing to the next ring element as we do elsewhere. Later
work will use this helper also in other places.
Fixes: 015632bb30 ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
If first_sg and last_sg wraps around in the scatterlist ring, then we
need to account for that in the shift as well. E.g. crafting such msgs
where this is the case leads to a hang as shift becomes negative. E.g.
consider the following scenario:
first_sg := 14 |=> shift := -12 msg->sg_start := 10
last_sg := 3 | msg->sg_end := 5
round 1: i := 15, move_from := 3, sg[15] := sg[ 3]
round 2: i := 0, move_from := -12, sg[ 0] := sg[-12]
round 3: i := 1, move_from := -11, sg[ 1] := sg[-11]
round 4: i := 2, move_from := -10, sg[ 2] := sg[-10]
[...]
round 13: i := 11, move_from := -1, sg[ 2] := sg[ -1]
round 14: i := 12, move_from := 0, sg[ 2] := sg[ 0]
round 15: i := 13, move_from := 1, sg[ 2] := sg[ 1]
round 16: i := 14, move_from := 2, sg[ 2] := sg[ 2]
round 17: i := 15, move_from := 3, sg[ 2] := sg[ 3]
[...]
This means we will loop forever and never hit the msg->sg_end condition
to break out of the loop. When we see that the ring wraps around, then
the shift should be MAX_SKB_FRAGS - first_sg + last_sg - 1. Meaning,
the remainder slots from the tail of the ring and the head until last_sg
combined.
Fixes: 015632bb30 ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In the current code, msg->data is set as sg_virt(&sg[i]) + start - offset
and msg->data_end relative to it as msg->data + bytes. Using iterator i
to point to the updated starting scatterlist element holds true for some
cases, however not for all where we'd end up pointing out of bounds. It
is /correct/ for these ones:
1) When first finding the starting scatterlist element (sge) where we
find that the page is already privately owned by the msg and where
the requested bytes and headroom fit into the sge's length.
However, it's /incorrect/ for the following ones:
2) After we made the requested area private and updated the newly allocated
page into first_sg slot of the scatterlist ring; when we find that no
shift repair of the ring is needed where we bail out updating msg->data
and msg->data_end. At that point i will point to last_sg, which in this
case is the next elem of first_sg in the ring. The sge at that point
might as well be invalid (e.g. i == msg->sg_end), which we use for
setting the range of sg_virt(&sg[i]). The correct one would have been
first_sg.
3) Similar as in 2) but when we find that a shift repair of the ring is
needed. In this case we fix up all sges and stop once we've reached the
end. In this case i will point to will point to the new msg->sg_end,
and the sge at that point will be invalid. Again here the requested
range sits in first_sg.
Fixes: 015632bb30 ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
While recently going over bpf_msg_pull_data(), I noticed three
issues which are fixed in here:
1) When we attempt to find the first scatterlist element (sge)
for the start offset, we add len to the offset before we check
for start < offset + len, whereas it should come after when
we iterate to the next sge to accumulate the offsets. For
example, given a start offset of 12 with a sge length of 8
for the first sge in the list would lead us to determine this
sge as the first sge thinking it covers first 16 bytes where
start is located, whereas start sits in subsequent sges so
we would end up pulling in the wrong data.
2) After figuring out the starting sge, we have a short-cut test
in !msg->sg_copy[i] && bytes <= len. This checks whether it's
not needed to make the page at the sge private where we can
just exit by updating msg->data and msg->data_end. However,
the length test is not fully correct. bytes <= len checks
whether the requested bytes (end - start offsets) fit into the
sge's length. The part that is missing is that start must not
be sge length aligned. Meaning, the start offset into the sge
needs to be accounted as well on top of the requested bytes
as otherwise we can access the sge out of bounds. For example
the sge could have length of 8, our requested bytes could have
length of 8, but at a start offset of 4, so we also would need
to pull in 4 bytes of the next sge, when we jump to the out
label we do set msg->data to sg_virt(&sg[i]) + start - offset
and msg->data_end to msg->data + bytes which would be oob.
3) The subsequent bytes < copy test for finding the last sge has
the same issue as in point 2) but also it tests for less than
rather than less or equal to. Meaning if the sge length is of
8 and requested bytes of 8 while having the start aligned with
the sge, we would unnecessarily go and pull in the next sge as
well to make it private.
Fixes: 015632bb30 ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Building the newly introduced BPF_PROG_TYPE_SK_REUSEPORT leads to
a compile time error when building with clang:
net/core/filter.o: In function `sk_reuseport_convert_ctx_access':
../net/core/filter.c:7284: undefined reference to `__compiletime_assert_7284'
It seems that clang has issues resolving hweight_long at compile
time. Since SK_FL_PROTO_MASK is a constant, we can use the interface
for known constant arguments which works fine with clang.
Fixes: 2dbb9b9e6d ("bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT")
Signed-off-by: Stefan Agner <stefan@agner.ch>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Pull IDA updates from Matthew Wilcox:
"A better IDA API:
id = ida_alloc(ida, GFP_xxx);
ida_free(ida, id);
rather than the cumbersome ida_simple_get(), ida_simple_remove().
The new IDA API is similar to ida_simple_get() but better named. The
internal restructuring of the IDA code removes the bitmap
preallocation nonsense.
I hope the net -200 lines of code is convincing"
* 'ida-4.19' of git://git.infradead.org/users/willy/linux-dax: (29 commits)
ida: Change ida_get_new_above to return the id
ida: Remove old API
test_ida: check_ida_destroy and check_ida_alloc
test_ida: Convert check_ida_conv to new API
test_ida: Move ida_check_max
test_ida: Move ida_check_leaf
idr-test: Convert ida_check_nomem to new API
ida: Start new test_ida module
target/iscsi: Allocate session IDs from an IDA
iscsi target: fix session creation failure handling
drm/vmwgfx: Convert to new IDA API
dmaengine: Convert to new IDA API
ppc: Convert vas ID allocation to new IDA API
media: Convert entity ID allocation to new IDA API
ppc: Convert mmu context allocation to new IDA API
Convert net_namespace to new IDA API
cb710: Convert to new IDA API
rsxx: Convert to new IDA API
osd: Convert to new IDA API
sd: Convert to new IDA API
...
Pull networking fixes from David Miller:
1) Fix races in IPVS, from Tan Hu.
2) Missing unbind in matchall classifier, from Hangbin Liu.
3) Missing act_ife action release, from Vlad Buslov.
4) Cure lockdep splats in ila, from Cong Wang.
5) veth queue leak on link delete, from Toshiaki Makita.
6) Disable isdn's IIOCDBGVAR ioctl, it exposes kernel addresses. From
Kees Cook.
7) RCU usage fixup in XDP, from Tariq Toukan.
8) Two TCP ULP fixes from Daniel Borkmann.
9) r8169 needs REALTEK_PHY as a Kconfig dependency, from Heiner
Kallweit.
10) Always take tcf_lock with BH disabled, otherwise we can deadlock
with rate estimator code paths. From Vlad Buslov.
11) Don't use MSI-X on RTL8106e r8169 chips, they don't resume properly.
From Jian-Hong Pan.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (41 commits)
ip6_vti: fix creating fallback tunnel device for vti6
ip_vti: fix a null pointer deferrence when create vti fallback tunnel
r8169: don't use MSI-X on RTL8106e
net: lan743x_ptp: convert to ktime_get_clocktai_ts64
net: sched: always disable bh when taking tcf_lock
ip6_vti: simplify stats handling in vti6_xmit
bpf: fix redirect to map under tail calls
r8169: add missing Kconfig dependency
tools/bpf: fix bpf selftest test_cgroup_storage failure
bpf, sockmap: fix sock_map_ctx_update_elem race with exist/noexist
bpf, sockmap: fix map elem deletion race with smap_stop_sock
bpf, sockmap: fix leakage of smap_psock_map_entry
tcp, ulp: fix leftover icsk_ulp_ops preventing sock from reattach
tcp, ulp: add alias for all ulp modules
bpf: fix a rcu usage warning in bpf_prog_array_copy_core()
samples/bpf: all XDP samples should unload xdp/bpf prog on SIGTERM
net/xdp: Fix suspicious RCU usage warning
net/mlx5e: Delete unneeded function argument
Documentation: networking: ti-cpsw: correct cbs parameters for Eth1 100Mb
isdn: Disable IIOCDBGVAR
...
Commits 109980b894 ("bpf: don't select potentially stale ri->map
from buggy xdp progs") and 7c30013133 ("bpf: fix ri->map_owner
pointer on bpf_prog_realloc") tried to mitigate that buggy programs
using bpf_redirect_map() helper call do not leave stale maps behind.
Idea was to add a map_owner cookie into the per CPU struct redirect_info
which was set to prog->aux by the prog making the helper call as a
proof that the map is not stale since the prog is implicitly holding
a reference to it. This owner cookie could later on get compared with
the program calling into BPF whether they match and therefore the
redirect could proceed with processing the map safely.
In (obvious) hindsight, this approach breaks down when tail calls are
involved since the original caller's prog->aux pointer does not have
to match the one from one of the progs out of the tail call chain,
and therefore the xdp buffer will be dropped instead of redirected.
A way around that would be to fix the issue differently (which also
allows to remove related work in fast path at the same time): once
the life-time of a redirect map has come to its end we use it's map
free callback where we need to wait on synchronize_rcu() for current
outstanding xdp buffers and remove such a map pointer from the
redirect info if found to be present. At that time no program is
using this map anymore so we simply invalidate the map pointers to
NULL iff they previously pointed to that instance while making sure
that the redirect path only reads out the map once.
Fixes: 97f91a7cf0 ("bpf: add bpf_redirect_map helper routine")
Fixes: 109980b894 ("bpf: don't select potentially stale ri->map from buggy xdp progs")
Reported-by: Sebastiano Miano <sebastiano.miano@polito.it>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
rdma.git merge resolution for the 4.19 merge window
Conflicts:
drivers/infiniband/core/rdma_core.c
- Use the rdma code and revise with the new spelling for
atomic_fetch_add_unless
drivers/nvme/host/rdma.c
- Replace max_sge with max_send_sge in new blk code
drivers/nvme/target/rdma.c
- Use the blk code and revise to use NULL for ib_post_recv when
appropriate
- Replace max_sge with max_recv_sge in new blk code
net/rds/ib_send.c
- Use the net code and revise to use NULL for ib_post_recv when
appropriate
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAltwm2geHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGITkH/iSzkVhT2OxHoir0
mLVzTi7/Z17L0e/ELl7TvAC0iLFlWZKdlGR0g3b4/QpXLPmNK4HxiDRTQuWn8ke0
qDZyDq89HqLt+mpeFZ43PCd9oqV8CH2xxK3iCWReqv6bNnowGnRpSStlks4rDqWn
zURC/5sUh7TzEG4s997RrrpnyPeQWUlf/Mhtzg2/WvK2btoLWgu5qzjX1uFh3s7u
vaF2NXVJ3X03gPktyxZzwtO1SwLFS1jhwUXWBZ5AnoJ99ywkghQnkqS/2YpekNTm
wFk80/78sU+d91aAqO8kkhHj8VRrd+9SGnZ4mB2aZHwjZjGcics4RRtxukSfOQ+6
L47IdXo=
=sJkt
-----END PGP SIGNATURE-----
Merge tag 'v4.18' into rdma.git for-next
Resolve merge conflicts from the -rc cycle against the rdma.git tree:
Conflicts:
drivers/infiniband/core/uverbs_cmd.c
- New ifs added to ib_uverbs_ex_create_flow in -rc and for-next
- Merge removal of file->ucontext in for-next with new code in -rc
drivers/infiniband/core/uverbs_main.c
- for-next removed code from ib_uverbs_write() that was modified
in for-rc
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
Addresses-Coverity-ID: 1472592 ("Missing break in switch")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
req->sdiag_family is a user-controlled value that's used as an array
index. Sanitize it after the bounds check to avoid speculative
out-of-bounds array access.
This also protects the sock_is_registered() call, so this removes the
sanitize call there.
Fixes: e978de7a6d ("net: socket: Fix potential spectre v1 gadget in sock_is_registered")
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: konrad.wilk@oracle.com
Cc: jamie.iles@oracle.com
Cc: liran.alon@oracle.com
Cc: stable@vger.kernel.org
Signed-off-by: Jeremy Cline <jcline@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-08-13
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Add driver XDP support for veth. This can be used in conjunction with
redirect of another XDP program e.g. sitting on NIC so the xdp_frame
can be forwarded to the peer veth directly without modification,
from Toshiaki.
2) Add a new BPF map type REUSEPORT_SOCKARRAY and prog type SK_REUSEPORT
in order to provide more control and visibility on where a SO_REUSEPORT
sk should be located, and the latter enables to directly select a sk
from the bpf map. This also enables map-in-map for application migration
use cases, from Martin.
3) Add a new BPF helper bpf_skb_ancestor_cgroup_id() that returns the id
of cgroup v2 that is the ancestor of the cgroup associated with the
skb at the ancestor_level, from Andrey.
4) Implement BPF fs map pretty-print support based on BTF data for regular
hash table and LRU map, from Yonghong.
5) Decouple the ability to attach BTF for a map from the key and value
pretty-printer in BPF fs, and enable further support of BTF for maps for
percpu and LPM trie, from Daniel.
6) Implement a better BPF sample of using XDP's CPU redirect feature for
load balancing SKB processing to remote CPU. The sample implements the
same XDP load balancing as Suricata does which is symmetric hash based
on IP and L4 protocol, from Jesper.
7) Revert adding NULL pointer check with WARN_ON_ONCE() in __xdp_return()'s
critical path as it is ensured that the allocator is present, from Björn.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
== Problem description ==
It's useful to be able to identify cgroup associated with skb in TC so
that a policy can be applied to this skb, and existing bpf_skb_cgroup_id
helper can help with this.
Though in real life cgroup hierarchy and hierarchy to apply a policy to
don't map 1:1.
It's often the case that there is a container and corresponding cgroup,
but there are many more sub-cgroups inside container, e.g. because it's
delegated to containerized application to control resources for its
subsystems, or to separate application inside container from infra that
belongs to containerization system (e.g. sshd).
At the same time it may be useful to apply a policy to container as a
whole.
If multiple containers like this are run on a host (what is often the
case) and many of them have sub-cgroups, it may not be possible to apply
per-container policy in TC with existing helpers such as
bpf_skb_under_cgroup or bpf_skb_cgroup_id:
* bpf_skb_cgroup_id will return id of immediate cgroup associated with
skb, i.e. if it's a sub-cgroup inside container, it can't be used to
identify container's cgroup;
* bpf_skb_under_cgroup can work only with one cgroup and doesn't scale,
i.e. if there are N containers on a host and a policy has to be
applied to M of them (0 <= M <= N), it'd require M calls to
bpf_skb_under_cgroup, and, if M changes, it'd require to rebuild &
load new BPF program.
== Solution ==
The patch introduces new helper bpf_skb_ancestor_cgroup_id that can be
used to get id of cgroup v2 that is an ancestor of cgroup associated
with skb at specified level of cgroup hierarchy.
That way admin can place all containers on one level of cgroup hierarchy
(what is a good practice in general and already used in many
configurations) and identify specific cgroup on this level no matter
what sub-cgroup skb is associated with.
E.g. if there is a cgroup hierarchy:
root/
root/container1/
root/container1/app11/
root/container1/app11/sub-app-a/
root/container1/app12/
root/container2/
root/container2/app21/
root/container2/app22/
root/container2/app22/sub-app-b/
, then having skb associated with root/container1/app11/sub-app-a/ it's
possible to get ancestor at level 1, what is container1 and apply policy
for this container, or apply another policy if it's container2.
Policies can be kept e.g. in a hash map where key is a container cgroup
id and value is an action.
Levels where container cgroups are created are usually known in advance
whether cgroup hierarchy inside container may be hard to predict
especially in case when its creation is delegated to containerized
application.
== Implementation details ==
The helper gets ancestor by walking parents up to specified level.
Another option would be to get different kind of "id" from
cgroup->ancestor_ids[level] and use it with idr_find() to get struct
cgroup for ancestor. But that would require radix lookup what doesn't
seem to be better (at least it's not obviously better).
Format of return value of the new helper is same as that of
bpf_skb_cgroup_id.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Extend gen_new_estimator() to also take stats_lock when re-assigning rate
estimator statistics pointer. (to be used by unlocked actions)
Rename 'stats_lock' to 'lock' and change argument description to explain
that it is now also used for control path.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows a BPF_PROG_TYPE_SK_REUSEPORT bpf prog to select a
SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY introduced in
the earlier patch. "bpf_run_sk_reuseport()" will return -ECONNREFUSED
when the BPF_PROG_TYPE_SK_REUSEPORT prog returns SK_DROP.
The callers, in inet[6]_hashtable.c and ipv[46]/udp.c, are modified to
handle this case and return NULL immediately instead of continuing the
sk search from its hashtable.
It re-uses the existing SO_ATTACH_REUSEPORT_EBPF setsockopt to attach
BPF_PROG_TYPE_SK_REUSEPORT. The "sk_reuseport_attach_bpf()" will check
if the attaching bpf prog is in the new SK_REUSEPORT or the existing
SOCKET_FILTER type and then check different things accordingly.
One level of "__reuseport_attach_prog()" call is removed. The
"sk_unhashed() && ..." and "sk->sk_reuseport_cb" tests are pushed
back to "reuseport_attach_prog()" in sock_reuseport.c. sock_reuseport.c
seems to have more knowledge on those test requirements than filter.c.
In "reuseport_attach_prog()", after new_prog is attached to reuse->prog,
the old_prog (if any) is also directly freed instead of returning the
old_prog to the caller and asking the caller to free.
The sysctl_optmem_max check is moved back to the
"sk_reuseport_attach_filter()" and "sk_reuseport_attach_bpf()".
As of other bpf prog types, the new BPF_PROG_TYPE_SK_REUSEPORT is only
bounded by the usual "bpf_prog_charge_memlock()" during load time
instead of bounded by both bpf_prog_charge_memlock and sysctl_optmem_max.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch adds a BPF_PROG_TYPE_SK_REUSEPORT which can select
a SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY. Like other
non SK_FILTER/CGROUP_SKB program, it requires CAP_SYS_ADMIN.
BPF_PROG_TYPE_SK_REUSEPORT introduces "struct sk_reuseport_kern"
to store the bpf context instead of using the skb->cb[48].
At the SO_REUSEPORT sk lookup time, it is in the middle of transiting
from a lower layer (ipv4/ipv6) to a upper layer (udp/tcp). At this
point, it is not always clear where the bpf context can be appended
in the skb->cb[48] to avoid saving-and-restoring cb[]. Even putting
aside the difference between ipv4-vs-ipv6 and udp-vs-tcp. It is not
clear if the lower layer is only ipv4 and ipv6 in the future and
will it not touch the cb[] again before transiting to the upper
layer.
For example, in udp_gro_receive(), it uses the 48 byte NAPI_GRO_CB
instead of IP[6]CB and it may still modify the cb[] after calling
the udp[46]_lib_lookup_skb(). Because of the above reason, if
sk->cb is used for the bpf ctx, saving-and-restoring is needed
and likely the whole 48 bytes cb[] has to be saved and restored.
Instead of saving, setting and restoring the cb[], this patch opts
to create a new "struct sk_reuseport_kern" and setting the needed
values in there.
The new BPF_PROG_TYPE_SK_REUSEPORT and "struct sk_reuseport_(kern|md)"
will serve all ipv4/ipv6 + udp/tcp combinations. There is no protocol
specific usage at this point and it is also inline with the current
sock_reuseport.c implementation (i.e. no protocol specific requirement).
In "struct sk_reuseport_md", this patch exposes data/data_end/len
with semantic similar to other existing usages. Together
with "bpf_skb_load_bytes()" and "bpf_skb_load_bytes_relative()",
the bpf prog can peek anywhere in the skb. The "bind_inany" tells
the bpf prog that the reuseport group is bind-ed to a local
INANY address which cannot be learned from skb.
The new "bind_inany" is added to "struct sock_reuseport" which will be
used when running the new "BPF_PROG_TYPE_SK_REUSEPORT" bpf prog in order
to avoid repeating the "bind INANY" test on
"sk_v6_rcv_saddr/sk->sk_rcv_saddr" every time a bpf prog is run. It can
only be properly initialized when a "sk->sk_reuseport" enabled sk is
adding to a hashtable (i.e. during "reuseport_alloc()" and
"reuseport_add_sock()").
The new "sk_select_reuseport()" is the main helper that the
bpf prog will use to select a SO_REUSEPORT sk. It is the only function
that can use the new BPF_MAP_TYPE_REUSEPORT_ARRAY. As mentioned in
the earlier patch, the validity of a selected sk is checked in
run time in "sk_select_reuseport()". Doing the check in
verification time is difficult and inflexible (consider the map-in-map
use case). The runtime check is to compare the selected sk's reuseport_id
with the reuseport_id that we want. This helper will return -EXXX if the
selected sk cannot serve the incoming request (e.g. reuseport_id
not match). The bpf prog can decide if it wants to do SK_DROP as its
discretion.
When the bpf prog returns SK_PASS, the kernel will check if a
valid sk has been selected (i.e. "reuse_kern->selected_sk != NULL").
If it does , it will use the selected sk. If not, the kernel
will select one from "reuse->socks[]" (as before this patch).
The SK_DROP and SK_PASS handling logic will be in the next patch.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch introduces a new map type BPF_MAP_TYPE_REUSEPORT_SOCKARRAY.
To unleash the full potential of a bpf prog, it is essential for the
userspace to be capable of directly setting up a bpf map which can then
be consumed by the bpf prog to make decision. In this case, decide which
SO_REUSEPORT sk to serve the incoming request.
By adding BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, the userspace has total control
and visibility on where a SO_REUSEPORT sk should be located in a bpf map.
The later patch will introduce BPF_PROG_TYPE_SK_REUSEPORT such that
the bpf prog can directly select a sk from the bpf map. That will
raise the programmability of the bpf prog attached to a reuseport
group (a group of sk serving the same IP:PORT).
For example, in UDP, the bpf prog can peek into the payload (e.g.
through the "data" pointer introduced in the later patch) to learn
the application level's connection information and then decide which sk
to pick from a bpf map. The userspace can tightly couple the sk's location
in a bpf map with the application logic in generating the UDP payload's
connection information. This connection info contact/API stays within the
userspace.
Also, when used with map-in-map, the userspace can switch the
old-server-process's inner map to a new-server-process's inner map
in one call "bpf_map_update_elem(outer_map, &index, &new_reuseport_array)".
The bpf prog will then direct incoming requests to the new process instead
of the old process. The old process can finish draining the pending
requests (e.g. by "accept()") before closing the old-fds. [Note that
deleting a fd from a bpf map does not necessary mean the fd is closed]
During map_update_elem(),
Only SO_REUSEPORT sk (i.e. which has already been added
to a reuse->socks[]) can be used. That means a SO_REUSEPORT sk that is
"bind()" for UDP or "bind()+listen()" for TCP. These conditions are
ensured in "reuseport_array_update_check()".
A SO_REUSEPORT sk can only be added once to a map (i.e. the
same sk cannot be added twice even to the same map). SO_REUSEPORT
already allows another sk to be created for the same IP:PORT.
There is no need to re-create a similar usage in the BPF side.
When a SO_REUSEPORT is deleted from the "reuse->socks[]" (e.g. "close()"),
it will notify the bpf map to remove it from the map also. It is
done through "bpf_sk_reuseport_detach()" and it will only be called
if >=1 of the "reuse->sock[]" has ever been added to a bpf map.
The map_update()/map_delete() has to be in-sync with the
"reuse->socks[]". Hence, the same "reuseport_lock" used
by "reuse->socks[]" has to be used here also. Care has
been taken to ensure the lock is only acquired when the
adding sk passes some strict tests. and
freeing the map does not require the reuseport_lock.
The reuseport_array will also support lookup from the syscall
side. It will return a sock_gen_cookie(). The sock_gen_cookie()
is on-demand (i.e. a sk's cookie is not generated until the very
first map_lookup_elem()).
The lookup cookie is 64bits but it goes against the logical userspace
expectation on 32bits sizeof(fd) (and as other fd based bpf maps do also).
It may catch user in surprise if we enforce value_size=8 while
userspace still pass a 32bits fd during update. Supporting different
value_size between lookup and update seems unintuitive also.
We also need to consider what if other existing fd based maps want
to return 64bits value from syscall's lookup in the future.
Hence, reuseport_array supports both value_size 4 and 8, and
assuming user will usually use value_size=4. The syscall's lookup
will return ENOSPC on value_size=4. It will will only
return 64bits value from sock_gen_cookie() when user consciously
choose value_size=8 (as a signal that lookup is desired) which then
requires a 64bits value in both lookup and update.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
A later patch will introduce a BPF_MAP_TYPE_REUSEPORT_ARRAY which
allows a SO_REUSEPORT sk to be added to a bpf map. When a sk
is removed from reuse->socks[], it also needs to be removed from
the bpf map. Also, when adding a sk to a bpf map, the bpf
map needs to ensure it is indeed in a reuse->socks[].
Hence, reuseport_lock is needed by the bpf map to ensure its
map_update_elem() and map_delete_elem() operations are in-sync with
the reuse->socks[]. The BPF_MAP_TYPE_REUSEPORT_ARRAY map will only
acquire the reuseport_lock after ensuring the adding sk is already
in a reuseport group (i.e. reuse->socks[]). The map_lookup_elem()
will be lockless.
This patch also adds an ID to sock_reuseport. A later patch
will introduce BPF_PROG_TYPE_SK_REUSEPORT which allows
a bpf prog to select a sk from a bpf map. It is inflexible to
statically enforce a bpf map can only contain the sk belonging to
a particular reuse->socks[] (i.e. same IP:PORT) during the bpf
verification time. For example, think about the the map-in-map situation
where the inner map can be dynamically changed in runtime and the outer
map may have inner maps belonging to different reuseport groups.
Hence, when the bpf prog (in the new BPF_PROG_TYPE_SK_REUSEPORT
type) selects a sk, this selected sk has to be checked to ensure it
belongs to the requesting reuseport group (i.e. the group serving
that IP:PORT).
The "sk->sk_reuseport_cb" pointer cannot be used for this checking
purpose because the pointer value will change after reuseport_grow().
Instead of saving all checking conditions like the ones
preced calling "reuseport_add_sock()" and compare them everytime a
bpf_prog is run, a 32bits ID is introduced to survive the
reuseport_grow(). The ID is only acquired if any of the
reuse->socks[] is added to the newly introduced
"BPF_MAP_TYPE_REUSEPORT_ARRAY" map.
If "BPF_MAP_TYPE_REUSEPORT_ARRAY" is not used, the changes in this
patch is a no-op.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Although the actual cookie check "__cookie_v[46]_check()" does
not involve sk specific info, it checks whether the sk has recent
synq overflow event in "tcp_synq_no_recent_overflow()". The
tcp_sk(sk)->rx_opt.ts_recent_stamp is updated every second
when it has sent out a syncookie (through "tcp_synq_overflow()").
The above per sk "recent synq overflow event timestamp" works well
for non SO_REUSEPORT use case. However, it may cause random
connection request reject/discard when SO_REUSEPORT is used with
syncookie because it fails the "tcp_synq_no_recent_overflow()"
test.
When SO_REUSEPORT is used, it usually has multiple listening
socks serving TCP connection requests destinated to the same local IP:PORT.
There are cases that the TCP-ACK-COOKIE may not be received
by the same sk that sent out the syncookie. For example,
if reuse->socks[] began with {sk0, sk1},
1) sk1 sent out syncookies and tcp_sk(sk1)->rx_opt.ts_recent_stamp
was updated.
2) the reuse->socks[] became {sk1, sk2} later. e.g. sk0 was first closed
and then sk2 was added. Here, sk2 does not have ts_recent_stamp set.
There are other ordering that will trigger the similar situation
below but the idea is the same.
3) When the TCP-ACK-COOKIE comes back, sk2 was selected.
"tcp_synq_no_recent_overflow(sk2)" returns true. In this case,
all syncookies sent by sk1 will be handled (and rejected)
by sk2 while sk1 is still alive.
The userspace may create and remove listening SO_REUSEPORT sockets
as it sees fit. E.g. Adding new thread (and SO_REUSEPORT sock) to handle
incoming requests, old process stopping and new process starting...etc.
With or without SO_ATTACH_REUSEPORT_[CB]BPF,
the sockets leaving and joining a reuseport group makes picking
the same sk to check the syncookie very difficult (if not impossible).
The later patches will allow bpf prog more flexibility in deciding
where a sk should be located in a bpf map and selecting a particular
SO_REUSEPORT sock as it sees fit. e.g. Without closing any sock,
replace the whole bpf reuseport_array in one map_update() by using
map-in-map. Getting the syncookie check working smoothly across
socks in the same "reuse->socks[]" is important.
A partial solution is to set the newly added sk's ts_recent_stamp
to the max ts_recent_stamp of a reuseport group but that will require
to iterate through reuse->socks[] OR
pessimistically set it to "now - TCP_SYNCOOKIE_VALID" when a sk is
joining a reuseport group. However, neither of them will solve the
existing sk getting moved around the reuse->socks[] and that
sk may not have ts_recent_stamp updated, unlikely under continuous
synflood but not impossible.
This patch opts to treat the reuseport group as a whole when
considering the last synq overflow timestamp since
they are serving the same IP:PORT from the userspace
(and BPF program) perspective.
"synq_overflow_ts" is added to "struct sock_reuseport".
The tcp_synq_overflow() and tcp_synq_no_recent_overflow()
will update/check reuse->synq_overflow_ts if the sk is
in a reuseport group. Similar to the reuseport decision in
__inet_lookup_listener(), both sk->sk_reuseport and
sk->sk_reuseport_cb are tested for SO_REUSEPORT usage.
Update on "synq_overflow_ts" happens at roughly once
every second.
A synflood test was done with a 16 rx-queues and 16 reuseport sockets.
No meaningful performance change is observed. Before and
after the change is ~9Mpps in IPv4.
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
We need some mechanism to disable napi_direct on calling
xdp_return_frame_rx_napi() from some context.
When veth gets support of XDP_REDIRECT, it will redirects packets which
are redirected from other devices. On redirection veth will reuse
xdp_mem_info of the redirection source device to make return_frame work.
But in this case .ndo_xdp_xmit() called from veth redirection uses
xdp_mem_info which is not guarded by NAPI, because the .ndo_xdp_xmit()
is not called directly from the rxq which owns the xdp_mem_info.
This approach introduces a flag in bpf_redirect_info to indicate that
napi_direct should be disabled even when _rx_napi variant is used as
well as helper functions to use it.
A NAPI handler who wants to use this flag needs to call
xdp_set_return_frame_no_direct() before processing packets, and call
xdp_clear_return_frame_no_direct() after xdp_do_flush_map() before
exiting NAPI.
v4:
- Use bpf_redirect_info for storing the flag instead of xdp_mem_info to
avoid per-frame copy cost.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
We are going to add kern_flags field in redirect_info for kernel
internal use.
In order to avoid function call to access the flags, make redirect_info
accessible from modules. Also as it is now non-static, add prefix bpf_
to redirect_info.
v6:
- Fix sparse warning around EXPORT_SYMBOL.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This is needed for veth XDP which does skb_copy_expand()-like operation.
v2:
- Drop skb_copy_header part because it has already been exported now.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This reverts commit 36e0f12bbf.
The reverted commit adds a WARN to check against NULL entries in the
mem_id_ht rhashtable. Any kernel path implementing the XDP (generic or
driver) fast path is required to make a paired
xdp_rxq_info_reg/xdp_rxq_info_unreg call for proper function. In
addition, a driver using a different allocation scheme than the
default MEM_TYPE_PAGE_SHARED is required to additionally call
xdp_rxq_info_reg_mem_model.
For MEM_TYPE_ZERO_COPY, an xdp_rxq_info_reg_mem_model call ensures
that the mem_id_ht rhashtable has a properly inserted allocator id. If
not, this would be a driver bug. A NULL pointer kernel OOPS is
preferred to the WARN.
Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The definition of static_key_slow_inc() has cpus_read_lock in place. In the
virtio_net driver, XPS queues are initialized after setting the queue:cpu
affinity in virtnet_set_affinity() which is already protected within
cpus_read_lock. Lockdep prints a warning when we are trying to acquire
cpus_read_lock when it is already held.
This patch adds an ability to call __netif_set_xps_queue under
cpus_read_lock().
Acked-by: Jason Wang <jasowang@redhat.com>
============================================
WARNING: possible recursive locking detected
4.18.0-rc3-next-20180703+ #1 Not tainted
--------------------------------------------
swapper/0/1 is trying to acquire lock:
00000000cf973d46 (cpu_hotplug_lock.rw_sem){++++}, at: static_key_slow_inc+0xe/0x20
but task is already holding lock:
00000000cf973d46 (cpu_hotplug_lock.rw_sem){++++}, at: init_vqs+0x513/0x5a0
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(cpu_hotplug_lock.rw_sem);
lock(cpu_hotplug_lock.rw_sem);
*** DEADLOCK ***
May be due to missing lock nesting notation
3 locks held by swapper/0/1:
#0: 00000000244bc7da (&dev->mutex){....}, at: __driver_attach+0x5a/0x110
#1: 00000000cf973d46 (cpu_hotplug_lock.rw_sem){++++}, at: init_vqs+0x513/0x5a0
#2: 000000005cd8463f (xps_map_mutex){+.+.}, at: __netif_set_xps_queue+0x8d/0xc60
v2: move cpus_read_lock() out of __netif_set_xps_queue()
Cc: "Nambiar, Amritha" <amritha.nambiar@intel.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Fixes: 8af2c06ff4 ("net-sysfs: Add interface for Rx queue(s) map per Tx queue")
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow the existing 'dissection' of tunnel metadata to 'dissect'
options already present in tunnel metadata. This dissection is
controlled by a new dissector key, FLOW_DISSECTOR_KEY_ENC_OPTS.
This dissection only occurs when skb_flow_dissect_tunnel_info()
is called, currently only the Flower classifier makes that call.
So there should be no impact on other users of the flow dissector.
This is in preparation for allowing the flower classifier to
match on Geneve options.
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>