Instead of linear search, use rhlist interface to look up the objects.
This fixes rulesets with thousands of named objects (quota, counters and
the like).
We only use a single table for this and consider the address of the
table we're doing the lookup in as a part of the key.
This reduces restore time of a sample ruleset with ~20k named counters
from 37 seconds to 0.8 seconds.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add a 'key' structure for object, so we can look them up by name + table
combination (the name can be the same in each table).
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
A follow-up patch will enable vetoing of FDB entries. Make it possible
to communicate details of why an FDB entry is not acceptable back to the
user.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are four sources of VXLAN switchdev notifier calls:
- the changelink() link operation, which already supports extack,
- ndo_fdb_add() which got extack support in a previous patch,
- FDB updates due to packet forwarding,
- and vxlan_fdb_replay().
Extend vxlan_fdb_switchdev_call_notifiers() to include extack in the
switchdev message that it sends, and propagate the argument upwards to
the callers. For the first two cases, pass in the extack gotten through
the operation. For case #3, pass in NULL.
To cover the last case, extend vxlan_fdb_replay() to take extack
argument, which might come from whatever operation necessitated the FDB
replay.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes recvmsg() to be able to peek across multiple tls records.
Without this patch, the tls's selftests test case
'recv_peek_large_buf_mult_recs' fails. Each tls receive context now
maintains a 'rx_list' to retain incoming skb carrying tls records. If a
tls record needs to be retained e.g. for peek case or for the case when
the buffer passed to recvmsg() has a length smaller than decrypted
record length, then it is added to 'rx_list'. Additionally, records are
added in 'rx_list' if the crypto operation runs in async mode. The
records are dequeued from 'rx_list' after the decrypted data is consumed
by copying into the buffer passed to recvmsg(). In case, the MSG_PEEK
flag is used in recvmsg(), then records are not consumed or removed
from the 'rx_list'.
Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of having net/dsa.h contain both the internal switch tree/driver
structures, split the relevant platform_data parts into
include/linux/platform_data/dsa.h and make that header be included by
net/dsa.h in order not to break any setup. A subsequent set of patches
will update code including net/dsa.h to include only the platform_data
header.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is not currently way to infer the port number through sysfs that
is being used as the CPU port number. Overlay a ndo_get_phys_port_name()
operation onto the DSA master network device in order to retrieve that
information.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The changes introduced to allow rxrpc calls to be retried creates an issue
when it comes to refcounting afs_call structs. The problem is that when
rxrpc_send_data() queues the last packet for an asynchronous call, the
following sequence can occur:
(1) The notify_end_tx callback is invoked which causes the state in the
afs_call to be changed from AFS_CALL_CL_REQUESTING or
AFS_CALL_SV_REPLYING.
(2) afs_deliver_to_call() can then process event notifications from rxrpc
on the async_work queue.
(3) Delivery of events, such as an abort from the server, can cause the
afs_call state to be changed to AFS_CALL_COMPLETE on async_work.
(4) For an asynchronous call, afs_process_async_call() notes that the call
is complete and tried to clean up all the refs on async_work.
(5) rxrpc_send_data() might return the amount of data transferred
(success) or an error - which could in turn reflect a local error or a
received error.
Synchronising the clean up after rxrpc_kernel_send_data() returns an error
with the asynchronous cleanup is then tricky to get right.
Mostly revert commit c038a58ccf. The two API
functions the original commit added aren't currently used. This makes
rxrpc_kernel_send_data() always return successfully if it queued the data
it was given.
Note that this doesn't affect synchronous calls since their Rx notification
function merely pokes a wait queue and does not refcounting. The
asynchronous call notification function *has* to do refcounting and pass a
ref over the work item to avoid the need to sync the workqueue in call
cleanup.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv4 routing tables are flushed in two cases:
1. In response to events in the netdev and inetaddr notification chains
2. When a network namespace is being dismantled
In both cases only routes associated with a dead nexthop group are
flushed. However, a nexthop group will only be marked as dead in case it
is populated with actual nexthops using a nexthop device. This is not
the case when the route in question is an error route (e.g.,
'blackhole', 'unreachable').
Therefore, when a network namespace is being dismantled such routes are
not flushed and leaked [1].
To reproduce:
# ip netns add blue
# ip -n blue route add unreachable 192.0.2.0/24
# ip netns del blue
Fix this by not skipping error routes that are not marked with
RTNH_F_DEAD when flushing the routing tables.
To prevent the flushing of such routes in case #1, add a parameter to
fib_table_flush() that indicates if the table is flushed as part of
namespace dismantle or not.
Note that this problem does not exist in IPv6 since error routes are
associated with the loopback device.
[1]
unreferenced object 0xffff888066650338 (size 56):
comm "ip", pid 1206, jiffies 4294786063 (age 26.235s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 b0 1c 62 61 80 88 ff ff ..........ba....
e8 8b a1 64 80 88 ff ff 00 07 00 08 fe 00 00 00 ...d............
backtrace:
[<00000000856ed27d>] inet_rtm_newroute+0x129/0x220
[<00000000fcdfc00a>] rtnetlink_rcv_msg+0x397/0xa20
[<00000000cb85801a>] netlink_rcv_skb+0x132/0x380
[<00000000ebc991d2>] netlink_unicast+0x4c0/0x690
[<0000000014f62875>] netlink_sendmsg+0x929/0xe10
[<00000000bac9d967>] sock_sendmsg+0xc8/0x110
[<00000000223e6485>] ___sys_sendmsg+0x77a/0x8f0
[<000000002e94f880>] __sys_sendmsg+0xf7/0x250
[<00000000ccb1fa72>] do_syscall_64+0x14d/0x610
[<00000000ffbe3dae>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[<000000003a8b605b>] 0xffffffffffffffff
unreferenced object 0xffff888061621c88 (size 48):
comm "ip", pid 1206, jiffies 4294786063 (age 26.235s)
hex dump (first 32 bytes):
6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
6b 6b 6b 6b 6b 6b 6b 6b d8 8e 26 5f 80 88 ff ff kkkkkkkk..&_....
backtrace:
[<00000000733609e3>] fib_table_insert+0x978/0x1500
[<00000000856ed27d>] inet_rtm_newroute+0x129/0x220
[<00000000fcdfc00a>] rtnetlink_rcv_msg+0x397/0xa20
[<00000000cb85801a>] netlink_rcv_skb+0x132/0x380
[<00000000ebc991d2>] netlink_unicast+0x4c0/0x690
[<0000000014f62875>] netlink_sendmsg+0x929/0xe10
[<00000000bac9d967>] sock_sendmsg+0xc8/0x110
[<00000000223e6485>] ___sys_sendmsg+0x77a/0x8f0
[<000000002e94f880>] __sys_sendmsg+0xf7/0x250
[<00000000ccb1fa72>] do_syscall_64+0x14d/0x610
[<00000000ffbe3dae>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[<000000003a8b605b>] 0xffffffffffffffff
Fixes: 8cced9eff1 ("[NETNS]: Enable routing configuration in non-initial namespace.")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the forward chain, the iif is changed from slave device to master vrf
device. Thus, flow offload does not find a match on the lower slave
device.
This patch uses the cached route, ie. dst->dev, to update the iif and
oif fields in the flow entry.
After this patch, the following example works fine:
# ip addr add dev eth0 1.1.1.1/24
# ip addr add dev eth1 10.0.0.1/24
# ip link add user1 type vrf table 1
# ip l set user1 up
# ip l set dev eth0 master user1
# ip l set dev eth1 master user1
# nft add table firewall
# nft add flowtable f fb1 { hook ingress priority 0 \; devices = { eth0, eth1 } \; }
# nft add chain f ftb-all {type filter hook forward priority 0 \; policy accept \; }
# nft add rule f ftb-all ct zone 1 ip protocol tcp flow offload @fb1
# nft add rule f ftb-all ct zone 1 ip protocol udp flow offload @fb1
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.
It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access. But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.
A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model. And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.
This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
There were a couple of notable cases:
- csky still had the old "verify_area()" name as an alias.
- the iter_iov code had magical hardcoded knowledge of the actual
values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
really used it)
- microblaze used the type argument for a debug printout
but other than those oddities this should be a total no-op patch.
I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something. Any missed conversion should be trivially fixable, though.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KMSAN detected read beyond end of buffer in vti and sit devices when
passing truncated packets with PF_PACKET. The issue affects additional
ip tunnel devices.
Extend commit 76c0ddd8c3 ("ip6_tunnel: be careful when accessing the
inner header") and commit ccfec9e5cb ("ip_tunnel: be careful when
accessing the inner header").
Move the check to a separate helper and call at the start of each
ndo_start_xmit function in net/ipv4 and net/ipv6.
Minor changes:
- convert dev_kfree_skb to kfree_skb on error path,
as dev_kfree_skb calls consume_skb which is not for error paths.
- use pskb_network_may_pull even though that is pedantic here,
as the same as pskb_may_pull for devices without llheaders.
- do not cache ipv6 hdrs if used only once
(unsafe across pskb_may_pull, was more relevant to earlier patch)
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Al Viro mentioned (Message-ID
<20170626041334.GZ10672@ZenIV.linux.org.uk>)
that there is probably a race condition
lurking in accesses of sk_stamp on 32-bit machines.
sock->sk_stamp is of type ktime_t which is always an s64.
On a 32 bit architecture, we might run into situations of
unsafe access as the access to the field becomes non atomic.
Use seqlocks for synchronization.
This allows us to avoid using spinlocks for readers as
readers do not need mutual exclusion.
Another approach to solve this is to require sk_lock for all
modifications of the timestamps. The current approach allows
for timestamps to have their own lock: sk_stamp_lock.
This allows for the patch to not compete with already
existing critical sections, and side effects are limited
to the paths in the patch.
The addition of the new field maintains the data locality
optimizations from
commit 9115e8cd2a ("net: reorganize struct sock for better data
locality")
Note that all the instances of the sk_stamp accesses
are either through the ioctl or the syscall recvmsg.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of removing a empty list node that might be reintroduced soon
thereafter, tentatively place the empty list node on the list passed to
tree_nodes_free(), then re-check if the list is empty again before erasing
it from the tree.
[ Florian: rebase on top of pending nf_conncount fixes ]
Fixes: 5c789e131c ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search")
Reviewed-by: Shawn Bohrer <sbohrer@cloudflare.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
'lookup' is always followed by 'add'.
Merge both and make the list-walk part of nf_conncount_add().
This also avoids one unneeded unlock/re-lock pair.
Extra care needs to be taken in count_tree, as we only hold rcu
read lock, i.e. we can only insert to an existing tree node after
acquiring its lock and making sure it has a nonzero count.
As a zero count should be rare, just fall back to insert_tree()
(which acquires tree lock).
This issue and its solution were pointed out by Shawn Bohrer
during patch review.
Reviewed-by: Shawn Bohrer <sbohrer@cloudflare.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Patch eedbbb0d98 "net: dccp: initialize (addr,port) ..."
added calling to inet_hashinfo2_init() from dccp_init().
However, inet_hashinfo2_init() is marked as __init(), and
thus the kernel panics when dccp is loaded as module. Removing
__init() tag from inet_hashinfo2_init() is not feasible because
it calls into __init functions in mm.
This patch adds inet_hashinfo2_init_mod() function that can
be called after the init phase is done; changes dccp_init() to
call the new function; un-marks inet_hashinfo2_init() as
exported.
Fixes: eedbbb0d98 ("net: dccp: initialize (addr,port) ...")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next:
1) Support for destination MAC in ipset, from Stefano Brivio.
2) Disallow all-zeroes MAC address in ipset, also from Stefano.
3) Add IPSET_CMD_GET_BYNAME and IPSET_CMD_GET_BYINDEX commands,
introduce protocol version number 7, from Jozsef Kadlecsik.
A follow up patch to fix ip_set_byindex() is also included
in this batch.
4) Honor CTA_MARK_MASK from ctnetlink, from Andreas Jaggi.
5) Statify nf_flow_table_iterate(), from Taehee Yoo.
6) Use nf_flow_table_iterate() to simplify garbage collection in
nf_flow_table logic, also from Taehee Yoo.
7) Don't use _bh variants of call_rcu(), rcu_barrier() and
synchronize_rcu_bh() in Netfilter, from Paul E. McKenney.
8) Remove NFC_* cache definition from the old caching
infrastructure.
9) Remove layer 4 port rover in NAT helpers, use random port
instead, from Florian Westphal.
10) Use strscpy() in ipset, from Qian Cai.
11) Remove NF_NAT_RANGE_PROTO_RANDOM_FULLY branch now that
random port is allocated by default, from Xiaozhou Liu.
12) Ignore NF_NAT_RANGE_PROTO_RANDOM too, from Florian Westphal.
13) Limit port allocation selection routine in NAT to avoid
softlockup splats when most ports are in use, from Florian.
14) Remove unused parameters in nf_ct_l4proto_unregister_sysctl()
from Yafang Shao.
15) Direct call to nf_nat_l4proto_unique_tuple() instead of
indirection, from Florian Westphal.
16) Several patches to remove all layer 4 NAT indirections,
remove nf_nat_l4proto struct, from Florian Westphal.
17) Fix RTP/RTCP source port translation when SNAT is in place,
from Alin Nastac.
18) Selective rule dump per chain, from Phil Sutter.
19) Revisit CLUSTERIP target, this includes a deadlock fix from
netns path, sleep in atomic, remove bogus WARN_ON_ONCE()
and disallow mismatching IP address and MAC address.
Patchset from Taehee Yoo.
20) Update UDP timeout to stream after 2 seconds, from Florian.
21) Shrink UDP established timeout to 120 seconds like TCP timewait.
22) Sysctl knobs to set GRE timeouts, from Yafang Shao.
23) Move seq_print_acct() to conntrack core file, from Florian.
24) Add enum for conntrack sysctl knobs, also from Florian.
25) Place nf_conntrack_acct, nf_conntrack_helper, nf_conntrack_events
and nf_conntrack_timestamp knobs in the core, from Florian Westphal.
As a side effect, shrink netns_ct structure by removing obsolete
sysctl anchors, also from Florian.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-12-21
The following pull-request contains BPF updates for your *net-next* tree.
There is a merge conflict in test_verifier.c. Result looks as follows:
[...]
},
{
"calls: cross frame pruning",
.insns = {
[...]
.prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
.errstr_unpriv = "function calls to other bpf functions are allowed for root only",
.result_unpriv = REJECT,
.errstr = "!read_ok",
.result = REJECT,
},
{
"jset: functional",
.insns = {
[...]
{
"jset: unknown const compare not taken",
.insns = {
BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
BPF_FUNC_get_prandom_u32),
BPF_JMP_IMM(BPF_JSET, BPF_REG_0, 1, 1),
BPF_LDX_MEM(BPF_B, BPF_REG_8, BPF_REG_9, 0),
BPF_EXIT_INSN(),
},
.prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
.errstr_unpriv = "!read_ok",
.result_unpriv = REJECT,
.errstr = "!read_ok",
.result = REJECT,
},
[...]
{
"jset: range",
.insns = {
[...]
},
.prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
.result_unpriv = ACCEPT,
.result = ACCEPT,
},
The main changes are:
1) Various BTF related improvements in order to get line info
working. Meaning, verifier will now annotate the corresponding
BPF C code to the error log, from Martin and Yonghong.
2) Implement support for raw BPF tracepoints in modules, from Matt.
3) Add several improvements to verifier state logic, namely speeding
up stacksafe check, optimizations for stack state equivalence
test and safety checks for liveness analysis, from Alexei.
4) Teach verifier to make use of BPF_JSET instruction, add several
test cases to kselftests and remove nfp specific JSET optimization
now that verifier has awareness, from Jakub.
5) Improve BPF verifier's slot_type marking logic in order to
allow more stack slot sharing, from Jiong.
6) Add sk_msg->size member for context access and add set of fixes
and improvements to make sock_map with kTLS usable with openssl
based applications, from John.
7) Several cleanups and documentation updates in bpftool as well as
auto-mount of tracefs for "bpftool prog tracelog" command,
from Quentin.
8) Include sub-program tags from now on in bpf_prog_info in order to
have a reliable way for user space to get all tags of the program
e.g. needed for kallsyms correlation, from Song.
9) Add BTF annotations for cgroup_local_storage BPF maps and
implement bpf fs pretty print support, from Roman.
10) Fix bpftool in order to allow for cross-compilation, from Ivan.
11) Update of bpftool license to GPLv2-only + BSD-2-Clause in order
to be compatible with libbfd and allow for Debian packaging,
from Jakub.
12) Remove an obsolete prog->aux sanitation in dump and get rid of
version check for prog load, from Daniel.
13) Fix a memory leak in libbpf's line info handling, from Prashant.
14) Fix cpumap's frame alignment for build_skb() so that skb_shared_info
does not get unaligned, from Jesper.
15) Fix test_progs kselftest to work with older compilers which are less
smart in optimizing (and thus throwing build error), from Stanislav.
16) Cleanup and simplify AF_XDP socket teardown, from Björn.
17) Fix sk lookup in BPF kselftest's test_sock_addr with regards
to netns_id argument, from Andrey.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
remove the obsolete sysctl anchors and move auto_assign_helper_warned
to avoid/cover a hole. Reduces size by 40 bytes on 64 bit.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
after moving sysctl handling into single place, the init functions
can't fail anymore and some of the fini functions are empty.
Remove them and change return type to void.
This also simplifies error unwinding in conntrack module init path.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently DNS resolvers that send both A and AAAA queries from same source port
can trigger stream mode prematurely, which results in non-early-evictable conntrack entry
for three minutes, even though DNS requests are done in a few milliseconds.
Add a two second grace period where we continue to use the ordinary
30-second default timeout. Its enough for DNS request/response traffic,
even if two request/reply packets are involved.
ASSURED is still set, else conntrack (and thus a possible
NAT mapping ...) gets zapped too in case conntrack table runs full.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
A sockmap program that redirects through a kTLS ULP enabled socket
will not work correctly because the ULP layer is skipped. This
fixes the behavior to call through the ULP layer on redirect to
ensure any operations required on the data stream at the ULP layer
continue to be applied.
To do this we add an internal flag MSG_SENDPAGE_NOPOLICY to avoid
calling the BPF layer on a redirected message. This is
required to avoid calling the BPF layer multiple times (possibly
recursively) which is not the current/expected behavior without
ULPs. In the future we may add a redirect flag if users _do_
want the policy applied again but this would need to work for both
ULP and non-ULP sockets and be opt-in to avoid breaking existing
programs.
Also to avoid polluting the flag space with an internal flag we
reuse the flag space overlapping MSG_SENDPAGE_NOPOLICY with
MSG_WAITFORONE. Here WAITFORONE is specific to recv path and
SENDPAGE_NOPOLICY is only used for sendpage hooks. The last thing
to verify is user space API is masked correctly to ensure the flag
can not be set by user. (Note this needs to be true regardless
because we have internal flags already in-use that user space
should not be able to set). But for completeness we have two UAPI
paths into sendpage, sendfile and splice.
In the sendfile case the function do_sendfile() zero's flags,
./fs/read_write.c:
static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
size_t count, loff_t max)
{
...
fl = 0;
#if 0
/*
* We need to debate whether we can enable this or not. The
* man page documents EAGAIN return for the output at least,
* and the application is arguably buggy if it doesn't expect
* EAGAIN on a non-blocking file descriptor.
*/
if (in.file->f_flags & O_NONBLOCK)
fl = SPLICE_F_NONBLOCK;
#endif
file_start_write(out.file);
retval = do_splice_direct(in.file, &pos, out.file, &out_pos, count, fl);
}
In the splice case the pipe_to_sendpage "actor" is used which
masks flags with SPLICE_F_MORE.
./fs/splice.c:
static int pipe_to_sendpage(struct pipe_inode_info *pipe,
struct pipe_buffer *buf, struct splice_desc *sd)
{
...
more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0;
...
}
Confirming what we expect that internal flags are in fact internal
to socket side.
Fixes: d3b18ad31f ("tls: add bpf support to sk_msg handling")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Lots of conflicts, by happily all cases of overlapping
changes, parallel adds, things of that nature.
Thanks to Stephen Rothwell, Saeed Mahameed, and others
for their guidance in these resolutions.
Signed-off-by: David S. Miller <davem@davemloft.net>
ip l add dev tun type gretap external
ip r a 10.0.0.1 encap ip dst 192.168.152.171 id 1000 dev gretap
For gretap Key example when the command set the id but don't set the
TUNNEL_KEY flags. There is no key field in the send packet
In the lwtunnel situation, some TUNNEL_FLAGS should can be set by
userspace
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
this patch registers neigh doit handler. The doit handler
returns a neigh entry given dst and dev. This is similar
to route and fdb doit (get) handlers. Also moves nda_policy
declaration from rtnetlink.c to neighbour.c
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove skb->sp and allocate secpath storage via extension
infrastructure. This also reduces sk_buff by 8 bytes on x86_64.
Total size of allyesconfig kernel is reduced slightly, as there is
less inlined code (one conditional atomic op instead of two on
skb_clone).
No differences in throughput in following ipsec performance tests:
- transport mode with aes on 10GB link
- tunnel mode between two network namespaces with aes and null cipher
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Will reduce noise when skb->sp is removed later in this series.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_sec_path gains 'const' qualifier to avoid
xt_policy.c: 'skb_sec_path' discards 'const' qualifier from pointer target type
same reasoning as previous conversions: Won't need to touch these
spots anymore when skb->sp is removed.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Future patch will remove skb->sp pointer.
To reduce noise in those patches, move existing helper to
sk_buff and use it in more places to ease skb->sp replacement later.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
It can only return 0 (success) or -ENOMEM.
Change return value to a pointer to secpath struct.
This avoids direct access to skb->sp:
err = secpath_set(skb);
if (!err) ..
skb->sp-> ...
Becomes:
sp = secpath_set(skb)
if (!sp) ..
sp-> ..
This reduces noise in followup patch which is going to remove skb->sp.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This converts the bridge netfilter (calling iptables hooks from bridge)
facility to use the extension infrastructure.
The bridge_nf specific hooks in skb clone and free paths are removed, they
have been replaced by the skb_ext hooks that do the same as the bridge nf
allocations hooks did.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This pointer is going to be removed soon, so use the existing helpers in
more places to avoid noise when the removal happens.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
* virt_wifi - wireless control simulation on top of
another network interface
* hwsim configurability to test capabilities similar
to real hardware
* various mesh improvements
* various radiotap vendor data fixes in mac80211
* finally the nl_set_extack_cookie_u64() we talked
about previously, used for
* peer measurement APIs, right now only with FTM
(flight time measurement) for location
* made nl80211 radio/interface announcements more complete
* various new HE (802.11ax) things:
updates, TWT support, ...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAlwaCwkACgkQB8qZga/f
l8S7mA/+I1CJmGC7Pvy+SBFkzoY5zEjjzgZYL6sGo16qMs89NPcURSe5j+uCsDP3
nKEjsvhQMYDfGNLTJJfWbDpGwm9LnKp69AFITlvfzmP6Sm36QMZr7oIC4abi8cW4
osaO3qfdaNoZ//x72jgjrFhUAnphvT2BsRVMNEjz7sXcDd7Jm9NnpRhV8zgXFvLF
dS2Ng51LM/BLMz5jQpyJUDZeeL/iBYybCecyckmVqzXPh1icIZETSqZXiN4ngv2A
6p9BSGNtP6wmjnbkvZz5RDq76VhTPZWsTgTpVb45Wf1k2fm1rB96UgpqvfQtjTgB
+7Zx2WRpMXM5OjGkwaEs8nawFmt7MHCGnhLPLWPCbXc685fhp3OFShysMJdYS/GZ
IIRJ7+IchAQX1yluftB+NkQM9sBDjyseMBwxHRYkj/rQVhoLY1sT+ke7lkuV10o6
DQqfpUTZAsIz7zkuscn7hkNdI/Rjub6BZjbrs1Jt9zSt9WQUBao23XudOI0j5JDa
ErnfC5PISXMQWik5B9M1Zhq3H9qCI2Swh19lMmtxtSDQ9yrLrJkEJ5SA+aHoxNHj
wSxBc3XXSW47qPXGX/D5DNnbOcOrE7kVZuD8YqRsy8VedyjIgEw7oQ21flAD4FC4
R4TgbNkqpfZQsU29gaMkDkYXnfQDB/G9FOk6ARGxjBPjT55Hz0E=
=EpyK
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2018-12-19' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
This time we have too many changes to list, highlights:
* virt_wifi - wireless control simulation on top of
another network interface
* hwsim configurability to test capabilities similar
to real hardware
* various mesh improvements
* various radiotap vendor data fixes in mac80211
* finally the nl_set_extack_cookie_u64() we talked
about previously, used for
* peer measurement APIs, right now only with FTM
(flight time measurement) for location
* made nl80211 radio/interface announcements more complete
* various new HE (802.11ax) things:
updates, TWT support, ...
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net): ipsec 2018-12-18
1) Fix error return code in xfrm_output_one()
when no dst_entry is attached to the skb.
From Wei Yongjun.
2) The xfrm state hash bucket count reported to
userspace is off by one. Fix from Benjamin Poirier.
3) Fix NULL pointer dereference in xfrm_input when
skb_dst_force clears the dst_entry.
4) Fix freeing of xfrm states on acquire. We use a
dedicated slab cache for the xfrm states now,
so free it properly with kmem_cache_free.
From Mathias Krause.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2018-12-18
1) Add xfrm policy selftest scripts.
From Florian Westphal.
2) Split inexact policies into four different search list
classes and use the rbtree infrastructure to store/lookup
the policies. This is to improve the policy lookup
performance after the flowcache removal.
Patches from Florian Westphal.
3) Various coding style fixes, from Colin Ian King.
4) Fix policy lookup logic after adding the inexact policy
search tree infrastructure. From Florian Westphal.
5) Remove a useless remove BUG_ON from xfrm6_dst_ifdown.
From Li RongQing.
6) Use the correct policy direction for lookups on hash
rebuilding. From Florian Westphal.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
TWT is a feature that was added in 11ah and enhanced in
11ax. There are two bits that need to be set if we want
to use the feature in 11ax: one in the HE Capability IE
and one in the Extended Capability IE. This is because
of backward compatibility between 11ah and 11ax.
In order to simplify the flow for the low level driver
in managed mode, aggregate the two bits and add a boolean
that tells whether TWT is supported or not, but only if
11ax is supported.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In the iwlwifi conversion, we sometimes call this from outside
of the wake_tx_queue() method, and in those cases must be in an
RCU critical section. Document this requirement.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The older code and current userspace assumed that this data
is the content of the Measurement Report element, starting
with the Measurement Token. Clarify this in the documentation.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
SOF_TIMESTAMPING_OPT_ID is supported on TCP, UDP and RAW sockets.
But it was missing on RAW with IPPROTO_IP, PF_PACKET and CAN.
Add skb_setup_tx_timestamp that configures both tx_flags and tskey
for these paths that do not need corking or use bytestream keys.
Fixes: 09c2d251b7 ("net-timestamp: add key to disambiguate concurrent datagrams")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This removes the (now empty) nf_nat_l4proto struct, all its instances
and all the no longer needed runtime (un)register functionality.
nf_nat_need_gre() can be axed as well: the module that calls it (to
load the no-longer-existing nat_gre module) also calls other nat core
functions. GRE nat is now always available if kernel is built with it.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This removes the last l4proto indirection, the two callers, the l3proto
packet mangling helpers for ipv4 and ipv6, now call the
nf_nat_l4proto_manip_pkt() helper.
nf_nat_proto_{dccp,tcp,sctp,gre,icmp,icmpv6} are left behind, even though
they contain no functionality anymore to not clutter this patch.
Next patch will remove the empty files and the nf_nat_l4proto
struct.
nf_nat_proto_udp.c is renamed to nf_nat_proto.c, as it now contains the
other nat manip functionality as well, not just udp and udplite.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
all protocols did set this to nf_nat_l4proto_nlattr_to_range, so
just call it directly.
The important difference is that we'll now also call it for
protocols that we don't support (i.e., nf_nat_proto_unknown did
not provide .nlattr_to_range).
However, there should be no harm, even icmp provided this callback.
If we don't implement a specific l4nat for this, nothing would make
use of this information, so adding a big switch/case construct listing
all supported l4protocols seems a bit pointless.
This change leaves a single function pointer in the l4proto struct.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
With exception of icmp, all of the l4 nat protocols set this to
nf_nat_l4proto_in_range.
Get rid of this and just check the l4proto in the caller.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
No need for indirections here, we only support ipv4 and ipv6
and the called functions are very small.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
fold remaining users (icmp, icmpv6, gre) into nf_nat_l4proto_unique_tuple.
The static-save of old incarnation of resolved key in gre and icmp is
removed as well, just use the prandom based offset like the others.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
almost all l4proto->unique_tuple implementations just call this helper,
so make ->unique_tuple() optional and call its helper directly if the
l4proto doesn't override it.
This is an intermediate step to get rid of ->unique_tuple completely.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Historically this was net_random() based, and was then converted to
a hash based algorithm (private boot seed + hash of endpoint addresses)
due to concerns of leaking net_random() bits.
RANDOM_FULLY mode was added later to avoid problems with hash
based mode (see commit 34ce324019,
"netfilter: nf_nat: add full port randomization support" for details).
Just make prandom_u32() the default search starting point and get rid of
->secure_port() altogether.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Rename the tag Kconfig option and related macros in preparation for
addition of new KSZ family switches with different tag formats.
Signed-off-by: Tristram Ha <Tristram.Ha@microchip.com>
Signed-off-by: Marek Vasut <marex@denx.de>
Cc: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Cc: Woojung Huh <woojung.huh@microchip.com>
Cc: David S. Miller <davem@davemloft.net>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to routes and rules, add protocol attribute to neighbor entries
for easier tracking of how each was created.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This avoids an indirect call in the receive path for TCP and UDP
packets. TCP takes precedence on UDP, so that we have a single
additional conditional in the common case.
When IPV6 is build as module, all gro symbols except UDPv6 are
builtin, while the latter belong to the ipv6 module, so we
need some special care.
v1 -> v2:
- adapted to INDIRECT_CALL_ changes
v2 -> v3:
- fix build issue with CONFIG_IPV6=m
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This avoids an indirect calls for L3 GRO receive path, both
for ipv4 and ipv6, if the latter is not compiled as a module.
Note that when IPv6 is compiled as builtin, it will be checked first,
so we have a single additional compare for the more common path.
v1 -> v2:
- adapted to INDIRECT_CALL_ changes
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move arp_queue_len_bytes ahead of arp_queue to remove two 4-byte holes.
Ensure ha element is always 8-byte aligned.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
neigh_update_ext_learned has one caller in neighbour.c so does not need
to be defined in the header. Move it and in the process remove the
intialization of ndm_flags and just set it based on the flags check.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit 69bd48404f ("net/sched: Remove egdev mechanism"),
tc_setup_cb_call() is nearly identical to tcf_block_cb_call(),
so we can just fold tcf_block_cb_call() into tc_setup_cb_call()
and remove its unused parameter 'exts'.
Fixes: 69bd48404f ("net/sched: Remove egdev mechanism")
Cc: Oz Shlomo <ozsh@mellanox.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
HW unhash within mutex for registered tls devices cause sleep
when called from tcp_set_state for TCP_CLOSE. Release lock and
re-acquire after function call with ref count incr/dec.
defined kref and fp release for tls_device to ensure device
is not released outside lock.
BUG: sleeping function called from invalid context at
kernel/locking/mutex.c:748
in_atomic(): 1, irqs_disabled(): 0, pid: 0, name: swapper/7
INFO: lockdep is turned off.
CPU: 7 PID: 0 Comm: swapper/7 Tainted: G W O
Call Trace:
<IRQ>
dump_stack+0x5e/0x8b
___might_sleep+0x222/0x260
__mutex_lock+0x5c/0xa50
? vprintk_emit+0x1f3/0x440
? kmem_cache_free+0x22d/0x2a0
? tls_hw_unhash+0x2f/0x80
? printk+0x52/0x6e
? tls_hw_unhash+0x2f/0x80
tls_hw_unhash+0x2f/0x80
tcp_set_state+0x5f/0x180
tcp_done+0x2e/0xe0
tcp_rcv_state_process+0x92c/0xdd3
? lock_acquire+0xf5/0x1f0
? tcp_v4_rcv+0xa7c/0xbe0
? tcp_v4_do_rcv+0x70/0x1e0
Signed-off-by: Atul Gupta <atul.gupta@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drivers use switchdev_handle_port_obj_add() to handle recursive descent
through lower devices. Change this function prototype to take add_cb
that itself takes an extack argument. Decode extack from
switchdev_notifier_port_obj_info and pass it to add_cb.
Update mlxsw and ocelot drivers which use this helper.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to pass extack to the drivers that need it, add an extack field
to struct switchdev_notifier_info, and an extack argument to the
function call_switchdev_blocking_notifiers(). Also add a helper function
switchdev_notifier_info_to_extack().
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the previous patch, bridge driver has extack argument available to
pass to switchdev. Therefore extend switchdev_port_obj_add() with this
argument, updating all callers, and passing the argument through to
switchdev_port_obj_notify().
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The egdev mechanism was replaced by the TC indirect block notifications
platform.
Signed-off-by: Oz Shlomo <ozsh@mellanox.com>
Reviewed-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Cc: John Hurley <john.hurley@netronome.com>
Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Changed the is_gretap_dev and is_ip6gretap_dev logic from structure
comparison to string comparison of the rtnl_link_ops kind field.
This approach aligns with the current identification methods and function
names of vxlan and geneve network devices.
Convert mlxsw to use these helpers and use them in downstream mlx5 patch.
Signed-off-by: Oz Shlomo <ozsh@mellanox.com>
Reviewed-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Trivial fix to clean up indentation issue, remove an extraneous
space.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Several conflicts, seemingly all over the place.
I used Stephen Rothwell's sample resolutions for many of these, if not
just to double check my own work, so definitely the credit largely
goes to him.
The NFP conflict consisted of a bug fix (moving operations
past the rhashtable operation) while chaning the initial
argument in the function call in the moved code.
The net/dsa/master.c conflict had to do with a bug fix intermixing of
making dsa_master_set_mtu() static with the fixing of the tagging
attribute location.
cls_flower had a conflict because the dup reject fix from Or
overlapped with the addition of port range classifiction.
__set_phy_supported()'s conflict was relatively easy to resolve
because Andrew fixed it in both trees, so it was just a matter
of taking the net-next copy. Or at least I think it was :-)
Joe Stringer's fix to the handling of netns id 0 in bpf_sk_lookup()
intermixed with changes on how the sdif and caller_net are calculated
in these code paths in net-next.
The remaining BPF conflicts were largely about the addition of the
__bpf_md_ptr stuff in 'net' overlapping with adjustments and additions
to the relevant data structure where the MD pointer macros are used.
Signed-off-by: David S. Miller <davem@davemloft.net>
While skb_push() makes the kernel panic if the skb headroom is less than
the unaligned hardware header size, it will proceed normally in case we
copy more than that because of alignment, and we'll silently corrupt
adjacent slabs.
In the case fixed by the previous patch,
"ipv6: Check available headroom in ip6_xmit() even without options", we
end up in neigh_hh_output() with 14 bytes headroom, 14 bytes hardware
header and write 16 bytes, starting 2 bytes before the allocated buffer.
Always check we're not writing before skb->head and, if the headroom is
not enough, warn and drop the packet.
v2:
- instead of panicking with BUG_ON(), WARN_ON_ONCE() and drop the packet
(Eric Dumazet)
- if we avoid the panic, though, we need to explicitly check the headroom
before the memcpy(), otherwise we'll have corrupted slabs on a running
kernel, after we warn
- use __skb_push() instead of skb_push(), as the headroom check is
already implemented here explicitly (Eric Dumazet)
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The existing garbage collection algorithm has a number of problems:
1. The gc algorithm will not evict PERMANENT entries as those entries
are managed by userspace, yet the existing algorithm walks the entire
hash table which means it always considers PERMANENT entries when
looking for entries to evict. In some use cases (e.g., EVPN) there
can be tens of thousands of PERMANENT entries leading to wasted
CPU cycles when gc kicks in. As an example, with 32k permanent
entries, neigh_alloc has been observed taking more than 4 msec per
invocation.
2. Currently, when the number of neighbor entries hits gc_thresh2 and
the last flush for the table was more than 5 seconds ago gc kicks in
walks the entire hash table evicting *all* entries not in PERMANENT
or REACHABLE state and not marked as externally learned. There is no
discriminator on when the neigh entry was created or if it just moved
from REACHABLE to another NUD_VALID state (e.g., NUD_STALE).
It is possible for entries to be created or for established neighbor
entries to be moved to STALE (e.g., an external node sends an ARP
request) right before the 5 second window lapses:
-----|---------x|----------|-----
t-5 t t+5
If that happens those entries are evicted during gc causing unnecessary
thrashing on neighbor entries and userspace caches trying to track them.
Further, this contradicts the description of gc_thresh2 which says
"Entries older than 5 seconds will be cleared".
One workaround is to make gc_thresh2 == gc_thresh3 but that negates the
whole point of having separate thresholds.
3. Clearing *all* neigh non-PERMANENT/REACHABLE/externally learned entries
when gc_thresh2 is exceeded is over kill and contributes to trashing
especially during startup.
This patch addresses these problems as follows:
1. Use of a separate list_head to track entries that can be garbage
collected along with a separate counter. PERMANENT entries are not
added to this list.
The gc_thresh parameters are only compared to the new counter, not the
total entries in the table. The forced_gc function is updated to only
walk this new gc_list looking for entries to evict.
2. Entries are added to the list head at the tail and removed from the
front.
3. Entries are only evicted if they were last updated more than 5 seconds
ago, adhering to the original intent of gc_thresh2.
4. Forced gc is stopped once the number of gc_entries drops below
gc_thresh2.
5. Since gc checks do not apply to PERMANENT entries, gc levels are skipped
when allocating a new neighbor for a PERMANENT entry. By extension this
means there are no explicit limits on the number of PERMANENT entries
that can be created, but this is no different than FIB entries or FDB
entries.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a driver unoffloads all FDB entries en bloc, it's inefficient to
send the switchdev notification one by one. Add a helper that walks the
FDB table, unsetting the offload flag on RDST with a given VNI.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a VXLAN device becomes relevant to a driver (such as when it is
attached to an offloaded bridge), the driver will generally need to walk
the existing FDB entries and offload them.
Add a function vxlan_fdb_replay() to call a given notifier block for
each FDB entry with a given VNI.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Each DSA tag protocol needs to add additional headers to the Ethernet
frame in order to direct it towards a specific switch egress port. It
must also remove the head from a frame received from a
switch. Indicate the maximum size of these headers in the tag protocol
ops structure, so the core can take these overheads into account.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
If for some reason an association's fragmentation point is zero,
sctp_datamsg_from_user will try to endlessly try to divide a message
into zero-sized chunks. This eventually causes kernel panic due to
running out of memory.
Although this situation is quite unlikely, it has occurred before as
reported. I propose to add this simple last-ditch sanity check due to
the severity of the potential consequences.
Signed-off-by: Jakub Audykowicz <jakub.audykowicz@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP_NOTSENT_LOWAT socket option or sysctl was added in linux-3.12
as a step to enable bigger tcp sndbuf limits.
It works reasonably well, but the following happens :
Once the limit is reached, TCP stack generates
an [E]POLLOUT event for every incoming ACK packet.
This causes a high number of context switches.
This patch implements the strategy David Miller added
in sock_def_write_space() :
- If TCP socket has a notsent_lowat constraint of X bytes,
allow sendmsg() to fill up to X bytes, but send [E]POLLOUT
only if number of notsent bytes is below X/2
This considerably reduces TCP_NOTSENT_LOWAT overhead,
while allowing to keep the pipe full.
Tested:
100 ms RTT netem testbed between A and B, 100 concurrent TCP_STREAM
A:/# cat /proc/sys/net/ipv4/tcp_wmem
4096 262144 64000000
A:/# super_netperf 100 -H B -l 1000 -- -K bbr &
A:/# grep TCP /proc/net/sockstat
TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 1364904 # This is about 54 MB of memory per flow :/
A:/# vmstat 5 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 256220672 13532 694976 0 0 10 0 28 14 0 1 99 0 0
2 0 0 256320016 13532 698480 0 0 512 0 715901 5927 0 10 90 0 0
0 0 0 256197232 13532 700992 0 0 735 13 771161 5849 0 11 89 0 0
1 0 0 256233824 13532 703320 0 0 512 23 719650 6635 0 11 89 0 0
2 0 0 256226880 13532 705780 0 0 642 4 775650 6009 0 12 88 0 0
A:/# echo 2097152 >/proc/sys/net/ipv4/tcp_notsent_lowat
A:/# grep TCP /proc/net/sockstat
TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 86411 # 3.5 MB per flow
A:/# vmstat 5 5 # check that context switches have not inflated too much.
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 260386512 13592 662148 0 0 10 0 17 14 0 1 99 0 0
0 0 0 260519680 13592 604184 0 0 512 13 726843 12424 0 10 90 0 0
1 1 0 260435424 13592 598360 0 0 512 25 764645 12925 0 10 90 0 0
1 0 0 260855392 13592 578380 0 0 512 7 722943 13624 0 11 88 0 0
1 0 0 260445008 13592 601176 0 0 614 34 772288 14317 0 10 90 0 0
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In sctp_hash_transport/sctp_epaddr_lookup_transport, it dereferences
a transport's asoc under rcu_read_lock while asoc is freed not after
a grace period, which leads to a use-after-free panic.
This patch fixes it by calling kfree_rcu to make asoc be freed after
a grace period.
Note that only the asoc's memory is delayed to free in the patch, it
won't cause sk to linger longer.
Thanks Neil and Marcelo to make this clear.
Fixes: 7fda702f93 ("sctp: use new rhlist interface on sctp transport rhashtable")
Fixes: cd2b708750 ("sctp: check duplicate node before inserting a new transport")
Reported-by: syzbot+0b05d8aa7cb185107483@syzkaller.appspotmail.com
Reported-by: syzbot+aad231d51b1923158444@syzkaller.appspotmail.com
Suggested-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Existing functions to retreive the l3mdev of a device did not walk the
master chain to find the upper master. This patch adds a function to
find the l3mdev, even indirect through e.g. a bridge:
+----------+
| |
| vrf-blue |
| |
+----+-----+
|
|
+----+-----+
| |
| br-blue |
| |
+----+-----+
|
|
+----+-----+
| |
| eth0 |
| |
+----------+
This will properly resolve the l3mdev of eth0 to vrf-blue.
Signed-off-by: Alexis Bauvin <abauvin@scaleway.com>
Reviewed-by: Amine Kherbouche <akherbouche@scaleway.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Tested-by: Amine Kherbouche <akherbouche@scaleway.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
UDP tunnel sockets are always opened unbound to a specific device. This
patch allow the socket to be bound on a custom device, which
incidentally makes UDP tunnels VRF-aware if binding to an l3mdev.
Signed-off-by: Alexis Bauvin <abauvin@scaleway.com>
Reviewed-by: Amine Kherbouche <akherbouche@scaleway.com>
Tested-by: Amine Kherbouche <akherbouche@scaleway.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Many drivers load the device's firmware image during the initialization
flow either from the flash or from the disk. Currently this option is not
controlled by the user and the driver decides from where to load the
firmware image.
'fw_load_policy' gives the ability to control this option which allows the
user to choose between different loading policies supported by the driver.
This parameter can be useful while testing and/or debugging the device. For
example, testing a firmware bug fix.
Signed-off-by: Shalom Toledo <shalomt@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a leftover from days where single-cpu systems were common:
Store last port used to resolve a clash to use it as a starting point when
the next conflict needs to be resolved.
When we have parallel attempt to connect to same address:port pair,
its likely that both cores end up computing the same "available" port,
as both use same starting port, and newly used ports won't become
visible to other cores until the conntrack gets confirmed later.
One of the cores then has to drop the packet at insertion time because
the chosen new tuple turns out to be in use after all.
Lets simplify this: remove port rover and use a pseudo-random starting
point.
Note that this doesn't make netfilter default to 'fully random' mode;
the 'rover' was only used if NAT could not reuse source port as-is.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Most linux hosts never setup TCP MD5 keys. We can avoid a
cache line miss (accessing tp->md5ig_info) on RX and TX
using a jump label.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jean-Louis Dupond reported poor iscsi TCP receive performance
that we tracked to backlog drops.
Apparently we fail to send window updates reflecting the
fact that we are under stress.
Note that we might lack a proper window increase when
backlog is fully processed, since __release_sock() clears
sk->sk_backlog.len _after_ all skbs have been processed.
This should not matter in practice. If we had a significant
load through socket backlog, we are in a dangerous
situation.
Reported-by: Jean-Louis Dupond <jean-louis@dupond.be>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Tested-by: Jean-Louis Dupond<jean-louis@dupond.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tell the compiler that most TCP flows are using SACK these days.
There is no need to add the unlikely() clause in tcp_is_reno(),
the compiler is able to infer it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are no such structs flow_dissector_key_flow_vlan or
flow_dissector_key_flow_tags, the actual structs used are struct
flow_dissector_key_vlan and struct flow_dissector_key_tags. So correct the
comments against FLOW_DISSECTOR_KEY_VLAN, FLOW_DISSECTOR_KEY_FLOW_LABEL and
FLOW_DISSECTOR_KEY_CVLAN to refer to those.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Trivial conflict in net/core/filter.c, a locally computed
'sdif' is now an argument to the function.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Disable BH while holding list spinlock in nf_conncount, from
Taehee Yoo.
2) List corruption in nf_conncount, also from Taehee.
3) Fix race that results in leaving around an empty list node in
nf_conncount, from Taehee Yoo.
4) Proper chain handling for inactive chains from the commit path,
from Florian Westphal. This includes a selftest for this.
5) Do duplicate rule handles when replacing rules, also from Florian.
6) Remove net_exit path in xt_RATEEST that results in splat, from Taehee.
7) Possible use-after-free in nft_compat when releasing extensions.
From Florian.
8) Memory leak in xt_hashlimit, from Taehee.
9) Call ip_vs_dst_notifier after ipv6_dev_notf, from Xin Long.
10) Fix cttimeout with udplite and gre, from Florian.
11) Preserve oif for IPv6 link-local generated traffic from mangle
table, from Alin Nastac.
12) Missing error handling in masquerade notifiers, from Taehee Yoo.
13) Use mutex to protect registration/unregistration of masquerade
extensions in order to prevent a race, from Taehee.
14) Incorrect condition check in tree_nodes_free(), also from Taehee.
15) Fix chain counter leak in rule replacement path, from Taehee.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
register_{netdevice/inetaddr/inet6addr}_notifier may return an error
value, this patch adds the code to handle these error paths.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Drop switchdev_ops.switchdev_port_obj_add and _del. Drop the uses of
this field from all clients, which were migrated to use switchdev
notification in the previous patches.
Add a new function switchdev_port_obj_notify() that sends the switchdev
notifications SWITCHDEV_PORT_OBJ_ADD and _DEL.
Update switchdev_port_obj_del_now() to dispatch to this new function.
Drop __switchdev_port_obj_add() and update switchdev_port_obj_add()
likewise.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the transition from switchdev operations to notifier chain (which
will take place in following patches), the onus is on the driver to find
its own devices below possible layer of LAG or other uppers.
The logic to do so is fairly repetitive: each driver is looking for its
own devices among the lowers of the notified device. For those that it
finds, it calls a handler. To indicate that the event was handled,
struct switchdev_notifier_port_obj_info.handled is set. The differences
lie only in what constitutes an "own" device and what handler to call.
Therefore abstract this logic into two helpers,
switchdev_handle_port_obj_add() and switchdev_handle_port_obj_del(). If
a driver only supports physical ports under a bridge device, it will
simply avoid this layer of indirection.
One area where this helper diverges from the current switchdev behavior
is the case of mixed lowers, some of which are switchdev ports and some
of which are not. Previously, such scenario would fail with -EOPNOTSUPP.
The helper could do that for lowers for which the passed-in predicate
doesn't hold. That would however break the case that switchdev ports
from several different drivers are stashed under one master, a scenario
that switchdev currently happily supports. Therefore tolerate any and
all unknown netdevices, whether they are backed by a switchdev driver
or not.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
An offloading driver may need to have access to switchdev events on
ports that aren't directly under its control. An example is a VXLAN port
attached to a bridge offloaded by a driver. The driver needs to know
about VLANs configured on the VXLAN device. However the VXLAN device
isn't stashed between the bridge and a front-panel-port device (such as
is the case e.g. for LAG devices), so the usual switchdev ops don't
reach the driver.
VXLAN is likely not the only device type like this: in theory any L2
tunnel device that needs offloading will prompt requirement of this
sort. This falsifies the assumption that only the lower devices of a
front panel port need to be notified to achieve flawless offloading.
A way to fix this is to give up the notion of port object addition /
deletion as a switchdev operation, which assumes somewhat tight coupling
between the message producer and consumer. And instead send the message
over a notifier chain.
To that end, introduce two new switchdev notifier types,
SWITCHDEV_PORT_OBJ_ADD and SWITCHDEV_PORT_OBJ_DEL. These notifier types
communicate the same event as the corresponding switchdev op, except in
a form of a notification. struct switchdev_notifier_port_obj_info was
added to carry the fields that the switchdev op carries. An additional
field, handled, will be used to communicate back to switchdev that the
event has reached an interested party, which will be important for the
two-phase commit.
The two switchdev operations themselves are kept in place. Following
patches first convert individual clients to the notifier protocol, and
only then are the operations removed.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In general one can't assume that a switchdev notifier is called in a
non-atomic context, and correspondingly, the switchdev notifier chain is
an atomic one.
However, port object addition and deletion messages are delivered from a
process context. Even the MDB addition messages, whose delivery is
scheduled from atomic context, are queued and the delivery itself takes
place in blocking context. For VLAN messages in particular, keeping the
blocking nature is important for error reporting.
Therefore introduce a blocking notifier chain and related service
functions to distribute the notifications for which a blocking context
can be assumed.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The two macros SWITCHDEV_OBJ_PORT_VLAN() and SWITCHDEV_OBJ_PORT_MDB()
expand to a container_of() call, yielding an appropriate container of
their sole argument. However, due to a name collision, the first
argument, i.e. the contained object pointer, is not the only one to get
expanded. The third argument, which is a structure member name, and
should be kept literal, gets expanded as well. The only safe way to use
these two macros is therefore to name the local variable passed to them
"obj".
To fix this, rename the sole argument of the two macros from
"obj" (which collides with the member name) to "OBJ". Additionally,
instead of passing "OBJ" to container_of() verbatim, parenthesize it, so
that a comma in the passed-in expression doesn't pollute the
container_of() invocation.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 565f0fa902 ("xfrm: use a dedicated slab cache for struct
xfrm_state") moved xfrm state objects to use their own slab cache.
However, it missed to adapt xfrm_user to use this new cache when
freeing xfrm states.
Fix this by introducing and make use of a new helper for freeing
xfrm_state objects.
Fixes: 565f0fa902 ("xfrm: use a dedicated slab cache for struct xfrm_state")
Reported-by: Pan Bian <bianpan2016@163.com>
Cc: <stable@vger.kernel.org> # v4.18+
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
In order to allow devices to signal learning events to VXLAN, introduce
two new switchdev messages: SWITCHDEV_VXLAN_FDB_ADD_TO_BRIDGE and
SWITCHDEV_VXLAN_FDB_DEL_TO_BRIDGE.
Listen to these notifications in the vxlan driver. The FDB entries
learned this way have an NTF_EXT_LEARNED flag, and only entries marked
as such can be unlearned by the _DEL_ event. They are also immediately
marked as offloaded. This is the same behavior that the bridge driver
observes.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The VXLAN driver needs to differentiate between FDB entries learned by
the VXLAN driver, and those added by the user. The latter ones shouldn't
be taken over by external learning events. This is in accordance with
bridge behavior.
Therefore, extend the flags bitfield to 16 bits and add a new private
NTF flag to mark the user-added entries.
This seems preferable to adding a dedicated boolean, because passing the
flag, unlike passing e.g. a true, makes it clear what the meaning of the
bit is.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case of egress offloads the class/flowid assigned by the filter
may be very important for offloaded Qdisc selection. Provide this
info to drivers.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow drivers which offload GRED to report back statistics. Since
A lot of GRED stats is fairly ad hoc in nature pass to drivers the
standard struct gnet_stats_basic/gnet_stats_queue pairs, and
untangle the values in the core.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add basic offload for the GRED Qdisc. Inform the drivers any
time Qdisc or virtual queue configuration changes.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 22d7be267e.
The dst's mtu in transport can be updated by a non sctp place like
in xfrm where the MTU information didn't get synced between asoc,
transport and dst, so it is still needed to do the pmtu check
in sctp_packet_config.
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sctp_event is a structure name defined in RFC for sockopt
SCTP_EVENT. To avoid the conflict, rename it.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The member subscribe should be per asoc, so that sockopt SCTP_EVENT
in the next patch can subscribe a event from one asoc only.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The member subscribe in sctp_sock is used to indicate to which of
the events it is subscribed, more like a group of flags. So it's
better to be defined as __u16 (2 bytpes), instead of struct
sctp_event_subscribe (13 bytes).
Note that sctp_event_subscribe is an UAPI struct, used on sockopt
calls, and thus it will not be removed. This patch only changes
the internal storage of the flags.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This structure is small (12 or 16 bytes depending on 64bit
or 32bit kernels), but we do not want it spanning two cache lines.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently we can use bpf or tcp tracepoint to conveniently trace the tcp
state transition at the run time.
So we don't need to do this stuff at the compile time anymore.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__tcp_checksum_complete() is 100% same with __skb_checksum_complete()
and there is no other caller except tcp_checksum_complete().
So, just use __skb_checksum_complete() there.
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The life-checking function, which is used by kAFS to make sure that a call
is still live in the event of a pending signal, only samples the received
packet serial number counter; it doesn't actually provoke a change in the
counter, rather relying on the server to happen to give us a packet in the
time window.
Fix this by adding a function to force a ping to be transmitted.
kAFS then keeps track of whether there's been a stall, and if so, uses the
new function to ping the server, resetting the timeout to allow the reply
to come back.
If there's a stall, a ping and the call is *still* stalled in the same
place after another period, then the call will be aborted.
Fixes: bc5e3a546d ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals")
Fixes: f4d15fb6f9 ("rxrpc: Provide functions for allowing cleaner handling of signals")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RED qdisc's limit parameter changes the behaviour of the qdisc,
for instance if it's set to 0 qdisc will drop all the packets.
When replace operation happens and parameter is set to non-0
a new fifo qdisc will be instantiated and replace the old child
qdisc which will be destroyed.
Drivers need to know the parameter, even if they don't impose
the actual limit to be able to reliably reconstruct the Qdisc
hierarchy.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drivers offloading Qdiscs should have reasonable certainty
the offloaded behaviour matches the SW path. This is impossible
if the driver does not know about all Qdiscs or when Qdiscs move
and are reused. Send a graft notification from MQ.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drivers offloading Qdiscs should have reasonable certainty
the offloaded behaviour matches the SW path. This is impossible
if the driver does not know about all Qdiscs or when Qdiscs move
and are reused. Send a graft notification from RED. The drivers
are expected to simply stop offloading the Qdisc, if a non-standard
child is ever grafted onto it.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drivers are currently not notified when a Qdisc is grafted as root.
This requires special casing Qdiscs added with parent = TC_H_ROOT in
the driver. Also there is no notification sent to the driver when
an existing Qdisc is grafted as root.
Add this very simple notifications, drivers should now be able to
track their Qdisc tree fully.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When socks' sk_reuseport is set, the same port and address are allowed
to be bound into these socks who have the same uid.
Note that the difference from sk_reuse is that it allows multiple socks
to listen on the same port and address.
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a part of sk_reuseport support for sctp. It defines a helper
sctp_bind_addrs_check() to check if the bind_addrs in two socks are
matched. It will add sock_reuseport if they are completely matched,
and return err if they are partly matched, and alloc sock_reuseport
if all socks are not matched at all.
It will work until sk_reuseport support is added in
sctp_get_port_local() in the next patch.
v1->v2:
- use 'laddr->valid && laddr2->valid' check instead as Marcelo
pointed in sctp_bind_addrs_check().
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
nf_flow_table_iterate() is local function, make it static.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently drivers can register to receive TC block bind/unbind callbacks
by implementing the setup_tc ndo in any of their given netdevs. However,
drivers may also be interested in binds to higher level devices (e.g.
tunnel drivers) to potentially offload filters applied to them.
Introduce indirect block devs which allows drivers to register callbacks
for block binds on other devices. The callback is triggered when the
device is bound to a block, allowing the driver to register for rules
applied to that block using already available functions.
Freeing an indirect block callback will trigger an unbind event (if
necessary) to direct the driver to remove any offloaded rules and unreg
any block rule callbacks. It is the responsibility of the implementing
driver to clean any registered indirect block callbacks before exiting,
if the block it still active at such a time.
Allow registering an indirect block dev callback for a device that is
already bound to a block. In this case (if it is an ingress block),
register and also trigger the callback meaning that any already installed
rules can be replayed to the calling driver.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds inexact lists per destination network, stored in a search tree.
Inexact lookups now return two 'candidate lists', the 'any' policies
('any' destionations), and a list of policies that share same
daddr/prefix.
Next patch will add a second search tree for 'saddr:any' policies
so we can avoid placing those on the 'any:any' list too.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
At this time inexact policies are all searched in-order until the first
match is found. After removal of the flow cache, this resolution has
to be performed for every packetm resulting in major slowdown when
number of inexact policies is high.
This adds infrastructure to later sort inexact policies into a tree.
This only introduces a single class: any:any.
Next patch will add a search tree to pre-sort policies that
have a fixed daddr/prefixlen, so in this patch the any:any class
will still be used for all policies.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Switch packet-path lookups for inexact policies to rhashtable.
In this initial version, we now no longer need to search policies with
non-matching address family and type.
Next patch will add the if_id as well so lookups from the xfrm interface
driver only need to search inexact policies for that device.
Future patches will augment the hlist in each rhash bucket with a tree
and pre-sort policies according to daddr/prefix.
A single rhashtable is used. In order to avoid a full rhashtable walk on
netns exit, the bins get placed on a pernet list, i.e. we add almost no
cost for network namespaces that had no xfrm policies.
The inexact lists are kept in place, and policies are added to both the
per-rhash-inexact list and a pernet one.
The latter is needed for the control plane to handle migrate -- these
requests do not consider the if_id, so if we'd remove the inexact_list
now we would have to search all hash buckets and then figure
out which matching policy candidate is the most recent one -- this appears
a bit harder than just keeping the 'old' inexact list for this purpose.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
When peering is in userspace, some implementations may want to control
which peers are accepted based on RSSI in addition to the information
elements being sent today. Add signal level so that info is available
to clients.
Signed-off-by: Bob Copeland <bobcopeland@fb.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When userspace is controlling mesh routing, it may have better
knowledge about whether a mesh STA is connected to a mesh
gate than the kernel mpath table. Add dot11MeshConnectedToMeshGate
to the mesh config so that such applications can explicitly
signal that a mesh STA is connected to a gate, which will then
be advertised in the beacon.
Signed-off-by: Bob Copeland <bobcopeland@fb.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Capture the current state of gate connectivity from the mesh
formation field in mesh config whenever we receive a beacon,
and report that via GET_STATION. This allows applications
doing mesh peering in userspace to make peering decisions
based on peers' current upstream connectivity.
Signed-off-by: Bob Copeland <bobcopeland@fb.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In some cases, like in the rsi driver hardware scan offload, there
may be scenarios in which hardware scan might not be available or
desirable.
Allow drivers to cope with this by letting them fall back to software
scan by returning the special value 1 from the hardware scan method.
Requested-by: Sushant Kumar Mishra <sushant2k1513@gmail.com>
Requested-by: Siva Rebbagondla <siva.rebbagondla@redpinesignals.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
There's nothing much for mac80211 to do, so only pass through
the requests with minimal checks and tracing. The driver must
call cfg80211's results APIs.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add a new "peer measurement" API, that can be used to measure
certain things related to a peer. Right now, only implement
FTM (flight time measurement) over it, but the idea is that
it'll be extensible to also support measuring the necessary
things to calculate e.g. angle-of-arrival for WiGig.
The API is structured to have a generic list of peers and
channels to measure with/on, and then for each of those a
set of measurements (again, only FTM right now) to perform.
Results are sent to the requesting socket, including a final
complete message.
Closing the controlling netlink socket will abort a running
measurement.
v3:
- add a bit to report "final" for partial results
- remove list keeping etc. and just unicast out the results
to the requester (big code reduction ...)
- also send complete message unicast, and as a result
remove the multicast group
- separate out struct cfg80211_pmsr_ftm_request_peer
from struct cfg80211_pmsr_request_peer
- document timeout == 0 if no timeout
- disallow setting timeout nl80211 attribute to 0,
must not include attribute for no timeout
- make MAC address randomization optional
- change num bursts exponent default to 0 (1 burst, rather
rather than the old default of 15==don't care)
v4:
- clarify NL80211_ATTR_TIMEOUT documentation
v5:
- remove unnecessary nl80211 multicast/family changes
- remove partial results bit/flag, final is sufficient
- add max_bursts_exponent, max_ftms_per_burst to capability
- rename "frames per burst" -> "FTMs per burst"
v6:
- rename cfg80211_pmsr_free_wdev() to cfg80211_pmsr_wdev_down()
and call it in leave, so the device can't go down with any
pending measurements
v7:
- wording fixes (Lior)
- fix ftm.max_bursts_exponent to allow having the limit of 0 (Lior)
v8:
- copyright statements
- minor coding style fixes
- fix error path leak
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
To mirror software behaviour on offload more precisely inform
the drivers about the state of the harddrop flag.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ICMP error handling is currently not possible for UDP tunnels not
employing a receiving socket with local destination port matching the
remote one, because we have no way to look them up.
Add an err_handler tunnel encapsulation operation that can be exported by
tunnels in order to pass the error to the protocol implementing the
encapsulation. We can't easily use a lookup function as we did for VXLAN
and GENEVE, as protocol error handlers, which would be in turn called by
implementations of this new operation, handle the errors themselves,
together with the tunnel lookup.
Without a socket, we can't be sure which encapsulation error handler is
the appropriate one: encapsulation handlers (the ones for FoU and GUE
introduced in the next patch, e.g.) will need to check the new error codes
returned by protocol handlers to figure out if errors match the given
encapsulation, and, in turn, report this error back, so that we can try
all of them in __udp{4,6}_lib_err_encap_no_sk() until we have a match.
v2:
- Name all arguments in err_handler prototypes (David Miller)
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
We'll need this to handle ICMP errors for tunnels without a sending socket
(i.e. FoU and GUE). There, we might have to look up different types of IP
tunnels, registered as network protocols, before we get a match, so we
want this for the error handlers of IPPROTO_IPIP and IPPROTO_IPV6 in both
inet_protos and inet6_protos. These error codes will be used in the next
patch.
For consistency, return sensible error codes in protocol error handlers
whenever handlers can't handle errors because, even if valid, they don't
match a protocol or any of its states.
This has no effect on existing error handling paths.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow users to set the IPv4 DF bit in outgoing packets, or to inherit its
value from the IPv4 inner header. If the encapsulated protocol is IPv6 and
DF is configured to be inherited, always set it.
For IPv4, inheriting DF from the inner header was probably intended from
the very beginning judging by the comment to vxlan_xmit(), but it wasn't
actually implemented -- also because it would have done more harm than
good, without handling for ICMP Fragmentation Needed messages.
According to RFC 7348, "Path MTU discovery MAY be used". An expired RFC
draft, draft-saum-nvo3-pmtud-over-vxlan-05, whose purpose was to describe
PMTUD implementation, says that "is a MUST that Vxlan gateways [...]
SHOULD set the DF-bit [...]", whatever that means.
Given this background, the only sane option is probably to let the user
decide, and keep the current behaviour as default.
This only applies to non-lwt tunnels: if an external control plane is
used, tunnel key will still control the DF flag.
v2:
- DF behaviour configuration only applies for non-lwt tunnels, move DF
setting to if (!info) block in vxlan_xmit_one() (Stephen Hemminger)
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
For both IPv4 and IPv6, if we can't match errors to a socket, try
tunnels before ignoring them. Look up a socket with the original source
and destination ports as found in the UDP packet inside the ICMP payload,
this will work for tunnels that force the same destination port for both
endpoints, i.e. VXLAN and GENEVE.
Actually, lwtunnels could break this assumption if they are configured by
an external control plane to have different destination ports on the
endpoints: in this case, we won't be able to trace ICMP messages back to
them.
For IPv6 redirect messages, call ip6_redirect() directly with the output
interface argument set to the interface we received the packet from (as
it's the very interface we should build the exception on), otherwise the
new nexthop will be rejected. There's no such need for IPv4.
Tunnels can now export an encap_err_lookup() operation that indicates a
match. Pass the packet to the lookup function, and if the tunnel driver
reports a matching association, continue with regular ICMP error handling.
v2:
- Added newline between network and transport header sets in
__udp{4,6}_lib_err_encap() (David Miller)
- Removed redundant skb_reset_network_header(skb); in
__udp4_lib_err_encap()
- Removed redundant reassignment of iph in __udp4_lib_err_encap()
(Sabrina Dubroca)
- Edited comment to __udp{4,6}_lib_err_encap() to reflect the fact this
won't work with lwtunnels configured to use asymmetric ports. By the way,
it's VXLAN, not VxLAN (Jiri Benc)
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Qdisc graft operation of offload-capable qdiscs performs a few
extra steps which are identical among all the qdiscs. Add
a helper to share this code.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Qdisc dump operation of offload-capable qdiscs performs a few
extra steps which are identical among all the qdiscs. Add
a helper to share this code.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a helper function to determine if the type of a netdev is geneve based
on its rtnl_link_ops. This allows drivers that may wish to offload tunnels
to check the underlying type of the device.
A recent patch added a similar helper to vxlan.h
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In some scenarios, the GRO engine can assemble an UDP GRO packet
that ultimately lands on a non GRO-enabled socket.
This patch tries to address the issue explicitly checking for the UDP
socket features before enqueuing the packet, and eventually segmenting
the unexpected GRO packet, as needed.
We must also cope with re-insertion requests: after segmentation the
UDP code calls the helper introduced by the previous patches, as needed.
Segmentation is performed by a common helper, which takes care of
updating socket and protocol stats is case of failure.
rfc v3 -> v1
- fix compile issues with rxrpc
- when gso_segment returns NULL, treat is as an error
- added 'ipv4' argument to udp_rcv_segment()
rfc v2 -> rfc v3
- moved udp_rcv_segment() into net/udp.h, account errors to socket
and ns, always return NULL or segs list
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
So that we can re-use it at the UDP level in the next patch
rfc v3 -> v1:
- add the helper declaration into the ipv6 header
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
So that we can re-use it at the UDP level in a later patch
rfc v3 -> v1
- add the helper declaration into the ip header
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The *encap_needed static keys are enabled by UDP tunnels
and several UDP encapsulations type, but they are never
turned off. This can cause unneeded overall performance
degradation for systems where such features are used
transiently.
This patch introduces complete book-keeping for such keys,
decreasing the usage at socket destruction time, if needed,
and avoiding that the same socket could increase the key
usage multiple times.
rfc v3 -> v1:
- add socket lock around udp_tunnel_encap_enable()
rfc v2 -> rfc v3:
- use udp_tunnel_encap_enable() in setsockopt()
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When there exist a pair of raw sockets one unbound and one bound
to a VRF but equal in all other respects, when a packet is received
in the VRF context, __raw_v4_lookup() matches on both sockets.
This results in the packet being delivered over both sockets,
instead of only the raw socket bound to the VRF. The bound device
checks in __raw_v4_lookup() are replaced with a call to
raw_sk_bound_dev_eq() which correctly handles whether the packet
should be delivered over the unbound socket in such cases.
In __raw_v6_lookup() the match on the device binding of the socket is
similarly updated to use raw_sk_bound_dev_eq() which matches the
handling in __raw_v4_lookup().
Importantly raw_sk_bound_dev_eq() takes the raw_l3mdev_accept sysctl
into account.
Signed-off-by: Duncan Eastoe <deastoe@vyatta.att-mail.com>
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a sysctl raw_l3mdev_accept to control raw socket lookup in a manner
similar to use of tcp_l3mdev_accept for stream and of udp_l3mdev_accept
for datagram sockets. Have this default to enabled for reasons of
backwards compatibility. This is so as to specify the output device
with cmsg and IP_PKTINFO, but using a socket not bound to the
corresponding VRF. This allows e.g. older ping implementations to be
run with specifying the device but without executing it in the VRF.
If the option is disabled, packets received in a VRF context are only
handled by a raw socket bound to the VRF, and correspondingly packets
in the default VRF are only handled by a socket not bound to any VRF.
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ensure an unbound datagram skt is chosen when not in a VRF. The check
for a device match in compute_score() for UDP must be performed when
there is no device match. For this, a failure is returned when there is
no device match. This ensures that bound sockets are never selected,
even if there is no unbound socket.
Allow IPv6 packets to be sent over a datagram skt bound to a VRF. These
packets are currently blocked, as flowi6_oif was set to that of the
master vrf device, and the ipi6_ifindex is that of the slave device.
Allow these packets to be sent by checking the device with ipi6_ifindex
has the same L3 scope as that of the bound device of the skt, which is
the master vrf device. Note that this check always succeeds if the skt
is unbound.
Even though the right datagram skt is now selected by compute_score(),
a different skt is being returned that is bound to the wrong vrf. The
difference between these and stream sockets is the handling of the skt
option for SO_REUSEPORT. While the handling when adding a skt for reuse
correctly checks that the bound device of the skt is a match, the skts
in the hashslot are already incorrect. So for the same hash, a skt for
the wrong vrf may be selected for the required port. The root cause is
that the skt is immediately placed into a slot when it is created,
but when the skt is then bound using SO_BINDTODEVICE, it remains in the
same slot. The solution is to move the skt to the correct slot by
forcing a rehash.
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit a04a480d43 ("net: Require exact match for TCP socket
lookups if dif is l3mdev") only ensures that the correct socket is
selected for packets in a VRF. However, there is no guarantee that
the unbound socket will be selected for packets when not in a VRF.
By checking for a device match in compute_score() also for the case
when there is no bound device and attaching a score to this, the
unbound socket is selected. And if a failure is returned when there
is no device match, this ensures that bound sockets are never selected,
even if there is no unbound socket.
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change the inet socket lookup to avoid packets arriving on a device
enslaved to an l3mdev from matching unbound sockets by removing the
wildcard for non sk_bound_dev_if and instead relying on check against
the secondary device index, which will be 0 when the input device is
not enslaved to an l3mdev and so match against an unbound socket and
not match when the input device is enslaved.
Change the socket binding to take the l3mdev into account to allow an
unbound socket to not conflict sockets bound to an l3mdev given the
datapath isolation now guaranteed.
Signed-off-by: Robert Shearman <rshearma@vyatta.att-mail.com>
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add extack argument to ip_fib_metrics_init and add messages for invalid
metrics.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add extack arg to rtnl_create_link and add messages for invalid
number of Tx or Rx queues.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains the first batch of Netfilter fixes for
your net tree:
1) Fix splat with IPv6 defragmenting locally generated fragments,
from Florian Westphal.
2) Fix Incorrect check for missing attribute in nft_osf.
3) Missing INT_MIN & INT_MAX definition for netfilter bridge uapi
header, from Jiri Slaby.
4) Revert map lookup in nft_numgen, this is already possible with
the existing infrastructure without this extension.
5) Fix wrong listing of set reference counter, make counter
synchronous again, from Stefano Brivio.
6) Fix CIDR 0 in hash:net,port,net, from Eric Westbrook.
7) Fix allocation failure with large set, use kvcalloc().
From Andrey Ryabinin.
8) No need to disable BH when fetch ip set comment, patch from
Jozsef Kadlecsik.
9) Sanity check for valid sysfs entry in xt_IDLETIMER, from
Taehee Yoo.
10) Fix suspicious rcu usage via ip_set() macro at netlink dump,
from Jozsef Kadlecsik.
11) Fix setting default timeout via nfnetlink_cttimeout, this
comes with preparation patch to add nf_{tcp,udp,...}_pernet()
helper.
12) Allow ebtables table nat to be of filter type via nft_compat.
From Florian Westphal.
13) Incorrect calculation of next bucket in early_drop, do no bump
hash value, update bucket counter instead. From Vasily Khoruzhick.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
icmp6_send() function is expensive on systems with a large number of
interfaces. Every time it’s called, it has to verify that the source
address does not correspond to an existing anycast address by looping
through every device and every anycast address on the device. This can
result in significant delays for a CPU when there are a large number of
neighbors and ND timers are frequently timing out and calling
neigh_invalidate().
Add anycast addresses to a global hashtable to allow quick searching for
matching anycast addresses. This is based on inet6_addr_lst in addrconf.c.
Signed-off-by: Jeff Barnhill <0xeffeff@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a warning from checkpatch.pl:'please no space before tabs'
in include/net/af_unix.h
Signed-off-by: Bo YU <tsu.yubo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a warning from checkpatch:
function definition argument 'struct sock *' should also have an
identifier name in include/net/af_unix.h.
Signed-off-by: Bo YU <tsu.yubo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* Finish removing the custom 9p request cache mechanism
* Embed part of the fcall in the request to have better slab
performance (msize usually is power of two aligned)
* syzkaller fixes:
- add a refcount to 9p requests to avoid use after free
- a few double free issues
* A few coverity fixes
* Some old patches that were in the bugzilla:
- do not trust pdu content for size header
- mount option for lock retry interval
----------------------------------------------------------------
Dan Carpenter (1):
9p: potential NULL dereference
Dinu-Razvan Chis-Serban (1):
9p locks: add mount option for lock retry interval
Dominique Martinet (12):
9p/xen: fix check for xenbus_read error in front_probe
v9fs_dir_readdir: fix double-free on p9stat_read error
9p: clear dangling pointers in p9stat_free
9p: embed fcall in req to round down buffer allocs
9p: add a per-client fcall kmem_cache
9p/rdma: do not disconnect on down_interruptible EAGAIN
9p: acl: fix uninitialized iattr access
9p/rdma: remove useless check in cm_event_handler
9p: p9dirent_read: check network-provided name length
9p locks: fix glock.client_id leak in do_lock
9p/trans_fd: abort p9_read_work if req status changed
9p/trans_fd: put worker reqs on destroy
Gertjan Halkes (1):
9p: do not trust pdu content for stat item size
Gustavo A. R. Silva (1):
9p: fix spelling mistake in fall-through annotation
Matthew Wilcox (2):
9p: Use a slab for allocating requests
9p: Remove p9_idpool
Tomas Bortoli (3):
9p: rename p9_free_req() function
9p: Add refcount to p9_req_t
9p: Rename req to rreq in trans_fd
fs/9p/acl.c | 2 +-
fs/9p/v9fs.c | 21 +++++
fs/9p/v9fs.h | 1 +
fs/9p/vfs_dir.c | 19 +---
fs/9p/vfs_file.c | 24 +++++-
include/net/9p/9p.h | 12 +--
include/net/9p/client.h | 71 ++++++---------
net/9p/Makefile | 1 -
net/9p/client.c | 551 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------------------------
net/9p/mod.c | 9 +-
net/9p/protocol.c | 20 ++++-
net/9p/trans_fd.c | 64 +++++++++-----
net/9p/trans_rdma.c | 37 ++++----
net/9p/trans_virtio.c | 44 +++++++---
net/9p/trans_xen.c | 17 ++--
net/9p/util.c | 140 ------------------------------
16 files changed, 482 insertions(+), 551 deletions(-)
delete mode 100644 net/9p/util.c
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAlvWYE0ACgkQq06b7GqY
5nBBDxAAkz4rGRsnRpx9D1Bt+81eARTqsWcoPM138zEbwiJehe5aSF8x/CGyirDW
+PoES/efho50sDfaUL9eBMgyy95UMN7LepMTcjNE1tywA+tRsH3dOL4757+lBwZc
FGsHvfse4b5ZzmMxSo2L5KbzpNiUEXCJGEvNlBSbeWcSSUtCfOShMVx9IC6r7JhX
pbsoczQ6W+384V4WgBxBZZf99EryxSeYJ9KMU+9b13ndzKICJqYeaEdItt28uYUC
d05iXDEDQZD3B8R/1Y08ktMqmHLXP9kZdJ/w8HhCMRAxWKUD1pfDwJ45gIwhCb5R
KxdxZ2tPvc8ihGngNJ4VaJoQQFTVzDgbyWXjwAFYkFUuPFgmuAgx3qApOhUFHCdM
8vgJd2u0attNwfOM3VsJApLQu09wmRw1LcmJ9re5OJ/YLiAUF3vw2oTpGhGlKRRI
bMBI9rwXHzeoVI7ank5mh1NMVRpMbzL9A8uqux85gcpXL36V3XmZtKRDe1AAa5jB
JR6dciAUABm9m+Ilt1Pl/faSjmysUOb1wZ0XU/cvVy5EBMUjbHv3x0TuDmrVwDas
puUXEM8soVf9e/NUYqTOozwFIle6KFcyPmae/OopATLgkGES4cjDy9cLh5SgnERM
vMcN/Z+DQO98zM/fPRy2B4qklflgTKI6VI/gSLRCWGpn49Iwhhg=
=g1OF
-----END PGP SIGNATURE-----
Merge tag '9p-for-4.20' of git://github.com/martinetd/linux
Pull 9p updates from Dominique Martinet:
"Highlights this time around are the end of Matthew's work to remove
the custom 9p request cache and use a slab directly for requests, with
some extra patches on my end to not degrade performance, but it's a
very good cleanup.
Tomas and I fixed a few more syzkaller bugs (refcount is the big one),
and I had a go at the coverity bugs and at some of the bugzilla
reports we had open for a while.
I'm a bit disappointed that I couldn't get much reviews for a few of
my own patches, but the big ones got some and it's all been soaking in
linux-next for quite a while so I think it should be OK.
Summary:
- Finish removing the custom 9p request cache mechanism
- Embed part of the fcall in the request to have better slab
performance (msize usually is power of two aligned)
- syzkaller fixes:
* add a refcount to 9p requests to avoid use after free
* a few double free issues
- A few coverity fixes
- Some old patches that were in the bugzilla:
* do not trust pdu content for size header
* mount option for lock retry interval"
* tag '9p-for-4.20' of git://github.com/martinetd/linux: (21 commits)
9p/trans_fd: put worker reqs on destroy
9p/trans_fd: abort p9_read_work if req status changed
9p: potential NULL dereference
9p locks: fix glock.client_id leak in do_lock
9p: p9dirent_read: check network-provided name length
9p/rdma: remove useless check in cm_event_handler
9p: acl: fix uninitialized iattr access
9p locks: add mount option for lock retry interval
9p: do not trust pdu content for stat item size
9p: Rename req to rreq in trans_fd
9p: fix spelling mistake in fall-through annotation
9p/rdma: do not disconnect on down_interruptible EAGAIN
9p: Add refcount to p9_req_t
9p: rename p9_free_req() function
9p: add a per-client fcall kmem_cache
9p: embed fcall in req to round down buffer allocs
9p: Remove p9_idpool
9p: Use a slab for allocating requests
9p: clear dangling pointers in p9stat_free
v9fs_dir_readdir: fix double-free on p9stat_read error
...
When doing a route dump across all address families, do not error out
if the table does not exist. This allows a route dump for AF_UNSPEC
with a table id that may only exist for some of the families.
Do return the table does not exist error if dumping routes for a
specific family and the table does not exist.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With EDT model, SRTT no longer is inflated by pacing delays.
This means that RTO and some other xmit timers might be setup
incorrectly. This is particularly visible with either :
- Very small enforced pacing rates (SO_MAX_PACING_RATE)
- Reduced rto (from the default 200 ms)
This can lead to TCP flows aborts in the worst case,
or spurious retransmits in other cases.
For example, this session gets far more throughput
than the requested 80kbit :
$ netperf -H 127.0.0.2 -l 100 -- -q 10000
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 127.0.0.2 () port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
540000 262144 262144 104.00 2.66
With the fix :
$ netperf -H 127.0.0.2 -l 100 -- -q 10000
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 127.0.0.2 () port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
540000 262144 262144 104.00 0.12
EDT allows for better control of rtx timers, since TCP has
a better idea of the earliest departure time of each skb
in the rtx queue. We only have to eventually add to the
timer the difference of the EDT time with current time.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit dd979b4df8.
This broke tcp_poll for SMC fallback: An AF_SMC socket establishes an
internal TCP socket for the initial handshake with the remote peer.
Whenever the SMC connection can not be established this TCP socket is
used as a fallback. All socket operations on the SMC socket are then
forwarded to the TCP socket. In case of poll, the file->private_data
pointer references the SMC socket because the TCP socket has no file
assigned. This causes tcp_poll to wait on the wrong socket.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace spurious spaces with a tab and remove superfluous tab from
unix_sock struct.
Signed-off-by: Vito Caputo <vcaputo@pengaru.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-10-21
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Implement two new kind of BPF maps, that is, queue and stack
map along with new peek, push and pop operations, from Mauricio.
2) Add support for MSG_PEEK flag when redirecting into an ingress
psock sk_msg queue, and add a new helper bpf_msg_push_data() for
insert data into the message, from John.
3) Allow for BPF programs of type BPF_PROG_TYPE_CGROUP_SKB to use
direct packet access for __skb_buff, from Song.
4) Use more lightweight barriers for walking perf ring buffer for
libbpf and perf tool as well. Also, various fixes and improvements
from verifier side, from Daniel.
5) Add per-symbol visibility for DSO in libbpf and hide by default
global symbols such as netlink related functions, from Andrey.
6) Two improvements to nfp's BPF offload to check vNIC capabilities
in case prog is shared with multiple vNICs and to protect against
mis-initializing atomic counters, from Jakub.
7) Fix for bpftool to use 4 context mode for the nfp disassembler,
also from Jakub.
8) Fix a return value comparison in test_libbpf.sh and add several
bpftool improvements in bash completion, documentation of bpf fs
restrictions and batch mode summary print, from Quentin.
9) Fix a file resource leak in BPF selftest's load_kallsyms()
helper, from Peng.
10) Fix an unused variable warning in map_lookup_and_delete_elem(),
from Alexei.
11) Fix bpf_skb_adjust_room() signature in BPF UAPI helper doc,
from Nicolas.
12) Add missing executables to .gitignore in BPF selftests, from Anders.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
They are not used anymore and therefore should be removed.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Johan Hedberg says:
====================
pull request: bluetooth-next 2018-10-20
Here's one more bluetooth-next pull request for the 4.20 kernel.
- Added new USB ID for QCA_ROME controller
- Added debug trace support from QCA wcn3990 controllers
- Updated L2CAP to conform to latest Errata Service Release
- Fix binding to non-removable BCM43430 devices
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next tree:
1) Use lockdep_is_held() in ipset_dereference_protected(), from Lance Roy.
2) Remove unused variable in cttimeout, from YueHaibing.
3) Add ttl option for nft_osf, from Fernando Fernandez Mancera.
4) Use xfrm family to deal with IPv6-in-IPv4 packets from nft_xfrm,
from Florian Westphal.
5) Simplify xt_osf_match_packet().
6) Missing ct helper alias definition in snmp_trap helper, from Taehee Yoo.
7) Remove unnecessary parameter in nf_flow_table_cleanup(), from Taehee Yoo.
8) Remove unused variable definitions in nft_{dup,fwd}, from Weongyo Jeong.
9) Remove empty net/netfilter/nfnetlink_log.h file, from Taehee Yoo.
10) Revert xt_quota updates remain option due to problems in the listing
path for 32-bit arches, from Maze.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
net/sched/cls_api.c has overlapping changes to a call to
nlmsg_parse(), one (from 'net') added rtm_tca_policy instead of NULL
to the 5th argument, and another (from 'net-next') added cb->extack
instead of NULL to the 6th argument.
net/ipv4/ipmr_base.c is a case of a bug fix in 'net' being done to
code which moved (to mr_table_dump)) in 'net-next'. Thanks to David
Ahern for the heads up.
Signed-off-by: David S. Miller <davem@davemloft.net>
/include/net/netfilter/nfnetlink_log.h file is empty.
so that it can be removed.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
parameter net of nf_flow_table_cleanup() is not used.
So that it can be removed.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Now it's confusing that asoc sndbuf_used is doing memory accounting with
SCTP_DATA_SNDSIZE(chunk) + sizeof(sk_buff) + sizeof(sctp_chunk) while sk
sk_wmem_alloc is doing that with skb->truesize + sizeof(sctp_chunk).
It also causes sctp_prsctp_prune to count with a wrong freed memory when
sndbuf_policy is not set.
To make this right and also keep consistent between asoc sndbuf_used, sk
sk_wmem_alloc and sk_wmem_queued, use skb->truesize + sizeof(sctp_chunk)
for them.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sctp data size should be calculated by subtracting data chunk header's
length from chunk_hdr->length, not just data header.
Fixes: 668c9beb90 ("sctp: implement assign_number for sctp_stream_interleave")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We shouldn't abuse exceptions: if the destination MTU is already higher
than what we're transmitting, no exception should be created.
Fixes: 52a589d51f ("geneve: update skb dst pmtu on tx path")
Fixes: a93bf0ff44 ("vxlan: update skb dst pmtu on tx path")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, an FDB entry only ceases being offloaded when it is deleted.
This changes with VxLAN encapsulation.
Devices capable of performing VxLAN encapsulation usually have only one
FDB table, unlike the software data path which has two - one in the
bridge driver and another in the VxLAN driver.
Therefore, bridge FDB entries pointing to a VxLAN device are only
offloaded if there is a corresponding entry in the VxLAN FDB.
Allow clearing the offload indication in case the corresponding entry
was deleted from the VxLAN FDB.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Offloaded bridge FDB entries are marked with NTF_OFFLOADED. Implement a
similar mechanism for VXLAN, where a given remote destination can be
marked as offloaded.
To that end, introduce a new event, SWITCHDEV_VXLAN_FDB_OFFLOADED,
through which the marking is communicated to the vxlan driver. To
identify which RDST should be marked as offloaded, an
switchdev_notifier_vxlan_fdb_info is passed to the listeners. The
"offloaded" flag in that object determines whether the offloaded mark
should be set or cleared.
When sending offloaded FDB entries over netlink, mark them with
NTF_OFFLOADED.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A switchdev-capable driver that is aware of VXLAN may need to query
VXLAN FDB. In the particular case of mlxsw, this functionality is
limited to querying UC FDBs. Those being easier to deal with than the
general case of RDST chain traversal, introduce an interface to query
specifically UC FDBs: vxlan_fdb_find_uc().
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When offloading VXLAN devices, drivers need to know about events in
VXLAN FDB database. Since VXLAN models a bridge, it is natural to
distribute the VXLAN FDB notifications using the pre-existing switchdev
notification mechanism.
To that end, introduce two new notification types:
SWITCHDEV_VXLAN_FDB_ADD_TO_DEVICE and SWITCHDEV_VXLAN_FDB_DEL_TO_DEVICE.
Introduce a new function, vxlan_fdb_switchdev_call_notifiers() to send
the new notifier types, and a struct switchdev_notifier_vxlan_fdb_info
to communicate the details of the FDB entry under consideration.
Invoke the new function from vxlan_fdb_notify().
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the ability to determine whether a netdev is a VxLAN netdev by
calling the above mentioned function that checks the netdev's
rtnl_link_ops.
This will allow modules to identify netdev events involving a VxLAN
netdev and act accordingly. For example, drivers capable of VxLAN
offload will need to configure the underlying device when a VxLAN netdev
is being enslaved to an offloaded bridge.
Convert nfp to use the newly introduced helper.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drivers that support tunnel decapsulation (IPinIP or NVE) need to
configure the underlying device to conform to the behavior outlined in
RFC 6040 with respect to the ECN bits.
This behavior is implemented by INET_ECN_decapsulate() which requires an
skb to be passed where the ECN CE bit can be potentially set. Since
these drivers do not need to mark an skb, but only configure the device
to do so, factor out the business logic to __INET_ECN_decapsulate() and
potentially perform the marking in INET_ECN_decapsulate().
This allows drivers to invoke __INET_ECN_decapsulate() and configure the
device.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Suggested-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drivers that support VxLAN offload need to be able to sanitize the
configuration of the VxLAN device and accept / reject its offload.
For example, mlxsw requires that the local IP of the VxLAN device be set
and that packets be flooded to unicast IP(s) and not to a multicast
group.
Expose the functions that perform such checks.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds support for the MSG_PEEK flag when doing redirect to ingress
and receiving on the sk_msg psock queue. Previously the flag was
being ignored which could confuse applications if they expected the
flag to work as normal.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Update parsing of route dump request to enable kernel side filtering.
Allow filtering results by protocol (e.g., which routing daemon installed
the route), route type (e.g., unicast), table id and nexthop device. These
amount to the low hanging fruit, yet a huge improvement, for dumping
routes.
ip_valid_fib_dump_req is called with RTNL held, so __dev_get_by_index can
be used to look up the device index without taking a reference. From
there filter->dev is only used during dump loops with the lock still held.
Set NLM_F_DUMP_FILTERED in the answer_flags so the user knows the results
have been filtered should no entries be returned.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement kernel side filtering of routes by table id, egress device index,
protocol and route type. If the table id is given in the filter, lookup the
table and call fib_table_dump directly for it.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add struct fib_dump_filter for options on limiting which routes are
returned in a dump request. The current list is table id, protocol,
route type, rtm_flags and nexthop device index. struct net is needed
to lookup the net_device from the index.
Declare the filter for each route dump handler and plumb the new
arguments from dump handlers to ip_valid_fib_dump_req.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-10-16
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Convert BPF sockmap and kTLS to both use a new sk_msg API and enable
sk_msg BPF integration for the latter, from Daniel and John.
2) Enable BPF syscall side to indicate for maps that they do not support
a map lookup operation as opposed to just missing key, from Prashant.
3) Add bpftool map create command which after map creation pins the
map into bpf fs for further processing, from Jakub.
4) Add bpftool support for attaching programs to maps allowing sock_map
and sock_hash to be used from bpftool, from John.
5) Improve syscall BPF map update/delete path for map-in-map types to
wait a RCU grace period for pending references to complete, from Daniel.
6) Couple of follow-up fixes for the BPF socket lookup to get it
enabled also when IPv6 is compiled as a module, from Joe.
7) Fix a generic-XDP bug to handle the case when the Ethernet header
was mangled and thus update skb's protocol and data, from Jesper.
8) Add a missing BTF header length check between header copies from
user space, from Wenwen.
9) Minor fixups in libbpf to use __u32 instead u32 types and include
proper perf_event.h uapi header instead of perf internal one, from Yonghong.
10) Allow to pass user-defined flags through EXTRA_CFLAGS and EXTRA_LDFLAGS
to bpftool's build, from Jiri.
11) BPF kselftest tweaks to add LWTUNNEL to config fragment and to install
with_addr.sh script from flow dissector selftest, from Anders.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_pacing_rate has beed introduced as a u32 field in 2013,
effectively limiting per flow pacing to 34Gbit.
We believe it is time to allow TCP to pace high speed flows
on 64bit hosts, as we now can reach 100Gbit on one TCP flow.
This patch adds no cost for 32bit kernels.
The tcpi_pacing_rate and tcpi_max_pacing_rate were already
exported as 64bit, so iproute2/ss command require no changes.
Unfortunately the SO_MAX_PACING_RATE socket option will stay
32bit and we will need to add a new option to let applications
control high pacing rates.
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 1787144 10.246.9.76:49992 10.246.9.77:36741
timer:(on,003ms,0) ino:91863 sk:2 <->
skmem:(r0,rb540000,t66440,tb2363904,f605944,w1822984,o0,bl0,d0)
ts sack bbr wscale:8,8 rto:201 rtt:0.057/0.006 mss:1448
rcvmss:536 advmss:1448
cwnd:138 ssthresh:178 bytes_acked:256699822585 segs_out:177279177
segs_in:3916318 data_segs_out:177279175
bbr:(bw:31276.8Mbps,mrtt:0,pacing_gain:1.25,cwnd_gain:2)
send 28045.5Mbps lastrcv:73333
pacing_rate 38705.0Mbps delivery_rate 22997.6Mbps
busy:73333ms unacked:135 retrans:0/157 rcv_space:14480
notsent:2085120 minrtt:0.013
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Other than asoc pmtu sync from all transports, sctp_assoc_sync_pmtu
is also processing transport pmtu_pending by icmp packets. But it's
meaningless to use sctp_dst_mtu(t->dst) as new pmtu for a transport.
The right pmtu value should come from the icmp packet, and it would
be saved into transport->mtu_info in this patch and used later when
the pmtu sync happens in sctp_sendmsg_to_asoc or sctp_packet_config.
Besides, without this patch, as pmtu can only be updated correctly
when receiving a icmp packet and no place is holding sock lock, it
will take long time if the sock is busy with sending packets.
Note that it doesn't process transport->mtu_info in .release_cb(),
as there is no enough information for pmtu update, like for which
asoc or transport. It is not worth traversing all asocs to check
pmtu_pending. So unlike tcp, sctp does this in tx path, for which
mtu_info needs to be atomic_t.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When commit 270972554c ("[IPV6]: ROUTE: Add Router Reachability
Probing (RFC4191).") introduced router probing, the rt6_probe() function
required that a neighbour entry existed. This neighbour entry is used to
record the timestamp of the last probe via the ->updated field.
Later, commit 2152caea71 ("ipv6: Do not depend on rt->n in rt6_probe().")
removed the requirement for a neighbour entry. Neighbourless routes skip
the interval check and are not rate-limited.
This patch adds rate-limiting for neighbourless routes, by recording the
timestamp of the last probe in the fib6_info itself.
Fixes: 2152caea71 ("ipv6: Do not depend on rt->n in rt6_probe().")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a more complete fix than d71019b54b ("net: core: Fix build
with CONFIG_IPV6=m"), so that IPv6 sockets may be looked up if the IPv6
module is loaded (not just if it's compiled in).
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Instead of re-implementing poll routine use the poll callback to
trigger read from kTLS, we reuse the stream_memory_read callback
which is simpler and achieves the same. This helps to align sockmap
and kTLS so we can more easily embed BPF in kTLS.
Joint work with Daniel.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Convert kTLS over to make use of sk_msg interface for plaintext and
encrypted scattergather data, so it reuses all the sk_msg helpers
and data structure which later on in a second step enables to glue
this to BPF.
This also allows to remove quite a bit of open coded helpers which
are covered by the sk_msg API. Recent changes in kTLs 80ece6a03a
("tls: Remove redundant vars from tls record structure") and
4e6d47206c ("tls: Add support for inplace records encryption")
changed the data path handling a bit; while we've kept the latter
optimization intact, we had to undo the former change to better
fit the sk_msg model, hence the sg_aead_in and sg_aead_out have
been brought back and are linked into the sk_msg sgs. Now the kTLS
record contains a msg_plaintext and msg_encrypted sk_msg each.
In the original code, the zerocopy_from_iter() has been used out
of TX but also RX path. For the strparser skb-based RX path,
we've left the zerocopy_from_iter() in decrypt_internal() mostly
untouched, meaning it has been moved into tls_setup_from_iter()
with charging logic removed (as not used from RX). Given RX path
is not based on sk_msg objects, we haven't pursued setting up a
dummy sk_msg to call into sk_msg_zerocopy_from_iter(), but it
could be an option to prusue in a later step.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In order to prepare sockmap logic to be used in combination with kTLS
we need to detangle it from ULP, and further split it in later commits
into a generic API.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add the result values specific to L2CAP LE credit based connections
and change the old result values wherever they were used.
Signed-off-by: Mallikarjun Phulari <mallikarjun.phulari@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Conflicts were easy to resolve using immediate context mostly,
except the cls_u32.c one where I simply too the entire HEAD
chunk.
Signed-off-by: David S. Miller <davem@davemloft.net>
We already have BUILD_BUG_ON_ZERO() which I just hadn't found
before, so we should use it here instead of open-coding another
implementation thereof.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* merge net-next, so I can finish the hwsim workqueue removal
* fix TXQ NULL pointer issue that was reported multiple times
* minstrel cleanups from Felix
* simplify lib80211 code by not using skcipher, note that this
will conflict with the crypto tree (and this new code here
should be used)
* use new netlink policy validation in nl80211
* fix up SAE (part of WPA3) in client-mode
* FTM responder support in the stack
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAlvAgX8ACgkQB8qZga/f
l8Sg6A/+IhbIRerfje8kVIWNG3fUa5oF/vXYRfNRDeNwHu6OVYer5uj6bonkSSoU
lpWivhuF1XxdIQx6npr8j7R6bw8sbKMKrpnR7ObFaaCmxjmjgMNY4pkkmzSRv3zi
LP91woUQ2mFv16U9E+9HROBJXgeZfU96bBBR4mfqKrhnhFz6O8SYa9JvRpylcLTz
HY+a+ezIztbrCPImBUOkSfUPuoeWSJ8Y4LN2zhqbY/ZUk3n4iFOD44bAN346u3+q
CzfL+e+PKjuSiwPrjrKMIp9emiAE1zeWPUcmnSAYU/s9zISoZ8v5L/kSBanVnvJI
DIHJ6LB+aETm1gnj1dSnboXRKpUaW63nbkoxwTy15JrdIMysh9vDaXWok6p85mDq
FVbFx1MzBcYPCMrACEBhhRz71eZHFtyrGyqUm1EZcdcNeQtg+eJdXvhTiowg5JhH
Csl+ZloZfqrd0aPjaM3F819aVMKcJORdQhrfTN+GYW81mdZyZO3VWmzd3kI8Pjuc
WRZmykYTub/MbDFjE5g1LRJ4jZ/bThIl71i8hYEopqfim9URE1RPm9mSMUv9ggVL
/xNIIMhw3H/+P3g0z2a0Z06hT/dScHsZ9If5Srb699/NUOiES5zl43eXidBzuAbI
36DHaHhf6k8/e3aaLOS1oD3AjpiW28uxq7baeLfd1B8l9WmmgZM=
=RqMD
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2018-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
Highlights:
* merge net-next, so I can finish the hwsim workqueue removal
* fix TXQ NULL pointer issue that was reported multiple times
* minstrel cleanups from Felix
* simplify lib80211 code by not using skcipher, note that this
will conflict with the crypto tree (and this new code here
should be used)
* use new netlink policy validation in nl80211
* fix up SAE (part of WPA3) in client-mode
* FTM responder support in the stack
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When a link's carrier goes down it could be a sign of the port changing
networks. If the new network has overlapping addresses with the old one,
then the kernel will continue trying to use neighbor entries established
based on the old network until the entries finally age out - meaning a
potentially long delay with communications not working.
This patch evicts neighbor entries on carrier down with the exception of
those marked permanent. Permanent entries are managed by userspace (either
an admin or a routing daemon such as FRR).
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Another difference between IPv4 and IPv6 is the generation of RTM_DELROUTE
notifications when a device is taken down (admin down) or deleted. IPv4
does not generate a message for routes evicted by the down or delete;
IPv6 does. A NOS at scale really needs to avoid these messages and have
IPv4 and IPv6 behave similarly, relying on userspace to handle link
notifications and evict the routes.
At this point existing user behavior needs to be preserved. Since
notifications are a global action (not per app) the only way to preserve
existing behavior and allow the messages to be skipped is to add a new
sysctl (net/ipv6/route/skip_notify_on_dev_down) which can be set to
disable the notificatioons.
IPv6 route code already supports the option to skip the message (it is
used for multipath routes for example). Besides the new sysctl we need
to pass the skip_notify setting through the generic fib6_clean and
fib6_walk functions to fib6_clean_node and to set skip_notify on calls
to __ip_del_rt for the addrconf_ifdown path.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current mac80211 has provision to update tx status through
ieee80211_tx_status() and ieee80211_tx_status_ext(). But
drivers like ath10k updates the tx status from the skb except
txrate, txrate will be updated from a different path, peer stats.
Using ieee80211_tx_status_ext() in two different paths
(one for the stats, one for the tx rate) would duplicate
the stats instead.
To avoid this stats duplication, ieee80211_tx_rate_update()
is implemented.
Signed-off-by: Anilkumar Kolli <akolli@codeaurora.org>
[minor commit message editing, use initializers in code]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add support for drivers to report the total number of MPDUs received
and the number of MPDUs received with an FCS error from a specific
peer. These counters will be incremented only when the TA of the
frame matches the MAC address of the peer irrespective of FCS
error.
It should be noted that the TA field in the frame might be corrupted
when there is an FCS error and TA matching logic would fail in such
cases. Hence, FCS error counter might not be fully accurate, but it can
provide help in detecting bad RX links in significant number of cases.
This FCS error counter without full accuracy can be used, e.g., to
trigger a kick-out of a connected client with a bad link in AP mode to
force such a client to roam to another AP.
Signed-off-by: Ankita Bajaj <bankita@codeaurora.org>
Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
New bss param ftm_responder is used to notify the driver to
enable fine timing request (FTM) responder role in AP mode.
Plumb the new cfg80211 API for FTM responder statistics through to
the driver API in mac80211.
Signed-off-by: David Spinadel <david.spinadel@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Pradeep Kumar Chitrapu <pradeepc@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Since commit 5aad1de5ea ("ipv4: use separate genid for next hop
exceptions"), exceptions get deprecated separately from cached
routes. In particular, administrative changes don't clear PMTU anymore.
As Stefano described in commit e9fa1495d7 ("ipv6: Reflect MTU changes
on PMTU of exceptions for MTU-less routes"), the PMTU discovered before
the local MTU change can become stale:
- if the local MTU is now lower than the PMTU, that PMTU is now
incorrect
- if the local MTU was the lowest value in the path, and is increased,
we might discover a higher PMTU
Similarly to what commit e9fa1495d7 did for IPv6, update PMTU in those
cases.
If the exception was locked, the discovered PMTU was smaller than the
minimal accepted PMTU. In that case, if the new local MTU is smaller
than the current PMTU, let PMTU discovery figure out if locking of the
exception is still needed.
To do this, we need to know the old link MTU in the NETDEV_CHANGEMTU
notifier. By the time the notifier is called, dev->mtu has been
changed. This patch adds the old MTU as additional information in the
notifier structure, and a new call_netdevice_notifiers_u32() function.
Fixes: 5aad1de5ea ("ipv4: use separate genid for next hop exceptions")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6_route_table_template is exported but there are no users outside
of route.c. Make it static.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Devlink string param buffer is allocated at the size of
DEVLINK_PARAM_MAX_STRING_VALUE. Add helper function which makes sure
this size is not exceeded.
Renamed DEVLINK_PARAM_MAX_STRING_VALUE to
__DEVLINK_PARAM_MAX_STRING_VALUE to emphasize that it should be used by
devlink only. The driver should use the helper function instead to
verify it doesn't exceed the allowed length.
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case devlink param type is string, it needs to copy the string value
it got from the input to devlink_param_value.
Fixes: e3b7ca18ad ("devlink: Add param set command")
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2018-10-08
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) sk_lookup_[tcp|udp] and sk_release helpers from Joe Stringer which allow
BPF programs to perform lookups for sockets in a network namespace. This would
allow programs to determine early on in processing whether the stack is
expecting to receive the packet, and perform some action (eg drop,
forward somewhere) based on this information.
2) per-cpu cgroup local storage from Roman Gushchin.
Per-cpu cgroup local storage is very similar to simple cgroup storage
except all the data is per-cpu. The main goal of per-cpu variant is to
implement super fast counters (e.g. packet counters), which don't require
neither lookups, neither atomic operations in a fast path.
The example of these hybrid counters is in selftests/bpf/netcnt_prog.c
3) allow HW offload of programs with BPF-to-BPF function calls from Quentin Monnet
4) support more than 64-byte key/value in HW offloaded BPF maps from Jakub Kicinski
5) rename of libbpf interfaces from Andrey Ignatov.
libbpf is maturing as a library and should follow good practices in
library design and implementation to play well with other libraries.
This patch set brings consistent naming convention to global symbols.
6) relicense libbpf as LGPL-2.1 OR BSD-2-Clause from Alexei Starovoitov
to let Apache2 projects use libbpf
7) various AF_XDP fixes from Björn and Magnus
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next tree:
1) Support for matching on ipsec policy already set in the route, from
Florian Westphal.
2) Split set destruction into deactivate and destroy phase to make it
fit better into the transaction infrastructure, also from Florian.
This includes a patch to warn on imbalance when setting the new
activate and deactivate interfaces.
3) Release transaction list from the workqueue to remove expensive
synchronize_rcu() from configuration plane path. This speeds up
configuration plane quite a bit. From Florian Westphal.
4) Add new xfrm/ipsec extension, this new extension allows you to match
for ipsec tunnel keys such as source and destination address, spi and
reqid. From Máté Eckl and Florian Westphal.
5) Add secmark support, this includes connsecmark too, patches
from Christian Gottsche.
6) Allow to specify remaining bytes in xt_quota, from Chenbo Feng.
One follow up patch to calm a clang warning for this one, from
Nathan Chancellor.
7) Flush conntrack entries based on layer 3 family, from Kristian Evensen.
8) New revision for cgroups2 to shrink the path field.
9) Get rid of obsolete need_conntrack(), as a result from recent
demodularization works.
10) Use WARN_ON instead of BUG_ON, from Florian Westphal.
11) Unused exported symbol in nf_nat_ipv4_fn(), from Florian.
12) Remove superfluous check for timeout netlink parser and dump
functions in layer 4 conntrack helpers.
13) Unnecessary redundant rcu read side locks in NAT redirect,
from Taehee Yoo.
14) Pass nf_hook_state structure to error handlers, patch from
Florian Westphal.
15) Remove ->new() interface from layer 4 protocol trackers. Place
them in the ->packet() interface. From Florian.
16) Place conntrack ->error() handling in the ->packet() interface.
Patches from Florian Westphal.
17) Remove unused parameter in the pernet initialization path,
also from Florian.
18) Remove additional parameter to specify layer 3 protocol when
looking up for protocol tracker. From Florian.
19) Shrink array of layer 4 protocol trackers, from Florian.
20) Check for linear skb only once from the ALG NAT mangling
codebase, from Taehee Yoo.
21) Use rhashtable_walk_enter() instead of deprecated
rhashtable_walk_init(), also from Taehee.
22) No need to flush all conntracks when only one single address
is gone, from Tan Hu.
23) Remove redundant check for NAT flags in flowtable code, from
Taehee Yoo.
24) Use rhashtable_lookup() instead of rhashtable_lookup_fast()
from netfilter codebase, since rcu read lock side is already
assumed in this path.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add helper to check netlink message for route dumps. If the strict flag
is set the dump request is expected to have an rtmsg struct as the header.
All elements of the struct are expected to be 0 with the exception of
rtm_flags (which is used by both ipv4 and ipv6 dumps) and no attributes
can be appended. rtm_flags can only have RTM_F_CLONED and RTM_F_PREFIX
set.
Update inet_dump_fib, inet6_dump_fib, mpls_dump_routes, ipmr_rtm_dumproute,
and ip6mr_rtm_dumproute to call this helper if strict data checking is
enabled.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
nla_parse is currently lenient on message parsing, allowing type to be 0
or greater than max expected and only logging a message
"netlink: %d bytes leftover after parsing attributes in process `%s'."
if the netlink message has unknown data at the end after parsing. What this
could mean is that the header at the front of the attributes is actually
wrong and the parsing is shifted from what is expected.
Add a new strict version that actually fails with EINVAL if there are any
bytes remaining after the parsing loop completes, if the atttrbitue type
is 0 or greater than max expected.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Christian Brauner <christian@brauner.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
Give a user a reason why EINVAL is returned in nlmsg_parse.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Christian Brauner <christian@brauner.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge net-next, which pulled in net, so I can merge a few more
patches that would otherwise conflict.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Avoid the socket lookup cost in udp_gro_receive if no socket has a
udp tunnel callback configured.
udp_sk(sk)->gro_receive requires a registration with
setup_udp_tunnel_sock, which enables the static key.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit ec3ed293e7 ("net_sched: change tcf_del_walker() to take idrinfo->lock")
we move fl_hw_destroy_tmplt() to a workqueue to avoid blocking
with the spinlock held. Unfortunately, this causes a lot of
troubles here:
1. tcf_chain_destroy() could be called right after we queue the work
but before the work runs. This is a use-after-free.
2. The chain refcnt is already 0, we can't even just hold it again.
We can check refcnt==1 but it is ugly.
3. The chain with refcnt 0 is still visible in its block, which means
it could be still found and used!
4. The block has a refcnt too, we can't hold it without introducing a
proper API either.
We can make it working but the end result is ugly. Instead of wasting
time on reviewing it, let's just convert the troubling spinlock to
a mutex, which allows us to use non-atomic allocations too.
Fixes: ec3ed293e7 ("net_sched: change tcf_del_walker() to take idrinfo->lock")
Reported-by: Ido Schimmel <idosch@idosch.org>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Vlad Buslov <vladbu@mellanox.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Tested-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We already check the RSS indirection table does not use queues which
would be disabled by channel reconfiguration. Make sure user does not
try to disable queues which have a UMEM and zero-copy AF_XDP socket
installed.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Move the refcounting and potential free of dst metrics associated
for ipv4 and ipv6 to a common helper.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv4 and ipv6 both use refcounted metrics if FIB entries have metrics set.
Move the common initialization code to a helper and use for both protocols.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the refcounting and potential free of dst metrics associated
with a fib entry to a helper and use it in both ipv4 and ipv6.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Consolidate initialization of ipv4 and ipv6 metrics when fib entries
are created into a single helper, ip_fib_metrics_init, that handles
the call to ip_metrics_convert.
If no metrics are defined for the fib entry, then the metrics is set
to dst_default_metrics.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcf_block_dev() doesn't seem to be used anywhere in the tree.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
msix_vec_per_pf_min - This param sets the number of minimal MSIX
vectors required for the device initialization. This value is set
in the device which limits MSIX vectors per PF.
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
msix_vec_per_pf_max - This param sets the number of MSIX vectors
that the device requests from the host on driver initialization.
This value is set in the device which is applicable per PF.
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ignore_ari - Device ignores ARI(Alternate Routing ID) capability,
even when platforms has the support and creates same number of
partitions when platform does not support ARI capability.
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
mlx5 core driver and ethernet netdev updates, please note there is a small
devlink releated update to allow extack argument to eswitch operations.
From Eli Britstein,
1) devlink: Add extack argument to the eswitch related operations
2) net/mlx5e: E-Switch, return extack messages for failures in the e-switch devlink callbacks
3) net/mlx5e: Add extack messages for TC offload failures
From Eran Ben Elisha,
4) mlx5e: Add counter for aRFS rule insertion failures
From Feras Daoud
5) Fast teardown support for mlx5 device
This change introduces the enhanced version of the "Force teardown" that
allows SW to perform teardown in a faster way without the need to reclaim
all the FW pages.
Fast teardown provides the following advantages:
1- Fix a FW race condition that could cause command timeout
2- Avoid moving to polling mode
3- Close the vport to prevent PCI ACK to be sent without been scatter
to memory
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJbtU45AAoJEEg/ir3gV/o+/C4H/RHA4KImrb476EdB3VNYMqAN
dgXb+bmh6sZP+jHWqQ4c3aVeh6/T8qm4gwiSn2nVTtHEnxtCdIYljzDC1Nswczeg
pSjD1eOP7M1LpAOmBb8xdnJcX7yM7r1bTklnp2sN853WShbsDRYgZBHsBwTzx25U
ZdzL4QTLuohlG/aLrbGXMntIy45ya2fVQrnK54s18nFlgsdFjEs0mi0xaUKNBC6+
P8CTohHAxuuxmL5b+6MIYLZCdgd8cLNQFdtqbckEVw7SvcRTxfraRlyqJ0YOgTGB
TdSWnqZz2JYH29wSFbpFG8qX6GCv8FoiZ+fKzldbolHk442rrktHv3+Y7qQuZVs=
=NVks
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2018-10-03' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2018-10-03
mlx5 core driver and ethernet netdev updates, please note there is a small
devlink releated update to allow extack argument to eswitch operations.
From Eli Britstein,
1) devlink: Add extack argument to the eswitch related operations
2) net/mlx5e: E-Switch, return extack messages for failures in the e-switch devlink callbacks
3) net/mlx5e: Add extack messages for TC offload failures
From Eran Ben Elisha,
4) mlx5e: Add counter for aRFS rule insertion failures
From Feras Daoud
5) Fast teardown support for mlx5 device
This change introduces the enhanced version of the "Force teardown" that
allows SW to perform teardown in a faster way without the need to reclaim
all the FW pages.
Fast teardown provides the following advantages:
1- Fix a FW race condition that could cause command timeout
2- Avoid moving to polling mode
3- Close the vport to prevent PCI ACK to be sent without been scatter
to memory
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow the epoch value to be queried on a server connection. This is in the
rxrpc header of every packet for use in routing and is derived from the
client's state. It's also not supposed to change unless the client gets
restarted.
AFS can make use of this information to deduce whether a fileserver has
been restarted because the fileserver makes client calls to the filesystem
driver's cache manager to send notifications (ie. callback breaks) about
conflicting changes from other clients. These convey the fileserver's own
epoch value back to the filesystem.
Signed-off-by: David Howells <dhowells@redhat.com>
Allow the timestamp on the sk_buff holding the first DATA packet of a reply
to be queried. This can then be used as a base for the expiry time
calculation on the callback promise duration indicated by an operation
result.
Signed-off-by: David Howells <dhowells@redhat.com>
Minor conflict in net/core/rtnetlink.c, David Ahern's bug fix in 'net'
overlapped the renaming of a netlink attribute in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
Add extack argument to the eswitch related operations.
Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Presently, for non-zero copy case, separate pages are allocated for
storing plaintext and encrypted text of records. These pages are stored
in sg_plaintext_data and sg_encrypted_data scatterlists inside record
structure. Further, sg_plaintext_data & sg_encrypted_data are passed
to cryptoapis for record encryption. Allocating separate pages for
plaintext and encrypted text is inefficient from both required memory
and performance point of view.
This patch adds support of inplace encryption of records. For non-zero
copy case, we reuse the pages from sg_encrypted_data scatterlist to
copy the application's plaintext data. For the movement of pages from
sg_encrypted_data to sg_plaintext_data scatterlists, we introduce a new
function move_to_plaintext_sg(). This function add pages into
sg_plaintext_data from sg_encrypted_data scatterlists.
tls_do_encryption() is modified to pass the same scatterlist as both
source and destination into aead_request_set_crypt() if inplace crypto
has been enabled. A new ariable 'inplace_crypto' has been introduced in
record structure to signify whether the same scatterlist can be used.
By default, the inplace_crypto is enabled in get_rec(). If zero-copy is
used (i.e. plaintext data is not copied), inplace_crypto is set to '0'.
Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Reviewed-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2018-09-30
Here's the first bluetooth-next pull request for the 4.20 kernel.
- Fixes & cleanups to hci_qca driver
- NULL dereference fix to debugfs
- Improved L2CAP Connection-oriented Channel MTU & MPS handling
- Added support for USB-based RTL8822C controller
- Added device ID for BCM4335C0 UART-based controller
- Various other smaller cleanups & fixes
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
syzkaller was able to hit the WARN_ON(sock_owned_by_user(sk));
in tcp_close()
While a socket is being closed, it is very possible other
threads find it in rtnetlink dump.
tcp_get_info() will acquire the socket lock for a short amount
of time (slow = lock_sock_fast(sk)/unlock_sock_fast(sk, slow);),
enough to trigger the warning.
Fixes: 67db3e4bfb ("tcp: no longer hold ehash lock while calling tcp_get_info()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(the parameter in question is mark)
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Timer handlers do not imply rcu_read_lock(), so my recent fix
triggered a LOCKDEP warning when SYNACK is retransmit.
Lets add rcu_read_lock()/rcu_read_unlock() pairs around ireq->ireq_opt
usages instead of guessing what is done by callers, since it is
not worth the pain.
Get rid of ireq_opt_deref() helper since it hides the logic
without real benefit, since it is now a standard rcu_dereference().
Fixes: 1ad98e9d1b ("tcp/dccp: fix lockdep issue when SYN is backlogged")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There's no reason for drivers to be able to access the
cfg80211 internal cookie counter; move it out of the
wiphy into the rdev structure.
While at it, also make it never assign 0 as a cookie
(we consider that invalid in some places), and warn if
we manage to do that for some reason (wrapping is not
likely to happen with a u64.)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Allow userspace to enable fine timing measurement responder
functionality with configurable lci/civic parameters in AP mode.
This can be done at AP start or changing beacon parameters.
A new EXT_FEATURE flag is introduced for drivers to advertise
the capability.
Also nl80211 API support for retrieving statistics is added.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Pradeep Kumar Chitrapu <pradeepc@codeaurora.org>
[remove unused cfg80211_ftm_responder_params, clarify docs,
move validation into policy]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In the recent TCP/EDT patch series, I switched TCP and sch_fq
clocks from MONOTONIC to TAI, in order to meet the choice done
earlier for sch_etf packet scheduler.
But sure enough, this broke some setups were the TAI clock
jumps forward (by almost 50 year...), as reported
by Leonard Crestez.
If we want to converge later, we'll probably need to add
an skb field to differentiate the clock bases, or a socket option.
In the meantime, an UDP application will need to use CLOCK_MONOTONIC
base for its SCM_TXTIME timestamps if using fq packet scheduler.
Fixes: 72b0094f91 ("tcp: switch tcp_clock_ns() to CLOCK_TAI base")
Fixes: 142537e419 ("net_sched: sch_fq: switch to CLOCK_TAI")
Fixes: fd2bca2aa7 ("tcp: switch internal pacing timer to CLOCK_TAI")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Leonard Crestez <leonard.crestez@nxp.com>
Tested-by: Leonard Crestez <leonard.crestez@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the ability to have an arbitrary validation function attached
to a netlink policy that doesn't already use the validation_data
pointer in another way.
This can be useful to validate for example the content of a binary
attribute, like in nl80211 the "(information) elements", which must
be valid streams of "u8 type, u8 length, u8 value[length]".
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Without further bloating the policy structs, we can overload
the `validation_data' pointer with a struct of s16 min, max
and use those to validate ranges in NLA_{U,S}{8,16,32,64}
attributes.
It may sound strange to validate NLA_U32 with a s16 max, but
in many cases NLA_U32 is used for enums etc. since there's no
size benefit in using a smaller attribute width anyway, due
to netlink attribute alignment; in cases like that it's still
useful, particularly when the attribute really transports an
enum value.
Doing so lets us remove quite a bit of validation code, if we
can be sure that these attributes aren't used by userspace in
places where they're ignored today.
To achieve all this, split the 'type' field and introduce a
new 'validation_type' field which indicates what further
validation (beyond the validation prescribed by the type of
the attribute) is done. This currently allows for no further
validation (the default), as well as min, max and range checks.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In normal SYN processing, packets are handled without listener
lock and in RCU protected ingress path.
But syzkaller is known to be able to trick us and SYN
packets might be processed in process context, after being
queued into socket backlog.
In commit 06f877d613 ("tcp/dccp: fix other lockdep splats
accessing ireq_opt") I made a very stupid fix, that happened
to work mostly because of the regular path being RCU protected.
Really the thing protecting ireq->ireq_opt is RCU read lock,
and the pseudo request refcnt is not relevant.
This patch extends what I did in commit 449809a66c ("tcp/dccp:
block BH for SYN processing") by adding an extra rcu_read_{lock|unlock}
pair in the paths that might be taken when processing SYN from
socket backlog (thus possibly in process context)
Fixes: 06f877d613 ("tcp/dccp: fix other lockdep splats accessing ireq_opt")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a simple typo: attribuets -> attributes
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Structure 'tls_rec' contains sg_aead_in and sg_aead_out which point
to a aad_space and then chain scatterlists sg_plaintext_data,
sg_encrypted_data respectively. Rather than using chained scatterlists
for plaintext and encrypted data in aead_req, it is efficient to store
aad_space in sg_encrypted_data and sg_plaintext_data itself in the
first index and get rid of sg_aead_in, sg_aead_in and further chaining.
This requires increasing size of sg_encrypted_data & sg_plaintext_data
arrarys by 1 to accommodate entry for aad_space. The code which uses
sg_encrypted_data and sg_plaintext_data has been modified to skip first
index as it points to aad_space.
Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* two new spectre-v1 mitigations in nl80211
* TX status fix in general, and mesh in particular
* powersave vs. offchannel fix
* regulatory initialization fix
* fix for a queue hang due to a bad return value
* allocate TXQs for active monitor interfaces, fixing my
earlier patch to avoid unnecessary allocations where I
missed this case needed them
* fix TDLS data frames priority assignment
* fix scan results processing to take into account duplicate
channel numbers (over different operating classes, but we
don't necessarily know the operating class)
* various hwsim fixes for radio destruction and new radio
announcement messages
* remove an extraneous kernel-doc line
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAlusvBUACgkQB8qZga/f
l8RGlg/+LeQVHh62PJcawXxxa7UyxBhapgkrXC/Np8hnd39xUL41SNMz2wIt7snu
W2iY+5DOJZ9UyrVL70hMZ26lOLonoafhRdH0utQX3U13iD+3g35Xrps/NETlZxcy
QftxstUeh9ojdKcOmTtOlHt3PkLCSVa6kS6TsfFGXnMeAgpbmBlPpJQhsCf1x2M4
ZBUn92sdQu4c8goFUIj8XhgccJbIuAi9pOgQDOE7EXTrxq+Y1SBl1qYCH1bjOkPd
O5lvV74Sr1SPWfWE6IgUPyRGTEjrImfKm5M/1aegcQ6KRk3LZYtNfjxcrRQg9uzA
5FMO6zNpu5EEyspS18F1F/9cLo5meqVW4A+Qr6FQ8NwAJ7O31CerEFHUwSHtLfgL
EjMuWfoKbFaj8YOeShdR3N305Eitc5/uF4m+FVee1tig3GnmUEFzNB0ngMqwFmed
esHbLlVPDYqoKzN7wZYFB/rTDcQxtcHX8m1SQfMqRgmLVkfKS1AUHk0kql++pivW
ZzNlePz3gzacurTeU8FqktxXQfdB8CRyeLqKapTE+3mY9HgYZxnKz1wV6lyoMGiM
yfn0FjsRaYgNXeB0hJ8J6w/+DJWRwahVyas1QbCZKjcY9A67LRpiRlUwEGVGtu/j
OaNWDFLPVkERYNg/7qhMym5h0RyvcBumYd9TVzwupnTiV3T1ScY=
=ySdv
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2018-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
More patches than I'd like perhaps, but each seems reasonable:
* two new spectre-v1 mitigations in nl80211
* TX status fix in general, and mesh in particular
* powersave vs. offchannel fix
* regulatory initialization fix
* fix for a queue hang due to a bad return value
* allocate TXQs for active monitor interfaces, fixing my
earlier patch to avoid unnecessary allocations where I
missed this case needed them
* fix TDLS data frames priority assignment
* fix scan results processing to take into account duplicate
channel numbers (over different operating classes, but we
don't necessarily know the operating class)
* various hwsim fixes for radio destruction and new radio
announcement messages
* remove an extraneous kernel-doc line
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sometimes nested netlink attributes are just used as arrays, with
the nla_type() of each not being used; we have this in nl80211 and
e.g. NFTA_SET_ELEM_LIST_ELEMENTS.
Add the ability to validate this type of message directly in the
policy, by adding the type NLA_NESTED_ARRAY which does exactly
this: require a first level of nesting but ignore the attribute
type, and then inside each require a second level of nested and
validate those attributes against a given policy (if present).
Note that some nested array types actually require that all of
the entries have the same index, this is possible to express in
a nested policy already, apart from the validation that only the
one allowed type is used.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we have a validation_data pointer, and the len field in
the policy is unused for NLA_NESTED, we can allow using them both
to have nested validation. This can be nice in code, although we
still have to use nla_parse_nested() or similar which would also
take a policy; however, it also serves as documentation in the
policy without requiring a look at the code.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The validation data is only used within the policy that
should usually already be const, and isn't changed in any
code that uses it. Therefore, make the validation_data
pointer const.
While at it, remove the duplicate variable in the bitfield
validation that I'd otherwise have to change to const.
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This isn't used anywhere, so we might as well get rid of it.
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the ability to set the security context of packets within the nf_tables framework.
Add a nft_object for holding security contexts in the kernel and manipulating packets on the wire.
Convert the security context strings at rule addition time to security identifiers.
This is the same behavior like in xt_SECMARK and offers better performance than computing it per packet.
Set the maximum security context length to 256.
Signed-off-by: Christian Göttsche <cgzones@googlemail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Give enough rx credits for a full packet instead of using an arbitrary
number which may not be enough depending on the MTU and MPS which can
cause interruptions while waiting for more credits, also remove
debugfs entry for l2cap_le_max_credits.
With these changes the credits are restored after each SDU is received
instead of using fixed threshold, this way it is garanteed that there
will always be enough credits to send a packet without waiting more
credits to arrive.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
This ensures the MPS can fit in a single HCI fragment so each
segment don't have to be reassembled at HCI level, in addition to
that also remove the debugfs entry to configure the MPS.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Add the definitions for adding entries to the LE resolve list and
removing entries from the LE resolve list. When the LE resolve list
gets changed via HCI commands make sure that the internal storage of
the resolve list entries gets updated.
Signed-off-by: Ankit Navik <ankit.p.navik@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
(the parameters in question are mark and flow_flags)
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(the parameters in question are mark and flow_flags)
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch attempts to untangle the TX and RX code in qeth from
af_iucv's respective HiperTransport path:
On the TX side, pointing skb_network_header() at the IUCV header
means that qeth_l3_fill_af_iucv_hdr() no longer needs a magical offset
to access the header.
On the RX side, qeth pulls the (fake) L2 header off the skb like any
normal ethernet driver would. This makes working with the IUCV header
in af_iucv easier, since we no longer have to assume a fixed skb layout.
While at it, replace the open-coded length checks in af_iucv's RX path
with pskb_may_pull().
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drop @ptr from kernel-doc for function reg_query_regdb_wmm().
This function parameter was recently removed so update the
kernel-doc to match that and remove the kernel-doc warnings.
Removes 109 occurrences of this warning message:
../include/net/cfg80211.h:4869: warning: Excess function parameter 'ptr' description in 'reg_query_regdb_wmm'
Fixes: 38cb87ee47 ("cfg80211: make wmm_rule part of the reg_rule structure")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: linux-wireless@vger.kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-09-25
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Allow for RX stack hardening by implementing the kernel's flow
dissector in BPF. Idea was originally presented at netconf 2017 [0].
Quote from merge commit:
[...] Because of the rigorous checks of the BPF verifier, this
provides significant security guarantees. In particular, the BPF
flow dissector cannot get inside of an infinite loop, as with
CVE-2013-4348, because BPF programs are guaranteed to terminate.
It cannot read outside of packet bounds, because all memory accesses
are checked. Also, with BPF the administrator can decide which
protocols to support, reducing potential attack surface. Rarely
encountered protocols can be excluded from dissection and the
program can be updated without kernel recompile or reboot if a
bug is discovered. [...]
Also, a sample flow dissector has been implemented in BPF as part
of this work, from Petar and Willem.
[0] http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf
2) Add support for bpftool to list currently active attachment
points of BPF networking programs providing a quick overview
similar to bpftool's perf subcommand, from Yonghong.
3) Fix a verifier pruning instability bug where a union member
from the register state was not cleared properly leading to
branches not being pruned despite them being valid candidates,
from Alexei.
4) Various smaller fast-path optimizations in XDP's map redirect
code, from Jesper.
5) Enable to recognize BPF_MAP_TYPE_REUSEPORT_SOCKARRAY maps
in bpftool, from Roman.
6) Remove a duplicate check in libbpf that probes for function
storage, from Taeung.
7) Fix an issue in test_progs by avoid checking for errno since
on success its value should not be checked, from Mauricio.
8) Fix unused variable warning in bpf_getsockopt() helper when
CONFIG_INET is not configured, from Anders.
9) Fix a compilation failure in the BPF sample code's use of
bpf_flow_keys, from Prashant.
10) Minor cleanups in BPF code, from Yue and Zhong.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jeff Kirsher says:
====================
40GbE Intel Wired LAN Driver Updates 2018-09-25
This series contains updates to i40e and xsk.
Mariusz fixes an issue where the VF link state was not being updated
properly when the PF is down or up. Also cleaned up the promiscuous
configuration during a VF reset.
Patryk simplifies the code a bit to use the variables for PF and HW that
are declared, rather than using the VSI pointers. Cleaned up the
message length parameter to several virtchnl functions, since it was not
being used (or needed).
Harshitha fixes two potential race conditions when trying to change VF
settings by creating a helper function to validate that the VF is
enabled and that the VSI is set up.
Sergey corrects a double "link down" message by putting in a check for
whether or not the link is up or going down.
Björn addresses an AF_XDP zero-copy issue that buffers passed
from userspace to the kernel was leaked when the hardware descriptor
ring was torn down. A zero-copy capable driver picks buffers off the
fill ring and places them on the hardware receive ring to be completed at
a later point when DMA is complete. Similar on the transmit side; The
driver picks buffers off the transmit ring and places them on the
transmit hardware ring.
In the typical flow, the receive buffer will be placed onto an receive
ring (completed to the user), and the transmit buffer will be placed on
the completion ring to notify the user that the transfer is done.
However, if the driver needs to tear down the hardware rings for some
reason (interface goes down, reconfiguration and such), the userspace
buffers cannot be leaked. They have to be reused or completed back to
userspace.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement get/put function for blocks that only take/release the reference
and perform deallocation. These functions are intended to be used by
unlocked rules update path to always hold reference to block while working
with it. They use on new fine-grained locking mechanisms introduced in
previous patches in this set, instead of relying on global protection
provided by rtnl lock.
Extract code that is common with tcf_block_detach_ext() into common
function __tcf_block_put().
Extend tcf_block with rcu to allow safe deallocation when it is accessed
concurrently.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As a preparation for removing rtnl lock dependency from rules update path,
change tcf block reference counter type to refcount_t to allow modification
by concurrent users.
In block put function perform decrement and check reference counter once to
accommodate concurrent modification by unlocked users. After this change
tcf_chain_put at the end of block put function is called with
block->refcnt==0 and will deallocate block after the last chain is
released, so there is no need to manually deallocate block in this case.
However, if block reference counter reached 0 and there are no chains to
release, block must still be deallocated manually.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>