Fix the report:
net/sunrpc/clnt.c:2580:1: warning: ‘static’ is not at beginning of declaration [-Wold-style-declaration]
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The current min/max resvport settings are independently limited
by the entire range of allowed ports, so max_resvport can be
set to a port lower than min_resvport.
Prevent inversion of min/max values when set through sysfs and
module parameter by setting the limits dependent on each other.
Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The current min/max resvport settings are independently limited
by the entire range of allowed ports, so max_resvport can be
set to a port lower than min_resvport.
Prevent inversion of min/max values when set through sysctl by
setting the limits dependent on each other.
Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The range calculation for choosing the random reserved port will panic
with divide-by-zero when min_resvport == max_resvport, a range of one
port, not zero.
Fix the reserved port range calculation by adding one to the difference.
Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Author: Frank Sorenson <sorenson@redhat.com>
Date: 2016-06-27 13:55:48 -0500
sunrpc: Fix bit count when setting hashtable size to power-of-two
The hashtable size is incorrectly calculated as the next higher
power-of-two when being set to a power-of-two. fls() returns the
bit number of the most significant set bit, with the least
significant bit being numbered '1'. For a power-of-two, fls()
will return a bit number which is one higher than the number of bits
required, leading to a hashtable which is twice the requested size.
In addition, the value of (1 << nbits) will always be at least num,
so the test will never be true.
Fix the hash table size calculation to correctly set hashtable
size, and eliminate the unnecessary check.
Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
A generic_cred can be used to look up a unx_cred or a gss_cred, so it's
not really safe to use the the generic_cred->acred->ac_flags to store
the NO_CRKEY_TIMEOUT flag. A lookup for a unx_cred triggered while the
KEY_EXPIRE_SOON flag is already set will cause both NO_CRKEY_TIMEOUT and
KEY_EXPIRE_SOON to be set in the ac_flags, leaving the user associated
with the auth_cred to be in a state where they're perpetually doing 4K
NFS_FILE_SYNC writes.
This can be reproduced as follows:
1. Mount two NFS filesystems, one with sec=krb5 and one with sec=sys.
They do not need to be the same export, nor do they even need to be from
the same NFS server. Also, v3 is fine.
$ sudo mount -o v3,sec=krb5 server1:/export /mnt/krb5
$ sudo mount -o v3,sec=sys server2:/export /mnt/sys
2. As the normal user, before accessing the kerberized mount, kinit with
a short lifetime (but not so short that renewing the ticket would leave
you within the 4-minute window again by the time the original ticket
expires), e.g.
$ kinit -l 10m -r 60m
3. Do some I/O to the kerberized mount and verify that the writes are
wsize, UNSTABLE:
$ dd if=/dev/zero of=/mnt/krb5/file bs=1M count=1
4. Wait until you're within 4 minutes of key expiry, then do some more
I/O to the kerberized mount to ensure that RPC_CRED_KEY_EXPIRE_SOON gets
set. Verify that the writes are 4K, FILE_SYNC:
$ dd if=/dev/zero of=/mnt/krb5/file bs=1M count=1
5. Now do some I/O to the sec=sys mount. This will cause
RPC_CRED_NO_CRKEY_TIMEOUT to be set:
$ dd if=/dev/zero of=/mnt/sys/file bs=1M count=1
6. Writes for that user will now be permanently 4K, FILE_SYNC for that
user, regardless of which mount is being written to, until you reboot
the client. Renewing the kerberos ticket (assuming it hasn't already
expired) will have no effect. Grabbing a new kerberos ticket at this
point will have no effect either.
Move the flag to the auth->au_flags field (which is currently unused)
and rename it slightly to reflect that it's no longer associated with
the auth_cred->ac_flags. Add the rpc_auth to the arg list of
rpcauth_cred_key_to_expire and check the au_flags there too. Finally,
add the inode to the arg list of nfs_ctx_key_to_expire so we can
determine the rpc_auth to pass to rpcauth_cred_key_to_expire.
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If there were less than 2 entries in the multipath list, then
xprt_iter_next_entry_multiple() would never advance beyond the
first entry, which is correct for round robin behaviour, but not
for the list iteration.
The end result would be infinite looping in rpc_clnt_iterate_for_each_xprt()
as we would never see the xprt == NULL condition fulfilled.
Reported-by: Oleg Drokin <green@linuxhacker.ru>
Fixes: 80b14d5e61 ("SUNRPC: Add a structure to track multiple transports")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Before commit 778be232a2 ("NFS do not find client in NFSv4
pg_authenticate"), the Linux callback server replied with
RPC_AUTH_ERROR / RPC_AUTH_BADCRED, instead of dropping the CB
request. Let's restore that behavior so the server has a chance to
do something useful about it, and provide a warning that helps
admins correct the problem.
Fixes: 778be232a2 ("NFS do not find client in NFSv4 ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If an RPC program does not set vs_dispatch and pc_func() returns
rpc_drop_reply, the server sends a reply anyway containing a single
word containing the value RPC_DROP_REPLY (in network byte-order, of
course). This is a nonsense RPC message.
Fixes: 9e701c6109 ("svcrpc: simpler request dropping")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Direct data placement is not allowed when using flavors that
guarantee integrity or privacy. When such security flavors are in
effect, don't allow the use of Read and Write chunks for moving
individual data items. All messages larger than the inline threshold
are sent via Long Call or Long Reply.
On my systems (CX-3 Pro on FDR), for small I/O operations, the use
of Long messages adds only around 5 usecs of latency in each
direction.
Note that when integrity or encryption is used, the host CPU touches
every byte in these messages. Even if it could be used, data
movement offload doesn't buy much in this case.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
fixup_copy_count should count only the number of bytes copied to the
page list. The head and tail are now always handled without a data
copy.
And the debugging at the end of rpcrdma_inline_fixup() is also no
longer necessary, since copy_len will be non-zero when there is reply
data in the tail (a normal and valid case).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Now that rpcrdma_inline_fixup() updates only two fields in
rq_rcv_buf, a full memcpy of that structure to rq_private_buf is
unwarranted. Updating rq_private_buf fields only where needed also
better documents what is going on.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ
operations failed. After the client unwrapped the NFS READ reply
message, the NFS READ XDR decoder was not able to decode the reply.
The message was "Server cheating in reply", with the reported
number of received payload bytes being zero. Applications reported
a read(2) that returned -1/EIO.
The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero
when the incoming reply fits entirely in the head iovec. The zero
tail.iov_len confused xdr_buf_trim(), which then mangled the actual
reply data instead of simply removing the trailing GSS checksum.
As near as I can tell, RPC transports are not supposed to update the
head.iov_len, page_len, or tail.iov_len fields in the receive XDR
buffer when handling an incoming RPC reply message. These fields
contain the length of each component of the XDR buffer, and hence
the maximum number of bytes of reply data that can be stored in each
XDR buffer component. I've concluded this because:
- This is how xdr_partial_copy_from_skb() appears to behave
- rpcrdma_inline_fixup() already does not alter page_len
- call_decode() compares rq_private_buf and rq_rcv_buf and WARNs
if they are not exactly the same
Unfortunately, as soon as I tried the simple fix to just remove the
line that sets tail.iov_len to zero, I saw that the logic that
appends the implicit Write chunk pad inline depends on inline_fixup
setting tail.iov_len to zero.
To address this, re-organize the tail iovec handling logic to use
the same approach as with the head iovec: simply point tail.iov_base
to the correct bytes in the receive buffer.
While I remember all this, write down the conclusion in documenting
comments.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When the remaining length of an incoming reply is longer than the
XDR buf's page_len, switch over to the tail iovec instead of
copying more than page_len bytes into the page list.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Currently, all three chunk list encoders each use a portion of the
one rl_segments array in rpcrdma_req. This is because the MWs for
each chunk list were preserved in rl_segments so that ro_unmap could
find and invalidate them after the RPC was complete.
However, now that MWs are placed on a per-req linked list as they
are registered, there is no longer any information in rpcrdma_mr_seg
that is shared between ro_map and ro_unmap_{sync,safe}, and thus
nothing in rl_segments needs to be preserved after
rpcrdma_marshal_req is complete.
Thus the rl_segments array can be used now just for the needs of
each rpcrdma_convert_iovs call. Once each chunk list is encoded, the
next chunk list encoder is free to re-use all of rl_segments.
This means all three chunk lists in one RPC request can now each
encode a full size data payload with no increase in the size of
rl_segments.
This is a key requirement for Kerberos support, since both the Call
and Reply for a single RPC transaction are conveyed via Long
messages (RDMA Read/Write). Both can be large.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Instead of placing registered MWs sparsely into the rl_segments
array, place these MWs on a per-req list.
ro_unmap_{sync,safe} can then simply pull those MWs off the list
instead of walking through the array.
This change significantly reduces the size of struct rpcrdma_req
by removing nsegs and rl_mw from every array element.
As an additional clean-up, chunk co-ordinates are returned in the
"*mw" output argument so they are no longer needed in every
array element.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Instead of leaving orphaned MRs to be released when the transport
is destroyed, release them immediately. The MR free list can now be
replenished if it becomes exhausted.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Frequent MR list exhaustion can impact I/O throughput, so enough MRs
are always created during transport set-up to prevent running out.
This means more MRs are created than most workloads need.
Commit 94f58c58c0 ("xprtrdma: Allow Read list and Reply chunk
simultaneously") introduced support for sending two chunk lists per
RPC, which consumes more MRs per RPC.
Instead of trying to provision more MRs, introduce a mechanism for
allocating MRs on demand. A few MRs are allocated during transport
set-up to kick things off.
This significantly reduces the average number of MRs per transport
while allowing the MR count to grow for workloads or devices that
need more MRs.
FRWR with mlx4 allocated almost 400 MRs per transport before this
patch. Now it starts with 32.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up, based on code audit: Remove the possibility that the
chunk list XDR encoders can return zero, which would be interpreted
as a NULL.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Commit c93c62231c ("xprtrdma: Disconnect on registration failure")
added a disconnect for some RPC marshaling failures. This is needed
only in a handful of cases, but it was triggering for simple stuff
like temporary resource shortages. Try to straighten this out.
Fix up the lower layers so they don't return -ENOMEM or other error
codes that the RPC client's FSM doesn't explicitly recognize.
Also fix up the places in the send_request path that do want a
disconnect. For example, when ib_post_send or ib_post_recv fail,
this is a sign that there is a send or receive queue resource
miscalculation. That should be rare, and is a sign of a software
bug. But xprtrdma can recover: disconnect to reset the transport and
start over.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Not having an rpcrdma_rep at call_allocate time can be a problem.
It means that send_request can't post a receive buffer to catch
the RPC's reply. Possible consequences are RPC timeouts or even
transport deadlock.
Instead of allowing an RPC to proceed if an rpcrdma_rep is
not available, return NULL to force call_allocate to wait and
try again.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: ALLPHYSICAL is gone and FMR has been converted to use
scatterlists. There are no more users of these functions.
This patch shrinks the size of struct rpcrdma_req by about 3500
bytes on x86_64. There is one of these structs for each RPC credit
(128 credits per transport connection).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
No HCA or RNIC in the kernel tree requires the use of ALLPHYSICAL.
ALLPHYSICAL advertises in the clear on the network fabric an R_key
that is good for all of the client's memory. No known exploit
exists, but theoretically any user on the server can use that R_key
on the client's QP to read or update any part of the client's memory.
ALLPHYSICAL exposes the client to server bugs, including:
o base/bounds errors causing data outside the i/o buffer to be
accessed
o RDMA access after reply causing data corruption and/or integrity
fail
ALLPHYSICAL can't protect application memory regions from server
update after a local signal or soft timeout has terminated an RPC.
ALLPHYSICAL chunks are no larger than a page. Special cases to
handle small chunks and long chunk lists have been a source of
implementation complexity and bugs.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Based on code audit.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
I found that commit ead3f26e35 ("xprtrdma: Add ro_unmap_safe
memreg method"), which introduces ro_unmap_safe, never wired up the
FMR recovery worker.
The FMR and FRWR recovery work queues both do the same thing.
Instead of setting up separate individual work queues for this,
schedule a delayed worker to deal with them, since recovering MRs is
not performance-critical.
Fixes: ead3f26e35 ("xprtrdma: Add ro_unmap_safe memreg method")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The use of a scatterlist for handling DMA mapping and unmapping
was recently introduced in frwr_ops.c in commit 4143f34e01
("xprtrdma: Port to new memory registration API"). That commit did
not make a similar update to xprtrdma's FMR support because the
core ib_map_phys_fmr() and ib_unmap_fmr() APIs have not been changed
to take a scatterlist argument.
However, FMR still needs to do DMA mapping and unmapping. It appears
that RDS, for example, uses a scatterlist for this, then builds the
DMA addr array for the ib_map_phys_fmr call separately. I see that
SRP also utilizes a scatterlist for DMA mapping. xprtrdma can do
something similar.
This modernization is used immediately to properly defer DMA
unmapping during fmr_unmap_safe (a FIXME). It separates the DMA
unmapping coordinates from the rl_segments array. This array, being
part of an rpcrdma_req, is always re-used immediately when an RPC
exits. A scatterlist is allocated in memory independent of the
rl_segments array, so it can be preserved indefinitely (ie, until
the MR invalidation and DMA unmapping can actually be done by a
worker thread).
The FRWR and FMR DMA mapping code are slightly different from each
other now, and will diverge further when the "Check for holes" logic
can be removed from FRWR (support for SG_GAP MRs). So I chose not to
create helpers for the common-looking code.
Fixes: ead3f26e35 ("xprtrdma: Add ro_unmap_safe memreg method")
Suggested-by: Sagi Grimberg <sagi@lightbits.io>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: Use the same naming convention used in other
RPC/RDMA-related data structures.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: Moving these helpers in a separate patch makes later
patches more readable.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: FMR is about to replace the rpcrdma_map_one code with
scatterlists. Move the scatterlist fields out of the FRWR-specific
union and into the generic part of rpcrdma_mw.
One minor change: -EIO is now returned if FRWR registration fails.
The RPC is terminated immediately, since the problem is likely due
to a software bug, thus retrying likely won't help.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
ib_unmap_fmr() takes a list of FMRs to unmap. However, it does not
remove the FMRs from this list as it processes them. Other
ib_unmap_fmr() call sites are careful to remove FMRs from the list
after ib_unmap_fmr() returns.
Since commit 7c7a5390dc ("xprtrdma: Add ro_unmap_sync method for FMR")
fmr_op_unmap_sync passes more than one FMR to ib_unmap_fmr(), but
it didn't bother to remove the FMRs from that list once the call was
complete.
I've noticed some instability that could be related to list
tangling by the new fmr_op_unmap_sync() logic. In an abundance
of caution, add some defensive logic to clean up properly after
ib_unmap_fmr().
Fixes: 7c7a5390dc ("xprtrdma: Add ro_unmap_sync method for FMR")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
It was first reported and reproduced by Petr (thanks!) in
https://bugzilla.kernel.org/show_bug.cgi?id=119581
free_percpu(rt->rt6i_pcpu) used to always happen in ip6_dst_destroy().
However, after fixing a deadlock bug in
commit 9c7370a166 ("ipv6: Fix a potential deadlock when creating pcpu rt"),
free_percpu() is not called before setting non_pcpu_rt->rt6i_pcpu to NULL.
It is worth to note that rt6i_pcpu is protected by table->tb6_lock.
kmemleak somehow did not report it. We nailed it down by
observing the pcpu entries in /proc/vmallocinfo (first suggested
by Hannes, thanks!).
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Fixes: 9c7370a166 ("ipv6: Fix a potential deadlock when creating pcpu rt")
Reported-by: Petr Novopashenniy <pety@rusnet.ru>
Tested-by: Petr Novopashenniy <pety@rusnet.ru>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Petr Novopashenniy <pety@rusnet.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
dn_fib_count_nhs() could enter an infinite loop if nhp->rtnh_len == 0
(i.e. if userspace passes a malformed netlink message).
Let's use the helpers from net/nexthop.h which take care of all this
stuff. We can do exactly the same as e.g. fib_count_nexthops() and
fib_get_nhs() from net/ipv4/fib_semantics.c.
This fixes the softlockup for me.
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If register_pernet_subsys() fails, we shouldn't try to call
unregister_pernet_subsys().
Fixes: 467fa15356 ("RDS-TCP: Support multiple RDS-TCP listen endpoints, one per netns.")
Cc: stable@vger.kernel.org
Cc: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix incorrect use of nla_strlcpy() where the first NLA_HDRLEN bytes
of the link name where left out.
Making the output of tipc-config -ls look something like:
Link statistics:
dcast-link
1:data0-1.1.2:data0
1:data0-1.1.3:data0
Also, for the record, the patch that introduce this regression
claims "Sending the whole object out can cause a leak". Which isn't
very likely as this is a compat layer, where the data we are parsing
is generated by us and we know the string to be NULL terminated. But
you can of course never be to secure.
Fixes: 5d2be1422e (tipc: fix an infoleak in tipc_nl_compat_link_dump)
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to commit 9b368814b3 ("net: fix bridge multicast packet checksum validation")
we need to fixup the checksum for CHECKSUM_COMPLETE when
pushing skb on RX path. Otherwise we get similar splats.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
People who use PACKET_FANOUT_HASH want a symmetric hash, meaning that
they want packets going in both directions on a flow to hash to the
same bucket.
The core kernel SKB hash became non-symmetric when the ipv6 flow label
and other entities were incorporated into the standard flow hash order
to increase entropy.
But there are no users of PACKET_FANOUT_HASH who want an assymetric
hash, they all want a symmetric one.
Therefore, use the flow dissector to compute a flat symmetric hash
over only the protocol, addresses and ports. This hash does not get
installed into and override the normal skb hash, so this change has
no effect whatsoever on the rest of the stack.
Reported-by: Eric Leblond <eric@regit.org>
Tested-by: Eric Leblond <eric@regit.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_skb_dst_mtu uses skb->sk, assuming it is an AF_INET socket (e.g. it
calls ip_sk_use_pmtu which casts sk as an inet_sk).
However, in the case of UDP tunneling, the skb->sk is not necessarily an
inet socket (could be AF_PACKET socket, or AF_UNSPEC if arriving from
tun/tap).
OTOH, the sk passed as an argument throughout IP stack's output path is
the one which is of PMTU interest:
- In case of local sockets, sk is same as skb->sk;
- In case of a udp tunnel, sk is the tunneling socket.
Fix, by passing ip_finish_output's sk to ip_skb_dst_mtu.
This augments 7026b1ddb6 'netfilter: Pass socket pointer down through okfn().'
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
"I've been traveling so this accumulates more than week or so of bug
fixing. It perhaps looks a little worse than it really is.
1) Fix deadlock in ath10k driver, from Ben Greear.
2) Increase scan timeout in iwlwifi, from Luca Coelho.
3) Unbreak STP by properly reinjecting STP packets back into the
stack. Regression fix from Ido Schimmel.
4) Mediatek driver fixes (missing malloc failure checks, leaking of
scratch memory, wrong indexing when mapping TX buffers, etc.) from
John Crispin.
5) Fix endianness bug in icmpv6_err() handler, from Hannes Frederic
Sowa.
6) Fix hashing of flows in UDP in the ruseport case, from Xuemin Su.
7) Fix netlink notifications in ovs for tunnels, delete link messages
are never emitted because of how the device registry state is
handled. From Nicolas Dichtel.
8) Conntrack module leaks kmemcache on unload, from Florian Westphal.
9) Prevent endless jump loops in nft rules, from Liping Zhang and
Pablo Neira Ayuso.
10) Not early enough spinlock initialization in mlx4, from Eric
Dumazet.
11) Bind refcount leak in act_ipt, from Cong WANG.
12) Missing RCU locking in HTB scheduler, from Florian Westphal.
13) Several small MACSEC bug fixes from Sabrina Dubroca (missing RCU
barrier, using heap for SG and IV, and erroneous use of async flag
when allocating AEAD conext.)
14) RCU handling fix in TIPC, from Ying Xue.
15) Pass correct protocol down into ipv4_{update_pmtu,redirect}() in
SIT driver, from Simon Horman.
16) Socket timer deadlock fix in TIPC from Jon Paul Maloy.
17) Fix potential deadlock in team enslave, from Ido Schimmel.
18) Memory leak in KCM procfs handling, from Jiri Slaby.
19) ESN generation fix in ipv4 ESP, from Herbert Xu.
20) Fix GFP_KERNEL allocations with locks held in act_ife, from Cong
WANG.
21) Use after free in netem, from Eric Dumazet.
22) Uninitialized last assert time in multicast router code, from Tom
Goff.
23) Skip raw sockets in sock_diag destruction broadcast, from Willem
de Bruijn.
24) Fix link status reporting in thunderx, from Sunil Goutham.
25) Limit resegmentation of retransmit queue so that we do not
retransmit too large GSO frames. From Eric Dumazet.
26) Delay bpf program release after grace period, from Daniel
Borkmann"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (141 commits)
openvswitch: fix conntrack netlink event delivery
qed: Protect the doorbell BAR with the write barriers.
neigh: Explicitly declare RCU-bh read side critical section in neigh_xmit()
e1000e: keep VLAN interfaces functional after rxvlan off
cfg80211: fix proto in ieee80211_data_to_8023 for frames without LLC header
qlcnic: use the correct ring in qlcnic_83xx_process_rcv_ring_diag()
bpf, perf: delay release of BPF prog after grace period
net: bridge: fix vlan stats continue counter
tcp: do not send too big packets at retransmit time
ibmvnic: fix to use list_for_each_safe() when delete items
net: thunderx: Fix TL4 configuration for secondary Qsets
net: thunderx: Fix link status reporting
net/mlx5e: Reorganize ethtool statistics
net/mlx5e: Fix number of PFC counters reported to ethtool
net/mlx5e: Prevent adding the same vxlan port
net/mlx5e: Check for BlueFlame capability before allocating SQ uar
net/mlx5e: Change enum to better reflect usage
net/mlx5: Add ConnectX-5 PCIe 4.0 to list of supported devices
net/mlx5: Update command strings
net: marvell: Add separate config ANEG function for Marvell 88E1111
...
* fix mesh peer link counter, decrement wasn't always done at all
* fix ethertype (length) for packets without RFC 1042 or bridge
tunnel header
-----BEGIN PGP SIGNATURE-----
iQIcBAABCgAGBQJXc5wQAAoJEGt7eEactAAd2HQP/RS3lZaNn97spZp9Vb9Swqd6
UjhT4vHL2wHPMhPals2M2ztRhL8df/55tOhSuCwUki0gTsAfP4+Jz1PZ4okFUSBn
6EHvRtAgSAzjBElZVEsNO2sM/qm0UOnlqm4XpOXcKzpnRzDSlIa6vpnctp2MJMBI
PB46x8BPQoi192aTzleMqgBwjrMX1ohsM6rgOIEf7rvLhftx6LRRm7/uKaiwenVT
ZxkP6jIOs5ofkUHLHS4OPyId4yReO8oMMbRRA4fW1z2T8hME9PH3jTZL3Znaj7VF
s2LLSqwUlv06k0WFu3SzJh750jCJ6cnJiIqggDlPChrb0/DPicqgWR7/tJAvNVfj
WJqugeNtEVI+4ElGNxFzFAvjAKqO8fHY5U8Ko4LtZuwhjpo5GDvA+FErFESRXvCG
c647cDMhn+6OXwpUwsggQK4c+AS4QX/PzHjmhW5tKORPXlYVXAAz+JDXn195WnzV
w4UAO4ZWai9XY6oSc63uudaHMt8Xkmq8PRQsH1hHG20CAJ+7bGugaWVTci5fzeNh
JC4X+Xh6t+5qq1jEWgr+KguVKXtP3pVetqGy9kYQrJov0fTe2Z6deEeY08sH1Ffz
xdnQUsZphvqSkosTw76JtNPXz2ajxZCD3KNWjcpkL/NNs9lnIZPZc/mnzKP7ltro
ZEYw0HbcXVt/uFhHlwdG
=dAbs
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2016-06-29-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Just two small fixes
* fix mesh peer link counter, decrement wasn't always done at all
* fix ethertype (length) for packets without RFC 1042 or bridge
tunnel header
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Only the first and last netlink message for a particular conntrack are
actually sent. The first message is sent through nf_conntrack_confirm when
the conntrack is committed. The last one is sent when the conntrack is
destroyed on timeout. The other conntrack state change messages are not
advertised.
When the conntrack subsystem is used from netfilter, nf_conntrack_confirm
is called for each packet, from the postrouting hook, which in turn calls
nf_ct_deliver_cached_events to send the state change netlink messages.
This commit fixes the problem by calling nf_ct_deliver_cached_events in the
non-commit case as well.
Fixes: 7f8a436eaa ("openvswitch: Add conntrack action")
CC: Joe Stringer <joestringer@nicira.com>
CC: Justin Pettit <jpettit@nicira.com>
CC: Andy Zhou <azhou@nicira.com>
CC: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Samuel Gauthier <samuel.gauthier@6wind.com>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
neigh_xmit() expects to be called inside an RCU-bh read side critical
section, and while one of its two current callers gets this right, the
other one doesn't.
More specifically, neigh_xmit() has two callers, mpls_forward() and
mpls_output(), and while both callers call neigh_xmit() under
rcu_read_lock(), this provides sufficient protection for neigh_xmit()
only in the case of mpls_forward(), as that is always called from
softirq context and therefore doesn't need explicit BH protection,
while mpls_output() can be called from process context with softirqs
enabled.
When mpls_output() is called from process context, with softirqs
enabled, we can be preempted by a softirq at any time, and RCU-bh
considers the completion of a softirq as signaling the end of any
pending read-side critical sections, so if we do get a softirq
while we are in the part of neigh_xmit() that expects to be run inside
an RCU-bh read side critical section, we can end up with an unexpected
RCU grace period running right in the middle of that critical section,
making things go boom.
This patch fixes this impedance mismatch in the callee, by making
neigh_xmit() always take rcu_read_{,un}lock_bh() around the code that
expects to be treated as an RCU-bh read side critical section, as this
seems a safer option than fixing it in the callers.
Fixes: 4fd3d7d9e8 ("neigh: Add helper function neigh_xmit")
Signed-off-by: David Barroso <dbarroso@fastly.com>
Signed-off-by: Lennert Buytenhek <lbuytenhek@fastly.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The PDU length of incoming LLC frames is set to the total skb payload size
in __ieee80211_data_to_8023() of net/wireless/util.c which incorrectly
includes the length of the IEEE 802.11 header.
The resulting LLC frame header has a too large PDU length, causing the
llc_fixup_skb() function of net/llc/llc_input.c to reject the incoming
skb, effectively breaking STP.
Solve the problem by properly substracting the IEEE 802.11 frame header size
from the PDU length, allowing the LLC processor to pick up the incoming
control messages.
Special thanks to Gerry Rozema for tracking down the regression and proposing
a suitable patch.
Fixes: 2d1c304cb2 ("cfg80211: add function for 802.3 conversion with separate output buffer")
Cc: stable@vger.kernel.org
Reported-by: Gerry Rozema <gerryr@rozeware.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
I made a dumb off-by-one mistake when I added the vlan stats counter
dumping code. The increment should happen before the check, not after
otherwise we miss one entry when we continue dumping.
Fixes: a60c090361 ("bridge: netlink: export per-vlan stats")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arjun reported a bug in TCP stack and bisected it to a recent commit.
In case where we process SACK, we can coalesce multiple skbs
into fat ones (tcp_shift_skb_data()), to lower write queue
overhead, because we do not expect to retransmit these packets.
However, SACK reneging can happen, forcing the sender to retransmit
all these packets. If skb->len is above 64KB, we then send buggy
IP packets that could hang TSO engine on cxgb4.
Neal suggested to use tcp_tso_autosize() instead of tp->gso_segs
so that we cook packets of optimal size vs TCP/pacing.
Thanks to Arjun for reporting the bug and running the tests !
Fixes: 10d3be5692 ("tcp-tso: do not split TSO packets at retransmit time")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Arjun V <arjun@chelsio.com>
Tested-by: Arjun V <arjun@chelsio.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The untagged vlan object is only destroyed when the interface is removed
via the legacy sysfs interface. But it also has to be destroyed when the
standard rtnl-link interface is used.
Fixes: 5d2c05b213 ("batman-adv: add per VLAN interface attribute framework")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The skb_linearize may reallocate the skb. This makes the calculated pointer
for ethhdr invalid. But it the pointer is used later to fill in the RR
field of the batadv_icmp_packet_rr packet.
Instead re-evaluate eth_hdr after the skb_linearize+skb_cow to fix the
pointer and avoid the invalid read.
Fixes: da6b8c20a5 ("batman-adv: generalize batman-adv icmp packet handling")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>