Commit Graph

431 Commits

Author SHA1 Message Date
Florian Westphal 860975c6f8 mptcp: skip to next candidate if subflow has unacked data
In case a subflow path is blocked, MPTCP-level retransmit may not take
place anymore because such subflow is likely to have unacked data left
in its write queue.

Ignore subflows that have experienced loss and test next candidate.

Fixes: 3b1d6210a9 ("mptcp: implement and use MPTCP-level retransmission")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-20 15:33:24 -08:00
Paolo Abeni 26aa231439 mptcp: fix state tracking for fallback socket
We need to cope with some more state transition for
fallback sockets, or could still end-up moving to TCP_CLOSE
too early and avoid spooling some pending data

Fixes: e16163b6e2 ("mptcp: refactor shutdown and close")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-20 15:33:24 -08:00
Paolo Abeni b2771d2419 mptcp: drop WORKER_RUNNING status bit
Only mptcp_close() can actually cancel the workqueue,
no need to add and use this flag.

Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-20 15:33:24 -08:00
Paolo Abeni b680a214ec mptcp: update rtx timeout only if required.
We must start the retransmission timer only there are
pending data in the rtx queue.
Otherwise we can hit a WARN_ON in mptcp_reset_timer(),
as syzbot demonstrated.

Reported-and-tested-by: syzbot+42aa53dafb66a07e5a24@syzkaller.appspotmail.com
Fixes: d9ca1de8c0 ("mptcp: move page frag allocation in mptcp_sendmsg()")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Link: https://lore.kernel.org/r/1a72039f112cae048c44d398ffa14e0a1432db3d.1605737083.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-19 21:54:24 -08:00
Eric Biggers a24d22b225 crypto: sha - split sha.h into sha1.h and sha2.h
Currently <crypto/sha.h> contains declarations for both SHA-1 and SHA-2,
and <crypto/sha3.h> contains declarations for SHA-3.

This organization is inconsistent, but more importantly SHA-1 is no
longer considered to be cryptographically secure.  So to the extent
possible, SHA-1 shouldn't be grouped together with any of the other SHA
versions, and usage of it should be phased out.

Therefore, split <crypto/sha.h> into two headers <crypto/sha1.h> and
<crypto/sha2.h>, and make everyone explicitly specify whether they want
the declarations for SHA-1, SHA-2, or both.

This avoids making the SHA-1 declarations visible to files that don't
want anything to do with SHA-1.  It also prepares for potentially moving
sha1.h into a new insecure/ or dangerous/ directory.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-11-20 14:45:33 +11:00
Paolo Abeni 7ed90803a2 mptcp: send explicit ack on delayed ack_seq incr
When the worker moves some bytes from the OoO queue into
the receive queue, the msk->ask_seq is updated, the MPTCP-level
ack carrying that value needs to wait the next ingress packet,
possibly slowing down or hanging the peer

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16 10:46:07 -08:00
Florian Westphal 6f8a612a33 mptcp: keep track of advertised windows right edge
Before sending 'x' new bytes also check that the new snd_una would
be within the permitted receive window.

For every ACK that also contains a DSS ack, check whether its tcp-level
receive window would advance the current mptcp window right edge and
update it if so.

Signed-off-by: Florian Westphal <fw@strlen.de>
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16 10:46:07 -08:00
Florian Westphal 8edf08649e mptcp: rework poll+nospace handling
MPTCP maintains a status bit, MPTCP_SEND_SPACE, that is set when at
least one subflow and the mptcp socket itself are writeable.

mptcp_poll returns EPOLLOUT if the bit is set.

mptcp_sendmsg makes sure MPTCP_SEND_SPACE gets cleared when last write
has used up all subflows or the mptcp socket wmem.

This reworks nospace handling as follows:

MPTCP_SEND_SPACE is replaced with MPTCP_NOSPACE, i.e. inverted meaning.
This bit is set when the mptcp socket is not writeable.
The mptcp-level ack path schedule will then schedule the mptcp worker
to allow it to free already-acked data (and reduce wmem usage).

This will then wake userspace processes that wait for a POLLOUT event.

sendmsg will set MPTCP_NOSPACE only when it has to wait for more
wmem (blocking I/O case).

poll path will set MPTCP_NOSPACE in case the mptcp socket is
not writeable.

Normal tcp-level notification (SOCK_NOSPACE) is only enabled
in case the subflow socket has no available wmem.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16 10:46:07 -08:00
Paolo Abeni 813e0a683d mptcp: try to push pending data on snd una updates
After the previous patch we may end-up with unsent data
in the write buffer. If such buffer is full, the writer
will block for unlimited time.

We need to trigger the MPTCP xmit path even for the
subflow rx path, on MPTCP snd_una updates.

Keep things simple and just schedule the work queue if
needed.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16 10:46:07 -08:00
Paolo Abeni d9ca1de8c0 mptcp: move page frag allocation in mptcp_sendmsg()
mptcp_sendmsg() is refactored so that first it copies
the data provided from user space into the send queue,
and then tries to spool the send queue via sendmsg_frag.

There a subtle change in the mptcp level collapsing on
consecutive data fragment: we now allow that only on unsent
data.

The latter don't need to deal with msghdr data anymore
and can be simplified in a relevant way.

snd_nxt and write_seq are now tracked independently.

Overall this allows some relevant cleanup and will
allow sending pending mptcp data on msk una update in
later patch.

Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16 10:46:07 -08:00
Paolo Abeni e16163b6e2 mptcp: refactor shutdown and close
We must not close the subflows before all the MPTCP level
data, comprising the DATA_FIN has been acked at the MPTCP
level, otherwise we could be unable to retransmit as needed.

__mptcp_wr_shutdown() shutdown is responsible to check for the
correct status and close all subflows. Is called by the output
path after spooling any data and at shutdown/close time.

In a similar way, __mptcp_destroy_sock() is responsible to clean-up
the MPTCP level status, and is called when the msk transition
to TCP_CLOSE.

The protocol level close() does not force anymore the TCP_CLOSE
status, but orphan the msk socket and all the subflows.
Orphaned msk sockets are forciby closed after a timeout or
when all MPTCP-level data is acked.

There is a caveat about keeping the orphaned subflows around:
the TCP stack can asynchronusly call tcp_cleanup_ulp() on them via
tcp_close(). To prevent accessing freed memory on later MPTCP
level operations, the msk acquires a reference to each subflow
socket and prevent subflow_ulp_release() from releasing the
subflow context before __mptcp_destroy_sock().

The additional subflow references are released by __mptcp_done()
and the async ULP release is detected checking ULP ops. If such
field has been already cleared by the ULP release path, the
dangling context is freed directly by __mptcp_done().

Co-developed-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16 10:46:07 -08:00
Paolo Abeni eaa2ffabfc mptcp: introduce MPTCP snd_nxt
Track the next MPTCP sequence number used on xmit,
currently always equal to write_next.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16 10:46:07 -08:00
Paolo Abeni f0e6a4cf11 mptcp: add accounting for pending data
Preparation patch to track the data pending in the msk
write queue. No functional change introduced here

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16 10:46:06 -08:00
Paolo Abeni caf971df01 mptcp: reduce the arguments of mptcp_sendmsg_frag
The current argument list is pretty long and quite unreadable,
move many of them into a specific struct. Later patches
will add more stuff to such struct.

Additionally drop the 'timeo' argument, now unused.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16 10:46:06 -08:00
Paolo Abeni ba8f48f7a4 mptcp: introduce mptcp_schedule_work
remove some of code duplications an allow preventing
rescheduling on close.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16 10:46:06 -08:00
Paolo Abeni e2223995a2 mptcp: use tcp_build_frag()
mptcp_push_pending() is called even on orphaned
msk (and orphaned subflows), if there is outstanding
data at close() time.

To cope with the above MPTCP needs to handle explicitly
the allocation failure on xmit. The newly introduced
do_tcp_sendfrag() allows that, just plug it.

We can additionally drop a couple of sanity checks,
duplicate in the TCP code.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16 10:46:06 -08:00
Jakub Kicinski e1d9d7b913 Merge https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-12 16:54:48 -08:00
Geliang Tang 724d06b437 mptcp: fix static checker warnings in mptcp_pm_add_timer
Fix the following Smatch complaint:

     net/mptcp/pm_netlink.c:213 mptcp_pm_add_timer()
     warn: variable dereferenced before check 'msk' (see line 208)

 net/mptcp/pm_netlink.c
    207          struct mptcp_sock *msk = entry->sock;
    208          struct sock *sk = (struct sock *)msk;
    209          struct net *net = sock_net(sk);
                                           ^^
 "msk" dereferenced here.

    210
    211          pr_debug("msk=%p", msk);
    212
    213          if (!msk)
                    ^^^^
 Too late.

    214                  return;
    215

Fixes: 93f323b9cc ("mptcp: add a new sysctl add_addr_timeout")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Link: https://lore.kernel.org/r/078a2ef5bdc4e3b2c25ef852461692001f426495.1604976945.git.geliangtang@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-12 07:54:15 -08:00
Paolo Abeni 989ef49bdf mptcp: provide rmem[0] limit
The mptcp proto struct currently does not provide the
required limit for forward memory scheduling. Under
pressure sk_rmem_schedule() will unconditionally try
to use such field and will oops.

Address the issue inheriting the tcp limit, as we already
do for the wmem one.

Fixes: 9c3f94e168 ("mptcp: add missing memory scheduling in the rx path")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Link: https://lore.kernel.org/r/37af798bd46f402fb7c79f57ebbdd00614f5d7fa.1604861097.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-09 17:34:34 -08:00
Jakub Kicinski ae0d0bb29b Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-06 17:33:38 -08:00
Geliang Tang 93f323b9cc mptcp: add a new sysctl add_addr_timeout
This patch added a new sysctl, named add_addr_timeout, to control the
timeout value (in seconds) of the ADD_ADDR retransmission.

Suggested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-04 17:45:53 -08:00
Florian Westphal 95ed690ebc mptcp: split mptcp_clean_una function
mptcp_clean_una() will wake writers in case memory could be reclaimed.
When called from mptcp_sendmsg the wakeup code isn't needed.

Move the wakeup to a new helper and then use that from the mptcp worker.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-04 17:45:53 -08:00
Paolo Abeni 5a369ca643 tcp: propagate MPTCP skb extensions on xmit splits
When the TCP stack splits a packet on the write queue, the tail
half currently lose the associated skb extensions, and will not
carry the DSM on the wire.

The above does not cause functional problems and is allowed by
the RFC, but interact badly with GRO and RX coalescing, as possible
candidates for aggregation will carry different TCP options.

This change tries to improve the MPTCP behavior, propagating the
skb extensions on split.

Additionally, we must prevent the MPTCP stack from updating the
mapping after the split occur: that will both violate the RFC and
fool the reader.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-04 17:45:53 -08:00
Florian Westphal 65f49fe72f mptcp: use _fast lock version in __mptcp_move_skbs
The function is short and won't sleep, so this can use the _fast version.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-04 17:45:52 -08:00
Florian Westphal 13c7ba0c84 mptcp: adjust mptcp receive buffer limit if subflow has larger one
In addition to tcp autotuning during read, it may also increase the
receive buffer in tcp_clamp_window().

In this case, mptcp should adjust its receive buffer size as well so
it can move all pending skbs from the subflow socket to the mptcp socket.

At this time, TCP can have more skbs ready for processing than what the
mptcp receive buffer size allows.

In the mptcp case, the receive window announced is based on the free
space of the mptcp parent socket instead of the individual subflows.

Following the subflow allows mptcp to grow its receive buffer.

This is especially noticeable for loopback traffic where two skbs are
enough to fill the initial receive window.

In mptcp_data_ready() we do not hold the mptcp socket lock, so modifying
mptcp_sk->sk_rcvbuf is racy.  Do it when moving skbs from subflow to
mptcp socket, both sockets are locked in this case.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-04 17:45:52 -08:00
Davide Caratti e16b874ee8 mptcp: token: fix unititialized variable
gcc complains about use of uninitialized 'num'. Fix it by doing the first
assignment of 'num' when the variable is declared.

Fixes: 96d890daad ("mptcp: add msk interations helper")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/49e20da5d467a73414d4294a8bd35e2cb1befd49.1604308087.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-03 13:08:30 -08:00
Paolo Abeni 9c3f94e168 mptcp: add missing memory scheduling in the rx path
When moving the skbs from the subflow into the msk receive
queue, we must schedule there the required amount of memory.

Try to borrow the required memory from the subflow, if needed,
so that we leverage the existing TCP heuristic.

Fixes: 6771bfd9ee ("mptcp: update mptcp ack sequence from work queue")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Link: https://lore.kernel.org/r/f6143a6193a083574f11b00dbf7b5ad151bc4ff4.1603810630.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-29 11:27:14 -07:00
Matthieu Baerts 0ed37ac586 mptcp: depends on IPV6 but not as a module
Like TCP, MPTCP cannot be compiled as a module. Obviously, MPTCP IPv6'
support also depends on CONFIG_IPV6. But not all functions from IPv6
code are exported.

To simplify the code and reduce modifications outside MPTCP, it was
decided from the beginning to support MPTCP with IPv6 only if
CONFIG_IPV6 was built inlined. That's also why CONFIG_MPTCP_IPV6 was
created. More modifications are needed to support CONFIG_IPV6=m.

Even if it was not explicit, until recently, we were forcing CONFIG_IPV6
to be built-in because we had "select IPV6" in Kconfig. Now that we have
"depends on IPV6", we have to explicitly set "IPV6=y" to force
CONFIG_IPV6 not to be built as a module.

In other words, we can now only have CONFIG_MPTCP_IPV6=y if
CONFIG_IPV6=y.

Note that the new dependency might hide the fact IPv6 is not supported
in MPTCP even if we have CONFIG_IPV6=m. But selecting IPV6 like we did
before was forcing it to be built-in while it was maybe not what the
user wants.

Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Fixes: 010b430d5d ("mptcp: MPTCP_IPV6 should depend on IPV6 instead of selecting it")
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Link: https://lore.kernel.org/r/20201021105154.628257-1-matthieu.baerts@tessares.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-21 08:05:40 -07:00
Geert Uytterhoeven 010b430d5d mptcp: MPTCP_IPV6 should depend on IPV6 instead of selecting it
MPTCP_IPV6 selects IPV6, thus enabling an optional feature the user may
not want to enable.  Fix this by making MPTCP_IPV6 depend on IPV6, like
is done for all other IPv6 features.

Fixes: f870fa0b57 ("mptcp: Add MPTCP socket stubs")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Link: https://lore.kernel.org/r/20201020073839.29226-1-geert@linux-m68k.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-20 20:54:08 -07:00
Geert Uytterhoeven b142083b58 mptcp: MPTCP_KUNIT_TESTS should depend on MPTCP instead of selecting it
MPTCP_KUNIT_TESTS selects MPTCP, thus enabling an optional feature the
user may not want to enable.  Fix this by making the test depend on
MPTCP instead.

Fixes: a00a582203 ("mptcp: move crypto test to KUNIT")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Link: https://lore.kernel.org/r/20201019113240.11516-1-geert@linux-m68k.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-20 16:50:06 -07:00
Geliang Tang 65b8c8a620 mptcp: move mptcp_options_received's port initialization
Move mptcp_options_received's port initialization from
mptcp_parse_option to mptcp_get_options, put it together with
the other fields initializations of mptcp_options_received.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-20 16:32:36 -07:00
Geliang Tang fe2d9b1a0e mptcp: initialize mptcp_options_received's ahmac
Initialize mptcp_options_received's ahmac to zero, otherwise it
will be a random number when receiving ADD_ADDR suboption with echo-flag=1.

Fixes: 3df523ab58 ("mptcp: Add ADD_ADDR handling")
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-20 16:32:14 -07:00
Jakub Kicinski 2295cddf99 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Minor conflicts in net/mptcp/protocol.h and
tools/testing/selftests/net/Makefile.

In both cases code was added on both sides in the same place
so just keep both.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-15 12:43:21 -07:00
Paolo Abeni 0e4f35d788 mptcp: subflows garbage collection
The msk can close MP_JOIN subflows if the initial handshake
fails. Currently such subflows are kept alive in the
conn_list until the msk itself is closed.

Beyond the wasted memory, we could end-up sending the
DATA_FIN and the DATA_FIN ack on such socket, even after a
reset.

Fixes: 43b54c6ee3 ("mptcp: Use full MPTCP-level disconnect state machine")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-10 11:04:53 -07:00
Paolo Abeni d582484726 mptcp: fix fallback for MP_JOIN subflows
Additional/MP_JOIN subflows that do not pass some initial handshake
tests currently causes fallback to TCP. That is an RFC violation:
we should instead reset the subflow and leave the the msk untouched.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/91
Fixes: f296234c98 ("mptcp: Add handling of incoming MP_JOIN requests")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-10 11:04:53 -07:00
Davide Caratti 37198e93ce net: mptcp: make DACK4/DACK8 usage consistent among all subflows
using packetdrill it's possible to observe the same MPTCP DSN being acked
by different subflows with DACK4 and DACK8. This is in contrast with what
specified in RFC8684 §3.3.2: if an MPTCP endpoint transmits a 64-bit wide
DSN, it MUST be acknowledged with a 64-bit wide DACK. Fix 'use_64bit_ack'
variable to make it a property of MPTCP sockets, not TCP subflows.

Fixes: a0c1d0eafd ("mptcp: Use 32-bit DATA_ACK when possible")
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-09 08:25:48 -07:00
Paolo Abeni d9fb8c507d mptcp: fix infinite loop on recvmsg()/worker() race.
If recvmsg() and the workqueue race to dequeue the data
pending on some subflow, the current mapping for such
subflow covers several skbs and some of them have not
reached yet the received, either the worker or recvmsg()
can find a subflow with the data_avail flag set - since
the current mapping is valid and in sequence - but no
skbs in the receive queue - since the other entity just
processed them.

The above will lead to an unbounded loop in __mptcp_move_skbs()
and a subsequent hang of any task trying to acquiring the msk
socket lock.

This change addresses the issue stopping the __mptcp_move_skbs()
loop as soon as we detect the above race (empty receive queue
with data_avail set).

Reported-and-tested-by: syzbot+fcf8ca5817d6e92c6567@syzkaller.appspotmail.com
Fixes: ab174ad8ef ("mptcp: move ooo skbs into msk out of order queue.")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-08 17:24:04 -07:00
Jakub Kicinski 9d49aea13f Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Small conflict around locking in rxrpc_process_event() -
channel_lock moved to bundle in next, while state lock
needs _bh() from net.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-08 15:44:50 -07:00
Paolo Abeni 717f203416 mptcp: don't skip needed ack
Currently we skip calling tcp_cleanup_rbuf() when packets
are moved into the OoO queue or simply dropped. In both
cases we still increment tp->copied_seq, and we should
ask the TCP stack to check for ack.

Fixes: c76c695656 ("mptcp: call tcp_cleanup_rbuf on subflows")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-06 06:08:06 -07:00
Paolo Abeni 017512a07e mptcp: more DATA FIN fixes
Currently data fin on data packet are not handled properly:
the 'rcv_data_fin_seq' field is interpreted as the last
sequence number carrying a valid data, but for data fin
packet with valid maps we currently store map_seq + map_len,
that is, the next value.

The 'write_seq' fields carries instead the value subseguent
to the last valid byte, so in mptcp_write_data_fin() we
never detect correctly the last DSS map.

Fixes: 7279da6145 ("mptcp: Use MPTCP-level flag for sending DATA_FIN")
Fixes: 1a49b2c2a5 ("mptcp: Handle incoming 32-bit DATA_FIN values")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-06 06:06:59 -07:00
David S. Miller 8b0308fe31 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Rejecting non-native endian BTF overlapped with the addition
of support for it.

The rest were more simple overlapping changes, except the
renesas ravb binding update, which had to follow a file
move as well as a YAML conversion.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-05 18:40:01 -07:00
Rikard Falkeborn 674d3ab949 mptcp: Constify mptcp_pm_ops
The only usages of mptcp_pm_ops is to assign its address to the small_ops
field of the genl_family struct, which is a const pointer, and applying
ARRAY_SIZE() on it. Make it const to allow the compiler to put it in
read-only memory.

Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-04 21:13:36 -07:00
Matthieu Baerts 456afe01b1 mptcp: ADD_ADDRs with echo bit are smaller
The MPTCP ADD_ADDR suboption with echo-flag=1 has no HMAC, the size is
smaller than the one initially sent without echo-flag=1. We then need to
use the correct size everywhere when we need this echo bit.

Before this patch, the wrong size was reserved but the correct amount of
bytes were written (and read): the remaining bytes contained garbage.

Fixes: 6a6c05a8b0 ("mptcp: send out ADD_ADDR with echo flag")
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/95
Reported-and-tested-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-03 17:36:37 -07:00
Jakub Kicinski 66a9b9287d genetlink: move to smaller ops wherever possible
Bulk of the genetlink users can use smaller ops, move them.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-02 19:11:11 -07:00
Mat Martineau 1a49b2c2a5 mptcp: Handle incoming 32-bit DATA_FIN values
The peer may send a DATA_FIN mapping with either a 32-bit or 64-bit
sequence number. When a 32-bit sequence number is received for the
DATA_FIN, it must be expanded to 64 bits before comparing it to the
last acked sequence number. This expansion was missing.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/93
Fixes: 3721b9b646 ("mptcp: Track received DATA_FIN sequence number and add related helpers")
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29 18:15:46 -07:00
Mat Martineau 917944da3b mptcp: Consistently use READ_ONCE/WRITE_ONCE with msk->ack_seq
The msk->ack_seq value is sometimes read without the msk lock held, so
make proper use of READ_ONCE and WRITE_ONCE.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29 18:15:40 -07:00
Florian Westphal 77d0cab939 net: tcp: drop unused function argument from mptcp_incoming_options
Since commit cfde141ea3 ("mptcp: move option parsing into
mptcp_incoming_options()"), the 3rd function argument is no longer used.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-24 20:17:01 -07:00
Geliang Tang 00cfd77b90 mptcp: retransmit ADD_ADDR when timeout
This patch implemented the retransmition of ADD_ADDR when no ADD_ADDR echo
is received. It added a timer with the announced address. When timeout
occurs, ADD_ADDR will be retransmitted.

Suggested-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-24 19:58:34 -07:00
Geliang Tang 0abd40f823 mptcp: add struct mptcp_pm_add_entry
Add a new struct mptcp_pm_add_entry to describe add_addr's entry.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-24 19:58:34 -07:00
Geliang Tang 5c8c164095 mptcp: add mptcp_destroy_common helper
This patch added a new helper named mptcp_destroy_common containing the
shared code between mptcp_destroy() and mptcp_sock_destruct().

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-24 19:58:34 -07:00
Geliang Tang 7a7e52e38a mptcp: add RM_ADDR related mibs
This patch added two new mibs for RM_ADDR, named MPTCP_MIB_RMADDR and
MPTCP_MIB_RMSUBFLOW, when the RM_ADDR suboption is received, increase
the first mib counter, when the local subflow is removed, increase the
second mib counter.

Suggested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Suggested-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-24 19:58:34 -07:00
Geliang Tang 0ee4261a36 mptcp: implement mptcp_pm_remove_subflow
This patch implemented the local subflow removing function,
mptcp_pm_remove_subflow, it simply called mptcp_pm_nl_rm_subflow_received
under the PM spin lock.

We use mptcp_pm_remove_subflow to remove a local subflow, so change it's
argument from remote_id to local_id.

We check subflow->local_id in mptcp_pm_nl_rm_subflow_received to remove
a subflow.

Suggested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Suggested-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-24 19:58:34 -07:00
Geliang Tang b6c0838086 mptcp: remove addr and subflow in PM netlink
This patch implements the remove announced addr and subflow logic in PM
netlink.

When the PM netlink removes an address, we traverse all the existing msk
sockets to find the relevant sockets.

We add a new list named anno_list in mptcp_pm_data, to record all the
announced addrs. In the traversing, we check if it has been recorded.
If it has been, we trigger the RM_ADDR signal.

We also check if this address is in conn_list. If it is, we remove the
subflow which using this local address.

Since we call mptcp_pm_free_anno_list in mptcp_destroy, we need to move
__mptcp_init_sock before the mptcp_is_enabled check in mptcp_init_sock.

Suggested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Suggested-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-24 19:58:33 -07:00
Geliang Tang f58f065aa1 mptcp: add accept_subflow re-check
The re-check of pm->accept_subflow with pm->lock held was missing, this
patch fixed it.

Suggested-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-24 19:58:33 -07:00
Geliang Tang a877de0671 mptcp: add ADD_ADDR related mibs
This patch added two mibs for ADD_ADDR, MPTCP_MIB_ADDADDR for receiving
of the ADD_ADDR suboption with echo-flag=0, and MPTCP_MIB_ECHOADD for
receiving the ADD_ADDR suboption with echo-flag=1.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-24 19:58:33 -07:00
Geliang Tang 6a6c05a8b0 mptcp: send out ADD_ADDR with echo flag
When the ADD_ADDR suboption has been received, we need to send out the same
ADD_ADDR suboption with echo-flag=1, and no HMAC.

Suggested-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-24 19:58:33 -07:00
Geliang Tang d0876b2284 mptcp: add the incoming RM_ADDR support
This patch added the RM_ADDR option parsing logic:

We parsed the incoming options to find if the rm_addr option is received,
and called mptcp_pm_rm_addr_received to schedule PM work to a new status,
named MPTCP_PM_RM_ADDR_RECEIVED.

PM work got this status, and called mptcp_pm_nl_rm_addr_received to handle
it.

In mptcp_pm_nl_rm_addr_received, we closed the subflow matching the rm_id,
and updated PM counter.

Suggested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Suggested-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-24 19:58:33 -07:00
Geliang Tang 5cb104ae55 mptcp: add the outgoing RM_ADDR support
This patch added a new signal named rm_addr_signal in PM. On outgoing path,
we called mptcp_pm_should_rm_signal to check if rm_addr_signal has been
set. If it has been, we sent out the RM_ADDR option.

Suggested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-24 19:58:33 -07:00
Geliang Tang f643b8032e mptcp: rename addr_signal and the related functions
This patch renamed addr_signal and the related functions with the explicit
word "add".

Suggested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-24 19:58:33 -07:00
Mat Martineau ef59b1953c mptcp: Wake up MPTCP worker when DATA_FIN found on a TCP FIN packet
When receiving a DATA_FIN MPTCP option on a TCP FIN packet, the DATA_FIN
information would be stored but the MPTCP worker did not get
scheduled. In turn, the MPTCP socket state would remain in
TCP_ESTABLISHED and no blocked operations would be awakened.

TCP FIN packets are seen by the MPTCP socket when moving skbs out of the
subflow receive queues, so schedule the MPTCP worker when a skb with
DATA_FIN but no data payload is moved from a subflow queue. Other cases
(DATA_FIN on a bare TCP ACK or on a packet with data payload) are
already handled.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/84
Fixes: 43b54c6ee3 ("mptcp: Use full MPTCP-level disconnect state machine")
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-23 17:30:52 -07:00
David S. Miller 3ab0a7a0c3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Two minor conflicts:

1) net/ipv4/route.c, adding a new local variable while
   moving another local variable and removing it's
   initial assignment.

2) drivers/net/dsa/microchip/ksz9477.c, overlapping changes.
   One pretty prints the port mode differently, whilst another
   changes the driver to try and obtain the port mode from
   the port node rather than the switch node.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-22 16:45:34 -07:00
Paolo Abeni 1d39cd8cf7 mptcp: fix integer overflow in mptcp_subflow_discard_data()
Christoph reported an infinite loop in the subflow receive path
under stress condition.

If there are multiple subflows, each of them using a large send
buffer, the delta between the sequence number used by
MPTCP-level retransmission can and the current msk->ack_seq
can be greater than MAX_INT.

In the above scenario, when calling mptcp_subflow_discard_data(),
such delta will be truncated to int, and could result in a negative
number: no bytes will be dropped, and subflow_check_data_avail()
will try again to process the same packet, looping forever.

This change addresses the issue by expanding the 'limit' size to 64
bits, so that overflows are not possible anymore.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/87
Fixes: 6719331c2f ("mptcp: trigger msk processing even for OoO data")
Reported-and-tested-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-17 18:04:48 -07:00
Ye Bin c2ec6bc010 mptcp: Fix unsigned 'max_seq' compared with zero in mptcp_data_queue_ofo
Fixes coccicheck warnig:
net/mptcp/protocol.c:164:11-18: WARNING: Unsigned expression compared with zero: max_seq > 0

Fixes: ab174ad8ef ("mptcp: move ooo skbs into msk out of order queue")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Ye Bin <yebin10@huawei.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-17 16:45:05 -07:00
Paolo Abeni c76c695656 mptcp: call tcp_cleanup_rbuf on subflows
That is needed to let the subflows announce promptly when new
space is available in the receive buffer.

tcp_cleanup_rbuf() is currently a static function, drop the
scope modifier and add a declaration in the TCP header.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 13:28:02 -07:00
Paolo Abeni d5f49190de mptcp: allow picking different xmit subflows
Update the scheduler to less trivial heuristic: cache
the last used subflow, and try to send on it a reasonably
long burst of data.

When the burst or the subflow send space is exhausted, pick
the subflow with the lower ratio between write space and
send buffer - that is, the subflow with the greater relative
amount of free space.

v1 -> v2:
 - fix 32 bit build breakage due to 64bits div
 - fix checkpath issues (uint64_t -> u64)

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 13:28:02 -07:00
Paolo Abeni 4596a2c1b7 mptcp: allow creating non-backup subflows
Currently the 'backup' attribute of local endpoint
is ignored. Let's use it for the MP_JOIN handshake

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 13:28:02 -07:00
Paolo Abeni ef0da3b8a2 mptcp: move address attribute into mptcp_addr_info
So that can be accessed easily from the subflow creation
helper. No functional change intended.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 13:28:02 -07:00
Paolo Abeni 06242e44b9 mptcp: add OoO related mibs
Add a bunch of MPTCP mibs related to MPTCP OoO data
processing.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 13:28:02 -07:00
Paolo Abeni 04e4cd4f7c mptcp: cleanup mptcp_subflow_discard_data()
There is no need to use the tcp_read_sock(), we can
simply drop the skb. Additionally try to look at the
next buffer for in order data.

This both simplifies the code and avoid unneeded indirect
calls.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 13:28:02 -07:00
Paolo Abeni ab174ad8ef mptcp: move ooo skbs into msk out of order queue.
Add an RB-tree to cope with OoO (at MPTCP level) data.
__mptcp_move_skb() insert into the RB tree "future"
data, eventually coalescing skb as allowed by the
MPTCP DSN.

To simplify sequence accounting, move the DSN inside
the cb.

After successfully enqueuing in sequence data, check
if we can use any data from the RB tree.

Additionally move the data_fin check after spooling
data from the OoO tree, otherwise we could miss shutdown
events.

The RB tree code is copied as verbatim as possible
from tcp_data_queue_ofo(), with a few simplifications
due to the fact that MPTCP doesn't need to cope with
sacks. All bugs here are added by me.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 13:28:02 -07:00
Paolo Abeni 8268ed4c9d mptcp: introduce and use mptcp_try_coalesce()
Factor-out existing code, will be re-used by the
next patch.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 13:28:02 -07:00
Paolo Abeni da51aef5fe mptcp: basic sndbuf autotuning
Let the msk sendbuf track the size of the larger subflow's
send window, so that we ensure mptcp_sendmsg() does not
exceed MPTCP-level send window.

The update is performed just before try to send any data.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 13:28:02 -07:00
Paolo Abeni 6719331c2f mptcp: trigger msk processing even for OoO data
This is a prerequisite to allow receiving data from multiple
subflows without re-injection.

Instead of dropping the OoO - "future" data in
subflow_check_data_avail(), call into __mptcp_move_skbs()
and let the msk drop that.

To avoid code duplication factor out the mptcp_subflow_discard_data()
helper.

Note that __mptcp_move_skbs() can now find multiple subflows
with data avail (comprising to-be-discarded data), so must
update the byte counter incrementally.

v1 -> v2:
 - fix checkpatch issues (unsigned -> unsigned int)

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 13:28:02 -07:00
Paolo Abeni 47bebdf365 mptcp: set data_ready status bit in subflow_check_data_avail()
This simplify mptcp_subflow_data_available() and will
made follow-up patches simpler.

Additionally remove the unneeded checks on subflow copied_seq:
we always whole skbs out of subflows.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 13:28:02 -07:00
Paolo Abeni 63561a403c mptcp: rethink 'is writable' conditional
Currently, when checking for the 'msk is writable' condition, we
look at the individual subflows write space.
That works well while we send data via a single subflow, but will
not as soon as we will enable concurrent xmit on multiple subflows.

With this change msk becomes writable when the following conditions
hold:
- the socket has some free write space
- there is at least a subflow with write free space

Additionally we need to set the NOSPACE bit on all subflows
before blocking.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 13:28:02 -07:00
Geliang Tang f612eb76f3 mptcp: fix kmalloc flag in mptcp_pm_nl_get_local_id
mptcp_pm_nl_get_local_id may be called in interrupt context, so we need to
use GFP_ATOMIC flag to allocate memory to avoid sleeping in atomic context.

[  280.209809] BUG: sleeping function called from invalid context at mm/slab.h:498
[  280.209812] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1680, name: kworker/1:3
[  280.209814] INFO: lockdep is turned off.
[  280.209816] CPU: 1 PID: 1680 Comm: kworker/1:3 Tainted: G        W         5.9.0-rc3-mptcp+ #146
[  280.209818] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  280.209820] Workqueue: events mptcp_worker
[  280.209822] Call Trace:
[  280.209824]  <IRQ>
[  280.209826]  dump_stack+0x77/0xa0
[  280.209829]  ___might_sleep.cold+0xa6/0xb6
[  280.209832]  kmem_cache_alloc_trace+0x1d1/0x290
[  280.209835]  mptcp_pm_nl_get_local_id+0x23c/0x410
[  280.209840]  subflow_init_req+0x1e9/0x2ea
[  280.209843]  ? inet_reqsk_alloc+0x1c/0x120
[  280.209845]  ? kmem_cache_alloc+0x264/0x290
[  280.209849]  tcp_conn_request+0x303/0xae0
[  280.209854]  ? printk+0x53/0x6a
[  280.209857]  ? tcp_rcv_state_process+0x28f/0x1374
[  280.209859]  tcp_rcv_state_process+0x28f/0x1374
[  280.209864]  ? tcp_v4_do_rcv+0xb3/0x1f0
[  280.209866]  tcp_v4_do_rcv+0xb3/0x1f0
[  280.209869]  tcp_v4_rcv+0xed6/0xfa0
[  280.209873]  ip_protocol_deliver_rcu+0x28/0x270
[  280.209875]  ip_local_deliver_finish+0x89/0x120
[  280.209877]  ip_local_deliver+0x180/0x220
[  280.209881]  ip_rcv+0x166/0x210
[  280.209885]  __netif_receive_skb_one_core+0x82/0x90
[  280.209888]  process_backlog+0xd6/0x230
[  280.209891]  net_rx_action+0x13a/0x410
[  280.209895]  __do_softirq+0xcf/0x468
[  280.209899]  asm_call_on_stack+0x12/0x20
[  280.209901]  </IRQ>
[  280.209903]  ? ip_finish_output2+0x240/0x9a0
[  280.209906]  do_softirq_own_stack+0x4d/0x60
[  280.209908]  do_softirq.part.0+0x2b/0x60
[  280.209911]  __local_bh_enable_ip+0x9a/0xa0
[  280.209913]  ip_finish_output2+0x264/0x9a0
[  280.209916]  ? rcu_read_lock_held+0x4d/0x60
[  280.209920]  ? ip_output+0x7a/0x250
[  280.209922]  ip_output+0x7a/0x250
[  280.209925]  ? __ip_finish_output+0x330/0x330
[  280.209928]  __ip_queue_xmit+0x1dc/0x5a0
[  280.209931]  __tcp_transmit_skb+0xa0f/0xc70
[  280.209937]  tcp_connect+0xb03/0xff0
[  280.209939]  ? lockdep_hardirqs_on_prepare+0xe7/0x190
[  280.209942]  ? ktime_get_with_offset+0x125/0x150
[  280.209944]  ? trace_hardirqs_on+0x1c/0xe0
[  280.209948]  tcp_v4_connect+0x449/0x550
[  280.209953]  __inet_stream_connect+0xbb/0x320
[  280.209955]  ? mark_held_locks+0x49/0x70
[  280.209958]  ? lockdep_hardirqs_on_prepare+0xe7/0x190
[  280.209960]  ? __local_bh_enable_ip+0x6b/0xa0
[  280.209963]  inet_stream_connect+0x32/0x50
[  280.209966]  __mptcp_subflow_connect+0x1fd/0x242
[  280.209972]  mptcp_pm_create_subflow_or_signal_addr+0x2db/0x600
[  280.209975]  mptcp_worker+0x543/0x7a0
[  280.209980]  process_one_work+0x26d/0x5b0
[  280.209984]  ? process_one_work+0x5b0/0x5b0
[  280.209987]  worker_thread+0x48/0x3d0
[  280.209990]  ? process_one_work+0x5b0/0x5b0
[  280.209993]  kthread+0x117/0x150
[  280.209996]  ? kthread_park+0x80/0x80
[  280.209998]  ret_from_fork+0x22/0x30

Fixes: 01cacb00b3 ("mptcp: add netlink-based PM")
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 12:30:03 -07:00
Geliang Tang 2ff0e566fa mptcp: fix subflow's remote_id issues
This patch set the init remote_id to zero, otherwise it will be a random
number.

Then it added the missing subflow's remote_id setting code both in
__mptcp_subflow_connect and in subflow_ulp_clone.

Fixes: 01cacb00b3 ("mptcp: add netlink-based PM")
Fixes: ec3edaa7ca ("mptcp: Add handling of outgoing MP_JOIN requests")
Fixes: f296234c98 ("mptcp: Add handling of incoming MP_JOIN requests")
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 12:29:15 -07:00
Geliang Tang 57025817ea mptcp: fix subflow's local_id issues
In mptcp_pm_nl_get_local_id, skc_local is the same as msk_local, so it
always return 0. Thus every subflow's local_id is 0. It's incorrect.

This patch fixed this issue.

Also, we need to ignore the zero address here, like 0.0.0.0 in IPv4. When
we use the zero address as a local address, it means that we can use any
one of the local addresses. The zero address is not a new address, we don't
need to add it to PM, so this patch added a new function address_zero to
check whether an address is the zero address, if it is, we ignore this
address.

Fixes: 01cacb00b3 ("mptcp: add netlink-based PM")
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 12:29:15 -07:00
Jakub Kicinski 44a8c4f33c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
We got slightly different patches removing a double word
in a comment in net/ipv4/raw.c - picked the version from net.

Simple conflict in drivers/net/ethernet/ibm/ibmvnic.c. Use cached
values instead of VNIC login response buffer (following what
commit 507ebe6444 ("ibmvnic: Fix use-after-free of VNIC login
response buffer") did).

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-04 21:28:59 -07:00
Linus Torvalds 3e8d3bdc2a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Use netif_rx_ni() when necessary in batman-adv stack, from Jussi
    Kivilinna.

 2) Fix loss of RTT samples in rxrpc, from David Howells.

 3) Memory leak in hns_nic_dev_probe(), from Dignhao Liu.

 4) ravb module cannot be unloaded, fix from Yuusuke Ashizuka.

 5) We disable BH for too lokng in sctp_get_port_local(), add a
    cond_resched() here as well, from Xin Long.

 6) Fix memory leak in st95hf_in_send_cmd, from Dinghao Liu.

 7) Out of bound access in bpf_raw_tp_link_fill_link_info(), from
    Yonghong Song.

 8) Missing of_node_put() in mt7530 DSA driver, from Sumera
    Priyadarsini.

 9) Fix crash in bnxt_fw_reset_task(), from Michael Chan.

10) Fix geneve tunnel checksumming bug in hns3, from Yi Li.

11) Memory leak in rxkad_verify_response, from Dinghao Liu.

12) In tipc, don't use smp_processor_id() in preemptible context. From
    Tuong Lien.

13) Fix signedness issue in mlx4 memory allocation, from Shung-Hsi Yu.

14) Missing clk_disable_prepare() in gemini driver, from Dan Carpenter.

15) Fix ABI mismatch between driver and firmware in nfp, from Louis
    Peens.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (110 commits)
  net/smc: fix sock refcounting in case of termination
  net/smc: reset sndbuf_desc if freed
  net/smc: set rx_off for SMCR explicitly
  net/smc: fix toleration of fake add_link messages
  tg3: Fix soft lockup when tg3_reset_task() fails.
  doc: net: dsa: Fix typo in config code sample
  net: dp83867: Fix WoL SecureOn password
  nfp: flower: fix ABI mismatch between driver and firmware
  tipc: fix shutdown() of connectionless socket
  ipv6: Fix sysctl max for fib_multipath_hash_policy
  drivers/net/wan/hdlc: Change the default of hard_header_len to 0
  net: gemini: Fix another missing clk_disable_unprepare() in probe
  net: bcmgenet: fix mask check in bcmgenet_validate_flow()
  amd-xgbe: Add support for new port mode
  net: usb: dm9601: Add USB ID of Keenetic Plus DSL
  vhost: fix typo in error message
  net: ethernet: mlx4: Fix memory allocation in mlx4_buddy_init()
  pktgen: fix error message with wrong function name
  net: ethernet: ti: am65-cpsw: fix rmii 100Mbit link mode
  cxgb4: fix thermal zone device registration
  ...
2020-09-03 18:50:48 -07:00
YueHaibing b1fd4470cd mptcp: Remove unused macro MPTCP_SAME_STATE
There is no caller in tree any more.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-31 12:37:40 -07:00
Florian Westphal 1cec170d45 mptcp: free acked data before waiting for more memory
After subflow lock is dropped, more wmem might have been made available.

This fixes a deadlock in mptcp_connect.sh 'mmap' mode: wmem is exhausted.
But as the mptcp socket holds on to already-acked data (for retransmit)
no wakeup will occur.

Using 'goto restart' calls mptcp_clean_una(sk) which will free pages
that have been acked completely in the mean time.

Fixes: fb529e62d3 ("mptcp: break and restart in case mptcp sndbuf is full")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-26 15:48:44 -07:00
Gustavo A. R. Silva df561f6688 treewide: Use fallthrough pseudo-keyword
Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-23 17:36:59 -05:00
Johannes Berg 8140860c81 netlink: consistently use NLA_POLICY_EXACT_LEN()
Change places that open-code NLA_POLICY_EXACT_LEN() to
use the macro instead, giving us flexibility in how we
handle the details of the macro.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-18 12:28:45 -07:00
Florian Westphal b3b2854dcf mptcp: sendmsg: reset iter on error redux
This fix wasn't correct: When this function is invoked from the
retransmission worker, the iterator contains garbage and resetting
it causes a crash.

As the work queue should not be performance critical also zero the
msghdr struct.

Fixes: 3575938313 "(mptcp: sendmsg: reset iter on error)"
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-16 21:11:37 -07:00
Florian Westphal 3575938313 mptcp: sendmsg: reset iter on error
Once we've copied data from the iterator we need to revert in case we
end up not sending any data.

This bug doesn't trigger with normal 'poll' based tests, because
we only feed a small chunk of data to kernel after poll indicated
POLLOUT.  With blocking IO and large writes this triggers. Receiver
ends up with less data than it should get.

Fixes: 72511aab95 ("mptcp: avoid blocking in tcp_sendpages")
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-14 14:11:37 -07:00
Paolo Abeni 7ee2492635 mptcp: fix warn at shutdown time for unaccepted msk sockets
With commit b93df08ccd ("mptcp: explicitly track the fully
established status"), the status of unaccepted mptcp closed in
mptcp_sock_destruct() changes from TCP_SYN_RECV to TCP_ESTABLISHED.

As a result mptcp_sock_destruct() does not perform the proper
cleanup and inet_sock_destruct() will later emit a warn.

Address the issue updating the condition tested in mptcp_sock_destruct().
Also update the related comment.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/66
Reported-and-tested-by: Christoph Paasch <cpaasch@apple.com>
Fixes: b93df08ccd ("mptcp: explicitly track the fully established status")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-07 17:26:16 -07:00
Linus Torvalds 47ec5303d7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:

 1) Support 6Ghz band in ath11k driver, from Rajkumar Manoharan.

 2) Support UDP segmentation in code TSO code, from Eric Dumazet.

 3) Allow flashing different flash images in cxgb4 driver, from Vishal
    Kulkarni.

 4) Add drop frames counter and flow status to tc flower offloading,
    from Po Liu.

 5) Support n-tuple filters in cxgb4, from Vishal Kulkarni.

 6) Various new indirect call avoidance, from Eric Dumazet and Brian
    Vazquez.

 7) Fix BPF verifier failures on 32-bit pointer arithmetic, from
    Yonghong Song.

 8) Support querying and setting hardware address of a port function via
    devlink, use this in mlx5, from Parav Pandit.

 9) Support hw ipsec offload on bonding slaves, from Jarod Wilson.

10) Switch qca8k driver over to phylink, from Jonathan McDowell.

11) In bpftool, show list of processes holding BPF FD references to
    maps, programs, links, and btf objects. From Andrii Nakryiko.

12) Several conversions over to generic power management, from Vaibhav
    Gupta.

13) Add support for SO_KEEPALIVE et al. to bpf_setsockopt(), from Dmitry
    Yakunin.

14) Various https url conversions, from Alexander A. Klimov.

15) Timestamping and PHC support for mscc PHY driver, from Antoine
    Tenart.

16) Support bpf iterating over tcp and udp sockets, from Yonghong Song.

17) Support 5GBASE-T i40e NICs, from Aleksandr Loktionov.

18) Add kTLS RX HW offload support to mlx5e, from Tariq Toukan.

19) Fix the ->ndo_start_xmit() return type to be netdev_tx_t in several
    drivers. From Luc Van Oostenryck.

20) XDP support for xen-netfront, from Denis Kirjanov.

21) Support receive buffer autotuning in MPTCP, from Florian Westphal.

22) Support EF100 chip in sfc driver, from Edward Cree.

23) Add XDP support to mvpp2 driver, from Matteo Croce.

24) Support MPTCP in sock_diag, from Paolo Abeni.

25) Commonize UDP tunnel offloading code by creating udp_tunnel_nic
    infrastructure, from Jakub Kicinski.

26) Several pci_ --> dma_ API conversions, from Christophe JAILLET.

27) Add FLOW_ACTION_POLICE support to mlxsw, from Ido Schimmel.

28) Add SK_LOOKUP bpf program type, from Jakub Sitnicki.

29) Refactor a lot of networking socket option handling code in order to
    avoid set_fs() calls, from Christoph Hellwig.

30) Add rfc4884 support to icmp code, from Willem de Bruijn.

31) Support TBF offload in dpaa2-eth driver, from Ioana Ciornei.

32) Support XDP_REDIRECT in qede driver, from Alexander Lobakin.

33) Support PCI relaxed ordering in mlx5 driver, from Aya Levin.

34) Support TCP syncookies in MPTCP, from Flowian Westphal.

35) Fix several tricky cases of PMTU handling wrt. briding, from Stefano
    Brivio.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2056 commits)
  net: thunderx: initialize VF's mailbox mutex before first usage
  usb: hso: remove bogus check for EINPROGRESS
  usb: hso: no complaint about kmalloc failure
  hso: fix bailout in error case of probe
  ip_tunnel_core: Fix build for archs without _HAVE_ARCH_IPV6_CSUM
  selftests/net: relax cpu affinity requirement in msg_zerocopy test
  mptcp: be careful on subflow creation
  selftests: rtnetlink: make kci_test_encap() return sub-test result
  selftests: rtnetlink: correct the final return value for the test
  net: dsa: sja1105: use detected device id instead of DT one on mismatch
  tipc: set ub->ifindex for local ipv6 address
  ipv6: add ipv6_dev_find()
  net: openvswitch: silence suspicious RCU usage warning
  Revert "vxlan: fix tos value before xmit"
  ptp: only allow phase values lower than 1 period
  farsync: switch from 'pci_' to 'dma_' API
  wan: wanxl: switch from 'pci_' to 'dma_' API
  hv_netvsc: do not use VF device if link is down
  dpaa2-eth: Fix passing zero to 'PTR_ERR' warning
  net: macb: Properly handle phylink on at91sam9x
  ...
2020-08-05 20:13:21 -07:00
Paolo Abeni adf7341064 mptcp: be careful on subflow creation
Nicolas reported the following oops:

[ 1521.392541] BUG: kernel NULL pointer dereference, address: 00000000000000c0
[ 1521.394189] #PF: supervisor read access in kernel mode
[ 1521.395376] #PF: error_code(0x0000) - not-present page
[ 1521.396607] PGD 0 P4D 0
[ 1521.397156] Oops: 0000 [#1] SMP PTI
[ 1521.398020] CPU: 0 PID: 22986 Comm: kworker/0:2 Not tainted 5.8.0-rc4+ #109
[ 1521.399618] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[ 1521.401728] Workqueue: events mptcp_worker
[ 1521.402651] RIP: 0010:mptcp_subflow_create_socket+0xf1/0x1c0
[ 1521.403954] Code: 24 08 89 44 24 04 48 8b 7a 18 e8 2a 48 d4 ff 8b 44 24 04 85 c0 75 7a 48 8b 8b 78 02 00 00 48 8b 54 24 08 48 8d bb 80 00 00 00 <48> 8b 89 c0 00 00 00 48 89 8a c0 00 00 00 48 8b 8b 78 02 00 00 8b
[ 1521.408201] RSP: 0000:ffffabc4002d3c60 EFLAGS: 00010246
[ 1521.409433] RAX: 0000000000000000 RBX: ffffa0b9ad8c9a00 RCX: 0000000000000000
[ 1521.411096] RDX: ffffa0b9ae78a300 RSI: 00000000fffffe01 RDI: ffffa0b9ad8c9a80
[ 1521.412734] RBP: ffffa0b9adff2e80 R08: ffffa0b9af02d640 R09: ffffa0b9ad923a00
[ 1521.414333] R10: ffffabc4007139f8 R11: fefefefefefefeff R12: ffffabc4002d3cb0
[ 1521.415918] R13: ffffa0b9ad91fa58 R14: ffffa0b9ad8c9f9c R15: 0000000000000000
[ 1521.417592] FS:  0000000000000000(0000) GS:ffffa0b9af000000(0000) knlGS:0000000000000000
[ 1521.419490] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1521.420839] CR2: 00000000000000c0 CR3: 000000002951e006 CR4: 0000000000160ef0
[ 1521.422511] Call Trace:
[ 1521.423103]  __mptcp_subflow_connect+0x94/0x1f0
[ 1521.425376]  mptcp_pm_create_subflow_or_signal_addr+0x200/0x2a0
[ 1521.426736]  mptcp_worker+0x31b/0x390
[ 1521.431324]  process_one_work+0x1fc/0x3f0
[ 1521.432268]  worker_thread+0x2d/0x3b0
[ 1521.434197]  kthread+0x117/0x130
[ 1521.435783]  ret_from_fork+0x22/0x30

on some unconventional configuration.

The MPTCP protocol is trying to create a subflow for an
unaccepted server socket. That is allowed by the RFC, even
if subflow creation will likely fail.
Unaccepted sockets have still a NULL sk_socket field,
avoid the issue by failing earlier.

Reported-and-tested-by: Nicolas Rybowski <nicolas.rybowski@tessares.net>
Fixes: 7d14b0d2b9 ("mptcp: set correct vfs info for subflows")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-05 12:24:20 -07:00
Paolo Abeni 8555c6bfd5 mptcp: fix bogus sendmsg() return code under pressure
In case of memory pressure, mptcp_sendmsg() may call
sk_stream_wait_memory() after succesfully xmitting some
bytes. If the latter fails we currently return to the
user-space the error code, ignoring the succeful xmit.

Address the issue always checking for the xmitted bytes
before mptcp_sendmsg() completes.

Fixes: f296234c98 ("mptcp: Add handling of incoming MP_JOIN requests")
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-03 18:14:20 -07:00
Geliang Tang 190f8b060e mptcp: use mptcp_for_each_subflow in mptcp_stream_accept
Use mptcp_for_each_subflow in mptcp_stream_accept instead of
open-coding.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-03 18:01:14 -07:00
Linus Torvalds ab5c60b79a Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
 "API:
   - Add support for allocating transforms on a specific NUMA Node
   - Introduce the flag CRYPTO_ALG_ALLOCATES_MEMORY for storage users

  Algorithms:
   - Drop PMULL based ghash on arm64
   - Fixes for building with clang on x86
   - Add sha256 helper that does the digest in one go
   - Add SP800-56A rev 3 validation checks to dh

  Drivers:
   - Permit users to specify NUMA node in hisilicon/zip
   - Add support for i.MX6 in imx-rngc
   - Add sa2ul crypto driver
   - Add BA431 hwrng driver
   - Add Ingenic JZ4780 and X1000 hwrng driver
   - Spread IRQ affinity in inside-secure and marvell/cesa"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (157 commits)
  crypto: sa2ul - Fix inconsistent IS_ERR and PTR_ERR
  hwrng: core - remove redundant initialization of variable ret
  crypto: x86/curve25519 - Remove unused carry variables
  crypto: ingenic - Add hardware RNG for Ingenic JZ4780 and X1000
  dt-bindings: RNG: Add Ingenic RNG bindings.
  crypto: caam/qi2 - add module alias
  crypto: caam - add more RNG hw error codes
  crypto: caam/jr - remove incorrect reference to caam_jr_register()
  crypto: caam - silence .setkey in case of bad key length
  crypto: caam/qi2 - create ahash shared descriptors only once
  crypto: caam/qi2 - fix error reporting for caam_hash_alloc
  crypto: caam - remove deadcode on 32-bit platforms
  crypto: ccp - use generic power management
  crypto: xts - Replace memcpy() invocation with simple assignment
  crypto: marvell/cesa - irq balance
  crypto: inside-secure - irq balance
  crypto: ecc - SP800-56A rev 3 local public key validation
  crypto: dh - SP800-56A rev 3 local public key validation
  crypto: dh - check validity of Z before export
  lib/mpi: Add mpi_sub_ui()
  ...
2020-08-03 10:40:14 -07:00
David S. Miller bd0b33b248 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Resolved kernel/bpf/btf.c using instructions from merge commit
69138b34a7

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-02 01:02:12 -07:00
Florian Westphal 7126bd5c8b mptcp: fix syncookie build error on UP
kernel test robot says:
net/mptcp/syncookies.c: In function 'mptcp_join_cookie_init':
include/linux/kernel.h:47:38: warning: division by zero [-Wdiv-by-zero]
 #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]) + __must_be_array(arr))

I forgot that spinock_t size is 0 on UP, so ARRAY_SIZE cannot be used.

Fixes: 9466a1cceb ("mptcp: enable JOIN requests even if cookies are in use")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-01 11:52:55 -07:00
Florian Westphal 9466a1cceb mptcp: enable JOIN requests even if cookies are in use
JOIN requests do not work in syncookie mode -- for HMAC validation, the
peers nonce and the mptcp token (to obtain the desired connection socket
the join is for) are required, but this information is only present in the
initial syn.

So either we need to drop all JOIN requests once a listening socket enters
syncookie mode, or we need to store enough state to reconstruct the request
socket later.

This adds a state table (1024 entries) to store the data present in the
MP_JOIN syn request and the random nonce used for the cookie syn/ack.

When a MP_JOIN ACK passed cookie validation, the table is consulted
to rebuild the request socket from it.

An alternate approach would be to "cancel" syn-cookie mode and force
MP_JOIN to always use a syn queue entry.

However, doing so brings the backlog over the configured queue limit.

v2: use req->syncookie, not (removed) want_cookie arg

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-31 16:55:32 -07:00
Florian Westphal c83a47e50d mptcp: subflow: add mptcp_subflow_init_cookie_req helper
Will be used to initialize the mptcp request socket when a MP_CAPABLE
request was handled in syncookie mode, i.e. when a TCP ACK containing a
MP_CAPABLE option is a valid syncookie value.

Normally (non-cookie case), MPTCP will generate a unique 32 bit connection
ID and stores it in the MPTCP token storage to be able to retrieve the
mptcp socket for subflow joining.

In syncookie case, we do not want to store any state, so just generate the
unique ID and use it in the reply.

This means there is a small window where another connection could generate
the same token.

When Cookie ACK comes back, we check that the token has not been registered
in the mean time.  If it was, the connection needs to fall back to TCP.

Changes in v2:
 - use req->syncookie instead of passing 'want_cookie' arg to ->init_req()
   (Eric Dumazet)

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-31 16:55:32 -07:00
Florian Westphal 08b8d08098 mptcp: rename and export mptcp_subflow_request_sock_ops
syncookie code path needs to create an mptcp request sock.

Prepare for this and add mptcp prefix plus needed export of ops struct.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-31 16:55:32 -07:00
Florian Westphal 78d8b7bc4b mptcp: subflow: split subflow_init_req
When syncookie support is added, we will need to add a variant of
subflow_init_req() helper.  It will do almost same thing except
that it will not compute/add a token to the mptcp token tree.

To avoid excess copy&paste, this commit splits away part of the
code into a new helper, __subflow_init_req, that can then be re-used
from the 'no insert' function added in a followup change.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-31 16:55:32 -07:00
Florian Westphal 535fb8152f mptcp: token: move retry to caller
Once syncookie support is added, no state will be stored anymore when the
syn/ack is generated in syncookie mode.

When the ACK comes back, the generated key will be taken from the TCP ACK,
the token is re-generated and inserted into the token tree.

This means we can't retry with a new key when the token is already taken
in the syncookie case.

Therefore, move the retry logic to the caller to prepare for syncookie
support in mptcp.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-31 16:55:32 -07:00
Mat Martineau 721e908990 mptcp: Safely store sequence number when sending data
The MPTCP socket's write_seq member can be read without the msk lock
held, so use WRITE_ONCE() to store it.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-28 17:02:42 -07:00
Mat Martineau c75293925f mptcp: Safely read sequence number when lock isn't held
The MPTCP socket's write_seq member should be read with READ_ONCE() when
the msk lock is not held.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-28 17:02:42 -07:00
Mat Martineau 06827b348b mptcp: Skip unnecessary skb extension allocation for bare acks
Bare TCP ack skbs are freed right after MPTCP sees them, so the work to
allocate, zero, and populate the MPTCP skb extension is wasted. Detect
these skbs and do not add skb extensions to them.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-28 17:02:42 -07:00
Mat Martineau 067a0b3dc5 mptcp: Only use subflow EOF signaling on fallback connections
The MPTCP state machine handles disconnections on non-fallback connections,
but the mptcp_sock still needs to get notified when fallback subflows
disconnect.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-28 17:02:42 -07:00
Mat Martineau 43b54c6ee3 mptcp: Use full MPTCP-level disconnect state machine
RFC 8684 appendix D describes the connection state machine for
MPTCP. This patch implements the DATA_FIN / DATA_ACK exchanges and
MPTCP-level socket state changes described in that appendix, rather than
simply sending DATA_FIN along with TCP FIN when disconnecting subflows.

DATA_FIN is now sent and acknowledged before shutting down the
subflows. Received DATA_FIN information (if not part of a data packet)
is written to the MPTCP socket when the incoming DSS option is parsed by
the subflow, and the MPTCP worker is scheduled to process the
flag. DATA_FIN received as part of a full DSS mapping will be handled
when the mapping is processed.

The DATA_FIN is acknowledged by the worker if the reader is caught
up. If there is still data to be moved to the MPTCP-level queue, ack_seq
will be incremented to account for the DATA_FIN when it reaches the end
of the stream and a DATA_ACK will be sent to the peer.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-28 17:02:42 -07:00
Mat Martineau 16a9a9da17 mptcp: Add helper to process acks of DATA_FIN
After DATA_FIN has been sent, the peer will acknowledge it. An ack of
the relevant MPTCP-level sequence number will update the MPTCP
connection state appropriately.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-28 17:02:42 -07:00
Mat Martineau 6920b85158 mptcp: Add mptcp_close_state() helper
This will be used to transition to the appropriate state on close and
determine if a DATA_FIN needs to be sent for that state transition.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-28 17:02:42 -07:00
Mat Martineau 3721b9b646 mptcp: Track received DATA_FIN sequence number and add related helpers
Incoming DATA_FIN headers need to propagate the presence of the DATA_FIN
bit and the associated sequence number to the MPTCP layer, even when
arriving on a bare ACK that does not get added to the receive queue. Add
structure members to store the DATA_FIN information and helpers to set
and check those values.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-28 17:02:42 -07:00
Mat Martineau 7279da6145 mptcp: Use MPTCP-level flag for sending DATA_FIN
Since DATA_FIN information is the same for every subflow, store it only
in the mptcp_sock.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-28 17:02:41 -07:00
Mat Martineau 242e63f651 mptcp: Remove outdated and incorrect comment
mptcp_close() acquires the msk lock, so it clearly should not be held
before the function is called.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-28 17:02:41 -07:00
Mat Martineau 57baaf2875 mptcp: Return EPIPE if sending is shut down during a sendmsg
A MPTCP socket where sending has been shut down should not attempt to
send additional data, since DATA_FIN has already been sent.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-28 17:02:41 -07:00
Mat Martineau 0bac966a1f mptcp: Allow DATA_FIN in headers without TCP FIN
RFC 8684-compliant DATA_FIN needs to be sent and ack'd before subflows
are closed with TCP FIN, so write DATA_FIN DSS headers whenever their
transmission has been enabled by the MPTCP connection-level socket.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-28 17:02:41 -07:00
Matthieu Baerts 367fe04eb6 mptcp: fix joined subflows with unblocking sk
Unblocking sockets used for outgoing connections were not containing
inet info about the initial connection due to a typo there: the value of
"err" variable is negative in the kernelspace.

This fixes the creation of additional subflows where the remote port has
to be reused if the other host didn't announce another one. This also
fixes inet_diag showing blank info about MPTCP sockets from unblocking
sockets doing a connect().

Fixes: 41be81a8d3 ("mptcp: fix unblocking connect()")
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-27 11:50:57 -07:00
Christoph Hellwig a7b75c5a8c net: pass a sockptr_t into ->setsockopt
Rework the remaining setsockopt code to pass a sockptr_t instead of a
plain user pointer.  This removes the last remaining set_fs(KERNEL_DS)
outside of architecture specific code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> [ieee802154]
Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-24 15:41:54 -07:00
Christoph Hellwig c8c1bbb6eb net: switch sock_set_timeout to sockptr_t
Pass a sockptr_t to prepare for set_fs-less handling of the kernel
pointer from bpf-cgroup.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-24 15:41:53 -07:00
Paolo Abeni 4cf8b7e48a subflow: introduce and use mptcp_can_accept_new_subflow()
So that we can easily perform some basic PM-related
adimission checks before creating the child socket.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Tested-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-23 11:47:25 -07:00
Paolo Abeni 97e617518c subflow: use rsk_ops->send_reset()
tcp_send_active_reset() is more prone to transient errors
(memory allocation or xmit queue full): in stress conditions
the kernel may drop the egress packet, and the client will be
stuck.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Tested-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-23 11:47:25 -07:00
Paolo Abeni b7514694ed subflow: explicitly check for plain tcp rsk
When syncookie are in use, the TCP stack may feed into
subflow_syn_recv_sock() plain TCP request sockets. We can't
access mptcp_subflow_request_sock-specific fields on such
sockets. Explicitly check the rsk ops to do safe accesses.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Tested-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-23 11:47:25 -07:00
Paolo Abeni fa25e815d9 mptcp: cleanup subflow_finish_connect()
The mentioned function has several unneeded branches,
handle each case - MP_CAPABLE, MP_JOIN, fallback -
under a single conditional and drop quite a bit of
duplicate code.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Tested-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-23 11:47:24 -07:00
Paolo Abeni b93df08ccd mptcp: explicitly track the fully established status
Currently accepted msk sockets become established only after
accept() returns the new sk to user-space.

As MP_JOIN request are refused as per RFC spec on non fully
established socket, the above causes mp_join self-tests
instabilities.

This change lets the msk entering the established status
as soon as it receives the 3rd ack and propagates the first
subflow fully established status on the msk socket.

Finally we can change the subflow acceptance condition to
take in account both the sock state and the msk fully
established flag.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Tested-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-23 11:47:24 -07:00
Paolo Abeni 0235d075a5 mptcp: mark as fallback even early ones
In the unlikely event of a failure at connect time,
we currently clear the request_mptcp flag - so that
the MPC handshake is not started at all, but the msk
is not explicitly marked as fallback.

This would lead to later insertion of wrong DSS options
in the xmitted packets, in violation of RFC specs and
possibly fooling the peer.

Fixes: e1ff9e82e2 ("net: mptcp: improve fallback to TCP")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Tested-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-23 11:47:24 -07:00
Paolo Abeni 53eb4c383d mptcp: avoid data corruption on reinsert
When updating a partially acked data fragment, we
actually corrupt it. This is irrelevant till we send
data on a single subflow, as retransmitted data, if
any are discarded by the peer as duplicate, but it
will cause data corruption as soon as we will start
creating non backup subflows.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Tested-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-23 11:47:24 -07:00
Paolo Abeni b0977bb268 subflow: always init 'rel_write_seq'
Currently we do not init the subflow write sequence for
MP_JOIN subflows. This will cause bad mapping being
generated as soon as we will use non backup subflow.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Tested-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-23 11:47:24 -07:00
Paolo Abeni 6ab301c98f mptcp: zero token hash at creation time.
Otherwise the 'chain_len' filed will carry random values,
some token creation calls will fail due to excessive chain
length, causing unexpected fallback to TCP.

Fixes: 2c5ebd001d ("mptcp: refactor token container")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Tested-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-22 17:57:37 -07:00
Florian Westphal c1d069e3bf mptcp: move helper to where its used
Only used in token.c.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-21 16:22:18 -07:00
Christoph Hellwig 8c918ffbba net: remove compat_sock_common_{get,set}sockopt
Add the compat handling to sock_common_{get,set}sockopt instead,
keyed of in_compat_syscall().  This allow to remove the now unused
->compat_{get,set}sockopt methods from struct proto_ops.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-19 18:16:40 -07:00
Davide Caratti 8c72894048 mptcp: silence warning in subflow_data_ready()
since commit d47a721520 ("mptcp: fix race in subflow_data_ready()"), it
is possible to observe a regression in MP_JOIN kselftests. For sockets in
TCP_CLOSE state, it's not sufficient to just wake up the main socket: we
also need to ensure that received data are made available to the reader.
Silence the WARN_ON_ONCE() in these cases: it preserves the syzkaller fix
and restores kselftests	when they are ran as follows:

  # while true; do
  > make KBUILD_OUTPUT=/tmp/kselftest TARGETS=net/mptcp kselftest
  > done

Reported-by: Florian Westphal <fw@strlen.de>
Fixes: d47a721520 ("mptcp: fix race in subflow_data_ready()")
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/47
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-17 12:47:00 -07:00
Eric Biggers 5a7a0d9400 mptcp: use sha256() instead of open coding
Now that there's a function that calculates the SHA-256 digest of a
buffer in one step, use it instead of sha256_init() + sha256_update() +
sha256_final().

Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: mptcp@lists.01.org
Cc: Mat Martineau <mathew.j.martineau@linux.intel.com>
Cc: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-07-16 21:49:05 +10:00
David S. Miller 71930d6102 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
All conflicts seemed rather trivial, with some guidance from
Saeed Mameed on the tc_ct.c one.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-11 00:46:00 -07:00
Paolo Abeni ac3b45f609 mptcp: add MPTCP socket diag interface
exposes basic inet socket attribute, plus some MPTCP socket
fields comprising PM status and MPTCP-level sequence numbers.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-09 12:38:41 -07:00
Paolo Abeni 96d890daad mptcp: add msk interations helper
mptcp_token_iter_next() allow traversing all the MPTCP
sockets inside the token container belonging to the given
network namespace with a quite standard iterator semantic.

That will be used by the next patch, but keep the API generic,
as we plan to use this later for PM's sake.

Additionally export mptcp_token_get_sock(), as it also
will be used by the diag module.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-09 12:38:41 -07:00
Paolo Abeni 9c29e36152 mptcp: fix DSS map generation on fin retransmission
The RFC 8684 mandates that no-data DATA FIN packets should carry
a DSS with 0 sequence number and data len equal to 1. Currently,
on FIN retransmission we re-use the existing mapping; if the previous
fin transmission was part of a partially acked data packet, we could
end-up writing in the egress packet a non-compliant DSS.

The above will be detected by a "Bad mapping" warning on the receiver
side.

This change addresses the issue explicitly checking for 0 len packet
when adding the DATA_FIN option.

Fixes: 6d0060f600 ("mptcp: Write MPTCP DSS headers to outgoing data packets")
Reported-by: syzbot+42a07faa5923cfaeb9c9@syzkaller.appspotmail.com
Tested-by: Christoph Paasch <cpaasch@apple.com>
Reviewed-by: Christoph Paasch <cpaasch@apple.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-07 15:27:37 -07:00
Florian Westphal b416268b7a mptcp: use mptcp worker for path management
We can re-use the existing work queue to handle path management
instead of a dedicated work queue.  Just move pm_worker to protocol.c,
call it from the mptcp worker and get rid of the msk lock (already held).

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-07 13:02:13 -07:00
Davide Caratti d47a721520 mptcp: fix race in subflow_data_ready()
syzkaller was able to make the kernel reach subflow_data_ready() for a
server subflow that was closed before subflow_finish_connect() completed.
In these cases we can avoid using the path for regular/fallback MPTCP
data, and just wake the main socket, to avoid the following warning:

 WARNING: CPU: 0 PID: 9370 at net/mptcp/subflow.c:885
 subflow_data_ready+0x1e6/0x290 net/mptcp/subflow.c:885
 Kernel panic - not syncing: panic_on_warn set ...
 CPU: 0 PID: 9370 Comm: syz-executor.0 Not tainted 5.7.0 #106
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
 rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
 Call Trace:
  <IRQ>
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0xb7/0xfe lib/dump_stack.c:118
  panic+0x29e/0x692 kernel/panic.c:221
  __warn.cold+0x2f/0x3d kernel/panic.c:582
  report_bug+0x28b/0x2f0 lib/bug.c:195
  fixup_bug arch/x86/kernel/traps.c:105 [inline]
  fixup_bug arch/x86/kernel/traps.c:100 [inline]
  do_error_trap+0x10f/0x180 arch/x86/kernel/traps.c:197
  do_invalid_op+0x32/0x40 arch/x86/kernel/traps.c:216
  invalid_op+0x1e/0x30 arch/x86/entry/entry_64.S:1027
 RIP: 0010:subflow_data_ready+0x1e6/0x290 net/mptcp/subflow.c:885
 Code: 04 02 84 c0 74 06 0f 8e 91 00 00 00 41 0f b6 5e 48 31 ff 83 e3 18
 89 de e8 37 ec 3d fe 84 db 0f 85 65 ff ff ff e8 fa ea 3d fe <0f> 0b e9
 59 ff ff ff e8 ee ea 3d fe 48 89 ee 4c 89 ef e8 f3 77 ff
 RSP: 0018:ffff88811b2099b0 EFLAGS: 00010206
 RAX: ffff888111197000 RBX: 0000000000000000 RCX: ffffffff82fbc609
 RDX: 0000000000000100 RSI: ffffffff82fbc616 RDI: 0000000000000001
 RBP: ffff8881111bc800 R08: ffff888111197000 R09: ffffed10222a82af
 R10: ffff888111541577 R11: ffffed10222a82ae R12: 1ffff11023641336
 R13: ffff888111541000 R14: ffff88810fd4ca00 R15: ffff888111541570
  tcp_child_process+0x754/0x920 net/ipv4/tcp_minisocks.c:841
  tcp_v4_do_rcv+0x749/0x8b0 net/ipv4/tcp_ipv4.c:1642
  tcp_v4_rcv+0x2666/0x2e60 net/ipv4/tcp_ipv4.c:1999
  ip_protocol_deliver_rcu+0x29/0x1f0 net/ipv4/ip_input.c:204
  ip_local_deliver_finish net/ipv4/ip_input.c:231 [inline]
  NF_HOOK include/linux/netfilter.h:421 [inline]
  ip_local_deliver+0x2da/0x390 net/ipv4/ip_input.c:252
  dst_input include/net/dst.h:441 [inline]
  ip_rcv_finish net/ipv4/ip_input.c:428 [inline]
  ip_rcv_finish net/ipv4/ip_input.c:414 [inline]
  NF_HOOK include/linux/netfilter.h:421 [inline]
  ip_rcv+0xef/0x140 net/ipv4/ip_input.c:539
  __netif_receive_skb_one_core+0x197/0x1e0 net/core/dev.c:5268
  __netif_receive_skb+0x27/0x1c0 net/core/dev.c:5382
  process_backlog+0x1e5/0x6d0 net/core/dev.c:6226
  napi_poll net/core/dev.c:6671 [inline]
  net_rx_action+0x3e3/0xd70 net/core/dev.c:6739
  __do_softirq+0x18c/0x634 kernel/softirq.c:292
  do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082
  </IRQ>
  do_softirq.part.0+0x26/0x30 kernel/softirq.c:337
  do_softirq arch/x86/include/asm/preempt.h:26 [inline]
  __local_bh_enable_ip+0x46/0x50 kernel/softirq.c:189
  local_bh_enable include/linux/bottom_half.h:32 [inline]
  rcu_read_unlock_bh include/linux/rcupdate.h:723 [inline]
  ip_finish_output2+0x78a/0x19c0 net/ipv4/ip_output.c:229
  __ip_finish_output+0x471/0x720 net/ipv4/ip_output.c:306
  dst_output include/net/dst.h:435 [inline]
  ip_local_out+0x181/0x1e0 net/ipv4/ip_output.c:125
  __ip_queue_xmit+0x7a1/0x14e0 net/ipv4/ip_output.c:530
  __tcp_transmit_skb+0x19dc/0x35e0 net/ipv4/tcp_output.c:1238
  __tcp_send_ack.part.0+0x3c2/0x5b0 net/ipv4/tcp_output.c:3785
  __tcp_send_ack net/ipv4/tcp_output.c:3791 [inline]
  tcp_send_ack+0x7d/0xa0 net/ipv4/tcp_output.c:3791
  tcp_rcv_synsent_state_process net/ipv4/tcp_input.c:6040 [inline]
  tcp_rcv_state_process+0x36a4/0x49c2 net/ipv4/tcp_input.c:6209
  tcp_v4_do_rcv+0x343/0x8b0 net/ipv4/tcp_ipv4.c:1651
  sk_backlog_rcv include/net/sock.h:996 [inline]
  __release_sock+0x1ad/0x310 net/core/sock.c:2548
  release_sock+0x54/0x1a0 net/core/sock.c:3064
  inet_wait_for_connect net/ipv4/af_inet.c:594 [inline]
  __inet_stream_connect+0x57e/0xd50 net/ipv4/af_inet.c:686
  inet_stream_connect+0x53/0xa0 net/ipv4/af_inet.c:725
  mptcp_stream_connect+0x171/0x5f0 net/mptcp/protocol.c:1920
  __sys_connect_file net/socket.c:1854 [inline]
  __sys_connect+0x267/0x2f0 net/socket.c:1871
  __do_sys_connect net/socket.c:1882 [inline]
  __se_sys_connect net/socket.c:1879 [inline]
  __x64_sys_connect+0x6f/0xb0 net/socket.c:1879
  do_syscall_64+0xb7/0x3d0 arch/x86/entry/common.c:295
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7fb577d06469
 Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89
 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01
 f0 ff ff 73 01 c3 48 8b 0d ff 49 2b 00 f7 d8 64 89 01 48
 RSP: 002b:00007fb5783d5dd8 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
 RAX: ffffffffffffffda RBX: 000000000068bfa0 RCX: 00007fb577d06469
 RDX: 000000000000004d RSI: 0000000020000040 RDI: 0000000000000003
 RBP: 00000000ffffffff R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
 R13: 000000000041427c R14: 00007fb5783d65c0 R15: 0000000000000003

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/39
Reported-by: Christoph Paasch <cpaasch@apple.com>
Fixes: e1ff9e82e2 ("net: mptcp: improve fallback to TCP")
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-06 13:31:12 -07:00
Florian Westphal c9b95a1359 mptcp: support IPV6_V6ONLY setsockopt
Without this, Opensshd fails to open an ipv6 socket listening
socket:
  error: setsockopt IPV6_V6ONLY: Operation not supported
  error: Bind to port 22 on :: failed: Address already in use.

Opensshd opens an ipv4 and and ipv6 listening socket, but because
IPV6_V6ONLY setsockopt fails, the port number is already in use.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-04 17:56:22 -07:00
Florian Westphal fd1452d8ef mptcp: add REUSEADDR/REUSEPORT support
This will e.g. make 'sshd restart' work when MPTCP is used, as we will
now set this option on the listener socket instead of only the mptcp
socket (where it has no effect).

We still need to copy the setting to the master socket so that a
subsequent getsockopt() returns the expected value.

Reported-by: Christoph Paasch <cpaasch@apple.com>
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-04 17:56:22 -07:00
Florian Westphal 83f0c10bc3 net: use mptcp setsockopt function for SOL_SOCKET on mptcp sockets
setsockopt(mptcp_fd, SOL_SOCKET, ...)...  appears to work (returns 0),
but it has no effect -- this is because the MPTCP layer never has a
chance to copy the settings to the subflow socket.

Skip the generic handling for the mptcp case and instead call the
mptcp specific handler instead for SOL_SOCKET too.

Next patch adds more specific handling for SOL_SOCKET to mptcp.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-04 17:56:22 -07:00
Florian Westphal a6b118febb mptcp: add receive buffer auto-tuning
When mptcp is used, userspace doesn't read from the tcp (subflow)
socket but from the parent (mptcp) socket receive queue.

skbs are moved from the subflow socket to the mptcp rx queue either from
'data_ready' callback (if mptcp socket can be locked), a work queue, or
the socket receive function.

This means tcp_rcv_space_adjust() is never called and thus no receive
buffer size auto-tuning is done.

An earlier (not merged) patch added tcp_rcv_space_adjust() calls to the
function that moves skbs from subflow to mptcp socket.
While this enabled autotuning, it also meant tuning was done even if
userspace was reading the mptcp socket very slowly.

This adds mptcp_rcv_space_adjust() and calls it after userspace has
read data from the mptcp socket rx queue.

Its very similar to tcp_rcv_space_adjust, with two differences:

1. The rtt estimate is the largest one observed on a subflow
2. The rcvbuf size and window clamp of all subflows is adjusted
   to the mptcp-level rcvbuf.

Otherwise, we get spurious drops at tcp (subflow) socket level if
the skbs are not moved to the mptcp socket fast enough.

Before:
time mptcp_connect.sh -t -f $((4*1024*1024)) -d 300 -l 0.01% -r 0 -e "" -m mmap
[..]
ns4 MPTCP -> ns3 (10.0.3.2:10108      ) MPTCP   (duration 40823ms) [ OK ]
ns4 MPTCP -> ns3 (10.0.3.2:10109      ) TCP     (duration 23119ms) [ OK ]
ns4 TCP   -> ns3 (10.0.3.2:10110      ) MPTCP   (duration  5421ms) [ OK ]
ns4 MPTCP -> ns3 (dead:beef:3::2:10111) MPTCP   (duration 41446ms) [ OK ]
ns4 MPTCP -> ns3 (dead:beef:3::2:10112) TCP     (duration 23427ms) [ OK ]
ns4 TCP   -> ns3 (dead:beef:3::2:10113) MPTCP   (duration  5426ms) [ OK ]
Time: 1396 seconds

After:
ns4 MPTCP -> ns3 (10.0.3.2:10108      ) MPTCP   (duration  5417ms) [ OK ]
ns4 MPTCP -> ns3 (10.0.3.2:10109      ) TCP     (duration  5427ms) [ OK ]
ns4 TCP   -> ns3 (10.0.3.2:10110      ) MPTCP   (duration  5422ms) [ OK ]
ns4 MPTCP -> ns3 (dead:beef:3::2:10111) MPTCP   (duration  5415ms) [ OK ]
ns4 MPTCP -> ns3 (dead:beef:3::2:10112) TCP     (duration  5422ms) [ OK ]
ns4 TCP   -> ns3 (dead:beef:3::2:10113) MPTCP   (duration  5423ms) [ OK ]
Time: 296 seconds

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-01 17:47:55 -07:00
Paolo Abeni 6bad912b7e mptcp: do nonce initialization at subflow creation time
This clean-up the code a bit, reduces the number of
used hooks and indirect call requested, and allow
better error reporting from __mptcp_subflow_connect()

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30 13:38:00 -07:00
Paolo Abeni 8a05661b2b mptcp: close poll() races
mptcp_poll always return POLLOUT for unblocking
connect(), ensure that the socket is a suitable
state.
The MPTCP_DATA_READY bit is never cleared on accept:
ensure we don't leave mptcp_accept() with an empty
accept queue and such bit set.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29 17:29:38 -07:00
Paolo Abeni 76660afbb7 mptcp: __mptcp_tcp_fallback() returns a struct sock
Currently __mptcp_tcp_fallback() always return NULL
on incoming connections, because MPTCP does not create
the additional socket for the first subflow.
Since the previous commit no __mptcp_tcp_fallback()
caller needs a struct socket, so let __mptcp_tcp_fallback()
return the first subflow sock and cope correctly even with
incoming connections.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29 17:29:38 -07:00
Paolo Abeni fa68018dc4 mptcp: create first subflow at msk creation time
This cleans the code a bit and makes the behavior more consistent.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29 17:29:38 -07:00
Paolo Abeni d2f77c5334 mptcp: check for plain TCP sock at accept time
This cleanup the code a bit and avoid corrupted states
on weird syscall sequence (accept(), connect()).

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29 17:29:38 -07:00
Davide Caratti 8fd738049a mptcp: fallback in case of simultaneous connect
when a MPTCP client tries to connect to itself, tcp_finish_connect() is
never reached. Because of this, depending on the socket current state,
multiple faulty behaviours can be observed:

1) a WARN_ON() in subflow_data_ready() is hit
 WARNING: CPU: 2 PID: 882 at net/mptcp/subflow.c:911 subflow_data_ready+0x18b/0x230
 [...]
 CPU: 2 PID: 882 Comm: gh35 Not tainted 5.7.0+ #187
 [...]
 RIP: 0010:subflow_data_ready+0x18b/0x230
 [...]
 Call Trace:
  tcp_data_queue+0xd2f/0x4250
  tcp_rcv_state_process+0xb1c/0x49d3
  tcp_v4_do_rcv+0x2bc/0x790
  __release_sock+0x153/0x2d0
  release_sock+0x4f/0x170
  mptcp_shutdown+0x167/0x4e0
  __sys_shutdown+0xe6/0x180
  __x64_sys_shutdown+0x50/0x70
  do_syscall_64+0x9a/0x370
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

2) client is stuck forever in mptcp_sendmsg() because the socket is not
   TCP_ESTABLISHED

 crash> bt 4847
 PID: 4847   TASK: ffff88814b2fb100  CPU: 1   COMMAND: "gh35"
  #0 [ffff8881376ff680] __schedule at ffffffff97248da4
  #1 [ffff8881376ff778] schedule at ffffffff9724a34f
  #2 [ffff8881376ff7a0] schedule_timeout at ffffffff97252ba0
  #3 [ffff8881376ff8a8] wait_woken at ffffffff958ab4ba
  #4 [ffff8881376ff940] sk_stream_wait_connect at ffffffff96c2d859
  #5 [ffff8881376ffa28] mptcp_sendmsg at ffffffff97207fca
  #6 [ffff8881376ffbc0] sock_sendmsg at ffffffff96be1b5b
  #7 [ffff8881376ffbe8] sock_write_iter at ffffffff96be1daa
  #8 [ffff8881376ffce8] new_sync_write at ffffffff95e5cb52
  #9 [ffff8881376ffe50] vfs_write at ffffffff95e6547f
 #10 [ffff8881376ffe90] ksys_write at ffffffff95e65d26
 #11 [ffff8881376fff28] do_syscall_64 at ffffffff956088ba
 #12 [ffff8881376fff50] entry_SYSCALL_64_after_hwframe at ffffffff9740008c
     RIP: 00007f126f6956ed  RSP: 00007ffc2a320278  RFLAGS: 00000217
     RAX: ffffffffffffffda  RBX: 0000000020000044  RCX: 00007f126f6956ed
     RDX: 0000000000000004  RSI: 00000000004007b8  RDI: 0000000000000003
     RBP: 00007ffc2a3202a0   R8: 0000000000400720   R9: 0000000000400720
     R10: 0000000000400720  R11: 0000000000000217  R12: 00000000004004b0
     R13: 00007ffc2a320380  R14: 0000000000000000  R15: 0000000000000000
     ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

3) tcpdump captures show that DSS is exchanged even when MP_CAPABLE handshake
   didn't complete.

 $ tcpdump -tnnr bad.pcap
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S], seq 3208913911, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291694721,nop,wscale 7,mptcp capable v1], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S.], seq 3208913911, ack 3208913912, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291706876,nop,wscale 7,mptcp capable v1], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 1, win 512, options [nop,nop,TS val 3291706876 ecr 3291706876], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [F.], seq 1, ack 1, win 512, options [nop,nop,TS val 3291707876 ecr 3291706876,mptcp dss fin seq 0 subseq 0 len 1,nop,nop], length 0
 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 2, win 512, options [nop,nop,TS val 3291707876 ecr 3291707876], length 0

force a fallback to TCP in these cases, and adjust the main socket
state to avoid hanging in mptcp_sendmsg().

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/35
Reported-by: Christoph Paasch <cpaasch@apple.com>
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29 17:29:38 -07:00
Davide Caratti e1ff9e82e2 net: mptcp: improve fallback to TCP
Keep using MPTCP sockets and a use "dummy mapping" in case of fallback
to regular TCP. When fallback is triggered, skip addition of the MPTCP
option on send.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/11
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/22
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29 17:29:38 -07:00
Paolo Abeni a8ee9c9b58 mptcp: introduce token KUNIT self-tests
Unit tests for the internal MPTCP token APIs, using KUNIT

v1 -> v2:
 - use the correct RCU annotation when initializing icsk ulp
 - fix a few checkpatch issues

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-26 16:21:39 -07:00
Paolo Abeni a00a582203 mptcp: move crypto test to KUNIT
currently MPTCP uses a custom hook to executed unit tests at
boot time. Let's use the KUNIT framework instead.
Additionally move the relevant code to a separate file and
export the function needed by the test when self-tests
are build as a module.

Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-26 16:21:39 -07:00
Paolo Abeni 2c5ebd001d mptcp: refactor token container
Replace the radix tree with a hash table allocated
at boot time. The radix tree has some shortcoming:
a single lock is contented by all the mptcp operation,
the lookup currently use such lock, and traversing
all the items would require a lock, too.

With hash table instead we trade a little memory to
address all the above - a per bucket lock is used.

To hash the MPTCP sockets, we re-use the msk' sk_node
entry: the MPTCP sockets are never hashed by the stack.
Replace the existing hash proto callbacks with a dummy
implementation, annotating the above constraint.

Additionally refactor the token creation to code to:

- limit the number of consecutive attempts to a fixed
maximum. Hitting a hash bucket with a long chain is
considered a failed attempt

- accept() no longer can fail to token management.

- if token creation fails at connect() time, we do
fallback to TCP (before the connection was closed)

v1 -> v2:
 - fix "no newline at end of file" - Jakub

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-26 16:21:39 -07:00
Paolo Abeni d39dceca38 mptcp: add __init annotation on setup functions
Add the missing annotation in some setup-only
functions.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-26 16:21:39 -07:00
David S. Miller 7bed145516 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Minor overlapping changes in xfrm_device.c, between the double
ESP trailing bug fix setting the XFRM_INIT flag and the changes
in net-next preparing for bonding encryption support.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-25 19:29:51 -07:00
Eric Dumazet 9b9e2f250e tcp: move ipv4_specific to tcp include file
Declare ipv4_specific once, in tcp.h were it belongs.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-23 20:10:15 -07:00
Eric Dumazet b03d2142be tcp: move ipv6_specific declaration to remove a warning
ipv6_specific should be declared in tcp include files,
not mptcp.

This removes the following warning :
  CHECK   net/ipv6/tcp_ipv6.c
net/ipv6/tcp_ipv6.c:78:42: warning: symbol 'ipv6_specific' was not declared. Should it be static?

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-23 20:10:15 -07:00
Geliang Tang b562f58bbc mptcp: drop sndr_key in mptcp_syn_options
In RFC 8684, we don't need to send sndr_key in SYN package anymore, so drop
it.

Fixes: cc7972ea19 ("mptcp: parse and emit MP_CAPABLE option according to v1 spec")
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-22 21:06:39 -07:00
Paolo Abeni 9e365ff576 mptcp: drop MP_JOIN request sock on syn cookies
Currently any MPTCP socket using syn cookies will fallback to
TCP at 3rd ack time. In case of MP_JOIN requests, the RFC mandate
closing the child and sockets, but the existing error paths
do not handle the syncookie scenario correctly.

Address the issue always forcing the child shutdown in case of
MP_JOIN fallback.

Fixes: ae2dd71649 ("mptcp: handle tcp fallback when using syn cookies")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-18 20:25:51 -07:00
Paolo Abeni 8fd4de1275 mptcp: cache msk on MP_JOIN init_req
The msk ownership is transferred to the child socket at
3rd ack time, so that we avoid more lookups later. If the
request does not reach the 3rd ack, the MSK reference is
dropped at request sock release time.

As a side effect, fallback is now tracked by a NULL msk
reference instead of zeroed 'mp_join' field. This will
simplify the next patch.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-18 20:25:51 -07:00
Wei Yongjun b8ad540dd4 mptcp: fix memory leak in mptcp_subflow_create_socket()
socket malloced  by sock_create_kern() should be release before return
in the error handling, otherwise it cause memory leak.

unreferenced object 0xffff88810910c000 (size 1216):
  comm "00000003_test_m", pid 12238, jiffies 4295050289 (age 54.237s)
  hex dump (first 32 bytes):
    01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 2f 30 0a 81 88 ff ff  ........./0.....
  backtrace:
    [<00000000e877f89f>] sock_alloc_inode+0x18/0x1c0
    [<0000000093d1dd51>] alloc_inode+0x63/0x1d0
    [<000000005673fec6>] new_inode_pseudo+0x14/0xe0
    [<00000000b5db6be8>] sock_alloc+0x3c/0x260
    [<00000000e7e3cbb2>] __sock_create+0x89/0x620
    [<0000000023e48593>] mptcp_subflow_create_socket+0xc0/0x5e0
    [<00000000419795e4>] __mptcp_socket_create+0x1ad/0x3f0
    [<00000000b2f942e8>] mptcp_stream_connect+0x281/0x4f0
    [<00000000c80cd5cc>] __sys_connect_file+0x14d/0x190
    [<00000000dc761f11>] __sys_connect+0x128/0x160
    [<000000008b14e764>] __x64_sys_connect+0x6f/0xb0
    [<000000007b4f93bd>] do_syscall_64+0xa1/0x530
    [<00000000d3e770b6>] entry_SYSCALL_64_after_hwframe+0x49/0xb3

Fixes: 2303f994b3 ("mptcp: Associate MPTCP context with TCP socket")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-15 18:08:50 -07:00
Geliang Tang a386bc5b21 mptcp: use list_first_entry_or_null
Use list_first_entry_or_null to simplify the code.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-15 13:04:53 -07:00
Geliang Tang c06c1f87b6 mptcp: drop MPTCP_PM_MAX_ADDR
We have defined MPTCP_PM_ADDR_MAX in pm_netlink.c, so drop this duplicate macro.

Fixes: 1b1c7a0ef7 ("mptcp: Add path manager interface")
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-15 13:01:17 -07:00
Paolo Abeni 4b5af44129 mptcp: don't leak msk in token container
If a listening MPTCP socket has unaccepted sockets at close
time, the related msks are freed via mptcp_sock_destruct(),
which in turn does not invoke the proto->destroy() method
nor the mptcp_token_destroy() function.

Due to the above, the child msk socket is not removed from
the token container, leading to later UaF.

Address the issue explicitly removing the token even in the
above error path.

Fixes: 79c0949e9a ("mptcp: Add key generation and token tree")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-10 16:07:00 -07:00
Paolo Abeni 5969856ae8 mptcp: fix races between shutdown and recvmsg
The msk sk_shutdown flag is set by a workqueue, possibly
introducing some delay in user-space notification. If the last
subflow carries some data with the fin packet, the user space
can wake-up before RCV_SHUTDOWN is set. If it executes unblocking
recvmsg(), it may return with an error instead of eof.

Address the issue explicitly checking for eof in recvmsg(), when
no data is found.

Fixes: 59832e2465 ("mptcp: subflow: check parent mptcp socket on subflow state change")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-10 13:34:14 -07:00
Geliang Tang 8e60eed6b3 mptcp: bugfix for RM_ADDR option parsing
In MPTCPOPT_RM_ADDR option parsing, the pointer "ptr" pointed to the
"Subtype" octet, the pointer "ptr+1" pointed to the "Address ID" octet:

  +-------+-------+---------------+
  |Subtype|(resvd)|   Address ID  |
  +-------+-------+---------------+
  |               |
 ptr            ptr+1

We should set mp_opt->rm_id to the value of "ptr+1", not "ptr". This patch
will fix this bug.

Fixes: 3df523ab58 ("mptcp: Add ADD_ADDR handling")
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-08 19:09:41 -07:00
Linus Torvalds cb8e59cc87 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:

 1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
    Augusto von Dentz.

 2) Add GSO partial support to igc, from Sasha Neftin.

 3) Several cleanups and improvements to r8169 from Heiner Kallweit.

 4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
    device self-test. From Andrew Lunn.

 5) Start moving away from custom driver versions, use the globally
    defined kernel version instead, from Leon Romanovsky.

 6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.

 7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.

 8) Add sriov and vf support to hinic, from Luo bin.

 9) Support Media Redundancy Protocol (MRP) in the bridging code, from
    Horatiu Vultur.

10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.

11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
    Dubroca. Also add ipv6 support for espintcp.

12) Lots of ReST conversions of the networking documentation, from Mauro
    Carvalho Chehab.

13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
    from Doug Berger.

14) Allow to dump cgroup id and filter by it in inet_diag code, from
    Dmitry Yakunin.

15) Add infrastructure to export netlink attribute policies to
    userspace, from Johannes Berg.

16) Several optimizations to sch_fq scheduler, from Eric Dumazet.

17) Fallback to the default qdisc if qdisc init fails because otherwise
    a packet scheduler init failure will make a device inoperative. From
    Jesper Dangaard Brouer.

18) Several RISCV bpf jit optimizations, from Luke Nelson.

19) Correct the return type of the ->ndo_start_xmit() method in several
    drivers, it's netdev_tx_t but many drivers were using
    'int'. From Yunjian Wang.

20) Add an ethtool interface for PHY master/slave config, from Oleksij
    Rempel.

21) Add BPF iterators, from Yonghang Song.

22) Add cable test infrastructure, including ethool interfaces, from
    Andrew Lunn. Marvell PHY driver is the first to support this
    facility.

23) Remove zero-length arrays all over, from Gustavo A. R. Silva.

24) Calculate and maintain an explicit frame size in XDP, from Jesper
    Dangaard Brouer.

25) Add CAP_BPF, from Alexei Starovoitov.

26) Support terse dumps in the packet scheduler, from Vlad Buslov.

27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.

28) Add devm_register_netdev(), from Bartosz Golaszewski.

29) Minimize qdisc resets, from Cong Wang.

30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
    eliminate set_fs/get_fs calls. From Christoph Hellwig.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
  selftests: net: ip_defrag: ignore EPERM
  net_failover: fixed rollback in net_failover_open()
  Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
  Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
  vmxnet3: allow rx flow hash ops only when rss is enabled
  hinic: add set_channels ethtool_ops support
  selftests/bpf: Add a default $(CXX) value
  tools/bpf: Don't use $(COMPILE.c)
  bpf, selftests: Use bpf_probe_read_kernel
  s390/bpf: Use bcr 0,%0 as tail call nop filler
  s390/bpf: Maintain 8-byte stack alignment
  selftests/bpf: Fix verifier test
  selftests/bpf: Fix sample_cnt shared between two threads
  bpf, selftests: Adapt cls_redirect to call csum_level helper
  bpf: Add csum_level helper for fixing up csum levels
  bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
  sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
  crypto/chtls: IPv6 support for inline TLS
  Crypto/chcr: Fixes a coccinile check error
  Crypto/chcr: Fixes compilations warnings
  ...
2020-06-03 16:27:18 -07:00
Linus Torvalds 81e8c10dac Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
 "API:
   - Introduce crypto_shash_tfm_digest() and use it wherever possible.
   - Fix use-after-free and race in crypto_spawn_alg.
   - Add support for parallel and batch requests to crypto_engine.

  Algorithms:
   - Update jitter RNG for SP800-90B compliance.
   - Always use jitter RNG as seed in drbg.

  Drivers:
   - Add Arm CryptoCell driver cctrng.
   - Add support for SEV-ES to the PSP driver in ccp"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (114 commits)
  crypto: hisilicon - fix driver compatibility issue with different versions of devices
  crypto: engine - do not requeue in case of fatal error
  crypto: cavium/nitrox - Fix a typo in a comment
  crypto: hisilicon/qm - change debugfs file name from qm_regs to regs
  crypto: hisilicon/qm - add DebugFS for xQC and xQE dump
  crypto: hisilicon/zip - add debugfs for Hisilicon ZIP
  crypto: hisilicon/hpre - add debugfs for Hisilicon HPRE
  crypto: hisilicon/sec2 - add debugfs for Hisilicon SEC
  crypto: hisilicon/qm - add debugfs to the QM state machine
  crypto: hisilicon/qm - add debugfs for QM
  crypto: stm32/crc32 - protect from concurrent accesses
  crypto: stm32/crc32 - don't sleep in runtime pm
  crypto: stm32/crc32 - fix multi-instance
  crypto: stm32/crc32 - fix run-time self test issue.
  crypto: stm32/crc32 - fix ext4 chksum BUG_ON()
  crypto: hisilicon/zip - Use temporary sqe when doing work
  crypto: hisilicon - add device error report through abnormal irq
  crypto: hisilicon - remove codes of directly report device errors through MSI
  crypto: hisilicon - QM memory management optimization
  crypto: hisilicon - unify initial value assignment into QM
  ...
2020-06-01 12:00:10 -07:00
David S. Miller 1806c13dc2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
xdp_umem.c had overlapping changes between the 64-bit math fix
for the calculation of npgs and the removal of the zerocopy
memory type which got rid of the chunk_size_nohdr member.

The mlx5 Kconfig conflict is a case where we just take the
net-next copy of the Kconfig entry dependency as it takes on
the ESWITCH dependency by one level of indirection which is
what the 'net' conflicting change is trying to ensure.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-31 17:48:46 -07:00
Paolo Abeni 39884604b1 mptcp: fix NULL ptr dereference in MP_JOIN error path
When token lookup on MP_JOIN 3rd ack fails, the server
socket closes with a reset the incoming child. Such socket
has the 'is_mptcp' flag set, but no msk socket associated
- due to the failed lookup.

While crafting the reset packet mptcp_established_options_mp()
will try to dereference the child's master socket, causing
a NULL ptr dereference.

This change addresses the issue with explicit fallback to
TCP in such error path.

Fixes: 729cd6436f ("mptcp: cope better with MP_JOIN failure")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-30 21:54:03 -07:00
Paolo Abeni c5c79763fa mptcp: remove msk from the token container at destruction time.
Currently we remote the msk from the token container only
via mptcp_close(). The MPTCP master socket can be destroyed
also via other paths (e.g. if not yet accepted, when shutting
down the listener socket). When we hit the latter scenario,
dangling msk references are left into the token container,
leading to memory corruption and/or UaF.

This change addresses the issue by moving the token removal
into the msk destructor.

Fixes: 79c0949e9a ("mptcp: Add key generation and token tree")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-30 21:39:13 -07:00
Paolo Abeni 10f6d46c94 mptcp: fix race between MP_JOIN and close
If a MP_JOIN subflow completes the 3whs while another
CPU is closing the master msk, we can hit the
following race:

CPU1                                    CPU2

close()
 mptcp_close
                                        subflow_syn_recv_sock
                                         mptcp_token_get_sock
                                         mptcp_finish_join
                                          inet_sk_state_load
  mptcp_token_destroy
  inet_sk_state_store(TCP_CLOSE)
  __mptcp_flush_join_list()
                                          mptcp_sock_graft
                                          list_add_tail
  sk_common_release
   sock_orphan()
 <socket free>

The MP_JOIN socket will be leaked. Additionally we can hit
UaF for the msk 'struct socket' referenced via the 'conn'
field.

This change try to address the issue introducing some
synchronization between the MP_JOIN 3whs and mptcp_close
via the join_list spinlock. If we detect the msk is closing
the MP_JOIN socket is closed, too.

Fixes: f296234c98 ("mptcp: Add handling of incoming MP_JOIN requests")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-30 21:39:13 -07:00
Paolo Abeni 41be81a8d3 mptcp: fix unblocking connect()
Currently unblocking connect() on MPTCP sockets fails frequently.
If mptcp_stream_connect() is invoked to complete a previously
attempted unblocking connection, it will still try to create
the first subflow via __mptcp_socket_create(). If the 3whs is
completed and the 'can_ack' flag is already set, the latter
will fail with -EINVAL.

This change addresses the issue checking for pending connect and
delegating the completion to the first subflow. Additionally
do msk addresses and sk_state changes only when needed.

Fixes: 2303f994b3 ("mptcp: Associate MPTCP context with TCP socket")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-30 21:39:13 -07:00
Florian Westphal 4e637c70b5 mptcp: attempt coalescing when moving skbs to mptcp rx queue
We can try to coalesce skbs we take from the subflows rx queue with the
tail of the mptcp rx queue.

If successful, the skb head can be discarded early.

We can also free the skb extensions, we do not access them after this.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 20:29:32 -07:00
Paolo Abeni 0a82e230c6 mptcp: avoid NULL-ptr derefence on fallback
In the MPTCP receive path we must cope with TCP fallback
on blocking recvmsg(). Currently in such code path we detect
the fallback condition, but we don't fetch the struct socket
required for fallback.

The above allowed syzkaller to trigger a NULL pointer
dereference:

general protection fault, probably for non-canonical address 0xdffffc0000000004: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000020-0x0000000000000027]
CPU: 1 PID: 7226 Comm: syz-executor523 Not tainted 5.7.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:sock_recvmsg_nosec net/socket.c:886 [inline]
RIP: 0010:sock_recvmsg+0x92/0x110 net/socket.c:904
Code: 5b 41 5c 41 5d 41 5e 41 5f 5d c3 44 89 6c 24 04 e8 53 18 1d fb 4d 8d 6f 20 4c 89 e8 48 c1 e8 03 48 b9 00 00 00 00 00 fc ff df <80> 3c 08 00 74 08 4c 89 ef e8 20 12 5b fb bd a0 00 00 00 49 03 6d
RSP: 0018:ffffc90001077b98 EFLAGS: 00010202
RAX: 0000000000000004 RBX: ffffc90001077dc0 RCX: dffffc0000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000000 R08: ffffffff86565e59 R09: ffffed10115afeaa
R10: ffffed10115afeaa R11: 0000000000000000 R12: 1ffff9200020efbc
R13: 0000000000000020 R14: ffffc90001077de0 R15: 0000000000000000
FS:  00007fc6a3abe700(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000004d0050 CR3: 00000000969f0000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 mptcp_recvmsg+0x18d5/0x19b0 net/mptcp/protocol.c:891
 inet_recvmsg+0xf6/0x1d0 net/ipv4/af_inet.c:838
 sock_recvmsg_nosec net/socket.c:886 [inline]
 sock_recvmsg net/socket.c:904 [inline]
 __sys_recvfrom+0x2f3/0x470 net/socket.c:2057
 __do_sys_recvfrom net/socket.c:2075 [inline]
 __se_sys_recvfrom net/socket.c:2071 [inline]
 __x64_sys_recvfrom+0xda/0xf0 net/socket.c:2071
 do_syscall_64+0xf3/0x1b0 arch/x86/entry/common.c:295
 entry_SYSCALL_64_after_hwframe+0x49/0xb3

Address the issue initializing the struct socket reference
before entering the fallback code.

Reported-and-tested-by: syzbot+c6bfc3db991edc918432@syzkaller.appspotmail.com
Suggested-by: Ondrej Mosnacek <omosnace@redhat.com>
Fixes: 8ab183deb2 ("mptcp: cope with later TCP fallback")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 20:18:24 -07:00
David S. Miller 13209a8f73 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
The MSCC bug fix in 'net' had to be slightly adjusted because the
register accesses are done slightly differently in net-next.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-24 13:47:27 -07:00
Todd Malsbary bd6972226f mptcp: use untruncated hash in ADD_ADDR HMAC
There is some ambiguity in the RFC as to whether the ADD_ADDR HMAC is
the rightmost 64 bits of the entire hash or of the leftmost 160 bits
of the hash.  The intention, as clarified with the author of the RFC,
is the entire hash.

This change returns the entire hash from
mptcp_crypto_hmac_sha (instead of only the first 160 bits), and moves
any truncation/selection operation on the hash to the caller.

Fixes: 12555a2d97 ("mptcp: use rightmost 64 bits in ADD_ADDR HMAC")
Reviewed-by: Christoph Paasch <cpaasch@apple.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Todd Malsbary <todd.malsbary@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22 14:21:24 -07:00
Todd Malsbary 12555a2d97 mptcp: use rightmost 64 bits in ADD_ADDR HMAC
This changes the HMAC used in the ADD_ADDR option from the leftmost 64
bits to the rightmost 64 bits as described in RFC 8684, section 3.4.1.

This issue was discovered while adding support to packetdrill for the
ADD_ADDR v1 option.

Fixes: 3df523ab58 ("mptcp: Add ADD_ADDR handling")
Signed-off-by: Todd Malsbary <todd.malsbary@linux.intel.com>
Acked-by: Christoph Paasch <cpaasch@apple.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-19 12:35:51 -07:00
Christoph Hellwig 3986912f6a ipv6: move SIOCADDRT and SIOCDELRT handling into ->compat_ioctl
To prepare removing the global routing_ioctl hack start lifting the code
into a newly added ipv6 ->compat_ioctl handler.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-18 17:35:02 -07:00
Florian Westphal 4930f4831b net: allow __skb_ext_alloc to sleep
mptcp calls this from the transmit side, from process context.
Allow a sleeping allocation instead of unconditional GFP_ATOMIC.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-17 12:35:34 -07:00
Florian Westphal 5c8264435d mptcp: remove inner wait loop from mptcp_sendmsg_frag
previous patches made sure we only call into this function
when these prerequisites are met, so no need to wait on the
subflow socket anymore.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/7
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-17 12:35:34 -07:00
Florian Westphal 17091708d1 mptcp: fill skb page frag cache outside of mptcp_sendmsg_frag
The mptcp_sendmsg_frag helper contains a loop that will wait on the
subflow sk.

It seems preferrable to only wait in mptcp_sendmsg() when blocking io is
requested.  mptcp_sendmsg already has such a wait loop that is used when
no subflow socket is available for transmission.

This is another preparation patch that makes sure we call
mptcp_sendmsg_frag only if the page frag cache has been refilled.

Followup patch will remove the wait loop from mptcp_sendmsg_frag().

The retransmit worker doesn't need to do this refill as it won't
transmit new mptcp-level data.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-17 12:35:34 -07:00
Florian Westphal 149f7c71e2 mptcp: fill skb extension cache outside of mptcp_sendmsg_frag
The mptcp_sendmsg_frag helper contains a loop that will wait on the
subflow sk.

It seems preferrable to only wait in mptcp_sendmsg() when blocking io is
requested.  mptcp_sendmsg already has such a wait loop that is used when
no subflow socket is available for transmission.

This is a preparation patch that makes sure we call
mptcp_sendmsg_frag only if a skb extension has been allocated.

Moreover, such allocation currently uses GFP_ATOMIC while it
could use sleeping allocation instead.

Followup patches will remove the wait loop from mptcp_sendmsg_frag()
and will allow to do a sleeping allocation for the extension.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-17 12:35:34 -07:00
Florian Westphal 72511aab95 mptcp: avoid blocking in tcp_sendpages
The transmit loop continues to xmit new data until an error is returned
or all data was transmitted.

For the blocking i/o case, this means that tcp_sendpages() may block on
the subflow until more space becomes available, i.e. we end up sleeping
with the mptcp socket lock held.

Instead we should check if a different subflow is ready to be used.

This restarts the subflow sk lookup when the tx operation succeeded
and the tcp subflow can't accept more data or if tcp_sendpages
indicates -EAGAIN on a blocking mptcp socket.

In that case we also need to set the NOSPACE bit to make sure we get
notified once memory becomes available.

In case all subflows are busy, the existing logic will wait until a
subflow is ready, releasing the mptcp socket lock while doing so.

The mptcp worker already sets DONTWAIT, so no need to make changes there.

v2:
 * set NOSPACE bit
 * add a comment to clarify that mptcp-sk sndbuf limits need to
   be checked as well.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-17 12:35:34 -07:00
Florian Westphal fb529e62d3 mptcp: break and restart in case mptcp sndbuf is full
Its not enough to check for available tcp send space.

We also hold on to transmitted data for mptcp-level retransmits.
Right now we will send more and more data if the peer can ack data
at the tcp level fast enough, since that frees up tcp send buffer space.

But we also need to check that data was acked and reclaimed at the mptcp
level.

Therefore add needed check in mptcp_sendmsg, flush tcp data and
wait until more mptcp snd space becomes available if we are over the
limit.  Before we wait for more data, also make sure we start the
retransmit timer if we ran out of sndbuf space.

Otherwise there is a very small chance that we wait forever:

 * receiver is waiting for data
 * sender is blocked because mptcp socket buffer is full
 * at tcp level, all data was acked
 * mptcp-level snd_una was not updated, because last ack
   that acknowledged the last data packet carried an older
   MPTCP-ack.

Restarting the retransmit timer avoids this problem: if TCP
subflow is idle, data is retransmitted from the RTX queue.

New data will make the peer send a new, updated MPTCP-Ack.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-17 12:35:34 -07:00
Florian Westphal a0e17064d4 mptcp: move common nospace-pattern to a helper
Paolo noticed that ssk_check_wmem() has same pattern, so add/use
common helper for both places.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-17 12:35:34 -07:00
Christoph Paasch a0c1d0eafd mptcp: Use 32-bit DATA_ACK when possible
RFC8684 allows to send 32-bit DATA_ACKs as long as the peer is not
sending 64-bit data-sequence numbers. The 64-bit DSN is only there for
extreme scenarios when a very high throughput subflow is combined with a
long-RTT subflow such that the high-throughput subflow wraps around the
32-bit sequence number space within an RTT of the high-RTT subflow.

It is thus a rare scenario and we should try to use the 32-bit DATA_ACK
instead as long as possible. It allows to reduce the TCP-option overhead
by 4 bytes, thus makes space for an additional SACK-block. It also makes
tcpdumps much easier to read when the DSN and DATA_ACK are both either
32 or 64-bit.

Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-16 13:51:10 -07:00
David S. Miller da07f52d3c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Move the bpf verifier trace check into the new switch statement in
HEAD.

Resolve the overlapping changes in hinic, where bug fixes overlap
the addition of VF support.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15 13:48:59 -07:00
Paolo Abeni 729cd6436f mptcp: cope better with MP_JOIN failure
Currently, on MP_JOIN failure we reset the child
socket, but leave the request socket untouched.

tcp_check_req will deal with it according to the
'tcp_abort_on_overflow' sysctl value - by default the
req socket will stay alive.

The above leads to inconsistent behavior on MP JOIN
failure, and bad listener overflow accounting.

This patch addresses the issue leveraging the infrastructure
just introduced to ask the TCP stack to drop the req on
failure.

The child socket is not freed anymore by subflow_syn_recv_sock(),
instead it's moved to a dead state and will be disposed by the
next sock_put done by the TCP stack, so that listener overflow
accounting is not affected by MP JOIN failure.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Christoph Paasch <cpaasch@apple.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15 12:30:13 -07:00
Paolo Abeni 90bf45134d mptcp: add new sock flag to deal with join subflows
MP_JOIN subflows must not land into the accept queue.
Currently tcp_check_req() calls an mptcp specific helper
to detect such scenario.

Such helper leverages the subflow context to check for
MP_JOIN subflows. We need to deal also with MP JOIN
failures, even when the subflow context is not available
due allocation failure.

A possible solution would be changing the syn_recv_sock()
signature to allow returning a more descriptive action/
error code and deal with that in tcp_check_req().

Since the above need is MPTCP specific, this patch instead
uses a TCP request socket hole to add a MPTCP specific flag.
Such flag is used by the MPTCP syn_recv_sock() to tell
tcp_check_req() how to deal with the request socket.

This change is a no-op for !MPTCP build, and makes the
MPTCP code simpler. It allows also the next patch to deal
correctly with MP JOIN failure.

v1 -> v2:
 - be more conservative on drop_req initialization (Mat)

RFC -> v1:
 - move the drop_req bit inside tcp_request_sock (Eric)

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Reviewed-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15 12:30:13 -07:00
Christoph Paasch 64d950ae0b mptcp: Initialize map_seq upon subflow establishment
When the other MPTCP-peer uses 32-bit data-sequence numbers, we rely on
map_seq to indicate how to expand to a 64-bit data-sequence number in
expand_seq() when receiving data.

For new subflows, this field is not initialized, thus results in an
"invalid" mapping being discarded.

Fix this by initializing map_seq upon subflow establishment time.

Fixes: f296234c98 ("mptcp: Add handling of incoming MP_JOIN requests")
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Reviewed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-12 12:08:22 -07:00
Eric Biggers ac0ad93df7 mptcp: use SHA256_BLOCK_SIZE, not SHA_MESSAGE_BYTES
In preparation for naming the SHA-1 stuff in <linux/cryptohash.h>
properly and moving it to a more appropriate header, fix the HMAC-SHA256
code in mptcp_crypto_hmac_sha() to use SHA256_BLOCK_SIZE instead of
"SHA_MESSAGE_BYTES" which is actually the SHA-1 block size.
(Fortunately these are both 64 bytes, so this wasn't a "real" bug...)

Cc: Paolo Abeni <pabeni@redhat.com>
Cc: mptcp@lists.01.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-05-08 15:32:16 +10:00
Paolo Abeni 7d14b0d2b9 mptcp: set correct vfs info for subflows
When a subflow is created via mptcp_subflow_create_socket(),
a new 'struct socket' is allocated, with a new i_ino value.

When inspecting TCP sockets via the procfs and or the diag
interface, the above ones are not related to the process owning
the MPTCP master socket, even if they are a logical part of it
('ss -p' shows an empty process field)

Additionally, subflows created by the path manager get
the uid/gid from the running workqueue.

Subflows are part of the owning MPTCP master socket, let's
adjust the vfs info to reflect this.

After this patch, 'ss' correctly displays subflows as belonging
to the msk socket creator.

Fixes: 2303f994b3 ("mptcp: Associate MPTCP context with TCP socket")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-07 18:16:05 -07:00
David S. Miller 3793faad7b Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts were all overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-06 22:10:13 -07:00
Paolo Abeni ac2b47fb92 mptcp: fix uninitialized value access
tcp_v{4,6}_syn_recv_sock() set 'own_req' only when returning
a not NULL 'child', let's check 'own_req' only if child is
available to avoid an - unharmful - UBSAN splat.

v1 -> v2:
 - reference the correct hash

Fixes: 4c8941de78 ("mptcp: avoid flipping mp_capable field in syn_recv_sock()")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:34:07 -07:00
Paolo Abeni a77895dbc0 mptcp: initialize the data_fin field for mpc packets
When parsing MPC+data packets we set the dss field, so
we must also initialize the data_fin, or we can find stray
value there.

Fixes: 9a19371bf0 ("mptcp: fix data_fin handing in RX path")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:23:23 -07:00
Paolo Abeni 5a91e32b40 mptcp: fix 'use_ack' option access.
The mentioned RX option field is initialized only for DSS
packet, we must access it only if 'dss' is set too, or
the subflow will end-up in a bad status, leading to
RFC violations.

Fixes: d22f4988ff ("mptcp: process MP_CAPABLE data option")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:23:22 -07:00
Paolo Abeni d6085fe19b mptcp: avoid a WARN on bad input.
Syzcaller has found a way to trigger the WARN_ON_ONCE condition
in check_fully_established().

The root cause is a legit fallback to TCP scenario, so replace
the WARN with a plain message on a more strict condition.

Fixes: f296234c98 ("mptcp: Add handling of incoming MP_JOIN requests")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:23:22 -07:00
Paolo Abeni cfde141ea3 mptcp: move option parsing into mptcp_incoming_options()
The mptcp_options_received structure carries several per
packet flags (mp_capable, mp_join, etc.). Such fields must
be cleared on each packet, even on dropped ones or packet
not carrying any MPTCP options, but the current mptcp
code clears them only on TCP option reset.

On several races/corner cases we end-up with stray bits in
incoming options, leading to WARN_ON splats. e.g.:

[  171.164906] Bad mapping: ssn=32714 map_seq=1 map_data_len=32713
[  171.165006] WARNING: CPU: 1 PID: 5026 at net/mptcp/subflow.c:533 warn_bad_map (linux-mptcp/net/mptcp/subflow.c:533 linux-mptcp/net/mptcp/subflow.c:531)
[  171.167632] Modules linked in: ip6_vti ip_vti ip_gre ipip sit tunnel4 ip_tunnel geneve ip6_udp_tunnel udp_tunnel macsec macvtap tap ipvlan macvlan 8021q garp mrp xfrm_interface veth netdevsim nlmon dummy team bonding vcan bridge stp llc ip6_gre gre ip6_tunnel tunnel6 tun binfmt_misc intel_rapl_msr intel_rapl_common rfkill kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev virtio_balloon pcspkr i2c_piix4 sunrpc ip_tables xfs libcrc32c crc32c_intel serio_raw virtio_console ata_generic virtio_blk virtio_net net_failover failover ata_piix libata
[  171.199464] CPU: 1 PID: 5026 Comm: repro Not tainted 5.7.0-rc1.mptcp_f227fdf5d388+ #95
[  171.200886] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
[  171.202546] RIP: 0010:warn_bad_map (linux-mptcp/net/mptcp/subflow.c:533 linux-mptcp/net/mptcp/subflow.c:531)
[  171.206537] Code: c1 ea 03 0f b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 04 84 d2 75 1d 8b 55 3c 44 89 e6 48 c7 c7 20 51 13 95 e8 37 8b 22 fe <0f> 0b 48 83 c4 08 5b 5d 41 5c c3 89 4c 24 04 e8 db d6 94 fe 8b 4c
[  171.220473] RSP: 0018:ffffc90000150560 EFLAGS: 00010282
[  171.221639] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[  171.223108] RDX: 0000000000000000 RSI: 0000000000000008 RDI: fffff5200002a09e
[  171.224388] RBP: ffff8880aa6e3c00 R08: 0000000000000001 R09: fffffbfff2ec9955
[  171.225706] R10: ffffffff9764caa7 R11: fffffbfff2ec9954 R12: 0000000000007fca
[  171.227211] R13: ffff8881066f4a7f R14: ffff8880aa6e3c00 R15: 0000000000000020
[  171.228460] FS:  00007f8623719740(0000) GS:ffff88810be00000(0000) knlGS:0000000000000000
[  171.230065] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  171.231303] CR2: 00007ffdab190a50 CR3: 00000001038ea006 CR4: 0000000000160ee0
[  171.232586] Call Trace:
[  171.233109]  <IRQ>
[  171.233531] get_mapping_status (linux-mptcp/net/mptcp/subflow.c:691)
[  171.234371] mptcp_subflow_data_available (linux-mptcp/net/mptcp/subflow.c:736 linux-mptcp/net/mptcp/subflow.c:832)
[  171.238181] subflow_state_change (linux-mptcp/net/mptcp/subflow.c:1085 (discriminator 1))
[  171.239066] tcp_fin (linux-mptcp/net/ipv4/tcp_input.c:4217)
[  171.240123] tcp_data_queue (linux-mptcp/./include/linux/compiler.h:199 linux-mptcp/net/ipv4/tcp_input.c:4822)
[  171.245083] tcp_rcv_established (linux-mptcp/./include/linux/skbuff.h:1785 linux-mptcp/./include/net/tcp.h:1774 linux-mptcp/./include/net/tcp.h:1847 linux-mptcp/net/ipv4/tcp_input.c:5238 linux-mptcp/net/ipv4/tcp_input.c:5730)
[  171.254089] tcp_v4_rcv (linux-mptcp/./include/linux/spinlock.h:393 linux-mptcp/net/ipv4/tcp_ipv4.c:2009)
[  171.258969] ip_protocol_deliver_rcu (linux-mptcp/net/ipv4/ip_input.c:204 (discriminator 1))
[  171.260214] ip_local_deliver_finish (linux-mptcp/./include/linux/rcupdate.h:651 linux-mptcp/net/ipv4/ip_input.c:232)
[  171.261389] ip_local_deliver (linux-mptcp/./include/linux/netfilter.h:307 linux-mptcp/./include/linux/netfilter.h:301 linux-mptcp/net/ipv4/ip_input.c:252)
[  171.265884] ip_rcv (linux-mptcp/./include/linux/netfilter.h:307 linux-mptcp/./include/linux/netfilter.h:301 linux-mptcp/net/ipv4/ip_input.c:539)
[  171.273666] process_backlog (linux-mptcp/./include/linux/rcupdate.h:651 linux-mptcp/net/core/dev.c:6135)
[  171.275328] net_rx_action (linux-mptcp/net/core/dev.c:6572 linux-mptcp/net/core/dev.c:6640)
[  171.280472] __do_softirq (linux-mptcp/./arch/x86/include/asm/jump_label.h:25 linux-mptcp/./include/linux/jump_label.h:200 linux-mptcp/./include/trace/events/irq.h:142 linux-mptcp/kernel/softirq.c:293)
[  171.281379] do_softirq_own_stack (linux-mptcp/arch/x86/entry/entry_64.S:1083)
[  171.282358]  </IRQ>

We could address the issue clearing explicitly the relevant fields
in several places - tcp_parse_option, tcp_fast_parse_options,
possibly others.

Instead we move the MPTCP option parsing into the already existing
mptcp ingress hook, so that we need to clear the fields in a single
place.

This allows us dropping an MPTCP hook from the TCP code and
removing the quite large mptcp_options_received from the tcp_sock
struct. On the flip side, the MPTCP sockets will traverse the
option space twice (in tcp_parse_option() and in
mptcp_incoming_options(). That looks acceptable: we already
do that for syn and 3rd ack packets, plain TCP socket will
benefit from it, and even MPTCP sockets will experience better
code locality, reducing the jumps between TCP and MPTCP code.

v1 -> v2:
 - rebased on current '-net' tree

Fixes: 648ef4b886 ("mptcp: Implement MPTCP receive path")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:23:22 -07:00
Paolo Abeni 263e1201a2 mptcp: consolidate synack processing.
Currently the MPTCP code uses 2 hooks to process syn-ack
packets, mptcp_rcv_synsent() and the sk_rx_dst_set()
callback.

We can drop the first, moving the relevant code into the
latter, reducing the hooking into the TCP code. This is
also needed by the next patch.

v1 -> v2:
 - use local tcp sock ptr instead of casting the sk variable
   several times - DaveM

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:23:22 -07:00
Florian Westphal 42c556fef9 mptcp: replace mptcp_disconnect with a stub
Paolo points out that mptcp_disconnect is bogus:
"lock_sock(sk);
looks suspicious (lock should be already held by the caller)
And call to: tcp_disconnect(sk, flags); too, sk is not a tcp
socket".

->disconnect() gets called from e.g. inet_stream_connect when
one tries to disassociate a connected socket again (to re-connect
without closing the socket first).
MPTCP however uses mptcp_stream_connect, not inet_stream_connect,
for the mptcp-socket connect call.

inet_stream_connect only gets called indirectly, for the tcp socket,
so any ->disconnect() calls end up calling tcp_disconnect for that
tcp subflow sk.

This also explains why syzkaller has not yet reported a problem
here.  So for now replace this with a stub that doesn't do anything.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/14
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-29 12:39:49 -07:00
Paolo Abeni 1200832c6e mptcp: fix race in msk status update
Currently subflow_finish_connect() changes unconditionally
any msk socket status other than TCP_ESTABLISHED.

If an unblocking connect() races with close(), we can end-up
triggering:

IPv4: Attempt to release TCP socket in state 1 00000000e32b8b7e

when the msk socket is disposed.

Be sure to enter the established status only from SYN_SENT.

Fixes: c3c123d16c ("net: mptcp: don't hang in mptcp_sendmsg() after TCP fallback")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-25 20:38:54 -07:00
Florian Westphal 071c8ed6e8 tcp: mptcp: use mptcp receive buffer space to select rcv window
In MPTCP, the receive window is shared across all subflows, because it
refers to the mptcp-level sequence space.

MPTCP receivers already place incoming packets on the mptcp socket
receive queue and will charge it to the mptcp socket rcvbuf until
userspace consumes the data.

Update __tcp_select_window to use the occupancy of the parent/mptcp
socket instead of the subflow socket in case the tcp socket is part
of a logical mptcp connection.

This commit doesn't change choice of initial window for passive or active
connections.
While it would be possible to change those as well, this adds complexity
(especially when handling MP_JOIN requests).  Furthermore, the MPTCP RFC
specifically says that a MPTCP sender 'MUST NOT use the RCV.WND field
of a TCP segment at the connection level if it does not also carry a DSS
option with a Data ACK field.'

SYN/SYNACK packets do not carry a DSS option with a Data ACK field.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-25 20:37:52 -07:00
Bo YU b4e0f9a926 mptcp/pm_netlink.c : add check for nla_put_in/6_addr
Normal there should be checked for nla_put_in6_addr like other
usage in net.

Detected by CoverityScan, CID# 1461639

Fixes: 01cacb00b3 ("mptcp: add netlink-based PM")
Signed-off-by: Bo YU <tsu.yubo@gmail.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-23 15:38:10 -07:00
Paolo Abeni 9a19371bf0 mptcp: fix data_fin handing in RX path
The data fin flag is set only via a DSS option, but
mptcp_incoming_options() copies it unconditionally from the
provided RX options.

Since we do not clear all the mptcp sock RX options in a
socket free/alloc cycle, we can end-up with a stray data_fin
value while parsing e.g. MPC packets.

That would lead to mapping data corruption and will trigger
a few WARN_ON() in the RX path.

Instead of adding a costly memset(), fetch the data_fin flag
only for DSS packets - when we always explicitly initialize
such bit at option parsing time.

Fixes: 648ef4b886 ("mptcp: Implement MPTCP receive path")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-22 19:46:01 -07:00
Paolo Abeni fca5c82c08 mptcp: drop req socket remote_key* fields
We don't need them, as we can use the current ingress opt
data instead. Setting them in syn_recv_sock() may causes
inconsistent mptcp socket status, as per previous commit.

Fixes: cc7972ea19 ("mptcp: parse and emit MP_CAPABLE option according to v1 spec")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-20 12:59:33 -07:00