OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Jakub Kicinski	d4e5db6452	tls: rx: device: keep the zero copy status with offload The non-zero-copy path assumes a full skb with decrypted contents. This means the device offload would have to CoW the data. Try to keep the zero-copy status instead, copy the data to user space. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-07-26 14:38:51 -07:00
Jakub Kicinski	b93f570016	tls: rx: don't free the output in case of zero-copy In the future we'll want to reuse the input skb in case of zero-copy so we shouldn't always free darg.skb. Move the freeing of darg.skb into the non-zc cases. All cases will now free ctx->recv_pkt (inside let tls_rx_rec_done()). Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-07-26 14:38:51 -07:00
Jakub Kicinski	dd47ed3620	tls: rx: factor SW handling out of tls_rx_one_record() After recent changes the SW side of tls_rx_one_record() can be nicely encapsulated in its own function. Move the pad handling as well. This will be useful for ->zc handling in tls_decrypt_device(). Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-07-26 14:38:50 -07:00
Jakub Kicinski	b92a13d488	tls: rx: wrap recv_pkt accesses in helpers To allow for the logic to change later wrap accesses which interrogate the input skb in helper functions. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-07-26 14:38:50 -07:00
Jakub Kicinski	dde06aaa89	tls: rx: release the sock lock on locking timeout Eric reports we should release the socket lock if the entire "grab reader lock" operation has failed. The callers assume they don't have to release it or otherwise unwind. Reported-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot+16e72110feb2b653ef27@syzkaller.appspotmail.com Fixes: `4cbc325ed6` ("tls: rx: allow only one reader at a time") Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220720203701.2179034-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-07-21 18:58:11 -07:00
Jakub Kicinski	6e0e846ee2	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-07-21 13:03:39 -07:00
Tariq Toukan	f08d8c1bb9	net/tls: Fix race in TLS device down flow Socket destruction flow and tls_device_down function sync against each other using tls_device_lock and the context refcount, to guarantee the device resources are freed via tls_dev_del() by the end of tls_device_down. In the following unfortunate flow, this won't happen: - refcount is decreased to zero in tls_device_sk_destruct. - tls_device_down starts, skips the context as refcount is zero, going all the way until it flushes the gc work, and returns without freeing the device resources. - only then, tls_device_queue_ctx_destruction is called, queues the gc work and frees the context's device resources. Solve it by decreasing the refcount in the socket's destruction flow under the tls_device_lock, for perfect synchronization. This does not slow down the common likely destructor flow, in which both the refcount is decreased and the spinlock is acquired, anyway. Fixes: `e8f6979981` ("net/tls: Add generic NIC offload infrastructure") Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-07-18 11:33:07 +01:00
Jakub Kicinski	fd31f3996a	tls: rx: decrypt into a fresh skb We currently CoW Rx skbs whenever we can't decrypt to a user space buffer. The skbs can be enormous (64kB) and CoW does a linear alloc which has a strong chance of failing under memory pressure. Or even without, skb_cow_data() assumes GFP_ATOMIC. Allocate a new frag'd skb and decrypt into it. We finally take advantage of the decrypted skb getting returned via darg. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-07-18 11:24:11 +01:00
Jakub Kicinski	cbbdee9918	tls: rx: async: don't put async zc on the list The "zero-copy" path in SW TLS will engage either for no skbs or for all but last. If the recvmsg parameters are right and the socket can do ZC we'll ZC until the iterator can't fit a full record at which point we'll decrypt one more record and copy over the necessary bits to fill up the request. The only reason we hold onto the ZC skbs which went thru the async path until the end of recvmsg() is to count bytes. We need an accurate count of zc'ed bytes so that we can calculate how much of the non-zc'd data to copy. To allow freeing input skbs on the ZC path count only how much of the list we'll need to consume. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-07-18 11:24:11 +01:00
Jakub Kicinski	c618db2afe	tls: rx: async: hold onto the input skb Async crypto currently benefits from the fact that we decrypt in place. When we allow input and output to be different skbs we will have to hang onto the input while we move to the next record. Clone the inputs and keep them on a list. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-07-18 11:24:11 +01:00
Jakub Kicinski	6ececdc513	tls: rx: async: adjust record geometry immediately Async crypto TLS Rx currently waits for crypto to be done in order to strip the TLS header and tailer. Simplify the code by moving the pointers immediately, since only TLS 1.2 is supported here there is no message padding. This simplifies the decryption into a new skb in the next patch as we don't have to worry about input vs output skb in the decrypt_done() handler any more. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-07-18 11:24:11 +01:00
Jakub Kicinski	6bd116c8c6	tls: rx: return the decrypted skb via darg Instead of using ctx->recv_pkt after decryption read the skb from darg.skb. This moves the decision of what the "output skb" is to the decrypt handlers. For now after decrypt handler returns successfully ctx->recv_pkt is simply moved to darg.skb, but it will change soon. Note that tls_decrypt_sg() cannot clear the ctx->recv_pkt because it gets called to re-encrypt (i.e. by the device offload). So we need an awkward temporary if() in tls_rx_one_record(). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-07-18 11:24:11 +01:00
Jakub Kicinski	541cc48be3	tls: rx: read the input skb from ctx->recv_pkt Callers always pass ctx->recv_pkt into decrypt_skb_update(), and it propagates it to its callees. This may give someone the false impression that those functions can accept any valid skb containing a TLS record. That's not the case, the record sequence number is read from the context, and they can only take the next record coming out of the strp. Let the functions get the skb from the context instead of passing it in. This will also make it cleaner to return a different skb than ctx->recv_pkt as the decrypted one later on. Since we're touching the definition of decrypt_skb_update() use this as an opportunity to rename it. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-07-18 11:24:11 +01:00
Jakub Kicinski	8a95873281	tls: rx: factor out device darg update I already forgot to transform darg from input to output semantics once on the NIC inline crypto fastpath. To avoid this happening again create a device equivalent of decrypt_internal(). A function responsible for decryption and transforming darg. While at it rename decrypt_internal() to a hopefully slightly more meaningful name. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-07-18 11:24:11 +01:00
Jakub Kicinski	53d57999fe	tls: rx: remove the message decrypted tracking We no longer allow a decrypted skb to remain linked to ctx->recv_pkt. Anything on the list is decrypted, anything on ctx->recv_pkt needs to be decrypted. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-07-18 11:24:10 +01:00
Jakub Kicinski	abb47dc95d	tls: rx: don't keep decrypted skbs on ctx->recv_pkt Detach the skb from ctx->recv_pkt after decryption is done, even if we can't consume it. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-07-18 11:24:10 +01:00
Jakub Kicinski	008141de85	tls: rx: don't try to keep the skbs always on the list I thought that having the skb either always on the ctx->rx_list or ctx->recv_pkt will simplify the handling, as we would not have to remember to flip it from one to the other on exit paths. This became a little harder to justify with the fix for BPF sockmaps. Subsequent changes will make the situation even worse. Queue the skbs only when really needed. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-07-18 11:24:10 +01:00
Jakub Kicinski	4cbc325ed6	tls: rx: allow only one reader at a time recvmsg() in TLS gets data from the skb list (rx_list) or fresh skbs we read from TCP via strparser. The former holds skbs which were already decrypted for peek or decrypted and partially consumed. tls_wait_data() only notices appearance of fresh skbs coming out of TCP (or psock). It is possible, if there is a concurrent call to peek() and recv() that the peek() will move the data from input to rx_list without recv() noticing. recv() will then read data out of order or never wake up. This is not a practical use case/concern, but it makes the self tests less reliable. This patch solves the problem by allowing only one reader in. Because having multiple processes calling read()/peek() is not normal avoid adding a lock and try to fast-path the single reader case. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-07-18 11:24:10 +01:00
Jakub Kicinski	816cd16883	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net include/net/sock.h `310731e2f1` ("net: Fix data-races around sysctl_mem.") `e70f3c7012` ("Revert "net: set SK_MEM_QUANTUM to 4096"") https://lore.kernel.org/all/20220711120211.7c8b7cba@canb.auug.org.au/ net/ipv4/fib_semantics.c `747c143072` ("ip: fix dflt addr selection for connected nexthop") `d62607c3fe` ("net: rename reference+tracking helpers") net/tls/tls.h include/net/tls.h `3d8c51b25a` ("net/tls: Check for errors in tls_device_init") `5879031423` ("tls: create an internal header") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-07-14 15:27:35 -07:00
Tariq Toukan	3d8c51b25a	net/tls: Check for errors in tls_device_init Add missing error checks in tls_device_init. Fixes: `e8f6979981` ("net/tls: Add generic NIC offload infrastructure") Reported-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20220714070754.1428-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-07-14 10:12:39 -07:00
Jakub Kicinski	57128e98c3	tls: rx: fix the NoPad getsockopt Maxim reports do_tls_getsockopt_no_pad() will always return an error. Indeed looks like refactoring gone wrong - remove err and use value. Reported-by: Maxim Mikityanskiy <maximmi@nvidia.com> Fixes: `88527790c0` ("tls: rx: add sockopt for enabling optimistic decrypt with TLS 1.3") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-07-11 19:48:33 -07:00
Jakub Kicinski	bb56cea9ab	tls: rx: add counter for NoPad violations As discussed with Maxim add a counter for true NoPad violations. This should help deployments catch unexpected padded records vs just control records which always need re-encryption. https: //lore.kernel.org/all/b111828e6ac34baad9f4e783127eba8344ac252d.camel@nvidia.com/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-07-11 19:48:33 -07:00
Jakub Kicinski	1090c1ea22	tls: fix spelling of MIB MIN -> MIB Fixes: `88527790c0` ("tls: rx: add sockopt for enabling optimistic decrypt with TLS 1.3") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-07-11 19:48:32 -07:00
Jakub Kicinski	35560b7f06	tls: rx: make tls_wait_data() return an recvmsg retcode tls_wait_data() sets the return code as an output parameter and always returns ctx->recv_pkt on success. Return the error code directly and let the caller read the skb from the context. Use positive return code to indicate ctx->recv_pkt is ready. While touching the definition of the function rename it. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-07-08 18:38:45 -07:00
Jakub Kicinski	5879031423	tls: create an internal header include/net/tls.h is getting a little long, and is probably hard for driver authors to navigate. Split out the internals into a header which will live under net/tls/. While at it move some static inlines with a single user into the source files, add a few tls_ prefixes and fix spelling of 'proccess'. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-07-08 18:38:45 -07:00
Jakub Kicinski	03957d8405	tls: rx: coalesce exit paths in tls_decrypt_sg() Jump to the free() call, instead of having to remember to free the memory in multiple places. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-07-08 18:38:45 -07:00
Jakub Kicinski	b89fec54fd	tls: rx: wrap decrypt params in a struct The max size of iv + aad + tail is 22B. That's smaller than a single sg entry (32B). Don't bother with the memory packing, just create a struct which holds the max size of those members. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-07-08 18:38:45 -07:00
Jakub Kicinski	50a07aa531	tls: rx: always allocate max possible aad size for decrypt AAD size is either 5 or 13. Really no point complicating the code for the 8B of difference. This will also let us turn the chunked up buffer into a sane struct. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-07-08 18:38:45 -07:00
Jakub Kicinski	83ec88d81a	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-07-07 12:07:37 -07:00
Gal Pressman	a069a90554	Revert "tls: rx: move counting TlsDecryptErrors for sync" This reverts commit `284b4d93da`. When using TLS device offload and coming from tls_device_reencrypt() flow, -EBADMSG error in tls_do_decryption() should not be counted towards the TLSTlsDecryptError counter. Move the counter increase back to the decrypt_internal() call site in decrypt_skb_update(). This also fixes an issue where: if (n_sgin < 1) return -EBADMSG; Errors in decrypt_internal() were not counted after the cited patch. Fixes: `284b4d93da` ("tls: rx: move counting TlsDecryptErrors for sync") Cc: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-07-06 13:10:59 +01:00
Jakub Kicinski	c46b01839f	tls: rx: periodically flush socket backlog We continuously hold the socket lock during large reads and writes. This may inflate RTT and negatively impact TCP performance. Flush the backlog periodically. I tried to pick a flush period (128kB) which gives significant benefit but the max Bps rate is not yet visibly impacted. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-07-06 12:56:35 +01:00
Jakub Kicinski	88527790c0	tls: rx: add sockopt for enabling optimistic decrypt with TLS 1.3 Since optimisitic decrypt may add extra load in case of retries require socket owner to explicitly opt-in. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-07-06 12:56:35 +01:00
Jakub Kicinski	ce61327ce9	tls: rx: support optimistic decrypt to user buffer with TLS 1.3 We currently don't support decrypt to user buffer with TLS 1.3 because we don't know the record type and how much padding record contains before decryption. In practice data records are by far most common and padding gets used rarely so we can assume data record, no padding, and if we find out that wasn't the case - retry the crypto in place (decrypt to skb). To safeguard from user overwriting content type and padding before we can check it attach a 1B sg entry where last byte of the record will land. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-07-06 12:56:35 +01:00
Jakub Kicinski	603380f54f	tls: rx: don't include tail size in data_len To make future patches easier to review make data_len contain the length of the data, without the tail. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-07-06 12:56:35 +01:00
Eric Dumazet	504148fedb	net: add skb_[inner_]tcp_all_headers helpers Most drivers use "skb_transport_offset(skb) + tcp_hdrlen(skb)" to compute headers length for a TCP packet, but others use more convoluted (but equivalent) ways. Add skb_tcp_all_headers() and skb_inner_tcp_all_headers() helpers to harmonize this a bit. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-07-02 16:22:25 +01:00
Jakub Kicinski	e34a07c0ae	sock: redo the psock vs ULP protection check Commit `8a59f9d1e3` ("sock: Introduce sk->sk_prot->psock_update_sk_prot()") has moved the inet_csk_has_ulp(sk) check from sk_psock_init() to the new tcp_bpf_update_proto() function. I'm guessing that this was done to allow creating psocks for non-inet sockets. Unfortunately the destruction path for psock includes the ULP unwind, so we need to fail the sk_psock_init() itself. Otherwise if ULP is already present we'll notice that later, and call tcp_update_ulp() with the sk_proto of the ULP itself, which will most likely result in the ULP looping its callbacks. Fixes: `8a59f9d1e3` ("sock: Introduce sk->sk_prot->psock_update_sk_prot()") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Tested-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/r/20220620191353.1184629-2-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-06-23 10:08:30 +02:00
Jakub Kicinski	1b205d948f	Revert "net/tls: fix tls_sk_proto_close executed repeatedly" This reverts commit `69135c572d`. This commit was just papering over the issue, ULP should not get ->update() called with its own sk_prot. Each ULP would need to add this check. Fixes: `69135c572d` ("net/tls: fix tls_sk_proto_close executed repeatedly") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20220620191353.1184629-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-06-23 10:08:30 +02:00
Ziyang Xuan	69135c572d	net/tls: fix tls_sk_proto_close executed repeatedly After setting the sock ktls, update ctx->sk_proto to sock->sk_prot by tls_update(), so now ctx->sk_proto->close is tls_sk_proto_close(). When close the sock, tls_sk_proto_close() is called for sock->sk_prot->close is tls_sk_proto_close(). But ctx->sk_proto->close() will be executed later in tls_sk_proto_close(). Thus tls_sk_proto_close() executed repeatedly occurred. That will trigger the following bug. ================================================================= KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017] RIP: 0010:tls_sk_proto_close+0xd8/0xaf0 net/tls/tls_main.c:306 Call Trace: <TASK> tls_sk_proto_close+0x356/0xaf0 net/tls/tls_main.c:329 inet_release+0x12e/0x280 net/ipv4/af_inet.c:428 __sock_release+0xcd/0x280 net/socket.c:650 sock_close+0x18/0x20 net/socket.c:1365 Updating a proto which is same with sock->sk_prot is incorrect. Add proto and sock->sk_prot equality check at the head of tls_update() to fix it. Fixes: `95fa145479` ("bpf: sockmap/tls, close can race with map free") Reported-by: syzbot+29c3c12f3214b85ad081@syzkaller.appspotmail.com Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-06-20 10:01:52 +01:00
Maxim Mikityanskiy	b489a6e587	tls: Rename TLS_INFO_ZC_SENDFILE to TLS_INFO_ZC_TX To embrace possible future optimizations of TLS, rename zerocopy sendfile definitions to more generic ones: * setsockopt: TLS_TX_ZEROCOPY_SENDFILE- > TLS_TX_ZEROCOPY_RO * sock_diag: TLS_INFO_ZC_SENDFILE -> TLS_INFO_ZC_RO_TX RO stands for readonly and emphasizes that the application shouldn't modify the data being transmitted with zerocopy to avoid potential disconnection. Fixes: `c1318b39c7` ("tls: Add opt-in zerocopy mode of sendfile()") Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Link: https://lore.kernel.org/r/20220608153425.3151146-1-maximmi@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-06-09 21:51:57 -07:00
Jakub Kicinski	1c2133114d	net: tls: fix messing up lists when bpf enabled Artem points out that skb may try to take over the skb and queue it to its own list. Unlink the skb before calling out. Fixes: `b1a2c17863` ("tls: rx: clear ctx->recv_pkt earlier") Reported-by: Artem Savkov <asavkov@redhat.com> Tested-by: Artem Savkov <asavkov@redhat.com> Link: https://lore.kernel.org/r/20220518205644.2059468-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-19 17:55:06 -07:00
Boris Pismenny	c1318b39c7	tls: Add opt-in zerocopy mode of sendfile() TLS device offload copies sendfile data to a bounce buffer before transmitting. It allows to maintain the valid MAC on TLS records when the file contents change and a part of TLS record has to be retransmitted on TCP level. In many common use cases (like serving static files over HTTPS) the file contents are not changed on the fly. In many use cases breaking the connection is totally acceptable if the file is changed during transmission, because it would be received corrupted in any case. This commit allows to optimize performance for such use cases to providing a new optional mode of TLS sendfile(), in which the extra copy is skipped. Removing this copy improves performance significantly, as TLS and TCP sendfile perform the same operations, and the only overhead is TLS header/trailer insertion. The new mode can only be enabled with the new socket option named TLS_TX_ZEROCOPY_SENDFILE on per-socket basis. It preserves backwards compatibility with existing applications that rely on the copying behavior. The new mode is safe, meaning that unsolicited modifications of the file being sent can't break integrity of the kernel. The worst thing that can happen is sending a corrupted TLS record, which is in any case not forbidden when using regular TCP sockets. Sockets other than TLS device offload are not affected by the new socket option. The actual status of zerocopy sendfile can be queried with sock_diag. Performance numbers in a single-core test with 24 HTTPS streams on nginx, under 100% CPU load: * non-zerocopy: 33.6 Gbit/s * zerocopy: 79.92 Gbit/s CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz Signed-off-by: Boris Pismenny <borisp@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20220518092731.1243494-1-maximmi@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-05-19 12:14:11 +02:00
Jakub Kicinski	9b19e57a3c	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net No conflicts. Build issue in drivers/net/ethernet/sfc/ptp.c `54fccfdd7c` ("sfc: efx_default_channel_type APIs can be static") `49e6123c65` ("net: sfc: fix memory leak due to ptp channel") https://lore.kernel.org/all/20220510130556.52598fe2@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-12 16:15:30 -07:00
Maxim Mikityanskiy	3740651bf7	tls: Fix context leak on tls_device_down The commit cited below claims to fix a use-after-free condition after tls_device_down. Apparently, the description wasn't fully accurate. The context stayed alive, but ctx->netdev became NULL, and the offload was torn down without a proper fallback, so a bug was present, but a different kind of bug. Due to misunderstanding of the issue, the original patch dropped the refcount_dec_and_test line for the context to avoid the alleged premature deallocation. That line has to be restored, because it matches the refcount_inc_not_zero from the same function, otherwise the contexts that survived tls_device_down are leaked. This patch fixes the described issue by restoring refcount_dec_and_test. After this change, there is no leak anymore, and the fallback to software kTLS still works. Fixes: `c55dcdd435` ("net/tls: Fix use-after-free after the TLS device goes down and up") Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20220512091830.678684-1-maximmi@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-12 10:01:36 -07:00
Jakub Kicinski	0e55546b18	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net include/linux/netdevice.h net/core/dev.c `6510ea973d` ("net: Use this_cpu_inc() to increment net->core_stats") `794c24e992` ("net-core: rx_otherhost_dropped to core_stats") https://lore.kernel.org/all/20220428111903.5f4304e0@canb.auug.org.au/ drivers/net/wan/cosa.c `d48fea8401` ("net: cosa: fix error check return value of register_chrdev()") `89fbca3307` ("net: wan: remove support for COSA and SRP synchronous serial boards") https://lore.kernel.org/all/20220428112130.1f689e5e@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-04-28 13:02:01 -07:00
Maxim Mikityanskiy	a0df71948e	tls: Skip tls_append_frag on zero copy size Calling tls_append_frag when max_open_record_len == record->len might add an empty fragment to the TLS record if the call happens to be on the page boundary. Normally tls_append_frag coalesces the zero-sized fragment to the previous one, but not if it's on page boundary. If a resync happens then, the mlx5 driver posts dump WQEs in tx_post_resync_dump, and the empty fragment may become a data segment with byte_count == 0, which will confuse the NIC and lead to a CQE error. This commit fixes the described issue by skipping tls_append_frag on zero size to avoid adding empty fragments. The fix is not in the driver, because an empty fragment is hardly the desired behavior. Fixes: `e8f6979981` ("net/tls: Add generic NIC offload infrastructure") Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20220426154949.159055-1-maximmi@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-04-27 15:25:20 -07:00
Jakub Kicinski	c706b2b5ed	net: tls: fix async vs NIC crypto offload When NIC takes care of crypto (or the record has already been decrypted) we forget to update darg->async. ->async is supposed to mean whether record is async capable on input and whether record has been queued for async crypto on output. Reported-by: Gal Pressman <gal@nvidia.com> Fixes: `3547a1f9d9` ("tls: rx: use async as an in-out argument") Tested-by: Gal Pressman <gal@nvidia.com> Link: https://lore.kernel.org/r/20220425233309.344858-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-04-26 17:08:49 -07:00
Eric Dumazet	68822bdf76	net: generalize skb freeing deferral to per-cpu lists Logic added in commit `f35f821935` ("tcp: defer skb freeing after socket lock is released") helped bulk TCP flows to move the cost of skbs frees outside of critical section where socket lock was held. But for RPC traffic, or hosts with RFS enabled, the solution is far from being ideal. For RPC traffic, recvmsg() has to return to user space right after skb payload has been consumed, meaning that BH handler has no chance to pick the skb before recvmsg() thread. This issue is more visible with BIG TCP, as more RPC fit one skb. For RFS, even if BH handler picks the skbs, they are still picked from the cpu on which user thread is running. Ideally, it is better to free the skbs (and associated page frags) on the cpu that originally allocated them. This patch removes the per socket anchor (sk->defer_list) and instead uses a per-cpu list, which will hold more skbs per round. This new per-cpu list is drained at the end of net_action_rx(), after incoming packets have been processed, to lower latencies. In normal conditions, skbs are added to the per-cpu list with no further action. In the (unlikely) cases where the cpu does not run net_action_rx() handler fast enough, we use an IPI to raise NET_RX_SOFTIRQ on the remote cpu. Also, we do not bother draining the per-cpu list from dev_cpu_dead() This is because skbs in this list have no requirement on how fast they should be freed. Note that we can add in the future a small per-cpu cache if we see any contention on sd->defer_lock. Tested on a pair of hosts with 100Gbit NIC, RFS enabled, and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around page recycling strategy used by NIC driver (its page pool capacity being too small compared to number of skbs/pages held in sockets receive queues) Note that this tuning was only done to demonstrate worse conditions for skb freeing for this particular test. These conditions can happen in more general production workload. 10 runs of one TCP_STREAM flow Before: Average throughput: 49685 Mbit. Kernel profiles on cpu running user thread recvmsg() show high cost for skb freeing related functions () 57.81% [kernel] [k] copy_user_enhanced_fast_string () 12.87% [kernel] [k] skb_release_data () 4.25% [kernel] [k] __free_one_page () 3.57% [kernel] [k] __list_del_entry_valid 1.85% [kernel] [k] __netif_receive_skb_core 1.60% [kernel] [k] __skb_datagram_iter () 1.59% [kernel] [k] free_unref_page_commit () 1.16% [kernel] [k] __slab_free 1.16% [kernel] [k] _copy_to_iter () 1.01% [kernel] [k] kfree () 0.88% [kernel] [k] free_unref_page 0.57% [kernel] [k] ip6_rcv_core 0.55% [kernel] [k] ip6t_do_table 0.54% [kernel] [k] flush_smp_call_function_queue () 0.54% [kernel] [k] free_pcppages_bulk 0.51% [kernel] [k] llist_reverse_order 0.38% [kernel] [k] process_backlog () 0.38% [kernel] [k] free_pcp_prepare 0.37% [kernel] [k] tcp_recvmsg_locked () 0.37% [kernel] [k] __list_add_valid 0.34% [kernel] [k] sock_rfree 0.34% [kernel] [k] _raw_spin_lock_irq () 0.33% [kernel] [k] __page_cache_release 0.33% [kernel] [k] tcp_v6_rcv () 0.33% [kernel] [k] __put_page () 0.29% [kernel] [k] __mod_zone_page_state 0.27% [kernel] [k] _raw_spin_lock After patch: Average throughput: 73076 Mbit. Kernel profiles on cpu running user thread recvmsg() looks better: 81.35% [kernel] [k] copy_user_enhanced_fast_string 1.95% [kernel] [k] _copy_to_iter 1.95% [kernel] [k] __skb_datagram_iter 1.27% [kernel] [k] __netif_receive_skb_core 1.03% [kernel] [k] ip6t_do_table 0.60% [kernel] [k] sock_rfree 0.50% [kernel] [k] tcp_v6_rcv 0.47% [kernel] [k] ip6_rcv_core 0.45% [kernel] [k] read_tsc 0.44% [kernel] [k] _raw_spin_lock_irqsave 0.37% [kernel] [k] _raw_spin_lock 0.37% [kernel] [k] native_irq_return_iret 0.33% [kernel] [k] __inet6_lookup_established 0.31% [kernel] [k] ip6_protocol_deliver_rcu 0.29% [kernel] [k] tcp_rcv_established 0.29% [kernel] [k] llist_reverse_order v2: kdoc issue (kernel bots) do not defer if (alloc_cpu == smp_processor_id()) (Paolo) replace the sk_buff_head with a single-linked list (Jakub) add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-04-26 17:05:59 -07:00
Jakub Kicinski	a4ae58cdb6	tls: rx: only copy IV from the packet for TLS 1.2 TLS 1.3 and ChaChaPoly don't carry IV in the packet. The code before this change would copy out iv_size worth of whatever followed the TLS header in the packet and then for TLS 1.3 \| ChaCha overwrite that with the sequence number. Waste of cycles especially with TLS 1.2 being close to dead and TLS 1.3 being the common case. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-04-13 11:45:39 +01:00
Jakub Kicinski	f7d45f4b52	tls: rx: use MAX_IV_SIZE for allocations IVs are 8 or 16 bytes, no point reading out the exact value for quantities this small. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-04-13 11:45:39 +01:00
Jakub Kicinski	3547a1f9d9	tls: rx: use async as an in-out argument Propagating EINPROGRESS thru multiple layers of functions is error prone. Use darg->async as an in/out argument, like we use darg->zc today. On input it tells the code if async is allowed, on output if it took place. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-04-13 11:45:39 +01:00

1 2 3 4 5 ...

409 Commits