As struct mlxsw_config_profile is mapped to the payload of the FW
command of the same name, resources_query_enable flag does not belong
there. Move it to struct mlxsw_driver.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The check should be done directly in mlxsw_pci_config_profile, as for
other profile items. Also, be consistent in naming with the rest and
rename to "used_kvd_sizes".
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
First arg of these helpers should be "mlxsw_core".
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This should not be part of the struct, as the struct fields
are tightly coupled with the FW command payload of the same name.
Just use the "granularity" define directly, as in other places.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The parts info is array. The parts copy this info array, yet they are a
list. So make the indexing according to the id and change the list of
parts into array of parts. This helps to eliminate lookups and
constructs like mlxsw_sp_kvdl_part_update() (took me some non-trivial
time to figure out what is going on there).
Alongside with that, introduce a helper macro to define the parts infos.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
devlink_resource_ops should be const as the arg of register function is
also const.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current code uses global variables, adjusts them and passes pointer down
to devlink. With every other mlxsw_core instance, the previously passed
pointer values are rewritten. Fix this by de-globalize the variables.
Fixes: 7f47b19bd7 ("mlxsw: spectrum_kvdl: Add support for per part occupancy")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With patch 279c936153 the btuart_cs driver has been deprecated in
favor of serial_cs + hci_uart combination.
static struct pcmcia_device_id btuart_ids[] = {
/* don't use this driver. Use serial_cs + hci_uart instead */
PCMCIA_DEVICE_NULL
};
Intead of keeping it around, just remove it since it is not even
assigned to any PCMCIA identifiers anymore.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
When adding the alignment and padding support for H:4 packet processing
for the Nokia driver, it broke the h4_recv_buf usage within bpa10x
driver. To fix this use a separate helper function and placing it into a
dedicated h4_recv.h header file.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
The HCILL or eHCILL protocol from TI is actually an H:4 protocol with a
few extra events and thus can also use the h4_recv_buf helper. Instead
of open coding the same funtionality add the extra events to the packet
description table and use h4_recv_buf.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Tested-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Now that we need just an ACPI HID in the table, and the driver auto-
configures itself otherwise, we can easily add a bunch of known ACPI HIDs.
This avoids having to add these 1 by 1 as devices with one are encountered
by users.
This commit may seem as if it simply adds all IDs between BCM2E00-BCM2EAC,
but that is not true, all these IDs were found in actual .inf files and
the range is not entirely continuous, the following IDs are not added:
BCM2E6A, BCM2E6C, BCM2E8F and BCM2E91 because I did not see these in any
.inf files. As for the large amount of IDs this seems to be caused by
Broadcom using a separate ID for every bluetooth module using their
chips. E.g. BCM2EA6 seems to be specifically for the Raspberry Pi 3.
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Since I've been doing a lot of work on Linux Bay Trail / Cherry Trail
support, I've gathered a collection of ACPI DSDTs from about 50 such
machines.
Looking at these DSTDs many have an ACPI device entry describing a bcm
bluetooth device (often disabled in the DSDT), quite a few of these ACPI
device entries have a resource-table where the order does not match with
the order currently associated with the HID of that entry in the
bcm_acpi_match table.
Looking at the Windows .inf files, there is nothing indicating a specific
order there, so I believe that there is no 1:1 mapping between the ACPI
HID and the order in which the resources are listed.
Therefor this commit replaces the hardcoded mapping based on ACPI HID,
with code which actually checks in which order the resources are listed
and bases the gpio-mapping on that.
This should ensure that we always pick the right mapping and this will
make adding new ACPI HIDs to the driver easier.
This has been tested on the following devices:
-Asus T100CHI BCM2E39 / brcmfmac43241b4-sdio / BCM4324B3-37.4M.hcd
-Asus T100TA BCM2E39 / brcmfmac43241b4-sdio / BCM4324B3-37.4M.hcd
-Asus T200TA BCM2E65 / brcmfmac43340-sdio / BCM43341B0-37.4M.hcd
-Jumper ezPad mini 3 BCM2E74 / brcmfmac43430a0-sdio / BCM4343A0-26M.hcd
-Acer Iconia Tab8 w1-8 BCM2E83 / brcmfmac4330-sdio / BCM4330B1-26M.hcd
-Chuwi Vi8 plus(CWI519) BCM2EAA / brcmfmac43430-sdio / BCM43430A1-26M.hcd
Which together cover all 3 combinations of using an Interrupt resource /
GpioInt resource as first resource / GpioInt resource as last resource.
Tested-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
We declare the same set of const acpi_gpio_params twice with different
names, besides the needless duplication this naming leads to a sortof
double indirection which also makes it harder to see how the mapping is
actually setup.
This commit renames the first set to have generic names, which better
describe the contents of the mapping and drops the second set.
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Add 6 new ACPI HIDs to enable bluetooth on devices using these HIDs,
I've tested the following HIDs / devices:
BCM2E74: Jumper ezPad mini 3
BCM2E83: Acer Iconia Tab8 w1-810
BCM2E90: Meegopad T08
BCM2EAA: Chuwi Vi8 plus (CWI519)
The reporter of Red Hat bugzilla 1554835 has tested:
BCM2E84: Lenovo Yoga2
The reporter of kernel bugzilla 274481 has tested:
BCM2E38: Toshiba Encore
Note the Lenovo Yoga2 and Toshiba Encore also needs the earlier patch to
treat all Interrupt ACPI resources as active low.
Cc: stable@vger.kernel.org
Buglink: https://bugzilla.kernel.org/attachment.cgi?id=274481
Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1554835
Reported-and-tested-by: Robert R. Howell <rhowell@uwyo.edu>
Reported-and-tested-by: Christian Herzog <daduke@daduke.org>
Tested-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Older devices with a serdev attached bcm bt hci, use an Interrupt ACPI
resource to describe the IRQ (rather then a GpioInt resource).
These device seem to all claim the IRQ is active-high and seem to all need
a DMI quirk to treat it as active-low. Instead simply always assume that
Interrupt resource specified IRQs are always active-low.
This fixes the bt device not being able to wake the host from runtime-
suspend on the: Asus T100TAM, Asus T200TA, Lenovo Yoga2 and the Toshiba
Encore, without the need to add 4 new DMI quirks for these models.
This also allows us to remove 2 DMI quirks for the Asus T100TA and Asus
T100CHI series. Likely the 2 remaining quirks can also be removed but I
could not find a DSDT of these devices to verify this.
Cc: stable@vger.kernel.org
Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=198953
Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1554835
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The struct hcill_cmd to create an skb with a single u8 is pointless. So
just use skb_put_u8 instead.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
The variable "payload" will eventually be set to an appropriate pointer
a bit later. Thus omit the explicit initialisation at the beginning.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The local variable "ret" will be set to an appropriate value a bit later.
Thus omit the explicit initialisation at the beginning.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
In case the shutdown GPIO is not wired up, it is impossible to reset the
Bluetooth controller to its original state. This include the initial
default baud rate which leads to issues when reloading the module or
when something unexpected happens. To avoid any kind of runtime
deadlocks, stick with the initial default baud rate.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Some GPIO controller drivers request sleepable context and so can't
be accessed from IRQ context. Using gpiod_set/get_value accessors
with such controller leads to a kernel warning since they are
reserved for atomic context (according to the documentation).
Use the postfixed _cansleep version instead, indicating that context
is safe for sleeping if necessary. Note that this is the case here
since we never toggle the gpio neither from IRQ nor from a spinlocked
section.
Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
It should be __le16 instead of __u16 since its part of
mgmt API.
Signed-off-by: Jaganath Kanakkassery <jaganathx.kanakkassery@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Atul Gupta says:
====================
Chelsio Inline TLS
Series for Chelsio Inline TLS driver (chtls)
Use tls ULP infrastructure to register chtls as Inline TLS driver.
Chtls use TCP Sockets to Tx/Rx TLS records.
TCP sk_proto APIs are enhanced to offload TLS record.
T6 adapter provides the following features:
-TLS record offload, TLS header, encrypt, digest and transmit
-TLS record receive and decrypt
-TLS keys store
-TCP/IP engine
-TLS engine
-GCM crypto engine [support CBC also]
TLS provides security at the transport layer. It uses TCP to provide
reliable end-to-end transport of application data.
It relies on TCP for any retransmission.
TLS session comprises of three parts:
a. TCP/IP connection
b. TLS handshake
c. Record layer processing
TLS handshake state machine is executed in host (refer standard
implementation eg. OpenSSL). Setsockopt [SOL_TCP, TCP_ULP]
initialize TCP proto-ops for Chelsio inline tls support.
setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls"));
Tx and Rx Keys are decided during handshake and programmed on
the chip after CCS is exchanged.
struct tls12_crypto_info_aes_gcm_128 crypto_info
setsockopt(sock, SOL_TLS, TLS_TX, &crypto_info, sizeof(crypto_info))
Finish is the first encrypted/decrypted message tx/rx inline.
On the Tx path TLS engine receive plain text from openssl, insert IV,
fetches the tx key, create cipher text records and generate MAC.
TLS header is added to cipher text and forward to TCP/IP engine for
transport layer processing and transmission on wire.
TX PATH:
Apps--openssl--chtls---TLS engine---encrypt/auth---TCP/IP engine---wire
On the Rx side, data received is PDU aligned at record boundaries.
TLS processes only the complete record. If rx key is programmed on
CCS receive, data is decrypted and plain text is posted to host.
RX PATH:
Wire--cipher-text--TCP/IP engine [PDU align]---TLS engine---
decrypt/auth---plain-text--chtls--openssl--application
v15: indent fix in mark_urg
-removed unwanted checks in sendmsg, sendpage, recvmsg,
close, disconnect,shutdown, destroy sock [Sabrina]
- removed unused chtls_free_kmap [chtls.h]
- rebase to top of net-next
v14: -Reverse christmas tree style for variable declarations for
various functions in chtls_hw.c, chtls_io.c [Stefano Brivio]
- replaced break with return in tcp_state_to_flowc_state
[Stefano Brivio]
- renamed tlstx_seq_number to tlstx_incr_seqnum [Stefano Brivio]
- use bool for corked, should_push and send_should_push
[Stefano Brivio]
- removed "Reviewed-by" tag for Stefano, Sabrina, Dave Watson
v13: handle clean ctx free for HW_RECORD in tls_sk_proto_close
-removed SOCK_INLINE [chtls.h], using csk_conn_inline instead
in send_abort_rpl,chtls_send_abort_rpl,chtls_sendmsg,chtls_sendpage
-removed sk_no_receive [chtls_io.c] replaced with sk_shutdown &
RCV_SHUTDOWN in chtls_pt_recvmsg, peekmsg and chtls_recvmsg
-cleaned chtls_expansion_size [Stefano Brivio]
- u8 conf:3 in tls_sw_context to add TLS_HW_RECORD
-removed is_tls_skb, using tls_skb_inline [Stefano Brivio]
-reverse christmas tree formatting in chtls_io.c, chtls_cm.c
[Stefano Brivio]
-fixed build warning reported by kbuild robot
-retained ctx conf enum in chtls_main vs earlier versions, tls_prots
not used in chtls.
-cleanup [removed syn_sent, base_prot, added synq] [Michael Werner]
- passing struct fw_wr_hdr * to ofldtxq_stop [Casey]
- rebased on top of the current net-next
v12: patch against net-next
-fixed build error [reported by Julia]
-replace set_queue with skb_set_queue_mapping [Sabrina]
-copyright year correction [chtls]
v11: formatting and cleanup, few function rename and error
handling [Stefano Brivio]
- ctx freed later for TLS_HW_RECORD
- split tx and rx in different patch
v10: fixed following based on the review comments of Sabrina Dubroca
-docs header added for struct tls_device [tls.h]
-changed TLS_FULL_HW to TLS_HW_RECORD
-similary using tls-hw-record instead of tls-inline for
ethtool feature config
-added more description to patch sets
-replaced kmalloc/vmalloc/kfree with kvzalloc/kvfree
-reordered the patch sequence
-formatted entire patch for func return values
v9: corrected __u8 and similar usage
-create_ctx to alloc tls_context
-tls_hw_prot before sk !establish check
v8: tls_main.c cleanup comment [Dave Watson]
v7: func name change, use sk->sk_prot where required
v6: modify prot only for FULL_HW
-corrected commit message for patch 11
v5: set TLS_FULL_HW for registered inline tls drivers
-set TLS_FULL_HW prot for offload connection else move
to TLS_SW_TX
-Case handled for interface with same IP [Dave Miller]
-Removed Specific IP and INADDR_ANY handling [v4]
v4: removed chtls ULP type, retained tls ULP
-registered chtls with net tls
-defined struct tls_device to register the Inline drivers
-ethtool interface tls-inline to enable Inline TLS for interface
-prot update to support inline TLS
v3: fixed the kbuild test issues
-made few funtions static
-initialized few variables
v2: fixed the following based on the review comments of Stephan Mueller,
Stefano Brivio and Hannes Frederic
-Added more details in cover letter
-Fixed indentation and formating issues
-Using aes instead of aes-generic
-memset key info after programing the key on chip
-reordered the patch sequence
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Entry for Inline TLS as another driver dependent on cxgb4 and chcr
Signed-off-by: Atul Gupta <atul.gupta@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Initialize the space reserved for storing the TLS keys,
get and free the location where key is stored for the TLS
connection.
Program the Tx and Rx key as received from user in
struct tls12_crypto_info_aes_gcm_128 and understood by hardware.
added socket option TLS_RX
Signed-off-by: Atul Gupta <atul.gupta@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
handler for record receive. plain text copied to user
buffer
Signed-off-by: Atul Gupta <atul.gupta@chelsio.com>
Signed-off-by: Michael Werner <werner@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TLS handler for record transmit.
Create Inline TLS work request and post to FW.
Create Inline TLS record CPLs for hardware
Signed-off-by: Atul Gupta <atul.gupta@chelsio.com>
Signed-off-by: Michael Werner <werner@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Exchange messages with hardware to program the TLS session
CPL handlers for messages received from chip.
Signed-off-by: Atul Gupta <atul.gupta@chelsio.com>
Signed-off-by: Michael Werner <werner@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Register chtls as Inline TLS driver, chtls is ULD to cxgb4.
Setsockopt to program (tx/rx) keys on chip.
Support AES GCM of key size 128.
Support both Inline Rx and Tx.
Signed-off-by: Atul Gupta <atul.gupta@chelsio.com>
Reviewed-by: Casey Leedom <leedom@chelsio.com>
Reviewed-by: Michael Werner <werner@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define Inline TLS state, connection management info.
Supporting macros definition.
Signed-off-by: Atul Gupta <atul.gupta@chelsio.com>
Reviewed-by: Michael Werner <werner@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define macro for programming the TLS Key context
Signed-off-by: Atul Gupta <atul.gupta@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Read the Inline TLS capability from firmware.
Determine the area reserved for storing the keys
Dump the Inline TLS tx and rx records count.
Signed-off-by: Atul Gupta <atul.gupta@chelsio.com>
Reviewed-by: Michael Werner <werner@chelsio.com>
Reviewed-by: Casey Leedom <leedom@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Key area size in hw-config file. CPL struct for TLS request
and response. Work request for Inline TLS.
Signed-off-by: Atul Gupta <atul.gupta@chelsio.com>
Reviewed-by: Casey Leedom <leedom@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ethtool option enables TLS record offload on HW, user
configures the feature for netdev capable of Inline TLS.
This allows user to define custom sk_prot for Inline TLS sock
Signed-off-by: Atul Gupta <atul.gupta@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Facility to register Inline TLS drivers to net/tls. Setup
TLS_HW_RECORD prot to listen on offload device.
Cases handled
- Inline TLS device exists, setup prot for TLS_HW_RECORD
- Atleast one Inline TLS exists, sets TLS_HW_RECORD.
- If non-inline device establish connection, move to TLS_SW_TX
Signed-off-by: Atul Gupta <atul.gupta@chelsio.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-03-31
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Add raw BPF tracepoint API in order to have a BPF program type that
can access kernel internal arguments of the tracepoints in their
raw form similar to kprobes based BPF programs. This infrastructure
also adds a new BPF_RAW_TRACEPOINT_OPEN command to BPF syscall which
returns an anon-inode backed fd for the tracepoint object that allows
for automatic detach of the BPF program resp. unregistering of the
tracepoint probe on fd release, from Alexei.
2) Add new BPF cgroup hooks at bind() and connect() entry in order to
allow BPF programs to reject, inspect or modify user space passed
struct sockaddr, and as well a hook at post bind time once the port
has been allocated. They are used in FB's container management engine
for implementing policy, replacing fragile LD_PRELOAD wrapper
intercepting bind() and connect() calls that only works in limited
scenarios like glibc based apps but not for other runtimes in
containerized applications, from Andrey.
3) BPF_F_INGRESS flag support has been added to sockmap programs for
their redirect helper call bringing it in line with cls_bpf based
programs. Support is added for both variants of sockmap programs,
meaning for tx ULP hooks as well as recv skb hooks, from John.
4) Various improvements on BPF side for the nfp driver, besides others
this work adds BPF map update and delete helper call support from
the datapath, JITing of 32 and 64 bit XADD instructions as well as
offload support of bpf_get_prandom_u32() call. Initial implementation
of nfp packet cache has been tackled that optimizes memory access
(see merge commit for further details), from Jakub and Jiong.
5) Removal of struct bpf_verifier_env argument from the print_bpf_insn()
API has been done in order to prepare to use print_bpf_insn() soon
out of perf tool directly. This makes the print_bpf_insn() API more
generic and pushes the env into private data. bpftool is adjusted
as well with the print_bpf_insn() argument removal, from Jiri.
6) Couple of cleanups and prep work for the upcoming BTF (BPF Type
Format). The latter will reuse the current BPF verifier log as
well, thus bpf_verifier_log() is further generalized, from Martin.
7) For bpf_getsockopt() and bpf_setsockopt() helpers, IPv4 IP_TOS read
and write support has been added in similar fashion to existing
IPv6 IPV6_TCLASS socket option we already have, from Nikita.
8) Fixes in recent sockmap scatterlist API usage, which did not use
sg_init_table() for initialization thus triggering a BUG_ON() in
scatterlist API when CONFIG_DEBUG_SG was enabled. This adds and
uses a small helper sg_init_marker() to properly handle the affected
cases, from Prashant.
9) Let the BPF core follow IDR code convention and therefore use the
idr_preload() and idr_preload_end() helpers, which would also help
idr_alloc_cyclic() under GFP_ATOMIC to better succeed under memory
pressure, from Shaohua.
10) Last but not least, a spelling fix in an error message for the
BPF cookie UID helper under BPF sample code, from Colin.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet says:
====================
inet: frags: bring rhashtables to IP defrag
IP defrag processing is one of the remaining problematic layer in linux.
It uses static hash tables of 1024 buckets, and up to 128 items per bucket.
A work queue is supposed to garbage collect items when host is under memory
pressure, and doing a hash rebuild, changing seed used in hash computations.
This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
occurring every 5 seconds if host is under fire.
Then there is the problem of sharing this hash table for all netns.
It is time to switch to rhashtables, and allocate one of them per netns
to speedup netns dismantle, since this is a critical metric these days.
Lookup is now using RCU, and 64bit hosts can now provision whatever amount
of memory needed to handle the expected workloads.
v2: Addressed Herbert and Kirill feedbacks
(Use rhashtable_free_and_destroy(), and split the big patch into small units)
v3: Removed the extra add_frag_mem_limit(...) from inet_frag_create()
Removed the refcount_inc_not_zero() call from inet_frags_free_cb(),
as we can exploit del_timer() return value.
v4: kbuild robot feedback about one missing static (squashed)
Additional patches :
inet: frags: do not clone skb in ip_expire()
ipv6: frags: rewrite ip6_expire_frag_queue()
rhashtable: reorganize struct rhashtable layout
inet: frags: reorganize struct netns_frags
inet: frags: get rid of ipfrag_skb_cb/FRAG_CB
ipv6: frags: get rid of ip6frag_skb_cb/FRAG6_CB
inet: frags: get rid of nf_ct_frag6_skb_cb/NFCT_FRAG6_CB
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
nf_ct_frag6_queue() uses skb->cb[] to store the fragment offset,
meaning that we could use two cache lines per skb when finding
the insertion point, if for some reason inet6_skb_parm size
is increased in the future.
By using skb->ip_defrag_offset instead of skb->cb[] we pack all the fields
in a single cache line, matching what we did for IPv4.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip6_frag_queue uses skb->cb[] to store the fragment offset, meaning that
we could use two cache lines per skb when finding the insertion point,
if for some reason inet6_skb_parm size is increased in the future.
By using skb->ip_defrag_offset instead of skb->cb[], we pack all
the fields in a single cache line, matching what we did for IPv4.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_defrag uses skb->cb[] to store the fragment offset, and unfortunately
this integer is currently in a different cache line than skb->next,
meaning that we use two cache lines per skb when finding the insertion point.
By aliasing skb->ip_defrag_offset and skb->dev, we pack all the fields
in a single cache line and save precious memory bandwidth.
Note that after the fast path added by Changli Gao in commit
d6bebca92c ("fragment: add fast path for in-order fragments")
this change wont help the fast path, since we still need
to access prev->len (2nd cache line), but will show great
benefits when slow path is entered, since we perform
a linear scan of a potentially long list.
Also, note that this potential long list is an attack vector,
we might consider also using an rb-tree there eventually.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Put the read-mostly fields in a separate cache line
at the beginning of struct netns_frags, to reduce
false sharing noticed in inet_frag_kill()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While under frags DDOS I noticed unfortunate false sharing between
@nelems and @params.automatic_shrinking
Move @nelems at the end of struct rhashtable so that first cache line
is shared between all cpus, because almost never dirtied.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make it similar to IPv4 ip_expire(), and release the lock
before calling icmp functions.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
An skb_clone() was added in commit ec4fbd6475 ("inet: frag: release
spinlock before calling icmp_send()")
While fixing the bug at that time, it also added a very high cost
for DDOS frags, as the ICMP rate limit is applied after this
expensive operation (skb_clone() + consume_skb(), implying memory
allocations, copy, and freeing)
We can use skb_get(head) here, all we want is to make sure skb wont
be freed by another cpu.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some users are willing to provision huge amounts of memory to be able
to perform reassembly reasonnably well under pressure.
Current memory tracking is using one atomic_t and integers.
Switch to atomic_long_t so that 64bit arches can use more than 2GB,
without any cost for 32bit arches.
Note that this patch avoids an overflow error, if high_thresh was set
to ~2GB, since this test in inet_frag_alloc() was never true :
if (... || frag_mem_limit(nf) > nf->high_thresh)
Tested:
$ echo 16000000000 >/proc/sys/net/ipv4/ipfrag_high_thresh
<frag DDOS>
$ grep FRAG /proc/net/sockstat
FRAG: inuse 14705885 memory 16000002880
$ nstat -n ; sleep 1 ; nstat | grep Reas
IpReasmReqds 3317150 0.0
IpReasmFails 3317112 0.0
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This function is obsolete, after rhashtable addition to inet defrag.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This refactors ip_expire() since one indentation level is removed.
Note: in the future, we should try hard to avoid the skb_clone()
since this is a serious performance cost.
Under DDOS, the ICMP message wont be sent because of rate limits.
Fact that ip6_expire_frag_queue() does not use skb_clone() is
disturbing too. Presumably IPv6 should have the same
issue than the one we fixed in commit ec4fbd6475
("inet: frag: release spinlock before calling icmp_send()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove sum_frag_mem_limit(), ip_frag_mem() & ip6_frag_mem()
Also since we use rhashtable we can bring back the number of fragments
in "grep FRAG /proc/net/sockstat /proc/net/sockstat6" that was
removed in commit 434d305405 ("inet: frag: don't account number
of fragment queues")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>