Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next tree.
Basically, nf_tables updates to add the set extension infrastructure and finish
the transaction for sets from Patrick McHardy. More specifically, they are:
1) Move netns to basechain and use recently added possible_net_t, from
Patrick McHardy.
2) Use LOGLEVEL_<FOO> from nf_log infrastructure, from Joe Perches.
3) Restore nf_log_trace that was accidentally removed during conflict
resolution.
4) nft_queue does not depend on NETFILTER_XTABLES, starting from here
all patches from Patrick McHardy.
5) Use raw_smp_processor_id() in nft_meta.
Then, several patches to prepare ground for the new set extension
infrastructure:
6) Pass object length to the hash callback in rhashtable as needed by
the new set extension infrastructure.
7) Cleanup patch to restore struct nft_hash as wrapper for struct
rhashtable
8) Another small source code readability cleanup for nft_hash.
9) Convert nft_hash to rhashtable callbacks.
And finally...
10) Add the new set extension infrastructure.
11) Convert the nft_hash and nft_rbtree sets to use it.
12) Batch set element release to avoid several RCU grace period in a row
and add new function nft_set_elem_destroy() to consolidate set element
release.
13) Return the set extension data area from nft_lookup.
14) Refactor existing transaction code to add some helper functions
and document it.
15) Complete the set transaction support, using similar approach to what we
already use, to activate/deactivate elements in an atomic fashion.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ying Xue says:
====================
tipc: fix two corner issues
The patch set aims at resolving the following two critical issues:
Patch #1: Resolve a deadlock which happens while all links are reset
Patch #2: Correct a mistake usage of RCU lock which is used to protect
node list
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
TIPC node hash node table is protected with rcu lock on read side.
tipc_node_find() is used to look for a node object with node address
through iterating the hash node table. As the entire process of what
tipc_node_find() traverses the table is guarded with rcu read lock,
it's safe for us. However, when callers use the node object returned
by tipc_node_find(), there is no rcu read lock applied. Therefore,
this is absolutely unsafe for callers of tipc_node_find().
Now we introduce a reference counter for node structure. Before
tipc_node_find() returns node object to its caller, it first increases
the reference counter. Accordingly, after its caller used it up,
it decreases the counter again. This can prevent a node being used by
one thread from being freed by another thread.
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
received is 0, no need to minus it and use "+=" to reassign it
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sathya Perla says:
====================
be2net: patch set
Hi David, this patch set includes 2 feature additions to the be2net driver:
Patch 1 sets up cpu affinity hints for be2net irqs using the
cpumask_set_cpu_local_first() API that first picks the near numa cores
and when they are exhausted, selects the far numa cores.
Patch 2 setups up xps queue mapping for be2net's TXQs to avoid,
by default, TX lock contention.
Patch 3 just bumps up the driver version.
Pls consider applying this patch set to the net-next queue. Thanks!
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch sets up xps queue mapping on load, so that TX traffic is
steered to the queue whose irqs are being processed by the current cpu.
This helps in avoiding TX lock contention.
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch provides hints to irqbalance to map be2net IRQs to
specific CPU cores. cpumask_set_cpu_local_first() is used, which first
maps IRQs to near NUMA cores; when those cores are exhausted, IRQs are
mapped to far NUMA cores.
Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@emulex.com>
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit 1fb6f159fd ("tcp: add tcp_conn_request"),
tcp_syn_flood_action() is no longer used from IPv6.
We can make it static, by moving it above tcp_conn_request()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
drivers/net/ethernet/chelsio/cxgb4/cxgb4_fcoe.c:49:9-10: WARNING: return of 0/1 in function 'cxgb_fcoe_sof_eof_supported' with return type bool
Return statements in functions returning bool should use
true/false instead of 1/0.
Generated by: scripts/coccinelle/misc/boolreturn.cocci
CC: Varun Prakash <varun@chelsio.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We should not commit the new ops until we finish
all the setup, otherwise we have to NULL it on failure.
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petri Gynther says:
====================
net: bcmgenet: multiple Rx queues support
Final patch set to add support for multiple Rx queues:
1. remove priv->int0_mask and priv->int1_mask
2. modify Tx ring int_enable and int_disable vectors
3. simplify bcmgenet_init_dma()
4. tweak init_umac()
5. rework Tx NAPI code
6. rework Rx NAPI code
7. add support for multiple Rx queues
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for multiple Rx queues:
1. Add NAPI context per Rx queue
2. Modify Rx interrupt and Rx NAPI code to handle multiple Rx queues
Signed-off-by: Petri Gynther <pgynther@google.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce new bcmgenet functions to handle the NAPI calls to:
netif_napi_add()
napi_enable()
napi_disable()
netif_napi_del()
Signed-off-by: Petri Gynther <pgynther@google.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce new bcmgenet functions to handle the NAPI calls to:
netif_napi_add()
napi_enable()
napi_disable()
netif_napi_del()
Signed-off-by: Petri Gynther <pgynther@google.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use more meaningful variable names int0_enable and int1_enable when
enabling bcmgenet interrupts.
For Rx default queue interrupts, use:
UMAC_IRQ_RXDMA_BDONE | UMAC_IRQ_RXDMA_PDONE
For Tx default queue interrupts, use:
UMAC_IRQ_TXDMA_BDONE | UMAC_IRQ_TXDMA_PDONE
Signed-off-by: Petri Gynther <pgynther@google.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Do the two kcalloc() calls first, before proceeding into Rx/Tx DMA init.
Makes the error case handling much simpler.
Signed-off-by: Petri Gynther <pgynther@google.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Jaedon Shin <jaedon.shin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove unnecessary function parameter priv. Use ring->priv instead.
Signed-off-by: Petri Gynther <pgynther@google.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove unused priv->int0_mask and priv->int1_mask.
Signed-off-by: Petri Gynther <pgynther@google.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Iyappan Subramanian says:
====================
drivers: net: xgene: Add separate tx completion ring
SGMII based 1GbE and 10GbE interfaces support multiple interrupts.
Adding separate tx completion descriptor ring and associating a dedicated irq for the TX completion.
====================
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: Keyur Chudgar <kchudgar@apm.com>
- Added wrapper functions around napi_add, napi_del, napi_enable and napi_disable
- Moved platform_get_irq function call after reading phy_mode
- Associating the new irq to tx completion for the supported ethernet interfaces
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: Keyur Chudgar <kchudgar@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: Keyur Chudgar <kchudgar@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: Keyur Chudgar <kchudgar@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Set elements are the last object type not supporting transaction support.
Implement similar to the existing rule transactions:
The global transaction counter keeps track of two generations, current
and next. Each element contains a bitmask specifying in which generations
it is inactive.
New elements start out as inactive in the current generation and active
in the next. On commit, the previous next generation becomes the current
generation and the element becomes active. The bitmask is then cleared
to indicate that the element is active in all future generations. If the
transaction is aborted, the element is removed from the set before it
becomes active.
When removing an element, it gets marked as inactive in the next generation.
On commit the next generation becomes active and the therefor the element
inactive. It is then taken out of then set and released. On abort, the
element is marked as active for the next generation again.
Lookups ignore elements not active in the current generation.
The current set types (hash/rbtree) both use a field in the extension area
to store the generation mask. This (currently) does not require any
additional memory since we have some free space in there.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add some helper functions for building the genmask as preparation for
set transactions.
Also add a little documentation how this stuff actually works.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Return the extension area from the ->lookup() function to allow to
consolidate common actions.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
With the conversion to set extensions, it is now possible to consolidate
the different set element destruction functions.
The set implementations' ->remove() functions are changed to only take
the element out of their internal data structures. Elements will be freed
in a batched fashion after the global transaction's completion RCU grace
period.
This reduces the amount of grace periods required for nft_hash from N
to zero additional ones, additionally this guarantees that the set
elements' extensions of all implementations can be used under RCU
protection.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
As namespaces are sometimes used with overlapping ip address ranges,
we should also use the namespace as input to the hash to select the ip
fragmentation counter bucket.
Cc: Eric Dumazet <edumazet@google.com>
Cc: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
As namespaces are sometimes used with overlapping ip address ranges,
we should also use the namespace as input to the hash to select the ip
fragmentation counter bucket.
Cc: Eric Dumazet <edumazet@google.com>
Cc: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Maloy says:
====================
tipc: some improvements and fixes
We introduce a better algorithm for selecting when and which
users should be subject to link congestion control, plus clean
up some code for that mechanism.
Commit #3 fixes another rare race condition during packet reception.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Despite recent improvements, the establishment of dual parallel
links still has a small glitch where messages can bypass each
other. When the second link in a dual-link configuration is
established, part of the first link's traffic will be steered over
to the new link. Although we do have a mechanism to ensure that
packets sent before and after the establishment of the new link
arrive in sequence to the destination node, this is not enough.
The arriving messages will still be delivered upwards in different
threads, something entailing a risk of message disordering during
the transition phase.
To fix this, we introduce a synchronization mechanism between the
two parallel links, so that traffic arriving on the new link cannot
be added to its input queue until we are guaranteed that all
pre-establishment messages have been delivered on the old, parallel
link.
This problem seems to always have been around, but its occurrence is
so rare that it has not been noticed until recent intensive testing.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the recent changes in message importance handling it becomes
possible to simplify handling of messages and sockets when we
encounter link congestion.
We merge the function tipc_link_cong() into link_schedule_user(),
and simplify the code of the latter. The code should now be
easier to follow, especially regarding return codes and handling
of the message that caused the situation.
In case the scheduling function is unable to pre-allocate a wakeup
message buffer, it now returns -ENOBUFS, which is a more correct
code than the previously used -EHOSTUNREACH.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, we only use a single counter; the length of the backlog
queue, to determine whether a message should be accepted to the queue
or not. Each time a message is being sent, the queue length is compared
to a threshold value for the message's importance priority. If the queue
length is beyond this threshold, the message is rejected. This algorithm
implies a risk of starvation of low importance senders during very high
load, because it may take a long time before the backlog queue has
decreased enough to accept a lower level message.
We now eliminate this risk by introducing a counter for each importance
priority. When a message is sent, we check only the queue level for that
particular message's priority. If that is ok, the message can be added
to the backlog, irrespective of the queue level for other priorities.
This way, each level is guaranteed a certain portion of the total
bandwidth, and any risk of starvation is eliminated.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Master change notifications may occur other than when joining or
leaving a bridge, for example when being added to or removed from
a bond or Open vSwitch. In that case, do nothing instead of asking
the switch driver to remove a port from a bridge that it didn't join.
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The set implementations' private struct will only contain the elements
needed to maintain the search structure, all other elements are moved
to the set extensions.
Element allocation and initialization is performed centrally by
nf_tables_api instead of by the different set implementations'
->insert() functions. A new "elemsize" member in the set ops specifies
the amount of memory to reserve for internal usage. Destruction
will also be moved out of the set implementations by a following patch.
Except for element allocation, the patch is a simple conversion to
using data from the extension area.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add simple set extension infrastructure for maintaining variable sized
and optional per element data.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
A following patch will convert sets to use so called set extensions,
where the key is not located in a fixed position anymore. This will
require rhashtable hashing and comparison callbacks to be used.
As preparation, convert nft_hash to use these callbacks without any
functional changes.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Improve readability by indenting the parameter initialization.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Following patches will add new private members, restore struct nft_hash
as preparation.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
nftables sets will be converted to use so called setextensions, moving
the key to a non-fixed position. To hash it, the obj_hashfn must be used,
however it so far doesn't receive the length parameter.
Pass the key length to obj_hashfn() and convert existing users.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Change type from unsigned long to int to fix an issue reported by kbuild robot:
crypto/algif_skcipher.c:596 skcipher_recvmsg_async() warn: unsigned 'used' is
never less than zero.
Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a node joins a cluster while we are transmitting a fragment
stream over the broadcast link, it's missing the preceding fragments
needed to build a meaningful message. As a result, the node has to
drop it. However, as the fragment message is not acknowledged to
its sender before it's dropped, it accidentally causes link reset
of retransmission failure on the node.
Reported-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Tested-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The irqclass_sub_desc array and enum interruption_class are out of sync
thus /proc/interrupts is broken. Remove IRQIO_CLW.
Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the declaration for external variables to sctp.h file avoiding
to repeatedly declare them with extern keyword.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using smp_processor_id() triggers warnings with PREEMPT_RCU. There is no
point in disabling preemption since we only collect the numeric value,
so use raw_smp_processor_id() instead.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
As described by 4017a7e ("netfilter: restore rule tracing via
nfnetlink_log"), this accidentally slipped through during conflict
resolution in d5c1d8c.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use the #defines where appropriate.
Miscellanea:
Add explicit #include <linux/kernel.h> where it was not
previously used so that these #defines are a bit more
explicitly defined instead of indirectly included via:
module.h->moduleparam.h->kernel.h
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The network namespace is only needed for base chains to get at the
gencursor. Also convert to possible_net_t.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>