Pablo Neira Ayuso says:
====================
netfilter updates: nf_tables pull request
The following patchset contains the current original nf_tables tree
condensed in 17 patches. I have organized them by chronogical order
since the original nf_tables code was released in 2009 and by
dependencies between the different patches.
The patches are:
1) Adapt all existing hooks in the tree to pass hook ops to the
hook callback function, required by nf_tables, from Patrick McHardy.
2) Move alloc_null_binding to nf_nat_core, as it is now also needed by
nf_tables and ip_tables, original patch from Patrick McHardy but
required major changes to adapt it to the current tree that I made.
3) Add nf_tables core, including the netlink API, the packet filtering
engine, expressions and built-in tables, from Patrick McHardy. This
patch includes accumulated fixes since 2009 and minor enhancements.
The patch description contains a list of references to the original
patches for the record. For those that are not familiar to the
original work, see [1], [2] and [3].
4) Add netlink set API, this replaces the original set infrastructure
to introduce a netlink API to add/delete sets and to add/delete
set elements. This includes two set types: the hash and the rb-tree
sets (used for interval based matching). The main difference with
ipset is that this infrastructure is data type agnostic. Patch from
Patrick McHardy.
5) Allow expression operation overload, this API change allows us to
provide define expression subtypes depending on the configuration
that is received from user-space via Netlink. It is used by follow
up patches to provide optimized versions of the payload and cmp
expressions and the x_tables compatibility layer, from Patrick
McHardy.
6) Add optimized data comparison operation, it requires the previous
patch, from Patrick McHardy.
7) Add optimized payload implementation, it requires patch 5, from
Patrick McHardy.
8) Convert built-in tables to chain types. Each chain type have special
semantics (filter, route and nat) that are used by userspace to
configure the chain behaviour. The main chain regarding iptables
is that tables become containers of chain, with no specific semantics.
However, you may still configure your tables and chains to retain
iptables like semantics, patch from me.
9) Add compatibility layer for x_tables. This patch adds support to
use all existing x_tables extensions from nf_tables, this is used
to provide a userspace utility that accepts iptables syntax but
used internally the nf_tables kernel core. This patch includes
missing features in the nf_tables core such as the per-chain
stats, default chain policy and number of chain references, which
are required by the iptables compatibility userspace tool. Patch
from me.
10) Fix transport protocol matching, this fix is a side effect of the
x_tables compatibility layer, which now provides a pointer to the
transport header, from me.
11) Add support for dormant tables, this feature allows you to disable
all chains and rules that are contained in one table, from me.
12) Add IPv6 NAT support. At the time nf_tables was made, there was no
NAT IPv6 support yet, from Tomasz Bursztyka.
13) Complete net namespace support. This patch register the protocol
family per net namespace, so tables (thus, other objects contained
in tables such as sets, chains and rules) are only visible from the
corresponding net namespace, from me.
14) Add the insert operation to the nf_tables netlink API, this requires
adding a new position attribute that allow us to locate where in the
ruleset a rule needs to be inserted, from Eric Leblond.
15) Add rule batching support, including atomic rule-set updates by
using rule-set generations. This patch includes a change to nfnetlink
to include two new control messages to indicate the beginning and
the end of a batch. The end message is interpreted as the commit
message, if it's missing, then the rule-set updates contained in the
batch are aborted, from me.
16) Add trace support to the nf_tables packet filtering core, from me.
17) Add ARP filtering support, original patch from Patrick McHardy, but
adapted to fit into the chain type infrastructure. This was recovered
to be used by nft userspace tool and our compatibility arptables
userspace tool.
There is still work to do to fully replace x_tables [4] [5] but that can
be done incrementally by extending our netlink API. Moreover, looking at
netfilter-devel and the amount of contributions to nf_tables we've been
getting, I think it would be good to have it mainstream to avoid accumulating
large patchsets skip continuous rebases.
I tried to provide a reasonable patchset, we have more than 100 accumulated
patches in the original nf_tables tree, so I collapsed many of the small
fixes to the main patch we had since 2009 and provide a small batch for
review to netdev, while trying to retain part of the history.
For those who didn't give a try to nf_tables yet, there's a quick howto
available from Eric Leblond that describes how to get things working [6].
Comments/reviews welcome.
Thanks!
[1] http://lwn.net/Articles/324251/
[2] http://workshop.netfilter.org/2013/wiki/images/e/ee/Nftables-osd-2013-developer.pdf
[3] http://lwn.net/Articles/564095/
[4] http://people.netfilter.org/pablo/map-pending-work.txt
[4] http://people.netfilter.org/pablo/nftables-todo.txt
[5] https://home.regit.org/netfilter-en/nftables-quick-howto/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes a comment mentioning IRQF_DISABLED,
which is deprecated.
Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Small code cleanup:
1. change MLX4_DEV_CAP_FLAGS2_REASSIGN_MAC_EN to MLX4_DEV_CAP_FLAG2_REASSIGN_MAC_EN
2. put MLX4_SET_PORT_PRIO2TC and MLX4_SET_PORT_SCHEDULER in the same union with the
other MLX4_SET_PORT_yyy
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for tracing the packet travel through
the ruleset, in a similar fashion to x_tables.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds a batch support to nfnetlink. Basically, it adds
two new control messages:
* NFNL_MSG_BATCH_BEGIN, that indicates the beginning of a batch,
the nfgenmsg->res_id indicates the nfnetlink subsystem ID.
* NFNL_MSG_BATCH_END, that results in the invocation of the
ss->commit callback function. If not specified or an error
ocurred in the batch, the ss->abort function is invoked
instead.
The end message represents the commit operation in nftables, the
lack of end message results in an abort. This patch also adds the
.call_batch function that is only called from the batch receival
path.
This patch adds atomic rule updates and dumps based on
bitmask generations. This allows to atomically commit a set of
rule-set updates incrementally without altering the internal
state of existing nf_tables expressions/matches/targets.
The idea consists of using a generation cursor of 1 bit and
a bitmask of 2 bits per rule. Assuming the gencursor is 0,
then the genmask (expressed as a bitmask) can be interpreted
as:
00 active in the present, will be active in the next generation.
01 inactive in the present, will be active in the next generation.
10 active in the present, will be deleted in the next generation.
^
gencursor
Once you invoke the transition to the next generation, the global
gencursor is updated:
00 active in the present, will be active in the next generation.
01 active in the present, needs to zero its future, it becomes 00.
10 inactive in the present, delete now.
^
gencursor
If a dump is in progress and nf_tables enters a new generation,
the dump will stop and return -EBUSY to let userspace know that
it has to retry again. In order to invalidate dumps, a global
genctr counter is increased everytime nf_tables enters a new
generation.
This new operation can be used from the user-space utility
that controls the firewall, eg.
nft -f restore
The rule updates contained in `file' will be applied atomically.
cat file
-----
add filter INPUT ip saddr 1.1.1.1 counter accept #1
del filter INPUT ip daddr 2.2.2.2 counter drop #2
-EOF-
Note that the rule 1 will be inactive until the transition to the
next generation, the rule 2 will be evicted in the next generation.
There is a penalty during the rule update due to the branch
misprediction in the packet matching framework. But that should be
quickly resolved once the iteration over the commit list that
contain rules that require updates is finished.
Event notification happens once the rule-set update has been
committed. So we skip notifications is case the rule-set update
is aborted, which can happen in case that the rule-set is tested
to apply correctly.
This patch squashed the following patches from Pablo:
* nf_tables: atomic rule updates and dumps
* nf_tables: get rid of per rule list_head for commits
* nf_tables: use per netns commit list
* nfnetlink: add batch support and use it from nf_tables
* nf_tables: all rule updates are transactional
* nf_tables: attach replacement rule after stale one
* nf_tables: do not allow deletion/replacement of stale rules
* nf_tables: remove unused NFTA_RULE_FLAGS
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds a new rule attribute NFTA_RULE_POSITION which is
used to store the position of a rule relatively to the others.
By providing the create command and specifying the position, the
rule is inserted after the rule with the handle equal to the
provided position.
Regarding notification, the position attribute specifies the
handle of the previous rule to make sure we don't point to any
stale rule in notifications coming from the commit path.
This patch includes the following fix from Pablo:
* nf_tables: fix rule deletion event reporting
Signed-off-by: Eric Leblond <eric@regit.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Register family per netnamespace to ensure that sets are
only visible in its approapriate namespace.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch generalizes the NAT expression to support both IPv4 and IPv6
using the existing IPv4/IPv6 NAT infrastructure. This also adds the
NAT chain type for IPv6.
This patch collapses the following patches that were posted to the
netfilter-devel mailing list, from Tomasz:
* nf_tables: Change NFTA_NAT_ attributes to better semantic significance
* nf_tables: Split IPv4 NAT into NAT expression and IPv4 NAT chain
* nf_tables: Add support for IPv6 NAT expression
* nf_tables: Add support for IPv6 NAT chain
* nf_tables: Fix up build issue on IPv6 NAT support
And, from Pablo Neira Ayuso:
* fix missing dependencies in nft_chain_nat
Signed-off-by: Tomasz Bursztyka <tomasz.bursztyka@linux.intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch allows you to temporarily disable an entire table.
You can change the state of a dormant table via NFT_MSG_NEWTABLE
messages. Using this operation you can wake up a table, so their
chains are registered.
This provides atomicity at chain level. Thus, the rule-set of one
chain is applied at once, avoiding any possible intermediate state
in every chain. Still, the chains that belongs to a table are
registered consecutively. This also allows you to have inactive
tables in the kernel.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds the x_tables compatibility layer. This allows you
to use existing x_tables matches and targets from nf_tables.
This compatibility later allows us to use existing matches/targets
for features that are still missing in nf_tables. We can progressively
replace them with native nf_tables extensions. It also provides the
userspace compatibility software that allows you to express the
rule-set using the iptables syntax but using the nf_tables kernel
components.
In order to get this compatibility layer working, I've done the
following things:
* add NFNL_SUBSYS_NFT_COMPAT: this new nfnetlink subsystem is used
to query the x_tables match/target revision, so we don't need to
use the native x_table getsockopt interface.
* emulate xt structures: this required extending the struct nft_pktinfo
to include the fragment offset, which is already obtained from
ip[6]_tables and that is used by some matches/targets.
* add support for default policy to base chains, required to emulate
x_tables.
* add NFTA_CHAIN_USE attribute to obtain the number of references to
chains, required by x_tables emulation.
* add chain packet/byte counters using per-cpu.
* support 32-64 bits compat.
For historical reasons, this patch includes the following patches
that were posted in the netfilter-devel mailing list.
From Pablo Neira Ayuso:
* nf_tables: add default policy to base chains
* netfilter: nf_tables: add NFTA_CHAIN_USE attribute
* nf_tables: nft_compat: private data of target and matches in contiguous area
* nf_tables: validate hooks for compat match/target
* nf_tables: nft_compat: release cached matches/targets
* nf_tables: x_tables support as a compile time option
* nf_tables: fix alias for xtables over nftables module
* nf_tables: add packet and byte counters per chain
* nf_tables: fix per-chain counter stats if no counters are passed
* nf_tables: don't bump chain stats
* nf_tables: add protocol and flags for xtables over nf_tables
* nf_tables: add ip[6]t_entry emulation
* nf_tables: move specific layer 3 compat code to nf_tables_ipv[4|6]
* nf_tables: support 32bits-64bits x_tables compat
* nf_tables: fix compilation if CONFIG_COMPAT is disabled
From Patrick McHardy:
* nf_tables: move policy to struct nft_base_chain
* nf_tables: send notifications for base chain policy changes
From Alexander Primak:
* nf_tables: remove the duplicate NF_INET_LOCAL_OUT
From Nicolas Dichtel:
* nf_tables: fix compilation when nf-netlink is a module
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch converts built-in tables/chains to chain types that
allows you to deploy customized table and chain configurations from
userspace.
After this patch, you have to specify the chain type when
creating a new chain:
add chain ip filter output { type filter hook input priority 0; }
^^^^ ------
The existing chain types after this patch are: filter, route and
nat. Note that tables are just containers of chains with no specific
semantics, which is a significant change with regards to iptables.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add an optimized payload expression implementation for small (up to 4 bytes)
aligned data loads from the linear packet area.
This patch also includes original Patrick McHardy's entitled (nf_tables:
inline nft_payload_fast_eval() into main evaluation loop).
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add an optimized version of nft_data_cmp() that only handles values of to
4 bytes length.
This patch includes original Patrick McHardy's patch entitled (nf_tables:
inline nft_cmp_fast_eval() into main evaluation loop).
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Split the expression ops into two parts and support overloading of
the runtime expression ops based on the requested function through
a ->select_ops() callback.
This can be used to provide optimized implementations, for instance
for loading small aligned amounts of data from the packet or inlining
frequently used operations into the main evaluation loop.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds the new netlink API for maintaining nf_tables sets
independently of the ruleset. The API supports the following operations:
- creation of sets
- deletion of sets
- querying of specific sets
- dumping of all sets
- addition of set elements
- removal of set elements
- dumping of all set elements
Sets are identified by name, each table defines an individual namespace.
The name of a set may be allocated automatically, this is mostly useful
in combination with the NFT_SET_ANONYMOUS flag, which destroys a set
automatically once the last reference has been released.
Sets can be marked constant, meaning they're not allowed to change while
linked to a rule. This allows to perform lockless operation for set
types that would otherwise require locking.
Additionally, if the implementation supports it, sets can (as before) be
used as maps, associating a data value with each key (or range), by
specifying the NFT_SET_MAP flag and can be used for interval queries by
specifying the NFT_SET_INTERVAL flag.
Set elements are added and removed incrementally. All element operations
support batching, reducing netlink message and set lookup overhead.
The old "set" and "hash" expressions are replaced by a generic "lookup"
expression, which binds to the specified set. Userspace is not aware
of the actual set implementation used by the kernel anymore, all
configuration options are generic.
Currently the implementation selection logic is largely missing and the
kernel will simply use the first registered implementation supporting the
requested operation. Eventually, the plan is to have userspace supply a
description of the data characteristics and select the implementation
based on expected performance and memory use.
This patch includes the new 'lookup' expression to look up for element
matching in the set.
This patch includes kernel-doc descriptions for this set API and it
also includes the following fixes.
From Patrick McHardy:
* netfilter: nf_tables: fix set element data type in dumps
* netfilter: nf_tables: fix indentation of struct nft_set_elem comments
* netfilter: nf_tables: fix oops in nft_validate_data_load()
* netfilter: nf_tables: fix oops while listing sets of built-in tables
* netfilter: nf_tables: destroy anonymous sets immediately if binding fails
* netfilter: nf_tables: propagate context to set iter callback
* netfilter: nf_tables: add loop detection
From Pablo Neira Ayuso:
* netfilter: nf_tables: allow to dump all existing sets
* netfilter: nf_tables: fix wrong type for flags variable in newelem
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds nftables which is the intended successor of iptables.
This packet filtering framework reuses the existing netfilter hooks,
the connection tracking system, the NAT subsystem, the transparent
proxying engine, the logging infrastructure and the userspace packet
queueing facilities.
In a nutshell, nftables provides a pseudo-state machine with 4 general
purpose registers of 128 bits and 1 specific purpose register to store
verdicts. This pseudo-machine comes with an extensible instruction set,
a.k.a. "expressions" in the nftables jargon. The expressions included
in this patch provide the basic functionality, they are:
* bitwise: to perform bitwise operations.
* byteorder: to change from host/network endianess.
* cmp: to compare data with the content of the registers.
* counter: to enable counters on rules.
* ct: to store conntrack keys into register.
* exthdr: to match IPv6 extension headers.
* immediate: to load data into registers.
* limit: to limit matching based on packet rate.
* log: to log packets.
* meta: to match metainformation that usually comes with the skbuff.
* nat: to perform Network Address Translation.
* payload: to fetch data from the packet payload and store it into
registers.
* reject (IPv4 only): to explicitly close connection, eg. TCP RST.
Using this instruction-set, the userspace utility 'nft' can transform
the rules expressed in human-readable text representation (using a
new syntax, inspired by tcpdump) to nftables bytecode.
nftables also inherits the table, chain and rule objects from
iptables, but in a more configurable way, and it also includes the
original datatype-agnostic set infrastructure with mapping support.
This set infrastructure is enhanced in the follow up patch (netfilter:
nf_tables: add netlink set API).
This patch includes the following components:
* the netlink API: net/netfilter/nf_tables_api.c and
include/uapi/netfilter/nf_tables.h
* the packet filter core: net/netfilter/nf_tables_core.c
* the expressions (described above): net/netfilter/nft_*.c
* the filter tables: arp, IPv4, IPv6 and bridge:
net/ipv4/netfilter/nf_tables_ipv4.c
net/ipv6/netfilter/nf_tables_ipv6.c
net/ipv4/netfilter/nf_tables_arp.c
net/bridge/netfilter/nf_tables_bridge.c
* the NAT table (IPv4 only):
net/ipv4/netfilter/nf_table_nat_ipv4.c
* the route table (similar to mangle):
net/ipv4/netfilter/nf_table_route_ipv4.c
net/ipv6/netfilter/nf_table_route_ipv6.c
* internal definitions under:
include/net/netfilter/nf_tables.h
include/net/netfilter/nf_tables_core.h
* It also includes an skeleton expression:
net/netfilter/nft_expr_template.c
and the preliminary implementation of the meta target
net/netfilter/nft_meta_target.c
It also includes a change in struct nf_hook_ops to add a new
pointer to store private data to the hook, that is used to store
the rule list per chain.
This patch is based on the patch from Patrick McHardy, plus merged
accumulated cleanups, fixes and small enhancements to the nftables
code that has been done since 2009, which are:
From Patrick McHardy:
* nf_tables: adjust netlink handler function signatures
* nf_tables: only retry table lookup after successful table module load
* nf_tables: fix event notification echo and avoid unnecessary messages
* nft_ct: add l3proto support
* nf_tables: pass expression context to nft_validate_data_load()
* nf_tables: remove redundant definition
* nft_ct: fix maxattr initialization
* nf_tables: fix invalid event type in nf_tables_getrule()
* nf_tables: simplify nft_data_init() usage
* nf_tables: build in more core modules
* nf_tables: fix double lookup expression unregistation
* nf_tables: move expression initialization to nf_tables_core.c
* nf_tables: build in payload module
* nf_tables: use NFPROTO constants
* nf_tables: rename pid variables to portid
* nf_tables: save 48 bits per rule
* nf_tables: introduce chain rename
* nf_tables: check for duplicate names on chain rename
* nf_tables: remove ability to specify handles for new rules
* nf_tables: return error for rule change request
* nf_tables: return error for NLM_F_REPLACE without rule handle
* nf_tables: include NLM_F_APPEND/NLM_F_REPLACE flags in rule notification
* nf_tables: fix NLM_F_MULTI usage in netlink notifications
* nf_tables: include NLM_F_APPEND in rule dumps
From Pablo Neira Ayuso:
* nf_tables: fix stack overflow in nf_tables_newrule
* nf_tables: nft_ct: fix compilation warning
* nf_tables: nft_ct: fix crash with invalid packets
* nft_log: group and qthreshold are 2^16
* nf_tables: nft_meta: fix socket uid,gid handling
* nft_counter: allow to restore counters
* nf_tables: fix module autoload
* nf_tables: allow to remove all rules placed in one chain
* nf_tables: use 64-bits rule handle instead of 16-bits
* nf_tables: fix chain after rule deletion
* nf_tables: improve deletion performance
* nf_tables: add missing code in route chain type
* nf_tables: rise maximum number of expressions from 12 to 128
* nf_tables: don't delete table if in use
* nf_tables: fix basechain release
From Tomasz Bursztyka:
* nf_tables: Add support for changing users chain's name
* nf_tables: Change chain's name to be fixed sized
* nf_tables: Add support for replacing a rule by another one
* nf_tables: Update uapi nftables netlink header documentation
From Florian Westphal:
* nft_log: group is u16, snaplen u32
From Phil Oester:
* nf_tables: operational limit match
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Similar to nat_decode_session, alloc_null_binding is needed for both
ip_tables and nf_tables, so move it to nf_nat_core.c. This change
is required by nf_tables.
This is an adapted version of the original patch from Patrick McHardy.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pass the hook ops to the hookfn to allow for generic hook
functions. This change is required by nf_tables.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In commit 634fb979e8 ("inet: includes a sock_common in request_sock")
I forgot that the two ports in sock_common do not have same byte order :
skc_dport is __be16 (network order), but skc_num is __u16 (host order)
So sparse complains because ir_loc_port (mapped into skc_num) is
considered as __u16 while it should be __be16
Let rename ir_loc_port to ireq->ir_num (analogy with inet->inet_num),
and perform appropriate htons/ntohs conversions.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP listener refactoring, part 5 :
We want to be able to insert request sockets (SYN_RECV) into main
ehash table instead of the per listener hash table to allow RCU
lookups and remove listener lock contention.
This patch includes the needed struct sock_common in front
of struct request_sock
This means there is no more inet6_request_sock IPv6 specific
structure.
Following inet_request_sock fields were renamed as they became
macros to reference fields from struct sock_common.
Prefix ir_ was chosen to avoid name collisions.
loc_port -> ir_loc_port
loc_addr -> ir_loc_addr
rmt_addr -> ir_rmt_addr
rmt_port -> ir_rmt_port
iif -> ir_iif
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
CONFIG_IPV6=n is still a valid choice ;)
It appears we can remove dead code.
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP listener refactoring, part 4 :
To speed up inet lookups, we moved IPv4 addresses from inet to struct
sock_common
Now is time to do the same for IPv6, because it permits us to have fast
lookups for all kind of sockets, including upcoming SYN_RECV.
Getting IPv6 addresses in TCP lookups currently requires two extra cache
lines, plus a dereference (and memory stall).
inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6
This patch is way bigger than its IPv4 counter part, because for IPv4,
we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
it's not doable easily.
inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr
And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
at the same offset.
We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
macro.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP listener refactoring, part 3 :
Our goal is to hash SYN_RECV sockets into main ehash for fast lookup,
and parallel SYN processing.
Current inet_ehash_bucket contains two chains, one for ESTABLISH (and
friend states) sockets, another for TIME_WAIT sockets only.
As the hash table is sized to get at most one socket per bucket, it
makes little sense to have separate twchain, as it makes the lookup
slightly more complicated, and doubles hash table memory usage.
If we make sure all socket types have the lookup keys at the same
offsets, we can use a generic and faster lookup. It turns out TIME_WAIT
and ESTABLISHED sockets already have common lookup fields for IPv4.
[ INET_TW_MATCH() is no longer needed ]
I'll provide a follow-up to factorize IPv6 lookup as well, to remove
INET6_TW_MATCH()
This way, SYN_RECV pseudo sockets will be supported the same.
A new sock_gen_put() helper is added, doing either a sock_put() or
inet_twsk_put() [ and will support SYN_RECV later ].
Note this helper should only be called in real slow path, when rcu
lookup found a socket that was moved to another identity (freed/reused
immediately), but could eventually be used in other contexts, like
sock_edemux()
Before patch :
dmesg | grep "TCP established"
TCP established hash table entries: 524288 (order: 11, 8388608 bytes)
After patch :
TCP established hash table entries: 524288 (order: 10, 4194304 bytes)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
include/linux/netdevice.h
net/core/sock.c
Trivial merge issues.
Removal of "extern" for functions declaration in netdevice.h
at the same time "const" was added to an argument.
Two parallel line additions in net/core/sock.c
Signed-off-by: David S. Miller <davem@davemloft.net>
The since the removal of the routing cache computing
fib_compute_spec_dst() does a fib_table lookup for each UDP multicast
packet received. This has introduced a performance regression for some
UDP workloads.
This change skips populating the packet info for sockets that do not have
IP_PKTINFO set.
Benchmark results from a netperf UDP_RR test:
Before 89789.68 transactions/s
After 90587.62 transactions/s
Benchmark results from a fio 1 byte UDP multicast pingpong test
(Multicast one way unicast response):
Before 12.63us RTT
After 12.48us RTT
Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The removal of the routing cache introduced a performance regression for
some UDP workloads since a dst lookup must be done for each packet.
This change caches the dst per socket in a similar manner to what we do
for TCP by implementing early_demux.
For UDP multicast we can only cache the dst if there is only one
receiving socket on the host. Since caching only works when there is
one receiving socket we do the multicast socket lookup using RCU.
For UDP unicast we only demux sockets with an exact match in order to
not break forwarding setups. Additionally since the hash chains may be
long we only check the first socket to see if it is a match and not
waste extra time searching the whole chain when we might not find an
exact match.
Benchmark results from a netperf UDP_RR test:
Before 87961.22 transactions/s
After 89789.68 transactions/s
Benchmark results from a fio 1 byte UDP multicast pingpong test
(Multicast one way unicast response):
Before 12.97us RTT
After 12.63us RTT
Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/wireless/brcm80211/brcmfmac/dhd_bus.h
drivers/net/wireless/rtlwifi/rtl8188ee/phy.h
drivers/net/wireless/rtlwifi/rtl8192ce/phy.h
drivers/net/wireless/rtlwifi/rtl8192de/phy.h
drivers/net/wireless/rtlwifi/rtl8723ae/phy.h
Just some minor conflicts between the wireless-next changes
and Joe Perches's "extern" removal from function prototypes
in header files.
John W. Linville says:
====================
Regarding the Bluetooth bits, Gustavo says:
"The big work here is from Marcel and Johan. They did a lot of work
in the L2CAP, HCI and MGMT layers. The most important ones are the
addition of a new MGMT command to enable/disable LE advertisement
and the introduction of the HCI user channel to allow applications
to get directly and exclusive access to Bluetooth devices."
As to the ath10k bits, Kalle says:
"Bartosz dropped support for qca98xx hw1.0 hardware from ath10k, it's
just too much to support it. Michal added support for the new firmware
interface. Marek fixed WEP in AP and IBSS mode. Rest of the changes are
minor fixes or cleanups."
And also:
"Major changes are:
* throughput improvements including aligning the RX frames correctly and
optimising HTT layer (Michal)
* remove qca98xx hw1.0 support (Bartosz)
* add support for firmware version 999.999.0.636 (Michal)
* firmware htt statistics support (Kalle)
* fix WEP in AP and IBSS mode (Marek)
* fix a mutex unlock balance in debugfs file (Shafi)
And of course there's a lot of smaller fixes and cleanup."
For the wl12xx bits, Luca says:
"Here are some patches intended for 3.13. Eliad is upstreaming a bunch
of patches that have been pending in the internal tree. Mostly bugfixes
and other small improvements."
Along with that...
Arend and friends bring us a batch of brcmfmac updates, Larry Finger
offers some rtlwifi refactoring, and Sujith sends the usual batch of
ath9k updates. As usual, there are a number of other small updates
from a variety of players as well.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Separate the unreg_list and the close_list in dev_close_many preventing
dev_close_many from permuting the unreg_list. The permutations of the
unreg_list have resulted in cases where the loopback device is accessed
it has been freed in code such as dst_ifdown. Resulting in subtle memory
corruption.
This is the second bug from sharing the storage between the close_list
and the unreg_list. The issues that crop up with sharing are
apparently too subtle to show up in normal testing or usage, so let's
forget about being clever and use two separate lists.
v2: Make all callers pass in a close_list to dev_close_many
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
virtio wants to pass in cpumask_of(cpu), make parameter
const to avoid build warnings.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
The following patchset contains Netfilter updates for your net-next tree,
mostly ipset improvements and enhancements features, they are:
* Don't call ip_nest_end needlessly in the error path from me, suggested
by Pablo Neira Ayuso, from Jozsef Kadlecsik.
* Fixed sparse warnings about shadowed variable and missing rcu annotation
and fix of "may be used uninitialized" warnings, also from Jozsef.
* Renamed simple macro names to avoid namespace issues, reported by David
Laight, again from Jozsef.
* Use fix sized type for timeout in the extension part, and cosmetic
ordering of matches and targets separatedly in xt_set.c, from Jozsef.
* Support package fragments for IPv4 protos without ports from Anders K.
Pedersen. For example this allows a hash:ip,port ipset containing the
entry 192.168.0.1,gre:0 to match all package fragments for PPTP VPN
tunnels to/from the host. Without this patch only the first package
fragment (with fragment offset 0) was matched.
* Introduced a new operation to get both setname and family, from Jozsef.
ip[6]tables set match and SET target need to know the family of the set
in order to reject adding rules which refer to a set with a non-mathcing
family. Currently such rules are silently accepted and then ignored
instead of generating an error message to the user.
* Reworked extensions support in ipset types from Jozsef. The approach of
defining structures with all variations is not manageable as the
number of extensions grows. Therefore a blob for the extensions is
introduced, somewhat similar to conntrack. The support of extensions
which need a per data destroy function is added as well.
* When an element timed out in a list:set type of set, the garbage
collector skipped the checking of the next element. So the purging
was delayed to the next run of the gc, fixed by Jozsef.
* A small Kconfig fix: NETFILTER_NETLINK cannot be selected and
ipset requires it.
* hash:net,net type from Oliver Smith. The type provides the ability to
store pairs of subnets in a set.
* Comment for ipset entries from Oliver Smith. This makes possible to
annotate entries in a set with comments, for example:
ipset n foo hash:net,net comment
ipset a foo 10.0.0.0/21,192.168.1.0/24 comment "office nets A and B"
* Fix of hash types resizing with comment extension from Jozsef.
* Fix of new extensions for list:set type when an element is added
into a slot from where another element was pushed away from Jozsef.
* Introduction of a common function for the listing of the element
extensions from Jozsef.
* Net namespace support for ipset from Vitaly Lavrov.
* hash:net,port,net type from Oliver Smith, which makes possible
to store the triples of two subnets and a protocol, port pair in
a set.
* Get xt_TCPMSS working with net namespace, by Gao feng.
* Use the proper net netnamespace to allocate skbs, also by Gao feng.
* A couple of cleanups for the conntrack SIP helper, by Holger
Eitzenberger.
* Extend cttimeout to allow setting default conntrack timeouts via
nfnetlink, so we can get rid of all our sysctl/proc interfaces in
the future for timeout tuning, from me.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
While working on tcp listener refactoring, I found that it
would really make things easier if sock_common could include
the IPv6 addresses needed in the lookups, instead of doing
very complex games to get their values (depending on sock
being SYN_RECV, ESTABLISHED, TIME_WAIT)
For this to happen, I need to be sure that tcp6_timewait_sock
and tcp_timewait_sock consume same number of cache lines.
This is possible if we only use 32bits for tw_ttd, as we remove
one 32bit hole in inet_timewait_sock
inet_tw_time_stamp() is defined and used, even if its current
implementation looks like tcp_time_stamp : We might need finer
resolution for tcp_time_stamp in the future.
Before patch : sizeof(struct tcp6_timewait_sock) = 0xc8
After patch : sizeof(struct tcp6_timewait_sock) = 0xc0
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds two new hash policy modes which use skb_flow_dissect:
3 - Encapsulated layer 2+3
4 - Encapsulated layer 3+4
There should be a good improvement for tunnel users in those modes.
It also changes the old hash functions to:
hash ^= (__force u32)flow.dst ^ (__force u32)flow.src;
hash ^= (hash >> 16);
hash ^= (hash >> 8);
Where hash will be initialized either to L2 hash, that is
SRCMAC[5] XOR DSTMAC[5], or to flow->ports which should be extracted
from the upper layer. Flow's dst and src are also extracted based on the
xmit policy either directly from the buffer or by using skb_flow_dissect,
but in both cases if the protocol is IPv6 then dst and src are obtained by
ipv6_addr_hash() on the real addresses. In case of a non-dissectable
packet, the algorithms fall back to L2 hashing.
The bond_set_mode_ops() function is now obsolete and thus deleted
because it was used only to set the proper hash policy. Also we trim a
pointer from struct bonding because we no longer need to keep the hash
function, now there's only a single hash function - bond_xmit_hash that
works based on bond->params.xmit_policy.
The hash function and skb_flow_dissect were suggested by Eric Dumazet.
The layer names were suggested by Andy Gospodarek, because I suck at
semantics.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Factor out the code that extracts the ports from skb_flow_dissect and
add a new function skb_flow_get_ports which can be re-used.
Suggested-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP listener refactoring, part 2 :
We can use a generic lookup, sockets being in whatever state, if
we are sure all relevant fields are at the same place in all socket
types (ESTABLISH, TIME_WAIT, SYN_RECV)
This patch removes these macros :
inet_addrpair, inet_addrpair, tw_addrpair, tw_portpair
And adds :
sk_portpair, sk_addrpair, sk_daddr, sk_rcv_saddr
Then, INET_TW_MATCH() is really the same than INET_MATCH()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
And thus we have only one function definition
Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jamal sent patch to add tc user simple actions to iproute2
but required header was not being exported.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a function to provide the phy address which should be used to the
Gigabit Ethernet driver connected to ssb.
Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>
Reviewed-by: Nithin Nayak Sujir <nsujir@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Slave Page Response Timeout event indicates to the Host that a
slave page response timeout has occurred in the BR/EDR Controller.
The Core Spec Addendum 4 adds this command in part B Connectionless
Slave Broadcast.
Bluetooth Core Specification Addendum 4 - Page 110
"7.7.72 Slave Page Response Timeout Event [New Section]
...
Note: this event will be generated if the slave BR/EDR Controller
responds to a page but does not receive the master FHS packet
(see Baseband, Section 8.3.3) within pagerespTO.
Event Parameters: NONE"
Signed-off-by: Dohyun Pyun <dh79.pyun@samsung.com>
Signed-off-by: C S Bhargava <cs.bhargava@samsung.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The Synchronization Train Complete event indicates that the Start
Synchronization Train command has completed.
The Core Spec Addendum 4 adds this command in part B Connectionless
Slave Broadcast.
Bluetooth Core Specification Addendum 4 - Page 103
"7.7.67 Synchronization Train Complete Event [New Section]
...
Event Parameters:
Status 0x00 Start Synchronization Train command completed
successfully.
0x01-0xFF Start Synchronization Train command failed.
See Part D, Error Codes, for error codes and
descriptions."
Signed-off-by: Dohyun Pyun <dh79.pyun@samsung.com>
Signed-off-by: C S Bhargava <cs.bhargava@samsung.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The Start_Synchronization_Train command controls the Synchronization
Train functionality in the BR/EDR Controller.
The Core Spec Addendum 4 adds this command in part B Connectionless
Slave Broadcast.
Bluetooth Core Specification Addendum 4 - Page 86
"7.1.51 Start Synchronization Train Command [New Section]
...
If connectionless slave broadcast mode is not enabled, the Command
Disallowed (0x0C) error code shall be returned. After receiving this
command and returning a Command Status event, the Baseband starts
attempting to send synchronization train packets containing information
related to the enabled Connectionless Slave Broadcast packet timing.
Note: The AFH_Channel_Map used in the synchronization train packets is
configured by the Set_AFH_Channel_Classification command and the local
channel classification in the BR/EDR Controller.
The synchronization train packets will be sent using the parameters
specified by the latest Write_Synchronization_Train_Parameters command.
The Synchronization Train will continue until synchronization_trainTO
slots (as specified in the last Write_Synchronization_Train command)
have passed or until the Host disables the Connectionless Slave Broadcast
logical transport."
Signed-off-by: Dohyun Pyun <dh79.pyun@samsung.com>
Signed-off-by: C S Bhargava <cs.bhargava@samsung.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
he Set_Connectionless_Slave_Broadcast command controls the
Connectionless Slave Broadcast functionality in the BR/EDR
Controller.
The Core Spec Addendum 4 adds this command in part B Connectionless
Slave Broadcast.
Bluetooth Core Specification Addendum 4 - Page 78
"7.1.49 Set Connectionless Slave Broadcast Command [New Section]
...
The LT_ADDR indicated in the Set_Connectionless_Slave_Broadcast shall be
pre-allocated using the HCI_Set_Reserved_LT_ADDR command. If the
LT_ADDR has not been reserved, the Unknown Connection Identifier (0x02)
error code shall be returned. If the controller is unable to reserve
sufficient bandwidth for the requested activity, the Connection Rejected
Due to Limited Resources (0x0D) error code shall be returned.
The LPO_Allowed parameter informs the BR/EDR Controller whether it is
allowed to sleep.
The Packet_Type parameter specifies which packet types are allowed. The
Host shall either enable BR packet types only, or shall enable EDR and DM1
packet types only.
The Interval_Min and Interval_Max parameters specify the range from which
the BR/EDR Controller must select the Connectionless Slave Broadcast
Interval. The selected Interval is returned."
Signed-off-by: Dohyun Pyun <dh79.pyun@samsung.com>
Signed-off-by: C S Bhargava <cs.bhargava@samsung.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The Write_Synchronization_Train_Parameters command configures
the Synchronization Train functionality in the BR/EDR Controller.
The Core Spec Addendum 4 adds this command in part B Connectionless
Slave Broadcast.
Bluetooth Core Specification Addendum 4 - Page 97
"7.3.90 Write Synchronization Train Parameters Command [New Section]
...
Note: The AFH_Channel_Map used in the Synchronization Train packets is
configured by the Set_AFH_Channel_Classification command and the local
channel classification in the BR/EDR Controller.
Interval_Min and Interval_Max specify the allowed range of
Sync_Train_Interval. Refer to [Vol. 2], Part B, section 2.7.2 for
a detailed description of Sync_Train_Interval. The BR/EDR Controller shall
select an interval from this range and return it in Sync_Train_Interval.
If the Controller is unable to select a value from this range, it shall
return the Invalid HCI Command Parameters (0x12) error code.
Once started (via the Start_Synchronization_Train Command) the
Synchronization Train will continue until synchronization_trainTO slots have
passed or Connectionless Slave Broadcast has been disabled."
Signed-off-by: Dohyun Pyun <dh79.pyun@samsung.com>
Signed-off-by: C S Bhargava <cs.bhargava@samsung.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The Set_Connectionless_Slave_Broadcast_Data command provides the
ability for the Host to set Connectionless Slave Broadcast data in
the BR/EDR Controller.
The Core Spec Addendum 4 adds this command in part B Connectionless
Slave Broadcast.
Bluetooth Core Specification Addendum 4 - Page 93
"7.3.88 Set Connectionless Slave Broadcast Data Command [New Section]
...
If connectionless slave broadcast mode is disabled, this data shall be
kept by the BR/EDR Controller and used once connectionless slave broadcast
mode is enabled. If connectionless slave broadcast mode is enabled,
and this command is successful, this data will be sent starting with
the next Connectionless Slave Broadcast instant.
The Data_Length field may be zero, in which case no data needs to be
provided.
The Host may fragment the data using the Fragment field in the command. If
the combined length of the fragments exceeds the capacity of the largest
allowed packet size specified in the Set Connectionless Slave Broadcast
command, all fragments associated with the data being assembled shall be
discarded and the Invalid HCI Command Parameters error (0x12) shall be
returned."
Signed-off-by: Dohyun Pyun <dh79.pyun@samsung.com>
Signed-off-by: C S Bhargava <cs.bhargava@samsung.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The Delete_Reserved_LT_ADDR command requests that the BR/EDR
Controller cancel the reservation for a specific LT_ADDR reserved for the
purposes of Connectionless Slave Broadcast.
The Core Spec Addendum 4 adds this command in part B Connectionless
Slave Broadcast.
Bluetooth Core Specification Addendum 4 - Page 92
"7.3.87 Delete Reserved LT_ADDR Command [New Section]
...
If the LT_ADDR indicated in the LT_ADDR parameter is not reserved by the
BR/EDR Controller, it shall return the Unknown Connection Identifier (0x02)
error code.
If connectionless slave broadcast mode is still active, then the Controller
shall return the Command Disallowed (0x0C) error code."
Signed-off-by: Dohyun Pyun <dh79.pyun@samsung.com>
Signed-off-by: C S Bhargava <cs.bhargava@samsung.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The Set_Reserved_LT_ADDR command allows the host to request that the
BR/EDR Controller reserve a specific LT_ADDR for Connectionless Slave
Broadcast.
The Core Spec Addendum 4 adds this command in part B Connectionless
Slave Broadcast.
Bluetooth Core Specification Addendum 4 - Page 90
"7.3.86 Set Reserved LT_ADDR Command [New Section]
...
If the LT_ADDR indicated in the LT_ADDR parameter is already in use by the
BR/EDR Controller, it shall return the ACL Connection Already Exists (0x0B)
error code. If the LT_ADDR indicated in the LT_ADDR parameter is out of
range, the controller shall return the Invalid HCI Command Parameters (0x12)
error code. If the command succeeds, then the reserved LT_ADDR shall be
used when issuing subsequent Set Connectionless Slave Broadcast Data and
Set Connectionless Slave Broadcast commands.
To ensure that the reserved LT_ADDR is not already allocated, it is
recommended that this command be issued at some point after HCI_Reset is
issued but before page scanning is enabled or paging is initiated."
Signed-off-by: Dohyun Pyun <dh79.pyun@samsung.com>
Signed-off-by: C S Bhargava <cs.bhargava@samsung.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
On dual-mode BR/EDR/LE and LE only controllers it is possible
to configure a random address. There are two types or random
addresses, one is static and the other private. Since the
random private addresses require special privacy feature to
be supported, the configuration of these two are kept separate.
This command allows for setting the static random address. It is
only supported on controllers with LE support. The static random
address is suppose to be valid for the lifetime of the controller
or at least until the next power cycle. To ensure such behavior,
setting of the address is limited to when the controller is
powered off.
The special BDADDR_ANY address (00:00:00:00:00:00) can be used to
disable the static address. This is also the default value.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
This patch introduces a new mgmt command for enabling/disabling BR/EDR
functionality. This can be convenient when one wants to make a dual-mode
controller behave like a single-mode one. The command is only available
for dual-mode controllers and requires that LE is enabled before using
it. The BR/EDR setting can be enabled at any point, however disabling it
requires the controller to be powered off (otherwise a "rejected"
response will be sent).
Disabling the BR/EDR setting will automatically disable all other BR/EDR
related settings.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>