In gre6 xmit path, we are sending a GRE packet, so set fl6 proto
to IPPROTO_GRE properly.
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When creat an ip6gretap interface with an unreachable route,
the MTU is about 14 bytes larger than what was needed.
If the remote address is reachable:
ping6 2001:0:130::1 -c 2
PING 2001:0:130::1(2001:0:130::1) 56 data bytes
64 bytes from 2001:0:130::1: icmp_seq=1 ttl=64 time=1.46 ms
64 bytes from 2001:0:130::1: icmp_seq=2 ttl=64 time=81.1 ms
Signed-off-by: David S. Miller <davem@davemloft.net>
In case we find a socket with encapsulation enabled we should call
the encap_recv function even if just a udp header without payload is
available. The callbacks are responsible for correctly verifying and
dropping the packets.
Also, in case the header validation fails for geneve and vxlan we
shouldn't put the skb back into the socket queue, no one will pick
them up there. Instead we can simply discard them in the respective
encap_recv functions.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch addresses the same issue we had for IPv4 where enabling GRE with
an inner checksum cannot be supported with FOU/GUE due to the fact that
they will jump past the GRE header at it is treated like a tunnel header.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since iptunnel_handle_offloads() is called in all paths we can
probably drop the block in ip6_tnl_xmit that was checking for
skb->encapsulation and resetting the inner headers.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Need to set dev features, use same values that are used in GREv6.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch add a new fou6 module that provides encapsulation
operations for IPv6.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add encap_hlen and ip_tunnel_encap structure to ip6_tnl. Add functions
for getting encap hlen, setting up encap on a tunnel, performing
encapsulation operation.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When performing foo-over-UDP, UDP packets are processed by the
encapsulation handler which returns another protocol to process.
This may result in processing two (or more) protocols in the
loop that are marked as INET6_PROTO_FINAL. The actions taken
for hitting a final protocol, in particular the skb_postpull_rcsum
can only be performed once.
This patch set adds a check of a final protocol has been seen. The
rules are:
- If the final protocol has not been seen any protocol is processed
(final and non-final). In the case of a final protocol, the final
actions are taken (like the skb_postpull_rcsum)
- If a final protocol has been seen (e.g. an encapsulating UDP
header) then no further non-final protocols are allowed
(e.g. extension headers). For more final protocols the
final actions are not taken (e.g. skb_postpull_rcsum).
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In ip6_input_finish the nexthdr protocol is retrieved from the
next header offset that is returned in the cb of the skb.
This method does not work for UDP encapsulation that may not
even have a concept of a nexthdr field (e.g. FOU).
This patch checks for a final protocol (INET6_PROTO_FINAL) when a
protocol handler returns > 0. If the protocol is not final then
resubmission is performed on nhoff value. If the protocol is final
then the nexthdr is taken to be the return value.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch defines two new GSO definitions SKB_GSO_IPXIP4 and
SKB_GSO_IPXIP6 along with corresponding NETIF_F_GSO_IPXIP4 and
NETIF_F_GSO_IPXIP6. These are used to described IP in IP
tunnel and what the outer protocol is. The inner protocol
can be deduced from other GSO types (e.g. SKB_GSO_TCPV4 and
SKB_GSO_TCPV6). The GSO types of SKB_GSO_IPIP and SKB_GSO_SIT
are removed (these are both instances of SKB_GSO_IPXIP4).
SKB_GSO_IPXIP6 will be used when support for GSO with IP
encapsulation over IPv6 is added.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In several gso_segment functions there are checks of gso_type against
a seemingly arbitrary list of SKB_GSO_* flags. This seems like an
attempt to identify unsupported GSO types, but since the stack is
the one that set these GSO types in the first place this seems
unnecessary to do. If a combination isn't valid in the first
place that stack should not allow setting it.
This is a code simplication especially for add new GSO types.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_hdr() is slightly more expensive than using skb->data in contexts
where we know they point to the same byte.
In receive path, tcp_v4_rcv() and tcp_v6_rcv() are in this situation,
as tcp header has not been pulled yet.
In output path, the same can be said when we just pushed the tcp header
in the skb, in tcp_transmit_skb() and tcp_make_synack()
Also factorize the two checks for tcb->tcp_flags & TCPHDR_SYN in
tcp_transmit_skb() and pass tcp header pointer to tcp_ecn_send(),
so that compiler can further optimize and avoid a reload.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__sock_cmsg_send() might return different error codes, not only -EINVAL.
Fixes: 24025c465f ("ipv4: process socket-level control messages in IPv4")
Fixes: ad1e46a837 ("ipv6: process socket-level control messages in IPv6")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The nf_conntrack_core.c fix in 'net' is not relevant in 'net-next'
because we no longer have a per-netns conntrack hash.
The ip_gre.c conflict as well as the iwlwifi ones were cases of
overlapping changes.
Conflicts:
drivers/net/wireless/intel/iwlwifi/mvm/tx.c
net/ipv4/ip_gre.c
net/netfilter/nf_conntrack_core.c
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, when creating or updating a route, no check is performed
in both ipv4 and ipv6 code to the hoplimit value.
The caller can i.e. set hoplimit to 256, and when such route will
be used, packets will be sent with hoplimit/ttl equal to 0.
This commit adds checks for the RTAX_HOPLIMIT value, in both ipv4
ipv6 route code, substituting any value greater than 255 with 255.
This is consistent with what is currently done for ADVMSS and MTU
in the ipv4 code.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While testing an OpenStack configuration using VXLANs I saw the following
call trace:
RIP: 0010:[<ffffffff815fad49>] udp4_lib_lookup_skb+0x49/0x80
RSP: 0018:ffff88103867bc50 EFLAGS: 00010286
RAX: ffff88103269bf00 RBX: ffff88103269bf00 RCX: 00000000ffffffff
RDX: 0000000000004300 RSI: 0000000000000000 RDI: ffff880f2932e780
RBP: ffff88103867bc60 R08: 0000000000000000 R09: 000000009001a8c0
R10: 0000000000004400 R11: ffffffff81333a58 R12: ffff880f2932e794
R13: 0000000000000014 R14: 0000000000000014 R15: ffffe8efbfd89ca0
FS: 0000000000000000(0000) GS:ffff88103fd80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000488 CR3: 0000000001c06000 CR4: 00000000001426e0
Stack:
ffffffff81576515 ffffffff815733c0 ffff88103867bc98 ffffffff815fcc17
ffff88103269bf00 ffffe8efbfd89ca0 0000000000000014 0000000000000080
ffffe8efbfd89ca0 ffff88103867bcc8 ffffffff815fcf8b ffff880f2932e794
Call Trace:
[<ffffffff81576515>] ? skb_checksum+0x35/0x50
[<ffffffff815733c0>] ? skb_push+0x40/0x40
[<ffffffff815fcc17>] udp_gro_receive+0x57/0x130
[<ffffffff815fcf8b>] udp4_gro_receive+0x10b/0x2c0
[<ffffffff81605863>] inet_gro_receive+0x1d3/0x270
[<ffffffff81589e59>] dev_gro_receive+0x269/0x3b0
[<ffffffff8158a1b8>] napi_gro_receive+0x38/0x120
[<ffffffffa0871297>] gro_cell_poll+0x57/0x80 [vxlan]
[<ffffffff815899d0>] net_rx_action+0x160/0x380
[<ffffffff816965c7>] __do_softirq+0xd7/0x2c5
[<ffffffff8107d969>] run_ksoftirqd+0x29/0x50
[<ffffffff8109a50f>] smpboot_thread_fn+0x10f/0x160
[<ffffffff8109a400>] ? sort_range+0x30/0x30
[<ffffffff81096da8>] kthread+0xd8/0xf0
[<ffffffff81693c82>] ret_from_fork+0x22/0x40
[<ffffffff81096cd0>] ? kthread_park+0x60/0x60
The following trace is seen when receiving a DHCP request over a flow-based
VXLAN tunnel. I believe this is caused by the metadata dst having a NULL
dev value and as a result dev_net(dev) is causing a NULL pointer dereference.
To resolve this I am replacing the check for skb_dst(skb)->dev with just
skb->dev. This makes sense as the callers of this function are usually in
the receive path and as such skb->dev should always be populated. In
addition other functions in the area where these are called are already
using dev_net(skb->dev) to determine the namespace the UDP packet belongs
in.
Fixes: 63058308cd ("udp: Add udp6_lib_lookup_skb and udp4_lib_lookup_skb")
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When dealing with WCCP in gre6 tunnel, it sets the wrong tpi->protocol,
that is, ETH_P_IP instead of ETH_P_IPV6 for the encapuslated traffic.
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Do not include attribute IFLA_GRE_TOS.
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the VRF driver uses the rx_handler to switch the skb device
to the VRF device. Switching the dev prior to the ip / ipv6 layer
means the VRF driver has to duplicate IP/IPv6 processing which adds
overhead and makes features such as retaining the ingress device index
more complicated than necessary.
This patch moves the hook to the L3 layer just after the first NF_HOOK
for PRE_ROUTING. This location makes exposing the original ingress device
trivial (next patch) and allows adding other NF_HOOKs to the VRF driver
in the future.
dev_queue_xmit_nit is exported so that the VRF driver can cycle the skb
with the switched device through the packet taps to maintain current
behavior (tcpdump can be used on either the vrf device or the enslaved
devices).
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Protocol for 4in6 tunnel is IPPROTO_IPIP. This was wrongly changed by
the last cleanup.
CC: Tom Herbert <tom@herbertland.com>
Fixes: 0d3c703a9d ("ipv6: Cleanup IPv6 tunnel receive path")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The handler 'ila_fill_encap_info' adds two attributes: ILA_ATTR_LOCATOR
and ILA_ATTR_CSUM_MODE.
nla_total_size_64bit() must be use for ILA_ATTR_LOCATOR.
Also, do nla_put_u8 instead of nla_put_u64 for ILA_ATTR_CSUM_MODE.
Fixes: f13a82d87b ("ipv6: use nla_put_u64_64bit()")
Fixes: 90bfe662db ("ila: add checksum neutral ILA translations")
Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix two spots where o_flags in a tunnel are being compared to GRE_SEQ
instead of TUNNEL_SEQ.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Need to use adjusted protocol value for setting inner protocol.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
GRE for IPv6 does not properly translate for GRE flags to tunnel
flags and vice versa. This patch fixes that.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In ip6gre_tnl_link_config set t->tun_len and t->hlen correctly for the
configuration. For hard_header_len and mtu calculation include
IPv6 header and encapsulation overhead.
In ip6gre_tunnel_init_common set t->tun_len and t->hlen correctly for
the configuration. Revert to setting hard_header_len instead of
needed_headroom.
Tested:
./ip link add name tun8 type ip6gretap remote \
2401:db00:20:911a:face:0:27:0 local \
2401:db00:20:911a:face:0:25:0 ttl 225
Gives MTU of 1434. That is equal to 1500 - 40 - 14 - 4 - 8.
./ip link add name tun8 type ip6gretap remote \
2401:db00:20:911a:face:0:27:0 local \
2401:db00:20:911a:face:0:25:0 ttl 225 okey 123
Gives MTU of 1430. That is equal to 1500 - 40 - 14 - 4 - 8 - 4.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move l3mdev_rt6_dst_by_oif and l3mdev_get_saddr to l3mdev.c. Collapse
l3mdev_get_rt6_dst into l3mdev_rt6_dst_by_oif since it is the only
user and keep the l3mdev_get_rt6_dst name for consistency with other
hooks.
A follow-on patch adds more code to these functions making them long
for inlined functions.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In netdevice.h we removed the structure in net-next that is being
changes in 'net'. In macsec.c and rtnetlink.c we have overlaps
between fixes in 'net' and the u64 attribute changes in 'net-next'.
The mlx5 conflicts have to do with vxlan support dependencies.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following large patchset contains Netfilter updates for your
net-next tree. My initial intention was to send you this in two goes but
when I looked back twice I already had this burden on top of me.
Several updates for IPVS from Marco Angaroni:
1) Allow SIP connections originating from real-servers to be load
balanced by the SIP persistence engine as is already implemented
in the other direction.
2) Release connections immediately for One-packet-scheduling (OPS)
in IPVS, instead of making it via timer and rcu callback.
3) Skip deleting conntracks for each one packet in OPS, and don't call
nf_conntrack_alter_reply() since no reply is expected.
4) Enable drop on exhaustion for OPS + SIP persistence.
Miscelaneous conntrack updates from Florian Westphal, including fix for
hash resize:
5) Move conntrack generation counter out of conntrack pernet structure
since this is only used by the init_ns to allow hash resizing.
6) Use get_random_once() from packet path to collect hash random seed
instead of our compound.
7) Don't disable BH from ____nf_conntrack_find() for statistics,
use NF_CT_STAT_INC_ATOMIC() instead.
8) Fix lookup race during conntrack hash resizing.
9) Introduce clash resolution on conntrack insertion for connectionless
protocol.
Then, Florian's netns rework to get rid of per-netns conntrack table,
thus we use one single table for them all. There was consensus on this
change during the NFWS 2015 and, on top of that, it has recently been
pointed as a source of multiple problems from unpriviledged netns:
11) Use a single conntrack hashtable for all namespaces. Include netns
in object comparisons and make it part of the hash calculation.
Adapt early_drop() to consider netns.
12) Use single expectation and NAT hashtable for all namespaces.
13) Use a single slab cache for all namespaces for conntrack objects.
14) Skip full table scanning from nf_ct_iterate_cleanup() if the pernet
conntrack counter tells us the table is empty (ie. equals zero).
Fixes for nf_tables interval set element handling, support to set
conntrack connlabels and allow set names up to 32 bytes.
15) Parse element flags from element deletion path and pass it up to the
backend set implementation.
16) Allow adjacent intervals in the rbtree set type for dynamic interval
updates.
17) Add support to set connlabel from nf_tables, from Florian Westphal.
18) Allow set names up to 32 bytes in nf_tables.
Several x_tables fixes and updates:
19) Fix incorrect use of IS_ERR_VALUE() in x_tables, original patch
from Andrzej Hajda.
And finally, miscelaneous netfilter updates such as:
20) Disable automatic helper assignment by default. Note this proc knob
was introduced by a900689264 ("netfilter: nf_ct_helper: allow to
disable automatic helper assignment") 4 years ago to start moving
towards explicit conntrack helper configuration via iptables CT
target.
21) Get rid of obsolete and inconsistent debugging instrumentation
in x_tables.
22) Remove unnecessary check for null after ip6_route_output().
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Tables have to exist for VRFs to function. Ensure they exist
when VRF device is created.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Responses for packets to unused ports are getting lost with L3 domains.
IPv4 has ip_send_unicast_reply for sending TCP responses which accounts
for L3 domains; update the IPv6 counterpart tcp_v6_send_response.
For icmp the L3 master check needs to be moved up in icmp6_send
to properly respond to UDP packets to a port with no listener.
Fixes: ca254490c8 ("net: Add VRF support to IPv6 stack")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The dprintf() and duprintf() functions are enabled at compile time,
these days we have better runtime debugging through pr_debug() and
static keys.
On top of this, this debugging is so old that I don't expect anyone
using this anymore, so let's get rid of this.
IP_NF_ASSERT() is still left in place, although this needs that
NETFILTER_DEBUG is enabled, I think these assertions provide useful
context information when reading the code.
Note that ARP_NF_ASSERT() has been removed as there is no user of
this.
Kill also DEBUG_ALLOW_ALL and a couple of pr_error() and pr_debug()
spots that are inconsistently placed in the code.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The ipv6 gre implementation was cleaned up to share more code
with the ipv4 version, but it can be enabled even when NET_IPGRE_DEMUX
is disabled, resulting in a link error:
net/built-in.o: In function `gre_rcv':
:(.text+0x17f5d0): undefined reference to `gre_parse_header'
ERROR: "gre_parse_header" [net/ipv6/ip6_gre.ko] undefined!
This adds a Kconfig dependency to prevent that now invalid
configuration.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 308edfdf15 ("gre6: Cleanup GREv6 receive path, call common GRE functions")
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's easier for gre_parse_header to return the header length instead of
filing it into a parameter. That way, the callers that don't care about the
header length can just check whether the returned value is lower than zero.
In gre_err, the tunnel header must not be pulled. See commit b7f8fe251e
("gre: do not pull header in ICMP error processing") for details.
This patch reduces the conflict between the mentioned commit and commit
95f5c64c3c ("gre: Move utility functions to common headers").
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/ipv4/ip_gre.c
Minor conflicts between tunnel bug fixes in net and
ipv6 tunnel cleanups in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
The handler 'ila_fill_encap_info' adds one attribute: ILA_ATTR_LOCATOR.
Fixes: 65d7ab8de5 ("net: Identifier Locator Addressing module")
CC: Tom Herbert <tom@herbertland.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the sendmsg function of UDP, raw, ICMP and l2tp sockets, we use local
variables like hlimits, tclass, opt and dontfrag and pass them to corresponding
functions like ip6_make_skb, ip6_append_data and xxx_push_pending_frames.
This is not a good practice and makes it hard to add new parameters.
This fix introduces a new struct ipcm6_cookie similar to ipcm_cookie in
ipv4 and include the above mentioned variables. And we only pass the
pointer to this structure to corresponding functions. This makes it easier
to add new parameters in the future and makes the function cleaner.
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changes in GREv6 transmit path:
- Call gre_checksum, remove gre6_checksum
- Rename ip6gre_xmit2 to __gre6_xmit
- Call gre_build_header utility function
- Call ip6_tnl_xmit common function
- Call ip6_tnl_change_mtu, eliminate ip6gre_tunnel_change_mtu
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A few generic changes to generalize tunnels in IPv6:
- Export ip6_tnl_change_mtu so that it can be called by ip6_gre
- Add tun_hlen to ip6_tnl structure.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch renames ip6_tnl_xmit2 to ip6_tnl_xmit and exports it. Other
users like GRE will be able to call this. The original ip6_tnl_xmit
function is renamed to ip6_tnl_start_xmit (this is an ndo_start_xmit
function).
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Create gre_rcv function. This calls gre_parse_header and ip6gre_rcv.
- Call ip6_tnl_rcv. Doing this and using gre_parse_header eliminates
most of the code in ip6gre_rcv.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some basic changes to make IPv6 tunnel receive path look more like
IPv4 path:
- Make ip6_tnl_rcv non-static so that GREv6 and others can call it
- Make ip6_tnl_rcv look like ip_tunnel_rcv
- Switch to gro_cells_receive
- Make ip6_tnl_rcv non-static and export it
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
UDP uses the generic socket backlog code, and this will soon
be changed to not disable BH when protocol is called back.
We need to use appropriate SNMP accessors.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We want to to make TCP stack preemptible, as draining prequeue
and backlog queues can take lot of time.
Many SNMP updates were assuming that BH (and preemption) was disabled.
Need to convert some __NET_INC_STATS() calls to NET_INC_STATS()
and some __TCP_INC_STATS() to TCP_INC_STATS()
Before using this_cpu_ptr(net->ipv4.tcp_sk) in tcp_v4_send_reset()
and tcp_v4_send_ack(), we add an explicit preempt disabled section.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a forward-port of the original patch from Andrzej Hajda,
he said:
"IS_ERR_VALUE should be used only with unsigned long type.
Otherwise it can work incorrectly. To achieve this function
xt_percpu_counter_alloc is modified to return unsigned long,
and its result is assigned to temporary variable to perform
error checking, before assigning to .pcnt field.
The patch follows conclusion from discussion on LKML [1][2].
[1]: http://permalink.gmane.org/gmane.linux.kernel/2120927
[2]: http://permalink.gmane.org/gmane.linux.kernel/2150581"
Original patch from Andrzej is here:
http://patchwork.ozlabs.org/patch/582970/
This patch has clashed with input validation fixes for x_tables.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Rename IP6_UPD_PO_STATS_BH() to __IP6_UPD_PO_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename IP6_INC_STATS_BH() to __IP6_INC_STATS()
and IP6_ADD_STATS_BH() to __IP6_ADD_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename NET_INC_STATS_BH() to __NET_INC_STATS()
and NET_ADD_STATS_BH() to __NET_ADD_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename ICMP6_INC_STATS_BH() to __ICMP6_INC_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename UDP_INC_STATS_BH() to __UDP_INC_STATS(),
and UDP6_INC_STATS_BH() to __UDP6_INC_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the old days (before linux-3.0), SNMP counters were duplicated,
one for user context, and one for BH context.
After commit 8f0ea0fe3a ("snmp: reduce percpu needs by 50%")
we have a single copy, and what really matters is preemption being
enabled or disabled, since we use this_cpu_inc() or __this_cpu_inc()
respectively.
We therefore kill SNMP_INC_STATS_USER(), SNMP_ADD_STATS_USER(),
NET_INC_STATS_USER(), NET_ADD_STATS_USER(), SCTP_INC_STATS_USER(),
SNMP_INC_STATS64_USER(), SNMP_ADD_STATS64_USER(), TCP_ADD_STATS_USER(),
UDP_INC_STATS_USER(), UDP6_INC_STATS_USER(), and XFRM_INC_STATS_USER()
Following patches will rename __BH helpers to make clear their
usage is not tied to BH being disabled.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Minor overlapping changes in the conflicts.
In the macsec case, the change of the default ID macro
name overlapped with the 64-bit netlink attribute alignment
fixes in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to 3bfd847203 ("net: Use passed in table for nexthop lookups")
for IPv4, if the route spec contains a table id use that to lookup the
next hop first and fall back to a full lookup if it fails (per the fix
4c9bcd1179 ("net: Fix nexthop lookups")).
Example:
root@kenny:~# ip -6 ro ls table red
local 2100:1::1 dev lo proto none metric 0 pref medium
2100:1::/120 dev eth1 proto kernel metric 256 pref medium
local 2100:2::1 dev lo proto none metric 0 pref medium
2100:2::/120 dev eth2 proto kernel metric 256 pref medium
local fe80::e0:f9ff:fe09:3cac dev lo proto none metric 0 pref medium
local fe80::e0:f9ff:fe1c:b974 dev lo proto none metric 0 pref medium
fe80::/64 dev eth1 proto kernel metric 256 pref medium
fe80::/64 dev eth2 proto kernel metric 256 pref medium
ff00::/8 dev red metric 256 pref medium
ff00::/8 dev eth1 metric 256 pref medium
ff00::/8 dev eth2 metric 256 pref medium
unreachable default dev lo metric 240 error -113 pref medium
root@kenny:~# ip -6 ro add table red 2100:3::/64 via 2100:1::64
RTNETLINK answers: No route to host
Route add fails even though 2100:1::64 is a reachable next hop:
root@kenny:~# ping6 -I red 2100:1::64
ping6: Warning: source address might be selected on device other than red.
PING 2100:1::64(2100:1::64) from 2100:1::1 red: 56 data bytes
64 bytes from 2100:1::64: icmp_seq=1 ttl=64 time=1.33 ms
With this patch:
root@kenny:~# ip -6 ro add table red 2100:3::/64 via 2100:1::64
root@kenny:~# ip -6 ro ls table red
local 2100:1::1 dev lo proto none metric 0 pref medium
2100:1::/120 dev eth1 proto kernel metric 256 pref medium
local 2100:2::1 dev lo proto none metric 0 pref medium
2100:2::/120 dev eth2 proto kernel metric 256 pref medium
2100:3::/64 via 2100:1::64 dev eth1 metric 1024 pref medium
local fe80::e0:f9ff:fe09:3cac dev lo proto none metric 0 pref medium
local fe80::e0:f9ff:fe1c:b974 dev lo proto none metric 0 pref medium
fe80::/64 dev eth1 proto kernel metric 256 pref medium
fe80::/64 dev eth2 proto kernel metric 256 pref medium
ff00::/8 dev red metric 256 pref medium
ff00::/8 dev eth1 metric 256 pref medium
ff00::/8 dev eth2 metric 256 pref medium
unreachable default dev lo metric 240 error -113 pref medium
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It was a simple idea -- save IPv6 configured addresses on a link down
so that IPv6 behaves similar to IPv4. As always the devil is in the
details and the IPv6 stack as too many behavioral differences from IPv4
making the simple idea more complicated than it needs to be.
The current implementation for keeping IPv6 addresses can panic or spit
out a warning in one of many paths:
1. IPv6 route gets an IPv4 route as its 'next' which causes a panic in
rt6_fill_node while handling a route dump request.
2. rt->dst.obsolete is set to DST_OBSOLETE_DEAD hitting the WARN_ON in
fib6_del
3. Panic in fib6_purge_rt because rt6i_ref count is not 1.
The root cause of all these is references related to the host route for
an address that is retained.
So, this patch deletes the host route every time the ifdown loop runs.
Since the host route is deleted and will be re-generated an up there is
no longer a need for the l3mdev fix up. On the 'admin up' side move
addrconf_permanent_addr into the NETDEV_UP event handling so that it
runs only once versus on UP and CHANGE events.
All of the current panics and warnings appear to be related to
addresses on the loopback device, but given the catastrophic nature when
a bug is triggered this patch takes the conservative approach and evicts
all host routes rather than trying to determine when it can be re-used
and when it can not. That can be a later optimizaton if desired.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 841645b5f2.
Ok, this puts the feature back. I've decided to apply David A.'s
bug fix and run with that rather than make everyone wait another
whole release for this feature.
Signed-off-by: David S. Miller <davem@davemloft.net>
Support checksum neutral ILA as described in the ILA draft. The low
order 16 bits of the identifier are used to contain the checksum
adjustment value.
The csum-mode parameter is added to described checksum processing. There
are three values:
- adjust transport checksum (previous behavior)
- do checksum neutral mapping
- do nothing
On output the csum-mode in the ila_params is checked and acted on. If
mode is checksum neutral mapping then to mapping and set C-bit.
On input, C-bit is checked. If it is set checksum-netural mapping is
done (regardless of csum-mode in ila params) and C-bit will be cleared.
If it is not set then action in csum-mode is taken.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change model of xlat to be used only for input where lookup is done on
the locator part of an address (comparing to locator_match as key
in rhashtable). This is needed for checksum neutral translation
which obfuscates the low order 16 bits of the identifier. It also
permits hosts to be in muliple ILA domains (each locator can map
to a different SIR address). A check is also added to disallow
translating non-ILA addresses (check of type in identifier).
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add structures for identifiers, locators, and an ila address which
is composed of a locator and identifier and in6_addr can be cast to
it. This includes a three bit type field and enums for the types defined
in ILA I-D.
In ILA lwt don't allow user to set a translation for a non-ILA
address (type of identifier is zero meaning it is an IID). This also
requires that the destination prefix is at least 65 bytes (64
bit locator and first byte of identifier).
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We should call consume_skb(skb) when skb is properly consumed,
or kfree_skb(skb) when skb must be dropped in error case.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts the following three commits:
70af921db6799977d9aaf1705ec197
The feature was ill conceived, has terrible semantics, and has added
nothing but regressions to the already fragile ipv6 stack.
Fixes: f1705ec197 ("net: ipv6: Make address flushing on ifdown optional")
Signed-off-by: David S. Miller <davem@davemloft.net>
ip6_route_output() will never return a NULL pointer, so there's no need
to check it.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next
tree, mostly from Florian Westphal to sort out the lack of sufficient
validation in x_tables and connlabel preparation patches to add
nf_tables support. They are:
1) Ensure we don't go over the ruleset blob boundaries in
mark_source_chains().
2) Validate that target jumps land on an existing xt_entry. This extra
sanitization comes with a performance penalty when loading the ruleset.
3) Introduce xt_check_entry_offsets() and use it from {arp,ip,ip6}tables.
4) Get rid of the smallish check_entry() functions in {arp,ip,ip6}tables.
5) Make sure the minimal possible target size in x_tables.
6) Similar to #3, add xt_compat_check_entry_offsets() for compat code.
7) Check that standard target size is valid.
8) More sanitization to ensure that the target_offset field is correct.
9) Add xt_check_entry_match() to validate that matches are well-formed.
10-12) Three patch to reduce the number of parameters in
translate_compat_table() for {arp,ip,ip6}tables by using a container
structure.
13) No need to return value from xt_compat_match_from_user(), so make
it void.
14) Consolidate translate_table() so it can be used by compat code too.
15) Remove obsolete check for compat code, so we keep consistent with
what was already removed in the native layout code (back in 2007).
16) Get rid of target jump validation from mark_source_chains(),
obsoleted by #2.
17) Introduce xt_copy_counters_from_user() to consolidate counter
copying, and use it from {arp,ip,ip6}tables.
18,22) Get rid of unnecessary explicit inlining in ctnetlink for dump
functions.
19) Move nf_connlabel_match() to xt_connlabel.
20) Skip event notification if connlabel did not change.
21) Update of nf_connlabels_get() to make the upcoming nft connlabel
support easier.
23) Remove spinlock to read protocol state field in conntrack.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts were two cases of simple overlapping changes,
nothing serious.
In the UDP case, we need to add a hlist_add_tail_rcu()
to linux/rculist.h, because we've moved UDP socket handling
away from using nulls lists.
Signed-off-by: David S. Miller <davem@davemloft.net>
Separated from previous patch for readability.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Struct ctl_table_header holds pointer to sysctl table which could be used
for freeing it after unregistration. IPv4 sysctls already use that.
Remove redundant NULL assignment: ndev allocated using kzalloc.
This also saves some bytes: sysctl table could be shorter than
DEVCONF_MAX+1 if some options are disable in config.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds code borrowed from bits and pieces of other protocols to
the IPv6 GRE path so that we can support GSO over IPv6 based GRE tunnels.
By adding this support we are able to significantly improve the throughput
for GRE tunnels as we are able to make use of GSO.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for the basic offloads we support on most devices.
Specifically with this patch set we can support checksum offload, basic
scatter-gather, and highdma.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we were creating an ip6gretap interface the MTU was about 6 bytes
short of what was needed. It turns out we were not taking the Ethernet
header into account and as a result we were eating into the 8 bytes
reserved for the encap limit.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch updates the IP tunnel core function iptunnel_handle_offloads so
that we return an int and do not free the skb inside the function. This
actually allows us to clean up several paths in several tunnels so that we
can free the skb at one point in the path without having to have a
secondary path if we are supporting tunnel offloads.
In addition it should resolve some double-free issues I have found in the
tunnels paths as I believe it is possible for us to end up triggering such
an event in the case of fou or gue.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When removing sk_refcnt manipulation on synflood, I missed that
using skb_set_owner_w() was racy, if sk->sk_wmem_alloc had already
transitioned to 0.
We should hold sk_refcnt instead, but this is a big deal under attack.
(Doing so increase performance from 3.2 Mpps to 3.8 Mpps only)
In this patch, I chose to not attach a socket to syncookies skb.
Performance is now 5 Mpps instead of 3.2 Mpps.
Following patch will remove last known false sharing in
tcp_rcv_state_process()
Fixes: 3b24d854cb ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a release_cb for UDPv6. It does a route lookup
and updates sk->sk_dst_cache if it is needed. It picks up the
left-over job from ip6_sk_update_pmtu() if the sk was owned
by user during the pmtu update.
It takes a rcu_read_lock to protect the __sk_dst_get() operations
because another thread may do ip6_dst_store() without taking the
sk lock (e.g. sendmsg).
Fixes: 45e4fd2668 ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reported-by: Wei Wang <weiwan@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a case in connected UDP socket such that
getsockopt(IPV6_MTU) will return a stale MTU value. The reproducible
sequence could be the following:
1. Create a connected UDP socket
2. Send some datagrams out
3. Receive a ICMPV6_PKT_TOOBIG
4. No new outgoing datagrams to trigger the sk_dst_check()
logic to update the sk->sk_dst_cache.
5. getsockopt(IPV6_MTU) returns the mtu from the invalid
sk->sk_dst_cache instead of the newly created RTF_CACHE clone.
This patch updates the sk->sk_dst_cache for a connected datagram sk
during pmtu-update code path.
Note that the sk->sk_v6_daddr is used to do the route lookup
instead of skb->data (i.e. iph). It is because a UDP socket can become
connected after sending out some datagrams in un-connected state. or
It can be connected multiple times to different destinations. Hence,
iph may not be related to where sk is currently connected to.
It is done under '!sock_owned_by_user(sk)' condition because
the user may make another ip6_datagram_connect() (i.e changing
the sk->sk_v6_daddr) while dst lookup is happening in the pmtu-update
code path.
For the sock_owned_by_user(sk) == true case, the next patch will
introduce a release_cb() which will update the sk->sk_dst_cache.
Test:
Server (Connected UDP Socket):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Route Details:
[root@arch-fb-vm1 ~]# ip -6 r show | egrep '2fac'
2fac::/64 dev eth0 proto kernel metric 256 pref medium
2fac:face::/64 via 2fac::face dev eth0 metric 1024 pref medium
A simple python code to create a connected UDP socket:
import socket
import errno
HOST = '2fac::1'
PORT = 8080
s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM)
s.bind((HOST, PORT))
s.connect(('2fac:face::face', 53))
print("connected")
while True:
try:
data = s.recv(1024)
except socket.error as se:
if se.errno == errno.EMSGSIZE:
pmtu = s.getsockopt(41, 24)
print("PMTU:%d" % pmtu)
break
s.close()
Python program output after getting a ICMPV6_PKT_TOOBIG:
[root@arch-fb-vm1 ~]# python2 ~/devshare/kernel/tasks/fib6/udp-connect-53-8080.py
connected
PMTU:1300
Cache routes after recieving TOOBIG:
[root@arch-fb-vm1 ~]# ip -6 r show table cache
2fac:face::face via 2fac::face dev eth0 metric 0
cache expires 463sec mtu 1300 pref medium
Client (Send the ICMPV6_PKT_TOOBIG):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
scapy is used to generate the TOOBIG message. Here is the scapy script I have
used:
>>> p=Ether(src='da:75:4d:36:ac:32', dst='52:54:00:12:34:66', type=0x86dd)/IPv6(src='2fac::face', dst='2fac::1')/ICMPv6PacketTooBig(mtu=1300)/IPv6(src='2fac::
1',dst='2fac:face::face', nh='UDP')/UDP(sport=8080,dport=53)
>>> sendp(p, iface='qemubr0')
Fixes: 45e4fd2668 ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reported-by: Wei Wang <weiwan@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch moves the route lookup and update codes for connected
datagram sk to a newly created function ip6_datagram_dst_update()
It will be reused during the pmtu update in the later patch.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move flowi6 init codes for connected datagram sk to a newly created
function ip6_datagram_flow_key_init().
Notes:
1. fl6_flowlabel is used instead of fl6.flowlabel in __ip6_datagram_connect
2. ipv6_addr_is_multicast(&fl6->daddr) is used instead of
(addr_type & IPV6_ADDR_MULTICAST) in ip6_datagram_flow_key_init()
This new function will be reused during pmtu update in the later patch.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for something I am referring to as GSO partial.
The basic idea is that we can support a broader range of devices for
segmentation if we use fixed outer headers and have the hardware only
really deal with segmenting the inner header. The idea behind the naming
is due to the fact that everything before csum_start will be fixed headers,
and everything after will be the region that is handled by hardware.
With the current implementation it allows us to add support for the
following GSO types with an inner TSO_MANGLEID or TSO6 offload:
NETIF_F_GSO_GRE
NETIF_F_GSO_GRE_CSUM
NETIF_F_GSO_IPIP
NETIF_F_GSO_SIT
NETIF_F_UDP_TUNNEL
NETIF_F_UDP_TUNNEL_CSUM
In the case of hardware that already supports tunneling we may be able to
extend this further to support TSO_TCPV4 without TSO_MANGLEID if the
hardware can support updating inner IPv4 headers.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch does two things.
First it allows TCP to aggregate TCP frames with a fixed IPv4 ID field. As
a result we should now be able to aggregate flows that were converted from
IPv6 to IPv4. In addition this allows us more flexibility for future
implementations of segmentation as we may be able to use a fixed IP ID when
segmenting the flow.
The second thing this does is that it places limitations on the outer IPv4
ID header in the case of tunneled frames. Specifically it forces the IP ID
to be incrementing by 1 unless the DF bit is set in the outer IPv4 header.
This way we can avoid creating overlapping series of IP IDs that could
possibly be fragmented if the frame goes through GRO and is then
resegmented via GSO.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for TSO using IPv4 headers with a fixed IP ID
field. This is meant to allow us to do a lossless GRO in the case of TCP
flows that use a fixed IP ID such as those that convert IPv6 header to IPv4
headers.
In addition I am adding a feature that for now I am referring to TSO with
IP ID mangling. Basically when this flag is enabled the device has the
option to either output the flow with incrementing IP IDs or with a fixed
IP ID regardless of what the original IP ID ordering was. This is useful
in cases where the DF bit is set and we do not care if the original IP ID
value is maintained.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
f1705ec197 added the option to retain user configured addresses on an
admin down. A comment to one of the later revisions suggested using the
IFA_F_PERMANENT flag rather than adding a user_managed boolean to the
ifaddr struct. A side effect of this change is that link local and
loopback addresses are also retained which is not part of the objective
of f1705ec197. Add check to drop those addresses.
Fixes: f1705ec197 ("net: ipv6: Make address flushing on ifdown optional")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The original tokenized iid support implemented via f53adae4ea ("net: ipv6:
add tokenized interface identifier support") didn't allow for clearing a
device token as it was intended that this addressing mode was the only one
active for globally scoped IPv6 addresses. Later we relaxed that restriction
via 617fe29d45 ("net: ipv6: only invalidate previously tokenized addresses"),
and we should also allow for clearing tokens as there's no good reason why
it shouldn't be allowed.
Fixes: 617fe29d45 ("net: ipv6: only invalidate previously tokenized addresses")
Reported-by: Robin H. Johnson <robbat2@gentoo.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The three variants use same copy&pasted code, condense this into a
helper and use that.
Make sure info.name is 0-terminated.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Since 'netfilter: x_tables: validate targets of jumps' change we
validate that the target aligns exactly with beginning of a rule,
so offset test is now redundant.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
commit 9e67d5a739
("[NETFILTER]: x_tables: remove obsolete overflow check") left the
compat parts alone, but we can kill it there as well.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This looks like refactoring, but its also a bug fix.
Problem is that the compat path (32bit iptables, 64bit kernel) lacks a few
sanity tests that are done in the normal path.
For example, we do not check for underflows and the base chain policies.
While its possible to also add such checks to the compat path, its more
copy&pastry, for instance we cannot reuse check_underflow() helper as
e->target_offset differs in the compat case.
Other problem is that it makes auditing for validation errors harder; two
places need to be checked and kept in sync.
At a high level 32 bit compat works like this:
1- initial pass over blob:
validate match/entry offsets, bounds checking
lookup all matches and targets
do bookkeeping wrt. size delta of 32/64bit structures
assign match/target.u.kernel pointer (points at kernel
implementation, needed to access ->compatsize etc.)
2- allocate memory according to the total bookkeeping size to
contain the translated ruleset
3- second pass over original blob:
for each entry, copy the 32bit representation to the newly allocated
memory. This also does any special match translations (e.g.
adjust 32bit to 64bit longs, etc).
4- check if ruleset is free of loops (chase all jumps)
5-first pass over translated blob:
call the checkentry function of all matches and targets.
The alternative implemented by this patch is to drop steps 3&4 from the
compat process, the translation is changed into an intermediate step
rather than a full 1:1 translate_table replacement.
In the 2nd pass (step #3), change the 64bit ruleset back to a kernel
representation, i.e. put() the kernel pointer and restore ->u.user.name .
This gets us a 64bit ruleset that is in the format generated by a 64bit
iptables userspace -- we can then use translate_table() to get the
'native' sanity checks.
This has two drawbacks:
1. we re-validate all the match and target entry structure sizes even
though compat translation is supposed to never generate bogus offsets.
2. we put and then re-lookup each match and target.
THe upside is that we get all sanity tests and ruleset validations
provided by the normal path and can remove some duplicated compat code.
iptables-restore time of autogenerated ruleset with 300k chains of form
-A CHAIN0001 -m limit --limit 1/s -j CHAIN0002
-A CHAIN0002 -m limit --limit 1/s -j CHAIN0003
shows no noticeable differences in restore times:
old: 0m30.796s
new: 0m31.521s
64bit: 0m25.674s
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We're currently asserting that targetoff + targetsize <= nextoff.
Extend it to also check that targetoff is >= sizeof(xt_entry).
Since this is generic code, add an argument pointing to the start of the
match/target, we can then derive the base structure size from the delta.
We also need the e->elems pointer in a followup change to validate matches.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
32bit rulesets have different layout and alignment requirements, so once
more integrity checks get added to xt_check_entry_offsets it will reject
well-formed 32bit rulesets.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Once we add more sanity testing to xt_check_entry_offsets it
becomes relvant if we're expecting a 32bit 'config_compat' blob
or a normal one.
Since we already have a lot of similar-named functions (check_entry,
compat_check_entry, find_and_check_entry, etc.) and the current
incarnation is short just fold its contents into the callers.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently arp/ip and ip6tables each implement a short helper to check that
the target offset is large enough to hold one xt_entry_target struct and
that t->u.target_size fits within the current rule.
Unfortunately these checks are not sufficient.
To avoid adding new tests to all of ip/ip6/arptables move the current
checks into a helper, then extend this helper in followup patches.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When we see a jump also check that the offset gets us to beginning of
a rule (an ipt_entry).
The extra overhead is negible, even with absurd cases.
300k custom rules, 300k jumps to 'next' user chain:
[ plus one jump from INPUT to first userchain ]:
Before:
real 0m24.874s
user 0m7.532s
sys 0m16.076s
After:
real 0m27.464s
user 0m7.436s
sys 0m18.840s
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Ben Hawkes says:
In the mark_source_chains function (net/ipv4/netfilter/ip_tables.c) it
is possible for a user-supplied ipt_entry structure to have a large
next_offset field. This field is not bounds checked prior to writing a
counter value at the supplied offset.
Base chains enforce absolute verdict.
User defined chains are supposed to end with an unconditional return,
xtables userspace adds them automatically.
But if such return is missing we will move to non-existent next rule.
Reported-by: Ben Hawkes <hawkes@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>