If the tokenized ip address is re-set on an interface we depend on the
arrival of a new router advertisment to call addrconf_verify to clean
up the old address (which valid_lft is now set to 0). Old addresses can
linger around for a longer time if e.g. the source of router advertisments
vanishes.
So, call addrconf_verify immediately after setting the new tokenized
address to get rid of the old tokenized addresses.
Cc: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We should check the return value of ipv6_get_lladdr in inet6_set_iftoken.
A possible situation, which could leave ll_addr unassigned is, when
the user removed her link-local address but a global scoped address was
already set. In this case the interface would still be IF_READY and not
dead. In that case the RS source address is some value from the stack.
v2: Daniel Borkmann noted a small indent inconstancy; no semantic
changes.
Cc: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Reviewed-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The reason behind this change is that as soon as we delete
the last ipv6 address of an interface we also lose the
/proc/sys/net/ipv6/conf/<interface> directory. This seems to be a
usability problem for me.
I don't see any reason why we should shutdown ipv6 on that interface in
such cases.
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch splits the timers for duplicate address detection and router
solicitations apart. The router solicitations timer goes into inet6_dev
and the dad timer stays in inet6_ifaddr.
The reason behind this patch is to reduce the number of unneeded router
solicitations send out by the host if additional link-local addresses
are created. Currently we send out RS for every link-local address on
an interface.
If the RS timer fires we pick a source address with ipv6_get_lladdr. This
change could hurt people adding additional link-local addresses and
specifying these addresses in the radvd clients section because we
no longer guarantee that we use every ll address as source address in
router solicitations.
Cc: Flavio Leitner <fleitner@redhat.com>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: David Stevens <dlstevens@us.ibm.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Reviewed-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Router Alert option is marked in skb.
Previously, IP6CB(skb)->ra was set to positive value for such packets.
Since commit dd3332bf ("ipv6: Store Router Alert option in IP6CB
directly."), IP6SKB_ROUTERALERT is set in IP6CB(skb)->flags, and
the value of Router Alert option (in network byte order) is set
to IP6CB(skb)->ra for such packets.
Multicast forwarding path uses that flag and value, but unicast
forwarding path does not use the flag and misuses IP6CB(skb)->ra
value.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit f88c91ddba ("ipv6: statically link
register_inet6addr_notifier()" added following sparse warnings :
net/ipv6/addrconf_core.c:83:5: warning: symbol
'register_inet6addr_notifier' was not declared. Should it be static?
net/ipv6/addrconf_core.c:89:5: warning: symbol
'unregister_inet6addr_notifier' was not declared. Should it be static?
net/ipv6/addrconf_core.c:95:5: warning: symbol
'inet6addr_notifier_call_chain' was not declared. Should it be static?
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
The following patchset contains five fixes for Netfilter/IPVS, they are:
* A skb leak fix in fragmentation handling in case that helpers are in place,
it occurs since the IPV6 NAT infrastructure, from Phil Oester.
* Fix SCTP port mangling in ICMP packets for IPVS, from Julian Anastasov.
* Fix event delivery in ctnetlink regarding the new connlabel infrastructure,
from Florian Westphal.
* Fix mangling in the SIP NAT helper, from Balazs Peter Odor.
* Fix crash in ipt_ULOG introduced while adding netnamespace support,
from Gao Feng.
I'll take care of passing several of these patches to -stable once they hit
Linus' tree.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This is debug info, should at least be pr_debug(), but given
that this code is in upstream for two years, there is no
need to keep this debugging printk any more, so just remove it.
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 4cdd3408 ("netfilter: nf_conntrack_ipv6: improve fragmentation
handling"), an sk_buff leak was introduced when dealing with reassembled
packets by grabbing a reference to the original skb instead of the
reassembled skb. At this point, the leak only impacted conntracks with an
associated helper.
In commit 58a317f1 ("netfilter: ipv6: add IPv6 NAT support"), the bug was
expanded to include all reassembled packets with unconfirmed conntracks.
Fix this by grabbing a reference to the proper reassembled skb. This
closes netfilter bugzilla #823.
Signed-off-by: Phil Oester <kernel@linuxace.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If we disable all of the net interfaces, and enable
un-lo interface before lo interface, we already allocated
the addrconf dst in ipv6_add_addr. So we shouldn't allocate
it again when we enable lo interface.
Otherwise the message below will be triggered.
unregister_netdevice: waiting for sit1 to become free. Usage count = 1
This problem is introduced by commit 25fb6ca4ed
"net IPv6 : Fix broken IPv6 routing table after loopback down-up"
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The use of this attribute has been added in 32b8a8e59c (sit: add IPv4 over
IPv4 support). It is optional, by default proto is IPPROTO_IPV6.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Process skb tunnel header before sending packet to protocol handler.
this allows code sharing between gre and ovs gre modules.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor various ip tunnels xmit functions and extend iptunnel_xmit()
so that there is more code sharing.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/wireless/ath/ath9k/Kconfig
drivers/net/xen-netback/netback.c
net/batman-adv/bat_iv_ogm.c
net/wireless/nl80211.c
The ath9k Kconfig conflict was a change of a Kconfig option name right
next to the deletion of another option.
The xen-netback conflict was overlapping changes involving the
handling of the notify list in xen_netbk_rx_action().
Batman conflict resolution provided by Antonio Quartulli, basically
keep everything in both conflict hunks.
The nl80211 conflict is a little more involved. In 'net' we added a
dynamic memory allocation to nl80211_dump_wiphy() to fix a race that
Linus reported. Meanwhile in 'net-next' the handlers were converted
to use pre and post doit handlers which use a flag to determine
whether to hold the RTNL mutex around the operation.
However, the dump handlers to not use this logic. Instead they have
to explicitly do the locking. There were apparent bugs in the
conversion of nl80211_dump_wiphy() in that we were not dropping the
RTNL mutex in all the return paths, and it seems we very much should
be doing so. So I fixed that whilst handling the overlapping changes.
To simplify the initial returns, I take the RTNL mutex after we try
to allocate 'tb'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Since some refactoring in 5f5a011, ndisc_send_redirect called
ndisc_fill_redirect_hdr_option on the wrong skb, leading to data corruption or
in the worst case a panic when the skb_put failed.
Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reduce the uses of this unnecessary typedef.
Done via perl script:
$ git grep --name-only -w ctl_table net | \
xargs perl -p -i -e '\
sub trim { my ($local) = @_; $local =~ s/(^\s+|\s+$)//g; return $local; } \
s/\b(?<!struct\s)ctl_table\b(\s*\*\s*|\s+\w+)/"struct ctl_table " . trim($1)/ge'
Reflow the modified lines that now exceed 80 columns.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/ipv4/ping.c:286:5: sparse: symbol 'ping_check_bind_addr' was not declared. Should it be static?
net/ipv4/ping.c:355:6: sparse: symbol 'ping_set_saddr' was not declared. Should it be static?
net/ipv4/ping.c:370:6: sparse: symbol 'ping_clear_saddr' was not declared. Should it be static?
net/ipv6/ping.c:60:5: sparse: symbol 'dummy_ipv6_recv_error' was not declared. Should it be static?
net/ipv6/ping.c:64:5: sparse: symbol 'dummy_ip6_datagram_recv_ctl' was not declared. Should it be static?
net/ipv6/ping.c:69:5: sparse: symbol 'dummy_icmpv6_err_convert' was not declared. Should it be static?
net/ipv6/ping.c:73:6: sparse: symbol 'dummy_ipv6_icmp_error' was not declared. Should it be static?
net/ipv6/ping.c:75:5: sparse: symbol 'dummy_ipv6_chk_addr' was not declared. Should it be static?
net/ipv6/ping.c:201:5: sparse: symbol 'ping_v6_seq_show' was not declared. Should it be static?
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds low latency socket poll support for TCP.
In tcp_v[46]_rcv() add a call to sk_mark_ll() to copy the napi_id
from the skb to the sk.
In tcp_recvmsg(), when there is no data in the socket we busy-poll.
This is a good example of how to add busy-poll support to more protocols.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add upport for busy-polling on UDP sockets.
In __udp[46]_lib_rcv add a call to sk_mark_ll() to copy the napi_id
from the skb into the sk.
This is done at the earliest possible moment, right after we identify
which socket this skb is for.
In __skb_recv_datagram When there is no data and the user
tries to read we busy poll.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge 'net' bug fixes into 'net-next' as we have patches
that will build on top of them.
This merge commit includes a change from Emil Goode
(emilgoode@gmail.com) that fixes a warning that would
have been introduced by this merge. Specifically it
fixes the pingv6_ops method ipv6_chk_addr() to add a
"const" to the "struct net_device *dev" argument and
likewise update the dummy_ipv6_chk_addr() declaration.
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 25fb6ca4ed
"net IPv6 : Fix broken IPv6 routing table after loopback down-up"
forgot to assign rt6_info to the inet6_ifaddr.
When disable the net device, the rt6_info which allocated
in init_loopback will not be destroied in __ipv6_ifa_notify.
This will trigger the waring message below
[23527.916091] unregister_netdevice: waiting for tap0 to become free. Usage count = 1
Reported-by: Arkadiusz Miskiewicz <a.miskiewicz@gmail.com>
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The format is based on /proc/net/icmp and /proc/net/{udp,raw}6.
Compiles and displays reasonable results with CONFIG_IPV6={n,m,y}
Couldn't figure out how to test without CONFIG_PROC_FS enabled.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
udp6_sock_seq_show and raw6_sock_seq_show are identical, except
the UDP version displays ports and the raw version displays the
protocol. Refactor most of the code in these two functions into
a new common ip6_dgram_sock_seq_show function, in preparation
for using it to display ICMPv6 sockets as well.
Also reduce the indentation in parts of include/net/transp_v6.h
to improve readability.
Compiles and displays reasonable results with CONFIG_IPV6={n,m,y}
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the support of IPv4 over Ipv4 for the module sit. The gain of
this feature is to be able to have 4in4 and 6in4 over the same interface
instead of having one interface for 6in4 and another for 4in4 even if
encapsulation addresses are the same.
To avoid conflicting with ipip module, sit IPv4 over IPv4 protocol is
registered with a smaller priority.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
udp6 over GRE tunnel does not work after to GRE tso changes. GRE
tso handler passes inner packet but keeps track of outer header
start in SKB_GSO_CB(skb)->mac_offset. udp6 fragment need to
take care of outer header, which start at the mac_offset, while
adding fragment header.
This bug is introduced by commit 68c3316311 (GRE: Add TCP
segmentation offload for GRE).
Reported-by: Dmitry Kravkov <dkravkov@gmail.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Tested-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This stat is not relevant in IPv6, there is no checksum in IPv6 header.
Just leave a comment to explain the hole.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This corrects an regression introduced by "net: Use 16bits for *_headers
fields of struct skbuff" when NET_SKBUFF_DATA_USES_OFFSET is not set. In
that case skb->tail will be a pointer whereas skb->transport_header
will be an offset from head. This is corrected by using wrappers that
ensure that comparisons and calculations are always made using pointers.
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 351638e7de (net: pass info struct via netdevice notifier)
breaks booting of my KVM guest, this is due to we still forget to pass
struct netdev_notifier_info in several places. This patch completes it.
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
So far, only net_device * could be passed along with netdevice notifier
event. This patch provides a possibility to pass custom structure
able to provide info that event listener needs to know.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
v2->v3: fix typo on simeth
shortened dev_getter
shortened notifier_info struct name
v1->v2: fix notifier_call parameter in call_netdevice_notifier()
Signed-off-by: David S. Miller <davem@davemloft.net>
In the case where a non-MPLS packet is received and an MPLS stack is
added it may well be the case that the original skb is GSO but the
NIC used for transmit does not support GSO of MPLS packets.
The aim of this code is to provide GSO in software for MPLS packets
whose skbs are GSO.
SKB Usage:
When an implementation adds an MPLS stack to a non-MPLS packet it should do
the following to skb metadata:
* Set skb->inner_protocol to the old non-MPLS ethertype of the packet.
skb->inner_protocol is added by this patch.
* Set skb->protocol to the new MPLS ethertype of the packet.
* Set skb->network_header to correspond to the
end of the L3 header, including the MPLS label stack.
I have posted a patch, "[PATCH v3.29] datapath: Add basic MPLS support to
kernel" which adds MPLS support to the kernel datapath of Open vSwtich.
That patch sets the above requirements in datapath/actions.c:push_mpls()
and was used to exercise this code. The datapath patch is against the Open
vSwtich tree but it is intended that it be added to the Open vSwtich code
present in the mainline Linux kernel at some point.
Features:
I believe that the approach that I have taken is at least partially
consistent with the handling of other protocols. Jesse, I understand that
you have some ideas here. I am more than happy to change my implementation.
This patch adds dev->mpls_features which may be used by devices
to advertise features supported for MPLS packets.
A new NETIF_F_MPLS_GSO feature is added for devices which support
hardware MPLS GSO offload. Currently no devices support this
and MPLS GSO always falls back to software.
Alternate Implementation:
One possible alternate implementation is to teach netif_skb_features()
and skb_network_protocol() about MPLS, in a similar way to their
understanding of VLANs. I believe this would avoid the need
for net/mpls/mpls_gso.c and in particular the calls to
__skb_push() and __skb_push() in mpls_gso_segment().
I have decided on the implementation in this patch as it should
not introduce any overhead in the case where mpls_gso is not compiled
into the kernel or inserted as a module.
MPLS GSO suggested by Jesse Gross.
Based in part on "v4 GRE: Add TCP segmentation offload for GRE"
by Pravin B Shelar.
Cc: Jesse Gross <jesse@nicira.com>
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds the ability to send ICMPv6 echo requests without a
raw socket. The equivalent ability for ICMPv4 was added in
2011.
Instead of having separate code paths for IPv4 and IPv6, make
most of the code in net/ipv4/ping.c dual-stack and only add a
few IPv6-specific bits (like the protocol definition) to a new
net/ipv6/ping.c. Hopefully this will reduce divergence and/or
duplication of bugs in the future.
Caveats:
- Setting options via ancillary data (e.g., using IPV6_PKTINFO
to specify the outgoing interface) is not yet supported.
- There are no separate security settings for IPv4 and IPv6;
everything is controlled by /proc/net/ipv4/ping_group_range.
- The proc interface does not yet display IPv6 ping sockets
properly.
Tested with a patched copy of ping6 and using raw socket calls.
Compiles and works with all of CONFIG_IPV6={n,m,y}.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge net into net-next because some upcoming net-next changes
build on top of bug fixes that went into net.
Signed-off-by: David S. Miller <davem@davemloft.net>
Quoting https://bugzilla.netfilter.org/show_bug.cgi?id=812:
[ ip6tables -m addrtype ]
When I tried to use in the nat/PREROUTING it messes up the
routing cache even if the rule didn't matched at all.
[..]
If I remove the --limit-iface-in from the non-working scenario, so just
use the -m addrtype --dst-type LOCAL it works!
This happens when LOCAL type matching is requested with --limit-iface-in,
and the default ipv6 route is via the interface the packet we test
arrived on.
Because xt_addrtype uses ip6_route_output, the ipv6 routing implementation
creates an unwanted cached entry, and the packet won't make it to the
real/expected destination.
Silently ignoring --limit-iface-in makes the routing work but it breaks
rule matching (--dst-type LOCAL with limit-iface-in is supposed to only
match if the dst address is configured on the incoming interface;
without --limit-iface-in it will match if the address is reachable
via lo).
The test should call ipv6_chk_addr() instead. However, this would add
a link-time dependency on ipv6.
There are two possible solutions:
1) Revert the commit that moved ipt_addrtype to xt_addrtype,
and put ipv6 specific code into ip6t_addrtype.
2) add new "nf_ipv6_ops" struct to register pointers to ipv6 functions.
While the former might seem preferable, Pablo pointed out that there
are more xt modules with link-time dependeny issues regarding ipv6,
so lets go for 2).
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
ipv6_addr_type(&addr)&IPV6_ADDR_SCOPE_MASK could be replaced
by ipv6_addr_scope(), which is slightly faster.
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6_addr_any() is a faster way to determine if an addr
is ipv6 any addr, no need to compute the addr type.
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the support of peer address for IPv6. For example, it is
possible to specify the remote end of a 6inY tunnel.
This was already possible in IPv4:
ip addr add ip1 peer ip2 dev dev1
The peer address is specified with IFA_ADDRESS and the local address with
IFA_LOCAL (like explained in include/uapi/linux/if_addr.h).
Note that the API is not changed, because before this patch, it was not
possible to specify two different addresses in IFA_LOCAL and IFA_REMOTE.
There is a small change for the dump: if the peer is different from ::,
IFA_ADDRESS will contain the peer address instead of the local address.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 0178b695fd ("ipv6: Copy cork options in ip6_append_data")
added some code duplication and bad error recovery, leading to potential
crash in ip6_cork_release() as kfree() could be called with garbage.
use kzalloc() to make sure this wont happen.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Neal Cardwell <ncardwell@google.com>
We forget to call dev_put() on error path in xfrm6_fill_dst(),
its caller doesn't handle this.
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a hole in struct ip6_tnl_parm2, so we have to
zero the struct on stack before copying it to user-space.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have seen multiple NULL dereferences in __inet6_lookup_established()
After analysis, I found that inet6_sk() could be NULL while the
check for sk_family == AF_INET6 was true.
Bug was added in linux-2.6.29 when RCU lookups were introduced in UDP
and TCP stacks.
Once an IPv6 socket, using SLAB_DESTROY_BY_RCU is inserted in a hash
table, we no longer can clear pinet6 field.
This patch extends logic used in commit fcbdf09d96
("net: fix nulls list corruptions in sk_prot_alloc")
TCP/UDP/UDPLite IPv6 protocols provide their own .clear_sk() method
to make sure we do not clear pinet6 field.
At socket clone phase, we do not really care, as cloning the parent (non
NULL) pinet6 is not adding a fatal race.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull VFS updates from Al Viro,
Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).
7kloc removed.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
don't bother with deferred freeing of fdtables
proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
proc: Make the PROC_I() and PDE() macros internal to procfs
proc: Supply a function to remove a proc entry by PDE
take cgroup_open() and cpuset_open() to fs/proc/base.c
ppc: Clean up scanlog
ppc: Clean up rtas_flash driver somewhat
hostap: proc: Use remove_proc_subtree()
drm: proc: Use remove_proc_subtree()
drm: proc: Use minor->index to label things, not PDE->name
drm: Constify drm_proc_list[]
zoran: Don't print proc_dir_entry data in debug
reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
proc: Supply an accessor for getting the data from a PDE's parent
airo: Use remove_proc_subtree()
rtl8192u: Don't need to save device proc dir PDE
rtl8187se: Use a dir under /proc/net/r8180/
proc: Add proc_mkdir_data()
proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
proc: Move PDE_NET() to fs/proc/proc_net.c
...
Add MIB counters for checksum errors in IP layer,
and TCP/UDP/ICMP layers, to help diagnose problems.
$ nstat -a | grep Csum
IcmpInCsumErrors 72 0.0
TcpInCsumErrors 382 0.0
UdpInCsumErrors 463221 0.0
Icmp6InCsumErrors 75 0.0
Udp6InCsumErrors 173442 0.0
IpExtInCsumErrors 10884 0.0
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following patch adds icmp-registration module for ipv6. It allows
ipv6 protocol to register icmp_sender which is used for sending
ipv6 icmp msgs. This extra layer allows us to kill ipv6 dependency
for sending icmp packets.
This patch also fixes ip_tunnel compilation problem when ip_tunnel
is statically compiled in kernel but ipv6 is module
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/emulex/benet/be_main.c
drivers/net/ethernet/intel/igb/igb_main.c
drivers/net/wireless/brcm80211/brcmsmac/mac80211_if.c
include/net/scm.h
net/batman-adv/routing.c
net/ipv4/tcp_input.c
The e{uid,gid} --> {uid,gid} credentials fix conflicted with the
cleanup in net-next to now pass cred structs around.
The be2net driver had a bug fix in 'net' that overlapped with the VLAN
interface changes by Patrick McHardy in net-next.
An IGB conflict existed because in 'net' the build_skb() support was
reverted, and in 'net-next' there was a comment style fix within that
code.
Several batman-adv conflicts were resolved by making sure that all
calls to batadv_is_my_mac() are changed to have a new bat_priv first
argument.
Eric Dumazet's TS ECR fix in TCP in 'net' conflicted with the F-RTO
rewrite in 'net-next', mostly overlapping changes.
Thanks to Stephen Rothwell and Antonio Quartulli for help with several
of these merge resolutions.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
The following patchset contains a small batch of Netfilter
updates for your net-next tree, they are:
* Three patches that provide more accurate error reporting to
user-space, instead of -EPERM, in IPv4/IPv6 netfilter re-routing
code and NAT, from Patrick McHardy.
* Update copyright statements in Netfilter filters of
Patrick McHardy, from himself.
* Add Kconfig dependency on the raw/mangle tables to the
rpfilter, from Florian Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
If time allows, please consider pulling the following patchset contains two
late Netfilter fixes, they are:
* Skip broadcast/multicast locally generated traffic in the rpfilter,
(closes netfilter bugzilla #814), from Florian Westphal.
* Fix missing elements in the listing of ipset bitmap ip,mac set
type with timeout support enabled, from Jozsef Kadlecsik.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
rpfilter is only valid in raw/mangle PREROUTING, i.e.
RPFILTER=y|m is useless without raw or mangle table support.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Alex Efros reported rpfilter module doesn't match following packets:
IN=br.qemu SRC=192.168.2.1 DST=192.168.2.255 [ .. ]
(netfilter bugzilla #814).
Problem is that network stack arranges for the locally generated broadcasts
to appear on the interface they were sent out, so the IFF_LOOPBACK check
doesn't trigger.
As -m rpfilter is restricted to PREROUTING, we can check for existing
rtable instead, it catches locally-generated broad/multicast case, too.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add copyright statements to all netfilter files which have had significant
changes done by myself in the past.
Some notes:
- nf_conntrack_ecache.c was incorrectly attributed to Rusty and Netfilter
Core Team when it got split out of nf_conntrack_core.c. The copyrights
even state a date which lies six years before it was written. It was
written in 2005 by Harald and myself.
- net/ipv{4,6}/netfilter.c, net/netfitler/nf_queue.c were missing copyright
statements. I've added the copyright statement from net/netfilter/core.c,
where this code originated
- for nf_conntrack_proto_tcp.c I've also added Jozsef, since I didn't want
it to give the wrong impression
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Commit 4a94445c9a (net: Use ip_route_input_noref() in input path)
added a bug in IP defragmentation handling, as non refcounted
dst could escape an RCU protected section.
Commit 64f3b9e203 (net: ip_expire() must revalidate route) fixed
the case of timeouts, but not the general problem.
Tom Parkin noticed crashes in UDP stack and provided a patch,
but further analysis permitted us to pinpoint the root cause.
Before queueing a packet into a frag list, we must drop its dst,
as this dst has limited lifetime (RCU protected)
When/if a packet is finally reassembled, we use the dst of the very
last skb, still protected by RCU and valid, as the dst of the
reassembled packet.
Use same logic in IPv6, as there is no need to hold dst references.
Reported-by: Tom Parkin <tparkin@katalix.com>
Tested-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, sock_tx_timestamp() always returns 0. The comment that
describes the sock_tx_timestamp() function wrongly says that it
returns an error when an invalid argument is passed (from commit
20d4947353, ``net: socket infrastructure for SO_TIMESTAMPING'').
Make the function void, so that we can also remove all the unneeded
if conditions that check for such a _non-existant_ error case in the
output path.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tomas reported the following build error:
net/built-in.o: In function `ieee80211_unregister_hw':
(.text+0x10f0e1): undefined reference to `unregister_inet6addr_notifier'
net/built-in.o: In function `ieee80211_register_hw':
(.text+0x10f610): undefined reference to `register_inet6addr_notifier'
make: *** [vmlinux] Error 1
when built IPv6 as a module.
So we have to statically link these symbols.
Reported-by: Tomas Melin <tomas.melin@iki.fi>
Cc: Tomas Melin <tomas.melin@iki.fi>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: YOSHIFUJI Hidaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
1) Allow to avoid copying DSCP during encapsulation
by setting a SA flag. From Nicolas Dichtel.
2) Constify the netlink dispatch table, no need to modify it
at runtime. From Mathias Krause.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The only part of proc_dir_entry the code outside of fs/proc
really cares about is PDE(inode)->data. Provide a helper
for that; static inline for now, eventually will be moved
to fs/proc, along with the knowledge of struct proc_dir_entry
layout.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Instead of invalidating all IPv6 addresses with global scope
when one decides to use IPv6 tokens, we should only invalidate
previous tokens and leave the rest intact until they expire
eventually (or are intact forever). For doing this less greedy
approach, we're adding a bool at the end of inet6_ifaddr structure
instead, for two reasons: i) per-inet6_ifaddr flag space is
already used up, making it wider might not be a good idea,
since ii) also we do not necessarily need to export this
information into user space.
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we set the iftoken in inet6_set_iftoken(), we return -EINVAL
when the device does not have flag IF_READY. This is however not
necessary and rather an artificial usability barrier, since we
simply can set the token despite that, and in case the device is
ready, we just send out our rs, otherwise ifup et al. will do
this for us anyway.
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since we check for !ipv6_addr_any(&in6_dev->token) in
addrconf_prefix_rcv(), make the token initialization on
device setup more intuitive by using in6addr_any as an
initializer.
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for IPv6 tokenized IIDs, that allow
for administrators to assign well-known host-part addresses
to nodes whilst still obtaining global network prefix from
Router Advertisements. It is currently in draft status.
The primary target for such support is server platforms
where addresses are usually manually configured, rather
than using DHCPv6 or SLAAC. By using tokenised identifiers,
hosts can still determine their network prefix by use of
SLAAC, but more readily be automatically renumbered should
their network prefix change. [...]
The disadvantage with static addresses is that they are
likely to require manual editing should the network prefix
in use change. If instead there were a method to only
manually configure the static identifier part of the IPv6
address, then the address could be automatically updated
when a new prefix was introduced, as described in [RFC4192]
for example. In such cases a DNS server might be
configured with such a tokenised interface identifier of
::53, and SLAAC would use the token in constructing the
interface address, using the advertised prefix. [...]
http://tools.ietf.org/html/draft-chown-6man-tokenised-ipv6-identifiers-02
The implementation is partially based on top of Mark K.
Thompson's proof of concept. However, it uses the Netlink
interface for configuration resp. data retrival, so that
it can be easily extended in future. Successfully tested
by myself.
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Propagate errors from ip_xfrm_me_harder() instead of returning EPERM in
all cases.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Propagate routing errors from ip_route_me_harder() when dropping a packet
using NF_DROP_ERR(). This makes userspace get the proper error instead of
EPERM for everything.
# ip -6 r a unreachable default table 100
# ip -6 ru add fwmark 0x1 lookup 100
# ip6tables -t mangle -A OUTPUT -d 2001:4860:4860::8888 -j MARK --set-mark 0x1
Old behaviour:
PING 2001:4860:4860::8888(2001:4860:4860::8888) 56 data bytes
ping: sendmsg: Operation not permitted
ping: sendmsg: Operation not permitted
ping: sendmsg: Operation not permitted
New behaviour:
PING 2001:4860:4860::8888(2001:4860:4860::8888) 56 data bytes
ping: sendmsg: Network is unreachable
ping: sendmsg: Network is unreachable
ping: sendmsg: Network is unreachable
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Conflicts:
drivers/nfc/microread/mei.c
net/netfilter/nfnetlink_queue_core.c
Pull in 'net' to get Eric Biederman's AF_UNIX fix, upon which
some cleanups are going to go on-top.
Signed-off-by: David S. Miller <davem@davemloft.net>
Tetja Rediske found that if the host receives an ICMPv6 redirect message
after sending a SYN+ACK, the connection will be reset.
He bisected it down to 093d04d (ipv6: Change skb->data before using
icmpv6_notify() to propagate redirect), but the origin of the bug comes
from ec18d9a26 (ipv6: Add redirect support to all protocol icmp error
handlers.). The bug simply did not trigger prior to 093d04d, because
skb->data did not point to the inner IP header and thus icmpv6_notify
did not call the correct err_handler.
This patch adds the missing "goto out;" in tcp_v6_err. After receiving
an ICMPv6 Redirect, we should not continue processing the ICMP in
tcp_v6_err, as this may trigger the removal of request-socks or setting
sk_err(_soft).
Reported-by: Tetja Rediske <tetja@tetja.de>
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
The following patchset contains Netfilter and IPVS updates for
your net-next tree, most relevantly they are:
* Add net namespace support to NFLOG, ULOG and ebt_ulog and NFQUEUE.
The LOG and ebt_log target has been also adapted, but they still
depend on the syslog netnamespace that seems to be missing, from
Gao Feng.
* Don't lose indications of congestion in IPv6 fragmentation handling,
from Hannes Frederic Sowa.i
* IPVS conversion to use RCU, including some code consolidation patches
and optimizations, also some from Julian Anastasov.
* cpu fanout support for NFQUEUE, from Holger Eitzenberger.
* Better error reporting to userspace when dropping packets from
all our _*_[xfrm|route]_me_harder functions, from Patrick McHardy.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This change brings netfilter reassembly logic on par with
reassembly.c. The corresponding change in net-next is
(eec2e61 ipv6: implement RFC3168 5.3 (ecn protection) for
ipv6 fragmentation handling)
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jesper Dangaard Brouer <jbrouer@redhat.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds netns support to nf_log and it prepares netns
support for existing loggers. It is composed of four major
changes.
1) nf_log_register has been split to two functions: nf_log_register
and nf_log_set. The new nf_log_register is used to globally
register the nf_logger and nf_log_set is used for enabling
pernet support from nf_loggers.
Per netns is not yet complete after this patch, it comes in
separate follow up patches.
2) Add net as a parameter of nf_log_bind_pf. Per netns is not
yet complete after this patch, it only allows to bind the
nf_logger to the protocol family from init_net and it skips
other cases.
3) Adapt all nf_log_packet callers to pass netns as parameter.
After this patch, this function only works for init_net.
4) Make the sysctl net/netfilter/nf_log pernet.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
The following patchset contains netfilter updates for your net tree,
they are:
* Fix missing the skb->trace reset in nf_reset, noticed by Gao Feng
while using the TRACE target with several net namespaces.
* Fix prefix translation in IPv6 NPT if non-multiple of 32 prefixes
are used, from Matthias Schiffer.
* Fix invalid nfacct objects with empty name, they are now rejected
with -EINVAL, spotted by Michael Zintakis, patch from myself.
* A couple of fixes for wrong return values in the error path of
nfnetlink_queue and nf_conntrack, from Wei Yongjun.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The bitmask used for the prefix mangling was being calculated
incorrectly, leading to the wrong part of the address being replaced
when the prefix length wasn't a multiple of 32.
Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
IPv6 Routing table becomes broken once we do ifdown, ifup of the loopback(lo)
interface. After down-up, routes of other interface's IPv6 addresses through
'lo' are lost.
IPv6 addresses assigned to all interfaces are routed through 'lo' for internal
communication. Once 'lo' is down, those routing entries are removed from routing
table. But those removed entries are not being re-created properly when 'lo' is
brought up. So IPv6 addresses of other interfaces becomes unreachable from the
same machine. Also this breaks communication with other machines because of
NDISC packet processing failure.
This patch fixes this issue by reading all interface's IPv6 addresses and adding
them to IPv6 routing table while bringing up 'lo'.
==Testing==
Before applying the patch:
$ route -A inet6
Kernel IPv6 routing table
Destination Next Hop Flag Met Ref Use If
2000::20/128 :: U 256 0 0 eth0
fe80::/64 :: U 256 0 0 eth0
::/0 :: !n -1 1 1 lo
::1/128 :: Un 0 1 0 lo
2000::20/128 :: Un 0 1 0 lo
fe80::xxxx:xxxx:xxxx:xxxx/128 :: Un 0 1 0 lo
ff00::/8 :: U 256 0 0 eth0
::/0 :: !n -1 1 1 lo
$ sudo ifdown lo
$ sudo ifup lo
$ route -A inet6
Kernel IPv6 routing table
Destination Next Hop Flag Met Ref Use If
2000::20/128 :: U 256 0 0 eth0
fe80::/64 :: U 256 0 0 eth0
::/0 :: !n -1 1 1 lo
::1/128 :: Un 0 1 0 lo
ff00::/8 :: U 256 0 0 eth0
::/0 :: !n -1 1 1 lo
$
After applying the patch:
$ route -A inet6
Kernel IPv6 routing
table
Destination Next Hop Flag Met Ref Use If
2000::20/128 :: U 256 0 0 eth0
fe80::/64 :: U 256 0 0 eth0
::/0 :: !n -1 1 1 lo
::1/128 :: Un 0 1 0 lo
2000::20/128 :: Un 0 1 0 lo
fe80::xxxx:xxxx:xxxx:xxxx/128 :: Un 0 1 0 lo
ff00::/8 :: U 256 0 0 eth0
::/0 :: !n -1 1 1 lo
$ sudo ifdown lo
$ sudo ifup lo
$ route -A inet6
Kernel IPv6 routing table
Destination Next Hop Flag Met Ref Use If
2000::20/128 :: U 256 0 0 eth0
fe80::/64 :: U 256 0 0 eth0
::/0 :: !n -1 1 1 lo
::1/128 :: Un 0 1 0 lo
2000::20/128 :: Un 0 1 0 lo
fe80::xxxx:xxxx:xxxx:xxxx/128 :: Un 0 1 0 lo
ff00::/8 :: U 256 0 0 eth0
::/0 :: !n -1 1 1 lo
$
Signed-off-by: Balakumaran Kannan <Balakumaran.Kannan@ap.sony.com>
Signed-off-by: Maruthi Thotad <Maruthi.Thotad@ap.sony.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/mac80211/sta_info.c
net/wireless/core.h
Two minor conflicts in wireless. Overlapping additions of extern
declarations in net/wireless/core.h and a bug fix overlapping with
the addition of a boolean parameter to __ieee80211_key_free().
Signed-off-by: David S. Miller <davem@davemloft.net>
Erik Hugne's errata proposal (Errata ID: 3480) to RFC4291 has been
verified: http://www.rfc-editor.org/errata_search.php?eid=3480
We have to check for pkt_type and loopback flag because either the
packets are allowed to travel over the loopback interface (in which case
pkt_type is PACKET_HOST and IFF_LOOPBACK flag is set) or they travel
over a non-loopback interface back to us (in which case PACKET_TYPE is
PACKET_LOOPBACK and IFF_LOOPBACK flag is not set).
Cc: Erik Hugne <erik.hugne@ericsson.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
include/net/ipip.h
The changes made to ipip.h in 'net' were already included
in 'net-next' before that header was moved to another location.
Signed-off-by: David S. Miller <davem@davemloft.net>
Use common function get calculate rtnl_link_stats64 stats.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following patch refactors GRE code into ip tunneling code and GRE
specific code. Common tunneling code is moved to ip_tunnel module.
ip_tunnel module is written as generic library which can be used
by different tunneling implementations.
ip_tunnel module contains following components:
- packet xmit and rcv generic code. xmit flow looks like
(gre_xmit/ipip_xmit)->ip_tunnel_xmit->ip_local_out.
- hash table of all devices.
- lookup for tunnel devices.
- control plane operations like device create, destroy, ioctl, netlink
operations code.
- registration for tunneling modules, like gre, ipip etc.
- define single pcpu_tstats dev->tstats.
- struct tnl_ptk_info added to pass parsed tunnel packet parameters.
ipip.h header is renamed to ip_tunnel.h
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
The following patchset contains Netfilter/IPVS updates for
your net-next tree, they are:
* Better performance in nfnetlink_queue by avoiding copy from the
packet to netlink message, from Eric Dumazet.
* Remove unnecessary locking in the exit path of ebt_ulog, from Gao Feng.
* Use new function ipv6_iface_scope_id in nf_ct_ipv6, from Hannes Frederic Sowa.
* A couple of sparse fixes for IPVS, from Julian Anastasov.
* Use xor hashing in nfnetlink_queue, as suggested by Eric Dumazet, from
myself.
* Allow to dump expectations per master conntrack via ctnetlink, from myself.
* A couple of cleanups to use PTR_RET in module init path, from Silviu-Mihai
Popescu.
* Remove nf_conntrack module a bit faster if netns are in use, from
Vladimir Davydov.
* Use checksum_partial in ip6t_NPT, from YOSHIFUJI Hideaki.
* Sparse fix for nf_conntrack, from Stephen Hemminger.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Hello!
After patch 1 got accepted to net-next I will also send a patch to
netfilter-devel to make the corresponding changes to the netfilter
reassembly logic.
Thanks,
Hannes
-- >8 --
[PATCH 2/2] ipv6: implement RFC3168 5.3 (ecn protection) for ipv6 fragmentation handling
This patch also ensures that INET_ECN_CE is propagated if one fragment
had the codepoint set.
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jesper Dangaard Brouer <jbrouer@redhat.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a dev_addr_genid for IPv6. The goal is to use it, combined with
dev_base_seq to check if a change occurs during a netlink dump.
If a change is detected, the flag NLM_F_DUMP_INTR is set in the first message
after the dump was interrupted.
Note that only dump of unicast addresses is checked (multicast and anycast are
not checked).
Reported-by: Junwei Zhang <junwei.zhang@6wind.com>
Reported-by: Hongjun Li <hongjun.li@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With decnet converted, we can finally get rid of rta_buf and its
computations around it. It also gets rid of the minimal header
length verification since all message handlers do that explicitly
anyway.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Users of udp encapsulation currently have an encap_rcv callback which they can
use to hook into the udp receive path.
In situations where a encapsulation user allocates resources associated with a
udp encap socket, it may be convenient to be able to also hook the proto
.destroy operation. For example, if an encap user holds a reference to the
udp socket, the destroy hook might be used to relinquish this reference.
This patch adds a socket destroy hook into udp, which is set and enabled
in the same way as the existing encap_rcv hook.
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
The following patchset contains 7 Netfilter/IPVS fixes for 3.9-rc, they are:
* Restrict IPv6 stateless NPT targets to the mangle table. Many users are
complaining that this target does not work in the nat table, which is the
wrong table for it, from Florian Westphal.
* Fix possible use before initialization in the netns init path of several
conntrack protocol trackers (introduced recently while improving conntrack
netns support), from Gao Feng.
* Fix incorrect initialization of copy_range in nfnetlink_queue, spotted
by Eric Dumazet during the NFWS2013, patch from myself.
* Fix wrong calculation of next SCTP chunk in IPVS, from Julian Anastasov.
* Remove rcu_read_lock section in IPVS while calling ipv4_update_pmtu
not required anymore after change introduced in 3.7, again from Julian.
* Fix SYN looping in IPVS state sync if the backup is used a real server
in DR/TUN modes, this required a new /proc entry to disable the director
function when acting as backup, also from Julian.
* Remove leftover IP_NF_QUEUE Kconfig after ip_queue removal, noted by
Paul Bolle.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces a constant limit of the fragment queue hash
table bucket list lengths. Currently the limit 128 is choosen somewhat
arbitrary and just ensures that we can fill up the fragment cache with
empty packets up to the default ip_frag_high_thresh limits. It should
just protect from list iteration eating considerable amounts of cpu.
If we reach the maximum length in one hash bucket a warning is printed.
This is implemented on the caller side of inet_frag_find to distinguish
between the different users of inet_fragment.c.
I dropped the out of memory warning in the ipv4 fragment lookup path,
because we already get a warning by the slab allocator.
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jesper Dangaard Brouer <jbrouer@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When an ICMP ICMP_FRAG_NEEDED (or ICMPV6_PKT_TOOBIG) message finds a
LISTEN socket, and this socket is currently owned by the user, we
set TCP_MTU_REDUCED_DEFERRED flag in listener tsq_flags.
This is bad because if we clone the parent before it had a chance to
clear the flag, the child inherits the tsq_flags value, and next
tcp_release_cb() on the child will decrement sk_refcnt.
Result is that we might free a live TCP socket, as reported by
Dormando.
IPv4: Attempt to release TCP socket in state 1
Fix this issue by testing sk_state against TCP_LISTEN early, so that we
set TCP_MTU_REDUCED_DEFERRED on appropriate sockets (not a LISTEN one)
This bug was introduced in commit 563d34d057
(tcp: dont drop MTU reduction indications)
Reported-by: dormando <dormando@rydia.net>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCPCT uses option-number 253, reserved for experimental use and should
not be used in production environments.
Further, TCPCT does not fully implement RFC 6013.
As a nice side-effect, removing TCPCT increases TCP's performance for
very short flows:
Doing an apache-benchmark with -c 100 -n 100000, sending HTTP-requests
for files of 1KB size.
before this patch:
average (among 7 runs) of 20845.5 Requests/Second
after:
average (among 7 runs) of 21403.6 Requests/Second
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
As the translation is stateless, using it in nat table
doesn't work (only initial packet is translated).
filter table OUTPUT works but won't re-route the packet after translation.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
As in (842df07 ipv6: use newly introduced __ipv6_addr_needs_scope_id and
ipv6_iface_scope_id).
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
[ Some fixes went into mainstream before this patch, so I needed
to rebase it upon the current tree, that's why it's different from
the original one posted on the list --pablo ]
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Conflicts:
drivers/net/ethernet/intel/e1000e/netdev.c
Minor conflict in e1000e, a line that got fixed in 'net'
has been removed in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to iptunnel_xmit(), group these operations into a
helper function.
This by the way fixes the missing u64_stats_update_begin()
and u64_stats_update_end() for 32 bit arch.
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Pravin B Shelar <pshelar@nicira.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
nla_data() cannot return NULL, so these NULL pointer checks are
superfluous.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With recent patches from Pravin, most tunnels can't use iptunnel_xmit()
any more, due to ip_select_ident() and skb->ip_summed. But we can just
move these operations out of iptunnel_xmit(), so that tunnels can
use it again.
This by the way fixes a bug in vxlan (missing nf_reset()) for net-next.
Cc: Pravin B Shelar <pshelar@nicira.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds generic tunneling offloading support for IPv4-UDP based
tunnels.
GSO type is added to request this offload for a skb.
netdev feature NETIF_F_UDP_TUNNEL is added for hardware offloaded
udp-tunnel support. Currently no device supports this feature,
software offload is used.
This can be used by tunneling protocols like VXLAN.
CC: Jesse Gross <jesse@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Earlier SG was unset if CSUM was not available for given device to
force skb copy to avoid sending inconsistent csum.
Commit c9af6db4c1 (net: Fix possible wrong checksum generation)
added explicit flag to force copy to fix this issue. Therefore
there is no need to link SG and CSUM, following patch kills this
link between there two features.
This patch is also required following patch in series.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
v4:
a) unchanged
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch requires multicast interface-scoped addresses to supply a
sin6_scope_id. Because the sin6_scope_id is now also correctly used
in case of interface-scoped multicast traffic this enables one to use
interface scoped addresses over interfaces which are not targeted by the
default multicast route (the route has to be put there manually, though).
getsockname() and getpeername() now return the correct sin6_scope_id in
case of interface-local mc addresses.
v2:
a) rebased ontop of patch 1/4 (now uses ipv6_addr_props)
v3:
a) reverted changes for ipv6_addr_props
v4:
a) unchanged
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>dave
Signed-off-by: David S. Miller <davem@davemloft.net>
send_sllao is already initialized with the value of dev->addr_len
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
v2:
a) used struct ipv6_addr_props
v3:
a) reverted changes for ipv6_addr_props
v4:
a) do not use __ipv6_addr_needs_scope_id
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 18367681a1 (ipv6 flowlabel: Convert np->ipv6_fl_list to RCU.)
omitted proper __rcu annotations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's useful to be able to get the initial state of all entries. The patch adds
the support for IPv4 and IPv6.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
By default, DSCP is copying during encapsulation.
Copying the DSCP in IPsec tunneling may be a bit dangerous because packets with
different DSCP may get reordered relative to each other in the network and then
dropped by the remote IPsec GW if the reordering becomes too big compared to the
replay window.
It is possible to avoid this copy with netfilter rules, but it's very convenient
to be able to configure it for each SA directly.
This patch adds a toogle for this purpose. By default, it's not set to maintain
backward compatibility.
Field flags in struct xfrm_usersa_info is full, hence I add a new attribute.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
We must try harder to get unique (addr, port) pairs when
doing port autoselection for sockets with SO_REUSEADDR
option set.
This is a continuation of commit aacd9289af
for IPv6.
Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
"A moderately sized pile of fixes, some specifically for merge window
introduced regressions although others are for longer standing items
and have been queued up for -stable.
I'm kind of tired of all the RDS protocol bugs over the years, to be
honest, it's way out of proportion to the number of people who
actually use it.
1) Fix missing range initialization in netfilter IPSET, from Jozsef
Kadlecsik.
2) ieee80211_local->tim_lock needs to use BH disabling, from Johannes
Berg.
3) Fix DMA syncing in SFC driver, from Ben Hutchings.
4) Fix regression in BOND device MAC address setting, from Jiri
Pirko.
5) Missing usb_free_urb in ISDN Hisax driver, from Marina Makienko.
6) Fix UDP checksumming in bnx2x driver for 57710 and 57711 chips,
fix from Dmitry Kravkov.
7) Missing cfgspace_lock initialization in BCMA driver.
8) Validate parameter size for SCTP assoc stats getsockopt(), from
Guenter Roeck.
9) Fix SCTP association hangs, from Lee A Roberts.
10) Fix jumbo frame handling in r8169, from Francois Romieu.
11) Fix phy_device memory leak, from Petr Malat.
12) Omit trailing FCS from frames received in BGMAC driver, from Hauke
Mehrtens.
13) Missing socket refcount release in L2TP, from Guillaume Nault.
14) sctp_endpoint_init should respect passed in gfp_t, rather than use
GFP_KERNEL unconditionally. From Dan Carpenter.
15) Add AISX AX88179 USB driver, from Freddy Xin.
16) Remove MAINTAINERS entries for drivers deleted during the merge
window, from Cesar Eduardo Barros.
17) RDS protocol can try to allocate huge amounts of memory, check
that the user's request length makes sense, from Cong Wang.
18) SCTP should use the provided KMALLOC_MAX_SIZE instead of it's own,
bogus, definition. From Cong Wang.
19) Fix deadlocks in FEC driver by moving TX reclaim into NAPI poll,
from Frank Li. Also, fix a build error introduced in the merge
window.
20) Fix bogus purging of default routes in ipv6, from Lorenzo Colitti.
21) Don't double count RTT measurements when we leave the TCP receive
fast path, from Neal Cardwell."
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (61 commits)
tcp: fix double-counted receiver RTT when leaving receiver fast path
CAIF: fix sparse warning for caif_usb
rds: simplify a warning message
net: fec: fix build error in no MXC platform
net: ipv6: Don't purge default router if accept_ra=2
net: fec: put tx to napi poll function to fix dead lock
sctp: use KMALLOC_MAX_SIZE instead of its own MAX_KMALLOC_SIZE
rds: limit the size allocated by rds_message_alloc()
MAINTAINERS: remove eexpress
MAINTAINERS: remove drivers/net/wan/cycx*
MAINTAINERS: remove 3c505
caif_dev: fix sparse warnings for caif_flow_cb
ax88179_178a: ASIX AX88179_178A USB 3.0/2.0 to gigabit ethernet adapter driver
sctp: use the passed in gfp flags instead GFP_KERNEL
ipv[4|6]: correct dropwatch false positive in local_deliver_finish
l2tp: Restore socket refcount when sendmsg succeeds
net/phy: micrel: Disable asymmetric pause for KSZ9021
bgmac: omit the fcs
phy: Fix phy_device_free memory leak
bnx2x: Fix KR2 work-around condition
...
Setting net.ipv6.conf.<interface>.accept_ra=2 causes the kernel
to accept RAs even when forwarding is enabled. However, enabling
forwarding purges all default routes on the system, breaking
connectivity until the next RA is received. Fix this by not
purging default routes on interfaces that have accept_ra=2.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I had a report recently of a user trying to use dropwatch to localise some frame
loss, and they were getting false positives. Turned out they were using a user
space SCTP stack that used raw sockets to grab frames. When we don't have a
registered protocol for a given packet, we record it as a drop, even if a raw
socket receieves the frame. We should only record the drop in the event a raw
socket doesnt exist to receive the frames
Tested by the reported successfully
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: William Reich <reich@ulticom.com>
Tested-by: William Reich <reich@ulticom.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: William Reich <reich@ulticom.com>
CC: eric.dumazet@gmail.com
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge misc patches from Andrew Morton:
- Florian has vanished so I appear to have become fbdev maintainer
again :(
- Joel and Mark are distracted to welcome to the new OCFS2 maintainer
- The backlight queue
- Small core kernel changes
- lib/ updates
- The rtc queue
- Various random bits
* akpm: (164 commits)
rtc: rtc-davinci: use devm_*() functions
rtc: rtc-max8997: use devm_request_threaded_irq()
rtc: rtc-max8907: use devm_request_threaded_irq()
rtc: rtc-da9052: use devm_request_threaded_irq()
rtc: rtc-wm831x: use devm_request_threaded_irq()
rtc: rtc-tps80031: use devm_request_threaded_irq()
rtc: rtc-lp8788: use devm_request_threaded_irq()
rtc: rtc-coh901331: use devm_clk_get()
rtc: rtc-vt8500: use devm_*() functions
rtc: rtc-tps6586x: use devm_request_threaded_irq()
rtc: rtc-imxdi: use devm_clk_get()
rtc: rtc-cmos: use dev_warn()/dev_dbg() instead of printk()/pr_debug()
rtc: rtc-pcf8583: use dev_warn() instead of printk()
rtc: rtc-sun4v: use pr_warn() instead of printk()
rtc: rtc-vr41xx: use dev_info() instead of printk()
rtc: rtc-rs5c313: use pr_err() instead of printk()
rtc: rtc-at91rm9200: use dev_dbg()/dev_err() instead of printk()/pr_debug()
rtc: rtc-rs5c372: use dev_dbg()/dev_warn() instead of printk()/pr_debug()
rtc: rtc-ds2404: use dev_err() instead of printk()
rtc: rtc-efi: use dev_err()/dev_warn()/pr_err() instead of printk()
...
After I came across a help text for SUNGEM mentioning a broken sun.com
URL, I felt like fixing those up, as they are now pointing to oracle.com
URLs.
Signed-off-by: Christian Kujau <lists@nerdbynature.de>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Here is the big driver core merge for 3.9-rc1
There are two major series here, both of which touch lots of drivers all
over the kernel, and will cause you some merge conflicts:
- add a new function called devm_ioremap_resource() to properly be
able to check return values.
- remove CONFIG_EXPERIMENTAL
If you need me to provide a merged tree to handle these resolutions,
please let me know.
Other than those patches, there's not much here, some minor fixes and
updates.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iEYEABECAAYFAlEmV0cACgkQMUfUDdst+yncCQCfbmnQZju7kzWXk6PjdFuKspT9
weAAoMCzcAtEzzc4LXuUxxG/sXBVBCjW
=yWAQ
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core patches from Greg Kroah-Hartman:
"Here is the big driver core merge for 3.9-rc1
There are two major series here, both of which touch lots of drivers
all over the kernel, and will cause you some merge conflicts:
- add a new function called devm_ioremap_resource() to properly be
able to check return values.
- remove CONFIG_EXPERIMENTAL
Other than those patches, there's not much here, some minor fixes and
updates"
Fix up trivial conflicts
* tag 'driver-core-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (221 commits)
base: memory: fix soft/hard_offline_page permissions
drivercore: Fix ordering between deferred_probe and exiting initcalls
backlight: fix class_find_device() arguments
TTY: mark tty_get_device call with the proper const values
driver-core: constify data for class_find_device()
firmware: Ignore abort check when no user-helper is used
firmware: Reduce ifdef CONFIG_FW_LOADER_USER_HELPER
firmware: Make user-mode helper optional
firmware: Refactoring for splitting user-mode helper code
Driver core: treat unregistered bus_types as having no devices
watchdog: Convert to devm_ioremap_resource()
thermal: Convert to devm_ioremap_resource()
spi: Convert to devm_ioremap_resource()
power: Convert to devm_ioremap_resource()
mtd: Convert to devm_ioremap_resource()
mmc: Convert to devm_ioremap_resource()
mfd: Convert to devm_ioremap_resource()
media: Convert to devm_ioremap_resource()
iommu: Convert to devm_ioremap_resource()
drm: Convert to devm_ioremap_resource()
...
Eric Dumazet wrote:
| Some strange crashes happen in rt6_check_expired(), with access
| to random addresses.
|
| At first glance, it looks like the RTF_EXPIRES and
| stuff added in commit 1716a96101
| (ipv6: fix problem with expired dst cache)
| are racy : same dst could be manipulated at the same time
| on different cpus.
|
| At some point, our stack believes rt->dst.from contains a dst pointer,
| while its really a jiffie value (as rt->dst.expires shares the same area
| of memory)
|
| rt6_update_expires() should be fixed, or am I missing something ?
|
| CC Neil because of https://bugzilla.redhat.com/show_bug.cgi?id=892060
Because we do not have any locks for dst_entry, we cannot change
essential structure in the entry; e.g., we cannot change reference
to other entity.
To fix this issue, split 'from' and 'expires' field in dst_entry
out of union. Once it is 'from' is assigned in the constructor,
keep the reference until the very last stage of the life time of
the object.
Of course, it is unsafe to change 'from', so make rt6_set_from simple
just for fresh entries.
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Reported-by: Neil Horman <nhorman@tuxdriver.com>
CC: Gao Feng <gaofeng@cn.fujitsu.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reported-by: Steinar H. Gunderson <sesse@google.com>
Reviewed-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
The following patchset contain updates for your net-next tree, they are:
* Fix (for just added) connlabel dependencies, from Florian Westphal.
* Add aliasing support for conntrack, thus users can either use -m state
or -m conntrack from iptables while using the same kernel module, from
Jozsef Kadlecsik.
* Some code refactoring for the CT target to merge common code in
revision 0 and 1, from myself.
* Add aliasing support for CT, based on patch from Jozsef Kadlecsik.
* Add one mutex per nfnetlink subsystem, from myself.
* Improved logging for packets that are dropped by helpers, from myself.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull in 'net' to take in the bug fixes that didn't make it into
3.8-final.
Also, deal with the semantic conflict of the change made to
net/ipv6/xfrm6_policy.c A missing rt6->n neighbour release
was added to 'net', but in 'net-next' we no longer cache the
neighbour entries in the ipv6 routes so that change is not
appropriate there.
Signed-off-by: David S. Miller <davem@davemloft.net>
Connection tracking helpers have to drop packets under exceptional
situations. Currently, the user gets the following logging message
in case that happens:
nf_ct_%s: dropping packet ...
However, depending on the helper, there are different reasons why a
packet can be dropped.
This patch modifies the existing code to provide more specific
error message in the scope of each helper to help users to debug
the reason why the packet has been dropped, ie:
nf_ct_%s: dropping packet: reason ...
Thanks to Joe Perches for many formatting suggestions.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
net/ipv6/reassembly.c:82:72: warning: incorrect type in argument 3 (different base types)
net/ipv6/reassembly.c:82:72: expected unsigned int [unsigned] [usertype] c
net/ipv6/reassembly.c:82:72: got restricted __be32 [usertype] id
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Neighbor is cloned in xfrm6_fill_dst but seems to never be released.
Neighbor entry should be released when XFRM6 dst entry is destroyed
in xfrm6_dst_destroy, otherwise references may be kept forever on
the device pointed by the neighbor entry.
I may not have understood all the subtleties of XFRM & dst so I would
be happy to receive comments on this patch.
Signed-off-by: Romain Kuntz <r.kuntz@ipflavors.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
proc_net_remove is only used to remove proc entries
that under /proc/net,it's not a general function for
removing proc entries of netns. if we want to remove
some proc entries which under /proc/net/stat/, we still
need to call remove_proc_entry.
this patch use remove_proc_entry to replace proc_net_remove.
we can remove proc_net_remove after this patch.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Right now, some modules such as bonding use proc_create
to create proc entries under /proc/net/, and other modules
such as ipv4 use proc_net_fops_create.
It looks a little chaos.this patch changes all of
proc_net_fops_create to proc_create. we can remove
proc_net_fops_create after this patch.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Should not use assignment in conditional:
warning: suggest parentheses around assignment used as truth value [-Wparentheses]
Problem introduced by:
commit 14bbd6a565
Author: Pravin B Shelar <pshelar@nicira.com>
Date: Thu Feb 14 09:44:49 2013 +0000
net: Add skb_unclone() helper function.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use ipv6_addr_hash() and a single jhash invocation.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following patch adds GRE protocol offload handler so that
skb_gso_segment() can segment GRE packets.
SKB GSO CB is added to keep track of total header length so that
skb_segment can push entire header. e.g. in case of GRE, skb_segment
need to push inner and outer headers to every segment.
New NETIF_F_GRE_GSO feature is added for devices which support HW
GRE TSO offload. Currently none of devices support it therefore GRE GSO
always fall backs to software GSO.
[ Compute pkt_len before ip_local_out() invocation. -DaveM ]
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This function will be used in next GRE_GSO patch. This patch does
not change any functionality.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Steffen Klassert says:
====================
1) Remove a duplicated call to skb_orphan() in pf_key, from Cong Wang.
2) Prepare xfrm and pf_key for algorithms without pf_key support,
from Jussi Kivilinna.
3) Fix an unbalanced lock in xfrm_output_one(), from Li RongQing.
4) Add an IPsec state resolution packet queue to handle
packets that are send before the states are resolved.
5) xfrm4_policy_fini() is unused since 2.6.11, time to remove it.
From Michal Kubecek.
6) The xfrm gc threshold was configurable just in the initial
namespace, make it configurable in all namespaces. From
Michal Kubecek.
7) We currently can not insert policies with mark and mask
such that some flows would be matched from both policies.
Allow this if the priorities of these policies are different,
the one with the higher priority is used in this case.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Adjusting of data pointers in net/netfilter/nf_conntrack_frag6_*
sysctl table for other namespaces points to wrong netns_frags
structure and has reversed order of entries.
Problem introduced by commit c038a767cd in 3.7-rc1
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Patch cef401de7b (net: fix possible wrong checksum
generation) fixed wrong checksum calculation but it broke TSO by
defining new GSO type but not a netdev feature for that type.
net_gso_ok() would not allow hardware checksum/segmentation
offload of such packets without the feature.
Following patch fixes TSO and wrong checksum. This patch uses
same logic that Eric Dumazet used. Patch introduces new flag
SKBTX_SHARED_FRAG if at least one frag can be modified by
the user. but SKBTX_SHARED_FRAG flag is kept in skb shared
info tx_flags rather than gso_type.
tx_flags is better compared to gso_type since we can have skb with
shared frag without gso packet. It does not link SHARED_FRAG to
GSO, So there is no need to define netdev feature for this.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A socket timestamp is a sum of the global tcp_time_stamp and
a per-socket offset.
A socket offset is added in places where externally visible
tcp timestamp option is parsed/initialized.
Connections in the SYN_RECV state are not supported, global
tcp_time_stamp is used for them, because repair mode doesn't support
this state. In a future it can be implemented by the similar way
as for TIME_WAIT sockets.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
The bnx2x gso_type setting bug fix in 'net' conflicted with
changes in 'net-next' that broke the gso_* setting logic
out into a seperate function, which also fixes the bug in
question. Thus, use the 'net-next' version.
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Erik Hugne <erik.hugne@ericsson.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
v2:
a) moved before multicast source address check
b) changed comment to netdev style
Cc: Erik Hugne <erik.hugne@ericsson.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reported-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Erik Hugne <erik.hugne@ericsson.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC4291 (IPv6 addressing architecture) says that interface-Local scope
spans only a single interface on a node. We should not join L2 device
multicast list for addresses in interface-local (or smaller) scope.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
The following patchset contains Netfilter/IPVS fixes for 3.8-rc7, they are:
* Fix oops in IPVS state-sync due to releasing a random memory area due
to unitialized pointer, from Dan Carpenter.
* Fix SCTP flow establishment due to bad checksumming mangling in IPVS,
from Daniel Borkmann.
* Three fixes for the recently added IPv6 NPT, all from YOSHIFUJI Hideaki,
with an amendment collapsed into those patches from Ulrich Weber. They
fiix adjustment calculation, fix prefix mangling and ensure LSB of
prefixes are zeroes (as required by RFC).
Specifically, it took me a while to validate the 1's complement arithmetics/
checksumming approach in the IPv6 NPT code.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Synchronize with 'net' in order to sort out some l2tp, wireless, and
ipv6 GRE fixes that will be built on top of in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes the following RCU warning:
[ 51.680236] ===============================
[ 51.681914] [ INFO: suspicious RCU usage. ]
[ 51.683610] 3.8.0-rc6-next-20130206-sasha-00028-g83214f7-dirty #276 Tainted: G W
[ 51.686703] -------------------------------
[ 51.688281] net/ipv6/ip6_flowlabel.c:671 suspicious rcu_dereference_check() usage!
we should use rcu_dereference_bh() when we hold rcu_read_lock_bh().
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC 6296 points that address bits that are not part of the prefix
has to be zeroed.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Make sure only the bits that are part of the prefix are mangled.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Cast __wsum from/to __sum16 is wrong. Instead, apply appropriate
conversion function: csum_unfold() or csum_fold().
[ The original patch has been modified to undo the final ~ that
csum_fold returns. We only need to fold the 32-bit word that
results from the checksum calculation into a 16-bit to ensure
that the original subnet is restored appropriately. Spotted by
Ulrich Weber. ]
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
ip6gre_tunnel_xmit() is leaking the skb when we hit this error branch,
and the -1 return value from this function is bogus. Use the error
handling we already have in place in ip6gre_tunnel_xmit() for this error
case to fix this.
Signed-off-by: Tommi Rantala <tt.rantala@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Calling icmpv6_send() on a local message size error leads to an
incorrect update of the path mtu in the case when IPsec is used.
So use ipv6_local_error() instead to notify the socket about the
error.
Reported-by: Jiri Bohac <jbohac@suse.cz>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The xfrm gc threshold can be configured via xfrm{4,6}_gc_thresh
sysctl but currently only in init_net, other namespaces always
use the default value. This can substantially limit the number
of IPsec tunnels that can be effectively used.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Conflicts:
drivers/net/ethernet/intel/e1000e/ethtool.c
drivers/net/vmxnet3/vmxnet3_drv.c
drivers/net/wireless/iwlwifi/dvm/tx.c
net/ipv6/route.c
The ipv6 route.c conflict is simple, just ignore the 'net' side change
as we fixed the same problem in 'net-next' by eliminating cached
neighbours from ipv6 routes.
The e1000e conflict is an addition of a new statistic in the ethtool
code, trivial.
The vmxnet3 conflict is about one change in 'net' removing a guarding
conditional, whilst in 'net-next' we had a netdev_info() conversion.
The iwlwifi conflict is dealing with a WARN_ON() conversion in
'net-next' vs. a revert happening in 'net'.
Signed-off-by: David S. Miller <davem@davemloft.net>
With the loop, don't check 'rv' twice in a row. Without the loop, 'rv'
doesn't even need to be checked.
Make the comment more grammar-friendly.
Signed-off-by: Jean Sacren <sakiwit@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch updates LINUX_MIB_LISTENDROPS and LINUX_MIB_LISTENOVERFLOWS in
tcp_v6_conn_request() and tcp_v6_err(). tcp_v6_conn_request() in particular can
drop SYNs for various reasons which are not currently tracked.
Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip6_datagram_recv_ctl and ip6_datagram_send_ctl are used for handling IPv6
ancillary data. Since ip6_datagram_send_ctl is already publicly exported for
use in modules, ip6_datagram_recv_ctl should also be available to support
ancillary data in the receive path.
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The datagram_*_ctl functions in net/ipv6/datagram.c are IPv6-specific. Since
datagram_send_ctl is publicly exported it should be appropriately named to
reflect the fact that it's for IPv6 only.
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since all users are write-lock, it does not make sense to use
rwlock here. Use simple spinlock.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
They will be created at output, if ever needed. This avoids creating
empty neighbor entries when TPROXYing/Forwarding packets for addresses
that are not even directly reachable.
Note that IPv4 already handles it this way. No neighbor entries are
created for local input.
Tested by myself and customer.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The "Universal/Local" (U/L) bit must be complmented according to RFC4944
and RFC2464.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bring in the 'net' tree so that we can get some ipv4/ipv6 bug
fixes that some net-next work will build upon.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds anti-spoofing checks in sit.c as specified in RFC3964
section 5.2 for 6to4 and RFC5969 section 12 for 6rd. I left out the
checks which could easily be implemented with netfilter.
Specifically this patch adds following logic (based loosely on the
pseudocode in RFC3964 section 5.2):
if prefix (inner_src_v6) == rd6_prefix (2002::/16 is the default)
and outer_src_v4 != embedded_ipv4 (inner_src_v6)
drop
if prefix (inner_dst_v6) == rd6_prefix (or 2002::/16 is the default)
and outer_dst_v4 != embedded_ipv4 (inner_dst_v6)
drop
accept
To accomplish the specified security checks proposed by above RFCs,
it is still necessary to employ uRPF filters with netfilter. These new
checks only kick in if the employed addresses are within the 2002::/16 or
another range specified by the 6rd-prefix (which defaults to 2002::/16).
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When attempting to build linux-next with user namespaces enabled I ran
into this fun build error.
CC net/ipv6/inet6_connection_sock.o
.../net/ipv6/inet6_connection_sock.c: In function ‘inet6_csk_bind_conflict’:
.../net/ipv6/inet6_connection_sock.c:37:12: error: incompatible types when initializing type ‘int’ using
type ‘kuid_t’
.../net/ipv6/inet6_connection_sock.c:54:30: error: incompatible type for argument 1 of ‘uid_eq’
.../include/linux/uidgid.h:48:20: note: expected ‘kuid_t’ but argument is of type ‘int’
make[3]: *** [net/ipv6/inet6_connection_sock.o] Error 1
make[2]: *** [net/ipv6] Error 2
make[2]: *** Waiting for unfinished jobs....
Using kuid_t instead of int to hold the uid fixes this.
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are some usecase when lifetime of ipv4 addresses might be helpful.
For example:
1) initramfs networkmanager uses a DHCP daemon to learn network
configuration parameters
2) initramfs networkmanager addresses, routes and DNS configuration
3) initramfs networkmanager is requested to stop
4) initramfs networkmanager stops all daemons including dhclient
5) there are addresses and routes configured but no daemon running. If
the system doesn't start networkmanager for some reason, addresses and
routes will be used forever, which violates RFC 2131.
This patch is essentially a backport of ivp6 address lifetime mechanism
for ipv4 addresses.
Current "ip" tool supports this without any patch (since it does not
distinguish between ipv4 and ipv6 addresses in this perspective.
Also, this should be back-compatible with all current netlink users.
Reported-by: Pavel Šimerda <psimerda@redhat.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Updating the fragmentation queues LRU (Least-Recently-Used) list,
required taking the hash writer lock. However, the LRU list isn't
tied to the hash at all, so we can use a separate lock for it.
Original-idea-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change is primarily a preparation to ease the extension of memory
limit tracking.
The change does reduce the number atomic operation, during freeing of
a frag queue. This does introduce a some performance improvement, as
these atomic operations are at the core of the performance problems
seen on NUMA systems.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pravin Shelar mentioned that GSO could potentially generate
wrong TX checksum if skb has fragments that are overwritten
by the user between the checksum computation and transmit.
He suggested to linearize skbs but this extra copy can be
avoided for normal tcp skbs cooked by tcp_sendmsg().
This patch introduces a new SKB_GSO_SHARED_FRAG flag, set
in skb_shinfo(skb)->gso_type if at least one frag can be
modified by the user.
Typical sources of such possible overwrites are {vm}splice(),
sendfile(), and macvtap/tun/virtio_net drivers.
Tested:
$ netperf -H 7.7.8.84
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
7.7.8.84 () port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 16384 16384 10.00 3959.52
$ netperf -H 7.7.8.84 -t TCP_SENDFILE
TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 ()
port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 16384 16384 10.00 3216.80
Performance of the SENDFILE is impacted by the extra allocation and
copy, and because we use order-0 pages, while the TCP_STREAM uses
bigger pages.
Reported-by: Pravin Shelar <pshelar@nicira.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We did this for IPv4 in b49d3c1e1c "net: ipmr: limit MRT_TABLE
identifiers" but we need to do it for IPv6 as well. On IPv6 the name
is "pim6reg" instead of "pimreg" so there is one less digit allowed.
The strcpy() is in ip6mr_reg_vif().
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
This batch contains netfilter updates for you net-next tree, they are:
* The new connlabel extension for x_tables, that allows us to attach
labels to each conntrack flow. The kernel implementation uses a
bitmask and there's a file in user-space that maps the bits with the
corresponding string for each existing label. By now, you can attach
up to 128 overlapping labels. From Florian Westphal.
* A new round of improvements for the netns support for conntrack.
Gao feng has moved many of the initialization code of each module
of the netns init path. He also made several code refactoring, that
code looks cleaner to me now.
* Added documentation for all possible tweaks for nf_conntrack via
sysctl, from Jiri Pirko.
* Cisco 7941/7945 IP phone support for our SIP conntrack helper,
from Kevin Cernekee.
* Missing header file in the snmp helper, from Stephen Hemminger.
* Finally, a couple of fixes to resolve minor issues with these
changes, from myself.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Motivation for soreuseport would be something like a DNS server. An
alternative would be to recv on the same socket from multiple threads.
As in the case of TCP, the load across these threads tends to be
disproportionate and we also see a lot of contection on the socket lock.
Note that SO_REUSEADDR already allows multiple UDP sockets to bind to
the same port, however there is no provision to prevent hijacking and
nothing to distribute packets across all the sockets sharing the same
bound port. This patch does not change the semantics of SO_REUSEADDR,
but provides usable functionality of it for unicast.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Motivation for soreuseport would be something like a web server
binding to port 80 running with multiple threads, where each thread
might have it's own listener socket. This could be done as an
alternative to other models: 1) have one listener thread which
dispatches completed connections to workers. 2) accept on a single
listener socket from multiple threads. In case #1 the listener thread
can easily become the bottleneck with high connection turn-over rate.
In case #2, the proportion of connections accepted per thread tends
to be uneven under high connection load (assuming simple event loop:
while (1) { accept(); process() }, wakeup does not promote fairness
among the sockets. We have seen the disproportion to be as high
as 3:1 ratio between thread accepting most connections and the one
accepting the fewest. With so_reusport the distribution is
uniform.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the code that register/unregister l4proto to the
module_init/exit context.
Given that we have to modify some interfaces to accomodate
these changes, it is a good time to use shorter function names
for this using the nf_ct_* prefix instead of nf_conntrack_*,
that is:
nf_ct_l4proto_register
nf_ct_l4proto_pernet_register
nf_ct_l4proto_unregister
nf_ct_l4proto_pernet_unregister
We same many line breaks with it.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Move the code that register/unregister l3proto to the
module_init/exit context.
Given that we have to modify some interfaces to accomodate
these changes, it is a good time to use shorter function names
for this using the nf_ct_* prefix instead of nf_conntrack_*,
that is:
nf_ct_l3proto_register
nf_ct_l3proto_pernet_register
nf_ct_l3proto_unregister
nf_ct_l3proto_pernet_unregister
We same many line breaks with it.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
It is declared in:
include/net/ip6_route.h:187:int ip6_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *));
and net/ip6_route.h is already included.
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
1) The transport header did not point to the right place after
esp/ah processing on tunnel mode in the receive path. As a
result, the ECN field of the inner header was not set correctly,
fixes from Li RongQing.
2) We did a null check too late in one of the xfrm_replay advance
functions. This can lead to a division by zero, fix from
Nickolai Zeldovich.
3) The size calculation of the hash table missed the muiltplication
with the actual struct size when the hash table is freed.
We might call the wrong free function, fix from Michal Kubecek.
4) On IPsec pmtu events we can't access the transport headers of
the original packet, so force a relookup for all routes
to notify about the pmtu event.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 2152caea ("ipv6: Do not depend on rt->n in rt6_probe().")
introduce a bug to try to update "updated" time in neighbour
structure.
Update the "updated" time only if neighbour is available.
Bug was found by Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch add the support of proxy multicast, ie being able to build a static
multicast tree. It adds the support of (*,*) and (*,G) entries.
The user should define an (*,*) entry which is not used for real forwarding.
This entry defines the upstream in iif and contains all interfaces from the
static tree in its oifs. It will be used to forward packet upstream when they
come from an interface belonging to the static tree.
Hence, the user should define (*,G) entries to build its static tree. Note that
upstream interface must be part of oifs: packets are sent to all oifs
interfaces except the input interface. This ensures to always join the whole
static tree, even if the packet is not coming from the upstream interface.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Construct NS/NA/RS message directly using C99 compound literals.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_transport_header() (thus icmp6_hdr()) is available here,
use it.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Build ICMPv6 message first and make buffer management easier;
we can use skb->len when filling checksum in ICMPv6 header,
and then build IP header with length field.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
- move ip6_nd_hdr() to its users' source files.
In net/ipv6/mcast.c, it will be called ip6_mc_hdr().
- make return type to void since this function never fails.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Suggested by Eric Dumazet <edumazet@google.com>.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This also makes ndisc_opt_addr_data() and ndisc_fill_addr_option()
use ndisc_opt_addr_space().
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add pointer to struct net_device (dev) and remove
data_len (= dev->addr_len) and addr_type (= dev->type).
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is already checked by the caller (tunnel64_rcv) and brings ipip6_rcv
in line with ipip_rcv.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Check message length before accessing "target" field,
as we do for other types.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Because of rt->n removal, we do not need neigh argument any more.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
pmtu and redirect events are now handled in the protocols error handler,
so add an error handler for icmp6 to do this. It is needed in the case
when we have no socket context. Based on a patch by Duan Jiong.
Reported-by: Duan Jiong <djduanjiong@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is to fix up a build problem with a wireless driver due to the
dynamic-debug patches in this branch.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
CC: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
neigh->nud_state and neigh->updated are under protection of
neigh->lock.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 299b0767 (ipv6: Fix IPsec slowpath fragmentation problem)
has introduced a error in the header length calculation that
provokes corrupted packets when non-fragmentable extensions
headers (Destination Option or Routing Header Type 2) are used.
rt->rt6i_nfheader_len is the length of the non-fragmentable
extension header, and it should be substracted to
rt->dst.header_len, and not to exthdrlen, as it was done before
commit 299b0767.
This patch reverts to the original and correct behavior. It has
been successfully tested with and without IPsec on packets
that include non-fragmentable extensions headers.
Signed-off-by: Romain Kuntz <r.kuntz@ipflavors.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
Documentation/networking/ip-sysctl.txt
drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
Both conflicts were simply overlapping context.
A build fix for qlcnic is in here too, simply removing the added
devinit annotations which no longer exist.
Signed-off-by: David S. Miller <davem@davemloft.net>
The only user is cxgb3 driver.
old_neigh is used to check device change, but it must not happen
on redirect. In this sense, we can remove old_neigh argument.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Router Alert option is very small and we can store the value
itself in the skb.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move generalized version of ipv6_is_mld() to header,
and use it from ip6_mc_input().
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 7a3198a8 ("ipv6: helper function to get tclass") introduced
ipv6_tclass(), but similar function is already available as
ipv6_get_dsfield().
We might be able to call ipv6_tclass() from ipv6_get_dsfield(),
but it is confusing to have two versions.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is not only for readability but also for optimization.
What we do here is to build the 32bit word at the beginning of the ipv6
header (the "ip6_flow" virtual member of struct ip6_hdr in RFC3542) and
we do not need to read the tclass portion of the target buffer.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.
CC: "David S. Miller" <davem@davemloft.net>
CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
CC: James Morris <jmorris@namei.org>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Patrick McHardy <kaber@trash.net>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: David S. Miller <davem@davemloft.net>
Replace ip6_route_lookup() with addrconf_get_prefix_route() when
looking up for a prefix route. This ensures that the connected prefix
is looked up in the main table, and avoids the selection of other
matching routes located in different tables as well as blackhole
or prohibited entries.
In addition, this fixes an Opps introduced by commit 64c6d08e (ipv6:
del unreachable route when an addr is deleted on lo), that would occur
when a blackhole or prohibited entry is selected by ip6_route_lookup().
Such entries have a NULL rt6i_table argument, which is accessed by
__ip6_del_rt() when trying to lock rt6i_table->tb6_lock.
The function addrconf_is_prefix_route() is not used anymore and is
removed.
[v2] Minor indentation cleanup and log updates.
Signed-off-by: Romain Kuntz <r.kuntz@ipflavors.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The tests on the flags in addrconf_get_prefix_route() does no make
much sense: the 'noflags' parameter contains the set of flags that
must not match with the route flags, so the test must be done
against 'noflags', and not against 'flags'.
Signed-off-by: Romain Kuntz <r.kuntz@ipflavors.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In ipv6_recv_error(), addr_offset points to daddr field of the ip header.
To get ipv6 header, use container_of() macro instead of substracting magic
number (24).
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
As suggested by David, udp6_csum_init() is too big to be inlined,
move it to ipv6 static library, net/ipv6/ip6_checksum.c.
And the generic csum_ipv6_magic() too.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPsec tunnel does not set ECN field to CE in inner header when
the ECN field in the outer header is CE, and the ECN field in
the inner header is ECT(0) or ECT(1).
The cause is ipip6_hdr() does not return the correct address of
inner header since skb->transport-header is not the inner header
after esp6_input_done2(), or ah6_input().
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
As per suggestion from Eric Dumazet this patch makes tcp_ecn sysctl
namespace aware. The reason behind this patch is to ease the testing
of ecn problems on the internet and allows applications to tune their
own use of ecn.
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the size of skb allocated for NDISC is MAX_HEADER +
LL_RESERVED_SPACE(dev) + packet length + dev->needed_tailroom,
but only LL_RESERVED_SPACE(dev) bytes is "reserved" for headers.
As a result, the skb looks like this (after construction of the
message):
head data tail end
+--------------------------------------------------------------+
+ | | | |
+--------------------------------------------------------------+
|<-hlen---->|<---ipv6 packet------>|<--tlen-->|<--MAX_HEADER-->|
=LL_ = dev
RESERVED_ ->needed_
SPACE(dev) tailroom
As the name implies, "MAX_HEADER" is used for headers, and should
be "reserved" in prior to packet construction. Or, if some space
is really required at the tail of ther skb, it should be
explicitly documented.
We have several option after construction of NDISC message:
Option 1:
head data tail end
+---------------------------------------------+
+ | | |
+---------------------------------------------+
|<-hlen---->|<---ipv6 packet------>|<--tlen-->|
=LL_ = dev
RESERVED_ ->needed_
SPACE(dev) tailroom
Option 2:
head data tail end
+--------------------------------------------------+
+ | | |
+--------------------------------------------------+
|<--MAX_HEADER-->|<---ipv6 packet------>|<--tlen-->|
= dev
->needed_
tailroom
Option 3:
head data tail end
+--------------------------------------------------------------+
+ | | | |
+--------------------------------------------------------------+
|<--MAX_HEADER-->|<-hlen---->|<---ipv6 packet------>|<--tlen-->|
=LL_ = dev
RESERVED_ ->needed_
SPACE(dev) tailroom
Our tunnel drivers try expanding headroom and the space for tunnel
encapsulation was not a mandatory space -- so we are not seeing
bugs here --, but just for optimization for performance critial
situations.
Since NDISC messages are not performance critical unlike TCP,
and as we know outgoing device, LL_RESERVED_SPACE(dev) should be
just enough for the device in most (if not all) cases:
LL_RESERVED_SPACE(dev) <= LL_MAX_HEADER <= MAX_HEADER
Note that LL_RESERVED_SPACE(dev) is also enough for NDISC over
SIT (e.g., ISATAP).
So, I think Option 1 is just fine here.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
csum16_add() has a broken carry detection, should be:
sum += sum < (__force u16)b;
Instead of fixing csum16_add, remove the custom checksum
functions and use the generic csum_add/csum_sub ones.
Signed-off-by: Ulrich Weber <ulrich.weber@sophos.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
The following batch contains Netfilter fixes for 3.8-rc1. They are
a mixture of old bugs that have passed unnoticed (I'll pass these to
stable) and more fresh ones from the previous merge window, they are:
* Fix for MAC address in 6in4 tunnels via NFLOG that results in ulogd
showing up wrong address, from Bob Hockney.
* Fix a comment in nf_conntrack_ipv6, from Florent Fourcot.
* Fix a leak an error path in ctnetlink while creating an expectation,
from Jesper Juhl.
* Fix missing ICMP time exceeded in the IPv6 defragmentation code, from
Haibo Xi.
* Fix inconsistent handling of routing changes in MASQUERADE for the
new connections case, from Andrew Collins.
* Fix a missing skb_reset_transport in ip[6]t_REJECT that leads to
crashes in the ixgbe driver (since it seems to access the transport
header with TSO enabled), from Mukund Jampala.
* Recover obsoleted NOTRACK target by including it into the CT and spot
a warning via printk about being obsoleted. Many people don't check the
scheduled to be removal file under Documentation, so we follow some
less agressive approach to kill this in a year or so. Spotted by Florian
Westphal, patch from myself.
* Fix race condition in xt_hashlimit that allows to create two or more
entries, from myself.
* Fix crash if the CT is used due to the recently added facilities to
consult the dying and unconfirmed conntrack lists, from myself.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
ip6gre_xmit2() incorrectly sets transport header to inner payload
instead of GRE header. It seems copy-and-pasted from ipip.c.
Set transport header to gre header.
(In ipip case the transport header is the inner ip header, so that's
correct.)
Found by inspection. In practice the incorrect transport header
doesn't matter because the skb usually is sent to another net_device
or socket, so the transport header isn't referenced.
Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
the value of err is always negative if it goes to errout, so we don't need to
check the value of err.
Signed-off-by: Cong Ding <dinggnu@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit b836c99fd6 (ipv6: unify conntrack reassembly expire
code with standard one) use the standard IPv6 reassembly
code(ip6_expire_frag_queue) to handle conntrack reassembly expire.
In ip6_expire_frag_queue, it invoke dev_get_by_index_rcu to get
which device received this expired packet.so we must save ifindex
when NF_conntrack get this packet.
With this patch applied, I can see ICMP Time Exceeded sent
from the receiver when the sender sent out 1/2 fragmented
IPv6 packet.
Signed-off-by: Haibo Xi <haibbo@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Remove ambiguity of double negation.
Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr>
Acked-by: Rick Jones <rick.jones2@hp.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Since (a0ecb85 netfilter: nf_nat: Handle routing changes in MASQUERADE
target), the MASQUERADE target handles routing changes which affect
the output interface of a connection, but only for ESTABLISHED
connections. It is also possible for NEW connections which
already have a conntrack entry to be affected by routing changes.
This adds a check to drop entries in the NEW+conntrack state
when the oif has changed.
Signed-off-by: Andrew Collins <bsderandrew@gmail.com>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The following commit breaks IPv6 TCP transmission for me:
Commit 75fe83c322
Author: Vlad Yasevich <vyasevic@redhat.com>
Date: Fri Nov 16 09:41:21 2012 +0000
ipv6: Preserve ipv6 functionality needed by NET
This patch fixes the typo "ipv6_offload" which should be
"ipv6-offload".
I don't know why not including the offload modules should
break TCP. Disabling all offload options on the NIC didn't
help. Outgoing pulseaudio traffic kept stalling.
Signed-off-by: Simon Arlott <simon@fire.lp0.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
If in either of the above functions inet_csk_route_child_sock() or
__inet_inherit_port() fails, the newsk will not be freed:
unreferenced object 0xffff88022e8a92c0 (size 1592):
comm "softirq", pid 0, jiffies 4294946244 (age 726.160s)
hex dump (first 32 bytes):
0a 01 01 01 0a 01 01 02 00 00 00 00 a7 cc 16 00 ................
02 00 03 01 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff8153d190>] kmemleak_alloc+0x21/0x3e
[<ffffffff810ab3e7>] kmem_cache_alloc+0xb5/0xc5
[<ffffffff8149b65b>] sk_prot_alloc.isra.53+0x2b/0xcd
[<ffffffff8149b784>] sk_clone_lock+0x16/0x21e
[<ffffffff814d711a>] inet_csk_clone_lock+0x10/0x7b
[<ffffffff814ebbc3>] tcp_create_openreq_child+0x21/0x481
[<ffffffff814e8fa5>] tcp_v4_syn_recv_sock+0x3a/0x23b
[<ffffffff814ec5ba>] tcp_check_req+0x29f/0x416
[<ffffffff814e8e10>] tcp_v4_do_rcv+0x161/0x2bc
[<ffffffff814eb917>] tcp_v4_rcv+0x6c9/0x701
[<ffffffff814cea9f>] ip_local_deliver_finish+0x70/0xc4
[<ffffffff814cec20>] ip_local_deliver+0x4e/0x7f
[<ffffffff814ce9f8>] ip_rcv_finish+0x1fc/0x233
[<ffffffff814cee68>] ip_rcv+0x217/0x267
[<ffffffff814a7bbe>] __netif_receive_skb+0x49e/0x553
[<ffffffff814a7cc3>] netif_receive_skb+0x50/0x82
This happens, because sk_clone_lock initializes sk_refcnt to 2, and thus
a single sock_put() is not enough to free the memory. Additionally, things
like xfrm, memcg, cookie_values,... may have been initialized.
We have to free them properly.
This is fixed by forcing a call to tcp_done(), ending up in
inet_csk_destroy_sock, doing the final sock_put(). tcp_done() is necessary,
because it ends up doing all the cleanup on xfrm, memcg, cookie_values,
xfrm,...
Before calling tcp_done, we have to set the socket to SOCK_DEAD, to
force it entering inet_csk_destroy_sock. To avoid the warning in
inet_csk_destroy_sock, inet_num has to be set to 0.
As inet_csk_destroy_sock does a dec on orphan_count, we first have to
increase it.
Calling tcp_done() allows us to remove the calls to
tcp_clear_xmit_timer() and tcp_cleanup_congestion_control().
A similar approach is taken for dccp by calling dccp_done().
This is in the kernel since 093d282321 (tproxy: fix hash locking issue
when using port redirection in __inet_inherit_port()), thus since
version >= 2.6.37.
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
In function ndisc_redirect_rcv(), the skb->data points to the transport
header, but function icmpv6_notify() need the skb->data points to the
inner IP packet. So before using icmpv6_notify() to propagate redirect,
change skb->data to point the inner IP packet that triggered the sending
of the Redirect, and introduce struct rd_msg to make it easy.
Signed-off-by: Duan Jiong <djduanjiong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a natural number n exists where 2 + data_len <= 8n < 2 + data_len + pad,
post padding is not initialized correctly.
(Un)fortunately, the only type that requires pad is Infiniband,
whose pad is 2 and data_len is 20, and this logical error has not
become obvious, but it is better to fix.
Note that ndisc_opt_addr_space() handles the situation described
above correctly.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
These symbols were exported for bonding device by commit 305d552a
("bonding: send IPv6 neighbor advertisement on failover").
It bacame obsolete by commit 7c899432 ("bonding, ipv4, ipv6, vlan: Handle
NETDEV_BONDING_FAILOVER like NETDEV_NOTIFY_PEERS") and removed by
commit 4f5762ec ("bonding: Remove obsolete source file 'bond_ipv6.c'").
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6_sock_mc_close() is called for ipv6 sockets at close time, and most
of them don't use multicast.
Add a test to avoid contention on a shared spinlock.
Same heuristic applies for ipv6_sock_ac_close(), to avoid contention
on a shared rwlock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We talk about IPv6, hence the family is RTNL_FAMILY_IP6MR!
rtnl_register() is already called with RTNL_FAMILY_IP6MR.
The bug is here since the beginning of this function (commit 5b285cac35).
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows to monitor mf6c activities via rtnetlink.
To avoid parsing two times the mf6c oifs, we use maxvif to allocate the rtnl
msg, thus we may allocate some superfluous space.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
/proc/net/ip[6]_mr_cache allows to get all mfc entries, even if they are put in
the unresolved list (mfc[6]_unres_queue). But only the table RT_TABLE_DEFAULT is
displayed.
This patch adds the parsing of the unresolved list when the dump is made via
rtnetlink, hence each table can be checked.
In IPv6, we set rtm_type in ip6mr_fill_mroute(), because in case of unresolved
mfc __ip6mr_fill_mroute() will not set it. In IPv4, it is already done.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A mfc entry can be static or not (added via the mroute_sk socket). The patch
reports MFC_STATIC flag into rtm_protocol by setting rtm_protocol to
RTPROT_STATIC or RTPROT_MROUTED.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These statistics can be checked only via /proc/net/ip_mr_cache or
SIOCGETSGCNT[_IN6] and thus only for the table RT_TABLE_DEFAULT.
Advertising them via rtnetlink allows to get statistics for all cache entries,
whatever the table is.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>