OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Roy.Li	01b7806cdc	ipv6: remove a rcu_read_lock in ndisc_constructor in6_dev_get(dev) takes a reference on struct inet6_dev, we dont need rcu locking in ndisc_constructor() Signed-off-by: Roy.Li <rongqing.li@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-17 19:27:56 -04:00
Eric Dumazet	87fb4b7b53	net: more accurate skb truesize skb truesize currently accounts for sk_buff struct and part of skb head. kmalloc() roundings are also ignored. Considering that skb_shared_info is larger than sk_buff, its time to take it into account for better memory accounting. This patch introduces SKB_TRUESIZE(X) macro to centralize various assumptions into a single place. At skb alloc phase, we put skb_shared_info struct at the exact end of skb head, to allow a better use of memory (lowering number of reallocations), since kmalloc() gives us power-of-two memory blocks. Unless SLUB/SLUB debug is active, both skb->head and skb_shared_info are aligned to cache lines, as before. Note: This patch might trigger performance regressions because of misconfigured protocol stacks, hitting per socket or global memory limits that were previously not reached. But its a necessary step for a more accurate memory accounting. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Andi Kleen <ak@linux.intel.com> CC: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-13 16:05:07 -04:00
Yan, Zheng	cdaf557034	gro: refetch inet6_protos[] after pulling ext headers ipv6_gro_receive() doesn't update the protocol ops after pulling the ext headers. It looks like a typo. Signed-off-by: Zheng Yan <zheng.z.yan@intel.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-10 14:26:16 -04:00
David S. Miller	88c5100c28	Merge branch 'master' of github.com:davem330/net Conflicts: net/batman-adv/soft-interface.c	2011-10-07 13:38:43 -04:00
Yan, Zheng	260fcbeb1a	tcp: properly handle md5sig_pool references tcp_v4_clear_md5_list() assumes that multiple tcp md5sig peers only hold one reference to md5sig_pool. but tcp_v4_md5_do_add() increases use count of md5sig_pool for each peer. This patch makes tcp_v4_md5_do_add() only increases use count for the first tcp md5sig peer. Signed-off-by: Zheng Yan <zheng.z.yan@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-04 23:31:24 -04:00
Yan, Zheng	676a1184e8	ipv6: nullify ipv6_ac_list and ipv6_fl_list when creating new socket ipv6_ac_list and ipv6_fl_list from listening socket are inadvertently shared with new socket created for connection. Signed-off-by: Zheng Yan <zheng.z.yan@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-29 00:32:10 -04:00
Ben Greear	67928c4041	ipv6-multicast: Fix memory leak in IPv6 multicast. If reg_vif_xmit cannot find a routing entry, be sure to free the skb before returning the error. Signed-off-by: Ben Greear <greearb@candelatech.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-27 15:34:00 -04:00
Madalin Bucur	fbe5818690	ipv6: check return value for dst_alloc return value of dst_alloc must be checked before use Signed-off-by: Madalin Bucur <madalin.bucur@freescale.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-27 15:32:06 -04:00
Ben Greear	2015de5fe2	ipv6-multicast: Fix memory leak in input path. Have to free the skb before returning if we fail the fib lookup. Signed-off-by: Ben Greear <greearb@candelatech.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-27 15:16:08 -04:00
Eric Dumazet	b82d1bb4fd	tcp: unalias tcp_skb_cb flags and ip_dsfield struct tcp_skb_cb contains a "flags" field containing either tcp flags or IP dsfield depending on context (input or output path) Introduce ip_dsfield to make the difference clear and ease maintenance. If later we want to save space, we can union flags/ip_dsfield Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-27 02:20:08 -04:00
David S. Miller	8decf86879	Merge branch 'master' of github.com:davem330/net Conflicts: MAINTAINERS drivers/net/Kconfig drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c drivers/net/ethernet/broadcom/tg3.c drivers/net/wireless/iwlwifi/iwl-pci.c drivers/net/wireless/iwlwifi/iwl-trans-tx-pcie.c drivers/net/wireless/rt2x00/rt2800usb.c drivers/net/wireless/wl12xx/main.c	2011-09-22 03:23:13 -04:00
Roy Li	8603e33d01	ipv6: fix a possible double free When calling snmp6_alloc_dev fails, the snmp6 relevant memory are freed by snmp6_alloc_dev. Calling in6_dev_finish_destroy will free these memory twice. Double free will lead that undefined behavior occurs. Signed-off-by: Roy Li <rongqing.li@windriver.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-20 15:10:16 -04:00
Eric Dumazet	d24f22f3df	ip6_tunnel: add optional fwmark inherit Add IP6_TNL_F_USE_ORIG_FWMARK to ip6_tunnel, so that ip6_tnl_xmit2() makes a route lookup taking into account skb->fwmark and doesnt cache lookup result. This permits more flexibility in policies and firewall setups. To setup such a tunnel, "fwmark inherit" option should be added to "ip -f inet6 tunnel" command. Reported-by: Anders Franzen <Anders.Franzen@ericsson.com> CC: Hans Schillström <hans.schillstrom@ericsson.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-20 14:50:00 -04:00
Yan, Zheng	8e2ec63917	ipv6: don't use inetpeer to store metrics for routes. Current IPv6 implementation uses inetpeer to store metrics for routes. The problem of inetpeer is that it doesn't take subnet prefix length in to consideration. If two routes have the same address but different prefix length, they share same inetpeer. So changing metrics of one route also affects the other. The fix is to allocate separate metrics storage for each route. Signed-off-by: Zheng Yan <zheng.z.yan@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-17 00:57:26 -04:00
Tore Anderson	026359bc6e	ipv6: Send ICMPv6 RSes only when RAs are accepted This patch improves the logic determining when to send ICMPv6 Router Solicitations, so that they are 1) always sent when the kernel is accepting Router Advertisements, and 2) never sent when the kernel is not accepting RAs. In other words, the operational setting of the "accept_ra" sysctl is used. The change also makes the special "Hybrid Router" forwarding mode ("forwarding" sysctl set to 2) operate exactly the same as the standard Router mode (forwarding=1). The only difference between the two was that RSes was being sent in the Hybrid Router mode only. The sysctl documentation describing the special Hybrid Router mode has therefore been removed. Rationale for the change: Currently, the value of forwarding sysctl is the only thing determining whether or not to send RSes. If it has the value 0 or 2, they are sent, otherwise they are not. This leads to inconsistent behaviour in the following cases: * accept_ra=0, forwarding=0 * accept_ra=0, forwarding=2 * accept_ra=1, forwarding=2 * accept_ra=2, forwarding=1 In the first three cases, the kernel will send RSes, even though it will not accept any RAs received in reply. In the last case, it will not send any RSes, even though it will accept and process any RAs received. (Most routers will send unsolicited RAs periodically, so suppressing RSes in the last case will merely delay auto-configuration, not prevent it.) Also, it is my opinion that having the forwarding sysctl control RS sending behaviour (completely independent of whether RAs are being accepted or not) is simply not what most users would intuitively expect to be the case. Signed-off-by: Tore Anderson <tore@fud.no> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-16 19:14:41 -04:00
David S. Miller	52b9aca7ae	Merge branch 'master' of ../netdev/	2011-09-16 01:09:02 -04:00
Eric Dumazet	946cedccbd	tcp: Change possible SYN flooding messages "Possible SYN flooding on port xxxx " messages can fill logs on servers. Change logic to log the message only once per listener, and add two new SNMP counters to track : TCPReqQFullDoCookies : number of times a SYNCOOKIE was replied to client TCPReqQFullDrop : number of times a SYN request was dropped because syncookies were not enabled. Based on a prior patch from Tom Herbert, and suggestions from David. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-15 14:49:43 -04:00
David S. Miller	7858241655	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-2.6	2011-08-30 17:43:56 -04:00
Maciej Żenczykowski	ec0506dbe4	net: relax PKTINFO non local ipv6 udp xmit check Allow transparent sockets to be less restrictive about the source ip of ipv6 udp packets being sent. Google-Bug-Id: 5018138 Signed-off-by: Maciej Żenczykowski <maze@google.com> CC: "Erik Kline" <ek@google.com> CC: "Lorenzo Colitti" <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-30 17:39:01 -04:00
Florian Westphal	c6675233f9	netfilter: nf_queue: reject NF_STOLEN verdicts from userspace A userspace listener may send (bogus) NF_STOLEN verdict, which causes skb leak. This problem was previously fixed via `64507fdbc2` (netfilter: nf_queue: fix NF_STOLEN skb leak) but this had to be reverted because NF_STOLEN can also be returned by a netfilter hook when iterating the rules in nf_reinject. Reject userspace NF_STOLEN verdict, as suggested by Michal Miroslaw. This is complementary to commit `fad5444043` (netfilter: avoid double free in nf_reinject). Cc: Julian Anastasov <ja@ssi.bg> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Patrick McHardy <kaber@trash.net>	2011-08-30 15:01:20 +02:00
Ian Campbell	408dadf03f	net: ipv6: convert to SKB frag APIs Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-24 17:52:12 -07:00
Yan, Zheng	e05c4ad3ed	mcast: Fix source address selection for multicast listener report Should check use count of include mode filter instead of total number of include mode filters. Signed-off-by: Zheng Yan <zheng.z.yan@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-24 17:46:15 -07:00
David S. Miller	823dcd2506	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net	2011-08-20 10:39:12 -07:00
Daniel Baluta	98e77438ae	ipv6: Fix ipv6_getsockopt for IPV6_2292PKTOPTIONS IPV6_2292PKTOPTIONS is broken for 32-bit applications running in COMPAT mode on 64-bit kernels. The same problem was fixed for IPv4 with the patch: ipv4: Fix ip_getsockopt for IP_PKTOPTIONS, commit `dd23198e58` Signed-off-by: Sorin Dumitru <sdumitru@ixiacom.com> Signed-off-by: Daniel Baluta <dbaluta@ixiacom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-19 03:19:07 -07:00
Tom Herbert	bdeab99191	rps: Add flag to skb to indicate rxhash is based on L4 tuple The l4_rxhash flag was added to the skb structure to indicate that the rxhash value was computed over the 4 tuple for the packet which includes the port information in the encapsulated transport packet. This is used by the stack to preserve the rxhash value in __skb_rx_tunnel. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-17 20:06:03 -07:00
Lionel Elie Mamane	c2bceb3d7f	sit tunnels: propagate IPv6 transport class to IPv4 Type of Service sit tunnels (IPv6 tunnel over IPv4) do not implement the "tos inherit" case to copy the IPv6 transport class byte from the inner packet to the IPv4 type of service byte in the outer packet. By contrast, ipip tunnels and GRE tunnels do. This patch, adapted from the similar code in net/ipv4/ipip.c and net/ipv4/ip_gre.c, implements that. This patch applies to 3.0.1, and has been tested on that version. Signed-off-by: Lionel Elie Mamane <lionel@mamane.lu> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-16 16:28:55 -07:00
Eric Dumazet	33d480ce6d	net: cleanup some rcu_dereference_raw RCU api had been completed and rcu_access_pointer() or rcu_dereference_protected() are better than generic rcu_dereference_raw() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-12 02:55:28 -07:00
Mike Waychison	f0e3d0689d	tcp: initialize variable ecn_ok in syncookies path Using a gcc 4.4.3, warnings are emitted for a possibly uninitialized use of ecn_ok. This can happen if cookie_check_timestamp() returns due to not having seen a timestamp. Defaulting to ecn off seems like a reasonable thing to do in this case, so initialized ecn_ok to false. Signed-off-by: Mike Waychison <mikew@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-10 21:59:57 -07:00
David S. Miller	19fd61785a	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net	2011-08-07 23:20:26 -07:00
David S. Miller	6e5714eaf7	net: Compute protocol sequence numbers and fragment IDs using MD5. Computers have become a lot faster since we compromised on the partial MD4 hash which we use currently for performance reasons. MD5 is a much safer choice, and is inline with both RFC1948 and other ISS generators (OpenBSD, Solaris, etc.) Furthermore, only having 24-bits of the sequence number be truly unpredictable is a very serious limitation. So the periodic regeneration and 8-bit counter have been removed. We compute and use a full 32-bit sequence number. For ipv6, DCCP was found to use a 32-bit truncated initial sequence number (it needs 43-bits) and that is fixed here as well. Reported-by: Dan Kaminsky <dan@doxpara.com> Tested-by: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-06 18:33:19 -07:00
Max Matveev	c15fea2d8c	ipv6: check for IPv4 mapped addresses when connecting IPv6 sockets When support for binding to 'mapped INADDR_ANY (::ffff.0.0.0.0)' was added in `0f8d3c7ac3` the rest of the code wasn't told so now it's possible to bind IPv6 datagram socket to ::ffff.0.0.0.0, connect it to another IPv4 address and it will all work except for getsockhame() which does not return the local address as expected. To give getsockname() something to work with check for 'mapped INADDR_ANY' when connecting and update the in-core source addresses appropriately. Signed-off-by: Max Matveev <makc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-05 03:56:30 -07:00
Eric Dumazet	f2c31e32b3	net: fix NULL dereferences in check_peer_redir() Gergely Kalman reported crashes in check_peer_redir(). It appears commit `f39925dbde` (ipv4: Cache learned redirect information in inetpeer.) added a race, leading to possible NULL ptr dereference. Since we can now change dst neighbour, we should make sure a reader can safely use a neighbour. Add RCU protection to dst neighbour, and make sure check_peer_redir() can be called safely by different cpus in parallel. As neighbours are already freed after one RCU grace period, this patch should not add typical RCU penalty (cache cold effects) Many thanks to Gergely for providing a pretty report pointing to the bug. Reported-by: Gergely Kalman <synapse@hippy.csoma.elte.hu> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-03 03:34:12 -07:00
Stephen Hemminger	a9b3cd7f32	rcu: convert uses of rcu_assign_pointer(x, NULL) to RCU_INIT_POINTER When assigning a NULL value to an RCU protected pointer, no barrier is needed. The rcu_assign_pointer, used to handle that but will soon change to not handle the special case. Convert all rcu_assign_pointer of NULL value. //smpl @@ expression P; @@ - rcu_assign_pointer(P, NULL) + RCU_INIT_POINTER(P, NULL) // </smpl> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-02 04:29:23 -07:00
Lorenzo Colitti	76f793e3a4	ipv6: updates to privacy addresses per RFC 4941. Update the code to handle some of the differences between RFC 3041 and RFC 4941, which obsoletes it. Also a couple of janitorial fixes. - Allow router advertisements to increase the lifetime of temporary addresses. This was not allowed by RFC 3041, but is specified by RFC 4941. It is useful when RA lifetimes are lower than TEMP_{VALID,PREFERRED}_LIFETIME: in this case, the previous code would delete or deprecate addresses prematurely. - Change the default of MAX_RETRY to 3 per RFC 4941. - Add a comment to clarify that the preferred and valid lifetimes in inet6_ifaddr are relative to the timestamp. - Shorten lines to 80 characters in a couple of places. Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-01 18:05:00 -07:00
Eric Dumazet	89b0212697	ip6tnl: avoid touching dst refcount in ip6_tnl_xmit2() Even using percpu stats, we still hit tunnel dst_entry refcount in ip6_tnl_xmit2() Since we are in RCU locked section, we can use skb_dst_set_noref() and avoid these atomic operations, leaving dst shared on cpus. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-01 00:12:00 -07:00
Eric Dumazet	897dc80b95	ipv6: avoid a dst_entry refcount change in ipv6_destopt_rcv() ipv6_destopt_rcv() runs with rcu_read_lock(), so there is no need to take a temporay reference on dst_entry, even if skb is freed by ip6_parse_tlv() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-01 00:12:00 -07:00
Eric Dumazet	d14730b8e9	ipv6: use RCU in inet6_csk_xmit() Use RCU to avoid changing dst_entry refcount in fast path. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-01 00:12:00 -07:00
Eric Dumazet	cfdf76474e	ipv6: some RCU conversions ICMP and ND are not fast path, but still we can avoid changing idev refcount, using RCU. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-01 00:12:00 -07:00
Jesper Juhl	91c66c6893	netfilter: ip_queue: Fix small leak in ipq_build_packet_message() ipq_build_packet_message() in net/ipv4/netfilter/ip_queue.c and net/ipv6/netfilter/ip6_queue.c contain a small potential mem leak as far as I can tell. We allocate memory for 'skb' with alloc_skb() annd then call nlh = NLMSG_PUT(skb, 0, 0, IPQM_PACKET, size - sizeof(*nlh)); NLMSG_PUT is a macro NLMSG_PUT(skb, pid, seq, type, len) \ NLMSG_NEW(skb, pid, seq, type, len, 0) that expands to NLMSG_NEW, which is also a macro which expands to: NLMSG_NEW(skb, pid, seq, type, len, flags) \ ({ if (unlikely(skb_tailroom(skb) < (int)NLMSG_SPACE(len))) \ goto nlmsg_failure; \ __nlmsg_put(skb, pid, seq, type, len, flags); }) If we take the true branch of the 'if' statement and 'goto nlmsg_failure', then we'll, at that point, return from ipq_build_packet_message() without having assigned 'skb' to anything and we'll leak the memory we allocated for it when it goes out of scope. Fix this by placing a 'kfree(skb)' at 'nlmsg_failure'. I admit that I do not know how likely this to actually happen or even if there's something that guarantees that it will never happen - I'm not that familiar with this code, but if that is so, I've not been able to spot it. Signed-off-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: Patrick McHardy <kaber@trash.net>	2011-07-29 16:38:49 +02:00
Linus Torvalds	d5eab9152a	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (32 commits) tg3: Remove 5719 jumbo frames and TSO blocks tg3: Break larger frags into 4k chunks for 5719 tg3: Add tx BD budgeting code tg3: Consolidate code that calls tg3_tx_set_bd() tg3: Add partial fragment unmapping code tg3: Generalize tg3_skb_error_unmap() tg3: Remove short DMA check for 1st fragment tg3: Simplify tx bd assignments tg3: Reintroduce tg3_tx_ring_info ASIX: Use only 11 bits of header for data size ASIX: Simplify condition in rx_fixup() Fix cdc-phonet build bonding: reduce noise during init bonding: fix string comparison errors net: Audit drivers to identify those needing IFF_TX_SKB_SHARING cleared net: add IFF_SKB_TX_SHARED flag to priv_flags net: sock_sendmsg_nosec() is static forcedeth: fix vlans gianfar: fix bug caused by `87c288c6e9` gro: Only reset frag0 when skb can be pulled ...	2011-07-28 05:58:19 -07:00
Arun Sharma	60063497a9	atomic: use <linux/atomic.h> This allows us to move duplicated code in <asm/atomic.h> (atomic_inc_not_zero() for now) to <linux/atomic.h> Signed-off-by: Arun Sharma <asharma@fb.com> Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-26 16:49:47 -07:00
YOSHIFUJI Hideaki	32019e651c	ipv6: Do not leave router anycast address for /127 prefixes. Original commit 2bda8a0c8af... "Disable router anycast address for /127 prefixes" says: \| No need for matching code in addrconf_leave_anycast() as it \| will silently ignore any attempt to leave an unknown anycast \| address. After analysis, because 1) we may add two or more prefixes on the same interface, or 2)user may have manually joined that anycast, we may hit chances to have anycast address which as if we had generated one by /127 prefix and we should not leave from subnet- router anycast address unconditionally. CC: Bjørn Mork <bjorn@mork.no> CC: Brian Haley <brian.haley@hp.com> Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-25 16:16:00 -07:00
Eric Dumazet	87c48fa3b4	ipv6: make fragment identifications less predictable IPv6 fragment identification generation is way beyond what we use for IPv4 : It uses a single generator. Its not scalable and allows DOS attacks. Now inetpeer is IPv6 aware, we can use it to provide a more secure and scalable frag ident generator (per destination, instead of system wide) This patch : 1) defines a new secure_ipv6_id() helper 2) extends inet_getid() to provide 32bit results 3) extends ipv6_select_ident() with a new dest parameter Reported-by: Fernando Gont <fernando@gont.com.ar> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-21 21:25:58 -07:00
Eric Dumazet	21efcfa0ff	ipv6: unshare inetpeers We currently cow metrics a bit too soon in IPv6 case : All routes are tied to a single inetpeer entry. Change ip6_rt_copy() to get destination address as second argument, so that we fill rt6i_dst before the dst_copy_metrics() call. icmp6_dst_alloc() must set rt6i_dst before calling dst_metric_set(), or else the cow is done while rt6i_dst is still NULL. If orig route points to readonly metrics, we can share the pointer instead of performing the memory allocation and copy. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-21 21:24:25 -07:00
David S. Miller	d3aaeb38c4	net: Add ->neigh_lookup() operation to dst_ops In the future dst entries will be neigh-less. In that environment we need to have an easy transition point for current users of dst->neighbour outside of the packet output fast path. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-18 00:40:17 -07:00
David S. Miller	69cce1d140	net: Abstract dst->neighbour accesses behind helpers. dst_{get,set}_neighbour() Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-17 23:11:35 -07:00
David S. Miller	9cbb7ecbcf	ipv6: Get rid of rt6i_nexthop macro. It just makes it harder to see 1) what the code is doing and 2) grep for all users of dst{->,.}neighbour Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-17 23:11:35 -07:00
David S. Miller	8f40b161de	neigh: Pass neighbour entry to output ops. This will get us closer to being able to do "neigh stuff" completely independent of the underlying dst_entry for protocols (ipv4/ipv6) that wish to do so. We will also be able to make dst entries neigh-less. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-17 23:11:17 -07:00
David S. Miller	542d4d685f	neigh: Kill ndisc_ops->queue_xmit It is always dev_queue_xmit(). Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-16 18:30:59 -07:00
David S. Miller	47ec132a40	neigh: Kill neigh_ops->hh_output It's always dev_queue_xmit(). Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-16 17:39:57 -07:00
David S. Miller	05e3aa0949	net: Create and use new helper, neigh_output(). Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-16 17:26:00 -07:00
David S. Miller	a29282972c	ipv6: Use calculated 'neigh' instead of re-evaluating dst->neighbour Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-16 14:30:47 -07:00
David S. Miller	f6b72b6217	net: Embed hh_cache inside of struct neighbour. Now that there is a one-to-one correspondance between neighbour and hh_cache entries, we no longer need: 1) dynamic allocation 2) attachment to dst->hh 3) refcounting Initialization of the hh_cache entry is indicated by hh_len being non-zero, and such initialization is always done with the neighbour's lock held as a writer. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-14 07:53:20 -07:00
Bjørn Mork	2bda8a0c8a	Disable router anycast address for /127 prefixes RFC 6164 requires that routers MUST disable Subnet-Router anycast for the prefix when /127 prefixes are used. No need for matching code in addrconf_leave_anycast() as it will silently ignore any attempt to leave an unknown anycast address. Signed-off-by: Bjørn Mork <bjorn@mork.no> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-07 04:15:10 -07:00
David S. Miller	e12fe68ce3	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6	2011-07-05 23:23:37 -07:00
Marcus Meissner	c349a528cd	net: bind() fix error return on wrong address family Hi, Reinhard Max also pointed out that the error should EAFNOSUPPORT according to POSIX. The Linux manpages have it as EINVAL, some other OSes (Minix, HPUX, perhaps BSD) use EAFNOSUPPORT. Windows uses WSAEFAULT according to MSDN. Other protocols error values in their af bind() methods in current mainline git as far as a brief look shows: EAFNOSUPPORT: atm, appletalk, l2tp, llc, phonet, rxrpc EINVAL: ax25, bluetooth, decnet, econet, ieee802154, iucv, netlink, netrom, packet, rds, rose, unix, x25, No check?: can/raw, ipv6/raw, irda, l2tp/l2tp_ip Ciao, Marcus Signed-off-by: Marcus Meissner <meissner@suse.de> Cc: Reinhard Max <max@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-04 21:37:41 -07:00
David S. Miller	957c665f37	ipv6: Don't put artificial limit on routing table size. IPV6, unlike IPV4, doesn't have a routing cache. Routing table entries, as well as clones made in response to route lookup requests, all live in the same table. And all of these things are together collected in the destination cache table for ipv6. This means that routing table entries count against the garbage collection limits, even though such entries cannot ever be reclaimed and are added explicitly by the administrator (rather than being created in response to lookups). Therefore it makes no sense to count ipv6 routing table entries against the GC limits. Add a DST_NOCOUNT destination cache entry flag, and skip the counting if it is set. Use this flag bit in ipv6 when adding routing table entries. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-01 17:30:43 -07:00
David S. Miller	11d53b4990	ipv6: Don't change dst->flags using assignments. This blows away any flags already set in the entry. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-01 17:30:43 -07:00
Joe Perches	207ec0abbe	ipv6: Reduce switch/case indent Make the case labels the same indent as the switch. git diff -w shows 80 column reflowing, removal of a useless break after return, and moving open brace after case instead of separate line. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-01 16:11:16 -07:00
Xufeng Zhang	9cfaa8def1	udp/recvmsg: Clear MSG_TRUNC flag when starting over for a new packet Consider this scenario: When the size of the first received udp packet is bigger than the receive buffer, MSG_TRUNC bit is set in msg->msg_flags. However, if checksum error happens and this is a blocking socket, it will goto try_again loop to receive the next packet. But if the size of the next udp packet is smaller than receive buffer, MSG_TRUNC flag should not be set, but because MSG_TRUNC bit is not cleared in msg->msg_flags before receive the next packet, MSG_TRUNC is still set, which is wrong. Fix this problem by clearing MSG_TRUNC flag when starting over for a new packet. Signed-off-by: Xufeng Zhang <xufeng.zhang@windriver.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-06-21 22:34:27 -07:00
Xufeng Zhang	32c90254ed	ipv6/udp: Use the correct variable to determine non-blocking condition udpv6_recvmsg() function is not using the correct variable to determine whether or not the socket is in non-blocking operation, this will lead to unexpected behavior when a UDP checksum error occurs. Consider a non-blocking udp receive scenario: when udpv6_recvmsg() is called by sock_common_recvmsg(), MSG_DONTWAIT bit of flags variable in udpv6_recvmsg() is cleared by "flags & ~MSG_DONTWAIT" in this call: err = sk->sk_prot->recvmsg(iocb, sk, msg, size, flags & MSG_DONTWAIT, flags & ~MSG_DONTWAIT, &addr_len); i.e. with udpv6_recvmsg() getting these values: int noblock = flags & MSG_DONTWAIT int flags = flags & ~MSG_DONTWAIT So, when udp checksum error occurs, the execution will go to csum_copy_err, and then the problem happens: csum_copy_err: ............... if (flags & MSG_DONTWAIT) return -EAGAIN; goto try_again; ............... But it will always go to try_again as MSG_DONTWAIT has been cleared from flags at call time -- only noblock contains the original value of MSG_DONTWAIT, so the test should be: if (noblock) return -EAGAIN; This is also consistent with what the ipv4/udp code does. Signed-off-by: Xufeng Zhang <xufeng.zhang@windriver.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-06-21 22:34:27 -07:00
David S. Miller	9f6ec8d697	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/wireless/iwlwifi/iwl-agn-rxon.c drivers/net/wireless/rtlwifi/pci.c net/netfilter/ipvs/ip_vs_core.c	2011-06-20 22:29:08 -07:00
Eric Dumazet	1eddceadb0	net: rfs: enable RFS before first data packet is received Le jeudi 16 juin 2011 à 23:38 -0400, David Miller a écrit : > From: Ben Hutchings <bhutchings@solarflare.com> > Date: Fri, 17 Jun 2011 00:50:46 +0100 > > > On Wed, 2011-06-15 at 04:15 +0200, Eric Dumazet wrote: > >> @@ -1594,6 +1594,7 @@ int tcp_v4_do_rcv(struct sock sk, struct sk_buff skb) > >> goto discard; > >> > >> if (nsk != sk) { > >> + sock_rps_save_rxhash(nsk, skb->rxhash); > >> if (tcp_child_process(sk, nsk, skb)) { > >> rsk = nsk; > >> goto reset; > >> > > > > I haven't tried this, but it looks reasonable to me. > > > > What about IPv6? The logic in tcp_v6_do_rcv() looks very similar. > > Indeed ipv6 side needs the same fix. > > Eric please add that part and resubmit. And in fact I might stick > this into net-2.6 instead of net-next-2.6 > OK, here is the net-2.6 based one then, thanks ! [PATCH v2] net: rfs: enable RFS before first data packet is received First packet received on a passive tcp flow is not correctly RFS steered. One sock_rps_record_flow() call is missing in inet_accept() But before that, we also must record rxhash when child socket is setup. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Tom Herbert <therbert@google.com> CC: Ben Hutchings <bhutchings@solarflare.com> CC: Jamal Hadi Salim <hadi@cyberus.ca> Signed-off-by: David S. Miller <davem@conan.davemloft.net>	2011-06-17 15:27:31 -04:00
Nicolas Cavallari	2c38de4c1f	netfilter: fix looped (broad\|multi)cast's MAC handling By default, when broadcast or multicast packet are sent from a local application, they are sent to the interface then looped by the kernel to other local applications, going throught netfilter hooks in the process. These looped packet have their MAC header removed from the skb by the kernel looping code. This confuse various netfilter's netlink queue, netlink log and the legacy ip_queue, because they try to extract a hardware address from these packets, but extracts a part of the IP header instead. This patch prevent NFQUEUE, NFLOG and ip_QUEUE to include a MAC header if there is none in the packet. Signed-off-by: Nicolas Cavallari <cavallar@lri.fr> Signed-off-by: Patrick McHardy <kaber@trash.net>	2011-06-16 17:27:04 +02:00
Greg Rose	c7ac8679be	rtnetlink: Compute and store minimum ifinfo dump size The message size allocated for rtnl ifinfo dumps was limited to a single page. This is not enough for additional interface info available with devices that support SR-IOV and caused a bug in which VF info would not be displayed if more than approximately 40 VFs were created per interface. Implement a new function pointer for the rtnl_register service that will calculate the amount of data required for the ifinfo dump and allocate enough data to satisfy the request. Signed-off-by: Greg Rose <gregory.v.rose@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	2011-06-09 20:38:07 -07:00
Jerry Chu	9ad7c049f0	tcp: RFC2988bis + taking RTT sample from 3WHS for the passive open side This patch lowers the default initRTO from 3secs to 1sec per RFC2988bis. It falls back to 3secs if the SYN or SYN-ACK packet has been retransmitted, AND the TCP timestamp option is not on. It also adds support to take RTT sample during 3WHS on the passive open side, just like its active open counterpart, and uses it, if valid, to seed the initRTO for the data transmission phase. The patch also resets ssthresh to its initial default at the beginning of the data transmission phase, and reduces cwnd to 1 if there has been MORE THAN ONE retransmission during 3WHS per RFC5681. Signed-off-by: H.K. Jerry Chu <hkchu@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-06-08 17:05:30 -07:00
stephen hemminger	aee80b54b2	ipv6: generate link local address for GRE tunnel Use same logic as SIT tunnel to handle link local address for GRE tunnel. OSPFv3 requires link-local address to function. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-06-08 17:05:30 -07:00
Marcus Meissner	5a079c305a	net/ipv6: check for mistakenly passed in non-AF_INET6 sockaddrs Same check as for IPv4, also do for IPv6. (If you passed in a IPv4 sockaddr_in here, the sizeof check in the line before would have triggered already though.) Signed-off-by: Marcus Meissner <meissner@suse.de> Cc: Reinhard Max <max@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-06-06 14:48:16 -07:00
Dave Jones	d232b8dded	netfilter: use unsigned variables for packet lengths in ip[6]_queue. Netlink message lengths can't be negative, so use unsigned variables. Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2011-06-06 01:37:16 +02:00
Pablo Neira Ayuso	88ed01d17b	netfilter: nf_conntrack: fix ct refcount leak in l4proto->error() This patch fixes a refcount leak of ct objects that may occur if l4proto->error() assigns one conntrack object to one skbuff. In that case, we have to skip further processing in nf_conntrack_in(). With this patch, we can also fix wrong return values (-NF_ACCEPT) for special cases in ICMP[v6] that should not bump the invalid/error statistic counters. Reported-by: Zoltan Menyhart <Zoltan.Menyhart@bull.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2011-06-06 01:37:02 +02:00
Eric Dumazet	fb04883371	netfilter: add more values to enum ip_conntrack_info Following error is raised (and other similar ones) : net/ipv4/netfilter/nf_nat_standalone.c: In function ‘nf_nat_fn’: net/ipv4/netfilter/nf_nat_standalone.c:119:2: warning: case value ‘4’ not in enumerated type ‘enum ip_conntrack_info’ gcc barfs on adding two enum values and getting a not enumerated result : case IP_CT_RELATED+IP_CT_IS_REPLY: Add missing enum values Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: David Miller <davem@davemloft.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2011-06-06 01:35:10 +02:00
Dan Rosenberg	71338aa7d0	net: convert %p usage to %pK The %pK format specifier is designed to hide exposed kernel pointers, specifically via /proc interfaces. Exposing these pointers provides an easy target for kernel write vulnerabilities, since they reveal the locations of writable structures containing easily triggerable function pointers. The behavior of %pK depends on the kptr_restrict sysctl. If kptr_restrict is set to 0, no deviation from the standard %p behavior occurs. If kptr_restrict is set to 1, the default, if the current user (intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG (currently in the LSM tree), kernel pointers using %pK are printed as 0's. If kptr_restrict is set to 2, kernel pointers using %pK are printed as 0's regardless of privileges. Replacing with 0's was chosen over the default "(null)", which cannot be parsed by userland %p, which expects "(nil)". The supporting code for kptr_restrict and %pK are currently in the -mm tree. This patch converts users of %p in net/ to %pK. Cases of printing pointers to the syslog are not covered, since this would eliminate useful information for postmortem debugging and the reading of the syslog is already optionally protected by the dmesg_restrict sysctl. Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com> Cc: James Morris <jmorris@namei.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Thomas Graf <tgraf@infradead.org> Cc: Eugene Teo <eugeneteo@kernel.org> Cc: Kees Cook <kees.cook@canonical.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: David S. Miller <davem@davemloft.net> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Eric Paris <eparis@parisplace.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-24 01:13:12 -04:00
David S. Miller	6ac3f66492	ipv6: Fix return of xfrm6_tunnel_rcv() Like ipv4, just return xfrm6_rcv_spi()'s return value directly. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-24 01:11:51 -04:00
Florian Westphal	0f6c6392dc	ipv6: copy prefsrc setting when copying route entry commit `c3968a857a` ('ipv6: RTA_PREFSRC support for ipv6 route source address selection') added support for ipv6 prefsrc as an alternative to ipv6 addrlabels, but it did not work because the prefsrc entry was not copied. Cc: Daniel Walter <sahne@0x90.at> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-21 02:05:22 -04:00
Linus Torvalds	06f4e926d2	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1446 commits) macvlan: fix panic if lowerdev in a bond tg3: Add braces around 5906 workaround. tg3: Fix NETIF_F_LOOPBACK error macvlan: remove one synchronize_rcu() call networking: NET_CLS_ROUTE4 depends on INET irda: Fix error propagation in ircomm_lmp_connect_response() irda: Kill set but unused variable 'bytes' in irlan_check_command_param() irda: Kill set but unused variable 'clen' in ircomm_connect_indication() rxrpc: Fix set but unused variable 'usage' in rxrpc_get_transport() be2net: Kill set but unused variable 'req' in lancer_fw_download() irda: Kill set but unused vars 'saddr' and 'daddr' in irlan_provider_connect_indication() atl1c: atl1c_resume() is only used when CONFIG_PM_SLEEP is defined. rxrpc: Fix set but unused variable 'usage' in rxrpc_get_peer(). rxrpc: Kill set but unused variable 'local' in rxrpc_UDP_error_handler() rxrpc: Kill set but unused variable 'sp' in rxrpc_process_connection() rxrpc: Kill set but unused variable 'sp' in rxrpc_rotate_tx_window() pkt_sched: Kill set but unused variable 'protocol' in tc_classify() isdn: capi: Use pr_debug() instead of ifdefs. tg3: Update version to 3.119 tg3: Apply rx_discards fix to 5719/5720 ... Fix up trivial conflicts in arch/x86/Kconfig and net/mac80211/agg-tx.c as per Davem.	2011-05-20 13:43:21 -07:00
Linus Torvalds	eb04f2f04e	Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (78 commits) Revert "rcu: Decrease memory-barrier usage based on semi-formal proof" net,rcu: convert call_rcu(prl_entry_destroy_rcu) to kfree batman,rcu: convert call_rcu(softif_neigh_free_rcu) to kfree_rcu batman,rcu: convert call_rcu(neigh_node_free_rcu) to kfree() batman,rcu: convert call_rcu(gw_node_free_rcu) to kfree_rcu net,rcu: convert call_rcu(kfree_tid_tx) to kfree_rcu() net,rcu: convert call_rcu(xt_osf_finger_free_rcu) to kfree_rcu() net/mac80211,rcu: convert call_rcu(work_free_rcu) to kfree_rcu() net,rcu: convert call_rcu(wq_free_rcu) to kfree_rcu() net,rcu: convert call_rcu(phonet_device_rcu_free) to kfree_rcu() perf,rcu: convert call_rcu(swevent_hlist_release_rcu) to kfree_rcu() perf,rcu: convert call_rcu(free_ctx) to kfree_rcu() net,rcu: convert call_rcu(__nf_ct_ext_free_rcu) to kfree_rcu() net,rcu: convert call_rcu(net_generic_release) to kfree_rcu() net,rcu: convert call_rcu(netlbl_unlhsh_free_addr6) to kfree_rcu() net,rcu: convert call_rcu(netlbl_unlhsh_free_addr4) to kfree_rcu() security,rcu: convert call_rcu(sel_netif_free) to kfree_rcu() net,rcu: convert call_rcu(xps_dev_maps_release) to kfree_rcu() net,rcu: convert call_rcu(xps_map_release) to kfree_rcu() net,rcu: convert call_rcu(rps_map_release) to kfree_rcu() ...	2011-05-19 18:14:34 -07:00
Eric Dumazet	be281e554e	ipv6: reduce per device ICMP mib sizes ipv6 has per device ICMP SNMP counters, taking too much space because they use percpu storage. needed size per device is : (512+4)sizeof(long)number_of_possible_cpus*2 On a 32bit kernel, 16 possible cpus, this wastes more than 64kbytes of memory per ipv6 enabled network device, taken in vmalloc pool. Since ICMP messages are rare, just use shared counters (atomic_long_t) Per network space ICMP counters are still using percpu memory, we might also convert them to shared counters in a future patch. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Denys Fedoryshchenko <denys@visp.net.lb> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-19 16:21:22 -04:00
David S. Miller	3c709f8fb4	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-3.6 Conflicts: drivers/net/benet/be_main.c	2011-05-11 14:26:58 -04:00
David S. Miller	9bbc052d5e	Merge branch 'pablo/nf-2.6-updates' of git://1984.lsi.us.es/net-2.6	2011-05-10 15:04:35 -07:00
Steffen Klassert	43a4dea4c9	xfrm: Assign the inner mode output function to the dst entry As it is, we assign the outer modes output function to the dst entry when we create the xfrm bundle. This leads to two problems on interfamily scenarios. We might insert ipv4 packets into ip6_fragment when called from xfrm6_output. The system crashes if we try to fragment an ipv4 packet with ip6_fragment. This issue was introduced with git commit `ad0081e4` (ipv6: Fragment locally generated tunnel-mode IPSec6 packets as needed). The second issue is, that we might insert ipv4 packets in netfilter6 and vice versa on interfamily scenarios. With this patch we assign the inner mode output function to the dst entry when we create the xfrm bundle. So xfrm4_output/xfrm6_output from the inner mode is used and the right fragmentation and netfilter functions are called. We switch then to outer mode with the output_finish functions. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-10 15:03:34 -07:00
Fernando Luis Vazquez Cao	4319cc0cf5	netfilter: IPv6: initialize TOS field in REJECT target module The IPv6 header is not zeroed out in alloc_skb so we must initialize it properly unless we want to see IPv6 packets with random TOS fields floating around. The current implementation resets the flow label but this could be changed if deemed necessary. We stumbled upon this issue when trying to apply a mangle rule to the RST packet generated by the REJECT target module. Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2011-05-10 09:55:44 +02:00
David S. Miller	d9d8da805d	inet: Pass flowi to ->queue_xmit(). This allows us to acquire the exact route keying information from the protocol, however that might be managed. It handles all of the possibilities, from the simplest case of storing the key in inet->cork.fl to the more complex setup SCTP has where individual transports determine the flow. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-08 15:28:28 -07:00
Paul E. McKenney	11c476f31a	net,rcu: convert call_rcu(prl_entry_destroy_rcu) to kfree The RCU callback prl_entry_destroy_rcu() just calls kfree(), so we can use kfree_rcu() instead of call_rcu(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Acked-by: David S. Miller <davem@davemloft.net> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-07 22:51:15 -07:00
Lai Jiangshan	e3cbf28fa6	net,rcu: convert call_rcu(ipv6_mc_socklist_reclaim) to kfree_rcu() The rcu callback ipv6_mc_socklist_reclaim() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(ipv6_mc_socklist_reclaim). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-07 22:51:02 -07:00
Lai Jiangshan	e57859854a	net,rcu: convert call_rcu(inet6_ifa_finish_destroy_rcu) to kfree_rcu() The rcu callback inet6_ifa_finish_destroy_rcu() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(inet6_ifa_finish_destroy_rcu). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-07 22:50:50 -07:00
Lai Jiangshan	38f57d1a4b	net,rcu: convert call_rcu(in6_dev_finish_destroy_rcu) to kfree_rcu() The rcu callback in6_dev_finish_destroy_rcu() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(in6_dev_finish_destroy_rcu). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-07 22:50:49 -07:00
David S. Miller	bdc712b4c2	inet: Decrease overhead of on-stack inet_cork. When we fast path datagram sends to avoid locking by putting the inet_cork on the stack we use up lots of space that isn't necessary. This is because inet_cork contains a "struct flowi" which isn't used in these code paths. Split inet_cork to two parts, "inet_cork" and "inet_cork_full". Only the latter of which has the "struct flowi" and is what is stored in inet_sock. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>	2011-05-06 15:37:57 -07:00
David S. Miller	7143b7d412	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/tg3.c	2011-05-05 14:59:02 -07:00
Jiri Pirko	1c5cae815d	net: call dev_alloc_name from register_netdevice Force dev_alloc_name() to be called from register_netdevice() by dev_get_valid_name(). That allows to remove multiple explicit dev_alloc_name() calls. The possibility to call dev_alloc_name in advance remains. This also fixes veth creation regresion caused by `84c49d8c3e` Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-05 10:57:45 -07:00
David S. Miller	301102cc83	ipv6: Use flowi4->{daddr,saddr} in ipip6_tunnel_xmit(). Instead of rt->rt_{dst,src} Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-04 12:55:07 -07:00
David S. Miller	31e4543db2	ipv4: Make caller provide on-stack flow key to ip_route_output_ports(). Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-03 20:25:42 -07:00
Lucian Adrian Grijincu	ff538818f4	sysctl: net: call unregister_net_sysctl_table where needed ctl_table_headers registered with register_net_sysctl_table should have been unregistered with the equivalent unregister_net_sysctl_table Signed-off-by: Lucian Adrian Grijincu <lucian.grijincu@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-02 16:12:14 -07:00
Eric Dumazet	e67f88dd12	net: dont hold rtnl mutex during netlink dump callbacks Four years ago, Patrick made a change to hold rtnl mutex during netlink dump callbacks. I believe it was a wrong move. This slows down concurrent dumps, making good old /proc/net/ files faster than rtnetlink in some situations. This occurred to me because one "ip link show dev ..." was _very_ slow on a workload adding/removing network devices in background. All dump callbacks are able to use RCU locking now, so this patch does roughly a revert of commits : `1c2d670f36` : [RTNETLINK]: Hold rtnl_mutex during netlink dump callbacks `6313c1e099` : [RTNETLINK]: Remove unnecessary locking in dump callbacks This let writers fight for rtnl mutex and readers going full speed. It also takes care of phonet : phonet_route_get() is now called from rcu read section. I renamed it to phonet_route_get_rcu() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Patrick McHardy <kaber@trash.net> Cc: Remi Denis-Courmont <remi.denis-courmont@nokia.com> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-02 15:26:28 -07:00
Ben Hutchings	ad246c992b	ipv4, ipv6, bonding: Restore control over number of peer notifications For backward compatibility, we should retain the module parameters and sysfs attributes to control the number of peer notifications (gratuitous ARPs and unsolicited NAs) sent after bonding failover. Also, it is possible for failover to take place even though the new active slave does not have link up, and in that case the peer notification should be deferred until it does. Change ipv4 and ipv6 so they do not automatically send peer notifications on bonding failover. Change the bonding driver to send separate NETDEV_NOTIFY_PEERS notifications when the link is up, as many times as requested. Since it does not directly control which protocols send notifications, make num_grat_arp and num_unsol_na aliases for a single parameter. Bump the bonding version number and update its documentation. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: Jay Vosburgh <fubar@us.ibm.com> Acked-by: Brian Haley <brian.haley@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-29 12:44:11 -07:00
David S. Miller	cf91166223	net: Use non-zero allocations in dst_alloc(). Make dst_alloc() and it's users explicitly initialize the entire entry. The zero'ing done by kmem_cache_zalloc() was almost entirely redundant. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-28 22:26:00 -07:00
David S. Miller	5c1e6aa300	net: Make dst_alloc() take more explicit initializations. Now the dst->dev, dev->obsolete, and dst->flags values can be specified as well. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-28 22:25:59 -07:00
Shan Wei	96339d6c49	net:use help function of skb_checksum_start_offset to calculate offset Although these are equivalent, but the skb_checksum_start_offset() is more readable. Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-28 13:28:57 -07:00
Eric Dumazet	f6d8bd051c	inet: add RCU protection to inet->opt We lack proper synchronization to manipulate inet->opt ip_options Problem is ip_make_skb() calls ip_setup_cork() and ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options), without any protection against another thread manipulating inet->opt. Another thread can change inet->opt pointer and free old one under us. Use RCU to protect inet->opt (changed to inet->inet_opt). Instead of handling atomic refcounts, just copy ip_options when necessary, to avoid cache line dirtying. We cant insert an rcu_head in struct ip_options since its included in skb->cb[], so this patch is large because I had to introduce a new ip_options_rcu structure. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-28 13:16:35 -07:00
Steffen Klassert	c0a56e64ae	esp6: Fix scatterlist initialization When we use IPsec extended sequence numbers, we may overwrite the last scatterlist of the associated data by the scatterlist for the skb. This patch fixes this by placing the scatterlist for the skb right behind the last scatterlist of the associated data. esp4 does it already like that. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-26 12:46:04 -07:00
David S. Miller	2bd93d7af1	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Resolved logic conflicts causing a build failure due to drivers/net/r8169.c changes using a patch from Stephen Rothwell. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-26 12:16:46 -07:00
Held Bernhard	0972ddb237	net: provide cow_metrics() methods to blackhole dst_ops Since commit `62fa8a846d` (net: Implement read-only protection and COW'ing of metrics.) the kernel throws an oops. [ 101.620985] BUG: unable to handle kernel NULL pointer dereference at (null) [ 101.621050] IP: [< (null)>] (null) [ 101.621084] PGD 6e53c067 PUD 3dd6a067 PMD 0 [ 101.621122] Oops: 0010 [#1] SMP [ 101.621153] last sysfs file: /sys/devices/virtual/ppp/ppp/uevent [ 101.621192] CPU 2 [ 101.621206] Modules linked in: l2tp_ppp pppox ppp_generic slhc l2tp_netlink l2tp_core deflate zlib_deflate twofish_x86_64 twofish_common des_generic cbc ecb sha1_generic hmac af_key iptable_filter snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device loop snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_pcm snd_timer snd i2c_i801 iTCO_wdt psmouse soundcore snd_page_alloc evdev uhci_hcd ehci_hcd thermal [ 101.621552] [ 101.621567] Pid: 5129, comm: openl2tpd Not tainted 2.6.39-rc4-Quad #3 Gigabyte Technology Co., Ltd. G33-DS3R/G33-DS3R [ 101.621637] RIP: 0010:[<0000000000000000>] [< (null)>] (null) [ 101.621684] RSP: 0018:ffff88003ddeba60 EFLAGS: 00010202 [ 101.621716] RAX: ffff88003ddb5600 RBX: ffff88003ddb5600 RCX: 0000000000000020 [ 101.621758] RDX: ffffffff81a69a00 RSI: ffffffff81b7ee61 RDI: ffff88003ddb5600 [ 101.621800] RBP: ffff8800537cd900 R08: 0000000000000000 R09: ffff88003ddb5600 [ 101.621840] R10: 0000000000000005 R11: 0000000000014b38 R12: ffff88003ddb5600 [ 101.621881] R13: ffffffff81b7e480 R14: ffffffff81b7e8b8 R15: ffff88003ddebad8 [ 101.621924] FS: 00007f06e4182700(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000 [ 101.621971] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 101.622005] CR2: 0000000000000000 CR3: 0000000045274000 CR4: 00000000000006e0 [ 101.622046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 101.622087] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 101.622129] Process openl2tpd (pid: 5129, threadinfo ffff88003ddea000, task ffff88003de9a280) [ 101.622177] Stack: [ 101.622191] ffffffff81447efa ffff88007d3ded80 ffff88003de9a280 ffff88007d3ded80 [ 101.622245] 0000000000000001 ffff88003ddebbb8 ffffffff8148d5a7 0000000000000212 [ 101.622299] ffff88003dcea000 ffff88003dcea188 ffffffff00000001 ffffffff81b7e480 [ 101.622353] Call Trace: [ 101.622374] [<ffffffff81447efa>] ? ipv4_blackhole_route+0x1ba/0x210 [ 101.622415] [<ffffffff8148d5a7>] ? xfrm_lookup+0x417/0x510 [ 101.622450] [<ffffffff8127672a>] ? extract_buf+0x9a/0x140 [ 101.622485] [<ffffffff8144c6a0>] ? __ip_flush_pending_frames+0x70/0x70 [ 101.622526] [<ffffffff8146fbbf>] ? udp_sendmsg+0x62f/0x810 [ 101.622562] [<ffffffff813f98a6>] ? sock_sendmsg+0x116/0x130 [ 101.622599] [<ffffffff8109df58>] ? find_get_page+0x18/0x90 [ 101.622633] [<ffffffff8109fd6a>] ? filemap_fault+0x12a/0x4b0 [ 101.622668] [<ffffffff813fb5c4>] ? move_addr_to_kernel+0x64/0x90 [ 101.622706] [<ffffffff81405d5a>] ? verify_iovec+0x7a/0xf0 [ 101.622739] [<ffffffff813fc772>] ? sys_sendmsg+0x292/0x420 [ 101.622774] [<ffffffff810b994a>] ? handle_pte_fault+0x8a/0x7c0 [ 101.622810] [<ffffffff810b76fe>] ? __pte_alloc+0xae/0x130 [ 101.622844] [<ffffffff810ba2f8>] ? handle_mm_fault+0x138/0x380 [ 101.622880] [<ffffffff81024af9>] ? do_page_fault+0x189/0x410 [ 101.622915] [<ffffffff813fbe03>] ? sys_getsockname+0xf3/0x110 [ 101.622952] [<ffffffff81450c4d>] ? ip_setsockopt+0x4d/0xa0 [ 101.622986] [<ffffffff813f9932>] ? sockfd_lookup_light+0x22/0x90 [ 101.623024] [<ffffffff814b61fb>] ? system_call_fastpath+0x16/0x1b [ 101.623060] Code: Bad RIP value. [ 101.623090] RIP [< (null)>] (null) [ 101.623125] RSP <ffff88003ddeba60> [ 101.623146] CR2: 0000000000000000 [ 101.650871] ---[ end trace ca3856a7d8e8dad4 ]--- [ 101.651011] __sk_free: optmem leakage (160 bytes) detected. The oops happens in dst_metrics_write_ptr() include/net/dst.h:124: return dst->ops->cow_metrics(dst, p); dst->ops->cow_metrics is NULL and causes the oops. Provide cow_metrics() methods, like we did in commit `214f45c91b` (net: provide default_advmss() methods to blackhole dst_ops) Signed-off-by: Held Bernhard <berny156@gmx.de> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-25 11:53:08 -07:00
Eric Dumazet	b71d1d426d	inet: constify ip headers and in6_addr Add const qualifiers to structs iphdr, ipv6hdr and in6_addr pointers where possible, to make code intention more obvious. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-22 11:04:14 -07:00
Thomas Egerer	e965c05dab	ipv6: Remove hoplimit initialization to -1 The changes introduced with git-commit `a02e4b7d` ("ipv6: Demark default hoplimit as zero.") missed to remove the hoplimit initialization. As a result, ipv6_get_mtu interprets the return value of dst_metric_raw (-1) as 255 and answers ping6 with this hoplimit. This patche removes the line such that ping6 is answered with the hoplimit value configured via sysctl. Signed-off-by: Thomas Egerer <thomas.egerer@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-21 17:24:08 -07:00
Shan Wei	a9cf73ea7f	ipv6: udp: fix the wrong headroom check At this point, skb->data points to skb_transport_header. So, headroom check is wrong. For some case:bridge(UFO is on) + eth device(UFO is off), there is no enough headroom for IPv6 frag head. But headroom check is always false. This will bring about data be moved to there prior to skb->head, when adding IPv6 frag header to skb. Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-21 10:39:10 -07:00
David S. Miller	4805347c1e	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-next-2.6	2011-04-19 11:24:06 -07:00
David S. Miller	e1943424e4	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/bnx2x/bnx2x_ethtool.c	2011-04-19 00:21:33 -07:00
Ben Hutchings	7c89943236	bonding, ipv4, ipv6, vlan: Handle NETDEV_BONDING_FAILOVER like NETDEV_NOTIFY_PEERS It is undesirable for the bonding driver to be poking into higher level protocols, and notifiers provide a way to avoid that. This does mean removing the ability to configure reptitition of gratuitous ARPs and unsolicited NAs. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-17 23:36:03 -07:00
Ben Hutchings	f47b94646f	ipv6: Send unsolicited neighbour advertismements when notified The NETDEV_NOTIFY_PEERS notifier is a request to send such advertisements following migration to a different physical link, e.g. virtual machine migration. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-17 23:35:16 -07:00
David S. Miller	b169f6db40	netfilter: ip6table_mangle: Fix set-but-unused variables. The variable 'flowlabel' is set but unused in ip6t_mangle_out(). The intention here was to compare this key to the header value after mangling, and trigger a route lookup on mismatch. Make it so. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-17 17:06:15 -07:00
David S. Miller	f3c85dd560	netfilter: ip6_tables: Fix set-but-unused variables. The variable 'target' is set but unused in compat_copy_entry_from_user(). Just kill it off. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-17 17:04:48 -07:00
Daniel Walter	c3968a857a	ipv6: RTA_PREFSRC support for ipv6 route source address selection [ipv6] Add support for RTA_PREFSRC This patch allows a user to select the preferred source address for a specific IPv6-Route. It can be set via a netlink message setting RTA_PREFSRC to a valid IPv6 address which must be up on the device the route will be bound to. Signed-off-by: Daniel Walter <dwalter@barracuda.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-15 15:44:37 -07:00
Daniel Walter	bd015928bb	ipv6: ignore looped-back NA while dad is running [ipv6] Ignore looped-back NAs while in Duplicate Address Detection If we send an unsolicited NA shortly after bringing up an IPv6 address, the duplicate address detection algorithm fails and the ip stays in tentative mode forever. This is due a missing check if the NA is looped-back to us. Signed-off-by: Daniel Walter <dwalter@barracuda.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-15 15:43:55 -07:00
David S. Miller	3e8c806a08	Revert "tcp: disallow bind() to reuse addr/port" This reverts commit `c191a836a9`. It causes known regressions for programs that expect to be able to use SO_REUSEADDR to shutdown a socket, then successfully rebind another socket to the same ID. Programs such as haproxy and amavisd expect this to work. This should fix kernel bugzilla 32832. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-13 12:01:14 -07:00
Linus Torvalds	c44eaf41a5	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (34 commits) net: Add support for SMSC LAN9530, LAN9730 and LAN89530 mlx4_en: Restoring RX buffer pointer in case of failure mlx4: Sensing link type at device initialization ipv4: Fix "Set rt->rt_iif more sanely on output routes." MAINTAINERS: add entry for Xen network backend be2net: Fix suspend/resume operation be2net: Rename some struct members for clarity pppoe: drop PPPOX_ZOMBIEs in pppoe_flush_dev dsa/mv88e6131: add support for mv88e6085 switch ipv6: Enable RFS sk_rxhash tracking for ipv6 sockets (v2) be2net: Fix a potential crash during shutdown. bna: Fix for handling firmware heartbeat failure can: mcp251x: Allow pass IRQ flags through platform data. smsc911x: fix mac_lock acquision before calling smsc911x_mac_read iwlwifi: accept EEPROM version 0x423 for iwl6000 rt2x00: fix cancelling uninitialized work rtlwifi: Fix some warnings/bugs p54usb: IDs for two new devices wl12xx: fix potential buffer overflow in testmode nvs push zd1211rw: reset rx idle timer from tasklet ...	2011-04-11 07:27:24 -07:00
Linus Torvalds	42933bac11	Merge branch 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6 * 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6: Fix common misspellings	2011-04-07 11:14:49 -07:00
Neil Horman	47482f132a	ipv6: Enable RFS sk_rxhash tracking for ipv6 sockets (v2) properly record sk_rxhash in ipv6 sockets (v2) Noticed while working on another project that flows to sockets which I had open on a test systems weren't getting steered properly when I had RFS enabled. Looking more closely I found that: 1) The affected sockets were all ipv6 2) They weren't getting steered because sk->sk_rxhash was never set from the incomming skbs on that socket. This was occuring because there are several points in the IPv4 tcp and udp code which save the rxhash value when a new connection is established. Those calls to sock_rps_save_rxhash were never added to the corresponding ipv6 code paths. This patch adds those calls. Tested by myself to properly enable RFS functionalty on ipv6. Change notes: v2: Filtered UDP to only arm RFS on bound sockets (Eric Dumazet) Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-06 13:07:09 -07:00
David S. Miller	9d93059497	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-2.6	2011-04-05 14:21:11 -07:00
Boris Ostrovsky	738faca343	ipv6: Don't pass invalid dst_entry pointer to dst_release(). Make sure dst_release() is not called with error pointer. This is similar to commit `4910ac6c52` ("ipv4: Don't ip_rt_put() an error pointer in RAW sockets."). Signed-off-by: Boris Ostrovsky <boris.ostrovsky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-04 13:07:26 -07:00
Eric Dumazet	7f5c6d4f66	netfilter: get rid of atomic ops in fast path We currently use a percpu spinlock to 'protect' rule bytes/packets counters, after various attempts to use RCU instead. Lately we added a seqlock so that get_counters() can run without blocking BH or 'writers'. But we really only need the seqcount in it. Spinlock itself is only locked by the current/owner cpu, so we can remove it completely. This cleanups api, using correct 'writer' vs 'reader' semantic. At replace time, the get_counters() call makes sure all cpus are done using the old table. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: Patrick McHardy <kaber@trash.net>	2011-04-04 17:04:03 +02:00
Florian Westphal	0fae2e7740	netfilter: af_info: add 'strict' parameter to limit lookup to .oif ipv6 fib lookup can set RT6_LOOKUP_F_IFACE flag to restrict search to an interface, but this flag cannot be set via struct flowi. Also, it cannot be set via ip6_route_output: this function uses the passed sock struct to determine if this flag is required (by testing for nonzero sk_bound_dev_if). Work around this by passing in an artificial struct sk in case 'strict' argument is true. This is required to replace the rt6_lookup call in xt_addrtype.c with nf_afinfo->route(). Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Patrick McHardy <kaber@trash.net>	2011-04-04 17:00:54 +02:00
Florian Westphal	31ad3dd64e	netfilter: af_info: add network namespace parameter to route hook This is required to eventually replace the rt6_lookup call in xt_addrtype.c with nf_afinfo->route(). Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Patrick McHardy <kaber@trash.net>	2011-04-04 16:56:29 +02:00
Lucas De Marchi	25985edced	Fix common misspellings Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>	2011-03-31 11:26:23 -03:00
Timo Teräs	93ca3bb5df	net: gre: provide multicast mappings for ipv4 and ipv6 My commit `6d55cb91a0` (gre: fix hard header destination address checking) broke multicast. The reason is that ip_gre used to get ipgre_header() calls with zero destination if we have NOARP or multicast destination. Instead the actual target was decided at ipgre_tunnel_xmit() time based on per-protocol dissection. Instead of allowing the "abuse" of ->header() calls with invalid destination, this creates multicast mappings for ip_gre. This also fixes "ip neigh show nud noarp" to display the proper multicast mappings used by the gre device. Reported-by: Doug Kehn <rdkehn@yahoo.com> Signed-off-by: Timo Teräs <timo.teras@iki.fi> Acked-by: Doug Kehn <rdkehn@yahoo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-30 00:10:47 -07:00
Cesar Eduardo Barros	3e49e6d520	net: use CHECKSUM_NONE instead of magic number Two places in the kernel were doing skb->ip_summed = 0. Change both to skb->ip_summed = CHECKSUM_NONE, which is more readable. Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-27 23:35:05 -07:00
Florian Westphal	9c7a4f9ce6	ipv6: ip6_route_output does not modify sk parameter, so make it const This avoids explicit cast to avoid 'discards qualifiers' compiler warning in a netfilter patch that i've been working on. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-22 19:17:36 -07:00
Eric W. Biederman	9d2a8fa96a	net ipv6: Fix duplicate /proc/sys/net/ipv6/neigh directory entries. When I was fixing issues with unregisgtering tables under /proc/sys/net/ipv6/neigh by adding a mount point it appears I missed a critical ordering issue, in the ipv6 initialization. I had not realized that ipv6_sysctl_register is called at the very end of the ipv6 initialization and in particular after we call neigh_sysctl_register from ndisc_init. "neigh" needs to be initialized in ipv6_static_sysctl_register which is the first ipv6 table to initialized, and definitely before ndisc_init. This removes the weirdness of duplicate tables while still providing a "neigh" mount point which prevents races in sysctl unregistering. This was initially reported at https://bugzilla.kernel.org/show_bug.cgi?id=31232 Reported-by: sunkan@zappa.cx Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-21 18:23:34 -07:00
Eric Dumazet	db856674ac	netfilter: xtables: fix reentrancy commit `f3c5c1bfd4` (make ip_tables reentrant) introduced a race in handling the stackptr restore, at the end of ipt_do_table() We should do it before the call to xt_info_rdunlock_bh(), or we allow cpu preemption and another cpu overwrites stackptr of original one. A second fix is to change the underflow test to check the origptr value instead of 0 to detect underflow, or else we allow a jump from different hooks. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: Patrick McHardy <kaber@trash.net>	2011-03-20 15:40:06 +01:00
Linus Torvalds	e16b396ce3	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (47 commits) doc: CONFIG_UNEVICTABLE_LRU doesn't exist anymore Update cpuset info & webiste for cgroups dcdbas: force SMI to happen when expected arch/arm/Kconfig: remove one to many l's in the word. asm-generic/user.h: Fix spelling in comment drm: fix printk typo 'sracth' Remove one to many n's in a word Documentation/filesystems/romfs.txt: fixing link to genromfs drivers:scsi Change printk typo initate -> initiate serial, pch uart: Remove duplicate inclusion of linux/pci.h header fs/eventpoll.c: fix spelling mm: Fix out-of-date comments which refers non-existent functions drm: Fix printk typo 'failled' coh901318.c: Change initate to initiate. mbox-db5500.c Change initate to initiate. edac: correct i82975x error-info reported edac: correct i82975x mci initialisation edac: correct commented info fs: update comments to point correct document target: remove duplicate include of target/target_core_device.h from drivers/target/target_core_hba.c ... Trivial conflict in fs/eventpoll.c (spelling vs addition)	2011-03-18 10:37:40 -07:00
David S. Miller	31111c26d9	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-next-2.6 Conflicts: Documentation/feature-removal-schedule.txt	2011-03-15 13:03:27 -07:00
Vasiliy Kulikov	6a8ab06077	ipv6: netfilter: ip6_tables: fix infoleak to userspace Structures ip6t_replace, compat_ip6t_replace, and xt_get_revision are copied from userspace. Fields of these structs that are zero-terminated strings are not checked. When they are used as argument to a format string containing "%s" in request_module(), some sensitive information is leaked to userspace via argument of spawned modprobe process. The first bug was introduced before the git epoch; the second was introduced in `3bc3fe5e` (v2.6.25-rc1); the third is introduced by `6b7d31fc` (v2.6.15-rc1). To trigger the bug one should have CAP_NET_ADMIN. Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Signed-off-by: Patrick McHardy <kaber@trash.net>	2011-03-15 13:37:13 +01:00
Steffen Klassert	d212a4c290	esp6: Add support for IPsec extended sequence numbers This patch adds IPsec extended sequence numbers support to esp6. We use the authencesn crypto algorithm to handle esp with separate encryption/authentication algorithms. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-13 20:22:29 -07:00
Steffen Klassert	1ce3644ade	xfrm: Use separate low and high order bits of the sequence numbers in xfrm_skb_cb To support IPsec extended sequence numbers, we split the output sequence numbers of xfrm_skb_cb in low and high order 32 bits and we add the high order 32 bits to the input sequence numbers. All users are updated accordingly. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-13 20:22:28 -07:00
David S. Miller	1958b856c1	net: Put fl6_* macros to struct flowi6 and use them again. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-12 15:08:55 -08:00
David S. Miller	4c9483b2fb	ipv6: Convert to use flowi6 where applicable. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-12 15:08:54 -08:00
David S. Miller	7e1dc7b6f7	net: Use flowi4 and flowi6 in xfrm layer. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-12 15:08:52 -08:00
David S. Miller	56bb8059e1	net: Break struct flowi out into AF specific instances. Now we have struct flowi4, flowi6, and flowidn for each address family. And struct flowi is just a union of them all. It might have been troublesome to convert flow_cache_uli_match() but as it turns out this function is completely unused and therefore can be simply removed. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-12 15:08:46 -08:00
David S. Miller	6281dcc94a	net: Make flowi ports AF dependent. Create two sets of port member accessors, one set prefixed by fl4_* and the other prefixed by fl6_* This will let us to create AF optimal flow instances. It will work because every context in which we access the ports, we have to be fully aware of which AF the flowi is anyways. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-12 15:08:46 -08:00
David S. Miller	1d28f42c1b	net: Put flowi_* prefix on AF independent members of struct flowi I intend to turn struct flowi into a union of AF specific flowi structs. There will be a common structure that each variant includes first, much like struct sock_common. This is the first step to move in that direction. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-12 15:08:44 -08:00
David S. Miller	78fbfd8a65	ipv4: Create and use route lookup helpers. The idea here is this minimizes the number of places one has to edit in order to make changes to how flows are defined and used. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-12 15:08:42 -08:00
David S. Miller	33175d84ee	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/bnx2x/bnx2x_cmn.c	2011-03-10 14:26:00 -08:00
stephen hemminger	6dfbd87a20	ip6ip6: autoload ip6 tunnel Add necessary alias to autoload ip6ip6 tunnel module. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-10 14:18:48 -08:00
David S. Miller	bef6e7e768	Merge branch 'master' of /home/davem/src/GIT/linux-2.6/	2011-03-10 14:00:44 -08:00
David S. Miller	7343ff31eb	ipv6: Don't create clones of host routes. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=29252 Addresses https://bugzilla.kernel.org/show_bug.cgi?id=30462 In commit `d80bc0fd26` ("ipv6: Always clone offlink routes.") we forced the kernel to always clone offlink routes. The reason we do that is to make sure we never bind an inetpeer to a prefixed route. The logic turned on here has existed in the tree for many years, but was always off due to a protecting CPP define. So perhaps it's no surprise that there is a logic bug here. The problem is that we canot clone a route that is already a host route (ie. has DST_HOST set). Because if we do, an identical entry already exists in the routing tree and therefore the ip6_rt_ins() call is going to fail. This sets off a series of failures and high cpu usage, because when ip6_rt_ins() fails we loop retrying this operation a few times in order to handle a race between two threads trying to clone and insert the same host route at the same time. Fix this by simply using the route as-is when DST_HOST is set. Reported-by: slash@ac.auone-net.jp Reported-by: Ernst Sjöstrand <ernstp@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-09 19:55:25 -08:00
Vasiliy Kulikov	8909c9ad8f	net: don't allow CAP_NET_ADMIN to load non-netdev kernel modules Since `a8f80e8ff9` any process with CAP_NET_ADMIN may load any module from /lib/modules/. This doesn't mean that CAP_NET_ADMIN is a superset of CAP_SYS_MODULE as modules are limited to /lib/modules/**. However, CAP_NET_ADMIN capability shouldn't allow anybody load any module not related to networking. This patch restricts an ability of autoloading modules to netdev modules with explicit aliases. This fixes CVE-2011-1019. Arnd Bergmann suggested to leave untouched the old pre-v2.6.32 behavior of loading netdev modules by name (without any prefix) for processes with CAP_SYS_MODULE to maintain the compatibility with network scripts that use autoloading netdev modules by aliases like "eth0", "wlan0". Currently there are only three users of the feature in the upstream kernel: ipip, ip_gre and sit. root@albatros:~# capsh --drop=$(seq -s, 0 11),$(seq -s, 13 34) -- root@albatros:~# grep Cap /proc/$$/status CapInh: 0000000000000000 CapPrm: fffffff800001000 CapEff: fffffff800001000 CapBnd: fffffff800001000 root@albatros:~# modprobe xfs FATAL: Error inserting xfs (/lib/modules/2.6.38-rc6-00001-g2bf4ca3/kernel/fs/xfs/xfs.ko): Operation not permitted root@albatros:~# lsmod \| grep xfs root@albatros:~# ifconfig xfs xfs: error fetching interface information: Device not found root@albatros:~# lsmod \| grep xfs root@albatros:~# lsmod \| grep sit root@albatros:~# ifconfig sit sit: error fetching interface information: Device not found root@albatros:~# lsmod \| grep sit root@albatros:~# ifconfig sit0 sit0 Link encap:IPv6-in-IPv4 NOARP MTU:1480 Metric:1 root@albatros:~# lsmod \| grep sit sit 10457 0 tunnel4 2957 1 sit For CAP_SYS_MODULE module loading is still relaxed: root@albatros:~# grep Cap /proc/$$/status CapInh: 0000000000000000 CapPrm: ffffffffffffffff CapEff: ffffffffffffffff CapBnd: ffffffffffffffff root@albatros:~# ifconfig xfs xfs: error fetching interface information: Device not found root@albatros:~# lsmod \| grep xfs xfs 745319 0 Reference: https://lkml.org/lkml/2011/2/24/203 Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Signed-off-by: Michael Tokarev <mjt@tls.msk.ru> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Kees Cook <kees.cook@canonical.com> Signed-off-by: James Morris <jmorris@namei.org>	2011-03-10 10:25:19 +11:00
Hagen Paul Pfeifer	4b66fef9b5	mcast: net_device dev not used ip6_mc_source(), ip6_mc_msfilter() as well as ip6_mc_msfget() declare and assign dev but do not use the variable afterwards. Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-07 15:51:13 -08:00
David S. Miller	0a0e9ae1bd	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/bnx2x/bnx2x.h	2011-03-03 21:27:42 -08:00
David S. Miller	29546a6404	ipv6: Use ERR_CAST in addrconf_dst_alloc. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-03 12:10:37 -08:00
David S. Miller	b23dd4fe42	ipv4: Make output route lookup return rtable directly. Instead of on the stack. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-02 14:31:35 -08:00
David S. Miller	452edd598f	xfrm: Return dst directly from xfrm_lookup() Instead of on the stack. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-02 13:27:41 -08:00
David S. Miller	b42835dbe8	ipv6: Make icmp route lookup code a bit clearer. The route lookup code in icmpv6_send() is slightly tricky as a result of having to handle all of the requirements of RFC 4301 host relookups. Pull the route resolution into a seperate function, so that the error handling and route reference counting is hopefully easier to see and contained wholly within this new routine. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-01 22:07:37 -08:00
David S. Miller	2774c131b1	xfrm: Handle blackhole route creation via afinfo. That way we don't have to potentially do this in every xfrm_lookup() caller. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-01 14:59:04 -08:00
David S. Miller	69ead7afdf	ipv6: Normalize arguments to ip6_dst_blackhole(). Return a dst pointer which is potentitally error encoded. Don't pass original dst pointer by reference, pass a struct net instead of a socket, and elide the flow argument since it is unnecessary. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-01 14:45:33 -08:00
David S. Miller	80c0bc9e37	xfrm: Kill XFRM_LOOKUP_WAIT flag. This can be determined from the flow flags instead. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-01 14:36:37 -08:00
David S. Miller	a1414715f0	ipv6: Change final dst lookup arg name to "can_sleep" Since it indicates whether we are invoked from a sleepable context or not. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-01 14:32:04 -08:00
David S. Miller	5df65e5567	net: Add FLOWI_FLAG_CAN_SLEEP. And set is in contexts where the route resolution can sleep. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-01 14:22:19 -08:00
David S. Miller	68d0c6d34d	ipv6: Consolidate route lookup sequences. Route lookups follow a general pattern in the ipv6 code wherein we first find the non-IPSEC route, potentially override the flow destination address due to ipv6 options settings, and then finally make an IPSEC search using either xfrm_lookup() or __xfrm_lookup(). __xfrm_lookup() is used when we want to generate a blackhole route if the key manager needs to resolve the IPSEC rules (in this case -EREMOTE is returned and the original 'dst' is left unchanged). Otherwise plain xfrm_lookup() is used and when asynchronous IPSEC resolution is necessary, we simply fail the lookup completely. All of these cases are encapsulated into two routines, ip6_dst_lookup_flow and ip6_sk_dst_lookup_flow. The latter of which handles unconnected UDP datagram sockets. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-01 13:19:07 -08:00
Herbert Xu	5a2ef92023	inet: Remove unused sk_sndmsg_* from UFO UFO doesn't really use the sk_sndmsg_* parameters so touching them is pointless. It can't use them anyway since the whole point of UFO is to use the original pages without copying. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-01 12:35:02 -08:00
Anders Berggren	a693e69897	net: TX timestamps for IPv6 UDP packets Enabling TX timestamps (SO_TIMESTAMPING) for IPv6 UDP packets, in the same fashion as for IPv4. Necessary in order for NICs such as Intel 82580 to timestamp IPv6 packets. Signed-off-by: Anders Berggren <anders@halon.se> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-02-28 12:32:11 -08:00
Hagen Paul Pfeifer	ddc3731fcb	ipv6: ignore rtnl_unicast() return code rtnl_unicast() return value is not of interest, we can silently ignore it, save some instructions and four byte on the stack. Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-02-25 14:00:23 -08:00
Hagen Paul Pfeifer	e9476e95d8	ipv6: variable next is never used in this function Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-02-25 14:00:22 -08:00
Hagen Paul Pfeifer	96d796a38e	ipv6: hash is calculated but not used afterwards hash is declared and assigned but not used anymore. ipv6_addr_hash() exhibit no side-effects. Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-02-25 14:00:22 -08:00
Hagen Paul Pfeifer	a5f5e3689c	ipv6: totlen is declared and assigned but not used Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-02-25 14:00:21 -08:00
Lucian Adrian Grijincu	c486da3439	sysctl: ipv6: use correct net in ipv6_sysctl_rtcache_flush Before this patch issuing these commands: fd = open("/proc/sys/net/ipv6/route/flush") unshare(CLONE_NEWNET) write(fd, "stuff") would flush the newly created net, not the original one. The equivalent ipv4 code is correct (stores the net inside ->extra1). Acked-by: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-02-25 11:01:56 -08:00
David S. Miller	5e6b930f21	xfrm: Const'ify address arguments to ->dst_lookup() Signed-off-by: David S. Miller <davem@davemloft.net>	2011-02-23 23:07:38 -08:00
David S. Miller	19bd62441c	xfrm: Const'ify tmpl and address arguments to ->init_temprop() Signed-off-by: David S. Miller <davem@davemloft.net>	2011-02-23 23:07:37 -08:00
David S. Miller	8f029de281	xfrm: Mark flowi arg to xfrm_type->reject() const. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-02-22 17:59:59 -08:00
David S. Miller	73e5ebb20f	xfrm: Mark flowi arg to ->init_tempsel() const. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-02-22 17:51:44 -08:00
David S. Miller	0c7b3eefb4	xfrm: Mark flowi arg to ->fill_dst() const. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-02-22 17:48:57 -08:00
David S. Miller	05d8402576	xfrm: Mark flowi arg to ->get_tos() const. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-02-22 17:47:10 -08:00
Shan Wei	089c34827e	tcp: Remove debug macro of TCP_CHECK_TIMER Now, TCP_CHECK_TIMER is not used for debuging, it does nothing. And, it has been there for several years, maybe 6 years. Remove it to keep code clearer. Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-02-20 11:10:14 -08:00
David S. Miller	da935c66ba	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: Documentation/feature-removal-schedule.txt drivers/net/e1000e/netdev.c net/xfrm/xfrm_policy.c	2011-02-19 19:17:35 -08:00
David S. Miller	ece639caa3	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-2.6	2011-02-19 16:42:37 -08:00
Eric Dumazet	214f45c91b	net: provide default_advmss() methods to blackhole dst_ops Commit `0dbaee3b37` (net: Abstract default ADVMSS behind an accessor.) introduced a possible crash in tcp_connect_init(), when dst->default_advmss() is called from dst_metric_advmss() Reported-by: George Spelvin <linux@horizon.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-02-18 11:39:01 -08:00
David S. Miller	3c7bd1a140	net: Add initial_ref arg to dst_alloc(). This allows avoiding multiple writes to the initial __refcnt. The most simplest cases of wanting an initial reference of "1" in ipv4 and ipv6 have been converted, the rest have been left along and kept at the existing "0". Signed-off-by: David S. Miller <davem@davemloft.net>	2011-02-17 15:44:00 -08:00
Joerg Marx	0af320fb46	netfilter: ip6t_LOG: fix a flaw in printing the MAC The flaw was in skipping the second byte in MAC header due to increasing the pointer AND indexed access starting at '1'. Signed-off-by: Joerg Marx <joerg.marx@secunet.com> Signed-off-by: Patrick McHardy <kaber@trash.net>	2011-02-17 16:23:40 +01:00
Jiri Kosina	0a9d59a246	Merge branch 'master' into for-next	2011-02-15 10:24:31 +01:00
David S. Miller	6431cbc25f	inet: Create a mechanism for upward inetpeer propagation into routes. If we didn't have a routing cache, we would not be able to properly propagate certain kinds of dynamic path attributes, for example PMTU information and redirects. The reason is that if we didn't have a routing cache, then there would be no way to lookup all of the active cached routes hanging off of sockets, tunnels, IPSEC bundles, etc. Consider the case where we created a cached route, but no inetpeer entry existed and also we were not asked to pre-COW the route metrics and therefore did not force the creation a new inetpeer entry. If we later get a PMTU message, or a redirect, and store this information in a new inetpeer entry, there is no way to teach that cached route about the newly existing inetpeer entry. The facilities implemented here handle this problem. First we create a generation ID. When we create a cached route of any kind, we remember the generation ID at the time of attachment. Any time we force-create an inetpeer entry in response to new path information, we bump that generation ID. The dst_ops->check() callback is where the knowledge of this event is propagated. If the global generation ID does not equal the one stored in the cached route, and the cached route has not attached to an inetpeer yet, we look it up and attach if one is found. Now that we've updated the cached route's information, we update the route's generation ID too. This clears the way for implementing PMTU and redirects directly in the inetpeer cache. There is absolutely no need to consult cached route information in order to maintain this information. At this point nothing bumps the inetpeer genids, that comes in the later changes which handle PMTUs and redirects using inetpeers. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-02-10 13:33:41 -08:00
David S. Miller	7a71ed899e	inetpeer: Abstract address representation further. Future changes will add caching information, and some of these new elements will be addresses. Since the family is implicit via the ->daddr.family member, replicating the family in ever address we store is entirely redundant. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-02-10 13:22:28 -08:00
David S. Miller	8d13a2a9fb	net: Kill NETEVENT_PMTU_UPDATE. Nobody actually does anything in response to the event, so just kill it off. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-02-08 16:17:55 -08:00
David S. Miller	92d8682926	inetpeer: Move ICMP rate limiting state into inet_peer entries. Like metrics, the ICMP rate limiting bits are cached state about a destination. So move it into the inet_peer entries. If an inet_peer cannot be bound (the reason is memory allocation failure or similar), the policy is to allow. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-02-04 15:59:53 -08:00
David S. Miller	bd4a6974cc	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6	2011-02-04 14:28:58 -08:00
David S. Miller	e2d57766e6	net: Provide compat support for SIOCGETMIFCNT_IN6 and SIOCGETSGCNT_IN6. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-02-03 18:05:29 -08:00
Eric W. Biederman	bf36076a67	net: Fix ipv6 neighbour unregister_sysctl_table warning In my testing of 2.6.37 I was occassionally getting a warning about sysctl table entries being unregistered in the wrong order. Digging in it turns out this dates back to the last great sysctl reorg done where Al Viro introduced the requirement that sysctl directories needed to be created before and destroyed after the files in them. It turns out that in that great reorg /proc/sys/net/ipv6/neigh was overlooked. So this patch fixes that oversight and makes an annoying warning message go away. >------------[ cut here ]------------ >WARNING: at kernel/sysctl.c:1992 unregister_sysctl_table+0x134/0x164() >Pid: 23951, comm: kworker/u:3 Not tainted 2.6.37-350888.2010AroraKernelBeta.fc14.x86_64 #1 >Call Trace: > [<ffffffff8103e034>] warn_slowpath_common+0x80/0x98 > [<ffffffff8103e061>] warn_slowpath_null+0x15/0x17 > [<ffffffff810452f8>] unregister_sysctl_table+0x134/0x164 > [<ffffffff810e7834>] ? kfree+0xc4/0xd1 > [<ffffffff813439b2>] neigh_sysctl_unregister+0x22/0x3a > [<ffffffffa02cd14e>] addrconf_ifdown+0x33f/0x37b [ipv6] > [<ffffffff81331ec2>] ? skb_dequeue+0x5f/0x6b > [<ffffffffa02ce4a5>] addrconf_notify+0x69b/0x75c [ipv6] > [<ffffffffa02eb953>] ? ip6mr_device_event+0x98/0xa9 [ipv6] > [<ffffffff813d2413>] notifier_call_chain+0x32/0x5e > [<ffffffff8105bdea>] raw_notifier_call_chain+0xf/0x11 > [<ffffffff8133cdac>] call_netdevice_notifiers+0x45/0x4a > [<ffffffff8133d2b0>] rollback_registered_many+0x118/0x201 > [<ffffffff8133d3af>] unregister_netdevice_many+0x16/0x6d > [<ffffffff8133d571>] default_device_exit_batch+0xa4/0xb8 > [<ffffffff81337c42>] ? cleanup_net+0x0/0x194 > [<ffffffff81337a2a>] ops_exit_list+0x4e/0x56 > [<ffffffff81337d36>] cleanup_net+0xf4/0x194 > [<ffffffff81053318>] process_one_work+0x187/0x280 > [<ffffffff8105441b>] worker_thread+0xff/0x19f > [<ffffffff8105431c>] ? worker_thread+0x0/0x19f > [<ffffffff8105776d>] kthread+0x7d/0x85 > [<ffffffff81003824>] kernel_thread_helper+0x4/0x10 > [<ffffffff810576f0>] ? kthread+0x0/0x85 > [<ffffffff81003820>] ? kernel_thread_helper+0x0/0x10 >---[ end trace 8a7e9310b35e9486 ]--- Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-31 20:54:17 -08:00
Roland Dreier	ec831ea72e	net: Add default_mtu() methods to blackhole dst_ops When an IPSEC SA is still being set up, __xfrm_lookup() will return -EREMOTE and so ip_route_output_flow() will return a blackhole route. This can happen in a sndmsg call, and after `d33e455337` ("net: Abstract default MTU metric calculation behind an accessor.") this leads to a crash in ip_append_data() because the blackhole dst_ops have no default_mtu() method and so dst_mtu() calls a NULL pointer. Fix this by adding default_mtu() methods (that simply return 0, matching the old behavior) to the blackhole dst_ops. The IPv4 part of this patch fixes a crash that I saw when using an IPSEC VPN; the IPv6 part is untested because I don't have an IPv6 VPN, but it looks to be needed as well. Signed-off-by: Roland Dreier <roland@purestorage.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-31 13:16:00 -08:00
David S. Miller	065825402c	net: Store ipv4/ipv6 COW'd metrics in inetpeer cache. Please note that the IPSEC dst entry metrics keep using the generic metrics COW'ing mechanism using kmalloc/kfree. This gives the IPSEC routes an opportunity to use metrics which are unique to their encapsulated paths. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-27 14:59:31 -08:00
David S. Miller	1397e171f1	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6	2011-01-27 14:59:08 -08:00
David S. Miller	8f2771f2b8	ipv6: Remove route peer binding assertions. They are bogus. The basic idea is that I wanted to make sure that prefixed routes never bind to peers. The test I used was whether RTF_CACHE was set. But first of all, the RTF_CACHE flag is set at different spots depending upon which ip6_rt_copy() caller you're talking about. I've validated all of the code paths, and even in the future where we bind peers more aggressively (for route metric COW'ing) we never bind to prefix'd routes, only fully specified ones. This even applies when addrconf or icmp6 routes are allocated. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-27 14:55:22 -08:00
David S. Miller	62fa8a846d	net: Implement read-only protection and COW'ing of metrics. Routing metrics are now copy-on-write. Initially a route entry points it's metrics at a read-only location. If a routing table entry exists, it will point there. Else it will point at the all zero metric place-holder called 'dst_default_metrics'. The writeability state of the metrics is stored in the low bits of the metrics pointer, we have two bits left to spare if we want to store more states. For the initial implementation, COW is implemented simply via kmalloc. However future enhancements will change this to place the writable metrics somewhere else, in order to increase sharing. Very likely this "somewhere else" will be the inetpeer cache. Note also that this means that metrics updates may transiently fail if we cannot COW the metrics successfully. But even by itself, this patch should decrease memory usage and increase cache locality especially for routing workloads. In those cases the read-only metric copies stay in place and never get written to. TCP workloads where metrics get updated, and those rare cases where PMTU triggers occur, will take a very slight performance hit. But that hit will be alleviated when the long-term writable metrics move to a more sharable location. Since the metrics storage went from a u32 array of RTAX_MAX entries to what is essentially a pointer, some retooling of the dst_entry layout was necessary. Most importantly, we need to preserve the alignment of the reference count so that it doesn't share cache lines with the read-mostly state, as per Eric Dumazet's alignment assertion checks. The only non-trivial bit here is the move of the 'flags' member into the writeable cacheline. This is OK since we are always accessing the flags around the same moment when we made a modification to the reference count. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-26 20:51:05 -08:00
David S. Miller	b4e69ac670	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6	2011-01-26 13:49:30 -08:00
David S. Miller	7cc2edb834	xfrm6: Don't forget to propagate peer into ipsec route. Like ipv4, we have to propagate the ipv6 route peer into the ipsec top-level route during instantiation. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-26 13:41:03 -08:00
David S. Miller	73a8bd74e2	ipv6: Revert 'administrative down' address handling changes. This reverts the following set of commits: `d1ed113f16` ("ipv6: remove duplicate neigh_ifdown") `29ba5fed1b` ("ipv6: don't flush routes when setting loopback down") `9d82ca98f7` ("ipv6: fix missing in6_ifa_put in addrconf") `2de7957072` ("ipv6: addrconf: don't remove address state on ifdown if the address is being kept") `8595805aaf` ("IPv6: only notify protocols if address is compeletely gone") `27bdb2abcc` ("IPv6: keep tentative addresses in hash table") `93fa159abe` ("IPv6: keep route for tentative address") `8f37ada5b5` ("IPv6: fix race between cleanup and add/delete address") `84e8b803f1` ("IPv6: addrconf notify when address is unavailable") `dc2b99f71e` ("IPv6: keep permanent addresses on admin down") because the core semantic change to ipv6 address handling on ifdown has broken some things, in particular "disable_ipv6" sysctl handling. Stephen has made several attempts to get things back in working order, but nothing has restored disable_ipv6 fully yet. Reported-by: Eric W. Biederman <ebiederm@xmission.com> Tested-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-25 12:49:08 -08:00
David S. Miller	d80bc0fd26	ipv6: Always clone offlink routes. Do not handle PMTU vs. route lookup creation any differently wrt. offlink routes, always clone them. Reported-by: PK <runningdoglackey@yahoo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-24 16:01:58 -08:00
Michał Mirosław	04ed3e741d	net: change netdev->features to u32 Quoting Ben Hutchings: we presumably won't be defining features that can only be enabled on 64-bit architectures. Occurences found by `grep -r` on net/, drivers/net, include/ [ Move features and vlan_features next to each other in struct netdev, as per Eric Dumazet's suggestion -DaveM ] Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-24 15:32:47 -08:00
Eric Dumazet	f2eda47df4	ipv6: raw: rcu annotations Remove sparse warnings, using a function typedef to be able to use __rcu annotation on mh_filter pointer. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-20 16:59:34 -08:00
Eric Dumazet	753ea8e962	net: ipv6: sit: fix rcu annotations Fix minor __rcu annotations and remove sparse warnings Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-20 16:59:33 -08:00
Eric Dumazet	bced94ed5e	netfilter: add a missing include in nf_conntrack_reasm.c After commit `ae90bdeaea` (netfilter: fix compilation when conntrack is disabled but tproxy is enabled) we have following warnings : net/ipv6/netfilter/nf_conntrack_reasm.c:520:16: warning: symbol 'nf_ct_frag6_gather' was not declared. Should it be static? net/ipv6/netfilter/nf_conntrack_reasm.c:591:6: warning: symbol 'nf_ct_frag6_output' was not declared. Should it be static? net/ipv6/netfilter/nf_conntrack_reasm.c:612:5: warning: symbol 'nf_ct_frag6_init' was not declared. Should it be static? net/ipv6/netfilter/nf_conntrack_reasm.c:640:6: warning: symbol 'nf_ct_frag6_cleanup' was not declared. Should it be static? Fix this including net/netfilter/ipv6/nf_defrag_ipv6.h Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: KOVACS Krisztian <hidden@balabit.hu> Signed-off-by: Patrick McHardy <kaber@trash.net>	2011-01-20 21:00:38 +01:00
Linus Torvalds	1268afe676	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (41 commits) sctp: user perfect name for Delayed SACK Timer option net: fix can_checksum_protocol() arguments swap Revert "netlink: test for all flags of the NLM_F_DUMP composite" gianfar: Fix misleading indentation in startup_gfar() net/irda/sh_irda: return to RX mode when TX error net offloading: Do not mask out NETIF_F_HW_VLAN_TX for vlan. USB CDC NCM: tx_fixup() race condition fix ns83820: Avoid bad pointer deref in ns83820_init_one(). ipv6: Silence privacy extensions initialization bnx2x: Update bnx2x version to 1.62.00-4 bnx2x: Fix AER setting for BCM57712 bnx2x: Fix BCM84823 LED behavior bnx2x: Mark full duplex on some external PHYs bnx2x: Fix BCM8073/BCM8727 microcode loading bnx2x: LED fix for BCM8727 over BCM57712 bnx2x: Common init will be executed only once after POR bnx2x: Swap BCM8073 PHY polarity if required iwlwifi: fix valid chain reading from EEPROM ath5k: fix locking in tx_complete_poll_work ath9k_hw: do PA offset calibration only on longcal interval ...	2011-01-19 20:25:45 -08:00
Patrick McHardy	14f0290ba4	Merge branch 'master' of /repos/git/net-next-2.6	2011-01-19 23:51:37 +01:00
Jesper Juhl	42b16b3fbb	Kill off warning: ‘inline’ is not at beginning of declaration Fix a bunch of warning: ‘inline’ is not at beginning of declaration messages when building a 'make allyesconfig' kernel with -Wextra. These warnings are trivial to kill, yet rather annoying when building with -Wextra. The more we can cut down on pointless crap like this the better (IMHO). A previous patch to do this for a 'allnoconfig' build has already been merged. This just takes the cleanup a little further. Signed-off-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2011-01-19 15:43:08 +01:00
David S. Miller	a5db219f4c	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6	2011-01-18 16:28:31 -08:00
Romain Francoise	2fdc1c8093	ipv6: Silence privacy extensions initialization When a network namespace is created (via CLONE_NEWNET), the loopback interface is automatically added to the new namespace, triggering a printk in ipv6_add_dev() if CONFIG_IPV6_PRIVACY is set. This is problematic for applications which use CLONE_NEWNET as part of a sandbox, like Chromium's suid sandbox or recent versions of vsftpd. On a busy machine, it can lead to thousands of useless "lo: Disabled Privacy Extensions" messages appearing in dmesg. It's easy enough to check the status of privacy extensions via the use_tempaddr sysctl, so just removing the printk seems like the most sensible solution. Signed-off-by: Romain Francoise <romain@orebokech.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-18 16:13:49 -08:00
Linus Torvalds	d018b6f4f1	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (47 commits) GRETH: resolve SMP issues and other problems GRETH: handle frame error interrupts GRETH: avoid writing bad speed/duplex when setting transfer mode GRETH: fixed skb buffer memory leak on frame errors GRETH: GBit transmit descriptor handling optimization GRETH: fix opening/closing GRETH: added raw AMBA vendor/device number to match against. cassini: Fix build bustage on x86. e1000e: consistent use of Rx/Tx vs. RX/TX/rx/tx in comments/logs e1000e: update Copyright for 2011 e1000: Avoid unhandled IRQ r8169: keep firmware in memory. netdev: tilepro: Use is_unicast_ether_addr helper etherdevice.h: Add is_unicast_ether_addr function ks8695net: Use default implementation of ethtool_ops::get_link ks8695net: Disable non-working ethtool operations USB CDC NCM: Don't deref NULL in cdc_ncm_rx_fixup() and don't use uninitialized variable. vxge: Remember to release firmware after upgrading firmware netdev: bfin_mac: Remove is_multicast_ether_addr use in netdev_for_each_mc_addr ipsec: update MAX_AH_AUTH_LEN to support sha512 ...	2011-01-14 13:25:30 -08:00
Linus Torvalds	008d23e485	Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits) Documentation/trace/events.txt: Remove obsolete sched_signal_send. writeback: fix global_dirty_limits comment runtime -> real-time ppc: fix comment typo singal -> signal drivers: fix comment typo diable -> disable. m68k: fix comment typo diable -> disable. wireless: comment typo fix diable -> disable. media: comment typo fix diable -> disable. remove doc for obsolete dynamic-printk kernel-parameter remove extraneous 'is' from Documentation/iostats.txt Fix spelling milisec -> ms in snd_ps3 module parameter description Fix spelling mistakes in comments Revert conflicting V4L changes i7core_edac: fix typos in comments mm/rmap.c: fix comment sound, ca0106: Fix assignment to 'channel'. hrtimer: fix a typo in comment init/Kconfig: fix typo anon_inodes: fix wrong function name in comment fix comment typos concerning "consistent" poll: fix a typo in comment ... Fix up trivial conflicts in: - drivers/net/wireless/iwlwifi/iwl-core.c (moved to iwl-legacy.c) - fs/ext4/ext4.h Also fix missed 'diabled' typo in drivers/net/bnx2x/bnx2x.h while at it.	2011-01-13 10:05:56 -08:00
Eric Dumazet	255d0dc340	netfilter: x_table: speedup compat operations One iptables invocation with 135000 rules takes 35 seconds of cpu time on a recent server, using a 32bit distro and a 64bit kernel. We eventually trigger NMI/RCU watchdog. INFO: rcu_sched_state detected stall on CPU 3 (t=6000 jiffies) COMPAT mode has quadratic behavior and consume 16 bytes of memory per rule. Switch the xt_compat algos to use an array instead of list, and use a binary search to locate an offset in the sorted array. This halves memory need (8 bytes per rule), and removes quadratic behavior [ O(NN) -> O(Nlog2(N)) ] Time of iptables goes from 35 s to 150 ms. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2011-01-13 12:05:12 +01:00
David S. Miller	464143c911	Merge branch 'master' of git://1984.lsi.us.es/net-2.6	2011-01-12 18:58:40 -08:00
Alexey Kuznetsov	72b43d0898	inet6: prevent network storms caused by linux IPv6 routers Linux IPv6 forwards unicast packets, which are link layer multicasts... The hole was present since day one. I was 100% this check is there, but it is not. The problem shows itself, f.e. when Microsoft Network Load Balancer runs on a network. This software resolves IPv6 unicast addresses to multicast MAC addresses. Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-12 18:51:55 -08:00
Simon Horman	fee1cc0895	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 into HEAD	2011-01-13 10:29:21 +09:00
KOVACS Krisztian	2fc72c7b84	netfilter: fix compilation when conntrack is disabled but tproxy is enabled The IPv6 tproxy patches split IPv6 defragmentation off of conntrack, but failed to update the #ifdef stanzas guarding the defragmentation related fields and code in skbuff and conntrack related code in nf_defrag_ipv6.c. This patch adds the required #ifdefs so that IPv6 tproxy can truly be used without connection tracking. Original report: http://marc.info/?l=linux-netdev&m=129010118516341&w=2 Reported-by: Randy Dunlap <randy.dunlap@oracle.com> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: KOVACS Krisztian <hidden@balabit.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2011-01-12 20:25:08 +01:00
David S. Miller	60dbb011df	Merge branch 'master' of git://1984.lsi.us.es/net-2.6	2011-01-11 15:43:03 -08:00
Dang Hongwu	4b0ef1f223	ah: reload pointers to skb data after calling skb_cow_data() skb_cow_data() may allocate a new data buffer, so pointers on skb should be set after this function. Bug was introduced by commit `dff3bb06` ("ah4: convert to ahash") and `8631e9bd` ("ah6: convert to ahash"). Signed-off-by: Wang Xuefu <xuefu.wang@6wind.com> Acked-by: Krzysztof Witek <krzysztof.witek@6wind.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-11 14:03:10 -08:00
Eric Dumazet	c191a836a9	tcp: disallow bind() to reuse addr/port inet_csk_bind_conflict() logic currently disallows a bind() if it finds a friend socket (a socket bound on same address/port) satisfying a set of conditions : 1) Current (to be bound) socket doesnt have sk_reuse set OR 2) other socket doesnt have sk_reuse set OR 3) other socket is in LISTEN state We should add the CLOSE state in the 3) condition, in order to avoid two REUSEADDR sockets in CLOSE state with same local address/port, since this can deny further operations. Note : a prior patch tried to address the problem in a different (and buggy) way. (commit `fda48a0d7a` tcp: bind() fix when many ports are bound). Reported-by: Gaspar Chilingarov <gasparch@gmail.com> Reported-by: Daniel Baluta <daniel.baluta@gmail.com> Tested-by: Daniel Baluta <daniel.baluta@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-11 14:03:07 -08:00
Eric Dumazet	83723d6071	netfilter: x_tables: dont block BH while reading counters Using "iptables -L" with a lot of rules have a too big BH latency. Jesper mentioned ~6 ms and worried of frame drops. Switch to a per_cpu seqlock scheme, so that taking a snapshot of counters doesnt need to block BH (for this cpu, but also other cpus). This adds two increments on seqlock sequence per ipt_do_table() call, its a reasonable cost for allowing "iptables -L" not block BH processing. Reported-by: Jesper Dangaard Brouer <hawk@comx.dk> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Patrick McHardy <kaber@trash.net> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Acked-by: Jesper Dangaard Brouer <hawk@comx.dk> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2011-01-10 20:11:38 +01:00
Jiri Kosina	4b7bd36470	Merge branch 'master' into for-next Conflicts: MAINTAINERS arch/arm/mach-omap2/pm24xx.c drivers/scsi/bfa/bfa_fcpim.c Needed to update to apply fixes for which the old branch was too outdated.	2010-12-22 18:57:02 +01:00
David S. Miller	d9993be65a	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6	2010-12-20 13:24:14 -08:00
David Stevens	ad0081e43a	ipv6: Fragment locally generated tunnel-mode IPSec6 packets as needed. This patch modifies IPsec6 to fragment IPv6 packets that are locally generated as needed. This version of the patch only fragments in tunnel mode, so that fragment headers will not be obscured by ESP in transport mode. Signed-off-by: David L Stevens <dlstevens@us.ibm.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-19 20:22:23 -08:00
stephen hemminger	d1ed113f16	ipv6: remove duplicate neigh_ifdown When device is being set to down, neigh_ifdown was being called twice. Once from addrconf notifier and once from ndisc notifier. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-18 22:01:17 -08:00
stephen hemminger	bc3ef6605e	ipv6: fib6_ifdown cleanup Remove (unnecessary) casts to make code cleaner. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-18 22:01:16 -08:00
David S. Miller	b4aa9e05a6	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/bnx2x/bnx2x.h drivers/net/wireless/iwlwifi/iwl-1000.c drivers/net/wireless/iwlwifi/iwl-6000.c drivers/net/wireless/iwlwifi/iwl-core.h drivers/vhost/vhost.c	2010-12-17 12:27:22 -08:00
stephen hemminger	29ba5fed1b	ipv6: don't flush routes when setting loopback down When loopback device is being brought down, then keep the route table entries because they are special. The entries in the local table for linklocal routes and ::1 address should not be purged. This is a sub optimal solution to the problem and should be replaced by a better fix in future. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-16 18:26:26 -08:00
Octavian Purdila	fcbdf09d96	net: fix nulls list corruptions in sk_prot_alloc Special care is taken inside sk_port_alloc to avoid overwriting skc_node/skc_nulls_node. We should also avoid overwriting skc_bind_node/skc_portaddr_node. The patch fixes the following crash: BUG: unable to handle kernel paging request at fffffffffffffff0 IP: [<ffffffff812ec6dd>] udp4_lib_lookup2+0xad/0x370 [<ffffffff812ecc22>] __udp4_lib_lookup+0x282/0x360 [<ffffffff812ed63e>] __udp4_lib_rcv+0x31e/0x700 [<ffffffff812bba45>] ? ip_local_deliver_finish+0x65/0x190 [<ffffffff812bbbf8>] ? ip_local_deliver+0x88/0xa0 [<ffffffff812eda35>] udp_rcv+0x15/0x20 [<ffffffff812bba45>] ip_local_deliver_finish+0x65/0x190 [<ffffffff812bbbf8>] ip_local_deliver+0x88/0xa0 [<ffffffff812bb2cd>] ip_rcv_finish+0x32d/0x6f0 [<ffffffff8128c14c>] ? netif_receive_skb+0x99c/0x11c0 [<ffffffff812bb94b>] ip_rcv+0x2bb/0x350 [<ffffffff8128c14c>] netif_receive_skb+0x99c/0x11c0 Signed-off-by: Leonard Crestez <lcrestez@ixiacom.com> Signed-off-by: Octavian Purdila <opurdila@ixiacom.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-16 14:26:56 -08:00
Andrey Vagin	d3052b557a	ipv6: delete expired route in ip6_pmtu_deliver The first big packets sent to a "low-MTU" client correctly triggers the creation of a temporary route containing the reduced MTU. But after the temporary route has expired, new ICMP6 "packet too big" will be sent, rt6_pmtu_discovery will find the previous EXPIRED route check that its mtu isn't bigger then in icmp packet and do nothing before the temporary route will not deleted by gc. I make the simple experiment: while :; do time ( dd if=/dev/zero bs=10K count=1 \| ssh hostname dd of=/dev/null ) \|\| break; done The "time" reports real 0m0.197s if a temporary route isn't expired, but it reports real 0m52.837s (!!!!) immediately after a temporare route has expired. Signed-off-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-16 12:28:13 -08:00
KOVACS Krisztian	ae90bdeaea	netfilter: fix compilation when conntrack is disabled but tproxy is enabled The IPv6 tproxy patches split IPv6 defragmentation off of conntrack, but failed to update the #ifdef stanzas guarding the defragmentation related fields and code in skbuff and conntrack related code in nf_defrag_ipv6.c. This patch adds the required #ifdefs so that IPv6 tproxy can truly be used without connection tracking. Original report: http://marc.info/?l=linux-netdev&m=129010118516341&w=2 Reported-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: KOVACS Krisztian <hidden@balabit.hu> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Patrick McHardy <kaber@trash.net>	2010-12-15 23:53:41 +01:00
David S. Miller	d33e455337	net: Abstract default MTU metric calculation behind an accessor. Like RTAX_ADVMSS, make the default calculation go through a dst_ops method rather than caching the computation in the routing cache entries. Now dst metrics are pretty much left as-is when new entries are created, thus optimizing metric sharing becomes a real possibility. Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-14 13:01:14 -08:00
David S. Miller	0dbaee3b37	net: Abstract default ADVMSS behind an accessor. Make all RTAX_ADVMSS metric accesses go through a new helper function, dst_metric_advmss(). Leave the actual default metric as "zero" in the real metric slot, and compute the actual default value dynamically via a new dst_ops AF specific callback. For stacked IPSEC routes, we use the advmss of the path which preserves existing behavior. Unlike ipv4/ipv6, DecNET ties the advmss to the mtu and thus updates advmss on pmtu updates. This inconsistency in advmss handling results in more raw metric accesses than I wish we ended up with. Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-13 12:52:14 -08:00
David S. Miller	a02e4b7dae	ipv6: Demark default hoplimit as zero. This is for consistency with ipv4. Using "-1" makes no sense. It was made this way a long time ago merely to be consistent with how the ipv6 socket hoplimit "default" is stored. Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-12 21:39:02 -08:00
David S. Miller	5170ae824d	net: Abstract RTAX_HOPLIMIT metric accesses behind helper. Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-12 21:35:57 -08:00
David S. Miller	abbf46ae0e	ipv6: Use ip6_dst_hoplimit() instead of direct dst_metric() calls. Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-12 21:14:46 -08:00
Tobias Klauser	376d940ee9	inet6: Remove redundant unlikely() IS_ERR() already implies unlikely(), so it can be omitted here. Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-10 14:57:34 -08:00
Martin Willi	040253c931	xfrm: Traffic Flow Confidentiality for IPv6 ESP Add TFC padding to all packets smaller than the boundary configured on the xfrm state. If the boundary is larger than the PMTU, limit padding to the PMTU. Signed-off-by: Martin Willi <martin@strongswan.org> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-10 14:43:59 -08:00
Jiri Pirko	c07224005d	net/ipv6/udp.c: fix typo in flush_stack() skb1 should be passed as parameter to sk_rcvqueues_full() here. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-10 14:05:09 -08:00
David S. Miller	457de4383e	ipv6: Fix 'release_it' logic in tcp_v6_get_peer() We accidently set it to "true" for the case where we are using a route bound peer. Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-10 13:16:09 -08:00
Nicolas Dichtel	5f75a1042f	ipv6: fix nl group when advertising a new link New idev are advertised with NL group RTNLGRP_IPV6_IFADDR, but should use RTNLGRP_IPV6_IFINFO. Bug was introduced by commit `8d7a76c9`. Signed-off-by: Wang Xuefu <xuefu.wang@6wind.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Thomas Graf <tgraf@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-10 12:49:36 -08:00
Uwe Kleine-König	a34f0b3139	fix comment typos concerning "consistent" Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2010-12-10 16:04:28 +01:00
Eric Dumazet	68835aba4d	net: optimize INET input path further Followup of commit `b178bb3dfc` (net: reorder struct sock fields) Optimize INET input path a bit further, by : 1) moving sk_refcnt close to sk_lock. This reduces number of dirtied cache lines by one on 64bit arches (and 64 bytes cache line size). 2) moving inet_daddr & inet_rcv_saddr at the beginning of sk (same cache line than hash / family / bound_dev_if / nulls_node) This reduces number of accessed cache lines in lookups by one, and dont increase size of inet and timewait socks. inet and tw sockets now share same place-holder for these fields. Before patch : offsetof(struct sock, sk_refcnt) = 0x10 offsetof(struct sock, sk_lock) = 0x40 offsetof(struct sock, sk_receive_queue) = 0x60 offsetof(struct inet_sock, inet_daddr) = 0x270 offsetof(struct inet_sock, inet_rcv_saddr) = 0x274 After patch : offsetof(struct sock, sk_refcnt) = 0x44 offsetof(struct sock, sk_lock) = 0x48 offsetof(struct sock, sk_receive_queue) = 0x68 offsetof(struct inet_sock, inet_daddr) = 0x0 offsetof(struct inet_sock, inet_rcv_saddr) = 0x4 compute_score() (udp or tcp) now use a single cache line per ignored item, instead of two. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-09 20:05:58 -08:00
David S. Miller	defb3519a6	net: Abstract away all dst_entry metrics accesses. Use helper functions to hide all direct accesses, especially writes, to dst_entry metrics values. This will allow us to: 1) More easily change how the metrics are stored. 2) Implement COW for metrics. In particular this will help us put metrics into the inetpeer cache if that is what we end up doing. We can make the _metrics member a pointer instead of an array, initially have it point at the read-only metrics in the FIB, and then on the first set grab an inetpeer entry and point the _metrics member there. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>	2010-12-09 10:46:36 -08:00
David S. Miller	fe6c791570	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/wireless/ath/ath9k/ar9003_eeprom.c net/llc/af_llc.c	2010-12-08 13:47:38 -08:00
Shan Wei	b672083ed3	ipv6: use ND_REACHABLE_TIME and ND_RETRANS_TIMER instead of magic number ND_REACHABLE_TIME and ND_RETRANS_TIMER have defined since v2.6.12-rc2, but never been used. So use them instead of magic number. This patch also changes original code style to read comfortably . Thank YOSHIFUJI Hideaki for pointing it out. Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-02 13:27:32 -08:00
David S. Miller	db3949c450	tcp: Implement ipv6 ->get_peer() and ->tw_get_peer(). Now ipv6 timewait recycling is fully implemented and enabled. Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-02 12:14:37 -08:00
David S. Miller	493f377d6d	tcp: Add timewait recycling bits to ipv6 connect code. This will also improve handling of ipv6 tcp socket request backlog when syncookies are not enabled. When backlog becomes very deep, last quarter of backlog is limited to validated destinations. Previously only ipv4 implemented this logic, but now ipv6 does too. Now we are only one step away from enabling timewait recycling for ipv6, and that step is simply filling in the implementation of tcp_v6_get_peer() and tcp_v6_tw_get_peer(). Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-02 12:14:29 -08:00
David S. Miller	ae4694b2d3	ipv6: Create inet6_csk_route_req(). Brother of ipv4's inet_csk_route_req(). Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-02 10:59:22 -08:00
David S. Miller	ccb7c410dd	timewait_sock: Create and use getpeer op. The only thing AF-specific about remembering the timestamp for a time-wait TCP socket is getting the peer. Abstract that behind a new timewait_sock_ops vector. Support for real IPV6 sockets is not filled in yet, but curiously this makes timewait recycling start to work for v4-mapped ipv6 sockets. Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-01 18:09:13 -08:00
David McCullough	6dcdd1b369	net/ipv6/sit.c: return unhandled skb to tunnel4_rcv I found a problem using an IPv6 over IPv4 tunnel. When CONFIG_IPV6_SIT was enabled, the packets would be rejected as net/ipv6/sit.c was catching all IPPROTO_IPV6 packets and returning an ICMP port unreachable error. I think this patch fixes the problem cleanly. I believe the code in net/ipv4/tunnel4.c:tunnel4_rcv takes care of it properly if none of the handlers claim the skb. Signed-off-by: David McCullough <david_mccullough@mcafee.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-01 13:19:34 -08:00
Anders Franzen	381601e5bb	Make the ip6_tunnel reflect the true mtu. The ip6_tunnel always assumes it consumes 40 bytes (ip6 hdr) of the mtu of the underlaying device. So for a normal ethernet bearer, the mtu of the ip6_tunnel is 1460. However, when creating a tunnel the encap limit option is enabled by default, and it consumes 8 bytes more, so the true mtu shall be 1452. I dont really know if this breaks some statement in some RFC, so this is a request for comments. Signed-off-by: Anders Franzen <anders.franzen@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-01 10:55:47 -08:00
David S. Miller	3f419d2d48	inet: Turn ->remember_stamp into ->get_peer in connection AF ops. Then we can make a completely generic tcp_remember_stamp() that uses ->get_peer() as a helper, minimizing the AF specific code and minimizing the eventual code duplication when we implement the ipv6 side of TW recycling. Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-30 12:28:06 -08:00
David S. Miller	b341936380	ipv6: Add infrastructure to bind inet_peer objects to routes. They are only allowed on cached ipv6 routes. Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-30 12:27:11 -08:00
Jozsef Kadlecsik	82a39eb6b3	ipv6: Prepare the tree for un-inlined jhash. jhash is widely used in the kernel and because the functions are inlined, the cost in size is significant. Also, the new jhash functions are slightly larger than the previous ones so better un-inline. As a preparation step, the calls to the internal macros are replaced with the plain jhash function calls. Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-28 11:26:21 -08:00
Shan Wei	d3c15cab21	ipv6: kill two unused macro definition 1. IPV6_TLV_TEL_DST_SIZE This has not been using for several years since created. 2. RT6_INFO_LEN commit `33120b30` kill all RT6_INFO_LEN's references, but only this definition remained. commit `33120b30cc` Author: Alexey Dobriyan <adobriyan@sw.ru> Date: Tue Nov 6 05:27:11 2007 -0800 [IPV6]: Convert /proc/net/ipv6_route to seq_file interface Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-28 10:47:17 -08:00
Thomas Graf	cf7afbfeb8	rtnl: make link af-specific updates atomic As David pointed out correctly, updates to af-specific attributes are currently not atomic. If multiple changes are requested and one of them fails, previous updates may have been applied already leaving the link behind in a undefined state. This patch splits the function parse_link_af() into two functions validate_link_af() and set_link_at(). validate_link_af() is placed to validate_linkmsg() check for errors as early as possible before any changes to the link have been made. set_link_af() is called to commit the changes later. This method is not fail proof, while it is currently sufficient to make set_link_af() inerrable and thus 100% atomic, the validation function method will not be able to detect all error scenarios in the future, there will likely always be errors depending on states which are f.e. not protected by rtnl_mutex and thus may change between validation and setting. Also, instead of silently ignoring unknown address families and config blocks for address families which did not register a set function the errors EAFNOSUPPORT respectively EOPNOSUPPORT are returned to avoid comitting 4 out of 5 update requests without notifying the user. Signed-off-by: Thomas Graf <tgraf@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-27 22:56:08 -08:00
Eric Dumazet	456b61bca8	ipv6: mcast: RCU conversion ipv6_sk_mc_lock rwlock becomes a spinlock. readers (inet6_mc_check()) now takes rcu_read_lock() instead of read lock. Writers dont need to disable BH anymore. struct ipv6_mc_socklist objects are reclaimed after one RCU grace period. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-24 11:16:42 -08:00
Tracey Dent	4de58dfebe	Net: ipv6: netfiliter: Makefile: Remove deprecated kbuild goal definitions Changed Makefile to use <modules>-y instead of <modules>-objs because -objs is deprecated and not mentioned in Documentation/kbuild/makefiles.txt. Signed-off-by: Tracey Dent <tdent48227@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-22 08:16:12 -08:00
John Fastabend	88b2a9a3d9	ipv6: fix missing in6_ifa_put in addrconf Fix ref count bug introduced by commit `2de7957072` Author: Lorenzo Colitti <lorenzo@google.com> Date: Wed Oct 27 18:16:49 2010 +0000 ipv6: addrconf: don't remove address state on ifdown if the address is being kept Fix logic so that addrconf_ifdown() decrements the inet6_ifaddr refcnt correctly with in6_ifa_put(). Reported-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-22 07:37:36 -08:00
David S. Miller	24912420e9	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/bonding/bond_main.c net/core/net-sysfs.c net/ipv6/addrconf.c	2010-11-19 13:13:47 -08:00
Thomas Graf	18a31e1e28	ipv6: Expose reachable and retrans timer values as msecs Expose reachable and retrans timer values in msecs instead of jiffies. Both timer values are already exposed as msecs in the neighbour table netlink interface. The creation timestamp format with increased precision is kept but cleaned up. Signed-off-by: Thomas Graf <tgraf@infradead.org> Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-18 12:08:36 -08:00
Thomas Graf	93908d1926	ipv6: Expose IFLA_PROTINFO timer values in msecs instead of jiffies IFLA_PROTINFO exposes timer related per device settings in jiffies. Change it to expose these values in msecs like the sysctl interface does. I did not find any users of IFLA_PROTINFO which rely on any of these values and even if there are, they are likely already broken because there is no way for them to reliably convert such a value to another time format. Signed-off-by: Thomas Graf <tgraf@infradead.org> Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-18 11:05:01 -08:00
Changli Gao	5811662b15	net: use the macros defined for the members of flowi Use the macros defined for the members of flowi to clean the code up. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-17 12:27:45 -08:00
Thomas Graf	b382b191ea	ipv6: AF_INET6 link address family IPv6 already exposes some address family data via netlink in the IFLA_PROTINFO attribute if RTM_GETLINK request is sent with the address family set to AF_INET6. We take over this format and reuse all the code. Signed-off-by: Thomas Graf <tgraf@infradead.org> Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-17 11:28:26 -08:00
Eric Dumazet	c31504dc0d	udp: use atomic_inc_not_zero_hint UDP sockets refcount is usually 2, unless an incoming frame is going to be queued in receive or backlog queue. Using atomic_inc_not_zero_hint() permits to reduce latency, because processor issues less memory transactions. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-16 11:17:43 -08:00
John Fastabend	9d82ca98f7	ipv6: fix missing in6_ifa_put in addrconf Fix ref count bug introduced by commit `2de7957072` Author: Lorenzo Colitti <lorenzo@google.com> Date: Wed Oct 27 18:16:49 2010 +0000 ipv6: addrconf: don't remove address state on ifdown if the address is being kept Fix logic so that addrconf_ifdown() decrements the inet6_ifaddr refcnt correctly with in6_ifa_put(). Reported-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-16 09:24:58 -08:00
Joe Perches	8a22c99a80	net/ipv6/mcast.c: Remove unnecessary semicolons Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-15 11:07:16 -08:00
Eric Dumazet	be9e9163af	netfilter: nf_ct_frag6_sysctl_table is static Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>	2010-11-15 18:18:29 +01:00
Jan Engelhardt	b468645d72	netfilter: xt_LOG: do print MAC header on FORWARD I am observing consistent behavior even with bridges, so let's unlock this. xt_mac is already usable in FORWARD, too. Section 9 of http://ebtables.sourceforge.net/br_fw_ia/br_fw_ia.html#section9 says the MAC source address is changed, but my observation does not match that claim -- the MAC header is retained. Signed-off-by: Jan Engelhardt <jengelh@medozas.de> [Patrick; code inspection seems to confirm this] Signed-off-by: Patrick McHardy <kaber@trash.net>	2010-11-15 11:23:06 +01:00
Ben Greear	4038565327	ipv6: Warn users if maximum number of routes is reached. This gives users at least some clue as to what the problem might be and how to go about fixing it. Signed-off-by: Ben Greear <greearb@candelatech.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-12 14:03:24 -08:00
Lorenzo Colitti	2de7957072	ipv6: addrconf: don't remove address state on ifdown if the address is being kept Currently, addrconf_ifdown does not delete statically configured IPv6 addresses when the interface is brought down. The intent is that when the interface comes back up the address will be usable again. However, this doesn't actually work, because the system stops listening on the corresponding solicited-node multicast address, so the address cannot respond to neighbor solicitations and thus receive traffic. Also, the code notifies the rest of the system that the address is being deleted (e.g, RTM_DELADDR), even though it is not. Fix it so that none of this state is updated if the address is being kept on the interface. Tested: Added a statically configured IPv6 address to an interface, started ping, brought link down, brought link up again. When link came up ping kept on going and "ip -6 maddr" showed that the host was still subscribed to there Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-12 13:44:24 -08:00
David S. Miller	7c13a0d9a1	Merge git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-2.6	2010-11-12 11:04:26 -08:00
Shan Wei	22e091e525	netfilter: ipv6: fix overlap check for fragments The type of FRAG6_CB(prev)->offset is int, skb->len is unsigned int, and offset is int. Without this patch, type conversion occurred to this expression, when (FRAG6_CB(prev)->offset + prev->len) is less than offset. Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com> Signed-off-by: Patrick McHardy <kaber@trash.net>	2010-11-12 08:51:55 +01:00
Shan Wei	f46421416f	ipv6: fix overlap check for fragments The type of FRAG6_CB(prev)->offset is int, skb->len is unsigned int, and offset is int. Without this patch, type conversion occurred to this expression, when (FRAG6_CB(prev)->offset + prev->len) is less than offset. Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-08 12:17:06 -08:00
Jan Engelhardt	cccbe5ef85	netfilter: ip6_tables: fix information leak to userspace Signed-off-by: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-03 18:55:39 -07:00
Xiaotian Feng	41bb78b4b9	net dst: fix percpu_counter list corruption and poison overwritten There're some percpu_counter list corruption and poison overwritten warnings in recent kernel, which is resulted by `fc66f95c`. commit `fc66f95c` switches to use percpu_counter, in ip6_route_net_init, kernel init the percpu_counter for dst entries, but, the percpu_counter is never destroyed in ip6_route_net_exit. So if the related data is freed by kernel, the freed percpu_counter is still on the list, then if we insert/remove other percpu_counter, list corruption resulted. Also, if the insert/remove option modifies the ->prev,->next pointer of the freed value, the poison overwritten is resulted then. With the following patch, the percpu_counter list corruption and poison overwritten warnings disappeared. Signed-off-by: Xiaotian Feng <dfeng@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-03 18:50:07 -07:00
Eric Dumazet	870be39258	ipv6/udp: report SndbufErrors and RcvbufErrors commit `a18135eb93` (Add UDP_MIB_{SND,RCV}BUFERRORS handling.) forgot to make the necessary changes in net/ipv6/proc.c to report additional counters in /proc/net/snmp6 Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-30 16:17:23 -07:00
Pavel Emelyanov	74b0b85b88	tunnels: Fix tunnels change rcu protection After making rcu protection for tunnels (ipip, gre, sit and ip6) a bug was introduced into the SIOCCHGTUNNEL code. The tunnel is first unlinked, then addresses change, then it is linked back probably into another bucket. But while changing the parms, the hash table is unlocked to readers and they can lookup the improper tunnel. Respective commits are `b7285b79` (ipip: get rid of ipip_lock), `1507850b` (gre: get rid of ipgre_lock), `3a43be3c` (sit: get rid of ipip6_lock) and `94767632` (ip6tnl: get rid of ip6_tnl_lock). The quick fix is to wait for quiescent state to pass after unlinking, but if it is inappropriate I can invent something better, just let me know. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-27 14:20:08 -07:00
Eric Dumazet	e0ad61ec86	net: add __rcu annotations to protocol Add __rcu annotations to : struct net_protocol inet_protos struct net_protocol inet6_protos And use appropriate casts to reduce sparse warnings if CONFIG_SPARSE_RCU_POINTER=y Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-27 11:37:31 -07:00
Ursula Braun	853dc2e03d	ipv6: fix refcnt problem related to POSTDAD state After running this bonding setup script modprobe bonding miimon=100 mode=0 max_bonds=1 ifconfig bond0 10.1.1.1/16 ifenslave bond0 eth1 ifenslave bond0 eth3 on s390 with qeth-driven slaves, modprobe -r fails with this message unregister_netdevice: waiting for bond0 to become free. Usage count = 1 due to twice detection of duplicate address. Problem is caused by a missing decrease of ifp->refcnt in addrconf_dad_failure. An extra call of in6_ifa_put(ifp) solves it. Problem has been introduced with commit `f2344a131b`. Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-27 11:37:30 -07:00
Glenn Wurster	7a876b0efc	IPv6: Temp addresses are immediately deleted. There is a bug in the interaction between ipv6_create_tempaddr and addrconf_verify. Because ipv6_create_tempaddr uses the cstamp and tstamp from the public address in creating a private address, if we have not received a router advertisement in a while, tstamp + temp_valid_lft might be < now. If this happens, the new address is created inside ipv6_create_tempaddr, then the loop within addrconf_verify starts again and the address is immediately deleted. We are left with no temporary addresses on the interface, and no more will be created until the public IP address is updated. To avoid this, set the expiry time to be the minimum of the time left on the public address or the config option PLUS the current age of the public interface. Signed-off-by: Glenn Wurster <gwurster@scs.carleton.ca> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-26 12:35:13 -07:00
Glenn Wurster	aed65501e8	IPv6: Create temporary address if none exists. If privacy extentions are enabled, but no current temporary address exists, then create one when we get a router advertisement. Signed-off-by: Glenn Wurster <gwurster@scs.carleton.ca> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-26 12:35:12 -07:00
David S. Miller	7932c2e55c	netfilter: Add missing CONFIG_SYSCTL checks in ipv6's nf_conntrack_reasm.c Reported-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-26 09:08:53 -07:00
Eric Dumazet	0d7da9ddd9	net: add __rcu annotation to sk_filter Add __rcu annotation to : (struct sock)->sk_filter And use appropriate rcu primitives to reduce sparse warnings if CONFIG_SPARSE_RCU_POINTER=y Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-25 14:18:28 -07:00
KOVACS Krisztian	f6318e5588	netfilter: fix module dependency issues with IPv6 defragmentation, ip6tables and xt_TPROXY One of the previous tproxy related patches split IPv6 defragmentation and connection tracking, but did not correctly add Kconfig stanzas to handle the new dependencies correctly. This patch fixes that by making the config options mirror the setup we have for IPv4: a distinct config option for defragmentation that is automatically selected by both connection tracking and xt_TPROXY/xt_socket. The patch also changes the #ifdefs enclosing IPv6 specific code in xt_socket and xt_TPROXY: we only compile these in case we have ip6tables support enabled. Signed-off-by: KOVACS Krisztian <hidden@balabit.hu> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-25 13:58:36 -07:00
Eric Dumazet	6f0bcf1525	tunnels: add _rcu annotations (struct ip6_tnl)->next is rcu protected : (struct ip_tunnel)->next is rcu protected : (struct xfrm6_tunnel)->next is rcu protected : add __rcu annotation and proper rcu primitives. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-25 13:09:45 -07:00
Balazs Scheidler	b889416b54	tproxy: Add missing CAP_NET_ADMIN check to ipv6 side IP_TRANSPARENT requires root (more precisely CAP_NET_ADMIN privielges) for IPV6. However as I see right now this check was missed from the IPv6 implementation. Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-24 16:07:50 -07:00
Anders Franzen	7e223de84b	ip6_tunnel dont update the mtu on the route. The ip6_tunnel device did not unset the flag, IFF_XMIT_DST_RELEASE. This will make the dev layer to release the dst before calling the tunnel. The tunnel will not update any mtu/pmtu info, since it does not have a dst on the skb. Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-24 15:23:36 -07:00
David S. Miller	9941fb6276	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-next-2.6	2010-10-21 08:21:34 -07:00
Balazs Scheidler	0a513f6af9	tproxy: allow non-local binds of IPv6 sockets if IP_TRANSPARENT is enabled Signed-off-by: Balazs Scheidler <bazsi@balabit.hu> Signed-off-by: KOVACS Krisztian <hidden@balabit.hu> Signed-off-by: Patrick McHardy <kaber@trash.net>	2010-10-21 16:10:03 +02:00
Balazs Scheidler	6c46862280	tproxy: added tproxy sockopt interface in the IPV6 layer Support for IPV6_RECVORIGDSTADDR sockopt for UDP sockets were contributed by Harry Mason. Signed-off-by: Balazs Scheidler <bazsi@balabit.hu> Signed-off-by: KOVACS Krisztian <hidden@balabit.hu> Signed-off-by: Patrick McHardy <kaber@trash.net>	2010-10-21 16:08:28 +02:00
Balazs Scheidler	aa976fc011	tproxy: added udp6_lib_lookup function Just like with IPv4, we need access to the UDP hash table to look up local sockets, but instead of exporting the global udp_table, export a lookup function. Signed-off-by: Balazs Scheidler <bazsi@balabit.hu> Signed-off-by: KOVACS Krisztian <hidden@balabit.hu> Signed-off-by: Patrick McHardy <kaber@trash.net>	2010-10-21 16:05:41 +02:00
Balazs Scheidler	88440ae70e	tproxy: added const specifiers to udp lookup functions The parameters for various UDP lookup functions were non-const, even though they could be const. TProxy has some const references and instead of downcasting it, I added const specifiers along the path. Signed-off-by: Balazs Scheidler <bazsi@balabit.hu> Signed-off-by: KOVACS Krisztian <hidden@balabit.hu> Signed-off-by: Patrick McHardy <kaber@trash.net>	2010-10-21 16:04:33 +02:00
Balazs Scheidler	e97c3e278e	tproxy: split off ipv6 defragmentation to a separate module Like with IPv4, TProxy needs IPv6 defragmentation but does not require connection tracking. Since defragmentation was coupled with conntrack, I split off the two, creating an nf_defrag_ipv6 module, similar to the already existing nf_defrag_ipv4. Signed-off-by: Balazs Scheidler <bazsi@balabit.hu> Signed-off-by: KOVACS Krisztian <hidden@balabit.hu> Signed-off-by: Patrick McHardy <kaber@trash.net>	2010-10-21 16:03:43 +02:00
Balazs Scheidler	093d282321	tproxy: fix hash locking issue when using port redirection in __inet_inherit_port() When __inet_inherit_port() is called on a tproxy connection the wrong locks are held for the inet_bind_bucket it is added to. __inet_inherit_port() made an implicit assumption that the listener's port number (and thus its bind bucket). Unfortunately, if you're using the TPROXY target to redirect skbs to a transparent proxy that assumption is not true anymore and things break. This patch adds code to __inet_inherit_port() so that it can handle this case by looking up or creating a new bind bucket for the child socket and updates callers of __inet_inherit_port() to gracefully handle __inet_inherit_port() failing. Reported by and original patch from Stephen Buck <stephen.buck@exinda.com>. See http://marc.info/?t=128169268200001&r=1&w=2 for the original discussion. Signed-off-by: KOVACS Krisztian <hidden@balabit.hu> Signed-off-by: Patrick McHardy <kaber@trash.net>	2010-10-21 13:06:43 +02:00
stephen hemminger	6f747aca5e	xfrm6: make xfrm6_tunnel_free_spi local Function only defined and used in one file. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-21 03:09:45 -07:00
Randy Dunlap	76b6717bc6	netfilter: fix kconfig unmet dependency warning Fix netfilter kconfig unmet dependencies warning & spell out "compatible" while there. warning: (IP_NF_TARGET_TTL && NET && INET && NETFILTER && IP_NF_IPTABLES && NETFILTER_ADVANCED \|\| IP6_NF_TARGET_HL && NET && INET && IPV6 && NETFILTER && IP6_NF_IPTABLES && NETFILTER_ADVANCED) selects NETFILTER_XT_TARGET_HL which has unmet direct dependencies ((IP_NF_MANGLE \|\| IP6_NF_MANGLE) && NETFILTER_ADVANCED) Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Patrick McHardy <kaber@trash.net>	2010-10-18 11:13:30 +02:00
Eric Dumazet	10da66f755	fib: avoid false sharing on fib_table_hash While doing profile analysis, I found fib_hash_table was sometime in a cache line shared by a possibly often written kernel structure. (CONFIG_IP_ROUTE_MULTIPATH \|\| !CONFIG_IPV6_MULTIPLE_TABLES) It's hard to detect because not easily reproductible. Make sure we allocate a full cache line to keep this shared in all cpus caches. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-16 11:13:23 -07:00
Eric Dumazet	2c1c00040a	fib6: use FIB_LOOKUP_NOREF in fib6_rule_lookup() Avoid two atomic ops on found rule in fib6_rule_lookup() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-16 11:13:21 -07:00
Jan Engelhardt	243bf6e29e	netfilter: xtables: resolve indirect macros 3/3	2010-10-13 18:00:46 +02:00
Jan Engelhardt	87a2e70db6	netfilter: xtables: resolve indirect macros 2/3 Signed-off-by: Jan Engelhardt <jengelh@medozas.de>	2010-10-13 18:00:41 +02:00
Jan Engelhardt	12b00c2c02	netfilter: xtables: resolve indirect macros 1/3 Many of the used macros are just there for userspace compatibility. Substitute the in-kernel code to directly use the terminal macro and stuff the defines into #ifndef __KERNEL__ sections. Signed-off-by: Jan Engelhardt <jengelh@medozas.de>	2010-10-13 18:00:36 +02:00
Eric Dumazet	fc66f95c68	net dst: use a percpu_counter to track entries struct dst_ops tracks number of allocated dst in an atomic_t field, subject to high cache line contention in stress workload. Switch to a percpu_counter, to reduce number of time we need to dirty a central location. Place it on a separate cache line to avoid dirtying read only fields. Stress test : (Sending 160.000.000 UDP frames, IP route cache disabled, dual E5540 @2.53GHz, 32bit kernel, FIB_TRIE, SLUB/NUMA) Before: real 0m51.179s user 0m15.329s sys 10m15.942s After: real 0m45.570s user 0m15.525s sys 9m56.669s With a small reordering of struct neighbour fields, subject of a following patch, (to separate refcnt from other read mostly fields) real 0m41.841s user 0m15.261s sys 8m45.949s Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-11 13:06:53 -07:00
Eric Dumazet	d6bf781712	net neigh: RCU conversion of neigh hash table David This is the first step for RCU conversion of neigh code. Next patches will convert hash_buckets[] and "struct neighbour" to RCU protected objects. Thanks [PATCH net-next] net neigh: RCU conversion of neigh hash table Instead of storing hash_buckets, hash_mask and hash_rnd in "struct neigh_table", a new structure is defined : struct neigh_hash_table { struct neighbour *hash_buckets; unsigned int hash_mask; __u32 hash_rnd; struct rcu_head rcu; }; And "struct neigh_table" has an RCU protected pointer to such a neigh_hash_table. This means the signature of (hash)() function changed: We need to add a third parameter with the actual hash_rnd value, since this is not anymore a neigh_table field. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-05 14:54:36 -07:00
Eric Dumazet	caf586e5f2	net: add a core netdev->rx_dropped counter In various situations, a device provides a packet to our stack and we drop it before it enters protocol stack : - softnet backlog full (accounted in /proc/net/softnet_stat) - bad vlan tag (not accounted) - unknown/unregistered protocol (not accounted) We can handle a per-device counter of such dropped frames at core level, and automatically adds it to the device provided stats (rx_dropped), so that standard tools can be used (ifconfig, ip link, cat /proc/net/dev) This is a generalization of commit `8990f468a` (net: rx_dropped accounting), thus reverting it. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-05 14:47:55 -07:00
stephen hemminger	c61393ea83	ipv6: make __ipv6_isatap_ifid static Another exported symbol only used in one file Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-05 00:47:39 -07:00
David S. Miller	21a180cda0	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: net/ipv4/Kconfig net/ipv4/tcp_timer.c	2010-10-04 11:56:38 -07:00
Eric Dumazet	a8defca048	netfilter: ipt_LOG: add bufferisation to call printk() once ipt_LOG & ip6t_LOG use lot of calls to printk() and use a lock in a hope several cpus wont mix their output in syslog. printk() being very expensive [1], its better to call it once, on a prebuilt and complete line. Also, with mixed IPv4 and IPv6 trafic, separate IPv4/IPv6 locks dont avoid garbage. I used an allocation of a 1024 bytes structure, sort of seq_printf() but with a fixed size limit. Use a static buffer if dynamic allocation failed. Emit a once time alert if buffer size happens to be too short. [1]: printk() has various features like printk_delay()... Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>	2010-10-04 20:56:05 +02:00

... 4 5 6 7 8 ...

2882 Commits