OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
David S. Miller	5339ab8b1d	ipv6: fib: Convert fib6_age() to dst_neigh_lookup(). In this specific situation we know we are dealing with a gatewayed route and therefore rt6i_gateway is not going to be in6addr_any even in future interpretations. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-01-27 21:00:08 -05:00
David S. Miller	eb857186eb	ipv6: ndisc: Convert to dst_neigh_lookup() Now all code paths grab a local reference to the neigh, so if neigh is not NULL we unconditionally release it at the end. The old logic would only release if we didn't have a non-NULL 'rt'. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-01-27 21:00:08 -05:00
David S. Miller	a7563f342d	ipv6: Use ipv6_addr_any() Suggested by YOSHIFUJI Hideaki. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-01-26 16:29:16 -05:00
David S. Miller	1e2927b081	ipv6: sit: Convert to dst_neigh_lookup() The only semantic difference is that we now hold a reference to the neighbour and thus have to release it. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-01-26 15:23:21 -05:00
David S. Miller	39232973b7	ipv4/ipv6: Prepare for new route gateway semantics. In the future the ipv4/ipv6 route gateway will take on two types of values: 1) INADDR_ANY/IN6ADDR_ANY, for local network routes, and in this case the neighbour must be obtained using the destination address in ipv4/ipv6 header as the lookup key. 2) Everything else, the actual nexthop route address. So if the gateway is not inaddr-any we use it, otherwise we must use the packet's destination address. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-01-26 15:22:32 -05:00
shawnlu	8a622e71f5	tcp: md5: using remote adress for md5 lookup in rst packet md5 key is added in socket through remote address. remote address should be used in finding md5 key when sending out reset packet. Signed-off-by: shawnlu <shawn.lu@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-01-22 15:08:45 -05:00
Francesco Ruggeri	013d97e9da	net: race condition in ipv6 forwarding and disable_ipv6 parameters There is a race condition in addrconf_sysctl_forward() and addrconf_sysctl_disable(). These functions change idev->cnf.forwarding (resp. idev->cnf.disable_ipv6) and then try to grab the rtnl lock before performing any actions. If that fails they restore the original value and restart the syscall. This creates race conditions if ipv6 code tries to access these parameters, or if multiple instances try to do the same operation. As an example of the former, if __ipv6_ifa_notify() finds a 0 in idev->cnf.forwarding when invoked by addrconf_ifdown() it may not free anycast addresses, ultimately resulting in the net_device not being freed. This patch reads the user parameters into a temporary location and only writes the actual parameters when the rtnl lock is acquired. Tested in 2.6.38.8. Signed-off-by: Francesco Ruggeri <fruggeri@aristanetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-01-18 16:38:34 -05:00
Linus Torvalds	ccb19d263f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (47 commits) tg3: Fix single-vector MSI-X code openvswitch: Fix multipart datapath dumps. ipv6: fix per device IP snmp counters inetpeer: initialize ->redirect_genid in inet_getpeer() net: fix NULL-deref in WARN() in skb_gso_segment() net: WARN if skb_checksum_help() is called on skb requiring segmentation caif: Remove bad WARN_ON in caif_dev caif: Fix typo in Vendor/Product-ID for CAIF modems bnx2x: Disable AN KR work-around for BCM57810 bnx2x: Remove AutoGrEEEn for BCM84833 bnx2x: Remove 100Mb force speed for BCM84833 bnx2x: Fix PFC setting on BCM57840 bnx2x: Fix Super-Isolate mode for BCM84833 net: fix some sparse errors net: kill duplicate included header net: sh-eth: Fix build error by the value which is not defined net: Use device model to get driver name in skb_gso_segment() bridge: BH already disabled in br_fdb_cleanup() net: move sock_update_memcg outside of CONFIG_INET mwl8k: Fixing Sparse ENDIAN CHECK warning ...	2012-01-17 22:26:41 -08:00
Eric Dumazet	766e9f1be1	ipv6: fix per device IP snmp counters In commit `4ce3c183fc` (snmp: 64bit ipstats_mib for all arches), I forgot to change the /proc/net/dev_snmp6/xxx output for IP counters. percpu array is 64bit per counter but the folding still used the 'long' variant, and output garbage on 32bit arches. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-01-17 23:56:18 -05:00
Eric Dumazet	747465ef7a	net: fix some sparse errors make C=2 CF="-D__CHECK_ENDIAN__" M=net And fix flowi4_init_output() prototype for sport Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-01-17 10:31:12 -05:00
Linus Torvalds	c49c41a413	Merge branch 'for-linus' of git://selinuxproject.org/~jmorris/linux-security * 'for-linus' of git://selinuxproject.org/~jmorris/linux-security: capabilities: remove __cap_full_set definition security: remove the security_netlink_recv hook as it is equivalent to capable() ptrace: do not audit capability check when outputing /proc/pid/stat capabilities: remove task_ns_* functions capabitlies: ns_capable can use the cap helpers rather than lsm call capabilities: style only - move capable below ns_capable capabilites: introduce new has_ns_capabilities_noaudit capabilities: call has_ns_capability from has_capability capabilities: remove all _real_ interfaces capabilities: introduce security_capable_noaudit capabilities: reverse arguments to security_capable capabilities: remove the task from capable LSM hook entirely selinux: sparse fix: fix several warnings in the security server cod selinux: sparse fix: fix warnings in netlink code selinux: sparse fix: eliminate warnings for selinuxfs selinux: sparse fix: declare selinux_disable() in security.h selinux: sparse fix: move selinux_complete_init selinux: sparse fix: make selinux_secmark_refcount static SELinux: Fix RCU deref check warning in sel_netport_insert() Manually fix up a semantic mis-merge wrt security_netlink_recv(): - the interface was removed in commit `fd77846152` ("security: remove the security_netlink_recv hook as it is equivalent to capable()") - a new user of it appeared in commit `a38f7907b9` ("crypto: Add userspace configuration API") causing no automatic merge conflict, but Eric Paris pointed out the issue.	2012-01-14 18:36:33 -08:00
RongQing.Li	252c3d84ed	ipv6: release idev when ip6_neigh_lookup failed in icmp6_dst_alloc release idev when ip6_neigh_lookup failed in icmp6_dst_alloc Signed-off-by: RongQing.Li <roy.qing.li@gmail.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-01-13 10:10:46 -08:00
Eric Dumazet	cf778b00e9	net: reintroduce missing rcu_assign_pointer() calls commit `a9b3cd7f32` (rcu: convert uses of rcu_assign_pointer(x, NULL) to RCU_INIT_POINTER) did a lot of incorrect changes, since it did a complete conversion of rcu_assign_pointer(x, y) to RCU_INIT_POINTER(x, y). We miss needed barriers, even on x86, when y is not NULL. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Stephen Hemminger <shemminger@vyatta.com> CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-01-12 12:26:56 -08:00
Eric Paris	fd77846152	security: remove the security_netlink_recv hook as it is equivalent to capable() Once upon a time netlink was not sync and we had to get the effective capabilities from the skb that was being received. Today we instead get the capabilities from the current task. This has rendered the entire purpose of the hook moot as it is now functionally equivalent to the capable() call. Signed-off-by: Eric Paris <eparis@redhat.com>	2012-01-05 18:53:01 -05:00
Mihai Maruseac	1d5783030a	ipv6/addrconf: speedup /proc/net/if_inet6 filling This ensures a linear behaviour when filling /proc/net/if_inet6 thus making ifconfig run really fast on IPv6 only addresses. In fact, with this patch and the IPv4 one sent a while ago, ifconfig will run in linear time regardless of address type. IPv4 related patch: `f04565ddf5` dev: use name hash for dev_seq_ops ... Some statistics (running ifconfig > /dev/null on a different setup): iface count / IPv6 no-patch time / IPv6 patched time / IPv4 time ---------------------------------------------------------------- 6250 \| 0.23 s \| 0.13 s \| 0.11 s 12500 \| 0.62 s \| 0.28 s \| 0.22 s 25000 \| 2.91 s \| 0.57 s \| 0.46 s 50000 \| 11.37 s \| 1.21 s \| 0.94 s 128000 \| 86.78 s \| 3.05 s \| 2.54 s Signed-off-by: Mihai Maruseac <mmaruseac@ixiacom.com> Cc: Daniel Baluta <dbaluta@ixiacom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-01-04 16:00:57 -05:00
Neil Horman	e6bff995f8	ipv6: Check RA for sllao when configuring optimistic ipv6 address (v2) Recently Dave noticed that a test we did in ipv6_add_addr to see if we next hop route for the interface we're adding an addres to was wrong (see commit `7ffbcecbee`). for one, it never triggers, and two, it was completely wrong to begin with. This test was meant to cover this section of RFC 4429: 3.3 Modifications to RFC 2462 Stateless Address Autoconfiguration * (modifies section 5.5) A host MAY choose to configure a new address as an Optimistic Address. A host that does not know the SLLAO of its router SHOULD NOT configure a new address as Optimistic. A router SHOULD NOT configure an Optimistic Address. This patch should bring us into proper compliance with the above clause. Since we only add a SLAAC address after we've received a RA which may or may not contain a source link layer address option, we can pass a pointer to that option to addrconf_prefix_rcv (which may be null if the option is not present), and only set the optimistic flag if the option was found in the RA. Change notes: (v2) modified the new parameter to addrconf_prefix_rcv to be a bool rather than a pointer to make its use more clear as per request from davem. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: "David S. Miller" <davem@davemloft.net> CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-01-04 15:53:20 -05:00
Josh Hunt	32b293a53d	IPv6: Avoid taking write lock for /proc/net/ipv6_route During some debugging I needed to look into how /proc/net/ipv6_route operated and in my digging I found its calling fib6_clean_all() which uses "write_lock_bh(&table->tb6_lock)" before doing the walk of the table. I found this on 2.6.32, but reading the code I believe the same basic idea exists currently. Looking at the rtnetlink code they are only calling "read_lock_bh(&table->tb6_lock);" via fib6_dump_table(). While I realize reading from proc isn't the recommended way of fetching the ipv6 route table; taking a write lock seems unnecessary and would probably cause network performance issues. To verify this I loaded up the ipv6 route table and then ran iperf in 3 cases: * doing nothing * reading ipv6 route table via proc (while :; do cat /proc/net/ipv6_route > /dev/null; done) * reading ipv6 route table via rtnetlink (while :; do ip -6 route show table all > /dev/null; done) * Load the ipv6 route table up with: * for ((i = 0;i < 4000;i++)); do ip route add unreachable 2000::$i; done * iperf commands: * client: iperf -i 1 -V -c <ipv6 addr> * server: iperf -V -s * iperf results - 3 runs each (in Mbits/sec) * nothing: client: 927,927,927 server: 927,927,927 * proc: client: 179,97,96,113 server: 142,112,133 * iproute: client: 928,927,928 server: 927,927,927 lock_stat shows taking the write lock is causing the slowdown. Using this info I decided to write a version of fib6_clean_all() which replaces write_lock_bh(&table->tb6_lock) with read_lock_bh(&table->tb6_lock). With this new function I see the same results as with my rtnetlink iperf test. Signed-off-by: Josh Hunt <joshhunt00@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-30 17:07:33 -05:00
David S. Miller	8ade06c616	ipv6: Fix neigh lookup using NULL device. In some of the rt6_bind_neighbour() call sites, it hasn't hooked up the rt->dst.dev pointer yet, so we'd deref a NULL pointer when obtaining dev->ifindex for the neighbour hash function computation. Just pass the netdevice explicitly in to fix this problem. Reported-by: Bjarke Istrup Pedersen <gurligebis@gentoo.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-29 18:51:57 -05:00
David S. Miller	346f870b8a	ipv6: Report TCP timetstamp info in cacheinfo just like ipv4 does. I missed this while adding ipv6 support to inet_peer. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-29 15:22:33 -05:00
David S. Miller	d191854282	ipv6: Kill rt6i_dev and rt6i_expires defines. It just obscures that the netdevice pointer and the expires value are implemented in the dst_entry sub-object of the ipv6 route. And it makes grepping for dst_entry member uses much harder too. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-28 20:19:20 -05:00
David S. Miller	f83c7790dc	ipv6: Create fast inline ipv6 neigh lookup just like ipv4. Also, create and use an rt6_bind_neighbour() in net/ipv6/route.c to consolidate some common logic. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-28 15:41:23 -05:00
David S. Miller	2c2aba6c56	ipv6: Use universal hash for NDISC. In order to perform a proper universal hash on a vector of integers, we have to use different universal hashes on each vector element. Which means we need 4 different hash randoms for ipv6. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-28 15:06:58 -05:00
David Miller	7ffbcecbee	ipv6: Remove optimistic DAD flag test in ipv6_add_addr() The route we have here is for the address being added to the interface, ie. for input packet processing. Therefore using that route to determine whether an output nexthop gateway is known and resolved doesn't make any sense. So, simply remove this test, it never triggered anyways. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-By: Neil Horman <nhorman@tuxdriver.com>	2011-12-28 13:38:49 -05:00
David S. Miller	c159d30c59	ipv6: Kill useless route tracing bits in net/ipv6/route.c RDBG() wasn't even used, and the messages printed by RT6_DEBUG() were far from useful. Just get rid of all this stuff, we can replace it with something more suitable if we want. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-26 15:24:36 -05:00
David S. Miller	c5e1fd8cca	Merge branch 'nf-next' of git://1984.lsi.us.es/net-next	2011-12-25 02:21:45 -05:00
David S. Miller	abb434cb05	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: net/bluetooth/l2cap_core.c Just two overlapping changes, one added an initialization of a local variable, and another change added a new local variable. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-23 17:13:56 -05:00
Eric Dumazet	e688a60480	net: introduce DST_NOPEER dst flag Chris Boot reported crashes occurring in ipv6_select_ident(). [ 461.457562] RIP: 0010:[<ffffffff812dde61>] [<ffffffff812dde61>] ipv6_select_ident+0x31/0xa7 [ 461.578229] Call Trace: [ 461.580742] <IRQ> [ 461.582870] [<ffffffff812efa7f>] ? udp6_ufo_fragment+0x124/0x1a2 [ 461.589054] [<ffffffff812dbfe0>] ? ipv6_gso_segment+0xc0/0x155 [ 461.595140] [<ffffffff812700c6>] ? skb_gso_segment+0x208/0x28b [ 461.601198] [<ffffffffa03f236b>] ? ipv6_confirm+0x146/0x15e [nf_conntrack_ipv6] [ 461.608786] [<ffffffff81291c4d>] ? nf_iterate+0x41/0x77 [ 461.614227] [<ffffffff81271d64>] ? dev_hard_start_xmit+0x357/0x543 [ 461.620659] [<ffffffff81291cf6>] ? nf_hook_slow+0x73/0x111 [ 461.626440] [<ffffffffa0379745>] ? br_parse_ip_options+0x19a/0x19a [bridge] [ 461.633581] [<ffffffff812722ff>] ? dev_queue_xmit+0x3af/0x459 [ 461.639577] [<ffffffffa03747d2>] ? br_dev_queue_push_xmit+0x72/0x76 [bridge] [ 461.646887] [<ffffffffa03791e3>] ? br_nf_post_routing+0x17d/0x18f [bridge] [ 461.653997] [<ffffffff81291c4d>] ? nf_iterate+0x41/0x77 [ 461.659473] [<ffffffffa0374760>] ? br_flood+0xfa/0xfa [bridge] [ 461.665485] [<ffffffff81291cf6>] ? nf_hook_slow+0x73/0x111 [ 461.671234] [<ffffffffa0374760>] ? br_flood+0xfa/0xfa [bridge] [ 461.677299] [<ffffffffa0379215>] ? nf_bridge_update_protocol+0x20/0x20 [bridge] [ 461.684891] [<ffffffffa03bb0e5>] ? nf_ct_zone+0xa/0x17 [nf_conntrack] [ 461.691520] [<ffffffffa0374760>] ? br_flood+0xfa/0xfa [bridge] [ 461.697572] [<ffffffffa0374812>] ? NF_HOOK.constprop.8+0x3c/0x56 [bridge] [ 461.704616] [<ffffffffa0379031>] ? nf_bridge_push_encap_header+0x1c/0x26 [bridge] [ 461.712329] [<ffffffffa037929f>] ? br_nf_forward_finish+0x8a/0x95 [bridge] [ 461.719490] [<ffffffffa037900a>] ? nf_bridge_pull_encap_header+0x1c/0x27 [bridge] [ 461.727223] [<ffffffffa0379974>] ? br_nf_forward_ip+0x1c0/0x1d4 [bridge] [ 461.734292] [<ffffffff81291c4d>] ? nf_iterate+0x41/0x77 [ 461.739758] [<ffffffffa03748cc>] ? __br_deliver+0xa0/0xa0 [bridge] [ 461.746203] [<ffffffff81291cf6>] ? nf_hook_slow+0x73/0x111 [ 461.751950] [<ffffffffa03748cc>] ? __br_deliver+0xa0/0xa0 [bridge] [ 461.758378] [<ffffffffa037533a>] ? NF_HOOK.constprop.4+0x56/0x56 [bridge] This is caused by bridge netfilter special dst_entry (fake_rtable), a special shared entry, where attaching an inetpeer makes no sense. Problem is present since commit `87c48fa3b4` (ipv6: make fragment identifications less predictable) Introduce DST_NOPEER dst flag and make sure ipv6_select_ident() and __ip_select_ident() fallback to the 'no peer attached' handling. Reported-by: Chris Boot <bootc@bootc.net> Tested-by: Chris Boot <bootc@bootc.net> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-22 22:34:56 -05:00
Rusty Russell	eb93992207	module_param: make bool parameters really bool (net & drivers/net) module_param(bool) used to counter-intuitively take an int. In `fddd5201` (mid-2009) we allowed bool or int/unsigned int using a messy trick. It's time to remove the int/unsigned int option. For this version it'll simply give a warning, but it'll break next kernel version. (Thanks to Joe Perches for suggesting coccinelle for 0/1 -> true/false). Cc: "David S. Miller" <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-19 22:27:29 -05:00
David S. Miller	b26e478f8f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/freescale/fsl_pq_mdio.c net/batman-adv/translation-table.c net/ipv6/route.c	2011-12-16 02:11:14 -05:00
David S. Miller	bb3c36863e	ipv6: Check dest prefix length on original route not copied one in rt6_alloc_cow(). After commit `8e2ec63917` ("ipv6: don't use inetpeer to store metrics for routes.") the test in rt6_alloc_cow() for setting the ANYCAST flag is now wrong. 'rt' will always now have a plen of 128, because it is set explicitly to 128 by ip6_rt_copy. So to restore the semantics of the test, check the destination prefix length of 'ort'. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-13 17:35:06 -05:00
David S. Miller	b43faac690	ipv6: If neigh lookup fails during icmp6 dst allocation, propagate error. Don't just succeed with a route that has a NULL neighbour attached. This follows the behavior of addrconf_dst_alloc(). Allowing this kind of route to end up with a NULL neigh attached will result in packet drops on output until the route is somehow invalidated, since nothing will meanwhile try to lookup the neigh again. A statistic is bumped for the case where we see a neigh-less route on output, but the resulting packet drop is otherwise silent in nature, and frankly it's a hard error for this to happen and ipv6 should do what ipv4 does which is say something in the kernel logs. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-13 16:51:51 -05:00
Florian Westphal	e26f9a480f	netfilter: add ipv6 reverse path filter match This is not merged with the ipv4 match into xt_rpfilter.c to avoid ipv6 module dependency issues. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2011-12-13 11:34:43 +01:00
Glauber Costa	3dc43e3e4d	per-netns ipv4 sysctl_tcp_mem This patch allows each namespace to independently set up its levels for tcp memory pressure thresholds. This patch alone does not buy much: we need to make this values per group of process somehow. This is achieved in the patches that follows in this patchset. Signed-off-by: Glauber Costa <glommer@parallels.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> CC: David S. Miller <davem@davemloft.net> CC: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-12 19:04:11 -05:00
Glauber Costa	d1a4c0b37c	tcp memory pressure controls This patch introduces memory pressure controls for the tcp protocol. It uses the generic socket memory pressure code introduced in earlier patches, and fills in the necessary data in cg_proto struct. Signed-off-by: Glauber Costa <glommer@parallels.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com> CC: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-12 19:04:10 -05:00
Glauber Costa	180d8cd942	foundations of per-cgroup memory pressure controlling. This patch replaces all uses of struct sock fields' memory_pressure, memory_allocated, sockets_allocated, and sysctl_mem to acessor macros. Those macros can either receive a socket argument, or a mem_cgroup argument, depending on the context they live in. Since we're only doing a macro wrapping here, no performance impact at all is expected in the case where we don't have cgroups disabled. Signed-off-by: Glauber Costa <glommer@parallels.com> Reviewed-by: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com> CC: David S. Miller <davem@davemloft.net> CC: Eric W. Biederman <ebiederm@xmission.com> CC: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-12 19:04:10 -05:00
Ted Feng	72b36015ba	ipip, sit: copy parms.name after register_netdevice Same fix as `731abb9cb2` for ipip and sit tunnel. Commit `1c5cae815d` removed an explicit call to dev_alloc_name in ipip_tunnel_locate and ipip6_tunnel_locate, because register_netdevice will now create a valid name, however the tunnel keeps a copy of the name in the private parms structure. Fix this by copying the name back after register_netdevice has successfully returned. This shows up if you do a simple tunnel add, followed by a tunnel show: $ sudo ip tunnel add mode ipip remote 10.2.20.211 $ ip tunnel tunl0: ip/ip remote any local any ttl inherit nopmtudisc tunl%d: ip/ip remote 10.2.20.211 local any ttl inherit $ sudo ip tunnel add mode sit remote 10.2.20.212 $ ip tunnel sit0: ipv6/ip remote any local any ttl 64 nopmtudisc 6rd-prefix 2002::/16 sit%d: ioctl 89f8 failed: No such device sit%d: ipv6/ip remote 10.2.20.212 local any ttl inherit Cc: stable@vger.kernel.org Signed-off-by: Ted Feng <artisdom@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-12 18:50:51 -05:00
Li Wei	4af04aba93	ipv6: Fix for adding multicast route for loopback device automatically. There is no obvious reason to add a default multicast route for loopback devices, otherwise there would be a route entry whose dst.error set to -ENETUNREACH that would blocking all multicast packets. ==================== [ more detailed explanation ] The problem is that the resulting routing table depends on the sequence of interface's initialization and in some situation, that would block all muticast packets. Suppose there are two interfaces on my computer (lo and eth0), if we initailize 'lo' before 'eth0', the resuting routing table(for multicast) would be # ip -6 route show \| grep ff00:: unreachable ff00::/8 dev lo metric 256 error -101 ff00::/8 dev eth0 metric 256 When sending multicasting packets, routing subsystem will return the first route entry which with a error set to -101(ENETUNREACH). I know the kernel will set the default ipv6 address for 'lo' when it is up and won't set the default multicast route for it, but there is no reason to stop 'init' program from setting address for 'lo', and that is exactly what systemd did. I am sure there is something wrong with kernel or systemd, currently I preferred kernel caused this problem. ==================== Signed-off-by: Li Wei <lw@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-12 18:48:18 -05:00
Pavel Emelyanov	fce823381e	udp: Export code sk lookup routines The UDP diag get_exact handler will require them to find a socket by provided net, [sd]addr-s, [sd]ports and device. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-09 14:14:08 -05:00
David S. Miller	87a115783e	ipv6: Move xfrm_lookup() call down into icmp6_dst_alloc(). And return error pointers. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-06 17:04:13 -05:00
David S. Miller	8f0315190d	ipv6: Make third arg to anycast_dst_alloc() bool. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-06 16:48:14 -05:00
David Miller	2721745501	net: Rename dst_get_neighbour{, _raw} to dst_get_neighbour_noref{, _raw}. To reflect the fact that a refrence is not obtained to the resulting neighbour entry. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Roland Dreier <roland@purestorage.com>	2011-12-05 15:20:19 -05:00
Florian Westphal	ea6e574e34	ipv6: add ip6_route_lookup like rt6_lookup, but allows caller to pass in flowi6 structure. Will be used by the upcoming ipv6 netfilter reverse path filter match. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2011-12-04 22:44:07 +01:00
David S. Miller	78a8a36fe0	Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/jesse/openvswitch	2011-12-03 22:53:31 -05:00
David S. Miller	04a6f4417b	ipv6: Kill ndisc_get_neigh() inline helper. It's only used in net/ipv6/route.c and the NULL device check is superfluous for all of the existing call sites. Just expand the __ndisc_lookup_errno() call at each location. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-03 18:29:30 -05:00
David S. Miller	3830847396	ipv6: Various cleanups in route.c 1) x == NULL --> !x 2) x != NULL --> x 3) (x&BIT) --> (x & BIT) 4) (BIT1\|BIT2) --> (BIT1 \| BIT2) 5) proper argument and struct member alignment Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-03 18:02:47 -05:00
David S. Miller	507c9b1e07	ipv6: Various cleanups in ip6_route.c 1) x == NULL --> !x 2) x != NULL --> x 3) if() --> if () 4) while() --> while () 5) (x & BIT) == 0 --> !(x & BIT) 6) (x&BIT) --> (x & BIT) 7) x=y --> x = y 8) (BIT1\|BIT2) --> (BIT1 \| BIT2) 9) if ((x & BIT)) --> if (x & BIT) 10) proper argument and struct member alignment Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-03 17:50:45 -05:00
Jesse Gross	75f2811c64	ipv6: Add fragment reporting to ipv6_skip_exthdr(). While parsing through IPv6 extension headers, fragment headers are skipped making them invisible to the caller. This reports the fragment offset of the last header in order to make it possible to determine whether the packet is fragmented and, if so whether it is a first or last fragment. Signed-off-by: Jesse Gross <jesse@nicira.com>	2011-12-03 09:35:10 -08:00
David S. Miller	b3613118eb	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2011-12-02 13:49:21 -05:00
David S. Miller	59c2cdae27	Revert "udp: remove redundant variable" This reverts commit `81d54ec847`. If we take the "try_again" goto, due to a checksum error, the 'len' has already been truncated. So we won't compute the same values as the original code did. Reported-by: paul bilke <fsmail@conspiracy.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-01 14:12:55 -05:00
Jun Zhao	99d2f47aa9	ipv6 : mcast : Delete useless parameter in ip6_mc_add1_src() Need not to used 'delta' flag when add single-source to interface filter source list. Signed-off-by: Jun Zhao <mypopydev@gmail.com> Signed-off-by: David S. Miller <davem@drr.davemloft.net>	2011-11-30 23:10:02 -05:00
David Miller	76cc714ed5	neigh: Do not set tbl->entry_size in ipv4/ipv6 neigh tables. Let the core self-size the neigh entry based upon the key length. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-30 18:46:43 -05:00
Eric Dumazet	b90e5794c5	net: dont call jump_label_dec from irq context Igor Maravic reported an error caused by jump_label_dec() being called from IRQ context : BUG: sleeping function called from invalid context at kernel/mutex.c:271 in_atomic(): 1, irqs_disabled(): 0, pid: 0, name: swapper 1 lock held by swapper/0: #0: (&n->timer){+.-...}, at: [<ffffffff8107ce90>] call_timer_fn+0x0/0x340 Pid: 0, comm: swapper Not tainted 3.2.0-rc2-net-next-mpls+ #1 Call Trace: <IRQ> [<ffffffff8104f417>] __might_sleep+0x137/0x1f0 [<ffffffff816b9a2f>] mutex_lock_nested+0x2f/0x370 [<ffffffff810a89fd>] ? trace_hardirqs_off+0xd/0x10 [<ffffffff8109a37f>] ? local_clock+0x6f/0x80 [<ffffffff810a90a5>] ? lock_release_holdtime.part.22+0x15/0x1a0 [<ffffffff81557929>] ? sock_def_write_space+0x59/0x160 [<ffffffff815e936e>] ? arp_error_report+0x3e/0x90 [<ffffffff810969cd>] atomic_dec_and_mutex_lock+0x5d/0x80 [<ffffffff8112fc1d>] jump_label_dec+0x1d/0x50 [<ffffffff81566525>] net_disable_timestamp+0x15/0x20 [<ffffffff81557a75>] sock_disable_timestamp+0x45/0x50 [<ffffffff81557b00>] __sk_free+0x80/0x200 [<ffffffff815578d0>] ? sk_send_sigurg+0x70/0x70 [<ffffffff815e936e>] ? arp_error_report+0x3e/0x90 [<ffffffff81557cba>] sock_wfree+0x3a/0x70 [<ffffffff8155c2b0>] skb_release_head_state+0x70/0x120 [<ffffffff8155c0b6>] __kfree_skb+0x16/0x30 [<ffffffff8155c119>] kfree_skb+0x49/0x170 [<ffffffff815e936e>] arp_error_report+0x3e/0x90 [<ffffffff81575bd9>] neigh_invalidate+0x89/0xc0 [<ffffffff81578dbe>] neigh_timer_handler+0x9e/0x2a0 [<ffffffff81578d20>] ? neigh_update+0x640/0x640 [<ffffffff81073558>] __do_softirq+0xc8/0x3a0 Since jump_label_{inc\|dec} must be called from process context only, we must defer jump_label_dec() if net_disable_timestamp() is called from interrupt context. Reported-by: Igor Maravic <igorm@etf.rs> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-29 00:26:25 -05:00
Li Wei	2a38e6d5ae	ipv6: Set mcast_hops to IPV6_DEFAULT_MCASTHOPS when -1 was given. We need to set np->mcast_hops to it's default value at this moment otherwise when we use it and found it's value is -1, the logic to get default hop limit doesn't take multicast into account and will return wrong hop limit(IPV6_DEFAULT_HOPLIMIT) which is for unicast. Signed-off-by: Li Wei <lw@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-28 18:09:13 -05:00
David S. Miller	6dec4ac4ee	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: net/ipv4/inet_diag.c	2011-11-26 14:47:03 -05:00
Steffen Klassert	618f9bc74a	net: Move mtu handling down to the protocol depended handlers We move all mtu handling from dst_mtu() down to the protocol layer. So each protocol can implement the mtu handling in a different manner. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-26 14:29:51 -05:00
Steffen Klassert	ebb762f27f	net: Rename the dst_opt default_mtu method to mtu We plan to invoke the dst_opt->default_mtu() method unconditioally from dst_mtu(). So rename the method to dst_opt->mtu() to match the name with the new meaning. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-26 14:29:50 -05:00
Steffen Klassert	6b600b26c0	route: Use the device mtu as the default for blackhole routes As it is, we return null as the default mtu of blackhole routes. This may lead to a propagation of a bogus pmtu if the default_mtu method of a blackhole route is invoked. So return dst->dev->mtu as the default mtu instead. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-26 14:29:50 -05:00
Eric Dumazet	4d0fe50c75	ipv6: tcp: fix tcp_v6_conn_request() Since linux 2.6.26 (commit `c6aefafb7e` : Add IPv6 support to TCP SYN cookies), we can drop a SYN packet reusing a TIME_WAIT socket. (As a matter of fact we fail to send the SYNACK answer) As the client resends its SYN packet after a one second timeout, we accept it, because first packet removed the TIME_WAIT socket before being dropped. This probably explains why nobody ever noticed or complained. Reported-by: Jesse Young <jlyo@jlyo.org> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-23 17:29:23 -05:00
David S. Miller	46a246c4df	netfilter: Remove NOTRACK/RAW dependency on NETFILTER_ADVANCED. Distributions are using this in their default scripts, so don't hide them behind the advanced setting. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-23 16:07:00 -05:00
Eric Dumazet	c16a98ed91	ipv6: tcp: fix panic in SYN processing commit `72a3effaf6` ([NET]: Size listen hash tables using backlog hint) added a bug allowing inet6_synq_hash() to return an out of bound array index, because of u16 overflow. Bug can happen if system admins set net.core.somaxconn & net.ipv4.tcp_max_syn_backlog sysctls to values greater than 65536 Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-23 15:49:31 -05:00
Li Wei	4d65a2465f	ipv6: fix a bug in ndisc_send_redirect Release skb when transmit rate limit _not_ allow Signed-off-by: Li Wei <lw@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-23 03:51:54 -05:00
Alexey Dobriyan	4e3fd7a06d	net: remove ipv6_addr_copy() C assignment can handle struct in6_addr copying. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-22 16:43:32 -05:00
David S. Miller	efd0bf97de	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net The forcedeth changes had a conflict with the conversion over to atomic u64 statistics in net-next. The libertas cfg.c code had a conflict with the bss reference counting fix by John Linville in net-next. Conflicts: drivers/net/ethernet/nvidia/forcedeth.c drivers/net/wireless/libertas/cfg.c	2011-11-21 13:50:33 -05:00
Herbert Xu	a7ae199224	ipv6: Remove all uses of LL_ALLOCATED_SPACE ipv6: Remove all uses of LL_ALLOCATED_SPACE The macro LL_ALLOCATED_SPACE was ill-conceived. It applies the alignment to the sum of needed_headroom and needed_tailroom. As the amount that is then reserved for head room is needed_headroom with alignment, this means that the tail room left may be too small. This patch replaces all uses of LL_ALLOCATED_SPACE in net/ipv6 with the macro LL_RESERVED_SPACE and direct reference to needed_tailroom. This also fixes the problem with needed_headroom changing between allocating the skb and reserving the head room. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-18 14:37:09 -05:00
David S. Miller	8d26784cf0	ipv6: Use pr_warn() in ip6_fib.c Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-17 03:18:28 -05:00
Matti Vaittinen	14df015bb1	IPV6 Fix a crash when trying to replace non existing route This patch fixes a crash when non existing IPv6 route is tried to be changed. When new destination node was inserted in middle of FIB6 tree, no relevant sanity checks were performed. Later route insertion might have been prevented due to invalid request, causing node with no rt info being left in tree. When this node was accessed, a crash occurred. Patch adds missing checks in fib6_add_1() Signed-off-by: Matti Vaittinen <Mazziesaccount@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-17 03:16:25 -05:00
Michał Mirosław	c8f44affb7	net: introduce and use netdev_features_t for device features sets v2: add couple missing conversions in drivers split unexporting netdev_fix_features() implemented %pNF convert sock::sk_route_(no?)caps Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-16 17:43:10 -05:00
Matti Vaittinen	229a66e3be	IPv6: Removing unnecessary NULL checks. This patch removes unnecessary NULL checks noticed by Dan Carpenter. Checks were introduced in commit `4a287eba2d` to net-next. Signed-off-by: Matti Vaittinen <Mazziesaccount@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-15 16:54:20 -05:00
Matti Vaittinen	4a287eba2d	IPv6 routing, NLM_F_* flag support: REPLACE and EXCL flags support, warn about missing CREATE flag The support for NLM_F_* flags at IPv6 routing requests. If NLM_F_CREATE flag is not defined for RTM_NEWROUTE request, warning is printed, but no error is returned. Instead new route is added. Later NLM_F_CREATE may be required for new route creation. Exception is when NLM_F_REPLACE flag is given without NLM_F_CREATE, and no matching route is found. In this case it should be safe to assume that the request issuer is familiar with NLM_F_* flags, and does really not want route to be created. Specifying NLM_F_REPLACE flag will now make the kernel to search for matching route, and replace it with new one. If no route is found and NLM_F_CREATE is specified as well, then new route is created. Also, specifying NLM_F_EXCL will yield returning of error if matching route is found. Patch created against linux-3.2-rc1 Signed-off-by: Matti Vaittinen <Mazziesaccount@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-14 14:35:33 -05:00
Matti Vaittinen	d71314b4ac	IPv6 routing, NLM_F_* flag support: warn if new route is created without NLM_F_CREATE The support for NLM_F_* flags at IPv6 routing requests. Warn if NLM_F_CREATE flag is not defined for RTM_NEWROUTE request, creating new table. Later NLM_F_CREATE may be required for new route creation. Patch created against linux-3.2-rc1 Signed-off-by: Matti Vaittinen <Mazziesaccount@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-14 14:35:33 -05:00
Eric Dumazet	8b5c171bb3	neigh: new unresolved queue limits Le mercredi 09 novembre 2011 à 16:21 -0500, David Miller a écrit : > From: David Miller <davem@davemloft.net> > Date: Wed, 09 Nov 2011 16:16:44 -0500 (EST) > > > From: Eric Dumazet <eric.dumazet@gmail.com> > > Date: Wed, 09 Nov 2011 12:14:09 +0100 > > > >> unres_qlen is the number of frames we are able to queue per unresolved > >> neighbour. Its default value (3) was never changed and is responsible > >> for strange drops, especially if IP fragments are used, or multiple > >> sessions start in parallel. Even a single tcp flow can hit this limit. > > ... > > > > Ok, I've applied this, let's see what happens :-) > > Early answer, build fails. > > Please test build this patch with DECNET enabled and resubmit. The > decnet neigh layer still refers to the removed ->queue_len member. > > Thanks. Ouch, this was fixed on one machine yesterday, but not the other one I used this morning, sorry. [PATCH V5 net-next] neigh: new unresolved queue limits unres_qlen is the number of frames we are able to queue per unresolved neighbour. Its default value (3) was never changed and is responsible for strange drops, especially if IP fragments are used, or multiple sessions start in parallel. Even a single tcp flow can hit this limit. $ arp -d 192.168.20.108 ; ping -c 2 -s 8000 192.168.20.108 PING 192.168.20.108 (192.168.20.108) 8000(8028) bytes of data. 8008 bytes from 192.168.20.108: icmp_seq=2 ttl=64 time=0.322 ms Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-14 00:47:54 -05:00
Josh Boyer	731abb9cb2	ip6_tunnel: copy parms.name after register_netdevice Commit `1c5cae815d` removed an explicit call to dev_alloc_name in ip6_tnl_create because register_netdevice will now create a valid name. This works for the net_device itself. However the tunnel keeps a copy of the name in the parms structure for the ip6_tnl associated with the tunnel. parms.name is set by copying the net_device name in ip6_tnl_dev_init_gen. That function is called from ip6_tnl_dev_init in ip6_tnl_create, but it is done before register_netdevice is called so the name is set to a bogus value in the parms.name structure. This shows up if you do a simple tunnel add, followed by a tunnel show: [root@localhost ~]# ip -6 tunnel add remote fec0::100 local fec0::200 [root@localhost ~]# ip -6 tunnel show ip6tnl0: ipv6/ipv6 remote :: local :: encaplimit 0 hoplimit 0 tclass 0x00 flowlabel 0x00000 (flowinfo 0x00000000) ip6tnl%d: ipv6/ipv6 remote fec0::100 local fec0::200 encaplimit 4 hoplimit 64 tclass 0x00 flowlabel 0x00000 (flowinfo 0x00000000) [root@localhost ~]# Fix this by moving the strcpy out of ip6_tnl_dev_init_gen, and calling it after register_netdevice has successfully returned. Cc: stable@vger.kernel.org Signed-off-by: Josh Boyer <jwboyer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-14 00:24:06 -05:00
Eric Dumazet	2a24444f8f	ipv6: reduce percpu needs for icmpv6msg mibs Reading /proc/net/snmp6 on a machine with a lot of cpus is very expensive (can be ~88000 us). This is because ICMPV6MSG MIB uses 4096 bytes per cpu, and folding values for all possible cpus can read 16 Mbytes of memory (32MBytes on non x86 arches) ICMP messages are not considered as fast path on a typical server, and eventually few cpus handle them anyway. We can afford an atomic operation instead of using percpu data. This saves 4096 bytes per cpu and per network namespace. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-14 00:12:26 -05:00
Nick Bowler	4b90a603a1	ah: Don't return NET_XMIT_DROP on input. When the ahash driver returns -EBUSY, AH4/6 input functions return NET_XMIT_DROP, presumably copied from the output code path. But returning transmit codes on input doesn't make a lot of sense. Since NET_XMIT_DROP is a positive int, this gets interpreted as the next header type (i.e., success). As that can only end badly, remove the check. Signed-off-by: Nick Bowler <nbowler@elliptictech.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-12 18:13:32 -05:00
Eric Dumazet	d826eb14ec	ipv4: PKTINFO doesnt need dst reference Le lundi 07 novembre 2011 à 15:33 +0100, Eric Dumazet a écrit : > At least, in recent kernels we dont change dst->refcnt in forwarding > patch (usinf NOREF skb->dst) > > One particular point is the atomic_inc(dst->refcnt) we have to perform > when queuing an UDP packet if socket asked PKTINFO stuff (for example a > typical DNS server has to setup this option) > > I have one patch somewhere that stores the information in skb->cb[] and > avoid the atomic_{inc\|dec}(dst->refcnt). > OK I found it, I did some extra tests and believe its ready. [PATCH net-next] ipv4: IP_PKTINFO doesnt need dst reference When a socket uses IP_PKTINFO notifications, we currently force a dst reference for each received skb. Reader has to access dst to get needed information (rt_iif & rt_spec_dst) and must release dst reference. We also forced a dst reference if skb was put in socket backlog, even without IP_PKTINFO handling. This happens under stress/load. We can instead store the needed information in skb->cb[], so that only softirq handler really access dst, improving cache hit ratios. This removes two atomic operations per packet, and false sharing as well. On a benchmark using a mono threaded receiver (doing only recvmsg() calls), I can reach 720.000 pps instead of 570.000 pps. IP_PKTINFO is typically used by DNS servers, and any multihomed aware UDP application. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-09 16:36:27 -05:00
Nick Bowler	b7ea81a58a	ah: Read nexthdr value before overwriting it in ahash input callback. The AH4/6 ahash input callbacks read out the nexthdr field from the AH header after they overwrite that header. This is obviously not going to end well. Fix it up. Signed-off-by: Nick Bowler <nbowler@elliptictech.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-09 15:55:53 -05:00
Nick Bowler	069294e813	ah: Correctly pass error codes in ahash output callback. The AH4/6 ahash output callbacks pass nexthdr to xfrm_output_resume instead of the error code. This appears to be a copy+paste error from the input case, where nexthdr is expected. This causes the driver to continuously add AH headers to the datagram until either an allocation fails and the packet is dropped or the ahash driver hits a synchronous fallback and the resulting monstrosity is transmitted. Correct this issue by simply passing the error code unadulterated. Signed-off-by: Nick Bowler <nbowler@elliptictech.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-09 15:55:53 -05:00
Maciej Żenczykowski	2563fa5954	net: make ipv6 PKTINFO honour freebind This just makes it possible to spoof source IPv6 address on a socket without having to create and bind a new socket for every source IP we wish to spoof. Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-08 15:13:03 -05:00
Maciej Żenczykowski	f74024d9f0	net: make ipv6 bind honour freebind This makes native ipv6 bind follow the precedent set by: - native ipv4 bind behaviour - dual stack ipv4-mapped ipv6 bind behaviour. This does allow an unpriviledged process to spoof its source IPv6 address, just like it currently can spoof its source IPv4 address (for example when using UDP). Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-08 15:13:03 -05:00
Eric Dumazet	8ce120f118	net: better pcpu data alignment Tunnels can force an alignment of their percpu data to reduce number of cache lines used in fast path, or read in .ndo_get_stats() percpu_alloc() is a very fine grained allocator, so any small hole will be used anyway. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-08 15:10:59 -05:00
Brian Haley	c457338d7a	ipv6: drop packets when source address is multicast RFC 4291 Section 2.7 says Multicast addresses must not be used as source addresses in IPv6 packets - drop them on input so we don't process the packet further. Signed-off-by: Brian Haley <brian.haley@hp.com> Reported-and-Tested-by: Kumar Sanghvi <divinekumar@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-08 12:37:06 -05:00
Linus Torvalds	32aaeffbd4	Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits) Revert "tracing: Include module.h in define_trace.h" irq: don't put module.h into irq.h for tracking irqgen modules. bluetooth: macroize two small inlines to avoid module.h ip_vs.h: fix implicit use of module_get/module_put from module.h nf_conntrack.h: fix up fallout from implicit moduleparam.h presence include: replace linux/module.h with "struct module" wherever possible include: convert various register fcns to macros to avoid include chaining crypto.h: remove unused crypto_tfm_alg_modname() inline uwb.h: fix implicit use of asm/page.h for PAGE_SIZE pm_runtime.h: explicitly requires notifier.h linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h miscdevice.h: fix up implicit use of lists and types stop_machine.h: fix implicit use of smp.h for smp_processor_id of: fix implicit use of errno.h in include/linux/of.h of_platform.h: delete needless include <linux/module.h> acpi: remove module.h include from platform/aclinux.h miscdevice.h: delete unnecessary inclusion of module.h device_cgroup.h: delete needless include <linux/module.h> net: sch_generic remove redundant use of <linux/module.h> net: inet_timewait_sock doesnt need <linux/module.h> ... Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in - drivers/media/dvb/frontends/dibx000_common.c - drivers/media/video/{mt9m111.c,ov6650.c} - drivers/mfd/ab3550-core.c - include/linux/dmaengine.h	2011-11-06 19:44:47 -08:00
Arjan van de Ven	73cb88ecb9	net: make the tcp and udp file_operations for the /proc stuff const the tcp and udp code creates a set of struct file_operations at runtime while it can also be done at compile time, with the added benefit of then having these file operations be const. the trickiest part was to get the "THIS_MODULE" reference right; the naive method of declaring a struct in the place of registration would not work for this reason. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-01 17:56:14 -04:00
Florian Westphal	2dad81adf2	netfilter: ipv6: fix afinfo->route refcnt leak on error Several callers (h323 conntrack, xt_addrtype) assume that the returned **dst only needs to be released if the function returns 0. This is true for the ipv4 implementation, but not for the ipv6 one. Instead of changing the users, change the ipv6 implementation to behave like the ipv4 version by only providing the dst_entry result in the success case. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2011-11-01 09:20:07 +01:00
Joe Perches	0a9ee81349	netfilter: Remove unnecessary OOM logging messages Site specific OOM messages are duplications of a generic MM out of memory message and aren't really useful, so just delete them. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2011-11-01 09:19:49 +01:00
Paul Gortmaker	bc3b2d7fb9	net: Add export.h for EXPORT_SYMBOL/THIS_MODULE to non-modules These files are non modular, but need to export symbols using the macros now living in export.h -- call out the include so that things won't break when we remove the implicit presence of module.h from everywhere. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>	2011-10-31 19:30:30 -04:00
Andreas Hofmeister	14ef37b6d0	ipv6: fix route lookup in addrconf_prefix_rcv() The route lookup to find a previously auto-configured route for a prefixes used to use rt6_lookup(), with the prefix from the RA used as an address. However, that kind of lookup ignores routing tables, the prefix length and route flags, so when there were other matching routes, even in different tables and/or with a different prefix length, the wrong route would be manipulated. Now, a new function "addrconf_get_prefix_route()" is used for the route lookup, which searches in RT6_TABLE_PREFIX and takes the prefix-length and route flags into account. Signed-off-by: Andreas Hofmeister <andi@collax.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-30 04:12:36 -04:00
Gao feng	7011687f0f	ipv6: fix route error binding peer in func icmp6_dst_alloc in func icmp6_dst_alloc,dst_metric_set call ipv6_cow_metrics to set metric. ipv6_cow_metrics may will call rt6_bind_peer to set rt6_info->rt6i_peer. So,we should move ipv6_addr_copy before dst_metric_set to make sure rt6_bind_peer success. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-28 16:36:07 -04:00
Zheng Yan	504744e4ed	ipv6: fix error propagation in ip6_ufo_append_data() We should return errcode from sock_alloc_send_skb() Signed-off-by: Zheng Yan <zheng.z.yan@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-28 00:26:00 -04:00
Eric Dumazet	b903d324be	ipv6: tcp: fix TCLASS value in ACK messages sent from TIME_WAIT commit `66b13d99d9` (ipv4: tcp: fix TOS value in ACK messages sent from TIME_WAIT) fixed IPv4 only. This part is for the IPv6 side, adding a tclass param to ip6_xmit() We alias tw_tclass and tw_tos, if socket family is INET6. [ if sockets is ipv4-mapped, only IP_TOS socket option is used to fill TOS field, TCLASS is not taken into account ] Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-27 00:44:35 -04:00
Andreas Hofmeister	9f56220fad	ipv6: Do not use routes from locally generated RAs When hybrid mode is enabled (accept_ra == 2), the kernel also sees RAs generated locally. This is useful since it allows the kernel to auto-configure its own interface addresses. However, if 'accept_ra_defrtr' and/or 'accept_ra_rtr_pref' are set and the locally generated RAs announce the default route and/or other route information, the kernel happily inserts bogus routes with its own address as gateway. With this patch, adding routes from an RA will be skiped when the RAs source address matches any local address, just as if 'accept_ra_defrtr' and 'accept_ra_rtr_pref' were set to 0. Signed-off-by: Andreas Hofmeister <andi@collax.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-24 19:13:15 -04:00
David S. Miller	1805b2f048	Merge branch 'master' of ra.kernel.org:/pub/scm/linux/kernel/git/davem/net	2011-10-24 18:18:09 -04:00
Eric Dumazet	318cf7aaa0	tcp: md5: add more const attributes Now tcp_md5_hash_header() has a const tcphdr argument, we can add more const attributes to callers. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-24 02:46:04 -04:00
Eric Dumazet	cf533ea53e	tcp: add const qualifiers where possible Adding const qualifiers to pointers can ease code review, and spot some bugs. It might allow compiler to optimize code further. For example, is it legal to temporary write a null cksum into tcphdr in tcp_md5_hash_header() ? I am afraid a sniffer could catch the temporary null value... Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-21 05:22:42 -04:00
Maciej Żenczykowski	6cc7a765c2	net: allow CAP_NET_RAW to set socket options IP{,V6}_TRANSPARENT Up till now the IP{,V6}_TRANSPARENT socket options (which actually set the same bit in the socket struct) have required CAP_NET_ADMIN privileges to set or clear the option. - we make clearing the bit not require any privileges. - we allow CAP_NET_ADMIN to set the bit (as before this change) - we allow CAP_NET_RAW to set this bit, because raw sockets already pretty much effectively allow you to emulate socket transparency. Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-20 18:21:36 -04:00
Kevin Wilson	25c8295b5b	cleanup: remove unnecessary include. This cleanup patch removes unnecessary include from net/ipv6/ip6_fib.c. Signed-off-by: Kevin Wilson <wkevils@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-19 19:26:16 -04:00
Eric Dumazet	9e903e0852	net: add skb frag size accessors To ease skb->truesize sanitization, its better to be able to localize all references to skb frags size. Define accessors : skb_frag_size() to fetch frag size, and skb_frag_size_{set\|add\|sub}() to manipulate it. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-19 03:10:46 -04:00
Steffen Klassert	dd767856a3	xfrm6: Don't call icmpv6_send on local error Calling icmpv6_send() on a local message size error leads to an incorrect update of the path mtu. So use xfrm6_local_rxpmtu() to notify about the pmtu if the IPV6_DONTFRAG socket option is set on an udp or raw socket, according RFC 3542 and use ipv6_local_error() otherwise. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-18 23:53:10 -04:00
Steffen Klassert	299b076764	ipv6: Fix IPsec slowpath fragmentation problem ip6_append_data() builds packets based on the mtu from dst_mtu(rt->dst.path). On IPsec the effective mtu is lower because we need to add the protocol headers and trailers later when we do the IPsec transformations. So after the IPsec transformations the packet might be too big, which leads to a slowpath fragmentation then. This patch fixes this by building the packets based on the lower IPsec mtu from dst_mtu(&rt->dst) and adapts the exthdr handling to this. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-18 23:53:10 -04:00
Steffen Klassert	c113464d43	ipv6: Remove superfluous NULL pointer check in ipv6_local_rxpmtu The pointer to mtu_info is taken from the common buffer of the skb, thus it can't be a NULL pointer. This patch removes this check on mtu_info. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-18 23:51:30 -04:00
Roy.Li	01b7806cdc	ipv6: remove a rcu_read_lock in ndisc_constructor in6_dev_get(dev) takes a reference on struct inet6_dev, we dont need rcu locking in ndisc_constructor() Signed-off-by: Roy.Li <rongqing.li@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-17 19:27:56 -04:00
Eric Dumazet	87fb4b7b53	net: more accurate skb truesize skb truesize currently accounts for sk_buff struct and part of skb head. kmalloc() roundings are also ignored. Considering that skb_shared_info is larger than sk_buff, its time to take it into account for better memory accounting. This patch introduces SKB_TRUESIZE(X) macro to centralize various assumptions into a single place. At skb alloc phase, we put skb_shared_info struct at the exact end of skb head, to allow a better use of memory (lowering number of reallocations), since kmalloc() gives us power-of-two memory blocks. Unless SLUB/SLUB debug is active, both skb->head and skb_shared_info are aligned to cache lines, as before. Note: This patch might trigger performance regressions because of misconfigured protocol stacks, hitting per socket or global memory limits that were previously not reached. But its a necessary step for a more accurate memory accounting. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Andi Kleen <ak@linux.intel.com> CC: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-13 16:05:07 -04:00
Yan, Zheng	cdaf557034	gro: refetch inet6_protos[] after pulling ext headers ipv6_gro_receive() doesn't update the protocol ops after pulling the ext headers. It looks like a typo. Signed-off-by: Zheng Yan <zheng.z.yan@intel.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-10 14:26:16 -04:00
David S. Miller	88c5100c28	Merge branch 'master' of github.com:davem330/net Conflicts: net/batman-adv/soft-interface.c	2011-10-07 13:38:43 -04:00
Yan, Zheng	260fcbeb1a	tcp: properly handle md5sig_pool references tcp_v4_clear_md5_list() assumes that multiple tcp md5sig peers only hold one reference to md5sig_pool. but tcp_v4_md5_do_add() increases use count of md5sig_pool for each peer. This patch makes tcp_v4_md5_do_add() only increases use count for the first tcp md5sig peer. Signed-off-by: Zheng Yan <zheng.z.yan@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-04 23:31:24 -04:00
Yan, Zheng	676a1184e8	ipv6: nullify ipv6_ac_list and ipv6_fl_list when creating new socket ipv6_ac_list and ipv6_fl_list from listening socket are inadvertently shared with new socket created for connection. Signed-off-by: Zheng Yan <zheng.z.yan@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-29 00:32:10 -04:00
Ben Greear	67928c4041	ipv6-multicast: Fix memory leak in IPv6 multicast. If reg_vif_xmit cannot find a routing entry, be sure to free the skb before returning the error. Signed-off-by: Ben Greear <greearb@candelatech.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-27 15:34:00 -04:00
Madalin Bucur	fbe5818690	ipv6: check return value for dst_alloc return value of dst_alloc must be checked before use Signed-off-by: Madalin Bucur <madalin.bucur@freescale.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-27 15:32:06 -04:00
Ben Greear	2015de5fe2	ipv6-multicast: Fix memory leak in input path. Have to free the skb before returning if we fail the fib lookup. Signed-off-by: Ben Greear <greearb@candelatech.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-27 15:16:08 -04:00
Eric Dumazet	b82d1bb4fd	tcp: unalias tcp_skb_cb flags and ip_dsfield struct tcp_skb_cb contains a "flags" field containing either tcp flags or IP dsfield depending on context (input or output path) Introduce ip_dsfield to make the difference clear and ease maintenance. If later we want to save space, we can union flags/ip_dsfield Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-27 02:20:08 -04:00
David S. Miller	8decf86879	Merge branch 'master' of github.com:davem330/net Conflicts: MAINTAINERS drivers/net/Kconfig drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c drivers/net/ethernet/broadcom/tg3.c drivers/net/wireless/iwlwifi/iwl-pci.c drivers/net/wireless/iwlwifi/iwl-trans-tx-pcie.c drivers/net/wireless/rt2x00/rt2800usb.c drivers/net/wireless/wl12xx/main.c	2011-09-22 03:23:13 -04:00
Roy Li	8603e33d01	ipv6: fix a possible double free When calling snmp6_alloc_dev fails, the snmp6 relevant memory are freed by snmp6_alloc_dev. Calling in6_dev_finish_destroy will free these memory twice. Double free will lead that undefined behavior occurs. Signed-off-by: Roy Li <rongqing.li@windriver.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-20 15:10:16 -04:00
Eric Dumazet	d24f22f3df	ip6_tunnel: add optional fwmark inherit Add IP6_TNL_F_USE_ORIG_FWMARK to ip6_tunnel, so that ip6_tnl_xmit2() makes a route lookup taking into account skb->fwmark and doesnt cache lookup result. This permits more flexibility in policies and firewall setups. To setup such a tunnel, "fwmark inherit" option should be added to "ip -f inet6 tunnel" command. Reported-by: Anders Franzen <Anders.Franzen@ericsson.com> CC: Hans Schillström <hans.schillstrom@ericsson.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-20 14:50:00 -04:00
Yan, Zheng	8e2ec63917	ipv6: don't use inetpeer to store metrics for routes. Current IPv6 implementation uses inetpeer to store metrics for routes. The problem of inetpeer is that it doesn't take subnet prefix length in to consideration. If two routes have the same address but different prefix length, they share same inetpeer. So changing metrics of one route also affects the other. The fix is to allocate separate metrics storage for each route. Signed-off-by: Zheng Yan <zheng.z.yan@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-17 00:57:26 -04:00
Tore Anderson	026359bc6e	ipv6: Send ICMPv6 RSes only when RAs are accepted This patch improves the logic determining when to send ICMPv6 Router Solicitations, so that they are 1) always sent when the kernel is accepting Router Advertisements, and 2) never sent when the kernel is not accepting RAs. In other words, the operational setting of the "accept_ra" sysctl is used. The change also makes the special "Hybrid Router" forwarding mode ("forwarding" sysctl set to 2) operate exactly the same as the standard Router mode (forwarding=1). The only difference between the two was that RSes was being sent in the Hybrid Router mode only. The sysctl documentation describing the special Hybrid Router mode has therefore been removed. Rationale for the change: Currently, the value of forwarding sysctl is the only thing determining whether or not to send RSes. If it has the value 0 or 2, they are sent, otherwise they are not. This leads to inconsistent behaviour in the following cases: * accept_ra=0, forwarding=0 * accept_ra=0, forwarding=2 * accept_ra=1, forwarding=2 * accept_ra=2, forwarding=1 In the first three cases, the kernel will send RSes, even though it will not accept any RAs received in reply. In the last case, it will not send any RSes, even though it will accept and process any RAs received. (Most routers will send unsolicited RAs periodically, so suppressing RSes in the last case will merely delay auto-configuration, not prevent it.) Also, it is my opinion that having the forwarding sysctl control RS sending behaviour (completely independent of whether RAs are being accepted or not) is simply not what most users would intuitively expect to be the case. Signed-off-by: Tore Anderson <tore@fud.no> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-16 19:14:41 -04:00
David S. Miller	52b9aca7ae	Merge branch 'master' of ../netdev/	2011-09-16 01:09:02 -04:00
Eric Dumazet	946cedccbd	tcp: Change possible SYN flooding messages "Possible SYN flooding on port xxxx " messages can fill logs on servers. Change logic to log the message only once per listener, and add two new SNMP counters to track : TCPReqQFullDoCookies : number of times a SYNCOOKIE was replied to client TCPReqQFullDrop : number of times a SYN request was dropped because syncookies were not enabled. Based on a prior patch from Tom Herbert, and suggestions from David. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-09-15 14:49:43 -04:00
David S. Miller	7858241655	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-2.6	2011-08-30 17:43:56 -04:00
Maciej Żenczykowski	ec0506dbe4	net: relax PKTINFO non local ipv6 udp xmit check Allow transparent sockets to be less restrictive about the source ip of ipv6 udp packets being sent. Google-Bug-Id: 5018138 Signed-off-by: Maciej Żenczykowski <maze@google.com> CC: "Erik Kline" <ek@google.com> CC: "Lorenzo Colitti" <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-30 17:39:01 -04:00
Florian Westphal	c6675233f9	netfilter: nf_queue: reject NF_STOLEN verdicts from userspace A userspace listener may send (bogus) NF_STOLEN verdict, which causes skb leak. This problem was previously fixed via `64507fdbc2` (netfilter: nf_queue: fix NF_STOLEN skb leak) but this had to be reverted because NF_STOLEN can also be returned by a netfilter hook when iterating the rules in nf_reinject. Reject userspace NF_STOLEN verdict, as suggested by Michal Miroslaw. This is complementary to commit `fad5444043` (netfilter: avoid double free in nf_reinject). Cc: Julian Anastasov <ja@ssi.bg> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Patrick McHardy <kaber@trash.net>	2011-08-30 15:01:20 +02:00
Ian Campbell	408dadf03f	net: ipv6: convert to SKB frag APIs Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-24 17:52:12 -07:00
Yan, Zheng	e05c4ad3ed	mcast: Fix source address selection for multicast listener report Should check use count of include mode filter instead of total number of include mode filters. Signed-off-by: Zheng Yan <zheng.z.yan@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-24 17:46:15 -07:00
David S. Miller	823dcd2506	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net	2011-08-20 10:39:12 -07:00
Daniel Baluta	98e77438ae	ipv6: Fix ipv6_getsockopt for IPV6_2292PKTOPTIONS IPV6_2292PKTOPTIONS is broken for 32-bit applications running in COMPAT mode on 64-bit kernels. The same problem was fixed for IPv4 with the patch: ipv4: Fix ip_getsockopt for IP_PKTOPTIONS, commit `dd23198e58` Signed-off-by: Sorin Dumitru <sdumitru@ixiacom.com> Signed-off-by: Daniel Baluta <dbaluta@ixiacom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-19 03:19:07 -07:00
Tom Herbert	bdeab99191	rps: Add flag to skb to indicate rxhash is based on L4 tuple The l4_rxhash flag was added to the skb structure to indicate that the rxhash value was computed over the 4 tuple for the packet which includes the port information in the encapsulated transport packet. This is used by the stack to preserve the rxhash value in __skb_rx_tunnel. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-17 20:06:03 -07:00
Lionel Elie Mamane	c2bceb3d7f	sit tunnels: propagate IPv6 transport class to IPv4 Type of Service sit tunnels (IPv6 tunnel over IPv4) do not implement the "tos inherit" case to copy the IPv6 transport class byte from the inner packet to the IPv4 type of service byte in the outer packet. By contrast, ipip tunnels and GRE tunnels do. This patch, adapted from the similar code in net/ipv4/ipip.c and net/ipv4/ip_gre.c, implements that. This patch applies to 3.0.1, and has been tested on that version. Signed-off-by: Lionel Elie Mamane <lionel@mamane.lu> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-16 16:28:55 -07:00
Eric Dumazet	33d480ce6d	net: cleanup some rcu_dereference_raw RCU api had been completed and rcu_access_pointer() or rcu_dereference_protected() are better than generic rcu_dereference_raw() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-12 02:55:28 -07:00
Mike Waychison	f0e3d0689d	tcp: initialize variable ecn_ok in syncookies path Using a gcc 4.4.3, warnings are emitted for a possibly uninitialized use of ecn_ok. This can happen if cookie_check_timestamp() returns due to not having seen a timestamp. Defaulting to ecn off seems like a reasonable thing to do in this case, so initialized ecn_ok to false. Signed-off-by: Mike Waychison <mikew@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-10 21:59:57 -07:00
David S. Miller	19fd61785a	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net	2011-08-07 23:20:26 -07:00
David S. Miller	6e5714eaf7	net: Compute protocol sequence numbers and fragment IDs using MD5. Computers have become a lot faster since we compromised on the partial MD4 hash which we use currently for performance reasons. MD5 is a much safer choice, and is inline with both RFC1948 and other ISS generators (OpenBSD, Solaris, etc.) Furthermore, only having 24-bits of the sequence number be truly unpredictable is a very serious limitation. So the periodic regeneration and 8-bit counter have been removed. We compute and use a full 32-bit sequence number. For ipv6, DCCP was found to use a 32-bit truncated initial sequence number (it needs 43-bits) and that is fixed here as well. Reported-by: Dan Kaminsky <dan@doxpara.com> Tested-by: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-06 18:33:19 -07:00
Max Matveev	c15fea2d8c	ipv6: check for IPv4 mapped addresses when connecting IPv6 sockets When support for binding to 'mapped INADDR_ANY (::ffff.0.0.0.0)' was added in `0f8d3c7ac3` the rest of the code wasn't told so now it's possible to bind IPv6 datagram socket to ::ffff.0.0.0.0, connect it to another IPv4 address and it will all work except for getsockhame() which does not return the local address as expected. To give getsockname() something to work with check for 'mapped INADDR_ANY' when connecting and update the in-core source addresses appropriately. Signed-off-by: Max Matveev <makc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-05 03:56:30 -07:00
Eric Dumazet	f2c31e32b3	net: fix NULL dereferences in check_peer_redir() Gergely Kalman reported crashes in check_peer_redir(). It appears commit `f39925dbde` (ipv4: Cache learned redirect information in inetpeer.) added a race, leading to possible NULL ptr dereference. Since we can now change dst neighbour, we should make sure a reader can safely use a neighbour. Add RCU protection to dst neighbour, and make sure check_peer_redir() can be called safely by different cpus in parallel. As neighbours are already freed after one RCU grace period, this patch should not add typical RCU penalty (cache cold effects) Many thanks to Gergely for providing a pretty report pointing to the bug. Reported-by: Gergely Kalman <synapse@hippy.csoma.elte.hu> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-03 03:34:12 -07:00
Stephen Hemminger	a9b3cd7f32	rcu: convert uses of rcu_assign_pointer(x, NULL) to RCU_INIT_POINTER When assigning a NULL value to an RCU protected pointer, no barrier is needed. The rcu_assign_pointer, used to handle that but will soon change to not handle the special case. Convert all rcu_assign_pointer of NULL value. //smpl @@ expression P; @@ - rcu_assign_pointer(P, NULL) + RCU_INIT_POINTER(P, NULL) // </smpl> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-02 04:29:23 -07:00
Lorenzo Colitti	76f793e3a4	ipv6: updates to privacy addresses per RFC 4941. Update the code to handle some of the differences between RFC 3041 and RFC 4941, which obsoletes it. Also a couple of janitorial fixes. - Allow router advertisements to increase the lifetime of temporary addresses. This was not allowed by RFC 3041, but is specified by RFC 4941. It is useful when RA lifetimes are lower than TEMP_{VALID,PREFERRED}_LIFETIME: in this case, the previous code would delete or deprecate addresses prematurely. - Change the default of MAX_RETRY to 3 per RFC 4941. - Add a comment to clarify that the preferred and valid lifetimes in inet6_ifaddr are relative to the timestamp. - Shorten lines to 80 characters in a couple of places. Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-01 18:05:00 -07:00
Eric Dumazet	89b0212697	ip6tnl: avoid touching dst refcount in ip6_tnl_xmit2() Even using percpu stats, we still hit tunnel dst_entry refcount in ip6_tnl_xmit2() Since we are in RCU locked section, we can use skb_dst_set_noref() and avoid these atomic operations, leaving dst shared on cpus. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-01 00:12:00 -07:00
Eric Dumazet	897dc80b95	ipv6: avoid a dst_entry refcount change in ipv6_destopt_rcv() ipv6_destopt_rcv() runs with rcu_read_lock(), so there is no need to take a temporay reference on dst_entry, even if skb is freed by ip6_parse_tlv() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-01 00:12:00 -07:00
Eric Dumazet	d14730b8e9	ipv6: use RCU in inet6_csk_xmit() Use RCU to avoid changing dst_entry refcount in fast path. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-01 00:12:00 -07:00
Eric Dumazet	cfdf76474e	ipv6: some RCU conversions ICMP and ND are not fast path, but still we can avoid changing idev refcount, using RCU. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-01 00:12:00 -07:00
Jesper Juhl	91c66c6893	netfilter: ip_queue: Fix small leak in ipq_build_packet_message() ipq_build_packet_message() in net/ipv4/netfilter/ip_queue.c and net/ipv6/netfilter/ip6_queue.c contain a small potential mem leak as far as I can tell. We allocate memory for 'skb' with alloc_skb() annd then call nlh = NLMSG_PUT(skb, 0, 0, IPQM_PACKET, size - sizeof(*nlh)); NLMSG_PUT is a macro NLMSG_PUT(skb, pid, seq, type, len) \ NLMSG_NEW(skb, pid, seq, type, len, 0) that expands to NLMSG_NEW, which is also a macro which expands to: NLMSG_NEW(skb, pid, seq, type, len, flags) \ ({ if (unlikely(skb_tailroom(skb) < (int)NLMSG_SPACE(len))) \ goto nlmsg_failure; \ __nlmsg_put(skb, pid, seq, type, len, flags); }) If we take the true branch of the 'if' statement and 'goto nlmsg_failure', then we'll, at that point, return from ipq_build_packet_message() without having assigned 'skb' to anything and we'll leak the memory we allocated for it when it goes out of scope. Fix this by placing a 'kfree(skb)' at 'nlmsg_failure'. I admit that I do not know how likely this to actually happen or even if there's something that guarantees that it will never happen - I'm not that familiar with this code, but if that is so, I've not been able to spot it. Signed-off-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: Patrick McHardy <kaber@trash.net>	2011-07-29 16:38:49 +02:00
Linus Torvalds	d5eab9152a	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (32 commits) tg3: Remove 5719 jumbo frames and TSO blocks tg3: Break larger frags into 4k chunks for 5719 tg3: Add tx BD budgeting code tg3: Consolidate code that calls tg3_tx_set_bd() tg3: Add partial fragment unmapping code tg3: Generalize tg3_skb_error_unmap() tg3: Remove short DMA check for 1st fragment tg3: Simplify tx bd assignments tg3: Reintroduce tg3_tx_ring_info ASIX: Use only 11 bits of header for data size ASIX: Simplify condition in rx_fixup() Fix cdc-phonet build bonding: reduce noise during init bonding: fix string comparison errors net: Audit drivers to identify those needing IFF_TX_SKB_SHARING cleared net: add IFF_SKB_TX_SHARED flag to priv_flags net: sock_sendmsg_nosec() is static forcedeth: fix vlans gianfar: fix bug caused by `87c288c6e9` gro: Only reset frag0 when skb can be pulled ...	2011-07-28 05:58:19 -07:00
Arun Sharma	60063497a9	atomic: use <linux/atomic.h> This allows us to move duplicated code in <asm/atomic.h> (atomic_inc_not_zero() for now) to <linux/atomic.h> Signed-off-by: Arun Sharma <asharma@fb.com> Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-26 16:49:47 -07:00
YOSHIFUJI Hideaki	32019e651c	ipv6: Do not leave router anycast address for /127 prefixes. Original commit 2bda8a0c8af... "Disable router anycast address for /127 prefixes" says: \| No need for matching code in addrconf_leave_anycast() as it \| will silently ignore any attempt to leave an unknown anycast \| address. After analysis, because 1) we may add two or more prefixes on the same interface, or 2)user may have manually joined that anycast, we may hit chances to have anycast address which as if we had generated one by /127 prefix and we should not leave from subnet- router anycast address unconditionally. CC: Bjørn Mork <bjorn@mork.no> CC: Brian Haley <brian.haley@hp.com> Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-25 16:16:00 -07:00
Eric Dumazet	87c48fa3b4	ipv6: make fragment identifications less predictable IPv6 fragment identification generation is way beyond what we use for IPv4 : It uses a single generator. Its not scalable and allows DOS attacks. Now inetpeer is IPv6 aware, we can use it to provide a more secure and scalable frag ident generator (per destination, instead of system wide) This patch : 1) defines a new secure_ipv6_id() helper 2) extends inet_getid() to provide 32bit results 3) extends ipv6_select_ident() with a new dest parameter Reported-by: Fernando Gont <fernando@gont.com.ar> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-21 21:25:58 -07:00
Eric Dumazet	21efcfa0ff	ipv6: unshare inetpeers We currently cow metrics a bit too soon in IPv6 case : All routes are tied to a single inetpeer entry. Change ip6_rt_copy() to get destination address as second argument, so that we fill rt6i_dst before the dst_copy_metrics() call. icmp6_dst_alloc() must set rt6i_dst before calling dst_metric_set(), or else the cow is done while rt6i_dst is still NULL. If orig route points to readonly metrics, we can share the pointer instead of performing the memory allocation and copy. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-21 21:24:25 -07:00
David S. Miller	d3aaeb38c4	net: Add ->neigh_lookup() operation to dst_ops In the future dst entries will be neigh-less. In that environment we need to have an easy transition point for current users of dst->neighbour outside of the packet output fast path. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-18 00:40:17 -07:00
David S. Miller	69cce1d140	net: Abstract dst->neighbour accesses behind helpers. dst_{get,set}_neighbour() Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-17 23:11:35 -07:00
David S. Miller	9cbb7ecbcf	ipv6: Get rid of rt6i_nexthop macro. It just makes it harder to see 1) what the code is doing and 2) grep for all users of dst{->,.}neighbour Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-17 23:11:35 -07:00
David S. Miller	8f40b161de	neigh: Pass neighbour entry to output ops. This will get us closer to being able to do "neigh stuff" completely independent of the underlying dst_entry for protocols (ipv4/ipv6) that wish to do so. We will also be able to make dst entries neigh-less. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-17 23:11:17 -07:00
David S. Miller	542d4d685f	neigh: Kill ndisc_ops->queue_xmit It is always dev_queue_xmit(). Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-16 18:30:59 -07:00
David S. Miller	47ec132a40	neigh: Kill neigh_ops->hh_output It's always dev_queue_xmit(). Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-16 17:39:57 -07:00
David S. Miller	05e3aa0949	net: Create and use new helper, neigh_output(). Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-16 17:26:00 -07:00
David S. Miller	a29282972c	ipv6: Use calculated 'neigh' instead of re-evaluating dst->neighbour Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-16 14:30:47 -07:00
David S. Miller	f6b72b6217	net: Embed hh_cache inside of struct neighbour. Now that there is a one-to-one correspondance between neighbour and hh_cache entries, we no longer need: 1) dynamic allocation 2) attachment to dst->hh 3) refcounting Initialization of the hh_cache entry is indicated by hh_len being non-zero, and such initialization is always done with the neighbour's lock held as a writer. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-14 07:53:20 -07:00
Bjørn Mork	2bda8a0c8a	Disable router anycast address for /127 prefixes RFC 6164 requires that routers MUST disable Subnet-Router anycast for the prefix when /127 prefixes are used. No need for matching code in addrconf_leave_anycast() as it will silently ignore any attempt to leave an unknown anycast address. Signed-off-by: Bjørn Mork <bjorn@mork.no> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-07 04:15:10 -07:00
David S. Miller	e12fe68ce3	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6	2011-07-05 23:23:37 -07:00
Marcus Meissner	c349a528cd	net: bind() fix error return on wrong address family Hi, Reinhard Max also pointed out that the error should EAFNOSUPPORT according to POSIX. The Linux manpages have it as EINVAL, some other OSes (Minix, HPUX, perhaps BSD) use EAFNOSUPPORT. Windows uses WSAEFAULT according to MSDN. Other protocols error values in their af bind() methods in current mainline git as far as a brief look shows: EAFNOSUPPORT: atm, appletalk, l2tp, llc, phonet, rxrpc EINVAL: ax25, bluetooth, decnet, econet, ieee802154, iucv, netlink, netrom, packet, rds, rose, unix, x25, No check?: can/raw, ipv6/raw, irda, l2tp/l2tp_ip Ciao, Marcus Signed-off-by: Marcus Meissner <meissner@suse.de> Cc: Reinhard Max <max@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-04 21:37:41 -07:00
David S. Miller	957c665f37	ipv6: Don't put artificial limit on routing table size. IPV6, unlike IPV4, doesn't have a routing cache. Routing table entries, as well as clones made in response to route lookup requests, all live in the same table. And all of these things are together collected in the destination cache table for ipv6. This means that routing table entries count against the garbage collection limits, even though such entries cannot ever be reclaimed and are added explicitly by the administrator (rather than being created in response to lookups). Therefore it makes no sense to count ipv6 routing table entries against the GC limits. Add a DST_NOCOUNT destination cache entry flag, and skip the counting if it is set. Use this flag bit in ipv6 when adding routing table entries. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-01 17:30:43 -07:00
David S. Miller	11d53b4990	ipv6: Don't change dst->flags using assignments. This blows away any flags already set in the entry. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-01 17:30:43 -07:00
Joe Perches	207ec0abbe	ipv6: Reduce switch/case indent Make the case labels the same indent as the switch. git diff -w shows 80 column reflowing, removal of a useless break after return, and moving open brace after case instead of separate line. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-01 16:11:16 -07:00
Xufeng Zhang	9cfaa8def1	udp/recvmsg: Clear MSG_TRUNC flag when starting over for a new packet Consider this scenario: When the size of the first received udp packet is bigger than the receive buffer, MSG_TRUNC bit is set in msg->msg_flags. However, if checksum error happens and this is a blocking socket, it will goto try_again loop to receive the next packet. But if the size of the next udp packet is smaller than receive buffer, MSG_TRUNC flag should not be set, but because MSG_TRUNC bit is not cleared in msg->msg_flags before receive the next packet, MSG_TRUNC is still set, which is wrong. Fix this problem by clearing MSG_TRUNC flag when starting over for a new packet. Signed-off-by: Xufeng Zhang <xufeng.zhang@windriver.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-06-21 22:34:27 -07:00
Xufeng Zhang	32c90254ed	ipv6/udp: Use the correct variable to determine non-blocking condition udpv6_recvmsg() function is not using the correct variable to determine whether or not the socket is in non-blocking operation, this will lead to unexpected behavior when a UDP checksum error occurs. Consider a non-blocking udp receive scenario: when udpv6_recvmsg() is called by sock_common_recvmsg(), MSG_DONTWAIT bit of flags variable in udpv6_recvmsg() is cleared by "flags & ~MSG_DONTWAIT" in this call: err = sk->sk_prot->recvmsg(iocb, sk, msg, size, flags & MSG_DONTWAIT, flags & ~MSG_DONTWAIT, &addr_len); i.e. with udpv6_recvmsg() getting these values: int noblock = flags & MSG_DONTWAIT int flags = flags & ~MSG_DONTWAIT So, when udp checksum error occurs, the execution will go to csum_copy_err, and then the problem happens: csum_copy_err: ............... if (flags & MSG_DONTWAIT) return -EAGAIN; goto try_again; ............... But it will always go to try_again as MSG_DONTWAIT has been cleared from flags at call time -- only noblock contains the original value of MSG_DONTWAIT, so the test should be: if (noblock) return -EAGAIN; This is also consistent with what the ipv4/udp code does. Signed-off-by: Xufeng Zhang <xufeng.zhang@windriver.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-06-21 22:34:27 -07:00
David S. Miller	9f6ec8d697	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/wireless/iwlwifi/iwl-agn-rxon.c drivers/net/wireless/rtlwifi/pci.c net/netfilter/ipvs/ip_vs_core.c	2011-06-20 22:29:08 -07:00
Eric Dumazet	1eddceadb0	net: rfs: enable RFS before first data packet is received Le jeudi 16 juin 2011 à 23:38 -0400, David Miller a écrit : > From: Ben Hutchings <bhutchings@solarflare.com> > Date: Fri, 17 Jun 2011 00:50:46 +0100 > > > On Wed, 2011-06-15 at 04:15 +0200, Eric Dumazet wrote: > >> @@ -1594,6 +1594,7 @@ int tcp_v4_do_rcv(struct sock sk, struct sk_buff skb) > >> goto discard; > >> > >> if (nsk != sk) { > >> + sock_rps_save_rxhash(nsk, skb->rxhash); > >> if (tcp_child_process(sk, nsk, skb)) { > >> rsk = nsk; > >> goto reset; > >> > > > > I haven't tried this, but it looks reasonable to me. > > > > What about IPv6? The logic in tcp_v6_do_rcv() looks very similar. > > Indeed ipv6 side needs the same fix. > > Eric please add that part and resubmit. And in fact I might stick > this into net-2.6 instead of net-next-2.6 > OK, here is the net-2.6 based one then, thanks ! [PATCH v2] net: rfs: enable RFS before first data packet is received First packet received on a passive tcp flow is not correctly RFS steered. One sock_rps_record_flow() call is missing in inet_accept() But before that, we also must record rxhash when child socket is setup. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Tom Herbert <therbert@google.com> CC: Ben Hutchings <bhutchings@solarflare.com> CC: Jamal Hadi Salim <hadi@cyberus.ca> Signed-off-by: David S. Miller <davem@conan.davemloft.net>	2011-06-17 15:27:31 -04:00
Nicolas Cavallari	2c38de4c1f	netfilter: fix looped (broad\|multi)cast's MAC handling By default, when broadcast or multicast packet are sent from a local application, they are sent to the interface then looped by the kernel to other local applications, going throught netfilter hooks in the process. These looped packet have their MAC header removed from the skb by the kernel looping code. This confuse various netfilter's netlink queue, netlink log and the legacy ip_queue, because they try to extract a hardware address from these packets, but extracts a part of the IP header instead. This patch prevent NFQUEUE, NFLOG and ip_QUEUE to include a MAC header if there is none in the packet. Signed-off-by: Nicolas Cavallari <cavallar@lri.fr> Signed-off-by: Patrick McHardy <kaber@trash.net>	2011-06-16 17:27:04 +02:00
Greg Rose	c7ac8679be	rtnetlink: Compute and store minimum ifinfo dump size The message size allocated for rtnl ifinfo dumps was limited to a single page. This is not enough for additional interface info available with devices that support SR-IOV and caused a bug in which VF info would not be displayed if more than approximately 40 VFs were created per interface. Implement a new function pointer for the rtnl_register service that will calculate the amount of data required for the ifinfo dump and allocate enough data to satisfy the request. Signed-off-by: Greg Rose <gregory.v.rose@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	2011-06-09 20:38:07 -07:00
Jerry Chu	9ad7c049f0	tcp: RFC2988bis + taking RTT sample from 3WHS for the passive open side This patch lowers the default initRTO from 3secs to 1sec per RFC2988bis. It falls back to 3secs if the SYN or SYN-ACK packet has been retransmitted, AND the TCP timestamp option is not on. It also adds support to take RTT sample during 3WHS on the passive open side, just like its active open counterpart, and uses it, if valid, to seed the initRTO for the data transmission phase. The patch also resets ssthresh to its initial default at the beginning of the data transmission phase, and reduces cwnd to 1 if there has been MORE THAN ONE retransmission during 3WHS per RFC5681. Signed-off-by: H.K. Jerry Chu <hkchu@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-06-08 17:05:30 -07:00
stephen hemminger	aee80b54b2	ipv6: generate link local address for GRE tunnel Use same logic as SIT tunnel to handle link local address for GRE tunnel. OSPFv3 requires link-local address to function. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-06-08 17:05:30 -07:00
Marcus Meissner	5a079c305a	net/ipv6: check for mistakenly passed in non-AF_INET6 sockaddrs Same check as for IPv4, also do for IPv6. (If you passed in a IPv4 sockaddr_in here, the sizeof check in the line before would have triggered already though.) Signed-off-by: Marcus Meissner <meissner@suse.de> Cc: Reinhard Max <max@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-06-06 14:48:16 -07:00
Dave Jones	d232b8dded	netfilter: use unsigned variables for packet lengths in ip[6]_queue. Netlink message lengths can't be negative, so use unsigned variables. Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2011-06-06 01:37:16 +02:00
Pablo Neira Ayuso	88ed01d17b	netfilter: nf_conntrack: fix ct refcount leak in l4proto->error() This patch fixes a refcount leak of ct objects that may occur if l4proto->error() assigns one conntrack object to one skbuff. In that case, we have to skip further processing in nf_conntrack_in(). With this patch, we can also fix wrong return values (-NF_ACCEPT) for special cases in ICMP[v6] that should not bump the invalid/error statistic counters. Reported-by: Zoltan Menyhart <Zoltan.Menyhart@bull.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2011-06-06 01:37:02 +02:00
Eric Dumazet	fb04883371	netfilter: add more values to enum ip_conntrack_info Following error is raised (and other similar ones) : net/ipv4/netfilter/nf_nat_standalone.c: In function ‘nf_nat_fn’: net/ipv4/netfilter/nf_nat_standalone.c:119:2: warning: case value ‘4’ not in enumerated type ‘enum ip_conntrack_info’ gcc barfs on adding two enum values and getting a not enumerated result : case IP_CT_RELATED+IP_CT_IS_REPLY: Add missing enum values Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: David Miller <davem@davemloft.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2011-06-06 01:35:10 +02:00
Dan Rosenberg	71338aa7d0	net: convert %p usage to %pK The %pK format specifier is designed to hide exposed kernel pointers, specifically via /proc interfaces. Exposing these pointers provides an easy target for kernel write vulnerabilities, since they reveal the locations of writable structures containing easily triggerable function pointers. The behavior of %pK depends on the kptr_restrict sysctl. If kptr_restrict is set to 0, no deviation from the standard %p behavior occurs. If kptr_restrict is set to 1, the default, if the current user (intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG (currently in the LSM tree), kernel pointers using %pK are printed as 0's. If kptr_restrict is set to 2, kernel pointers using %pK are printed as 0's regardless of privileges. Replacing with 0's was chosen over the default "(null)", which cannot be parsed by userland %p, which expects "(nil)". The supporting code for kptr_restrict and %pK are currently in the -mm tree. This patch converts users of %p in net/ to %pK. Cases of printing pointers to the syslog are not covered, since this would eliminate useful information for postmortem debugging and the reading of the syslog is already optionally protected by the dmesg_restrict sysctl. Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com> Cc: James Morris <jmorris@namei.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Thomas Graf <tgraf@infradead.org> Cc: Eugene Teo <eugeneteo@kernel.org> Cc: Kees Cook <kees.cook@canonical.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: David S. Miller <davem@davemloft.net> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Eric Paris <eparis@parisplace.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-24 01:13:12 -04:00
David S. Miller	6ac3f66492	ipv6: Fix return of xfrm6_tunnel_rcv() Like ipv4, just return xfrm6_rcv_spi()'s return value directly. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-24 01:11:51 -04:00
Florian Westphal	0f6c6392dc	ipv6: copy prefsrc setting when copying route entry commit `c3968a857a` ('ipv6: RTA_PREFSRC support for ipv6 route source address selection') added support for ipv6 prefsrc as an alternative to ipv6 addrlabels, but it did not work because the prefsrc entry was not copied. Cc: Daniel Walter <sahne@0x90.at> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-21 02:05:22 -04:00
Linus Torvalds	06f4e926d2	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1446 commits) macvlan: fix panic if lowerdev in a bond tg3: Add braces around 5906 workaround. tg3: Fix NETIF_F_LOOPBACK error macvlan: remove one synchronize_rcu() call networking: NET_CLS_ROUTE4 depends on INET irda: Fix error propagation in ircomm_lmp_connect_response() irda: Kill set but unused variable 'bytes' in irlan_check_command_param() irda: Kill set but unused variable 'clen' in ircomm_connect_indication() rxrpc: Fix set but unused variable 'usage' in rxrpc_get_transport() be2net: Kill set but unused variable 'req' in lancer_fw_download() irda: Kill set but unused vars 'saddr' and 'daddr' in irlan_provider_connect_indication() atl1c: atl1c_resume() is only used when CONFIG_PM_SLEEP is defined. rxrpc: Fix set but unused variable 'usage' in rxrpc_get_peer(). rxrpc: Kill set but unused variable 'local' in rxrpc_UDP_error_handler() rxrpc: Kill set but unused variable 'sp' in rxrpc_process_connection() rxrpc: Kill set but unused variable 'sp' in rxrpc_rotate_tx_window() pkt_sched: Kill set but unused variable 'protocol' in tc_classify() isdn: capi: Use pr_debug() instead of ifdefs. tg3: Update version to 3.119 tg3: Apply rx_discards fix to 5719/5720 ... Fix up trivial conflicts in arch/x86/Kconfig and net/mac80211/agg-tx.c as per Davem.	2011-05-20 13:43:21 -07:00
Linus Torvalds	eb04f2f04e	Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (78 commits) Revert "rcu: Decrease memory-barrier usage based on semi-formal proof" net,rcu: convert call_rcu(prl_entry_destroy_rcu) to kfree batman,rcu: convert call_rcu(softif_neigh_free_rcu) to kfree_rcu batman,rcu: convert call_rcu(neigh_node_free_rcu) to kfree() batman,rcu: convert call_rcu(gw_node_free_rcu) to kfree_rcu net,rcu: convert call_rcu(kfree_tid_tx) to kfree_rcu() net,rcu: convert call_rcu(xt_osf_finger_free_rcu) to kfree_rcu() net/mac80211,rcu: convert call_rcu(work_free_rcu) to kfree_rcu() net,rcu: convert call_rcu(wq_free_rcu) to kfree_rcu() net,rcu: convert call_rcu(phonet_device_rcu_free) to kfree_rcu() perf,rcu: convert call_rcu(swevent_hlist_release_rcu) to kfree_rcu() perf,rcu: convert call_rcu(free_ctx) to kfree_rcu() net,rcu: convert call_rcu(__nf_ct_ext_free_rcu) to kfree_rcu() net,rcu: convert call_rcu(net_generic_release) to kfree_rcu() net,rcu: convert call_rcu(netlbl_unlhsh_free_addr6) to kfree_rcu() net,rcu: convert call_rcu(netlbl_unlhsh_free_addr4) to kfree_rcu() security,rcu: convert call_rcu(sel_netif_free) to kfree_rcu() net,rcu: convert call_rcu(xps_dev_maps_release) to kfree_rcu() net,rcu: convert call_rcu(xps_map_release) to kfree_rcu() net,rcu: convert call_rcu(rps_map_release) to kfree_rcu() ...	2011-05-19 18:14:34 -07:00
Eric Dumazet	be281e554e	ipv6: reduce per device ICMP mib sizes ipv6 has per device ICMP SNMP counters, taking too much space because they use percpu storage. needed size per device is : (512+4)sizeof(long)number_of_possible_cpus*2 On a 32bit kernel, 16 possible cpus, this wastes more than 64kbytes of memory per ipv6 enabled network device, taken in vmalloc pool. Since ICMP messages are rare, just use shared counters (atomic_long_t) Per network space ICMP counters are still using percpu memory, we might also convert them to shared counters in a future patch. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Denys Fedoryshchenko <denys@visp.net.lb> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-19 16:21:22 -04:00
David S. Miller	3c709f8fb4	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-3.6 Conflicts: drivers/net/benet/be_main.c	2011-05-11 14:26:58 -04:00
David S. Miller	9bbc052d5e	Merge branch 'pablo/nf-2.6-updates' of git://1984.lsi.us.es/net-2.6	2011-05-10 15:04:35 -07:00
Steffen Klassert	43a4dea4c9	xfrm: Assign the inner mode output function to the dst entry As it is, we assign the outer modes output function to the dst entry when we create the xfrm bundle. This leads to two problems on interfamily scenarios. We might insert ipv4 packets into ip6_fragment when called from xfrm6_output. The system crashes if we try to fragment an ipv4 packet with ip6_fragment. This issue was introduced with git commit `ad0081e4` (ipv6: Fragment locally generated tunnel-mode IPSec6 packets as needed). The second issue is, that we might insert ipv4 packets in netfilter6 and vice versa on interfamily scenarios. With this patch we assign the inner mode output function to the dst entry when we create the xfrm bundle. So xfrm4_output/xfrm6_output from the inner mode is used and the right fragmentation and netfilter functions are called. We switch then to outer mode with the output_finish functions. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-10 15:03:34 -07:00
Fernando Luis Vazquez Cao	4319cc0cf5	netfilter: IPv6: initialize TOS field in REJECT target module The IPv6 header is not zeroed out in alloc_skb so we must initialize it properly unless we want to see IPv6 packets with random TOS fields floating around. The current implementation resets the flow label but this could be changed if deemed necessary. We stumbled upon this issue when trying to apply a mangle rule to the RST packet generated by the REJECT target module. Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2011-05-10 09:55:44 +02:00
David S. Miller	d9d8da805d	inet: Pass flowi to ->queue_xmit(). This allows us to acquire the exact route keying information from the protocol, however that might be managed. It handles all of the possibilities, from the simplest case of storing the key in inet->cork.fl to the more complex setup SCTP has where individual transports determine the flow. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-08 15:28:28 -07:00
Paul E. McKenney	11c476f31a	net,rcu: convert call_rcu(prl_entry_destroy_rcu) to kfree The RCU callback prl_entry_destroy_rcu() just calls kfree(), so we can use kfree_rcu() instead of call_rcu(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Acked-by: David S. Miller <davem@davemloft.net> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-07 22:51:15 -07:00
Lai Jiangshan	e3cbf28fa6	net,rcu: convert call_rcu(ipv6_mc_socklist_reclaim) to kfree_rcu() The rcu callback ipv6_mc_socklist_reclaim() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(ipv6_mc_socklist_reclaim). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-07 22:51:02 -07:00
Lai Jiangshan	e57859854a	net,rcu: convert call_rcu(inet6_ifa_finish_destroy_rcu) to kfree_rcu() The rcu callback inet6_ifa_finish_destroy_rcu() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(inet6_ifa_finish_destroy_rcu). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-07 22:50:50 -07:00
Lai Jiangshan	38f57d1a4b	net,rcu: convert call_rcu(in6_dev_finish_destroy_rcu) to kfree_rcu() The rcu callback in6_dev_finish_destroy_rcu() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(in6_dev_finish_destroy_rcu). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>	2011-05-07 22:50:49 -07:00
David S. Miller	bdc712b4c2	inet: Decrease overhead of on-stack inet_cork. When we fast path datagram sends to avoid locking by putting the inet_cork on the stack we use up lots of space that isn't necessary. This is because inet_cork contains a "struct flowi" which isn't used in these code paths. Split inet_cork to two parts, "inet_cork" and "inet_cork_full". Only the latter of which has the "struct flowi" and is what is stored in inet_sock. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>	2011-05-06 15:37:57 -07:00
David S. Miller	7143b7d412	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/tg3.c	2011-05-05 14:59:02 -07:00
Jiri Pirko	1c5cae815d	net: call dev_alloc_name from register_netdevice Force dev_alloc_name() to be called from register_netdevice() by dev_get_valid_name(). That allows to remove multiple explicit dev_alloc_name() calls. The possibility to call dev_alloc_name in advance remains. This also fixes veth creation regresion caused by `84c49d8c3e` Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-05 10:57:45 -07:00
David S. Miller	301102cc83	ipv6: Use flowi4->{daddr,saddr} in ipip6_tunnel_xmit(). Instead of rt->rt_{dst,src} Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-04 12:55:07 -07:00
David S. Miller	31e4543db2	ipv4: Make caller provide on-stack flow key to ip_route_output_ports(). Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-03 20:25:42 -07:00
Lucian Adrian Grijincu	ff538818f4	sysctl: net: call unregister_net_sysctl_table where needed ctl_table_headers registered with register_net_sysctl_table should have been unregistered with the equivalent unregister_net_sysctl_table Signed-off-by: Lucian Adrian Grijincu <lucian.grijincu@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-02 16:12:14 -07:00
Eric Dumazet	e67f88dd12	net: dont hold rtnl mutex during netlink dump callbacks Four years ago, Patrick made a change to hold rtnl mutex during netlink dump callbacks. I believe it was a wrong move. This slows down concurrent dumps, making good old /proc/net/ files faster than rtnetlink in some situations. This occurred to me because one "ip link show dev ..." was _very_ slow on a workload adding/removing network devices in background. All dump callbacks are able to use RCU locking now, so this patch does roughly a revert of commits : `1c2d670f36` : [RTNETLINK]: Hold rtnl_mutex during netlink dump callbacks `6313c1e099` : [RTNETLINK]: Remove unnecessary locking in dump callbacks This let writers fight for rtnl mutex and readers going full speed. It also takes care of phonet : phonet_route_get() is now called from rcu read section. I renamed it to phonet_route_get_rcu() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Patrick McHardy <kaber@trash.net> Cc: Remi Denis-Courmont <remi.denis-courmont@nokia.com> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-02 15:26:28 -07:00
Ben Hutchings	ad246c992b	ipv4, ipv6, bonding: Restore control over number of peer notifications For backward compatibility, we should retain the module parameters and sysfs attributes to control the number of peer notifications (gratuitous ARPs and unsolicited NAs) sent after bonding failover. Also, it is possible for failover to take place even though the new active slave does not have link up, and in that case the peer notification should be deferred until it does. Change ipv4 and ipv6 so they do not automatically send peer notifications on bonding failover. Change the bonding driver to send separate NETDEV_NOTIFY_PEERS notifications when the link is up, as many times as requested. Since it does not directly control which protocols send notifications, make num_grat_arp and num_unsol_na aliases for a single parameter. Bump the bonding version number and update its documentation. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: Jay Vosburgh <fubar@us.ibm.com> Acked-by: Brian Haley <brian.haley@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-29 12:44:11 -07:00
David S. Miller	cf91166223	net: Use non-zero allocations in dst_alloc(). Make dst_alloc() and it's users explicitly initialize the entire entry. The zero'ing done by kmem_cache_zalloc() was almost entirely redundant. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-28 22:26:00 -07:00
David S. Miller	5c1e6aa300	net: Make dst_alloc() take more explicit initializations. Now the dst->dev, dev->obsolete, and dst->flags values can be specified as well. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-28 22:25:59 -07:00
Shan Wei	96339d6c49	net:use help function of skb_checksum_start_offset to calculate offset Although these are equivalent, but the skb_checksum_start_offset() is more readable. Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-28 13:28:57 -07:00
Eric Dumazet	f6d8bd051c	inet: add RCU protection to inet->opt We lack proper synchronization to manipulate inet->opt ip_options Problem is ip_make_skb() calls ip_setup_cork() and ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options), without any protection against another thread manipulating inet->opt. Another thread can change inet->opt pointer and free old one under us. Use RCU to protect inet->opt (changed to inet->inet_opt). Instead of handling atomic refcounts, just copy ip_options when necessary, to avoid cache line dirtying. We cant insert an rcu_head in struct ip_options since its included in skb->cb[], so this patch is large because I had to introduce a new ip_options_rcu structure. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-28 13:16:35 -07:00
Steffen Klassert	c0a56e64ae	esp6: Fix scatterlist initialization When we use IPsec extended sequence numbers, we may overwrite the last scatterlist of the associated data by the scatterlist for the skb. This patch fixes this by placing the scatterlist for the skb right behind the last scatterlist of the associated data. esp4 does it already like that. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-26 12:46:04 -07:00
David S. Miller	2bd93d7af1	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Resolved logic conflicts causing a build failure due to drivers/net/r8169.c changes using a patch from Stephen Rothwell. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-26 12:16:46 -07:00

... 2 3 4 5 6 ...

2882 Commits