OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Herbert Xu	f945fa7ad9	[INET]: Fix truesize setting in ip_append_data As it is ip_append_data only counts page fragments to the skb that allocated it. As such it means that the first skb gets hit with a 4K charge even though it might have only used a fraction of it while all subsequent skb's that use the same page gets away with no charge at all. This bug was exposed by the UDP accounting patch. [ The wmem_alloc bumping needs to be moved with the truesize, noticed by Takahiro Yasui. -DaveM ] Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-23 03:11:43 -08:00
David S. Miller	1e34a11d55	[IPV4]: Add missing skb->truesize increment in ip_append_page(). And as noted by Takahiro Yasui, we thus need to bump the sk->sk_wmem_alloc at this spot as well. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-23 03:11:40 -08:00
Wang Chen	5b4d383a1a	[ICMP]: ICMP_MIB_OUTMSGS increment duplicated Commit "96793b482540f3a26e2188eaf75cb56b7829d3e3" (Add ICMPMsgStats MIB (RFC 4293)) made a mistake. In that patch, David L added a icmp_out_count() in ip_push_pending_frames(), remove icmp_out_count() from icmp_reply(). But he forgot to remove icmp_out_count() from icmp_send() too. Since icmp_send and icmp_reply will call icmp_push_reply, which will call ip_push_pending_frames, a duplicated increment happened in icmp_send. This patch remove the icmp_out_count from icmp_send too. Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-21 03:39:45 -08:00
Eric Dumazet	8d3f099abe	[IPV4] FIB_HASH : Avoid unecessary loop in fn_hash_dump_zone() I noticed "ip route list" was slower than "cat /proc/net/route" on a machine with a full Internet routing table (214392 entries : Special thanks to Robert ;) ) This is similar to problem reported in commit `d8c9283089` ("[IPV4] ROUTE: ip_rt_dump() is unecessary slow") Fix is to avoid scanning the begining of fz_hash table, but directly seek to the right offset. Before patch : time ip route >/tmp/ROUTE real 0m1.285s user 0m0.712s sys 0m0.436s After patch # time ip route >/tmp/ROUTE real 0m0.835s user 0m0.692s sys 0m0.124s Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-20 20:31:39 -08:00
Joonwoo Park	6725033fa2	[IPV4] fib_trie: fix duplicated route issue http://bugzilla.kernel.org/show_bug.cgi?id=9493 The fib allows making identical routes with 'ip route replace'. This patch makes the fib return -EEXIST if replacement would cause duplication. Signed-off-by: Joonwoo Park <joonwpark81@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-20 20:31:38 -08:00
Joonwoo Park	bd566e7525	[IPV4] fib_hash: fix duplicated route issue http://bugzilla.kernel.org/show_bug.cgi?id=9493 The fib allows making identical routes with 'ip route replace'. This patch makes the fib return -EEXIST if replacement would cause duplication. Signed-off-by: Joonwoo Park <joonwpark81@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-20 20:31:37 -08:00
Eric Dumazet	0bcceadceb	[IPV4] ROUTE: fix rcu_dereference() uses in /proc/net/rt_cache In rt_cache_get_next(), no need to guard seq->private by a rcu_dereference() since seq is private to the thread running this function. Reading seq.private once (as guaranted bu rcu_dereference()) or several time if compiler really is dumb enough wont change the result. But we miss real spots where rcu_dereference() are needed, both in rt_cache_get_first() and rt_cache_get_next() Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-10 03:55:57 -08:00
Brice Goglin	877364e60e	[LRO] Fix lro_mgr->features checks lro_mgr->features contains a bitmask of LRO_F_* values which are defined as power of two, not as bit indexes. They must be checked with x&LRO_F_FOO, not with test_bit(LRO_F_FOO,&x). Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr> Acked-by: Andrew Gallatin <gallatin@myri.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-08 23:30:18 -08:00
Eric Dumazet	d8c9283089	[IPV4] ROUTE: ip_rt_dump() is unecessary slow I noticed "ip route list cache x.y.z.t" can be very slow. While strace-ing -T it I also noticed that first part of route cache is fetched quite fast : recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202 GXm\0\0\2 \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3772 <0.000047> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\234\0\0\0\30\0\2\0\254i\ 202GXm\0\0\2 \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3736 <0.000042> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\204\0\0\0\30\0\2\0\254i\ 202GXm\0\0\2 \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3740 <0.000055> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\234\0\0\0\30\0\2\0\254i\ 202GXm\0\0\2 \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3712 <0.000043> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\204\0\0\0\30\0\2\0\254i\ 202GXm\0\0\2 \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3732 <0.000053> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202 GXm\0\0\2 \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3708 <0.000052> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202 GXm\0\0\2 \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3680 <0.000041> while the part at the end of the table is more expensive: recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\204\0\0\0\30\0\2\0\254i\202GXm\0\0\2 \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3656 <0.003857> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\204\0\0\0\30\0\2\0\254i\202GXm\0\0\2 \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3772 <0.003891> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202GXm\0\0\2 \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3712 <0.003765> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202GXm\0\0\2 \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3700 <0.003879> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202GXm\0\0\2 \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3676 <0.003797> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202GXm\0\0\2 \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3724 <0.003856> recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\234\0\0\0\30\0\2\0\254i\202GXm\0\0\2 \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3736 <0.003848> The following patch corrects this performance/latency problem, removing quadratic behavior. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-08 23:30:16 -08:00
Amos Waterland	92ffb85dd3	[IPV4] ipconfig: Fix regression in ip command line processing The recent changes for ip command line processing fixed some problems but unfortunately broke some common usage scenarios. In current 2.6.24-rc6 the following command line results in no IP address assignment, which is surely a regression: ip=10.0.2.15::10.0.2.2:255.255.255.0::eth0:off Please find below a patch that works for all cases I can find. Signed-off-by: Amos Waterland <apw@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-08 23:29:58 -08:00
Herbert Xu	f844c74fe0	[IPV4] raw: Strengthen check on validity of iph->ihl We currently check that iph->ihl is bounded by the real length and that the real length is greater than the minimum IP header length. However, we did not check the caes where iph->ihl is less than the minimum IP header length. This breaks because some ip_fast_csum implementations assume that which is quite reasonable. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-08 23:29:57 -08:00
Mark McLoughlin	44344b2a85	[INET]: Fix netdev renaming and inet address labels When re-naming an interface, the previous secondary address labels get lost e.g. $> brctl addbr foo $> ip addr add 192.168.0.1 dev foo $> ip addr add 192.168.0.2 dev foo label foo:00 $> ip addr show dev foo \| grep inet inet 192.168.0.1/32 scope global foo inet 192.168.0.2/32 scope global foo:00 $> ip link set foo name bar $> ip addr show dev bar \| grep inet inet 192.168.0.1/32 scope global bar inet 192.168.0.2/32 scope global bar:2 Turns out to be a simple thinko in inetdev_changename() - clearly we want to look at the address label, rather than the device name, for a suffix to retain. Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-04 03:55:34 -08:00
Gavin McCullagh	2072c228c9	[TCP]: use non-delayed ACK for congestion control RTT When a delayed ACK representing two packets arrives, there are two RTT samples available, one for each packet. The first (in order of seq number) will be artificially long due to the delay waiting for the second packet, the second will trigger the ACK and so will not itself be delayed. According to rfc1323, the SRTT used for RTO calculation should use the first rtt, so receivers echo the timestamp from the first packet in the delayed ack. For congestion control however, it seems measuring delayed ack delay is not desirable as it varies independently of congestion. The patch below causes seq_rtt and last_ackt to be updated with any available later packet rtts which should have less (and hopefully zero) delack delay. The rtt value then gets passed to ca_ops->pkts_acked(). Where TCP_CONG_RTT_STAMP was set, effort was made to supress RTTs from within a TSO chunk (!fully_acked), using only the final ACK (which includes any TSO delay) to generate RTTs. This patch removes these checks so RTTs are passed for each ACK to ca_ops->pkts_acked(). For non-delay based congestion control (cubic, h-tcp), rtt is sometimes used for rtt-scaling. In shortening the RTT, this may make them a little less aggressive. Delay-based schemes (eg vegas, veno, illinois) should get a cleaner, more accurate congestion signal, particularly for small cwnds. The congestion control module can potentially also filter out bad RTTs due to the delayed ack alarm by looking at the associated cnt which (where delayed acking is in use) should probably be 1 if the alarm went off or greater if the ACK was triggered by a packet. Signed-off-by: Gavin McCullagh <gavin.mccullagh@nuim.ie> Acked-by: Ilpo J�rvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-12-29 19:11:21 -08:00
Simon Horman	9cecd07c3f	[IPV4] Fix ip=dhcp regression David Brownell pointed out a regression in my recent "Fix ip command line processing" patch. It turns out to be a fairly blatant oversight on my part whereby ic_enable is never set, and thus autoconfiguration is never enabled. Clearly my testing was broken :-( The solution that I have is to set ic_enable to 1 if we hit ip_auto_config_setup(), which basically means that autoconfiguration is activated unless told otherwise. I then flip ic_enable to 0 if ip=off, ip=none, ip=::::::off or ip=::::::none using ic_proto_name(); The incremental patch is below, let me know if a non-incremental version is prepared, as I did as for the original patch to be reverted pending a fix. Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-12-28 13:39:11 -08:00
Simon Horman	a6c05c3d06	[IPV4]: Fix ip command line processing. Recently the documentation in Documentation/nfsroot.txt was update to note that in fact ip=off and ip=::::::off as the latter is ignored and the default (on) is used. This was certainly a step in the direction of reducing confusion. But it seems to me that the code ought to be fixed up so that ip=::::::off actually turns off ip autoconfiguration. This patch also notes more specifically that ip=on (aka ip=::::::on) is the default. Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-12-26 19:36:36 -08:00
Patrick McHardy	fae718ddaf	[NETFILTER]: nf_conntrack_ipv4: fix module parameter compatibility Some users do "modprobe ip_conntrack hashsize=...". Since we have the module aliases this loads nf_conntrack_ipv4 and nf_conntrack, the hashsize parameter is unknown for nf_conntrack_ipv4 however and makes it fail. Allow to specify hashsize= for both nf_conntrack and nf_conntrack_ipv4. Note: the nf_conntrack message in the ringbuffer will display an incorrect hashsize since nf_conntrack is first pulled in as a dependency and calculates the size itself, then it gets changed through a call to nf_conntrack_set_hashsize(). Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-12-26 19:36:33 -08:00
Denis V. Lunev	d883a03671	[IPV4]: OOPS with NETLINK_FIB_LOOKUP netlink socket [ Regression added by changeset: `cd40b7d398` [NET]: make netlink user -> kernel interface synchronious -DaveM ] nl_fib_input re-reuses incoming skb to send the reply. This means that this packet will be freed twice, namely in: - netlink_unicast_kernel - on receive path Use clone to send as a cure, the caller is responsible for kfree_skb on error. Thanks to Alexey Dobryan, who originally found the problem. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-12-21 02:01:53 -08:00
Joe Perches	e00ccd4a78	[NETFILTER] ipv4: Spelling fixes Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-12-20 14:05:03 -08:00
Timo Teras	1d06916747	[IPV4] ip_gre: set mac_header correctly in receive path mac_header update in ipgre_recv() was incorrectly changed to skb_reset_mac_header() when it was introduced. Signed-off-by: Timo Teras <timo.teras@iki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-12-20 00:10:33 -08:00
Mark Ryden	e0260feddf	[IPV4] ARP: Remove not used code In arp_process() (net/ipv4/arp.c), there is unused code: definition and assignment of tha (target hw address ). Signed-off-by: Mark Ryden <markryde@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-12-19 23:38:11 -08:00
Satoru SATOH	488faa2ae3	[IPV4]: Make tcp_input_metrics() get minimum RTO via tcp_rto_min() tcp_input_metrics() refers to the built-time constant TCP_RTO_MIN regardless of configured minimum RTO with iproute2. Signed-off-by: Satoru SATOH <satoru.satoh@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-12-16 14:00:19 -08:00
Amos Waterland	f33e1d9fa2	[IPV4]: Updates to nfsroot documentation The difference between ip=off and ip=::::::off has been a cause of much confusion. Document how each behaves, and do not contradict ourselves by saying that "off" is the default when in fact "any" is the default and is descibed as being so lower in the file. Signed-off-by: Amos Waterland <apw@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-12-14 13:54:40 -08:00
Patrick McHardy	a18aa31b77	[NETFILTER]: ip_tables: fix compat copy race When copying entries to user, the kernel makes two passes through the data, first copying all the entries, then fixing up names and counters. On the second pass it copies the kernel and match data from userspace to the kernel again to find the corresponding structures, expecting that kernel pointers contained in the data are still valid. This is obviously broken, fix by avoiding the second pass completely and fixing names and counters while dumping the ruleset, using the kernel-internal data structures. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-12-14 13:54:35 -08:00
Thomas Graf	2017a72c07	[IPv4] ESP: Discard dummy packets introduced in rfc4303 RFC4303 introduces dummy packets with a nexthdr value of 59 to implement traffic confidentiality. Such packets need to be dropped silently and the payload may not be attempted to be parsed as it consists of random chunk. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-12-11 02:45:26 -08:00
Pavel Emelyanov	a4e65d36a9	[IPV4]: Swap the ifa allocation with the"ipv4_devconf_setall" call According to Herbert, the ipv4_devconf_setall should be called only when the ifa is added to the device. However, failed ifa allocation may bring things into inconsistent state. Move the call to ipv4_devconf_setall after the ifa allocation. Fits both net-2.6 (with offsets) and net-2.6.25 (cleanly). Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-12-11 02:45:25 -08:00
Denis V. Lunev	56c99d0415	[IPV4]: Remove prototype of ip_rt_advice ip_rt_advice has been gone, so no need to keep prototype and debug message. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-12-07 01:07:38 -08:00
Mitsuru Chinen	7f53878dc2	[IPv4]: Reply net unreachable ICMP message IPv4 stack doesn't reply any ICMP destination unreachable message with net unreachable code when IP detagrams are being discarded because of no route could be found in the forwarding path. Incidentally, IPv6 stack replies such ICMPv6 message in the similar situation. Signed-off-by: Mitsuru Chinen <mitch@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-12-07 01:07:24 -08:00
Andrew Gallatin	621544eb8c	[LRO]: fix lro_gen_skb() alignment Add a field to the lro_mgr struct so that drivers can specify how much padding is required to align layer 3 headers when a packet is copied into a freshly allocated skb by inet_lro.c:lro_gen_skb(). Without padding, skbs generated by LRO will cause alignment warnings on architectures which require strict alignment (seen on sparc64). Myri10GE is updated to use this field. Signed-off-by: Andrew Gallatin <gallatin@myri.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-12-05 05:37:32 -08:00
Ilpo J�rvinen	4e67d876ce	[TCP]: NAGLE_PUSH seems to be a wrong way around The comment in tcp_nagle_test suggests that. This bug is very very old, even 2.4.0 seems to have it. Signed-off-by: Ilpo J�rvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-12-05 05:37:31 -08:00
Ilpo J�rvinen	52d3408150	[TCP]: Move prior_in_flight collect to more robust place The previous location is after sacktag processing, which affects counters tcp_packets_in_flight depends on. This may manifest as wrong behavior if new SACK blocks are present and all is clear for call to tcp_cong_avoid, which in the case of tcp_reno_cong_avoid bails out early because it thinks that TCP is not limited by cwnd. Signed-off-by: Ilpo J�rvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-12-05 05:37:30 -08:00
Ilpo J�rvinen	3e6f049e0c	[TCP] FRTO: Use of existing funcs make code more obvious & robust Though there's little need for everything that tcp_may_send_now does (actually, even the state had to be adjusted to pass some checks FRTO does not want to occur), it's more robust to let it make the decision if sending is allowed. State adjustments needed: - Make sure snd_cwnd limit is not hit in there - Disable nagle (if necessary) through the frto_counter == 2 The result of check for frto_counter in argument to call for tcp_enter_frto_loss can just be open coded, therefore there isn't need to store the previous frto_counter past tcp_may_send_now. In addition, returns can then be combined. Signed-off-by: Ilpo J�rvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-12-05 05:37:29 -08:00
Pavel Emelyanov	4ac63ad6c5	[IPVS]: Fix sched registration race when checking for name collision. The register_ip_vs_scheduler() checks for the scheduler with the same name under the read-locked __ip_vs_sched_lock, then drops, takes it for writing and puts the scheduler in list. This is racy, since we can have a race window between the lock being re-locked for writing. The fix is to search the scheduler with the given name right under the write-locked __ip_vs_sched_lock. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-12-05 05:37:27 -08:00
Pavel Emelyanov	a014bc8f0f	[IPVS]: Don't leak sysctl tables if the scheduler registration fails. In case we load lblc or lblcr module we can leak some sysctl tables if the call to register_ip_vs_scheduler() fails. I've looked at the register_ip_vs_scheduler() code and saw, that the only reason to fail is the name collision, so I think that with some 3rd party schedulers this becomes a relevant issue. No? Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-12-05 05:37:26 -08:00
Herbert Xu	d523a328fb	[INET]: Fix inet_diag dead-lock regression The inet_diag register fix broke inet_diag module loading because the loaded module had to take the same mutex that's already held by the loader in order to register the new handler. This patch fixes it by introducing a separate mutex to protect the handling of handlers. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2007-12-03 15:51:25 +11:00
Stephen Hemminger	a357dde9df	[TCP] illinois: Incorrect beta usage Lachlan Andrew observed that my TCP-Illinois implementation uses the beta value incorrectly: The parameter beta in the paper specifies the amount to decrease by: that is, on loss, W <- W - betaW but in tcp_illinois_ssthresh() uses beta as the amount to decrease to: W <- betaW This bug makes the Linux TCP-Illinois get less-aggressive on uncongested network, hurting performance. Note: since the base beta value is .5, it has no impact on a congested network. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2007-11-30 01:10:55 +11:00
Pavel Emelyanov	076931989f	[INET]: Fix inet_diag register vs rcv race The following race is possible when one cpu unregisters the handler while other one is trying to receive a message and call this one: CPU1: CPU2: inet_diag_rcv() inet_diag_unregister() mutex_lock(&inet_diag_mutex); netlink_rcv_skb(skb, &inet_diag_rcv_msg); if (inet_diag_table[nlh->nlmsg_type] == NULL) /* false handler is still registered / ... netlink_dump_start(idiagnl, skb, nlh, inet_diag_dump, NULL); cb = kzalloc(sizeof(cb), GFP_KERNEL); /* sleep here freeing memory * or preempt * or sleep later on nlk->cb_mutex / spin_lock(&inet_diag_register_lock); inet_diag_table[type] = NULL; ... spin_unlock(&inet_diag_register_lock); synchronize_rcu(); / CPU1 is sleeping - RCU quiescent * state is passed / return; / inet_diag_dump is finally called: / inet_diag_dump() handler = inet_diag_table[cb->nlh->nlmsg_type]; BUG_ON(handler == NULL); / OOPS! While we slept the unregister has set * handler to NULL :( */ Grep showed, that the register/unregister functions are called from init/fini module callbacks for tcp_/dccp_diag, so it's OK to use the inet_diag_mutex to synchronize manipulations with the inet_diag_table and the access to it. Besides, as Herbert pointed out, asynchronous dumps should hold this mutex as well, and thus, we provide the mutex as cb_mutex one. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2007-11-30 00:08:14 +11:00
Adrian Bunk	3660019e5f	[IPV4]: Remove bogus ifdef mess in arp_process The #ifdef's in arp_process() were not only a mess, they were also wrong in the CONFIG_NET_ETHERNET=n and (CONFIG_NETDEV_1000=y or CONFIG_NETDEV_10000=y) cases. Since they are not required this patch removes them. Also removed are some #ifdef's around #include's that caused compile errors after this change. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2007-11-26 23:17:53 +08:00
Ilpo Järvinen	7f9c33e515	[TCP] MTUprobe: Cleanup send queue check (no need to loop) The original code has striking complexity to perform a query which can be reduced to a very simple compare. FIN seqno may be included to write_seq but it should not make any significant difference here compared to skb->len which was used previously. One won't end up there with SYN still queued. Use of write_seq check guarantees that there's a valid skb in send_head so I removed the extra check. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Acked-by: John Heffner <jheffner@psc.edu> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2007-11-23 19:10:56 +08:00
Ilpo Järvinen	91cc17c0e5	[TCP]: MTUprobe: receiver window & data available checks fixed It seems that the checked range for receiver window check should begin from the first rather than from the last skb that is going to be included to the probe. And that can be achieved without reference to skbs at all, snd_nxt and write_seq provides the correct seqno already. Plus, it SHOULD account packets that are necessary to trigger fast retransmit [RFC4821]. Location of snd_wnd < probe_size/size_needed check is bogus because it will cause the other if() match as well (due to snd_nxt >= snd_una invariant). Removed dead obvious comment. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2007-11-23 19:08:16 +08:00
Pavel Emelyanov	d535a916cd	[IPVS]: Fix compiler warning about unused register_ip_vs_protocol This is silly, but I have turned the CONFIG_IP_VS to m, to check the compilation of one (recently sent) fix and set all the CONFIG_IP_VS_PROTO_XXX options to n to speed up the compilation. In this configuration the compiler warns me about CC [M] net/ipv4/ipvs/ip_vs_proto.o net/ipv4/ipvs/ip_vs_proto.c:49: warning: 'register_ip_vs_protocol' defined but not used Indeed. With no protocols selected there are no calls to this function - all are compiled out with ifdefs. Maybe the best fix would be to surround this call with ifdef-s or tune the Kconfig dependences, but I think that marking this register function as __used is enough. No? Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-20 17:44:01 -08:00
Jonas Danielsson	b4a9811c42	[ARP]: Fix arp reply when sender ip 0 Fix arp reply when received arp probe with sender ip 0. Send arp reply with target ip address 0.0.0.0 and target hardware address set to hardware address of requester. Previously sent reply with target ip address and target hardware address set to same as source fields. Signed-off-by: Jonas Danielsson <the.sator@gmail.com> Acked-by: Alexey Kuznetov <kuznet@ms2.inr.ac.ru> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-20 17:38:16 -08:00
YOSHIFUJI Hideaki	354faf0977	[IPV4] TCPMD5: Use memmove() instead of memcpy() because we have overlaps. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-20 17:30:31 -08:00
YOSHIFUJI Hideaki	a80cc20da4	[IPV4] TCPMD5: Omit redundant NULL check for kfree() argument. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-20 17:30:06 -08:00
Evgeniy Polyakov	1f305323ff	[NETFILTER]: Fix kernel panic with REDIRECT target. When connection tracking entry (nf_conn) is about to copy itself it can have some of its extension users (like nat) as being already freed and thus not required to be copied. Actually looking at this function I suspect it was copied from nf_nat_setup_info() and thus bug was introduced. Report and testing from David <david@unsolicited.net>. [ Patrick McHardy states: I now understand whats happening: - new connection is allocated without helper - connection is REDIRECTed to localhost - nf_nat_setup_info adds NAT extension, but doesn't initialize it yet - nf_conntrack_alter_reply performs a helper lookup based on the new tuple, finds the SIP helper and allocates a helper extension, causing reallocation because of too little space - nf_nat_move_storage is called with the uninitialized nat extension So your fix is entirely correct, thanks a lot :) ] Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-20 04:27:35 -08:00
Joe Perches	464c4f184a	[IPV4]: Add missing "space" Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-19 23:46:29 -08:00
Sam Jansen	5487796f0c	[TCP]: Problem bug with sysctl_tcp_congestion_control function From: "Sam Jansen" <sjansen@google.com> sysctl_tcp_congestion_control seems to have a bug that prevents it from actually calling the tcp_set_default_congestion_control function. This is not so apparent because it does not return an error and generally the /proc interface is used to configure the default TCP congestion control algorithm. This is present in 2.6.18 onwards and probably earlier, though I have not inspected 2.6.15--2.6.17. sysctl_tcp_congestion_control calls sysctl_string and expects a successful return code of 0. In such a case it actually sets the congestion control algorithm with tcp_set_default_congestion_control. Otherwise, it returns the value returned by sysctl_string. This was correct in 2.6.14, as sysctl_string returned 0 on success. However, sysctl_string was updated to return 1 on success around about 2.6.15 and sysctl_tcp_congestion_control was not updated. Even though sysctl_tcp_congestion_control returns 1, do_sysctl_strategy converts this return code to '0', so the caller never notices the error. Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-19 23:28:21 -08:00
Ilpo J�rvinen	6e42141009	[TCP] MTUprobe: fix potential sk_send_head corruption When the abstraction functions got added, conversion here was made incorrectly. As a result, the skb may end up pointing to skb which got included to the probe skb and then was freed. For it to trigger, however, skb_transmit must fail sending as well. Signed-off-by: Ilpo J�rvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-19 23:24:09 -08:00
Simon Horman	9055fa1f3d	[IPVS]: Move remaining sysctl handlers over to CTL_UNNUMBERED Switch the remaining IPVS sysctl entries over to to use CTL_UNNUMBERED, I stronly doubt that anyone is using the sys_sysctl interface to these variables. Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-19 21:51:13 -08:00
Simon Horman	9e103fa6bd	[IPVS]: Fix sysctl warnings about missing strategy in schedulers sysctl table check failed: /net/ipv4/vs/lblc_expiration .3.5.21.19 Missing strategy [...] sysctl table check failed: /net/ipv4/vs/lblcr_expiration .3.5.21.20 Missing strategy Switch these entried over to use CTL_UNNUMBERED as clearly the sys_syscal portion wasn't working. This is along the same lines as Christian Borntraeger's patch that fixes up entries with no stratergy in net/ipv4/ipvs/ip_vs_ctl.c Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-19 21:50:21 -08:00
Christian Borntraeger	611cd55b15	[IPVS]: Fix sysctl warnings about missing strategy Running the latest git code I get the following messages during boot: sysctl table check failed: /net/ipv4/vs/drop_entry .3.5.21.4 Missing strategy [...] sysctl table check failed: /net/ipv4/vs/drop_packet .3.5.21.5 Missing strategy [...] sysctl table check failed: /net/ipv4/vs/secure_tcp .3.5.21.6 Missing strategy [...] sysctl table check failed: /net/ipv4/vs/sync_threshold .3.5.21.24 Missing strategy I removed the binary sysctl handler for those messages and also removed the definitions in ip_vs.h. The alternative would be to implement a proper strategy handler, but syscall sysctl is deprecated. There are other sysctl definitions that are commented out or work with the default sysctl_data strategy. I did not touch these. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-19 21:49:25 -08:00

1 2 3 4 5 ...

1818 Commits