Commit Graph

2882 Commits

Author SHA1 Message Date
Roy.Li 01b7806cdc ipv6: remove a rcu_read_lock in ndisc_constructor
in6_dev_get(dev) takes a reference on struct inet6_dev, we dont need
rcu locking in ndisc_constructor()

Signed-off-by: Roy.Li <rongqing.li@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-17 19:27:56 -04:00
Eric Dumazet 87fb4b7b53 net: more accurate skb truesize
skb truesize currently accounts for sk_buff struct and part of skb head.
kmalloc() roundings are also ignored.

Considering that skb_shared_info is larger than sk_buff, its time to
take it into account for better memory accounting.

This patch introduces SKB_TRUESIZE(X) macro to centralize various
assumptions into a single place.

At skb alloc phase, we put skb_shared_info struct at the exact end of
skb head, to allow a better use of memory (lowering number of
reallocations), since kmalloc() gives us power-of-two memory blocks.

Unless SLUB/SLUB debug is active, both skb->head and skb_shared_info are
aligned to cache lines, as before.

Note: This patch might trigger performance regressions because of
misconfigured protocol stacks, hitting per socket or global memory
limits that were previously not reached. But its a necessary step for a
more accurate memory accounting.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Andi Kleen <ak@linux.intel.com>
CC: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-13 16:05:07 -04:00
Yan, Zheng cdaf557034 gro: refetch inet6_protos[] after pulling ext headers
ipv6_gro_receive() doesn't update the protocol ops after pulling
the ext headers. It looks like a typo.

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-10 14:26:16 -04:00
David S. Miller 88c5100c28 Merge branch 'master' of github.com:davem330/net
Conflicts:
	net/batman-adv/soft-interface.c
2011-10-07 13:38:43 -04:00
Yan, Zheng 260fcbeb1a tcp: properly handle md5sig_pool references
tcp_v4_clear_md5_list() assumes that multiple tcp md5sig peers
only hold one reference to md5sig_pool. but tcp_v4_md5_do_add()
increases use count of md5sig_pool for each peer. This patch
makes tcp_v4_md5_do_add() only increases use count for the first
tcp md5sig peer.

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-04 23:31:24 -04:00
Yan, Zheng 676a1184e8 ipv6: nullify ipv6_ac_list and ipv6_fl_list when creating new socket
ipv6_ac_list and ipv6_fl_list from listening socket are inadvertently
shared with new socket created for connection.

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-09-29 00:32:10 -04:00
Ben Greear 67928c4041 ipv6-multicast: Fix memory leak in IPv6 multicast.
If reg_vif_xmit cannot find a routing entry, be sure to
free the skb before returning the error.

Signed-off-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-09-27 15:34:00 -04:00
Madalin Bucur fbe5818690 ipv6: check return value for dst_alloc
return value of dst_alloc must be checked before use

Signed-off-by: Madalin Bucur <madalin.bucur@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-09-27 15:32:06 -04:00
Ben Greear 2015de5fe2 ipv6-multicast: Fix memory leak in input path.
Have to free the skb before returning if we fail
the fib lookup.

Signed-off-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-09-27 15:16:08 -04:00
Eric Dumazet b82d1bb4fd tcp: unalias tcp_skb_cb flags and ip_dsfield
struct tcp_skb_cb contains a "flags" field containing either tcp flags
or IP dsfield depending on context (input or output path)

Introduce ip_dsfield to make the difference clear and ease maintenance.
If later we want to save space, we can union flags/ip_dsfield

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-09-27 02:20:08 -04:00
David S. Miller 8decf86879 Merge branch 'master' of github.com:davem330/net
Conflicts:
	MAINTAINERS
	drivers/net/Kconfig
	drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
	drivers/net/ethernet/broadcom/tg3.c
	drivers/net/wireless/iwlwifi/iwl-pci.c
	drivers/net/wireless/iwlwifi/iwl-trans-tx-pcie.c
	drivers/net/wireless/rt2x00/rt2800usb.c
	drivers/net/wireless/wl12xx/main.c
2011-09-22 03:23:13 -04:00
Roy Li 8603e33d01 ipv6: fix a possible double free
When calling snmp6_alloc_dev fails, the snmp6 relevant memory
are freed by snmp6_alloc_dev. Calling in6_dev_finish_destroy
will free these memory twice.

Double free will lead that undefined behavior occurs.

Signed-off-by: Roy Li <rongqing.li@windriver.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-09-20 15:10:16 -04:00
Eric Dumazet d24f22f3df ip6_tunnel: add optional fwmark inherit
Add IP6_TNL_F_USE_ORIG_FWMARK to ip6_tunnel, so that ip6_tnl_xmit2()
makes a route lookup taking into account skb->fwmark and doesnt cache
lookup result.

This permits more flexibility in policies and firewall setups.

To setup such a tunnel, "fwmark inherit" option should be added to "ip
-f inet6 tunnel" command.

Reported-by: Anders Franzen <Anders.Franzen@ericsson.com>
CC: Hans Schillström <hans.schillstrom@ericsson.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-09-20 14:50:00 -04:00
Yan, Zheng 8e2ec63917 ipv6: don't use inetpeer to store metrics for routes.
Current IPv6 implementation uses inetpeer to store metrics for
routes. The problem of inetpeer is that it doesn't take subnet
prefix length in to consideration. If two routes have the same
address but different prefix length, they share same inetpeer.
So changing metrics of one route also affects the other. The
fix is to allocate separate metrics storage for each route.

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-09-17 00:57:26 -04:00
Tore Anderson 026359bc6e ipv6: Send ICMPv6 RSes only when RAs are accepted
This patch improves the logic determining when to send ICMPv6 Router
Solicitations, so that they are 1) always sent when the kernel is
accepting Router Advertisements, and 2) never sent when the kernel is
not accepting RAs. In other words, the operational setting of the
"accept_ra" sysctl is used.

The change also makes the special "Hybrid Router" forwarding mode
("forwarding" sysctl set to 2) operate exactly the same as the standard
Router mode (forwarding=1). The only difference between the two was
that RSes was being sent in the Hybrid Router mode only. The sysctl
documentation describing the special Hybrid Router mode has therefore
been removed.

Rationale for the change:

Currently, the value of forwarding sysctl is the only thing determining
whether or not to send RSes. If it has the value 0 or 2, they are sent,
otherwise they are not. This leads to inconsistent behaviour in the
following cases:

* accept_ra=0, forwarding=0
* accept_ra=0, forwarding=2
* accept_ra=1, forwarding=2
* accept_ra=2, forwarding=1

In the first three cases, the kernel will send RSes, even though it will
not accept any RAs received in reply. In the last case, it will not send
any RSes, even though it will accept and process any RAs received. (Most
routers will send unsolicited RAs periodically, so suppressing RSes in
the last case will merely delay auto-configuration, not prevent it.)

Also, it is my opinion that having the forwarding sysctl control RS
sending behaviour (completely independent of whether RAs are being
accepted or not) is simply not what most users would intuitively expect
to be the case.

Signed-off-by: Tore Anderson <tore@fud.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-09-16 19:14:41 -04:00
David S. Miller 52b9aca7ae Merge branch 'master' of ../netdev/ 2011-09-16 01:09:02 -04:00
Eric Dumazet 946cedccbd tcp: Change possible SYN flooding messages
"Possible SYN flooding on port xxxx " messages can fill logs on servers.

Change logic to log the message only once per listener, and add two new
SNMP counters to track :

TCPReqQFullDoCookies : number of times a SYNCOOKIE was replied to client

TCPReqQFullDrop : number of times a SYN request was dropped because
syncookies were not enabled.

Based on a prior patch from Tom Herbert, and suggestions from David.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-09-15 14:49:43 -04:00
David S. Miller 7858241655 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-2.6 2011-08-30 17:43:56 -04:00
Maciej Żenczykowski ec0506dbe4 net: relax PKTINFO non local ipv6 udp xmit check
Allow transparent sockets to be less restrictive about
the source ip of ipv6 udp packets being sent.

Google-Bug-Id: 5018138
Signed-off-by: Maciej Żenczykowski <maze@google.com>
CC: "Erik Kline" <ek@google.com>
CC: "Lorenzo Colitti" <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-30 17:39:01 -04:00
Florian Westphal c6675233f9 netfilter: nf_queue: reject NF_STOLEN verdicts from userspace
A userspace listener may send (bogus) NF_STOLEN verdict, which causes skb leak.

This problem was previously fixed via
64507fdbc2 (netfilter:
nf_queue: fix NF_STOLEN skb leak) but this had to be reverted because
NF_STOLEN can also be returned by a netfilter hook when iterating the
rules in nf_reinject.

Reject userspace NF_STOLEN verdict, as suggested by Michal Miroslaw.

This is complementary to commit fad5444043
(netfilter: avoid double free in nf_reinject).

Cc: Julian Anastasov <ja@ssi.bg>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2011-08-30 15:01:20 +02:00
Ian Campbell 408dadf03f net: ipv6: convert to SKB frag APIs
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-24 17:52:12 -07:00
Yan, Zheng e05c4ad3ed mcast: Fix source address selection for multicast listener report
Should check use count of include mode filter instead of total number
of include mode filters.

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-24 17:46:15 -07:00
David S. Miller 823dcd2506 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net 2011-08-20 10:39:12 -07:00
Daniel Baluta 98e77438ae ipv6: Fix ipv6_getsockopt for IPV6_2292PKTOPTIONS
IPV6_2292PKTOPTIONS is broken for 32-bit applications running
in COMPAT mode on 64-bit kernels.

The same problem was fixed for IPv4 with the patch:
ipv4: Fix ip_getsockopt for IP_PKTOPTIONS,
commit dd23198e58

Signed-off-by: Sorin Dumitru <sdumitru@ixiacom.com>
Signed-off-by: Daniel Baluta <dbaluta@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-19 03:19:07 -07:00
Tom Herbert bdeab99191 rps: Add flag to skb to indicate rxhash is based on L4 tuple
The l4_rxhash flag was added to the skb structure to indicate
that the rxhash value was computed over the 4 tuple for the
packet which includes the port information in the encapsulated
transport packet.  This is used by the stack to preserve the
rxhash value in __skb_rx_tunnel.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-17 20:06:03 -07:00
Lionel Elie Mamane c2bceb3d7f sit tunnels: propagate IPv6 transport class to IPv4 Type of Service
sit tunnels (IPv6 tunnel over IPv4) do not implement the "tos inherit"
case to copy the IPv6 transport class byte from the inner packet to
the IPv4 type of service byte in the outer packet. By contrast, ipip
tunnels and GRE tunnels do.

This patch, adapted from the similar code in net/ipv4/ipip.c and
net/ipv4/ip_gre.c, implements that.

This patch applies to 3.0.1, and has been tested on that version.

Signed-off-by: Lionel Elie Mamane <lionel@mamane.lu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-16 16:28:55 -07:00
Eric Dumazet 33d480ce6d net: cleanup some rcu_dereference_raw
RCU api had been completed and rcu_access_pointer() or
rcu_dereference_protected() are better than generic
rcu_dereference_raw()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-12 02:55:28 -07:00
Mike Waychison f0e3d0689d tcp: initialize variable ecn_ok in syncookies path
Using a gcc 4.4.3, warnings are emitted for a possibly uninitialized use
of ecn_ok.

This can happen if cookie_check_timestamp() returns due to not having
seen a timestamp.  Defaulting to ecn off seems like a reasonable thing
to do in this case, so initialized ecn_ok to false.

Signed-off-by: Mike Waychison <mikew@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-10 21:59:57 -07:00
David S. Miller 19fd61785a Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net 2011-08-07 23:20:26 -07:00
David S. Miller 6e5714eaf7 net: Compute protocol sequence numbers and fragment IDs using MD5.
Computers have become a lot faster since we compromised on the
partial MD4 hash which we use currently for performance reasons.

MD5 is a much safer choice, and is inline with both RFC1948 and
other ISS generators (OpenBSD, Solaris, etc.)

Furthermore, only having 24-bits of the sequence number be truly
unpredictable is a very serious limitation.  So the periodic
regeneration and 8-bit counter have been removed.  We compute and
use a full 32-bit sequence number.

For ipv6, DCCP was found to use a 32-bit truncated initial sequence
number (it needs 43-bits) and that is fixed here as well.

Reported-by: Dan Kaminsky <dan@doxpara.com>
Tested-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-06 18:33:19 -07:00
Max Matveev c15fea2d8c ipv6: check for IPv4 mapped addresses when connecting IPv6 sockets
When support for binding to 'mapped INADDR_ANY (::ffff.0.0.0.0)' was added
in 0f8d3c7ac3 the rest of the code
wasn't told so now it's possible to bind IPv6 datagram socket to
::ffff.0.0.0.0, connect it to another IPv4 address and it will all
work except for getsockhame() which does not return the local address
as expected.

To give getsockname() something to work with check for 'mapped INADDR_ANY'
when connecting and update the in-core source addresses appropriately.

Signed-off-by: Max Matveev <makc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-05 03:56:30 -07:00
Eric Dumazet f2c31e32b3 net: fix NULL dereferences in check_peer_redir()
Gergely Kalman reported crashes in check_peer_redir().

It appears commit f39925dbde (ipv4: Cache learned redirect
information in inetpeer.) added a race, leading to possible NULL ptr
dereference.

Since we can now change dst neighbour, we should make sure a reader can
safely use a neighbour.

Add RCU protection to dst neighbour, and make sure check_peer_redir()
can be called safely by different cpus in parallel.

As neighbours are already freed after one RCU grace period, this patch
should not add typical RCU penalty (cache cold effects)

Many thanks to Gergely for providing a pretty report pointing to the
bug.

Reported-by: Gergely Kalman <synapse@hippy.csoma.elte.hu>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-03 03:34:12 -07:00
Stephen Hemminger a9b3cd7f32 rcu: convert uses of rcu_assign_pointer(x, NULL) to RCU_INIT_POINTER
When assigning a NULL value to an RCU protected pointer, no barrier
is needed. The rcu_assign_pointer, used to handle that but will soon
change to not handle the special case.

Convert all rcu_assign_pointer of NULL value.

//smpl
@@ expression P; @@

- rcu_assign_pointer(P, NULL)
+ RCU_INIT_POINTER(P, NULL)

// </smpl>

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-02 04:29:23 -07:00
Lorenzo Colitti 76f793e3a4 ipv6: updates to privacy addresses per RFC 4941.
Update the code to handle some of the differences between
RFC 3041 and RFC 4941, which obsoletes it. Also a couple
of janitorial fixes.

- Allow router advertisements to increase the lifetime of
  temporary addresses. This was not allowed by RFC 3041,
  but is specified by RFC 4941. It is useful when RA
  lifetimes are lower than TEMP_{VALID,PREFERRED}_LIFETIME:
  in this case, the previous code would delete or deprecate
  addresses prematurely.

- Change the default of MAX_RETRY to 3 per RFC 4941.

- Add a comment to clarify that the preferred and valid
  lifetimes in inet6_ifaddr are relative to the timestamp.

- Shorten lines to 80 characters in a couple of places.

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-01 18:05:00 -07:00
Eric Dumazet 89b0212697 ip6tnl: avoid touching dst refcount in ip6_tnl_xmit2()
Even using percpu stats, we still hit tunnel dst_entry refcount in
ip6_tnl_xmit2()

Since we are in RCU locked section, we can use skb_dst_set_noref() and
avoid these atomic operations, leaving dst shared on cpus.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-01 00:12:00 -07:00
Eric Dumazet 897dc80b95 ipv6: avoid a dst_entry refcount change in ipv6_destopt_rcv()
ipv6_destopt_rcv() runs with rcu_read_lock(), so there is no need to
take a temporay reference on dst_entry, even if skb is freed by
ip6_parse_tlv()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-01 00:12:00 -07:00
Eric Dumazet d14730b8e9 ipv6: use RCU in inet6_csk_xmit()
Use RCU to avoid changing dst_entry refcount in fast path.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-01 00:12:00 -07:00
Eric Dumazet cfdf76474e ipv6: some RCU conversions
ICMP and ND are not fast path, but still we can avoid changing idev
refcount, using RCU.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-01 00:12:00 -07:00
Jesper Juhl 91c66c6893 netfilter: ip_queue: Fix small leak in ipq_build_packet_message()
ipq_build_packet_message() in net/ipv4/netfilter/ip_queue.c and
net/ipv6/netfilter/ip6_queue.c contain a small potential mem leak as
far as I can tell.

We allocate memory for 'skb' with alloc_skb() annd then call
 nlh = NLMSG_PUT(skb, 0, 0, IPQM_PACKET, size - sizeof(*nlh));

NLMSG_PUT is a macro
 NLMSG_PUT(skb, pid, seq, type, len) \
  		NLMSG_NEW(skb, pid, seq, type, len, 0)

that expands to NLMSG_NEW, which is also a macro which expands to:
 NLMSG_NEW(skb, pid, seq, type, len, flags) \
  	({	if (unlikely(skb_tailroom(skb) < (int)NLMSG_SPACE(len))) \
  			goto nlmsg_failure; \
  		__nlmsg_put(skb, pid, seq, type, len, flags); })

If we take the true branch of the 'if' statement and 'goto
nlmsg_failure', then we'll, at that point, return from
ipq_build_packet_message() without having assigned 'skb' to anything
and we'll leak the memory we allocated for it when it goes out of
scope.

Fix this by placing a 'kfree(skb)' at 'nlmsg_failure'.

I admit that I do not know how likely this to actually happen or even
if there's something that guarantees that it will never happen - I'm
not that familiar with this code, but if that is so, I've not been
able to spot it.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2011-07-29 16:38:49 +02:00
Linus Torvalds d5eab9152a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (32 commits)
  tg3: Remove 5719 jumbo frames and TSO blocks
  tg3: Break larger frags into 4k chunks for 5719
  tg3: Add tx BD budgeting code
  tg3: Consolidate code that calls tg3_tx_set_bd()
  tg3: Add partial fragment unmapping code
  tg3: Generalize tg3_skb_error_unmap()
  tg3: Remove short DMA check for 1st fragment
  tg3: Simplify tx bd assignments
  tg3: Reintroduce tg3_tx_ring_info
  ASIX: Use only 11 bits of header for data size
  ASIX: Simplify condition in rx_fixup()
  Fix cdc-phonet build
  bonding: reduce noise during init
  bonding: fix string comparison errors
  net: Audit drivers to identify those needing IFF_TX_SKB_SHARING cleared
  net: add IFF_SKB_TX_SHARED flag to priv_flags
  net: sock_sendmsg_nosec() is static
  forcedeth: fix vlans
  gianfar: fix bug caused by 87c288c6e9
  gro: Only reset frag0 when skb can be pulled
  ...
2011-07-28 05:58:19 -07:00
Arun Sharma 60063497a9 atomic: use <linux/atomic.h>
This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>

Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-26 16:49:47 -07:00
YOSHIFUJI Hideaki 32019e651c ipv6: Do not leave router anycast address for /127 prefixes.
Original commit 2bda8a0c8af... "Disable router anycast
address for /127 prefixes" says:

|   No need for matching code in addrconf_leave_anycast() as it
|   will silently ignore any attempt to leave an unknown anycast
|   address.

After analysis, because 1) we may add two or more prefixes on the
same interface, or 2)user may have manually joined that anycast,
we may hit chances to have anycast address which as if we had
generated one by /127 prefix and we should not leave from subnet-
router anycast address unconditionally.

CC: Bjørn Mork <bjorn@mork.no>
CC: Brian Haley <brian.haley@hp.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-25 16:16:00 -07:00
Eric Dumazet 87c48fa3b4 ipv6: make fragment identifications less predictable
IPv6 fragment identification generation is way beyond what we use for
IPv4 : It uses a single generator. Its not scalable and allows DOS
attacks.

Now inetpeer is IPv6 aware, we can use it to provide a more secure and
scalable frag ident generator (per destination, instead of system wide)

This patch :
1) defines a new secure_ipv6_id() helper
2) extends inet_getid() to provide 32bit results
3) extends ipv6_select_ident() with a new dest parameter

Reported-by: Fernando Gont <fernando@gont.com.ar>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-21 21:25:58 -07:00
Eric Dumazet 21efcfa0ff ipv6: unshare inetpeers
We currently cow metrics a bit too soon in IPv6 case : All routes are
tied to a single inetpeer entry.

Change ip6_rt_copy() to get destination address as second argument, so
that we fill rt6i_dst before the dst_copy_metrics() call.

icmp6_dst_alloc() must set rt6i_dst before calling dst_metric_set(), or
else the cow is done while rt6i_dst is still NULL.

If orig route points to readonly metrics, we can share the pointer
instead of performing the memory allocation and copy.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-21 21:24:25 -07:00
David S. Miller d3aaeb38c4 net: Add ->neigh_lookup() operation to dst_ops
In the future dst entries will be neigh-less.  In that environment we
need to have an easy transition point for current users of
dst->neighbour outside of the packet output fast path.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-18 00:40:17 -07:00
David S. Miller 69cce1d140 net: Abstract dst->neighbour accesses behind helpers.
dst_{get,set}_neighbour()

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-17 23:11:35 -07:00
David S. Miller 9cbb7ecbcf ipv6: Get rid of rt6i_nexthop macro.
It just makes it harder to see 1) what the code is doing
and 2) grep for all users of dst{->,.}neighbour

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-17 23:11:35 -07:00
David S. Miller 8f40b161de neigh: Pass neighbour entry to output ops.
This will get us closer to being able to do "neigh stuff"
completely independent of the underlying dst_entry for
protocols (ipv4/ipv6) that wish to do so.

We will also be able to make dst entries neigh-less.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-17 23:11:17 -07:00
David S. Miller 542d4d685f neigh: Kill ndisc_ops->queue_xmit
It is always dev_queue_xmit().

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-16 18:30:59 -07:00
David S. Miller 47ec132a40 neigh: Kill neigh_ops->hh_output
It's always dev_queue_xmit().

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-16 17:39:57 -07:00
David S. Miller 05e3aa0949 net: Create and use new helper, neigh_output().
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-16 17:26:00 -07:00
David S. Miller a29282972c ipv6: Use calculated 'neigh' instead of re-evaluating dst->neighbour
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-16 14:30:47 -07:00
David S. Miller f6b72b6217 net: Embed hh_cache inside of struct neighbour.
Now that there is a one-to-one correspondance between neighbour
and hh_cache entries, we no longer need:

1) dynamic allocation
2) attachment to dst->hh
3) refcounting

Initialization of the hh_cache entry is indicated by hh_len
being non-zero, and such initialization is always done with
the neighbour's lock held as a writer.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-14 07:53:20 -07:00
Bjørn Mork 2bda8a0c8a Disable router anycast address for /127 prefixes
RFC 6164 requires that routers MUST disable Subnet-Router anycast
for the prefix when /127 prefixes are used.

No need for matching code in addrconf_leave_anycast() as it
will silently ignore any attempt to leave an unknown anycast
address.

Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-07 04:15:10 -07:00
David S. Miller e12fe68ce3 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2011-07-05 23:23:37 -07:00
Marcus Meissner c349a528cd net: bind() fix error return on wrong address family
Hi,

Reinhard Max also pointed out that the error should EAFNOSUPPORT according
to POSIX.

The Linux manpages have it as EINVAL, some other OSes (Minix, HPUX, perhaps BSD) use
EAFNOSUPPORT. Windows uses WSAEFAULT according to MSDN.

Other protocols error values in their af bind() methods in current mainline git as far
as a brief look shows:
	EAFNOSUPPORT: atm, appletalk, l2tp, llc, phonet, rxrpc
	EINVAL: ax25, bluetooth, decnet, econet, ieee802154, iucv, netlink, netrom, packet, rds, rose, unix, x25,
	No check?: can/raw, ipv6/raw, irda, l2tp/l2tp_ip

Ciao, Marcus

Signed-off-by: Marcus Meissner <meissner@suse.de>
Cc: Reinhard Max <max@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-04 21:37:41 -07:00
David S. Miller 957c665f37 ipv6: Don't put artificial limit on routing table size.
IPV6, unlike IPV4, doesn't have a routing cache.

Routing table entries, as well as clones made in response
to route lookup requests, all live in the same table.  And
all of these things are together collected in the destination
cache table for ipv6.

This means that routing table entries count against the garbage
collection limits, even though such entries cannot ever be reclaimed
and are added explicitly by the administrator (rather than being
created in response to lookups).

Therefore it makes no sense to count ipv6 routing table entries
against the GC limits.

Add a DST_NOCOUNT destination cache entry flag, and skip the counting
if it is set.  Use this flag bit in ipv6 when adding routing table
entries.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-01 17:30:43 -07:00
David S. Miller 11d53b4990 ipv6: Don't change dst->flags using assignments.
This blows away any flags already set in the entry.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-01 17:30:43 -07:00
Joe Perches 207ec0abbe ipv6: Reduce switch/case indent
Make the case labels the same indent as the switch.

git diff -w shows 80 column reflowing,
removal of a useless break after return, and moving
open brace after case instead of separate line.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-01 16:11:16 -07:00
Xufeng Zhang 9cfaa8def1 udp/recvmsg: Clear MSG_TRUNC flag when starting over for a new packet
Consider this scenario: When the size of the first received udp packet
is bigger than the receive buffer, MSG_TRUNC bit is set in msg->msg_flags.
However, if checksum error happens and this is a blocking socket, it will
goto try_again loop to receive the next packet.  But if the size of the
next udp packet is smaller than receive buffer, MSG_TRUNC flag should not
be set, but because MSG_TRUNC bit is not cleared in msg->msg_flags before
receive the next packet, MSG_TRUNC is still set, which is wrong.

Fix this problem by clearing MSG_TRUNC flag when starting over for a
new packet.

Signed-off-by: Xufeng Zhang <xufeng.zhang@windriver.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-21 22:34:27 -07:00
Xufeng Zhang 32c90254ed ipv6/udp: Use the correct variable to determine non-blocking condition
udpv6_recvmsg() function is not using the correct variable to determine
whether or not the socket is in non-blocking operation, this will lead
to unexpected behavior when a UDP checksum error occurs.

Consider a non-blocking udp receive scenario: when udpv6_recvmsg() is
called by sock_common_recvmsg(), MSG_DONTWAIT bit of flags variable in
udpv6_recvmsg() is cleared by "flags & ~MSG_DONTWAIT" in this call:

    err = sk->sk_prot->recvmsg(iocb, sk, msg, size, flags & MSG_DONTWAIT,
                   flags & ~MSG_DONTWAIT, &addr_len);

i.e. with udpv6_recvmsg() getting these values:

	int noblock = flags & MSG_DONTWAIT
	int flags = flags & ~MSG_DONTWAIT

So, when udp checksum error occurs, the execution will go to
csum_copy_err, and then the problem happens:

    csum_copy_err:
            ...............
            if (flags & MSG_DONTWAIT)
                    return -EAGAIN;
            goto try_again;
            ...............

But it will always go to try_again as MSG_DONTWAIT has been cleared
from flags at call time -- only noblock contains the original value
of MSG_DONTWAIT, so the test should be:

            if (noblock)
                    return -EAGAIN;

This is also consistent with what the ipv4/udp code does.

Signed-off-by: Xufeng Zhang <xufeng.zhang@windriver.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-21 22:34:27 -07:00
David S. Miller 9f6ec8d697 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/wireless/iwlwifi/iwl-agn-rxon.c
	drivers/net/wireless/rtlwifi/pci.c
	net/netfilter/ipvs/ip_vs_core.c
2011-06-20 22:29:08 -07:00
Eric Dumazet 1eddceadb0 net: rfs: enable RFS before first data packet is received
Le jeudi 16 juin 2011 à 23:38 -0400, David Miller a écrit :
> From: Ben Hutchings <bhutchings@solarflare.com>
> Date: Fri, 17 Jun 2011 00:50:46 +0100
>
> > On Wed, 2011-06-15 at 04:15 +0200, Eric Dumazet wrote:
> >> @@ -1594,6 +1594,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
> >>  			goto discard;
> >>
> >>  		if (nsk != sk) {
> >> +			sock_rps_save_rxhash(nsk, skb->rxhash);
> >>  			if (tcp_child_process(sk, nsk, skb)) {
> >>  				rsk = nsk;
> >>  				goto reset;
> >>
> >
> > I haven't tried this, but it looks reasonable to me.
> >
> > What about IPv6?  The logic in tcp_v6_do_rcv() looks very similar.
>
> Indeed ipv6 side needs the same fix.
>
> Eric please add that part and resubmit.  And in fact I might stick
> this into net-2.6 instead of net-next-2.6
>

OK, here is the net-2.6 based one then, thanks !

[PATCH v2] net: rfs: enable RFS before first data packet is received

First packet received on a passive tcp flow is not correctly RFS
steered.

One sock_rps_record_flow() call is missing in inet_accept()

But before that, we also must record rxhash when child socket is setup.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Tom Herbert <therbert@google.com>
CC: Ben Hutchings <bhutchings@solarflare.com>
CC: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@conan.davemloft.net>
2011-06-17 15:27:31 -04:00
Nicolas Cavallari 2c38de4c1f netfilter: fix looped (broad|multi)cast's MAC handling
By default, when broadcast or multicast packet are sent from a local
application, they are sent to the interface then looped by the kernel
to other local applications, going throught netfilter hooks in the
process.

These looped packet have their MAC header removed from the skb by the
kernel looping code. This confuse various netfilter's netlink queue,
netlink log and the legacy ip_queue, because they try to extract a
hardware address from these packets, but extracts a part of the IP
header instead.

This patch prevent NFQUEUE, NFLOG and ip_QUEUE to include a MAC header
if there is none in the packet.

Signed-off-by: Nicolas Cavallari <cavallar@lri.fr>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2011-06-16 17:27:04 +02:00
Greg Rose c7ac8679be rtnetlink: Compute and store minimum ifinfo dump size
The message size allocated for rtnl ifinfo dumps was limited to
a single page.  This is not enough for additional interface info
available with devices that support SR-IOV and caused a bug in
which VF info would not be displayed if more than approximately
40 VFs were created per interface.

Implement a new function pointer for the rtnl_register service that will
calculate the amount of data required for the ifinfo dump and allocate
enough data to satisfy the request.

Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2011-06-09 20:38:07 -07:00
Jerry Chu 9ad7c049f0 tcp: RFC2988bis + taking RTT sample from 3WHS for the passive open side
This patch lowers the default initRTO from 3secs to 1sec per
RFC2988bis. It falls back to 3secs if the SYN or SYN-ACK packet
has been retransmitted, AND the TCP timestamp option is not on.

It also adds support to take RTT sample during 3WHS on the passive
open side, just like its active open counterpart, and uses it, if
valid, to seed the initRTO for the data transmission phase.

The patch also resets ssthresh to its initial default at the
beginning of the data transmission phase, and reduces cwnd to 1 if
there has been MORE THAN ONE retransmission during 3WHS per RFC5681.

Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-08 17:05:30 -07:00
stephen hemminger aee80b54b2 ipv6: generate link local address for GRE tunnel
Use same logic as SIT tunnel to handle link local address
for GRE tunnel. OSPFv3 requires link-local address to function.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-08 17:05:30 -07:00
Marcus Meissner 5a079c305a net/ipv6: check for mistakenly passed in non-AF_INET6 sockaddrs
Same check as for IPv4, also do for IPv6.

(If you passed in a IPv4 sockaddr_in here, the sizeof check
 in the line before would have triggered already though.)

Signed-off-by: Marcus Meissner <meissner@suse.de>
Cc: Reinhard Max <max@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-06 14:48:16 -07:00
Dave Jones d232b8dded netfilter: use unsigned variables for packet lengths in ip[6]_queue.
Netlink message lengths can't be negative, so use unsigned variables.

Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2011-06-06 01:37:16 +02:00
Pablo Neira Ayuso 88ed01d17b netfilter: nf_conntrack: fix ct refcount leak in l4proto->error()
This patch fixes a refcount leak of ct objects that may occur if
l4proto->error() assigns one conntrack object to one skbuff. In
that case, we have to skip further processing in nf_conntrack_in().

With this patch, we can also fix wrong return values (-NF_ACCEPT)
for special cases in ICMP[v6] that should not bump the invalid/error
statistic counters.

Reported-by: Zoltan Menyhart <Zoltan.Menyhart@bull.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2011-06-06 01:37:02 +02:00
Eric Dumazet fb04883371 netfilter: add more values to enum ip_conntrack_info
Following error is raised (and other similar ones) :

net/ipv4/netfilter/nf_nat_standalone.c: In function ‘nf_nat_fn’:
net/ipv4/netfilter/nf_nat_standalone.c:119:2: warning: case value ‘4’
not in enumerated type ‘enum ip_conntrack_info’

gcc barfs on adding two enum values and getting a not enumerated
result :

case IP_CT_RELATED+IP_CT_IS_REPLY:

Add missing enum values

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: David Miller <davem@davemloft.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2011-06-06 01:35:10 +02:00
Dan Rosenberg 71338aa7d0 net: convert %p usage to %pK
The %pK format specifier is designed to hide exposed kernel pointers,
specifically via /proc interfaces.  Exposing these pointers provides an
easy target for kernel write vulnerabilities, since they reveal the
locations of writable structures containing easily triggerable function
pointers.  The behavior of %pK depends on the kptr_restrict sysctl.

If kptr_restrict is set to 0, no deviation from the standard %p behavior
occurs.  If kptr_restrict is set to 1, the default, if the current user
(intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
(currently in the LSM tree), kernel pointers using %pK are printed as 0's.
 If kptr_restrict is set to 2, kernel pointers using %pK are printed as
0's regardless of privileges.  Replacing with 0's was chosen over the
default "(null)", which cannot be parsed by userland %p, which expects
"(nil)".

The supporting code for kptr_restrict and %pK are currently in the -mm
tree.  This patch converts users of %p in net/ to %pK.  Cases of printing
pointers to the syslog are not covered, since this would eliminate useful
information for postmortem debugging and the reading of the syslog is
already optionally protected by the dmesg_restrict sysctl.

Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com>
Cc: James Morris <jmorris@namei.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Thomas Graf <tgraf@infradead.org>
Cc: Eugene Teo <eugeneteo@kernel.org>
Cc: Kees Cook <kees.cook@canonical.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-24 01:13:12 -04:00
David S. Miller 6ac3f66492 ipv6: Fix return of xfrm6_tunnel_rcv()
Like ipv4, just return xfrm6_rcv_spi()'s return value directly.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-24 01:11:51 -04:00
Florian Westphal 0f6c6392dc ipv6: copy prefsrc setting when copying route entry
commit c3968a857a
('ipv6: RTA_PREFSRC support for ipv6 route source address selection')
added support for ipv6 prefsrc as an alternative to ipv6 addrlabels,
but it did not work because the prefsrc entry was not copied.

Cc: Daniel Walter <sahne@0x90.at>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-21 02:05:22 -04:00
Linus Torvalds 06f4e926d2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1446 commits)
  macvlan: fix panic if lowerdev in a bond
  tg3: Add braces around 5906 workaround.
  tg3: Fix NETIF_F_LOOPBACK error
  macvlan: remove one synchronize_rcu() call
  networking: NET_CLS_ROUTE4 depends on INET
  irda: Fix error propagation in ircomm_lmp_connect_response()
  irda: Kill set but unused variable 'bytes' in irlan_check_command_param()
  irda: Kill set but unused variable 'clen' in ircomm_connect_indication()
  rxrpc: Fix set but unused variable 'usage' in rxrpc_get_transport()
  be2net: Kill set but unused variable 'req' in lancer_fw_download()
  irda: Kill set but unused vars 'saddr' and 'daddr' in irlan_provider_connect_indication()
  atl1c: atl1c_resume() is only used when CONFIG_PM_SLEEP is defined.
  rxrpc: Fix set but unused variable 'usage' in rxrpc_get_peer().
  rxrpc: Kill set but unused variable 'local' in rxrpc_UDP_error_handler()
  rxrpc: Kill set but unused variable 'sp' in rxrpc_process_connection()
  rxrpc: Kill set but unused variable 'sp' in rxrpc_rotate_tx_window()
  pkt_sched: Kill set but unused variable 'protocol' in tc_classify()
  isdn: capi: Use pr_debug() instead of ifdefs.
  tg3: Update version to 3.119
  tg3: Apply rx_discards fix to 5719/5720
  ...

Fix up trivial conflicts in arch/x86/Kconfig and net/mac80211/agg-tx.c
as per Davem.
2011-05-20 13:43:21 -07:00
Linus Torvalds eb04f2f04e Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (78 commits)
  Revert "rcu: Decrease memory-barrier usage based on semi-formal proof"
  net,rcu: convert call_rcu(prl_entry_destroy_rcu) to kfree
  batman,rcu: convert call_rcu(softif_neigh_free_rcu) to kfree_rcu
  batman,rcu: convert call_rcu(neigh_node_free_rcu) to kfree()
  batman,rcu: convert call_rcu(gw_node_free_rcu) to kfree_rcu
  net,rcu: convert call_rcu(kfree_tid_tx) to kfree_rcu()
  net,rcu: convert call_rcu(xt_osf_finger_free_rcu) to kfree_rcu()
  net/mac80211,rcu: convert call_rcu(work_free_rcu) to kfree_rcu()
  net,rcu: convert call_rcu(wq_free_rcu) to kfree_rcu()
  net,rcu: convert call_rcu(phonet_device_rcu_free) to kfree_rcu()
  perf,rcu: convert call_rcu(swevent_hlist_release_rcu) to kfree_rcu()
  perf,rcu: convert call_rcu(free_ctx) to kfree_rcu()
  net,rcu: convert call_rcu(__nf_ct_ext_free_rcu) to kfree_rcu()
  net,rcu: convert call_rcu(net_generic_release) to kfree_rcu()
  net,rcu: convert call_rcu(netlbl_unlhsh_free_addr6) to kfree_rcu()
  net,rcu: convert call_rcu(netlbl_unlhsh_free_addr4) to kfree_rcu()
  security,rcu: convert call_rcu(sel_netif_free) to kfree_rcu()
  net,rcu: convert call_rcu(xps_dev_maps_release) to kfree_rcu()
  net,rcu: convert call_rcu(xps_map_release) to kfree_rcu()
  net,rcu: convert call_rcu(rps_map_release) to kfree_rcu()
  ...
2011-05-19 18:14:34 -07:00
Eric Dumazet be281e554e ipv6: reduce per device ICMP mib sizes
ipv6 has per device ICMP SNMP counters, taking too much space because
they use percpu storage.

needed size per device is :
(512+4)*sizeof(long)*number_of_possible_cpus*2

On a 32bit kernel, 16 possible cpus, this wastes more than 64kbytes of
memory per ipv6 enabled network device, taken in vmalloc pool.

Since ICMP messages are rare, just use shared counters (atomic_long_t)

Per network space ICMP counters are still using percpu memory, we might
also convert them to shared counters in a future patch.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Denys Fedoryshchenko <denys@visp.net.lb>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-19 16:21:22 -04:00
David S. Miller 3c709f8fb4 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-3.6
Conflicts:
	drivers/net/benet/be_main.c
2011-05-11 14:26:58 -04:00
David S. Miller 9bbc052d5e Merge branch 'pablo/nf-2.6-updates' of git://1984.lsi.us.es/net-2.6 2011-05-10 15:04:35 -07:00
Steffen Klassert 43a4dea4c9 xfrm: Assign the inner mode output function to the dst entry
As it is, we assign the outer modes output function to the dst entry
when we create the xfrm bundle. This leads to two problems on interfamily
scenarios. We might insert ipv4 packets into ip6_fragment when called
from xfrm6_output. The system crashes if we try to fragment an ipv4
packet with ip6_fragment. This issue was introduced with git commit
ad0081e4 (ipv6: Fragment locally generated tunnel-mode IPSec6 packets
as needed). The second issue is, that we might insert ipv4 packets in
netfilter6 and vice versa on interfamily scenarios.

With this patch we assign the inner mode output function to the dst entry
when we create the xfrm bundle. So xfrm4_output/xfrm6_output from the inner
mode is used and the right fragmentation and netfilter functions are called.
We switch then to outer mode with the output_finish functions.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-10 15:03:34 -07:00
Fernando Luis Vazquez Cao 4319cc0cf5 netfilter: IPv6: initialize TOS field in REJECT target module
The IPv6 header is not zeroed out in alloc_skb so we must initialize
it properly unless we want to see IPv6 packets with random TOS fields
floating around. The current implementation resets the flow label
but this could be changed if deemed necessary.

We stumbled upon this issue when trying to apply a mangle rule to
the RST packet generated by the REJECT target module.

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2011-05-10 09:55:44 +02:00
David S. Miller d9d8da805d inet: Pass flowi to ->queue_xmit().
This allows us to acquire the exact route keying information from the
protocol, however that might be managed.

It handles all of the possibilities, from the simplest case of storing
the key in inet->cork.fl to the more complex setup SCTP has where
individual transports determine the flow.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-08 15:28:28 -07:00
Paul E. McKenney 11c476f31a net,rcu: convert call_rcu(prl_entry_destroy_rcu) to kfree
The RCU callback prl_entry_destroy_rcu() just calls kfree(), so we can
use kfree_rcu() instead of call_rcu().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-07 22:51:15 -07:00
Lai Jiangshan e3cbf28fa6 net,rcu: convert call_rcu(ipv6_mc_socklist_reclaim) to kfree_rcu()
The rcu callback ipv6_mc_socklist_reclaim() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(ipv6_mc_socklist_reclaim).

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-07 22:51:02 -07:00
Lai Jiangshan e57859854a net,rcu: convert call_rcu(inet6_ifa_finish_destroy_rcu) to kfree_rcu()
The rcu callback inet6_ifa_finish_destroy_rcu() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(inet6_ifa_finish_destroy_rcu).

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-07 22:50:50 -07:00
Lai Jiangshan 38f57d1a4b net,rcu: convert call_rcu(in6_dev_finish_destroy_rcu) to kfree_rcu()
The rcu callback in6_dev_finish_destroy_rcu() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(in6_dev_finish_destroy_rcu).

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-07 22:50:49 -07:00
David S. Miller bdc712b4c2 inet: Decrease overhead of on-stack inet_cork.
When we fast path datagram sends to avoid locking by putting
the inet_cork on the stack we use up lots of space that isn't
necessary.

This is because inet_cork contains a "struct flowi" which isn't
used in these code paths.

Split inet_cork to two parts, "inet_cork" and "inet_cork_full".
Only the latter of which has the "struct flowi" and is what is
stored in inet_sock.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
2011-05-06 15:37:57 -07:00
David S. Miller 7143b7d412 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/tg3.c
2011-05-05 14:59:02 -07:00
Jiri Pirko 1c5cae815d net: call dev_alloc_name from register_netdevice
Force dev_alloc_name() to be called from register_netdevice() by
dev_get_valid_name(). That allows to remove multiple explicit
dev_alloc_name() calls.

The possibility to call dev_alloc_name in advance remains.

This also fixes veth creation regresion caused by
84c49d8c3e

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-05 10:57:45 -07:00
David S. Miller 301102cc83 ipv6: Use flowi4->{daddr,saddr} in ipip6_tunnel_xmit().
Instead of rt->rt_{dst,src}

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-04 12:55:07 -07:00
David S. Miller 31e4543db2 ipv4: Make caller provide on-stack flow key to ip_route_output_ports().
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-03 20:25:42 -07:00
Lucian Adrian Grijincu ff538818f4 sysctl: net: call unregister_net_sysctl_table where needed
ctl_table_headers registered with register_net_sysctl_table should
have been unregistered with the equivalent unregister_net_sysctl_table

Signed-off-by: Lucian Adrian Grijincu <lucian.grijincu@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-02 16:12:14 -07:00
Eric Dumazet e67f88dd12 net: dont hold rtnl mutex during netlink dump callbacks
Four years ago, Patrick made a change to hold rtnl mutex during netlink
dump callbacks.

I believe it was a wrong move. This slows down concurrent dumps, making
good old /proc/net/ files faster than rtnetlink in some situations.

This occurred to me because one "ip link show dev ..." was _very_ slow
on a workload adding/removing network devices in background.

All dump callbacks are able to use RCU locking now, so this patch does
roughly a revert of commits :

1c2d670f36 : [RTNETLINK]: Hold rtnl_mutex during netlink dump callbacks
6313c1e099 : [RTNETLINK]: Remove unnecessary locking in dump callbacks

This let writers fight for rtnl mutex and readers going full speed.

It also takes care of phonet : phonet_route_get() is now called from rcu
read section. I renamed it to phonet_route_get_rcu()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Remi Denis-Courmont <remi.denis-courmont@nokia.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-02 15:26:28 -07:00
Ben Hutchings ad246c992b ipv4, ipv6, bonding: Restore control over number of peer notifications
For backward compatibility, we should retain the module parameters and
sysfs attributes to control the number of peer notifications
(gratuitous ARPs and unsolicited NAs) sent after bonding failover.
Also, it is possible for failover to take place even though the new
active slave does not have link up, and in that case the peer
notification should be deferred until it does.

Change ipv4 and ipv6 so they do not automatically send peer
notifications on bonding failover.

Change the bonding driver to send separate NETDEV_NOTIFY_PEERS
notifications when the link is up, as many times as requested.  Since
it does not directly control which protocols send notifications, make
num_grat_arp and num_unsol_na aliases for a single parameter.  Bump
the bonding version number and update its documentation.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Acked-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-29 12:44:11 -07:00
David S. Miller cf91166223 net: Use non-zero allocations in dst_alloc().
Make dst_alloc() and it's users explicitly initialize the entire
entry.

The zero'ing done by kmem_cache_zalloc() was almost entirely
redundant.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-28 22:26:00 -07:00
David S. Miller 5c1e6aa300 net: Make dst_alloc() take more explicit initializations.
Now the dst->dev, dev->obsolete, and dst->flags values can
be specified as well.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-28 22:25:59 -07:00
Shan Wei 96339d6c49 net:use help function of skb_checksum_start_offset to calculate offset
Although these are equivalent, but the skb_checksum_start_offset() is more readable.

Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-28 13:28:57 -07:00
Eric Dumazet f6d8bd051c inet: add RCU protection to inet->opt
We lack proper synchronization to manipulate inet->opt ip_options

Problem is ip_make_skb() calls ip_setup_cork() and
ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options),
without any protection against another thread manipulating inet->opt.

Another thread can change inet->opt pointer and free old one under us.

Use RCU to protect inet->opt (changed to inet->inet_opt).

Instead of handling atomic refcounts, just copy ip_options when
necessary, to avoid cache line dirtying.

We cant insert an rcu_head in struct ip_options since its included in
skb->cb[], so this patch is large because I had to introduce a new
ip_options_rcu structure.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-28 13:16:35 -07:00
Steffen Klassert c0a56e64ae esp6: Fix scatterlist initialization
When we use IPsec extended sequence numbers, we may overwrite
the last scatterlist of the associated data by the scatterlist
for the skb. This patch fixes this by placing the scatterlist
for the skb right behind the last scatterlist of the associated
data. esp4 does it already like that.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-26 12:46:04 -07:00
David S. Miller 2bd93d7af1 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Resolved logic conflicts causing a build failure due to
drivers/net/r8169.c changes using a patch from Stephen Rothwell.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-26 12:16:46 -07:00
Held Bernhard 0972ddb237 net: provide cow_metrics() methods to blackhole dst_ops
Since commit 62fa8a846d (net: Implement read-only protection and COW'ing
of metrics.) the kernel throws an oops.

[  101.620985] BUG: unable to handle kernel NULL pointer dereference at
           (null)
[  101.621050] IP: [<          (null)>]           (null)
[  101.621084] PGD 6e53c067 PUD 3dd6a067 PMD 0
[  101.621122] Oops: 0010 [#1] SMP
[  101.621153] last sysfs file: /sys/devices/virtual/ppp/ppp/uevent
[  101.621192] CPU 2
[  101.621206] Modules linked in: l2tp_ppp pppox ppp_generic slhc
l2tp_netlink l2tp_core deflate zlib_deflate twofish_x86_64
twofish_common des_generic cbc ecb sha1_generic hmac af_key
iptable_filter snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device loop
snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec
snd_pcm snd_timer snd i2c_i801 iTCO_wdt psmouse soundcore snd_page_alloc
evdev uhci_hcd ehci_hcd thermal
[  101.621552]
[  101.621567] Pid: 5129, comm: openl2tpd Not tainted 2.6.39-rc4-Quad #3
Gigabyte Technology Co., Ltd. G33-DS3R/G33-DS3R
[  101.621637] RIP: 0010:[<0000000000000000>]  [<          (null)>]   (null)
[  101.621684] RSP: 0018:ffff88003ddeba60  EFLAGS: 00010202
[  101.621716] RAX: ffff88003ddb5600 RBX: ffff88003ddb5600 RCX:
0000000000000020
[  101.621758] RDX: ffffffff81a69a00 RSI: ffffffff81b7ee61 RDI:
ffff88003ddb5600
[  101.621800] RBP: ffff8800537cd900 R08: 0000000000000000 R09:
ffff88003ddb5600
[  101.621840] R10: 0000000000000005 R11: 0000000000014b38 R12:
ffff88003ddb5600
[  101.621881] R13: ffffffff81b7e480 R14: ffffffff81b7e8b8 R15:
ffff88003ddebad8
[  101.621924] FS:  00007f06e4182700(0000) GS:ffff88007fd00000(0000)
knlGS:0000000000000000
[  101.621971] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  101.622005] CR2: 0000000000000000 CR3: 0000000045274000 CR4:
00000000000006e0
[  101.622046] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[  101.622087] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[  101.622129] Process openl2tpd (pid: 5129, threadinfo
ffff88003ddea000, task ffff88003de9a280)
[  101.622177] Stack:
[  101.622191]  ffffffff81447efa ffff88007d3ded80 ffff88003de9a280
ffff88007d3ded80
[  101.622245]  0000000000000001 ffff88003ddebbb8 ffffffff8148d5a7
0000000000000212
[  101.622299]  ffff88003dcea000 ffff88003dcea188 ffffffff00000001
ffffffff81b7e480
[  101.622353] Call Trace:
[  101.622374]  [<ffffffff81447efa>] ? ipv4_blackhole_route+0x1ba/0x210
[  101.622415]  [<ffffffff8148d5a7>] ? xfrm_lookup+0x417/0x510
[  101.622450]  [<ffffffff8127672a>] ? extract_buf+0x9a/0x140
[  101.622485]  [<ffffffff8144c6a0>] ? __ip_flush_pending_frames+0x70/0x70
[  101.622526]  [<ffffffff8146fbbf>] ? udp_sendmsg+0x62f/0x810
[  101.622562]  [<ffffffff813f98a6>] ? sock_sendmsg+0x116/0x130
[  101.622599]  [<ffffffff8109df58>] ? find_get_page+0x18/0x90
[  101.622633]  [<ffffffff8109fd6a>] ? filemap_fault+0x12a/0x4b0
[  101.622668]  [<ffffffff813fb5c4>] ? move_addr_to_kernel+0x64/0x90
[  101.622706]  [<ffffffff81405d5a>] ? verify_iovec+0x7a/0xf0
[  101.622739]  [<ffffffff813fc772>] ? sys_sendmsg+0x292/0x420
[  101.622774]  [<ffffffff810b994a>] ? handle_pte_fault+0x8a/0x7c0
[  101.622810]  [<ffffffff810b76fe>] ? __pte_alloc+0xae/0x130
[  101.622844]  [<ffffffff810ba2f8>] ? handle_mm_fault+0x138/0x380
[  101.622880]  [<ffffffff81024af9>] ? do_page_fault+0x189/0x410
[  101.622915]  [<ffffffff813fbe03>] ? sys_getsockname+0xf3/0x110
[  101.622952]  [<ffffffff81450c4d>] ? ip_setsockopt+0x4d/0xa0
[  101.622986]  [<ffffffff813f9932>] ? sockfd_lookup_light+0x22/0x90
[  101.623024]  [<ffffffff814b61fb>] ? system_call_fastpath+0x16/0x1b
[  101.623060] Code:  Bad RIP value.
[  101.623090] RIP  [<          (null)>]           (null)
[  101.623125]  RSP <ffff88003ddeba60>
[  101.623146] CR2: 0000000000000000
[  101.650871] ---[ end trace ca3856a7d8e8dad4 ]---
[  101.651011] __sk_free: optmem leakage (160 bytes) detected.

The oops happens in dst_metrics_write_ptr()
include/net/dst.h:124: return dst->ops->cow_metrics(dst, p);

dst->ops->cow_metrics is NULL and causes the oops.

Provide cow_metrics() methods, like we did in commit 214f45c91b
(net: provide default_advmss() methods to blackhole dst_ops)

Signed-off-by: Held Bernhard <berny156@gmx.de>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-25 11:53:08 -07:00
Eric Dumazet b71d1d426d inet: constify ip headers and in6_addr
Add const qualifiers to structs iphdr, ipv6hdr and in6_addr pointers
where possible, to make code intention more obvious.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-22 11:04:14 -07:00
Thomas Egerer e965c05dab ipv6: Remove hoplimit initialization to -1
The changes introduced with git-commit a02e4b7d ("ipv6: Demark default
hoplimit as zero.") missed to remove the hoplimit initialization. As a
result, ipv6_get_mtu interprets the return value of dst_metric_raw
(-1) as 255 and answers ping6 with this hoplimit.  This patche removes
the line such that ping6 is answered with the hoplimit value
configured via sysctl.

Signed-off-by: Thomas Egerer <thomas.egerer@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-21 17:24:08 -07:00
Shan Wei a9cf73ea7f ipv6: udp: fix the wrong headroom check
At this point, skb->data points to skb_transport_header.
So, headroom check is wrong.

For some case:bridge(UFO is on) + eth device(UFO is off),
there is no enough headroom for IPv6 frag head.
But headroom check is always false.

This will bring about data be moved to there prior to skb->head,
when adding IPv6 frag header to skb.

Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-21 10:39:10 -07:00
David S. Miller 4805347c1e Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-next-2.6 2011-04-19 11:24:06 -07:00
David S. Miller e1943424e4 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/bnx2x/bnx2x_ethtool.c
2011-04-19 00:21:33 -07:00
Ben Hutchings 7c89943236 bonding, ipv4, ipv6, vlan: Handle NETDEV_BONDING_FAILOVER like NETDEV_NOTIFY_PEERS
It is undesirable for the bonding driver to be poking into higher
level protocols, and notifiers provide a way to avoid that.  This does
mean removing the ability to configure reptitition of gratuitous ARPs
and unsolicited NAs.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-17 23:36:03 -07:00
Ben Hutchings f47b94646f ipv6: Send unsolicited neighbour advertismements when notified
The NETDEV_NOTIFY_PEERS notifier is a request to send such
advertisements following migration to a different physical link,
e.g. virtual machine migration.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-17 23:35:16 -07:00
David S. Miller b169f6db40 netfilter: ip6table_mangle: Fix set-but-unused variables.
The variable 'flowlabel' is set but unused in ip6t_mangle_out().

The intention here was to compare this key to the header value after
mangling, and trigger a route lookup on mismatch.

Make it so.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-17 17:06:15 -07:00
David S. Miller f3c85dd560 netfilter: ip6_tables: Fix set-but-unused variables.
The variable 'target' is set but unused in compat_copy_entry_from_user().

Just kill it off.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-17 17:04:48 -07:00
Daniel Walter c3968a857a ipv6: RTA_PREFSRC support for ipv6 route source address selection
[ipv6] Add support for RTA_PREFSRC

This patch allows a user to select the preferred source address
for a specific IPv6-Route. It can be set via a netlink message
setting RTA_PREFSRC to a valid IPv6 address which must be
up on the device the route will be bound to.

Signed-off-by: Daniel Walter <dwalter@barracuda.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-15 15:44:37 -07:00
Daniel Walter bd015928bb ipv6: ignore looped-back NA while dad is running
[ipv6] Ignore looped-back NAs while in Duplicate Address Detection

If we send an unsolicited NA shortly after bringing up an
IPv6 address, the duplicate address detection algorithm
fails and the ip stays in tentative mode forever.
This is due a missing check if the NA is looped-back to us.

Signed-off-by: Daniel Walter <dwalter@barracuda.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-15 15:43:55 -07:00
David S. Miller 3e8c806a08 Revert "tcp: disallow bind() to reuse addr/port"
This reverts commit c191a836a9.

It causes known regressions for programs that expect to be able to use
SO_REUSEADDR to shutdown a socket, then successfully rebind another
socket to the same ID.

Programs such as haproxy and amavisd expect this to work.

This should fix kernel bugzilla 32832.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-13 12:01:14 -07:00
Linus Torvalds c44eaf41a5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (34 commits)
  net: Add support for SMSC LAN9530, LAN9730 and LAN89530
  mlx4_en: Restoring RX buffer pointer in case of failure
  mlx4: Sensing link type at device initialization
  ipv4: Fix "Set rt->rt_iif more sanely on output routes."
  MAINTAINERS: add entry for Xen network backend
  be2net: Fix suspend/resume operation
  be2net: Rename some struct members for clarity
  pppoe: drop PPPOX_ZOMBIEs in pppoe_flush_dev
  dsa/mv88e6131: add support for mv88e6085 switch
  ipv6: Enable RFS sk_rxhash tracking for ipv6 sockets (v2)
  be2net: Fix a potential crash during shutdown.
  bna: Fix for handling firmware heartbeat failure
  can: mcp251x: Allow pass IRQ flags through platform data.
  smsc911x: fix mac_lock acquision before calling smsc911x_mac_read
  iwlwifi: accept EEPROM version 0x423 for iwl6000
  rt2x00: fix cancelling uninitialized work
  rtlwifi: Fix some warnings/bugs
  p54usb: IDs for two new devices
  wl12xx: fix potential buffer overflow in testmode nvs push
  zd1211rw: reset rx idle timer from tasklet
  ...
2011-04-11 07:27:24 -07:00
Linus Torvalds 42933bac11 Merge branch 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6
* 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6:
  Fix common misspellings
2011-04-07 11:14:49 -07:00
Neil Horman 47482f132a ipv6: Enable RFS sk_rxhash tracking for ipv6 sockets (v2)
properly record sk_rxhash in ipv6 sockets (v2)

Noticed while working on another project that flows to sockets which I had open
on a test systems weren't getting steered properly when I had RFS enabled.
Looking more closely I found that:

1) The affected sockets were all ipv6
2) They weren't getting steered because sk->sk_rxhash was never set from the
incomming skbs on that socket.

This was occuring because there are several points in the IPv4 tcp and udp code
which save the rxhash value when a new connection is established.  Those calls
to sock_rps_save_rxhash were never added to the corresponding ipv6 code paths.
This patch adds those calls.  Tested by myself to properly enable RFS
functionalty on ipv6.

Change notes:
v2:
	Filtered UDP to only arm RFS on bound sockets (Eric Dumazet)

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-06 13:07:09 -07:00
David S. Miller 9d93059497 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-2.6 2011-04-05 14:21:11 -07:00
Boris Ostrovsky 738faca343 ipv6: Don't pass invalid dst_entry pointer to dst_release().
Make sure dst_release() is not called with error pointer. This is
similar to commit 4910ac6c52 ("ipv4:
Don't ip_rt_put() an error pointer in RAW sockets.").

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-04 13:07:26 -07:00
Eric Dumazet 7f5c6d4f66 netfilter: get rid of atomic ops in fast path
We currently use a percpu spinlock to 'protect' rule bytes/packets
counters, after various attempts to use RCU instead.

Lately we added a seqlock so that get_counters() can run without
blocking BH or 'writers'. But we really only need the seqcount in it.

Spinlock itself is only locked by the current/owner cpu, so we can
remove it completely.

This cleanups api, using correct 'writer' vs 'reader' semantic.

At replace time, the get_counters() call makes sure all cpus are done
using the old table.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2011-04-04 17:04:03 +02:00
Florian Westphal 0fae2e7740 netfilter: af_info: add 'strict' parameter to limit lookup to .oif
ipv6 fib lookup can set RT6_LOOKUP_F_IFACE flag to restrict search
to an interface, but this flag cannot be set via struct flowi.

Also, it cannot be set via ip6_route_output: this function uses the
passed sock struct to determine if this flag is required
(by testing for nonzero sk_bound_dev_if).

Work around this by passing in an artificial struct sk in case
'strict' argument is true.

This is required to replace the rt6_lookup call in xt_addrtype.c with
nf_afinfo->route().

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2011-04-04 17:00:54 +02:00
Florian Westphal 31ad3dd64e netfilter: af_info: add network namespace parameter to route hook
This is required to eventually replace the rt6_lookup call in
xt_addrtype.c with nf_afinfo->route().

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2011-04-04 16:56:29 +02:00
Lucas De Marchi 25985edced Fix common misspellings
Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-31 11:26:23 -03:00
Timo Teräs 93ca3bb5df net: gre: provide multicast mappings for ipv4 and ipv6
My commit 6d55cb91a0 (gre: fix hard header destination
address checking) broke multicast.

The reason is that ip_gre used to get ipgre_header() calls with
zero destination if we have NOARP or multicast destination. Instead
the actual target was decided at ipgre_tunnel_xmit() time based on
per-protocol dissection.

Instead of allowing the "abuse" of ->header() calls with invalid
destination, this creates multicast mappings for ip_gre. This also
fixes "ip neigh show nud noarp" to display the proper multicast
mappings used by the gre device.

Reported-by: Doug Kehn <rdkehn@yahoo.com>
Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Acked-by: Doug Kehn <rdkehn@yahoo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-30 00:10:47 -07:00
Cesar Eduardo Barros 3e49e6d520 net: use CHECKSUM_NONE instead of magic number
Two places in the kernel were doing skb->ip_summed = 0.

Change both to skb->ip_summed = CHECKSUM_NONE, which is more readable.

Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-27 23:35:05 -07:00
Florian Westphal 9c7a4f9ce6 ipv6: ip6_route_output does not modify sk parameter, so make it const
This avoids explicit cast to avoid 'discards qualifiers'
compiler warning in a netfilter patch that i've been working on.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-22 19:17:36 -07:00
Eric W. Biederman 9d2a8fa96a net ipv6: Fix duplicate /proc/sys/net/ipv6/neigh directory entries.
When I was fixing issues with unregisgtering tables under /proc/sys/net/ipv6/neigh
by adding a mount point it appears I missed a critical ordering issue, in the
ipv6 initialization.  I had not realized that ipv6_sysctl_register is called
at the very end of the ipv6 initialization and in particular after we call
neigh_sysctl_register from ndisc_init.

"neigh" needs to be initialized in ipv6_static_sysctl_register which is
the first ipv6 table to initialized, and definitely before ndisc_init.
This removes the weirdness of duplicate tables while still providing a
"neigh" mount point which prevents races in sysctl unregistering.

This was initially reported at https://bugzilla.kernel.org/show_bug.cgi?id=31232
Reported-by: sunkan@zappa.cx
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-21 18:23:34 -07:00
Eric Dumazet db856674ac netfilter: xtables: fix reentrancy
commit f3c5c1bfd4 (make ip_tables reentrant) introduced a race in
handling the stackptr restore, at the end of ipt_do_table()

We should do it before the call to xt_info_rdunlock_bh(), or we allow
cpu preemption and another cpu overwrites stackptr of original one.

A second fix is to change the underflow test to check the origptr value
instead of 0 to detect underflow, or else we allow a jump from different
hooks.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2011-03-20 15:40:06 +01:00
Linus Torvalds e16b396ce3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (47 commits)
  doc: CONFIG_UNEVICTABLE_LRU doesn't exist anymore
  Update cpuset info & webiste for cgroups
  dcdbas: force SMI to happen when expected
  arch/arm/Kconfig: remove one to many l's in the word.
  asm-generic/user.h: Fix spelling in comment
  drm: fix printk typo 'sracth'
  Remove one to many n's in a word
  Documentation/filesystems/romfs.txt: fixing link to genromfs
  drivers:scsi Change printk typo initate -> initiate
  serial, pch uart: Remove duplicate inclusion of linux/pci.h header
  fs/eventpoll.c: fix spelling
  mm: Fix out-of-date comments which refers non-existent functions
  drm: Fix printk typo 'failled'
  coh901318.c: Change initate to initiate.
  mbox-db5500.c Change initate to initiate.
  edac: correct i82975x error-info reported
  edac: correct i82975x mci initialisation
  edac: correct commented info
  fs: update comments to point correct document
  target: remove duplicate include of target/target_core_device.h from drivers/target/target_core_hba.c
  ...

Trivial conflict in fs/eventpoll.c (spelling vs addition)
2011-03-18 10:37:40 -07:00
David S. Miller 31111c26d9 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-next-2.6
Conflicts:
	Documentation/feature-removal-schedule.txt
2011-03-15 13:03:27 -07:00
Vasiliy Kulikov 6a8ab06077 ipv6: netfilter: ip6_tables: fix infoleak to userspace
Structures ip6t_replace, compat_ip6t_replace, and xt_get_revision are
copied from userspace.  Fields of these structs that are
zero-terminated strings are not checked.  When they are used as argument
to a format string containing "%s" in request_module(), some sensitive
information is leaked to userspace via argument of spawned modprobe
process.

The first bug was introduced before the git epoch;  the second was
introduced in 3bc3fe5e (v2.6.25-rc1);  the third is introduced by
6b7d31fc (v2.6.15-rc1).  To trigger the bug one should have
CAP_NET_ADMIN.

Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2011-03-15 13:37:13 +01:00
Steffen Klassert d212a4c290 esp6: Add support for IPsec extended sequence numbers
This patch adds IPsec extended sequence numbers support to esp6.
We use the authencesn crypto algorithm to handle esp with separate
encryption/authentication algorithms.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-13 20:22:29 -07:00
Steffen Klassert 1ce3644ade xfrm: Use separate low and high order bits of the sequence numbers in xfrm_skb_cb
To support IPsec extended sequence numbers, we split the
output sequence numbers of xfrm_skb_cb in low and high order 32 bits
and we add the high order 32 bits to the input sequence numbers.
All users are updated accordingly.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-13 20:22:28 -07:00
David S. Miller 1958b856c1 net: Put fl6_* macros to struct flowi6 and use them again.
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-12 15:08:55 -08:00
David S. Miller 4c9483b2fb ipv6: Convert to use flowi6 where applicable.
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-12 15:08:54 -08:00
David S. Miller 7e1dc7b6f7 net: Use flowi4 and flowi6 in xfrm layer.
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-12 15:08:52 -08:00
David S. Miller 56bb8059e1 net: Break struct flowi out into AF specific instances.
Now we have struct flowi4, flowi6, and flowidn for each address
family.  And struct flowi is just a union of them all.

It might have been troublesome to convert flow_cache_uli_match() but
as it turns out this function is completely unused and therefore can
be simply removed.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-12 15:08:46 -08:00
David S. Miller 6281dcc94a net: Make flowi ports AF dependent.
Create two sets of port member accessors, one set prefixed by fl4_*
and the other prefixed by fl6_*

This will let us to create AF optimal flow instances.

It will work because every context in which we access the ports,
we have to be fully aware of which AF the flowi is anyways.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-12 15:08:46 -08:00
David S. Miller 1d28f42c1b net: Put flowi_* prefix on AF independent members of struct flowi
I intend to turn struct flowi into a union of AF specific flowi
structs.  There will be a common structure that each variant includes
first, much like struct sock_common.

This is the first step to move in that direction.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-12 15:08:44 -08:00
David S. Miller 78fbfd8a65 ipv4: Create and use route lookup helpers.
The idea here is this minimizes the number of places one has to edit
in order to make changes to how flows are defined and used.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-12 15:08:42 -08:00
David S. Miller 33175d84ee Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/bnx2x/bnx2x_cmn.c
2011-03-10 14:26:00 -08:00
stephen hemminger 6dfbd87a20 ip6ip6: autoload ip6 tunnel
Add necessary alias to autoload ip6ip6 tunnel module.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-10 14:18:48 -08:00
David S. Miller bef6e7e768 Merge branch 'master' of /home/davem/src/GIT/linux-2.6/ 2011-03-10 14:00:44 -08:00
David S. Miller 7343ff31eb ipv6: Don't create clones of host routes.
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=29252
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=30462

In commit d80bc0fd26 ("ipv6: Always
clone offlink routes.") we forced the kernel to always clone offlink
routes.

The reason we do that is to make sure we never bind an inetpeer to a
prefixed route.

The logic turned on here has existed in the tree for many years,
but was always off due to a protecting CPP define.  So perhaps
it's no surprise that there is a logic bug here.

The problem is that we canot clone a route that is already a
host route (ie. has DST_HOST set).  Because if we do, an identical
entry already exists in the routing tree and therefore the
ip6_rt_ins() call is going to fail.

This sets off a series of failures and high cpu usage, because when
ip6_rt_ins() fails we loop retrying this operation a few times in
order to handle a race between two threads trying to clone and insert
the same host route at the same time.

Fix this by simply using the route as-is when DST_HOST is set.

Reported-by: slash@ac.auone-net.jp
Reported-by: Ernst Sjöstrand <ernstp@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-09 19:55:25 -08:00
Vasiliy Kulikov 8909c9ad8f net: don't allow CAP_NET_ADMIN to load non-netdev kernel modules
Since a8f80e8ff9 any process with
CAP_NET_ADMIN may load any module from /lib/modules/.  This doesn't mean
that CAP_NET_ADMIN is a superset of CAP_SYS_MODULE as modules are
limited to /lib/modules/**.  However, CAP_NET_ADMIN capability shouldn't
allow anybody load any module not related to networking.

This patch restricts an ability of autoloading modules to netdev modules
with explicit aliases.  This fixes CVE-2011-1019.

Arnd Bergmann suggested to leave untouched the old pre-v2.6.32 behavior
of loading netdev modules by name (without any prefix) for processes
with CAP_SYS_MODULE to maintain the compatibility with network scripts
that use autoloading netdev modules by aliases like "eth0", "wlan0".

Currently there are only three users of the feature in the upstream
kernel: ipip, ip_gre and sit.

    root@albatros:~# capsh --drop=$(seq -s, 0 11),$(seq -s, 13 34) --
    root@albatros:~# grep Cap /proc/$$/status
    CapInh:	0000000000000000
    CapPrm:	fffffff800001000
    CapEff:	fffffff800001000
    CapBnd:	fffffff800001000
    root@albatros:~# modprobe xfs
    FATAL: Error inserting xfs
    (/lib/modules/2.6.38-rc6-00001-g2bf4ca3/kernel/fs/xfs/xfs.ko): Operation not permitted
    root@albatros:~# lsmod | grep xfs
    root@albatros:~# ifconfig xfs
    xfs: error fetching interface information: Device not found
    root@albatros:~# lsmod | grep xfs
    root@albatros:~# lsmod | grep sit
    root@albatros:~# ifconfig sit
    sit: error fetching interface information: Device not found
    root@albatros:~# lsmod | grep sit
    root@albatros:~# ifconfig sit0
    sit0      Link encap:IPv6-in-IPv4
	      NOARP  MTU:1480  Metric:1

    root@albatros:~# lsmod | grep sit
    sit                    10457  0
    tunnel4                 2957  1 sit

For CAP_SYS_MODULE module loading is still relaxed:

    root@albatros:~# grep Cap /proc/$$/status
    CapInh:	0000000000000000
    CapPrm:	ffffffffffffffff
    CapEff:	ffffffffffffffff
    CapBnd:	ffffffffffffffff
    root@albatros:~# ifconfig xfs
    xfs: error fetching interface information: Device not found
    root@albatros:~# lsmod | grep xfs
    xfs                   745319  0

Reference: https://lkml.org/lkml/2011/2/24/203

Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Kees Cook <kees.cook@canonical.com>
Signed-off-by: James Morris <jmorris@namei.org>
2011-03-10 10:25:19 +11:00
Hagen Paul Pfeifer 4b66fef9b5 mcast: net_device dev not used
ip6_mc_source(), ip6_mc_msfilter() as well as ip6_mc_msfget() declare
and assign dev but do not use the variable afterwards.

Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-07 15:51:13 -08:00
David S. Miller 0a0e9ae1bd Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/bnx2x/bnx2x.h
2011-03-03 21:27:42 -08:00
David S. Miller 29546a6404 ipv6: Use ERR_CAST in addrconf_dst_alloc.
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-03 12:10:37 -08:00
David S. Miller b23dd4fe42 ipv4: Make output route lookup return rtable directly.
Instead of on the stack.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-02 14:31:35 -08:00
David S. Miller 452edd598f xfrm: Return dst directly from xfrm_lookup()
Instead of on the stack.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-02 13:27:41 -08:00
David S. Miller b42835dbe8 ipv6: Make icmp route lookup code a bit clearer.
The route lookup code in icmpv6_send() is slightly tricky as a result of
having to handle all of the requirements of RFC 4301 host relookups.

Pull the route resolution into a seperate function, so that the error
handling and route reference counting is hopefully easier to see and
contained wholly within this new routine.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01 22:07:37 -08:00
David S. Miller 2774c131b1 xfrm: Handle blackhole route creation via afinfo.
That way we don't have to potentially do this in every xfrm_lookup()
caller.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01 14:59:04 -08:00
David S. Miller 69ead7afdf ipv6: Normalize arguments to ip6_dst_blackhole().
Return a dst pointer which is potentitally error encoded.

Don't pass original dst pointer by reference, pass a struct net
instead of a socket, and elide the flow argument since it is
unnecessary.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01 14:45:33 -08:00
David S. Miller 80c0bc9e37 xfrm: Kill XFRM_LOOKUP_WAIT flag.
This can be determined from the flow flags instead.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01 14:36:37 -08:00
David S. Miller a1414715f0 ipv6: Change final dst lookup arg name to "can_sleep"
Since it indicates whether we are invoked from a sleepable
context or not.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01 14:32:04 -08:00
David S. Miller 5df65e5567 net: Add FLOWI_FLAG_CAN_SLEEP.
And set is in contexts where the route resolution can sleep.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01 14:22:19 -08:00
David S. Miller 68d0c6d34d ipv6: Consolidate route lookup sequences.
Route lookups follow a general pattern in the ipv6 code wherein
we first find the non-IPSEC route, potentially override the
flow destination address due to ipv6 options settings, and then
finally make an IPSEC search using either xfrm_lookup() or
__xfrm_lookup().

__xfrm_lookup() is used when we want to generate a blackhole route
if the key manager needs to resolve the IPSEC rules (in this case
-EREMOTE is returned and the original 'dst' is left unchanged).

Otherwise plain xfrm_lookup() is used and when asynchronous IPSEC
resolution is necessary, we simply fail the lookup completely.

All of these cases are encapsulated into two routines,
ip6_dst_lookup_flow and ip6_sk_dst_lookup_flow.  The latter of which
handles unconnected UDP datagram sockets.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01 13:19:07 -08:00
Herbert Xu 5a2ef92023 inet: Remove unused sk_sndmsg_* from UFO
UFO doesn't really use the sk_sndmsg_* parameters so touching
them is pointless.  It can't use them anyway since the whole
point of UFO is to use the original pages without copying.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01 12:35:02 -08:00
Anders Berggren a693e69897 net: TX timestamps for IPv6 UDP packets
Enabling TX timestamps (SO_TIMESTAMPING) for IPv6 UDP packets, in
the same fashion as for IPv4. Necessary in order for NICs such as
Intel 82580 to timestamp IPv6 packets.

Signed-off-by: Anders Berggren <anders@halon.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-28 12:32:11 -08:00
Hagen Paul Pfeifer ddc3731fcb ipv6: ignore rtnl_unicast() return code
rtnl_unicast() return value is not of interest, we can silently ignore
it, save some instructions and four byte on the stack.

Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-25 14:00:23 -08:00
Hagen Paul Pfeifer e9476e95d8 ipv6: variable next is never used in this function
Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-25 14:00:22 -08:00
Hagen Paul Pfeifer 96d796a38e ipv6: hash is calculated but not used afterwards
hash is declared and assigned but not used anymore. ipv6_addr_hash()
exhibit no side-effects.

Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-25 14:00:22 -08:00
Hagen Paul Pfeifer a5f5e3689c ipv6: totlen is declared and assigned but not used
Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-25 14:00:21 -08:00
Lucian Adrian Grijincu c486da3439 sysctl: ipv6: use correct net in ipv6_sysctl_rtcache_flush
Before this patch issuing these commands:

  fd = open("/proc/sys/net/ipv6/route/flush")
  unshare(CLONE_NEWNET)
  write(fd, "stuff")

would flush the newly created net, not the original one.

The equivalent ipv4 code is correct (stores the net inside ->extra1).
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-25 11:01:56 -08:00
David S. Miller 5e6b930f21 xfrm: Const'ify address arguments to ->dst_lookup()
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-23 23:07:38 -08:00
David S. Miller 19bd62441c xfrm: Const'ify tmpl and address arguments to ->init_temprop()
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-23 23:07:37 -08:00
David S. Miller 8f029de281 xfrm: Mark flowi arg to xfrm_type->reject() const.
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-22 17:59:59 -08:00
David S. Miller 73e5ebb20f xfrm: Mark flowi arg to ->init_tempsel() const.
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-22 17:51:44 -08:00
David S. Miller 0c7b3eefb4 xfrm: Mark flowi arg to ->fill_dst() const.
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-22 17:48:57 -08:00
David S. Miller 05d8402576 xfrm: Mark flowi arg to ->get_tos() const.
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-22 17:47:10 -08:00
Shan Wei 089c34827e tcp: Remove debug macro of TCP_CHECK_TIMER
Now, TCP_CHECK_TIMER is not used for debuging, it does nothing.
And, it has been there for several years, maybe 6 years.

Remove it to keep code clearer.

Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-20 11:10:14 -08:00
David S. Miller da935c66ba Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	Documentation/feature-removal-schedule.txt
	drivers/net/e1000e/netdev.c
	net/xfrm/xfrm_policy.c
2011-02-19 19:17:35 -08:00
David S. Miller ece639caa3 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-2.6 2011-02-19 16:42:37 -08:00
Eric Dumazet 214f45c91b net: provide default_advmss() methods to blackhole dst_ops
Commit 0dbaee3b37 (net: Abstract default ADVMSS behind an
accessor.) introduced a possible crash in tcp_connect_init(), when
dst->default_advmss() is called from dst_metric_advmss()

Reported-by: George Spelvin <linux@horizon.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-18 11:39:01 -08:00
David S. Miller 3c7bd1a140 net: Add initial_ref arg to dst_alloc().
This allows avoiding multiple writes to the initial __refcnt.

The most simplest cases of wanting an initial reference of "1"
in ipv4 and ipv6 have been converted, the rest have been left
along and kept at the existing "0".

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-17 15:44:00 -08:00
Joerg Marx 0af320fb46 netfilter: ip6t_LOG: fix a flaw in printing the MAC
The flaw was in skipping the second byte in MAC header due to increasing
the pointer AND indexed access starting at '1'.

Signed-off-by: Joerg Marx <joerg.marx@secunet.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2011-02-17 16:23:40 +01:00
Jiri Kosina 0a9d59a246 Merge branch 'master' into for-next 2011-02-15 10:24:31 +01:00
David S. Miller 6431cbc25f inet: Create a mechanism for upward inetpeer propagation into routes.
If we didn't have a routing cache, we would not be able to properly
propagate certain kinds of dynamic path attributes, for example
PMTU information and redirects.

The reason is that if we didn't have a routing cache, then there would
be no way to lookup all of the active cached routes hanging off of
sockets, tunnels, IPSEC bundles, etc.

Consider the case where we created a cached route, but no inetpeer
entry existed and also we were not asked to pre-COW the route metrics
and therefore did not force the creation a new inetpeer entry.

If we later get a PMTU message, or a redirect, and store this
information in a new inetpeer entry, there is no way to teach that
cached route about the newly existing inetpeer entry.

The facilities implemented here handle this problem.

First we create a generation ID.  When we create a cached route of any
kind, we remember the generation ID at the time of attachment.  Any
time we force-create an inetpeer entry in response to new path
information, we bump that generation ID.

The dst_ops->check() callback is where the knowledge of this event
is propagated.  If the global generation ID does not equal the one
stored in the cached route, and the cached route has not attached
to an inetpeer yet, we look it up and attach if one is found.  Now
that we've updated the cached route's information, we update the
route's generation ID too.

This clears the way for implementing PMTU and redirects directly in
the inetpeer cache.  There is absolutely no need to consult cached
route information in order to maintain this information.

At this point nothing bumps the inetpeer genids, that comes in the
later changes which handle PMTUs and redirects using inetpeers.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-10 13:33:41 -08:00
David S. Miller 7a71ed899e inetpeer: Abstract address representation further.
Future changes will add caching information, and some of
these new elements will be addresses.

Since the family is implicit via the ->daddr.family member,
replicating the family in ever address we store is entirely
redundant.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-10 13:22:28 -08:00
David S. Miller 8d13a2a9fb net: Kill NETEVENT_PMTU_UPDATE.
Nobody actually does anything in response to the event,
so just kill it off.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-08 16:17:55 -08:00
David S. Miller 92d8682926 inetpeer: Move ICMP rate limiting state into inet_peer entries.
Like metrics, the ICMP rate limiting bits are cached state about
a destination.  So move it into the inet_peer entries.

If an inet_peer cannot be bound (the reason is memory allocation
failure or similar), the policy is to allow.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-04 15:59:53 -08:00
David S. Miller bd4a6974cc Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2011-02-04 14:28:58 -08:00
David S. Miller e2d57766e6 net: Provide compat support for SIOCGETMIFCNT_IN6 and SIOCGETSGCNT_IN6.
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-03 18:05:29 -08:00
Eric W. Biederman bf36076a67 net: Fix ipv6 neighbour unregister_sysctl_table warning
In my testing of 2.6.37 I was occassionally getting a warning about
sysctl table entries being unregistered in the wrong order.  Digging
in it turns out this dates back to the last great sysctl reorg done
where Al Viro introduced the requirement that sysctl directories
needed to be created before and destroyed after the files in them.

It turns out that in that great reorg /proc/sys/net/ipv6/neigh was
overlooked.  So this patch fixes that oversight and makes an annoying
warning message go away.

>------------[ cut here ]------------
>WARNING: at kernel/sysctl.c:1992 unregister_sysctl_table+0x134/0x164()
>Pid: 23951, comm: kworker/u:3 Not tainted 2.6.37-350888.2010AroraKernelBeta.fc14.x86_64 #1
>Call Trace:
> [<ffffffff8103e034>] warn_slowpath_common+0x80/0x98
> [<ffffffff8103e061>] warn_slowpath_null+0x15/0x17
> [<ffffffff810452f8>] unregister_sysctl_table+0x134/0x164
> [<ffffffff810e7834>] ? kfree+0xc4/0xd1
> [<ffffffff813439b2>] neigh_sysctl_unregister+0x22/0x3a
> [<ffffffffa02cd14e>] addrconf_ifdown+0x33f/0x37b [ipv6]
> [<ffffffff81331ec2>] ? skb_dequeue+0x5f/0x6b
> [<ffffffffa02ce4a5>] addrconf_notify+0x69b/0x75c [ipv6]
> [<ffffffffa02eb953>] ? ip6mr_device_event+0x98/0xa9 [ipv6]
> [<ffffffff813d2413>] notifier_call_chain+0x32/0x5e
> [<ffffffff8105bdea>] raw_notifier_call_chain+0xf/0x11
> [<ffffffff8133cdac>] call_netdevice_notifiers+0x45/0x4a
> [<ffffffff8133d2b0>] rollback_registered_many+0x118/0x201
> [<ffffffff8133d3af>] unregister_netdevice_many+0x16/0x6d
> [<ffffffff8133d571>] default_device_exit_batch+0xa4/0xb8
> [<ffffffff81337c42>] ? cleanup_net+0x0/0x194
> [<ffffffff81337a2a>] ops_exit_list+0x4e/0x56
> [<ffffffff81337d36>] cleanup_net+0xf4/0x194
> [<ffffffff81053318>] process_one_work+0x187/0x280
> [<ffffffff8105441b>] worker_thread+0xff/0x19f
> [<ffffffff8105431c>] ? worker_thread+0x0/0x19f
> [<ffffffff8105776d>] kthread+0x7d/0x85
> [<ffffffff81003824>] kernel_thread_helper+0x4/0x10
> [<ffffffff810576f0>] ? kthread+0x0/0x85
> [<ffffffff81003820>] ? kernel_thread_helper+0x0/0x10
>---[ end trace 8a7e9310b35e9486 ]---

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-31 20:54:17 -08:00
Roland Dreier ec831ea72e net: Add default_mtu() methods to blackhole dst_ops
When an IPSEC SA is still being set up, __xfrm_lookup() will return
-EREMOTE and so ip_route_output_flow() will return a blackhole route.
This can happen in a sndmsg call, and after d33e455337 ("net: Abstract
default MTU metric calculation behind an accessor.") this leads to a
crash in ip_append_data() because the blackhole dst_ops have no
default_mtu() method and so dst_mtu() calls a NULL pointer.

Fix this by adding default_mtu() methods (that simply return 0, matching
the old behavior) to the blackhole dst_ops.

The IPv4 part of this patch fixes a crash that I saw when using an IPSEC
VPN; the IPv6 part is untested because I don't have an IPv6 VPN, but it
looks to be needed as well.

Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-31 13:16:00 -08:00
David S. Miller 065825402c net: Store ipv4/ipv6 COW'd metrics in inetpeer cache.
Please note that the IPSEC dst entry metrics keep using
the generic metrics COW'ing mechanism using kmalloc/kfree.

This gives the IPSEC routes an opportunity to use metrics
which are unique to their encapsulated paths.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-27 14:59:31 -08:00
David S. Miller 1397e171f1 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2011-01-27 14:59:08 -08:00
David S. Miller 8f2771f2b8 ipv6: Remove route peer binding assertions.
They are bogus.  The basic idea is that I wanted to make sure
that prefixed routes never bind to peers.

The test I used was whether RTF_CACHE was set.

But first of all, the RTF_CACHE flag is set at different spots
depending upon which ip6_rt_copy() caller you're talking about.

I've validated all of the code paths, and even in the future
where we bind peers more aggressively (for route metric COW'ing)
we never bind to prefix'd routes, only fully specified ones.
This even applies when addrconf or icmp6 routes are allocated.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-27 14:55:22 -08:00
David S. Miller 62fa8a846d net: Implement read-only protection and COW'ing of metrics.
Routing metrics are now copy-on-write.

Initially a route entry points it's metrics at a read-only location.
If a routing table entry exists, it will point there.  Else it will
point at the all zero metric place-holder called 'dst_default_metrics'.

The writeability state of the metrics is stored in the low bits of the
metrics pointer, we have two bits left to spare if we want to store
more states.

For the initial implementation, COW is implemented simply via kmalloc.
However future enhancements will change this to place the writable
metrics somewhere else, in order to increase sharing.  Very likely
this "somewhere else" will be the inetpeer cache.

Note also that this means that metrics updates may transiently fail
if we cannot COW the metrics successfully.

But even by itself, this patch should decrease memory usage and
increase cache locality especially for routing workloads.  In those
cases the read-only metric copies stay in place and never get written
to.

TCP workloads where metrics get updated, and those rare cases where
PMTU triggers occur, will take a very slight performance hit.  But
that hit will be alleviated when the long-term writable metrics
move to a more sharable location.

Since the metrics storage went from a u32 array of RTAX_MAX entries to
what is essentially a pointer, some retooling of the dst_entry layout
was necessary.

Most importantly, we need to preserve the alignment of the reference
count so that it doesn't share cache lines with the read-mostly state,
as per Eric Dumazet's alignment assertion checks.

The only non-trivial bit here is the move of the 'flags' member into
the writeable cacheline.  This is OK since we are always accessing the
flags around the same moment when we made a modification to the
reference count.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-26 20:51:05 -08:00
David S. Miller b4e69ac670 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2011-01-26 13:49:30 -08:00
David S. Miller 7cc2edb834 xfrm6: Don't forget to propagate peer into ipsec route.
Like ipv4, we have to propagate the ipv6 route peer into
the ipsec top-level route during instantiation.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-26 13:41:03 -08:00
David S. Miller 73a8bd74e2 ipv6: Revert 'administrative down' address handling changes.
This reverts the following set of commits:

d1ed113f16 ("ipv6: remove duplicate neigh_ifdown")
29ba5fed1b ("ipv6: don't flush routes when setting loopback down")
9d82ca98f7 ("ipv6: fix missing in6_ifa_put in addrconf")
2de7957072 ("ipv6: addrconf: don't remove address state on ifdown if the address is being kept")
8595805aaf ("IPv6: only notify protocols if address is compeletely gone")
27bdb2abcc ("IPv6: keep tentative addresses in hash table")
93fa159abe ("IPv6: keep route for tentative address")
8f37ada5b5 ("IPv6: fix race between cleanup and add/delete address")
84e8b803f1 ("IPv6: addrconf notify when address is unavailable")
dc2b99f71e ("IPv6: keep permanent addresses on admin down")

because the core semantic change to ipv6 address handling on ifdown
has broken some things, in particular "disable_ipv6" sysctl handling.

Stephen has made several attempts to get things back in working order,
but nothing has restored disable_ipv6 fully yet.

Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Tested-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-25 12:49:08 -08:00
David S. Miller d80bc0fd26 ipv6: Always clone offlink routes.
Do not handle PMTU vs. route lookup creation any differently
wrt. offlink routes, always clone them.

Reported-by: PK <runningdoglackey@yahoo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-24 16:01:58 -08:00
Michał Mirosław 04ed3e741d net: change netdev->features to u32
Quoting Ben Hutchings: we presumably won't be defining features that
can only be enabled on 64-bit architectures.

Occurences found by `grep -r` on net/, drivers/net, include/

[ Move features and vlan_features next to each other in
  struct netdev, as per Eric Dumazet's suggestion -DaveM ]

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-24 15:32:47 -08:00
Eric Dumazet f2eda47df4 ipv6: raw: rcu annotations
Remove sparse warnings, using a function typedef to be able to use __rcu
annotation on mh_filter pointer.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-20 16:59:34 -08:00
Eric Dumazet 753ea8e962 net: ipv6: sit: fix rcu annotations
Fix minor __rcu annotations and remove sparse warnings

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-20 16:59:33 -08:00
Eric Dumazet bced94ed5e netfilter: add a missing include in nf_conntrack_reasm.c
After commit ae90bdeaea (netfilter: fix compilation when conntrack is
disabled but tproxy is enabled) we have following warnings :

net/ipv6/netfilter/nf_conntrack_reasm.c:520:16: warning: symbol
'nf_ct_frag6_gather' was not declared. Should it be static?
net/ipv6/netfilter/nf_conntrack_reasm.c:591:6: warning: symbol
'nf_ct_frag6_output' was not declared. Should it be static?
net/ipv6/netfilter/nf_conntrack_reasm.c:612:5: warning: symbol
'nf_ct_frag6_init' was not declared. Should it be static?
net/ipv6/netfilter/nf_conntrack_reasm.c:640:6: warning: symbol
'nf_ct_frag6_cleanup' was not declared. Should it be static?

Fix this including net/netfilter/ipv6/nf_defrag_ipv6.h

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2011-01-20 21:00:38 +01:00
Linus Torvalds 1268afe676 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (41 commits)
  sctp: user perfect name for Delayed SACK Timer option
  net: fix can_checksum_protocol() arguments swap
  Revert "netlink: test for all flags of the NLM_F_DUMP composite"
  gianfar: Fix misleading indentation in startup_gfar()
  net/irda/sh_irda: return to RX mode when TX error
  net offloading: Do not mask out NETIF_F_HW_VLAN_TX for vlan.
  USB CDC NCM: tx_fixup() race condition fix
  ns83820: Avoid bad pointer deref in ns83820_init_one().
  ipv6: Silence privacy extensions initialization
  bnx2x: Update bnx2x version to 1.62.00-4
  bnx2x: Fix AER setting for BCM57712
  bnx2x: Fix BCM84823 LED behavior
  bnx2x: Mark full duplex on some external PHYs
  bnx2x: Fix BCM8073/BCM8727 microcode loading
  bnx2x: LED fix for BCM8727 over BCM57712
  bnx2x: Common init will be executed only once after POR
  bnx2x: Swap BCM8073 PHY polarity if required
  iwlwifi: fix valid chain reading from EEPROM
  ath5k: fix locking in tx_complete_poll_work
  ath9k_hw: do PA offset calibration only on longcal interval
  ...
2011-01-19 20:25:45 -08:00
Patrick McHardy 14f0290ba4 Merge branch 'master' of /repos/git/net-next-2.6 2011-01-19 23:51:37 +01:00
Jesper Juhl 42b16b3fbb Kill off warning: ‘inline’ is not at beginning of declaration
Fix a bunch of
	warning: ‘inline’ is not at beginning of declaration
messages when building a 'make allyesconfig' kernel with -Wextra.

These warnings are trivial to kill, yet rather annoying when building with
-Wextra.
The more we can cut down on pointless crap like this the better (IMHO).

A previous patch to do this for a 'allnoconfig' build has already been
merged. This just takes the cleanup a little further.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2011-01-19 15:43:08 +01:00
David S. Miller a5db219f4c Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2011-01-18 16:28:31 -08:00
Romain Francoise 2fdc1c8093 ipv6: Silence privacy extensions initialization
When a network namespace is created (via CLONE_NEWNET), the loopback
interface is automatically added to the new namespace, triggering a
printk in ipv6_add_dev() if CONFIG_IPV6_PRIVACY is set.

This is problematic for applications which use CLONE_NEWNET as
part of a sandbox, like Chromium's suid sandbox or recent versions of
vsftpd. On a busy machine, it can lead to thousands of useless
"lo: Disabled Privacy Extensions" messages appearing in dmesg.

It's easy enough to check the status of privacy extensions via the
use_tempaddr sysctl, so just removing the printk seems like the most
sensible solution.

Signed-off-by: Romain Francoise <romain@orebokech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-18 16:13:49 -08:00
Linus Torvalds d018b6f4f1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (47 commits)
  GRETH: resolve SMP issues and other problems
  GRETH: handle frame error interrupts
  GRETH: avoid writing bad speed/duplex when setting transfer mode
  GRETH: fixed skb buffer memory leak on frame errors
  GRETH: GBit transmit descriptor handling optimization
  GRETH: fix opening/closing
  GRETH: added raw AMBA vendor/device number to match against.
  cassini: Fix build bustage on x86.
  e1000e: consistent use of Rx/Tx vs. RX/TX/rx/tx in comments/logs
  e1000e: update Copyright for 2011
  e1000: Avoid unhandled IRQ
  r8169: keep firmware in memory.
  netdev: tilepro: Use is_unicast_ether_addr helper
  etherdevice.h: Add is_unicast_ether_addr function
  ks8695net: Use default implementation of ethtool_ops::get_link
  ks8695net: Disable non-working ethtool operations
  USB CDC NCM: Don't deref NULL in cdc_ncm_rx_fixup() and don't use uninitialized variable.
  vxge: Remember to release firmware after upgrading firmware
  netdev: bfin_mac: Remove is_multicast_ether_addr use in netdev_for_each_mc_addr
  ipsec: update MAX_AH_AUTH_LEN to support sha512
  ...
2011-01-14 13:25:30 -08:00
Linus Torvalds 008d23e485 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
  Documentation/trace/events.txt: Remove obsolete sched_signal_send.
  writeback: fix global_dirty_limits comment runtime -> real-time
  ppc: fix comment typo singal -> signal
  drivers: fix comment typo diable -> disable.
  m68k: fix comment typo diable -> disable.
  wireless: comment typo fix diable -> disable.
  media: comment typo fix diable -> disable.
  remove doc for obsolete dynamic-printk kernel-parameter
  remove extraneous 'is' from Documentation/iostats.txt
  Fix spelling milisec -> ms in snd_ps3 module parameter description
  Fix spelling mistakes in comments
  Revert conflicting V4L changes
  i7core_edac: fix typos in comments
  mm/rmap.c: fix comment
  sound, ca0106: Fix assignment to 'channel'.
  hrtimer: fix a typo in comment
  init/Kconfig: fix typo
  anon_inodes: fix wrong function name in comment
  fix comment typos concerning "consistent"
  poll: fix a typo in comment
  ...

Fix up trivial conflicts in:
 - drivers/net/wireless/iwlwifi/iwl-core.c (moved to iwl-legacy.c)
 - fs/ext4/ext4.h

Also fix missed 'diabled' typo in drivers/net/bnx2x/bnx2x.h while at it.
2011-01-13 10:05:56 -08:00
Eric Dumazet 255d0dc340 netfilter: x_table: speedup compat operations
One iptables invocation with 135000 rules takes 35 seconds of cpu time
on a recent server, using a 32bit distro and a 64bit kernel.

We eventually trigger NMI/RCU watchdog.

INFO: rcu_sched_state detected stall on CPU 3 (t=6000 jiffies)

COMPAT mode has quadratic behavior and consume 16 bytes of memory per
rule.

Switch the xt_compat algos to use an array instead of list, and use a
binary search to locate an offset in the sorted array.

This halves memory need (8 bytes per rule), and removes quadratic
behavior [ O(N*N) -> O(N*log2(N)) ]

Time of iptables goes from 35 s to 150 ms.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2011-01-13 12:05:12 +01:00
David S. Miller 464143c911 Merge branch 'master' of git://1984.lsi.us.es/net-2.6 2011-01-12 18:58:40 -08:00
Alexey Kuznetsov 72b43d0898 inet6: prevent network storms caused by linux IPv6 routers
Linux IPv6 forwards unicast packets, which are link layer multicasts...
The hole was present since day one. I was 100% this check is there, but it is not.

The problem shows itself, f.e. when Microsoft Network Load Balancer runs on a network.
This software resolves IPv6 unicast addresses to multicast MAC addresses.

Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-12 18:51:55 -08:00
Simon Horman fee1cc0895 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 into HEAD 2011-01-13 10:29:21 +09:00
KOVACS Krisztian 2fc72c7b84 netfilter: fix compilation when conntrack is disabled but tproxy is enabled
The IPv6 tproxy patches split IPv6 defragmentation off of conntrack, but
failed to update the #ifdef stanzas guarding the defragmentation related
fields and code in skbuff and conntrack related code in nf_defrag_ipv6.c.

This patch adds the required #ifdefs so that IPv6 tproxy can truly be used
without connection tracking.

Original report:
http://marc.info/?l=linux-netdev&m=129010118516341&w=2

Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2011-01-12 20:25:08 +01:00
David S. Miller 60dbb011df Merge branch 'master' of git://1984.lsi.us.es/net-2.6 2011-01-11 15:43:03 -08:00
Dang Hongwu 4b0ef1f223 ah: reload pointers to skb data after calling skb_cow_data()
skb_cow_data() may allocate a new data buffer, so pointers on
skb should be set after this function.

Bug was introduced by commit dff3bb06 ("ah4: convert to ahash")
and 8631e9bd ("ah6: convert to ahash").

Signed-off-by: Wang Xuefu <xuefu.wang@6wind.com>
Acked-by: Krzysztof Witek <krzysztof.witek@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-11 14:03:10 -08:00
Eric Dumazet c191a836a9 tcp: disallow bind() to reuse addr/port
inet_csk_bind_conflict() logic currently disallows a bind() if
it finds a friend socket (a socket bound on same address/port)
satisfying a set of conditions :

1) Current (to be bound) socket doesnt have sk_reuse set
OR
2) other socket doesnt have sk_reuse set
OR
3) other socket is in LISTEN state

We should add the CLOSE state in the 3) condition, in order to avoid two
REUSEADDR sockets in CLOSE state with same local address/port, since
this can deny further operations.

Note : a prior patch tried to address the problem in a different (and
buggy) way. (commit fda48a0d7a tcp: bind() fix when many ports
are bound).

Reported-by: Gaspar Chilingarov <gasparch@gmail.com>
Reported-by: Daniel Baluta <daniel.baluta@gmail.com>
Tested-by: Daniel Baluta <daniel.baluta@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-11 14:03:07 -08:00
Eric Dumazet 83723d6071 netfilter: x_tables: dont block BH while reading counters
Using "iptables -L" with a lot of rules have a too big BH latency.
Jesper mentioned ~6 ms and worried of frame drops.

Switch to a per_cpu seqlock scheme, so that taking a snapshot of
counters doesnt need to block BH (for this cpu, but also other cpus).

This adds two increments on seqlock sequence per ipt_do_table() call,
its a reasonable cost for allowing "iptables -L" not block BH
processing.

Reported-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Patrick McHardy <kaber@trash.net>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2011-01-10 20:11:38 +01:00
Jiri Kosina 4b7bd36470 Merge branch 'master' into for-next
Conflicts:
	MAINTAINERS
	arch/arm/mach-omap2/pm24xx.c
	drivers/scsi/bfa/bfa_fcpim.c

Needed to update to apply fixes for which the old branch was too
outdated.
2010-12-22 18:57:02 +01:00
David S. Miller d9993be65a Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2010-12-20 13:24:14 -08:00
David Stevens ad0081e43a ipv6: Fragment locally generated tunnel-mode IPSec6 packets as needed.
This patch modifies IPsec6 to fragment IPv6 packets that are
locally generated as needed.

This version of the patch only fragments in tunnel mode, so that fragment
headers will not be obscured by ESP in transport mode.

Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-19 20:22:23 -08:00
stephen hemminger d1ed113f16 ipv6: remove duplicate neigh_ifdown
When device is being set to down, neigh_ifdown was being called
twice. Once from addrconf notifier and once from ndisc notifier.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-18 22:01:17 -08:00
stephen hemminger bc3ef6605e ipv6: fib6_ifdown cleanup
Remove (unnecessary) casts to make code cleaner.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-18 22:01:16 -08:00
David S. Miller b4aa9e05a6 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/bnx2x/bnx2x.h
	drivers/net/wireless/iwlwifi/iwl-1000.c
	drivers/net/wireless/iwlwifi/iwl-6000.c
	drivers/net/wireless/iwlwifi/iwl-core.h
	drivers/vhost/vhost.c
2010-12-17 12:27:22 -08:00
stephen hemminger 29ba5fed1b ipv6: don't flush routes when setting loopback down
When loopback device is being brought down, then keep the route table
entries because they are special. The entries in the local table for
linklocal routes and ::1 address should not be purged.

This is a sub optimal solution to the problem and should be replaced
by a better fix in future.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-16 18:26:26 -08:00
Octavian Purdila fcbdf09d96 net: fix nulls list corruptions in sk_prot_alloc
Special care is taken inside sk_port_alloc to avoid overwriting
skc_node/skc_nulls_node. We should also avoid overwriting
skc_bind_node/skc_portaddr_node.

The patch fixes the following crash:

 BUG: unable to handle kernel paging request at fffffffffffffff0
 IP: [<ffffffff812ec6dd>] udp4_lib_lookup2+0xad/0x370
 [<ffffffff812ecc22>] __udp4_lib_lookup+0x282/0x360
 [<ffffffff812ed63e>] __udp4_lib_rcv+0x31e/0x700
 [<ffffffff812bba45>] ? ip_local_deliver_finish+0x65/0x190
 [<ffffffff812bbbf8>] ? ip_local_deliver+0x88/0xa0
 [<ffffffff812eda35>] udp_rcv+0x15/0x20
 [<ffffffff812bba45>] ip_local_deliver_finish+0x65/0x190
 [<ffffffff812bbbf8>] ip_local_deliver+0x88/0xa0
 [<ffffffff812bb2cd>] ip_rcv_finish+0x32d/0x6f0
 [<ffffffff8128c14c>] ? netif_receive_skb+0x99c/0x11c0
 [<ffffffff812bb94b>] ip_rcv+0x2bb/0x350
 [<ffffffff8128c14c>] netif_receive_skb+0x99c/0x11c0

Signed-off-by: Leonard Crestez <lcrestez@ixiacom.com>
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-16 14:26:56 -08:00
Andrey Vagin d3052b557a ipv6: delete expired route in ip6_pmtu_deliver
The first big packets sent to a "low-MTU" client correctly
triggers the creation of a temporary route containing the reduced MTU.

But after the temporary route has expired, new ICMP6 "packet too big"
will be sent, rt6_pmtu_discovery will find the previous EXPIRED route
check that its mtu isn't bigger then in icmp packet and do nothing
before the temporary route will not deleted by gc.

I make the simple experiment:
while :; do
    time ( dd if=/dev/zero bs=10K count=1 | ssh hostname dd of=/dev/null ) || break;
done

The "time" reports real 0m0.197s if a temporary route isn't expired, but
it reports real 0m52.837s (!!!!) immediately after a temporare route has
expired.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-16 12:28:13 -08:00
KOVACS Krisztian ae90bdeaea netfilter: fix compilation when conntrack is disabled but tproxy is enabled
The IPv6 tproxy patches split IPv6 defragmentation off of conntrack, but
failed to update the #ifdef stanzas guarding the defragmentation related
fields and code in skbuff and conntrack related code in nf_defrag_ipv6.c.

This patch adds the required #ifdefs so that IPv6 tproxy can truly be used
without connection tracking.

Original report:
http://marc.info/?l=linux-netdev&m=129010118516341&w=2

Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-12-15 23:53:41 +01:00
David S. Miller d33e455337 net: Abstract default MTU metric calculation behind an accessor.
Like RTAX_ADVMSS, make the default calculation go through a dst_ops
method rather than caching the computation in the routing cache
entries.

Now dst metrics are pretty much left as-is when new entries are
created, thus optimizing metric sharing becomes a real possibility.

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-14 13:01:14 -08:00
David S. Miller 0dbaee3b37 net: Abstract default ADVMSS behind an accessor.
Make all RTAX_ADVMSS metric accesses go through a new helper function,
dst_metric_advmss().

Leave the actual default metric as "zero" in the real metric slot,
and compute the actual default value dynamically via a new dst_ops
AF specific callback.

For stacked IPSEC routes, we use the advmss of the path which
preserves existing behavior.

Unlike ipv4/ipv6, DecNET ties the advmss to the mtu and thus updates
advmss on pmtu updates.  This inconsistency in advmss handling
results in more raw metric accesses than I wish we ended up with.

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-13 12:52:14 -08:00
David S. Miller a02e4b7dae ipv6: Demark default hoplimit as zero.
This is for consistency with ipv4.  Using "-1" makes
no sense.

It was made this way a long time ago merely to be consistent
with how the ipv6 socket hoplimit "default" is stored.

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-12 21:39:02 -08:00
David S. Miller 5170ae824d net: Abstract RTAX_HOPLIMIT metric accesses behind helper.
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-12 21:35:57 -08:00
David S. Miller abbf46ae0e ipv6: Use ip6_dst_hoplimit() instead of direct dst_metric() calls.
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-12 21:14:46 -08:00
Tobias Klauser 376d940ee9 inet6: Remove redundant unlikely()
IS_ERR() already implies unlikely(), so it can be omitted here.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-10 14:57:34 -08:00
Martin Willi 040253c931 xfrm: Traffic Flow Confidentiality for IPv6 ESP
Add TFC padding to all packets smaller than the boundary configured
on the xfrm state. If the boundary is larger than the PMTU, limit
padding to the PMTU.

Signed-off-by: Martin Willi <martin@strongswan.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-10 14:43:59 -08:00
Jiri Pirko c07224005d net/ipv6/udp.c: fix typo in flush_stack()
skb1 should be passed as parameter to sk_rcvqueues_full() here.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-10 14:05:09 -08:00
David S. Miller 457de4383e ipv6: Fix 'release_it' logic in tcp_v6_get_peer()
We accidently set it to "true" for the case where we
are using a route bound peer.

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-10 13:16:09 -08:00
Nicolas Dichtel 5f75a1042f ipv6: fix nl group when advertising a new link
New idev are advertised with NL group RTNLGRP_IPV6_IFADDR, but
should use RTNLGRP_IPV6_IFINFO.
Bug was introduced by commit 8d7a76c9.

Signed-off-by: Wang Xuefu <xuefu.wang@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-10 12:49:36 -08:00
Uwe Kleine-König a34f0b3139 fix comment typos concerning "consistent"
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-12-10 16:04:28 +01:00
Eric Dumazet 68835aba4d net: optimize INET input path further
Followup of commit b178bb3dfc (net: reorder struct sock fields)

Optimize INET input path a bit further, by :

1) moving sk_refcnt close to sk_lock.

This reduces number of dirtied cache lines by one on 64bit arches (and
64 bytes cache line size).

2) moving inet_daddr & inet_rcv_saddr at the beginning of sk

(same cache line than hash / family / bound_dev_if / nulls_node)

This reduces number of accessed cache lines in lookups by one, and dont
increase size of inet and timewait socks.
inet and tw sockets now share same place-holder for these fields.

Before patch :

offsetof(struct sock, sk_refcnt) = 0x10
offsetof(struct sock, sk_lock) = 0x40
offsetof(struct sock, sk_receive_queue) = 0x60
offsetof(struct inet_sock, inet_daddr) = 0x270
offsetof(struct inet_sock, inet_rcv_saddr) = 0x274

After patch :

offsetof(struct sock, sk_refcnt) = 0x44
offsetof(struct sock, sk_lock) = 0x48
offsetof(struct sock, sk_receive_queue) = 0x68
offsetof(struct inet_sock, inet_daddr) = 0x0
offsetof(struct inet_sock, inet_rcv_saddr) = 0x4

compute_score() (udp or tcp) now use a single cache line per ignored
item, instead of two.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-09 20:05:58 -08:00
David S. Miller defb3519a6 net: Abstract away all dst_entry metrics accesses.
Use helper functions to hide all direct accesses, especially writes,
to dst_entry metrics values.

This will allow us to:

1) More easily change how the metrics are stored.

2) Implement COW for metrics.

In particular this will help us put metrics into the inetpeer
cache if that is what we end up doing.  We can make the _metrics
member a pointer instead of an array, initially have it point
at the read-only metrics in the FIB, and then on the first set
grab an inetpeer entry and point the _metrics member there.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
2010-12-09 10:46:36 -08:00
David S. Miller fe6c791570 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/wireless/ath/ath9k/ar9003_eeprom.c
	net/llc/af_llc.c
2010-12-08 13:47:38 -08:00
Shan Wei b672083ed3 ipv6: use ND_REACHABLE_TIME and ND_RETRANS_TIMER instead of magic number
ND_REACHABLE_TIME and ND_RETRANS_TIMER have defined
since v2.6.12-rc2, but never been used.
So use them instead of magic number.

This patch also changes original code style to read comfortably .

Thank YOSHIFUJI Hideaki for pointing it out.

Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-02 13:27:32 -08:00
David S. Miller db3949c450 tcp: Implement ipv6 ->get_peer() and ->tw_get_peer().
Now ipv6 timewait recycling is fully implemented and
enabled.

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-02 12:14:37 -08:00
David S. Miller 493f377d6d tcp: Add timewait recycling bits to ipv6 connect code.
This will also improve handling of ipv6 tcp socket request
backlog when syncookies are not enabled.  When backlog
becomes very deep, last quarter of backlog is limited to
validated destinations.  Previously only ipv4 implemented
this logic, but now ipv6 does too.

Now we are only one step away from enabling timewait
recycling for ipv6, and that step is simply filling in
the implementation of tcp_v6_get_peer() and
tcp_v6_tw_get_peer().

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-02 12:14:29 -08:00
David S. Miller ae4694b2d3 ipv6: Create inet6_csk_route_req().
Brother of ipv4's inet_csk_route_req().

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-02 10:59:22 -08:00
David S. Miller ccb7c410dd timewait_sock: Create and use getpeer op.
The only thing AF-specific about remembering the timestamp
for a time-wait TCP socket is getting the peer.

Abstract that behind a new timewait_sock_ops vector.

Support for real IPV6 sockets is not filled in yet, but
curiously this makes timewait recycling start to work
for v4-mapped ipv6 sockets.

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-01 18:09:13 -08:00
David McCullough 6dcdd1b369 net/ipv6/sit.c: return unhandled skb to tunnel4_rcv
I found a problem using an IPv6 over IPv4 tunnel.  When CONFIG_IPV6_SIT
was enabled, the packets would be rejected as net/ipv6/sit.c was catching
all IPPROTO_IPV6 packets and returning an ICMP port unreachable error.

I think this patch fixes the problem cleanly.  I believe the code in
net/ipv4/tunnel4.c:tunnel4_rcv takes care of it properly if none of the
handlers claim the skb.

Signed-off-by: David McCullough <david_mccullough@mcafee.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-01 13:19:34 -08:00
Anders Franzen 381601e5bb Make the ip6_tunnel reflect the true mtu.
The ip6_tunnel always assumes it consumes 40 bytes (ip6 hdr) of the mtu of the
underlaying device. So for a normal ethernet bearer, the mtu of the ip6_tunnel is
1460.
However, when creating a tunnel the encap limit option is enabled by default, and it
consumes 8 bytes more, so the true mtu shall be 1452.

I dont really know if this breaks some statement in some RFC, so this is a request for
comments.

Signed-off-by: Anders Franzen <anders.franzen@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-01 10:55:47 -08:00
David S. Miller 3f419d2d48 inet: Turn ->remember_stamp into ->get_peer in connection AF ops.
Then we can make a completely generic tcp_remember_stamp()
that uses ->get_peer() as a helper, minimizing the AF specific
code and minimizing the eventual code duplication when we implement
the ipv6 side of TW recycling.

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-30 12:28:06 -08:00
David S. Miller b341936380 ipv6: Add infrastructure to bind inet_peer objects to routes.
They are only allowed on cached ipv6 routes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-30 12:27:11 -08:00
Jozsef Kadlecsik 82a39eb6b3 ipv6: Prepare the tree for un-inlined jhash.
jhash is widely used in the kernel and because the functions
are inlined, the cost in size is significant. Also, the new jhash
functions are slightly larger than the previous ones so better un-inline.
As a preparation step, the calls to the internal macros are replaced
with the plain jhash function calls.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-28 11:26:21 -08:00
Shan Wei d3c15cab21 ipv6: kill two unused macro definition
1. IPV6_TLV_TEL_DST_SIZE
This has not been using for several years since created.

2. RT6_INFO_LEN
commit 33120b30 kill all RT6_INFO_LEN's references, but only this definition remained.

commit 33120b30cc
Author: Alexey Dobriyan <adobriyan@sw.ru>
Date:   Tue Nov 6 05:27:11 2007 -0800

    [IPV6]: Convert /proc/net/ipv6_route to seq_file interface

Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-28 10:47:17 -08:00
Thomas Graf cf7afbfeb8 rtnl: make link af-specific updates atomic
As David pointed out correctly, updates to af-specific attributes
are currently not atomic. If multiple changes are requested and
one of them fails, previous updates may have been applied already
leaving the link behind in a undefined state.

This patch splits the function parse_link_af() into two functions
validate_link_af() and set_link_at(). validate_link_af() is placed
to validate_linkmsg() check for errors as early as possible before
any changes to the link have been made. set_link_af() is called to
commit the changes later.

This method is not fail proof, while it is currently sufficient
to make set_link_af() inerrable and thus 100% atomic, the
validation function method will not be able to detect all error
scenarios in the future, there will likely always be errors
depending on states which are f.e. not protected by rtnl_mutex
and thus may change between validation and setting.

Also, instead of silently ignoring unknown address families and
config blocks for address families which did not register a set
function the errors EAFNOSUPPORT respectively EOPNOSUPPORT are
returned to avoid comitting 4 out of 5 update requests without
notifying the user.

Signed-off-by: Thomas Graf <tgraf@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-27 22:56:08 -08:00
Eric Dumazet 456b61bca8 ipv6: mcast: RCU conversion
ipv6_sk_mc_lock rwlock becomes a spinlock.

readers (inet6_mc_check()) now takes rcu_read_lock() instead of read
lock. Writers dont need to disable BH anymore.

struct ipv6_mc_socklist objects are reclaimed after one RCU grace
period.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-24 11:16:42 -08:00
Tracey Dent 4de58dfebe Net: ipv6: netfiliter: Makefile: Remove deprecated kbuild goal definitions
Changed Makefile to use <modules>-y instead of <modules>-objs
because -objs is deprecated and not mentioned in
Documentation/kbuild/makefiles.txt.

Signed-off-by: Tracey Dent <tdent48227@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-22 08:16:12 -08:00
John Fastabend 88b2a9a3d9 ipv6: fix missing in6_ifa_put in addrconf
Fix ref count bug introduced by

commit 2de7957072
Author: Lorenzo Colitti <lorenzo@google.com>
Date:   Wed Oct 27 18:16:49 2010 +0000

ipv6: addrconf: don't remove address state on ifdown if the address
is being kept

Fix logic so that addrconf_ifdown() decrements the inet6_ifaddr
refcnt correctly with in6_ifa_put().

Reported-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-22 07:37:36 -08:00
David S. Miller 24912420e9 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/bonding/bond_main.c
	net/core/net-sysfs.c
	net/ipv6/addrconf.c
2010-11-19 13:13:47 -08:00
Thomas Graf 18a31e1e28 ipv6: Expose reachable and retrans timer values as msecs
Expose reachable and retrans timer values in msecs instead of jiffies.
Both timer values are already exposed as msecs in the neighbour table
netlink interface.

The creation timestamp format with increased precision is kept but
cleaned up.

Signed-off-by: Thomas Graf <tgraf@infradead.org>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-18 12:08:36 -08:00
Thomas Graf 93908d1926 ipv6: Expose IFLA_PROTINFO timer values in msecs instead of jiffies
IFLA_PROTINFO exposes timer related per device settings in jiffies.
Change it to expose these values in msecs like the sysctl interface
does.

I did not find any users of IFLA_PROTINFO which rely on any of these
values and even if there are, they are likely already broken because
there is no way for them to reliably convert such a value to another
time format.

Signed-off-by: Thomas Graf <tgraf@infradead.org>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-18 11:05:01 -08:00
Changli Gao 5811662b15 net: use the macros defined for the members of flowi
Use the macros defined for the members of flowi to clean the code up.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-17 12:27:45 -08:00
Thomas Graf b382b191ea ipv6: AF_INET6 link address family
IPv6 already exposes some address family data via netlink in the
IFLA_PROTINFO attribute if RTM_GETLINK request is sent with the
address family set to AF_INET6. We take over this format and
reuse all the code.

Signed-off-by: Thomas Graf <tgraf@infradead.org>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-17 11:28:26 -08:00
Eric Dumazet c31504dc0d udp: use atomic_inc_not_zero_hint
UDP sockets refcount is usually 2, unless an incoming frame is going to
be queued in receive or backlog queue.

Using atomic_inc_not_zero_hint() permits to reduce latency, because
processor issues less memory transactions.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-16 11:17:43 -08:00
John Fastabend 9d82ca98f7 ipv6: fix missing in6_ifa_put in addrconf
Fix ref count bug introduced by

commit 2de7957072
Author: Lorenzo Colitti <lorenzo@google.com>
Date:   Wed Oct 27 18:16:49 2010 +0000

ipv6: addrconf: don't remove address state on ifdown if the address
is being kept

Fix logic so that addrconf_ifdown() decrements the inet6_ifaddr
refcnt correctly with in6_ifa_put().

Reported-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-16 09:24:58 -08:00
Joe Perches 8a22c99a80 net/ipv6/mcast.c: Remove unnecessary semicolons
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-15 11:07:16 -08:00
Eric Dumazet be9e9163af netfilter: nf_ct_frag6_sysctl_table is static
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-11-15 18:18:29 +01:00
Jan Engelhardt b468645d72 netfilter: xt_LOG: do print MAC header on FORWARD
I am observing consistent behavior even with bridges, so let's unlock
this. xt_mac is already usable in FORWARD, too. Section 9 of
http://ebtables.sourceforge.net/br_fw_ia/br_fw_ia.html#section9 says
the MAC source address is changed, but my observation does not match
that claim -- the MAC header is retained.

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
[Patrick; code inspection seems to confirm this]
Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-11-15 11:23:06 +01:00
Ben Greear 4038565327 ipv6: Warn users if maximum number of routes is reached.
This gives users at least some clue as to what the problem
might be and how to go about fixing it.

Signed-off-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-12 14:03:24 -08:00
Lorenzo Colitti 2de7957072 ipv6: addrconf: don't remove address state on ifdown if the address is being kept
Currently, addrconf_ifdown does not delete statically configured IPv6
addresses when the interface is brought down. The intent is that when
the interface comes back up the address will be usable again. However,
this doesn't actually work, because the system stops listening on the
corresponding solicited-node multicast address, so the address cannot
respond to neighbor solicitations and thus receive traffic. Also, the
code notifies the rest of the system that the address is being deleted
(e.g, RTM_DELADDR), even though it is not. Fix it so that none of this
state is updated if the address is being kept on the interface.

Tested: Added a statically configured IPv6 address to an interface,
started ping, brought link down, brought link up again. When link came
up ping kept on going and "ip -6 maddr" showed that the host was still
subscribed to there

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-12 13:44:24 -08:00
David S. Miller 7c13a0d9a1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-2.6 2010-11-12 11:04:26 -08:00
Shan Wei 22e091e525 netfilter: ipv6: fix overlap check for fragments
The type of FRAG6_CB(prev)->offset is int, skb->len is *unsigned* int,
and offset is int.

Without this patch, type conversion occurred to this expression, when
(FRAG6_CB(prev)->offset + prev->len) is less than offset.

Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-11-12 08:51:55 +01:00
Shan Wei f46421416f ipv6: fix overlap check for fragments
The type of FRAG6_CB(prev)->offset is int, skb->len is *unsigned* int,
and offset is int.

Without this patch, type conversion occurred to this expression, when
(FRAG6_CB(prev)->offset + prev->len) is less than offset.

Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-08 12:17:06 -08:00
Jan Engelhardt cccbe5ef85 netfilter: ip6_tables: fix information leak to userspace
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-03 18:55:39 -07:00
Xiaotian Feng 41bb78b4b9 net dst: fix percpu_counter list corruption and poison overwritten
There're some percpu_counter list corruption and poison overwritten warnings
in recent kernel, which is resulted by fc66f95c.

commit fc66f95c switches to use percpu_counter, in ip6_route_net_init, kernel
init the percpu_counter for dst entries, but, the percpu_counter is never destroyed
in ip6_route_net_exit. So if the related data is freed by kernel, the freed percpu_counter
is still on the list, then if we insert/remove other percpu_counter, list corruption
resulted. Also, if the insert/remove option modifies the ->prev,->next pointer of
the freed value, the poison overwritten is resulted then.

With the following patch, the percpu_counter list corruption and poison overwritten
warnings disappeared.

Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-03 18:50:07 -07:00
Eric Dumazet 870be39258 ipv6/udp: report SndbufErrors and RcvbufErrors
commit a18135eb93 (Add UDP_MIB_{SND,RCV}BUFERRORS handling.)
forgot to make the necessary changes in net/ipv6/proc.c to report
additional counters in /proc/net/snmp6

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-30 16:17:23 -07:00
Pavel Emelyanov 74b0b85b88 tunnels: Fix tunnels change rcu protection
After making rcu protection for tunnels (ipip, gre, sit and ip6) a bug
was introduced into the SIOCCHGTUNNEL code.

The tunnel is first unlinked, then addresses change, then it is linked
back probably into another bucket. But while changing the parms, the
hash table is unlocked to readers and they can lookup the improper tunnel.

Respective commits are b7285b79 (ipip: get rid of ipip_lock), 1507850b
(gre: get rid of ipgre_lock), 3a43be3c (sit: get rid of ipip6_lock) and
94767632 (ip6tnl: get rid of ip6_tnl_lock).

The quick fix is to wait for quiescent state to pass after unlinking,
but if it is inappropriate I can invent something better, just let me
know.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-27 14:20:08 -07:00
Eric Dumazet e0ad61ec86 net: add __rcu annotations to protocol
Add __rcu annotations to :
        struct net_protocol *inet_protos
        struct net_protocol *inet6_protos

And use appropriate casts to reduce sparse warnings if
CONFIG_SPARSE_RCU_POINTER=y

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-27 11:37:31 -07:00
Ursula Braun 853dc2e03d ipv6: fix refcnt problem related to POSTDAD state
After running this bonding setup script
    modprobe bonding miimon=100 mode=0 max_bonds=1
    ifconfig bond0 10.1.1.1/16
    ifenslave bond0 eth1
    ifenslave bond0 eth3
on s390 with qeth-driven slaves, modprobe -r fails with this message
    unregister_netdevice: waiting for bond0 to become free. Usage count = 1
due to twice detection of duplicate address.
Problem is caused by a missing decrease of ifp->refcnt in addrconf_dad_failure.
An extra call of in6_ifa_put(ifp) solves it.
Problem has been introduced with commit f2344a131b.

Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-27 11:37:30 -07:00
Glenn Wurster 7a876b0efc IPv6: Temp addresses are immediately deleted.
There is a bug in the interaction between ipv6_create_tempaddr and
addrconf_verify.  Because ipv6_create_tempaddr uses the cstamp and tstamp
from the public address in creating a private address, if we have not
received a router advertisement in a while, tstamp + temp_valid_lft might be
< now.  If this happens, the new address is created inside
ipv6_create_tempaddr, then the loop within addrconf_verify starts again and
the address is immediately deleted.  We are left with no temporary addresses
on the interface, and no more will be created until the public IP address is
updated.  To avoid this, set the expiry time to be the minimum of the time
left on the public address or the config option PLUS the current age of the
public interface.

Signed-off-by: Glenn Wurster <gwurster@scs.carleton.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-26 12:35:13 -07:00
Glenn Wurster aed65501e8 IPv6: Create temporary address if none exists.
If privacy extentions are enabled, but no current temporary address exists,
then create one when we get a router advertisement.

Signed-off-by: Glenn Wurster <gwurster@scs.carleton.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-26 12:35:12 -07:00
David S. Miller 7932c2e55c netfilter: Add missing CONFIG_SYSCTL checks in ipv6's nf_conntrack_reasm.c
Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-26 09:08:53 -07:00
Eric Dumazet 0d7da9ddd9 net: add __rcu annotation to sk_filter
Add __rcu annotation to :
        (struct sock)->sk_filter

And use appropriate rcu primitives to reduce sparse warnings if
CONFIG_SPARSE_RCU_POINTER=y

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-25 14:18:28 -07:00
KOVACS Krisztian f6318e5588 netfilter: fix module dependency issues with IPv6 defragmentation, ip6tables and xt_TPROXY
One of the previous tproxy related patches split IPv6 defragmentation and
connection tracking, but did not correctly add Kconfig stanzas to handle the
new dependencies correctly. This patch fixes that by making the config options
mirror the setup we have for IPv4: a distinct config option for defragmentation
that is automatically selected by both connection tracking and
xt_TPROXY/xt_socket.

The patch also changes the #ifdefs enclosing IPv6 specific code in xt_socket
and xt_TPROXY: we only compile these in case we have ip6tables support enabled.

Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-25 13:58:36 -07:00
Eric Dumazet 6f0bcf1525 tunnels: add _rcu annotations
(struct ip6_tnl)->next is rcu protected :
(struct ip_tunnel)->next is rcu protected :
(struct xfrm6_tunnel)->next is rcu protected :

add __rcu annotation and proper rcu primitives.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-25 13:09:45 -07:00
Balazs Scheidler b889416b54 tproxy: Add missing CAP_NET_ADMIN check to ipv6 side
IP_TRANSPARENT requires root (more precisely CAP_NET_ADMIN privielges)
for IPV6.

However as I see right now this check was missed from the IPv6
implementation.

Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-24 16:07:50 -07:00
Anders Franzen 7e223de84b ip6_tunnel dont update the mtu on the route.
The ip6_tunnel device did not unset the flag,
IFF_XMIT_DST_RELEASE. This will make the dev layer
to release the dst before calling the tunnel.
The tunnel will not update any mtu/pmtu info, since
it does not have a dst on the skb.
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-24 15:23:36 -07:00
David S. Miller 9941fb6276 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-next-2.6 2010-10-21 08:21:34 -07:00
Balazs Scheidler 0a513f6af9 tproxy: allow non-local binds of IPv6 sockets if IP_TRANSPARENT is enabled
Signed-off-by: Balazs Scheidler <bazsi@balabit.hu>
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-10-21 16:10:03 +02:00
Balazs Scheidler 6c46862280 tproxy: added tproxy sockopt interface in the IPV6 layer
Support for IPV6_RECVORIGDSTADDR sockopt for UDP sockets were contributed by
Harry Mason.

Signed-off-by: Balazs Scheidler <bazsi@balabit.hu>
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-10-21 16:08:28 +02:00
Balazs Scheidler aa976fc011 tproxy: added udp6_lib_lookup function
Just like with IPv4, we need access to the UDP hash table to look up local
sockets, but instead of exporting the global udp_table, export a lookup
function.

Signed-off-by: Balazs Scheidler <bazsi@balabit.hu>
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-10-21 16:05:41 +02:00
Balazs Scheidler 88440ae70e tproxy: added const specifiers to udp lookup functions
The parameters for various UDP lookup functions were non-const, even though
they could be const. TProxy has some const references and instead of
downcasting it, I added const specifiers along the path.

Signed-off-by: Balazs Scheidler <bazsi@balabit.hu>
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-10-21 16:04:33 +02:00
Balazs Scheidler e97c3e278e tproxy: split off ipv6 defragmentation to a separate module
Like with IPv4, TProxy needs IPv6 defragmentation but does not
require connection tracking. Since defragmentation was coupled
with conntrack, I split off the two, creating an nf_defrag_ipv6 module,
similar to the already existing nf_defrag_ipv4.

Signed-off-by: Balazs Scheidler <bazsi@balabit.hu>
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-10-21 16:03:43 +02:00
Balazs Scheidler 093d282321 tproxy: fix hash locking issue when using port redirection in __inet_inherit_port()
When __inet_inherit_port() is called on a tproxy connection the wrong locks are
held for the inet_bind_bucket it is added to. __inet_inherit_port() made an
implicit assumption that the listener's port number (and thus its bind bucket).
Unfortunately, if you're using the TPROXY target to redirect skbs to a
transparent proxy that assumption is not true anymore and things break.

This patch adds code to __inet_inherit_port() so that it can handle this case
by looking up or creating a new bind bucket for the child socket and updates
callers of __inet_inherit_port() to gracefully handle __inet_inherit_port()
failing.

Reported by and original patch from Stephen Buck <stephen.buck@exinda.com>.
See http://marc.info/?t=128169268200001&r=1&w=2 for the original discussion.

Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-10-21 13:06:43 +02:00
stephen hemminger 6f747aca5e xfrm6: make xfrm6_tunnel_free_spi local
Function only defined and used in one file.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-21 03:09:45 -07:00
Randy Dunlap 76b6717bc6 netfilter: fix kconfig unmet dependency warning
Fix netfilter kconfig unmet dependencies warning & spell out
"compatible" while there.

warning: (IP_NF_TARGET_TTL && NET && INET && NETFILTER && IP_NF_IPTABLES && NETFILTER_ADVANCED || IP6_NF_TARGET_HL && NET && INET && IPV6 && NETFILTER && IP6_NF_IPTABLES && NETFILTER_ADVANCED) selects NETFILTER_XT_TARGET_HL which has unmet direct dependencies ((IP_NF_MANGLE || IP6_NF_MANGLE) && NETFILTER_ADVANCED)

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-10-18 11:13:30 +02:00
Eric Dumazet 10da66f755 fib: avoid false sharing on fib_table_hash
While doing profile analysis, I found fib_hash_table was sometime in a
cache line shared by a possibly often written kernel structure.

(CONFIG_IP_ROUTE_MULTIPATH || !CONFIG_IPV6_MULTIPLE_TABLES)

It's hard to detect because not easily reproductible.

Make sure we allocate a full cache line to keep this shared in all cpus
caches.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-16 11:13:23 -07:00
Eric Dumazet 2c1c00040a fib6: use FIB_LOOKUP_NOREF in fib6_rule_lookup()
Avoid two atomic ops on found rule in fib6_rule_lookup()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-16 11:13:21 -07:00
Jan Engelhardt 243bf6e29e netfilter: xtables: resolve indirect macros 3/3 2010-10-13 18:00:46 +02:00
Jan Engelhardt 87a2e70db6 netfilter: xtables: resolve indirect macros 2/3
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
2010-10-13 18:00:41 +02:00
Jan Engelhardt 12b00c2c02 netfilter: xtables: resolve indirect macros 1/3
Many of the used macros are just there for userspace compatibility.
Substitute the in-kernel code to directly use the terminal macro
and stuff the defines into #ifndef __KERNEL__ sections.

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
2010-10-13 18:00:36 +02:00
Eric Dumazet fc66f95c68 net dst: use a percpu_counter to track entries
struct dst_ops tracks number of allocated dst in an atomic_t field,
subject to high cache line contention in stress workload.

Switch to a percpu_counter, to reduce number of time we need to dirty a
central location. Place it on a separate cache line to avoid dirtying
read only fields.

Stress test :

(Sending 160.000.000 UDP frames,
IP route cache disabled, dual E5540 @2.53GHz,
32bit kernel, FIB_TRIE, SLUB/NUMA)

Before:

real    0m51.179s
user    0m15.329s
sys     10m15.942s

After:

real	0m45.570s
user	0m15.525s
sys	9m56.669s

With a small reordering of struct neighbour fields, subject of a
following patch, (to separate refcnt from other read mostly fields)

real	0m41.841s
user	0m15.261s
sys	8m45.949s

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-11 13:06:53 -07:00
Eric Dumazet d6bf781712 net neigh: RCU conversion of neigh hash table
David

This is the first step for RCU conversion of neigh code.

Next patches will convert hash_buckets[] and "struct neighbour" to RCU
protected objects.

Thanks

[PATCH net-next] net neigh: RCU conversion of neigh hash table

Instead of storing hash_buckets, hash_mask and hash_rnd in "struct
neigh_table", a new structure is defined :

struct neigh_hash_table {
       struct neighbour        **hash_buckets;
       unsigned int            hash_mask;
       __u32                   hash_rnd;
       struct rcu_head         rcu;
};

And "struct neigh_table" has an RCU protected pointer to such a
neigh_hash_table.

This means the signature of (*hash)() function changed: We need to add a
third parameter with the actual hash_rnd value, since this is not
anymore a neigh_table field.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-05 14:54:36 -07:00
Eric Dumazet caf586e5f2 net: add a core netdev->rx_dropped counter
In various situations, a device provides a packet to our stack and we
drop it before it enters protocol stack :
- softnet backlog full (accounted in /proc/net/softnet_stat)
- bad vlan tag (not accounted)
- unknown/unregistered protocol (not accounted)

We can handle a per-device counter of such dropped frames at core level,
and automatically adds it to the device provided stats (rx_dropped), so
that standard tools can be used (ifconfig, ip link, cat /proc/net/dev)

This is a generalization of commit 8990f468a (net: rx_dropped
accounting), thus reverting it.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-05 14:47:55 -07:00
stephen hemminger c61393ea83 ipv6: make __ipv6_isatap_ifid static
Another exported symbol only used in one file

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-05 00:47:39 -07:00
David S. Miller 21a180cda0 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	net/ipv4/Kconfig
	net/ipv4/tcp_timer.c
2010-10-04 11:56:38 -07:00
Eric Dumazet a8defca048 netfilter: ipt_LOG: add bufferisation to call printk() once
ipt_LOG & ip6t_LOG use lot of calls to printk() and use a lock in a hope
several cpus wont mix their output in syslog.

printk() being very expensive [1], its better to call it once, on a
prebuilt and complete line. Also, with mixed IPv4 and IPv6 trafic,
separate IPv4/IPv6 locks dont avoid garbage.

I used an allocation of a 1024 bytes structure, sort of seq_printf() but
with a fixed size limit.
Use a static buffer if dynamic allocation failed.

Emit a once time alert if buffer size happens to be too short.

[1]: printk() has various features like printk_delay()...

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-10-04 20:56:05 +02:00