Commit Graph

402 Commits

Author SHA1 Message Date
David McCullough 4dc27d1cf3 net/ipv6/route.c: packets originating on device match lo
Fix to allow IPv6 packets originating locally to match rules with the "iff"
set to "lo".  This allows IPv6 rule matching work the same as it does for
IPv4.  From the iproute2 man page:

   iif NAME
		  select  the incoming device to match.  If the interface is loop‐
		  back, the rule only matches packets originating from this  host.
		  This  means that you may create separate routing tables for for‐
		  warded and local packets and, hence, completely segregate them.

Signed-off-by: David McCullough <david_mccullough@mcafee.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-25 23:54:32 -07:00
David S. Miller e486463e82 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/usb/qmi_wwan.c
	net/batman-adv/translation-table.c
	net/ipv6/route.c

qmi_wwan.c resolution provided by Bjørn Mork.

batman-adv conflict is dealing merely with the changes
of global function names to have a proper subsystem
prefix.

ipv6's route.c conflict is merely two side-by-side additions
of network namespace methods.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-25 15:50:32 -07:00
Thomas Graf d189634eca ipv6: Move ipv6 proc file registration to end of init order
/proc/net/ipv6_route reflects the contents of fib_table_hash. The proc
handler is installed in ip6_route_net_init() whereas fib_table_hash is
allocated in fib6_net_init() _after_ the proc handler has been installed.

This opens up a short time frame to access fib_table_hash with its pants
down.

Move the registration of the proc files to a later point in the init
order to avoid the race.

Tested :-)

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-18 18:38:50 -07:00
David S. Miller aee289baaa Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/ipv6/route.c

Pull in 'net' again to get the revert of Thomas's change
which introduced regressions.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-16 01:23:04 -07:00
David S. Miller e8803b6c38 Revert "ipv6: Prevent access to uninitialized fib_table_hash via /proc/net/ipv6_route"
This reverts commit 2a0c451ade.

It causes crashes, because now ip6_null_entry is used before
it is initialized.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-16 01:12:19 -07:00
David S. Miller 42ae66c80d ipv6: Fix types of ip6_update_pmtu().
The mtu should be a __be32, not the mark.

Reported-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-15 20:01:57 -07:00
David S. Miller 7e52b33bd5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/ipv6/route.c

This deals with a merge conflict between the net-next addition of the
inetpeer network namespace ops, and Thomas Graf's bug fix in
2a0c451ade which makes sure we don't
register /proc/net/ipv6_route before it is actually safe to do so.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-15 15:51:55 -07:00
Thomas Graf 2a0c451ade ipv6: Prevent access to uninitialized fib_table_hash via /proc/net/ipv6_route
/proc/net/ipv6_route reflects the contents of fib_table_hash. The proc
handler is installed in ip6_route_net_init() whereas fib_table_hash is
allocated in fib6_net_init() _after_ the proc handler has been installed.

This opens up a short time frame to access fib_table_hash with its pants
down.

fib6_init() as a whole can't be moved to an earlier position as it also
registers the rtnetlink message handlers which should be registered at
the end. Therefore split it into fib6_init() which is run early and
fib6_init_late() to register the rtnetlink message handlers.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Reviewed-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-15 15:30:15 -07:00
David S. Miller 81aded2467 ipv6: Handle PMTU in ICMP error handlers.
One tricky issue on the ipv6 side vs. ipv4 is that the ICMP callouts
to handle the error pass the 32-bit info cookie in network byte order
whereas ipv4 passes it around in host byte order.

Like the ipv4 side, we have two helper functions.  One for when we
have a socket context and one for when we do not.

ip6ip6 tunnels are not handled here, because they handle PMTU events
by essentially relaying another ICMP packet-too-big message back to
the original sender.

This patch allows us to get rid of rt6_do_pmtu_disc().  It handles all
kinds of situations that simply cannot happen when we do the PMTU
update directly using a fully resolved route.

In fact, the "plen == 128" check in ip6_rt_update_pmtu() can very
likely be removed or changed into a BUG_ON() check.  We should never
have a prefixed ipv6 route when we get there.

Another piece of strange history here is that TCP and DCCP, unlike in
ipv4, never invoke the update_pmtu() method from their ICMP error
handlers.  This is incredibly astonishing since this is the context
where we have the most accurate context in which to make a PMTU
update, namely we have a fully connected socket and associated cached
socket route.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-15 14:54:11 -07:00
David S. Miller 7b34ca2ac7 inet: Avoid potential NULL peer dereference.
We handle NULL in rt{,6}_set_peer but then our caller will try to pass
that NULL pointer into inet_putpeer() which isn't ready for it.

Fix this by moving the NULL check one level up, and then remove the
now unnecessary NULL check from inetpeer_ptr_set_peer().

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-11 04:13:57 -07:00
David S. Miller 8b96d22d7a inet: Use FIB table peer roots in routes.
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-11 02:10:54 -07:00
David S. Miller 97bab73f98 inet: Hide route peer accesses behind helpers.
We encode the pointer(s) into an unsigned long with one state bit.

The state bit is used so we can store the inetpeer tree root to use
when resolving the peer later.

Later the peer roots will be per-FIB table, and this change works to
facilitate that.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-11 02:08:47 -07:00
David S. Miller c0efc887dc inet: Pass inetpeer root into inet_getpeer*() interfaces.
Otherwise we reference potentially non-existing members when
ipv6 is disabled.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-09 19:12:36 -07:00
David S. Miller 2b823f7258 ipv6: Do not mark ipv6_inetpeer_ops as __net_initdata.
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-09 19:00:16 -07:00
David S. Miller 56a6b248eb inet: Consolidate inetpeer_invalidate_tree() interfaces.
We only need one interface for this operation, since we always know
which inetpeer root we want to flush.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-09 16:32:41 -07:00
David S. Miller c3426b4719 inet: Initialize per-netns inetpeer roots in net/ipv{4,6}/route.c
Instead of net/ipv4/inetpeer.c

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-09 16:27:05 -07:00
David S. Miller fbfe95a42e inet: Create and use rt{,6}_get_peer_create().
There's a lot of places that open-code rt{,6}_get_peer() only because
they want to set 'create' to one.  So add an rt{,6}_get_peer_create()
for their sake.

There were also a few spots open-coding plain rt{,6}_get_peer() and
those are transformed here as well.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-08 23:24:18 -07:00
Gao feng 54db0cc2ba inetpeer: add parameter net for inet_getpeer_v4,v6
add struct net as a parameter of inet_getpeer_v[4,6],
use net to replace &init_net.

and modify some places to provide net for inet_getpeer_v[4,6]

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-08 14:27:23 -07:00
Eric Dumazet a50feda546 ipv6: bool/const conversions phase2
Mostly bool conversions, some inline removals and const additions.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-19 01:08:16 -04:00
Joe Perches f32138319c net: ipv6: Standardize prefixes for message logging
Add #define pr_fmt(fmt) as appropriate.

Add "IPv6: " to appropriate files.

Convert printk(KERN_<LEVEL> to pr_<level> (but not KERN_DEBUG).
Standardize on "%s: " not "%s(): " when emitting __func__.
Use "%s: ", __func__ instead of embedding function name.
Coalesce formats, align arguments.

ADDRCONF output is now prefixed with "IPv6: "

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-16 01:01:03 -04:00
Joe Perches e87cc4728f net: Convert net_ratelimit uses to net_<level>_ratelimited
Standardize the net core ratelimited logging functions.

Coalesce formats, align arguments.
Change a printk then vprintk sequence to use printf extension %pV.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-15 13:45:03 -04:00
David S. Miller 56845d78ce Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/atheros/atlx/atl1.c
	drivers/net/ethernet/atheros/atlx/atl1.h

Resolved a conflict between a DMA error bug fix and NAPI
support changes in the atl1 driver.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-15 13:19:04 -04:00
Eric Dumazet 95c9617472 net: cleanup unsigned to unsigned int
Use of "unsigned int" is preferred to bare "unsigned" in net tree.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-15 12:44:40 -04:00
Gao feng 1716a96101 ipv6: fix problem with expired dst cache
If the ipv6 dst cache which copy from the dst generated by ICMPV6 RA packet.
this dst cache will not check expire because it has no RTF_EXPIRES flag.
So this dst cache will always be used until the dst gc run.

Change the struct dst_entry,add a union contains new pointer from and expires.
When rt6_info.rt6i_flags has no RTF_EXPIRES flag,the dst.expires has no use.
we can use this field to point to where the dst cache copy from.
The dst.from is only used in IPV6.

rt6_check_expired check if rt6_info.dst.from is expired.

ip6_rt_copy only set dst.from when the ort has flag RTF_ADDRCONF
and RTF_DEFAULT.then hold the ort.

ip6_dst_destroy release the ort.

Add some functions to operate the RTF_EXPIRES flag and expires(from) together.
and change the code to use these new adding functions.

Changes from v5:
modify ip6_route_add and ndisc_router_discovery to use new adding functions.

Only set dst.from when the ort has flag RTF_ADDRCONF
and RTF_DEFAULT.then hold the ort.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-13 12:58:29 -04:00
Shmulik Ladkani 2173bff5dc ipv6: Fix 'inet6_rtm_getroute' to release 'rt->dst' in case of 'alloc_skb' failure
In 72331bc [ipv6: Fix RTM_GETROUTE's interpretation of RTA_IIF to be
consistent with ipv4] the code of 'inet6_rtm_getroute()' was re-ordered
such that the reference to 'rt->dst' is incremented prior skb
allocation.

Hence, if 'alloc_skb()' fails, must drop a reference from 'rt->dst'.
Add the missing 'dst_release()' call.

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-04 05:25:51 -04:00
David S. Miller c78679e8f3 ipv6: Stop using NLA_PUT*().
These macros contain a hidden goto, and are thus extremely error
prone and make code hard to audit.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-02 04:33:43 -04:00
Shmulik Ladkani 72331bc0cd ipv6: Fix RTM_GETROUTE's interpretation of RTA_IIF to be consistent with ipv4
In IPv4, if an RTA_IIF attribute is specified within an RTM_GETROUTE
message, then a route is searched as if a packet was received on the
specified 'iif' interface.

However in IPv6, RTA_IIF is not interpreted in the same way:
'inet6_rtm_getroute()' always calls 'ip6_route_output()', regardless the
RTA_IIF attribute.

As a result, in IPv6 there's no way to use RTM_GETROUTE in order to look
for a route as if a packet was received on a specific interface.

Fix 'inet6_rtm_getroute()' so that RTA_IIF is interpreted as "lookup a
route as if a packet was received on the specified interface", similar
to IPv4's 'inet_rtm_getroute()' interpretation.

Reported-by: Ami Koren <amikoren@yahoo.com>
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-01 17:29:40 -04:00
Eric Dumazet 94f826b807 net: fix a potential rcu_read_lock() imbalance in rt6_fill_node()
Commit f2c31e32b3 (net: fix NULL dereferences in check_peer_redir() )
added a regression in rt6_fill_node(), leading to rcu_read_lock()
imbalance.

Thats because NLA_PUT() can make a jump to nla_put_failure label.

Fix this by using nla_put()

Many thanks to Ben Greear for his help

Reported-by: Ben Greear <greearb@candelatech.com>
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-27 18:48:35 -04:00
David S. Miller 4da0bd7365 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-03-18 23:29:41 -04:00
Eric Dumazet 122bdf67f1 ipv6: fix icmp6_dst_alloc()
commit 87a115783 ( ipv6: Move xfrm_lookup() call down into
icmp6_dst_alloc().) forgot to convert one error path, leading
to crashes in mld_sendpack()

Many thanks to Dave Jones for providing a very complete bug report.

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-16 01:53:42 -07:00
David S. Miller a7563f342d ipv6: Use ipv6_addr_any()
Suggested by YOSHIFUJI Hideaki.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-26 16:29:16 -05:00
David S. Miller 39232973b7 ipv4/ipv6: Prepare for new route gateway semantics.
In the future the ipv4/ipv6 route gateway will take on two types
of values:

1) INADDR_ANY/IN6ADDR_ANY, for local network routes, and in this case
   the neighbour must be obtained using the destination address in
   ipv4/ipv6 header as the lookup key.

2) Everything else, the actual nexthop route address.

So if the gateway is not inaddr-any we use it, otherwise we must use
the packet's destination address.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-26 15:22:32 -05:00
RongQing.Li 252c3d84ed ipv6: release idev when ip6_neigh_lookup failed in icmp6_dst_alloc
release idev when ip6_neigh_lookup failed in icmp6_dst_alloc

Signed-off-by: RongQing.Li <roy.qing.li@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-13 10:10:46 -08:00
Josh Hunt 32b293a53d IPv6: Avoid taking write lock for /proc/net/ipv6_route
During some debugging I needed to look into how /proc/net/ipv6_route
operated and in my digging I found its calling fib6_clean_all() which uses
"write_lock_bh(&table->tb6_lock)" before doing the walk of the table. I
found this on 2.6.32, but reading the code I believe the same basic idea
exists currently. Looking at the rtnetlink code they are only calling
"read_lock_bh(&table->tb6_lock);" via fib6_dump_table(). While I realize
reading from proc isn't the recommended way of fetching the ipv6 route
table; taking a write lock seems unnecessary and would probably cause
network performance issues.

To verify this I loaded up the ipv6 route table and then ran iperf in 3
cases:
  * doing nothing
  * reading ipv6 route table via proc
    (while :; do cat /proc/net/ipv6_route > /dev/null; done)
  * reading ipv6 route table via rtnetlink
    (while :; do ip -6 route show table all > /dev/null; done)

* Load the ipv6 route table up with:
  * for ((i = 0;i < 4000;i++)); do ip route add unreachable 2000::$i; done

* iperf commands:
  * client: iperf -i 1 -V -c <ipv6 addr>
  * server: iperf -V -s

* iperf results - 3 runs each (in Mbits/sec)
  * nothing: client: 927,927,927 server: 927,927,927
  * proc: client: 179,97,96,113 server: 142,112,133
  * iproute: client: 928,927,928 server: 927,927,927

lock_stat shows taking the write lock is causing the slowdown. Using this
info I decided to write a version of fib6_clean_all() which replaces
write_lock_bh(&table->tb6_lock) with read_lock_bh(&table->tb6_lock). With
this new function I see the same results as with my rtnetlink iperf test.

Signed-off-by: Josh Hunt <joshhunt00@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-30 17:07:33 -05:00
David S. Miller 8ade06c616 ipv6: Fix neigh lookup using NULL device.
In some of the rt6_bind_neighbour() call sites, it hasn't hooked
up the rt->dst.dev pointer yet, so we'd deref a NULL pointer when
obtaining dev->ifindex for the neighbour hash function computation.

Just pass the netdevice explicitly in to fix this problem.

Reported-by: Bjarke Istrup Pedersen <gurligebis@gentoo.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-29 18:51:57 -05:00
David S. Miller 346f870b8a ipv6: Report TCP timetstamp info in cacheinfo just like ipv4 does.
I missed this while adding ipv6 support to inet_peer.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-29 15:22:33 -05:00
David S. Miller d191854282 ipv6: Kill rt6i_dev and rt6i_expires defines.
It just obscures that the netdevice pointer and the expires value are
implemented in the dst_entry sub-object of the ipv6 route.

And it makes grepping for dst_entry member uses much harder too.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-28 20:19:20 -05:00
David S. Miller f83c7790dc ipv6: Create fast inline ipv6 neigh lookup just like ipv4.
Also, create and use an rt6_bind_neighbour() in net/ipv6/route.c to
consolidate some common logic.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-28 15:41:23 -05:00
David S. Miller c159d30c59 ipv6: Kill useless route tracing bits in net/ipv6/route.c
RDBG() wasn't even used, and the messages printed by RT6_DEBUG() were
far from useful.  Just get rid of all this stuff, we can replace it
with something more suitable if we want.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-26 15:24:36 -05:00
David S. Miller c5e1fd8cca Merge branch 'nf-next' of git://1984.lsi.us.es/net-next 2011-12-25 02:21:45 -05:00
David S. Miller b26e478f8f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/freescale/fsl_pq_mdio.c
	net/batman-adv/translation-table.c
	net/ipv6/route.c
2011-12-16 02:11:14 -05:00
David S. Miller bb3c36863e ipv6: Check dest prefix length on original route not copied one in rt6_alloc_cow().
After commit 8e2ec63917 ("ipv6: don't
use inetpeer to store metrics for routes.") the test in rt6_alloc_cow()
for setting the ANYCAST flag is now wrong.

'rt' will always now have a plen of 128, because it is set explicitly
to 128 by ip6_rt_copy.

So to restore the semantics of the test, check the destination prefix
length of 'ort'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-13 17:35:06 -05:00
David S. Miller b43faac690 ipv6: If neigh lookup fails during icmp6 dst allocation, propagate error.
Don't just succeed with a route that has a NULL neighbour attached.
This follows the behavior of addrconf_dst_alloc().

Allowing this kind of route to end up with a NULL neigh attached will
result in packet drops on output until the route is somehow
invalidated, since nothing will meanwhile try to lookup the neigh
again.

A statistic is bumped for the case where we see a neigh-less route on
output, but the resulting packet drop is otherwise silent in nature,
and frankly it's a hard error for this to happen and ipv6 should do
what ipv4 does which is say something in the kernel logs.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-13 16:51:51 -05:00
David S. Miller 87a115783e ipv6: Move xfrm_lookup() call down into icmp6_dst_alloc().
And return error pointers.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-06 17:04:13 -05:00
David S. Miller 8f0315190d ipv6: Make third arg to anycast_dst_alloc() bool.
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-06 16:48:14 -05:00
David Miller 2721745501 net: Rename dst_get_neighbour{, _raw} to dst_get_neighbour_noref{, _raw}.
To reflect the fact that a refrence is not obtained to the
resulting neighbour entry.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Roland Dreier <roland@purestorage.com>
2011-12-05 15:20:19 -05:00
Florian Westphal ea6e574e34 ipv6: add ip6_route_lookup
like rt6_lookup, but allows caller to pass in flowi6 structure.
Will be used by the upcoming ipv6 netfilter reverse path filter
match.

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2011-12-04 22:44:07 +01:00
David S. Miller 04a6f4417b ipv6: Kill ndisc_get_neigh() inline helper.
It's only used in net/ipv6/route.c and the NULL device check is
superfluous for all of the existing call sites.

Just expand the __ndisc_lookup_errno() call at each location.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-03 18:29:30 -05:00
David S. Miller 3830847396 ipv6: Various cleanups in route.c
1) x == NULL --> !x
2) x != NULL --> x
3) (x&BIT) --> (x & BIT)
4) (BIT1|BIT2) --> (BIT1 | BIT2)
5) proper argument and struct member alignment

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-03 18:02:47 -05:00
David S. Miller 6dec4ac4ee Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/ipv4/inet_diag.c
2011-11-26 14:47:03 -05:00
Steffen Klassert 618f9bc74a net: Move mtu handling down to the protocol depended handlers
We move all mtu handling from dst_mtu() down to the protocol
layer. So each protocol can implement the mtu handling in
a different manner.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-26 14:29:51 -05:00
Steffen Klassert ebb762f27f net: Rename the dst_opt default_mtu method to mtu
We plan to invoke the dst_opt->default_mtu() method unconditioally
from dst_mtu(). So rename the method to dst_opt->mtu() to match
the name with the new meaning.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-26 14:29:50 -05:00
Steffen Klassert 6b600b26c0 route: Use the device mtu as the default for blackhole routes
As it is, we return null as the default mtu of blackhole routes.
This may lead to a propagation of a bogus pmtu if the default_mtu
method of a blackhole route is invoked. So return dst->dev->mtu
as the default mtu instead.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-26 14:29:50 -05:00
Alexey Dobriyan 4e3fd7a06d net: remove ipv6_addr_copy()
C assignment can handle struct in6_addr copying.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-22 16:43:32 -05:00
Matti Vaittinen d71314b4ac IPv6 routing, NLM_F_* flag support: warn if new route is created without NLM_F_CREATE
The support for NLM_F_* flags at IPv6 routing requests.

Warn if NLM_F_CREATE flag is not defined for RTM_NEWROUTE request,
creating new table. Later NLM_F_CREATE may be required for
new route creation.

Patch created against linux-3.2-rc1

Signed-off-by: Matti Vaittinen <Mazziesaccount@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-14 14:35:33 -05:00
Linus Torvalds 32aaeffbd4 Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux
* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
  Revert "tracing: Include module.h in define_trace.h"
  irq: don't put module.h into irq.h for tracking irqgen modules.
  bluetooth: macroize two small inlines to avoid module.h
  ip_vs.h: fix implicit use of module_get/module_put from module.h
  nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
  include: replace linux/module.h with "struct module" wherever possible
  include: convert various register fcns to macros to avoid include chaining
  crypto.h: remove unused crypto_tfm_alg_modname() inline
  uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
  pm_runtime.h: explicitly requires notifier.h
  linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
  miscdevice.h: fix up implicit use of lists and types
  stop_machine.h: fix implicit use of smp.h for smp_processor_id
  of: fix implicit use of errno.h in include/linux/of.h
  of_platform.h: delete needless include <linux/module.h>
  acpi: remove module.h include from platform/aclinux.h
  miscdevice.h: delete unnecessary inclusion of module.h
  device_cgroup.h: delete needless include <linux/module.h>
  net: sch_generic remove redundant use of <linux/module.h>
  net: inet_timewait_sock doesnt need <linux/module.h>
  ...

Fix up trivial conflicts (other header files, and  removal of the ab3550 mfd driver) in
 - drivers/media/dvb/frontends/dibx000_common.c
 - drivers/media/video/{mt9m111.c,ov6650.c}
 - drivers/mfd/ab3550-core.c
 - include/linux/dmaengine.h
2011-11-06 19:44:47 -08:00
Paul Gortmaker bc3b2d7fb9 net: Add export.h for EXPORT_SYMBOL/THIS_MODULE to non-modules
These files are non modular, but need to export symbols using
the macros now living in export.h -- call out the include so
that things won't break when we remove the implicit presence
of module.h from everywhere.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31 19:30:30 -04:00
Gao feng 7011687f0f ipv6: fix route error binding peer in func icmp6_dst_alloc
in func icmp6_dst_alloc,dst_metric_set call ipv6_cow_metrics to set metric.
ipv6_cow_metrics may will call rt6_bind_peer to set rt6_info->rt6i_peer.
So,we should move ipv6_addr_copy before dst_metric_set to make sure rt6_bind_peer success.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-28 16:36:07 -04:00
Madalin Bucur fbe5818690 ipv6: check return value for dst_alloc
return value of dst_alloc must be checked before use

Signed-off-by: Madalin Bucur <madalin.bucur@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-09-27 15:32:06 -04:00
Yan, Zheng 8e2ec63917 ipv6: don't use inetpeer to store metrics for routes.
Current IPv6 implementation uses inetpeer to store metrics for
routes. The problem of inetpeer is that it doesn't take subnet
prefix length in to consideration. If two routes have the same
address but different prefix length, they share same inetpeer.
So changing metrics of one route also affects the other. The
fix is to allocate separate metrics storage for each route.

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-09-17 00:57:26 -04:00
Eric Dumazet f2c31e32b3 net: fix NULL dereferences in check_peer_redir()
Gergely Kalman reported crashes in check_peer_redir().

It appears commit f39925dbde (ipv4: Cache learned redirect
information in inetpeer.) added a race, leading to possible NULL ptr
dereference.

Since we can now change dst neighbour, we should make sure a reader can
safely use a neighbour.

Add RCU protection to dst neighbour, and make sure check_peer_redir()
can be called safely by different cpus in parallel.

As neighbours are already freed after one RCU grace period, this patch
should not add typical RCU penalty (cache cold effects)

Many thanks to Gergely for providing a pretty report pointing to the
bug.

Reported-by: Gergely Kalman <synapse@hippy.csoma.elte.hu>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-03 03:34:12 -07:00
Eric Dumazet 21efcfa0ff ipv6: unshare inetpeers
We currently cow metrics a bit too soon in IPv6 case : All routes are
tied to a single inetpeer entry.

Change ip6_rt_copy() to get destination address as second argument, so
that we fill rt6i_dst before the dst_copy_metrics() call.

icmp6_dst_alloc() must set rt6i_dst before calling dst_metric_set(), or
else the cow is done while rt6i_dst is still NULL.

If orig route points to readonly metrics, we can share the pointer
instead of performing the memory allocation and copy.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-21 21:24:25 -07:00
David S. Miller d3aaeb38c4 net: Add ->neigh_lookup() operation to dst_ops
In the future dst entries will be neigh-less.  In that environment we
need to have an easy transition point for current users of
dst->neighbour outside of the packet output fast path.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-18 00:40:17 -07:00
David S. Miller 69cce1d140 net: Abstract dst->neighbour accesses behind helpers.
dst_{get,set}_neighbour()

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-17 23:11:35 -07:00
David S. Miller 9cbb7ecbcf ipv6: Get rid of rt6i_nexthop macro.
It just makes it harder to see 1) what the code is doing
and 2) grep for all users of dst{->,.}neighbour

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-17 23:11:35 -07:00
David S. Miller e12fe68ce3 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2011-07-05 23:23:37 -07:00
David S. Miller 957c665f37 ipv6: Don't put artificial limit on routing table size.
IPV6, unlike IPV4, doesn't have a routing cache.

Routing table entries, as well as clones made in response
to route lookup requests, all live in the same table.  And
all of these things are together collected in the destination
cache table for ipv6.

This means that routing table entries count against the garbage
collection limits, even though such entries cannot ever be reclaimed
and are added explicitly by the administrator (rather than being
created in response to lookups).

Therefore it makes no sense to count ipv6 routing table entries
against the GC limits.

Add a DST_NOCOUNT destination cache entry flag, and skip the counting
if it is set.  Use this flag bit in ipv6 when adding routing table
entries.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-01 17:30:43 -07:00
David S. Miller 11d53b4990 ipv6: Don't change dst->flags using assignments.
This blows away any flags already set in the entry.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-01 17:30:43 -07:00
Greg Rose c7ac8679be rtnetlink: Compute and store minimum ifinfo dump size
The message size allocated for rtnl ifinfo dumps was limited to
a single page.  This is not enough for additional interface info
available with devices that support SR-IOV and caused a bug in
which VF info would not be displayed if more than approximately
40 VFs were created per interface.

Implement a new function pointer for the rtnl_register service that will
calculate the amount of data required for the ifinfo dump and allocate
enough data to satisfy the request.

Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2011-06-09 20:38:07 -07:00
Florian Westphal 0f6c6392dc ipv6: copy prefsrc setting when copying route entry
commit c3968a857a
('ipv6: RTA_PREFSRC support for ipv6 route source address selection')
added support for ipv6 prefsrc as an alternative to ipv6 addrlabels,
but it did not work because the prefsrc entry was not copied.

Cc: Daniel Walter <sahne@0x90.at>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-21 02:05:22 -04:00
David S. Miller cf91166223 net: Use non-zero allocations in dst_alloc().
Make dst_alloc() and it's users explicitly initialize the entire
entry.

The zero'ing done by kmem_cache_zalloc() was almost entirely
redundant.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-28 22:26:00 -07:00
David S. Miller 5c1e6aa300 net: Make dst_alloc() take more explicit initializations.
Now the dst->dev, dev->obsolete, and dst->flags values can
be specified as well.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-28 22:25:59 -07:00
David S. Miller 2bd93d7af1 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Resolved logic conflicts causing a build failure due to
drivers/net/r8169.c changes using a patch from Stephen Rothwell.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-26 12:16:46 -07:00
Held Bernhard 0972ddb237 net: provide cow_metrics() methods to blackhole dst_ops
Since commit 62fa8a846d (net: Implement read-only protection and COW'ing
of metrics.) the kernel throws an oops.

[  101.620985] BUG: unable to handle kernel NULL pointer dereference at
           (null)
[  101.621050] IP: [<          (null)>]           (null)
[  101.621084] PGD 6e53c067 PUD 3dd6a067 PMD 0
[  101.621122] Oops: 0010 [#1] SMP
[  101.621153] last sysfs file: /sys/devices/virtual/ppp/ppp/uevent
[  101.621192] CPU 2
[  101.621206] Modules linked in: l2tp_ppp pppox ppp_generic slhc
l2tp_netlink l2tp_core deflate zlib_deflate twofish_x86_64
twofish_common des_generic cbc ecb sha1_generic hmac af_key
iptable_filter snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device loop
snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec
snd_pcm snd_timer snd i2c_i801 iTCO_wdt psmouse soundcore snd_page_alloc
evdev uhci_hcd ehci_hcd thermal
[  101.621552]
[  101.621567] Pid: 5129, comm: openl2tpd Not tainted 2.6.39-rc4-Quad #3
Gigabyte Technology Co., Ltd. G33-DS3R/G33-DS3R
[  101.621637] RIP: 0010:[<0000000000000000>]  [<          (null)>]   (null)
[  101.621684] RSP: 0018:ffff88003ddeba60  EFLAGS: 00010202
[  101.621716] RAX: ffff88003ddb5600 RBX: ffff88003ddb5600 RCX:
0000000000000020
[  101.621758] RDX: ffffffff81a69a00 RSI: ffffffff81b7ee61 RDI:
ffff88003ddb5600
[  101.621800] RBP: ffff8800537cd900 R08: 0000000000000000 R09:
ffff88003ddb5600
[  101.621840] R10: 0000000000000005 R11: 0000000000014b38 R12:
ffff88003ddb5600
[  101.621881] R13: ffffffff81b7e480 R14: ffffffff81b7e8b8 R15:
ffff88003ddebad8
[  101.621924] FS:  00007f06e4182700(0000) GS:ffff88007fd00000(0000)
knlGS:0000000000000000
[  101.621971] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  101.622005] CR2: 0000000000000000 CR3: 0000000045274000 CR4:
00000000000006e0
[  101.622046] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[  101.622087] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[  101.622129] Process openl2tpd (pid: 5129, threadinfo
ffff88003ddea000, task ffff88003de9a280)
[  101.622177] Stack:
[  101.622191]  ffffffff81447efa ffff88007d3ded80 ffff88003de9a280
ffff88007d3ded80
[  101.622245]  0000000000000001 ffff88003ddebbb8 ffffffff8148d5a7
0000000000000212
[  101.622299]  ffff88003dcea000 ffff88003dcea188 ffffffff00000001
ffffffff81b7e480
[  101.622353] Call Trace:
[  101.622374]  [<ffffffff81447efa>] ? ipv4_blackhole_route+0x1ba/0x210
[  101.622415]  [<ffffffff8148d5a7>] ? xfrm_lookup+0x417/0x510
[  101.622450]  [<ffffffff8127672a>] ? extract_buf+0x9a/0x140
[  101.622485]  [<ffffffff8144c6a0>] ? __ip_flush_pending_frames+0x70/0x70
[  101.622526]  [<ffffffff8146fbbf>] ? udp_sendmsg+0x62f/0x810
[  101.622562]  [<ffffffff813f98a6>] ? sock_sendmsg+0x116/0x130
[  101.622599]  [<ffffffff8109df58>] ? find_get_page+0x18/0x90
[  101.622633]  [<ffffffff8109fd6a>] ? filemap_fault+0x12a/0x4b0
[  101.622668]  [<ffffffff813fb5c4>] ? move_addr_to_kernel+0x64/0x90
[  101.622706]  [<ffffffff81405d5a>] ? verify_iovec+0x7a/0xf0
[  101.622739]  [<ffffffff813fc772>] ? sys_sendmsg+0x292/0x420
[  101.622774]  [<ffffffff810b994a>] ? handle_pte_fault+0x8a/0x7c0
[  101.622810]  [<ffffffff810b76fe>] ? __pte_alloc+0xae/0x130
[  101.622844]  [<ffffffff810ba2f8>] ? handle_mm_fault+0x138/0x380
[  101.622880]  [<ffffffff81024af9>] ? do_page_fault+0x189/0x410
[  101.622915]  [<ffffffff813fbe03>] ? sys_getsockname+0xf3/0x110
[  101.622952]  [<ffffffff81450c4d>] ? ip_setsockopt+0x4d/0xa0
[  101.622986]  [<ffffffff813f9932>] ? sockfd_lookup_light+0x22/0x90
[  101.623024]  [<ffffffff814b61fb>] ? system_call_fastpath+0x16/0x1b
[  101.623060] Code:  Bad RIP value.
[  101.623090] RIP  [<          (null)>]           (null)
[  101.623125]  RSP <ffff88003ddeba60>
[  101.623146] CR2: 0000000000000000
[  101.650871] ---[ end trace ca3856a7d8e8dad4 ]---
[  101.651011] __sk_free: optmem leakage (160 bytes) detected.

The oops happens in dst_metrics_write_ptr()
include/net/dst.h:124: return dst->ops->cow_metrics(dst, p);

dst->ops->cow_metrics is NULL and causes the oops.

Provide cow_metrics() methods, like we did in commit 214f45c91b
(net: provide default_advmss() methods to blackhole dst_ops)

Signed-off-by: Held Bernhard <berny156@gmx.de>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-25 11:53:08 -07:00
Eric Dumazet b71d1d426d inet: constify ip headers and in6_addr
Add const qualifiers to structs iphdr, ipv6hdr and in6_addr pointers
where possible, to make code intention more obvious.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-22 11:04:14 -07:00
Thomas Egerer e965c05dab ipv6: Remove hoplimit initialization to -1
The changes introduced with git-commit a02e4b7d ("ipv6: Demark default
hoplimit as zero.") missed to remove the hoplimit initialization. As a
result, ipv6_get_mtu interprets the return value of dst_metric_raw
(-1) as 255 and answers ping6 with this hoplimit.  This patche removes
the line such that ping6 is answered with the hoplimit value
configured via sysctl.

Signed-off-by: Thomas Egerer <thomas.egerer@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-21 17:24:08 -07:00
Daniel Walter c3968a857a ipv6: RTA_PREFSRC support for ipv6 route source address selection
[ipv6] Add support for RTA_PREFSRC

This patch allows a user to select the preferred source address
for a specific IPv6-Route. It can be set via a netlink message
setting RTA_PREFSRC to a valid IPv6 address which must be
up on the device the route will be bound to.

Signed-off-by: Daniel Walter <dwalter@barracuda.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-15 15:44:37 -07:00
Florian Westphal 9c7a4f9ce6 ipv6: ip6_route_output does not modify sk parameter, so make it const
This avoids explicit cast to avoid 'discards qualifiers'
compiler warning in a netfilter patch that i've been working on.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-22 19:17:36 -07:00
David S. Miller 4c9483b2fb ipv6: Convert to use flowi6 where applicable.
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-12 15:08:54 -08:00
David S. Miller 1d28f42c1b net: Put flowi_* prefix on AF independent members of struct flowi
I intend to turn struct flowi into a union of AF specific flowi
structs.  There will be a common structure that each variant includes
first, much like struct sock_common.

This is the first step to move in that direction.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-12 15:08:44 -08:00
David S. Miller 33175d84ee Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/bnx2x/bnx2x_cmn.c
2011-03-10 14:26:00 -08:00
David S. Miller 7343ff31eb ipv6: Don't create clones of host routes.
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=29252
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=30462

In commit d80bc0fd26 ("ipv6: Always
clone offlink routes.") we forced the kernel to always clone offlink
routes.

The reason we do that is to make sure we never bind an inetpeer to a
prefixed route.

The logic turned on here has existed in the tree for many years,
but was always off due to a protecting CPP define.  So perhaps
it's no surprise that there is a logic bug here.

The problem is that we canot clone a route that is already a
host route (ie. has DST_HOST set).  Because if we do, an identical
entry already exists in the routing tree and therefore the
ip6_rt_ins() call is going to fail.

This sets off a series of failures and high cpu usage, because when
ip6_rt_ins() fails we loop retrying this operation a few times in
order to handle a race between two threads trying to clone and insert
the same host route at the same time.

Fix this by simply using the route as-is when DST_HOST is set.

Reported-by: slash@ac.auone-net.jp
Reported-by: Ernst Sjöstrand <ernstp@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-09 19:55:25 -08:00
David S. Miller 0a0e9ae1bd Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/bnx2x/bnx2x.h
2011-03-03 21:27:42 -08:00
David S. Miller 29546a6404 ipv6: Use ERR_CAST in addrconf_dst_alloc.
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-03 12:10:37 -08:00
David S. Miller 2774c131b1 xfrm: Handle blackhole route creation via afinfo.
That way we don't have to potentially do this in every xfrm_lookup()
caller.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01 14:59:04 -08:00
David S. Miller 69ead7afdf ipv6: Normalize arguments to ip6_dst_blackhole().
Return a dst pointer which is potentitally error encoded.

Don't pass original dst pointer by reference, pass a struct net
instead of a socket, and elide the flow argument since it is
unnecessary.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-01 14:45:33 -08:00
Hagen Paul Pfeifer e9476e95d8 ipv6: variable next is never used in this function
Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-25 14:00:22 -08:00
Lucian Adrian Grijincu c486da3439 sysctl: ipv6: use correct net in ipv6_sysctl_rtcache_flush
Before this patch issuing these commands:

  fd = open("/proc/sys/net/ipv6/route/flush")
  unshare(CLONE_NEWNET)
  write(fd, "stuff")

would flush the newly created net, not the original one.

The equivalent ipv4 code is correct (stores the net inside ->extra1).
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-25 11:01:56 -08:00
David S. Miller da935c66ba Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	Documentation/feature-removal-schedule.txt
	drivers/net/e1000e/netdev.c
	net/xfrm/xfrm_policy.c
2011-02-19 19:17:35 -08:00
Eric Dumazet 214f45c91b net: provide default_advmss() methods to blackhole dst_ops
Commit 0dbaee3b37 (net: Abstract default ADVMSS behind an
accessor.) introduced a possible crash in tcp_connect_init(), when
dst->default_advmss() is called from dst_metric_advmss()

Reported-by: George Spelvin <linux@horizon.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-18 11:39:01 -08:00
David S. Miller 3c7bd1a140 net: Add initial_ref arg to dst_alloc().
This allows avoiding multiple writes to the initial __refcnt.

The most simplest cases of wanting an initial reference of "1"
in ipv4 and ipv6 have been converted, the rest have been left
along and kept at the existing "0".

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-17 15:44:00 -08:00
David S. Miller 6431cbc25f inet: Create a mechanism for upward inetpeer propagation into routes.
If we didn't have a routing cache, we would not be able to properly
propagate certain kinds of dynamic path attributes, for example
PMTU information and redirects.

The reason is that if we didn't have a routing cache, then there would
be no way to lookup all of the active cached routes hanging off of
sockets, tunnels, IPSEC bundles, etc.

Consider the case where we created a cached route, but no inetpeer
entry existed and also we were not asked to pre-COW the route metrics
and therefore did not force the creation a new inetpeer entry.

If we later get a PMTU message, or a redirect, and store this
information in a new inetpeer entry, there is no way to teach that
cached route about the newly existing inetpeer entry.

The facilities implemented here handle this problem.

First we create a generation ID.  When we create a cached route of any
kind, we remember the generation ID at the time of attachment.  Any
time we force-create an inetpeer entry in response to new path
information, we bump that generation ID.

The dst_ops->check() callback is where the knowledge of this event
is propagated.  If the global generation ID does not equal the one
stored in the cached route, and the cached route has not attached
to an inetpeer yet, we look it up and attach if one is found.  Now
that we've updated the cached route's information, we update the
route's generation ID too.

This clears the way for implementing PMTU and redirects directly in
the inetpeer cache.  There is absolutely no need to consult cached
route information in order to maintain this information.

At this point nothing bumps the inetpeer genids, that comes in the
later changes which handle PMTUs and redirects using inetpeers.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-10 13:33:41 -08:00
David S. Miller 8d13a2a9fb net: Kill NETEVENT_PMTU_UPDATE.
Nobody actually does anything in response to the event,
so just kill it off.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-08 16:17:55 -08:00
David S. Miller bd4a6974cc Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2011-02-04 14:28:58 -08:00
Roland Dreier ec831ea72e net: Add default_mtu() methods to blackhole dst_ops
When an IPSEC SA is still being set up, __xfrm_lookup() will return
-EREMOTE and so ip_route_output_flow() will return a blackhole route.
This can happen in a sndmsg call, and after d33e455337 ("net: Abstract
default MTU metric calculation behind an accessor.") this leads to a
crash in ip_append_data() because the blackhole dst_ops have no
default_mtu() method and so dst_mtu() calls a NULL pointer.

Fix this by adding default_mtu() methods (that simply return 0, matching
the old behavior) to the blackhole dst_ops.

The IPv4 part of this patch fixes a crash that I saw when using an IPSEC
VPN; the IPv6 part is untested because I don't have an IPv6 VPN, but it
looks to be needed as well.

Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-31 13:16:00 -08:00
David S. Miller 065825402c net: Store ipv4/ipv6 COW'd metrics in inetpeer cache.
Please note that the IPSEC dst entry metrics keep using
the generic metrics COW'ing mechanism using kmalloc/kfree.

This gives the IPSEC routes an opportunity to use metrics
which are unique to their encapsulated paths.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-27 14:59:31 -08:00
David S. Miller 1397e171f1 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2011-01-27 14:59:08 -08:00
David S. Miller 8f2771f2b8 ipv6: Remove route peer binding assertions.
They are bogus.  The basic idea is that I wanted to make sure
that prefixed routes never bind to peers.

The test I used was whether RTF_CACHE was set.

But first of all, the RTF_CACHE flag is set at different spots
depending upon which ip6_rt_copy() caller you're talking about.

I've validated all of the code paths, and even in the future
where we bind peers more aggressively (for route metric COW'ing)
we never bind to prefix'd routes, only fully specified ones.
This even applies when addrconf or icmp6 routes are allocated.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-27 14:55:22 -08:00
David S. Miller 62fa8a846d net: Implement read-only protection and COW'ing of metrics.
Routing metrics are now copy-on-write.

Initially a route entry points it's metrics at a read-only location.
If a routing table entry exists, it will point there.  Else it will
point at the all zero metric place-holder called 'dst_default_metrics'.

The writeability state of the metrics is stored in the low bits of the
metrics pointer, we have two bits left to spare if we want to store
more states.

For the initial implementation, COW is implemented simply via kmalloc.
However future enhancements will change this to place the writable
metrics somewhere else, in order to increase sharing.  Very likely
this "somewhere else" will be the inetpeer cache.

Note also that this means that metrics updates may transiently fail
if we cannot COW the metrics successfully.

But even by itself, this patch should decrease memory usage and
increase cache locality especially for routing workloads.  In those
cases the read-only metric copies stay in place and never get written
to.

TCP workloads where metrics get updated, and those rare cases where
PMTU triggers occur, will take a very slight performance hit.  But
that hit will be alleviated when the long-term writable metrics
move to a more sharable location.

Since the metrics storage went from a u32 array of RTAX_MAX entries to
what is essentially a pointer, some retooling of the dst_entry layout
was necessary.

Most importantly, we need to preserve the alignment of the reference
count so that it doesn't share cache lines with the read-mostly state,
as per Eric Dumazet's alignment assertion checks.

The only non-trivial bit here is the move of the 'flags' member into
the writeable cacheline.  This is OK since we are always accessing the
flags around the same moment when we made a modification to the
reference count.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-26 20:51:05 -08:00
David S. Miller d80bc0fd26 ipv6: Always clone offlink routes.
Do not handle PMTU vs. route lookup creation any differently
wrt. offlink routes, always clone them.

Reported-by: PK <runningdoglackey@yahoo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-24 16:01:58 -08:00