neigh->nud_state and neigh->updated are under protection of
neigh->lock.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The only user is cxgb3 driver.
old_neigh is used to check device change, but it must not happen
on redirect. In this sense, we can remove old_neigh argument.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
I believe this commit from 2008 was incorrect:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=398bcbebb6f721ac308df1e3d658c0029bb74503
When CONFIG_IPV6_ROUTER_PREF is disabled, the kernel should follow
RFC4861 section 6.3.6: if no route is NUD_VALID, then traffic should be
sprayed across all routers (indirectly triggering NUD) until one of them
becomes NUD_VALID.
However, the following experiment demonstrates that this does not work:
1) Connect to an IPv6 network.
2) Change the router's MAC (and link-local) address.
The kernel will lock onto the first router and never try the new one, even
if the first becomes unreachable. This patch fixes the problem by
allowing rt6_check_neigh() to return 0; if all routers return 0, then
rt6_select() will fall back to round-robin behavior.
This patch should have no effect when CONFIG_IPV6_ROUTER_PREF=y.
Note that rt6_check_neigh() is only used in a boolean context, so I've
changed its return type accordingly.
Signed-off-by: Paul Marks <pmarks@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow an unpriviled user who has created a user namespace, and then
created a network namespace to effectively use the new network
namespace, by reducing capable(CAP_NET_ADMIN) and
capable(CAP_NET_RAW) calls to be ns_capable(net->user_ns,
CAP_NET_ADMIN), or capable(net->user_ns, CAP_NET_RAW) calls.
Settings that merely control a single network device are allowed.
Either the network device is a logical network device where
restrictions make no difference or the network device is hardware NIC
that has been explicity moved from the initial network namespace.
In general policy and network stack state changes are allowed while
resource control is left unchanged.
Allow the SIOCSIFADDR ioctl to add ipv6 addresses.
Allow the SIOCDIFADDR ioctl to delete ipv6 addresses.
Allow the SIOCADDRT ioctl to add ipv6 routes.
Allow the SIOCDELRT ioctl to delete ipv6 routes.
Allow creation of ipv6 raw sockets.
Allow setting the IPV6_JOIN_ANYCAST socket option.
Allow setting the IPV6_FL_A_RENEW parameter of the IPV6_FLOWLABEL_MGR
socket option.
Allow setting the IPV6_TRANSPARENT socket option.
Allow setting the IPV6_HOPOPTS socket option.
Allow setting the IPV6_RTHDRDSTOPTS socket option.
Allow setting the IPV6_DSTOPTS socket option.
Allow setting the IPV6_IPSEC_POLICY socket option.
Allow setting the IPV6_XFRM_POLICY socket option.
Allow sending packets with the IPV6_2292HOPOPTS control message.
Allow sending packets with the IPV6_2292DSTOPTS control message.
Allow sending packets with the IPV6_RTHDRDSTOPTS control message.
Allow setting the multicast routing socket options on non multicast
routing sockets.
Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL, and SIOCDELTUNNEL ioctls for
setting up, changing and deleting tunnels over ipv6.
Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL, SIOCDELTUNNEL ioctls for
setting up, changing and deleting ipv6 over ipv4 tunnels.
Allow the SIOCADDPRL, SIOCDELPRL, SIOCCHGPRL ioctls for adding,
deleting, and changing the potential router list for ISATAP tunnels.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- In rtnetlink_rcv_msg convert the capable(CAP_NET_ADMIN) check
to ns_capable(net->user-ns, CAP_NET_ADMIN). Allowing unprivileged
users to make netlink calls to modify their local network
namespace.
- In the rtnetlink doit methods add capable(CAP_NET_ADMIN) so
that calls that are not safe for unprivileged users are still
protected.
Later patches will remove the extra capable calls from methods
that are safe for unprivilged users.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for supporting the creation of network namespaces
by unprivileged users, modify all of the per net sysctl exports
and refuse to allow them to unprivileged users.
This makes it safe for unprivileged users in general to access
per net sysctls, and allows sysctls to be exported to unprivileged
users on an individual basis as they are deemed safe.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
Minor conflict between the BCM_CNIC define removal in net-next
and a bug fix added to net. Based upon a conflict resolution
patch posted by Stephen Rothwell.
Signed-off-by: David S. Miller <davem@davemloft.net>
6431cbc25f(Create a mechanism for upward inetpeer propagation into routes)
introduces these codes, but this mechanism is never enabled since
rt6i_peer_genid always is zero whether it is not assigned or assigned by
rt6_peer_genid(). After 5943634fc5 (ipv4: Maintain redirect and PMTU info
in struct rtable again), the ipv4 related codes of this mechanism has been
removed, I think we maybe able to remove them now.
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As suggested by Eric, we could introduce a helper function
for ipv6 too, to avoid checking if rt is NULL before
dst_release().
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fib6_add_rt2node() will reject the nexthop if this flag is set, so
we perform the check only for the first nexthop.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit a02e4b7dae4551(Demark default hoplimit as zero) only changes the
hoplimit checking condition and default value in ip6_dst_hoplimit, not
zeros all hoplimit default value.
Keep the zeroing ip6_template_metrics[RTAX_HOPLIMIT - 1] to force it as
const, cause as a37e6e344910(net: force dst_default_metrics to const
section)
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adding by commit 51ebd31815 which adds the support of ECMP for IPv6.
Spotted-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Each nexthop is added like a single route in the routing table. All routes
that have the same metric/weight and destination but not the same gateway
are considering as ECMP routes. They are linked together, through a list called
rt6i_siblings.
ECMP routes can be added in one shot, with RTA_MULTIPATH attribute or one after
the other (in both case, the flag NLM_F_EXCL should not be set).
The patch is based on a previous work from
Luc Saillard <luc.saillard@6wind.com>.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
as we hold dst_entry before we call __ip6_del_rt,
so we should alse call dst_release not only return
-ENOENT when the rt6_info is ip6_null_entry.
and we already hold the dst entry, so I think it's
safe to call dst_release out of the write-read lock.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/team/team.c
drivers/net/usb/qmi_wwan.c
net/batman-adv/bat_iv_ogm.c
net/ipv4/fib_frontend.c
net/ipv4/route.c
net/l2tp/l2tp_netlink.c
The team, fib_frontend, route, and l2tp_netlink conflicts were simply
overlapping changes.
qmi_wwan and bat_iv_ogm were of the "use HEAD" variety.
With help from Antonio Quartulli.
Signed-off-by: David S. Miller <davem@davemloft.net>
If dst cache dst_a copies from dst_b, and dst_b copies from dst_c, check
if dst_a is expired or not, we should not end with dst_a->dst.from, dst_b,
we should check dst_c.
CC: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 dst should take care of rt_genid too. When a xfrm policy is inserted or
deleted, all dst should be invalidated.
To force the validation, dst entries should be created with ->obsolete set to
DST_OBSOLETE_FORCE_CHK. This was already the case for all functions calling
ip6_dst_alloc(), except for ip6_rt_copy().
As a consequence, we can remove the specific code in inet6_connection_sock.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
geting route info does not write rt->rt6i_table, so replace
write lock with read lock
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After this commit:
commit 97cac0821a
Author: David S. Miller <davem@davemloft.net>
Date: Mon Jul 2 22:43:47 2012 -0700
ipv6: Store route neighbour in rt6_info struct.
we no longer use RCU to protect route neighbour.
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is a frequent mistake to confuse the netlink port identifier with a
process identifier. Try to reduce this confusion by renaming fields
that hold port identifiers portid instead of pid.
I have carefully avoided changing the structures exported to
userspace to avoid changing the userspace API.
I have successfully built an allyesconfig kernel with this change.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's the same problem that previous fix about blackhole and prohibit routes.
When adding a throw route, it was handled like a classic route.
Moreover, it was only possible to add this kind of routes by specifying
an interface.
Before the patch:
$ ip route add throw 2001::2/128
RTNETLINK answers: No such device
$ ip route add throw 2001::2/128 dev eth0
$ ip -6 route | grep 2001::2
2001::2 dev eth0 metric 1024
After:
$ ip route add throw 2001::2/128
$ ip -6 route | grep 2001::2
throw 2001::2 dev lo metric 1024 error -11
Reported-by: Markus Stenberg <markus.stenberg@iki.fi>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When adding a blackhole or a prohibit route, they were handling like classic
routes. Moreover, it was only possible to add this kind of routes by specifying
an interface.
Bug already reported here:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=498498
Before the patch:
$ ip route add blackhole 2001::1/128
RTNETLINK answers: No such device
$ ip route add blackhole 2001::1/128 dev eth0
$ ip -6 route | grep 2001
2001::1 dev eth0 metric 1024
After:
$ ip route add blackhole 2001::1/128
$ ip -6 route | grep 2001
blackhole 2001::1 dev lo metric 1024 error -22
v2: wrong patch
v3: add a field fc_type in struct fib6_config to store RTN_* type
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As pointed out, there are places, that access net->loopback_dev->ifindex
and after ifindex generation is made per-net this value becomes constant
equals 1. So go ahead and introduce the LOOPBACK_IFINDEX constant and use
it where appropriate.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When userspace use RTM_GETROUTE to dump route table, with an already
expired route entry, we always got an 'expires' value(2147157)
calculated base on INT_MAX.
The reason of this problem is in the following satement:
rt->dst.expires - jiffies < INT_MAX
gcc promoted the type of both sides of '<' to unsigned long, thus
a small negative value would be considered greater than INT_MAX.
With the help of Eric Dumazet, do the out of bound checks in
rtnl_put_cacheinfo(), _after_ conversion to clock_t.
Signed-off-by: Li Wei <lw@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a big comment explaining how the field works, and use defines
instead of magic constants for the values assigned to it.
Suggested by Joe Perches.
Signed-off-by: David S. Miller <davem@davemloft.net>
These patches implement the final mechanism necessary to really allow
us to go without the route cache in ipv4.
We need a place to have long-term storage of PMTU/redirect information
which is independent of the routes themselves, yet does not get us
back into a situation where we have to write to metrics or anything
like that.
For this we use an "next-hop exception" table in the FIB nexthops.
The one thing I desperately want to avoid is having to create clone
routes in the FIB trie for this purpose, because that is very
expensive. However, I'm willing to entertain such an idea later
if this current scheme proves to have downsides that the FIB trie
variant would not have.
In order to accomodate this any such scheme, we need to be able to
produce a full flow key at PMTU/redirect time. That required an
adjustment of the interface call-sites used to propagate these events.
For a PMTU/redirect with a fully specified socket, we pass that socket
and use it to produce the flow key.
Otherwise we use a passed in SKB to formulate the key. There are two
cases that need to be distinguished, ICMP message processing (in which
case the IP header is at skb->data) and output packet processing
(mostly tunnels, and in all such cases the IP header is at ip_hdr(skb)).
We also have to make the code able to handle the case where the dst
itself passed into the dst_ops->{update_pmtu,redirect} method is
invalidated. This matters for calls from sockets that have cached
that route. We provide a inet{,6} helper function for this purpose,
and edit SCTP specially since it caches routes at the transport rather
than socket level.
Signed-off-by: David S. Miller <davem@davemloft.net>
This will be used so that we can compose a full flow key.
Even though we have a route in this context, we need more. In the
future the routes will be without destination address, source address,
etc. keying. One ipv4 route will cover entire subnets, etc.
In this environment we have to have a way to possess persistent storage
for redirects and PMTU information. This persistent storage will exist
in the FIB tables, and that's why we'll need to be able to rebuild a
full lookup flow key here. Using that flow key will do a fib_lookup()
and create/update the persistent entry.
Signed-off-by: David S. Miller <davem@davemloft.net>
Userspace implementations of network routing protocols sometimes need to
tell RA-originated IPv6 routes from other kernel routes to make proper
routing decisions. This makes most sense for RA routes with nexthops,
namely, default routes and Route Information routes.
The intended mean of preserving RA route origin in a netlink message is
through indicating RTPROT_RA as protocol code. Function rt6_fill_node()
tried to do that for default routes, but its test condition was taken
wrong. This change is modeled after the original mailing list posting
by Jeff Haran. It fixes the test condition for default route case and
sets the same behaviour for Route Information case (both types use
nexthops). Handling of the 3rd RA route type, Prefix Information, is
left unchanged, as it stands for interface connected routes (without
nexthops).
Signed-off-by: Denis Ovsienko <infrastation@yandex.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
We start initializing the struct rt6_info at the first field
behind the struct dst_enty. This is error prone because it
might leave a new field uninitialized. So start initializing
the struct rt6_info right behind the dst_entry.
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This sets things up so that we can have the protocol error handlers
call down into the ipv6 route code for redirects just as ipv4 already
does.
Signed-off-by: David S. Miller <davem@davemloft.net>
No longer needed. TCP writes metrics, but now in it's own special
cache that does not dirty the route metrics. Therefore there is no
longer any reason to pre-cow metrics in this way.
Signed-off-by: David S. Miller <davem@davemloft.net>
git commit 97cac082 (ipv6: Store route neighbour in rt6_info struct)
added a neighbour pointer to rt6_info. Currently we don't initialize
this pointer at allocation time. We assume this pointer to be valid
if it is not a null pointer, so initialize it on allocation.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This makes for a simplified conversion away from dst_get_neighbour*().
All code outside of ipv6 will use neigh lookups via dst_neigh_lookup*().
Signed-off-by: David S. Miller <davem@davemloft.net>
Causes the handler to use the daddr in the ipv4/ipv6 header when
the route gateway is unspecified (local subnet).
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix to allow IPv6 packets originating locally to match rules with the "iff"
set to "lo". This allows IPv6 rule matching work the same as it does for
IPv4. From the iproute2 man page:
iif NAME
select the incoming device to match. If the interface is loop‐
back, the rule only matches packets originating from this host.
This means that you may create separate routing tables for for‐
warded and local packets and, hence, completely segregate them.
Signed-off-by: David McCullough <david_mccullough@mcafee.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/usb/qmi_wwan.c
net/batman-adv/translation-table.c
net/ipv6/route.c
qmi_wwan.c resolution provided by Bjørn Mork.
batman-adv conflict is dealing merely with the changes
of global function names to have a proper subsystem
prefix.
ipv6's route.c conflict is merely two side-by-side additions
of network namespace methods.
Signed-off-by: David S. Miller <davem@davemloft.net>
/proc/net/ipv6_route reflects the contents of fib_table_hash. The proc
handler is installed in ip6_route_net_init() whereas fib_table_hash is
allocated in fib6_net_init() _after_ the proc handler has been installed.
This opens up a short time frame to access fib_table_hash with its pants
down.
Move the registration of the proc files to a later point in the init
order to avoid the race.
Tested :-)
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/ipv6/route.c
Pull in 'net' again to get the revert of Thomas's change
which introduced regressions.
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 2a0c451ade.
It causes crashes, because now ip6_null_entry is used before
it is initialized.
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/ipv6/route.c
This deals with a merge conflict between the net-next addition of the
inetpeer network namespace ops, and Thomas Graf's bug fix in
2a0c451ade which makes sure we don't
register /proc/net/ipv6_route before it is actually safe to do so.
Signed-off-by: David S. Miller <davem@davemloft.net>
/proc/net/ipv6_route reflects the contents of fib_table_hash. The proc
handler is installed in ip6_route_net_init() whereas fib_table_hash is
allocated in fib6_net_init() _after_ the proc handler has been installed.
This opens up a short time frame to access fib_table_hash with its pants
down.
fib6_init() as a whole can't be moved to an earlier position as it also
registers the rtnetlink message handlers which should be registered at
the end. Therefore split it into fib6_init() which is run early and
fib6_init_late() to register the rtnetlink message handlers.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Reviewed-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
One tricky issue on the ipv6 side vs. ipv4 is that the ICMP callouts
to handle the error pass the 32-bit info cookie in network byte order
whereas ipv4 passes it around in host byte order.
Like the ipv4 side, we have two helper functions. One for when we
have a socket context and one for when we do not.
ip6ip6 tunnels are not handled here, because they handle PMTU events
by essentially relaying another ICMP packet-too-big message back to
the original sender.
This patch allows us to get rid of rt6_do_pmtu_disc(). It handles all
kinds of situations that simply cannot happen when we do the PMTU
update directly using a fully resolved route.
In fact, the "plen == 128" check in ip6_rt_update_pmtu() can very
likely be removed or changed into a BUG_ON() check. We should never
have a prefixed ipv6 route when we get there.
Another piece of strange history here is that TCP and DCCP, unlike in
ipv4, never invoke the update_pmtu() method from their ICMP error
handlers. This is incredibly astonishing since this is the context
where we have the most accurate context in which to make a PMTU
update, namely we have a fully connected socket and associated cached
socket route.
Signed-off-by: David S. Miller <davem@davemloft.net>
We handle NULL in rt{,6}_set_peer but then our caller will try to pass
that NULL pointer into inet_putpeer() which isn't ready for it.
Fix this by moving the NULL check one level up, and then remove the
now unnecessary NULL check from inetpeer_ptr_set_peer().
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We encode the pointer(s) into an unsigned long with one state bit.
The state bit is used so we can store the inetpeer tree root to use
when resolving the peer later.
Later the peer roots will be per-FIB table, and this change works to
facilitate that.
Signed-off-by: David S. Miller <davem@davemloft.net>
We only need one interface for this operation, since we always know
which inetpeer root we want to flush.
Signed-off-by: David S. Miller <davem@davemloft.net>
There's a lot of places that open-code rt{,6}_get_peer() only because
they want to set 'create' to one. So add an rt{,6}_get_peer_create()
for their sake.
There were also a few spots open-coding plain rt{,6}_get_peer() and
those are transformed here as well.
Signed-off-by: David S. Miller <davem@davemloft.net>
add struct net as a parameter of inet_getpeer_v[4,6],
use net to replace &init_net.
and modify some places to provide net for inet_getpeer_v[4,6]
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mostly bool conversions, some inline removals and const additions.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add #define pr_fmt(fmt) as appropriate.
Add "IPv6: " to appropriate files.
Convert printk(KERN_<LEVEL> to pr_<level> (but not KERN_DEBUG).
Standardize on "%s: " not "%s(): " when emitting __func__.
Use "%s: ", __func__ instead of embedding function name.
Coalesce formats, align arguments.
ADDRCONF output is now prefixed with "IPv6: "
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Standardize the net core ratelimited logging functions.
Coalesce formats, align arguments.
Change a printk then vprintk sequence to use printf extension %pV.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/atheros/atlx/atl1.c
drivers/net/ethernet/atheros/atlx/atl1.h
Resolved a conflict between a DMA error bug fix and NAPI
support changes in the atl1 driver.
Signed-off-by: David S. Miller <davem@davemloft.net>
Use of "unsigned int" is preferred to bare "unsigned" in net tree.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the ipv6 dst cache which copy from the dst generated by ICMPV6 RA packet.
this dst cache will not check expire because it has no RTF_EXPIRES flag.
So this dst cache will always be used until the dst gc run.
Change the struct dst_entry,add a union contains new pointer from and expires.
When rt6_info.rt6i_flags has no RTF_EXPIRES flag,the dst.expires has no use.
we can use this field to point to where the dst cache copy from.
The dst.from is only used in IPV6.
rt6_check_expired check if rt6_info.dst.from is expired.
ip6_rt_copy only set dst.from when the ort has flag RTF_ADDRCONF
and RTF_DEFAULT.then hold the ort.
ip6_dst_destroy release the ort.
Add some functions to operate the RTF_EXPIRES flag and expires(from) together.
and change the code to use these new adding functions.
Changes from v5:
modify ip6_route_add and ndisc_router_discovery to use new adding functions.
Only set dst.from when the ort has flag RTF_ADDRCONF
and RTF_DEFAULT.then hold the ort.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In 72331bc [ipv6: Fix RTM_GETROUTE's interpretation of RTA_IIF to be
consistent with ipv4] the code of 'inet6_rtm_getroute()' was re-ordered
such that the reference to 'rt->dst' is incremented prior skb
allocation.
Hence, if 'alloc_skb()' fails, must drop a reference from 'rt->dst'.
Add the missing 'dst_release()' call.
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These macros contain a hidden goto, and are thus extremely error
prone and make code hard to audit.
Signed-off-by: David S. Miller <davem@davemloft.net>
In IPv4, if an RTA_IIF attribute is specified within an RTM_GETROUTE
message, then a route is searched as if a packet was received on the
specified 'iif' interface.
However in IPv6, RTA_IIF is not interpreted in the same way:
'inet6_rtm_getroute()' always calls 'ip6_route_output()', regardless the
RTA_IIF attribute.
As a result, in IPv6 there's no way to use RTM_GETROUTE in order to look
for a route as if a packet was received on a specific interface.
Fix 'inet6_rtm_getroute()' so that RTA_IIF is interpreted as "lookup a
route as if a packet was received on the specified interface", similar
to IPv4's 'inet_rtm_getroute()' interpretation.
Reported-by: Ami Koren <amikoren@yahoo.com>
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit f2c31e32b3 (net: fix NULL dereferences in check_peer_redir() )
added a regression in rt6_fill_node(), leading to rcu_read_lock()
imbalance.
Thats because NLA_PUT() can make a jump to nla_put_failure label.
Fix this by using nla_put()
Many thanks to Ben Greear for his help
Reported-by: Ben Greear <greearb@candelatech.com>
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 87a115783 ( ipv6: Move xfrm_lookup() call down into
icmp6_dst_alloc().) forgot to convert one error path, leading
to crashes in mld_sendpack()
Many thanks to Dave Jones for providing a very complete bug report.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the future the ipv4/ipv6 route gateway will take on two types
of values:
1) INADDR_ANY/IN6ADDR_ANY, for local network routes, and in this case
the neighbour must be obtained using the destination address in
ipv4/ipv6 header as the lookup key.
2) Everything else, the actual nexthop route address.
So if the gateway is not inaddr-any we use it, otherwise we must use
the packet's destination address.
Signed-off-by: David S. Miller <davem@davemloft.net>
release idev when ip6_neigh_lookup failed in icmp6_dst_alloc
Signed-off-by: RongQing.Li <roy.qing.li@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
During some debugging I needed to look into how /proc/net/ipv6_route
operated and in my digging I found its calling fib6_clean_all() which uses
"write_lock_bh(&table->tb6_lock)" before doing the walk of the table. I
found this on 2.6.32, but reading the code I believe the same basic idea
exists currently. Looking at the rtnetlink code they are only calling
"read_lock_bh(&table->tb6_lock);" via fib6_dump_table(). While I realize
reading from proc isn't the recommended way of fetching the ipv6 route
table; taking a write lock seems unnecessary and would probably cause
network performance issues.
To verify this I loaded up the ipv6 route table and then ran iperf in 3
cases:
* doing nothing
* reading ipv6 route table via proc
(while :; do cat /proc/net/ipv6_route > /dev/null; done)
* reading ipv6 route table via rtnetlink
(while :; do ip -6 route show table all > /dev/null; done)
* Load the ipv6 route table up with:
* for ((i = 0;i < 4000;i++)); do ip route add unreachable 2000::$i; done
* iperf commands:
* client: iperf -i 1 -V -c <ipv6 addr>
* server: iperf -V -s
* iperf results - 3 runs each (in Mbits/sec)
* nothing: client: 927,927,927 server: 927,927,927
* proc: client: 179,97,96,113 server: 142,112,133
* iproute: client: 928,927,928 server: 927,927,927
lock_stat shows taking the write lock is causing the slowdown. Using this
info I decided to write a version of fib6_clean_all() which replaces
write_lock_bh(&table->tb6_lock) with read_lock_bh(&table->tb6_lock). With
this new function I see the same results as with my rtnetlink iperf test.
Signed-off-by: Josh Hunt <joshhunt00@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In some of the rt6_bind_neighbour() call sites, it hasn't hooked
up the rt->dst.dev pointer yet, so we'd deref a NULL pointer when
obtaining dev->ifindex for the neighbour hash function computation.
Just pass the netdevice explicitly in to fix this problem.
Reported-by: Bjarke Istrup Pedersen <gurligebis@gentoo.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
It just obscures that the netdevice pointer and the expires value are
implemented in the dst_entry sub-object of the ipv6 route.
And it makes grepping for dst_entry member uses much harder too.
Signed-off-by: David S. Miller <davem@davemloft.net>
Also, create and use an rt6_bind_neighbour() in net/ipv6/route.c to
consolidate some common logic.
Signed-off-by: David S. Miller <davem@davemloft.net>
RDBG() wasn't even used, and the messages printed by RT6_DEBUG() were
far from useful. Just get rid of all this stuff, we can replace it
with something more suitable if we want.
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit 8e2ec63917 ("ipv6: don't
use inetpeer to store metrics for routes.") the test in rt6_alloc_cow()
for setting the ANYCAST flag is now wrong.
'rt' will always now have a plen of 128, because it is set explicitly
to 128 by ip6_rt_copy.
So to restore the semantics of the test, check the destination prefix
length of 'ort'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't just succeed with a route that has a NULL neighbour attached.
This follows the behavior of addrconf_dst_alloc().
Allowing this kind of route to end up with a NULL neigh attached will
result in packet drops on output until the route is somehow
invalidated, since nothing will meanwhile try to lookup the neigh
again.
A statistic is bumped for the case where we see a neigh-less route on
output, but the resulting packet drop is otherwise silent in nature,
and frankly it's a hard error for this to happen and ipv6 should do
what ipv4 does which is say something in the kernel logs.
Signed-off-by: David S. Miller <davem@davemloft.net>
To reflect the fact that a refrence is not obtained to the
resulting neighbour entry.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Roland Dreier <roland@purestorage.com>
like rt6_lookup, but allows caller to pass in flowi6 structure.
Will be used by the upcoming ipv6 netfilter reverse path filter
match.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
It's only used in net/ipv6/route.c and the NULL device check is
superfluous for all of the existing call sites.
Just expand the __ndisc_lookup_errno() call at each location.
Signed-off-by: David S. Miller <davem@davemloft.net>
1) x == NULL --> !x
2) x != NULL --> x
3) (x&BIT) --> (x & BIT)
4) (BIT1|BIT2) --> (BIT1 | BIT2)
5) proper argument and struct member alignment
Signed-off-by: David S. Miller <davem@davemloft.net>
We move all mtu handling from dst_mtu() down to the protocol
layer. So each protocol can implement the mtu handling in
a different manner.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We plan to invoke the dst_opt->default_mtu() method unconditioally
from dst_mtu(). So rename the method to dst_opt->mtu() to match
the name with the new meaning.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>