Introduce fib6_nh structure and move nexthop related data from
rt6_info and rt6_info.dst to fib6_nh. References to dev, gateway or
lwtstate from a FIB lookup perspective are converted to use fib6_nh;
datapath references to dst version are left as is.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The RTN_ type for IPv6 FIB entries is currently embedded in rt6i_flags
and dst.error. Since dst is going to be removed, it can no longer be
relied on for FIB dumps so save the route type as fib6_type.
fc_type is set in current users based on the algorithm in rt6_fill_node:
- rt6i_flags contains RTF_LOCAL: fc_type = RTN_LOCAL
- rt6i_flags contains RTF_ANYCAST: fc_type = RTN_ANYCAST
- else fc_type = RTN_UNICAST
Similarly, fib6_type is set in the rt6_info templates based on the
RTF_REJECT section of rt6_fill_node converting dst.error to RTN type.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass network namespace reference into route add, delete and get
functions.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass net namespace to fib6_update_sernum. It can not be marked const
as fib6_new_sernum will change ipv6.fib6_sernum.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The statistics such as InHdrErrors should be counted on the ingress
netdev rather than on the dev from the dst, which is the egress.
Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move commonly used pattern of ip6_dst_store() usage to a separate
function - ip6_sk_dst_store_flow(), which will check the addresses
for equality using the flow information, before saving them.
There is no functional changes in this patch. In addition, it will
be used in the next patch, in ip6_sk_dst_lookup_flow().
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Minor conflicts in drivers/net/ethernet/mellanox/mlx5/core/en_rep.c,
we had some overlapping changes:
1) In 'net' MLX5E_PARAMS_LOG_{SQ,RQ}_SIZE -->
MLX5E_REP_PARAMS_LOG_{SQ,RQ}_SIZE
2) In 'net-next' params->log_rq_size is renamed to be
params->log_rq_mtu_frames.
3) In 'net-next' params->hard_mtu is added.
Signed-off-by: David S. Miller <davem@davemloft.net>
Donald reported that IPv6 route leaking between VRFs is not working.
The root cause is the strict argument in the call to rt6_lookup when
validating the nexthop spec.
ip6_route_check_nh validates the gateway and device (if given) of a
route spec. It in turn could call rt6_lookup (e.g., lookup in a given
table did not succeed so it falls back to a full lookup) and if so
sets the strict argument to 1. That means if the egress device is given,
the route lookup needs to return a result with the same device. This
strict requirement does not work with VRFs (IPv4 or IPv6) because the
oif in the flow struct is overridden with the index of the VRF device
to trigger a match on the l3mdev rule and force the lookup to its table.
The right long term solution is to add an l3mdev index to the flow
struct such that the oif is not overridden. That solution will not
backport well, so this patch aims for a simpler solution to relax the
strict argument if the route spec device is an l3mdev slave. As done
in other places, use the FLOWI_FLAG_SKIP_NH_OIF to know that the
RT6_LOOKUP_F_IFACE flag needs to be removed.
Fixes: ca254490c8 ("net: Add VRF support to IPv6 stack")
Reported-by: Donald Sharp <sharpd@cumulusnetworks.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Do not allow setting ipv6 routes from userspace if disable_ipv6 has been
enabled. The issue can be triggered using the following reproducer:
- sysctl net.ipv6.conf.all.disable_ipv6=1
- ip -6 route add a🅱️c:d::/64 dev em1
- ip -6 route show
a🅱️c:d::/64 dev em1 metric 1024 pref medium
Fix it checking disable_ipv6 value in ip6_route_info_create routine
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Synchronous pernet_operations are not allowed anymore.
All are asynchronous. So, drop the structure member.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prefer the direct use of octal for permissions.
Done with checkpatch -f --types=SYMBOLIC_PERMS --fix-inplace
and some typing.
Miscellanea:
o Whitespace neatening around these conversions.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fun set of conflict resolutions here...
For the mac80211 stuff, these were fortunately just parallel
adds. Trivially resolved.
In drivers/net/phy/phy.c we had a bug fix in 'net' that moved the
function phy_disable_interrupts() earlier in the file, whilst in
'net-next' the phy_error() call from this function was removed.
In net/ipv4/xfrm4_policy.c, David Ahern's changes to remove the
'rt_table_id' member of rtable collided with a bug fix in 'net' that
added a new struct member "rt_mtu_locked" which needs to be copied
over here.
The mlxsw driver conflict consisted of net-next separating
the span code and definitions into separate files, whilst
a 'net' bug fix made some changes to that moved code.
The mlx5 infiniband conflict resolution was quite non-trivial,
the RDMA tree's merge commit was used as a guide here, and
here are their notes:
====================
Due to bug fixes found by the syzkaller bot and taken into the for-rc
branch after development for the 4.17 merge window had already started
being taken into the for-next branch, there were fairly non-trivial
merge issues that would need to be resolved between the for-rc branch
and the for-next branch. This merge resolves those conflicts and
provides a unified base upon which ongoing development for 4.17 can
be based.
Conflicts:
drivers/infiniband/hw/mlx5/main.c - Commit 42cea83f95
(IB/mlx5: Fix cleanup order on unload) added to for-rc and
commit b5ca15ad7e (IB/mlx5: Add proper representors support)
add as part of the devel cycle both needed to modify the
init/de-init functions used by mlx5. To support the new
representors, the new functions added by the cleanup patch
needed to be made non-static, and the init/de-init list
added by the representors patch needed to be modified to
match the init/de-init list changes made by the cleanup
patch.
Updates:
drivers/infiniband/hw/mlx5/mlx5_ib.h - Update function
prototypes added by representors patch to reflect new function
names as changed by cleanup patch
drivers/infiniband/hw/mlx5/ib_rep.c - Update init/de-init
stage list to match new order from cleanup patch
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
For multipath routes the ONLINK flag can be specified per nexthop in
rtnh_flags or globally in rtm_flags. Update ip6_route_multipath_add
to consider the ONLINK setting coming from rtnh_flags. Each loop over
nexthops the config for the sibling route is initialized to the global
config and then per nexthop settings overlayed. The flag is 'or'ed into
fib6_config to handle the ONLINK flag coming from either rtm_flags or
rtnh_flags.
Fixes: fc1e64e109 ("net/ipv6: Add support for onlink flag")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6_chk_addr_and_flags determines if an address is a local address and
optionally if it is an address on a specific device. For example, it is
called by ip6_route_info_create to determine if a given gateway address
is a local address. The address check currently does not consider L3
domains and as a result does not allow a route to be added in one VRF
if the nexthop points to an address in a second VRF. e.g.,
$ ip route add 2001:db8:1::/64 vrf r2 via 2001:db8:102::23
Error: Invalid gateway address.
where 2001:db8:102::23 is an address on an interface in vrf r1.
ipv6_chk_addr_and_flags needs to allow callers to always pass in a device
with a separate argument to not limit the address to the specific device.
The device is used used to determine the L3 domain of interest.
To that end add an argument to skip the device check and update callers
to always pass a device where possible and use the new argument to mean
any address in the domain.
Update a handful of users of ipv6_chk_addr with a NULL dev argument. This
patch handles the change to these callers without adding the domain check.
ip6_validate_gw needs to handle 2 cases - one where the device is given
as part of the nexthop spec and the other where the device is resolved.
There is at least 1 VRF case where deferring the check to only after
the route lookup has resolved the device fails with an unintuitive error
"RTNETLINK answers: No route to host" as opposed to the preferred
"Error: Gateway can not be a local address." The 'no route to host'
error is because of the fallback to a full lookup. The check is done
twice to avoid this error.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move gateway validation code from ip6_route_info_create into
ip6_validate_gw. Code move plus adjustments to handle the potential
reset of dev and idev and to make checkpatch happy.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net): ipsec 2018-03-13
1) Refuse to insert 32 bit userspace socket policies on 64
bit systems like we do it for standard policies. We don't
have a compat layer, so inserting socket policies from
32 bit userspace will lead to a broken configuration.
2) Make the policy hold queue work without the flowcache.
Dummy bundles are not chached anymore, so we need to
generate a new one on each lookup as long as the SAs
are not yet in place.
3) Fix the validation of the esn replay attribute. The
The sanity check in verify_replay() is bypassed if
the XFRM_STATE_ESN flag is not set. Fix this by doing
the sanity check uncoditionally.
From Florian Westphal.
4) After most of the dst_entry garbage collection code
is removed, we may leak xfrm_dst entries as they are
neither cached nor tracked somewhere. Fix this by
reusing the 'uncached_list' to track xfrm_dst entries
too. From Xin Long.
5) Fix a rcu_read_lock/rcu_read_unlock imbalance in
xfrm_get_tos() From Xin Long.
6) Fix an infinite loop in xfrm_get_dst_nexthop. On
transport mode we fetch the child dst_entry after
we continue, so this pointer is never updated.
Fix this by fetching it before we continue.
7) Fix ESN sequence number gap after IPsec GSO packets.
We accidentally increment the sequence number counter
on the xfrm_state by one packet too much in the ESN
case. Fix this by setting the sequence number to the
correct value.
8) Reset the ethernet protocol after decapsulation only if a
mac header was set. Otherwise it breaks configurations
with TUN devices. From Yossi Kuperman.
9) Fix __this_cpu_read() usage in preemptible code. Use
this_cpu_read() instead in ipcomp_alloc_tfms().
From Greg Hackmann.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, administrative MTU changes on a given netdevice are
not reflected on route exceptions for MTU-less routes, with a
set PMTU value, for that device:
# ip -6 route get 2001:db8::b
2001:db8::b from :: dev vti_a proto kernel src 2001:db8::a metric 256 pref medium
# ping6 -c 1 -q -s10000 2001:db8::b > /dev/null
# ip netns exec a ip -6 route get 2001:db8::b
2001:db8::b from :: dev vti_a src 2001:db8::a metric 0
cache expires 571sec mtu 4926 pref medium
# ip link set dev vti_a mtu 3000
# ip -6 route get 2001:db8::b
2001:db8::b from :: dev vti_a src 2001:db8::a metric 0
cache expires 571sec mtu 4926 pref medium
# ip link set dev vti_a mtu 9000
# ip -6 route get 2001:db8::b
2001:db8::b from :: dev vti_a src 2001:db8::a metric 0
cache expires 571sec mtu 4926 pref medium
The first issue is that since commit fb56be83e4 ("net-ipv6: on
device mtu change do not add mtu to mtu-less routes") we don't
call rt6_exceptions_update_pmtu() from rt6_mtu_change_route(),
which handles administrative MTU changes, if the regular route
is MTU-less.
However, PMTU exceptions should be always updated, as long as
RTAX_MTU is not locked. Keep the check for MTU-less main route,
as introduced by that commit, but, for exceptions,
call rt6_exceptions_update_pmtu() regardless of that check.
Once that is fixed, one problem remains: MTU changes are not
reflected if the new MTU is higher than the previous one,
because rt6_exceptions_update_pmtu() doesn't allow that. We
should instead allow PMTU increase if the old PMTU matches the
local MTU, as that implies that the old MTU was the lowest in the
path, and PMTU discovery might lead to different results.
The existing check in rt6_mtu_change_route() correctly took that
case into account (for regular routes only), so factor it out
and re-use it also in rt6_exceptions_update_pmtu().
While at it, fix comments style and grammar, and try to be a bit
more descriptive.
Reported-by: Xiumei Mu <xmu@redhat.com>
Fixes: fb56be83e4 ("net-ipv6: on device mtu change do not add mtu to mtu-less routes")
Fixes: f5bbe7ee79 ("ipv6: prepare rt6_mtu_change() for exception table")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some operators prefer IPv6 path selection to use a standard 5-tuple
hash rather than just an L3 hash with the flow the label. To that end
add support to IPv6 for multipath hash policy similar to bf4e0a3db9
("net: ipv4: add support for ECMP hash policy choice"). The default
is still L3 which covers source and destination addresses along with
flow label and IPv6 protocol.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Tested-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 does path selection for multipath routes deep in the lookup
functions. The next patch adds L4 hash option and needs the skb
for the forward path. To get the skb to the relevant FIB lookup
functions it needs to go through the fib rules layer, so add a
lookup_data argument to the fib_lookup_arg struct.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make rt6_multipath_hash more of a direct parallel to fib_multipath_hash
and reduce stack and overhead in the process: get_hash_from_flowi6 is
just a wrapper around __get_hash_from_flowi6 with another stack
allocation for flow_keys. Move setting the addresses, protocol and
label into rt6_multipath_hash and allow it to make the call to
flow_hash_from_keys.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Symmetry is good and allows easy comparison that ipv4 and ipv6 are
doing the same thing. To that end, change ip_multipath_l3_keys to
set addresses at the end after the icmp compares, and move the
initialization of ipv6 flow keys to rt6_multipath_hash.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dissect flow in fwd path if fib rules require it. Controlled by
a flag to avoid penatly for the common case. Flag is set when fib
rules with sport, dport and proto match that require flow dissect
are installed. Also passes the dissected hash keys to the multipath
hash function when applicable to avoid dissecting the flow again.
icmp packets will continue to use inner header for hash
calculations.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net->ipv6.peers is dereferenced in three places via inet_getpeer_v6(),
and it's used to handle skb. All the users of inet_getpeer_v6() do not
look like be able to be called from foreign net pernet_operations, so
we may mark them as async.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These pernet_operations create and destroy /proc entries
and safely may be converted and safely may be mark as async.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In early time, when freeing a xdst, it would be inserted into
dst_garbage.list first. Then if it's refcnt was still held
somewhere, later it would be put into dst_busy_list in
dst_gc_task().
When one dev was being unregistered, the dev of these dsts in
dst_busy_list would be set with loopback_dev and put this dev.
So that this dev's removal wouldn't get blocked, and avoid the
kmsg warning:
kernel:unregister_netdevice: waiting for veth0 to become \
free. Usage count = 2
However after Commit 52df157f17 ("xfrm: take refcnt of dst
when creating struct xfrm_dst bundle"), the xdst will not be
freed with dst gc, and this warning happens.
To fix it, we need to find these xdsts that are still held by
others when removing the dev, and free xdst's dev and set it
with loopback_dev.
But unfortunately after flow_cache for xfrm was deleted, no
list tracks them anymore. So we need to save these xdsts
somewhere to release the xdst's dev later.
To make this easier, this patch is to reuse uncached_list to
track xdsts, so that the dev refcnt can be released in the
event NETDEV_UNREGISTER process of fib_netdev_notifier.
Thanks to Florian, we could move forward this fix quickly.
Fixes: 52df157f17 ("xfrm: take refcnt of dst when creating struct xfrm_dst bundle")
Reported-by: Jianlin Shi <jishi@redhat.com>
Reported-by: Hangbin Liu <liuhangbin@gmail.com>
Tested-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
IPv4 uses set_lwt_redirect to set the lwtunnel redirect functions as
needed. Move it to lwtunnel.h as lwtunnel_set_redirect and change
IPv6 to also use it.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Because of differences in how ipv4 and ipv6 handle fib lookups,
verification of nexthops with onlink flag need to default to the main
table rather than the local table used by IPv4. As it stands an
address within a connected route on device 1 can be used with
onlink on device 2. Updating the table properly rejects the route
due to the egress device mismatch.
Update the extack message as well to show it could be a device
mismatch for the nexthop spec.
Fixes: fc1e64e109 ("net/ipv6: Add support for onlink flag")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Verification of nexthops with onlink flag need to handle unreachable
routes. The lookup is only intended to validate the gateway address
is not a local address and if the gateway resolves the egress device
must match the given device. Hence, hitting any default reject route
is ok.
Fixes: fc1e64e109 ("net/ipv6: Add support for onlink flag")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In current route cache aging logic, if a route has both RTF_EXPIRE and
RTF_GATEWAY set, the route will only be removed if the neighbor cache
has no NTF_ROUTER flag. Otherwise, even if the route has expired, it
won't get deleted.
Fix this logic to always check if the route has expired first and then
do the gateway neighbor cache check if previous check decide to not
remove the exception entry.
Fixes: 1859bac04f ("ipv6: remove from fib tree aged out RTF_CACHE dst")
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to IPv4 allow routes to be added with the RTNH_F_ONLINK flag.
The onlink option requires a gateway and a nexthop device. Any unicast
gateway is allowed (including IPv4 mapped addresses and unresolved
ones) as long as the gateway is not a local address and if it resolves
it must match the given device.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
onlink verification needs to do a lookup in potentially different
table than the table in fib6_config and without the RT6_LOOKUP_F_IFACE
flag. Change ip6_nh_lookup_table to take table id and flags as input
arguments. Both verifications want to ignore link state, so add that
flag can stay in the lookup helper.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move existing code to validate nexthop into a helper. Follow on patch
adds support for nexthops marked with onlink, and this helper keeps
the complexity of ip6_route_info_create in check.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 allows routes to be installed when the device is not up (admin up).
Worse, it does not mark it as LINKDOWN. IPv4 does not allow it and really
there is no reason for IPv6 to allow it, so check the flags and deny if
device is admin down.
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
/proc has been ignoring struct file_operations::owner field for 10 years.
Specifically, it started with commit 786d7e1612
("Fix rmmod/read/write races in /proc entries"). Notice the chunk where
inode->i_fop is initialized with proxy struct file_operations for
regular files:
- if (de->proc_fops)
- inode->i_fop = de->proc_fops;
+ if (de->proc_fops) {
+ if (S_ISREG(inode->i_mode))
+ inode->i_fop = &proc_reg_file_ops;
+ else
+ inode->i_fop = de->proc_fops;
+ }
VFS stopped pinning module at this point.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Emil reported the following compiler errors:
net/ipv6/route.c: In function `rt6_sync_up`:
net/ipv6/route.c:3586: error: unknown field `nh_flags` specified in initializer
net/ipv6/route.c:3586: warning: missing braces around initializer
net/ipv6/route.c:3586: warning: (near initialization for `arg.<anonymous>`)
net/ipv6/route.c: In function `rt6_sync_down_dev`:
net/ipv6/route.c:3695: error: unknown field `event` specified in initializer
net/ipv6/route.c:3695: warning: missing braces around initializer
net/ipv6/route.c:3695: warning: (near initialization for `arg.<anonymous>`)
Problem is with the named initializers for the anonymous union members.
Fix this by adding curly braces around the initialization.
Fixes: 4c981e28d3 ("ipv6: Prepare to handle multiple netdev events")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Emil S Tantilov <emils.tantilov@gmail.com>
Tested-by: Emil S Tantilov <emils.tantilov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The use of hash-threshold instead of modulo-N makes it trivial to add
support for non-equal-cost multipath.
Instead of dividing the multipath hash function's output space equally
between the nexthops, each nexthop is assigned a region size which is
proportional to its weight.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that each nexthop stores its region boundary in the multipath hash
function's output space, we can use hash-threshold instead of modulo-N
in multipath selection.
This reduces the number of checks we need to perform during lookup, as
dead and linkdown nexthops are assigned a negative region boundary. In
addition, in contrast to modulo-N, only flows near region boundaries are
affected when a nexthop is added or removed.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The hash thresholds assigned to IPv6 nexthops are in the range of
[-1, 2^31 - 1], where a negative value is assigned to nexthops that
should not be considered during multipath selection.
Therefore, in a similar fashion to IPv4, we need to use the upper
31-bits of the multipath hash for multipath selection.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before we convert IPv6 to use hash-threshold instead of modulo-N, we
first need each nexthop to store its region boundary in the hash
function's output space.
The boundary is calculated by dividing the output space equally between
the different active nexthops. That is, nexthops that are not dead or
linkdown.
The boundaries are rebalanced whenever a nexthop is added or removed to
a multipath route and whenever a nexthop becomes active or inactive.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
By default, IPv6 deletes nexthops from a multipath route when the
nexthop device is put administratively down. This differs from IPv4
where the nexthops are kept, but marked with the RTNH_F_DEAD flag. A
multipath route is flushed when all of its nexthops become dead.
Align IPv6 with IPv4 and have it conform to the same guidelines.
In case the multipath route needs to be flushed, its siblings are
flushed one by one. Otherwise, the nexthops are marked with the
appropriate flags and the tree walker is instructed to skip all the
siblings.
As explained in previous patches, care is taken to update the sernum of
the affected tree nodes, so as to prevent the use of wrong dst entries.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The next patch is going to allow dead routes to remain in the FIB tree
in certain situations.
When this happens we need to be sure to bump the sernum of the nodes
where these are stored so that potential copies cached in sockets are
invalidated.
The function that performs this update assumes the table lock is not
taken when it is invoked, but that will not be the case when it is
invoked by the tree walker.
Have the function assume the lock is taken and make the single caller
take the lock itself.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Up until now the RTNH_F_DEAD flag was only reported in route dump when
the 'ignore_routes_with_linkdown' sysctl was set. This is expected as
dead routes were flushed otherwise.
The reliance on this sysctl is going to be removed, so we need to report
the flag regardless of the sysctl's value.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, dead routes are only present in the routing tables in case
the 'ignore_routes_with_linkdown' sysctl is set. Otherwise, they are
flushed.
Subsequent patches are going to remove the reliance on this sysctl and
make IPv6 more consistent with IPv4.
Before this is done, we need to make sure dead routes are skipped during
route lookup, so as to not cause packet loss.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to previous patch, there is no need to check for the carrier of
the nexthop device when dumping the route and we can instead check for
the presence of the RTNH_F_LINKDOWN flag.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that the RTNH_F_LINKDOWN flag is set in nexthops, we can avoid the
need to dereference the nexthop device and check its carrier and instead
check for the presence of the flag.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is valid to install routes with a nexthop device that does not have a
carrier, so we need to make sure they're marked accordingly.
As explained in the previous patch, host and anycast routes are never
marked with the 'linkdown' flag.
Note that reject routes are unaffected, as these use the loopback device
which always has a carrier.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to IPv4, when the carrier of a netdev changes we should toggle
the 'linkdown' flag on all the nexthops using it as their nexthop
device.
This will later allow us to test for the presence of this flag during
route lookup and dump.
Up until commit 4832c30d54 ("net: ipv6: put host and anycast routes on
device with address") host and anycast routes used the loopback netdev
as their nexthop device and thus were not marked with the 'linkdown'
flag. The patch preserves this behavior and allows one to ping the local
address even when the nexthop device does not have a carrier and the
'ignore_routes_with_linkdown' sysctl is set.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To make IPv6 more in line with IPv4 we need to be able to respond
differently to different netdev events. For example, when a netdev is
unregistered all the routes using it as their nexthop device should be
flushed, whereas when the netdev's carrier changes only the 'linkdown'
flag should be toggled.
Currently, this is not possible, as the function that traverses the
routing tables is not aware of the triggering event.
Propagate the triggering event down, so that it could be used in later
patches.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previous patch marked nexthops with the 'dead' and 'linkdown' flags.
Clear these flags when the netdev comes back up.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a netdev is put administratively down or unregistered all the
nexthops using it as their nexthop device should be marked with the
'dead' and 'linkdown' flags.
Currently, when a route is dumped its nexthop device is tested and the
flags are set accordingly. A similar check is performed during route
lookup.
Instead, we can simply mark the nexthops based on netdev events and
avoid checking the netdev's state during route dump and lookup.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
By the time fib6_net_exit() is executed all the netdevs in the namespace
have been either unregistered or pushed back to the default namespace.
That is because pernet subsys operations are always ordered before
pernet device operations and therefore invoked after them during
namespace dismantle.
Thus, all the routing tables in the namespace are empty by the time
fib6_net_exit() is invoked and the call to rt6_ifdown() can be removed.
This allows us to simplify the condition in fib6_ifdown() as it's only
ever called with an actual netdev.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lots of overlapping changes. Also on the net-next side
the XDP state management is handled more in the generic
layers so undo the 'net' nfp fix which isn't applicable
in net-next.
Include a necessary change by Jakub Kicinski, with log message:
====================
cls_bpf no longer takes care of offload tracking. Make sure
netdevsim performs necessary checks. This fixes a warning
caused by TC trying to remove a filter it has not added.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, parameters such as oif and source address are not taken into
account during fibmatch lookup. Example (IPv4 for reference) before
patch:
$ ip -4 route show
192.0.2.0/24 dev dummy0 proto kernel scope link src 192.0.2.1
198.51.100.0/24 dev dummy1 proto kernel scope link src 198.51.100.1
$ ip -6 route show
2001:db8:1::/64 dev dummy0 proto kernel metric 256 pref medium
2001:db8:2::/64 dev dummy1 proto kernel metric 256 pref medium
fe80::/64 dev dummy0 proto kernel metric 256 pref medium
fe80::/64 dev dummy1 proto kernel metric 256 pref medium
$ ip -4 route get fibmatch 192.0.2.2 oif dummy0
192.0.2.0/24 dev dummy0 proto kernel scope link src 192.0.2.1
$ ip -4 route get fibmatch 192.0.2.2 oif dummy1
RTNETLINK answers: No route to host
$ ip -6 route get fibmatch 2001:db8:1::2 oif dummy0
2001:db8:1::/64 dev dummy0 proto kernel metric 256 pref medium
$ ip -6 route get fibmatch 2001:db8:1::2 oif dummy1
2001:db8:1::/64 dev dummy0 proto kernel metric 256 pref medium
After:
$ ip -6 route get fibmatch 2001:db8:1::2 oif dummy0
2001:db8:1::/64 dev dummy0 proto kernel metric 256 pref medium
$ ip -6 route get fibmatch 2001:db8:1::2 oif dummy1
RTNETLINK answers: Network is unreachable
The problem stems from the fact that the necessary route lookup flags
are not set based on these parameters.
Instead of duplicating the same logic for fibmatch, we can simply
resolve the original route from its copy and dump it instead.
Fixes: 18c3a61c42 ("net: ipv6: RTM_GETROUTE: return matched fib result when requested")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
One example of when an ICMPv6 packet is required to be looped back is
when a host acts as both a Multicast Listener and a Multicast Router.
A Multicast Router will listen on address ff02::16 for MLDv2 messages.
Currently, MLDv2 messages originating from a Multicast Listener running
on the same host as the Multicast Router are not being delivered to the
Multicast Router. This is due to dst.input being assigned the default
value of dst_discard.
This results in the packet being looped back but discarded before being
delivered to the Multicast Router.
This patch sets dst.input to ip6_input to ensure a looped back packet
is delivered to the Multicast Router.
Signed-off-by: Brendan McGrath <redmcg@redmandi.dyndns.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This removes __rtnl_register and switches callers to either
rtnl_register or rtnl_register_module.
Also, rtnl_register() will now print an error if memory allocation
failed rather than panic the kernel.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The first member of an IPSEC route bundle chain sets it's dst->path to
the underlying ipv4/ipv6 route that carries the bundle.
Stated another way, if one were to follow the xfrm_dst->child chain of
the bundle, the final non-NULL pointer would be the path and point to
either an ipv4 or an ipv6 route.
This is largely used to make sure that PMTU events propagate down to
the correct ipv4 or ipv6 route.
When we don't have the top of an IPSEC bundle 'dst->path == dst'.
Move it down into xfrm_dst and key off of dst->xfrm.
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
The dst->from value is only used by ipv6 routes to track where
a route "came from".
Any time we clone or copy a core ipv6 route in the ipv6 routing
tables, we have the copy/clone's ->from point to the base route.
This is used to handle route expiration properly.
Only ipv6 uses this mechanism, and only ipv6 code references
it. So it is safe to move it into rt6_info.
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Florian reported a breakage with anycast routes due to commit
4832c30d54 ("net: ipv6: put host and anycast routes on device with
address"). Prior to this commit anycast routes were added against the
loopback device causing repetitive route entries with no insight into
why they existed. e.g.:
$ ip -6 ro ls table local type anycast
anycast 2001:db8:1:: dev lo proto kernel metric 0 pref medium
anycast 2001:db8:2:: dev lo proto kernel metric 0 pref medium
anycast fe80:: dev lo proto kernel metric 0 pref medium
anycast fe80:: dev lo proto kernel metric 0 pref medium
The point of commit 4832c30d54 is to add the routes using the device
with the address which is causing the route to be added. e.g.,:
$ ip -6 ro ls table local type anycast
anycast 2001:db8:1:: dev eth1 proto kernel metric 0 pref medium
anycast 2001:db8:2:: dev eth2 proto kernel metric 0 pref medium
anycast fe80:: dev eth2 proto kernel metric 0 pref medium
anycast fe80:: dev eth1 proto kernel metric 0 pref medium
For traffic to work as it did before, the dst device needs to be switched
to the loopback when the copy is created similar to local routes.
Fixes: 4832c30d54 ("net: ipv6: put host and anycast routes on device with address")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the 'ignore_routes_with_linkdown' sysctl is set, we should not
consider linkdown nexthops during route lookup.
While the code correctly verifies that the initially selected route
('match') has a carrier, it does not perform the same check in the
subsequent multipath selection, resulting in a potential packet loss.
In case the chosen route does not have a carrier and the sysctl is set,
choose the initially selected route.
Fixes: 35103d1117 ("net: ipv6 sysctl option to ignore routes when nexthop link is down")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Acked-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make default TCP default congestion control to a per namespace
value. This changes default congestion control to a pointer to congestion ops
(rather than implicit as first element of available lsit).
The congestion control setting of new namespaces is inherited
from the current setting of the root namespace.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cached routes should only be created by the system when receiving pmtu
discovery or ip redirect msg. Users should not be allowed to create
cached routes.
Furthermore, after the patch series to move cached routes into exception
table, user added cached routes will trigger the following warning in
fib6_add():
WARNING: CPU: 0 PID: 2985 at net/ipv6/ip6_fib.c:1137
fib6_add+0x20d9/0x2c10 net/ipv6/ip6_fib.c:1137
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 2985 Comm: syzkaller320388 Not tainted 4.14.0-rc3+ #74
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:16 [inline]
dump_stack+0x194/0x257 lib/dump_stack.c:52
panic+0x1e4/0x417 kernel/panic.c:181
__warn+0x1c4/0x1d9 kernel/panic.c:542
report_bug+0x211/0x2d0 lib/bug.c:183
fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178
do_trap_no_signal arch/x86/kernel/traps.c:212 [inline]
do_trap+0x260/0x390 arch/x86/kernel/traps.c:261
do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298
do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311
invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905
RIP: 0010:fib6_add+0x20d9/0x2c10 net/ipv6/ip6_fib.c:1137
RSP: 0018:ffff8801cf09f6a0 EFLAGS: 00010297
RAX: ffff8801ce45e340 RBX: 1ffff10039e13eec RCX: ffff8801d749c814
RDX: 0000000000000000 RSI: ffff8801d749c700 RDI: ffff8801d749c780
RBP: ffff8801cf09fa08 R08: 0000000000000000 R09: ffff8801cf09f360
R10: ffff8801cf09f2d8 R11: 1ffff10039c8befb R12: 0000000000000001
R13: dffffc0000000000 R14: ffff8801d749c700 R15: ffffffff860655c0
__ip6_ins_rt+0x6c/0x90 net/ipv6/route.c:1011
ip6_route_add+0x148/0x1a0 net/ipv6/route.c:2782
ipv6_route_ioctl+0x4d5/0x690 net/ipv6/route.c:3291
inet6_ioctl+0xef/0x1e0 net/ipv6/af_inet6.c:521
sock_do_ioctl+0x65/0xb0 net/socket.c:961
sock_ioctl+0x2c2/0x440 net/socket.c:1058
vfs_ioctl fs/ioctl.c:45 [inline]
do_vfs_ioctl+0x1b1/0x1530 fs/ioctl.c:685
SYSC_ioctl fs/ioctl.c:700 [inline]
SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
entry_SYSCALL_64_fastpath+0x1f/0xbe
So we fix this by failing the attemp to add cached routes from userspace
with returning EINVAL error.
Fixes: 2b760fcf5c ("ipv6: hook up exception table to store dst cache")
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In rt6_select(), fn->leaf could be pointing to net->ipv6.ip6_null_entry.
In this case, we should directly return instead of trying to carry on
with the rest of the process.
If not, we could crash at:
spin_lock_bh(&leaf->rt6i_table->rt6_lock);
because net->ipv6.ip6_null_entry does not have rt6i_table set.
Syzkaller recently reported following issue on net-next:
Use struct sctp_sack_info instead
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in:
sctp: [Deprecated]: syz-executor4 (pid 26496) Use of struct sctp_assoc_value in delayed_ack socket option.
Use struct sctp_sack_info instead
CPU: 1 PID: 26523 Comm: syz-executor6 Not tainted 4.14.0-rc4+ #85
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
task: ffff8801d147e3c0 task.stack: ffff8801a4328000
RIP: 0010:debug_spin_lock_before kernel/locking/spinlock_debug.c:83 [inline]
RIP: 0010:do_raw_spin_lock+0x23/0x1e0 kernel/locking/spinlock_debug.c:112
RSP: 0018:ffff8801a432ed70 EFLAGS: 00010207
RAX: dffffc0000000000 RBX: 0000000000000018 RCX: 0000000000000000
RDX: 0000000000000003 RSI: 0000000000000000 RDI: 000000000000001c
RBP: ffff8801a432ed90 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: ffffffff8482b279 R12: ffff8801ce2ff3a0
sctp: [Deprecated]: syz-executor1 (pid 26546) Use of int in maxseg socket option.
Use struct sctp_assoc_value instead
R13: dffffc0000000000 R14: ffff8801d971e000 R15: ffff8801ce2ff0d8
FS: 00007f56e82f5700(0000) GS:ffff8801db300000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001ddbc22000 CR3: 00000001a4a04000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
__raw_spin_lock_bh include/linux/spinlock_api_smp.h:136 [inline]
_raw_spin_lock_bh+0x39/0x40 kernel/locking/spinlock.c:175
spin_lock_bh include/linux/spinlock.h:321 [inline]
rt6_select net/ipv6/route.c:786 [inline]
ip6_pol_route+0x1be3/0x3bd0 net/ipv6/route.c:1650
sctp: [Deprecated]: syz-executor1 (pid 26576) Use of int in maxseg socket option.
Use struct sctp_assoc_value instead
TCP: request_sock_TCPv6: Possible SYN flooding on port 20002. Sending cookies. Check SNMP counters.
ip6_pol_route_output+0x4c/0x60 net/ipv6/route.c:1843
fib6_rule_lookup+0x9e/0x2a0 net/ipv6/ip6_fib.c:309
ip6_route_output_flags+0x1f1/0x2b0 net/ipv6/route.c:1871
ip6_route_output include/net/ip6_route.h:80 [inline]
ip6_dst_lookup_tail+0x4ea/0x970 net/ipv6/ip6_output.c:953
ip6_dst_lookup_flow+0xc8/0x270 net/ipv6/ip6_output.c:1076
sctp_v6_get_dst+0x675/0x1c30 net/sctp/ipv6.c:274
sctp_transport_route+0xa8/0x430 net/sctp/transport.c:287
sctp_assoc_add_peer+0x4fe/0x1100 net/sctp/associola.c:656
__sctp_connect+0x251/0xc80 net/sctp/socket.c:1187
sctp_connect+0xb4/0xf0 net/sctp/socket.c:4209
inet_dgram_connect+0x16b/0x1f0 net/ipv4/af_inet.c:541
SYSC_connect+0x20a/0x480 net/socket.c:1642
SyS_connect+0x24/0x30 net/socket.c:1623
entry_SYSCALL_64_fastpath+0x1f/0xbe
Fixes: 66f5d6ce53 ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The perf traces for ipv6 routing code show a relevant cost around
trace_fib6_table_lookup(), even if no trace is enabled. This is
due to the fib6_table de-referencing currently performed by the
caller.
Let's the tracing code pay this overhead, passing to the trace
helper the table pointer. This gives small but measurable
performance improvement under UDP flood.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit 2b760fcf5c ("ipv6: hook up exception table to store
dst cache") partially reverted the commit 1e2ea8ad37 ("ipv6: set
dst.obsolete when a cached route has expired").
As a result, RTF_CACHE dst referenced outside the fib tree will
not be removed until the next sernum change; dst_check() does not
fail on aged-out dst, and dst->__refcnt can't decrease: the aged
out dst will stay valid for a potentially unlimited time after the
timeout expiration.
This change explicitly removes RTF_CACHE dst from the fib tree when
aged out. The rt6_remove_exception() logic will then obsolete the
dst and other entities will drop the related reference on next
dst_check().
pMTU exceptions are not aged-out, and are removed from the exception
table only when the - usually considerably longer - ip6_rt_mtu_expires
timeout expires.
v1 -> v2:
- do not touch dst.obsolete in rt6_remove_exception(), not needed
v2 -> v3:
- take care of pMTU exceptions, too
Fixes: 2b760fcf5c ("ipv6: hook up exception table to store dst cache")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the commit 2b760fcf5c ("ipv6: hook up exception table
to store dst cache"), the fib6 gc is not started after the
creation of a RTF_CACHE via a redirect or pmtu update, since
fib6_add() isn't invoked anymore for such dsts.
We need the fib6 gc to run periodically to clean the RTF_CACHE,
or the dst will stay there forever.
Fix it by explicitly calling fib6_force_start_gc() on successful
exception creation. gc_args->more accounting will ensure that
the gc timer will run for whatever time needed to properly
clean the table.
v2 -> v3:
- clarified the commit message
Fixes: 2b760fcf5c ("ipv6: hook up exception table to store dst cache")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The use of the | operator always leads to true which looks rather
suspect to me. Fix this by using & instead to just check the
RTF_CACHE entry bit.
Detected by CoverityScan, CID#1457734, #1457747 ("Wrong operator used")
Fixes: 35732d01fe ("ipv6: introduce a hash table to store dst cache")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently rt6_ex is being dereferenced before it is null checked
hence there is a possible null dereference bug. Fix this by only
dereferencing rt6_ex after it has been null checked.
Detected by CoverityScan, CID#1457749 ("Dereference before null check")
Fixes: 81eb8447da ("ipv6: take care of rt6_stats")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
per cpu allocations are already zeroed, no need to clear them again.
Fixes: d52d3997f8 ("ipv6: Create percpu rt6_info")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A recent patch removed the dst_free() on the allocated
dst_entry in ipv6_blackhole_route(). The dst_free() marked
the dst_entry as dead and added it to the gc list. I.e. it
was setup for a one time usage. As a result we may now have
a blackhole route cached at a socket on some IPsec scenarios.
This makes the connection unusable.
Fix this by marking the dst_entry directly at allocation time
as 'dead', so it is used only once.
Fixes: 587fea7411 ("ipv6: mark DST_NOGC and remove the operation of dst_free()")
Reported-by: Tobias Brunner <tobias@strongswan.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido reported following splat and provided a patch.
[ 122.221814] BUG: using smp_processor_id() in preemptible [00000000] code: sshd/2672
[ 122.221845] caller is debug_smp_processor_id+0x17/0x20
[ 122.221866] CPU: 0 PID: 2672 Comm: sshd Not tainted 4.14.0-rc3-idosch-next-custom #639
[ 122.221880] Hardware name: Mellanox Technologies Ltd. MSN2100-CB2FO/SA001017, BIOS 5.6.5 06/07/2016
[ 122.221893] Call Trace:
[ 122.221919] dump_stack+0xb1/0x10c
[ 122.221946] ? _atomic_dec_and_lock+0x124/0x124
[ 122.221974] ? ___ratelimit+0xfe/0x240
[ 122.222020] check_preemption_disabled+0x173/0x1b0
[ 122.222060] debug_smp_processor_id+0x17/0x20
[ 122.222083] ip6_pol_route+0x1482/0x24a0
...
I believe we can simplify this code path a bit, since we no longer
hold a read_lock and need to release it to avoid a dead lock.
By disabling BH, we make sure we'll prevent code re-entry and
rt6_get_pcpu_route()/rt6_make_pcpu_route() run on the same cpu.
Fixes: 66f5d6ce53 ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
Reported-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, most of the rt6_stats are not hooked up correctly. As the
last part of this patch series, hook up all existing rt6_stats and add
one new stat fib_rt_uncache to indicate the number of routes in the
uncached list.
For details of the stats, please refer to the comments added in
include/net/ip6_fib.h.
Note: fib_rt_alloc and fib_rt_uncache are not guaranteed to be modified
under a lock. So atomic_t is used for them.
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With all the preparation work before, we are now ready to replace rwlock
with rcu and spinlock in fib6_table.
That means now all fib6_node in fib6_table are protected by rcu. And
when freeing fib6_node, call_rcu() is used to wait for the rcu grace
period before releasing the memory.
When accessing fib6_node, corresponding rcu APIs need to be used.
And all previous sessions protected by the write lock will now be
protected by the spin lock per table.
All previous sessions protected by read lock will now be protected by
rcu_read_lock().
A couple of things to note here:
1. As part of the work of replacing rwlock with rcu, the linked list of
fn->leaf now has to be rcu protected as well. So both fn->leaf and
rt->dst.rt6_next are now __rcu tagged and corresponding rcu APIs are
used when manipulating them.
2. For fn->rr_ptr, first of all, it also needs to be rcu protected now
and is tagged with __rcu and rcu APIs are used in corresponding places.
Secondly, fn->rr_ptr is changed in rt6_select() which is a reader
thread. This makes the issue a bit complicated. We think a valid
solution for it is to let rt6_select() grab the tb6_lock if it decides
to change it. As it is not in the normal operation and only happens when
there is no valid neighbor cache for the route, we think the performance
impact should be low.
3. fib6_walk_continue() has to be called with tb6_lock held even in the
route dumping related functions, e.g. inet6_dump_fib(),
fib6_tables_dump() and ipv6_route_seq_ops. It is because
fib6_walk_continue() makes modifications to the walker structure, and so
are fib6_repair_tree() and fib6_del_route(). In order to do proper
syncing between them, we need to let fib6_walk_continue() hold the lock.
We may be able to do further improvement on the way we do the tree walk
to get rid of the need for holding the spin lock. But not for now.
4. When fib6_del_route() removes a route from the tree, we no longer
mark rt->dst.rt6_next to NULL to make simultaneous reader be able to
further traverse the list with rcu. However, rt->dst.rt6_next is only
valid within this same rcu period. No one should access it later.
5. All the operation of atomic_inc(rt->rt6i_ref) is changed to be
performed before we publish this route (either by linking it to fn->leaf
or insert it in the list pointed by fn->leaf) just to be safe because as
soon as we publish the route, some read thread will be able to access it.
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After rwlock is replaced with rcu and spinlock, fib6_lookup() could
potentially return an intermediate node if other thread is doing
fib6_del() on a route which is the only route on the node so that
fib6_repair_tree() will be called on this node and potentially assigns
fn->leaf to the its child's fn->leaf.
In order to detect this situation in rt6_select(), we have to check if
fn->fn_bit is consistent with the key length stored in the route. And
depending on if the fn is in the subtree or not, the key is either
rt->rt6i_dst or rt->rt6i_src.
If any inconsistency is found, that means the node no longer holds valid
routes in it. So net->ipv6.ip6_null_entry is returned.
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If rwlock is replaced with rcu and spinlock, it is possible that the
reader thread will see fn->leaf as NULL in the following scenarios:
1. fib6_add() is in progress and we have already inserted a new node but
not yet inserted the route.
2. fib6_del_route() is in progress and we have already set fn->leaf to
NULL but not yet freed the node because of rcu grace period.
This patch makes sure all the reader threads check fn->leaf first before
using it. And together with later patch to grab rcu_read_lock() and
rcu_dereference() fn->leaf, it makes sure reader threads are safe when
accessing fn->leaf.
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With rwlock, it is safe to call dst_hold() in the read thread because
read thread is guaranteed to be separated from write thread.
However, after we replace rwlock with rcu, it is no longer safe to use
dst_hold(). A dst might already have been deleted but is waiting for the
rcu grace period to pass before freeing the memory when a read thread is
trying to do dst_hold(). This could potentially cause double free issue.
So this commit replaces all dst_hold() with dst_hold_safe() in all read
thread to avoid this double free issue.
And in order to make the code more compact, a new function ip6_hold_safe()
is introduced. It calls dst_hold_safe() first, and if that fails, it will
either fall back to hold and return net->ipv6.ip6_null_entry or set rt to
NULL according to the caller's need.
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After rwlock is replaced with rcu and spinlock, ip6_pol_route() will be
called with only rcu held. That means rt6 route deletion could happen
simultaneously with rt6_make_pcpu_rt(). This could potentially cause
memory leak if rt6_release() is called right before rt6_make_pcpu_rt()
on the same route.
This patch grabs rt->rt6i_ref safely before calling rt6_make_pcpu_rt()
to make sure rt6_release() will not get triggered while
rt6_make_pcpu_rt() is in progress. And rt6_release() is called after
rt6_make_pcpu_rt() is finished.
Note: As we are incrementing rt->rt6i_ref in ip6_pol_route(), there is a
very slim chance that fib6_purge_rt() will be triggered unnecessarily
when deleting a route if ip6_pol_route() running on another thread picks
this route as well and tries to make pcpu cache for it.
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit makes use of the exception hash table implementation to
store dst caches created by pmtu discovery and ip redirect into the hash
table under the rt_info and no longer inserts these routes into fib6
tree.
This makes the fib6 tree only contain static configured routes and could
now be protected by rcu instead of a rw lock.
With this change, in the route lookup related functions, after finding
the rt6_info with the longest prefix, we also need to search for the
exception table before doing backtracking.
In the route delete function, if the route being deleted is not a dst
cache, deletion of this route also need to flush the whole hash table
under it. If it is a dst cache, then only delete the cached dst in the
hash table.
Note: for fib6_walk_continue() function, w->root now is always pointing
to a root node considering that fib6_prune_clones() is removed from the
code. So we add a WARN_ON() msg to make sure w->root always points to a
root node and also removed the update of w->root in fib6_repair_tree().
This is a prerequisite for later patch because we don't need to make
w->root as rcu protected when replacing rwlock with RCU.
Also, we remove all prune related variables as it is no longer used.
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fib6_locate() is used to find the fib6_node according to the passed in
prefix address key. It currently tries to find the fib6_node with the
exact match of the passed in key. However, when we move cached routes
into the exception table, fib6_locate() will fail to find the fib6_node
for it as the cached routes will be stored in the exception table under
the fib6_node with the longest prefix match of the cache's dst addr key.
This commit adds a new parameter to let the caller specify if it needs
exact match or longest prefix match.
Right now, all callers still does exact match when calling
fib6_locate(). It will be changed in later commit where exception table
is hooked up to store cached routes.
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If all dst cache entries are stored in the exception table under the
main route, we have to go through them during fib6_age() when doing
garbage collecting.
Introduce a new function rt6_age_exception() which goes through all dst
entries in the exception table and remove those entries that are expired.
This function is called in fib6_age() so that all dst caches are also
garbage collected.
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If we move all cached dst into the exception table under the main route,
current rt6_clean_tohost() will no longer be able to access them.
This commit makes fib6_clean_tohost() to also go through all cached
routes in exception table and removes cached gateway routes to the
passed in gateway.
This is a preparation in order to move all cached routes into the
exception table.
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If we move all cached dst into the exception table under the main route,
current rt6_mtu_change() will no longer be able to access them.
This commit makes rt6_mtu_change_route() function to also go through all
cached routes in the exception table under the main route and do proper
updates on the mtu.
This is a preparation in order to move all cached routes into the
exception table.
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After we move cached dst entries into the exception table under its
parent route, current fib6_remove_prefsrc() no longer can access them.
This commit makes fib6_remove_prefsrc() also go through all routes
in the exception table to remove the pref src.
This is a preparation patch in order to move all cached dst into the
exception table.
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a hash table into struct rt6_info in order to store dst caches
created by pmtu discovery and ip redirect in ipv6 routing code.
APIs to add dst cache, delete dst cache, find dst cache and update
dst cache in the hash table are implemented and will be used in later
commits.
This is a preparation work to move all cache routes into the exception
table instead of getting inserted into the fib6 tree.
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now it doesn't check for the cached route expiration in ipv6's
dst_ops->check(), because it trusts dst_gc that would clean the
cached route up when it's expired.
The problem is in dst_gc, it would clean the cached route only
when it's refcount is 1. If some other module (like xfrm) keeps
holding it and the module only release it when dst_ops->check()
fails.
But without checking for the cached route expiration, .check()
may always return true. Meanwhile, without releasing the cached
route, dst_gc couldn't del it. It will cause this cached route
never to expire.
This patch is to set dst.obsolete with DST_OBSOLETE_KILL in .gc
when it's expired, and check obsolete != DST_OBSOLETE_FORCE_CHK
in .check.
Note that this is even needed when ipv6 dst_gc timer is removed
one day. It would set dst.obsolete in .redirect and .update_pmtu
instead, and check for cached route expiration when getting it,
just like what ipv4 route does.
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit c5cff8561d adds rcu grace period before freeing fib6_node. This
generates a new sparse warning on rt->rt6i_node related code:
net/ipv6/route.c:1394:30: error: incompatible types in comparison
expression (different address spaces)
./include/net/ip6_fib.h:187:14: error: incompatible types in comparison
expression (different address spaces)
This commit adds "__rcu" tag for rt6i_node and makes sure corresponding
rcu API is used for it.
After this fix, sparse no longer generates the above warning.
Fixes: c5cff8561d ("ipv6: add rcu grace period before freeing fib6_node")
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rt_cookie might be used uninitialized, fix this by
initializing it.
Fixes: c5cff8561d ("ipv6: add rcu grace period before freeing fib6_node")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow our callers to influence the choice of ECMP link by honoring the
hash passed together with the flow info. This allows for special
treatment of ICMP errors which we would like to route over the same path
as the IPv6 datagram that triggered the error.
Also go through rt6_multipath_hash(), in the usual case when we aren't
dealing with an ICMP error, so that there is one central place where
multipath hash is computed.
Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 644d0e6569 ("ipv6 Use get_hash_from_flowi6 for rt6 hash") has
turned rt6_info_hash_nhsfn() into a one-liner, so it no longer makes
sense to keep it around. Also remove the accompanying comment that has
become outdated.
Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When forwarding or sending out an ICMPv6 error, look at the embedded
packet that triggered the error and compute a flow hash over its
headers.
This let's us route the ICMP error together with the flow it belongs to
when multipath (ECMP) routing is in use, which in turn makes Path MTU
Discovery work in ECMP load-balanced or anycast setups (RFC 7690).
Granted, end-hosts behind the ECMP router (aka servers) need to reflect
the IPv6 Flow Label for PMTUD to work.
The code is organized to be in parallel with ipv4 stack:
ip_multipath_l3_keys -> ip6_multipath_l3_keys
fib_multipath_hash -> rt6_multipath_hash
Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently keep rt->rt6i_node pointing to the fib6_node for the route.
And some functions make use of this pointer to dereference the fib6_node
from rt structure, e.g. rt6_check(). However, as there is neither
refcount nor rcu taken when dereferencing rt->rt6i_node, it could
potentially cause crashes as rt->rt6i_node could be set to NULL by other
CPUs when doing a route deletion.
This patch introduces an rcu grace period before freeing fib6_node and
makes sure the functions that dereference it takes rcu_read_lock().
Note: there is no "Fixes" tag because this bug was there in a very
early stage.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
One nagging difference between ipv4 and ipv6 is host routes for ipv6
addresses are installed using the loopback device or VRF / L3 Master
device. e.g.,
2001:db8:1::/120 dev veth0 proto kernel metric 256 pref medium
local 2001:db8:1::1 dev lo table local proto kernel metric 0 pref medium
Using the loopback device is convenient -- necessary for local tx, but
has some nasty side effects, most notably setting the 'lo' device down
causes all host routes for all local IPv6 address to be removed from the
FIB and completely breaks IPv6 networking across all interfaces.
This patch puts FIB entries for IPv6 routes against the device. This
simplifies the routes in the FIB, for example by making dst->dev and
rt6i_idev->dev the same (a future patch can look at removing the device
reference taken for rt6i_idev for FIB entries).
When copies are made on FIB lookups, the cloned route has dst->dev
set to loopback (or the L3 master device). This is needed for the
local Tx of packets to local addresses.
With fib entries allocated against the real network device, the addrconf
code that reinserts host routes on admin up of 'lo' is no longer needed.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adding a lock around one of the assignments prevents gcc from
tracking the state of the local 'fibmatch' variable, so it can no
longer prove that 'dst' is always initialized, leading to a bogus
warning:
net/ipv6/route.c: In function 'inet6_rtm_getroute':
net/ipv6/route.c:3659:2: error: 'dst' may be used uninitialized in this function [-Werror=maybe-uninitialized]
This moves the other assignment into the same lock to shut up the
warning.
Fixes: 121622dba8 ("ipv6: route: make rtm_getroute not assume rtnl is locked")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>