Add ability to return neighbour proxies list to caller if
it sent full ndmsg structure and has NTF_PROXY flag set.
Before this patch (and before iproute2 patches):
$ ip neigh add proxy 2001::1 dev eth0
$ ip -6 neigh show
$
After it and with applied iproute2 patches:
$ ip neigh add proxy 2001::1 dev eth0
$ ip -6 neigh show
2001::1 dev eth0 proxy
$
Compatibility with old versions of iproute2 is not broken,
kernel checks for incoming structure size and properly
works if old structure is came.
[v2]
* changed comments style.
* removed useless line with continue and curly bracket.
* changed incoming message size check from equal to more or
equal.
CC: davem@davemloft.net
CC: kuznet@ms2.inr.ac.ru
CC: netdev@vger.kernel.org
CC: xemul@parallels.com
Signed-off-by: Tony Zelenoff <antonz@parallels.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to perform a proper universal hash on a vector of integers,
we have to use different universal hashes on each vector element.
Which means we need 4 different hash randoms for ipv6.
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 5c3ddec73d.
S390 qeth driver actually still uses the setup ops.
Reported-by: Frank Blaschka <blaschka@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's simpler to just keep these things out until there is a real user
of them, so we can see what the needs actually are, rather than keep
these things around as useless overhead.
Signed-off-by: David S. Miller <davem@davemloft.net>
To reflect the fact that a refrence is not obtained to the
resulting neighbour entry.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Roland Dreier <roland@purestorage.com>
netdev->neigh_priv_len records the private area length.
This will trigger for neigh_table objects which set tbl->entry_size
to zero, and the first instances of this will be forthcoming.
Signed-off-by: David S. Miller <davem@davemloft.net>
We are going to alloc for device specific private areas for
neighbour entries, and in order to do that we have to move
away from the fixed allocation size enforced by using
neigh_table->kmem_cachep
As a nice side effect we can now use kfree_rcu().
Signed-off-by: David S. Miller <davem@davemloft.net>
Skip entries from foreign network namespaces.
Signed-off-by: Jorge Boncompte [DTI2] <jorge@dti2.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Le mercredi 09 novembre 2011 à 16:21 -0500, David Miller a écrit :
> From: David Miller <davem@davemloft.net>
> Date: Wed, 09 Nov 2011 16:16:44 -0500 (EST)
>
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Wed, 09 Nov 2011 12:14:09 +0100
> >
> >> unres_qlen is the number of frames we are able to queue per unresolved
> >> neighbour. Its default value (3) was never changed and is responsible
> >> for strange drops, especially if IP fragments are used, or multiple
> >> sessions start in parallel. Even a single tcp flow can hit this limit.
> > ...
> >
> > Ok, I've applied this, let's see what happens :-)
>
> Early answer, build fails.
>
> Please test build this patch with DECNET enabled and resubmit. The
> decnet neigh layer still refers to the removed ->queue_len member.
>
> Thanks.
Ouch, this was fixed on one machine yesterday, but not the other one I
used this morning, sorry.
[PATCH V5 net-next] neigh: new unresolved queue limits
unres_qlen is the number of frames we are able to queue per unresolved
neighbour. Its default value (3) was never changed and is responsible
for strange drops, especially if IP fragments are used, or multiple
sessions start in parallel. Even a single tcp flow can hit this limit.
$ arp -d 192.168.20.108 ; ping -c 2 -s 8000 192.168.20.108
PING 192.168.20.108 (192.168.20.108) 8000(8028) bytes of data.
8008 bytes from 192.168.20.108: icmp_seq=2 ttl=64 time=0.322 ms
Signed-off-by: David S. Miller <davem@davemloft.net>
Whatever situations make this state legitimate when SMP
also would be legitimate when !SMP and f.e. preemption is
enabled.
This is dubious enough that we should just delete it entirely. If we
want to add debugging for neigh timer races, better more thorough
mechanisms are needed.
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the artificial HZ latency on arp resolution.
Instead of firing a timer in one jiffy (up to 10 ms if HZ=100), lets
send the ARP message immediately.
Before patch :
# arp -d 192.168.20.108 ; ping -c 3 192.168.20.108
PING 192.168.20.108 (192.168.20.108) 56(84) bytes of data.
64 bytes from 192.168.20.108: icmp_seq=1 ttl=64 time=9.91 ms
64 bytes from 192.168.20.108: icmp_seq=2 ttl=64 time=0.065 ms
64 bytes from 192.168.20.108: icmp_seq=3 ttl=64 time=0.061 ms
After patch :
$ arp -d 192.168.20.108 ; ping -c 3 192.168.20.108
PING 192.168.20.108 (192.168.20.108) 56(84) bytes of data.
64 bytes from 192.168.20.108: icmp_seq=1 ttl=64 time=0.152 ms
64 bytes from 192.168.20.108: icmp_seq=2 ttl=64 time=0.064 ms
64 bytes from 192.168.20.108: icmp_seq=3 ttl=64 time=0.074 ms
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This will get us closer to being able to do "neigh stuff"
completely independent of the underlying dst_entry for
protocols (ipv4/ipv6) that wish to do so.
We will also be able to make dst entries neigh-less.
Signed-off-by: David S. Miller <davem@davemloft.net>
It's just taking on one of two possible values, either
neigh_ops->output or dev_queue_xmit(). And this is purely depending
upon whether nud_state has NUD_CONNECTED set or not.
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that hh_cache entries are embedded inside of neighbour
entries, their lifetimes and accesses are now synchronous
to that of the encompassing neighbour object.
Therefore we don't need to hook up the blackhole op to
hh_output on destroy.
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that there is a one-to-one correspondance between neighbour
and hh_cache entries, we no longer need:
1) dynamic allocation
2) attachment to dst->hh
3) refcounting
Initialization of the hh_cache entry is indicated by hh_len
being non-zero, and such initialization is always done with
the neighbour's lock held as a writer.
Signed-off-by: David S. Miller <davem@davemloft.net>
This never, ever, happens.
Neighbour entries are always tied to one address family, and therefore
one set of dst_ops, and therefore one dst_ops->protocol "hh_type"
value.
This capability was blindly imported by Alexey Kuznetsov when he wrote
the neighbour layer.
Signed-off-by: David S. Miller <davem@davemloft.net>
And mask the hash function result by simply shifting
down the "->hash_shift" most significant bits.
Currently which bits we use is arbitrary since jhash
produces entropy evenly across the whole hash function
result.
But soon we'll be using universal hashing functions,
and in those cases more entropy exists in the higher
bits than the lower bits, because they use multiplies.
Signed-off-by: David S. Miller <davem@davemloft.net>
The message size allocated for rtnl ifinfo dumps was limited to
a single page. This is not enough for additional interface info
available with devices that support SR-IOV and caused a bug in
which VF info would not be displayed if more than approximately
40 VFs were created per interface.
Implement a new function pointer for the rtnl_register service that will
calculate the amount of data required for the ifinfo dump and allocate
enough data to satisfy the request.
Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
fix some minor issues and sparse (__rcu) warnings
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a seqlock in struct neighbour to protect neigh->ha[], and avoid
dirtying neighbour in stress situation (many different flows / dsts)
Dirtying takes place because of read_lock(&n->lock) and n->used writes.
Switching to a seqlock, and writing n->used only on jiffies changes
permits less dirtying.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a new dst is used to send a frame, neigh_resolve_output() tries to
associate an struct hh_cache to this dst, calling neigh_hh_init() with
the neigh rwlock write locked.
Most of the time, hh_cache is already known and linked into neighbour,
so we find it and increment its refcount.
This patch changes the logic so that we call neigh_hh_init() with
neighbour lock read locked only, so that fast path can be run in
parallel by concurrent cpus.
This brings part of the speedup we got with commit c7d4426a98
(introduce DST_NOCACHE flag) for non cached dsts, even for cached ones,
removing one of the contention point that routers hit on multiqueue
enabled machines.
Further improvements would need to use a seqlock instead of an rwlock to
protect neigh->ha[], to not dirty neigh too often and remove two atomic
ops.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the second step for neighbour RCU conversion.
(first was commit d6bf7817 : RCU conversion of neigh hash table)
neigh_lookup() becomes lockless, but still take a reference on found
neighbour. (no more read_lock()/read_unlock() on tbl->lock)
struct neighbour gets an additional rcu_head field and is freed after an
RCU grace period.
Future work would need to eventually not take a reference on neighbour
for temporary dst (DST_NOCACHE), but this would need dst->_neighbour to
use a noref bit like we did for skb->_dst.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David
This is the first step for RCU conversion of neigh code.
Next patches will convert hash_buckets[] and "struct neighbour" to RCU
protected objects.
Thanks
[PATCH net-next] net neigh: RCU conversion of neigh hash table
Instead of storing hash_buckets, hash_mask and hash_rnd in "struct
neigh_table", a new structure is defined :
struct neigh_hash_table {
struct neighbour **hash_buckets;
unsigned int hash_mask;
__u32 hash_rnd;
struct rcu_head rcu;
};
And "struct neigh_table" has an RCU protected pointer to such a
neigh_hash_table.
This means the signature of (*hash)() function changed: We need to add a
third parameter with the actual hash_rnd value, since this is not
anymore a neigh_table field.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
neigh_delete() and neigh_add() dont need to touch device refcount,
we hold RTNL when calling them, so device cannot disappear under us.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While doing stress tests with IP route cache disabled, and multi queue
devices, I noticed a very high contention on one rwlock used in
neighbour code.
When many cpus are trying to send frames (possibly using a high
performance multiqueue device) to the same neighbour, they fight for the
neigh->lock rwlock in order to call neigh_hh_init(), and fight on
hh->hh_refcnt (a pair of atomic_inc/atomic_dec_and_test())
But we dont need to call neigh_hh_init() for dst that are used only
once. It costs four atomic operations at least, on two contended cache
lines, plus the high contention on neigh->lock rwlock.
Introduce a new dst flag, DST_NOCACHE, that is set when dst was not
inserted in route cache.
With the stress test bench, sending 160000000 frames on one neighbour,
results are :
Before patch:
real 2m28.406s
user 0m11.781s
sys 36m17.964s
After patch:
real 1m26.532s
user 0m12.185s
sys 20m3.903s
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change "return (EXPR);" to "return EXPR;"
return is not a function, parentheses are not required.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When configuring DMVPN (GRE + openNHRP) and a GRE remote
address is configured a kernel Oops is observed. The
obserseved Oops is caused by a NULL header_ops pointer
(neigh->dev->header_ops) in neigh_update_hhs() when
void (*update)(struct hh_cache*, const struct net_device*, const unsigned char *)
= neigh->dev->header_ops->cache_update;
is executed. The dev associated with the NULL header_ops is
the GRE interface. This patch guards against the
possibility that header_ops is NULL.
This Oops was first observed in kernel version 2.6.26.8.
Signed-off-by: Doug Kehn <rdkehn@yahoo.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 7fee226ad2 (net: add a noref bit on skb dst) missed one spot
where an skb is enqueued, with a possibly not refcounted dst entry.
__neigh_event_send() inserts skb into arp_queue, so we must make sure
dst entry is refcounted, or dst entry can be freed by garbage collector
after caller exits from rcu protected section.
Reported-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Annotates neigh_invalidate() with __releases() and __acquires() for
sparse sake.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stop computing the number of neighbour table settings we have by
counting the number of binary sysctls. This behaviour was silly
and meant that we could not add another neighbour table setting
without also adding another binary sysctl.
Don't pass the binary sysctl path for neighour table entries
into neigh_sysctl_register. These parameters are no longer
used and so are just dead code.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simpily pass 'struct neigh_table' with seq_file private pointer,
and save one dereference. Proc entry itself isn't interesting.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1815 commits)
mac80211: fix reorder buffer release
iwmc3200wifi: Enable wimax core through module parameter
iwmc3200wifi: Add wifi-wimax coexistence mode as a module parameter
iwmc3200wifi: Coex table command does not expect a response
iwmc3200wifi: Update wiwi priority table
iwlwifi: driver version track kernel version
iwlwifi: indicate uCode type when fail dump error/event log
iwl3945: remove duplicated event logging code
b43: fix two warnings
ipw2100: fix rebooting hang with driver loaded
cfg80211: indent regulatory messages with spaces
iwmc3200wifi: fix NULL pointer dereference in pmkid update
mac80211: Fix TX status reporting for injected data frames
ath9k: enable 2GHz band only if the device supports it
airo: Fix integer overflow warning
rt2x00: Fix padding bug on L2PAD devices.
WE: Fix set events not propagated
b43legacy: avoid PPC fault during resume
b43: avoid PPC fault during resume
tcp: fix a timewait refcnt race
...
Fix up conflicts due to sysctl cleanups (dead sysctl_check code and
CTL_UNNUMBERED removed) in
kernel/sysctl_check.c
net/ipv4/sysctl_net_ipv4.c
net/ipv6/addrconf.c
net/sctp/sysctl.c
Generated with the following semantic patch
@@
struct net *n1;
struct net *n2;
@@
- n1 == n2
+ net_eq(n1, n2)
@@
struct net *n1;
struct net *n2;
@@
- n1 != n2
+ !net_eq(n1, n2)
applied over {include,net,drivers/net}.
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that sys_sysctl is a compatiblity wrapper around /proc/sys
all sysctl strategy routines, and all ctl_name and strategy
entries in the sysctl tables are unused, and can be
revmoed.
In addition neigh_sysctl_register has been modified to no longer
take a strategy argument and it's callers have been modified not
to pass one.
Cc: "David Miller" <davem@davemloft.net>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: netdev@vger.kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Current neigh_periodic_timer() function is fired by timer IRQ, and
scans one hash bucket each round (very litle work in fact)
As we are supposed to scan whole hash table in 15 seconds, this means
neigh_periodic_timer() can be fired very often. (depending on the number
of concurrent hash entries we stored in this table)
Converting this to a workqueue permits scanning whole table, minimizing
icache pollution, and firing this work every 15 seconds, independantly
of hash table size.
This 15 seconds delay is not a hard number, as work is a deferrable one.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename lookup_neigh_params to lookup_neigh_parms as the struct is named
neigh_parms and all other functions dealing with the struct carry
neigh_parms in their names.
Signed-off-by: Tobias Klauser <klto@zhaw.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current code errors out the INCOMPLETE neigh entry skb queue only from
the timer if maximum probes have been attempted and there has been no reply.
This also causes the transtion to FAILED state.
However, the neigh entry can be also updated via Netlink to inform that the
address is unavailable. Currently, neigh_update() just stops the timers and
leaves the pending skb's unreleased. This results that the clean up code in
the timer callback is never called, preventing also proper garbage collection.
This fixes neigh_update() to process the pending skb queue immediately if
INCOMPLETE -> FAILED state transtion occurs due to a Netlink request.
Signed-off-by: Timo Teras <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define three accessors to get/set dst attached to a skb
struct dst_entry *skb_dst(const struct sk_buff *skb)
void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)
void skb_dst_drop(struct sk_buff *skb)
This one should replace occurrences of :
dst_release(skb->dst)
skb->dst = NULL;
Delete skb->dst field
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently it is possible to do just about everything with the arp table
from user space except treat an entry like you are using it. To that end
implement and a flag NTF_USE that when set in a netwlink update request
treats the neighbour table entry like the kernel does on the output path.
This allows user space applications to share the kernel's arp cache.
Signed-off-by: Eric Biederman <ebiederm@aristanetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove some pointless conditionals before kfree_skb().
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes the return value of nlmsg_notify() as follows:
If NETLINK_BROADCAST_ERROR is set by any of the listeners and
an error in the delivery happened, return the broadcast error;
else if there are no listeners apart from the socket that
requested a change with the echo flag, return the result of the
unicast notification. Thus, with this patch, the unicast
notification is handled in the same way of a broadcast listener
that has set the NETLINK_BROADCAST_ERROR socket flag.
This patch is useful in case that the caller of nlmsg_notify()
wants to know the result of the delivery of a netlink notification
(including the broadcast delivery) and take any action in case
that the delivery failed. For example, ctnetlink can drop packets
if the event delivery failed to provide reliable logging and
state-synchronization at the cost of dropping packets.
This patch also modifies the rtnetlink code to ignore the return
value of rtnl_notify() in all callers. The function rtnl_notify()
(before this patch) returned the error of the unicast notification
which makes rtnl_set_sk_err() reports errors to all listeners. This
is not of any help since the origin of the change (the socket that
requested the echoing) notices the ENOBUFS error if the notification
fails and should resync itself.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
neightbl_dump_info and neigh_dump_table can skip entries if the
*fill*info functions return an error. This results in an incomplete
dump ((invoked by netlink requests for RTM_GETNEIGHTBL or
RTM_GETNEIGH)
nidx and idx should not be incremented if the current entry was not
placed in the output buffer
Signed-off-by: Gautam Kachroo <gk@aristanetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In future all cpumask ops will only be valid (in general) for bit
numbers < nr_cpu_ids. So use that instead of NR_CPUS in iterators
and other comparisons.
This is always safe: no cpu number can be >= nr_cpu_ids, and
nr_cpu_ids is initialized to NR_CPUS at boot.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch moves neigh_setup and hard_start_xmit into the network device ops
structure. For bisection, fix all the previously converted drivers as well.
Bonding driver took the biggest hit on this.
Added a prefetch of the hard_start_xmit in the fast path to try and reduce
any impact this would have.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using read_pnet() and write_pnet() in neighbour code ease the reading
of code.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
->pde isn't actually needed, since name is stashed in ->id.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I want to compile out proc_* and sysctl_* handlers totally and
stub them to NULL depending on config options, however usage of &
will prevent this, since taking adress of NULL pointer will break
compilation.
So, drop & in front of every ->proc_handler and every ->strategy
handler, it was never needed in fact.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
call_rcu() will unconditionally rewrite RCU head anyway.
Applies to
struct neigh_parms
struct neigh_table
struct net
struct cipso_v4_doi
struct in_ifaddr
struct in_device
rt->u.dst
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When pneigh entries exist, but the user's read buffer isn't sufficient to
hold them all, one of the pneigh entries will be missing from the results.
In neigh_get_idx_any, the number of elements which neigh_get_idx
encountered is not correctly subtracted from the position number before
the call to pneigh_get_idx. neigh_get_idx reduces the position by 1 for
each call to neigh_get_next, but it does not reduce it by one for the
first element (neigh_get_first). The patch alters the neigh_get_idx and
pneigh_get_idx functions to subtract one from pos, for the first element,
when pos is non-zero.
Signed-off-by: Chris Larson <clarson@mvista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
neigh_seq_next won't be called both with *pos > 0 && v ==
SEQ_START_TOKEN, so there's no point calling neigh_get_idx when we're
on the start token, just call neigh_get_first directly.
Signed-off-by: Chris Larson <clarson@mvista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
in __neigh_event_send, if we have a neighbour entry which is in
NUD_INCOMPLETE state, we enqueue any outbound frames to that neighbour
to the neighbours arp_queue, which is default capped to a length of 3
skbs. If that queue exceeds its set length, it will drop an skb on
the queue to enqueue the newly arrived skb. This results in a drop
for which we have no statistics incremented. This patch adds an
unresolved_discards stat to /proc/net/stat/ndisc_cache to track these
lost frames.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make nlmsg_trim(), nlmsg_cancel(), genlmsg_cancel(), and
nla_nest_cancel() void functions.
Return -EMSGSIZE instead of -1 if the provided message buffer is not
big enough.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The neighbor table time of last use information is returned in the
incorrect unit. Kernel to user space ABI's need to use USER_HZ (or
milliseconds), otherwise the application has to try and discover the
real system HZ value which is problematic. Linux has standardized on
keeping USER_HZ consistent (100hz) even when kernel is running
internally at some other value.
This change is small, but it breaks the ABI for older version of
iproute2 utilities. But these utilities are already broken since they
are looking at the psched_hz values which are completely different. So
let's just go ahead and fix both kernel and user space. Older
utilities will just print wrong values.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simply replace proc_create and further data assigned with proc_create_data.
Additionally, there is no need to assign NULL to PDE->data after creation,
/proc generic has already done this for us.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extract hash function for pneigh entries from pneigh_lookup(),
__pneigh_lookup() and pneigh_delete() as pneigh_hash().
Extract core of pneigh_lookup() and __pneigh_lookup() as
__pneigh_lookup_1().
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Introduce an inline net_eq() to compare two namespaces.
Without CONFIG_NET_NS, since no namespace other than &init_net
exists, it is always 1.
We do not need to convert 1) inline vs inline and
2) inline vs &init_net comparisons.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Introduce neigh_parms/pneigh_entry inlines: neigh_parms_net(), pneigh_net().
Without CONFIG_NET_NS, no namespace other than &init_net exists.
Let's explicitly define them to help compiler optimizations.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Without CONFIG_NET_NS, no namespace other than &init_net exists,
no need to store net in seq_net_private.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Introduce per-sock inlines: sock_net(), sock_net_set()
and per-inet_timewait_sock inlines: twsk_net(), twsk_net_set().
Without CONFIG_NET_NS, no namespace other than &init_net exists.
Let's explicitly define them to help compiler optimizations.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Introduce per-net_device inlines: dev_net(), dev_net_set().
Without CONFIG_NET_NS, no namespace other than &init_net exists.
Let's explicitly define them to help compiler optimizations.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Proxy neighbors do not have any reference counting, so any caller
of pneigh_lookup (unless it's a netlink triggered add/del routine)
should _not_ perform any actions on the found proxy entry.
There's one exception from this rule - the ipv6's ndisc_recv_ns()
uses found entry to check the flags for NTF_ROUTER.
This creates a race between the ndisc and pneigh_delete - after
the pneigh is returned to the caller, the nd_tbl.lock is dropped
and the deleting procedure may proceed.
One of the fixes would be to add a reference counting, but this
problem exists for ndisc only. Besides such a patch would be too
big for -rc4.
So I propose to introduce a __pneigh_lookup() which is supposed
to be called with the lock held and use it in ndisc code to check
the flags on alive pneigh entry.
Changes from v2:
As David noticed, Exported the __pneigh_lookup() to ipv6 module.
The checkpatch generates a warning on it, since the EXPORT_SYMBOL
does not follow the symbol itself, but in this file all the
exports come at the end, so I decided no to break this harmony.
Changes from v1:
Fixed comments from YOSHIFUJI - indentation of prototype in header
and the pndisc_check_router() name - and a compilation fix, pointed
by Daniel - the is_routed was (falsely) considered as uninitialized
by gcc.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
neigh_update sends skb from neigh->arp_queue while neigh_timer_handler
has increased skbs refcount and calls solicit with the
skb. neigh_timer_handler should not increase skbs refcount but make a
copy of the skb and do solicit with the copy.
Signed-off-by: Frank Blaschka <frank.blaschka@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Default ARP parameters should be findable regardless of the context.
Required to make inetdev_event working.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
neigh_sysctl_register should register sysctl entries inside correct namespace
to avoid naming conflict. Typical example is a loopback. Entries for it
present in all namespaces.
Required to make inetdev_event working.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use proc_create() to make sure that ->proc_fops be setup before gluing
PDE to main tree.
Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The neigh_hash_grow() may update the tbl->hash_rnd value, which
is used in all tbl->hash callbacks to calculate the hashval.
Two lookup routines may race with this, since they call the
->hash callback without the tbl->lock held. Since the hash_rnd
is changed with this lock write-locked moving the calls to ->hash
under this lock read-locked closes this gap.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
release_net is missed on the error path in pneigh_lookup.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 69cc64d8d9.
It causes recursive locking in IPV6 because unlike other
neighbour layer clients, it even needs neighbour cache
entries to send neighbour soliciation messages :-(
We'll have to find another way to fix this race.
Signed-off-by: David S. Miller <davem@davemloft.net>
Frank Blaschka provided the bug report and the initial suggested fix
for this bug. He also validated this version of this fix.
The problem is that the access to neigh->arp_queue is inconsistent, we
grab references when dropping the lock lock to call
neigh->ops->solicit() but this does not prevent other threads of
control from trying to send out that packet at the same time causing
corruptions because both code paths believe they have exclusive access
to the skb.
The best option seems to be to hold the write lock on neigh->lock
during the ->solicit() call. I looked at all of the ndisc_ops
implementations and this seems workable. The only case that needs
special care is the IPV4 ARP implementation of arp_solicit(). It
wants to take neigh->lock as a reader to protect the header entry in
neigh->ha during the emission of the soliciation. We can simply
remove the read lock calls to take care of that since holding the lock
as a writer at the caller providers a superset of the protection
afforded by the existing read locking.
The rest of the ->solicit() implementations don't care whether the
neigh is locked or not.
Signed-off-by: David S. Miller <davem@davemloft.net>
Make them static.
[ Moved the inline before, instead of after, call sites. -DaveM ]
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
No need for this. It is declared in the neighbour.h
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Valid network device is always passed into neigh_param_alloc, so
remove extra checking for dev == NULL. Additionally, cleanup bogus
netns assignment.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
seq_open_net requires that first field of the seq->private data to be
struct seq_net_private. In reality this is a single pointer to a
struct net for now. The patch makes code consistent.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add __acquires() and __releases() annotations to suppress some sparse
warnings.
example of warnings :
net/ipv4/udp.c:1555:14: warning: context imbalance in 'udp_seq_start' - wrong
count at exit
net/ipv4/udp.c:1571:13: warning: context imbalance in 'udp_seq_stop' -
unexpected unlock
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I'm actually surprised at how much was involved. At first glance it
appears that the neighbour table data structures are already split by
network device so all that should be needed is to modify the user
interface commands to filter the set of neighbours by the network
namespace of their devices.
However a couple things turned up while I was reading through the
code. The proxy neighbour table allows entries with no network
device, and the neighbour parms are per network device (except for the
defaults) so they now need a per network namespace default.
So I updated the two structures (which surprised me) with their very
own network namespace parameter. Updated the relevant lookup and
destroy routines with a network namespace parameter and modified the
code that interacts with users to filter out neighbour table entries
for devices of other namespaces.
I'm a little concerned that we can modify and display the global table
configuration and from all network namespaces. But this appears good
enough for now.
I keep thinking modifying the neighbour table to have per network
namespace instances of each table type would should be cleaner. The
hash table is already dynamically sized so there are it is not a
limiter. The default parameter would be straight forward to take care
of. However when I look at the how the network table is built and
used I still find some assumptions that there is only a single
neighbour table for each type of table in the kernel. The netlink
operations, neigh_seq_start, the non-core network users that call
neigh_lookup. So while it might be doable it would require more
refactoring than my current approach of just doing a little extra
filtering in the code.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The neigh_del_timer() looks sane - it removes the timer and
(conditionally) puts the neighbor. I expected, that the
neigh_add_timer() is symmetrical to the del one - i.e. it
holds the neighbor and arms the timer - but it turned out
that it was not so.
I think, that making them look symmetrical makes the code
more readable.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The appropriate path is prepared right inside this function. It
is prepared similar to how the ctl tables were.
Since the path is modified, it is put on the stack, to avoid
possible races with multiple calls to neigh_sysctl_register() : it
is called by protocols and I didn't find any protection in this
case. Did I overlooked the rtnl lock?.
The stack growth of the neigh_sysctl_register() is 40 bytes. I
believe this is OK, since this is not that much and this function
is not called with the deep stack (device/protocols register).
The device's name is stored on the template to free it later.
This will help with the net namespaces, as each namespace should
have its own set of these ctls.
Besides, this saves ~350 bytes from the neigh template :)
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This mainly removes the err variable, as this call always
return the same error code (-ENOBUFS).
Besides, I moved the call to kmalloc() from the *t declaration
into the code (this is confusing when a variable is initialized
with the result of some call) and removed unneeded comment near
the error path.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
After this patch none of the netlink callback support anything
except the initial network namespace but the rtnetlink infrastructure
now handles multiple network namespaces.
Changes from v2:
- IPv6 addrlabel processing
Changes from v1:
- no need for special rtnl_unlock handling
- fixed IPv6 ndisc
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before I can enable rtnetlink to work in all network namespaces I need
to be certain that something won't break. So this patch deliberately
disables all of the rtnletlink methods in everything except the
initial network namespace. After the methods have been audited this
extra check can be disabled.
Changes from v1:
- added IPv6 addrlabel protection
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Many-many code in the kernel initialized the timer->function
and timer->data together with calling init_timer(timer). There
is already a helper for this. Use it for networking code.
The patch is HUGE, but makes the code 130 lines shorter
(98 insertions(+), 228 deletions(-)).
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 9cd4002942 (Fix race between
neigh_parms_release and neightbl_fill_parms) introduced device
reference counting regressions for several people, see:
http://bugzilla.kernel.org/show_bug.cgi?id=9778
for example.
Signed-off-by: David S. Miller <davem@davemloft.net>
The neightbl_fill_parms() is called under the write-locked tbl->lock
and accesses the parms->dev. The negh_parm_release() calls the
dev_put(parms->dev) without this lock. This creates a tiny race window
on which the parms contains potentially stale dev pointer.
To fix this race it's enough to move the dev_put() upper under the
tbl->lock, but note, that the parms are held by neighbors and thus can
live after the neigh_parms_release() is called, so we still can have a
parm with bad dev pointer.
I didn't find where the neigh->parms->dev is accessed, but still think
that putting the dev is to be done in a place, where the parms are
really freed. Am I right with that?
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>