When showing device statistics use RCU rather than read_lock(&dev_base_lock)
Compile tested only.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use RCU to walk list of network devices in qdisc dump.
This could be optimized for large number of devices.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Do not need to use read_lock(&dev_base_lock), use RCU instead.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change udp6_portaddr_hash() to use ipv6_addr_v4mapped()
inline instead of ipv6_addr_type().
Signed-off-by: Brian Haley <brian.haley@hp.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch rearranges the SIT DF bit handling using the new IPIP DF
code. The only externally visible effect should be the case where
PMTU is enabled and the MTU is exactly 1280 bytes. In this case the
previous code would send packets out with DF off while the new code
would set the DF bit. This is inline with RFC 4213.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Thanks,
Signed-off-by: David S. Miller <davem@davemloft.net>
Apparently, inet6_dump_addr() is not able to handle more than
64 ipv6 addresses per device. We must break from inner loops
in case skb is full, or else cursor is put at the end of list.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When handling large number of netdevice, inet6_dump_ifinfo()
is very slow because it has O(N^2) complexity.
Instead of scanning one single list, we can use the 256 sub lists
of the dev_index hash table, and RCU lookups.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use guard DECLARE_SOCKADDR in a few more places which allow
us to catch if the structure copied back is too big.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
UDP bind() can be O(N^2) in some pathological cases.
Thanks to secondary hash tables, we can make it O(N)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (34 commits)
net/fsl_pq_mdio: add module license GPL
can: fix WARN_ON dump in net/core/rtnetlink.c:rtmsg_ifinfo()
can: should not use __dev_get_by_index() without locks
hisax: remove bad udelay call to fix build error on ARM
ipip: Fix handling of DF packets when pmtudisc is OFF
qlge: Set PCIe reset type for EEH to fundamental.
qlge: Fix early exit from mbox cmd complete wait.
ixgbe: fix traffic hangs on Tx with ioatdma loaded
ixgbe: Fix checking TFCS register for TXOFF status when DCB is enabled
ixgbe: Fix gso_max_size for 82599 when DCB is enabled
macsonic: fix crash on PowerBook 520
NET: cassini, fix lock imbalance
ems_usb: Fix byte order issues on big endian machines
be2net: Bug fix to send config commands to hardware after netdev_register
be2net: fix to set proper flow control on resume
netfilter: xt_connlimit: fix regression caused by zero family value
rt2x00: Don't queue ieee80211 work after USB removal
Revert "ipw2200: fix oops on missing firmware"
decnet: netdevice refcount leak
netfilter: nf_nat: fix NAT issue in 2.6.30.4+
...
something-bility is spelled as something-blity
so a grep for 'blit' would find these lines
this is so trivial that I didn't split it by subsystem / copy
additional maintainers - all changes are to comments
The only purpose is to get fewer false positives when grepping
around the kernel sources.
Signed-off-by: Dirk Hohndel <hohndel@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
This fixes the following bug in the current implementation of
net/xfrm: SAD entries timeouts do not count the time spent by the machine
in the suspended state. This leads to the connectivity problems because
after resuming local machine thinks that the SAD entry is still valid, while
it has already been expired on the remote server.
The cause of this is very simple: the timeouts in the net/xfrm are bound to
the old mod_timer() timers. This patch reassigns them to the
CLOCK_REALTIME hrtimer.
I have been using this version of the patch for a few months on my
machines without any problems. Also run a few stress tests w/o any
issues.
This version of the patch uses tasklet_hrtimer by Peter Zijlstra
(commit 9ba5f0).
This patch is against 2.6.31.4. Please CC me.
Signed-off-by: Yury Polyanskiy <polyanskiy@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds compat_ioctl support for SIOCWANDEV, which has
always been missing.
The definition of struct compat_ifreq was missing an
ifru_settings fields that is needed to support SIOCWANDEV,
so add that and clean up the whitespace damage in the
struct definition.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
SIOCGMIIPHY and SIOCGMIIREG return data through ifreq,
so it needs to be converted on the way out as well.
SIOCGIFPFLAGS is unused, but has the same problem in theory.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
When skb_clone() fails, we should increment sk_drops and SNMP counters.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPV6 UDP multicast rx path is a bit complex and can hold a spinlock
for a long time.
Using a small (32 or 64 entries) stack of socket pointers can help
to perform expensive operations (skb_clone(), udp_queue_rcv_skb())
outside of the lock, in most cases.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
UDP multicast rx path is a bit complex and can hold a spinlock
for a long time.
Using a small (32 or 64 entries) stack of socket pointers can help
to perform expensive operations (skb_clone(), udp_queue_rcv_skb())
outside of the lock, in most cases.
It's also a base for a future RCU conversion of multicast recption.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Lucian Adrian Grijincu <lgrijincu@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We first locate the (local port) hash chain head
If few sockets are in this chain, we proceed with previous lookup algo.
If too many sockets are listed, we take a look at the secondary
(port, address) hash chain.
We choose the shortest chain and proceed with a RCU lookup on the elected chain.
But, if we chose (port, address) chain, and fail to find a socket on given address,
we must try another lookup on (port, in6addr_any) chain to find sockets not bound
to a particular IP.
-> No extra cost for typical setups, where the first lookup will probabbly
be performed.
RCU lookups everywhere, we dont acquire spinlock.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We first locate the (local port) hash chain head
If few sockets are in this chain, we proceed with previous lookup algo.
If too many sockets are listed, we take a look at the secondary
(port, address) hash chain we added in previous patch.
We choose the shortest chain and proceed with a RCU lookup on the elected chain.
But, if we chose (port, address) chain, and fail to find a socket on given address,
we must try another lookup on (port, INADDR_ANY) chain to find socket not bound
to a particular IP.
-> No extra cost for typical setups, where the first lookup will probabbly
be performed.
RCU lookups everywhere, we dont acquire spinlock.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extends udp_table to contain a secondary hash table.
socket anchor for this second hash is free, because UDP
doesnt use skc_bind_node : We define an union to hold
both skc_bind_node & a new hlist_nulls_node udp_portaddr_node
udp_lib_get_port() inserts sockets into second hash chain
(additional cost of one atomic op)
udp_lib_unhash() deletes socket from second hash chain
(additional cost of one atomic op)
Note : No spinlock lockdep annotation is needed, because
lock for the secondary hash chain is always get after
lock for primary hash chain.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Union sk_hash with two u16 hashes for udp (no extra memory taken)
One 16 bits hash on (local port) value (the previous udp 'hash')
One 16 bits hash on (local address, local port) values, initialized
but not yet used. This second hash is using jenkin hash for better
distribution.
Because the 'port' is xored later, a partial hash is performed
on local address + net_hash_mix(net)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds a counter in udp_hslot to keep an accurate count
of sockets present in chain.
This will permit to upcoming UDP lookup algo to chose
the shortest chain when secondary hash is added.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no good reason to not support userspace specifying the
network namespace during device creation, and it makes it easier
to create a network device and pass it to a child network namespace
with a well known name.
We have to be careful to ensure that the target network namespace
for the new device exists through the life of the call. To keep
that logic clear I have factored out the network namespace grabbing
logic into rtnl_link_get_net.
In addtion we need to continue to pass the source network namespace
to the rtnl_link_ops.newlink method so that we can find the base
device source network namespace.
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
atalk_sum_partial can now use the rol16 function in bitops.h
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using RCU helps not touching device refcount in rawv6_bind()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bcm_proc_getifname() is called with RTNL and dev_base_lock
not held. It calls __dev_get_by_index() without locks, and
this is illegal (might crash)
Close the race by holding dev_base_lock and copying dev->name
in the protected section.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Oliver Hartkopp <oliver@hartkopp.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The x25 driver uses lock_kernel() implicitly through
its proto_ops wrapper. The makes the usage explicit
in order to get rid of that wrapper and to better document
the usage of the BKL.
The next step should be to get rid of the usage of the BKL
in x25 entirely, which requires understanding what data
structures need serialized accesses.
Cc: Henner Eisen <eis@baty.hanse.de>
Cc: David S. Miller <davem@davemloft.net>
Cc: linux-x25@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The irda driver uses the BKL implicitly in its protocol
operations. Replace the wrapped proto_ops with explicit
lock_kernel() calls makes the usage more obvious and
shrinks the size of the object code.
The calls t lock_kernel() should eventually all be replaced
by other serialization methods, which requires finding out
The calls t lock_kernel() should eventually all be replaced
by other serialization methods, which requires finding out
which data actually needs protection.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Making the BKL usage explicit in ipx makes it more
obvious where it is used, reduces code size and helps
getting rid of the BKL in common code.
I did not analyse how to kill lock_kernel from ipx
entirely, this will involve either proving that it's not
needed, or replacing with a proper mutex or spinlock,
after finding out which data structures are protected
by the lock.
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: netdev@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Making the BKL usage explicit in appletalk makes it more
obvious where it is used, reduces code size and helps
getting rid of the BKL in common code.
I did not analyse how to kill lock_kernel from appletalk
entirely, this will involve either proving that it's not
needed, or replacing with a proper mutex or spinlock,
after finding out which data structures are protected
by the lock.
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: netdev@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
SPIN_LOCK_UNLOCKED is deprecated. Use DEFINE_SPINLOCK instead.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The MII ioctls and SIOCSIFNAME need to go through ifsioc conversion,
which they never did so far. Some others are not implemented in the
native path, so we can just return -EINVAL directly.
Add IFSLAVE ioctls to the EINVAL list and move it to the end to
optimize the code path for the common case.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This removes the original socket compat_ioctl code
from fs/compat_ioctl.c and converts the code from the copy
in net/socket.c into a single function. We add a few cycles
of runtime to compat_sock_ioctl() with the long switch()
statement, but gain some cycles in return by simplifying
the call chain to get there.
Due to better inlining, save 1.5kb of object size in the
process, and enable further savings:
before:
text data bss dec hex filename
13540 18008 2080 33628 835c obj/fs/compat_ioctl.o
14565 636 40 15241 3b89 obj/net/socket.o
after:
text data bss dec hex filename
8916 15176 2080 26172 663c obj/fs/compat_ioctl.o
20725 636 40 21401 5399 obj/net/socket.o
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
We must not have a compat ioctl handler for SIOCATALKDIFADDR
in common code, because the same number is used in other protocols
with different data structures.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This makes an identical copy of the socket compat_ioctl code
from fs/compat_ioctl.c to net/socket.c, as a preparation
for moving the functionality in a way that can be easily
reviewed.
The code is hidden inside of #if 0 and gets activated in the
patch that will make it work.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC 2003 requires the outer header to have DF set if DF is set
on the inner header, even when PMTU discovery is off for the
tunnel. Our implementation does exactly that.
For this to work properly the IPIP gateway also needs to engate
in PMTU when the inner DF bit is set. As otherwise the original
host would not be able to carry out its PMTU successfully since
part of the path is only visible to the gateway.
Unfortunately when the tunnel PMTU discovery setting is off, we
do not collect the necessary soft state, resulting in blackholes
when the original host tries to perform PMTU discovery.
This problem is not reproducible on the IPIP gateway itself as
the inner packet usually has skb->local_df set. This is not
correctly cleared (an unrelated bug) when the packet passes
through the tunnel, which allows fragmentation to occur. For
hosts behind the IPIP gateway it is readily visible with a simple
ping.
This patch fixes the problem by performing PMTU discovery for
all packets with the inner DF bit set, regardless of the PMTU
discovery setting on the tunnel itself.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit v2.6.28-rc1~717^2~109^2~2 was slightly incomplete; not all
instances of par->match->family were changed to par->family.
References: http://bugzilla.netfilter.org/show_bug.cgi?id=610
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some devices require that all frames to a station
are flushed when that station goes into powersave
mode before being able to send frames to that
station again when it wakes up or polls -- all in
order to avoid reordering and too many or too few
frames being sent to the station when it polls.
Normally, this is the case unless the station
goes to sleep and wakes up very quickly again.
But in that case, frames for it may be pending
on the hardware queues, and thus races could
happen in the case of multiple hardware queues
used for QoS/WMM. Normally this isn't a problem,
but with the iwlwifi mechanism we need to make
sure the race doesn't happen.
This makes mac80211 able to cope with the race
with driver help by a new WLAN_STA_PS_DRIVER
per-station flag that can be controlled by the
driver and tells mac80211 whether it can transmit
frames or not. This flag must be set according to
very specific rules outlined in the documentation
for the function that controls it.
When we buffer new frames for the station, we
normally set the TIM bit right away, but while
the driver has blocked transmission to that sta
we need to avoid that as well since we cannot
respond to the station if it wakes up due to the
TIM bit. Once the driver unblocks, we can set
the TIM bit.
Similarly, when the station just wakes up, we
need to wait until all other frames are flushed
before we can transmit frames to that station,
so the same applies here, we need to wait for
the driver to give the OK.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The NETLINK_URELEASE notifier is only invoked for bound sockets, so
there is no need to check ->pid again.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Add support for two more NL802154 commands: ADD_IFACE and DEL_IFACE,
thus allowing creation and removal of logic WPAN interfaces on the top
of wpan-phy.
Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
Move all mac-related stuff to separate file so that ieee802154/netlink.c
contains only generic code.
Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
There is no real need to have ieee802154 interfaces separate
into several small modules, as neither of them has it's own use.
Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
Follow the usual pattern of devices registration by adding new function
(wpan_phy_set_dev) that sets child->parent relationship and removing
parent argument from wpan_phy_register call.
Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
IEEE 802.15.4-2006 defines channel pages that hold channels (max 32 pages,
27 channels per page). Allow the driver to specify supported channels
on pages, other than the first one.
Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
Set page to zero (for compatibility w/ devices supporting only first page).
Also init channel by default to -1 to disallow transfers for non-initialised
devices.
Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
Conflicts:
drivers/net/usb/cdc_ether.c
All CDC ethernet devices of type USB_CLASS_COMM need to use
'&mbm_info'.
Signed-off-by: David S. Miller <davem@davemloft.net>
While working on device refcount stuff, I found a device refcount leak
through DECNET.
This nasty bug can be used to hold refcounts on any !DECNET netdevice.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vitezslav Samel discovered that since 2.6.30.4+ active FTP can not work
over NAT. The "cause" of the problem was a fix of unacknowledged data
detection with NAT (commit a3a9f79e36).
However, actually, that fix uncovered a long standing bug in TCP conntrack:
when NAT was enabled, we simply updated the max of the right edge of
the segments we have seen (td_end), by the offset NAT produced with
changing IP/port in the data. However, we did not update the other parameter
(td_maxend) which is affected by the NAT offset. Thus that could drift
away from the correct value and thus resulted breaking active FTP.
The patch below fixes the issue by *not* updating the conntrack parameters
from NAT, but instead taking into account the NAT offsets in conntrack in a
consistent way. (Updating from NAT would be more harder and expensive because
it'd need to re-calculate parameters we already calculated in conntrack.)
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/core/sock.c: In function 'sock_setsockopt':
net/core/sock.c:396: warning: 'index' may be used uninitialized in this function
net/core/sock.c:396: note: 'index' was declared here
GCC can't see that all paths initialize index, so just
set it to the default (0) and eliminate the specific
code block that handles the null device name string.
Signed-off-by: David S. Miller <davem@davemloft.net>
cur_pkt_size can be changed in proc fs while pktgen is running,
we better use a private field to get precise tx-bytes counter.
Signed-off-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Avoid dev_hold()/dev_put() in sock_bindtodevice()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When sending fragmentation expiration ICMP V4/V6 messages,
we can avoid touching device refcount, thanks to RCU
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We hold RTNL in tc_dump_tfilter(), we can avoid dev_hold()/dev_put()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Avoid touching device refcount in sctp/ipv6, thanks to RCU
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use dev_get_by_name_rcu() to avoid dev_put() calls,
in sections already inside a rcu_read_lock()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
add_del_if() is called with RTNL, we can use __dev_get_by_index()
instead of [dev_get_by_index() + dev_put()]
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before calling capable(CAP_NET_RAW) check if this operations is on behalf
of the kernel or on behalf of userspace. Do not do the security check if
it is on behalf of the kernel.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The generic __sock_create function has a kern argument which allows the
security system to make decisions based on if a socket is being created by
the kernel or by userspace. This patch passes that flag to the
net_proto_family specific create function, so it can do the same thing.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
struct can_proto had a capability field which wasn't ever used. It is
dropped entirely.
struct inet_protosw had a capability field which can be more clearly
expressed in the code by just checking if sock->type = SOCK_RAW.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While hunting dev_put() for net-next-2.6, I found a device refcount
leak in ROSE, ioctl(SIOCADDRT) error path.
Fix is to not touch device refcount, as we hold RTNL
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bridge code assumes ethernet addressing, so be more strict in
the what is allowed. This showed up when GRE had a bug and was not
using correct address format.
Add some more comments for increased clarity.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The variable 'other_way' gets initialized but is not read afterwards,
so remove it. Pass the right arguments to a pr_debug call.
While being at tidy up a bit and it fix this checkpatch warning:
WARNING: suspect code indent for conditional statements
Signed-off-by: Hannes Eder <heder@google.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Trying to parse the option of a SYN packet that we have
no route entry for should just use global wide defaults
for route entry options.
Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Tested-by: Valdis.Kletnieks@vt.edu
Signed-off-by: David S. Miller <davem@davemloft.net>
Calling IPv4 specific inet_csk_route_req in tcp_check_req
is a bad idea and crashes machine on IPv6 connections, as reported
by Valdis Kletnieks
Also, all we are really interested in is the timestamp
option in the header, so calling tcp_parse_options()
with the "estab" set to false flag is an overkill as
it tries to parse half a dozen other TCP options.
We know whether timestamp should be enabled or not
using data from request_sock.
Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Tested-by: Valdis.Kletnieks@vt.edu
Signed-off-by: David S. Miller <davem@davemloft.net>
As pointed by Stephen Rothwell, commit c6d14c84 added a warning :
net/ipv4/devinet.c: In function 'inet_select_addr':
net/ipv4/devinet.c:902: warning: label 'out' defined but not used
delete unused 'out' label and do some cleanups as well
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The internal scan request mac80211 uses to
scan for IBSS networks was set up to contain
no channels at all because n_channels wasn't
set after setting up the channels array. Fix
this to properly scan for networks.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Currently, in IBSS mode, a single creator would go into
a loop trying to merge/scan. This happens because the IBSS timer is
rearmed on finishing a scan and the subsequent
timer invocation requests another scan immediately.
This patch fixes this issue by checking if we have just completed
a scan run trying to merge with other IBSS networks.
Signed-off-by: Sujith <Sujith.Manoharan@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Since we have a TODO item to make all station
management dependent on virtual interfaces, I
figured I'd start with pushing such a change
to drivers before more drivers start using the
ieee80211_find_sta() API with a hw pointer and
cause us grief later on.
For now continue exporting the old API in form
of ieee80211_find_sta_by_hw(), but discourage
its use strongly.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
commit 211a4d12abf86fe0df4cd68fc6327cbb58f56f81
Author: Johannes Berg <johannes@sipsolutions.net>
Date: Tue Oct 20 15:08:53 2009 +0900
cfg80211: sme: deauthenticate on assoc failure
accidentally introduced a dead variable, I had
changed the code to not need it while creating
the patch and it looks like I forgot to remove
the variable (and nobody else noticed either).
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
nf_unregister_queue_handlers() already does a synchronize_rcu()
call, we dont need to do it again in callers.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Adds RCU management to the list of netdevices.
Convert some for_each_netdev() users to RCU version, if
it can avoid read_lock-ing dev_base_lock
Ie:
read_lock(&dev_base_loack);
for_each_netdev(net, dev)
some_action();
read_unlock(&dev_base_lock);
becomes :
rcu_read_lock();
for_each_netdev_rcu(net, dev)
some_action();
rcu_read_unlock();
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Another rcu conversion to avoid one dev_hold()/dev_put() pair
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These checks don't make sense anymore since rtnl_notify() cannot fail.
Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently it is possible to request a scan on only
disabled channels, which could be problematic for
some drivers. Reject such scans, and also ignore
disabled channels that are given. This resuls in
the scan begin/end event only including channels
that are actually used.
This makes the mac80211 check for disabled channels
superfluous. At the same time, remove the no-IBSS
check from mac80211 -- nothing says that we should
not find any networks on channels that cannot be
used for an IBSS, even when operating in IBSS mode.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Neither of these commands should clear the current configuration value
if they return error. Furthermore, cfg80211_set_cipher_group() should
be able to handle IW_AUTH_CIPHER_NONE without reporting error.
Signed-off-by: Jouni Malinen <j@w1.fi>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Since sometimes mac80211 queues up a scan request
to only act on it later, it must be allowed to
(internally) cancel a not-yet-running scan, e.g.
when the interface is taken down. This condition
was missing since we always checked only the
local->scanning variable which isn't yet set in
that situation.
Reported-by: Luis R. Rodriguez <mcgrof@gmail.com>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The multi-line code in this macro wasn't wrapped
in do {} while (0) so we cannot use it in an if()
branch safely in the future -- fix that.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
For some strange reason the netif_running() check
ended up after the actual type change instead of
before, potentially causing all kinds of problems
if the interface is up while changing the type;
one of the problems manifests itself as a warning:
WARNING: at net/mac80211/iface.c:651 ieee80211_teardown_sdata+0xda/0x1a0 [mac80211]()
Hardware name: Aspire one
Pid: 2596, comm: wpa_supplicant Tainted: G W 2.6.31-10-generic #32-Ubuntu
Call Trace:
[] warn_slowpath_common+0x6d/0xa0
[] warn_slowpath_null+0x15/0x20
[] ieee80211_teardown_sdata+0xda/0x1a0 [mac80211]
[] ieee80211_if_change_type+0x4a/0xc0 [mac80211]
[] ieee80211_change_iface+0x61/0xa0 [mac80211]
[] cfg80211_wext_siwmode+0xc7/0x120 [cfg80211]
[] ioctl_standard_call+0x58/0xf0
(http://www.kerneloops.org/searchweek.php?search=ieee80211_teardown_sdata)
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: stable@kernel.org
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
commit 211a4d12abf86fe0df4cd68fc6327cbb58f56f81
Author: Johannes Berg <johannes@sipsolutions.net>
Date: Tue Oct 20 15:08:53 2009 +0900
cfg80211: sme: deauthenticate on assoc failure
introduced a potential NULL pointer dereference that
some people have been hitting for some reason -- the
params.bssid pointer is not guaranteed to be non-NULL
for what seems to be a race between various ways of
reaching the same thing.
While I'm trying to analyse the problem more let's
first fix the crash. I think the real fix may be to
avoid doing _anything_ if it ended up being NULL, but
right now I'm not sure yet.
I think
http://bugzilla.kernel.org/show_bug.cgi?id=14342
might also be this issue.
Reported-by: Parag Warudkar <parag.lkml@gmail.com>
Tested-by: Parag Warudkar <parag.lkml@gmail.com>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The patch below also addresses a couple of other corner cases in readdir
seen with a large (e.g. 64k) msize. I'm not sure what people think of
my co-opting of fid->aux here. I'd be happy to rework if there's a better
way.
When the size of the user supplied buffer passed to readdir is smaller
than the data returned in one go by the 9P read request, v9fs_dir_readdir()
currently discards extra data so that, on the next call, a 9P read
request will be issued with offset < previous offset + bytes returned,
which voilates the constraint described in paragraph 3 of read(5) description.
This patch preseves the leftover data in fid->aux for use in the next call.
Signed-off-by: Jim Garlick <garlick@llnl.gov>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Avoids touching device refcount in datagram_send_ctl(), thanks to RCU
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Avoids touching device refcount in inet6_bind(), thanks to RCU
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using dev_get_by_index_rcu() in ip6_tnl_rcv_ctl() & ip6_tnl_xmit_ctl()
avoids touching device refcount.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- packet_sendmsg_spkt() can use dev_get_by_name_rcu() to avoid touching device refcount.
- packet_getname_spkt() & packet_getname() can use dev_get_by_index_rcu() to
avoid touching device refcount too.
tpacket_snd() & packet_snd() can not use RCU yet because they can sleep when
allocating skb.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All ioctls() implemented by dev_ifsioc_locked() :
SIOCGIFFLAGS, SIOCGIFMETRIC, SIOCGIFMTU, SIOCGIFHWADDR,
SIOCGIFSLAVE, SIOCGIFMAP, SIOCGIFINDEX & SIOCGIFTXQLEN
can use RCU lock instead of dev_base_lock rwlock
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can avoid touching device refcount in icmp_send(),
using dev_get_by_index_rcu()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use dev_get_by_index_rcu() instead of __dev_get_by_index() and
dev_base_lock rwlock
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I tested the recent unregister many changes and got a weird,
nasty and seemingly unrelasted kernel oops. Changing
unregister_netdevice_queue to use list_move_tail fixes
the problem for me.
ip link add type veth
rmmod veth
ls /sys/class/net/
showed one of the veth devices still present.
A subsequent ip link oopsed the box.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some workloads hit dev_base_lock rwlock pretty hard.
We can use RCU lookups to avoid touching this rwlock
(and avoid touching netdevice refcount)
netdevices are already freed after a RCU grace period, so this patch
adds no penalty at device dismantle time.
However, it adds a synchronize_rcu() call in dev_change_name()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move receive processing from event handler to a tasklet.
This should help prevent hangcheck timer from going off
when RDS is under heavy load.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This issue was discovered by HP's Pradeep and fixed in OFED
1.3, but not fixed in later versions, since the fix's implementation
was not immediately applyable to the later code. This patch should
do the trick for 1.4+ codebases.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove explicit destruction of passive connection when destroying
active end of the connection. The passive end is also on the
device's connection list, and will thus be cleaned up properly.
Panic was caused by trying to clean it up twice.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
"At rds_ib_recv_refill_one(), it first executes atomic_read(&rds_ib_allocation)
for if-condition checking,
and then executes atomic_inc(&rds_ib_allocation) if the condition was
not satisfied.
However, if any other code which updates rds_ib_allocation executes
between these two atomic operation executions,
it seems that it may result race condition. (especially when
rds_ib_allocation + 1 == rds_ib_sysctl_max_recv_allocation)"
This patch fixes this by using atomic_inc_unless to eliminate the
possibility of allocating more than rds_ib_sysctl_max_recv_allocation
and then decrementing the count if the allocation fails. It also
makes an identical change to the iwarp transport.
Reported-by: Shin Hong <hongshin@gmail.com>
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RDS currently supports a GET_MR sockopt to establish a
memory region (MR) for a chunk of memory. However, the fastreg
method ties a MR to a particular destination. The GET_MR_FOR_DEST
sockopt allows the remote machine to be specified, and thus
support for fastreg (aka FRWRs).
Note that this patch does *not* do all of this - it simply
implements the new sockopt in terms of the old one, so applications
can begin to use the new sockopt in preparation for cutover to
FRWRs.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's not right to do something here when returning an
error, and hostapd should never have relied on it as
it only fixes up a small part of the problem anyway.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This variable is set once, and tested once.
However, the code path that can set it is
mutually exclusive with the code path that
tests it, so the test is always true. Thus
we also don't need to set it either and can
just remove the variable.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
We drop nullfunc frames, but not qos-nullfunc frames,
even though those could be used for PS state control
as well.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
When mac80211 is asked to buffer multicast frames
in AP mode, it will not set the flag indicating
that the frames should be sent after the DTIM
beacon for those frames buffered in software. Fix
this little inconsistency by always setting that
flag in the buffering code path.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This value is unused by mac80211, because it was only
be used by wireless extensions, and turned out to not
be useful there because the quality value needs to be
comparable between scan results and the current value
which is impossible when the qual value is calculated
taking into account noise, for example.
Since it is unused anyway, this patch deprecates it
in the hope that drivers will remove their sometimes
quite expensive calculations of the value.
I'm open to actual uses of the value, but the best
way of using it seems to be what the Intel drivers do
which should probably be generalised if we have noise
values from the hardware.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Compared to ieee80211_beacon_get(), the new function
ieee80211_beacon_get_tim() returns information on the
location and length of the TIM IE, which some drivers
need in order to generate the TIM on the device. The
old function, ieee80211_beacon_get(), becomes a small
static inline wrapper around the new one to not break
all drivers.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This fixed a BUG_ON in __skb_trim() when paged rx is used in
iwlwifi driver. Yes, the whole mac80211 stack doesn't support
paged SKB yet. But let's start the work slowly from small
code snippets.
Reported-and-tested-by: Abhijeet Kolekar <abhijeet.kolekar@intel.com>
Signed-off-by: Zhu Yi <yi.zhu@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
While there may be a case for a driver adding its
own bits of radiotap information, none currently
does. Also, drivers would have to copy the code
to generate the radiotap bits that now mac80211
generates. If some driver in the future needs to
add some driver-specific information I'd expect
that to be in a radiotap vendor namespace and we
can add a different way of passing such data up
and having mac80211 include it.
Additionally, rename IEEE80211_CONF_RADIOTAP to
IEEE80211_CONF_MONITOR since it's still used by
b43(legacy) to obtain per-frame timestamps.
The purpose of this patch is to simplify the RX
code in mac80211 to make it easier to add paged
skb support.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
In
commit 601ae7f25a
Author: Bruno Randolf <br1@einfach.org>
Date: Thu May 8 19:22:43 2008 +0200
mac80211: make rx radiotap header more flexible
code was added that tried to align the radiotap header
position in memory based on the radiotap header length.
Quite obviously, that is completely useless.
Instead of trying to do that, use unaligned accesses
to generate the radiotap header. To properly do that,
we also need to mark struct ieee80211_radiotap_header
packed, but that is fine since it's already packed
(and it should be marked packed anyway since its a
wire format).
Cc: Bruno Randolf <br1@einfach.org>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
There's currently a very odd bug in mac80211 -- a
hardware scan that is done while the hardware is
really operating on 2.4 GHz will include CCK rates
in the probe request frame, even on 5 GHz (if the
driver uses the mac80211 IEs). Vice versa, if the
hardware is operating on 5 GHz the 2.4 GHz probe
requests will not include CCK rates even though
they should.
Fix this by splitting up cfg80211 scan requests by
band -- recalculating the IEs every time -- and
requesting only per-band scans from the driver.
Apparently this bug hasn't been a problem yet, but
it is imaginable that some older access points get
confused if confronted with such behaviour.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This comment hasn't been a real TODO item for a long
time now since we fixed that quite a while ago.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
In TX path it was assumed that dynamic power save works only if
IEEE80211_HW_PS_NULLFUNC_STACK is set. But is not the case, there are
devices which have nullfunc support in hardware but need mac80211
to handle dynamic power save timers, TI's wl1251 is one of them.
The fix is to not check for IEEE80211_HW_PS_NULLFUNC_STACK in
is_dynamic_ps_enabled(), instead check IEEE80211_HW_SUPPORTS_PS and
IEEE80211_HW_SUPPORTS_DYNAMIC_PS flags and act accordingly.
Tested with wl1251.
Signed-off-by: Kalle Valo <kalle.valo@nokia.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Refactor dynamic power save checks to a function of it's own for better
readibility. No functional changes.
Signed-off-by: Kalle Valo <kalle.valo@nokia.com>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
We can save a lot of code and pointers in the structs
by using debugfs_remove_recursive().
First, change cfg80211 to use debugfs_remove_recursive()
so that drivers do not need to clean up any files they
added to the per-wiphy debugfs (if and only if they are
ok to be accessed until after wiphy_unregister!).
Then also make mac80211 use debugfs_remove_recursive()
where necessary -- it need not remove per-wiphy files
as cfg80211 now removes those, but netdev etc. files
still need to be handled but can now be removed without
needing struct dentry pointers to all of them.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
When HT debugging is enabled and we receive a DelBA
frame we print out the reason code in the wrong byte
order. Fix that so we don't get weird values printed.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The addba timer function acquires the sta spinlock,
but at the same time we try to del_timer_sync() it
under the spinlock which can produce deadlocks.
To fix this, always del_timer_sync() the timer in
ieee80211_process_addba_resp() and add it again
after checking the conditions, if necessary.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The IBSS code leaks a BSS struct after telling
cfg80211 about a given BSS by passing a frame.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This isn't beautifully abstracted, but it is simple,
simplifies uses and so far is only needed for the bonding driver.
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nathan Neulinger noticed that gretap devices get their MAC address
from the local IP address, which results in invalid MAC addresses
half of the time.
This is because gretap is still using the tunnel netdev ops rather
than the correct tap netdev ops struct.
This patch also fixes changelink to not clobber the MAC address
for the gretap case.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Tested-by: Nathan Neulinger <nneul@mst.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
On UDP sockets, we must call skb_free_datagram() with socket locked,
or risk sk_forward_alloc corruption. This requirement is not respected
in SUNRPC.
Add a convenient helper, skb_free_datagram_locked() and use it in SUNRPC
Reported-by: Francis Moreau <francis.moro@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Small cleanup of __dev_get_by_name() and __dev_get_by_index()
to use hlist_for_each_entry() : They'll look like their _rcu variant.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The temporary copy of the VLAN group is not neccessary since the lower device
is already in the process of being unregistered, if it was neccessary the
memset of the global group would introduce a race condition.
With this removed, the changes to the original code are only a few lines, so
remove the new function and move the code back into vlan_device_event().
Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
struct ax25_ctl_struct member `arg' is unsigned and cannot be less
than 0.
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Policy routing is not looked up by mark on reverse path filtering.
This fixes it.
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
This will allow drivers to adjust their receive path dynamically
based on whether GRO is being applied successfully.
Currently all in-tree callers ignore the return values of these
functions and do not need to be changed.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This clarifies which return and parameter types are GRO result codes
and not RX result codes.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (43 commits)
net: Fix 'Re: PACKET_TX_RING: packet size is too long'
netdev: usb: dm9601.c can drive a device not supported yet, add support for it
qlge: Fix firmware mailbox command timeout.
qlge: Fix EEH handling.
AF_RAW: Augment raw_send_hdrinc to expand skb to fit iphdr->ihl (v2)
bonding: fix a race condition in calls to slave MII ioctls
virtio-net: fix data corruption with OOM
sfc: Set ip_summed correctly for page buffers passed to GRO
cnic: Fix L2CTX_STATUSB_NUM offset in context memory.
MAINTAINERS: rt2x00 list is moderated
airo: Reorder tests, check bounds before element
mac80211: fix for incorrect sequence number on hostapd injected frames
libertas spi: fix sparse errors
mac80211: trivial: fix spelling in mesh_hwmp
cfg80211: sme: deauthenticate on assoc failure
mac80211: keep auth state when assoc fails
mac80211: fix ibss joining
b43: add 'struct b43_wl' missing declaration
b43: Fix Bugzilla #14181 and the bug from the previous 'fix'
rt2x00: Fix crypto in TX frame for rt2800usb
...
This should make it possible to test for the existence of local
sockets in the INPUT path.
References: http://marc.info/?l=netfilter-devel&m=125380481517129&w=2
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Balazs Scheidler <bazsi@balabit.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Currently PACKET_TX_RING forces certain amount of every frame to remain
unused. This probably originates from an early version of the
PACKET_TX_RING patch that in fact used the extra space when the (since
removed) CONFIG_PACKET_MMAP_ZERO_COPY option was enabled. The current
code does not make any use of this extra space.
This patch removes the extra space reservation and lets userspace make
use of the full frame size.
Signed-off-by: Gabor Gombas <gombasg@sztaki.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
proto_ops->getname implies copying protocol specific data
into storage unit (particulary to __kernel_sockaddr_storage).
So when we implement new protocol support we should keep such
a detail in mind (which is easy to forget about).
Lets introduce DECLARE_SOCKADDR helper which check if
storage unit is not overfowed at build time.
Eventually inet_getname is switched to use DECLARE_SOCKADDR
(to show example of usage).
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some workloads hit dev_base_lock rwlock pretty hard.
We can use RCU lookups to avoid touching this rwlock.
netdevices are already freed after a RCU grace period, so this patch
adds no penalty at device dismantle time.
dev_ifname() converted to dev_get_by_index_rcu()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
optlen is unsigned so the `< 0' test is never true.
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If there is data, the unsigned skb->len is greater than 0.
rt.sigdigits is unsigned as well, so the test `>= 0' is
always true, the other part of the test catches wrapped
values.
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add and use no DSCAK bit in the features field.
Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add and use no window scale bit in the features field.
Note that this is not the same as setting a window scale of 0
as would happen with window limit on route.
Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement querying and acting upon the no timestamp bit in the feature
field.
Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement querying and acting upon the no sack bit in the features
field.
Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need tcp_parse_options to be aware of dst_entry to
take into account per dst_entry TCP options settings
Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since we only use tcp_parse_options here to check for the exietence
of TCP timestamp option in the header, it is better to call with
the "established" flag on.
Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Signed-off-by: Ori Finkelman <ori@comsleep.com>
Signed-off-by: Yony Amit <yony@comsleep.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Speedup module unloading by factorizing synchronize_rcu() calls
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Speedup module unloading by factorizing synchronize_rcu() calls
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Speedup module unloading by factorizing synchronize_rcu() calls
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Speedup module unloading by factorizing synchronize_rcu() calls
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Speedup module unloading by factorizing synchronize_rcu() calls
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Augment raw_send_hdrinc to correct for incorrect ip header length values
A series of oopses was reported to me recently. Apparently when using AF_RAW
sockets to send data to peers that were reachable via ipsec encapsulation,
people could panic or BUG halt their systems.
I've tracked the problem down to user space sending an invalid ip header over an
AF_RAW socket with IP_HDRINCL set to 1.
Basically what happens is that userspace sends down an ip frame that includes
only the header (no data), but sets the ip header ihl value to a large number,
one that is larger than the total amount of data passed to the sendmsg call. In
raw_send_hdrincl, we allocate an skb based on the size of the data in the msghdr
that was passed in, but assume the data is all valid. Later during ipsec
encapsulation, xfrm4_tranport_output moves the entire frame back in the skbuff
to provide headroom for the ipsec headers. During this operation, the
skb->transport_header is repointed to a spot computed by
skb->network_header + the ip header length (ihl). Since so little data was
passed in relative to the value of ihl provided by the raw socket, we point
transport header to an unknown location, resulting in various crashes.
This fix for this is pretty straightforward, simply validate the value of of
iph->ihl when sending over a raw socket. If (iph->ihl*4U) > user data buffer
size, drop the frame and return -EINVAL. I just confirmed this fixes the
reported crashes.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implements the netdev_ops.ndo_fcoe_get_wwn for VLAN device.
Signed-off-by: Yi Zou <yi.zou@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Corrected a spelling error in a function name.
Signed-off-by: Andreas Petlund <apetlund@simula.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit d519e17e2d
(net: export device speed and duplex via sysfs)
made the wrong assumption that netdev->ethtool_ops was always set.
This makes possible to crash kernel and let rtnl in locked state.
modprobe dummy
ip link set dummy0 up
(udev runs and crash)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Speedup module unloading by factorizing synchronize_rcu() calls
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Speedup module unloading by factorizing synchronize_rcu() calls
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use unregister_netdevice_many() to speedup master device unregister.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adding a list_head parameter to rtnl_link_ops->dellink() methods
allow us to queue devices on a list, in order to dismantle
them all at once.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce rollback_registered_many() and unregister_netdevice_many()
rollback_registered_many() is able to perform necessary steps at device dismantle
time, factorizing two expensive synchronize_net() calls.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patchs adds an unreg_list anchor to struct net_device, and
introduces an unregister_netdevice_queue() function, able to queue
a net_device to a list instead of immediately unregister it.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This makes the mesh housekeeping timer work properly on big endian
systems.
Signed-off-by: Rui Paulo <rpaulo@gmail.com>
Signed-off-by: Javier Cardona <javier@cozybit.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Mesh portals proxy traffic for nodes external to the mesh. When a
proxied frame is received by a mesh interface, it should update its mesh
portal table. This was only happening for unicast frames. With this
change we also learn about mesh portals from proxied multicast frames.
Signed-off-by: Javier Cardona <javier@cozybit.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Replace netif_tx_{start,stop,wake}_all_queues with the single-queue
equivalents (i.e. netif_{start,stop,wake}_queue). Since we are down to
a single queue, these should peform slightly better.
Signed-off-by: John W. Linville <linville@tuxdriver.com>
It might be the case that __cfg80211_disconnected() has already
cleaned up wdev->current_bss() for us. The old code didn't catch
that situation and didn't warn needlessly.
Signed-off-by: Holger Schurig <hs4233@mail.mn-solutions.de>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Get rid of cookies in cfg80211_send_XXX() functions.
Signed-off-by: Holger Schurig <hs4233@mail.mn-solutions.de>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
When hostapd injects a frame, e.g. an authentication or association
response, mac80211 looks for a suitable access point virtual interface
to associate the frame with based on its source address. This makes it
possible e.g. to correctly assign sequence numbers to the frames.
A small typo in the ethernet address comparison statement caused a
failure to find a suitable ap interface. Sequence numbers on such
frames where therefore left unassigned causing some clients
(especially windows-based 11b/g clients) to reject them and fail to
authenticate or associate with the access point. This patch fixes the
typo in the address comparison statement.
Signed-off-by: Björn Smedman <bjorn.smedman@venatech.se>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Cc: stable@kernel.org
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Fix a typo in the description of hwmp_route_info_get(), no function
changes.
Signed-off-by: Andrey Yurovsky <andrey@cozybit.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
When the in-kernel SME gets an association failure from
the AP we don't deauthenticate, and thus get into a very
confused state which will lead to warnings later on. Fix
this by actually deauthenticating when the AP indicates
an association failure.
(Brought to you by the hacking session at Kernel Summit 2009 in Tokyo,
Japan. -- JWL)
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
When association fails, we should stay authenticated,
which in mac80211 is represented by the existence of
the mlme work struct, so we cannot free that, instead
we need to just set it to idle.
(Brought to you by the hacking session at Kernel Summit 2009 in Tokyo,
Japan. -- JWL)
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Recent commit "mac80211: fix logic error ibss merge bssid check" fixed
joining of ibss cell when static bssid is provided. In this case
ifibss->bssid is set before the cell is joined and comparing that address
to a bss should thus always succeed. Unfortunately this change broke the
other case of joining a ibss cell without providing a static bssid where
the value of ifibss->bssid is not set before the cell is joined.
Since ifibss->bssid may be set before or after joining the cell we do not
learn anything by comparing it to a known bss. Remove this check.
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
We currently use a 16 bit field (vlan_tci) to store VLAN ID/PRIO on a skb.
Null value is used as a special value, meaning vlan tagging not enabled.
This forbids use of null vlan ID.
As pointed by David, some drivers use the 3 high order bits (PRIO)
As VLAN ID is 12 bits, we can use the remaining bit (CFI) as a flag, and
allow null VLAN ID.
In case future code really wants to use VLAN_CFI_MASK, we'll have to use
a bit outside of vlan_tci.
#define VLAN_PRIO_MASK 0xe000 /* Priority Code Point */
#define VLAN_PRIO_SHIFT 13
#define VLAN_CFI_MASK 0x1000 /* Canonical Format Indicator */
#define VLAN_TAG_PRESENT VLAN_CFI_MASK
#define VLAN_VID_MASK 0x0fff /* VLAN Identifier */
Reported-by: Gertjan Hofman <gertjan_hofman@yahoo.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While playing with pktgen, I realized IP ID was not filled and a
random value was taken, possibly leaking 2 bytes of kernel memory.
We can use an increasing ID, this can help diagnostics anyway.
Also clear packet payload, instead of leaking kernel memory.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When handling large number of netdevice, rtnl_dump_ifinfo()
is very slow because it has O(N^2) complexity.
Instead of scanning one single list, we can use the 256 sub lists
of the dev_index hash table.
This considerably speedups "ip link" operations
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
GRE tunnels use one rwlock to protect their hash tables.
This locking scheme can be converted to RCU for free, since netdevice
already must wait for a RCU grace period at dismantle time.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip6_tunnels use one rwlock to protect their hash tables.
This locking scheme can be converted to RCU for free, since netdevice
already must wait for a RCU grace period at dismantle time.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPIP tunnels use one rwlock to protect their hash tables.
This locking scheme can be converted to RCU for free, since netdevice
already must wait for a RCU grace period at dismantle time.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
xfrm6_tunnels use one rwlock to protect their hash tables.
Plain and straightforward conversion to RCU locking to permit better SMP
performance.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SIT tunnels use one rwlock to protect their hash tables.
This locking scheme can be converted to RCU for free, since netdevice
already must wait for a RCU grace period at dismantle time.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SIT tunnels use one rwlock to protect their prl entries.
This first patch adds RCU locking for prl management,
with standard call_rcu() calls.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit b6b39e8f3f (tcp: Try to catch MSG_PEEK bug) added a printk()
to the WARN_ON() that's in tcp.c. This patch changes this combination
to WARN(); the advantage of WARN() is that the printk message shows up
inside the message, so that kerneloops.org will collect the message.
In addition, this gets rid of an extra if() statement.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
move virtrng_remove to .devexit.text
move virtballoon_remove to .devexit.text
virtio_blk: Revert serial number support
virtio: let header files include virtio_ids.h
virtio_blk: revert QUEUE_FLAG_VIRT addition
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (21 commits)
niu: VLAN_ETH_HLEN should be used to make sure that the whole MAC header was copied to the head buffer in the Vlan packets case
KS8851: Fix ks8851_set_rx_mode() for IFF_MULTICAST
KS8851: Fix MAC address write order
KS8851: Add soft reset at probe time
net: fix section mismatch in fec.c
net: Fix struct inet_timewait_sock bitfield annotation
tcp: Try to catch MSG_PEEK bug
net: Fix IP_MULTICAST_IF
bluetooth: static lock key fix
bluetooth: scheduling while atomic bug fix
tcp: fix TCP_DEFER_ACCEPT retrans calculation
tcp: reduce SYN-ACK retrans for TCP_DEFER_ACCEPT
tcp: accept socket after TCP_DEFER_ACCEPT period
Revert "tcp: fix tcp_defer_accept to consider the timeout"
AF_UNIX: Fix deadlock on connecting to shutdown socket
ethoc: clear only pending irqs
ethoc: inline regs access
vmxnet3: use dev_dbg, fix build for CONFIG_BLOCK=n
virtio_net: use dev_kfree_skb_any() in free_old_xmit_skbs()
be2net: fix support for PCI hot plug
...
rtnl_getlink() & rtnl_setlink() run with RTNL held, we can use
__dev_get_by_index() and __dev_get_by_name() variants and avoid
dev_hold()/dev_put()
Adds to rtnl_getlink() the capability to find a device by its name,
not only by its index.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rusty,
commit 3ca4f5ca73
virtio: add virtio IDs file
moved all device IDs into a single file. While the change itself is
a very good one, it can break userspace applications. For example
if a userspace tool wanted to get the ID of virtio_net it used to
include virtio_net.h. This does no longer work, since virtio_net.h
does not include virtio_ids.h.
This patch moves all "#include <linux/virtio_ids.h>" from the C
files into the header files, making the header files compatible with
the old ones.
In addition, this patch exports virtio_ids.h to userspace.
CC: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
For connected sockets, the first run of dev_pick_tx saves the
calculated txq in sk_tx_queue_mapping. This is not saved if
either the device has a queue select or the socket is not
connected. Next iterations of dev_pick_tx uses the cached value
of sk_tx_queue_mapping.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dst_negative_advice() should check for changed dst and reset
sk_tx_queue_mapping accordingly. Pass sock to the callers of
dst_negative_advice.
(sk_reset_txq is defined just for use by dst_negative_advice. The
only way I could find to get around this is to move dst_negative_()
from dst.h to dst.c, include sock.h in dst.c, etc)
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6: Reset sk_tx_queue_mapping when dst_cache is reset. Use existing
macro to do the work.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce sk_tx_queue_mapping; and functions that set, test and
get this value. Reset sk_tx_queue_mapping to -1 whenever the dst
cache is set/reset, and in socket alloc. Setting txq to -1 and
using valid txq=<0 to n-1> allows the tx path to use the value
of sk_tx_queue_mapping directly instead of subtracting 1 on every
tx.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It can help being able to filter packets on their queue_mapping.
If filter performance is not good, we could add a "numqueue" field
in struct packet_type, so that netif_nit_deliver() and other functions
can directly ignore packets with not expected queue number.
Lets experiment this simple filter extension first.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We hold RTNL, we can use __dev_get_by_index() instead of dev_get_by_index()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While doing multiple captures, I found af_packet was dirtying cache line
containing its prot_hook.
This slow down machines where several cpus are necessary to handle capture
traffic, as each prot_hook is traversed for each packet coming in or out
the host.
This patches moves "struct packet_type prot_hook" to the end of
packet_sock, and uses a ____cacheline_aligned_in_smp to make sure
this remains shared by all cpus.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch tries to print out more information when we hit the
MSG_PEEK bug in tcp_recvmsg. It's been around since at least
2005 and it's about time that we finally fix it.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use symbols instead of magic constants while checking PMTU discovery
setsockopt.
Remove redundant test in ip_rt_frag_needed() (done by caller).
Signed-off-by: John Dykstra <john.dykstra1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow bpf to set a filter to drop packets that dont
match a specific mark
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv4/ipv6 setsockopt(IP_MULTICAST_IF) have dubious __dev_get_by_index() calls.
This function should be called only with RTNL or dev_base_lock held, or reader
could see a corrupt hash chain and eventually enter an endless loop.
Fix is to call dev_get_by_index()/dev_put().
If this happens to be performance critical, we could define a new dev_exist_by_index()
function to avoid touching dev refcount.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix TCP_DEFER_ACCEPT conversion between seconds and
retransmission to match the TCP SYN-ACK retransmission periods
because the time is converted to such retransmissions. The old
algorithm selects one more retransmission in some cases. Allow
up to 255 retransmissions.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change SYN-ACK retransmitting code for the TCP_DEFER_ACCEPT
users to not retransmit SYN-ACKs during the deferring period if
ACK from client was received. The goal is to reduce traffic
during the deferring period. When the period is finished
we continue with sending SYN-ACKs (at least one) but this time
any traffic from client will change the request to established
socket allowing application to terminate it properly.
Also, do not drop acked request if sending of SYN-ACK fails.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Willy Tarreau and many other folks in recent years
were concerned what happens when the TCP_DEFER_ACCEPT period
expires for clients which sent ACK packet. They prefer clients
that actively resend ACK on our SYN-ACK retransmissions to be
converted from open requests to sockets and queued to the
listener for accepting after the deferring period is finished.
Then application server can decide to wait longer for data
or to properly terminate the connection with FIN if read()
returns EAGAIN which is an indication for accepting after
the deferring period. This change still can have side effects
for applications that expect always to see data on the accepted
socket. Others can be prepared to work in both modes (with or
without TCP_DEFER_ACCEPT period) and their data processing can
ignore the read=EAGAIN notification and to allocate resources for
clients which proved to have no data to send during the deferring
period. OTOH, servers that use TCP_DEFER_ACCEPT=1 as flag (not
as a timeout) to wait for data will notice clients that didn't
send data for 3 seconds but that still resend ACKs.
Thanks to Willy Tarreau for the initial idea and to
Eric Dumazet for the review and testing the change.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 6d01a026b7.
Julian Anastasov, Willy Tarreau and Eric Dumazet have come up
with a more correct way to deal with this.
Signed-off-by: David S. Miller <davem@davemloft.net>
Until now, calls to wimax_rfkill() will be blocked until the device is
at least past the WIMAX_ST_UNINITIALIZED state, return -ENOMEDIUM when
the device is in the WIMAX_ST_DOWN state.
In parallel, wimax-tools would issue a wimax_rfkill(WIMAX_RF_QUERY)
call right after opening a handle with wimaxll_open() as means to
verify if the interface is really a WiMAX interface [newer kernel
version will have a call specifically for this].
The combination of these two facts is that in some cases, before the
driver has finalized initializing its device's firmware, a
wimaxll_open() call would fail, when it should not.
Thus, change the wimax_rfkill() code to allow queries when the device
is in WIMAX_ST_UNINITIALIZED state.
Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>
It makes sense that the messaging pipe to the device can be used
before the device is fully ready, as long as it is registered with the
stack. Some debugging tools need it.
Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>
Add "debug" module options to all the wimax modules (including
drivers) so that the debug levels can be set upon kernel boot or
module load time.
This is needed as currently there was a limitation where the debug
levels could only be set when a device was succesfully
enumerated. This made it difficult to debug issues that made a device
not probe properly.
Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>
The WiMAX stack assumes that all WiMAX devices are SW OFF when they
are initialized. The recent changes in the RFKILL stack thus cause an
initial call after rfkill_register(), because by default, rfkill
considers devices to be SW ON upon registration.
So call rfkill_init_sw_state() to set it to SW OFF so
rfkill_register() doesn't do that unnecessary step.
Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>
I found a deadlock bug in UNIX domain socket, which makes able to DoS
attack against the local machine by non-root users.
How to reproduce:
1. Make a listening AF_UNIX/SOCK_STREAM socket with an abstruct
namespace(*), and shutdown(2) it.
2. Repeat connect(2)ing to the listening socket from the other sockets
until the connection backlog is full-filled.
3. connect(2) takes the CPU forever. If every core is taken, the
system hangs.
PoC code: (Run as many times as cores on SMP machines.)
int main(void)
{
int ret;
int csd;
int lsd;
struct sockaddr_un sun;
/* make an abstruct name address (*) */
memset(&sun, 0, sizeof(sun));
sun.sun_family = PF_UNIX;
sprintf(&sun.sun_path[1], "%d", getpid());
/* create the listening socket and shutdown */
lsd = socket(AF_UNIX, SOCK_STREAM, 0);
bind(lsd, (struct sockaddr *)&sun, sizeof(sun));
listen(lsd, 1);
shutdown(lsd, SHUT_RDWR);
/* connect loop */
alarm(15); /* forcely exit the loop after 15 sec */
for (;;) {
csd = socket(AF_UNIX, SOCK_STREAM, 0);
ret = connect(csd, (struct sockaddr *)&sun, sizeof(sun));
if (-1 == ret) {
perror("connect()");
break;
}
puts("Connection OK");
}
return 0;
}
(*) Make sun_path[0] = 0 to use the abstruct namespace.
If a file-based socket is used, the system doesn't deadlock because
of context switches in the file system layer.
Why this happens:
Error checks between unix_socket_connect() and unix_wait_for_peer() are
inconsistent. The former calls the latter to wait until the backlog is
processed. Despite the latter returns without doing anything when the
socket is shutdown, the former doesn't check the shutdown state and
just retries calling the latter forever.
Patch:
The patch below adds shutdown check into unix_socket_connect(), so
connect(2) to the shutdown socket will return -ECONREFUSED.
Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Signed-off-by: Masanori Yoshida <masanori.yoshida.tv@hitachi.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The last users of skb_icv_walk are converted to ahash now,
so skb_icv_walk is unused and can be removed.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch converts ah6 to the new ahash interface.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch converts ah4 to the new ahash interface.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- skb_kill_datagram() can increment sk->sk_drops itself, not callers.
- UDP on IPV4 & IPV6 dropped frames (because of bad checksum or policy checks) increment sk_drops
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to have better cache layouts of struct sock (separate zones
for rx/tx paths), we need this preliminary patch.
Goal is to transfert fields used at lookup time in the first
read-mostly cache line (inside struct sock_common) and move sk_refcnt
to a separate cache line (only written by rx path)
This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr,
sport and id fields. This allows a future patch to define these
fields as macros, like sk_refcnt, without name clashes.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1. GENL_MIN_ID is a valid id -> no need to start at
GENL_MIN_ID + 1.
2. Avoid going through the ids two times: If we start at
GENL_MIN_ID+1 (*or bigger*) and all ids are over!, the
code iterates through the list twice (*or lesser*).
3. Simplify code - no need to start at idx=0 which gets
reset to GENL_MIN_ID.
Patch on net-next-2.6. Reboot test shows that first id
passed to genl_register_family was 16, next two were
GENL_ID_GENERATE and genl_generate_id returned 17 & 18
(user level testing of same code shows expected values
across entire range of MIN/MAX).
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
genl_register_family() doesn't need to call genl_family_find_byid
when GENL_ID_GENERATE is passed during register.
Patch on net-next-2.6, compile and reboot testing only.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove duplicate sock_set_flag(sk, SOCK_ZAPPED) in iucv_sock_close,
which has been overlooked in September-commit
7514bab04e.
Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of modifying sk->sk_ack_backlog directly, use respective
socket functions.
Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sock_queue_rcv_skb() can update sk_drops itself, removing need for
callers to take care of it. This is more consistent since
sock_queue_rcv_skb() also reads sk_drops when queueing a skb.
This adds sk_drops managment to many protocols that not cared yet.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Phonet "universe" only has 64 addresses, so we keep a trivial flat
routing table.
Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Looking at commit ebc3f64b86 it appears that this was intended
and not the original, equivalent to `if (facilities.reverse & ~0x81)'.
In x25_parse_facilities() that patch changed how facilities->reverse
was set. No other bits were set than 0x80 and/or 0x01.
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Storing the mask (size - 1) instead of the size allows fast path to be
a bit faster.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
udp_poll() can in some circumstances drop frames with incorrect checksums.
Problem is we now have to lock the socket while dropping frames, or risk
sk_forward corruption.
This bug is present since commit 95766fff6b
([UDP]: Add memory accounting.)
While we are at it, we can correct ioctl(SIOCINQ) to also drop bad frames.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I was trying to use TCP_DEFER_ACCEPT and noticed that if the
client does not talk, the connection is never accepted and
remains in SYN_RECV state until the retransmits expire, where
it finally is deleted. This is bad when some firewall such as
netfilter sits between the client and the server because the
firewall sees the connection in ESTABLISHED state while the
server will finally silently drop it without sending an RST.
This behaviour contradicts the man page which says it should
wait only for some time :
TCP_DEFER_ACCEPT (since Linux 2.4)
Allows a listener to be awakened only when data arrives
on the socket. Takes an integer value (seconds), this
can bound the maximum number of attempts TCP will
make to complete the connection. This option should not
be used in code intended to be portable.
Also, looking at ipv4/tcp.c, a retransmit counter is correctly
computed :
case TCP_DEFER_ACCEPT:
icsk->icsk_accept_queue.rskq_defer_accept = 0;
if (val > 0) {
/* Translate value in seconds to number of
* retransmits */
while (icsk->icsk_accept_queue.rskq_defer_accept < 32 &&
val > ((TCP_TIMEOUT_INIT / HZ) <<
icsk->icsk_accept_queue.rskq_defer_accept))
icsk->icsk_accept_queue.rskq_defer_accept++;
icsk->icsk_accept_queue.rskq_defer_accept++;
}
break;
==> rskq_defer_accept is used as a counter of retransmits.
But in tcp_minisocks.c, this counter is only checked. And in
fact, I have found no location which updates it. So I think
that what was intended was to decrease it in tcp_minisocks
whenever it is checked, which the trivial patch below does.
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Meaning receive multiple messages, reducing the number of syscalls and
net stack entry/exit operations.
Next patches will introduce mechanisms where protocols that want to
optimize this operation will provide an unlocked_recvmsg operation.
This takes into account comments made by:
. Paul Moore: sock_recvmsg is called only for the first datagram,
sock_recvmsg_nosec is used for the rest.
. Caitlin Bestler: recvmmsg now has a struct timespec timeout, that
works in the same fashion as the ppoll one.
If the underlying protocol returns a datagram with MSG_OOB set, this
will make recvmmsg return right away with as many datagrams (+ the OOB
one) it has received so far.
. Rémi Denis-Courmont & Steven Whitehouse: If we receive N < vlen
datagrams and then recvmsg returns an error, recvmmsg will return
the successfully received datagrams, store the error and return it
in the next call.
This paves the way for a subsequent optimization, sk_prot->unlocked_recvmsg,
where we will be able to acquire the lock only at batch start and end, not at
every underlying recvmsg call.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create a new socket level option to report number of queue overflows
Recently I augmented the AF_PACKET protocol to report the number of frames lost
on the socket receive queue between any two enqueued frames. This value was
exported via a SOL_PACKET level cmsg. AFter I completed that work it was
requested that this feature be generalized so that any datagram oriented socket
could make use of this option. As such I've created this patch, It creates a
new SOL_SOCKET level option called SO_RXQ_OVFL, which when enabled exports a
SOL_SOCKET level cmsg that reports the nubmer of times the sk_receive_queue
overflowed between any two given frames. It also augments the AF_PACKET
protocol to take advantage of this new feature (as it previously did not touch
sk->sk_drops, which this patch uses to record the overflow count). Tested
successfully by me.
Notes:
1) Unlike my previous patch, this patch simply records the sk_drops value, which
is not a number of drops between packets, but rather a total number of drops.
Deltas must be computed in user space.
2) While this patch currently works with datagram oriented protocols, it will
also be accepted by non-datagram oriented protocols. I'm not sure if thats
agreeable to everyone, but my argument in favor of doing so is that, for those
protocols which aren't applicable to this option, sk_drops will always be zero,
and reporting no drops on a receive queue that isn't used for those
non-participating protocols seems reasonable to me. This also saves us having
to code in a per-protocol opt in mechanism.
3) This applies cleanly to net-next assuming that commit
977750076d (my af packet cmsg patch) is reverted
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ieee80211_rx() must be called with softirqs disabled
since the networking stack requires this for netif_rx()
and some code in mac80211 can assume that it can not
be processing its own tasklet and this call at the same
time.
It may be possible to remove this requirement after a
careful audit of mac80211 and doing any needed locking
improvements in it along with disabling softirqs around
netif_rx(). An alternative might be to push all packet
processing to process context in mac80211, instead of
to the tasklet, and add other synchronisation.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
When a scan completes, we call ieee80211_sta_find_ibss(),
which is also called from other places. When the scan was
done in software, there's no problem as both run from the
single-threaded mac80211 workqueue and are thus serialised
against each other, but with hardware scan the completion
can be in a different context and race against callers of
this function from the workqueue (e.g. due to beacon RX).
So instead of calling ieee80211_sta_find_ibss() directly,
just arm the timer and have it fire, scheduling the work,
which will invoke ieee80211_sta_find_ibss() (if that is
appropriate in the current state).
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
ipv6 sit: Set relay to 0.0.0.0 directly if relay_prefixlen == 0.
Do not use bit-shift if relay_prefixlen == 0;
relay_prefix << 32 does not result in 0.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6 sit: Fix 6rd relay address.
Relay's address should be extracted from real IPv6 address
instead of configured prefix.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6 sit: Ensure to initialize 6rd parameters.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This probably deserves to go into -stable.
Pedit will reject a policy that is large because it
uses the wrong structure in the policy validation.
This fixes it.
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
After m68k's task_thread_info() doesn't refer to current,
it's possible to remove sched.h from interrupt.h and not break m68k!
Many thanks to Heiko Carstens for allowing this.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
The error unwinding code in set_netns has a bug
that will make it run into a BUG_ON if passed a
bad wiphy index, fix by not trying to unlock a
wiphy that doesn't exist.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (40 commits)
ethoc: limit the number of buffers to 128
ethoc: use system memory as buffer
ethoc: align received packet to make IP header at word boundary
ethoc: fix buffer address mapping
ethoc: fix typo to compute number of tx descriptors
au1000_eth: Duplicate test of RX_OVERLEN bit in update_rx_stats()
netxen: Fix Unlikely(x) > y
pasemi_mac: ethtool get settings fix
add maintainer for network drop monitor kernel service
tg3: Fix phylib locking strategy
rndis_host: support ETHTOOL_GPERMADDR
ipv4: arp_notify address list bug
gigaset: add kerneldoc comments
gigaset: correct debugging output selection
gigaset: improve error recovery
gigaset: fix device ERROR response handling
gigaset: announce if built with debugging
gigaset: handle isoc frame errors more gracefully
gigaset: linearize skb
gigaset: fix reject/hangup handling
...
Commit 9ef1d4c7c7 ("[NETLINK]: Missing
initializations in dumped data") introduced a typo in
initialization. This patch fixes this.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow enable/disable UFO on bridge device via ethtool
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that software UFO is supported, UFO can be enabled on master
devices like bridge, bond even though the attached device doesn't
support this feature in hardware.
This allows UFO to be used between KVM host and guest even when a
physical interface attached to the bridge doesn't support UFO.
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
UDP_HTABLE_SIZE was initialy defined to 128, which is a bit small for
several setups.
4000 active UDP sockets -> 32 sockets per chain in average. An
incoming frame has to lookup all sockets to find best match, so long
chains hurt latency.
Instead of a fixed size hash table that cant be perfect for every
needs, let UDP stack choose its table size at boot time like tcp/ip
route, using alloc_large_system_hash() helper
Add an optional boot parameter, uhash_entries=x so that an admin can
force a size between 256 and 65536 if needed, like thash_entries and
rhash_entries.
dmesg logs two new lines :
[ 0.647039] UDP hash table entries: 512 (order: 0, 4096 bytes)
[ 0.647099] UDP Lite hash table entries: 512 (order: 0, 4096 bytes)
Maximal size on 64bit arches would be 65536 slots, ie 1 MBytes for non
debugging spinlocks.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no reason that cipso_v4_delopt() is not
defined as a static function.
Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Function argument len was redeclarated within the
function. This patch fix the redeclaration of symbol 'len'.
Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Might as well use the ipv6_addr_set_v4mapped() inline we created last
year.
Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change ip6_route_redirect() to use ipv6_addr_copy().
Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for route lookup using sk_mark on IPv4 listening sockets.
Signed-off-by: Atis Elsts <atis@mikrotik.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This continues the previous patch, by applying the same change to CCID-3.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This removes a redundancy in the CCID half-connection (hc) naming scheme:
* instead of 'hctx->tx_...', write 'hc->tx_...';
* instead of 'hcrx->rx_...', write 'hc->rx_...';
which works because the 'type' of the half-connection is encoded in the
'rx_' / 'tx_' prefixes.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This implements the new naming scheme also for CCID-3.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch starts a less problematic naming convention for CCID structs.
The old naming convention used 'hc{tx,rx}->ccid?hc{tx,rx}->...' as
recurring prefixes, which made the code
* hard to write (not easy to fit into 80 characters);
* hard to read (most of the space is occupied by prefixes).
The new naming scheme:
* struct entries for the TX socket are prefixed by 'tx_';
* and those for the RX socket are prefixed by 'rx_'.
The identifiers then remain distinguishable when grep-ing through the tree:
(a) RX/TX sockets are distinguished by the naming scheme,
(b) individual CCIDs are distinguished by filename (ccid{2,3,4}.{c,h}).
This first patch implements the scheme for CCID-2.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Everything including this header includes net/cfg80211.h, which
includes linux/netdevice.h, which includes linux/ethtool.h already. Why
slow-down the build, even a little bit?
Signed-off-by: John W. Linville <linville@tuxdriver.com>
It's useful to provide firmware and hardware version to user space and have a
generic interface to retrieve them. Users can provide the version information
in bug reports etc.
Add fields for firmware and hardware version to struct wiphy.
(Dropped nl80211 bits for now and modified remaining bits in favor of
ethtool. -- JWL)
Cc: Kalle Valo <kalle.valo@nokia.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Refactor wext to
* split out iwpriv handling
* split out iwspy handling
* split out procfs support
* allow cfg80211 to have wireless extensions compat code
w/o CONFIG_WIRELESS_EXT
After this, drivers need to
- select WIRELESS_EXT - for wext support
- select WEXT_PRIV - for iwpriv support
- select WEXT_SPY - for iwspy support
except cfg80211 -- which gets new hooks in wext-core.c
and can then get wext handlers without CONFIG_WIRELESS_EXT.
Wireless extensions procfs support is auto-selected
based on PROC_FS and anything that requires the wext core
(i.e. WIRELESS_EXT or CFG80211_WEXT).
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Linux keeps scan results up to 15 seconds. This can be a problem for fast
moving clients: they get back stale data. But if the kernel reports the age
of the BSS items, then user-space can simply weed out old entries by itself.
Signed-off-by: Holger Schurig <hs4233@mail.mn-solutions.de>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
kfree_skb() should be used to free struct sk_buff pointers.
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Cc: stable@kernel.org
Signed-off-by: John W. Linville <linville@tuxdriver.com>
When receiving data frames, we can send them only to
the interface they belong to based on transmitting
station (this doesn't work for probe requests). Also,
don't try to handle other frames for AP_VLAN at all
since those interface should only receive data.
Additionally, the transmit side must check that the
station we're sending a frame to is actually on the
interface we're transmitting on, and not transmit
packets to functions that live on other interfaces,
so validate that as well.
Another bug fix is needed in sta_info.c where in the
VLAN case when adding/removing stations we overwrite
the sdata variable we still need.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Cc: stable@kernel.org
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This fixes a bug with arp_notify.
If arp_notify is enabled, kernel will crash if address is changed
and no IP address is assigned.
http://bugzilla.kernel.org/show_bug.cgi?id=14330
Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All usages of structure net_proto_ops should be declared const.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On Friday 02 October 2009 20:53:51 you wrote:
> This is good although I would have shortened the name.
Ah, I knew I forgot something :) Here is v4.
tavi
>From 24d96d825b9fa832b22878cc6c990d5711968734 Mon Sep 17 00:00:00 2001
From: Octavian Purdila <opurdila@ixiacom.com>
Date: Fri, 2 Oct 2009 00:51:15 +0300
Subject: [PATCH] ipv6: new sysctl for sending TLLAO with unicast NAs
Neighbor advertisements responding to unicast neighbor solicitations
did not include the target link-layer address option. This patch adds
a new sysctl option (disabled by default) which controls whether this
option should be sent even with unicast NAs.
The need for this arose because certain routers expect the TLLAO in
some situations even as a response to unicast NS packets.
Moreover, RFC 2461 recommends sending this to avoid a race condition
(section 4.4, Target link-layer address)
Signed-off-by: Cosmin Ratiu <cratiu@ixiacom.com>
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Atis Elsts wrote:
> Not sure if there is need to fill the mark from skb in tunnel xmit functions. In any case, it's not done for GRE or IPIP tunnels at the moment.
Ok, I'll just drop that part, I'm not sure what should be done in this case.
> Also, in this patch you are doing that for SIT (v6-in-v4) tunnels only, and not doing it for v4-in-v6 or v6-in-v6 tunnels. Any reason for that?
I just sent that patch out too quickly, here's a better one with the updates.
Add support for IPv6 route lookups using sk_mark.
Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After updating firmware stored in flash, users may wish to reset the
relevant hardware and start the new firmware immediately. This should
not be completely automatic as it may be disruptive.
A selective reset may also be useful for debugging or diagnostics.
This adds a separate reset operation which takes flags indicating the
components to be reset. Drivers are allowed to reset only a subset of
those requested, and must indicate the actual subset. This allows the
use of generic component masks and some future expansion.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jarek Poplawski a écrit :
>
>
> Hmm... So you made me to do some "real" work here, and guess what?:
> there is one serious checkpatch warning! ;-) Plus, this new parameter
> should be added to the function description. Otherwise:
> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
>
> Thanks,
> Jarek P.
>
> PS: I guess full "Don't" would show we really mean it...
Okay :) Here is the last round, before the night !
Thanks again
[RFC] pkt_sched: gen_estimator: Don't report fake rate estimators
We currently send TCA_STATS_RATE_EST elements to netlink users, even if no estimator
is running.
# tc -s -d qdisc
qdisc pfifo_fast 0: dev eth0 root bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
Sent 112833764978 bytes 1495081739 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0
User has no way to tell if the "rate 0bit 0pps" is a real estimation, or a fake
one (because no estimator is active)
After this patch, tc command output is :
$ tc -s -d qdisc
qdisc pfifo_fast 0: dev eth0 root bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
Sent 561075 bytes 1196 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
We add a parameter to gnet_stats_copy_rate_est() function so that
it can use gen_estimator_active(bstats, r), as suggested by Jarek.
This parameter can be NULL if check is not necessary, (htb for
example has a mandatory rate estimator)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Here is a followup on this area, thanks.
[RFC] af_packet: fill skb->mark at xmit
skb->mark may be used by classifiers, so fill it in case user
set a SO_MARK option on socket.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 Rapid Deployment (6rd; draft-ietf-softwire-ipv6-6rd) builds upon
mechanisms of 6to4 (RFC3056) to enable a service provider to rapidly
deploy IPv6 unicast service to IPv4 sites to which it provides
customer premise equipment. Like 6to4, it utilizes stateless IPv6 in
IPv4 encapsulation in order to transit IPv4-only network
infrastructure. Unlike 6to4, a 6rd service provider uses an IPv6
prefix of its own in place of the fixed 6to4 prefix.
With this option enabled, the SIT driver offers 6rd functionality by
providing additional ioctl API to configure the IPv6 Prefix for in
stead of static 2002::/16 for 6to4.
Original patch was done by Alexandre Cassen <acassen@freebox.fr>
based on old Internet-Draft.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When routing daemon wants to enable forwarding of multicast traffic it
performs something like:
struct vifctl vc = {
.vifc_vifi = 1,
.vifc_flags = 0,
.vifc_threshold = 1,
.vifc_rate_limit = 0,
.vifc_lcl_addr = ip, /* <--- ip address of physical
interface, e.g. eth0 */
.vifc_rmt_addr.s_addr = htonl(INADDR_ANY),
};
setsockopt(fd, IPPROTO_IP, MRT_ADD_VIF, &vc, sizeof(vc));
This leads (in the kernel) to calling vif_add() function call which
search the (physical) device using assigned IP address:
dev = ip_dev_find(net, vifc->vifc_lcl_addr.s_addr);
The current API (struct vifctl) does not allow to specify an
interface other way than using it's IP, and if there are more than a
single interface with specified IP only the first one will be found.
The attached patch (against 2.6.30.4) allows to specify an interface
by its index, instead of IP address:
struct vifctl vc = {
.vifc_vifi = 1,
.vifc_flags = VIFF_USE_IFINDEX, /* NEW */
.vifc_threshold = 1,
.vifc_rate_limit = 0,
.vifc_lcl_ifindex = if_nametoindex("eth0"), /* NEW */
.vifc_rmt_addr.s_addr = htonl(INADDR_ANY),
};
setsockopt(fd, IPPROTO_IP, MRT_ADD_VIF, &vc, sizeof(vc));
Signed-off-by: Ilia K. <mail4ilia@gmail.com>
=== modified file 'include/linux/mroute.h'
Signed-off-by: David S. Miller <davem@davemloft.net>
An incoming datagram must bring into cpu cache *lot* of cache lines,
in particular : (other parts omitted (hash chains, ip route cache...))
On 32bit arches :
offsetof(struct sock, sk_rcvbuf) =0x30 (read)
offsetof(struct sock, sk_lock) =0x34 (rw)
offsetof(struct sock, sk_sleep) =0x50 (read)
offsetof(struct sock, sk_rmem_alloc) =0x64 (rw)
offsetof(struct sock, sk_receive_queue)=0x74 (rw)
offsetof(struct sock, sk_forward_alloc)=0x98 (rw)
offsetof(struct sock, sk_callback_lock)=0xcc (rw)
offsetof(struct sock, sk_drops) =0xd8 (read if we add dropcount support, rw if frame dropped)
offsetof(struct sock, sk_filter) =0xf8 (read)
offsetof(struct sock, sk_socket) =0x138 (read)
offsetof(struct sock, sk_data_ready) =0x15c (read)
We can avoid sk->sk_socket and socket->fasync_list referencing on sockets
with no fasync() structures. (socket->fasync_list ptr is probably already in cache
because it shares a cache line with socket->wait, ie location pointed by sk->sk_sleep)
This avoids one cache line load per incoming packet for common cases (no fasync())
We can leave (or even move in a future patch) sk->sk_socket in a cold location
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A number of drivers (recently including cfg80211-based ones)
assume that all wireless handlers, including statistics, can
sleep and they often also implicitly assume that the rtnl is
held around their invocation. This is almost always true now
except when reading from sysfs:
BUG: sleeping function called from invalid context at kernel/mutex.c:280
in_atomic(): 1, irqs_disabled(): 0, pid: 10450, name: head
2 locks held by head/10450:
#0: (&buffer->mutex){+.+.+.}, at: [<c10ceb99>] sysfs_read_file+0x24/0xf4
#1: (dev_base_lock){++.?..}, at: [<c12844ee>] wireless_show+0x1a/0x4c
Pid: 10450, comm: head Not tainted 2.6.32-rc3 #1
Call Trace:
[<c102301c>] __might_sleep+0xf0/0xf7
[<c1324355>] mutex_lock_nested+0x1a/0x33
[<f8cea53b>] wdev_lock+0xd/0xf [cfg80211]
[<f8cea58f>] cfg80211_wireless_stats+0x45/0x12d [cfg80211]
[<c13118d6>] get_wireless_stats+0x16/0x1c
[<c12844fe>] wireless_show+0x2a/0x4c
Fix this by using the rtnl instead of dev_base_lock.
Reported-by: Miles Lane <miles.lane@gmail.com>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch exports the link-speed (in Mbps) and duplex of an interface
via sysfs. This eliminates the need to use ethtool just to check the
link-speed. Not requiring 'ethtool' and not relying on the SIOCETHTOOL
ioctl should be helpful in an embedded environment where space is at a
premium as well.
NOTE: This patch also intentionally allows non-root users to check the link
speed and duplex -- something not possible with ethtool.
Here's some sample output:
# cat /sys/class/net/eth0/speed
100
# cat /sys/class/net/eth0/duplex
half
# ethtool eth0
Settings for eth0:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Half 1000baseT/Full
Supports auto-negotiation: Yes
Advertised link modes: Not reported
Advertised auto-negotiation: No
Speed: 100Mb/s
Duplex: Half
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: off
Supports Wake-on: g
Wake-on: g
Current message level: 0x000000ff (255)
Link detected: yes
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of having to modify every non-mac80211 for device type assignment,
do this inside the netdev notifier callback of cfg80211. So all drivers
that integrate with cfg80211 will export a proper device type.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
For various purposes including a wireless extensions
bugfix, we need to hook into the netdev creation before
before netdev_register_kobject(). This will also ease
doing the dev type assignment that Marcel was working
on for cfg80211 drivers w/o touching them all.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently dirty a cache line to update tunnel device stats
(tx_packets/tx_bytes). We better use the txq->tx_bytes/tx_packets
counters that already are present in cpu cache, in the cache
line shared with txq->_xmit_lock
This patch extends IPTUNNEL_XMIT() macro to use txq pointer
provided by the caller.
Also &tunnel->dev->stats can be replaced by &dev->stats
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The FIB algorithim for IPV4 is set at compile time, but kernel goes through
the overhead of function call indirection at runtime. Save some
cycles by turning the indirect calls to direct calls to either
hash or trie code.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add Ancilliary data to better represent loss information
I've had a few requests recently to provide more detail regarding frame loss
during an AF_PACKET packet capture session. Specifically the requestors want to
see where in a packet sequence frames were lost, i.e. they want to see that 40
frames were lost between frames 302 and 303 in a packet capture file. In order
to do this we need:
1) The kernel to export this data to user space
2) The applications to make use of it
This patch addresses item (1). It does this by doing the following:
A) Anytime we drop a frame for which we would increment po->stats.tp_drops, we
also no increment a stats called po->stats.tp_gap.
B) Every time we successfully enqueue a frame to sk_receive_queue, we record the
value of po->stats.tp_gap in skb->mark. skb->cb would nominally be the place to
record this, but since all the space there is used up, we're overloading
skb->mark. Its safe to do since any enqueued packet is guaranteed to be
unshared at this point, and skb->mark isn't used for anything else in the rx
path to the application. After we record tp_gap in the skb, we zero
po->stats.tp_gap. This allows us to keep a counter of the number of frames lost
between any two enqueued packets
C) When the application goes to dequeue a frame from the packet socket, we look
at skb->mark for that frame. If it is non-zero, we add a cmsg chunk to the
msghdr of level SOL_PACKET and type PACKET_GAPDATA. Its a 32 bit integer that
represents the number of frames lost between this packet and the last previous
frame received.
Note there is a chance that if there is frame loss after a receive, and then the
socket is closed, some gap data might be lost. This is covered by the use of
the PACKET_AUXDATA socket option, which gives total loss data. With a bit of
math, the final gap can be determined that way.
I've tested this patch myself, and it works well.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
include/linux/if_packet.h | 2 ++
net/packet/af_packet.c | 33 +++++++++++++++++++++++++++++++++
2 files changed, 35 insertions(+)
Signed-off-by: David S. Miller <davem@davemloft.net>
We can avoid two atomic ops on skb->users if packet is not going to be
sent to the device (because hardware txqueue is full)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can make icmp messages tx completion callback a litle bit faster.
Setting SOCK_USE_WRITE_QUEUE sk flag tells sock_wfree() to
not call sk_write_space() on a socket we know no thread is posssibly
waiting for write space. (on per cpu kernel internal icmp sockets only)
This avoids the sock_def_write_space() call and
read_lock(&sk->sk_callback_lock)/read_unlock(&sk->sk_callback_lock) calls
as well.
We avoid three atomic ops.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The in-tree implementations have all been converted to
get_sset_count().
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit fd29cf72 (pktgen: convert to use ktime_t)
inadvertantly converted "delay" parameter from nanosec to microsec.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is not currently possible to instruct pktgen to use one selected tx queue.
When Robert added multiqueue support in commit 45b270f8, he added
an interval (queue_map_min, queue_map_max), and his code doesnt take
into account the case of min = max, to select one tx queue exactly.
I suspect a high performance setup on a eight txqueue device wants
to use exactly eight cpus, and assign one tx queue to each sender.
This patchs makes pktgen select the right tx queue, not the first one.
Also updates Documentation to reflect Robert changes.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (46 commits)
cnic: Fix NETDEV_UP event processing.
uvesafb/connector: Disallow unpliviged users to send netlink packets
pohmelfs/connector: Disallow unpliviged users to configure pohmelfs
dst/connector: Disallow unpliviged users to configure dst
dm/connector: Only process connector packages from privileged processes
connector: Removed the destruct_data callback since it is always kfree_skb()
connector/dm: Fixed a compilation warning
connector: Provide the sender's credentials to the callback
connector: Keep the skb in cn_callback_data
e1000e/igb/ixgbe: Don't report an error if devices don't support AER
net: Fix wrong sizeof
net: splice() from tcp to pipe should take into account O_NONBLOCK
net: Use sk_mark for routing lookup in more places
sky2: irqname based on pci address
skge: use unique IRQ name
IPv4 TCP fails to send window scale option when window scale is zero
net/ipv4/tcp.c: fix min() type mismatch warning
Kconfig: STRIP: Remove stale bits of STRIP help text
NET: mkiss: Fix typo
tg3: Remove prev_vlan_tag from struct tx_ring_info
...
tcp_splice_read() doesnt take into account socket's O_NONBLOCK flag
Before this patch :
splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE);
causes a random endless block (if pipe is full) and
splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE | SPLICE_F_NONBLOCK);
will return 0 immediately if the TCP buffer is empty.
User application has no way to instruct splice() that socket should be in blocking mode
but pipe in nonblock more.
Many projects cannot use splice(tcp -> pipe) because of this flaw.
http://git.samba.org/?p=samba.git;a=history;f=source3/lib/recvfile.c;h=ea0159642137390a0f7e57a123684e6e63e47581;hb=HEADhttp://lkml.indiana.edu/hypermail/linux/kernel/0807.2/0687.html
Linus introduced SPLICE_F_NONBLOCK in commit 29e350944f
(splice: add SPLICE_F_NONBLOCK flag )
It doesn't make the splice itself necessarily nonblocking (because the
actual file descriptors that are spliced from/to may block unless they
have the O_NONBLOCK flag set), but it makes the splice pipe operations
nonblocking.
Linus intention was clear : let SPLICE_F_NONBLOCK control the splice pipe mode only
This patch instruct tcp_splice_read() to use the underlying file O_NONBLOCK
flag, as other socket operations do.
Users will then call :
splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE | SPLICE_F_NONBLOCK );
to block on data coming from socket (if file is in blocking mode),
and not block on pipe output (to avoid deadlock)
First version of this patch was submitted by Octavian Purdila
Reported-by: Volker Lendecke <vl@samba.org>
Reported-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch against v2.6.31 adds support for route lookup using sk_mark in some
more places. The benefits from this patch are the following.
First, SO_MARK option now has effect on UDP sockets too.
Second, ip_queue_xmit() and inet_sk_rebuild_header() could fail to do routing
lookup correctly if TCP sockets with SO_MARK were used.
Signed-off-by: Atis Elsts <atis@mikrotik.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Acknowledge TCP window scale support by inserting the proper option in SYN/ACK
and SYN headers even if our window scale is zero.
This fixes the following observed behavior:
1. Client sends a SYN with TCP window scaling option and non zero window scale
value to a Linux box.
2. Linux box notes large receive window from client.
3. Linux decides on a zero value of window scale for its part.
4. Due to compare against requested window scale size option, Linux does not to
send windows scale TCP option header on SYN/ACK at all.
With the following result:
Client box thinks TCP window scaling is not supported, since SYN/ACK had no
TCP window scale option, while Linux thinks that TCP window scaling is
supported (and scale might be non zero), since SYN had TCP window scale
option and we have a mismatched idea between the client and server
regarding window sizes.
Probably it also fixes up the following bug (not observed in practice):
1. Linux box opens TCP connection to some server.
2. Linux decides on zero value of window scale.
3. Due to compare against computed window scale size option, Linux does
not to set windows scale TCP option header on SYN.
With the expected result that the server OS does not use window scale option
due to not receiving such an option in the SYN headers, leading to suboptimal
performance.
Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Signed-off-by: Ori Finkelman <ori@comsleep.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/ipv4/tcp.c: In function 'do_tcp_setsockopt':
net/ipv4/tcp.c:2050: warning: comparison of distinct pointer types lacks a cast
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
After last pktgen changes, delay handling is wrong.
pktgen actually sends packets at full line speed.
Fix is to update pkt_dev->next_tx even if spin() returns early,
so that next spin() calls have a chance to see a positive delay.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In ax25_make_new, if kmemdup of digipeat returns an error, there would
be an oops in sk_free while calling sk_destruct, because sk_protinfo
is NULL at the moment; move sk->sk_destruct initialization after this.
BTW of reported-by: Bernard Pidoux F6BVP <f6bvp@free.fr>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 9b22ea5609
( net: fix packet socket delivery in rx irq handler )
We lost rx timestamping of packets received on accelerated vlans.
Effect is that tcpdump on real dev can show strange timings, since it gets rx timestamps
too late (ie at skb dequeueing time, not at skb queueing time)
14:47:26.986871 IP 192.168.20.110 > 192.168.20.141: icmp 64: echo request seq 1
14:47:26.986786 IP 192.168.20.141 > 192.168.20.110: icmp 64: echo reply seq 1
14:47:27.986888 IP 192.168.20.110 > 192.168.20.141: icmp 64: echo request seq 2
14:47:27.986781 IP 192.168.20.141 > 192.168.20.110: icmp 64: echo reply seq 2
14:47:28.986896 IP 192.168.20.110 > 192.168.20.141: icmp 64: echo request seq 3
14:47:28.986780 IP 192.168.20.141 > 192.168.20.110: icmp 64: echo reply seq 3
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
From: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
port_mutex was unlocked twice.
Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When requesting all prl entries (kprl.addr == INADDR_ANY) and there are
more prl entries than there is space passed from userspace, the existing
code would always copy cmax+1 entries, which is more than can be handled.
This patch makes the kernel copy only exactly cmax entries.
Signed-off-by: Sascha Hlusiak <contact@saschahlusiak.de>
Acked-By: Fred L. Templin <Fred.L.Templin@boeing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 2b85a34e91
(net: No more expensive sock_hold()/sock_put() on each tx)
opens a window in sock_wfree() where another cpu
might free the socket we are working on.
A fix is to call sk->sk_write_space(sk) while still
holding a reference on sk.
Reported-by: Jike Song <albcamus@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This provides safety against negative optlen at the type
level instead of depending upon (sometimes non-trivial)
checks against this sprinkled all over the the place, in
each and every implementation.
Based upon work done by Arjan van de Ven and feedback
from Linus Torvalds.
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (33 commits)
sony-laptop: re-read the rfkill state when resuming from suspend
sony-laptop: check for rfkill hard block at load time
wext: add back wireless/ dir in sysfs for cfg80211 interfaces
wext: Add bound checks for copy_from_user
mac80211: improve/fix mlme messages
cfg80211: always get BSS
iwlwifi: fix 3945 ucode info retrieval after failure
iwlwifi: fix memory leak in command queue handling
iwlwifi: fix debugfs buffer handling
cfg80211: don't set privacy w/o key
cfg80211: wext: don't display BSSID unless associated
net: Add explicit bound checks in net/socket.c
bridge: Fix double-free in br_add_if.
isdn: fix netjet/isdnhdlc build errors
atm: dereference of he_dev->rbps_virt in he_init_group()
ax25: Add missing dev_put in ax25_setsockopt
Revert "sit: stateless autoconf for isatap"
net: fix double skb free in dcbnl
net: fix nlmsg len size for skb when error bit is set.
net: fix vlan_get_size to include vlan_flags size
...
Consider the following step-by step:
1. A STA authenticates and associates with the AP and exchanges
traffic.
2. The STA reports to the AP that it is going to PS state.
3. Some time later the STA device goes to the stand-by mode (not only
its wi-fi card, but the device itself) and drops the association state
without sending a disassociation frame.
4. The STA device wakes up and begins authentication with an
Auth frame as it hasn't been authenticated/associated previously.
At the step 4 the AP "remembers" the STA and considers it is still in
the PS state, so the AP buffers frames, which it has to send to the STA.
But the STA isn't actually in the PS state and so it neither checks
TIM bits nor reports to the AP that it isn't power saving.
Because of that authentication/[re]association fails.
To fix authentication/[re]association stage of this issue, Auth, Assoc
Resp and Reassoc Resp frames are transmitted disregarding of STA's power
saving state.
N.B. This patch doesn't fix further data frame exchange after
authentication/[re]association. A patch in hostapd is required to fix
that.
Signed-off-by: Igor Perminov <igor.perminov@inbox.ru>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The move away from having drivers assign wireless handlers,
in favour of making cfg80211 assign them, broke the sysfs
registration (the wireless/ dir went missing) because the
handlers are now assigned only after registration, which is
too late.
Fix this by special-casing cfg80211-based devices, all
of which are required to have an ieee80211_ptr, in the
sysfs code, and also using get_wireless_stats() to have
the same values reported as in procfs.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Reported-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Tested-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The wireless extensions have a copy_from_user to a local stack
array "essid", but both me and gcc have failed to find where
the bounds for this copy are located in the code.
This patch adds some basic sanity checks for the copy length
to make sure that we don't overflow the stack buffer.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: linux-wireless@vger.kernel.org
Signed-off-by: John W. Linville <linville@tuxdriver.com>
It's useful to know the MAC address when being
disassociated; fix a typo (missing colon) and
move some messages so we get them only when they
are actually taking effect.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Multiple problems were reported due to interaction
between wpa_supplicant and the wext compat code in
cfg80211, which appear to be due to it not getting
any bss pointer here when wpa_supplicant sets all
parameters -- do that now. We should still get the
bss after doing an extra scan, but that appears to
increase the time we need for connecting enough to
sometimes cause timeouts.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Tested-by: Hin-Tak Leung <hintak.leung@gmail.com>,
Signed-off-by: John W. Linville <linville@tuxdriver.com>
When wpa_supplicant is used to connect to open networks,
it causes the wdev->wext.keys to point to key memory, but
that key memory is all empty. Only use privacy when there
is a default key to be used.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Tested-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Tested-by: Kalle Valo <kalle.valo@iki.fi>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Currently, cfg80211's SIOCGIWAP implementation returns
the BSSID that the user set, even if the connection has
since been dropped due to other changes. It only should
return the current BSSID when actually connected.
Also do a small code cleanup.
Reported-by: Thomas H. Guenther <thomas.h.guenther@intel.com>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Tested-by: Thomas H. Guenther <thomas.h.guenther@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The sys_socketcall() function has a very clever system for the copy
size of its arguments. Unfortunately, gcc cannot deal with this in
terms of proving that the copy_from_user() is then always in bounds.
This is the last (well 9th of this series, but last in the kernel) such
case around.
With this patch, we can turn on code to make having the boundary provably
right for the whole kernel, and detect introduction of new security
accidents of this type early on.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a potential double-kfree in net/bridge/br_if.c. If br_fdb_insert
fails, then the kobject is put back (which calls kfree due to the kobject
release), and then kfree is called again on the net_bridge_port. This
patch fixes the crash.
Thanks to Stephen Hemminger for the one-line fix.
Signed-off-by: Jeff Hansen <x@jeffhansen.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ax25_setsockopt SO_BINDTODEVICE is missing a dev_put call in case of
success. Re-order code to fix this bug. While at it also reformat two
lines of code to comply with the Linux coding style.
Initial patch by Jarek Poplawski <jarkao2@gmail.com>.
Reported-by: Bernard Pidoux F6BVP <f6bvp@free.fr>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
* mark struct vm_area_struct::vm_ops as const
* mark vm_ops in AGP code
But leave TTM code alone, something is fishy there with global vm_ops
being used.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 645069299a.
While the code does not actually break anything, it does not completely follow
RFC5214 yet. After talking back with Fred L. Templin, I agree that completing the
ISATAP specific RS/RA code, would pollute the kernel a lot with code that is better
implemented in userspace.
The kernel should not send RS packages for ISATAP at all.
Signed-off-by: Sascha Hlusiak <contact@saschahlusiak.de>
Acked-by: Fred L. Templin <Fred.L.Templin@boeing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netlink_unicast() calls kfree_skb even in the error case.
dcbnl calls netlink_unicast() which when it fails free's the
skb and returns an error value. dcbnl is free'ing the skb
again when this error occurs. This patch removes the double
free.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the nlmsg->len field is not set correctly in netlink_ack()
for ack messages that include the nlmsg of the error frame. This
corrects the length field passed to __nlmsg_put to use the correct
payload size.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix vlan_get_size to include vlan->flags. Currently, the
size of the vlan flags is not included in the nlmsg size.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use ax25_cb_put after ax25_find_cb in ax25_ctl_ioctl.
Reported-by: Bernard Pidoux F6BVP <f6bvp@free.fr>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Reviewed-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to commit d136f1bd36,
there's a bug when unregistering a generic netlink family,
which is caught by the might_sleep() added in that commit:
BUG: sleeping function called from invalid context at net/netlink/af_netlink.c:183
in_atomic(): 1, irqs_disabled(): 0, pid: 1510, name: rmmod
2 locks held by rmmod/1510:
#0: (genl_mutex){+.+.+.}, at: [<ffffffff8138283b>] genl_unregister_family+0x2b/0x130
#1: (rcu_read_lock){.+.+..}, at: [<ffffffff8138270c>] __genl_unregister_mc_group+0x1c/0x120
Pid: 1510, comm: rmmod Not tainted 2.6.31-wl #444
Call Trace:
[<ffffffff81044ff9>] __might_sleep+0x119/0x150
[<ffffffff81380501>] netlink_table_grab+0x21/0x100
[<ffffffff813813a3>] netlink_clear_multicast_users+0x23/0x60
[<ffffffff81382761>] __genl_unregister_mc_group+0x71/0x120
[<ffffffff81382866>] genl_unregister_family+0x56/0x130
[<ffffffffa0007d85>] nl80211_exit+0x15/0x20 [cfg80211]
[<ffffffffa000005a>] cfg80211_exit+0x1a/0x40 [cfg80211]
Fix in the same way by grabbing the netlink table lock
before doing rcu_read_lock().
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
It seems recursion field from "struct ip_tunnel" is not anymore needed.
recursion prevention is done at the upper level (in dev_queue_xmit()),
since we use HARD_TX_LOCK protection for tunnels.
This avoids a cache line ping pong on "struct ip_tunnel" : This structure
should be now mostly read on xmit and receive paths.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
DOCPROC Documentation/DocBook/networking.xml
Warning(net/sunrpc/clnt.c:647): No description found for parameter 'req'
Warning(net/sunrpc/clnt.c:647): No description found for parameter 'tk_ops'
Warning(net/sunrpc/clnt.c:647): Excess function parameter 'ops' description in 'rpc_run_bc_task'
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Cc: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Cc: Benny Halevy <bhalevy@panasas.com>
Cc: Andy Adamson <andros@netapp.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
If we ever implement this, then we can stop returning an error.
Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allocating a port number to a socket and hashing that socket shall be
an atomic operation with regards to other port allocation. Otherwise,
we could allocate a port that is already being allocated to another
socket.
Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previous update did not resched in inner loop causing watchdogs.
Rewrite inner loop to:
* account for delays better with less clock calls
* more accurate timing of delay:
- only delay if packet was successfully sent
- if delay is 100ns and it takes 10ns to build packet then
account for that
* use wait_event_interruptible_timeout rather than open coding it.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Due to man page of setsockopt, if optlen is not valid, kernel should return
-EINVAL. But a simple testcase as following, errno is 0, which means setsockopt
is successful.
addr.s_addr = inet_addr("192.1.2.3");
setsockopt(s, IPPROTO_IP, IP_MULTICAST_IF, &addr, 1);
printf("errno is %d\n", errno);
Xiaotian Feng(dfeng@redhat.com) caught the bug. We fix it firstly checking
the availability of optlen and then dealing with the logic like other options.
Reported-by: Xiaotian Feng <dfeng@redhat.com>
Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's unused.
It isn't needed -- read or write flag is already passed and sysctl
shouldn't care about the rest.
It _was_ used in two places at arch/frv for some reason.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* remove asm/atomic.h inclusion from linux/utsname.h --
not needed after kref conversion
* remove linux/utsname.h inclusion from files which do not need it
NOTE: it looks like fs/binfmt_elf.c do not need utsname.h, however
due to some personality stuff it _is_ needed -- cowardly leave ELF-related
headers and files alone.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[[resending with correct cc: - "vfs.kernel.org" just isn't right!]]
xprt->reestablish_timeout is used to cause TCP connection attempts to
back off if the connection fails so as not to hammer the network,
but to still allow immediate connections when there is no reason to
believe there is a problem.
It is not used for the first connection (when transport->sock is NULL)
but only on reconnects.
It is currently set:
a/ to 0 when xs_tcp_state_change finds a state of TCP_FIN_WAIT1
on the assumption that the client has closed the connection
so the reconnect should be immediate when needed.
b/ to at least XS_TCP_INIT_REEST_TO when xs_tcp_state_change
detects TCP_CLOSING or TCP_CLOSE_WAIT on the assumption that the
server closed the connection so a small delay at least is
required.
c/ as above when xs_tcp_state_change detects TCP_SYN_SENT, so that
it is never 0 while a connection has been attempted, else
the doubling will produce 0 and there will be no backoff.
d/ to double is value (up to a limit) when delaying a connection,
thus providing exponential backoff and
e/ to XS_TCP_INIT_REEST_TO in xs_setup_tcp as simple initialisation.
So you can see it is highly dependant on xs_tcp_state_change being
called as expected. However experimental evidence shows that
xs_tcp_state_change does not see all state changes.
("rpcdebug -m rpc trans" can help show what actually happens).
Results show:
TCP_ESTABLISHED is reported when a connection is made. TCP_SYN_SENT
is never reported, so rule 'c' above is never effective.
When the server closes the connection, TCP_CLOSE_WAIT and
TCP_LAST_ACK *might* be reported, and TCP_CLOSE is always
reported. This rule 'b' above will sometimes be effective, but
not reliably.
When the client closes the connection, it used to result in
TCP_FIN_WAIT1, TCP_FIN_WAIT2, TCP_CLOSE. However since commit
f75e674 (SUNRPC: Fix the problem of EADDRNOTAVAIL syslog floods on
reconnect) we don't see *any* events on client-close. I think this
is because xs_restore_old_callbacks is called to disconnect
xs_tcp_state_change before the socket is closed.
In any case, rule 'a' no longer applies.
So all that is left are rule d, which successfully doubles the
timeout which is never rest, and rule e which initialises the timeout.
Even if the rules worked as expected, there would be a problem because
a successful connection does not reset the timeout, so a sequence
of events where the server closes the connection (e.g. during failover
testing) will cause longer and longer timeouts with no good reason.
This patch:
- sets reestablish_timeout to 0 in xs_close thus effecting rule 'a'
- sets it to 0 in xs_tcp_data_ready to ensure that a successful
connection resets the timeout
- sets it to at least XS_TCP_INIT_REEST_TO after it is doubled,
thus effecting rule c
I have not reimplemented rule b and the new version of rule c
seems sufficient.
I suspect other code in xs_tcp_data_ready needs to be revised as well.
For example I don't think connect_cookie is being incremented as often
as it should be.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
lguest: don't force VIRTIO_F_NOTIFY_ON_EMPTY
lguest: cleanup for map_switcher()
lguest: use PGDIR_SHIFT for PAE code to allow different PAGE_OFFSET
lguest: use set_pte/set_pmd uniformly for real page table entries
lguest: move panic notifier registration to its expected place.
virtio_blk: add support for cache flush
virtio: add virtio IDs file
virtio: get rid of redundant VIRTIO_ID_9P definition
virtio: make add_buf return capacity remaining
virtio_pci: minor MSI-X cleanups
When cfg80211 is instructed to connect, it always
uses the default WEP key for the privacy setting,
which clearly is wrong when using wpa_supplicant.
Don't overwrite the setting, and rely on it being
false when wpa_supplicant is not running, instead
set it to true when we have keys.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>