don't clone skb when skb isn't shared
When the tcf_action is TC_ACT_STOLEN, and the skb isn't shared, we don't need
to clone a new skb. As the skb will be freed after this function returns, we
can use it freely once we get a reference to it.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
include/net/sch_generic.h | 11 +++++++++--
net/sched/act_mirred.c | 6 +++---
2 files changed, 12 insertions(+), 5 deletions(-)
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allows use of ECN when syncookies are in effect by encoding ecn_ok
into the syn-ack tcp timestamp.
While at it, remove a uneeded #ifdef CONFIG_SYN_COOKIES.
With CONFIG_SYN_COOKIES=nm want_cookie is ifdef'd to 0 and gcc
removes the "if (0)".
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for 64bit snmp counters for some mibs,
add an 'align' parameter to snmp_mib_init(), instead
of assuming mibs only contain 'unsigned long' fields.
Callers can use __alignof__(type) to provide correct
alignment.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
this patch is implementing IP_NODEFRAG option for IPv4 socket.
The reason is, there's no other way to send out the packet with user
customized header of the reassembly part.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a typo in include/net/netlink.h
should be finalize instead of finanlize
Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit aa2ea0586d (tcp: fix outsegs stat for TSO segments) incorrectly
assumed SNMP_ADD_STATS() was used from BH context.
Fix this using mib[!in_softirq()] instead of mib[0]
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously CAIF supported maximum transfer size of ~4050.
The transfer size is now calculated dynamically based on the
link layers mtu size.
Signed-off-by: Sjur Braendeland@stericsson.com
Signed-off-by: David S. Miller <davem@davemloft.net>
CAIF Remote File Manager may send or receive more than 4050 bytes.
Due to this The CAIF RFM service have to support segmentation.
Signed-off-by: Sjur Braendeland@stericsson.com
Signed-off-by: David S. Miller <davem@davemloft.net>
Flow control is not used by all CAIF services.
The usage of flow control is now part of the gerneal
initialization function for CAIF Services.
Signed-off-by: Sjur Braendeland@stericsson.com
Signed-off-by: David S. Miller <davem@davemloft.net>
In unix_skb_parms store pointers to struct pid and struct cred instead
of raw uid, gid, and pid values, then translate the credentials on
reception into values that are meaningful in the receiving processes
namespaces.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Start capturing not only the userspace pid, uid and gid values of the
sending process but also the struct pid and struct cred of the sending
process as well.
This is in preparation for properly supporting SCM_CREDENTIALS for
sockets that have different uid and/or pid namespaces at the different
ends.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Serge E. Hallyn <serge@hallyn.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use struct pid and struct cred to store the peer credentials on struct
sock. This gives enough information to convert the peer credential
information to a value relative to whatever namespace the socket is in
at the time.
This removes nasty surprises when using SO_PEERCRED on socket
connetions where the processes on either side are in different pid and
user namespaces.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reorder the fields in scm_cookie so they pack better on 64bit.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Discard the ACK if we find options that do not match current sysctl
settings.
Previously it was possible to create a connection with sack, wscale,
etc. enabled even if the feature was disabled via sysctl.
Also remove an unneeded call to tcp_sack_reset() in
cookie_check_timestamp: Both call sites (cookie_v4_check,
cookie_v6_check) zero "struct tcp_options_received", hand it to
tcp_parse_options() (which does not change tcp_opt->num_sacks/dsack)
and then call cookie_check_timestamp().
Even if num_sacks/dsacks were changed, the structure is allocated on
the stack and after cookie_check_timestamp returns only a few selected
members are copied to the inet_request_sock.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Addition of rcu_head to struct inet_peer added 16bytes on 64bit arches.
Thats a bit unfortunate, since old size was exactly 64 bytes.
This can be solved, using an union between this rcu_head an four fields,
that are normally used only when a refcount is taken on inet_peer.
rcu_head is used only when refcnt=-1, right before structure freeing.
Add a inet_peer_refcheck() function to check this assertion for a while.
We can bring back SLAB_HWCACHE_ALIGN qualifier in kmem cache creation.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
inetpeer currently uses an AVL tree protected by an rwlock.
It's possible to make most lookups use RCU
1) Add a struct rcu_head to struct inet_peer
2) add a lookup_rcu_bh() helper to perform lockless and opportunistic
lookup. This is a normal function, not a macro like lookup().
3) Add a limit to number of links followed by lookup_rcu_bh(). This is
needed in case we fall in a loop.
4) add an smp_wmb() in link_to_pool() right before node insert.
5) make unlink_from_pool() use atomic_cmpxchg() to make sure it can take
last reference to an inet_peer, since lockless readers could increase
refcount, even while we hold peers.lock.
6) Delay struct inet_peer freeing after rcu grace period so that
lookup_rcu_bh() cannot crash.
7) inet_getpeer() first attempts lockless lookup.
Note this lookup can fail even if target is in AVL tree, but a
concurrent writer can let tree in a non correct form.
If this attemps fails, lock is taken a regular lookup is performed
again.
8) convert peers.lock from rwlock to a spinlock
9) Remove SLAB_HWCACHE_ALIGN when peer_cachep is created, because
rcu_head adds 16 bytes on 64bit arches, doubling effective size (64 ->
128 bytes)
In a future patch, this is probably possible to revert this part, if rcu
field is put in an union to share space with rid, ip_id_count, tcp_ts &
tcp_ts_stamp. These fields being manipulated only with refcnt > 0.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ps-qos latency handling is broken. It uses predetermined latency values
to select specific dynamic PS timeouts. With common AP configurations, these
values overlap with beacon interval and are therefore essentially useless
(for network latencies less than the beacon interval, PSM is disabled.)
This patch remedies the problem by replacing the predetermined network latency
values with one high value (1900ms) which is used to go trigger full psm. For
backwards compatibility, the value 2000ms is still mapped to a dynamic ps
timeout of 100ms.
Currently also the mac80211 internal value for storing user space configured
dynamic PSM values is incorrectly in the driver visible ieee80211_conf struct.
Move it to the ieee80211_local struct.
Signed-off-by: Juuso Oikarinen <juuso.oikarinen@nokia.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
There is a circular locking dependency when configuring the
hardware ARP filters on association, occurring when flushing the mac80211
workqueue. This is what happens:
[ 92.026800] =======================================================
[ 92.030507] [ INFO: possible circular locking dependency detected ]
[ 92.030507] 2.6.34-04781-g2b2c009 #85
[ 92.030507] -------------------------------------------------------
[ 92.030507] modprobe/5225 is trying to acquire lock:
[ 92.030507] ((wiphy_name(local->hw.wiphy))){+.+.+.}, at: [<ffffffff8105b5c0>] flush_workq
ueue+0x0/0xb0
[ 92.030507]
[ 92.030507] but task is already holding lock:
[ 92.030507] (rtnl_mutex){+.+.+.}, at: [<ffffffff812b9ce2>] rtnl_lock+0x12/0x20
[ 92.030507]
[ 92.030507] which lock already depends on the new lock.
[ 92.030507]
[ 92.030507]
[ 92.030507] the existing dependency chain (in reverse order) is:
[ 92.030507]
[ 92.030507] -> #2 (rtnl_mutex){+.+.+.}:
[ 92.030507] [<ffffffff810761fb>] lock_acquire+0xdb/0x110
[ 92.030507] [<ffffffff81341754>] mutex_lock_nested+0x44/0x300
[ 92.030507] [<ffffffff812b9ce2>] rtnl_lock+0x12/0x20
[ 92.030507] [<ffffffffa022d47c>] ieee80211_assoc_done+0x6c/0xe0 [mac80211]
[ 92.030507] [<ffffffffa022f2ad>] ieee80211_work_work+0x31d/0x1280 [mac80211]
[ 92.030507] -> #1 ((&local->work_work)){+.+.+.}:
[ 92.030507] [<ffffffff810761fb>] lock_acquire+0xdb/0x110
[ 92.030507] [<ffffffff8105a51a>] worker_thread+0x22a/0x370
[ 92.030507] [<ffffffff8105ecc6>] kthread+0x96/0xb0
[ 92.030507] [<ffffffff81003a94>] kernel_thread_helper+0x4/0x10
[ 92.030507]
[ 92.030507] -> #0 ((wiphy_name(local->hw.wiphy))){+.+.+.}:
[ 92.030507] [<ffffffff81075fdc>] __lock_acquire+0x1c0c/0x1d50
[ 92.030507] [<ffffffff810761fb>] lock_acquire+0xdb/0x110
[ 92.030507] [<ffffffff8105b60e>] flush_workqueue+0x4e/0xb0
[ 92.030507] [<ffffffffa023ff7b>] ieee80211_stop_device+0x2b/0xb0 [mac80211]
[ 92.030507] [<ffffffffa0231635>] ieee80211_stop+0x3e5/0x680 [mac80211]
The locking in this case is quite complex. Fix the problem by rewriting the
way the hardware ARP filter list is handled - i.e. make a copy of the address
list to the bss_conf struct, and provide that list to the hardware driver
when needed.
The current patch will enable filtering also in promiscuous mode. This may need
to be changed in the future.
Reported-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Juuso Oikarinen <juuso.oikarinen@nokia.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This patch adds support to nl80211 and mac80211 to set basic rates when
joining/creating ibss network.
Original patch was posted by Johannes Berg on the linux-wireless posting list.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Teemu Paasikivi <ext-teemu.3.paasikivi@nokia.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Allow drivers to sleep, and indicate this in
the documentation. ath9k has some locking I
don't understand, so keep it safe and disable
BHs in it, all other drivers look fine with
the context change.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The non-irqsafe aggregation start/stop done
callbacks are currently only used by ath9k_htc,
and can cause callbacks into the driver again.
This might lead to locking issues, which will
only get worse as we modify locking. To avoid
trouble, remove the non-irqsafe versions and
change ath9k_htc to use those instead.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
gen_kill_estimator() API is incomplete or not well documented, since
caller should make sure an RCU grace period is respected before
freeing stats_lock.
This was partially addressed in commit 5d944c640b
(gen_estimator: deadlock fix), but same problem exist for all
gen_kill_estimator() users, if lock they use is not already RCU
protected.
A code review shows xt_RATEEST.c, act_api.c, act_police.c have this
problem. Other are ok because they use qdisc lock, already RCU
protected.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
remove useless union keyword in rtable, rt6_info and dn_route.
Since there is only one member in a union, the union keyword isn't useful.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 66018506e1 (ip: Router Alert RCU conversion) introduced RCU
lookups to ip_call_ra_chain(). It missed proper deinit phase :
When ip_ra_control() deletes an ip_ra_chain, it should make sure
ip_call_ra_chain() users can not start to use socket during the rcu
grace period. It should also delay the sock_put() after the grace
period, or we risk a premature socket freeing and corruptions, as
raw sockets are not rcu protected yet.
This delay avoids using expensive atomic_inc_not_zero() in
ip_call_ra_chain().
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use call_rcu rather than synchronize_rcu.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
NOTRACK makes all cpus share a cache line on nf_conntrack_untracked
twice per packet, slowing down performance.
This patch converts it to a per_cpu variable.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
NOTRACK makes all cpus share a cache line on nf_conntrack_untracked
twice per packet. This is bad for performance.
__read_mostly annotation is also a bad choice.
This patch introduces IPS_UNTRACKED bit so that we can use later a
per_cpu untrack structure more easily.
A new helper, nf_ct_untracked_get() returns a pointer to
nf_conntrack_untracked.
Another one, nf_ct_untracked_status_or() is used by nf_nat_init() to add
IPS_NAT_DONE_MASK bits to untracked status.
nf_ct_is_untracked() prototype is changed to work on a nf_conn pointer.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Fix a whole bunch of kernel-doc warnings
and errors that crop up when running it on
mac80211 and cfg80211; the latter isn't
normally done so lots of bit-rot happened.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
We currently dirty two cache lines in struct xt_rateest, this hurts SMP
performance.
This patch moves lock/bstats/rstats at beginning of structure so that
they share a single cache line.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Straightforward conversion to RCU.
One rwlock becomes a spinlock, and is static.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch address a serious performance issue in reading the
TCP sockets table (/proc/net/tcp).
Reading the full table is done by a number of sequential read
operations. At each read operation, a seek is done to find the
last socket that was previously read. This seek operation requires
that the sockets in the table need to be counted up to the current
file position, and to count each of these requires taking a lock for
each non-empty bucket. The whole algorithm is O(n^2).
The fix is to cache the last bucket value, offset within the bucket,
and the file position returned by the last read operation. On the
next sequential read, the bucket and offset are used to find the
last read socket immediately without needing ot scan the previous
buckets the table. This algorithm t read the whole table is O(n).
The improvement offered by this patch is easily show by performing
cat'ing /proc/net/tcp on a machine with a lot of connections. With
about 182K connections in the table, I see the following:
- Without patch
time cat /proc/net/tcp > /dev/null
real 1m56.729s
user 0m0.214s
sys 1m56.344s
- With patch
time cat /proc/net/tcp > /dev/null
real 0m0.894s
user 0m0.290s
sys 0m0.594s
Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
xfrm triggers a warning if dst_pop() drops a refcount
on a noref dst. This patch changes dst_pop() to
skb_dst_pop(). skb_dst_pop() drops the refcnt only
on a refcounted dst. Also we don't clone the child
dst_entry, so it is not refcounted and we can use
skb_dst_set_noref() in xfrm_output_one().
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
STA_NOTIFY_ADD and STA_NOTIFY_REMOVE have no users anymore,
and station addition/removal are indicated to drivers
using sta_add() and sta_remove(), which can sleep.
Signed-off-by: Sujith <Sujith.Manoharan@atheros.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
There's no sense in letting anything but internal
mac80211 functions set the initiator to anything
but WLAN_BACK_INITIATOR, since WLAN_BACK_RECIPIENT
is only valid when we have received a frame from
the peer, which we react to directly in mac80211.
The debugfs code I recently added got this wrong
as well.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Some hardware allow extended filtering of ARP frames not intended for
the host. To perform such filtering, the hardware needs to know the current
IP address(es) of the host, bound to its interface.
Add support for ARP filtering to mac80211 by adding a new op to the driver
interface, allowing to configure the current IP addresses. This op is called
upon association with the currently configured address(es), and when
associated whenever the IP address(es) change.
This patch adds configuration of IPv4 addresses only, as IPv6 addresses don't
need ARP filtering.
Signed-off-by: Juuso Oikarinen <juuso.oikarinen@nokia.com>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
There's a single use of this struct member, but
as it is write-only it clearly not necessary.
Thus we can free up some space here, even if we
don't need it right now it seems pointless to
carry around the variable.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
cleanup patch.
Use new __packed annotation in net/ and include/
(except netfilter)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When sending action frames, we want to verify
that we do that on the correct channel. However,
checking the channel type in addition can get in
the way, since the channel type could change on
the fly during an association, and it's not
useful to have the channel type anyway since it
has no effect on the transmission. Therefore,
make it optional to specify so that if wanted,
it can still be checked, but is not required.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
There are more than a dozen occurrences of following code in the
IPv6 stack:
if (opt && opt->srcrt) {
struct rt0_hdr *rt0 = (struct rt0_hdr *) opt->srcrt;
ipv6_addr_copy(&final, &fl.fl6_dst);
ipv6_addr_copy(&fl.fl6_dst, rt0->addr);
final_p = &final;
}
Replace those with a helper. Note that the helper overrides final_p
in all cases. This is ok as final_p was previously initialized to
NULL when declared.
Signed-off-by: Arnaud Ebalard <arno@natisbad.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use read_pnet() and write_pnet() to reduce number of ifdef CONFIG_NET_NS
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When many cpus compete for sending frames on a given qdisc, the qdisc
spinlock suffers from very high contention.
The cpu owning __QDISC_STATE_RUNNING bit has same priority to acquire
the lock, and cannot dequeue packets fast enough, since it must wait for
this lock for each dequeued packet.
One solution to this problem is to force all cpus spinning on a second
lock before trying to get the main lock, when/if they see
__QDISC_STATE_RUNNING already set.
The owning cpu then compete with at most one other cpu for the main
lock, allowing for higher dequeueing rate.
Based on a previous patch from Alexander Duyck. I added the heuristic to
avoid the atomic in fast path, and put the new lock far away from the
cache line used by the dequeue worker. Also try to release the busylock
lock as late as possible.
Tests with following script gave a boost from ~50.000 pps to ~600.000
pps on a dual quad core machine (E5450 @3.00GHz), tg3 driver.
(A single netperf flow can reach ~800.000 pps on this platform)
for j in `seq 0 3`; do
for i in `seq 0 7`; do
netperf -H 192.168.0.1 -t UDP_STREAM -l 60 -N -T $i -- -m 6 &
done
done
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__QDISC_STATE_RUNNING is always changed while qdisc lock is held.
We can avoid two atomic operations in xmit path, if we move this bit in
a new __state container.
Location of this __state container is carefully chosen so that fast path
only dirties one qdisc cache line.
THROTTLED bit could later be moved into this __state location too, to
avoid dirtying first qdisc cache line.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>