dev_deactivate() can skip rescheduling of a qdisc by qdisc_watchdog()
or other timer calling netif_schedule() after dev_queue_deactivate().
We prevent this checking aliveness before scheduling the timer. Since
during deactivation the root qdisc is available only as qdisc_sleeping
additional accessor qdisc_root_sleeping() is created.
With feedback from Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All of the SCTP-AUTH socket options could cause a panic
if the extension is disabled and the API is envoked.
Additionally, there were some additional assumptions that
certain pointers would always be valid which may not
always be the case.
This patch hardens the API and address all of the crash
scenarios.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If dev_deactivate() is trying to quiesce the queue, it
is theoretically possible for another cpu to livelock
trying to process that queue. This happens because
dev_deactivate() grabs the queue spinlock as it checks
the queue state, whereas net_tx_action() does a trylock
and reschedules the qdisc if it hits the lock.
This breaks the livelock by adding a check on
__QDISC_STATE_DEACTIVATED to net_tx_action() when
the trylock fails.
Based upon feedback from Herbert Xu and Jarek Poplawski.
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 1cfa26661a.
qdisc_destroy() runs fully under RTNL again and not from softint any
longer, so this change is no longer needed.
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit d4766692e7.
qdisc_destroy() now runs in RTNL fully again, so this
change is no longer needed.
Signed-off-by: David S. Miller <davem@davemloft.net>
...Last block local var got just deleted.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use incoming network tuple as seed for NAT port randomization.
This avoids concerns of leaking net_random() bits, and also gives better
port distribution. Don't have NAT server, compile tested only.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
[ added missing EXPORT_SYMBOL_GPL ]
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes a GFP_KERNEL allocation while holding a spin lock with
bottom halves disabled in ctnetlink_change_helper().
This problem was introduced in 2.6.23 with the netfilter extension
infrastructure.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix allocation with GFP_KERNEL in ctnetlink_create_conntrack() under
read-side lock sections.
This problem was introduced in 2.6.25.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
If we create a conntrack that has NAT handlings and a helper, the helper
is assigned twice. This happens because nf_nat_setup_info() - via
nf_conntrack_alter_reply() - sets the helper before ctnetlink, which
indeed does not check if the conntrack already has a helper as it thinks that
it is a brand new conntrack.
The fix moves the helper assignation before the set of the status flags.
This avoids a bogus assertion in __nf_ct_ext_add (if netfilter assertions are
enabled) which checks that the conntrack must not be confirmed.
This problem was introduced in 2.6.23 with the netfilter extension
infrastructure.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch fixes matching of inverted destination address type.
Signed-off-by: Anders Grafström <grfstrm@users.sourceforge.net>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Thanks is due to Wei Yongjun for the detailed analysis and description of this
bug at http://marc.info/?l=dccp&m=121739364909199&w=2
The problem is that invalid packets received by a client in state REQUEST cause
the retransmission timer for the DCCP-Request to be reset. This includes freeing
the Request-skb ( in dccp_rcv_request_sent_state_process() ). As a consequence,
* the arrival of further packets cause a double-free, triggering a panic(),
* the connection then may hang, since further retransmissions are blocked.
This patch changes the order of statements so that the retransmission timer is
reset, and the pending Request freed, only if a valid Response has arrived (or
the number of sysctl-retries has been exhausted).
Further changes:
----------------
To be on the safe side, replaced __kfree_skb with kfree_skb so that if due to
unexpected circumstances the sk_send_head is NULL the WARN_ON is used instead.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based upon reports by Denys Fedoryshchenko, and feedback
and help from Jarek Poplawski and Herbert Xu.
We always either:
1) Never made an external reference to this qdisc.
or
2) Did a dev_deactivate() which purged all asynchronous
references.
So do not lock the qdisc when we call qdisc_destroy(),
it's illegal anyways as when we drop the lock this is
free'd memory.
Signed-off-by: David S. Miller <davem@davemloft.net>
Qdisc locks are initialized in the same function, qdisc_alloc(), so
lockdep can't distinguish tx qdisc lock from rx and reports "possible
recursive locking detected" when both these locks are taken eg. while
using act_mirred with ifb. This looks like a false positive. Anyway,
after this patch these locks will be reported more exactly.
Reported-by: Denys Fedoryshchenko <denys@visp.net.lb>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based upon initial discovery and patch by Jarek Poplawski.
The qdisc watchdogs can be attached to any qdisc, not just the root,
so make sure we schedule the correct one.
CBQ has a similar bug.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes needless probe request caused by zero value in
sta->last_rx inside ieee80211_associated flow
Signed-off-by: Ron Rindjunsky <ron.rindjunsky@intel.com>
Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Guard rfkill controllers attached to a rfkill class against state changes
after class suspend has been issued.
Signed-off-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Acked-by: Ivo van Doorn <IvDoorn@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The Bluetooth entries for the MAINTAINERS file are a little bit too
much. Consolidate them into two entries. One for Bluetooth drivers and
another one for the Bluetooth subsystem.
Also the MODULE_AUTHOR should indicate the current maintainer of the
module and actually not the original author. Fix all Bluetooth modules
to provide current maintainer information.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The Bluetooth adapters and connections are best presented via a class
in sysfs. The removal of the links inside the Bluetooth class broke
assumptions by userspace programs on how to find attached adapters.
This patch creates adapters and connections as part of the Bluetooth
class, but it uses different device types to distinguish them. The
userspace programs can now easily navigate in the sysfs device tree.
The unused platform device and bus have been removed to keep the
code simple and clean.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Based upon a bug report by Josip Rodin.
Packet schedulers should only return NET_XMIT_DROP iff
the packet really was dropped. If the packet does reach
the device after we return NET_XMIT_DROP then TCP can
crash because it depends upon the enqueue path return
values being accurate.
Signed-off-by: David S. Miller <davem@davemloft.net>
When get receiving interface index while no message is received,
the bounded device's index of the socket should be returned.
RFC 3542:
Issuing getsockopt() for the above options will return the sticky
option value i.e., the value set with setsockopt(). If no sticky
option value has been set getsockopt() will return the following
values:
- For the IPV6_PKTINFO option, it will return an in6_pktinfo
structure with ipi6_addr being in6addr_any and ipi6_ifindex being
zero.
Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use return value from inner qdisc requeue when value returned isn't
NET_XMIT_SUCCESS, instead of always returning NET_XMIT_DROP.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can now kill them synchronously with all of the
previous dev_deactivate() cures.
This makes netdev destruction and shutdown saner as
the qdiscs hold references to the device.
Signed-off-by: David S. Miller <davem@davemloft.net>
From: Jarek Poplawski <jarkao2@gmail.com>
When we are destroying non-root qdiscs, we need to lock
the root of the qdisc tree not the the qdisc itself.
Signed-off-by: David S. Miller <davem@davemloft.net>
The condition under which the previous qdisc has no more references
after we've attached &noop_qdisc is that both RUNNING and SCHED
are both seen clear while holding the root lock.
So just make specifically that check in the polling loop, instead
of this overly complex "check without then check with lock held"
sequence.
Signed-off-by: David S. Miller <davem@davemloft.net>
Change handling of the __QDISC_STATE_SCHED flag in net_tx_action() to
enable proper control in dev_deactivate(). Now, if this flag is seen
as unset under root_lock means a qdisc can't be netif_scheduled.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This new state lets dev_deactivate() mark a qdisc as having been
deactivated.
dev_queue_xmit() and ing_filter() check for this bit and do not
try to process the qdisc if the bit is set.
dev_deactivate() polls the qdisc after setting the bit, waiting
for both __QDISC_STATE_RUNNING and __QDISC_STATE_SCHED to clear.
This isn't perfect yet, but subsequent changesets will make it so.
This part is just one piece of the puzzle.
Signed-off-by: David S. Miller <davem@davemloft.net>
There's an skb_copy_datagram_iovec() to copy out of a paged skb, but
nothing the other way around (because we don't do that).
We want to allocate big skbs in tun.c, so let's add the function.
It's a carbon copy of skb_copy_datagram_iovec() with enough changes to
be annoying.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_gso_segment didn't preserve some attributes in the original skb
such as the netfilter fields. This was harmless until they were used
which is the case for packets going through lo.
This patch makes it call __copy_skb_header which also picks up some
other missing attributes.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add more ethtool generic operations to dump the bridge offload
settings.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Let me first state that disabling the route cache hash rebuild
should not be done without extensive analysis on the risk profile
and careful deliberation.
However, there are times when this can be done safely or for
testing. For example, when you have mechanisms for ensuring
that offending parties do not exist in your network.
This patch lets the user disable the rebuild if the interval is
set to zero. This also incidentally fixes a divide-by-zero error
with name-spaces.
In addition, this patch makes the effect of an interval change
immediate rather than it taking effect at the next rebuild as
is currently the case.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a bug with spin_lock_bh() inserted instead of spin_unlock_bh() by
some recent patch.
Reported-by: Denys Fedoryshchenko <denys@visp.net.lb>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6_dev_get_saddr() blindly de-references dst_dev to get the network
namespace, but some callers might pass NULL. Change callers to pass a
namespace pointer instead.
Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes the multicast socket to be per namespace.
When a network namespace is created, other than the init_net and a
multicast packet is received, the kernel goes to a hang or a kernel panic.
How to reproduce ?
* create a child network namespace
* create a pair virtual device veth
* ip link add type veth
* move one side to the pair network device to the child namespace
* ip link set netns <childpid> dev veth1
* ping -I veth0 224.0.0.1
The bug appears because the function ip_mc_init_dev does not initialize
the different multicast fields as it exits because it is not the init_net.
BUG: soft lockup - CPU#0 stuck for 61s! [avahi-daemon:2695]
Modules linked in:
irq event stamp: 50350
hardirqs last enabled at (50349): [<c03ee949>] _spin_unlock_irqrestore+0x34/0x39
hardirqs last disabled at (50350): [<c03ec639>] schedule+0x9f/0x5ff
softirqs last enabled at (45712): [<c0374d4b>] ip_setsockopt+0x8e7/0x909
softirqs last disabled at (45710): [<c03ee682>] _spin_lock_bh+0x8/0x27
Pid: 2695, comm: avahi-daemon Not tainted (2.6.27-rc2-00029-g0872073 #3)
EIP: 0060:[<c03ee47c>] EFLAGS: 00000297 CPU: 0
EIP is at __read_lock_failed+0x8/0x10
EAX: c4f38810 EBX: c4f38810 ECX: 00000000 EDX: c04cc22e
ESI: fb0000e0 EDI: 00000011 EBP: 0f02000a ESP: c4e3faa0
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
CR0: 8005003b CR2: 44618a40 CR3: 04e37000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
[<c02311f8>] ? _raw_read_lock+0x23/0x25
[<c0390666>] ? ip_check_mc+0x1c/0x83
[<c036d478>] ? ip_route_input+0x229/0xe92
[<c022e2e4>] ? trace_hardirqs_on_thunk+0xc/0x10
[<c0104c9c>] ? do_IRQ+0x69/0x7d
[<c0102e64>] ? restore_nocheck_notrace+0x0/0xe
[<c036fdba>] ? ip_rcv+0x227/0x505
[<c0358764>] ? netif_receive_skb+0xfe/0x2b3
[<c03588d2>] ? netif_receive_skb+0x26c/0x2b3
[<c035af31>] ? process_backlog+0x73/0xbd
[<c035a8cd>] ? net_rx_action+0xc1/0x1ae
[<c01218a8>] ? __do_softirq+0x7b/0xef
[<c0121953>] ? do_softirq+0x37/0x4d
[<c035b50d>] ? dev_queue_xmit+0x3d4/0x40b
[<c0122037>] ? local_bh_enable+0x96/0xab
[<c035b50d>] ? dev_queue_xmit+0x3d4/0x40b
[<c012181e>] ? _local_bh_enable+0x79/0x88
[<c035fcb8>] ? neigh_resolve_output+0x20f/0x239
[<c0373118>] ? ip_finish_output+0x1df/0x209
[<c0373364>] ? ip_dev_loopback_xmit+0x62/0x66
[<c0371db5>] ? ip_local_out+0x15/0x17
[<c0372013>] ? ip_push_pending_frames+0x25c/0x2bb
[<c03891b8>] ? udp_push_pending_frames+0x2bb/0x30e
[<c038a189>] ? udp_sendmsg+0x413/0x51d
[<c038a1a9>] ? udp_sendmsg+0x433/0x51d
[<c038f927>] ? inet_sendmsg+0x35/0x3f
[<c034f092>] ? sock_sendmsg+0xb8/0xd1
[<c012d554>] ? autoremove_wake_function+0x0/0x2b
[<c022e6de>] ? copy_from_user+0x32/0x5e
[<c022e6de>] ? copy_from_user+0x32/0x5e
[<c034f238>] ? sys_sendmsg+0x18d/0x1f0
[<c0175e90>] ? pipe_write+0x3cb/0x3d7
[<c0170347>] ? do_sync_write+0xbe/0x105
[<c012d554>] ? autoremove_wake_function+0x0/0x2b
[<c03503b2>] ? sys_socketcall+0x176/0x1b0
[<c01085ea>] ? syscall_trace_enter+0x6c/0x7b
[<c0102e1a>] ? syscall_call+0x7/0xb
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
gen_kill_estimator() required rtnl_lock() protection, but since it is
moved to an RCU callback __qdisc_destroy() let's use est_lock instead.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based upon discussions with Jarek P. and Herbert Xu.
First, we're testing the wrong qdisc. We just reset the device
queue qdiscs to &noop_qdisc and checking it's state is completely
pointless here.
We want to wait until the previous qdisc that was sitting at
the ->qdisc pointer is not busy any more. And that would be
->qdisc_sleeping.
Because of how we propagate the samples qdisc pointer down into
qdisc_run and friends via per-cpu ->output_queue and netif_schedule,
we have to wait also for the __QDISC_STATE_SCHED bit to clear as
well.
Signed-off-by: David S. Miller <davem@davemloft.net>
Recent changes introduced a bug in htb_delete(): cl->parent->children
counter update misses checking cl->parent for NULL, which is used for
root classes, so deleting them causes an oops.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the new multi-queue transmit code, it is possible to accidentally
make pktgen pick a non-existing tx queue simply by using a stale
script to drive pktgen. Access to this non-existing tx queue will
then trigger a bad memory access and kill the machine.
For example, setting "queue_map_max 2" will cause my machine to die
when accessing a garbage spinlock in the non-existing tx queue:
BUG: spinlock bad magic on CPU#0, kpktgend_0/564
lock: ffff88001ddf6718, .magic: ffffffff, .owner: /-1, .owner_cpu: 0
Pid: 564, comm: kpktgend_0 Not tainted 2.6.27-rc3 #35
Call Trace:
[<ffffffff803a1228>] spin_bug+0xa4/0xac
[<ffffffff803a1253>] _raw_spin_lock+0x23/0x123
[<ffffffff8055b06f>] _spin_lock_bh+0x17/0x1b
[<ffffffff804cb57d>] pktgen_thread_worker+0xa97/0x1002
[<ffffffff8022874d>] ? finish_task_switch+0x38/0x97
[<ffffffff80242077>] ? autoremove_wake_function+0x0/0x36
[<ffffffff80242077>] ? autoremove_wake_function+0x0/0x36
[<ffffffff804caae6>] ? pktgen_thread_worker+0x0/0x1002
[<ffffffff80241a40>] kthread+0x44/0x6d
[<ffffffff8020c399>] child_rip+0xa/0x11
[<ffffffff802419fc>] ? kthread+0x0/0x6d
[<ffffffff8020c38f>] ? child_rip+0x0/0x11
The attached patch adds some sanity checking to prevent
these sorts of configuration errors.
Signed-off-by: Andrew Gallatin <gallatin@myri.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RDMA_READ completions are kept on a separate queue from the general
I/O request queue. Since a separate lock is used to protect the RDMA_READ
completion queue, a race exists between the dto_tasklet and the
svc_rdma_recvfrom thread where the dto_tasklet sets the XPT_DATA
bit and adds I/O to the read-completion queue. Concurrently, the
recvfrom thread checks the generic queue, finds it empty and resets
the XPT_DATA bit. A subsequent svc_xprt_enqueue will fail to enqueue
the transport for I/O and cause the transport to "stall".
The fix is to protect both lists with the same lock and set the XPT_DATA
bit with this lock held.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Thanks to Eugene Teo for reporting this problem.
Signed-off-by: Eugene Teo <eugenete@kernel.sg>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Small fix removing an unnecessary intermediate variable.
Signed-off-by: Jean-Christophe DUBOIS <jcd@tribudubois.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Flushing must consistently return ENOMEM on failure of any allocation
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
Flushing of actions has been broken since we changed
the semantics of netlink parsed tb[X] to mean X is an attribute type.
This makes the flushing work.
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case of error, the function rxrpc_get_transport returns an ERR
pointer, but never returns a NULL pointer. So after a call to this
function, a NULL test should be replaced by an IS_ERR test.
A simplified version of the semantic patch that makes this change is
as follows:
(http://www.emn.fr/x-info/coccinelle/)
// <smpl>
@correct_null_test@
expression x,E;
statement S1, S2;
@@
x = rxrpc_get_transport(...)
<... when != x = E
if (
(
- x@p2 != NULL
+ ! IS_ERR ( x )
|
- x@p2 == NULL
+ IS_ERR( x )
)
)
S1
else S2
...>
? x = E;
// </smpl>
Signed-off-by: Julien Brunel <brunel@diku.dk>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the minimal the wireless extensions oughta send at least
the name in addition to the ifindex.
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's an internal implementation detail which we _should_ be free to change.
So we did, and it promptly broke.
The compiler shold be able to work out when to use the __constant version
anyway.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexey Dobriyan wrote:
> On Thu, Aug 07, 2008 at 07:00:56PM +0200, John Gumb wrote:
>> Scenario: no ipv6 default route set.
>
>> # ip -f inet6 route get fec0::1
>>
>> BUG: unable to handle kernel NULL pointer dereference at 00000000
>> IP: [<c0369b85>] rt6_fill_node+0x175/0x3b0
>> EIP is at rt6_fill_node+0x175/0x3b0
>
> 0xffffffff80424dd3 is in rt6_fill_node (net/ipv6/route.c:2191).
> 2186 } else
> 2187 #endif
> 2188 NLA_PUT_U32(skb, RTA_IIF, iif);
> 2189 } else if (dst) {
> 2190 struct in6_addr saddr_buf;
> 2191 ====> if (ipv6_dev_get_saddr(ip6_dst_idev(&rt->u.dst)->dev,
> ^^^^^^^^^^^^^^^^^^^^^^^^
> NULL
>
> 2192 dst, 0, &saddr_buf) == 0)
> 2193 NLA_PUT(skb, RTA_PREFSRC, 16, &saddr_buf);
> 2194 }
The commit that changed this can't be reverted easily, but the patch
below works for me.
Fix NULL de-reference in rt6_fill_node() when there's no IPv6 input
device present in the dst entry.
Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since qdisc_stab_lock is used in qdisc_put_stab(), which is called in
BH context from __qdisc_destroy() RCU callback, softirq safe locking
is needed.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to align the coding styles of ip_vs_zero_stats() and
its child-function ip_vs_zero_estimator(), clear ip_vs_stats
members explicitlty rather than doing a limited memset().
This was chosen over modifying ip_vs_zero_estimator() to use
memset() as it is more robust against changes in members
in the relevant structures. memset() would be prefered if
all members of the structure were to be cleared.
Cc: Sven Wegener <sven.wegener@stealer.net>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
It's a global variable and automatically initialized to zero. And now we can
also initialize the lock at compile time.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Acked-by: Simon Horman <horms@verge.net.au>
There's no reason for dynamically allocating an estimator object for every
stats object. Directly embed an estimator object into every stats object and
switch to using the kernel-provided list implementation. This makes the code
much simpler and faster, as we do not need to traverse the list of all
estimators to find the one belonging to a stats object. There's no need to use
an rwlock, as we only have one reader. Also reorder the members of the
estimator structure slightly to avoid padding overhead. This can't be done
with the stats object as the members are currently copied to our user space
object via memcpy() and changing it would break ABI.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Acked-by: Simon Horman <horms@verge.net.au>
Being able to discard these functions saves a couple of bytes at runtime. The
cleanup functions can't be annotated with __exit as they are also called from
init functions.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Acked-by: Simon Horman <horms@verge.net.au>
No need to do it at runtime and this saves a couple of bytes in the text
section.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Acked-by: Simon Horman <horms@verge.net.au>
There is a slight chance for a deadlock in the estimator code. We can't call
del_timer_sync() while holding our lock, as the timer might be active and
spinning for the lock on another cpu. Work around this issue by using
try_to_del_timer_sync() and releasing the lock. We could actually delete the
timer outside of our lock, as the add and kill functions are only every called
from userspace via [gs]etsockopt() and are serialized by a mutex, but better
make this explicit.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Cc: stable <stable@kernel.org>
Acked-by: Simon Horman <horms@verge.net.au>
Commit 998e7a7680 ("ipvs: Use kthread_run()
instead of doing a double-fork via kernel_thread()") introduced a possible
deadlock in the sync code. We need to use the _bh versions for the lock, as the
lock is also accessed from a bottom half.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Acked-by: Simon Horman <horms@verge.net.au>
The socket lock is there to protect the normal UDP receive path.
Encapsulation UDP sockets don't need that protection. In fact
the locking is deadly for them as they may contain another UDP
packet within, possibly with the same addresses.
Also the nested bit was copied from TCP. TCP needs it because
of accept(2) spawning sockets. This simply doesn't apply to UDP
so I've removed it.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based upon bug reports by Stephen Hemminger.
We still had some cases using ->qdisc instead of ->qdisc_sleeping.
Also, qdisc_lookup() should return ingress qdiscs.
Signed-off-by: David S. Miller <davem@davemloft.net>
When an action is added several times with the same exact index
it gets deleted on every even-numbered attempt.
This fixes that issue.
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
The indentation in part of tcp_minisocks makes it look like one of the if
statements is much more important than it actually is.
Signed-off-by: Adam Langley <agl@imperialviolet.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Bluetooth qualification for PAN demands testing with BNEP header
compression disabled. This is actually pretty stupid and the Linux
implementation outsmarts the test system since it compresses whenever
possible. So to pass qualification two need parameters have been added
to control the compression of source and destination headers.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Currently a mesh node will not forward a multicast frame if it is not subscribed
to the specific multicast address. This patch addresses the issue and fixes mesh
multicast forwarding.
Signed-off-by: Luis Carlos Cobo <luisca@cozybit.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Now we deal with mesh forwarding before the 802.11->802.3 conversion, thus
eliminating a few unnecessary steps. The next hop lookup is called from
ieee80211_master_start_xmit() instead of subif_start_xmit(). Until the next hop
is found, RA in the frame will be all zeroes for frames originating from the
device. For forwarded frames, RA will contain the TA of the received frame,
which will be necessary to send a path error if a next hop is not found.
Signed-off-by: Luis Carlos Cobo <luisca@cozybit.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Sofar far pktgen have had a restriction to only use one device per kernel
thread. With the new multiqueue architecture this is no longer adequate.
The patch below is an effort to remove this by in pktgen configuration
adding a tag to the device name a la eth0@0 etc. The tag is used for
usual device config just as before. Also a new flag is introduced to mirror
queue_map with sending threads smp_processor_id() QUEUE_MAP_CPU.
An example: We use 4 CPU's to send to one 10g interface (eth0)
and we use the new tagging to send a mix of packet sizes, 64, 576 and
1500 bytes. Also we use TX queues according to smp_processor_id()
PGDEV=/proc/net/pktgen/kpktgend_0
pgset "add_device eth0@0"
PGDEV=/proc/net/pktgen/kpktgend_1
pgset "add_device eth0@1"
PGDEV=/proc/net/pktgen/kpktgend_2
pgset "add_device eth0@2"
PGDEV=/proc/net/pktgen/kpktgend_3
pgset "add_device eth0@3"
....
PGDEV=/proc/net/pktgen/eth0@0
pgset "pkt_size 64"
pgset "flag QUEUE_MAP_CPU"
PGDEV=/proc/net/pktgen/eth0@1
pgset "pkt_size 572"
pgset "flag QUEUE_MAP_CPU"
PGDEV=/proc/net/pktgen/eth0@2
pgset "pkt_size 1496"
PGDEV=/proc/net/pktgen/eth0@3
pgset "pkt_size 1496"
pgset "flag QUEUE_MAP_CPU"
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a packet_type specifies an active slave to bonding and not just any
interface, allow it to receive frames that came in on that interface.
Signed-off-by: Joe Eykholt <jre@nuovasystems.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Allow a packet_type that specifies the exact device to receive
even on an inactive bonding slave devices. This is important for some
L2 protocols such as LLDP and FCoE. This can eventually be used
for the bonding special cases as well.
Signed-off-by: Joe Eykholt <jre@nuovasystems.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Otherwise subsequent changes need multiple return values.
Signed-off-by: Joe Eykholt <jre@nuovasystems.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Here's a revised version, based on Herbert's comments, of a fix for
the ipv4-inner, ipv6-outer interfamily ipsec beet mode. It fixes the
network header adjustment during interfamily, as well as makes sure
that we reserve enough room for the new ipv6 header if we might have
something else as the inner family. Also, the ipv4 pseudo header
construction was added.
Signed-off-by: Joakim Koskela <jookos@gmail.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Here's a revised version, based on Herbert's comments, of a fix for
the ipv6-inner, ipv4-outer interfamily ipsec beet mode. It fixes the
network header adjustment in interfamily, and doesn't reserve space
for the pseudo header anymore when we have ipv6 as the inner family.
Signed-off-by: Joakim Koskela <jookos@gmail.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Starting with 9043476f72 ("[PATCH]
sanitize proc_sysctl") we have two netfilter releated problems:
- WARNING: at kernel/sysctl.c:1966 unregister_sysctl_table+0xcc/0x103(),
caused by wrong order of ini/fini calls
- net.netfilter is duplicated and has truncated set of records
Thanks to very useful guidelines from Al Viro, this patch fixes both
of them.
Signed-off-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch replaces dst_metric() with dst_mtu() in net/ipv6/route.c.
Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch replaces dst_metric() with dst_mtu() in net/ipv4/route.c.
Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since 49ffcf8f99 ("sysctl: update
sysctl_check_table") setting struct ctl_table.procname = NULL does no
longer work as it used to the way the AX.25 code is expecting it to
resulting in the AX.25 sysctl registration code to break if
CONFIG_AX25_DAMA_SLAVE was not set as in some distribution kernels.
Kernel releases from 2.6.24 are affected.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
dst_mac_count and src_mac_count patch from Eneas Hunguana
We have sent one mac address to much.
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
Random flow generation has not worked. This fixes it.
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
From: Stephen Hemminger <shemminger@vyatta.com>
Based upon original patch by Herbert Xu, which contained
the following problem description:
--------------------
When the forward delay is set to zero, we still delay the setting
of the forwarding state by one or possibly two timers depending
on whether STP is enabled. This could either turn out to be
instantaneous, or horribly slow depending on the load of the
machine.
As there is nothing preventing us from enabling forwarding straight
away, this patch eliminates this potential delay by executing the
code directly if the forward delay is zero.
The effect of this problem is that immediately after the carrier
comes on a port, the bridge will drop all packets received from
that port until it enters forwarding mode, thus causing unnecessary
packet loss.
Note that this patch doesn't fully remove the delay due to the
link watcher. We should also check the carrier state when we
are about to drop an incoming packet because the port is disabled.
But that's for another patch.
--------------------
This version of the fix takes a different approach, in that
it just does the state change directly.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes the following warning due to incompatible pointer
assignment:
net/bridge/br_netfilter.c: In function 'br_netfilter_rtable_init':
net/bridge/br_netfilter.c:116: warning: assignment from incompatible
pointer type
This warning is due to commit 4adf0af681
from July 30 (send correct MTU value in PMTU (revised)).
Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Patrick McHardy <kaber@trash.net> noticed that it would be nice to
handle NET_XMIT_BYPASS by NET_XMIT_SUCCESS with an internal qdisc flag
__NET_XMIT_BYPASS and to remove the mapping from dev_queue_xmit().
David Miller <davem@davemloft.net> spotted a serious bug in the first
version of this patch.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Patrick McHardy <kaber@trash.net> noticed:
"The other problem that affects all qdiscs supporting actions is
TC_ACT_QUEUED/TC_ACT_STOLEN getting mapped to NET_XMIT_SUCCESS
even though the packet is not queued, corrupting upper qdiscs'
qlen counters."
and later explained:
"The reason why it translates it at all seems to be to not increase
the drops counter. Within a single qdisc this could be avoided by
other means easily, upper qdiscs would still increase the counter
when we return anything besides NET_XMIT_SUCCESS though.
This means we need a new NET_XMIT return value to indicate this to
the upper qdiscs. So I'd suggest to introduce NET_XMIT_STOLEN,
return that to upper qdiscs and translate it to NET_XMIT_SUCCESS
in dev_queue_xmit, similar to NET_XMIT_BYPASS."
David Miller <davem@davemloft.net> noticed:
"Maybe these NET_XMIT_* values being passed around should be a set of
bits. They could be composed of base meanings, combined with specific
attributes.
So you could say "NET_XMIT_DROP | __NET_XMIT_NO_DROP_COUNT"
The attributes get masked out by the top-level ->enqueue() caller,
such that the base meanings are the only thing that make their
way up into the stack. If it's only about communication within the
qdisc tree, let's simply code it that way."
This patch is trying to realize these ideas.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds few HW bug fixes.
Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
Signed-off-by: Zhu Yi <yi.zhu@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
When joining an ad-hoc network, the user is currently required to specify
the channel. The network will not be joined otherwise, unless it happens
to be sitting on the currently active channel.
This patch implements automatic channel selection when the user has not
locked the interface onto a specific channel.
Signed-off-by: Daniel Drake <dsd@gentoo.org>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This patch makes possible for a driver to specify maximal listen interval
The possibility for user to configure listen interval is not implemented
yet, currently the maximum provided by the driver or 1 is used.
Mac80211 uses config handler to set listen interval for to the driver.
Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Zhu Yi <yi.zhu@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This patch adds the dtim_period in ieee80211_bss_conf, this allows the low
level driver to know the dtim_period, and to plan power save accordingly.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
Signed-off-by: Zhu Yi <yi.zhu@intel.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Avoid the overhead of atomic increment/decrement on each received packet.
This helps performance of non-NAPI devices (like loopback).
Use cleanup function to walk queue on each cpu and clean out any
left over packets.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The old code will drop IPv6 packet if ipfragok is not set, since
ipfragok is obsoleted, will be instead by used skb->local_df, so this
check must be changed to skb->local_df.
This patch fix this problem and not drop packet if skb->local_df is
set to true.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ipfragok flag controls whether the packet may be fragmented
either on the local host on beyond. The latter is only valid on
IPv4.
In fact, we never want to do the latter even on IPv4 when PMTU is
enabled. This is because even though we can't fragment packets
within SCTP due to the prtocol's inherent faults, we can still
fragment it at IP layer. By setting the DF bit we will improve
the PMTU process.
RFC 2960 only says that we SHOULD clear the DF bit in this case,
so we're compliant even if we set the DF bit. In fact RFC 4960
no longer has this statement.
Once we make this change, we only need to control the local
fragmentation. There is already a bit in the skb which controls
that, local_df. So this patch sets that instead of using the
ipfragok argument.
The only complication is that there isn't a struct sock object
per transport, so for IPv4 we have to resort to changing the
pmtudisc field for every packet. This should be safe though
as the protocol is single-threaded.
Note that after this patch we can remove ipfragok from the rest
of the stack too.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>