Commit Graph

500 Commits

Author SHA1 Message Date
Tejun Heo 29187a9eea workqueue: fix subtle pool management issue which can stall whole worker_pool
A worker_pool's forward progress is guaranteed by the fact that the
last idle worker assumes the manager role to create more workers and
summon the rescuers if creating workers doesn't succeed in timely
manner before proceeding to execute work items.

This manager role is implemented in manage_workers(), which indicates
whether the worker may proceed to work item execution with its return
value.  This is necessary because multiple workers may contend for the
manager role, and, if there already is a manager, others should
proceed to work item execution.

Unfortunately, the function also indicates that the worker may proceed
to work item execution if need_to_create_worker() is false at the head
of the function.  need_to_create_worker() tests the following
conditions.

	pending work items && !nr_running && !nr_idle

The first and third conditions are protected by pool->lock and thus
won't change while holding pool->lock; however, nr_running can change
asynchronously as other workers block and resume and while it's likely
to be zero, as someone woke this worker up in the first place, some
other workers could have become runnable inbetween making it non-zero.

If this happens, manage_worker() could return false even with zero
nr_idle making the worker, the last idle one, proceed to execute work
items.  If then all workers of the pool end up blocking on a resource
which can only be released by a work item which is pending on that
pool, the whole pool can deadlock as there's no one to create more
workers or summon the rescuers.

This patch fixes the problem by removing the early exit condition from
maybe_create_worker() and making manage_workers() return false iff
there's already another manager, which ensures that the last worker
doesn't start executing work items.

We can leave the early exit condition alone and just ignore the return
value but the only reason it was put there is because the
manage_workers() used to perform both creations and destructions of
workers and thus the function may be invoked while the pool is trying
to reduce the number of workers.  Now that manage_workers() is called
only when more workers are needed, the only case this early exit
condition is triggered is rare race conditions rendering it pointless.

Tested with simulated workload and modified workqueue code which
trigger the pool deadlock reliably without this patch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Eric Sandeen <sandeen@sandeen.net>
Link: http://lkml.kernel.org/g/54B019F4.8030009@sandeen.net
Cc: Dave Chinner <david@fromorbit.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: stable@vger.kernel.org
2015-01-16 14:21:16 -05:00
NeilBrown 008847f66c workqueue: allow rescuer thread to do more work.
When there is serious memory pressure, all workers in a pool could be
blocked, and a new thread cannot be created because it requires memory
allocation.

In this situation a WQ_MEM_RECLAIM workqueue will wake up the
rescuer thread to do some work.

The rescuer will only handle requests that are already on ->worklist.
If max_requests is 1, that means it will handle a single request.

The rescuer will be woken again in 100ms to handle another max_requests
requests.

I've seen a machine (running a 3.0 based "enterprise" kernel) with
thousands of requests queued for xfslogd, which has a max_requests of
1, and is needed for retiring all 'xfs' write requests.  When one of
the worker pools gets into this state, it progresses extremely slowly
and possibly never recovers (only waited an hour or two).

With this patch we leave a pool_workqueue on mayday list
until it is clearly no longer in need of assistance.  This allows
all requests to be handled in a timely fashion.

We keep each pool_workqueue on the mayday list until
need_to_create_worker() is false, and no work for this workqueue is
found in the pool.

I have tested this in combination with a (hackish) patch which forces
all work items to be handled by the rescuer thread.  In that context
it significantly improves performance.  A similar patch for a 3.0
kernel significantly improved performance on a heavy work load.

Thanks to Jan Kara for some design ideas, and to Dongsu Park for
some comments and testing.

tj: Inverted the lock order between wq_mayday_lock and pool->lock with
    a preceding patch and simplified this patch.  Added comment and
    updated changelog accordingly.  Dongsu spotted missing get_pwq()
    in the simplified code.

Cc: Dongsu Park <dongsu.park@profitbricks.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-12-08 12:39:16 -05:00
Tejun Heo b2d829096b workqueue: invert the order between pool->lock and wq_mayday_lock
Currently, pool->lock nests inside pool->lock.  There's no inherent
reason for this order.  The only place where the two locks are held
together is pool_mayday_timeout() and it just got decided that way.

This nesting order turns out to complicate things with the planned
rescuer_thread() update.  Let's invert them.  This doesn't cause any
behavior differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Dongsu Park <dongsu.park@profitbricks.com>
2014-12-08 12:39:16 -05:00
Tejun Heo 0479c8c549 workqueue: cosmetic update in rescuer_thread()
rescuer_thread() caches &rescuer->scheduled in a local variable
scheduled for convenience.  There's one WARN_ON_ONCE() which was using
&rescuer->scheduled directly.  Replace it with the local variable.

This patch causes no functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
2014-12-04 10:14:54 -05:00
Joe Lawrence 3e28e37720 workqueue: Use cond_resched_rcu_qs macro
Tidy up and use cond_resched_rcu_qs when calling cond_resched and
reporting potential quiescent state to RCU.  Splitting this change in
this way allows easy backporting to -stable for kernel versions not
having cond_resched_rcu_qs().

Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-10-06 05:58:26 -07:00
Joe Lawrence 789cbbeca4 workqueue: Add quiescent state between work items
Similar to the stop_machine deadlock scenario on !PREEMPT kernels
addressed in b22ce2785d "workqueue: cond_resched() after processing
each work item", kworker threads requeueing back-to-back with zero jiffy
delay can stall RCU. The cond_resched call introduced in that fix will
yield only iff there are other higher priority tasks to run, so force a
quiescent RCU state between work items.

Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
Link: https://lkml.kernel.org/r/20140926105227.01325697@jlaw-desktop.mno.stratus.com
Link: https://lkml.kernel.org/r/20140929115445.40221d8e@jlaw-desktop.mno.stratus.com
Fixes: b22ce2785d ("workqueue: cond_resched() after processing each work item")
Cc: <stable@vger.kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-10-06 05:57:43 -07:00
Linus Torvalds f2a84170ed Merge branch 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu updates from Tejun Heo:

 - Major reorganization of percpu header files which I think makes
   things a lot more readable and logical than before.

 - percpu-refcount is updated so that it requires explicit destruction
   and can be reinitialized if necessary.  This was pulled into the
   block tree to replace the custom percpu refcnting implemented in
   blk-mq.

 - In the process, percpu and percpu-refcount got cleaned up a bit

* 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (21 commits)
  percpu-refcount: implement percpu_ref_reinit() and percpu_ref_is_zero()
  percpu-refcount: require percpu_ref to be exited explicitly
  percpu-refcount: use unsigned long for pcpu_count pointer
  percpu-refcount: add helpers for ->percpu_count accesses
  percpu-refcount: one bit is enough for REF_STATUS
  percpu-refcount, aio: use percpu_ref_cancel_init() in ioctx_alloc()
  workqueue: stronger test in process_one_work()
  workqueue: clear POOL_DISASSOCIATED in rebind_workers()
  percpu: Use ALIGN macro instead of hand coding alignment calculation
  percpu: invoke __verify_pcpu_ptr() from the generic part of accessors and operations
  percpu: preffity percpu header files
  percpu: use raw_cpu_*() to define __this_cpu_*()
  percpu: reorder macros in percpu header files
  percpu: move {raw|this}_cpu_*() definitions to include/linux/percpu-defs.h
  percpu: move generic {raw|this}_cpu_*_N() definitions to include/asm-generic/percpu.h
  percpu: only allow sized arch overrides for {raw|this}_cpu_*() ops
  percpu: reorganize include/linux/percpu-defs.h
  percpu: move accessors from include/linux/percpu.h to percpu-defs.h
  percpu: include/asm-generic/percpu.h should contain only arch-overridable parts
  percpu: introduce arch_raw_cpu_ptr()
  ...
2014-08-04 10:09:27 -07:00
Linus Torvalds c4c3f5fba0 Merge branch 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
 "Lai has been doing a lot of cleanups of workqueue and kthread_work.
  No significant behavior change.  Just a lot of cleanups all over the
  place.  Some are a bit invasive but overall nothing too dangerous"

* 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  kthread_work: remove the unused wait_queue_head
  kthread_work: wake up worker only when the worker is idle
  workqueue: use nr_node_ids instead of wq_numa_tbl_len
  workqueue: remove the misnamed out_unlock label in get_unbound_pool()
  workqueue: remove the stale comment in pwq_unbound_release_workfn()
  workqueue: move rescuer pool detachment to the end
  workqueue: unfold start_worker() into create_worker()
  workqueue: remove @wakeup from worker_set_flags()
  workqueue: remove an unneeded UNBOUND test before waking up the next worker
  workqueue: wake regular worker if need_more_worker() when rescuer leave the pool
  workqueue: alloc struct worker on its local node
  workqueue: reuse the already calculated pwq in try_to_grab_pending()
  workqueue: stronger test in process_one_work()
  workqueue: clear POOL_DISASSOCIATED in rebind_workers()
  workqueue: sanity check pool->cpu in wq_worker_sleeping()
  workqueue: clear leftover flags when detached
  workqueue: remove useless WARN_ON_ONCE()
  workqueue: use schedule_timeout_interruptible() instead of open code
  workqueue: remove the empty check in too_many_workers()
  workqueue: use "pool->cpu < 0" to stand for an unbound pool
2014-08-04 10:04:44 -07:00
Lai Jiangshan ddcb57e2ed workqueue: use nr_node_ids instead of wq_numa_tbl_len
They are the same and nr_node_ids is provided by the memory subsystem.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-07-22 12:10:39 -04:00
Lai Jiangshan 3fb1823c09 workqueue: remove the misnamed out_unlock label in get_unbound_pool()
After the locking was moved up to the caller of the get_unbound_pool(),
out_unlock label doesn't need to do any unlock operation and the name
became bad, so we just remove this label, and the only usage-site
"goto out_unlock" is subsituted to "return pool".

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-07-22 12:10:39 -04:00
Lai Jiangshan 29b1cb416a workqueue: remove the stale comment in pwq_unbound_release_workfn()
In 75ccf5950f ("workqueue: prepare flush_workqueue() for dynamic
creation and destrucion of unbound pool_workqueues"), a comment
about the synchronization for the pwq in pwq_unbound_release_workfn()
was added. The comment claimed the flush_mutex wasn't strictly
necessary, it was correct in that time, due to the pwq was protected
by workqueue_lock.

But it is incorrect now since the wq->flush_mutex was renamed to
wq->mutex and workqueue_lock was removed, the wq->mutex is strictly
needed. But the comment was miss-updated when the synchronization
was changed.

This patch removes the incorrect comments and doesn't add any new
comment to explain why wq->mutex is needed here, which is definitely
obvious and wq->pwqs_node has "WQ" notation in its definition which is
better comment.

The old commit mentioned above also introduced a comment in link_pwq()
about the synchronization. This comment is also removed in this patch
since the whole link_pwq() is proteced by wq->mutex.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-07-22 12:10:39 -04:00
Lai Jiangshan 13b1d625ef workqueue: move rescuer pool detachment to the end
In 51697d3939 ("workqueue: use generic attach/detach routine for
rescuers"), The rescuer detaches itself from the pool before put_pwq()
so that the put_unbound_pool() will not destroy the rescuer-attached
pool.

It is unnecessary.  worker_detach_from_pool() can be used as the last
statement to access to the pool just like the regular workers,
put_unbound_pool() will wait for it to detach and then free the pool.

So we move the worker_detach_from_pool() down, make it coincide with
the regular workers.

tj: Minor description update.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-07-22 12:10:39 -04:00
Lai Jiangshan 051e185010 workqueue: unfold start_worker() into create_worker()
Simply unfold the code of start_worker() into create_worker() and
remove the original start_worker() and create_and_start_worker().

The only trade-off is the introduced overhead that the pool->lock
is released and regrabbed after the newly worker is started.
The overhead is acceptible since the manager is slow path.

And because this new locking behavior, the newly created worker
may grab the lock earlier than the manager and go to process
work items. In this case, the recheck need_to_create_worker() may be
true as expected and the manager goes to restart which is the
correct behavior.

tj: Minor updates to description and comments.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-07-22 12:10:39 -04:00
Lai Jiangshan 228f1d0018 workqueue: remove @wakeup from worker_set_flags()
worker_set_flags() has only two callers, each specifying %true and
%false for @wakeup.  Let's push the wake up to the caller and remove
@wakeup from worker_set_flags().  The caller can use the following
instead if wakeup is necessary:

	worker_set_flags();
	if (need_more_worker(pool))
 		wake_up_worker(pool);

This makes the code simpler.  This patch doesn't introduce behavior
changes.

tj: Updated description and comments.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-07-22 12:08:36 -04:00
Lai Jiangshan a489a03eca workqueue: remove an unneeded UNBOUND test before waking up the next worker
In process_one_work():

	if ((worker->flags & WORKER_UNBOUND) && need_more_worker(pool))
		wake_up_worker(pool);

the first test is unneeded.  Even if the first test is removed, it
doesn't affect the wake-up logic for WORKER_UNBOUND, and it will not
introduce any useless wake-ups for normal per-cpu workers since
nr_running is always >= 1.  It will introduce useless/redundant
wake-ups for CPU_INTENSIVE, but this case is rare and the next patch
will also remove this redundant wake-up.

tj: Minor updates to the description and comment.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-07-22 10:37:52 -04:00
Lai Jiangshan d8ca83e68c workqueue: wake regular worker if need_more_worker() when rescuer leave the pool
We don't need to wake up regular worker when nr_running==1,
so need_more_worker() is sufficient here.

And need_more_worker() gives us better readability due to the name of
"keep_working()" implies the rescuer should keep working now but
the rescuer is actually leaving.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-07-18 18:46:11 -04:00
Lai Jiangshan f7537df520 workqueue: alloc struct worker on its local node
When the create_worker() is called from non-manager, the struct worker
is allocated from the node of the caller which may be different from the
node of pool->node.

So we add a node ID argument for the alloc_worker() to ensure the
struct worker is allocated from the preferable node.

tj: @nid renamed to @node for consistency.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-07-15 11:11:20 -04:00
Lai Jiangshan 9c34a7042e workqueue: reuse the already calculated pwq in try_to_grab_pending()
try_to_grab_pending() was re-calculating the associated pwq using
get_work_pwq() when it already has it cached in a local varible and
the association can't change.  Reuse the local variable instead.

This doesn't introduce any functional changes.

tj: Updated description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-07-11 11:14:50 -04:00
Yasuaki Ishimatsu 5a6024f160 workqueue: zero cpumask of wq_numa_possible_cpumask on init
When hot-adding and onlining CPU, kernel panic occurs, showing following
call trace.

  BUG: unable to handle kernel paging request at 0000000000001d08
  IP: [<ffffffff8114acfd>] __alloc_pages_nodemask+0x9d/0xb10
  PGD 0
  Oops: 0000 [#1] SMP
  ...
  Call Trace:
   [<ffffffff812b8745>] ? cpumask_next_and+0x35/0x50
   [<ffffffff810a3283>] ? find_busiest_group+0x113/0x8f0
   [<ffffffff81193bc9>] ? deactivate_slab+0x349/0x3c0
   [<ffffffff811926f1>] new_slab+0x91/0x300
   [<ffffffff815de95a>] __slab_alloc+0x2bb/0x482
   [<ffffffff8105bc1c>] ? copy_process.part.25+0xfc/0x14c0
   [<ffffffff810a3c78>] ? load_balance+0x218/0x890
   [<ffffffff8101a679>] ? sched_clock+0x9/0x10
   [<ffffffff81105ba9>] ? trace_clock_local+0x9/0x10
   [<ffffffff81193d1c>] kmem_cache_alloc_node+0x8c/0x200
   [<ffffffff8105bc1c>] copy_process.part.25+0xfc/0x14c0
   [<ffffffff81114d0d>] ? trace_buffer_unlock_commit+0x4d/0x60
   [<ffffffff81085a80>] ? kthread_create_on_node+0x140/0x140
   [<ffffffff8105d0ec>] do_fork+0xbc/0x360
   [<ffffffff8105d3b6>] kernel_thread+0x26/0x30
   [<ffffffff81086652>] kthreadd+0x2c2/0x300
   [<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60
   [<ffffffff815f20ec>] ret_from_fork+0x7c/0xb0
   [<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60

In my investigation, I found the root cause is wq_numa_possible_cpumask.
All entries of wq_numa_possible_cpumask is allocated by
alloc_cpumask_var_node(). And these entries are used without initializing.
So these entries have wrong value.

When hot-adding and onlining CPU, wq_update_unbound_numa() is called.
wq_update_unbound_numa() calls alloc_unbound_pwq(). And alloc_unbound_pwq()
calls get_unbound_pool(). In get_unbound_pool(), worker_pool->node is set
as follow:

3592         /* if cpumask is contained inside a NUMA node, we belong to that node */
3593         if (wq_numa_enabled) {
3594                 for_each_node(node) {
3595                         if (cpumask_subset(pool->attrs->cpumask,
3596                                            wq_numa_possible_cpumask[node])) {
3597                                 pool->node = node;
3598                                 break;
3599                         }
3600                 }
3601         }

But wq_numa_possible_cpumask[node] does not have correct cpumask. So, wrong
node is selected. As a result, kernel panic occurs.

By this patch, all entries of wq_numa_possible_cpumask are allocated by
zalloc_cpumask_var_node to initialize them. And the panic disappeared.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Fixes: bce903809a ("workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]")
2014-07-07 09:56:48 -04:00
Lai Jiangshan 85327af61d workqueue: stronger test in process_one_work()
When POOL_DISASSOCIATED is cleared, the running worker's local CPU should
be the same as pool->cpu without any exception even during cpu-hotplug.

This patch changes "(proposition_A && proposition_B && proposition_C)"
to "(proposition_B && proposition_C)", so if the old compound
proposition is true, the new one must be true too. so this won't hide
any possible bug which can be hit by old test.

tj: Minor description update and dropped the obvious comment.

CC: Jason J. Herne <jjherne@linux.vnet.ibm.com>
CC: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-07-01 17:40:14 -04:00
Lai Jiangshan 3de5e88485 workqueue: clear POOL_DISASSOCIATED in rebind_workers()
a9ab775bca ("workqueue: directly restore CPU affinity of workers
from CPU_ONLINE") moved pool locking into rebind_workers() but left
"pool->flags &= ~POOL_DISASSOCIATED" in workqueue_cpu_up_callback().

There is nothing necessarily wrong with it, but there is no benefit
either.  Let's move it into rebind_workers() and achieve the following
benefits:

  1) better readability, POOL_DISASSOCIATED is cleared in rebind_workers()
     as expected.

  2) we can guarantee that, when POOL_DISASSOCIATED is clear, the
     running workers of the pool are on the local CPU (pool->cpu).

tj: Minor description update.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-07-01 17:40:14 -04:00
Maxime Bizon bddbceb688 workqueue: fix dev_set_uevent_suppress() imbalance
Uevents are suppressed during attributes registration, but never
restored, so kobject_uevent() does nothing.

Signed-off-by: Maxime Bizon <mbizon@freebox.fr>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 226223ab3c
2014-06-23 14:40:49 -04:00
Lai Jiangshan 807407c0a2 workqueue: stronger test in process_one_work()
After the recent changes, when POOL_DISASSOCIATED is cleared, the
running worker's local CPU should be the same as pool->cpu without any
exception even during cpu-hotplug.  Update the sanity check in
process_one_work() accordingly.

This patch changes "(proposition_A && proposition_B && proposition_C)"
to "(proposition_B && proposition_C)", so if the old compound
proposition is true, the new one must be true too. so this will not
hide any possible bug which can be caught by the old test.

tj: Minor updates to the description.

CC: Jason J. Herne <jjherne@linux.vnet.ibm.com>
CC: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-06-19 15:43:33 -04:00
Lai Jiangshan f05b558d7e workqueue: clear POOL_DISASSOCIATED in rebind_workers()
The commit a9ab775bca ("workqueue: directly restore CPU affinity of
workers from CPU_ONLINE") moved the pool->lock into rebind_workers()
without also moving "pool->flags &= ~POOL_DISASSOCIATED".

There is nothing wrong with "pool->flags &= ~POOL_DISASSOCIATED" not
being moved together, but there isn't any benefit either. We move it
into rebind_workers() and achieve these benefits:

1) Better readability.  POOL_DISASSOCIATED is cleared in
   rebind_workers() as expected.

2) When POOL_DISASSOCIATED is cleared, we can ensure that all the
   running workers of the pool are on the local CPU (pool->cpu).

tj: Cosmetic updates to the code and description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-06-19 15:43:33 -04:00
Lai Jiangshan 92b69f5091 workqueue: sanity check pool->cpu in wq_worker_sleeping()
In theory, pool->cpu is equals to @cpu in wq_worker_sleeping() after
worker->flags is checked.

And "pool->cpu != cpu" sanity check will help us if something wrong.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-06-19 12:32:27 -04:00
Lai Jiangshan b62c075194 workqueue: clear leftover flags when detached
When a worker is detached, the worker->flags may still have WORKER_UNBOUND
or WORKER_REBOUND, it is OK for all cases:
  1) if it is a normal worker, the worker will be dead, it is OK.
  2) if it is a rescuer, it may re-attach to a pool with this leftover flag[s],
     it is still correct except it may cause unneeded wakeup.

It is correct but not good, so we just remove the leftover flags.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-06-19 12:29:12 -04:00
Lai Jiangshan 25ef09586d workqueue: remove useless WARN_ON_ONCE()
The @cpu is fetched via smp_processor_id() in this function,
so the check is useless.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-06-19 12:28:20 -04:00
Lai Jiangshan e212f361fb workqueue: use schedule_timeout_interruptible() instead of open code
schedule_timeout_interruptible(CREATE_COOLDOWN) is exactly the same as
the original code.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-06-19 12:25:51 -04:00
Lai Jiangshan e6a9a77123 workqueue: remove the empty check in too_many_workers()
The commit ea1abd6197 ("workqueue: reimplement idle worker rebinding")
used a trick which simply removes all to-be-bound idle workers from the
idle list and lets them add themselves back after completing rebinding.

And this trick caused the @worker_pool->nr_idle may deviate than the actual
number of idle workers on @worker_pool->idle_list.  More specifically,
nr_idle may be non-zero while ->idle_list is empty.  All users of
->nr_idle and ->idle_list are audited.  The only affected one is
too_many_workers() which is updated to check %false if ->idle_list is
empty regardless of ->nr_idle.

The commit/trick was complicated due to it just tried to simplify an even
more complicated problem (workers had to rebind itself). But the commit
a9ab775bca ("workqueue: directly restore CPU affinity of workers
from CPU_ONLINE") fixed all these problems and the mentioned trick was
useless and is gone.

So, now the @worker_pool->nr_idle is exactly the actual number of workers
on @worker_pool->idle_list. too_many_workers() should recover as it was
before the trick. So we remove the empty check.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-06-19 12:24:40 -04:00
Lai Jiangshan 61d0fbb4b6 workqueue: use "pool->cpu < 0" to stand for an unbound pool
There is a piece of sanity checks code in the put_unbound_pool().
The meaning of this code is "if it is not an unbound pool, it will complain
and return" IIUC. But the code uses "pool->flags & POOL_DISASSOCIATED"
imprecisely due to a non-unbound pool may also have this flags.

We should use "pool->cpu < 0" to stand for an unbound pool, so we covert the
code to it.

There is no strictly wrong if we still keep "pool->flags & POOL_DISASSOCIATED"
here, but it is just a noise if we keep it:
  1) we focus on "unbound" here, not "[dis]association".
  2) "pool->cpu < 0" already implies "pool->flags & POOL_DISASSOCIATED".

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-06-19 12:05:00 -04:00
Linus Torvalds da85d191f5 Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
 "Lai simplified worker destruction path and internal workqueue locking
  and there are some other minor changes.

  Except for the removal of some long-deprecated interfaces which
  haven't had any in-kernel user for quite a while, there shouldn't be
  any difference to workqueue users"

* 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  kernel/workqueue.c: pr_warning/pr_warn & printk/pr_info
  workqueue: remove the confusing POOL_FREEZING
  workqueue: rename first_worker() to first_idle_worker()
  workqueue: remove unused work_clear_pending()
  workqueue: remove unused WORK_CPU_END
  workqueue: declare system_highpri_wq
  workqueue: use generic attach/detach routine for rescuers
  workqueue: separate pool-attaching code out from create_worker()
  workqueue: rename manager_mutex to attach_mutex
  workqueue: narrow the protection range of manager_mutex
  workqueue: convert worker_idr to worker_ida
  workqueue: separate iteration role from worker_idr
  workqueue: destroy worker directly in the idle timeout handler
  workqueue: async worker destruction
  workqueue: destroy_worker() should destroy idle workers only
  workqueue: use manager lock only to protect worker_idr
  workqueue: Remove deprecated system_nrt[_freezable]_wq
  workqueue: Remove deprecated flush[_delayed]_work_sync()
  kernel/workqueue.c: pr_warning/pr_warn & printk/pr_info
  workqueue: simplify wq_update_unbound_numa() by jumping to use_dfl_pwq if the target cpumask equals wq's
2014-06-09 14:56:49 -07:00
Valdis Kletnieks 015af06e10 kernel/workqueue.c: pr_warning/pr_warn & printk/pr_info
This commit did an incorrect printk->pr_info conversion. If we were
converting to pr_info() we should lose the log_level parameter. The problem is
that this is called (indirectly) by show_regs_print_info(), which is called
with various log_levels (from _INFO clear to _EMERG). So we leave it as
a printk() call so the desired log_level is applied.

Not a full revert, as the other half of the patch is correct.

Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-28 10:22:34 -04:00
Lai Jiangshan 74b414ead1 workqueue: remove the confusing POOL_FREEZING
Currently, the global freezing state is propagated to worker_pools via
POOL_FREEZING and then to each workqueue; however, the middle step -
propagation through worker_pools - can be skipped as long as one or
more max_active adjustments happens for each workqueue after the
update to the global state is visible.  The global workqueue freezing
state and the max_active adjustments during workqueue creation and
[un]freezing are serialized with wq_pool_mutex, so it's trivial to
guarantee that max_actives stay in sync with global freezing state.

POOL_FREEZING is unnecessary and makes the code more confusing and
complicates freeze_workqueues_begin() and thaw_workqueues() by
requiring them to walk through all pools.

Remove POOL_FREEZING and use workqueue_freezing directly instead.

tj: Description and comment updates.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-22 11:35:51 -04:00
Lai Jiangshan 1037de36ed workqueue: rename first_worker() to first_idle_worker()
first_worker() actually returns the first idle workers, the name
first_idle_worker() which is self-commnet will be better.

All the callers of first_worker() expect it returns an idle worker,
the name first_idle_worker() with "idle" notation makes reviewers happier.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-22 11:35:51 -04:00
Ingo Molnar 65c2ce7004 Linux 3.15-rc6
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJTfR2zAAoJEHm+PkMAQRiG3noH/2s+KUge3qO2M+AmxttUo74B
 +npAMdbqYR3MdEiwxYZfsHcMu4Ye/IKLcrh4pydB5hI2mdjtGkH1bnmia0f1ve/c
 Z/a0256+W8gWp7mcUBqSNztqLPAWa7wKOqNdLjj5idr1BSj6u8im+fQ9FBh2woki
 1fyYAuq/60lq4CMOKJvkA95V1Ome/jO+8tS4PguOgsCETQxCVFGurZcBbG3Mx5Y3
 v+ioCqeRc6GvxPFR6YngnTZCrsLxSRT3tnO2Qy5zX7dxjIQkCEbvIckpBQv01Y3R
 wNUaX+2Jae207igxrEv8CjmCFnmZFuUI15aWWCy6fOS/j8bjuk6ThYJO8N4ZBM0=
 =2ShG
 -----END PGP SIGNATURE-----

Merge tag 'v3.15-rc6' into sched/core, to pick up the latest fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22 10:28:56 +02:00
Lai Jiangshan 51697d3939 workqueue: use generic attach/detach routine for rescuers
There are several problems with the code that rescuers use to bind
themselve to the target pool's cpumask.

  1) It is very different from how the normal workers bind to cpumask,
     increasing code complexity and maintenance overhead.

  2) The code of cpu-binding for rescuers is complicated.

  3) If one or more cpu hotplugs happen while a rescuer is processing
     its scheduled work items, the rescuer may not stay bound to the
     cpumask of the pool. This is an allowed behavior, but is still
     hairy. It will be better if the cpumask of the rescuer is always
     kept synchronized with the pool across cpu hotplugs.

Using generic attach/detach routine will solve the above problems and
results in much simpler code.

tj: Minor description updates.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 10:59:32 -04:00
Lai Jiangshan 4736cbf7a4 workqueue: separate pool-attaching code out from create_worker()
Currently, the code to attach a new worker to its pool is embedded in
create_worker().  Separating this code out will make the codes clearer
and will allow rescuers to share the code path later.

tj: Description and comment updates.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 10:59:32 -04:00
Lai Jiangshan 92f9c5c40c workqueue: rename manager_mutex to attach_mutex
manager_mutex is only used to protect the attaching for the pool
and the pool->workers list. It protects the pool->workers and operations
based on this list, such as:

	cpu-binding for the workers in the pool->workers
	the operations to set/clear WORKER_UNBOUND

So let's rename manager_mutex to attach_mutex to better reflect its
role. This patch is a pure rename.

tj: Minor command and description updates.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 10:59:32 -04:00
Lai Jiangshan 4d757c5c81 workqueue: narrow the protection range of manager_mutex
In create_worker(), as pool->worker_ida now uses
ida_simple_get()/ida_simple_put() and doesn't require external
synchronization, it doesn't need manager_mutex.

struct worker allocation and kthread allocation are not visible by any
one before attached, so they don't need manager_mutex either.

The above operations are before the attaching operation which attaches
the worker to the pool. Between attaching and starting the worker, the
worker is already attached to the pool, so the cpu hotplug will handle
cpu-binding for the worker correctly and we don't need the
manager_mutex after attaching.

The conclusion is that only the attaching operation needs manager_mutex,
so we narrow the protection section of manager_mutex in create_worker().

Some comments about manager_mutex are removed, because we will rename
it to attach_mutex and add worker_attach_to_pool() later which will be
self-explanatory.

tj: Minor description updates.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 10:59:32 -04:00
Lai Jiangshan 7cda9aae05 workqueue: convert worker_idr to worker_ida
We no longer iterate workers via worker_idr and worker_idr is used
only for allocating/freeing ID, so we can convert it to worker_ida.

By using ida_simple_get/remove(), worker_ida doesn't require external
synchronization, so we don't need manager_mutex to protect it and the
ID-removal code is allowed to be moved out from
worker_detach_from_pool().

In a later patch, worker_detach_from_pool() will be used in rescuers
which don't have IDs, so we move the ID-removal code out from
worker_detach_from_pool() into worker_thread().

tj: Minor description updates.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 10:59:31 -04:00
Lai Jiangshan da028469ba workqueue: separate iteration role from worker_idr
worker_idr has the iteration (iterating for attached workers) and
worker ID duties. These two duties don't have to be tied together. We
can separate them and use a list for tracking attached workers and
iteration.

Before this separation, it wasn't possible to add rescuer workers to
worker_idr due to rescuer workers couldn't allocate ID dynamically
because ID-allocation depends on memory-allocation, which rescuer
can't depend on.

After separation, we can easily add the rescuer workers to the list for
iteration without any memory-allocation. It is required when we attach
the rescuer worker to the pool in later patch.

tj: Minor description updates.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 10:59:31 -04:00
Lai Jiangshan 3347fc9f36 workqueue: destroy worker directly in the idle timeout handler
Since destroy_worker() doesn't need to sleep nor require manager_mutex,
destroy_worker() can be directly called in the idle timeout
handler, it helps us remove POOL_MANAGE_WORKERS and
maybe_destroy_worker() and simplify the manage_workers()

After POOL_MANAGE_WORKERS is removed, worker_thread() doesn't
need to test whether it needs to manage after processed works.
So we can remove the test branch.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2014-05-20 10:59:31 -04:00
Lai Jiangshan 60f5a4bcf8 workqueue: async worker destruction
worker destruction includes these parts of code:
	adjust pool's stats
	remove the worker from idle list
	detach the worker from the pool
	kthread_stop() to wait for the worker's task exit
	free the worker struct

We can find out that there is no essential work to do after
kthread_stop(), which means destroy_worker() doesn't need to wait for
the worker's task exit, so we can remove kthread_stop() and free the
worker struct in the worker exiting path.

However, put_unbound_pool() still needs to sync the all the workers'
destruction before destroying the pool; otherwise, the workers may
access to the invalid pool when they are exiting.

So we also move the code of "detach the worker" to the exiting
path and let put_unbound_pool() to sync with this code via
detach_completion.

The code of "detach the worker" is wrapped in a new function
"worker_detach_from_pool()" although worker_detach_from_pool() is only
called once (in worker_thread()) after this patch, but we need to wrap
it for these reasons:

  1) The code of "detach the worker" is not short enough to unfold them
     in worker_thread().
  2) the name of "worker_detach_from_pool()" is self-comment, and we add
     some comments above the function.
  3) it will be shared by rescuer in later patch which allows rescuer
     and normal thread use the same attach/detach frameworks.

The worker id is freed when detaching which happens before the worker
is fully dead, but this id of the dying worker may be re-used for a
new worker, so the dying worker's task name is changed to
"worker/dying" to avoid two or several workers having the same name.

Since "detach the worker" is moved out from destroy_worker(),
destroy_worker() doesn't require manager_mutex, so the
"lockdep_assert_held(&pool->manager_mutex)" in destroy_worker() is
removed, and destroy_worker() is not protected by manager_mutex in
put_unbound_pool().

tj: Minor description updates.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 10:59:30 -04:00
Lai Jiangshan 73eb7fe73a workqueue: destroy_worker() should destroy idle workers only
We used to have the CPU online failure path where a worker is created
and then destroyed without being started. A worker was created for
the CPU coming online and if the online operation failed the created worker
was shut down without being started.  But this behavior was changed.
The first worker is created and started at the same time for the CPU coming
online.

It means that we had already ensured in the code that destroy_worker()
destroys only idle workers and we don't want to allow it to destroy
any non-idle worker in the future. Otherwise, it may be buggy and it
may be extremely hard to check. We should force destroy_worker() to
destroy only idle workers explicitly.

Since destroy_worker() destroys only idle workers, this patch does not
change any functionality. We just need to update the comments and the
sanity check code.

In the sanity check code, we will refuse to destroy the worker
if !(worker->flags & WORKER_IDLE).

If the worker entered idle which means it is already started,
so we remove the check of "worker->flags & WORKER_STARTED",
after this removal, WORKER_STARTED is totally unneeded,
so we remove WORKER_STARTED too.

In the comments for create_worker(), "Create a new worker which is bound..."
is changed to "... which is attached..." due to we change the name of this
behavior to attaching.

tj: Minor description / comment updates.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 10:59:30 -04:00
Lai Jiangshan 9625ab1727 workqueue: use manager lock only to protect worker_idr
worker_idr is highly bound to managers and is always/only accessed in manager
lock context. So we don't need pool->lock for it.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20 10:59:30 -04:00
Fabian Frederick 2d916033a3 kernel/workqueue.c: pr_warning/pr_warn & printk/pr_info
tj: Refreshed on top of wq/for-3.16.

Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-12 13:59:35 -04:00
Daeseok Youn 534a3fbb3f workqueue: simplify wq_update_unbound_numa() by jumping to use_dfl_pwq if the target cpumask equals wq's
wq_update_unbound_numa(), when it's decided that the newly updated
cpumask equals the default, looks at whether the current pwq is
already the default one and skips setting pwq to the default one.
This extra step is unnecessary and we can always jump to use_dfl_pwq
instead. Simplify the code by removing the conditional.
This doesn't make any functional difference.

Signed-off-by: Daeseok Youn <daeseok.youn@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-04-18 15:50:34 -04:00
Lai Jiangshan 77668c8b55 workqueue: fix a possible race condition between rescuer and pwq-release
There is a race condition between rescuer_thread() and
pwq_unbound_release_workfn().

Even after a pwq is scheduled for rescue, the associated work items
may be consumed by any worker.  If all of them are consumed before the
rescuer gets to them and the pwq's base ref was put due to attribute
change, the pwq may be released while still being linked on
@wq->maydays list making the rescuer dereference already freed pwq
later.

Make send_mayday() pin the target pwq until the rescuer is done with
it.

tj: Updated comment and patch description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v3.10+
2014-04-18 12:33:29 -04:00
Lai Jiangshan 4d595b866d workqueue: make rescuer_thread() empty wq->maydays list before exiting
After a @pwq is scheduled for emergency execution, other workers may
consume the affectd work items before the rescuer gets to them.  This
means that a workqueue many have pwqs queued on @wq->maydays list
while not having any work item pending or in-flight.  If
destroy_workqueue() executes in such condition, the rescuer may exit
without emptying @wq->maydays.

This currently doesn't cause any actual harm.  destroy_workqueue() can
safely destroy all the involved data structures whether @wq->maydays
is populated or not as nobody access the list once the rescuer exits.

However, this is nasty and makes future development difficult.  Let's
update rescuer_thread() so that it empties @wq->maydays after seeing
should_stop to guarantee that the list is empty on rescuer exit.

tj: Updated comment and patch description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v3.10+
2014-04-18 11:04:16 -04:00
Dongsheng Yang 8698a745d8 sched, treewide: Replace hardcoded nice values with MIN_NICE/MAX_NICE
Replace various -20/+19 hardcoded nice values with MIN_NICE/MAX_NICE.

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/ff13819fd09b7a5dba5ab5ae797f2e7019bdfa17.1394532288.git.yangds.fnst@cn.fujitsu.com
Cc: devel@driverdev.osuosl.org
Cc: devicetree@vger.kernel.org
Cc: fcoe-devel@open-fcoe.org
Cc: linux390@de.ibm.com
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-s390@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Cc: nbd-general@lists.sourceforge.net
Cc: ocfs2-devel@oss.oracle.com
Cc: openipmi-developer@lists.sourceforge.net
Cc: qla2xxx-upstream@qlogic.com
Cc: linux-arch@vger.kernel.org
[ Consolidated the patches, twiddled the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-18 12:07:24 +02:00