Commit Graph

437 Commits

Author SHA1 Message Date
Lai Jiangshan 12ee4fc67c workqueue: add missing POOL_FREEZING
get_unbound_pool() forgot to set POOL_FREEZING if workqueue_freezing
is set and a new pool could go out of sync with the global freezing
state.

Fix it by adding POOL_FREEZING if workqueue_freezing.  wq_mutex is
already held so no further locking is necessary.  This also removes
the unused static variable warning when !CONFIG_FREEZER.

tj: Updated commit message.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-20 10:19:09 -07:00
Tejun Heo 7dbc725e47 workqueue: restore CPU affinity of unbound workers on CPU_ONLINE
With the recent addition of the custom attributes support, unbound
pools may have allowed cpumask which isn't full.  As long as some of
CPUs in the cpumask are online, its workers will maintain cpus_allowed
as set on worker creation; however, once no online CPU is left in
cpus_allowed, the scheduler will reset cpus_allowed of any workers
which get scheduled so that they can execute.

To remain compliant to the user-specified configuration, CPU affinity
needs to be restored when a CPU becomes online for an unbound pool
which doesn't currently have any online CPUs before.

This patch implement restore_unbound_workers_cpumask(), which is
called from CPU_ONLINE for all unbound pools, checks whether the
coming up CPU is the first allowed online one, and, if so, invokes
set_cpus_allowed_ptr() with the configured cpumask on all workers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-19 13:45:21 -07:00
Tejun Heo a9ab775bca workqueue: directly restore CPU affinity of workers from CPU_ONLINE
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.

As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr().  Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound.  worker->rebind_work is queued to busy
workers.  Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind().  The manager isn't covered by either and
implements its own mechanism.

PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers.  This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.

There are a couple subtleties.  All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up.  This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.

Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers.  While DISASSOCIATED, all workers
are unbound and nr_running is zero.  As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals.  Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution.  The only change
needed for the workers is clearing REBOUND along with PREP.

* This patch leaves for_each_busy_worker() without any user.  Removed.

* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
  and rebind logic in manager_workers() removed.

* worker_thread() now looks at WORKER_DIE instead of testing whether
  @worker->entry is empty to determine whether it needs to do
  something special as dying is the only special thing now.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-19 13:45:21 -07:00
Tejun Heo bd7c089eb2 workqueue: relocate rebind_workers()
rebind_workers() will be reimplemented in a way which makes it mostly
decoupled from the rest of worker management.  Move rebind_workers()
so that it's located with other CPU hotplug related functions.

This patch is pure function relocation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-19 13:45:21 -07:00
Tejun Heo 822d8405d1 workqueue: convert worker_pool->worker_ida to idr and implement for_each_pool_worker()
Make worker_ida an idr - worker_idr and use it to implement
for_each_pool_worker() which will be used to simplify worker rebinding
on CPU_ONLINE.

pool->worker_idr is protected by both pool->manager_mutex and
pool->lock so that it can be iterated while holding either lock.

* create_worker() allocates ID without installing worker pointer and
  installs the pointer later using idr_replace().  This is because
  worker ID is needed when creating the actual task to name it and the
  new worker shouldn't be visible to iterations before fully
  initialized.

* In destroy_worker(), ID removal is moved before kthread_stop().
  This is again to guarantee that only fully working workers are
  visible to for_each_pool_worker().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-19 13:45:21 -07:00
Tejun Heo 14a40ffccd sched: replace PF_THREAD_BOUND with PF_NO_SETAFFINITY
PF_THREAD_BOUND was originally used to mark kernel threads which were
bound to a specific CPU using kthread_bind() and a task with the flag
set allows cpus_allowed modifications only to itself.  Workqueue is
currently abusing it to prevent userland from meddling with
cpus_allowed of workqueue workers.

What we need is a flag to prevent userland from messing with
cpus_allowed of certain kernel tasks.  In kernel, anyone can
(incorrectly) squash the flag, and, for worker-type usages,
restricting cpus_allowed modification to the task itself doesn't
provide meaningful extra proection as other tasks can inject work
items to the task anyway.

This patch replaces PF_THREAD_BOUND with PF_NO_SETAFFINITY.
sched_setaffinity() checks the flag and return -EINVAL if set.
set_cpus_allowed_ptr() is no longer affected by the flag.

This will allow simplifying workqueue worker CPU affinity management.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-03-19 13:45:20 -07:00
Linus Torvalds b63dc123b2 Merge branch 'for-3.9-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fix from Tejun Heo:
 "Lai's patch to fix highly unlikely but still possible workqueue stall
  during CPU hotunplug."

* 'for-3.9-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: fix possible pool stall bug in wq_unbind_fn()
2013-03-18 18:47:07 -07:00
Tejun Heo 2e109a2855 workqueue: rename workqueue_lock to wq_mayday_lock
With the recent locking updates, the only thing protected by
workqueue_lock is workqueue->maydays list.  Rename workqueue_lock to
wq_mayday_lock.

This patch is pure rename.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-13 19:47:40 -07:00
Tejun Heo 794b18bc8a workqueue: separate out pool_workqueue locking into pwq_lock
This patch continues locking cleanup from the previous patch.  It
breaks out pool_workqueue synchronization from workqueue_lock into a
new spinlock - pwq_lock.  The followings are protected by pwq_lock.

* workqueue->pwqs
* workqueue->saved_max_active

The conversion is straight-forward.  workqueue_lock usages which cover
the above two are converted to pwq_lock.  New locking label PW added
for things protected by pwq_lock and FR is updated to mean flush_mutex
+ pwq_lock + sched-RCU.

This patch shouldn't introduce any visible behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-13 19:47:40 -07:00
Tejun Heo 5bcab3355a workqueue: separate out pool and workqueue locking into wq_mutex
Currently, workqueue_lock protects most shared workqueue resources -
the pools, workqueues, pool_workqueues, draining, ID assignments,
mayday handling and so on.  The coverage has grown organically and
there is no identified bottleneck coming from workqueue_lock, but it
has grown a bit too much and scheduled rebinding changes need the
pools and workqueues to be protected by a mutex instead of a spinlock.

This patch breaks out pool and workqueue synchronization from
workqueue_lock into a new mutex - wq_mutex.  The followings are
protected by wq_mutex.

* worker_pool_idr and unbound_pool_hash
* pool->refcnt
* workqueues list
* workqueue->flags, ->nr_drainers

Most changes are mostly straight-forward.  workqueue_lock is replaced
with wq_mutex where applicable and workqueue_lock lock/unlocks are
added where wq_mutex conversion leaves data structures not protected
by wq_mutex without locking.  irq / preemption flippings were added
where the conversion affects them.  Things worth noting are

* New WQ and WR locking lables added along with
  assert_rcu_or_wq_mutex().

* worker_pool_assign_id() now expects to be called under wq_mutex.

* create_mutex is removed from get_unbound_pool().  It now just holds
  wq_mutex.

This patch shouldn't introduce any visible behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-13 19:47:40 -07:00
Tejun Heo 7d19c5ce66 workqueue: relocate global variable defs and function decls in workqueue.c
They're split across debugobj code for some reason.  Collect them.

This patch is pure relocation.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-13 19:47:40 -07:00
Tejun Heo cd549687a7 workqueue: better define locking rules around worker creation / destruction
When a manager creates or destroys workers, the operations are always
done with the manager_mutex held; however, initial worker creation or
worker destruction during pool release don't grab the mutex.  They are
still correct as initial worker creation doesn't require
synchronization and grabbing manager_arb provides enough exclusion for
pool release path.

Still, let's make everyone follow the same rules for consistency and
such that lockdep annotations can be added.

Update create_and_start_worker() and put_unbound_pool() to grab
manager_mutex around thread creation and destruction respectively and
add lockdep assertions to create_worker() and destroy_worker().

This patch doesn't introduce any visible behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-13 19:47:39 -07:00
Tejun Heo ebf44d16ec workqueue: factor out initial worker creation into create_and_start_worker()
get_unbound_pool(), workqueue_cpu_up_callback() and init_workqueues()
have similar code pieces to create and start the initial worker factor
those out into create_and_start_worker().

This patch doesn't introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-13 19:47:39 -07:00
Tejun Heo bc3a1afc92 workqueue: rename worker_pool->assoc_mutex to ->manager_mutex
Manager operations are currently governed by two mutexes -
pool->manager_arb and ->assoc_mutex.  The former is used to decide who
gets to be the manager and the latter to exclude the actual manager
operations including creation and destruction of workers.  Anyone who
grabs ->manager_arb must perform manager role; otherwise, the pool
might stall.

Grabbing ->assoc_mutex blocks everyone else from performing manager
operations but doesn't require the holder to perform manager duties as
it's merely blocking manager operations without becoming the manager.

Because the blocking was necessary when [dis]associating per-cpu
workqueues during CPU hotplug events, the latter was named
assoc_mutex.  The mutex is scheduled to be used for other purposes, so
this patch gives it a more fitting generic name - manager_mutex - and
updates / adds comments to explain synchronization around the manager
role and operations.

This patch is pure rename / doc update.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-13 19:47:39 -07:00
Tejun Heo 8425e3d5bd workqueue: inline trivial wrappers
There's no reason to make these trivial wrappers full (exported)
functions.  Inline the followings.

 queue_work()
 queue_delayed_work()
 mod_delayed_work()
 schedule_work_on()
 schedule_work()
 schedule_delayed_work_on()
 schedule_delayed_work()
 keventd_up()

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-13 16:51:36 -07:00
Tejun Heo 611c92a020 workqueue: rename @id to @pi in for_each_each_pool()
Rename @id argument of for_each_pool() to @pi so that it doesn't get
reused accidentally when for_each_pool() is used in combination with
other iterators.

This patch is purely cosmetic.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-13 16:51:36 -07:00
Tejun Heo c5aa87bbf4 workqueue: update comments and a warning message
* Update incorrect and add missing synchronization labels.

* Update incorrect or misleading comments.  Add new comments where
  clarification is necessary.  Reformat / rephrase some comments.

* drain_workqueue() can be used separately from destroy_workqueue()
  but its warning message was incorrectly referring to destruction.

Other than the warning message change, this patch doesn't make any
functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-13 16:51:36 -07:00
Tejun Heo 983ca25e73 workqueue: fix max_active handling in init_and_link_pwq()
Since 9e8cd2f589 ("workqueue: implement apply_workqueue_attrs()"),
init_and_link_pwq() may be called to initialize a new pool_workqueue
for a workqueue which is already online, but the function was setting
pwq->max_active to wq->saved_max_active without proper
synchronization.

Fix it by calling pwq_adjust_max_active() under proper locking instead
of manually setting max_active.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-13 16:51:35 -07:00
Tejun Heo 699ce097ef workqueue: implement and use pwq_adjust_max_active()
Rename pwq_set_max_active() to pwq_adjust_max_active() and move
pool_workqueue->max_active synchronization and max_active
determination logic into it.

The new function should be called with workqueue_lock held for stable
workqueue->saved_max_active, determines the current max_active value
the target pool_workqueue should be using from @wq->saved_max_active
and the state of the associated pool, and applies it with proper
synchronization.

The current two users - workqueue_set_max_active() and
thaw_workqueues() - are updated accordingly.  In addition, the manual
freezing handling in __alloc_workqueue_key() and
freeze_workqueues_begin() are replaced with calls to
pwq_adjust_max_active().

This centralizes max_active handling so that it's less error-prone.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-13 16:51:35 -07:00
Tejun Heo 0fbd95aa8a workqueue: relocate pwq_set_max_active()
pwq_set_max_active() is gonna be modified and used during
pool_workqueue init.  Move it above init_and_link_pwq().

This patch is pure code reorganization and doesn't introduce any
functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-13 16:51:35 -07:00
Tejun Heo e68035fb65 workqueue: convert to idr_alloc()
idr_get_new*() and friends are about to be deprecated.  Convert to the
new idr_alloc() interface.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-13 15:21:46 -07:00
Tejun Heo e626761691 workqueue: implement current_is_workqueue_rescuer()
Implement a function which queries whether it currently is running off
a workqueue rescuer.  This will be used to convert writeback to
workqueue.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-12 17:42:01 -07:00
Tejun Heo 226223ab3c workqueue: implement sysfs interface for workqueues
There are cases where workqueue users want to expose control knobs to
userland.  e.g. Unbound workqueues with custom attributes are
scheduled to be used for writeback workers and depending on
configuration it can be useful to allow admins to tinker with the
priority or allowed CPUs.

This patch implements workqueue_sysfs_register(), which makes the
workqueue visible under /sys/bus/workqueue/devices/WQ_NAME.  There
currently are two attributes common to both per-cpu and unbound pools
and extra attributes for unbound pools including nice level and
cpumask.

If alloc_workqueue*() is called with WQ_SYSFS,
workqueue_sysfs_register() is called automatically as part of
workqueue creation.  This is the preferred method unless the workqueue
user wants to apply workqueue_attrs before making the workqueue
visible to userland.

v2: Disallow exposing ordered workqueues as ordered workqueues can't
    be tuned in any way.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-12 11:37:07 -07:00
Tejun Heo 8719dceae2 workqueue: reject adjusting max_active or applying attrs to ordered workqueues
Adjusting max_active of or applying new workqueue_attrs to an ordered
workqueue breaks its ordering guarantee.  The former is obvious.  The
latter is because applying attrs creates a new pwq (pool_workqueue)
and there is no ordering constraint between the old and new pwqs.

Make apply_workqueue_attrs() and workqueue_set_max_active() trigger
WARN_ON() if those operations are requested on an ordered workqueue
and fail / ignore respectively.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-12 11:30:04 -07:00
Tejun Heo 618b01eb42 workqueue: make it clear that WQ_DRAINING is an internal flag
We're gonna add another internal WQ flag.  Let's make the distinction
clear.  Prefix WQ_DRAINING with __ and move it to bit 16.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-12 11:30:04 -07:00
Tejun Heo 9e8cd2f589 workqueue: implement apply_workqueue_attrs()
Implement apply_workqueue_attrs() which applies workqueue_attrs to the
specified unbound workqueue by creating a new pwq (pool_workqueue)
linked to worker_pool with the specified attributes.

A new pwq is linked at the head of wq->pwqs instead of tail and
__queue_work() verifies that the first unbound pwq has positive refcnt
before choosing it for the actual queueing.  This is to cover the case
where creation of a new pwq races with queueing.  As base ref on a pwq
won't be dropped without making another pwq the first one,
__queue_work() is guaranteed to make progress and not add work item to
a dead pwq.

init_and_link_pwq() is updated to return the last first pwq the new
pwq replaced, which is put by apply_workqueue_attrs().

Note that apply_workqueue_attrs() is almost identical to unbound pwq
part of alloc_and_link_pwqs().  The only difference is that there is
no previous first pwq.  apply_workqueue_attrs() is implemented to
handle such cases and replaces unbound pwq handling in
alloc_and_link_pwqs().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-12 11:30:04 -07:00
Tejun Heo c9178087ac workqueue: perform non-reentrancy test when queueing to unbound workqueues too
Because per-cpu workqueues have multiple pwqs (pool_workqueues) to
serve the CPUs, to guarantee that a single work item isn't queued on
one pwq while still executing another, __queue_work() takes a look at
the previous pool the target work item was on and if it's still
executing there, queue the work item on that pool.

To support changing workqueue_attrs on the fly, unbound workqueues too
will have multiple pwqs and thus need non-reentrancy test when
queueing.  This patch modifies __queue_work() such that the reentrancy
test is performed regardless of the workqueue type.

per_cpu_ptr(wq->cpu_pwqs, cpu) used to be used to determine the
matching pwq for the last pool.  This can't be used for unbound
workqueues and is replaced with worker->current_pwq which also happens
to be simpler.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-12 11:30:04 -07:00
Tejun Heo 75ccf5950f workqueue: prepare flush_workqueue() for dynamic creation and destrucion of unbound pool_workqueues
Unbound pwqs (pool_workqueues) will be dynamically created and
destroyed with the scheduled unbound workqueue w/ custom attributes
support.  This patch synchronizes pwq linking and unlinking against
flush_workqueue() so that its operation isn't disturbed by pwqs coming
and going.

Linking and unlinking a pwq into wq->pwqs is now protected also by
wq->flush_mutex and a new pwq's work_color is initialized to
wq->work_color during linking.  This ensures that pwqs changes don't
disturb flush_workqueue() in progress and the new pwq's work coloring
stays in sync with the rest of the workqueue.

flush_mutex during unlinking isn't strictly necessary but it's simpler
to do it anyway.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-12 11:30:04 -07:00
Tejun Heo 8864b4e59f workqueue: implement get/put_pwq()
Add pool_workqueue->refcnt along with get/put_pwq().  Both per-cpu and
unbound pwqs have refcnts and any work item inserted on a pwq
increments the refcnt which is dropped when the work item finishes.

For per-cpu pwqs the base ref is never dropped and destroy_workqueue()
frees the pwqs as before.  For unbound ones, destroy_workqueue()
simply drops the base ref on the first pwq.  When the refcnt reaches
zero, pwq_unbound_release_workfn() is scheduled on system_wq, which
unlinks the pwq, puts the associated pool and frees the pwq and wq as
necessary.  This needs to be done from a work item as put_pwq() needs
to be protected by pool->lock but release can't happen with the lock
held - e.g. put_unbound_pool() involves blocking operations.

Unbound pool->locks are marked with lockdep subclas 1 as put_pwq()
will schedule the release work item on system_wq while holding the
unbound pool's lock and triggers recursive locking warning spuriously.

This will be used to implement dynamic creation and destruction of
unbound pwqs.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-12 11:30:04 -07:00
Tejun Heo d2c1d40487 workqueue: restructure __alloc_workqueue_key()
* Move initialization and linking of pool_workqueues into
  init_and_link_pwq().

* Make the failure path use destroy_workqueue() once pool_workqueue
  initialization succeeds.

These changes are to prepare for dynamic management of pool_workqueues
and don't introduce any functional changes.

While at it, convert list_del(&wq->list) to list_del_init() as a
precaution as scheduled changes will make destruction more complex.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-12 11:30:04 -07:00
Tejun Heo 493008a8e4 workqueue: drop WQ_RESCUER and test workqueue->rescuer for NULL instead
WQ_RESCUER is superflous.  WQ_MEM_RECLAIM indicates that the user
wants a rescuer and testing wq->rescuer for NULL can answer whether a
given workqueue has a rescuer or not.  Drop WQ_RESCUER and test
wq->rescuer directly.

This will help simplifying __alloc_workqueue_key() failure path by
allowing it to use destroy_workqueue() on a partially constructed
workqueue, which in turn will help implementing dynamic management of
pool_workqueues.

While at it, clear wq->rescuer after freeing it in
destroy_workqueue().  This is a precaution as scheduled changes will
make destruction more complex.

This patch doesn't introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-12 11:30:03 -07:00
Tejun Heo ac6104cdf8 workqueue: add pool ID to the names of unbound kworkers
There are gonna be multiple unbound pools.  Include pool ID in the
name of unbound kworkers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-12 11:30:03 -07:00
Tejun Heo f02ae73aaa workqueue: drop "std" from cpu_std_worker_pools and for_each_std_worker_pool()
All per-cpu pools are standard, so there's no need to use both "cpu"
and "std" and for_each_std_worker_pool() is confusing in that it can
be used only for per-cpu pools.

* s/cpu_std_worker_pools/cpu_worker_pools/

* s/for_each_std_worker_pool()/for_each_cpu_worker_pool()/

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-12 11:30:03 -07:00
Tejun Heo 7a62c2c87e workqueue: remove unbound_std_worker_pools[] and related helpers
Workqueue no longer makes use of unbound_std_worker_pools[].  All
unbound worker_pools are created dynamically and there's nothing
special about the standard ones.  With unbound_std_worker_pools[]
unused, workqueue no longer has places where it needs to treat the
per-cpu pools-cpu and unbound pools together.

Remove unbound_std_worker_pools[] and the helpers wrapping it to
present unified per-cpu and unbound standard worker_pools.

* for_each_std_worker_pool() now only walks through per-cpu pools.

* for_each[_online]_wq_cpu() which don't have any users left are
  removed.

* std_worker_pools() and std_worker_pool_pri() are unused and removed.

* get_std_worker_pool() is removed.  Its only user -
  alloc_and_link_pwqs() - only used it for per-cpu pools anyway.  Open
  code per_cpu access in alloc_and_link_pwqs() instead.

This patch doesn't introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-12 11:30:03 -07:00
Tejun Heo 29c91e9912 workqueue: implement attribute-based unbound worker_pool management
This patch makes unbound worker_pools reference counted and
dynamically created and destroyed as workqueues needing them come and
go.  All unbound worker_pools are hashed on unbound_pool_hash which is
keyed by the content of worker_pool->attrs.

When an unbound workqueue is allocated, get_unbound_pool() is called
with the attributes of the workqueue.  If there already is a matching
worker_pool, the reference count is bumped and the pool is returned.
If not, a new worker_pool with matching attributes is created and
returned.

When an unbound workqueue is destroyed, put_unbound_pool() is called
which decrements the reference count of the associated worker_pool.
If the refcnt reaches zero, the worker_pool is destroyed in sched-RCU
safe way.

Note that the standard unbound worker_pools - normal and highpri ones
with no specific cpumask affinity - are no longer created explicitly
during init_workqueues().  init_workqueues() only initializes
workqueue_attrs to be used for standard unbound pools -
unbound_std_wq_attrs[].  The pools are spawned on demand as workqueues
are created.

v2: - Comment added to init_worker_pool() explaining that @pool should
      be in a condition which can be passed to put_unbound_pool() even
      on failure.

    - pool->refcnt reaching zero and the pool being removed from
      unbound_pool_hash should be dynamic.  pool->refcnt is converted
      to int from atomic_t and now manipulated inside workqueue_lock.

    - Removed an incorrect sanity check on nr_idle in
      put_unbound_pool() which may trigger spuriously.

    All changes were suggested by Lai Jiangshan.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-12 11:30:03 -07:00
Tejun Heo 7a4e344c56 workqueue: introduce workqueue_attrs
Introduce struct workqueue_attrs which carries worker attributes -
currently the nice level and allowed cpumask along with helper
routines alloc_workqueue_attrs() and free_workqueue_attrs().

Each worker_pool now carries ->attrs describing the attributes of its
workers.  All functions dealing with cpumask and nice level of workers
are updated to follow worker_pool->attrs instead of determining them
from other characteristics of the worker_pool, and init_workqueues()
is updated to set worker_pool->attrs appropriately for all standard
pools.

Note that create_worker() is updated to always perform set_user_nice()
and use set_cpus_allowed_ptr() combined with manual assertion of
PF_THREAD_BOUND instead of kthread_bind().  This simplifies handling
random attributes without affecting the outcome.

This patch doesn't introduce any behavior changes.

v2: Missing cpumask_var_t definition caused build failure on some
    archs.  linux/cpumask.h included.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-12 11:30:00 -07:00
Tejun Heo 4e1a1f9a05 workqueue: separate out init_worker_pool() from init_workqueues()
This will be used to implement unbound pools with custom attributes.

This patch doesn't introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-12 11:30:00 -07:00
Tejun Heo 34a06bd6b6 workqueue: replace POOL_MANAGING_WORKERS flag with worker_pool->manager_arb
POOL_MANAGING_WORKERS is used to synchronize the manager role.
Synchronizing among workers doesn't need blocking and that's why it's
implemented as a flag.

It got converted to a mutex a while back to add blocking wait from CPU
hotplug path - 6037315269 ("workqueue: use mutex for global_cwq
manager exclusion").  Later it turned out that synchronization among
workers and cpu hotplug need to be done separately.  Eventually,
POOL_MANAGING_WORKERS is restored and workqueue->manager_mutex got
morphed into workqueue->assoc_mutex - 552a37e936 ("workqueue: restore
POOL_MANAGING_WORKERS") and b2eb83d123 ("workqueue: rename
manager_mutex to assoc_mutex").

Now, we're gonna need to be able to lock out managers from
destroy_workqueue() to support multiple unbound pools with custom
attributes making it again necessary to be able to block on the
manager role.  This patch replaces POOL_MANAGING_WORKERS with
worker_pool->manager_arb.

This patch doesn't introduce any behavior changes.

v2: s/manager_mutex/manager_arb/

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-12 11:30:00 -07:00
Tejun Heo fa1b54e69b workqueue: update synchronization rules on worker_pool_idr
Make worker_pool_idr protected by workqueue_lock for writes and
sched-RCU protected for reads.  Lockdep assertions are added to
for_each_pool() and get_work_pool() and all their users are converted
to either hold workqueue_lock or disable preemption/irq.

worker_pool_assign_id() is updated to hold workqueue_lock when
allocating a pool ID.  As idr_get_new() always performs RCU-safe
assignment, this is enough on the writer side.

As standard pools are never destroyed, there's nothing to do on that
side.

The locking is superflous at this point.  This is to help
implementation of unbound pools/pwqs with custom attributes.

This patch doesn't introduce any behavior changes.

v2: Updated for_each_pwq() use if/else for the hidden assertion
    statement instead of just if as suggested by Lai.  This avoids
    confusing the following else clause.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-12 11:30:00 -07:00
Tejun Heo 76af4d9361 workqueue: update synchronization rules on workqueue->pwqs
Make workqueue->pwqs protected by workqueue_lock for writes and
sched-RCU protected for reads.  Lockdep assertions are added to
for_each_pwq() and first_pwq() and all their users are converted to
either hold workqueue_lock or disable preemption/irq.

alloc_and_link_pwqs() is updated to use list_add_tail_rcu() for
consistency which isn't strictly necessary as the workqueue isn't
visible.  destroy_workqueue() isn't updated to sched-RCU release pwqs.
This is okay as the workqueue should have on users left by that point.

The locking is superflous at this point.  This is to help
implementation of unbound pools/pwqs with custom attributes.

This patch doesn't introduce any behavior changes.

v2: Updated for_each_pwq() use if/else for the hidden assertion
    statement instead of just if as suggested by Lai.  This avoids
    confusing the following else clause.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-12 11:30:00 -07:00
Tejun Heo 7fb98ea79c workqueue: replace get_pwq() with explicit per_cpu_ptr() accesses and first_pwq()
get_pwq() takes @cpu, which can also be WORK_CPU_UNBOUND, and @wq and
returns the matching pwq (pool_workqueue).  We want to move away from
using @cpu for identifying pools and pwqs for unbound pools with
custom attributes and there is only one user - workqueue_congested() -
which makes use of the WQ_UNBOUND conditional in get_pwq().  All other
users already know whether they're dealing with a per-cpu or unbound
workqueue.

Replace get_pwq() with explicit per_cpu_ptr(wq->cpu_pwqs, cpu) for
per-cpu workqueues and first_pwq() for unbound ones, and open-code
WQ_UNBOUND conditional in workqueue_congested().

Note that this makes workqueue_congested() behave sligntly differently
when @cpu other than WORK_CPU_UNBOUND is specified.  It ignores @cpu
for unbound workqueues and always uses the first pwq instead of
oopsing.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-12 11:30:00 -07:00
Tejun Heo 420c0ddb1f workqueue: remove workqueue_struct->pool_wq.single
workqueue->pool_wq union is used to point either to percpu pwqs
(pool_workqueues) or single unbound pwq.  As the first pwq can be
accessed via workqueue->pwqs list, there's no reason for the single
pointer anymore.

Use list_first_entry(workqueue->pwqs) to access the unbound pwq and
drop workqueue->pool_wq.single pointer and the pool_wq union.  It
simplifies the code and eases implementing multiple unbound pools w/
custom attributes.

This patch doesn't introduce any visible behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-12 11:29:59 -07:00
Tejun Heo d84ff0512f workqueue: consistently use int for @cpu variables
Workqueue is mixing unsigned int and int for @cpu variables.  There's
no point in using unsigned int for cpus - many of cpu related APIs
take int anyway.  Consistently use int for @cpu variables so that we
can use negative values to mark special ones.

This patch doesn't introduce any visible behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-12 11:29:59 -07:00
Tejun Heo 493a1724fe workqueue: add wokrqueue_struct->maydays list to replace mayday cpu iterators
Similar to how pool_workqueue iteration used to be, raising and
servicing mayday requests is based on CPU numbers.  It's hairy because
cpumask_t may not be able to handle WORK_CPU_UNBOUND and cpumasks are
assumed to be always set on UP.  This is ugly and can't handle
multiple unbound pools to be added for unbound workqueues w/ custom
attributes.

Add workqueue_struct->maydays.  When a pool_workqueue needs rescuing,
it gets chained on the list through pool_workqueue->mayday_node and
rescuer_thread() consumes the list until it's empty.

This patch doesn't introduce any visible behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-12 11:29:59 -07:00
Tejun Heo 24b8a84718 workqueue: restructure pool / pool_workqueue iterations in freeze/thaw functions
The three freeze/thaw related functions - freeze_workqueues_begin(),
freeze_workqueues_busy() and thaw_workqueues() - need to iterate
through all pool_workqueues of all freezable workqueues.  They did it
by first iterating pools and then visiting all pwqs (pool_workqueues)
of all workqueues and process it if its pwq->pool matches the current
pool.  This is rather backwards and done this way partly because
workqueue didn't have fitting iteration helpers and partly to avoid
the number of lock operations on pool->lock.

Workqueue now has fitting iterators and the locking operation overhead
isn't anything to worry about - those locks are unlikely to be
contended and the same CPU visiting the same set of locks multiple
times isn't expensive.

Restructure the three functions such that the flow better matches the
logical steps and pwq iteration is done using for_each_pwq() inside
workqueue iteration.

* freeze_workqueues_begin(): Setting of FREEZING is moved into a
  separate for_each_pool() iteration.  pwq iteration for clearing
  max_active is updated as described above.

* freeze_workqueues_busy(): pwq iteration updated as described above.

* thaw_workqueues(): The single for_each_wq_cpu() iteration is broken
  into three discrete steps - clearing FREEZING, restoring max_active,
  and kicking workers.  The first and last steps use for_each_pool()
  and the second step uses pwq iteration described above.

This makes the code easier to understand and removes the use of
for_each_wq_cpu() for walking pwqs, which can't support multiple
unbound pwqs which will be needed to implement unbound workqueues with
custom attributes.

This patch doesn't introduce any visible behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-12 11:29:58 -07:00
Tejun Heo 1711696955 workqueue: introduce for_each_pool()
With the scheduled unbound pools with custom attributes, there will be
multiple unbound pools, so it wouldn't be able to use
for_each_wq_cpu() + for_each_std_worker_pool() to iterate through all
pools.

Introduce for_each_pool() which iterates through all pools using
worker_pool_idr and use it instead of for_each_wq_cpu() +
for_each_std_worker_pool() combination in freeze_workqueues_begin().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-12 11:29:58 -07:00
Tejun Heo 49e3cf44df workqueue: replace for_each_pwq_cpu() with for_each_pwq()
Introduce for_each_pwq() which iterates all pool_workqueues of a
workqueue using the recently added workqueue->pwqs list and replace
for_each_pwq_cpu() usages with it.

This is primarily to remove the single unbound CPU assumption from pwq
iteration for the scheduled unbound pools with custom attributes
support which would introduce multiple unbound pwqs per workqueue;
however, it also simplifies iterator users.

Note that pwq->pool initialization is moved to alloc_and_link_pwqs()
as that now is the only place which is explicitly handling the two pwq
types.

This patch doesn't introduce any visible behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-12 11:29:58 -07:00
Tejun Heo 30cdf2496d workqueue: add workqueue_struct->pwqs list
Add workqueue_struct->pwqs list and chain all pool_workqueues
belonging to a workqueue there.  This will be used to implement
generic pool_workqueue iteration and handle multiple pool_workqueues
for the scheduled unbound pools with custom attributes.

This patch doesn't introduce any visible behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-12 11:29:57 -07:00
Tejun Heo e904e6c266 workqueue: introduce kmem_cache for pool_workqueues
pool_workqueues need to be aligned to 1 << WORK_STRUCT_FLAG_BITS as
the lower bits of work->data are used for flags when they're pointing
to pool_workqueues.

Due to historical reasons, unbound pool_workqueues are allocated using
kzalloc() with sufficient buffer area for alignment and aligned
manually.  The original pointer is stored at the end which free_pwqs()
retrieves when freeing it.

There's no reason for this hackery anymore.  Set alignment of struct
pool_workqueue to 1 << WORK_STRUCT_FLAG_BITS, add kmem_cache for
pool_workqueues with proper alignment and replace the hacky alloc and
free implementation with plain kmem_cache_zalloc/free().

In case WORK_STRUCT_FLAG_BITS gets shrunk too much and makes fields of
pool_workqueues misaligned, trigger WARN if the alignment of struct
pool_workqueue becomes smaller than that of long long.

Note that assertion on IS_ALIGNED() is removed from alloc_pwqs().  We
already have another one in pwq init loop in __alloc_workqueue_key().

This patch doesn't introduce any visible behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-12 11:29:57 -07:00
Tejun Heo e98d5b16cf workqueue: make workqueue_lock irq-safe
workqueue_lock will be used to synchronize areas which require
irq-safety and there isn't much benefit in keeping it not irq-safe.
Make it irq-safe.

This patch doesn't introduce any visible behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-12 11:29:57 -07:00
Tejun Heo 6183c009f6 workqueue: make sanity checks less punshing using WARN_ON[_ONCE]()s
Workqueue has been using mostly BUG_ON()s for sanity checks, which
fail unnecessarily harshly when the assertion doesn't hold.  Most
assertions can converted to be less drastic such that things can limp
along instead of dying completely.  Convert BUG_ON()s to
WARN_ON[_ONCE]()s with softer failure behaviors - e.g. if assertion
check fails in destroy_worker(), trigger WARN and silently ignore
destruction request.

Most conversions are trivial.  Note that sanity checks in
destroy_workqueue() are moved above removal from workqueues list so
that it can bail out without side-effects if assertion checks fail.

This patch doesn't introduce any visible behavior changes during
normal operation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-03-12 11:29:57 -07:00
Lai Jiangshan eb2834285c workqueue: fix possible pool stall bug in wq_unbind_fn()
Since multiple pools per cpu have been introduced, wq_unbind_fn() has
a subtle bug which may theoretically stall work item processing.  The
problem is two-fold.

* wq_unbind_fn() depends on the worker executing wq_unbind_fn() itself
  to start unbound chain execution, which works fine when there was
  only single pool.  With multiple pools, only the pool which is
  running wq_unbind_fn() - the highpri one - is guaranteed to have
  such kick-off.  The other pool could stall when its busy workers
  block.

* The current code is setting WORKER_UNBIND / POOL_DISASSOCIATED of
  the two pools in succession without initiating work execution
  inbetween.  Because setting the flags requires grabbing assoc_mutex
  which is held while new workers are created, this could lead to
  stalls if a pool's manager is waiting for the previous pool's work
  items to release memory.  This is almost purely theoretical tho.

Update wq_unbind_fn() such that it sets WORKER_UNBIND /
POOL_DISASSOCIATED, goes over schedule() and explicitly kicks off
execution for a pool and then moves on to the next one.

tj: Updated comments and description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
2013-03-08 15:18:28 -08:00
Lai Jiangshan b31041042a workqueue: better define synchronization rule around rescuer->pool updates
Rescuers visit different worker_pools to process work items from pools
under pressure.  Currently, rescuer->pool is updated outside any
locking and when an outsider looks at a rescuer, there's no way to
tell when and whether rescuer->pool is gonna change.  While this
doesn't currently cause any problem, it is nasty.

With recent worker_maybe_bind_and_lock() changes, we can move
rescuer->pool updates inside pool locks such that if rescuer->pool
equals a locked pool, it's guaranteed to stay that way until the pool
is unlocked.

Move rescuer->pool inside pool->lock.

This patch doesn't introduce any visible behavior difference.

tj: Updated the description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-04 09:44:58 -08:00
Lai Jiangshan f36dc67b27 workqueue: change argument of worker_maybe_bind_and_lock() to @pool
worker_maybe_bind_and_lock() currently takes @worker but only cares
about @worker->pool.  This patch updates worker_maybe_bind_and_lock()
to take @pool instead of @worker.  This will be used to better define
synchronization rules regarding rescuer->pool updates.

This doesn't introduce any functional change.

tj: Updated the comments and description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-04 09:44:58 -08:00
Lai Jiangshan f5faa0774e workqueue: use %current instead of worker->task in worker_maybe_bind_and_lock()
worker_maybe_bind_and_lock() uses both @worker->task and @current at
the same time.  As worker_maybe_bind_and_lock() can only be called by
the current worker task, they are always the same.

Update worker_maybe_bind_and_lock() to use %current consistently.

This doesn't introduce any functional change.

tj: Massaged the description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-03-04 09:44:58 -08:00
Sasha Levin b67bfe0d42 hlist: drop the node parameter from iterators
I'm not sure why, but the hlist for each entry iterators were conceived

        list_for_each_entry(pos, head, member)

The hlist ones were greedy and wanted an extra parameter:

        hlist_for_each_entry(tpos, pos, head, member)

Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.

Besides the semantic patch, there was some manual work required:

 - Fix up the actual hlist iterators in linux/list.h
 - Fix up the declaration of other iterators based on the hlist ones.
 - A very small amount of places were using the 'node' parameter, this
 was modified to use 'obj->member' instead.
 - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
 properly, so those had to be fixed up manually.

The semantic patch which is mostly the work of Peter Senna Tschudin is here:

@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

type T;
expression a,c,d,e;
identifier b;
statement S;
@@

-T b;
    <+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
    ...+>

[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 19:10:24 -08:00
Konstantin Khlebnikov 1438ade567 workqueue: un-GPL function delayed_work_timer_fn()
commit d8e794dfd5 ("workqueue: set
delayed_work->timer function on initialization") exports function
delayed_work_timer_fn() only for GPL modules. This makes delayed-works
unusable for non-GPL modules, because initialization macro now requires
GPL symbol. For example schedule_delayed_work() available for non-GPL.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # 3.7
2013-02-19 10:09:13 -08:00
Tejun Heo 112202d909 workqueue: rename cpu_workqueue to pool_workqueue
workqueue has moved away from global_cwqs to worker_pools and with the
scheduled custom worker pools, wforkqueues will be associated with
pools which don't have anything to do with CPUs.  The workqueue code
went through significant amount of changes recently and mass renaming
isn't likely to hurt much additionally.  Let's replace 'cpu' with
'pool' so that it reflects the current design.

* s/struct cpu_workqueue_struct/struct pool_workqueue/
* s/cpu_wq/pool_wq/
* s/cwq/pwq/

This patch is purely cosmetic.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-13 19:29:12 -08:00
Tejun Heo 8d03ecfe47 workqueue: reimplement is_chained_work() using current_wq_worker()
is_chained_work() was added before current_wq_worker() and implemented
its own ham-fisted way of finding out whether %current is a workqueue
worker - it iterates through all possible workers.

Drop the custom implementation and reimplement using
current_wq_worker().

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-13 19:29:10 -08:00
Tejun Heo 1dd638149f workqueue: fix is_chained_work() regression
c9e7cf273f ("workqueue: move busy_hash from global_cwq to
worker_pool") incorrectly converted is_chained_work() to use
get_gcwq() inside for_each_gcwq_cpu() while removing get_gcwq().

As cwq might not exist for all possible workqueue CPUs, @cwq can be
NULL and the following cwq deferences can lead to oops.

Fix it by using for_each_cwq_cpu() instead, which is the better one to
use anyway as we only need to check pools that the wq is associated
with.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-13 19:29:07 -08:00
Lai Jiangshan 8594fade39 workqueue: pick cwq instead of pool in __queue_work()
Currently, __queue_work() chooses the pool to queue a work item to and
then determines cwq from the target wq and the chosen pool.  This is a
bit backwards in that we can determine cwq first and simply use
cwq->pool.  This way, we can skip get_std_worker_pool() in queueing
path which will be a hurdle when implementing custom worker pools.

Update __queue_work() such that it chooses the target cwq and then use
cwq->pool instead of the other way around.  While at it, add missing
{} in an if statement.

This patch doesn't introduce any functional changes.

tj: The original patch had two get_cwq() calls - the first to
    determine the pool by doing get_cwq(cpu, wq)->pool and the second
    to determine the matching cwq from get_cwq(pool->cpu, wq).
    Updated the function such that it chooses cwq instead of pool and
    removed the second call.  Rewrote the description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-07 13:17:51 -08:00
Lai Jiangshan 54d5b7d079 workqueue: make get_work_pool_id() cheaper
get_work_pool_id() currently first obtains pool using get_work_pool()
and then return pool->id.  For an off-queue work item, this involves
obtaining pool ID from worker->data, performing idr_find() to find the
matching pool and then returning its pool->id which of course is the
same as the one which went into idr_find().

Just open code WORK_STRUCT_CWQ case and directly return pool ID from
work->data.

tj: The original patch dropped on-queue work item handling and renamed
    the function to offq_work_pool_id().  There isn't much benefit in
    doing so.  Handling it only requires a single if() and we need at
    least BUG_ON(), which is also a branch, even if we drop on-queue
    handling.  Open code WORK_STRUCT_CWQ case and keep the function in
    line with get_work_pool().  Rewrote the description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-07 13:14:20 -08:00
Tejun Heo e19e397a85 workqueue: move nr_running into worker_pool
As nr_running is likely to be accessed from other CPUs during
try_to_wake_up(), it was kept outside worker_pool; however, while less
frequent, other fields in worker_pool are accessed from other CPUs
for, e.g., non-reentrancy check.  Also, with recent pool related
changes, accessing nr_running matching the worker_pool isn't as simple
as it used to be.

Move nr_running inside worker_pool.  Keep it aligned to cacheline and
define CPU pools using DEFINE_PER_CPU_SHARED_ALIGNED().  This should
give at least the same cacheline behavior.

get_pool_nr_running() is replaced with direct pool->nr_running
accesses.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Joonsoo Kim <js1304@gmail.com>
2013-02-07 13:14:20 -08:00
Tejun Heo 1606283622 workqueue: cosmetic update in try_to_grab_pending()
With the recent is-work-queued-here test simplification, the nested
if() in try_to_grab_pending() can be collapsed.  Collapse it.

This patch is purely cosmetic.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-02-06 18:04:53 -08:00
Lai Jiangshan 0b3dae68ac workqueue: simplify is-work-item-queued-here test
Currently, determining whether a work item is queued on a locked pool
involves somewhat convoluted memory barrier dancing.  It goes like the
following.

* When a work item is queued on a pool, work->data is updated before
  work->entry is linked to the pending list with a wmb() inbetween.

* When trying to determine whether a work item is currently queued on
  a pool pointed to by work->data, it locks the pool and looks at
  work->entry.  If work->entry is linked, we then do rmb() and then
  check whether work->data points to the current pool.

This works because, work->data can only point to a pool if it
currently is or were on the pool and,

* If it currently is on the pool, the tests would obviously succeed.

* It it left the pool, its work->entry was cleared under pool->lock,
  so if we're seeing non-empty work->entry, it has to be from the work
  item being linked on another pool.  Because work->data is updated
  before work->entry is linked with wmb() inbetween, work->data update
  from another pool is guaranteed to be visible if we do rmb() after
  seeing non-empty work->entry.  So, we either see empty work->entry
  or we see updated work->data pointin to another pool.

While this works, it's convoluted, to put it mildly.  With recent
updates, it's now guaranteed that work->data points to cwq only while
the work item is queued and that updating work->data to point to cwq
or back to pool is done under pool->lock, so we can simply test
whether work->data points to cwq which is associated with the
currently locked pool instead of the convoluted memory barrier
dancing.

This patch replaces the memory barrier based "are you still here,
really?" test with much simpler "does work->data points to me?" test -
if work->data points to a cwq which is associated with the currently
locked pool, the work item is guaranteed to be queued on the pool as
work->data can start and stop pointing to such cwq only under
pool->lock and the start and stop coincide with queue and dequeue.

tj: Rewrote the comments and description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-06 18:04:53 -08:00
Lai Jiangshan 4468a00fd9 workqueue: make work->data point to pool after try_to_grab_pending()
We plan to use work->data pointing to cwq as the synchronization
invariant when determining whether a given work item is on a locked
pool or not, which requires work->data pointing to cwq only while the
work item is queued on the associated pool.

With delayed_work updated not to overload work->data for target
workqueue recording, the only case where we still have off-queue
work->data pointing to cwq is try_to_grab_pending() which doesn't
update work->data after stealing a queued work item.  There's no
reason for try_to_grab_pending() to not update work->data to point to
the pool instead of cwq, like the normal execution does.

This patch adds set_work_pool_and_keep_pending() which makes
work->data point to pool instead of cwq but keeps the pending bit
unlike set_work_pool_and_clear_pending() (surprise!).

After this patch, it's guaranteed that only queued work items point to
cwqs.

This patch doesn't introduce any visible behavior change.

tj: Renamed the new helper function to match
    set_work_pool_and_clear_pending() and rewrote the description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-06 18:04:53 -08:00
Lai Jiangshan 60c057bca2 workqueue: add delayed_work->wq to simplify reentrancy handling
To avoid executing the same work item from multiple CPUs concurrently,
a work_struct records the last pool it was on in its ->data so that,
on the next queueing, the pool can be queried to determine whether the
work item is still executing or not.

A delayed_work goes through timer before actually being queued on the
target workqueue and the timer needs to know the target workqueue and
CPU.  This is currently achieved by modifying delayed_work->work.data
such that it points to the cwq which points to the target workqueue
and the last CPU the work item was on.  __queue_delayed_work()
extracts the last CPU from delayed_work->work.data and then combines
it with the target workqueue to create new work.data.

The only thing this rather ugly hack achieves is encoding the target
workqueue into delayed_work->work.data without using a separate field,
which could be a trade off one can make; unfortunately, this entangles
work->data management between regular workqueue and delayed_work code
by setting cwq pointer before the work item is actually queued and
becomes a hindrance for further improvements of work->data handling.

This can be easily made sane by adding a target workqueue field to
delayed_work.  While delayed_work is used widely in the kernel and
this does make it a bit larger (<5%), I think this is the right
trade-off especially given the prospect of much saner handling of
work->data which currently involves quite tricky memory barrier
dancing, and don't expect to see any measureable effect.

Add delayed_work->wq and drop the delayed_work->work.data overloading.

tj: Rewrote the description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-06 18:04:53 -08:00
Lai Jiangshan 038366c5cf workqueue: make work_busy() test WORK_STRUCT_PENDING first
Currently, work_busy() first tests whether the work has a pool
associated with it and if not, considers it idle.  This works fine
even for delayed_work.work queued on timer, as __queue_delayed_work()
sets cwq on delayed_work.work - a queued delayed_work always has its
cwq and thus pool associated with it.

However, we're about to update delayed_work queueing and this won't
hold.  Update work_busy() such that it tests WORK_STRUCT_PENDING
before the associated pool.  This doesn't make any noticeable behavior
difference now.

With work_pending() test moved, the function read a lot better with
"if (!pool)" test flipped to positive.  Flip it.

While at it, lose the comment about now non-existent reentrant
workqueues.

tj: Reorganized the function and rewrote the description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-06 18:04:53 -08:00
Lai Jiangshan 6be195886a workqueue: replace WORK_CPU_NONE/LAST with WORK_CPU_END
Now that workqueue has moved away from gcwqs, workqueue no longer has
the need to have a CPU identifier indicating "no cpu associated" - we
now use WORK_OFFQ_POOL_NONE instead - and most uses of WORK_CPU_NONE
are gone.

The only left usage is as the end marker for for_each_*wq*()
iterators, where the name WORK_CPU_NONE is confusing w/o actual
WORK_CPU_NONE usages.  Similarly, WORK_CPU_LAST which equals
WORK_CPU_NONE no longer makes sense.

Replace both WORK_CPU_NONE and LAST with WORK_CPU_END.  This patch
doesn't introduce any functional difference.

tj: s/WORK_CPU_LAST/WORK_CPU_END/ and rewrote the description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-02-06 18:04:53 -08:00
Tejun Heo 706026c214 workqueue: post global_cwq removal cleanups
Remove remaining references to gcwq.

* __next_gcwq_cpu() steals __next_wq_cpu() name.  The original
  __next_wq_cpu() became __next_cwq_cpu().

* s/for_each_gcwq_cpu/for_each_wq_cpu/
  s/for_each_online_gcwq_cpu/for_each_online_wq_cpu/

* s/gcwq_mayday_timeout/pool_mayday_timeout/

* s/gcwq_unbind_fn/wq_unbind_fn/

* Drop references to gcwq in comments.

This patch doesn't introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-01-24 11:01:34 -08:00
Tejun Heo e6e380ed92 workqueue: rename nr_running variables
Rename per-cpu and unbound nr_running variables such that they match
the pool variables.

This patch doesn't introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-01-24 11:01:34 -08:00
Tejun Heo a60dc39c01 workqueue: remove global_cwq
global_cwq is now nothing but a container for per-cpu standard
worker_pools.  Declare the worker pools directly as
cpu/unbound_std_worker_pools[] and remove global_cwq.

* ____cacheline_aligned_in_smp moved from global_cwq to worker_pool.
  This probably would have made sense even before this change as we
  want each pool to be aligned.

* get_gcwq() is replaced with std_worker_pools() which returns the
  pointer to the standard pool array for a given CPU.

* __alloc_workqueue_key() updated to use get_std_worker_pool() instead
  of open-coding pool determination.

This is part of an effort to remove global_cwq and make worker_pool
the top level abstraction, which in turn will help implementing worker
pools with user-specified attributes.

v2: Joonsoo pointed out that it'd better to align struct worker_pool
    rather than the array so that every pool is aligned.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Joonsoo Kim <js1304@gmail.com>
2013-01-24 11:01:34 -08:00
Tejun Heo 4e8f0a6096 workqueue: remove worker_pool->gcwq
The only remaining user of pool->gcwq is std_worker_pool_pri().
Reimplement it using get_gcwq() and remove worker_pool->gcwq.

This is part of an effort to remove global_cwq and make worker_pool
the top level abstraction, which in turn will help implementing worker
pools with user-specified attributes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-01-24 11:01:34 -08:00
Tejun Heo 38db41d984 workqueue: replace for_each_worker_pool() with for_each_std_worker_pool()
for_each_std_worker_pool() takes @cpu instead of @gcwq.

This is part of an effort to remove global_cwq and make worker_pool
the top level abstraction, which in turn will help implementing worker
pools with user-specified attributes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-01-24 11:01:34 -08:00
Tejun Heo a1056305fa workqueue: make freezing/thawing per-pool
Instead of holding locks from both pools and then processing the pools
together, make freezing/thwaing per-pool - grab locks of one pool,
process it, release it and then proceed to the next pool.

While this patch changes processing order across pools, order within
each pool remains the same.  As each pool is independent, this
shouldn't break anything.

This is part of an effort to remove global_cwq and make worker_pool
the top level abstraction, which in turn will help implementing worker
pools with user-specified attributes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-01-24 11:01:33 -08:00
Tejun Heo 94cf58bb29 workqueue: make hotplug processing per-pool
Instead of holding locks from both pools and then processing the pools
together, make hotplug processing per-pool - grab locks of one pool,
process it, release it and then proceed to the next pool.

rebind_workers() is updated to take and process @pool instead of @gcwq
which results in a lot of de-indentation.  gcwq_claim_assoc_and_lock()
and its counterpart are replaced with in-line per-pool locking.

While this patch changes processing order across pools, order within
each pool remains the same.  As each pool is independent, this
shouldn't break anything.

This is part of an effort to remove global_cwq and make worker_pool
the top level abstraction, which in turn will help implementing worker
pools with user-specified attributes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-01-24 11:01:33 -08:00
Tejun Heo d565ed6309 workqueue: move global_cwq->lock to worker_pool
Move gcwq->lock to pool->lock.  The conversion is mostly
straight-forward.  Things worth noting are

* In many places, this removes the need to use gcwq completely.  pool
  is used directly instead.  get_std_worker_pool() is added to help
  some of these conversions.  This also leaves get_work_gcwq() without
  any user.  Removed.

* In hotplug and freezer paths, the pools belonging to a CPU are often
  processed together.  This patch makes those paths hold locks of all
  pools, with highpri lock nested inside, to keep the conversion
  straight-forward.  These nested lockings will be removed by
  following patches.

This is part of an effort to remove global_cwq and make worker_pool
the top level abstraction, which in turn will help implementing worker
pools with user-specified attributes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-01-24 11:01:33 -08:00
Tejun Heo ec22ca5eab workqueue: move global_cwq->cpu to worker_pool
Move gcwq->cpu to pool->cpu.  This introduces a couple places where
gcwq->pools[0].cpu is used.  These will soon go away as gcwq is
further reduced.

This is part of an effort to remove global_cwq and make worker_pool
the top level abstraction, which in turn will help implementing worker
pools with user-specified attributes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-01-24 11:01:33 -08:00
Tejun Heo c9e7cf273f workqueue: move busy_hash from global_cwq to worker_pool
There's no functional necessity for the two pools on the same CPU to
share the busy hash table.  It's also likely to be a bottleneck when
implementing pools with user-specified attributes.

This patch makes busy_hash per-pool.  The conversion is mostly
straight-forward.  Changes worth noting are,

* Large block of changes in rebind_workers() is moving the block
  inside for_each_worker_pool() as now there are separate hash tables
  for each pool.  This changes the order of operations but doesn't
  break anything.

* Thre for_each_worker_pool() loops in gcwq_unbind_fn() are combined
  into one.  This again changes the order of operaitons but doesn't
  break anything.

This is part of an effort to remove global_cwq and make worker_pool
the top level abstraction, which in turn will help implementing worker
pools with user-specified attributes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-01-24 11:01:33 -08:00
Tejun Heo 7c3eed5cd6 workqueue: record pool ID instead of CPU in work->data when off-queue
Currently, when a work item is off-queue, work->data records the CPU
it was last on, which is used to locate the last executing instance
for non-reentrance, flushing, etc.

We're in the process of removing global_cwq and making worker_pool the
top level abstraction.  This patch makes work->data point to the pool
it was last associated with instead of CPU.

After the previous WORK_OFFQ_POOL_CPU and worker_poo->id additions,
the conversion is fairly straight-forward.  WORK_OFFQ constants and
functions are modified to record and read back pool ID instead.
worker_pool_by_id() is added to allow looking up pool from ID.
get_work_pool() replaces get_work_gcwq(), which is reimplemented using
get_work_pool().  get_work_pool_id() replaces work_cpu().

This patch shouldn't introduce any observable behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-01-24 11:01:33 -08:00
Tejun Heo 9daf9e678d workqueue: add worker_pool->id
Add worker_pool->id which is allocated from worker_pool_idr.  This
will be used to record the last associated worker_pool in work->data.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-01-24 11:01:33 -08:00
Tejun Heo 715b06b864 workqueue: introduce WORK_OFFQ_CPU_NONE
Currently, when a work item is off queue, high bits of its data
encodes the last CPU it was on.  This is scheduled to be changed to
pool ID, which will make it impossible to use WORK_CPU_NONE to
indicate no association.

This patch limits the number of bits which are used for off-queue cpu
number to 31 (so that the max fits in an int) and uses the highest
possible value - WORK_OFFQ_CPU_NONE - to indicate no association.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-01-24 11:01:33 -08:00
Tejun Heo 35b6bb63b8 workqueue: make GCWQ_FREEZING a pool flag
Make GCWQ_FREEZING a pool flag POOL_FREEZING.  This patch doesn't
change locking - FREEZING on both pools of a CPU are set or clear
together while holding gcwq->lock.  It shouldn't cause any functional
difference.

This leaves gcwq->flags w/o any flags.  Removed.

While at it, convert BUG_ON()s in freeze_workqueue_begin() and
thaw_workqueues() to WARN_ON_ONCE().

This is part of an effort to remove global_cwq and make worker_pool
the top level abstraction, which in turn will help implementing worker
pools with user-specified attributes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-01-24 11:01:33 -08:00
Tejun Heo 2464757086 workqueue: make GCWQ_DISASSOCIATED a pool flag
Make GCWQ_DISASSOCIATED a pool flag POOL_DISASSOCIATED.  This patch
doesn't change locking - DISASSOCIATED on both pools of a CPU are set
or clear together while holding gcwq->lock.  It shouldn't cause any
functional difference.

This is part of an effort to remove global_cwq and make worker_pool
the top level abstraction, which in turn will help implementing worker
pools with user-specified attributes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-01-24 11:01:33 -08:00
Tejun Heo e34cdddb03 workqueue: use std_ prefix for the standard per-cpu pools
There are currently two worker pools per cpu (including the unbound
cpu) and they are the only pools in use.  New class of pools are
scheduled to be added and some pool related APIs will be added
inbetween.  Call the existing pools the standard pools and prefix them
with std_.  Do this early so that new APIs can use std_ prefix from
the beginning.

This patch doesn't introduce any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-01-24 11:01:33 -08:00
Tejun Heo e2905b2912 workqueue: unexport work_cpu()
This function no longer has any external users.  Unexport it.  It will
be removed later on.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-01-24 11:01:32 -08:00
Tejun Heo 2eaebdb33e workqueue: move struct worker definition to workqueue_internal.h
This will be used to implement an inline function to query whether
%current is a workqueue worker and, if so, allow determining which
work item it's executing.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-18 14:05:55 -08:00
Tejun Heo ea138446e5 workqueue: rename kernel/workqueue_sched.h to kernel/workqueue_internal.h
Workqueue wants to expose more interface internal to kernel/.  Instead
of adding a new header file, repurpose kernel/workqueue_sched.h.
Rename it to workqueue_internal.h and add include protector.

This patch doesn't introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2013-01-18 14:05:55 -08:00
Tejun Heo 111c225a5f workqueue: set PF_WQ_WORKER on rescuers
PF_WQ_WORKER is used to tell scheduler that the task is a workqueue
worker and needs wq_worker_sleeping/waking_up() invoked on it for
concurrency management.  As rescuers never participate in concurrency
management, PF_WQ_WORKER wasn't set on them.

There's a need for an interface which can query whether %current is
executing a work item and if so which.  Such interface requires a way
to identify all tasks which may execute work items and PF_WQ_WORKER
will be used for that.  As all normal workers always have PF_WQ_WORKER
set, we only need to add it to rescuers.

As rescuers start with WORKER_PREP but never clear it, it's always
NOT_RUNNING and there's no need to worry about it interfering with
concurrency management even if PF_WQ_WORKER is set; however, unlike
normal workers, rescuers currently don't have its worker struct as
kthread_data().  It uses the associated workqueue_struct instead.
This is problematic as wq_worker_sleeping/waking_up() expect struct
worker at kthread_data().

This patch adds worker->rescue_wq and start rescuer kthreads with
worker struct as kthread_data and sets PF_WQ_WORKER on rescuers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-17 17:19:58 -08:00
Tejun Heo 023f27d3d6 workqueue: fix find_worker_executing_work() brekage from hashtable conversion
42f8570f43 ("workqueue: use new hashtable implementation") incorrectly
made busy workers hashed by the pointer value of worker instead of
work.  This broke find_worker_executing_work() which in turn broke a
lot of fundamental operations of workqueue - non-reentrancy and
flushing among others.  The flush malfunction triggered warning in
disk event code in Fengguang's automated test.

 write_dev_root_ (3265) used greatest stack depth: 2704 bytes left
 ------------[ cut here ]------------
 WARNING: at /c/kernel-tests/src/stable/block/genhd.c:1574 disk_clear_events+0x\
cf/0x108()
 Hardware name: Bochs
 Modules linked in:
 Pid: 3328, comm: ata_id Not tainted 3.7.0-01930-gbff6343 #1167
 Call Trace:
  [<ffffffff810997c4>] warn_slowpath_common+0x83/0x9c
  [<ffffffff810997f7>] warn_slowpath_null+0x1a/0x1c
  [<ffffffff816aea77>] disk_clear_events+0xcf/0x108
  [<ffffffff811bd8be>] check_disk_change+0x27/0x59
  [<ffffffff822e48e2>] cdrom_open+0x49/0x68b
  [<ffffffff81ab0291>] idecd_open+0x88/0xb7
  [<ffffffff811be58f>] __blkdev_get+0x102/0x3ec
  [<ffffffff811bea08>] blkdev_get+0x18f/0x30f
  [<ffffffff811bebfd>] blkdev_open+0x75/0x80
  [<ffffffff8118f510>] do_dentry_open+0x1ea/0x295
  [<ffffffff8118f5f0>] finish_open+0x35/0x41
  [<ffffffff8119c720>] do_last+0x878/0xa25
  [<ffffffff8119c993>] path_openat+0xc6/0x333
  [<ffffffff8119cf37>] do_filp_open+0x38/0x86
  [<ffffffff81190170>] do_sys_open+0x6c/0xf9
  [<ffffffff8119021e>] sys_open+0x21/0x23
  [<ffffffff82c1c3d9>] system_call_fastpath+0x16/0x1b

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
2012-12-19 11:24:06 -08:00
Tejun Heo a2c1c57be8 workqueue: consider work function when searching for busy work items
To avoid executing the same work item concurrenlty, workqueue hashes
currently busy workers according to their current work items and looks
up the the table when it wants to execute a new work item.  If there
already is a worker which is executing the new work item, the new item
is queued to the found worker so that it gets executed only after the
current execution finishes.

Unfortunately, a work item may be freed while being executed and thus
recycled for different purposes.  If it gets recycled for a different
work item and queued while the previous execution is still in
progress, workqueue may make the new work item wait for the old one
although the two aren't really related in any way.

In extreme cases, this false dependency may lead to deadlock although
it's extremely unlikely given that there aren't too many self-freeing
work item users and they usually don't wait for other work items.

To alleviate the problem, record the current work function in each
busy worker and match it together with the work item address in
find_worker_executing_work().  While this isn't complete, it ensures
that unrelated work items don't interact with each other and in the
very unlikely case where a twisted wq user triggers it, it's always
onto itself making the culprit easy to spot.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Andrey Isakov <andy51@gmx.ru>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=51701
Cc: stable@vger.kernel.org
2012-12-18 10:56:14 -08:00
Sasha Levin 42f8570f43 workqueue: use new hashtable implementation
Switch workqueues to use the new hashtable implementation. This reduces the
amount of generic unrelated code in the workqueues.

This patch depends on d9b482c ("hashtable: introduce a small and naive
hashtable") which was merged in v3.6.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2012-12-18 09:21:13 -08:00
Linus Torvalds e7b55b8fcd Merge branch 'for-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue changes from Tejun Heo:
 "Nothing exciting.  Just two trivial changes."

* 'for-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: add WARN_ON_ONCE() on CPU number to wq_worker_waking_up()
  workqueue: trivial fix for return statement in work_busy()
2012-12-12 08:15:13 -08:00
Tejun Heo fc4b514f27 workqueue: convert BUG_ON()s in __queue_delayed_work() to WARN_ON_ONCE()s
8852aac25e ("workqueue: mod_delayed_work_on() shouldn't queue timer on
0 delay") unexpectedly uncovered a very nasty abuse of delayed_work in
megaraid - it allocated work_struct, casted it to delayed_work and
then pass that into queue_delayed_work().

Previously, this was okay because 0 @delay short-circuited to
queue_work() before doing anything with delayed_work.  8852aac25e
moved 0 @delay test into __queue_delayed_work() after sanity check on
delayed_work making megaraid trigger BUG_ON().

Although megaraid is already fixed by c1d390d8e6 ("megaraid: fix
BUG_ON() from incorrect use of delayed work"), this patch converts
BUG_ON()s in __queue_delayed_work() to WARN_ON_ONCE()s so that such
abusers, if there are more, trigger warning but don't crash the
machine.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Xiaotian Feng <xtfeng@gmail.com>
2012-12-04 07:58:47 -08:00
Joonsoo Kim 3657600040 workqueue: add WARN_ON_ONCE() on CPU number to wq_worker_waking_up()
Recently, workqueue code has gone through some changes and we found
some bugs related to concurrency management operations happening on
the wrong CPU.  When a worker is concurrency managed
(!WORKER_NOT_RUNNIG), it should be bound to its associated cpu and
woken up to that cpu.  Add WARN_ON_ONCE() to verify this.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2012-12-01 16:45:45 -08:00
Joonsoo Kim 999767beb1 workqueue: trivial fix for return statement in work_busy()
Return type of work_busy() is unsigned int.
There is return statement returning boolean value, 'false' in work_busy().
It is not problem, because 'false' may be treated '0'.
However, fixing it would make code robust.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2012-12-01 16:45:40 -08:00
Tejun Heo 8852aac25e workqueue: mod_delayed_work_on() shouldn't queue timer on 0 delay
8376fe22c7 ("workqueue: implement mod_delayed_work[_on]()")
implemented mod_delayed_work[_on]() using the improved
try_to_grab_pending().  The function is later used, among others, to
replace [__]candel_delayed_work() + queue_delayed_work() combinations.

Unfortunately, a delayed_work item w/ zero @delay is handled slightly
differently by mod_delayed_work_on() compared to
queue_delayed_work_on().  The latter skips timer altogether and
directly queues it using queue_work_on() while the former schedules
timer which will expire on the closest tick.  This means, when @delay
is zero, that [__]cancel_delayed_work() + queue_delayed_work_on()
makes the target item immediately executable while
mod_delayed_work_on() may induce delay of upto a full tick.

This somewhat subtle difference breaks some of the converted users.
e.g. block queue plugging uses delayed_work for deferred processing
and uses mod_delayed_work_on() when the queue needs to be immediately
unplugged.  The above problem manifested as noticeably higher number
of context switches under certain circumstances.

The difference in behavior was caused by missing special case handling
for 0 delay in mod_delayed_work_on() compared to
queue_delayed_work_on().  Joonsoo Kim posted a patch to add it -
("workqueue: optimize mod_delayed_work_on() when @delay == 0")[1].
The patch was queued for 3.8 but it was described as optimization and
I missed that it was a correctness issue.

As both queue_delayed_work_on() and mod_delayed_work_on() use
__queue_delayed_work() for queueing, it seems that the better approach
is to move the 0 delay special handling to the function instead of
duplicating it in mod_delayed_work_on().

Fix the problem by moving 0 delay special case handling from
queue_delayed_work_on() to __queue_delayed_work().  This replaces
Joonsoo's patch.

[1] http://thread.gmane.org/gmane.linux.kernel/1379011/focus=1379012

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Anders Kaseorg <andersk@MIT.EDU>
Reported-and-tested-by: Zlatko Calusic <zlatko.calusic@iskon.hr>
LKML-Reference: <alpine.DEB.2.00.1211280953350.26602@dr-wily.mit.edu>
LKML-Reference: <50A78AA9.5040904@iskon.hr>
Cc: Joonsoo Kim <js1304@gmail.com>
2012-12-01 16:43:18 -08:00
Mike Galbraith 412d32e6c9 workqueue: exit rescuer_thread() as TASK_RUNNING
A rescue thread exiting TASK_INTERRUPTIBLE can lead to a task scheduling
off, never to be seen again.  In the case where this occurred, an exiting
thread hit reiserfs homebrew conditional resched while holding a mutex,
bringing the box to its knees.

PID: 18105  TASK: ffff8807fd412180  CPU: 5   COMMAND: "kdmflush"
 #0 [ffff8808157e7670] schedule at ffffffff8143f489
 #1 [ffff8808157e77b8] reiserfs_get_block at ffffffffa038ab2d [reiserfs]
 #2 [ffff8808157e79a8] __block_write_begin at ffffffff8117fb14
 #3 [ffff8808157e7a98] reiserfs_write_begin at ffffffffa0388695 [reiserfs]
 #4 [ffff8808157e7ad8] generic_perform_write at ffffffff810ee9e2
 #5 [ffff8808157e7b58] generic_file_buffered_write at ffffffff810eeb41
 #6 [ffff8808157e7ba8] __generic_file_aio_write at ffffffff810f1a3a
 #7 [ffff8808157e7c58] generic_file_aio_write at ffffffff810f1c88
 #8 [ffff8808157e7cc8] do_sync_write at ffffffff8114f850
 #9 [ffff8808157e7dd8] do_acct_process at ffffffff810a268f
    [exception RIP: kernel_thread_helper]
    RIP: ffffffff8144a5c0  RSP: ffff8808157e7f58  RFLAGS: 00000202
    RAX: 0000000000000000  RBX: 0000000000000000  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: ffffffff8107af60  RDI: ffff8803ee491d18
    RBP: 0000000000000000   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018

Signed-off-by: Mike Galbraith <mgalbraith@suse.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
2012-12-01 15:56:42 -08:00
Dan Magenheimer c0158ca64d workqueue: cancel_delayed_work() should return %false if work item is idle
57b30ae77b ("workqueue: reimplement cancel_delayed_work() using
try_to_grab_pending()") made cancel_delayed_work() always return %true
unless someone else is also trying to cancel the work item, which is
broken - if the target work item is idle, the return value should be
%false.

try_to_grab_pending() indicates that the target work item was idle by
zero return value.  Use it for return.  Note that this brings
cancel_delayed_work() in line with __cancel_work_timer() in return
value handling.

Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <444a6439-b1a4-4740-9e7e-bc37267cfe73@default>
2012-10-24 12:38:16 -07:00
Linus Torvalds 033d9959ed Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue changes from Tejun Heo:
 "This is workqueue updates for v3.7-rc1.  A lot of activities this
  round including considerable API and behavior cleanups.

   * delayed_work combines a timer and a work item.  The handling of the
     timer part has always been a bit clunky leading to confusing
     cancelation API with weird corner-case behaviors.  delayed_work is
     updated to use new IRQ safe timer and cancelation now works as
     expected.

   * Another deficiency of delayed_work was lack of the counterpart of
     mod_timer() which led to cancel+queue combinations or open-coded
     timer+work usages.  mod_delayed_work[_on]() are added.

     These two delayed_work changes make delayed_work provide interface
     and behave like timer which is executed with process context.

   * A work item could be executed concurrently on multiple CPUs, which
     is rather unintuitive and made flush_work() behavior confusing and
     half-broken under certain circumstances.  This problem doesn't
     exist for non-reentrant workqueues.  While non-reentrancy check
     isn't free, the overhead is incurred only when a work item bounces
     across different CPUs and even in simulated pathological scenario
     the overhead isn't too high.

     All workqueues are made non-reentrant.  This removes the
     distinction between flush_[delayed_]work() and
     flush_[delayed_]_work_sync().  The former is now as strong as the
     latter and the specified work item is guaranteed to have finished
     execution of any previous queueing on return.

   * In addition to the various bug fixes, Lai redid and simplified CPU
     hotplug handling significantly.

   * Joonsoo introduced system_highpri_wq and used it during CPU
     hotplug.

  There are two merge commits - one to pull in IRQ safe timer from
  tip/timers/core and the other to pull in CPU hotplug fixes from
  wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."

Fixed a number of trivial conflicts, but the more interesting conflicts
were silent ones where the deprecated interfaces had been used by new
code in the merge window, and thus didn't cause any real data conflicts.

Tejun pointed out a few of them, I fixed a couple more.

* 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
  workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
  workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
  workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
  workqueue: remove @delayed from cwq_dec_nr_in_flight()
  workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
  workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
  workqueue: use __cpuinit instead of __devinit for cpu callbacks
  workqueue: rename manager_mutex to assoc_mutex
  workqueue: WORKER_REBIND is no longer necessary for idle rebinding
  workqueue: WORKER_REBIND is no longer necessary for busy rebinding
  workqueue: reimplement idle worker rebinding
  workqueue: deprecate __cancel_delayed_work()
  workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
  workqueue: use mod_delayed_work() instead of __cancel + queue
  workqueue: use irqsafe timer for delayed_work
  workqueue: clean up delayed_work initializers and add missing one
  workqueue: make deferrable delayed_work initializer names consistent
  workqueue: cosmetic whitespace updates for macro definitions
  workqueue: deprecate system_nrt[_freezable]_wq
  workqueue: deprecate flush[_delayed]_work_sync()
  ...
2012-10-02 09:54:49 -07:00