Pull two workqueue fixes from Tejun Heo:
"A lockdep notation update so that nested work_on_cpu() invocations
don't lead to spurious lockdep warnings and fix for an unbound attr
bug which made what's shown in sysfs deviate from the actual ones.
Both patches have pretty limited scope"
* 'for-3.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: copy workqueue_attrs with all fields
workqueue: allow work_on_cpu() to be called recursively
$echo '0' > /sys/bus/workqueue/devices/xxx/numa
$cat /sys/bus/workqueue/devices/xxx/numa
I got 1. It should be 0, the reason is copy_workqueue_attrs() called
in apply_workqueue_attrs() doesn't copy no_numa field.
Fix it by making copy_workqueue_attrs() copy ->no_numa too. This
would also make get_unbound_pool() set a pool's ->no_numa attribute
according to the workqueue attributes used when the pool was created.
While harmelss, as ->no_numa isn't a pool attribute, this is a bit
confusing. Clear it explicitly.
tj: Updated description and comments a bit.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
If the @fn call work_on_cpu() again, the lockdep will complain:
> [ INFO: possible recursive locking detected ]
> 3.11.0-rc1-lockdep-fix-a #6 Not tainted
> ---------------------------------------------
> kworker/0:1/142 is trying to acquire lock:
> ((&wfc.work)){+.+.+.}, at: [<ffffffff81077100>] flush_work+0x0/0xb0
>
> but task is already holding lock:
> ((&wfc.work)){+.+.+.}, at: [<ffffffff81075dd9>] process_one_work+0x169/0x610
>
> other info that might help us debug this:
> Possible unsafe locking scenario:
>
> CPU0
> ----
> lock((&wfc.work));
> lock((&wfc.work));
>
> *** DEADLOCK ***
It is false-positive lockdep report. In this sutiation,
the two "wfc"s of the two work_on_cpu() are different,
they are both on stack. flush_work() can't be deadlock.
To fix this, we need to avoid the lockdep checking in this case,
thus we instroduce a internal __flush_work() which skip the lockdep.
tj: Minor comment adjustment.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Reported-by: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Reported-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The __cpuinit type of throwaway sections might have made sense
some time ago when RAM was more constrained, but now the savings
do not offset the cost and complications. For example, the fix in
commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
is a good example of the nasty type of bugs that can be created
with improper use of the various __init prefixes.
After a discussion on LKML[1] it was decided that cpuinit should go
the way of devinit and be phased out. Once all the users are gone,
we can then finally remove the macros themselves from linux/init.h.
This removes all the uses of the __cpuinit macros from C files in
the core kernel directories (kernel, init, lib, mm, and include)
that don't really have a specific maintainer.
[1] https://lkml.org/lkml/2013/5/20/589
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Pull workqueue changes from Tejun Heo:
"Surprisingly, Lai and I didn't break too many things implementing
custom pools and stuff last time around and there aren't any follow-up
changes necessary at this point.
The only change in this pull request is Viresh's patches to make some
per-cpu workqueues to behave as unbound workqueues dependent on a boot
param whose default can be configured via a config option. This leads
to higher processing overhead / lower bandwidth as more work items are
bounced across CPUs; however, it can lead to noticeable powersave in
certain configurations - ~10% w/ idlish constant workload on a
big.LITTLE configuration according to Viresh.
This is because per-cpu workqueues interfere with how the scheduler
perceives whether or not each CPU is idle by forcing pinned tasks on
them, which makes the scheduler's power-aware scheduling decisions
less effective.
Its effectiveness is likely less pronounced on homogenous
configurations and this type of optimization can probably be made
automatic; however, the changes are pretty minimal and the affected
workqueues are clearly marked, so it's an easy gain for some
configurations for the time being with pretty unintrusive changes."
* 'for-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
fbcon: queue work on power efficient wq
block: queue work on power efficient wq
PHYLIB: queue work on system_power_efficient_wq
workqueue: Add system wide power_efficient workqueues
workqueues: Introduce new flag WQ_POWER_EFFICIENT for power oriented workqueues
Commit 8425e3d5bd ("workqueue: inline trivial wrappers") changed
schedule_work() and schedule_delayed_work() to inline wrappers,
but these rely on some symbols that are EXPORT_SYMBOL_GPL, while
the original functions were EXPORT_SYMBOL. This has the effect of
changing the licensing requirement for these functions and making
them unavailable to non GPL modules.
Make them available again by removing the restriction on the
required symbols.
Signed-off-by: Marc Dionne <marc.dionne@your-file-system.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When we fail to mutex_trylock(), we release the pool spin_lock and do
mutex_lock(). After that, we should regrab the pool spin_lock, but,
regrabbing is missed in current code. So correct it.
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This patch adds system wide workqueues aligned towards power saving. This is
done by allocating them with WQ_UNBOUND flag if 'wq_power_efficient' is set to
'true'.
tj: updated comments a bit.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Workqueues can be performance or power-oriented. Currently, most workqueues are
bound to the CPU they were created on. This gives good performance (due to cache
effects) at the cost of potentially waking up otherwise idle cores (Idle from
scheduler's perspective. Which may or may not be physically idle) just to
process some work. To save power, we can allow the work to be rescheduled on a
core that is already awake.
Workqueues created with the WQ_UNBOUND flag will allow some power savings.
However, we don't change the default behaviour of the system. To enable
power-saving behaviour, a new config option CONFIG_WQ_POWER_EFFICIENT needs to
be turned on. This option can also be overridden by the
workqueue.power_efficient boot parameter.
tj: Updated config description and comments. Renamed
CONFIG_WQ_POWER_EFFICIENT to CONFIG_WQ_POWER_EFFICIENT_DEFAULT.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
One of the problems that arise when converting dedicated custom
threadpool to workqueue is that the shared worker pool used by workqueue
anonimizes each worker making it more difficult to identify what the
worker was doing on which target from the output of sysrq-t or debug
dump from oops, BUG() and friends.
This patch implements set_worker_desc() which can be called from any
workqueue work function to set its description. When the worker task is
dumped for whatever reason - sysrq-t, WARN, BUG, oops, lockdep assertion
and so on - the description will be printed out together with the
workqueue name and the worker function pointer.
The printing side is implemented by print_worker_info() which is called
from functions in task dump paths - sched_show_task() and
dump_stack_print_info(). print_worker_info() can be safely called on
any task in any state as long as the task struct itself is accessible.
It uses probe_*() functions to access worker fields. It may print
garbage if something went very wrong, but it wouldn't cause (another)
oops.
The description is currently limited to 24bytes including the
terminating \0. worker->desc_valid and workder->desc[] are added and
the 64 bytes marker which was already incorrect before adding the new
fields is moved to the correct position.
Here's an example dump with writeback updated to set the bdi name as
worker desc.
Hardware name: Bochs
Modules linked in:
Pid: 7, comm: kworker/u9:0 Not tainted 3.9.0-rc1-work+ #1
Workqueue: writeback bdi_writeback_workfn (flush-8:0)
ffffffff820a3ab0 ffff88000f6e9cb8 ffffffff81c61845 ffff88000f6e9cf8
ffffffff8108f50f 0000000000000000 0000000000000000 ffff88000cde16b0
ffff88000cde1aa8 ffff88001ee19240 ffff88000f6e9fd8 ffff88000f6e9d08
Call Trace:
[<ffffffff81c61845>] dump_stack+0x19/0x1b
[<ffffffff8108f50f>] warn_slowpath_common+0x7f/0xc0
[<ffffffff8108f56a>] warn_slowpath_null+0x1a/0x20
[<ffffffff81200150>] bdi_writeback_workfn+0x2a0/0x3b0
...
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
memory allocated by kmem_cache_alloc() should be freed using
kmem_cache_free(), not kfree().
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
destroy_workqueue() performs several sanity checks before proceeding
with destruction of a workqueue. One of the checks verifies that
refcnt of each pwq (pool_workqueue) is over 1 as at that point there
should be no in-flight work items and the only holder of pwq refs is
the workqueue itself.
This worked fine as a workqueue used to hold only one reference to its
pwqs; however, since 4c16bd327c ("workqueue: implement NUMA affinity
for unbound workqueues"), a workqueue may hold multiple references to
its default pwq triggering this sanity check spuriously.
Fix it by not triggering the pwq->refcnt assertion on default pwqs.
An example spurious WARN trigger follows.
WARNING: at kernel/workqueue.c:4201 destroy_workqueue+0x6a/0x13e()
Hardware name: 4286C12
Modules linked in: sdhci_pci sdhci mmc_core usb_storage i915 drm_kms_helper drm i2c_algo_bit i2c_core video
Pid: 361, comm: umount Not tainted 3.9.0-rc5+ #29
Call Trace:
[<c04314a7>] warn_slowpath_common+0x7c/0x93
[<c04314e0>] warn_slowpath_null+0x22/0x24
[<c044796a>] destroy_workqueue+0x6a/0x13e
[<c056dc01>] ext4_put_super+0x43/0x2c4
[<c04fb7b8>] generic_shutdown_super+0x4b/0xb9
[<c04fb848>] kill_block_super+0x22/0x60
[<c04fb960>] deactivate_locked_super+0x2f/0x56
[<c04fc41b>] deactivate_super+0x2e/0x31
[<c050f1e6>] mntput_no_expire+0x103/0x108
[<c050fdce>] sys_umount+0x2a2/0x2c4
[<c050fe0e>] sys_oldumount+0x1e/0x20
[<c085ba4d>] sysenter_do_call+0x12/0x38
tj: Rewrote description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQEcBAABAgAGBQJRWLTrAAoJEHm+PkMAQRiGe8oH/iMy48mecVWvxVZn74Tx3Cef
xmW/PnAIj28EhSPqK49N/Ow6AfQToFKf7AP0ge20KAf5teTq95AY+tH74DAANt8F
BjKXXTZiR5xwBvRkq7CR5wDcCvEcBAAz8fgTEd6SEDB2d2VXFf5eKdKUqt1avTCh
Z6Hup5kuwX+ddtwY2DCBXtp2n6fL0Rm5yLzY1A3OOBye1E7VyLTF7M5BR603Q44P
4kRLxn8+R7jy3hTuZIhAeoS8TKUoBwVk7DmKxEzrhTHZVOmvwE9lEHybRnIyOpd/
k1JnbRbiPsLsCVFOn10SQkGDAIk00lro3tuWP2C1ljERiD/OOh5Ui9nXYAhMkbI=
=q15K
-----END PGP SIGNATURE-----
Merge tag 'v3.9-rc5' into wq/for-3.10
Writeback conversion to workqueue will be based on top of wq/for-3.10
branch to take advantage of custom attrs and NUMA support for unbound
workqueues. Mainline currently contains two commits which result in
non-trivial merge conflicts with wq/for-3.10 and because
block/for-3.10/core is based on v3.9-rc3 which contains one of the
conflicting commits, we need a pre-merge-window merge anyway. Let's
pull v3.9-rc5 into wq/for-3.10 so that the block tree doesn't suffer
from workqueue merge conflicts.
The two conflicts and their resolutions:
* e68035fb65 ("workqueue: convert to idr_alloc()") in mainline changes
worker_pool_assign_id() to use idr_alloc() instead of the old idr
interface. worker_pool_assign_id() goes through multiple locking
changes in wq/for-3.10 causing the following conflict.
static int worker_pool_assign_id(struct worker_pool *pool)
{
int ret;
<<<<<<< HEAD
lockdep_assert_held(&wq_pool_mutex);
do {
if (!idr_pre_get(&worker_pool_idr, GFP_KERNEL))
return -ENOMEM;
ret = idr_get_new(&worker_pool_idr, pool, &pool->id);
} while (ret == -EAGAIN);
=======
mutex_lock(&worker_pool_idr_mutex);
ret = idr_alloc(&worker_pool_idr, pool, 0, 0, GFP_KERNEL);
if (ret >= 0)
pool->id = ret;
mutex_unlock(&worker_pool_idr_mutex);
>>>>>>> c67bf5361e7e66a0ff1f4caf95f89347d55dfb89
return ret < 0 ? ret : 0;
}
We want locking from the former and idr_alloc() usage from the
latter, which can be combined to the following.
static int worker_pool_assign_id(struct worker_pool *pool)
{
int ret;
lockdep_assert_held(&wq_pool_mutex);
ret = idr_alloc(&worker_pool_idr, pool, 0, 0, GFP_KERNEL);
if (ret >= 0) {
pool->id = ret;
return 0;
}
return ret;
}
* eb2834285c ("workqueue: fix possible pool stall bug in
wq_unbind_fn()") updated wq_unbind_fn() such that it has single
larger for_each_std_worker_pool() loop instead of two separate loops
with a schedule() call inbetween. wq/for-3.10 renamed
pool->assoc_mutex to pool->manager_mutex causing the following
conflict (earlier function body and comments omitted for brevity).
static void wq_unbind_fn(struct work_struct *work)
{
...
spin_unlock_irq(&pool->lock);
<<<<<<< HEAD
mutex_unlock(&pool->manager_mutex);
}
=======
mutex_unlock(&pool->assoc_mutex);
>>>>>>> c67bf5361e7e66a0ff1f4caf95f89347d55dfb89
schedule();
<<<<<<< HEAD
for_each_cpu_worker_pool(pool, cpu)
=======
>>>>>>> c67bf5361e7e66a0ff1f4caf95f89347d55dfb89
atomic_set(&pool->nr_running, 0);
spin_lock_irq(&pool->lock);
wake_up_worker(pool);
spin_unlock_irq(&pool->lock);
}
}
The resolution is mostly trivial. We want the control flow of the
latter with the rename of the former.
static void wq_unbind_fn(struct work_struct *work)
{
...
spin_unlock_irq(&pool->lock);
mutex_unlock(&pool->manager_mutex);
schedule();
atomic_set(&pool->nr_running, 0);
spin_lock_irq(&pool->lock);
wake_up_worker(pool);
spin_unlock_irq(&pool->lock);
}
}
Signed-off-by: Tejun Heo <tj@kernel.org>
Unbound workqueues are now NUMA aware. Let's add some control knobs
and update sysfs interface accordingly.
* Add kernel param workqueue.numa_disable which disables NUMA affinity
globally.
* Replace sysfs file "pool_id" with "pool_ids" which contain
node:pool_id pairs. This change is userland-visible but "pool_id"
hasn't seen a release yet, so this is okay.
* Add a new sysf files "numa" which can toggle NUMA affinity on
individual workqueues. This is implemented as attrs->no_numa whichn
is special in that it isn't part of a pool's attributes. It only
affects how apply_workqueue_attrs() picks which pools to use.
After "pool_ids" change, first_pwq() doesn't have any user left.
Removed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Currently, an unbound workqueue has single current, or first, pwq
(pool_workqueue) to which all new work items are queued. This often
isn't optimal on NUMA machines as workers may jump around across node
boundaries and work items get assigned to workers without any regard
to NUMA affinity.
This patch implements NUMA affinity for unbound workqueues. Instead
of mapping all entries of numa_pwq_tbl[] to the same pwq,
apply_workqueue_attrs() now creates a separate pwq covering the
intersecting CPUs for each NUMA node which has online CPUs in
@attrs->cpumask. Nodes which don't have intersecting possible CPUs
are mapped to pwqs covering whole @attrs->cpumask.
As CPUs come up and go down, the pool association is changed
accordingly. Changing pool association may involve allocating new
pools which may fail. To avoid failing CPU_DOWN, each workqueue
always keeps a default pwq which covers whole attrs->cpumask which is
used as fallback if pool creation fails during a CPU hotplug
operation.
This ensures that all work items issued on a NUMA node is executed on
the same node as long as the workqueue allows execution on the CPUs of
the node.
As this maps a workqueue to multiple pwqs and max_active is per-pwq,
this change the behavior of max_active. The limit is now per NUMA
node instead of global. While this is an actual change, max_active is
already per-cpu for per-cpu workqueues and primarily used as safety
mechanism rather than for active concurrency control. Concurrency is
usually limited from workqueue users by the number of concurrently
active work items and this change shouldn't matter much.
v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted
by Lai.
v3: The previous version incorrectly made a workqueue spanning
multiple nodes spread work items over all online CPUs when some of
its nodes don't have any desired cpus. Reimplemented so that NUMA
affinity is properly updated as CPUs go up and down. This problem
was spotted by Lai Jiangshan.
v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
however, wq may be freed at any time after dfl_pwq is put making
the clearing use-after-free. Clear wq->dfl_pwq before putting it.
v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
@pwq_tbl after success. Fixed.
Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
application of new attrs is excluded via CPU hotplug. Removed.
Documentation on CPU affinity guarantee on CPU_DOWN added.
All changes are suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Factor out lock pool, put_pwq(), unlock sequence into
put_pwq_unlocked(). The two existing places are converted and there
will be more with NUMA affinity support.
This is to prepare for NUMA affinity support for unbound workqueues
and doesn't introduce any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Factor out pool_workqueue linking and installation into numa_pwq_tbl[]
from apply_workqueue_attrs() into numa_pwq_tbl_install(). link_pwq()
is made safe to call multiple times. numa_pwq_tbl_install() links the
pwq, installs it into numa_pwq_tbl[] at the specified node and returns
the old entry.
@last_pwq is removed from link_pwq() as the return value of the new
function can be used instead.
This is to prepare for NUMA affinity support for unbound workqueues.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Use kmem_cache_alloc_node() with @pool->node instead of
kmem_cache_zalloc() when allocating a pool_workqueue so that it's
allocated on the same node as the associated worker_pool. As there's
no no kmem_cache_zalloc_node(), move zeroing to init_pwq().
This was suggested by Lai Jiangshan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Break init_and_link_pwq() into init_pwq() and link_pwq() and move
unbound-workqueue specific handling into apply_workqueue_attrs().
Also, factor out unbound pool and pool_workqueue allocation into
alloc_unbound_pwq().
This reorganization is to prepare for NUMA affinity and doesn't
introduce any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Currently, an unbound workqueue has only one "current" pool_workqueue
associated with it. It may have multple pool_workqueues but only the
first pool_workqueue servies new work items. For NUMA affinity, we
want to change this so that there are multiple current pool_workqueues
serving different NUMA nodes.
Introduce workqueue->numa_pwq_tbl[] which is indexed by NUMA node and
points to the pool_workqueue to use for each possible node. This
replaces first_pwq() in __queue_work() and workqueue_congested().
numa_pwq_tbl[] is currently initialized to point to the same
pool_workqueue as first_pwq() so this patch doesn't make any behavior
changes.
v2: Use rcu_dereference_raw() in unbound_pwq_by_node() as the function
may be called only with wq->mutex held.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Move wq->flags and ->cpu_pwqs to the end of workqueue_struct and align
them to the cacheline. These two fields are used in the work item
issue path and thus hot. The scheduled NUMA affinity support will add
dispatch table at the end of workqueue_struct and relocating these two
fields will allow us hitting only single cacheline on hot paths.
Note that wq->pwqs isn't moved although it currently is being used in
the work item issue path for unbound workqueues. The dispatch table
mentioned above will replace its use in the issue path, so it will
become cold once NUMA support is implemented.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Currently workqueue->name[] is of flexible length. We want to use the
flexible field for something more useful and there isn't much benefit
in allowing arbitrary name length anyway. Make it fixed len capping
at 24 bytes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Currently, when exposing attrs of an unbound workqueue via sysfs, the
workqueue_attrs of first_pwq() is used as that should equal the
current state of the workqueue.
The planned NUMA affinity support will make unbound workqueues make
use of multiple pool_workqueues for different NUMA nodes and the above
assumption will no longer hold. Introduce workqueue->unbound_attrs
which records the current attrs in effect and use it for sysfs instead
of first_pwq()->attrs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
When worker tasks are created using kthread_create_on_node(),
currently only per-cpu ones have the matching NUMA node specified.
All unbound workers are always created with NUMA_NO_NODE.
Now that an unbound worker pool may have an arbitrary cpumask
associated with it, this isn't optimal. Add pool->node which is
determined by the pool's cpumask. If the pool's cpumask is contained
inside a NUMA node proper, the pool is associated with that node, and
all workers of the pool are created on that node.
This currently only makes difference for unbound worker pools with
cpumask contained inside single NUMA node, but this will serve as
foundation for making all unbound pools NUMA-affine.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Currently, all workqueue workers which have negative nice value has
'H' postfixed to their names. This is necessary for per-cpu workers
as they use the CPU number instead of pool->id to identify the pool
and the 'H' postfix is the only thing distinguishing normal and
highpri workers.
As workers for unbound pools use pool->id, the 'H' postfix is purely
informational. TASK_COMM_LEN is 16 and after the static part and
delimiters, there are only five characters left for the pool and
worker IDs. We're expecting to have more unbound pools with the
scheduled NUMA awareness support. Let's drop the non-essential 'H'
postfix from unbound kworker name.
While at it, restructure kthread_create*() invocation to help future
NUMA related changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Unbound workqueues are going to be NUMA-affine. Add wq_numa_tbl_len
and wq_numa_possible_cpumask[] in preparation. The former is the
highest NUMA node ID + 1 and the latter is masks of possibles CPUs for
each NUMA node.
This patch only introduces these. Future patches will make use of
them.
v2: NUMA initialization move into wq_numa_init(). Also, the possible
cpumask array is not created if there aren't multiple nodes on the
system. wq_numa_enabled bool added.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
The scheduled NUMA affinity support for unbound workqueues would need
to walk workqueues list and pool related operations on each workqueue.
Move wq_pool_mutex locking out of get/put_unbound_pool() to their
callers so that pool operations can be performed while walking the
workqueues list, which is also protected by wq_pool_mutex.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
29c91e9912 ("workqueue: implement attribute-based unbound worker_pool
management") implemented attrs based worker_pool matching. It tried
to avoid false negative when comparing cpumasks with custom hash
function; unfortunately, the hash and comparison functions fail to
ignore CPUs which are not possible. It incorrectly assumed that
bitmap_copy() skips leftover bits in the last word of bitmap and
cpumask_equal() ignores impossible CPUs.
This patch updates attrs->cpumask handling such that impossible CPUs
are properly ignored.
* Hash and copy functions no longer do anything special. They expect
their callers to clear impossible CPUs.
* alloc_workqueue_attrs() initializes the cpumask to cpu_possible_mask
instead of setting all bits and explicit cpumask_setall() for
unbound_std_wq_attrs[] in init_workqueues() is dropped.
* apply_workqueue_attrs() is now responsible for ignoring impossible
CPUs. It makes a copy of @attrs and clears impossible CPUs before
doing anything else.
Signed-off-by: Tejun Heo <tj@kernel.org>
8864b4e59 ("workqueue: implement get/put_pwq()") implemented pwq
(pool_workqueue) refcnting which frees workqueue when the last pwq
goes away. It determined whether it was the last pwq by testing
wq->pwqs is empty. Unfortunately, the test was done outside wq->mutex
and multiple pwq release could race and try to free wq multiple times
leading to oops.
Test wq->pwqs emptiness while holding wq->mutex.
Signed-off-by: Tejun Heo <tj@kernel.org>
To simplify locking, the previous patches expanded wq->mutex to
protect all fields of each workqueue instance including the pwqs list
leaving pwq_lock without any user. Remove the unused pwq_lock.
tj: Rebased on top of the current dev branch. Updated description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
We're expanding wq->mutex to cover all fields specific to each
workqueue with the end goal of replacing pwq_lock which will make
locking simpler and easier to understand.
This patch makes wq->saved_max_active protected by wq->mutex instead
of pwq_lock. As pwq_lock locking around pwq_adjust_mac_active() is no
longer necessary, this patch also replaces pwq_lock lockings of
for_each_pwq() around pwq_adjust_max_active() to wq->mutex.
tj: Rebased on top of the current dev branch. Updated description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
We're expanding wq->mutex to cover all fields specific to each
workqueue with the end goal of replacing pwq_lock which will make
locking simpler and easier to understand.
init_and_link_pwq() and pwq_unbound_release_workfn() already grab
wq->mutex when adding or removing a pwq from wq->pwqs list. This
patch makes it official that the list is wq->mutex protected for
writes and updates readers accoridingly. Explicit IRQ toggles for
sched-RCU read-locking in flush_workqueue_prep_pwqs() and
drain_workqueues() are removed as the surrounding wq->mutex can
provide sufficient synchronization.
Also, assert_rcu_or_pwq_lock() is renamed to assert_rcu_or_wq_mutex()
and checks for wq->mutex too.
pwq_lock locking and assertion are not removed by this patch and a
couple of for_each_pwq() iterations are still protected by it.
They'll be removed by future patches.
tj: Rebased on top of the current dev branch. Updated description.
Folded in assert_rcu_or_wq_mutex() renaming from a later patch
along with associated comment updates.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
We're expanding wq->mutex to cover all fields specific to each
workqueue with the end goal of replacing pwq_lock which will make
locking simpler and easier to understand.
wq->nr_drainers and ->flags are specific to each workqueue. Protect
->nr_drainers and ->flags with wq->mutex instead of pool_mutex.
tj: Rebased on top of the current dev branch. Updated description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently pwq->flush_mutex protects many fields of a workqueue
including, especially, the pwqs list. We're going to expand this
mutex to protect most of a workqueue and eventually replace pwq_lock,
which will make locking simpler and easier to understand.
Drop the "flush_" prefix in preparation.
This patch is pure rename.
tj: Rebased on top of the current dev branch. Updated description.
Use WQ: and WR: instead of Q: and QR: for synchronization labels.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
wq->flush_mutex will be renamed to wq->mutex and cover all fields
specific to each workqueue and eventually replace pwq_lock, which will
make locking simpler and easier to understand.
Rename wq_mutex to wq_pool_mutex to avoid confusion with wq->mutex.
After the scheduled changes, wq_pool_mutex won't be protecting
anything specific to each workqueue instance anyway.
This patch is pure rename.
tj: s/wqs_mutex/wq_pool_mutex/. Rewrote description.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
If lockdep complains something for other subsystem, lockdep_is_held()
can be false negative, so we need to also test debug_locks before
triggering WARN.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
rcu_read_lock_sched() is better than preempt_disable() if the code is
protected by RCU_SCHED.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
If pwq_adjust_max_active() changes max_active from 0 to
saved_max_active, it needs to wakeup worker. This is already done by
thaw_workqueues().
If pwq_adjust_max_active() increases max_active for an unbound wq,
while not strictly necessary for correctness, it's still desirable to
wake up a worker so that the requested concurrency level is reached
sooner.
Move wake_up_worker() call from thaw_workqueues() to
pwq_adjust_max_active() so that it can handle both of the above two
cases. This also makes thaw_workqueues() simpler.
tj: Updated comments and description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
We can test worker->recue_wq instead of reaching into
current_pwq->wq->rescuer and then comparing it to self.
tj: Commit message.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
get_unbound_pool() forgot to set POOL_FREEZING if workqueue_freezing
is set and a new pool could go out of sync with the global freezing
state.
Fix it by adding POOL_FREEZING if workqueue_freezing. wq_mutex is
already held so no further locking is necessary. This also removes
the unused static variable warning when !CONFIG_FREEZER.
tj: Updated commit message.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
With the recent addition of the custom attributes support, unbound
pools may have allowed cpumask which isn't full. As long as some of
CPUs in the cpumask are online, its workers will maintain cpus_allowed
as set on worker creation; however, once no online CPU is left in
cpus_allowed, the scheduler will reset cpus_allowed of any workers
which get scheduled so that they can execute.
To remain compliant to the user-specified configuration, CPU affinity
needs to be restored when a CPU becomes online for an unbound pool
which doesn't currently have any online CPUs before.
This patch implement restore_unbound_workers_cpumask(), which is
called from CPU_ONLINE for all unbound pools, checks whether the
coming up CPU is the first allowed online one, and, if so, invokes
set_cpus_allowed_ptr() with the configured cpumask on all workers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Rebinding workers of a per-cpu pool after a CPU comes online involves
a lot of back-and-forth mostly because only the task itself could
adjust CPU affinity if PF_THREAD_BOUND was set.
As CPU_ONLINE itself couldn't adjust affinity, it had to somehow
coerce the workers themselves to perform set_cpus_allowed_ptr(). Due
to the various states a worker can be in, this led to three different
paths a worker may be rebound. worker->rebind_work is queued to busy
workers. Idle ones are signaled by unlinking worker->entry and call
idle_worker_rebind(). The manager isn't covered by either and
implements its own mechanism.
PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE
itself now can manipulate CPU affinity of workers. This patch
replaces the existing rebind mechanism with direct one where
CPU_ONLINE iterates over all workers using for_each_pool_worker(),
restores CPU affinity, and clears WORKER_UNBOUND.
There are a couple subtleties. All bound idle workers should have
their runqueues set to that of the bound CPU; however, if the target
task isn't running, set_cpus_allowed_ptr() just updates the
cpus_allowed mask deferring the actual migration to when the task
wakes up. This is worked around by waking up idle workers after
restoring CPU affinity before any workers can become bound.
Another subtlety is stems from matching @pool->nr_running with the
number of running unbound workers. While DISASSOCIATED, all workers
are unbound and nr_running is zero. As workers become bound again,
nr_running needs to be adjusted accordingly; however, there is no good
way to tell whether a given worker is running without poking into
scheduler internals. Instead of clearing UNBOUND directly,
rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag -
REBOUND, which will later be cleared by the workers themselves while
preparing for the next round of work item execution. The only change
needed for the workers is clearing REBOUND along with PREP.
* This patch leaves for_each_busy_worker() without any user. Removed.
* idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work
and rebind logic in manager_workers() removed.
* worker_thread() now looks at WORKER_DIE instead of testing whether
@worker->entry is empty to determine whether it needs to do
something special as dying is the only special thing now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
rebind_workers() will be reimplemented in a way which makes it mostly
decoupled from the rest of worker management. Move rebind_workers()
so that it's located with other CPU hotplug related functions.
This patch is pure function relocation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Make worker_ida an idr - worker_idr and use it to implement
for_each_pool_worker() which will be used to simplify worker rebinding
on CPU_ONLINE.
pool->worker_idr is protected by both pool->manager_mutex and
pool->lock so that it can be iterated while holding either lock.
* create_worker() allocates ID without installing worker pointer and
installs the pointer later using idr_replace(). This is because
worker ID is needed when creating the actual task to name it and the
new worker shouldn't be visible to iterations before fully
initialized.
* In destroy_worker(), ID removal is moved before kthread_stop().
This is again to guarantee that only fully working workers are
visible to for_each_pool_worker().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
PF_THREAD_BOUND was originally used to mark kernel threads which were
bound to a specific CPU using kthread_bind() and a task with the flag
set allows cpus_allowed modifications only to itself. Workqueue is
currently abusing it to prevent userland from meddling with
cpus_allowed of workqueue workers.
What we need is a flag to prevent userland from messing with
cpus_allowed of certain kernel tasks. In kernel, anyone can
(incorrectly) squash the flag, and, for worker-type usages,
restricting cpus_allowed modification to the task itself doesn't
provide meaningful extra proection as other tasks can inject work
items to the task anyway.
This patch replaces PF_THREAD_BOUND with PF_NO_SETAFFINITY.
sched_setaffinity() checks the flag and return -EINVAL if set.
set_cpus_allowed_ptr() is no longer affected by the flag.
This will allow simplifying workqueue worker CPU affinity management.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Pull workqueue fix from Tejun Heo:
"Lai's patch to fix highly unlikely but still possible workqueue stall
during CPU hotunplug."
* 'for-3.9-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: fix possible pool stall bug in wq_unbind_fn()
With the recent locking updates, the only thing protected by
workqueue_lock is workqueue->maydays list. Rename workqueue_lock to
wq_mayday_lock.
This patch is pure rename.
Signed-off-by: Tejun Heo <tj@kernel.org>