linux-sg2042

Commit Graph

Author	SHA1	Message	Date
Linus Torvalds	4264bc1371	Merge branch 'for-3.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull two workqueue fixes from Tejun Heo: "A lockdep notation update so that nested work_on_cpu() invocations don't lead to spurious lockdep warnings and fix for an unbound attr bug which made what's shown in sysfs deviate from the actual ones. Both patches have pretty limited scope" * 'for-3.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: copy workqueue_attrs with all fields workqueue: allow work_on_cpu() to be called recursively	2013-08-06 13:58:34 -07:00
Shaohua Li	2865a8fb44	workqueue: copy workqueue_attrs with all fields $echo '0' > /sys/bus/workqueue/devices/xxx/numa $cat /sys/bus/workqueue/devices/xxx/numa I got 1. It should be 0, the reason is copy_workqueue_attrs() called in apply_workqueue_attrs() doesn't copy no_numa field. Fix it by making copy_workqueue_attrs() copy ->no_numa too. This would also make get_unbound_pool() set a pool's ->no_numa attribute according to the workqueue attributes used when the pool was created. While harmelss, as ->no_numa isn't a pool attribute, this is a bit confusing. Clear it explicitly. tj: Updated description and comments a bit. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org	2013-08-01 08:36:40 -04:00
Lai Jiangshan	c2fda50966	workqueue: allow work_on_cpu() to be called recursively If the @fn call work_on_cpu() again, the lockdep will complain: > [ INFO: possible recursive locking detected ] > 3.11.0-rc1-lockdep-fix-a #6 Not tainted > --------------------------------------------- > kworker/0:1/142 is trying to acquire lock: > ((&wfc.work)){+.+.+.}, at: [<ffffffff81077100>] flush_work+0x0/0xb0 > > but task is already holding lock: > ((&wfc.work)){+.+.+.}, at: [<ffffffff81075dd9>] process_one_work+0x169/0x610 > > other info that might help us debug this: > Possible unsafe locking scenario: > > CPU0 > ---- > lock((&wfc.work)); > lock((&wfc.work)); > > * DEADLOCK * It is false-positive lockdep report. In this sutiation, the two "wfc"s of the two work_on_cpu() are different, they are both on stack. flush_work() can't be deadlock. To fix this, we need to avoid the lockdep checking in this case, thus we instroduce a internal __flush_work() which skip the lockdep. tj: Minor comment adjustment. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Reported-by: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> Reported-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-07-24 12:24:25 -04:00
Paul Gortmaker	0db0628d90	kernel: delete __cpuinit usage from all core kernel files The __cpuinit type of throwaway sections might have made sense some time ago when RAM was more constrained, but now the savings do not offset the cost and complications. For example, the fix in commit `5e427ec2d0` ("x86: Fix bit corruption at CPU resume time") is a good example of the nasty type of bugs that can be created with improper use of the various __init prefixes. After a discussion on LKML[1] it was decided that cpuinit should go the way of devinit and be phased out. Once all the users are gone, we can then finally remove the macros themselves from linux/init.h. This removes all the uses of the __cpuinit macros from C files in the core kernel directories (kernel, init, lib, mm, and include) that don't really have a specific maintainer. [1] https://lkml.org/lkml/2013/5/20/589 Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>	2013-07-14 19:36:59 -04:00
Linus Torvalds	f317ff9eed	Merge branch 'for-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue changes from Tejun Heo: "Surprisingly, Lai and I didn't break too many things implementing custom pools and stuff last time around and there aren't any follow-up changes necessary at this point. The only change in this pull request is Viresh's patches to make some per-cpu workqueues to behave as unbound workqueues dependent on a boot param whose default can be configured via a config option. This leads to higher processing overhead / lower bandwidth as more work items are bounced across CPUs; however, it can lead to noticeable powersave in certain configurations - ~10% w/ idlish constant workload on a big.LITTLE configuration according to Viresh. This is because per-cpu workqueues interfere with how the scheduler perceives whether or not each CPU is idle by forcing pinned tasks on them, which makes the scheduler's power-aware scheduling decisions less effective. Its effectiveness is likely less pronounced on homogenous configurations and this type of optimization can probably be made automatic; however, the changes are pretty minimal and the affected workqueues are clearly marked, so it's an easy gain for some configurations for the time being with pretty unintrusive changes." * 'for-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: fbcon: queue work on power efficient wq block: queue work on power efficient wq PHYLIB: queue work on system_power_efficient_wq workqueue: Add system wide power_efficient workqueues workqueues: Introduce new flag WQ_POWER_EFFICIENT for power oriented workqueues	2013-07-02 19:53:30 -07:00
Tejun Heo	1be0c25da5	workqueue: don't perform NUMA-aware allocations on offline nodes in wq_numa_init() wq_numa_init() builds per-node cpumasks which are later used to make unbound workqueues NUMA-aware. The cpumasks are allocated using alloc_cpumask_var_node() for all possible nodes. Unfortunately, on machines with off-line nodes, this leads to NUMA-aware allocations on existing bug offline nodes, which in turn triggers BUG in the memory allocation code. Fix it by using NUMA_NO_NODE for cpumask allocations for offline nodes. kernel BUG at include/linux/gfp.h:323! invalid opcode: 0000 [#1] SMP Modules linked in: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.9.0+ #1 Hardware name: ProLiant BL465c G7, BIOS A19 12/10/2011 task: ffff880234608000 ti: ffff880234602000 task.ti: ffff880234602000 RIP: 0010:[<ffffffff8117495d>] [<ffffffff8117495d>] new_slab+0x2ad/0x340 RSP: 0000:ffff880234603bf8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff880237404b40 RCX: 00000000000000d0 RDX: 0000000000000001 RSI: 0000000000000003 RDI: 00000000002052d0 RBP: ffff880234603c28 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000001 R11: ffffffff812e3aa8 R12: 0000000000000001 R13: ffff8802378161c0 R14: 0000000000030027 R15: 00000000000040d0 FS: 0000000000000000(0000) GS:ffff880237800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: ffff88043fdff000 CR3: 00000000018d5000 CR4: 00000000000007f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: ffff880234603c28 0000000000000001 00000000000000d0 ffff8802378161c0 ffff880237404b40 ffff880237404b40 ffff880234603d28 ffffffff815edba1 ffff880237816140 0000000000000000 ffff88023740e1c0 Call Trace: [<ffffffff815edba1>] __slab_alloc+0x330/0x4f2 [<ffffffff81174b25>] kmem_cache_alloc_node_trace+0xa5/0x200 [<ffffffff812e3aa8>] alloc_cpumask_var_node+0x28/0x90 [<ffffffff81a0bdb3>] wq_numa_init+0x10d/0x1be [<ffffffff81a0bec8>] init_workqueues+0x64/0x341 [<ffffffff810002ea>] do_one_initcall+0xea/0x1a0 [<ffffffff819f1f31>] kernel_init_freeable+0xb7/0x1ec [<ffffffff815d50de>] kernel_init+0xe/0xf0 [<ffffffff815ff89c>] ret_from_fork+0x7c/0xb0 Code: 45 84 ac 00 00 00 f0 41 80 4d 00 40 e9 f6 fe ff ff 66 0f 1f 84 00 00 00 00 00 e8 eb 4b ff ff 49 89 c5 e9 05 fe ff ff <0f> 0b 4c 8b 73 38 44 89 ff 81 cf 00 00 20 00 4c 89 f6 48 c1 ee Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-Tested-by: Lingzhu Xiang <lxiang@redhat.com>	2013-05-15 14:24:24 -07:00
Marc Dionne	ad7b1f841f	workqueue: Make schedule_work() available again to non GPL modules Commit `8425e3d5bd` ("workqueue: inline trivial wrappers") changed schedule_work() and schedule_delayed_work() to inline wrappers, but these rely on some symbols that are EXPORT_SYMBOL_GPL, while the original functions were EXPORT_SYMBOL. This has the effect of changing the licensing requirement for these functions and making them unavailable to non GPL modules. Make them available again by removing the restriction on the required symbols. Signed-off-by: Marc Dionne <marc.dionne@your-file-system.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-05-14 11:52:51 -07:00
Joonsoo Kim	8f174b1175	workqueue: correct handling of the pool spin_lock When we fail to mutex_trylock(), we release the pool spin_lock and do mutex_lock(). After that, we should regrab the pool spin_lock, but, regrabbing is missed in current code. So correct it. Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-05-14 11:48:15 -07:00
Viresh Kumar	0668106ca3	workqueue: Add system wide power_efficient workqueues This patch adds system wide workqueues aligned towards power saving. This is done by allocating them with WQ_UNBOUND flag if 'wq_power_efficient' is set to 'true'. tj: updated comments a bit. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-05-14 10:50:06 -07:00
Viresh Kumar	cee22a1505	workqueues: Introduce new flag WQ_POWER_EFFICIENT for power oriented workqueues Workqueues can be performance or power-oriented. Currently, most workqueues are bound to the CPU they were created on. This gives good performance (due to cache effects) at the cost of potentially waking up otherwise idle cores (Idle from scheduler's perspective. Which may or may not be physically idle) just to process some work. To save power, we can allow the work to be rescheduled on a core that is already awake. Workqueues created with the WQ_UNBOUND flag will allow some power savings. However, we don't change the default behaviour of the system. To enable power-saving behaviour, a new config option CONFIG_WQ_POWER_EFFICIENT needs to be turned on. This option can also be overridden by the workqueue.power_efficient boot parameter. tj: Updated config description and comments. Renamed CONFIG_WQ_POWER_EFFICIENT to CONFIG_WQ_POWER_EFFICIENT_DEFAULT. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-05-14 10:50:06 -07:00
Tejun Heo	d325185916	workqueue: workqueue_congested() shouldn't translate WORK_CPU_UNBOUND into node number `df2d5ae499` ("workqueue: map an unbound workqueues to multiple per-node pool_workqueues") made unbound workqueues to map to multiple per-node pool_workqueues and accordingly updated workqueue_contested() so that, for unbound workqueues, it maps the specified @cpu to the NUMA node number to obtain the matching pool_workqueue to query the congested state. Before this change, workqueue_congested() ignored @cpu for unbound workqueues as there was only one pool_workqueue and some users (fscache) called it with WORK_CPU_UNBOUND. After the commit, this causes the following oops as WORK_CPU_UNBOUND gets translated to garbage by cpu_to_node(). BUG: unable to handle kernel paging request at ffff8803598d98b8 IP: [<ffffffff81043b7e>] unbound_pwq_by_node+0xa1/0xfa PGD 2421067 PUD 0 Oops: 0000 [#1] SMP CPU: 1 PID: 2689 Comm: cat Tainted: GF 3.9.0-fsdevel+ #4 task: ffff88003d801040 ti: ffff880025806000 task.ti: ffff880025806000 RIP: 0010:[<ffffffff81043b7e>] [<ffffffff81043b7e>] unbound_pwq_by_node+0xa1/0xfa RSP: 0018:ffff880025807ad8 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffff8800388a2400 RCX: 0000000000000003 RDX: ffff880025807fd8 RSI: ffffffff81a31420 RDI: ffff88003d8016e0 RBP: ffff880025807ae8 R08: ffff88003d801730 R09: ffffffffa00b4898 R10: ffffffff81044217 R11: ffff88003d801040 R12: 0000000064206e97 R13: ffff880036059d98 R14: ffff880038cc8080 R15: ffff880038cc82d0 FS: 00007f21afd9c740(0000) GS:ffff88003d100000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: ffff8803598d98b8 CR3: 000000003df49000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: ffff8800388a2400 0000000000000002 ffff880025807b18 ffffffff810442ce ffffffff81044217 ffff880000000002 ffff8800371b4080 ffff88003d112ec0 ffff880025807b38 ffffffffa00810b0 ffff880036059d88 ffff880036059be8 Call Trace: [<ffffffff810442ce>] workqueue_congested+0xb7/0x12c [<ffffffffa00810b0>] fscache_enqueue_object+0xb2/0xe8 [fscache] [<ffffffffa007facd>] __fscache_acquire_cookie+0x3b9/0x56c [fscache] [<ffffffffa00ad8fe>] nfs_fscache_set_inode_cookie+0xee/0x132 [nfs] [<ffffffffa009e112>] do_open+0x9/0xd [nfs] [<ffffffff810e804a>] do_dentry_open+0x175/0x24b [<ffffffff810e8298>] finish_open+0x41/0x51 Fix it by using smp_processor_id() if @cpu is WORK_CPU_UNBOUND. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: David Howells <dhowells@redhat.com> Tested-and-Acked-by: David Howells <dhowells@redhat.com>	2013-05-10 11:10:17 -07:00
Tejun Heo	3d1cb2059d	workqueue: include workqueue info when printing debug dump of a worker task One of the problems that arise when converting dedicated custom threadpool to workqueue is that the shared worker pool used by workqueue anonimizes each worker making it more difficult to identify what the worker was doing on which target from the output of sysrq-t or debug dump from oops, BUG() and friends. This patch implements set_worker_desc() which can be called from any workqueue work function to set its description. When the worker task is dumped for whatever reason - sysrq-t, WARN, BUG, oops, lockdep assertion and so on - the description will be printed out together with the workqueue name and the worker function pointer. The printing side is implemented by print_worker_info() which is called from functions in task dump paths - sched_show_task() and dump_stack_print_info(). print_worker_info() can be safely called on any task in any state as long as the task struct itself is accessible. It uses probe_*() functions to access worker fields. It may print garbage if something went very wrong, but it wouldn't cause (another) oops. The description is currently limited to 24bytes including the terminating \0. worker->desc_valid and workder->desc[] are added and the 64 bytes marker which was already incorrect before adding the new fields is moved to the correct position. Here's an example dump with writeback updated to set the bdi name as worker desc. Hardware name: Bochs Modules linked in: Pid: 7, comm: kworker/u9:0 Not tainted 3.9.0-rc1-work+ #1 Workqueue: writeback bdi_writeback_workfn (flush-8:0) ffffffff820a3ab0 ffff88000f6e9cb8 ffffffff81c61845 ffff88000f6e9cf8 ffffffff8108f50f 0000000000000000 0000000000000000 ffff88000cde16b0 ffff88000cde1aa8 ffff88001ee19240 ffff88000f6e9fd8 ffff88000f6e9d08 Call Trace: [<ffffffff81c61845>] dump_stack+0x19/0x1b [<ffffffff8108f50f>] warn_slowpath_common+0x7f/0xc0 [<ffffffff8108f56a>] warn_slowpath_null+0x1a/0x20 [<ffffffff81200150>] bdi_writeback_workfn+0x2a0/0x3b0 ... Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Acked-by: Jan Kara <jack@suse.cz> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-30 17:04:02 -07:00
Wei Yongjun	cece95dfe5	workqueue: use kmem_cache_free() instead of kfree() memory allocated by kmem_cache_alloc() should be freed using kmem_cache_free(), not kfree(). Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-04-09 11:33:40 -07:00
Lai Jiangshan	5c529597e9	workqueue: avoid false negative WARN_ON() in destroy_workqueue() destroy_workqueue() performs several sanity checks before proceeding with destruction of a workqueue. One of the checks verifies that refcnt of each pwq (pool_workqueue) is over 1 as at that point there should be no in-flight work items and the only holder of pwq refs is the workqueue itself. This worked fine as a workqueue used to hold only one reference to its pwqs; however, since `4c16bd327c` ("workqueue: implement NUMA affinity for unbound workqueues"), a workqueue may hold multiple references to its default pwq triggering this sanity check spuriously. Fix it by not triggering the pwq->refcnt assertion on default pwqs. An example spurious WARN trigger follows. WARNING: at kernel/workqueue.c:4201 destroy_workqueue+0x6a/0x13e() Hardware name: 4286C12 Modules linked in: sdhci_pci sdhci mmc_core usb_storage i915 drm_kms_helper drm i2c_algo_bit i2c_core video Pid: 361, comm: umount Not tainted 3.9.0-rc5+ #29 Call Trace: [<c04314a7>] warn_slowpath_common+0x7c/0x93 [<c04314e0>] warn_slowpath_null+0x22/0x24 [<c044796a>] destroy_workqueue+0x6a/0x13e [<c056dc01>] ext4_put_super+0x43/0x2c4 [<c04fb7b8>] generic_shutdown_super+0x4b/0xb9 [<c04fb848>] kill_block_super+0x22/0x60 [<c04fb960>] deactivate_locked_super+0x2f/0x56 [<c04fc41b>] deactivate_super+0x2e/0x31 [<c050f1e6>] mntput_no_expire+0x103/0x108 [<c050fdce>] sys_umount+0x2a2/0x2c4 [<c050fe0e>] sys_oldumount+0x1e/0x20 [<c085ba4d>] sysenter_do_call+0x12/0x38 tj: Rewrote description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Fengguang Wu <fengguang.wu@intel.com>	2013-04-04 07:54:01 -07:00
Tejun Heo	229641a6f1	Linux 3.9-rc5 -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iQEcBAABAgAGBQJRWLTrAAoJEHm+PkMAQRiGe8oH/iMy48mecVWvxVZn74Tx3Cef xmW/PnAIj28EhSPqK49N/Ow6AfQToFKf7AP0ge20KAf5teTq95AY+tH74DAANt8F BjKXXTZiR5xwBvRkq7CR5wDcCvEcBAAz8fgTEd6SEDB2d2VXFf5eKdKUqt1avTCh Z6Hup5kuwX+ddtwY2DCBXtp2n6fL0Rm5yLzY1A3OOBye1E7VyLTF7M5BR603Q44P 4kRLxn8+R7jy3hTuZIhAeoS8TKUoBwVk7DmKxEzrhTHZVOmvwE9lEHybRnIyOpd/ k1JnbRbiPsLsCVFOn10SQkGDAIk00lro3tuWP2C1ljERiD/OOh5Ui9nXYAhMkbI= =q15K -----END PGP SIGNATURE----- Merge tag 'v3.9-rc5' into wq/for-3.10 Writeback conversion to workqueue will be based on top of wq/for-3.10 branch to take advantage of custom attrs and NUMA support for unbound workqueues. Mainline currently contains two commits which result in non-trivial merge conflicts with wq/for-3.10 and because block/for-3.10/core is based on v3.9-rc3 which contains one of the conflicting commits, we need a pre-merge-window merge anyway. Let's pull v3.9-rc5 into wq/for-3.10 so that the block tree doesn't suffer from workqueue merge conflicts. The two conflicts and their resolutions: * `e68035fb65` ("workqueue: convert to idr_alloc()") in mainline changes worker_pool_assign_id() to use idr_alloc() instead of the old idr interface. worker_pool_assign_id() goes through multiple locking changes in wq/for-3.10 causing the following conflict. static int worker_pool_assign_id(struct worker_pool pool) { int ret; <<<<<<< HEAD lockdep_assert_held(&wq_pool_mutex); do { if (!idr_pre_get(&worker_pool_idr, GFP_KERNEL)) return -ENOMEM; ret = idr_get_new(&worker_pool_idr, pool, &pool->id); } while (ret == -EAGAIN); ======= mutex_lock(&worker_pool_idr_mutex); ret = idr_alloc(&worker_pool_idr, pool, 0, 0, GFP_KERNEL); if (ret >= 0) pool->id = ret; mutex_unlock(&worker_pool_idr_mutex); >>>>>>> `c67bf5361e` return ret < 0 ? ret : 0; } We want locking from the former and idr_alloc() usage from the latter, which can be combined to the following. static int worker_pool_assign_id(struct worker_pool pool) { int ret; lockdep_assert_held(&wq_pool_mutex); ret = idr_alloc(&worker_pool_idr, pool, 0, 0, GFP_KERNEL); if (ret >= 0) { pool->id = ret; return 0; } return ret; } * `eb2834285c` ("workqueue: fix possible pool stall bug in wq_unbind_fn()") updated wq_unbind_fn() such that it has single larger for_each_std_worker_pool() loop instead of two separate loops with a schedule() call inbetween. wq/for-3.10 renamed pool->assoc_mutex to pool->manager_mutex causing the following conflict (earlier function body and comments omitted for brevity). static void wq_unbind_fn(struct work_struct work) { ... spin_unlock_irq(&pool->lock); <<<<<<< HEAD mutex_unlock(&pool->manager_mutex); } ======= mutex_unlock(&pool->assoc_mutex); >>>>>>> `c67bf5361e` schedule(); <<<<<<< HEAD for_each_cpu_worker_pool(pool, cpu) ======= >>>>>>> `c67bf5361e` atomic_set(&pool->nr_running, 0); spin_lock_irq(&pool->lock); wake_up_worker(pool); spin_unlock_irq(&pool->lock); } } The resolution is mostly trivial. We want the control flow of the latter with the rename of the former. static void wq_unbind_fn(struct work_struct work) { ... spin_unlock_irq(&pool->lock); mutex_unlock(&pool->manager_mutex); schedule(); atomic_set(&pool->nr_running, 0); spin_lock_irq(&pool->lock); wake_up_worker(pool); spin_unlock_irq(&pool->lock); } } Signed-off-by: Tejun Heo <tj@kernel.org>	2013-04-01 18:45:36 -07:00
Tejun Heo	d55262c4d1	workqueue: update sysfs interface to reflect NUMA awareness and a kernel param to disable NUMA affinity Unbound workqueues are now NUMA aware. Let's add some control knobs and update sysfs interface accordingly. * Add kernel param workqueue.numa_disable which disables NUMA affinity globally. * Replace sysfs file "pool_id" with "pool_ids" which contain node:pool_id pairs. This change is userland-visible but "pool_id" hasn't seen a release yet, so this is okay. * Add a new sysf files "numa" which can toggle NUMA affinity on individual workqueues. This is implemented as attrs->no_numa whichn is special in that it isn't part of a pool's attributes. It only affects how apply_workqueue_attrs() picks which pools to use. After "pool_ids" change, first_pwq() doesn't have any user left. Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-04-01 11:23:38 -07:00
Tejun Heo	4c16bd327c	workqueue: implement NUMA affinity for unbound workqueues Currently, an unbound workqueue has single current, or first, pwq (pool_workqueue) to which all new work items are queued. This often isn't optimal on NUMA machines as workers may jump around across node boundaries and work items get assigned to workers without any regard to NUMA affinity. This patch implements NUMA affinity for unbound workqueues. Instead of mapping all entries of numa_pwq_tbl[] to the same pwq, apply_workqueue_attrs() now creates a separate pwq covering the intersecting CPUs for each NUMA node which has online CPUs in @attrs->cpumask. Nodes which don't have intersecting possible CPUs are mapped to pwqs covering whole @attrs->cpumask. As CPUs come up and go down, the pool association is changed accordingly. Changing pool association may involve allocating new pools which may fail. To avoid failing CPU_DOWN, each workqueue always keeps a default pwq which covers whole attrs->cpumask which is used as fallback if pool creation fails during a CPU hotplug operation. This ensures that all work items issued on a NUMA node is executed on the same node as long as the workqueue allows execution on the CPUs of the node. As this maps a workqueue to multiple pwqs and max_active is per-pwq, this change the behavior of max_active. The limit is now per NUMA node instead of global. While this is an actual change, max_active is already per-cpu for per-cpu workqueues and primarily used as safety mechanism rather than for active concurrency control. Concurrency is usually limited from workqueue users by the number of concurrently active work items and this change shouldn't matter much. v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted by Lai. v3: The previous version incorrectly made a workqueue spanning multiple nodes spread work items over all online CPUs when some of its nodes don't have any desired cpus. Reimplemented so that NUMA affinity is properly updated as CPUs go up and down. This problem was spotted by Lai Jiangshan. v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it; however, wq may be freed at any time after dfl_pwq is put making the clearing use-after-free. Clear wq->dfl_pwq before putting it. v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and @pwq_tbl after success. Fixed. Retry loop in wq_update_unbound_numa_attrs() isn't necessary as application of new attrs is excluded via CPU hotplug. Removed. Documentation on CPU affinity guarantee on CPU_DOWN added. All changes are suggested by Lai Jiangshan. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-04-01 11:23:36 -07:00
Tejun Heo	dce90d47c4	workqueue: introduce put_pwq_unlocked() Factor out lock pool, put_pwq(), unlock sequence into put_pwq_unlocked(). The two existing places are converted and there will be more with NUMA affinity support. This is to prepare for NUMA affinity support for unbound workqueues and doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-04-01 11:23:35 -07:00
Tejun Heo	1befcf3073	workqueue: introduce numa_pwq_tbl_install() Factor out pool_workqueue linking and installation into numa_pwq_tbl[] from apply_workqueue_attrs() into numa_pwq_tbl_install(). link_pwq() is made safe to call multiple times. numa_pwq_tbl_install() links the pwq, installs it into numa_pwq_tbl[] at the specified node and returns the old entry. @last_pwq is removed from link_pwq() as the return value of the new function can be used instead. This is to prepare for NUMA affinity support for unbound workqueues. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-04-01 11:23:35 -07:00
Tejun Heo	e50aba9aea	workqueue: use NUMA-aware allocation for pool_workqueues Use kmem_cache_alloc_node() with @pool->node instead of kmem_cache_zalloc() when allocating a pool_workqueue so that it's allocated on the same node as the associated worker_pool. As there's no no kmem_cache_zalloc_node(), move zeroing to init_pwq(). This was suggested by Lai Jiangshan. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-04-01 11:23:35 -07:00
Tejun Heo	f147f29eb7	workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq() Break init_and_link_pwq() into init_pwq() and link_pwq() and move unbound-workqueue specific handling into apply_workqueue_attrs(). Also, factor out unbound pool and pool_workqueue allocation into alloc_unbound_pwq(). This reorganization is to prepare for NUMA affinity and doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-04-01 11:23:35 -07:00
Tejun Heo	df2d5ae499	workqueue: map an unbound workqueues to multiple per-node pool_workqueues Currently, an unbound workqueue has only one "current" pool_workqueue associated with it. It may have multple pool_workqueues but only the first pool_workqueue servies new work items. For NUMA affinity, we want to change this so that there are multiple current pool_workqueues serving different NUMA nodes. Introduce workqueue->numa_pwq_tbl[] which is indexed by NUMA node and points to the pool_workqueue to use for each possible node. This replaces first_pwq() in __queue_work() and workqueue_congested(). numa_pwq_tbl[] is currently initialized to point to the same pool_workqueue as first_pwq() so this patch doesn't make any behavior changes. v2: Use rcu_dereference_raw() in unbound_pwq_by_node() as the function may be called only with wq->mutex held. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-04-01 11:23:35 -07:00
Tejun Heo	2728fd2f09	workqueue: move hot fields of workqueue_struct to the end Move wq->flags and ->cpu_pwqs to the end of workqueue_struct and align them to the cacheline. These two fields are used in the work item issue path and thus hot. The scheduled NUMA affinity support will add dispatch table at the end of workqueue_struct and relocating these two fields will allow us hitting only single cacheline on hot paths. Note that wq->pwqs isn't moved although it currently is being used in the work item issue path for unbound workqueues. The dispatch table mentioned above will replace its use in the issue path, so it will become cold once NUMA support is implemented. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-04-01 11:23:35 -07:00
Tejun Heo	ecf6881ff3	workqueue: make workqueue->name[] fixed len Currently workqueue->name[] is of flexible length. We want to use the flexible field for something more useful and there isn't much benefit in allowing arbitrary name length anyway. Make it fixed len capping at 24 bytes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-04-01 11:23:34 -07:00
Tejun Heo	6029a91829	workqueue: add workqueue->unbound_attrs Currently, when exposing attrs of an unbound workqueue via sysfs, the workqueue_attrs of first_pwq() is used as that should equal the current state of the workqueue. The planned NUMA affinity support will make unbound workqueues make use of multiple pool_workqueues for different NUMA nodes and the above assumption will no longer hold. Introduce workqueue->unbound_attrs which records the current attrs in effect and use it for sysfs instead of first_pwq()->attrs. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-04-01 11:23:34 -07:00
Tejun Heo	f3f90ad469	workqueue: determine NUMA node of workers accourding to the allowed cpumask When worker tasks are created using kthread_create_on_node(), currently only per-cpu ones have the matching NUMA node specified. All unbound workers are always created with NUMA_NO_NODE. Now that an unbound worker pool may have an arbitrary cpumask associated with it, this isn't optimal. Add pool->node which is determined by the pool's cpumask. If the pool's cpumask is contained inside a NUMA node proper, the pool is associated with that node, and all workers of the pool are created on that node. This currently only makes difference for unbound worker pools with cpumask contained inside single NUMA node, but this will serve as foundation for making all unbound pools NUMA-affine. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-04-01 11:23:34 -07:00
Tejun Heo	e3c916a4c7	workqueue: drop 'H' from kworker names of unbound worker pools Currently, all workqueue workers which have negative nice value has 'H' postfixed to their names. This is necessary for per-cpu workers as they use the CPU number instead of pool->id to identify the pool and the 'H' postfix is the only thing distinguishing normal and highpri workers. As workers for unbound pools use pool->id, the 'H' postfix is purely informational. TASK_COMM_LEN is 16 and after the static part and delimiters, there are only five characters left for the pool and worker IDs. We're expecting to have more unbound pools with the scheduled NUMA awareness support. Let's drop the non-essential 'H' postfix from unbound kworker name. While at it, restructure kthread_create*() invocation to help future NUMA related changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-04-01 11:23:32 -07:00
Tejun Heo	bce903809a	workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[] Unbound workqueues are going to be NUMA-affine. Add wq_numa_tbl_len and wq_numa_possible_cpumask[] in preparation. The former is the highest NUMA node ID + 1 and the latter is masks of possibles CPUs for each NUMA node. This patch only introduces these. Future patches will make use of them. v2: NUMA initialization move into wq_numa_init(). Also, the possible cpumask array is not created if there aren't multiple nodes on the system. wq_numa_enabled bool added. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-04-01 11:23:32 -07:00
Tejun Heo	a892cacc7f	workqueue: move pwq_pool_locking outside of get/put_unbound_pool() The scheduled NUMA affinity support for unbound workqueues would need to walk workqueues list and pool related operations on each workqueue. Move wq_pool_mutex locking out of get/put_unbound_pool() to their callers so that pool operations can be performed while walking the workqueues list, which is also protected by wq_pool_mutex. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-04-01 11:23:32 -07:00
Tejun Heo	4862125b02	workqueue: fix memory leak in apply_workqueue_attrs() apply_workqueue_attrs() wasn't freeing temp attrs variable @new_attrs in its success path. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-04-01 11:23:31 -07:00
Tejun Heo	13e2e55601	workqueue: fix unbound workqueue attrs hashing / comparison `29c91e9912` ("workqueue: implement attribute-based unbound worker_pool management") implemented attrs based worker_pool matching. It tried to avoid false negative when comparing cpumasks with custom hash function; unfortunately, the hash and comparison functions fail to ignore CPUs which are not possible. It incorrectly assumed that bitmap_copy() skips leftover bits in the last word of bitmap and cpumask_equal() ignores impossible CPUs. This patch updates attrs->cpumask handling such that impossible CPUs are properly ignored. * Hash and copy functions no longer do anything special. They expect their callers to clear impossible CPUs. * alloc_workqueue_attrs() initializes the cpumask to cpu_possible_mask instead of setting all bits and explicit cpumask_setall() for unbound_std_wq_attrs[] in init_workqueues() is dropped. * apply_workqueue_attrs() is now responsible for ignoring impossible CPUs. It makes a copy of @attrs and clears impossible CPUs before doing anything else. Signed-off-by: Tejun Heo <tj@kernel.org>	2013-04-01 11:23:31 -07:00
Tejun Heo	bc0caf099d	workqueue: fix race condition in unbound workqueue free path `8864b4e59` ("workqueue: implement get/put_pwq()") implemented pwq (pool_workqueue) refcnting which frees workqueue when the last pwq goes away. It determined whether it was the last pwq by testing wq->pwqs is empty. Unfortunately, the test was done outside wq->mutex and multiple pwq release could race and try to free wq multiple times leading to oops. Test wq->pwqs emptiness while holding wq->mutex. Signed-off-by: Tejun Heo <tj@kernel.org>	2013-04-01 11:23:31 -07:00
Lai Jiangshan	b592760547	workqueue: remove pwq_lock which is no longer used To simplify locking, the previous patches expanded wq->mutex to protect all fields of each workqueue instance including the pwqs list leaving pwq_lock without any user. Remove the unused pwq_lock. tj: Rebased on top of the current dev branch. Updated description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-25 16:57:19 -07:00
Lai Jiangshan	a357fc0326	workqueue: protect wq->saved_max_active with wq->mutex We're expanding wq->mutex to cover all fields specific to each workqueue with the end goal of replacing pwq_lock which will make locking simpler and easier to understand. This patch makes wq->saved_max_active protected by wq->mutex instead of pwq_lock. As pwq_lock locking around pwq_adjust_mac_active() is no longer necessary, this patch also replaces pwq_lock lockings of for_each_pwq() around pwq_adjust_max_active() to wq->mutex. tj: Rebased on top of the current dev branch. Updated description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-25 16:57:19 -07:00
Lai Jiangshan	b09f4fd39c	workqueue: protect wq->pwqs and iteration with wq->mutex We're expanding wq->mutex to cover all fields specific to each workqueue with the end goal of replacing pwq_lock which will make locking simpler and easier to understand. init_and_link_pwq() and pwq_unbound_release_workfn() already grab wq->mutex when adding or removing a pwq from wq->pwqs list. This patch makes it official that the list is wq->mutex protected for writes and updates readers accoridingly. Explicit IRQ toggles for sched-RCU read-locking in flush_workqueue_prep_pwqs() and drain_workqueues() are removed as the surrounding wq->mutex can provide sufficient synchronization. Also, assert_rcu_or_pwq_lock() is renamed to assert_rcu_or_wq_mutex() and checks for wq->mutex too. pwq_lock locking and assertion are not removed by this patch and a couple of for_each_pwq() iterations are still protected by it. They'll be removed by future patches. tj: Rebased on top of the current dev branch. Updated description. Folded in assert_rcu_or_wq_mutex() renaming from a later patch along with associated comment updates. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-25 16:57:18 -07:00
Lai Jiangshan	87fc741e94	workqueue: protect wq->nr_drainers and ->flags with wq->mutex We're expanding wq->mutex to cover all fields specific to each workqueue with the end goal of replacing pwq_lock which will make locking simpler and easier to understand. wq->nr_drainers and ->flags are specific to each workqueue. Protect ->nr_drainers and ->flags with wq->mutex instead of pool_mutex. tj: Rebased on top of the current dev branch. Updated description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-25 16:57:18 -07:00
Lai Jiangshan	3c25a55daa	workqueue: rename wq->flush_mutex to wq->mutex Currently pwq->flush_mutex protects many fields of a workqueue including, especially, the pwqs list. We're going to expand this mutex to protect most of a workqueue and eventually replace pwq_lock, which will make locking simpler and easier to understand. Drop the "flush_" prefix in preparation. This patch is pure rename. tj: Rebased on top of the current dev branch. Updated description. Use WQ: and WR: instead of Q: and QR: for synchronization labels. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-25 16:57:17 -07:00
Lai Jiangshan	68e13a67dd	workqueue: rename wq_mutex to wq_pool_mutex wq->flush_mutex will be renamed to wq->mutex and cover all fields specific to each workqueue and eventually replace pwq_lock, which will make locking simpler and easier to understand. Rename wq_mutex to wq_pool_mutex to avoid confusion with wq->mutex. After the scheduled changes, wq_pool_mutex won't be protecting anything specific to each workqueue instance anyway. This patch is pure rename. tj: s/wqs_mutex/wq_pool_mutex/. Rewrote description. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-03-25 16:57:17 -07:00
Lai Jiangshan	519e3c1163	workqueue: avoid false negative in assert_manager_or_pool_lock() If lockdep complains something for other subsystem, lockdep_is_held() can be false negative, so we need to also test debug_locks before triggering WARN. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-20 11:21:34 -07:00
Lai Jiangshan	881094532e	workqueue: use rcu_read_lock_sched() instead for accessing pwq in RCU rcu_read_lock_sched() is better than preempt_disable() if the code is protected by RCU_SCHED. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-20 11:00:57 -07:00
Lai Jiangshan	951a078a52	workqueue: kick a worker in pwq_adjust_max_active() If pwq_adjust_max_active() changes max_active from 0 to saved_max_active, it needs to wakeup worker. This is already done by thaw_workqueues(). If pwq_adjust_max_active() increases max_active for an unbound wq, while not strictly necessary for correctness, it's still desirable to wake up a worker so that the requested concurrency level is reached sooner. Move wake_up_worker() call from thaw_workqueues() to pwq_adjust_max_active() so that it can handle both of the above two cases. This also makes thaw_workqueues() simpler. tj: Updated comments and description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-20 10:52:30 -07:00
Lai Jiangshan	6a092dfd51	workqueue: simplify current_is_workqueue_rescuer() We can test worker->recue_wq instead of reaching into current_pwq->wq->rescuer and then comparing it to self. tj: Commit message. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-20 10:40:25 -07:00
Lai Jiangshan	12ee4fc67c	workqueue: add missing POOL_FREEZING get_unbound_pool() forgot to set POOL_FREEZING if workqueue_freezing is set and a new pool could go out of sync with the global freezing state. Fix it by adding POOL_FREEZING if workqueue_freezing. wq_mutex is already held so no further locking is necessary. This also removes the unused static variable warning when !CONFIG_FREEZER. tj: Updated commit message. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-20 10:19:09 -07:00
Tejun Heo	7dbc725e47	workqueue: restore CPU affinity of unbound workers on CPU_ONLINE With the recent addition of the custom attributes support, unbound pools may have allowed cpumask which isn't full. As long as some of CPUs in the cpumask are online, its workers will maintain cpus_allowed as set on worker creation; however, once no online CPU is left in cpus_allowed, the scheduler will reset cpus_allowed of any workers which get scheduled so that they can execute. To remain compliant to the user-specified configuration, CPU affinity needs to be restored when a CPU becomes online for an unbound pool which doesn't currently have any online CPUs before. This patch implement restore_unbound_workers_cpumask(), which is called from CPU_ONLINE for all unbound pools, checks whether the coming up CPU is the first allowed online one, and, if so, invokes set_cpus_allowed_ptr() with the configured cpumask on all workers. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-03-19 13:45:21 -07:00
Tejun Heo	a9ab775bca	workqueue: directly restore CPU affinity of workers from CPU_ONLINE Rebinding workers of a per-cpu pool after a CPU comes online involves a lot of back-and-forth mostly because only the task itself could adjust CPU affinity if PF_THREAD_BOUND was set. As CPU_ONLINE itself couldn't adjust affinity, it had to somehow coerce the workers themselves to perform set_cpus_allowed_ptr(). Due to the various states a worker can be in, this led to three different paths a worker may be rebound. worker->rebind_work is queued to busy workers. Idle ones are signaled by unlinking worker->entry and call idle_worker_rebind(). The manager isn't covered by either and implements its own mechanism. PF_THREAD_BOUND has been relaced with PF_NO_SETAFFINITY and CPU_ONLINE itself now can manipulate CPU affinity of workers. This patch replaces the existing rebind mechanism with direct one where CPU_ONLINE iterates over all workers using for_each_pool_worker(), restores CPU affinity, and clears WORKER_UNBOUND. There are a couple subtleties. All bound idle workers should have their runqueues set to that of the bound CPU; however, if the target task isn't running, set_cpus_allowed_ptr() just updates the cpus_allowed mask deferring the actual migration to when the task wakes up. This is worked around by waking up idle workers after restoring CPU affinity before any workers can become bound. Another subtlety is stems from matching @pool->nr_running with the number of running unbound workers. While DISASSOCIATED, all workers are unbound and nr_running is zero. As workers become bound again, nr_running needs to be adjusted accordingly; however, there is no good way to tell whether a given worker is running without poking into scheduler internals. Instead of clearing UNBOUND directly, rebind_workers() replaces UNBOUND with another new NOT_RUNNING flag - REBOUND, which will later be cleared by the workers themselves while preparing for the next round of work item execution. The only change needed for the workers is clearing REBOUND along with PREP. * This patch leaves for_each_busy_worker() without any user. Removed. * idle_worker_rebind(), busy_worker_rebind_fn(), worker->rebind_work and rebind logic in manager_workers() removed. * worker_thread() now looks at WORKER_DIE instead of testing whether @worker->entry is empty to determine whether it needs to do something special as dying is the only special thing now. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-03-19 13:45:21 -07:00
Tejun Heo	bd7c089eb2	workqueue: relocate rebind_workers() rebind_workers() will be reimplemented in a way which makes it mostly decoupled from the rest of worker management. Move rebind_workers() so that it's located with other CPU hotplug related functions. This patch is pure function relocation. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-03-19 13:45:21 -07:00
Tejun Heo	822d8405d1	workqueue: convert worker_pool->worker_ida to idr and implement for_each_pool_worker() Make worker_ida an idr - worker_idr and use it to implement for_each_pool_worker() which will be used to simplify worker rebinding on CPU_ONLINE. pool->worker_idr is protected by both pool->manager_mutex and pool->lock so that it can be iterated while holding either lock. * create_worker() allocates ID without installing worker pointer and installs the pointer later using idr_replace(). This is because worker ID is needed when creating the actual task to name it and the new worker shouldn't be visible to iterations before fully initialized. * In destroy_worker(), ID removal is moved before kthread_stop(). This is again to guarantee that only fully working workers are visible to for_each_pool_worker(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-03-19 13:45:21 -07:00
Tejun Heo	14a40ffccd	sched: replace PF_THREAD_BOUND with PF_NO_SETAFFINITY PF_THREAD_BOUND was originally used to mark kernel threads which were bound to a specific CPU using kthread_bind() and a task with the flag set allows cpus_allowed modifications only to itself. Workqueue is currently abusing it to prevent userland from meddling with cpus_allowed of workqueue workers. What we need is a flag to prevent userland from messing with cpus_allowed of certain kernel tasks. In kernel, anyone can (incorrectly) squash the flag, and, for worker-type usages, restricting cpus_allowed modification to the task itself doesn't provide meaningful extra proection as other tasks can inject work items to the task anyway. This patch replaces PF_THREAD_BOUND with PF_NO_SETAFFINITY. sched_setaffinity() checks the flag and return -EINVAL if set. set_cpus_allowed_ptr() is no longer affected by the flag. This will allow simplifying workqueue worker CPU affinity management. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de>	2013-03-19 13:45:20 -07:00
Linus Torvalds	b63dc123b2	Merge branch 'for-3.9-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fix from Tejun Heo: "Lai's patch to fix highly unlikely but still possible workqueue stall during CPU hotunplug." * 'for-3.9-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: fix possible pool stall bug in wq_unbind_fn()	2013-03-18 18:47:07 -07:00
Tejun Heo	2e109a2855	workqueue: rename workqueue_lock to wq_mayday_lock With the recent locking updates, the only thing protected by workqueue_lock is workqueue->maydays list. Rename workqueue_lock to wq_mayday_lock. This patch is pure rename. Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-13 19:47:40 -07:00
Tejun Heo	794b18bc8a	workqueue: separate out pool_workqueue locking into pwq_lock This patch continues locking cleanup from the previous patch. It breaks out pool_workqueue synchronization from workqueue_lock into a new spinlock - pwq_lock. The followings are protected by pwq_lock. * workqueue->pwqs * workqueue->saved_max_active The conversion is straight-forward. workqueue_lock usages which cover the above two are converted to pwq_lock. New locking label PW added for things protected by pwq_lock and FR is updated to mean flush_mutex + pwq_lock + sched-RCU. This patch shouldn't introduce any visible behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-13 19:47:40 -07:00
Tejun Heo	5bcab3355a	workqueue: separate out pool and workqueue locking into wq_mutex Currently, workqueue_lock protects most shared workqueue resources - the pools, workqueues, pool_workqueues, draining, ID assignments, mayday handling and so on. The coverage has grown organically and there is no identified bottleneck coming from workqueue_lock, but it has grown a bit too much and scheduled rebinding changes need the pools and workqueues to be protected by a mutex instead of a spinlock. This patch breaks out pool and workqueue synchronization from workqueue_lock into a new mutex - wq_mutex. The followings are protected by wq_mutex. * worker_pool_idr and unbound_pool_hash * pool->refcnt * workqueues list * workqueue->flags, ->nr_drainers Most changes are mostly straight-forward. workqueue_lock is replaced with wq_mutex where applicable and workqueue_lock lock/unlocks are added where wq_mutex conversion leaves data structures not protected by wq_mutex without locking. irq / preemption flippings were added where the conversion affects them. Things worth noting are * New WQ and WR locking lables added along with assert_rcu_or_wq_mutex(). * worker_pool_assign_id() now expects to be called under wq_mutex. * create_mutex is removed from get_unbound_pool(). It now just holds wq_mutex. This patch shouldn't introduce any visible behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-13 19:47:40 -07:00
Tejun Heo	7d19c5ce66	workqueue: relocate global variable defs and function decls in workqueue.c They're split across debugobj code for some reason. Collect them. This patch is pure relocation. Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-13 19:47:40 -07:00
Tejun Heo	cd549687a7	workqueue: better define locking rules around worker creation / destruction When a manager creates or destroys workers, the operations are always done with the manager_mutex held; however, initial worker creation or worker destruction during pool release don't grab the mutex. They are still correct as initial worker creation doesn't require synchronization and grabbing manager_arb provides enough exclusion for pool release path. Still, let's make everyone follow the same rules for consistency and such that lockdep annotations can be added. Update create_and_start_worker() and put_unbound_pool() to grab manager_mutex around thread creation and destruction respectively and add lockdep assertions to create_worker() and destroy_worker(). This patch doesn't introduce any visible behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-13 19:47:39 -07:00
Tejun Heo	ebf44d16ec	workqueue: factor out initial worker creation into create_and_start_worker() get_unbound_pool(), workqueue_cpu_up_callback() and init_workqueues() have similar code pieces to create and start the initial worker factor those out into create_and_start_worker(). This patch doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-13 19:47:39 -07:00
Tejun Heo	bc3a1afc92	workqueue: rename worker_pool->assoc_mutex to ->manager_mutex Manager operations are currently governed by two mutexes - pool->manager_arb and ->assoc_mutex. The former is used to decide who gets to be the manager and the latter to exclude the actual manager operations including creation and destruction of workers. Anyone who grabs ->manager_arb must perform manager role; otherwise, the pool might stall. Grabbing ->assoc_mutex blocks everyone else from performing manager operations but doesn't require the holder to perform manager duties as it's merely blocking manager operations without becoming the manager. Because the blocking was necessary when [dis]associating per-cpu workqueues during CPU hotplug events, the latter was named assoc_mutex. The mutex is scheduled to be used for other purposes, so this patch gives it a more fitting generic name - manager_mutex - and updates / adds comments to explain synchronization around the manager role and operations. This patch is pure rename / doc update. Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-13 19:47:39 -07:00
Tejun Heo	8425e3d5bd	workqueue: inline trivial wrappers There's no reason to make these trivial wrappers full (exported) functions. Inline the followings. queue_work() queue_delayed_work() mod_delayed_work() schedule_work_on() schedule_work() schedule_delayed_work_on() schedule_delayed_work() keventd_up() Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-13 16:51:36 -07:00
Tejun Heo	611c92a020	workqueue: rename @id to @pi in for_each_each_pool() Rename @id argument of for_each_pool() to @pi so that it doesn't get reused accidentally when for_each_pool() is used in combination with other iterators. This patch is purely cosmetic. Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-13 16:51:36 -07:00
Tejun Heo	c5aa87bbf4	workqueue: update comments and a warning message * Update incorrect and add missing synchronization labels. * Update incorrect or misleading comments. Add new comments where clarification is necessary. Reformat / rephrase some comments. * drain_workqueue() can be used separately from destroy_workqueue() but its warning message was incorrectly referring to destruction. Other than the warning message change, this patch doesn't make any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-13 16:51:36 -07:00
Tejun Heo	983ca25e73	workqueue: fix max_active handling in init_and_link_pwq() Since `9e8cd2f589` ("workqueue: implement apply_workqueue_attrs()"), init_and_link_pwq() may be called to initialize a new pool_workqueue for a workqueue which is already online, but the function was setting pwq->max_active to wq->saved_max_active without proper synchronization. Fix it by calling pwq_adjust_max_active() under proper locking instead of manually setting max_active. Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-13 16:51:35 -07:00
Tejun Heo	699ce097ef	workqueue: implement and use pwq_adjust_max_active() Rename pwq_set_max_active() to pwq_adjust_max_active() and move pool_workqueue->max_active synchronization and max_active determination logic into it. The new function should be called with workqueue_lock held for stable workqueue->saved_max_active, determines the current max_active value the target pool_workqueue should be using from @wq->saved_max_active and the state of the associated pool, and applies it with proper synchronization. The current two users - workqueue_set_max_active() and thaw_workqueues() - are updated accordingly. In addition, the manual freezing handling in __alloc_workqueue_key() and freeze_workqueues_begin() are replaced with calls to pwq_adjust_max_active(). This centralizes max_active handling so that it's less error-prone. Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-13 16:51:35 -07:00
Tejun Heo	0fbd95aa8a	workqueue: relocate pwq_set_max_active() pwq_set_max_active() is gonna be modified and used during pool_workqueue init. Move it above init_and_link_pwq(). This patch is pure code reorganization and doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-13 16:51:35 -07:00
Tejun Heo	e68035fb65	workqueue: convert to idr_alloc() idr_get_new*() and friends are about to be deprecated. Convert to the new idr_alloc() interface. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-03-13 15:21:46 -07:00
Tejun Heo	e626761691	workqueue: implement current_is_workqueue_rescuer() Implement a function which queries whether it currently is running off a workqueue rescuer. This will be used to convert writeback to workqueue. Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-12 17:42:01 -07:00
Tejun Heo	226223ab3c	workqueue: implement sysfs interface for workqueues There are cases where workqueue users want to expose control knobs to userland. e.g. Unbound workqueues with custom attributes are scheduled to be used for writeback workers and depending on configuration it can be useful to allow admins to tinker with the priority or allowed CPUs. This patch implements workqueue_sysfs_register(), which makes the workqueue visible under /sys/bus/workqueue/devices/WQ_NAME. There currently are two attributes common to both per-cpu and unbound pools and extra attributes for unbound pools including nice level and cpumask. If alloc_workqueue*() is called with WQ_SYSFS, workqueue_sysfs_register() is called automatically as part of workqueue creation. This is the preferred method unless the workqueue user wants to apply workqueue_attrs before making the workqueue visible to userland. v2: Disallow exposing ordered workqueues as ordered workqueues can't be tuned in any way. Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-12 11:37:07 -07:00
Tejun Heo	8719dceae2	workqueue: reject adjusting max_active or applying attrs to ordered workqueues Adjusting max_active of or applying new workqueue_attrs to an ordered workqueue breaks its ordering guarantee. The former is obvious. The latter is because applying attrs creates a new pwq (pool_workqueue) and there is no ordering constraint between the old and new pwqs. Make apply_workqueue_attrs() and workqueue_set_max_active() trigger WARN_ON() if those operations are requested on an ordered workqueue and fail / ignore respectively. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-03-12 11:30:04 -07:00
Tejun Heo	618b01eb42	workqueue: make it clear that WQ_DRAINING is an internal flag We're gonna add another internal WQ flag. Let's make the distinction clear. Prefix WQ_DRAINING with __ and move it to bit 16. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-03-12 11:30:04 -07:00
Tejun Heo	9e8cd2f589	workqueue: implement apply_workqueue_attrs() Implement apply_workqueue_attrs() which applies workqueue_attrs to the specified unbound workqueue by creating a new pwq (pool_workqueue) linked to worker_pool with the specified attributes. A new pwq is linked at the head of wq->pwqs instead of tail and __queue_work() verifies that the first unbound pwq has positive refcnt before choosing it for the actual queueing. This is to cover the case where creation of a new pwq races with queueing. As base ref on a pwq won't be dropped without making another pwq the first one, __queue_work() is guaranteed to make progress and not add work item to a dead pwq. init_and_link_pwq() is updated to return the last first pwq the new pwq replaced, which is put by apply_workqueue_attrs(). Note that apply_workqueue_attrs() is almost identical to unbound pwq part of alloc_and_link_pwqs(). The only difference is that there is no previous first pwq. apply_workqueue_attrs() is implemented to handle such cases and replaces unbound pwq handling in alloc_and_link_pwqs(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-03-12 11:30:04 -07:00
Tejun Heo	c9178087ac	workqueue: perform non-reentrancy test when queueing to unbound workqueues too Because per-cpu workqueues have multiple pwqs (pool_workqueues) to serve the CPUs, to guarantee that a single work item isn't queued on one pwq while still executing another, __queue_work() takes a look at the previous pool the target work item was on and if it's still executing there, queue the work item on that pool. To support changing workqueue_attrs on the fly, unbound workqueues too will have multiple pwqs and thus need non-reentrancy test when queueing. This patch modifies __queue_work() such that the reentrancy test is performed regardless of the workqueue type. per_cpu_ptr(wq->cpu_pwqs, cpu) used to be used to determine the matching pwq for the last pool. This can't be used for unbound workqueues and is replaced with worker->current_pwq which also happens to be simpler. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-03-12 11:30:04 -07:00
Tejun Heo	75ccf5950f	workqueue: prepare flush_workqueue() for dynamic creation and destrucion of unbound pool_workqueues Unbound pwqs (pool_workqueues) will be dynamically created and destroyed with the scheduled unbound workqueue w/ custom attributes support. This patch synchronizes pwq linking and unlinking against flush_workqueue() so that its operation isn't disturbed by pwqs coming and going. Linking and unlinking a pwq into wq->pwqs is now protected also by wq->flush_mutex and a new pwq's work_color is initialized to wq->work_color during linking. This ensures that pwqs changes don't disturb flush_workqueue() in progress and the new pwq's work coloring stays in sync with the rest of the workqueue. flush_mutex during unlinking isn't strictly necessary but it's simpler to do it anyway. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-03-12 11:30:04 -07:00
Tejun Heo	8864b4e59f	workqueue: implement get/put_pwq() Add pool_workqueue->refcnt along with get/put_pwq(). Both per-cpu and unbound pwqs have refcnts and any work item inserted on a pwq increments the refcnt which is dropped when the work item finishes. For per-cpu pwqs the base ref is never dropped and destroy_workqueue() frees the pwqs as before. For unbound ones, destroy_workqueue() simply drops the base ref on the first pwq. When the refcnt reaches zero, pwq_unbound_release_workfn() is scheduled on system_wq, which unlinks the pwq, puts the associated pool and frees the pwq and wq as necessary. This needs to be done from a work item as put_pwq() needs to be protected by pool->lock but release can't happen with the lock held - e.g. put_unbound_pool() involves blocking operations. Unbound pool->locks are marked with lockdep subclas 1 as put_pwq() will schedule the release work item on system_wq while holding the unbound pool's lock and triggers recursive locking warning spuriously. This will be used to implement dynamic creation and destruction of unbound pwqs. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-03-12 11:30:04 -07:00
Tejun Heo	d2c1d40487	workqueue: restructure __alloc_workqueue_key() * Move initialization and linking of pool_workqueues into init_and_link_pwq(). * Make the failure path use destroy_workqueue() once pool_workqueue initialization succeeds. These changes are to prepare for dynamic management of pool_workqueues and don't introduce any functional changes. While at it, convert list_del(&wq->list) to list_del_init() as a precaution as scheduled changes will make destruction more complex. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-03-12 11:30:04 -07:00
Tejun Heo	493008a8e4	workqueue: drop WQ_RESCUER and test workqueue->rescuer for NULL instead WQ_RESCUER is superflous. WQ_MEM_RECLAIM indicates that the user wants a rescuer and testing wq->rescuer for NULL can answer whether a given workqueue has a rescuer or not. Drop WQ_RESCUER and test wq->rescuer directly. This will help simplifying __alloc_workqueue_key() failure path by allowing it to use destroy_workqueue() on a partially constructed workqueue, which in turn will help implementing dynamic management of pool_workqueues. While at it, clear wq->rescuer after freeing it in destroy_workqueue(). This is a precaution as scheduled changes will make destruction more complex. This patch doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-03-12 11:30:03 -07:00
Tejun Heo	ac6104cdf8	workqueue: add pool ID to the names of unbound kworkers There are gonna be multiple unbound pools. Include pool ID in the name of unbound kworkers. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-03-12 11:30:03 -07:00
Tejun Heo	f02ae73aaa	workqueue: drop "std" from cpu_std_worker_pools and for_each_std_worker_pool() All per-cpu pools are standard, so there's no need to use both "cpu" and "std" and for_each_std_worker_pool() is confusing in that it can be used only for per-cpu pools. * s/cpu_std_worker_pools/cpu_worker_pools/ * s/for_each_std_worker_pool()/for_each_cpu_worker_pool()/ Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-03-12 11:30:03 -07:00
Tejun Heo	7a62c2c87e	workqueue: remove unbound_std_worker_pools[] and related helpers Workqueue no longer makes use of unbound_std_worker_pools[]. All unbound worker_pools are created dynamically and there's nothing special about the standard ones. With unbound_std_worker_pools[] unused, workqueue no longer has places where it needs to treat the per-cpu pools-cpu and unbound pools together. Remove unbound_std_worker_pools[] and the helpers wrapping it to present unified per-cpu and unbound standard worker_pools. * for_each_std_worker_pool() now only walks through per-cpu pools. * for_each[_online]_wq_cpu() which don't have any users left are removed. * std_worker_pools() and std_worker_pool_pri() are unused and removed. * get_std_worker_pool() is removed. Its only user - alloc_and_link_pwqs() - only used it for per-cpu pools anyway. Open code per_cpu access in alloc_and_link_pwqs() instead. This patch doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-03-12 11:30:03 -07:00
Tejun Heo	29c91e9912	workqueue: implement attribute-based unbound worker_pool management This patch makes unbound worker_pools reference counted and dynamically created and destroyed as workqueues needing them come and go. All unbound worker_pools are hashed on unbound_pool_hash which is keyed by the content of worker_pool->attrs. When an unbound workqueue is allocated, get_unbound_pool() is called with the attributes of the workqueue. If there already is a matching worker_pool, the reference count is bumped and the pool is returned. If not, a new worker_pool with matching attributes is created and returned. When an unbound workqueue is destroyed, put_unbound_pool() is called which decrements the reference count of the associated worker_pool. If the refcnt reaches zero, the worker_pool is destroyed in sched-RCU safe way. Note that the standard unbound worker_pools - normal and highpri ones with no specific cpumask affinity - are no longer created explicitly during init_workqueues(). init_workqueues() only initializes workqueue_attrs to be used for standard unbound pools - unbound_std_wq_attrs[]. The pools are spawned on demand as workqueues are created. v2: - Comment added to init_worker_pool() explaining that @pool should be in a condition which can be passed to put_unbound_pool() even on failure. - pool->refcnt reaching zero and the pool being removed from unbound_pool_hash should be dynamic. pool->refcnt is converted to int from atomic_t and now manipulated inside workqueue_lock. - Removed an incorrect sanity check on nr_idle in put_unbound_pool() which may trigger spuriously. All changes were suggested by Lai Jiangshan. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-03-12 11:30:03 -07:00
Tejun Heo	7a4e344c56	workqueue: introduce workqueue_attrs Introduce struct workqueue_attrs which carries worker attributes - currently the nice level and allowed cpumask along with helper routines alloc_workqueue_attrs() and free_workqueue_attrs(). Each worker_pool now carries ->attrs describing the attributes of its workers. All functions dealing with cpumask and nice level of workers are updated to follow worker_pool->attrs instead of determining them from other characteristics of the worker_pool, and init_workqueues() is updated to set worker_pool->attrs appropriately for all standard pools. Note that create_worker() is updated to always perform set_user_nice() and use set_cpus_allowed_ptr() combined with manual assertion of PF_THREAD_BOUND instead of kthread_bind(). This simplifies handling random attributes without affecting the outcome. This patch doesn't introduce any behavior changes. v2: Missing cpumask_var_t definition caused build failure on some archs. linux/cpumask.h included. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: kbuild test robot <fengguang.wu@intel.com> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-03-12 11:30:00 -07:00
Tejun Heo	4e1a1f9a05	workqueue: separate out init_worker_pool() from init_workqueues() This will be used to implement unbound pools with custom attributes. This patch doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-03-12 11:30:00 -07:00
Tejun Heo	34a06bd6b6	workqueue: replace POOL_MANAGING_WORKERS flag with worker_pool->manager_arb POOL_MANAGING_WORKERS is used to synchronize the manager role. Synchronizing among workers doesn't need blocking and that's why it's implemented as a flag. It got converted to a mutex a while back to add blocking wait from CPU hotplug path - `6037315269` ("workqueue: use mutex for global_cwq manager exclusion"). Later it turned out that synchronization among workers and cpu hotplug need to be done separately. Eventually, POOL_MANAGING_WORKERS is restored and workqueue->manager_mutex got morphed into workqueue->assoc_mutex - `552a37e936` ("workqueue: restore POOL_MANAGING_WORKERS") and `b2eb83d123` ("workqueue: rename manager_mutex to assoc_mutex"). Now, we're gonna need to be able to lock out managers from destroy_workqueue() to support multiple unbound pools with custom attributes making it again necessary to be able to block on the manager role. This patch replaces POOL_MANAGING_WORKERS with worker_pool->manager_arb. This patch doesn't introduce any behavior changes. v2: s/manager_mutex/manager_arb/ Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-12 11:30:00 -07:00
Tejun Heo	fa1b54e69b	workqueue: update synchronization rules on worker_pool_idr Make worker_pool_idr protected by workqueue_lock for writes and sched-RCU protected for reads. Lockdep assertions are added to for_each_pool() and get_work_pool() and all their users are converted to either hold workqueue_lock or disable preemption/irq. worker_pool_assign_id() is updated to hold workqueue_lock when allocating a pool ID. As idr_get_new() always performs RCU-safe assignment, this is enough on the writer side. As standard pools are never destroyed, there's nothing to do on that side. The locking is superflous at this point. This is to help implementation of unbound pools/pwqs with custom attributes. This patch doesn't introduce any behavior changes. v2: Updated for_each_pwq() use if/else for the hidden assertion statement instead of just if as suggested by Lai. This avoids confusing the following else clause. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-03-12 11:30:00 -07:00
Tejun Heo	76af4d9361	workqueue: update synchronization rules on workqueue->pwqs Make workqueue->pwqs protected by workqueue_lock for writes and sched-RCU protected for reads. Lockdep assertions are added to for_each_pwq() and first_pwq() and all their users are converted to either hold workqueue_lock or disable preemption/irq. alloc_and_link_pwqs() is updated to use list_add_tail_rcu() for consistency which isn't strictly necessary as the workqueue isn't visible. destroy_workqueue() isn't updated to sched-RCU release pwqs. This is okay as the workqueue should have on users left by that point. The locking is superflous at this point. This is to help implementation of unbound pools/pwqs with custom attributes. This patch doesn't introduce any behavior changes. v2: Updated for_each_pwq() use if/else for the hidden assertion statement instead of just if as suggested by Lai. This avoids confusing the following else clause. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-03-12 11:30:00 -07:00
Tejun Heo	7fb98ea79c	workqueue: replace get_pwq() with explicit per_cpu_ptr() accesses and first_pwq() get_pwq() takes @cpu, which can also be WORK_CPU_UNBOUND, and @wq and returns the matching pwq (pool_workqueue). We want to move away from using @cpu for identifying pools and pwqs for unbound pools with custom attributes and there is only one user - workqueue_congested() - which makes use of the WQ_UNBOUND conditional in get_pwq(). All other users already know whether they're dealing with a per-cpu or unbound workqueue. Replace get_pwq() with explicit per_cpu_ptr(wq->cpu_pwqs, cpu) for per-cpu workqueues and first_pwq() for unbound ones, and open-code WQ_UNBOUND conditional in workqueue_congested(). Note that this makes workqueue_congested() behave sligntly differently when @cpu other than WORK_CPU_UNBOUND is specified. It ignores @cpu for unbound workqueues and always uses the first pwq instead of oopsing. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-03-12 11:30:00 -07:00
Tejun Heo	420c0ddb1f	workqueue: remove workqueue_struct->pool_wq.single workqueue->pool_wq union is used to point either to percpu pwqs (pool_workqueues) or single unbound pwq. As the first pwq can be accessed via workqueue->pwqs list, there's no reason for the single pointer anymore. Use list_first_entry(workqueue->pwqs) to access the unbound pwq and drop workqueue->pool_wq.single pointer and the pool_wq union. It simplifies the code and eases implementing multiple unbound pools w/ custom attributes. This patch doesn't introduce any visible behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-03-12 11:29:59 -07:00
Tejun Heo	d84ff0512f	workqueue: consistently use int for @cpu variables Workqueue is mixing unsigned int and int for @cpu variables. There's no point in using unsigned int for cpus - many of cpu related APIs take int anyway. Consistently use int for @cpu variables so that we can use negative values to mark special ones. This patch doesn't introduce any visible behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-03-12 11:29:59 -07:00
Tejun Heo	493a1724fe	workqueue: add wokrqueue_struct->maydays list to replace mayday cpu iterators Similar to how pool_workqueue iteration used to be, raising and servicing mayday requests is based on CPU numbers. It's hairy because cpumask_t may not be able to handle WORK_CPU_UNBOUND and cpumasks are assumed to be always set on UP. This is ugly and can't handle multiple unbound pools to be added for unbound workqueues w/ custom attributes. Add workqueue_struct->maydays. When a pool_workqueue needs rescuing, it gets chained on the list through pool_workqueue->mayday_node and rescuer_thread() consumes the list until it's empty. This patch doesn't introduce any visible behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-03-12 11:29:59 -07:00
Tejun Heo	24b8a84718	workqueue: restructure pool / pool_workqueue iterations in freeze/thaw functions The three freeze/thaw related functions - freeze_workqueues_begin(), freeze_workqueues_busy() and thaw_workqueues() - need to iterate through all pool_workqueues of all freezable workqueues. They did it by first iterating pools and then visiting all pwqs (pool_workqueues) of all workqueues and process it if its pwq->pool matches the current pool. This is rather backwards and done this way partly because workqueue didn't have fitting iteration helpers and partly to avoid the number of lock operations on pool->lock. Workqueue now has fitting iterators and the locking operation overhead isn't anything to worry about - those locks are unlikely to be contended and the same CPU visiting the same set of locks multiple times isn't expensive. Restructure the three functions such that the flow better matches the logical steps and pwq iteration is done using for_each_pwq() inside workqueue iteration. * freeze_workqueues_begin(): Setting of FREEZING is moved into a separate for_each_pool() iteration. pwq iteration for clearing max_active is updated as described above. * freeze_workqueues_busy(): pwq iteration updated as described above. * thaw_workqueues(): The single for_each_wq_cpu() iteration is broken into three discrete steps - clearing FREEZING, restoring max_active, and kicking workers. The first and last steps use for_each_pool() and the second step uses pwq iteration described above. This makes the code easier to understand and removes the use of for_each_wq_cpu() for walking pwqs, which can't support multiple unbound pwqs which will be needed to implement unbound workqueues with custom attributes. This patch doesn't introduce any visible behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-03-12 11:29:58 -07:00
Tejun Heo	1711696955	workqueue: introduce for_each_pool() With the scheduled unbound pools with custom attributes, there will be multiple unbound pools, so it wouldn't be able to use for_each_wq_cpu() + for_each_std_worker_pool() to iterate through all pools. Introduce for_each_pool() which iterates through all pools using worker_pool_idr and use it instead of for_each_wq_cpu() + for_each_std_worker_pool() combination in freeze_workqueues_begin(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-03-12 11:29:58 -07:00
Tejun Heo	49e3cf44df	workqueue: replace for_each_pwq_cpu() with for_each_pwq() Introduce for_each_pwq() which iterates all pool_workqueues of a workqueue using the recently added workqueue->pwqs list and replace for_each_pwq_cpu() usages with it. This is primarily to remove the single unbound CPU assumption from pwq iteration for the scheduled unbound pools with custom attributes support which would introduce multiple unbound pwqs per workqueue; however, it also simplifies iterator users. Note that pwq->pool initialization is moved to alloc_and_link_pwqs() as that now is the only place which is explicitly handling the two pwq types. This patch doesn't introduce any visible behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-03-12 11:29:58 -07:00
Tejun Heo	30cdf2496d	workqueue: add workqueue_struct->pwqs list Add workqueue_struct->pwqs list and chain all pool_workqueues belonging to a workqueue there. This will be used to implement generic pool_workqueue iteration and handle multiple pool_workqueues for the scheduled unbound pools with custom attributes. This patch doesn't introduce any visible behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-03-12 11:29:57 -07:00
Tejun Heo	e904e6c266	workqueue: introduce kmem_cache for pool_workqueues pool_workqueues need to be aligned to 1 << WORK_STRUCT_FLAG_BITS as the lower bits of work->data are used for flags when they're pointing to pool_workqueues. Due to historical reasons, unbound pool_workqueues are allocated using kzalloc() with sufficient buffer area for alignment and aligned manually. The original pointer is stored at the end which free_pwqs() retrieves when freeing it. There's no reason for this hackery anymore. Set alignment of struct pool_workqueue to 1 << WORK_STRUCT_FLAG_BITS, add kmem_cache for pool_workqueues with proper alignment and replace the hacky alloc and free implementation with plain kmem_cache_zalloc/free(). In case WORK_STRUCT_FLAG_BITS gets shrunk too much and makes fields of pool_workqueues misaligned, trigger WARN if the alignment of struct pool_workqueue becomes smaller than that of long long. Note that assertion on IS_ALIGNED() is removed from alloc_pwqs(). We already have another one in pwq init loop in __alloc_workqueue_key(). This patch doesn't introduce any visible behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-03-12 11:29:57 -07:00
Tejun Heo	e98d5b16cf	workqueue: make workqueue_lock irq-safe workqueue_lock will be used to synchronize areas which require irq-safety and there isn't much benefit in keeping it not irq-safe. Make it irq-safe. This patch doesn't introduce any visible behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-03-12 11:29:57 -07:00
Tejun Heo	6183c009f6	workqueue: make sanity checks less punshing using WARN_ON[_ONCE]()s Workqueue has been using mostly BUG_ON()s for sanity checks, which fail unnecessarily harshly when the assertion doesn't hold. Most assertions can converted to be less drastic such that things can limp along instead of dying completely. Convert BUG_ON()s to WARN_ON[_ONCE]()s with softer failure behaviors - e.g. if assertion check fails in destroy_worker(), trigger WARN and silently ignore destruction request. Most conversions are trivial. Note that sanity checks in destroy_workqueue() are moved above removal from workqueues list so that it can bail out without side-effects if assertion checks fail. This patch doesn't introduce any visible behavior changes during normal operation. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-03-12 11:29:57 -07:00
Lai Jiangshan	eb2834285c	workqueue: fix possible pool stall bug in wq_unbind_fn() Since multiple pools per cpu have been introduced, wq_unbind_fn() has a subtle bug which may theoretically stall work item processing. The problem is two-fold. * wq_unbind_fn() depends on the worker executing wq_unbind_fn() itself to start unbound chain execution, which works fine when there was only single pool. With multiple pools, only the pool which is running wq_unbind_fn() - the highpri one - is guaranteed to have such kick-off. The other pool could stall when its busy workers block. * The current code is setting WORKER_UNBIND / POOL_DISASSOCIATED of the two pools in succession without initiating work execution inbetween. Because setting the flags requires grabbing assoc_mutex which is held while new workers are created, this could lead to stalls if a pool's manager is waiting for the previous pool's work items to release memory. This is almost purely theoretical tho. Update wq_unbind_fn() such that it sets WORKER_UNBIND / POOL_DISASSOCIATED, goes over schedule() and explicitly kicks off execution for a pool and then moves on to the next one. tj: Updated comments and description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org	2013-03-08 15:18:28 -08:00
Lai Jiangshan	b31041042a	workqueue: better define synchronization rule around rescuer->pool updates Rescuers visit different worker_pools to process work items from pools under pressure. Currently, rescuer->pool is updated outside any locking and when an outsider looks at a rescuer, there's no way to tell when and whether rescuer->pool is gonna change. While this doesn't currently cause any problem, it is nasty. With recent worker_maybe_bind_and_lock() changes, we can move rescuer->pool updates inside pool locks such that if rescuer->pool equals a locked pool, it's guaranteed to stay that way until the pool is unlocked. Move rescuer->pool inside pool->lock. This patch doesn't introduce any visible behavior difference. tj: Updated the description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-04 09:44:58 -08:00
Lai Jiangshan	f36dc67b27	workqueue: change argument of worker_maybe_bind_and_lock() to @pool worker_maybe_bind_and_lock() currently takes @worker but only cares about @worker->pool. This patch updates worker_maybe_bind_and_lock() to take @pool instead of @worker. This will be used to better define synchronization rules regarding rescuer->pool updates. This doesn't introduce any functional change. tj: Updated the comments and description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-04 09:44:58 -08:00
Lai Jiangshan	f5faa0774e	workqueue: use %current instead of worker->task in worker_maybe_bind_and_lock() worker_maybe_bind_and_lock() uses both @worker->task and @current at the same time. As worker_maybe_bind_and_lock() can only be called by the current worker task, they are always the same. Update worker_maybe_bind_and_lock() to use %current consistently. This doesn't introduce any functional change. tj: Massaged the description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-03-04 09:44:58 -08:00
Sasha Levin	b67bfe0d42	hlist: drop the node parameter from iterators I'm not sure why, but the hlist for each entry iterators were conceived list_for_each_entry(pos, head, member) The hlist ones were greedy and wanted an extra parameter: hlist_for_each_entry(tpos, pos, head, member) Why did they need an extra pos parameter? I'm not quite sure. Not only they don't really need it, it also prevents the iterator from looking exactly like the list iterator, which is unfortunate. Besides the semantic patch, there was some manual work required: - Fix up the actual hlist iterators in linux/list.h - Fix up the declaration of other iterators based on the hlist ones. - A very small amount of places were using the 'node' parameter, this was modified to use 'obj->member' instead. - Coccinelle didn't handle the hlist_for_each_entry_safe iterator properly, so those had to be fixed up manually. The semantic patch which is mostly the work of Peter Senna Tschudin is here: @@ iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host; type T; expression a,c,d,e; identifier b; statement S; @@ -T b; <+... when != b ( hlist_for_each_entry(a, - b, c, d) S \| hlist_for_each_entry_continue(a, - b, c) S \| hlist_for_each_entry_from(a, - b, c) S \| hlist_for_each_entry_rcu(a, - b, c, d) S \| hlist_for_each_entry_rcu_bh(a, - b, c, d) S \| hlist_for_each_entry_continue_rcu_bh(a, - b, c) S \| for_each_busy_worker(a, c, - b, d) S \| ax25_uid_for_each(a, - b, c) S \| ax25_for_each(a, - b, c) S \| inet_bind_bucket_for_each(a, - b, c) S \| sctp_for_each_hentry(a, - b, c) S \| sk_for_each(a, - b, c) S \| sk_for_each_rcu(a, - b, c) S \| sk_for_each_from -(a, b) +(a) S + sk_for_each_from(a) S \| sk_for_each_safe(a, - b, c, d) S \| sk_for_each_bound(a, - b, c) S \| hlist_for_each_entry_safe(a, - b, c, d, e) S \| hlist_for_each_entry_continue_rcu(a, - b, c) S \| nr_neigh_for_each(a, - b, c) S \| nr_neigh_for_each_safe(a, - b, c, d) S \| nr_node_for_each(a, - b, c) S \| nr_node_for_each_safe(a, - b, c, d) S \| - for_each_gfn_sp(a, c, d, b) S + for_each_gfn_sp(a, c, d) S \| - for_each_gfn_indirect_valid_sp(a, c, d, b) S + for_each_gfn_indirect_valid_sp(a, c, d) S \| for_each_host(a, - b, c) S \| for_each_host_safe(a, - b, c, d) S \| for_each_mesh_entry(a, - b, c, d) S ) ...+> [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c] [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c] [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix warnings] [akpm@linux-foudnation.org: redo intrusive kvm changes] Tested-by: Peter Senna Tschudin <peter.senna@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-27 19:10:24 -08:00
Konstantin Khlebnikov	1438ade567	workqueue: un-GPL function delayed_work_timer_fn() commit `d8e794dfd5` ("workqueue: set delayed_work->timer function on initialization") exports function delayed_work_timer_fn() only for GPL modules. This makes delayed-works unusable for non-GPL modules, because initialization macro now requires GPL symbol. For example schedule_delayed_work() available for non-GPL. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org # 3.7	2013-02-19 10:09:13 -08:00
Tejun Heo	112202d909	workqueue: rename cpu_workqueue to pool_workqueue workqueue has moved away from global_cwqs to worker_pools and with the scheduled custom worker pools, wforkqueues will be associated with pools which don't have anything to do with CPUs. The workqueue code went through significant amount of changes recently and mass renaming isn't likely to hurt much additionally. Let's replace 'cpu' with 'pool' so that it reflects the current design. * s/struct cpu_workqueue_struct/struct pool_workqueue/ * s/cpu_wq/pool_wq/ * s/cwq/pwq/ This patch is purely cosmetic. Signed-off-by: Tejun Heo <tj@kernel.org>	2013-02-13 19:29:12 -08:00
Tejun Heo	8d03ecfe47	workqueue: reimplement is_chained_work() using current_wq_worker() is_chained_work() was added before current_wq_worker() and implemented its own ham-fisted way of finding out whether %current is a workqueue worker - it iterates through all possible workers. Drop the custom implementation and reimplement using current_wq_worker(). Signed-off-by: Tejun Heo <tj@kernel.org>	2013-02-13 19:29:10 -08:00
Tejun Heo	1dd638149f	workqueue: fix is_chained_work() regression `c9e7cf273f` ("workqueue: move busy_hash from global_cwq to worker_pool") incorrectly converted is_chained_work() to use get_gcwq() inside for_each_gcwq_cpu() while removing get_gcwq(). As cwq might not exist for all possible workqueue CPUs, @cwq can be NULL and the following cwq deferences can lead to oops. Fix it by using for_each_cwq_cpu() instead, which is the better one to use anyway as we only need to check pools that the wq is associated with. Signed-off-by: Tejun Heo <tj@kernel.org>	2013-02-13 19:29:07 -08:00
Lai Jiangshan	8594fade39	workqueue: pick cwq instead of pool in __queue_work() Currently, __queue_work() chooses the pool to queue a work item to and then determines cwq from the target wq and the chosen pool. This is a bit backwards in that we can determine cwq first and simply use cwq->pool. This way, we can skip get_std_worker_pool() in queueing path which will be a hurdle when implementing custom worker pools. Update __queue_work() such that it chooses the target cwq and then use cwq->pool instead of the other way around. While at it, add missing {} in an if statement. This patch doesn't introduce any functional changes. tj: The original patch had two get_cwq() calls - the first to determine the pool by doing get_cwq(cpu, wq)->pool and the second to determine the matching cwq from get_cwq(pool->cpu, wq). Updated the function such that it chooses cwq instead of pool and removed the second call. Rewrote the description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-02-07 13:17:51 -08:00
Lai Jiangshan	54d5b7d079	workqueue: make get_work_pool_id() cheaper get_work_pool_id() currently first obtains pool using get_work_pool() and then return pool->id. For an off-queue work item, this involves obtaining pool ID from worker->data, performing idr_find() to find the matching pool and then returning its pool->id which of course is the same as the one which went into idr_find(). Just open code WORK_STRUCT_CWQ case and directly return pool ID from work->data. tj: The original patch dropped on-queue work item handling and renamed the function to offq_work_pool_id(). There isn't much benefit in doing so. Handling it only requires a single if() and we need at least BUG_ON(), which is also a branch, even if we drop on-queue handling. Open code WORK_STRUCT_CWQ case and keep the function in line with get_work_pool(). Rewrote the description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-02-07 13:14:20 -08:00
Tejun Heo	e19e397a85	workqueue: move nr_running into worker_pool As nr_running is likely to be accessed from other CPUs during try_to_wake_up(), it was kept outside worker_pool; however, while less frequent, other fields in worker_pool are accessed from other CPUs for, e.g., non-reentrancy check. Also, with recent pool related changes, accessing nr_running matching the worker_pool isn't as simple as it used to be. Move nr_running inside worker_pool. Keep it aligned to cacheline and define CPU pools using DEFINE_PER_CPU_SHARED_ALIGNED(). This should give at least the same cacheline behavior. get_pool_nr_running() is replaced with direct pool->nr_running accesses. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Joonsoo Kim <js1304@gmail.com>	2013-02-07 13:14:20 -08:00
Tejun Heo	1606283622	workqueue: cosmetic update in try_to_grab_pending() With the recent is-work-queued-here test simplification, the nested if() in try_to_grab_pending() can be collapsed. Collapse it. This patch is purely cosmetic. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-02-06 18:04:53 -08:00
Lai Jiangshan	0b3dae68ac	workqueue: simplify is-work-item-queued-here test Currently, determining whether a work item is queued on a locked pool involves somewhat convoluted memory barrier dancing. It goes like the following. * When a work item is queued on a pool, work->data is updated before work->entry is linked to the pending list with a wmb() inbetween. * When trying to determine whether a work item is currently queued on a pool pointed to by work->data, it locks the pool and looks at work->entry. If work->entry is linked, we then do rmb() and then check whether work->data points to the current pool. This works because, work->data can only point to a pool if it currently is or were on the pool and, * If it currently is on the pool, the tests would obviously succeed. * It it left the pool, its work->entry was cleared under pool->lock, so if we're seeing non-empty work->entry, it has to be from the work item being linked on another pool. Because work->data is updated before work->entry is linked with wmb() inbetween, work->data update from another pool is guaranteed to be visible if we do rmb() after seeing non-empty work->entry. So, we either see empty work->entry or we see updated work->data pointin to another pool. While this works, it's convoluted, to put it mildly. With recent updates, it's now guaranteed that work->data points to cwq only while the work item is queued and that updating work->data to point to cwq or back to pool is done under pool->lock, so we can simply test whether work->data points to cwq which is associated with the currently locked pool instead of the convoluted memory barrier dancing. This patch replaces the memory barrier based "are you still here, really?" test with much simpler "does work->data points to me?" test - if work->data points to a cwq which is associated with the currently locked pool, the work item is guaranteed to be queued on the pool as work->data can start and stop pointing to such cwq only under pool->lock and the start and stop coincide with queue and dequeue. tj: Rewrote the comments and description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-02-06 18:04:53 -08:00
Lai Jiangshan	4468a00fd9	workqueue: make work->data point to pool after try_to_grab_pending() We plan to use work->data pointing to cwq as the synchronization invariant when determining whether a given work item is on a locked pool or not, which requires work->data pointing to cwq only while the work item is queued on the associated pool. With delayed_work updated not to overload work->data for target workqueue recording, the only case where we still have off-queue work->data pointing to cwq is try_to_grab_pending() which doesn't update work->data after stealing a queued work item. There's no reason for try_to_grab_pending() to not update work->data to point to the pool instead of cwq, like the normal execution does. This patch adds set_work_pool_and_keep_pending() which makes work->data point to pool instead of cwq but keeps the pending bit unlike set_work_pool_and_clear_pending() (surprise!). After this patch, it's guaranteed that only queued work items point to cwqs. This patch doesn't introduce any visible behavior change. tj: Renamed the new helper function to match set_work_pool_and_clear_pending() and rewrote the description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-02-06 18:04:53 -08:00
Lai Jiangshan	60c057bca2	workqueue: add delayed_work->wq to simplify reentrancy handling To avoid executing the same work item from multiple CPUs concurrently, a work_struct records the last pool it was on in its ->data so that, on the next queueing, the pool can be queried to determine whether the work item is still executing or not. A delayed_work goes through timer before actually being queued on the target workqueue and the timer needs to know the target workqueue and CPU. This is currently achieved by modifying delayed_work->work.data such that it points to the cwq which points to the target workqueue and the last CPU the work item was on. __queue_delayed_work() extracts the last CPU from delayed_work->work.data and then combines it with the target workqueue to create new work.data. The only thing this rather ugly hack achieves is encoding the target workqueue into delayed_work->work.data without using a separate field, which could be a trade off one can make; unfortunately, this entangles work->data management between regular workqueue and delayed_work code by setting cwq pointer before the work item is actually queued and becomes a hindrance for further improvements of work->data handling. This can be easily made sane by adding a target workqueue field to delayed_work. While delayed_work is used widely in the kernel and this does make it a bit larger (<5%), I think this is the right trade-off especially given the prospect of much saner handling of work->data which currently involves quite tricky memory barrier dancing, and don't expect to see any measureable effect. Add delayed_work->wq and drop the delayed_work->work.data overloading. tj: Rewrote the description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-02-06 18:04:53 -08:00
Lai Jiangshan	038366c5cf	workqueue: make work_busy() test WORK_STRUCT_PENDING first Currently, work_busy() first tests whether the work has a pool associated with it and if not, considers it idle. This works fine even for delayed_work.work queued on timer, as __queue_delayed_work() sets cwq on delayed_work.work - a queued delayed_work always has its cwq and thus pool associated with it. However, we're about to update delayed_work queueing and this won't hold. Update work_busy() such that it tests WORK_STRUCT_PENDING before the associated pool. This doesn't make any noticeable behavior difference now. With work_pending() test moved, the function read a lot better with "if (!pool)" test flipped to positive. Flip it. While at it, lose the comment about now non-existent reentrant workqueues. tj: Reorganized the function and rewrote the description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-02-06 18:04:53 -08:00
Lai Jiangshan	6be195886a	workqueue: replace WORK_CPU_NONE/LAST with WORK_CPU_END Now that workqueue has moved away from gcwqs, workqueue no longer has the need to have a CPU identifier indicating "no cpu associated" - we now use WORK_OFFQ_POOL_NONE instead - and most uses of WORK_CPU_NONE are gone. The only left usage is as the end marker for for_each_wq() iterators, where the name WORK_CPU_NONE is confusing w/o actual WORK_CPU_NONE usages. Similarly, WORK_CPU_LAST which equals WORK_CPU_NONE no longer makes sense. Replace both WORK_CPU_NONE and LAST with WORK_CPU_END. This patch doesn't introduce any functional difference. tj: s/WORK_CPU_LAST/WORK_CPU_END/ and rewrote the description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-02-06 18:04:53 -08:00
Tejun Heo	706026c214	workqueue: post global_cwq removal cleanups Remove remaining references to gcwq. * __next_gcwq_cpu() steals __next_wq_cpu() name. The original __next_wq_cpu() became __next_cwq_cpu(). * s/for_each_gcwq_cpu/for_each_wq_cpu/ s/for_each_online_gcwq_cpu/for_each_online_wq_cpu/ * s/gcwq_mayday_timeout/pool_mayday_timeout/ * s/gcwq_unbind_fn/wq_unbind_fn/ * Drop references to gcwq in comments. This patch doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-01-24 11:01:34 -08:00
Tejun Heo	e6e380ed92	workqueue: rename nr_running variables Rename per-cpu and unbound nr_running variables such that they match the pool variables. This patch doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-01-24 11:01:34 -08:00
Tejun Heo	a60dc39c01	workqueue: remove global_cwq global_cwq is now nothing but a container for per-cpu standard worker_pools. Declare the worker pools directly as cpu/unbound_std_worker_pools[] and remove global_cwq. * ____cacheline_aligned_in_smp moved from global_cwq to worker_pool. This probably would have made sense even before this change as we want each pool to be aligned. * get_gcwq() is replaced with std_worker_pools() which returns the pointer to the standard pool array for a given CPU. * __alloc_workqueue_key() updated to use get_std_worker_pool() instead of open-coding pool determination. This is part of an effort to remove global_cwq and make worker_pool the top level abstraction, which in turn will help implementing worker pools with user-specified attributes. v2: Joonsoo pointed out that it'd better to align struct worker_pool rather than the array so that every pool is aligned. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Joonsoo Kim <js1304@gmail.com>	2013-01-24 11:01:34 -08:00
Tejun Heo	4e8f0a6096	workqueue: remove worker_pool->gcwq The only remaining user of pool->gcwq is std_worker_pool_pri(). Reimplement it using get_gcwq() and remove worker_pool->gcwq. This is part of an effort to remove global_cwq and make worker_pool the top level abstraction, which in turn will help implementing worker pools with user-specified attributes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-01-24 11:01:34 -08:00
Tejun Heo	38db41d984	workqueue: replace for_each_worker_pool() with for_each_std_worker_pool() for_each_std_worker_pool() takes @cpu instead of @gcwq. This is part of an effort to remove global_cwq and make worker_pool the top level abstraction, which in turn will help implementing worker pools with user-specified attributes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-01-24 11:01:34 -08:00
Tejun Heo	a1056305fa	workqueue: make freezing/thawing per-pool Instead of holding locks from both pools and then processing the pools together, make freezing/thwaing per-pool - grab locks of one pool, process it, release it and then proceed to the next pool. While this patch changes processing order across pools, order within each pool remains the same. As each pool is independent, this shouldn't break anything. This is part of an effort to remove global_cwq and make worker_pool the top level abstraction, which in turn will help implementing worker pools with user-specified attributes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-01-24 11:01:33 -08:00
Tejun Heo	94cf58bb29	workqueue: make hotplug processing per-pool Instead of holding locks from both pools and then processing the pools together, make hotplug processing per-pool - grab locks of one pool, process it, release it and then proceed to the next pool. rebind_workers() is updated to take and process @pool instead of @gcwq which results in a lot of de-indentation. gcwq_claim_assoc_and_lock() and its counterpart are replaced with in-line per-pool locking. While this patch changes processing order across pools, order within each pool remains the same. As each pool is independent, this shouldn't break anything. This is part of an effort to remove global_cwq and make worker_pool the top level abstraction, which in turn will help implementing worker pools with user-specified attributes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-01-24 11:01:33 -08:00
Tejun Heo	d565ed6309	workqueue: move global_cwq->lock to worker_pool Move gcwq->lock to pool->lock. The conversion is mostly straight-forward. Things worth noting are * In many places, this removes the need to use gcwq completely. pool is used directly instead. get_std_worker_pool() is added to help some of these conversions. This also leaves get_work_gcwq() without any user. Removed. * In hotplug and freezer paths, the pools belonging to a CPU are often processed together. This patch makes those paths hold locks of all pools, with highpri lock nested inside, to keep the conversion straight-forward. These nested lockings will be removed by following patches. This is part of an effort to remove global_cwq and make worker_pool the top level abstraction, which in turn will help implementing worker pools with user-specified attributes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-01-24 11:01:33 -08:00
Tejun Heo	ec22ca5eab	workqueue: move global_cwq->cpu to worker_pool Move gcwq->cpu to pool->cpu. This introduces a couple places where gcwq->pools[0].cpu is used. These will soon go away as gcwq is further reduced. This is part of an effort to remove global_cwq and make worker_pool the top level abstraction, which in turn will help implementing worker pools with user-specified attributes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-01-24 11:01:33 -08:00
Tejun Heo	c9e7cf273f	workqueue: move busy_hash from global_cwq to worker_pool There's no functional necessity for the two pools on the same CPU to share the busy hash table. It's also likely to be a bottleneck when implementing pools with user-specified attributes. This patch makes busy_hash per-pool. The conversion is mostly straight-forward. Changes worth noting are, * Large block of changes in rebind_workers() is moving the block inside for_each_worker_pool() as now there are separate hash tables for each pool. This changes the order of operations but doesn't break anything. * Thre for_each_worker_pool() loops in gcwq_unbind_fn() are combined into one. This again changes the order of operaitons but doesn't break anything. This is part of an effort to remove global_cwq and make worker_pool the top level abstraction, which in turn will help implementing worker pools with user-specified attributes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-01-24 11:01:33 -08:00
Tejun Heo	7c3eed5cd6	workqueue: record pool ID instead of CPU in work->data when off-queue Currently, when a work item is off-queue, work->data records the CPU it was last on, which is used to locate the last executing instance for non-reentrance, flushing, etc. We're in the process of removing global_cwq and making worker_pool the top level abstraction. This patch makes work->data point to the pool it was last associated with instead of CPU. After the previous WORK_OFFQ_POOL_CPU and worker_poo->id additions, the conversion is fairly straight-forward. WORK_OFFQ constants and functions are modified to record and read back pool ID instead. worker_pool_by_id() is added to allow looking up pool from ID. get_work_pool() replaces get_work_gcwq(), which is reimplemented using get_work_pool(). get_work_pool_id() replaces work_cpu(). This patch shouldn't introduce any observable behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-01-24 11:01:33 -08:00
Tejun Heo	9daf9e678d	workqueue: add worker_pool->id Add worker_pool->id which is allocated from worker_pool_idr. This will be used to record the last associated worker_pool in work->data. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-01-24 11:01:33 -08:00
Tejun Heo	715b06b864	workqueue: introduce WORK_OFFQ_CPU_NONE Currently, when a work item is off queue, high bits of its data encodes the last CPU it was on. This is scheduled to be changed to pool ID, which will make it impossible to use WORK_CPU_NONE to indicate no association. This patch limits the number of bits which are used for off-queue cpu number to 31 (so that the max fits in an int) and uses the highest possible value - WORK_OFFQ_CPU_NONE - to indicate no association. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-01-24 11:01:33 -08:00
Tejun Heo	35b6bb63b8	workqueue: make GCWQ_FREEZING a pool flag Make GCWQ_FREEZING a pool flag POOL_FREEZING. This patch doesn't change locking - FREEZING on both pools of a CPU are set or clear together while holding gcwq->lock. It shouldn't cause any functional difference. This leaves gcwq->flags w/o any flags. Removed. While at it, convert BUG_ON()s in freeze_workqueue_begin() and thaw_workqueues() to WARN_ON_ONCE(). This is part of an effort to remove global_cwq and make worker_pool the top level abstraction, which in turn will help implementing worker pools with user-specified attributes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-01-24 11:01:33 -08:00
Tejun Heo	2464757086	workqueue: make GCWQ_DISASSOCIATED a pool flag Make GCWQ_DISASSOCIATED a pool flag POOL_DISASSOCIATED. This patch doesn't change locking - DISASSOCIATED on both pools of a CPU are set or clear together while holding gcwq->lock. It shouldn't cause any functional difference. This is part of an effort to remove global_cwq and make worker_pool the top level abstraction, which in turn will help implementing worker pools with user-specified attributes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-01-24 11:01:33 -08:00
Tejun Heo	e34cdddb03	workqueue: use std_ prefix for the standard per-cpu pools There are currently two worker pools per cpu (including the unbound cpu) and they are the only pools in use. New class of pools are scheduled to be added and some pool related APIs will be added inbetween. Call the existing pools the standard pools and prefix them with std_. Do this early so that new APIs can use std_ prefix from the beginning. This patch doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-01-24 11:01:33 -08:00
Tejun Heo	e2905b2912	workqueue: unexport work_cpu() This function no longer has any external users. Unexport it. It will be removed later on. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2013-01-24 11:01:32 -08:00
Tejun Heo	2eaebdb33e	workqueue: move struct worker definition to workqueue_internal.h This will be used to implement an inline function to query whether %current is a workqueue worker and, if so, allow determining which work item it's executing. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org>	2013-01-18 14:05:55 -08:00
Tejun Heo	ea138446e5	workqueue: rename kernel/workqueue_sched.h to kernel/workqueue_internal.h Workqueue wants to expose more interface internal to kernel/. Instead of adding a new header file, repurpose kernel/workqueue_sched.h. Rename it to workqueue_internal.h and add include protector. This patch doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org>	2013-01-18 14:05:55 -08:00
Tejun Heo	111c225a5f	workqueue: set PF_WQ_WORKER on rescuers PF_WQ_WORKER is used to tell scheduler that the task is a workqueue worker and needs wq_worker_sleeping/waking_up() invoked on it for concurrency management. As rescuers never participate in concurrency management, PF_WQ_WORKER wasn't set on them. There's a need for an interface which can query whether %current is executing a work item and if so which. Such interface requires a way to identify all tasks which may execute work items and PF_WQ_WORKER will be used for that. As all normal workers always have PF_WQ_WORKER set, we only need to add it to rescuers. As rescuers start with WORKER_PREP but never clear it, it's always NOT_RUNNING and there's no need to worry about it interfering with concurrency management even if PF_WQ_WORKER is set; however, unlike normal workers, rescuers currently don't have its worker struct as kthread_data(). It uses the associated workqueue_struct instead. This is problematic as wq_worker_sleeping/waking_up() expect struct worker at kthread_data(). This patch adds worker->rescue_wq and start rescuer kthreads with worker struct as kthread_data and sets PF_WQ_WORKER on rescuers. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org>	2013-01-17 17:19:58 -08:00
Tejun Heo	023f27d3d6	workqueue: fix find_worker_executing_work() brekage from hashtable conversion `42f8570f43` ("workqueue: use new hashtable implementation") incorrectly made busy workers hashed by the pointer value of worker instead of work. This broke find_worker_executing_work() which in turn broke a lot of fundamental operations of workqueue - non-reentrancy and flushing among others. The flush malfunction triggered warning in disk event code in Fengguang's automated test. write_dev_root_ (3265) used greatest stack depth: 2704 bytes left ------------[ cut here ]------------ WARNING: at /c/kernel-tests/src/stable/block/genhd.c:1574 disk_clear_events+0x\ cf/0x108() Hardware name: Bochs Modules linked in: Pid: 3328, comm: ata_id Not tainted 3.7.0-01930-gbff6343 #1167 Call Trace: [<ffffffff810997c4>] warn_slowpath_common+0x83/0x9c [<ffffffff810997f7>] warn_slowpath_null+0x1a/0x1c [<ffffffff816aea77>] disk_clear_events+0xcf/0x108 [<ffffffff811bd8be>] check_disk_change+0x27/0x59 [<ffffffff822e48e2>] cdrom_open+0x49/0x68b [<ffffffff81ab0291>] idecd_open+0x88/0xb7 [<ffffffff811be58f>] __blkdev_get+0x102/0x3ec [<ffffffff811bea08>] blkdev_get+0x18f/0x30f [<ffffffff811bebfd>] blkdev_open+0x75/0x80 [<ffffffff8118f510>] do_dentry_open+0x1ea/0x295 [<ffffffff8118f5f0>] finish_open+0x35/0x41 [<ffffffff8119c720>] do_last+0x878/0xa25 [<ffffffff8119c993>] path_openat+0xc6/0x333 [<ffffffff8119cf37>] do_filp_open+0x38/0x86 [<ffffffff81190170>] do_sys_open+0x6c/0xf9 [<ffffffff8119021e>] sys_open+0x21/0x23 [<ffffffff82c1c3d9>] system_call_fastpath+0x16/0x1b Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Cc: Sasha Levin <sasha.levin@oracle.com>	2012-12-19 11:24:06 -08:00
Tejun Heo	a2c1c57be8	workqueue: consider work function when searching for busy work items To avoid executing the same work item concurrenlty, workqueue hashes currently busy workers according to their current work items and looks up the the table when it wants to execute a new work item. If there already is a worker which is executing the new work item, the new item is queued to the found worker so that it gets executed only after the current execution finishes. Unfortunately, a work item may be freed while being executed and thus recycled for different purposes. If it gets recycled for a different work item and queued while the previous execution is still in progress, workqueue may make the new work item wait for the old one although the two aren't really related in any way. In extreme cases, this false dependency may lead to deadlock although it's extremely unlikely given that there aren't too many self-freeing work item users and they usually don't wait for other work items. To alleviate the problem, record the current work function in each busy worker and match it together with the work item address in find_worker_executing_work(). While this isn't complete, it ensures that unrelated work items don't interact with each other and in the very unlikely case where a twisted wq user triggers it, it's always onto itself making the culprit easy to spot. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Andrey Isakov <andy51@gmx.ru> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=51701 Cc: stable@vger.kernel.org	2012-12-18 10:56:14 -08:00
Sasha Levin	42f8570f43	workqueue: use new hashtable implementation Switch workqueues to use the new hashtable implementation. This reduces the amount of generic unrelated code in the workqueues. This patch depends on `d9b482c` ("hashtable: introduce a small and naive hashtable") which was merged in v3.6. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-12-18 09:21:13 -08:00
Linus Torvalds	e7b55b8fcd	Merge branch 'for-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue changes from Tejun Heo: "Nothing exciting. Just two trivial changes." * 'for-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: add WARN_ON_ONCE() on CPU number to wq_worker_waking_up() workqueue: trivial fix for return statement in work_busy()	2012-12-12 08:15:13 -08:00
Tejun Heo	fc4b514f27	workqueue: convert BUG_ON()s in __queue_delayed_work() to WARN_ON_ONCE()s `8852aac25e` ("workqueue: mod_delayed_work_on() shouldn't queue timer on 0 delay") unexpectedly uncovered a very nasty abuse of delayed_work in megaraid - it allocated work_struct, casted it to delayed_work and then pass that into queue_delayed_work(). Previously, this was okay because 0 @delay short-circuited to queue_work() before doing anything with delayed_work. `8852aac25e` moved 0 @delay test into __queue_delayed_work() after sanity check on delayed_work making megaraid trigger BUG_ON(). Although megaraid is already fixed by `c1d390d8e6` ("megaraid: fix BUG_ON() from incorrect use of delayed work"), this patch converts BUG_ON()s in __queue_delayed_work() to WARN_ON_ONCE()s so that such abusers, if there are more, trigger warning but don't crash the machine. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Xiaotian Feng <xtfeng@gmail.com>	2012-12-04 07:58:47 -08:00
Joonsoo Kim	3657600040	workqueue: add WARN_ON_ONCE() on CPU number to wq_worker_waking_up() Recently, workqueue code has gone through some changes and we found some bugs related to concurrency management operations happening on the wrong CPU. When a worker is concurrency managed (!WORKER_NOT_RUNNIG), it should be bound to its associated cpu and woken up to that cpu. Add WARN_ON_ONCE() to verify this. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-12-01 16:45:45 -08:00
Joonsoo Kim	999767beb1	workqueue: trivial fix for return statement in work_busy() Return type of work_busy() is unsigned int. There is return statement returning boolean value, 'false' in work_busy(). It is not problem, because 'false' may be treated '0'. However, fixing it would make code robust. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-12-01 16:45:40 -08:00
Tejun Heo	8852aac25e	workqueue: mod_delayed_work_on() shouldn't queue timer on 0 delay `8376fe22c7` ("workqueue: implement mod_delayed_work[_on]()") implemented mod_delayed_work[_on]() using the improved try_to_grab_pending(). The function is later used, among others, to replace [__]candel_delayed_work() + queue_delayed_work() combinations. Unfortunately, a delayed_work item w/ zero @delay is handled slightly differently by mod_delayed_work_on() compared to queue_delayed_work_on(). The latter skips timer altogether and directly queues it using queue_work_on() while the former schedules timer which will expire on the closest tick. This means, when @delay is zero, that [__]cancel_delayed_work() + queue_delayed_work_on() makes the target item immediately executable while mod_delayed_work_on() may induce delay of upto a full tick. This somewhat subtle difference breaks some of the converted users. e.g. block queue plugging uses delayed_work for deferred processing and uses mod_delayed_work_on() when the queue needs to be immediately unplugged. The above problem manifested as noticeably higher number of context switches under certain circumstances. The difference in behavior was caused by missing special case handling for 0 delay in mod_delayed_work_on() compared to queue_delayed_work_on(). Joonsoo Kim posted a patch to add it - ("workqueue: optimize mod_delayed_work_on() when @delay == 0")[1]. The patch was queued for 3.8 but it was described as optimization and I missed that it was a correctness issue. As both queue_delayed_work_on() and mod_delayed_work_on() use __queue_delayed_work() for queueing, it seems that the better approach is to move the 0 delay special handling to the function instead of duplicating it in mod_delayed_work_on(). Fix the problem by moving 0 delay special case handling from queue_delayed_work_on() to __queue_delayed_work(). This replaces Joonsoo's patch. [1] http://thread.gmane.org/gmane.linux.kernel/1379011/focus=1379012 Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Anders Kaseorg <andersk@MIT.EDU> Reported-and-tested-by: Zlatko Calusic <zlatko.calusic@iskon.hr> LKML-Reference: <alpine.DEB.2.00.1211280953350.26602@dr-wily.mit.edu> LKML-Reference: <50A78AA9.5040904@iskon.hr> Cc: Joonsoo Kim <js1304@gmail.com>	2012-12-01 16:43:18 -08:00
Mike Galbraith	412d32e6c9	workqueue: exit rescuer_thread() as TASK_RUNNING A rescue thread exiting TASK_INTERRUPTIBLE can lead to a task scheduling off, never to be seen again. In the case where this occurred, an exiting thread hit reiserfs homebrew conditional resched while holding a mutex, bringing the box to its knees. PID: 18105 TASK: ffff8807fd412180 CPU: 5 COMMAND: "kdmflush" #0 [ffff8808157e7670] schedule at ffffffff8143f489 #1 [ffff8808157e77b8] reiserfs_get_block at ffffffffa038ab2d [reiserfs] #2 [ffff8808157e79a8] __block_write_begin at ffffffff8117fb14 #3 [ffff8808157e7a98] reiserfs_write_begin at ffffffffa0388695 [reiserfs] #4 [ffff8808157e7ad8] generic_perform_write at ffffffff810ee9e2 #5 [ffff8808157e7b58] generic_file_buffered_write at ffffffff810eeb41 #6 [ffff8808157e7ba8] __generic_file_aio_write at ffffffff810f1a3a #7 [ffff8808157e7c58] generic_file_aio_write at ffffffff810f1c88 #8 [ffff8808157e7cc8] do_sync_write at ffffffff8114f850 #9 [ffff8808157e7dd8] do_acct_process at ffffffff810a268f [exception RIP: kernel_thread_helper] RIP: ffffffff8144a5c0 RSP: ffff8808157e7f58 RFLAGS: 00000202 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff8107af60 RDI: ffff8803ee491d18 RBP: 0000000000000000 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 Signed-off-by: Mike Galbraith <mgalbraith@suse.de> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org	2012-12-01 15:56:42 -08:00
Dan Magenheimer	c0158ca64d	workqueue: cancel_delayed_work() should return %false if work item is idle `57b30ae77b` ("workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()") made cancel_delayed_work() always return %true unless someone else is also trying to cancel the work item, which is broken - if the target work item is idle, the return value should be %false. try_to_grab_pending() indicates that the target work item was idle by zero return value. Use it for return. Note that this brings cancel_delayed_work() in line with __cancel_work_timer() in return value handling. Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com> Signed-off-by: Tejun Heo <tj@kernel.org> LKML-Reference: <444a6439-b1a4-4740-9e7e-bc37267cfe73@default>	2012-10-24 12:38:16 -07:00
Linus Torvalds	033d9959ed	Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue changes from Tejun Heo: "This is workqueue updates for v3.7-rc1. A lot of activities this round including considerable API and behavior cleanups. * delayed_work combines a timer and a work item. The handling of the timer part has always been a bit clunky leading to confusing cancelation API with weird corner-case behaviors. delayed_work is updated to use new IRQ safe timer and cancelation now works as expected. * Another deficiency of delayed_work was lack of the counterpart of mod_timer() which led to cancel+queue combinations or open-coded timer+work usages. mod_delayed_work[_on]() are added. These two delayed_work changes make delayed_work provide interface and behave like timer which is executed with process context. * A work item could be executed concurrently on multiple CPUs, which is rather unintuitive and made flush_work() behavior confusing and half-broken under certain circumstances. This problem doesn't exist for non-reentrant workqueues. While non-reentrancy check isn't free, the overhead is incurred only when a work item bounces across different CPUs and even in simulated pathological scenario the overhead isn't too high. All workqueues are made non-reentrant. This removes the distinction between flush_[delayed_]work() and flush_[delayed_]_work_sync(). The former is now as strong as the latter and the specified work item is guaranteed to have finished execution of any previous queueing on return. * In addition to the various bug fixes, Lai redid and simplified CPU hotplug handling significantly. * Joonsoo introduced system_highpri_wq and used it during CPU hotplug. There are two merge commits - one to pull in IRQ safe timer from tip/timers/core and the other to pull in CPU hotplug fixes from wq/for-3.6-fixes as Lai's hotplug restructuring depended on them." Fixed a number of trivial conflicts, but the more interesting conflicts were silent ones where the deprecated interfaces had been used by new code in the merge window, and thus didn't cause any real data conflicts. Tejun pointed out a few of them, I fixed a couple more. * 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits) workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending() workqueue: use cwq_set_max_active() helper for workqueue_set_max_active() workqueue: introduce cwq_set_max_active() helper for thaw_workqueues() workqueue: remove @delayed from cwq_dec_nr_in_flight() workqueue: fix possible stall on try_to_grab_pending() of a delayed work item workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback() workqueue: use __cpuinit instead of __devinit for cpu callbacks workqueue: rename manager_mutex to assoc_mutex workqueue: WORKER_REBIND is no longer necessary for idle rebinding workqueue: WORKER_REBIND is no longer necessary for busy rebinding workqueue: reimplement idle worker rebinding workqueue: deprecate __cancel_delayed_work() workqueue: reimplement cancel_delayed_work() using try_to_grab_pending() workqueue: use mod_delayed_work() instead of __cancel + queue workqueue: use irqsafe timer for delayed_work workqueue: clean up delayed_work initializers and add missing one workqueue: make deferrable delayed_work initializer names consistent workqueue: cosmetic whitespace updates for macro definitions workqueue: deprecate system_nrt[_freezable]_wq workqueue: deprecate flush[_delayed]_work_sync() ...	2012-10-02 09:54:49 -07:00
Tejun Heo	7c6e72e46c	workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending() `e0aecdd874` ("workqueue: use irqsafe timer for delayed_work") made try_to_grab_pending() safe to use from irq context but forgot to remove WARN_ON_ONCE(in_irq()). Remove it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Fengguang Wu <fengguang.wu@intel.com>	2012-09-20 10:03:19 -07:00
Lai Jiangshan	70369b117a	workqueue: use cwq_set_max_active() helper for workqueue_set_max_active() workqueue_set_max_active() may increase ->max_active without activating delayed works and may make the activation order differ from the queueing order. Both aren't strictly bugs but the resulting behavior could be a bit odd. To make things more consistent, use cwq_set_max_active() helper which immediately makes use of the newly increased max_mactive if there are delayed work items and also keeps the activation order. tj: Slight update to description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-09-19 10:40:48 -07:00
Lai Jiangshan	9f4bd4cddb	workqueue: introduce cwq_set_max_active() helper for thaw_workqueues() Using a helper instead of open code makes thaw_workqueues() clearer. The helper will also be used by the next patch. tj: Slight update to comment and description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-09-19 10:40:48 -07:00
Tejun Heo	ed48ece27c	workqueue: reimplement work_on_cpu() using system_wq The existing work_on_cpu() implementation is hugely inefficient. It creates a new kthread, execute that single function and then let the kthread die on each invocation. Now that system_wq can handle concurrent executions, there's no advantage of doing this. Reimplement work_on_cpu() using system_wq which makes it simpler and way more efficient. stable: While this isn't a fix in itself, it's needed to fix a workqueue related bug in cpufreq/powernow-k8. AFAICS, this shouldn't break other existing users. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jiri Kosina <jkosina@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Len Brown <lenb@kernel.org> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: stable@vger.kernel.org	2012-09-19 10:13:12 -07:00
Lai Jiangshan	b3f9f405a2	workqueue: remove @delayed from cwq_dec_nr_in_flight() @delayed is now always false for all callers, remove it. tj: Updated description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-09-18 10:40:00 -07:00
Lai Jiangshan	3aa6249759	workqueue: fix possible stall on try_to_grab_pending() of a delayed work item Currently, when try_to_grab_pending() grabs a delayed work item, it leaves its linked work items alone on the delayed_works. The linked work items are always NO_COLOR and will cause future cwq_activate_first_delayed() increase cwq->nr_active incorrectly, and may cause the whole cwq to stall. For example, state: cwq->max_active = 1, cwq->nr_active = 1 one work in cwq->pool, many in cwq->delayed_works. step1: try_to_grab_pending() removes a work item from delayed_works but leaves its NO_COLOR linked work items on it. step2: Later on, cwq_activate_first_delayed() activates the linked work item increasing ->nr_active. step3: cwq->nr_active = 1, but all activated work items of the cwq are NO_COLOR. When they finish, cwq->nr_active will not be decreased due to NO_COLOR, and no further work items will be activated from cwq->delayed_works. the cwq stalls. Fix it by ensuring the target work item is activated before stealing PENDING in try_to_grab_pending(). This ensures that all the linked work items are activated without incorrectly bumping cwq->nr_active. tj: Updated comment and description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@kernel.org	2012-09-18 10:40:00 -07:00
Lai Jiangshan	a5b4e57d7c	workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback() workqueue_cpu_down_callback() is used only if HOTPLUG_CPU=y, so hotcpu_notifier() fits better than cpu_notifier(). When HOTPLUG_CPU=y, hotcpu_notifier() and cpu_notifier() are the same. When HOTPLUG_CPU=n, if we use cpu_notifier(), workqueue_cpu_down_callback() will be called during boot to do nothing, and the memory of workqueue_cpu_down_callback() and gcwq_unbind_fn() will be discarded after boot. If we use hotcpu_notifier(), we can avoid the no-op call of workqueue_cpu_down_callback() and the memory of workqueue_cpu_down_callback() and gcwq_unbind_fn() will be discard at build time: $ ls -l kernel/workqueue.o.cpu_notifier kernel/workqueue.o.hotcpu_notifier -rw-rw-r-- 1 laijs laijs 484080 Sep 15 11:31 kernel/workqueue.o.cpu_notifier -rw-rw-r-- 1 laijs laijs 478240 Sep 15 11:31 kernel/workqueue.o.hotcpu_notifier $ size kernel/workqueue.o.cpu_notifier kernel/workqueue.o.hotcpu_notifier text data bss dec hex filename 18513 2387 1221 22121 5669 kernel/workqueue.o.cpu_notifier 18082 2355 1221 21658 549a kernel/workqueue.o.hotcpu_notifier tj: Updated description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-09-18 09:59:23 -07:00
Lai Jiangshan	9fdf9b73d6	workqueue: use __cpuinit instead of __devinit for cpu callbacks For workqueue hotplug callbacks, it makes less sense to use __devinit which discards the memory after boot if !HOTPLUG. __cpuinit, which discards the memory after boot if !HOTPLUG_CPU fits better. tj: Updated description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-09-18 09:59:23 -07:00
Lai Jiangshan	b2eb83d123	workqueue: rename manager_mutex to assoc_mutex Now that manager_mutex's role has changed from synchronizing manager role to excluding hotplug against manager, the name is misleading. As it is protecting the CPU-association of the gcwq now, rename it to assoc_mutex. This patch is pure rename and doesn't introduce any functional change. tj: Updated comments and description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-09-18 09:59:23 -07:00
Lai Jiangshan	5f7dabfd5c	workqueue: WORKER_REBIND is no longer necessary for idle rebinding Now both worker destruction and idle rebinding remove the worker from idle list while it's still idle, so list_empty(&worker->entry) can be used to test whether either is pending and WORKER_DIE to distinguish between the two instead making WORKER_REBIND unnecessary. Use list_empty(&worker->entry) to determine whether destruction or rebinding is pending. This simplifies worker state transitions. WORKER_REBIND is not needed anymore. Remove it. tj: Updated comments and description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-09-18 09:59:23 -07:00
Lai Jiangshan	eab6d82843	workqueue: WORKER_REBIND is no longer necessary for busy rebinding Because the old unbind/rebinding implementation wasn't atomic w.r.t. GCWQ_DISASSOCIATED manipulation which is protected by global_cwq->lock, we had to use two flags, WORKER_UNBOUND and WORKER_REBIND, to avoid incorrectly losing all NOT_RUNNING bits with back-to-back CPU hotplug operations; otherwise, completion of rebinding while another unbinding is in progress could clear UNBIND prematurely. Now that both unbind/rebinding are atomic w.r.t. GCWQ_DISASSOCIATED, there's no need to use two flags. Just one is enough. Don't use WORKER_REBIND for busy rebinding. tj: Updated description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-09-18 09:59:22 -07:00
Lai Jiangshan	ea1abd6197	workqueue: reimplement idle worker rebinding Currently rebind_workers() uses rebinds idle workers synchronously before proceeding to requesting busy workers to rebind. This is necessary because all workers on @worker_pool->idle_list must be bound before concurrency management local wake-ups from the busy workers take place. Unfortunately, the synchronous idle rebinding is quite complicated. This patch reimplements idle rebinding to simplify the code path. Rather than trying to make all idle workers bound before rebinding busy workers, we simply remove all to-be-bound idle workers from the idle list and let them add themselves back after completing rebinding (successful or not). As only workers which finished rebinding can on on the idle worker list, the idle worker list is guaranteed to have only bound workers unless CPU went down again and local wake-ups are safe. After the change, @worker_pool->nr_idle may deviate than the actual number of idle workers on @worker_pool->idle_list. More specifically, nr_idle may be non-zero while ->idle_list is empty. All users of ->nr_idle and ->idle_list are audited. The only affected one is too_many_workers() which is updated to check %false if ->idle_list is empty regardless of ->nr_idle. After this patch, rebind_workers() no longer performs the nasty idle-rebind retries which require temporary release of gcwq->lock, and both unbinding and rebinding are atomic w.r.t. global_cwq->lock. worker->idle_rebind and global_cwq->rebind_hold are now unnecessary and removed along with the definition of struct idle_rebind. Changed from V1: 1) remove unlikely from too_many_workers(), ->idle_list can be empty anytime, even before this patch, no reason to use unlikely. 2) fix a small rebasing mistake. (which is from rebasing the orignal fixing patch to for-next) 3) add a lot of comments. 4) clear WORKER_REBIND unconditionaly in idle_worker_rebind() tj: Updated comments and description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-09-18 09:59:22 -07:00
Tejun Heo	6c1423ba5d	Merge branch 'for-3.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq into for-3.7 This merge is necessary as Lai's CPU hotplug restructuring series depends on the CPU hotplug bug fixes in for-3.6-fixes. The merge creates one trivial conflict between the following two commits. `96e65306b8` "workqueue: UNBOUND -> REBIND morphing in rebind_workers() should be atomic" `e2b6a6d570` "workqueue: use system_highpri_wq for highpri workers in rebind_workers()" Both add local variable definitions to the same block and can be merged in any order. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-09-17 16:09:09 -07:00
Lai Jiangshan	960bd11bf2	workqueue: always clear WORKER_REBIND in busy_worker_rebind_fn() busy_worker_rebind_fn() didn't clear WORKER_REBIND if rebinding failed (CPU is down again). This used to be okay because the flag wasn't used for anything else. However, after `25511a477` "workqueue: reimplement CPU online rebinding to handle idle workers", WORKER_REBIND is also used to command idle workers to rebind. If not cleared, the worker may confuse the next CPU_UP cycle by having REBIND spuriously set or oops / get stuck by prematurely calling idle_worker_rebind(). WARNING: at /work/os/wq/kernel/workqueue.c:1323 worker_thread+0x4cd/0x5 00() Hardware name: Bochs Modules linked in: test_wq(O-) Pid: 33, comm: kworker/1:1 Tainted: G O 3.6.0-rc1-work+ #3 Call Trace: [<ffffffff8109039f>] warn_slowpath_common+0x7f/0xc0 [<ffffffff810903fa>] warn_slowpath_null+0x1a/0x20 [<ffffffff810b3f1d>] worker_thread+0x4cd/0x500 [<ffffffff810bc16e>] kthread+0xbe/0xd0 [<ffffffff81bd2664>] kernel_thread_helper+0x4/0x10 ---[ end trace e977cf20f4661968 ]--- BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff810b3db0>] worker_thread+0x360/0x500 PGD 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: test_wq(O-) CPU 0 Pid: 33, comm: kworker/1:1 Tainted: G W O 3.6.0-rc1-work+ #3 Bochs Bochs RIP: 0010:[<ffffffff810b3db0>] [<ffffffff810b3db0>] worker_thread+0x360/0x500 RSP: 0018:ffff88001e1c9de0 EFLAGS: 00010086 RAX: 0000000000000000 RBX: ffff88001e633e00 RCX: 0000000000004140 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000009 RBP: ffff88001e1c9ea0 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000002 R11: 0000000000000000 R12: ffff88001fc8d580 R13: ffff88001fc8d590 R14: ffff88001e633e20 R15: ffff88001e1c6900 FS: 0000000000000000(0000) GS:ffff88001fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 00000000130e8000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process kworker/1:1 (pid: 33, threadinfo ffff88001e1c8000, task ffff88001e1c6900) Stack: ffff880000000000 ffff88001e1c9e40 0000000000000001 ffff88001e1c8010 ffff88001e519c78 ffff88001e1c9e58 ffff88001e1c6900 ffff88001e1c6900 ffff88001e1c6900 ffff88001e1c6900 ffff88001fc8d340 ffff88001fc8d340 Call Trace: [<ffffffff810bc16e>] kthread+0xbe/0xd0 [<ffffffff81bd2664>] kernel_thread_helper+0x4/0x10 Code: b1 00 f6 43 48 02 0f 85 91 01 00 00 48 8b 43 38 48 89 df 48 8b 00 48 89 45 90 e8 ac f0 ff ff 3c 01 0f 85 60 01 00 00 48 8b 53 50 <8b> 02 83 e8 01 85 c0 89 02 0f 84 3b 01 00 00 48 8b 43 38 48 8b RIP [<ffffffff810b3db0>] worker_thread+0x360/0x500 RSP <ffff88001e1c9de0> CR2: 0000000000000000 There was no reason to keep WORKER_REBIND on failure in the first place - WORKER_UNBOUND is guaranteed to be set in such cases preventing incorrectly activating concurrency management. Always clear WORKER_REBIND. tj: Updated comment and description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-09-17 15:42:31 -07:00
Lai Jiangshan	ee378aa49b	workqueue: fix possible idle worker depletion across CPU hotplug To simplify both normal and CPU hotplug paths, worker management is prevented while CPU hoplug is in progress. This is achieved by CPU hotplug holding the same exclusion mechanism used by workers to ensure there's only one manager per pool. If someone else seems to be performing the manager role, workers proceed to execute work items. CPU hotplug using the same mechanism can lead to idle worker depletion because all workers could proceed to execute work items while CPU hotplug is in progress and CPU hotplug itself wouldn't actually perform the worker management duty - it doesn't guarantee that there's an idle worker left when it releases management. This idle worker depletion, under extreme circumstances, can break forward-progress guarantee and thus lead to deadlock. This patch fixes the bug by using separate mechanisms for manager exclusion among workers and hotplug exclusion. For manager exclusion, POOL_MANAGING_WORKERS which was restored by the previous patch is used. pool->manager_mutex is now only used for exclusion between the elected manager and CPU hotplug. The elected manager won't proceed without holding pool->manager_mutex. This ensures that the worker which won the manager position can't skip managing while CPU hotplug is in progress. It will block on manager_mutex and perform management after CPU hotplug is complete. Note that hotplug may happen while waiting for manager_mutex. A manager isn't either on idle or busy list and thus the hoplug code can't unbind/rebind it. Make the manager handle its own un/rebinding. tj: Updated comment and description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-09-10 10:05:54 -07:00
Lai Jiangshan	552a37e936	workqueue: restore POOL_MANAGING_WORKERS This patch restores POOL_MANAGING_WORKERS which was replaced by pool->manager_mutex by `6037315269` "workqueue: use mutex for global_cwq manager exclusion". There's a subtle idle worker depletion bug across CPU hotplug events and we need to distinguish an actual manager and CPU hotplug preventing management. POOL_MANAGING_WORKERS will be used for the former and manager_mutex the later. This patch just lays POOL_MANAGING_WORKERS on top of the existing manager_mutex and doesn't introduce any synchronization changes. The next patch will update it. Note that this patch fixes a non-critical anomaly where too_many_workers() may return %true spuriously while CPU hotplug is in progress. While the issue could schedule idle timer spuriously, it didn't trigger any actual misbehavior. tj: Rewrote patch description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-09-10 10:04:54 -07:00
Tejun Heo	ec58815ab0	workqueue: fix possible deadlock in idle worker rebinding Currently, rebind_workers() and idle_worker_rebind() are two-way interlocked. rebind_workers() waits for idle workers to finish rebinding and rebound idle workers wait for rebind_workers() to finish rebinding busy workers before proceeding. Unfortunately, this isn't enough. The second wait from idle workers is implemented as follows. wait_event(gcwq->rebind_hold, !(worker->flags & WORKER_REBIND)); rebind_workers() clears WORKER_REBIND, wakes up the idle workers and then returns. If CPU hotplug cycle happens again before one of the idle workers finishes the above wait_event(), rebind_workers() will repeat the first part of the handshake - set WORKER_REBIND again and wait for the idle worker to finish rebinding - and this leads to deadlock because the idle worker would be waiting for WORKER_REBIND to clear. This is fixed by adding another interlocking step at the end - rebind_workers() now waits for all the idle workers to finish the above WORKER_REBIND wait before returning. This ensures that all rebinding steps are complete on all idle workers before the next hotplug cycle can happen. This problem was diagnosed by Lai Jiangshan who also posted a patch to fix the issue, upon which this patch is based. This is the minimal fix and further patches are scheduled for the next merge window to simplify the CPU hotplug path. Signed-off-by: Tejun Heo <tj@kernel.org> Original-patch-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <1346516916-1991-3-git-send-email-laijs@cn.fujitsu.com>	2012-09-05 16:10:15 -07:00
Tejun Heo	90beca5de5	workqueue: move WORKER_REBIND clearing in rebind_workers() to the end of the function This doesn't make any functional difference and is purely to help the next patch to be simpler. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com>	2012-09-05 16:10:14 -07:00
Lai Jiangshan	96e65306b8	workqueue: UNBOUND -> REBIND morphing in rebind_workers() should be atomic The compiler may compile the following code into TWO write/modify instructions. worker->flags &= ~WORKER_UNBOUND; worker->flags \|= WORKER_REBIND; so the other CPU may temporarily see worker->flags which doesn't have either WORKER_UNBOUND or WORKER_REBIND set and perform local wakeup prematurely. Fix it by using single explicit assignment via ACCESS_ONCE(). Because idle workers have another WORKER_NOT_RUNNING flag, this bug doesn't exist for them; however, update it to use the same pattern for consistency. tj: Applied the change to idle workers too and updated comments and patch description a bit. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org	2012-09-04 17:04:45 -07:00
Tejun Heo	57b30ae77b	workqueue: reimplement cancel_delayed_work() using try_to_grab_pending() cancel_delayed_work() can't be called from IRQ handlers due to its use of del_timer_sync() and can't cancel work items which are already transferred from timer to worklist. Also, unlike other flush and cancel functions, a canceled delayed_work would still point to the last associated cpu_workqueue. If the workqueue is destroyed afterwards and the work item is re-used on a different workqueue, the queueing code can oops trying to dereference already freed cpu_workqueue. This patch reimplements cancel_delayed_work() using try_to_grab_pending() and set_work_cpu_and_clear_pending(). This allows the function to be called from IRQ handlers and makes its behavior consistent with other flush / cancel functions. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org>	2012-08-21 13:18:24 -07:00
Tejun Heo	e0aecdd874	workqueue: use irqsafe timer for delayed_work Up to now, for delayed_works, try_to_grab_pending() couldn't be used from IRQ handlers because IRQs may happen while delayed_work_timer_fn() is in progress leading to indefinite -EAGAIN. This patch makes delayed_work use the new TIMER_IRQSAFE flag for delayed_work->timer. This makes try_to_grab_pending() and thus mod_delayed_work_on() safe to call from IRQ handlers. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-08-21 13:18:24 -07:00
Tejun Heo	ae930e0f4e	workqueue: gut system_nrt[_freezable]_wq() Now that all workqueues are non-reentrant, system[_freezable]_wq() are equivalent to system_nrt[_freezable]_wq(). Replace the latter with wrappers around system[_freezable]_wq(). The wrapping goes through inline functions so that __deprecated can be added easily. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-08-20 14:51:23 -07:00
Tejun Heo	606a5020b9	workqueue: gut flush[_delayed]_work_sync() Now that all workqueues are non-reentrant, flush[_delayed]_work_sync() are equivalent to flush[_delayed]_work(). Drop the separate implementation and make them thin wrappers around flush[_delayed]_work(). * start_flush_work() no longer takes @wait_executing as the only left user - flush_work() - always sets it to %true. * __cancel_work_timer() uses flush_work() instead of wait_on_work(). Signed-off-by: Tejun Heo <tj@kernel.org>	2012-08-20 14:51:23 -07:00
Tejun Heo	dbf2576e37	workqueue: make all workqueues non-reentrant By default, each per-cpu part of a bound workqueue operates separately and a work item may be executing concurrently on different CPUs. The behavior avoids some cross-cpu traffic but leads to subtle weirdities and not-so-subtle contortions in the API. * There's no sane usefulness in allowing a single work item to be executed concurrently on multiple CPUs. People just get the behavior unintentionally and get surprised after learning about it. Most either explicitly synchronize or use non-reentrant/ordered workqueue but this is error-prone. * flush_work() can't wait for multiple instances of the same work item on different CPUs. If a work item is executing on cpu0 and then queued on cpu1, flush_work() can only wait for the one on cpu1. Unfortunately, work items can easily cross CPU boundaries unintentionally when the queueing thread gets migrated. This means that if multiple queuers compete, flush_work() can't even guarantee that the instance queued right before it is finished before returning. * flush_work_sync() was added to work around some of the deficiencies of flush_work(). In addition to the usual flushing, it ensures that all currently executing instances are finished before returning. This operation is expensive as it has to walk all CPUs and at the same time fails to address competing queuer case. Incorrectly using flush_work() when flush_work_sync() is necessary is an easy error to make and can lead to bugs which are difficult to reproduce. * Similar problems exist for flush_delayed_work[_sync](). Other than the cross-cpu access concern, there's no benefit in allowing parallel execution and it's plain silly to have this level of contortion for workqueue which is widely used from core code to extremely obscure drivers. This patch makes all workqueues non-reentrant. If a work item is executing on a different CPU when queueing is requested, it is always queued to that CPU. This guarantees that any given work item can be executing on one CPU at maximum and if a work item is queued and executing, both are on the same CPU. The only behavior change which may affect workqueue users negatively is that non-reentrancy overrides the affinity specified by queue_work_on(). On a reentrant workqueue, the affinity specified by queue_work_on() is always followed. Now, if the work item is executing on one of the CPUs, the work item will be queued there regardless of the requested affinity. I've reviewed all workqueue users which request explicit affinity, and, fortunately, none seems to be crazy enough to exploit parallel execution of the same work item. This adds an additional busy_hash lookup if the work item was previously queued on a different CPU. This shouldn't be noticeable under any sane workload. Work item queueing isn't a very high-frequency operation and they don't jump across CPUs all the time. In a micro benchmark to exaggerate this difference - measuring the time it takes for two work items to repeatedly jump between two CPUs a number (10M) of times with busy_hash table densely populated, the difference was around 3%. While the overhead is measureable, it is only visible in pathological cases and the difference isn't huge. This change brings much needed sanity to workqueue and makes its behavior consistent with timer. I think this is the right tradeoff to make. This enables significant simplification of workqueue API. Simplification patches will follow. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-08-20 14:51:23 -07:00
Valentin Ilie	044c782ce3	workqueue: fix checkpatch issues Fixed some checkpatch warnings. tj: adapted to wq/for-3.7 and massaged pr_xxx() format strings a bit. Signed-off-by: Valentin Ilie <valentin.ilie@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> LKML-Reference: <1345326762-21747-1-git-send-email-valentin.ilie@gmail.com>	2012-08-20 13:37:07 -07:00
Joonsoo Kim	7635d2fd7f	workqueue: use system_highpri_wq for unbind_work To speed cpu down processing up, use system_highpri_wq. As scheduling priority of workers on it is higher than system_wq and it is not contended by other normal works on this cpu, work on it is processed faster than system_wq. tj: CPU up/downs care quite a bit about latency these days. This shouldn't hurt anything and makes sense. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-08-16 14:21:16 -07:00
Joonsoo Kim	e2b6a6d570	workqueue: use system_highpri_wq for highpri workers in rebind_workers() In rebind_workers(), we do inserting a work to rebind to cpu for busy workers. Currently, in this case, we use only system_wq. This makes a possible error situation as there is mismatch between cwq->pool and worker->pool. To prevent this, we should use system_highpri_wq for highpri worker to match theses. This implements it. tj: Rephrased comment a bit. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-08-16 14:21:15 -07:00
Joonsoo Kim	1aabe902ca	workqueue: introduce system_highpri_wq Commit `3270476a6c` ('workqueue: reimplement WQ_HIGHPRI using a separate worker_pool') introduce separate worker pool for HIGHPRI. When we handle busyworkers for gcwq, it can be normal worker or highpri worker. But, we don't consider this difference in rebind_workers(), we use just system_wq for highpri worker. It makes mismatch between cwq->pool and worker->pool. It doesn't make error in current implementation, but possible in the future. Now, we introduce system_highpri_wq to use proper cwq for highpri workers in rebind_workers(). Following patch fix this issue properly. tj: Even apart from rebinding, having system_highpri_wq generally makes sense. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-08-16 14:21:15 -07:00
Joonsoo Kim	e42986de48	workqueue: change value of lcpu in __queue_delayed_work_on() We assign cpu id into work struct's data field in __queue_delayed_work_on(). In current implementation, when work is come in first time, current running cpu id is assigned. If we do __queue_delayed_work_on() with CPU A on CPU B, __queue_work() invoked in delayed_work_timer_fn() go into the following sub-optimal path in case of WQ_NON_REENTRANT. gcwq = get_gcwq(cpu); if (wq->flags & WQ_NON_REENTRANT && (last_gcwq = get_work_gcwq(work)) && last_gcwq != gcwq) { Change lcpu to @cpu and rechange lcpu to local cpu if lcpu is WORK_CPU_UNBOUND. It is sufficient to prevent to go into sub-optimal path. tj: Slightly rephrased the comment. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-08-16 14:21:15 -07:00
Joonsoo Kim	b75cac9368	workqueue: correct req_cpu in trace_workqueue_queue_work() When we do tracing workqueue_queue_work(), it records requested cpu. But, if !(@wq->flag & WQ_UNBOUND) and @cpu is WORK_CPU_UNBOUND, requested cpu is changed as local cpu. In case of @wq->flag & WQ_UNBOUND, above change is not occured, therefore it is reasonable to correct it. Use temporary local variable for storing requested cpu. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-08-16 14:21:15 -07:00
Joonsoo Kim	330dad5b9c	workqueue: use enum value to set array size of pools in gcwq Commit `3270476a6c` ('workqueue: reimplement WQ_HIGHPRI using a separate worker_pool') introduce separate worker_pool for HIGHPRI. Although there is NR_WORKER_POOLS enum value which represent size of pools, definition of worker_pool in gcwq doesn't use it. Using it makes code robust and prevent future mistakes. So change code to use this enum value. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2012-08-16 14:21:15 -07:00
Tejun Heo	23657bb192	workqueue: add missing wmb() in clear_work_data() Any operation which clears PENDING should be preceded by a wmb to guarantee that the next PENDING owner sees all the changes made before PENDING release. There are only two places where PENDING is cleared - set_work_cpu_and_clear_pending() and clear_work_data(). The caller of the former already does smp_wmb() but the latter doesn't have any. Move the wmb above set_work_cpu_and_clear_pending() into it and add one to clear_work_data(). There hasn't been any report related to this issue, and, given how clear_work_data() is used, it is extremely unlikely to have caused any actual problems on any architecture. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com>	2012-08-13 17:08:19 -07:00
Tejun Heo	1265057fa0	workqueue: fix CPU binding of flush_delayed_work[_sync]() delayed_work encodes the workqueue to use and the last CPU in delayed_work->work.data while it's on timer. The target CPU is implicitly recorded as the CPU the timer is queued on and delayed_work_timer_fn() queues delayed_work->work to the CPU it is running on. Unfortunately, this leaves flush_delayed_work[_sync]() no way to find out which CPU the delayed_work was queued for when they try to re-queue after killing the timer. Currently, it chooses the local CPU flush is running on. This can unexpectedly move a delayed_work queued on a specific CPU to another CPU and lead to subtle errors. There isn't much point in trying to save several bytes in struct delayed_work, which is already close to a hundred bytes on 64bit with all debug options turned off. This patch adds delayed_work->cpu to remember the CPU it's queued for. Note that if the timer is migrated during CPU down, the work item could be queued to the downed global_cwq after this change. As a detached global_cwq behaves like an unbound one, this doesn't change much for the delayed_work. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org>	2012-08-13 16:27:55 -07:00
Tejun Heo	8376fe22c7	workqueue: implement mod_delayed_work[_on]() Workqueue was lacking a mechanism to modify the timeout of an already pending delayed_work. delayed_work users have been working around this using several methods - using an explicit timer + work item, messing directly with delayed_work->timer, and canceling before re-queueing, all of which are error-prone and/or ugly. This patch implements mod_delayed_work[_on]() which behaves similarly to mod_timer() - if the delayed_work is idle, it's queued with the given delay; otherwise, its timeout is modified to the new value. Zero @delay guarantees immediate execution. v2: Updated to reflect try_to_grab_pending() changes. Now safe to be called from bh context. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com>	2012-08-03 10:30:47 -07:00
Tejun Heo	bbb68dfaba	workqueue: mark a work item being canceled as such There can be two reasons try_to_grab_pending() can fail with -EAGAIN. One is when someone else is queueing or deqeueing the work item. With the previous patches, it is guaranteed that PENDING and queued state will soon agree making it safe to busy-retry in this case. The other is if multiple __cancel_work_timer() invocations are racing one another. __cancel_work_timer() grabs PENDING and then waits for running instances of the target work item on all CPUs while holding PENDING and !queued. try_to_grab_pending() invoked from another task will keep returning -EAGAIN while the current owner is waiting. Not distinguishing the two cases is okay because __cancel_work_timer() is the only user of try_to_grab_pending() and it invokes wait_on_work() whenever grabbing fails. For the first case, busy looping should be fine but wait_on_work() doesn't cause any critical problem. For the latter case, the new contender usually waits for the same condition as the current owner, so no unnecessarily extended busy-looping happens. Combined, these make __cancel_work_timer() technically correct even without irq protection while grabbing PENDING or distinguishing the two different cases. While the current code is technically correct, not distinguishing the two cases makes it difficult to use try_to_grab_pending() for other purposes than canceling because it's impossible to tell whether it's safe to busy-retry grabbing. This patch adds a mechanism to mark a work item being canceled. try_to_grab_pending() now disables irq on success and returns -EAGAIN to indicate that grabbing failed but PENDING and queued states are gonna agree soon and it's safe to busy-loop. It returns -ENOENT if the work item is being canceled and it may stay PENDING && !queued for arbitrary amount of time. __cancel_work_timer() is modified to mark the work canceling with WORK_OFFQ_CANCELING after grabbing PENDING, thus making try_to_grab_pending() fail with -ENOENT instead of -EAGAIN. Also, it invokes wait_on_work() iff grabbing failed with -ENOENT. This isn't necessary for correctness but makes it consistent with other future users of try_to_grab_pending(). v2: try_to_grab_pending() was testing preempt_count() to ensure that the caller has disabled preemption. This triggers spuriously if !CONFIG_PREEMPT_COUNT. Use preemptible() instead. Reported by Fengguang Wu. v3: Updated so that try_to_grab_pending() disables irq on success rather than requiring preemption disabled by the caller. This makes busy-looping easier and will allow try_to_grap_pending() to be used from bh/irq contexts. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Fengguang Wu <fengguang.wu@intel.com>	2012-08-03 10:30:46 -07:00
Tejun Heo	36e227d242	workqueue: reorganize try_to_grab_pending() and __cancel_timer_work() * Use bool @is_dwork instead of @timer and let try_to_grab_pending() use to_delayed_work() to determine the delayed_work address. * Move timer handling from __cancel_work_timer() to try_to_grab_pending(). * Make try_to_grab_pending() use -EAGAIN instead of -1 for busy-looping and drop the ret local variable. * Add proper function comment to try_to_grab_pending(). This makes the code a bit easier to understand and will ease further changes. This patch doesn't make any functional change. v2: Use @is_dwork instead of @timer. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-08-03 10:30:46 -07:00
Tejun Heo	7beb2edf44	workqueue: factor out __queue_delayed_work() from queue_delayed_work_on() This is to prepare for mod_delayed_work[_on]() and doesn't cause any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-08-03 10:30:46 -07:00
Tejun Heo	b549007727	workqueue: introduce WORK_OFFQ_FLAG_* Low WORK_STRUCT_FLAG_BITS bits of work_struct->data contain WORK_STRUCT_FLAG_* and flush color. If the work item is queued, the rest point to the cpu_workqueue with WORK_STRUCT_CWQ set; otherwise, WORK_STRUCT_CWQ is clear and the bits contain the last CPU number - either a real CPU number or one of WORK_CPU_. Scheduled addition of mod_delayed_work[_on]() requires an additional flag, which is used only while a work item is off queue. There are more than enough bits to represent off-queue CPU number on both 32 and 64bits. This patch introduces WORK_OFFQ_FLAG_ which occupy the lower part of the @work->data high bits while off queue. This patch doesn't define any actual OFFQ flag yet. Off-queue CPU number is now shifted by WORK_OFFQ_CPU_SHIFT, which adds the number of bits used by OFFQ flags to WORK_STRUCT_FLAG_SHIFT, to make room for OFFQ flags. To avoid shift width warning with large WORK_OFFQ_FLAG_BITS, ulong cast is added to WORK_STRUCT_NO_CPU and, just in case, BUILD_BUG_ON() to check that there are enough bits to accomodate off-queue CPU number is added. This patch doesn't make any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-08-03 10:30:46 -07:00
Tejun Heo	bf4ede014e	workqueue: move try_to_grab_pending() upwards try_to_grab_pending() will be used by to-be-implemented mod_delayed_work[_on](). Move try_to_grab_pending() and related functions above queueing functions. This patch only moves functions around. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-08-03 10:30:46 -07:00
Tejun Heo	715f130080	workqueue: fix zero @delay handling of queue_delayed_work_on() If @delay is zero and the dealyed_work is idle, queue_delayed_work() queues it for immediate execution; however, queue_delayed_work_on() lacks this logic and always goes through timer regardless of @delay. This patch moves 0 @delay handling logic from queue_delayed_work() to queue_delayed_work_on() so that both functions behave the same. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-08-03 10:30:46 -07:00
Tejun Heo	57469821fd	workqueue: unify local CPU queueing handling Queueing functions have been using different methods to determine the local CPU. * queue_work() superflously uses get/put_cpu() to acquire and hold the local CPU across queue_work_on(). * delayed_work_timer_fn() uses smp_processor_id(). * queue_delayed_work() calls queue_delayed_work_on() with -1 @cpu which is interpreted as the local CPU. * flush_delayed_work[_sync]() were using raw_smp_processor_id(). * __queue_work() interprets %WORK_CPU_UNBOUND as local CPU if the target workqueue is bound one but nobody uses this. This patch converts all functions to uniformly use %WORK_CPU_UNBOUND to indicate local CPU and use the local binding feature of __queue_work(). unlikely() is dropped from %WORK_CPU_UNBOUND handling in __queue_work(). Signed-off-by: Tejun Heo <tj@kernel.org>	2012-08-03 10:30:45 -07:00
Tejun Heo	d8e794dfd5	workqueue: set delayed_work->timer function on initialization delayed_work->timer.function is currently initialized during queue_delayed_work_on(). Export delayed_work_timer_fn() and set delayed_work timer function during delayed_work initialization together with other fields. This ensures the timer function is always valid on an initialized delayed_work. This is to help mod_delayed_work() implementation. To detect delayed_work users which diddle with the internal timer, trigger WARN if timer function doesn't match on queue. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-08-03 10:30:45 -07:00
Tejun Heo	8930caba3d	workqueue: disable irq while manipulating PENDING Queueing operations use WORK_STRUCT_PENDING_BIT to synchronize access to the target work item. They first try to claim the bit and proceed with queueing only after that succeeds and there's a window between PENDING being set and the actual queueing where the task can be interrupted or preempted. There's also a similar window in process_one_work() when clearing PENDING. A work item is dequeued, gcwq->lock is released and then PENDING is cleared and the worker might get interrupted or preempted between releasing gcwq->lock and clearing PENDING. cancel[_delayed]_work_sync() tries to claim or steal PENDING. The function assumes that a work item with PENDING is either queued or in the process of being [de]queued. In the latter case, it busy-loops until either the work item loses PENDING or is queued. If canceling coincides with the above described interrupts or preemptions, the canceling task will busy-loop while the queueing or executing task is preempted. This patch keeps irq disabled across claiming PENDING and actual queueing and moves PENDING clearing in process_one_work() inside gcwq->lock so that busy looping from PENDING && !queued doesn't wait for interrupted/preempted tasks. Note that, in process_one_work(), setting last CPU and clearing PENDING got merged into single operation. This removes possible long busy-loops and will allow using try_to_grab_pending() from bh and irq contexts. v2: __queue_work() was testing preempt_count() to ensure that the caller has disabled preemption. This triggers spuriously if !CONFIG_PREEMPT_COUNT. Use preemptible() instead. Reported by Fengguang Wu. v3: Disable irq instead of preemption. IRQ will be disabled while grabbing gcwq->lock later anyway and this allows using try_to_grab_pending() from bh and irq contexts. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Fengguang Wu <fengguang.wu@intel.com>	2012-08-03 10:30:45 -07:00
Tejun Heo	959d1af8cf	workqueue: add missing smp_wmb() in process_one_work() WORK_STRUCT_PENDING is used to claim ownership of a work item and process_one_work() releases it before starting execution. When someone else grabs PENDING, all pre-release updates to the work item should be visible and all updates made by the new owner should happen afterwards. Grabbing PENDING uses test_and_set_bit() and thus has a full barrier; however, clearing doesn't have a matching wmb. Given the preceding spin_unlock and use of clear_bit, I don't believe this can be a problem on an actual machine and there hasn't been any related report but it still is theretically possible for clear_pending to permeate upwards and happen before work->entry update. Add an explicit smp_wmb() before work_clear_pending(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: stable@vger.kernel.org	2012-08-03 10:30:45 -07:00
Tejun Heo	d4283e9378	workqueue: make queueing functions return bool All queueing functions return 1 on success, 0 if the work item was already pending. Update them to return bool instead. This signifies better that they don't return 0 / -errno. This is cleanup and doesn't cause any functional difference. While at it, fix comment opening for schedule_work_on(). Signed-off-by: Tejun Heo <tj@kernel.org>	2012-08-03 10:30:44 -07:00
Tejun Heo	0a13c00e9d	workqueue: reorder queueing functions so that _on() variants are on top Currently, queue/schedule[_delayed]_work_on() are located below the counterpart without the _on postifx even though the latter is usually implemented using the former. Swap them. This is cleanup and doesn't cause any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-08-03 10:30:44 -07:00
Tejun Heo	6fec10a1a5	workqueue: fix spurious CPU locality WARN from process_one_work() `25511a4776` "workqueue: reimplement CPU online rebinding to handle idle workers" added CPU locality sanity check in process_one_work(). It triggers if a worker is executing on a different CPU without UNBOUND or REBIND set. This works for all normal workers but rescuers can trigger this spuriously when they're serving the unbound or a disassociated global_cwq - rescuers don't have either flag set and thus its gcwq->cpu can be a different value including %WORK_CPU_UNBOUND. Fix it by additionally testing %GCWQ_DISASSOCIATED. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> LKML-Refence: <20120721213656.GA7783@linux.vnet.ibm.com>	2012-07-22 10:16:34 -07:00
Tejun Heo	8db25e7891	workqueue: simplify CPU hotplug code With trustee gone, CPU hotplug code can be simplified. * gcwq_claim/release_management() now grab and release gcwq lock too respectively and gained _and_lock and _and_unlock postfixes. * All CPU hotplug logic was implemented in workqueue_cpu_callback() which was called by workqueue_cpu_up/down_callback() for the correct priority. This was because up and down paths shared a lot of logic, which is no longer true. Remove workqueue_cpu_callback() and move all hotplug logic into the two actual callbacks. This patch doesn't make any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>	2012-07-17 12:39:28 -07:00
Tejun Heo	628c78e7ea	workqueue: remove CPU offline trustee With the previous changes, a disassociated global_cwq now can run as an unbound one on its own - it can create workers as necessary to drain remaining works after the CPU has been brought down and manage the number of workers using the usual idle timer mechanism making trustee completely redundant except for the actual unbinding operation. This patch removes the trustee and let a disassociated global_cwq manage itself. Unbinding is moved to a work item (for CPU affinity) which is scheduled and flushed from CPU_DONW_PREPARE. This patch moves nr_running clearing outside gcwq and manager locks to simplify the code. As nr_running is unused at the point, this is safe. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>	2012-07-17 12:39:27 -07:00
Tejun Heo	3ce6337730	workqueue: don't butcher idle workers on an offline CPU Currently, during CPU offlining, after all pending work items are drained, the trustee butchers all workers. Also, on CPU onlining failure, workqueue_cpu_callback() ensures that the first idle worker is destroyed. Combined, these guarantee that an offline CPU doesn't have any worker for it once all the lingering work items are finished. This guarantee isn't really necessary and makes CPU on/offlining more expensive than needs to be, especially for platforms which use CPU hotplug for powersaving. This patch lets offline CPUs removes idle worker butchering from the trustee and let a CPU which failed onlining keep the created first worker. The first worker is created if the CPU doesn't have any during CPU_DOWN_PREPARE and started right away. If onlining succeeds, the rebind_workers() call in CPU_ONLINE will rebind it like any other workers. If onlining fails, the worker is left alone till the next try. This makes CPU hotplugs cheaper by allowing global_cwqs to keep workers across them and simplifies code. Note that trustee doesn't re-arm idle timer when it's done and thus the disassociated global_cwq will keep all workers until it comes back online. This will be improved by further patches. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>	2012-07-17 12:39:27 -07:00
Tejun Heo	25511a4776	workqueue: reimplement CPU online rebinding to handle idle workers Currently, if there are left workers when a CPU is being brough back online, the trustee kills all idle workers and scheduled rebind_work so that they re-bind to the CPU after the currently executing work is finished. This works for busy workers because concurrency management doesn't try to wake up them from scheduler callbacks, which require the target task to be on the local run queue. The busy worker bumps concurrency counter appropriately as it clears WORKER_UNBOUND from the rebind work item and it's bound to the CPU before returning to the idle state. To reduce CPU on/offlining overhead (as many embedded systems use it for powersaving) and simplify the code path, workqueue is planned to be modified to retain idle workers across CPU on/offlining. This patch reimplements CPU online rebinding such that it can also handle idle workers. As noted earlier, due to the local wakeup requirement, rebinding idle workers is tricky. All idle workers must be re-bound before scheduler callbacks are enabled. This is achieved by interlocking idle re-binding. Idle workers are requested to re-bind and then hold until all idle re-binding is complete so that no bound worker starts executing work item. Only after all idle workers are re-bound and parked, CPU_ONLINE proceeds to release them and queue rebind work item to busy workers thus guaranteeing scheduler callbacks aren't invoked until all idle workers are ready. worker_rebind_fn() is renamed to busy_worker_rebind_fn() and idle_worker_rebind() for idle workers is added. Rebinding logic is moved to rebind_workers() and now called from CPU_ONLINE after flushing trustee. While at it, add CPU sanity check in worker_thread(). Note that now a worker may become idle or the manager between trustee release and rebinding during CPU_ONLINE. As the previous patch updated create_worker() so that it can be used by regular manager while unbound and this patch implements idle re-binding, this is safe. This prepares for removal of trustee and keeping idle workers across CPU hotplugs. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>	2012-07-17 12:39:27 -07:00
Tejun Heo	bc2ae0f5bb	workqueue: drop @bind from create_worker() Currently, create_worker()'s callers are responsible for deciding whether the newly created worker should be bound to the associated CPU and create_worker() sets WORKER_UNBOUND only for the workers for the unbound global_cwq. Creation during normal operation is always via maybe_create_worker() and @bind is true. For workers created during hotplug, @bind is false. Normal operation path is planned to be used even while the CPU is going through hotplug operations or offline and this static decision won't work. Drop @bind from create_worker() and decide whether to bind by looking at GCWQ_DISASSOCIATED. create_worker() will also set WORKER_UNBOUND autmatically if disassociated. To avoid flipping GCWQ_DISASSOCIATED while create_worker() is in progress, the flag is now allowed to be changed only while holding all manager_mutexes on the global_cwq. This requires that GCWQ_DISASSOCIATED is not cleared behind trustee's back. CPU_ONLINE no longer clears DISASSOCIATED before flushing trustee, which clears DISASSOCIATED before rebinding remaining workers if asked to release. For cases where trustee isn't around, CPU_ONLINE clears DISASSOCIATED after flushing trustee. Also, now, first_idle has UNBOUND set on creation which is explicitly cleared by CPU_ONLINE while binding it. These convolutions will soon be removed by further simplification of CPU hotplug path. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>	2012-07-17 12:39:27 -07:00
Tejun Heo	6037315269	workqueue: use mutex for global_cwq manager exclusion POOL_MANAGING_WORKERS is used to ensure that at most one worker takes the manager role at any given time on a given global_cwq. Trustee later hitched on it to assume manager adding blocking wait for the bit. As trustee already needed a custom wait mechanism, waiting for MANAGING_WORKERS was rolled into the same mechanism. Trustee is scheduled to be removed. This patch separates out MANAGING_WORKERS wait into per-pool mutex. Workers use mutex_trylock() to test for manager role and trustee uses mutex_lock() to claim manager roles. gcwq_claim/release_management() helpers are added to grab and release manager roles of all pools on a global_cwq. gcwq_claim_management() always grabs pool manager mutexes in ascending pool index order and uses pool index as lockdep subclass. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>	2012-07-17 12:39:27 -07:00
Tejun Heo	403c821d45	workqueue: ROGUE workers are UNBOUND workers Currently, WORKER_UNBOUND is used to mark workers for the unbound global_cwq and WORKER_ROGUE is used to mark workers for disassociated per-cpu global_cwqs. Both are used to make the marked worker skip concurrency management and the only place they make any difference is in worker_enter_idle() where WORKER_ROGUE is used to skip scheduling idle timer, which can easily be replaced with trustee state testing. This patch replaces WORKER_ROGUE with WORKER_UNBOUND and drops WORKER_ROGUE. This is to prepare for removing trustee and handling disassociated global_cwqs as unbound. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>	2012-07-17 12:39:27 -07:00
Tejun Heo	f2d5a0ee06	workqueue: drop CPU_DYING notifier operation Workqueue used CPU_DYING notification to mark GCWQ_DISASSOCIATED. This was necessary because workqueue's CPU_DOWN_PREPARE happened before other DOWN_PREPARE notifiers and workqueue needed to stay associated across the rest of DOWN_PREPARE. After the previous patch, workqueue's DOWN_PREPARE happens after others and can set GCWQ_DISASSOCIATED directly. Drop CPU_DYING and let the trustee set GCWQ_DISASSOCIATED after disabling concurrency management. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>	2012-07-17 12:39:26 -07:00
Tejun Heo	6575820221	workqueue: perform cpu down operations from low priority cpu_notifier() Currently, all workqueue cpu hotplug operations run off CPU_PRI_WORKQUEUE which is higher than normal notifiers. This is to ensure that workqueue is up and running while bringing up a CPU before other notifiers try to use workqueue on the CPU. Per-cpu workqueues are supposed to remain working and bound to the CPU for normal CPU_DOWN_PREPARE notifiers. This holds mostly true even with workqueue offlining running with higher priority because workqueue CPU_DOWN_PREPARE only creates a bound trustee thread which runs the per-cpu workqueue without concurrency management without explicitly detaching the existing workers. However, if the trustee needs to create new workers, it creates unbound workers which may wander off to other CPUs while CPU_DOWN_PREPARE notifiers are in progress. Furthermore, if the CPU down is cancelled, the per-CPU workqueue may end up with workers which aren't bound to the CPU. While reliably reproducible with a convoluted artificial test-case involving scheduling and flushing CPU burning work items from CPU down notifiers, this isn't very likely to happen in the wild, and, even when it happens, the effects are likely to be hidden by the following successful CPU down. Fix it by using different priorities for up and down notifiers - high priority for up operations and low priority for down operations. Workqueue cpu hotplug operations will soon go through further cleanup. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>	2012-07-17 12:39:26 -07:00
Tejun Heo	3270476a6c	workqueue: reimplement WQ_HIGHPRI using a separate worker_pool WQ_HIGHPRI was implemented by queueing highpri work items at the head of the global worklist. Other than queueing at the head, they weren't handled differently; unfortunately, this could lead to execution latency of a few seconds on heavily loaded systems. Now that workqueue code has been updated to deal with multiple worker_pools per global_cwq, this patch reimplements WQ_HIGHPRI using a separate worker_pool. NR_WORKER_POOLS is bumped to two and gcwq->pools[0] is used for normal pri work items and ->pools[1] for highpri. Highpri workers get -20 nice level and has 'H' suffix in their names. Note that this change increases the number of kworkers per cpu. POOL_HIGHPRI_PENDING, pool_determine_ins_pos() and highpri chain wakeup code in process_one_work() are no longer used and removed. This allows proper prioritization of highpri work items and removes high execution latency of highpri work items. v2: nr_running indexing bug in get_pool_nr_running() fixed. v3: Refreshed for the get_pool_nr_running() update in the previous patch. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Josh Hunt <joshhunt00@gmail.com> LKML-Reference: <CAKA=qzaHqwZ8eqpLNFjxnO2fX-tgAOjmpvxgBFjv6dJeQaOW1w@mail.gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fengguang Wu <fengguang.wu@intel.com>	2012-07-13 22:24:45 -07:00
Tejun Heo	4ce62e9e30	workqueue: introduce NR_WORKER_POOLS and for_each_worker_pool() Introduce NR_WORKER_POOLS and for_each_worker_pool() and convert code paths which need to manipulate all pools in a gcwq to use them. NR_WORKER_POOLS is currently one and for_each_worker_pool() iterates over only @gcwq->pool. Note that nr_running is per-pool property and converted to an array with NR_WORKER_POOLS elements and renamed to pool_nr_running. Note that get_pool_nr_running() currently assumes 0 index. The next patch will make use of non-zero index. The changes in this patch are mechanical and don't caues any functional difference. This is to prepare for multiple pools per gcwq. v2: nr_running indexing bug in get_pool_nr_running() fixed. v3: Pointer to array is stupid. Don't use it in get_pool_nr_running() as suggested by Linus. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>	2012-07-13 22:16:44 -07:00

... 2 3 4 5 6 ...

579 Commits