linux-sg2042

Commit Graph

Author	SHA1	Message	Date
Martin K. Petersen	f31dc1cd49	block: Consolidate command flag and queue limit checks for merges - blk_check_merge_flags() verifies that cmd_flags / bi_rw are compatible. This function is called for both req-req and req-bio merging. - blk_rq_get_max_sectors() and blk_queue_get_max_sectors() can be used to query the maximum sector count for a given request or queue. The calls will return the right value from the queue limits given the type of command (RW, discard, write same, etc.) Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-09-20 14:31:41 +02:00
Martin K. Petersen	e2a60da74f	block: Clean up special command handling logic Remove special-casing of non-rw fs style requests (discard). The nomerge flags are consolidated in blk_types.h, and rq_mergeable() and bio_mergeable() have been modified to use them. bio_is_rw() is used in place of bio_has_data() a few places. This is done to to distinguish true reads and writes from other fs type requests that carry a payload (e.g. write same). Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-09-20 14:31:38 +02:00
Jaehoon Chung	e32463b2f7	block: remove the duplicated setting for congestion_threshold Before call the blk_queue_congestion_threshold(), the blk_queue_congestion_threshold() is already called at blk_queue_make_rquest(). Because this code is the duplicated, it has removed. Signed-off-by: Jaehoon Chung <jh80.chung@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-09-09 12:44:10 +02:00
Kent Overstreet	bf800ef181	block: Add bio_clone_bioset(), bio_clone_kmalloc() Previously, there was bio_clone() but it only allocated from the fs bio set; as a result various users were open coding it and using __bio_clone(). This changes bio_clone() to become bio_clone_bioset(), and then we add bio_clone() and bio_clone_kmalloc() as wrappers around it, making use of the functionality the last patch adedd. This will also help in a later patch changing how bio cloning works. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: NeilBrown <neilb@suse.de> CC: Alasdair Kergon <agk@redhat.com> CC: Boaz Harrosh <bharrosh@panasas.com> CC: Jeff Garzik <jeff@garzik.org> Acked-by: Jeff Garzik <jgarzik@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-09-09 10:35:39 +02:00
Kent Overstreet	4254bba17d	block: Kill bi_destructor Now that we've got generic code for freeing bios allocated from bio pools, this isn't needed anymore. This patch also makes bio_free() static, since without bi_destructor there should be no need for it to be called anywhere else. bio_free() is now only called from bio_put, so we can refactor those a bit - move some code from bio_put() to bio_free() and kill the redundant bio->bi_next = NULL. v5: Switch to BIO_KMALLOC_POOL ((void *)~0), per Boaz v6: BIO_KMALLOC_POOL now NULL, drop bio_free's EXPORT_SYMBOL v7: No #define BIO_KMALLOC_POOL anymore Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-09-09 10:35:39 +02:00
Kent Overstreet	1e2a410ff7	block: Ues bi_pool for bio_integrity_alloc() Now that bios keep track of where they were allocated from, bio_integrity_alloc_bioset() becomes redundant. Remove bio_integrity_alloc_bioset() and drop bio_set argument from the related functions and make them use bio->bi_pool. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-09-09 10:35:38 +02:00
Yi Zou	37d7b34f05	block: rate-limit the error message from failing commands When performing a cable pull test w/ active stress I/O using fio over a dual port Intel 82599 FCoE CNA, w/ 256LUNs on one port and about 32LUNs on the other, it is observed that the system becomes not usable due to scsi-ml being busy printing the error messages for all the failing commands. I don't believe this problem is specific to FCoE and these commands are anyway failing due to link being down (DID_NO_CONNECT), just rate-limit the messages here to solve this issue. v2->v1: use __ratelimit() as Tomas Henzl mentioned as the proper way for rate-limit per function. However, in this case, the failed i/o gets to blk_end_request_err() and then blk_update_request(), which also has to be rate-limited, as added in the v2 of this patch. v3-v2: resolved conflict to apply on current 3.6-rc3 upstream tip. Signed-off-by: Yi Zou <yi.zou@intel.com> Cc: www.Open-FCoE.org <devel@open-fcoe.org> Cc: Tomas Henzl <thenzl@redhat.com> Cc: <linux-scsi@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-08-30 16:26:25 -07:00
Tejun Heo	136b5721d7	workqueue: deprecate __cancel_delayed_work() Now that cancel_delayed_work() can be safely called from IRQ handlers, there's no reason to use __cancel_delayed_work(). Use cancel_delayed_work() instead of __cancel_delayed_work() and mark the latter deprecated. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jens Axboe <axboe@kernel.dk> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Roland Dreier <roland@kernel.org> Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>	2012-08-21 13:18:24 -07:00
Tejun Heo	e7c2f96744	workqueue: use mod_delayed_work() instead of __cancel + queue Now that mod_delayed_work() is safe to call from IRQ handlers, __cancel_delayed_work() followed by queue_delayed_work() can be replaced with mod_delayed_work(). Most conversions are straight-forward except for the following. * net/core/link_watch.c: linkwatch_schedule_work() was doing a quite elaborate dancing around its delayed_work. Collapse it such that linkwatch_work is queued for immediate execution if LW_URGENT and existing timer is kept otherwise. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>	2012-08-21 13:18:24 -07:00
NeilBrown	74018dc306	blk: pass from_schedule to non-request unplug functions. This will allow md/raid to know why the unplug was called, and will be able to act according - if !from_schedule it is safe to perform tasks which could themselves schedule. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-07-31 09:08:15 +02:00
Shaohua Li	2a7d5559b3	block: stack unplug MD raid1 prepares to dispatch request in unplug callback. If make_request in low level queue also uses unplug callback to dispatch request, the low level queue's unplug callback will not be called. Recheck the callback list helps this case. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-07-31 09:08:15 +02:00
NeilBrown	9cbb175088	blk: centralize non-request unplug handling. Both md and umem has similar code for getting notified on an blk_finish_plug event. Centralize this code in block/ and allow each driver to provide its distinctive difference. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-07-31 09:08:14 +02:00
Tejun Heo	a051661ca6	blkcg: implement per-blkg request allocation Currently, request_queue has one request_list to allocate requests from regardless of blkcg of the IO being issued. When the unified request pool is used up, cfq proportional IO limits become meaningless - whoever grabs the next request being freed wins the race regardless of the configured weights. This can be easily demonstrated by creating a blkio cgroup w/ very low weight, put a program which can issue a lot of random direct IOs there and running a sequential IO from a different cgroup. As soon as the request pool is used up, the sequential IO bandwidth crashes. This patch implements per-blkg request_list. Each blkg has its own request_list and any IO allocates its request from the matching blkg making blkcgs completely isolated in terms of request allocation. * Root blkcg uses the request_list embedded in each request_queue, which was renamed to @q->root_rl from @q->rq. While making blkcg rl handling a bit harier, this enables avoiding most overhead for root blkcg. * Queue fullness is properly per request_list but bdi isn't blkcg aware yet, so congestion state currently just follows the root blkcg. As writeback isn't aware of blkcg yet, this works okay for async congestion but readahead may get the wrong signals. It's better than blkcg completely collapsing with shared request_list but needs to be improved with future changes. * After this change, each block cgroup gets a full request pool making resource consumption of each cgroup higher. This makes allowing non-root users to create cgroups less desirable; however, note that allowing non-root users to directly manage cgroups is already severely broken regardless of this patch - each block cgroup consumes kernel memory and skews IO weight (IO weights are not hierarchical). v2: queue-sysfs.txt updated and patch description udpated as suggested by Vivek. v3: blk_get_rl() wasn't checking error return from blkg_lookup_create() and may cause oops on lookup failure. Fix it by falling back to root_rl on blkg lookup failures. This problem was spotted by Rakesh Iyer <rni@google.com>. v4: Updated to accomodate `458f27a982` "block: Avoid missed wakeup in request waitqueue". blk_drain_queue() now wakes up waiters on all blkg->rl on the target queue. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-06-26 18:42:49 -04:00
Tejun Heo	5b788ce3e2	block: prepare for multiple request_lists Request allocation is about to be made per-blkg meaning that there'll be multiple request lists. * Make queue full state per request_list. blk_queue_full() functions are renamed to blk_rl_full() and takes @rl instead of @q. * Rename blk_init_free_list() to blk_init_rl() and make it take @rl instead of @q. Also add @gfp_mask parameter. * Add blk_exit_rl() instead of destroying rl directly from blk_release_queue(). * Add request_list->q and make request alloc/free functions - blk_free_request(), [__]freed_request(), __get_request() - take @rl instead of @q. This patch doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-06-25 11:53:52 +02:00
Tejun Heo	8a5ecdd428	block: add q->nr_rqs[] and move q->rq.elvpriv to q->nr_rqs_elvpriv Add q->nr_rqs[] which currently behaves the same as q->rq.count[] and move q->rq.elvpriv to q->nr_rqs_elvpriv. blk_drain_queue() is updated to use q->nr_rqs[] instead of q->rq.count[]. These counters separates queue-wide request statistics from the request list and allow implementation of per-queue request allocation. While at it, properly indent fields of struct request_list. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-06-25 11:53:51 +02:00
Tejun Heo	7f4b35d155	block: allocate io_context upfront Block layer very lazy allocation of ioc. It waits until the moment ioc is absolutely necessary; unfortunately, that time could be inside queue lock and __get_request() performs unlock - try alloc - retry dancing. Just allocate it up-front on entry to block layer. We're not saving the rain forest by deferring it to the last possible moment and complicating things unnecessarily. This patch is to prepare for further updates to request allocation path. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-06-25 11:53:50 +02:00
Tejun Heo	a06e05e6af	block: refactor get_request[_wait]() Currently, there are two request allocation functions - get_request() and get_request_wait(). The former tries to allocate a request once and the latter keeps retrying until it succeeds. The latter wraps the former and keeps retrying until allocation succeeds. The combination of two functions deliver fallible non-wait allocation, fallible wait allocation and unfailing wait allocation. However, given that forward progress is guaranteed, fallible wait allocation isn't all that useful and in fact nobody uses it. This patch simplifies the interface as follows. * get_request() is renamed to __get_request() and is only used by the wrapper function. * get_request_wait() is renamed to get_request(). It now takes @gfp_mask and retries iff it contains %__GFP_WAIT. This patch doesn't introduce any functional change and is to prepare for further updates to request allocation path. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-06-25 11:53:49 +02:00
Tejun Heo	a91a5ac685	mempool: add @gfp_mask to mempool_create_node() mempool_create_node() currently assumes %GFP_KERNEL. Its only user, blk_init_free_list(), is about to be updated to use other allocation flags - add @gfp_mask argument to the function. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-06-25 11:53:47 +02:00
Asias He	5e5cfac0c6	block: Mitigate lock unbalance caused by lock switching Commit `777eb1bf15` disconnects externally supplied queue_lock before blk_drain_queue(). Switching the lock would introduce lock unbalance because theads which have taken the external lock might unlock the internal lock in the during the queue drain. This patch mitigate this by disconnecting the lock after the queue draining since queue draining makes a lot of request_queue users go away. However, please note, this patch only makes the problem less likely to happen. Anyone who still holds a ref might try to issue a new request on a dead queue after the blk_cleanup_queue() finishes draining, the lock unbalance might still happen in this case. ===================================== [ BUG: bad unlock balance detected! ] 3.4.0+ #288 Not tainted ------------------------------------- fio/17706 is trying to release lock (&(&q->__queue_lock)->rlock) at: [<ffffffff81329372>] blk_queue_bio+0x2a2/0x380 but there are no more locks to release! other info that might help us debug this: 1 lock held by fio/17706: #0: (&(&vblk->lock)->rlock){......}, at: [<ffffffff81327f1a>] get_request_wait+0x19a/0x250 stack backtrace: Pid: 17706, comm: fio Not tainted 3.4.0+ #288 Call Trace: [<ffffffff81329372>] ? blk_queue_bio+0x2a2/0x380 [<ffffffff810dea49>] print_unlock_inbalance_bug+0xf9/0x100 [<ffffffff810dfe4f>] lock_release_non_nested+0x1df/0x330 [<ffffffff811dae24>] ? dio_bio_end_aio+0x34/0xc0 [<ffffffff811d6935>] ? bio_check_pages_dirty+0x85/0xe0 [<ffffffff811daea1>] ? dio_bio_end_aio+0xb1/0xc0 [<ffffffff81329372>] ? blk_queue_bio+0x2a2/0x380 [<ffffffff81329372>] ? blk_queue_bio+0x2a2/0x380 [<ffffffff810e0079>] lock_release+0xd9/0x250 [<ffffffff81a74553>] _raw_spin_unlock_irq+0x23/0x40 [<ffffffff81329372>] blk_queue_bio+0x2a2/0x380 [<ffffffff81328faa>] generic_make_request+0xca/0x100 [<ffffffff81329056>] submit_bio+0x76/0xf0 [<ffffffff8115470c>] ? set_page_dirty_lock+0x3c/0x60 [<ffffffff811d69e1>] ? bio_set_pages_dirty+0x51/0x70 [<ffffffff811dd1a8>] do_blockdev_direct_IO+0xbf8/0xee0 [<ffffffff811d8620>] ? blkdev_get_block+0x80/0x80 [<ffffffff811dd4e5>] __blockdev_direct_IO+0x55/0x60 [<ffffffff811d8620>] ? blkdev_get_block+0x80/0x80 [<ffffffff811d92e7>] blkdev_direct_IO+0x57/0x60 [<ffffffff811d8620>] ? blkdev_get_block+0x80/0x80 [<ffffffff8114c6ae>] generic_file_aio_read+0x70e/0x760 [<ffffffff810df7c5>] ? __lock_acquire+0x215/0x5a0 [<ffffffff811e9924>] ? aio_run_iocb+0x54/0x1a0 [<ffffffff8114bfa0>] ? grab_cache_page_nowait+0xc0/0xc0 [<ffffffff811e82cc>] aio_rw_vect_retry+0x7c/0x1e0 [<ffffffff811e8250>] ? aio_fsync+0x30/0x30 [<ffffffff811e9936>] aio_run_iocb+0x66/0x1a0 [<ffffffff811ea9b0>] do_io_submit+0x6f0/0xb80 [<ffffffff8134de2e>] ? trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff811eae50>] sys_io_submit+0x10/0x20 [<ffffffff81a7c9e9>] system_call_fastpath+0x16/0x1b Changes since v2: Update commit log to explain how the code is still broken even if we delay the lock switching after the drain. Changes since v1: Update commit log as Tejun suggested. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Asias He <asias@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-06-15 08:46:22 +02:00
Asias He	458f27a982	block: Avoid missed wakeup in request waitqueue After hot-unplug a stressed disk, I found that rl->wait[] is not empty while rl->count[] is empty and there are theads still sleeping on get_request after the queue cleanup. With simple debug code, I found there are exactly nr_sleep - nr_wakeup of theads in D state. So there are missed wakeup. $ dmesg \| grep nr_sleep [ 52.917115] ---> nr_sleep=1046, nr_wakeup=873, delta=173 $ vmstat 1 1 173 0 712640 24292 96172 0 0 0 0 419 757 0 0 0 100 0 To quote Tejun: Ah, okay, freed_request() wakes up single waiter with the assumption that after the wakeup there will at least be one successful allocation which in turn will continue the wakeup chain until the wait list is empty - ie. waiter wakeup is dependent on successful request allocation happening after each wakeup. With queue marked dead, any woken up waiter fails the allocation path, so the wakeup chaining is lost and we're left with hung waiters. What we need is wake_up_all() after drain completion. This patch fixes the missed wakeup by waking up all the theads which are sleeping on wait queue after queue drain. Changes in v2: Drop waitqueue_active() optimization Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Asias He <asias@redhat.com> Fixed a bug by me, where stacked devices would oops on calling blk_drain_queue() since ->rq.wait[] do not get initialized unless it's a full queue setup. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-06-15 08:45:25 +02:00
Jens Axboe	0b7877d4ee	Linux 3.4-rc5 -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.18 (GNU/Linux) iQEcBAABAgAGBQJPnb50AAoJEHm+PkMAQRiGAE0H/A4zFZIUGmF3miKPDYmejmrZ oVDYxVAu6JHjHWhu8E3VsinvyVscowjV8dr15eSaQzmDmRkSHAnUQ+dB7Di7jLC2 MNopxsWjwyZ8zvvr3rFR76kjbWKk/1GYytnf7GPZLbJQzd51om2V/TY/6qkwiDSX U8Tt7ihSgHAezefqEmWp2X/1pxDCEt+VFyn9vWpkhgdfM1iuzF39MbxSZAgqDQ/9 JJrBHFXhArqJguhENwL7OdDzkYqkdzlGtS0xgeY7qio2CzSXxZXK4svT6FFGA8Za xlAaIvzslDniv3vR2ZKd6wzUwFHuynX222hNim3QMaYdXm012M+Nn1ufKYGFxI0= =4d4w -----END PGP SIGNATURE----- Merge tag 'v3.4-rc5' into for-3.5/core The core branch is behind driver commits that we want to build on for 3.5, hence I'm pulling in a later -rc. Linux 3.4-rc5 Conflicts: Documentation/feature-removal-schedule.txt Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-05-01 14:29:55 +02:00
Tejun Heo	aaf7c68068	block: fix elvpriv allocation failure handling Request allocation is mempool backed to guarantee forward progress under memory pressure; unfortunately, this property got broken while adding elvpriv data. Failures during elvpriv allocation, including ioc and icq creation failures, currently make get_request() fail as whole. There's no forward progress guarantee for these allocations - they may fail indefinitely under memory pressure stalling IO and deadlocking the system. This patch updates get_request() such that elvpriv allocation failure doesn't make the whole function fail. If elvpriv allocation fails, the allocation is degraded into !ELVPRIV. This will force the request to ELEVATOR_INSERT_BACK disturbing scheduling but elvpriv alloc failures should be rare (nothing is per-request) and anything is better than deadlocking. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-20 10:06:40 +02:00
Tejun Heo	29e2b09ab5	block: collapse blk_alloc_request() into get_request() Allocation failure handling in get_request() is about to be updated. To ease the update, collapse blk_alloc_request() into get_request(). This patch doesn't introduce any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-20 10:06:40 +02:00
Tejun Heo	b82d4b197c	blkcg: make request_queue bypassing on allocation With the previous change to guarantee bypass visiblity for RCU read lock regions, entering bypass mode involves non-trivial overhead and future changes are scheduled to make use of bypass mode during init path. Combined it may end up adding noticeable delay during boot. This patch makes request_queue start its life in bypass mode, which is ended on queue init completion at the end of blk_init_allocated_queue(), and updates blk_queue_bypass_start() such that draining and RCU synchronization are performed only when the queue actually enters bypass mode. This avoids unnecessarily switching in and out of bypass mode during init avoiding the overhead and any nasty surprises which may step from leaving bypass mode on half-initialized queues. The boot time overhead was pointed out by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-20 10:06:06 +02:00
Tejun Heo	80fd99792b	blkcg: make sure blkg_lookup() returns %NULL if @q is bypassing Currently, blkg_lookup() doesn't check @q bypass state. This patch updates blk_queue_bypass_start() to do synchronize_rcu() before returning and updates blkg_lookup() to check blk_queue_bypass() and return %NULL if bypassing. This ensures blkg_lookup() returns %NULL if @q is bypassing. This is to guarantee that nobody is accessing policy data while @q is bypassing, which is necessary to allow replacing blkio_cgroup->pd[] in place on policy [de]activation. v2: Added more comments explaining bypass guarantees as suggested by Vivek. v3: Added more comments explaining why there's no synchronize_rcu() in blk_cleanup_queue() as suggested by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-20 10:06:06 +02:00
Shaohua Li	1b2e19f17e	block: make auto block plug flush threshold per-disk based We do auto block plug flush to reduce latency, the threshold is 16 requests. This works well if the task is accessing one or two drives. The problem is if the task is accessing a raid 0 device and the raid disk number is big, say 8 or 16, 16/8 = 2 or 16/16=1, we will have heavy lock contention. This patch makes the threshold per-disk based. The latency should be still ok accessing one or two drives. The setup with application accessing a lot of drives in the meantime uaually is big machine, avoiding lock contention is more important, because any contention will actually increase latency. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-06 11:37:47 -06:00
Dan Carpenter	00380a404f	block: blk_alloc_queue_node(): use caller's GFP flags instead of GFP_KERNEL We should use the GFP flags that the caller specified instead of picking our own. All the callers specify GFP_KERNEL so this doesn't make a difference to how the kernel runs, it's just a cleanup. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-23 09:58:54 +01:00
Tejun Heo	852c788f83	block: implement bio_associate_current() IO scheduling and cgroup are tied to the issuing task via io_context and cgroup of %current. Unfortunately, there are cases where IOs need to be routed via a different task which makes scheduling and cgroup limit enforcement applied completely incorrectly. For example, all bios delayed by blk-throttle end up being issued by a delayed work item and get assigned the io_context of the worker task which happens to serve the work item and dumped to the default block cgroup. This is double confusing as bios which aren't delayed end up in the correct cgroup and makes using blk-throttle and cfq propio together impossible. Any code which punts IO issuing to another task is affected which is getting more and more common (e.g. btrfs). As both io_context and cgroup are firmly tied to task including userland visible APIs to manipulate them, it makes a lot of sense to match up tasks to bios. This patch implements bio_associate_current() which associates the specified bio with %current. The bio will record the associated ioc and blkcg at that point and block layer will use the recorded ones regardless of which task actually ends up issuing the bio. bio release puts the associated ioc and blkcg. It grabs and remembers ioc and blkcg instead of the task itself because task may already be dead by the time the bio is issued making ioc and blkcg inaccessible and those are all block layer cares about. elevator_set_req_fn() is updated such that the bio elvdata is being allocated for is available to the elevator. This doesn't update block cgroup policies yet. Further patches will implement the support. -v2: #ifdef CONFIG_BLK_CGROUP added around bio->bi_ioc dereference in rq_ioc() to fix build breakage. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Kent Overstreet <koverstreet@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:24 +01:00
Tejun Heo	24acfc34fb	block: interface update for ioc/icq creation functions Make the following interface updates to prepare for future ioc related changes. * create_io_context() returning ioc only works for %current because it doesn't increment ref on the ioc. Drop @task parameter from it and always assume %current. * Make create_io_context_slowpath() return 0 or -errno and rename it to create_task_io_context(). * Make ioc_create_icq() take @ioc as parameter instead of assuming that of %current. The caller, get_request(), is updated to create ioc explicitly and then pass it into ioc_create_icq(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:24 +01:00
Tejun Heo	b679281a64	block: restructure get_request() get_request() is structured a bit unusually in that failure path is inlined in the usual flow with goto labels atop and inside it. Relocate the error path to the end of the function. This is to prepare for icq handling changes in get_request() and doesn't introduce any behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:24 +01:00
Tejun Heo	e8989fae38	blkcg: unify blkg's for blkcg policies Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:23 +01:00
Tejun Heo	4eef304998	blkcg: move per-queue blkg list heads and counters to queue and blkg Currently, specific policy implementations are responsible for maintaining list and number of blkgs. This duplicates code unnecessarily, and hinders factoring common code and providing blkcg API with better defined semantics. After this patch, request_queue hosts list heads and counters and blkg has list nodes for both policies. This patch only relocates the necessary fields and the next patch will actually move management code into blkcg core. Note that request_queue->blkg_list[] and ->nr_blkgs[] are hardcoded to have 2 elements. This is to avoid include dependency and will be removed by the next patch. This patch doesn't introduce any behavior change. -v2: Now unnecessary conditional on CONFIG_BLK_CGROUP_MODULE removed as pointed out by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:23 +01:00
Tejun Heo	5efd611351	blkcg: add blkcg_{init\|drain\|exit}_queue() Currently block core calls directly into blk-throttle for init, drain and exit. This patch adds blkcg_{init\|drain\|exit}_queue() which wraps the blk-throttle functions. This is to give more control and visiblity to blkcg core layer for proper layering. Further patches will add logic common to blkcg policies to the functions. While at it, collapse blk_throtl_release() into blk_throtl_exit(). There's no reason to keep them separate. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:23 +01:00
Tejun Heo	f51b802c17	blkcg: use the usual get blkg path for root blkio_group For root blkg, blk_throtl_init() was using throtl_alloc_tg() explicitly and cfq_init_queue() was manually initializing embedded cfqd->root_group, adding unnecessarily different code paths to blkg handling. Make both use the usual blkio_group get functions - throtl_get_tg() and cfq_get_cfqg() - for the root blkio_group too. Note that blk_throtl_init() callsite is pushed downwards in blk_alloc_queue_node() so that @q is sufficiently initialized for throtl_get_tg(). This simplifies root blkg handling noticeably for cfq and will allow further modularization of blkcg API. -v2: Vivek pointed out that using cfq_get_cfqg() won't work if CONFIG_CFQ_GROUP_IOSCHED is disabled. Fix it by factoring out initialization of base part of cfqg into cfq_init_cfqg_base() and alloc/init/free explicitly if !CONFIG_CFQ_GROUP_IOSCHED. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:22 +01:00
Tejun Heo	6ecf23afab	block: extend queue bypassing to cover blkcg policies Extend queue bypassing such that dying queue is always bypassing and blk-throttle is drained on bypass. With blkcg policies updated to test blk_queue_bypass() instead of blk_queue_dead(), this ensures that no bio or request is held by or going through blkcg policies on a bypassing queue. This will be used to implement blkg cleanup on elevator switches and policy changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:22 +01:00
Tejun Heo	d732580b4e	block: implement blk_queue_bypass_start/end() Rename and extend elv_queisce_start/end() to blk_queue_bypass_start/end() which are exported and supports nesting via @q->bypass_depth. Also add blk_queue_bypass() to test bypass state. This will be further extended and used for blkio_group management. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:21 +01:00
Tejun Heo	b855b04a0b	block: blk-throttle should be drained regardless of q->elevator Currently, blk_cleanup_queue() doesn't call elv_drain_elevator() if q->elevator doesn't exist; however, bio based drivers don't have elevator initialized but can still use blk-throttle. This patch moves q->elevator test inside blk_drain_queue() such that only elv_drain_elevator() is skipped if !q->elevator. -v2: loop can have registered queue which has NULL request_fn. Make sure we don't call into __blk_run_queue() in such cases. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Vivek Goyal <vgoyal@redhat.com> Fold in bug fix from Vivek. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:24:55 +01:00
Tejun Heo	07c2bd3735	block: don't call elevator callbacks for plug merges Plug merge calls two elevator callbacks outside queue lock - elevator_allow_merge_fn() and elevator_bio_merged_fn(). Although attempt_plug_merge() suggests that elevator is guaranteed to be there through the existing request on the plug list, nothing prevents plug merge from calling into dying or initializing elevator. For regular merges, bypass ensures elvpriv count to reach zero, which in turn prevents merges as all !ELVPRIV requests get REQ_SOFTBARRIER from forced back insertion. Plug merge doesn't check ELVPRIV, and, as the requests haven't gone through elevator insertion yet, it doesn't have SOFTBARRIER set allowing merges on a bypassed queue. This, for example, leads to the following crash during elevator switch. BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0 PGD 112cbc067 PUD 115d5c067 PMD 0 Oops: 0000 [#1] PREEMPT SMP CPU 1 Modules linked in: deadline_iosched Pid: 819, comm: dd Not tainted 3.3.0-rc2-work+ #76 Bochs Bochs RIP: 0010:[<ffffffff813b34e9>] [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0 RSP: 0018:ffff8801143a38f8 EFLAGS: 00010297 RAX: 0000000000000000 RBX: ffff88011817ce28 RCX: ffff880116eb6cc0 RDX: 0000000000000000 RSI: ffff880118056e20 RDI: ffff8801199512f8 RBP: ffff8801143a3908 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: ffff880118195708 R13: ffff880118052aa0 R14: ffff8801143a3d50 R15: ffff880118195708 FS: 00007f19f82cb700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000008 CR3: 0000000112c6a000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process dd (pid: 819, threadinfo ffff8801143a2000, task ffff880116eb6cc0) Stack: ffff88011817ce28 ffff880118195708 ffff8801143a3928 ffffffff81391bba ffff88011817ce28 ffff880118195708 ffff8801143a3948 ffffffff81391bf1 ffff88011817ce28 0000000000000000 ffff8801143a39a8 ffffffff81398e3e Call Trace: [<ffffffff81391bba>] elv_rq_merge_ok+0x4a/0x60 [<ffffffff81391bf1>] elv_try_merge+0x21/0x40 [<ffffffff81398e3e>] blk_queue_bio+0x8e/0x390 [<ffffffff81396a5a>] generic_make_request+0xca/0x100 [<ffffffff81396b04>] submit_bio+0x74/0x100 [<ffffffff811d45c2>] __blockdev_direct_IO+0x1ce2/0x3450 [<ffffffff811d0dc7>] blkdev_direct_IO+0x57/0x60 [<ffffffff811460b5>] generic_file_aio_read+0x6d5/0x760 [<ffffffff811986b2>] do_sync_read+0xe2/0x120 [<ffffffff81199345>] vfs_read+0xc5/0x180 [<ffffffff81199501>] sys_read+0x51/0x90 [<ffffffff81aeac12>] system_call_fastpath+0x16/0x1b There are multiple ways to fix this including making plug merge check ELVPRIV; however, * Calling into elevator outside queue lock is confusing and error-prone. * Requests on plug list aren't known to the elevator. They aren't on the elevator yet, so there's no elevator specific state to update. * Given the nature of plug merges - collecting bio's for the same purpose from the same issuer - elevator specific restrictions aren't applicable. So, simply don't call into elevator methods from plug merge by moving elv_bio_merged() from bio_attempt_*_merge() to blk_queue_bio(), and using blk_try_merge() in attempt_plug_merge(). This is based on Jens' patch to skip elevator_allow_merge_fn() from plug merge. Note that this makes per-cgroup merged stats skip plug merging. Signed-off-by: Tejun Heo <tj@kernel.org> LKML-Reference: <4F16F3CA.90904@kernel.dk> Original-patch-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-02-08 09:19:42 +01:00
Tejun Heo	050c8ea80e	block: separate out blk_rq_merge_ok() and blk_try_merge() from elevator functions blk_rq_merge_ok() is the elevator-neutral part of merge eligibility test. blk_try_merge() determines merge direction and expects the caller to have tested elv_rq_merge_ok() previously. elv_rq_merge_ok() now wraps blk_rq_merge_ok() and then calls elv_iosched_allow_merge(). elv_try_merge() is removed and the two callers are updated to call elv_rq_merge_ok() explicitly followed by blk_try_merge(). While at it, make rq_merge_ok() functions return bool. This is to prepare for plug merge update and doesn't introduce any behavior change. This is based on Jens' patch to skip elevator_allow_merge_fn() from plug merge. Signed-off-by: Tejun Heo <tj@kernel.org> LKML-Reference: <4F16F3CA.90904@kernel.dk> Original-patch-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-02-08 09:19:38 +01:00
Tejun Heo	11a3122f6c	block: strip out locking optimization in put_io_context() put_io_context() performed a complex trylock dancing to avoid deferring ioc release to workqueue. It was also broken on UP because trylock was always assumed to succeed which resulted in unbalanced preemption count. While there are ways to fix the UP breakage, even the most pathological microbench (forced ioc allocation and tight fork/exit loop) fails to show any appreciable performance benefit of the optimization. Strip it out. If there turns out to be workloads which are affected by this change, simpler optimization from the discussion thread can be applied later. Signed-off-by: Tejun Heo <tj@kernel.org> LKML-Reference: <1328514611.21268.66.camel@sli10-conroe> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-02-07 07:51:30 +01:00
Shaohua Li	05c30b9551	block: fix NULL icq_cache reference Vivek reported a kernel crash: [ 94.217015] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c [ 94.218004] IP: [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200 [ 94.218004] PGD 13abda067 PUD 137d52067 PMD 0 [ 94.218004] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC [ 94.218004] CPU 0 [ 94.218004] Modules linked in: [last unloaded: scsi_wait_scan] [ 94.218004] [ 94.218004] Pid: 0, comm: swapper/0 Not tainted 3.2.0+ #16 Hewlett-Packard HP xw6600 Workstation/0A9Ch [ 94.218004] RIP: 0010:[<ffffffff81142fae>] [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200 [ 94.218004] RSP: 0018:ffff88013fc03de0 EFLAGS: 00010006 [ 94.218004] RAX: ffffffff81e0d020 RBX: ffff880138b3c680 RCX: 00000001801c001b [ 94.218004] RDX: 00000000003aac1d RSI: ffff880138b3c680 RDI: ffffffff81142fae [ 94.218004] RBP: ffff88013fc03e10 R08: ffff880137830238 R09: 0000000000000001 [ 94.218004] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 94.218004] R13: ffffea0004e2cf00 R14: ffffffff812f6eb6 R15: 0000000000000246 [ 94.218004] FS: 0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000 [ 94.218004] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 94.218004] CR2: 000000000000001c CR3: 00000001395ab000 CR4: 00000000000006f0 [ 94.218004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 94.218004] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 94.218004] Process swapper/0 (pid: 0, threadinfo ffffffff81e00000, task ffffffff81e0d020) [ 94.218004] Stack: [ 94.218004] 0000000000000102 ffff88013fc0db20 ffffffff81e22700 ffff880139500f00 [ 94.218004] 0000000000000001 000000000000000a ffff88013fc03e20 ffffffff812f6eb6 [ 94.218004] ffff88013fc03e90 ffffffff810c8da2 ffffffff81e01fd8 ffff880137830240 [ 94.218004] Call Trace: [ 94.218004] <IRQ> [ 94.218004] [<ffffffff812f6eb6>] icq_free_icq_rcu+0x16/0x20 [ 94.218004] [<ffffffff810c8da2>] __rcu_process_callbacks+0x1c2/0x420 [ 94.218004] [<ffffffff810c9038>] rcu_process_callbacks+0x38/0x250 [ 94.218004] [<ffffffff810405ee>] __do_softirq+0xce/0x3e0 [ 94.218004] [<ffffffff8108ed04>] ? clockevents_program_event+0x74/0x100 [ 94.218004] [<ffffffff81090104>] ? tick_program_event+0x24/0x30 [ 94.218004] [<ffffffff8183ed1c>] call_softirq+0x1c/0x30 [ 94.218004] [<ffffffff8100422d>] do_softirq+0x8d/0xc0 [ 94.218004] [<ffffffff81040c3e>] irq_exit+0xae/0xe0 [ 94.218004] [<ffffffff8183f4be>] smp_apic_timer_interrupt+0x6e/0x99 [ 94.218004] [<ffffffff8183e330>] apic_timer_interrupt+0x70/0x80 Once a queue is quiesced, it's not supposed to have any elvpriv data or icq's, and elevator switching depends on that. Request alloc path followed the rule for elvpriv data but forgot apply it to icq's leading to the following crash during elevator switch. Fix it by not allocating icq's if ELVPRIV is not set for the request. Reported-by: Vivek Goyal <vgoyal@redhat.com> Tested-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-01-19 09:20:10 +01:00
Linus Torvalds	b3c9dd182e	Merge branch 'for-3.3/core' of git://git.kernel.dk/linux-block * 'for-3.3/core' of git://git.kernel.dk/linux-block: (37 commits) Revert "block: recursive merge requests" block: Stop using macro stubs for the bio data integrity calls blockdev: convert some macros to static inlines fs: remove unneeded plug in mpage_readpages() block: Add BLKROTATIONAL ioctl block: Introduce blk_set_stacking_limits function block: remove WARN_ON_ONCE() in exit_io_context() block: an exiting task should be allowed to create io_context block: ioc_cgroup_changed() needs to be exported block: recursive merge requests block, cfq: fix empty queue crash caused by request merge block, cfq: move icq creation and rq->elv.icq association to block core block, cfq: restructure io_cq creation path for io_context interface cleanup block, cfq: move io_cq exit/release to blk-ioc.c block, cfq: move icq cache management to block core block, cfq: move io_cq lookup to blk-ioc.c block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq block, cfq: reorganize cfq_io_context into generic and cfq specific parts block: remove elevator_queue->ops block: reorder elevator switch sequence ... Fix up conflicts in: - block/blk-cgroup.c Switch from can_attach_task to can_attach - block/cfq-iosched.c conflict with now removed cic index changes (we now use q->id instead)	2012-01-15 12:24:45 -08:00
Tejun Heo	4eabc94125	block: don't kick empty queue in blk_drain_queue() While probing, fd sets up queue, probes hardware and tears down the queue if probing fails. In the process, blk_drain_queue() kicks the queue which failed to finish initialization and fd is unhappy about that. floppy0: no floppy controllers found ------------[ cut here ]------------ WARNING: at drivers/block/floppy.c:2929 do_fd_request+0xbf/0xd0() Hardware name: To Be Filled By O.E.M. VFS: do_fd_request called on non-open device Modules linked in: Pid: 1, comm: swapper Not tainted 3.2.0-rc4-00077-g5983fe2 #2 Call Trace: [<ffffffff81039a6a>] warn_slowpath_common+0x7a/0xb0 [<ffffffff81039b41>] warn_slowpath_fmt+0x41/0x50 [<ffffffff813d657f>] do_fd_request+0xbf/0xd0 [<ffffffff81322b95>] blk_drain_queue+0x65/0x80 [<ffffffff81322c93>] blk_cleanup_queue+0xe3/0x1a0 [<ffffffff818a809d>] floppy_init+0xdeb/0xe28 [<ffffffff818a72b2>] ? daring+0x6b/0x6b [<ffffffff810002af>] do_one_initcall+0x3f/0x170 [<ffffffff81884b34>] kernel_init+0x9d/0x11e [<ffffffff810317c2>] ? schedule_tail+0x22/0xa0 [<ffffffff815dbb14>] kernel_thread_helper+0x4/0x10 [<ffffffff81884a97>] ? start_kernel+0x2be/0x2be [<ffffffff815dbb10>] ? gs_change+0xb/0xb Avoid it by making blk_drain_queue() kick queue iff dispatch queue has something on it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Ralf Hildebrandt <Ralf.Hildebrandt@charite.de> Reported-by: Wu Fengguang <fengguang.wu@intel.com> Tested-by: Sergei Trofimovich <slyich@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-15 20:03:04 +01:00
Tejun Heo	f1f8cc9465	block, cfq: move icq creation and rq->elv.icq association to block core Now block layer knows everything necessary to create and associate icq's with requests. Move ioc_create_icq() to blk-ioc.c and update get_request() such that, if elevator_type->icq_size is set, requests are automatically associated with their matching icq's before elv_set_request(). io_context reference is also managed by block core on request alloc/free. * Only ioprio/cgroup changed handling remains from cfq_get_cic(). Collapsed into cfq_set_request(). * This removes queue kicking on icq allocation failure (for now). As icq allocation failure is rare and the only effect of queue kicking achieved was possibily accelerating queue processing, this change shouldn't be noticeable. There is a larger underlying problem. Unlike request allocation, icq allocation is not guaranteed to succeed eventually after retries. The number of icq is unbound and thus mempool can't be the solution either. This effectively adds allocation dependency on memory free path and thus possibility of deadlock. This usually wouldn't happen because icq allocation is not a hot path and, even when the condition triggers, it's highly unlikely that none of the writeback workers already has icq. However, this is still possible especially if elevator is being switched under high memory pressure, so we better get it fixed. Probably the only solution is just bypassing elevator and appending to dispatch queue on any elevator allocation failure. * Comment added to explain how icq's are managed and synchronized. This completes cleanup of io_context interface. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:42 +01:00
Tejun Heo	a612fddf0d	block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq Most of icq management is about to be moved out of cfq into blk-ioc. This patch prepares for it. * Move cfqd->icq_list to request_queue->icq_list * Make request explicitly point to icq instead of through elevator private data. ->elevator_private[3] is replaced with sub struct elv which contains icq pointer and priv[2]. cfq is updated accordingly. * Meaningless clearing of ->elevator_private[0] removed from elv_set_request(). At that point in code, the field was guaranteed to be %NULL anyway. This patch doesn't introduce any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:41 +01:00
Tejun Heo	f2dbd76a0a	block, cfq: replace current_io_context() with create_io_context() When called under queue_lock, current_io_context() triggers lockdep warning if it hits allocation path. This is because io_context installation is protected by task_lock which is not IRQ safe, so it triggers irq-unsafe-lock -> irq -> irq-safe-lock -> irq-unsafe-lock deadlock warning. Given the restriction, accessor + creator rolled into one doesn't work too well. Drop current_io_context() and let the users access task->io_context directly inside queue_lock combined with explicit creation using create_io_context(). Future ioc updates will further consolidate ioc access and the create interface will be unexported. While at it, relocate ioc internal interface declarations in blk.h and add section comments before and after. This patch does not introduce functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:40 +01:00
Tejun Heo	09ac46c429	block: misc updates to blk_get_queue() * blk_get_queue() is peculiar in that it returns 0 on success and 1 on failure instead of 0 / -errno or boolean. Update it such that it returns %true on success and %false on failure. * Make sure the caller checks for the return value. * Separate out __blk_get_queue() which doesn't check whether @q is dead and put it in blk.h. This will be used later. This patch doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:38 +01:00
Tejun Heo	a73f730d01	block, cfq: move cfqd->cic_index to q->id cfq allocates per-queue id using ida and uses it to index cic radix tree from io_context. Move it to q->id and allocate on queue init and free on queue release. This simplifies cfq a bit and will allow for further improvements of io context life-cycle management. This patch doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:37 +01:00
Tejun Heo	8ba61435d7	block: add missing blk_queue_dead() checks blk_insert_cloned_request(), blk_execute_rq_nowait() and blk_flush_plug_list() either didn't check whether the queue was dead or did it without holding queue_lock. Update them so that dead state is checked while holding queue_lock. AFAICS, this plugs all holes (requeue doesn't matter as the request is transitioning atomically from in_flight to queued). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:37 +01:00
Tejun Heo	481a7d6479	block: fix drain_all condition in blk_drain_queue() When trying to drain all requests, blk_drain_queue() checked only q->rq.count[]; however, this only tracks REQ_ALLOCED requests. This patch updates blk_drain_queue() such that it looks at all the counters and queues so that request_queue is actually empty on completion. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:37 +01:00
Tejun Heo	34f6055c80	block: add blk_queue_dead() There are a number of QUEUE_FLAG_DEAD tests. Add blk_queue_dead() macro and use it. This patch doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:37 +01:00
Tejun Heo	1ba64edef6	block, sx8: kill blk_insert_request() The only user left for blk_insert_request() is sx8 and it can be trivially switched to use blk_execute_rq_nowait() - special requests aren't included in io stat and sx8 doesn't use block layer tagging. Switch sx8 and kill blk_insert_requeset(). This patch doesn't introduce any functional difference. Only compile tested. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jeff Garzik <jgarzik@pobox.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:37 +01:00
Mike Snitzer	5151412dd4	block: initialize request_queue's numa node during struct request_queue is allocated with __GFP_ZERO so its "node" field is zero before initialization. This causes an oops if node 0 is offline in the page allocator because its zonelists are not initialized. From Dave Young's dmesg: SRAT: Node 1 PXM 2 0-d0000000 SRAT: Node 1 PXM 2 100000000-330000000 SRAT: Node 0 PXM 1 330000000-630000000 Initmem setup node 1 0000000000000000-000000000affb000 ... Built 1 zonelists in Node order, mobility grouping on. ... BUG: unable to handle kernel paging request at 0000000000001c08 IP: [<ffffffff8111c355>] __alloc_pages_nodemask+0xb5/0x870 and __alloc_pages_nodemask+0xb5 translates to a NULL pointer on zonelist->_zonerefs. The fix is to initialize q->node at the time of allocation so the correct node is passed to the slab allocator later. Since blk_init_allocated_queue_node() is no longer needed, merge it with blk_init_allocated_queue(). [rientjes@google.com: changelog, initializing q->node] Cc: stable@vger.kernel.org [2.6.37+] Reported-by: Dave Young <dyoung@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Tested-by: Dave Young <dyoung@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-11-23 10:59:13 +01:00
Shaohua Li	019ceb7d5d	block: add missed trace_block_plug After flush plug list, the list has no request, so we need to add a trace_block_plug(). Signed-off-by: Shaohua Li <shaohua.li@intel.com> Reviewed-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-11-16 09:21:50 +01:00
Shaohua Li	3540d5e89b	block: avoid unnecessary plug list flush get_request_wait() could sleep and flush the plug list. If the list is already flushed, don't flush again. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Reviewed-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-11-16 09:21:50 +01:00
Tejun Heo	6dd9ad7df2	block: don't call blk_drain_queue() if elevator is not up blk_cleanup_queue() may be called before elevator is set up on a queue which triggers the following oops. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff8125a69c>] elv_drain_elevator+0x1c/0x70 ... Pid: 830, comm: kworker/0:2 Not tainted 3.1.0-next-20111025_64+ #1590 Bochs Bochs RIP: 0010:[<ffffffff8125a69c>] [<ffffffff8125a69c>] elv_drain_elevator+0x1c/0x70 ... Call Trace: [<ffffffff8125da92>] blk_drain_queue+0x42/0x70 [<ffffffff8125db90>] blk_cleanup_queue+0xd0/0x1c0 [<ffffffff81469640>] md_free+0x50/0x70 [<ffffffff8126f43b>] kobject_release+0x8b/0x1d0 [<ffffffff81270d56>] kref_put+0x36/0xa0 [<ffffffff8126f2b7>] kobject_put+0x27/0x60 [<ffffffff814693af>] mddev_delayed_delete+0x2f/0x40 [<ffffffff81083450>] process_one_work+0x100/0x3b0 [<ffffffff8108527f>] worker_thread+0x15f/0x3a0 [<ffffffff81089937>] kthread+0x87/0x90 [<ffffffff81621834>] kernel_thread_helper+0x4/0x10 Fix it by making blk_cleanup_queue() check whether q->elevator is set up before invoking blk_drain_queue. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-11-03 18:52:11 +01:00
Jens Axboe	83157223de	Merge branch 'for-linus' into for-3.2/core	2011-10-24 16:24:38 +02:00
Jeff Moyer	e67b77c791	blk-flush: move the queue kick into A dm-multipath user reported[1] a problem when trying to boot a kernel with commit `4853abaae7` (block: fix flush machinery for stacking drivers with differring flush flags) applied. It turns out that an empty flush request can be sent into blk_insert_flush. When the BUG_ON was fixed to allow for this, I/O on the underlying device would stall. The reason is that blk_insert_cloned_request does not kick the queue. In the aforementioned commit, I had added a special case to kick the queue if data was sent down but the queue flags did not require a flush. A better solution is to push the queue kick up into blk_insert_cloned_request. This patch, along with a follow-on which fixes the BUG_ON, fixes the issue reported. [1] http://www.redhat.com/archives/dm-devel/2011-September/msg00154.html Reported-by: Christophe Saout <christophe@saout.de> Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Stable note: 3.1 Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-10-24 16:24:31 +02:00
Tao Ma	9562ad9ab3	block: Remove the control of complete cpu from bio. bio originally has the functionality to set the complete cpu, but it is broken. Chirstoph said that "This code is unused, and from the all the discussions lately pretty obviously broken. The only thing keeping it serves is creating more confusion and possibly more bugs." And Jens replied with "We can kill bio_set_completion_cpu(). I'm fine with leaving cpu control to the request based drivers, they are the only ones that can toggle the setting anyway". So this patch tries to remove all the work of controling complete cpu from a bio. Cc: Shaohua Li <shaohua.li@intel.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-10-24 16:11:30 +02:00
Tejun Heo	c9a929dde3	block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown request_queue is refcounted but actually depdends on lifetime management from the queue owner - on blk_cleanup_queue(), block layer expects that there's no request passing through request_queue and no new one will. This is fundamentally broken. The queue owner (e.g. SCSI layer) doesn't have a way to know whether there are other active users before calling blk_cleanup_queue() and other users (e.g. bsg) don't have any guarantee that the queue is and would stay valid while it's holding a reference. With delay added in blk_queue_bio() before queue_lock is grabbed, the following oops can be easily triggered when a device is removed with in-flight IOs. sd 0:0:1:0: [sdb] Stopping disk ata1.01: disabled general protection fault: 0000 [#1] PREEMPT SMP CPU 2 Modules linked in: Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs RIP: 0010:[<ffffffff8137d651>] [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100 ... Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80) ... Call Trace: [<ffffffff8137d774>] elv_merge+0x84/0xe0 [<ffffffff81385b54>] blk_queue_bio+0xf4/0x400 [<ffffffff813838ea>] generic_make_request+0xca/0x100 [<ffffffff81383994>] submit_bio+0x74/0x100 [<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0 [<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40 [<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60 [<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760 [<ffffffff8118c1ca>] do_sync_read+0xda/0x120 [<ffffffff8118ce55>] vfs_read+0xc5/0x180 [<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0 [<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b This happens because blk_queue_cleanup() destroys the queue and elevator whether IOs are in progress or not and DEAD tests are sprinkled in the request processing path without proper synchronization. Similar problem exists for blk-throtl. On queue cleanup, blk-throtl is shutdown whether it has requests in it or not. Depending on timing, it either oopses or throttled bios are lost putting tasks which are waiting for bio completion into eternal D state. The way it should work is having the usual clear distinction between shutdown and release. Shutdown drains all currently pending requests, marks the queue dead, and performs partial teardown of the now unnecessary part of the queue. Even after shutdown is complete, reference holders are still allowed to issue requests to the queue although they will be immmediately failed. The rest of teardown happens on release. This patch makes the following changes to make blk_queue_cleanup() behave as proper shutdown. * QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and queue_lock. * Unsynchronized DEAD check in generic_make_request_checks() removed. This couldn't make any meaningful difference as the queue could die after the check. * blk_drain_queue() updated such that it can drain all requests and is now called during cleanup. * blk_throtl updated such that it checks DEAD on grabbing queue_lock, drains all throttled bios during cleanup and free td when queue is released. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-10-19 14:42:16 +02:00
Tejun Heo	bd87b5898a	block: drop @tsk from attempt_plug_merge() and explain sync rules attempt_plug_merge() accesses elevator without holding queue_lock and may call into ->elevator_bio_merge_fn(). The elvator is guaranteed to be valid because it's accessed iff the plugged list has requests and elevator is never exited with live requests, so as long as the elevator method can deal with unlocked access, this is safe. Explain the sync rules around attempt_plug_merge() and drop the unnecessary @tsk parameter. This patch doesn't introduce any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-10-19 14:33:08 +02:00
Tejun Heo	da8303c63b	block: make get_request[_wait]() fail if queue is dead Currently get_request[_wait]() allocates request whether queue is dead or not. This patch makes get_request[_wait]() return NULL if @q is dead. blk_queue_bio() is updated to fail the submitted bio if request allocation fails. While at it, add docbook comments for get_request[_wait](). Note that the current code has rather unclear (there are spurious DEAD tests scattered around) assumption that the owner of a queue guarantees that no request travels block layer if the queue is dead and this patch in itself doesn't change much; however, this will allow fixing the broken assumption in the next patch. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-10-19 14:33:05 +02:00
Tejun Heo	bc16a4f933	block: reorganize throtl_get_tg() and blk_throtl_bio() blk_throtl_bio() and throtl_get_tg() have rather unusual interface. * throtl_get_tg() returns pointer to a valid tg or ERR_PTR(-ENODEV), and drops queue_lock in the latter case. Different locking context depending on return value is error-prone and DEAD state is scheduled to be protected by queue_lock anyway. Move DEAD check inside queue_lock and return valid tg or NULL. * blk_throtl_bio() indicates return status both with its return value and in/out param *@bio. The former is used to indicate whether queue is found to be dead during throtl processing. The latter whether the bio is throttled. There's no point in returning DEAD check result from blk_throtl_bio(). The queue can die after blk_throtl_bio() is finished but before make_request_fn() grabs queue lock. Make it take @bio instead and return boolean result indicating whether the request is throttled or not. This patch doesn't cause any visible functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-10-19 14:33:01 +02:00
Tejun Heo	e3c78ca524	block: reorganize queue draining Reorganize queue draining related code in preparation of queue exit changes. * Factor out actual draining from elv_quiesce_start() to blk_drain_queue(). * Make elv_quiesce_start/end() responsible for their own locking. * Replace open-coded ELVSWITCH clearing in elevator_switch() with elv_quiesce_end(). This patch doesn't cause any visible functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-10-19 14:32:38 +02:00
Tejun Heo	75eb6c372d	block: pass around REQ_* flags instead of broken down booleans during request alloc/free blk_alloc_request() and freed_request() take different combinations of REQ_* @flags, @priv and @is_sync when @flags is superset of the latter two. Make them take @flags only. This cleans up the code a bit and will ease updating allocation related REQ_* flags. This patch doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-10-19 14:31:22 +02:00
Jens Axboe	5c04b426f2	Merge branch 'v3.1-rc10' into for-3.2/core Conflicts: block/blk-core.c include/linux/blkdev.h Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-10-19 14:30:42 +02:00
Hannes Reinecke	777eb1bf15	block: Free queue resources at blk_release_queue() A kernel crash is observed when a mounted ext3/ext4 filesystem is physically removed. The problem is that blk_cleanup_queue() frees up some resources eg by calling elevator_exit(), which are not checked for in normal operation. So we should rather move these calls to the destructor function blk_release_queue() as at that point all remaining references are gone. However, in doing so we have to ensure that any externally supplied queue_lock is disconnected as the driver might free up the lock after the call of blk_cleanup_queue(), Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-09-28 08:07:01 -06:00
Suresh Jayaraman	75df713627	block: document blk-plug Thus spake Andrew Morton: "And I have the usual maintainability whine. If someone comes up to vmscan.c and sees it calling blk_start_plug(), how are they supposed to work out why that call is there? They go look at the blk_start_plug() definition and it is undocumented. I think we can do better than this?" Adapted from the LWN article - http://lwn.net/Articles/438256/ by Jens Axboe and from an earlier attempt by Shaohua Li to document blk-plug. [akpm@linux-foundation.org: grammatical and spelling tweaks] Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-09-21 10:00:16 +02:00
Christoph Hellwig	27a84d54c0	block: refactor generic_make_request Move all the checks performed on a bio into a new helper, and call it as soon as bio is submitted even if it is a re-submission from ->make_request. We explicitly mark the new helper as beeing non-inlined as the stack usage for printing the block device name in the failure case is quite high and this a patch where we have to be extremely conservative about stack usage. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-09-15 14:01:40 +02:00
Christoph Hellwig	5a7bbad27a	block: remove support for bio remapping from ->make_request There is very little benefit in allowing to let a ->make_request instance update the bios device and sector and loop around it in __generic_make_request when we can archive the same through calling generic_make_request from the driver and letting the loop in generic_make_request handle it. Note that various drivers got the return value from ->make_request and returned non-zero values for errors. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-09-12 12:12:01 +02:00
Jens Axboe	c20e8de27f	block: rename __make_request() to blk_queue_bio() Now that it's exported, lets put it in a more sane namespace. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-09-12 12:08:31 +02:00
Christoph Hellwig	166e1f901b	block: export __make_request Avoid the hacks need for request based device mappers currently by simply exporting the symbol instead of trying to get it through the back door. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-09-12 12:08:27 +02:00
Shaohua Li	56ebdaf2fa	block: simplify force plug flush code a little bit Cleaning up the code a little bit. attempt_plug_merge() traverses the plug list anyway, we can do the request counting there, so stack size is reduced a little bit. The motivation here is I suspect if we should count the requests for each queue (task could handle multiple disks in the meantime), but my test doesn't show it's worthy doing. If somebody proves we should do it, below change will make that more easier. Signed-off-by: Shaohua Li <shli@kernel.org> Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-08-24 16:04:34 +02:00
Shaohua Li	a632716275	block: change force plug flush call order Do blk_flush_plug_list() first and then add new request aDo blk_flush_plug_list() first and then add new request aDo blk_flush_plug_list() first and then add new request at the tail. New request can't be merged to existing requests, but later new requests might be merged with this new one. If blk_flush_plug_list() is done later, the merge doesn't happen. Believe it or not, this fixes a 10% regression running sysbench workload. Signed-off-by: Shaohua Li <shli@kernel.org> Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-08-24 16:04:32 +02:00
Linus Torvalds	5ccc38740a	Merge branch 'for-linus' of git://git.kernel.dk/linux-block * 'for-linus' of git://git.kernel.dk/linux-block: (23 commits) Revert "cfq: Remove special treatment for metadata rqs." block: fix flush machinery for stacking drivers with differring flush flags block: improve rq_affinity placement blktrace: add FLUSH/FUA support Move some REQ flags to the common bio/request area allow blk_flush_policy to return REQ_FSEQ_DATA independent of *FLUSH xen/blkback: Make description more obvious. cfq-iosched: Add documentation about idling block: Make rq_affinity = 1 work as expected block: swim3: fix unterminated of_device_id table block/genhd.c: remove useless cast in diskstats_show() drivers/cdrom/cdrom.c: relax check on dvd manufacturer value drivers/block/drbd/drbd_nl.c: use bitmap_parse instead of __bitmap_parse bsg-lib: add module.h include cfq-iosched: Reduce linked group count upon group destruction blk-throttle: correctly determine sync bio loop: fix deadlock when sysfs and LOOP_CLR_FD race against each other loop: add BLK_DEV_LOOP_MIN_COUNT=%i to allow distros 0 pre-allocated loop devices loop: add management interface for on-demand device allocation loop: replace linked list of allocated devices with an idr index ...	2011-08-19 10:47:07 -07:00
Jeff Moyer	4853abaae7	block: fix flush machinery for stacking drivers with differring flush flags Commit `ae1b153962`, block: reimplement FLUSH/FUA to support merge, introduced a performance regression when running any sort of fsyncing workload using dm-multipath and certain storage (in our case, an HP EVA). The test I ran was fs_mark, and it dropped from ~800 files/sec on ext4 to ~100 files/sec. It turns out that dm-multipath always advertised flush+fua support, and passed commands on down the stack, where those flags used to get stripped off. The above commit changed that behavior: static inline struct request __elv_next_request(struct request_queue q) { struct request rq; while (1) { - while (!list_empty(&q->queue_head)) { + if (!list_empty(&q->queue_head)) { rq = list_entry_rq(q->queue_head.next); - if (!(rq->cmd_flags & (REQ_FLUSH \| REQ_FUA)) \|\| - (rq->cmd_flags & REQ_FLUSH_SEQ)) - return rq; - rq = blk_do_flush(q, rq); - if (rq) - return rq; + return rq; } Note that previously, a command would come in here, have REQ_FLUSH\|REQ_FUA set, and then get handed off to blk_do_flush: struct request blk_do_flush(struct request_queue q, struct request rq) { unsigned int fflags = q->flush_flags; /* may change, cache it */ bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA; bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH); bool do_postflush = has_flush && !has_fua && (rq->cmd_flags & REQ_FUA); unsigned skip = 0; ... if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) { rq->cmd_flags &= ~REQ_FLUSH; if (!has_fua) rq->cmd_flags &= ~REQ_FUA; return rq; } So, the flush machinery was bypassed in such cases (q->flush_flags == 0 && rq->cmd_flags & (REQ_FLUSH\|REQ_FUA)). Now, however, we don't get into the flush machinery at all. Instead, __elv_next_request just hands a request with flush and fua bits set to the scsi_request_fn, even if the underlying request_queue does not support flush or fua. The agreed upon approach is to fix the flush machinery to allow stacking. While this isn't used in practice (since there is only one request-based dm target, and that target will now reflect the flush flags of the underlying device), it does future-proof the solution, and make it function as designed. In order to make this work, I had to add a field to the struct request, inside the flush structure (to store the original req->end_io). Shaohua had suggested overloading the union with rb_node and completion_data, but the completion data is used by device mapper and can also be used by other drivers. So, I didn't see a way around the additional field. I tested this patch on an HP EVA with both ext4 and xfs, and it recovers the lost performance. Comments and other testers, as always, are appreciated. Cheers, Jeff Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-08-15 21:37:25 +02:00
Akinobu Mita	dd48c085c1	fault-injection: add ability to export fault_attr in arbitrary directory init_fault_attr_dentries() is used to export fault_attr via debugfs. But it can only export it in debugfs root directory. Per Forlin is working on mmc_fail_request which adds support to inject data errors after a completed host transfer in MMC subsystem. The fault_attr for mmc_fail_request should be defined per mmc host and export it in debugfs directory per mmc host like /sys/kernel/debug/mmc0/mmc_fail_request. init_fault_attr_dentries() doesn't help for mmc_fail_request. So this introduces fault_create_debugfs_attr() which is able to create a directory in the arbitrary directory and replace init_fault_attr_dentries(). [akpm@linux-foundation.org: extraneous semicolon, per Randy] Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Tested-by: Per Forlin <per.forlin@linaro.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-08-03 14:25:20 -10:00
Akinobu Mita	b2c9cd3793	fail_make_request: cleanup should_fail_request This changes should_fail_request() to more usable wrapper function of should_fail(). It can avoid putting #ifdef CONFIG_FAIL_MAKE_REQUEST in the middle of a function. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-26 16:49:46 -07:00
Jens Axboe	11ccf116d0	block: fix warning with calling smp_processor_id() in preemptible section After commit `5757a6d7` introduced an unsafe calling of smp_processor_id(), with preempt debuggin turned on we spew a lot of: BUG: using smp_processor_id() in preemptible [00000000] code: kjournald/514 caller is __make_request+0x1b8/0x308 [<c0019f44>] (unwind_backtrace+0x0/0xe8) from [<c024b4cc>] (debug_smp_processor_id+0xbc/0xf0) [<c024b4cc>] (debug_smp_processor_id+0xbc/0xf0) from [<c0223d14>] (__make_request+0x1b8/0x308) [<c0223d14>] (__make_request+0x1b8/0x308) from [<c02215ac>] (generic_make_request+0x4dc/0x558) [<c02215ac>] (generic_make_request+0x4dc/0x558) from [<c022173c>] (submit_bio+0x114/0x138) [<c022173c>] (submit_bio+0x114/0x138) from [<c011f504>] (submit_bh+0x148/0x16c) [<c011f504>] (submit_bh+0x148/0x16c) from [<c0121ed8>] (__sync_dirty_buffer+0x88/0xd8) [<c0121ed8>] (__sync_dirty_buffer+0x88/0xd8) from [<c01aff78>] (journal_commit_transaction+0x1198/0x1688) [<c01aff78>] (journal_commit_transaction+0x1198/0x1688) from [<c01b4034>] (kjournald+0xb4/0x224) [<c01b4034>] (kjournald+0xb4/0x224) from [<c0069ea0>] (kthread+0x8c/0x94) [<c0069ea0>] (kthread+0x8c/0x94) from [<c00137f8>] (kernel_thread_exit+0x0/0x8) Fix this by just using raw_smp_processor_id(), it's just a hint after all. There's no pinning of the CPU or accessing per-cpu structures involved. Reported-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-07-26 15:01:15 +02:00
Linus Torvalds	096a705bbc	Merge branch 'for-3.1/core' of git://git.kernel.dk/linux-block * 'for-3.1/core' of git://git.kernel.dk/linux-block: (24 commits) block: strict rq_affinity backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu block: fix patch import error in max_discard_sectors check block: reorder request_queue to remove 64 bit alignment padding CFQ: add think time check for group CFQ: add think time check for service tree CFQ: move think time check variables to a separate struct fixlet: Remove fs_excl from struct task. cfq: Remove special treatment for metadata rqs. block: document blk_plug list access block: avoid building too big plug list compat_ioctl: fix make headers_check regression block: eliminate potential for infinite loop in blkdev_issue_discard compat_ioctl: fix warning caused by qemu block: flush MEDIA_CHANGE from drivers on close(2) blk-throttle: Make total_nr_queued unsigned block: Add __attribute__((format(printf...) and fix fallout fs/partitions/check.c: make local symbols static block:remove some spare spaces in genhd.c block:fix the comment error in blkdev.h ...	2011-07-25 10:33:36 -07:00
Dan Williams	5757a6d76c	block: strict rq_affinity Some systems benefit from completions always being steered to the strict requester cpu rather than the looser "per-socket" steering that blk_cpu_to_group() attempts by default. This is because the first CPU in the group mask ends up being completely overloaded with work, while the others (including the original submitter) has power left to spare. Allow the strict mode to be set by writing '2' to the sysfs control file. This is identical to the scheme used for the nomerges file, where '2' is a more aggressive setting than just being turned on. echo 2 > /sys/block/<bdev>/queue/rq_affinity Cc: Christoph Hellwig <hch@infradead.org> Cc: Roland Dreier <roland@purestorage.com> Tested-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-07-23 20:44:25 +02:00
James Bottomley	bfe159a512	[SCSI] fix crash in scsi_dispatch_cmd() USB surprise removal of sr is triggering an oops in scsi_dispatch_command(). What seems to be happening is that USB is hanging on to a queue reference until the last close of the upper device, so the crash is caused by surprise remove of a mounted CD followed by attempted unmount. The problem is that USB doesn't issue its final commands as part of the SCSI teardown path, but on last close when the block queue is long gone. The long term fix is probably to make sr do the teardown in the same way as sd (so remove all the lower bits on ejection, but keep the upper disk alive until last close of user space). However, the current oops can be simply fixed by not allowing any commands to be sent to a dead queue. Cc: stable@kernel.org Signed-off-by: James Bottomley <JBottomley@Parallels.com>	2011-07-21 14:21:18 -07:00
Shaohua Li	55c022bbdd	block: avoid building too big plug list When I test fio script with big I/O depth, I found the total throughput drops compared to some relative small I/O depth. The reason is the thread accumulates big requests in its plug list and causes some delays (surely this depends on CPU speed). I thought we'd better have a threshold for requests. When a threshold reaches, this means there is no request merge and queue lock contention isn't severe when pushing per-task requests to queue, so the main advantages of blk plug don't exist. We can force a plug list flush in this case. With this, my test throughput actually increases and almost equals to small I/O depth. Another side effect is irq off time decreases in blk_flush_plug_list() for big I/O depth. The BLK_MAX_REQUEST_COUNT is choosen arbitarily, but 16 is efficiently to reduce lock contention to me. But I'm open here, 32 is ok in my test too. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-07-08 08:19:20 +02:00
Jens Axboe	d86e0e83b3	block: export blk_{get,put}_queue() We need them in SCSI to fix a bug, but currently they are not exported to modules. Export them. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-27 07:45:45 +02:00
Luca Tettamanti	700c4f3325	block: remove unused variable in bio_attempt_front_merge() sector is never read inside the function. Signed-off-by: Luca Tettamanti <kronos.it@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-26 21:07:26 +02:00
Vivek Goyal	95cf3dd9db	block: call elv_bio_merged() when merged Commit `73c1010119` ("block: initial patch for on-stack per-task plugging") removed calls to elv_bio_merged() when @bio merged with @req. Re-add them. This in turn will update merged stats in associated group. That should be safe as long as request has got reference to the blkio_group. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Cc: Divyesh Shah <dpshah@google.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-23 10:02:19 +02:00
Jens Axboe	771949d03b	block: get rid of on-stack plugging debug checks We don't need them anymore, so kill: - REQ_ON_PLUG checks in various places - !rq_mergeable() check in plug merging Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-20 20:52:16 +02:00
Vivek Goyal	f469a7b4d5	blk-cgroup: Allow sleeping while dynamically allocating a group Currently, all the cfq_group or throtl_group allocations happen while we are holding ->queue_lock and sleeping is not allowed. Soon, we will move to per cpu stats and also need to allocate the per group stats. As one can not call alloc_percpu() from atomic context as it can sleep, we need to drop ->queue_lock, allocate the group, retake the lock and continue processing. In throttling code, I check the queue DEAD flag again to make sure that driver did not call blk_cleanup_queue() in the mean time. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-20 20:34:52 +02:00
Shaohua Li	3ec717b7ca	block: don't delay blk_run_queue_async Let's check a scenario: 1. blk_delay_queue(q, SCSI_QUEUE_DELAY); 2. blk_run_queue_async(); the second one will became a noop, because q->delay_work already has WORK_STRUCT_PENDING_BIT set, so the delayed work will still run after SCSI_QUEUE_DELAY. But blk_run_queue_async actually hopes the delayed work runs immediately. Fix this by doing a cancel on potentially pending delayed work before queuing an immediate run of the workqueue. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-05-18 12:24:03 +02:00
Jens Axboe	d350e6b6e8	block: remove stale kerneldoc member from __blk_run_queue() We don't pass in a 'force_kblockd' anymore, get rid of the stsale comment. Reported-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-19 13:34:14 +02:00
Jens Axboe	c21e6beba8	block: get rid of QUEUE_FLAG_REENTER We are currently using this flag to check whether it's safe to call into ->request_fn(). If it is set, we punt to kblockd. But we get a lot of false positives and excessive punts to kblockd, which hurts performance. The only real abuser of this infrastructure is SCSI. So export the async queue run and convert SCSI over to use that. There's room for improvement in that SCSI need not always use the async call, but this fixes our performance issue and they can fix that up in due time. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-19 13:32:46 +02:00
Jens Axboe	bd900d4580	block: kill blk_flush_plug_list() export With all drivers and file systems converted, we only have in-core use of this function. So remove the export. Reporteed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-18 22:06:57 +02:00
Christoph Hellwig	24ecfbe27f	block: add blk_run_queue_async Instead of overloading __blk_run_queue to force an offload to kblockd add a new blk_run_queue_async helper to do it explicitly. I've kept the blk_queue_stopped check for now, but I suspect it's not needed as the check we do when the workqueue items runs should be enough. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-18 11:41:33 +02:00
Jens Axboe	4521cc4ed5	block: blk_delay_queue() should use kblockd workqueue Reported-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-18 11:36:39 +02:00
Jens Axboe	99e22598e9	block: drop queue lock before calling __blk_run_queue() for kblockd punt If we know we are going to punt to kblockd, we can drop the queue lock before calling into __blk_run_queue() since it only does a safe bit test and a workqueue call. Since kblockd needs to grab this very lock as one of the first things it does, it's a good optimization to drop the lock before waking kblockd. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-18 09:59:55 +02:00
Jens Axboe	b4cb290e0a	Revert "block: add callback function for unplug notification" MD can't use this since it really requires us to be able to keep more than a single piece of state for the unplug. Commit `048c9374` added the required support for MD, so get rid of this now unused code. This reverts commit `f75664570d`. Conflicts: block/blk-core.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-18 09:54:05 +02:00
NeilBrown	048c9374a7	block: Enhance new plugging support to support general callbacks md/raid requires an unplug callback, but as it does not uses requests the current code cannot provide one. So allow arbitrary callbacks to be attached to the blk_plug. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-18 09:52:22 +02:00
Jens Axboe	49cac01e1f	block: make unplug timer trace event correspond to the schedule() unplug It's a pretty close match to what we had before - the timer triggering would mean that nobody unplugged the plug in due time, in the new scheme this matches very closely what the schedule() unplug now is. It's essentially the difference between an explicit unplug (IO unplug) or an implicit unplug (timer unplug, we scheduled with pending IO queued). Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-16 13:51:05 +02:00
Jens Axboe	f6603783f9	block: only force kblockd unplugging from the schedule() path For the explicit unplugging, we'd prefer to kick things off immediately and not pay the penalty of the latency to switch to kblockd. So let blk_finish_plug() do the run inline, while the implicit-on-schedule-out unplug will punt to kblockd. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-15 15:49:07 +02:00
Christoph Hellwig	88b996cd06	block: cleanup the block plug helper functions It's a bit of a mess currently. task->plug is being cleared and reset in __blk_finish_plug(), and blk_finish_plug() is testing for a NULL plug which cannot happen even from schedule() anymore since it uses blk_needs_flush_plug() to determine whether to call into this function at all. So get rid of some of the cruft. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-04-15 15:20:10 +02:00

1 2 3 4 5 ...

385 Commits