OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Minfei Huang	8dc23658b7	dm: return correct error code in dm_resume()'s retry loop dm_resume() will return success (0) rather than -EINVAL if !dm_suspended_md() upon retry within dm_resume(). Reset the error code at the start of dm_resume()'s retry loop. Also, remove a useless assignment at the end of dm_resume(). Fixes: `ffcc393641` ("dm: enhance internal suspend and resume interface") Cc: stable@vger.kernel.org # 3.19+ Signed-off-by: Minfei Huang <mnghuan@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-09-14 13:56:38 -04:00
Jens Axboe	1eff9d322a	block: rename bio bi_rw to bi_opf Since commit `63a4cc2486`, bio->bi_rw contains flags in the lower portion and the op code in the higher portions. This means that old code that relies on manually setting bi_rw is most likely going to be broken. Instead of letting that brokeness linger, rename the member, to force old and out-of-tree code to break at compile time instead of at runtime. No intended functional changes in this commit. Signed-off-by: Jens Axboe <axboe@fb.com>	2016-08-07 14:41:02 -06:00
Mike Snitzer	eaf9a7361f	dm: set DMF_SUSPENDED* _before_ clearing DMF_NOFLUSH_SUSPENDING Otherwise, there is potential for both DMF_SUSPENDED* and DMF_NOFLUSH_SUSPENDING to not be set during dm_suspend() -- which is definitely _not_ a valid state. This fix, in conjuction with "dm rq: fix the starting and stopping of blk-mq queues", addresses the potential for request-based DM multipath's __multipath_map() to see !dm_noflush_suspending() during suspend. Reported-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2016-08-02 16:21:37 -04:00
Linus Torvalds	f0c98ebc57	libnvdimm for 4.8 1/ Replace pcommit with ADR / directed-flushing: The pcommit instruction, which has not shipped on any product, is deprecated. Instead, the requirement is that platforms implement either ADR, or provide one or more flush addresses per nvdimm. ADR (Asynchronous DRAM Refresh) flushes data in posted write buffers to the memory controller on a power-fail event. Flush addresses are defined in ACPI 6.x as an NVDIMM Firmware Interface Table (NFIT) sub-structure: "Flush Hint Address Structure". A flush hint is an mmio address that when written and fenced assures that all previous posted writes targeting a given dimm have been flushed to media. 2/ On-demand ARS (address range scrub): Linux uses the results of the ACPI ARS commands to track bad blocks in pmem devices. When latent errors are detected we re-scrub the media to refresh the bad block list, userspace can also request a re-scrub at any time. 3/ Support for the Microsoft DSM (device specific method) command format. 4/ Support for EDK2/OVMF virtual disk device memory ranges. 5/ Various fixes and cleanups across the subsystem. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJXmXBsAAoJEB7SkWpmfYgCEwwP/1IOt9ocP+iHLMDH9KE7VaTZ NmUDR+Zy6g5cRQM7SgcuU5BXUcx+OsSrSrUTVF1cW994o9Gbz1mFotkv0ZAsPcYY ZVRQxo2oqHrssyOcg+PsgKWiXn68rJOCgmpEyzaJywl5qTMst7pzsT1s1f7rSh6h trCf4VaJJwxZR8fARGtlHUnnhPe2Orp99EZRKEWprAsIv2kPuWpPHSjRjuEgN1JG KW8AYwWqFTtiLRUk86I4KBB0wcDrfctsjgN9Ogd6+aHyQBRnVSr2U+vDCFkC8KLu qiDCpYp+yyxBjclnljz7tRRT3GtzfCUWd4v2KVWqgg2IaobUc0Lbukp/rmikUXQP WLikT2OCQ994eFK5OX3Q3cIU/4j459TQnof8q14yVSpjAKrNUXVSR5puN7Hxa+V7 41wKrAsnsyY1oq+Yd/rMR8VfH7PHx3bFkrmRCGZCufLX1UQm4aYj+sWagDKiV3yA DiudghbOnhfurfGsnXUVw7y7GKs+gNWNBmB6ndAD6ZEHmKoGUhAEbJDLCc3DnANl b/2mv1MIdIcC1DlCmnbbcn6fv6bICe/r8poK3VrCK3UgOq/EOvKIWl7giP+k1JuC 6DdVYhlNYIVFXUNSLFAwz8OkLu8byx7WDm36iEqrKHtPw+8qa/2bWVgOU6OBgpjV cN3edFVIdxvZeMgM5Ubq =xCBG -----END PGP SIGNATURE----- Merge tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: - Replace pcommit with ADR / directed-flushing. The pcommit instruction, which has not shipped on any product, is deprecated. Instead, the requirement is that platforms implement either ADR, or provide one or more flush addresses per nvdimm. ADR (Asynchronous DRAM Refresh) flushes data in posted write buffers to the memory controller on a power-fail event. Flush addresses are defined in ACPI 6.x as an NVDIMM Firmware Interface Table (NFIT) sub-structure: "Flush Hint Address Structure". A flush hint is an mmio address that when written and fenced assures that all previous posted writes targeting a given dimm have been flushed to media. - On-demand ARS (address range scrub). Linux uses the results of the ACPI ARS commands to track bad blocks in pmem devices. When latent errors are detected we re-scrub the media to refresh the bad block list, userspace can also request a re-scrub at any time. - Support for the Microsoft DSM (device specific method) command format. - Support for EDK2/OVMF virtual disk device memory ranges. - Various fixes and cleanups across the subsystem. * tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (41 commits) libnvdimm-btt: Delete an unnecessary check before the function call "__nd_device_register" nfit: do an ARS scrub on hitting a latent media error nfit: move to nfit/ sub-directory nfit, libnvdimm: allow an ARS scrub to be triggered on demand libnvdimm: register nvdimm_bus devices with an nd_bus driver pmem: clarify a debug print in pmem_clear_poison x86/insn: remove pcommit Revert "KVM: x86: add pcommit support" nfit, tools/testing/nvdimm/: unify shutdown paths libnvdimm: move ->module to struct nvdimm_bus_descriptor nfit: cleanup acpi_nfit_init calling convention nfit: fix _FIT evaluation memory leak + use after free tools/testing/nvdimm: add manufacturing_{date\|location} dimm properties tools/testing/nvdimm: add virtual ramdisk range acpi, nfit: treat virtual ramdisk SPA as pmem region pmem: kill __pmem address space pmem: kill wmb_pmem() libnvdimm, pmem: use nvdimm_flush() for namespace I/O writes fs/dax: remove wmb_pmem() libnvdimm, pmem: flush posted-write queues on shutdown ...	2016-07-28 17:38:16 -07:00
Linus Torvalds	f7e6816994	- initially based on Jens' 'for-4.8/core' (given all the flag churn) and later merged with 'for-4.8/core' to pickup the QUEUE_FLAG_DAX commits that DM depends on to provide its DAX support - clean up the bio-based vs request-based DM core code by moving the request-based DM core code out to dm-rq.[hc] - reinstate bio-based support in the DM multipath target (done with the idea that fast storage like NVMe over Fabrics could benefit) -- while preserving support for request_fn and blk-mq request-based DM mpath - SCSI and DM multipath persistent reservation fixes that were coordinated with Martin Petersen. - the DM raid target saw the most extensive change this cycle; it now provides reshape and takeover support (by layering ontop of the corresponding MD capabilities) - DAX support for DM core and the linear, stripe and error targets - A DM thin-provisioning block discard vs allocation race fix that addresses potential for corruption - A stable fix for DM verity-fec's block calculation during decode - A few cleanups and fixes to DM core and various targets -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJXkRZmAAoJEMUj8QotnQNat2wH/i4LpkoGI5tI6UhyKWxRkzJp vKaJ0zuZ2Ez73DucJujNuvaiyHq1IjHD5pfr8JQO3E8ygDkRC2KjF2O8EXp0Has6 U1uLahQej72MAs0ZJTpvfE+JiY6qyIl4K+xxuPmYm2f2S5TWTIgOetYjJQmcMlQo Y8zFfcDYn4Dv5rMdvDT4+1ePETxq74wcBwTxyW3OAbHE1f0JjsUGdMKzXB1iTWcM VjLjWI//ETfFdIlDO0w2Qbd90aLUjmTR2k67RGnbPj5kNUNikv/X6iiY32KERR/0 vMiiJ7JS+a44P7FJqCMoAVM/oBYFiSNpS4LYevOgHb0G0ikF8kaSeqBPC6sMYvg= =uYt9 -----END PGP SIGNATURE----- Merge tag 'dm-4.8-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - initially based on Jens' 'for-4.8/core' (given all the flag churn) and later merged with 'for-4.8/core' to pickup the QUEUE_FLAG_DAX commits that DM depends on to provide its DAX support - clean up the bio-based vs request-based DM core code by moving the request-based DM core code out to dm-rq.[hc] - reinstate bio-based support in the DM multipath target (done with the idea that fast storage like NVMe over Fabrics could benefit) -- while preserving support for request_fn and blk-mq request-based DM mpath - SCSI and DM multipath persistent reservation fixes that were coordinated with Martin Petersen. - the DM raid target saw the most extensive change this cycle; it now provides reshape and takeover support (by layering ontop of the corresponding MD capabilities) - DAX support for DM core and the linear, stripe and error targets - a DM thin-provisioning block discard vs allocation race fix that addresses potential for corruption - a stable fix for DM verity-fec's block calculation during decode - a few cleanups and fixes to DM core and various targets * tag 'dm-4.8-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (73 commits) dm: allow bio-based table to be upgraded to bio-based with DAX support dm snap: add fake origin_direct_access dm stripe: add DAX support dm error: add DAX support dm linear: add DAX support dm: add infrastructure for DAX support dm thin: fix a race condition between discarding and provisioning a block dm btree: fix a bug in dm_btree_find_next_single() dm raid: fix random optimal_io_size for raid0 dm raid: address checkpatch.pl complaints dm: call PR reserve/unreserve on each underlying device sd: don't use the ALL_TG_PT bit for reservations dm: fix second blk_delay_queue() parameter to be in msec units not jiffies dm raid: change logical functions to actually return bool dm raid: use rdev_for_each in status dm raid: use rs->raid_disks to avoid memory leaks on free dm raid: support delta_disks for raid1, fix table output dm raid: enhance reshape check and factor out reshape setup dm raid: allow resize during recovery dm raid: fix rs_is_recovering() to allow for lvextend ...	2016-07-26 17:12:11 -07:00
Toshi Kani	545ed20e6d	dm: add infrastructure for DAX support Change mapped device to implement direct_access function, dm_blk_direct_access(), which calls a target direct_access function. 'struct target_type' is extended to have target direct_access interface. This function limits direct accessible size to the dm_target's limit with max_io_len(). Add dm_table_supports_dax() to iterate all targets and associated block devices to check for DAX support. To add DAX support to a DM target the target must only implement the direct_access function. Add a new dm type, DM_TYPE_DAX_BIO_BASED, which indicates that mapped device supports DAX and is bio based. This new type is used to assure that all target devices have DAX support and remain that way after QUEUE_FLAG_DAX is set in mapped device. At initial table load, QUEUE_FLAG_DAX is set to mapped device when setting DM_TYPE_DAX_BIO_BASED to the type. Any subsequent table load to the mapped device must have the same type, or else it fails per the check in table_load(). Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-20 23:49:49 -04:00
Christoph Hellwig	70246286e9	block: get rid of bio_rw and READA These two are confusing leftover of the old world order, combining values of the REQ_OP_ and REQ_ namespaces. For callers that don't special case we mostly just replace bi_rw with bio_data_dir or op_is_write, except for the few cases where a switch over the REQ_OP_ values makes more sense. Any check for READA is replaced with an explicit check for REQ_RAHEAD. Also remove the READA alias for REQ_RAHEAD. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-20 17:37:01 -06:00
Christoph Hellwig	9c72bad1f3	dm: call PR reserve/unreserve on each underlying device So far we tried to rely on the SCSI 'all target ports' bit to register all path, but for many setups this didn't work properly as the different paths are seen as separate initiators to the target instead of multiple ports of the same initiator. Because of that we'll stop setting the 'all target ports' bit in SCSI, and let device mapper handle iterating over the device for each path and register them manually. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-07-18 15:37:35 -04:00
Mike Snitzer	e83068a5fa	dm mpath: add optional "queue_mode" feature Allow a user to specify an optional feature 'queue_mode <mode>' where <mode> may be "bio", "rq" or "mq" -- which corresponds to bio-based, request_fn rq-based, and blk-mq rq-based respectively. If the queue_mode feature isn't specified the default for the "multipath" target is still "rq" but if dm_mod.use_blk_mq is set to Y it'll default to mode "mq". This new queue_mode feature introduces the ability for each multipath device to have its own queue_mode (whereas before this feature all multipath devices effectively had to have the same queue_mode). This commit also goes a long way to eliminate the awkward (ab)use of DM_TYPE_*, the associated filter_md_type() and other relatively fragile and difficult to maintain code. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-06-10 15:16:02 -04:00
Mike Snitzer	4cc96131af	dm: move request-based code out to dm-rq.[hc] Add some seperation between bio-based and request-based DM core code. 'struct mapped_device' and other DM core only structures and functions have been moved to dm-core.h and all relevant DM core .c files have been updated to include dm-core.h rather than dm.h DM targets should _never_ include dm-core.h! [block core merge conflict resolution from Stephen Rothwell] Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>	2016-06-10 15:15:44 -04:00
Mike Christie	28a8f0d317	block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSH To avoid confusion between REQ_OP_FLUSH, which is handled by request_fn drivers, and upper layers requesting the block layer perform a flush sequence along with possibly a WRITE, this patch renames REQ_FLUSH to REQ_PREFLUSH. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-07 13:41:38 -06:00
Mike Christie	3a5e02ced1	block, drivers: add REQ_OP_FLUSH operation This adds a REQ_OP_FLUSH operation that is sent to request_fn based drivers by the block layer's flush code, instead of sending requests with the request->cmd_flags REQ_FLUSH bit set. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-07 13:41:38 -06:00
Mike Christie	c2df40dfb8	drivers: use req op accessor The req operation REQ_OP is separated from the rq_flag_bits definition. This converts the block layer drivers to use req_op to get the op from the request struct. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-07 13:41:38 -06:00
Mike Christie	e6047149db	dm: use bio op accessors Separate the op from the rq_flag_bits and have dm set/get the bio using bio_set_op_attrs/bio_op. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-07 13:41:38 -06:00
Mike Christie	528ec5abe6	dm: pass dm stats data dir instead of bi_rw It looks like dm stats cares about the data direction (READ vs WRITE) and does not need the bio/request flags. Commands like REQ_FLUSH, REQ_DISCARD and REQ_WRITE_SAME are currently always set with REQ_WRITE, so the extra check for REQ_DISCARD in dm_stats_account_io is not needed. This patch has it use the bio and request data_dir helpers instead of accessing the bi_rw/cmd_flags directly. This makes the next patches that remove the operation from the cmd_flags and bi_rw easier, because we will no longer have the REQ_WRITE bit set for operations like discards. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-07 13:41:38 -06:00
Mike Snitzer	cfae7529b5	dm: remove unused mapped_device argument from free_tio() Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-05-05 15:25:49 -04:00
Mikulas Patocka	072623de1f	dm: fix dm_target_io leak if clone_bio() returns an error Commit `c80914e81e` ("dm: return error if bio_integrity_clone() fails in clone_bio()") changed clone_bio() such that if it does return error then the alloc_tio() created resources (both the bio that was allocated to be a clone and the containing dm_target_io struct) will leak. Fix this by calling free_tio() in __clone_and_map_data_bio()'s clone_bio() error path. Fixes: `c80914e81e` ("dm: return error if bio_integrity_clone() fails in clone_bio()") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-04-11 11:49:09 -04:00
Bryn M. Reeves	98dbc9c6c6	dm: fix rq_end_stats() NULL pointer in dm_requeue_original_request() An "old" (.request_fn) DM 'struct request' stores a pointer to the associated 'struct dm_rq_target_io' in rq->special. dm_requeue_original_request(), previously named dm_requeue_unmapped_original_request(), called dm_unprep_request() to reset rq->special to NULL. But rq_end_stats() would go on to hit a NULL pointer deference because its call to tio_from_request() returned NULL. Fix this by calling rq_end_stats() _before_ dm_unprep_request() Signed-off-by: Bryn M. Reeves <bmr@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Fixes: `e262f34741` ("dm stats: add support for request-based DM devices") Cc: stable@vger.kernel.org # 4.2+	2016-03-14 17:04:34 -04:00
Mike Snitzer	c80914e81e	dm: return error if bio_integrity_clone() fails in clone_bio() clone_bio() now checks if bio_integrity_clone() returned an error rather than just drop it on the floor. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-03-10 17:12:10 -05:00
Bob Liu	e233d800a9	dm: drop unnecessary assignment of md->queue md->queue and q are the same thing in dm_old_init_request_queue() and dm_mq_init_request_queue(). Also drop the temporary 'struct request_queue *q' in dm_old_init_request_queue(). Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-03-10 17:12:07 -05:00
Mike Snitzer	032482fda4	dm: reorder 'struct mapped_device' members to fix alignment and holes Saves 16 bytes by eliminating 4 4byte holes but more importantly: numerous members that crossed cachelines were fixed. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-03-10 17:12:07 -05:00
Mike Snitzer	1d3aa6f683	dm: remove dummy definition of 'struct dm_table' Change the map pointer in 'struct mapped_device' from 'struct dm_table __rcu ' to 'void __rcu ' to avoid the need for the dummy definition. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-03-10 17:12:06 -05:00
Mike Snitzer	115485e83f	dm: add 'dm_numa_node' module parameter Allows user to control which NUMA node the memory for DM device structures (e.g. mapped_device, request_queue, gendisk, blk_mq_tag_set) is allocated from. Defaults to NUMA_NO_NODE (-1). Allowable range is from -1 until the last online NUMA node id. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-03-10 17:12:06 -05:00
Mike Snitzer	591ddcfc4b	dm: allow immutable request-based targets to use blk-mq pdu This will allow DM multipath to use a portion of the blk-mq pdu space for target data (e.g. struct dm_mpath_io). Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-22 22:34:37 -05:00
Mike Snitzer	30187e1d48	dm: rename target's per_bio_data_size to per_io_data_size Request-based DM will also make use of per_bio_data_size. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-22 22:34:37 -05:00
Mike Snitzer	eca7ee6dc0	dm: distinquish old .request_fn (dm-old) vs dm-mq request-based DM Rename various methods to have either a "dm_old" or "dm_mq" prefix. Improve code comments to assist with understanding the duality of code that handles both "dm_old" and "dm_mq" cases. It is no much easier to quickly look at the code and _know_ that a given method is either 1) "dm_old" only 2) "dm_mq" only 3) common to both. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-22 22:34:33 -05:00
Mike Snitzer	c5248f79f3	dm: remove support for stacking dm-mq on .request_fn device(s) Remove all fiddley code that propped up this support for a blk-mq request-queue ontop of all .request_fn devices. Testing has proven this niche request-based dm-mq mode to be buggy, when testing fault tolerance with DM multipath, and there is no point trying to preserve it. Should help improve efficiency of pure dm-mq code and make code maintenance less delicate. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-22 22:33:46 -05:00
Mike Snitzer	818c5f3bef	dm: fix a couple locking issues with use of block interfaces old_stop_queue() was checking blk_queue_stopped() without holding the q->queue_lock. dm_requeue_original_request() needed to check blk_queue_stopped(), with q->queue_lock held, before calling blk_mq_kick_requeue_list(). And a side-effect of that change is start_queue() must also call blk_mq_kick_requeue_list(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-22 22:33:09 -05:00
Mike Snitzer	1c357a1e86	dm: allocate blk_mq_tag_set rather than embed in mapped_device The blk_mq_tag_set is only needed for dm-mq support. There is point wasting space in 'struct mapped_device' for non-dm-mq devices. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> # check kzalloc return	2016-02-22 12:07:14 -05:00
Mike Snitzer	faad87df4b	dm: add 'dm_mq_nr_hw_queues' and 'dm_mq_queue_depth' module params Allow user to change these values via module params or sysfs. 'dm_mq_nr_hw_queues' defaults to 1 (max 32). 'dm_mq_queue_depth' defaults to 2048 (up from 64, which proved far too small under moderate sized workloads -- the dm-multipath device would continuously block waiting for tags (requests) to become available). The maximum is BLK_MQ_MAX_DEPTH (currently 10240). Keep in mind the total number of pre-allocated requests per request-based dm-mq device is 'dm_mq_nr_hw_queues' * 'dm_mq_queue_depth' (currently 2048). Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-22 12:07:10 -05:00
Mike Snitzer	c91852ff08	dm: optimize dm_request_fn() DM multipath is the only request-based DM target -- which only supports tables with a single target that is immutable. Leverage this fact in dm_request_fn(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-22 11:06:22 -05:00
Mike Snitzer	16f122661d	dm: optimize dm_mq_queue_rq() DM multipath is the only dm-mq target. But that aside, request-based DM only supports tables with a single target that is immutable. Leverage this fact in dm_mq_queue_rq() by using the 'immutable_target' stored in the mapped_device when the table was made active. This saves the need to even take the read-side of the SRCU via dm_{get,put}_live_table. If the active DM table does not have an immutable target (e.g. "error" target was swapped in) then fallback to the slow-path where the target is looked up from the live table. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-22 11:06:22 -05:00
Mike Snitzer	e522c03905	dm: cleanup dm_any_congested() The request-based DM support for checking queue congestion doesn't require access to the live DM table. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-22 11:06:20 -05:00
Mike Snitzer	ae6ad75e5c	dm: remove unused dm_get_rq_mapinfo() Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-22 11:06:20 -05:00
Mike Snitzer	6acfe68bac	dm: fix excessive dm-mq context switching Request-based DM's blk-mq support (dm-mq) was reported to be 50% slower than if an underlying null_blk device were used directly. One of the reasons for this drop in performance is that blk_insert_clone_request() was calling blk_mq_insert_request() with @async=true. This forced the use of kblockd_schedule_delayed_work_on() to run the blk-mq hw queues which ushered in ping-ponging between process context (fio in this case) and kblockd's kworker to submit the cloned request. The ftrace function_graph tracer showed: kworker-2013 => fio-12190 fio-12190 => kworker-2013 ... kworker-2013 => fio-12190 fio-12190 => kworker-2013 ... Fixing blk_insert_clone_request()'s blk_mq_insert_request() call to _not_ use kblockd to submit the cloned requests isn't enough to eliminate the observed context switches. In addition to this dm-mq specific blk-core fix, there are 2 DM core fixes to dm-mq that (when paired with the blk-core fix) completely eliminate the observed context switching: 1) don't blk_mq_run_hw_queues in blk-mq request completion Motivated by desire to reduce overhead of dm-mq, punting to kblockd just increases context switches. In my testing against a really fast null_blk device there was no benefit to running blk_mq_run_hw_queues() on completion (and no other blk-mq driver does this). So hopefully this change doesn't induce the need for yet another revert like commit `621739b00e` ! 2) use blk_mq_complete_request() in dm_complete_request() blk_complete_request() doesn't offer the traditional q->mq_ops vs .request_fn branching pattern that other historic block interfaces do (e.g. blk_get_request). Using blk_mq_complete_request() for blk-mq requests is important for performance. It should be noted that, like blk_complete_request(), blk_mq_complete_request() doesn't natively handle partial completions -- but the request-based DM-multipath target does provide the required partial completion support by dm.c:end_clone_bio() triggering requeueing of the request via dm-mpath.c:multipath_end_io()'s return of DM_ENDIO_REQUEUE. dm-mq fix #2 is _much_ more important than #1 for eliminating the context switches. Before: cpu : usr=15.10%, sys=59.39%, ctx=7905181, majf=0, minf=475 After: cpu : usr=20.60%, sys=79.35%, ctx=2008, majf=0, minf=472 With these changes multithreaded async read IOPs improved from ~950K to ~1350K for this dm-mq stacked on null_blk test-case. The raw read IOPs of the underlying null_blk device for the same workload is ~1950K. Fixes: `7fb4898e0` ("block: add blk-mq support to blk_insert_cloned_request()") Fixes: `bfebd1cdb` ("dm: add full blk-mq support to request-based DM") Cc: stable@vger.kernel.org # 4.1+ Reported-by: Sagi Grimberg <sagig@dev.mellanox.co.il> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Jens Axboe <axboe@kernel.dk>	2016-02-22 11:04:40 -05:00
Mike Snitzer	956a402580	dm: fix sparse "unexpected unlock" warnings in ioctl code Rename dm_get_live_table_for_ioctl to dm_grab_bdev_for_ioctl and have it do the dm_{get,put}_live_table() rather than split those operations. The dm_grab_bdev_for_ioctl() callers only care about the block_device associated with a singleton DM device so there isn't any need to retain a reference to the live DM table. It is sufficient to: 1) dm_get_live_table() 2) bdgrab() the bdev associated with the singleton table's target 3) dm_put_live_table() 4) bdput() the bdev Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-21 20:27:51 -05:00
Mike Snitzer	664820265d	dm: do not return target from dm_get_live_table_for_ioctl() None of the callers actually used the returned target. Also, just reuse bdev pointer passed to dm_blk_ioctl(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-21 20:27:51 -05:00
Mike Snitzer	4328daa2e7	dm: fix dm_rq_target_io leak on faults with .request_fn DM w/ blk-mq paths Using request-based DM mpath configured with the following stacking (.request_fn DM mpath ontop of scsi-mq paths): echo Y > /sys/module/scsi_mod/parameters/use_blk_mq echo N > /sys/module/dm_mod/parameters/use_blk_mq 'struct dm_rq_target_io' would leak if a request is requeued before a blk-mq clone is allocated (or fails to allocate). free_rq_tio() wasn't being called. kmemleak reported: unreferenced object 0xffff8800b90b98c0 (size 112): comm "kworker/7:1H", pid 5692, jiffies 4295056109 (age 78.589s) hex dump (first 32 bytes): 00 d0 5c 2c 03 88 ff ff 40 00 bf 01 00 c9 ff ff ..\,....@....... e0 d9 b1 34 00 88 ff ff 00 00 00 00 00 00 00 00 ...4............ backtrace: [<ffffffff81672b6e>] kmemleak_alloc+0x4e/0xb0 [<ffffffff811dbb63>] kmem_cache_alloc+0xc3/0x1e0 [<ffffffff8117eae5>] mempool_alloc_slab+0x15/0x20 [<ffffffff8117ec1e>] mempool_alloc+0x6e/0x170 [<ffffffffa00029ac>] dm_old_prep_fn+0x3c/0x180 [dm_mod] [<ffffffff812fbd78>] blk_peek_request+0x168/0x290 [<ffffffffa0003e62>] dm_request_fn+0xb2/0x1b0 [dm_mod] [<ffffffff812f66e3>] __blk_run_queue+0x33/0x40 [<ffffffff812f9585>] blk_delay_work+0x25/0x40 [<ffffffff81096fff>] process_one_work+0x14f/0x3d0 [<ffffffff81097715>] worker_thread+0x125/0x4b0 [<ffffffff8109ce88>] kthread+0xd8/0xf0 [<ffffffff8167cb8f>] ret_from_fork+0x3f/0x70 [<ffffffffffffffff>] 0xffffffffffffffff crash> struct -o dm_rq_target_io struct dm_rq_target_io { ... } SIZE: 112 Fixes: `e5863d9ad7` ("dm: allocate requests in target when stacking on blk-mq devices") Cc: stable@vger.kernel.org # 4.0+ Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2016-02-21 20:27:50 -05:00
Mike Snitzer	647a20d5ca	dm: do not reuse dm_blk_ioctl block_device input as local variable (Ab)using the @bdev passed to dm_blk_ioctl() opens the potential for targets' .prepare_ioctl to fail if they go on to check the bdev for !NULL. Fixes: `e56f81e0b0` ("dm: refactor ioctl handling") Reported-by: Junichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-11-17 14:18:49 -05:00
Junichi Nomura	5bbbfdf685	dm: fix ioctl retry termination with signal dm-mpath retries ioctl, when no path is readily available and the device is configured to queue I/O in such a case. If you want to stop the retry before multipathd decides to turn off queueing mode, you could send signal for the process to exit from the loop. However the check of fatal signal has not carried along when commit `6c182cd88d` ("dm mpath: fix ioctl deadlock when no paths") moved the loop from dm-mpath to dm core. As a result, we can't terminate such a process in the retry loop. Easy reproducer of the situation is: # dmsetup create mp --table '0 1024 multipath 0 0 0 0' # dmsetup message mp 0 'queue_if_no_path' # sg_inq /dev/mapper/mp then you should be able to terminate sg_inq by pressing Ctrl+C. Fixes: `6c182cd88d` ("dm mpath: fix ioctl deadlock when no paths") Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-11-17 14:04:32 -05:00
Jens Axboe	dece16353e	block: change ->make_request_fn() and users to return a queue cookie No functional changes in this patch, but it prepares us for returning a more useful cookie related to the IO that was queued up. Signed-off-by: Jens Axboe <axboe@fb.com> Acked-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com>	2015-11-07 10:40:46 -07:00
Linus Torvalds	e0700ce709	- Revert a dm-multipath change that caused a regression for unprivledged users (e.g. kvm guests) that issued ioctls when a multipath device had no available paths. - Include Christoph's refactoring of DM's ioctl handling and add support for passing through persistent reservations with DM multipath. - All other changes are very simple cleanups. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJWOp04AAoJEMUj8QotnQNaFLsH/AhMEH/jI1ObOfy4J1Wy4rOx ujJT91uS/s0H3pc9cGKQYnuGpFkX6WWU4wMiabIyiTn4sAsoXaflfIGutivLiDJr HfecrMrGZgnP4ZlpPPB02BmlxFbcPW8yzAU4ma38xBgQ+Pu30RO/HkvX/2vKOppG qwPop/XsNxq3KXgFGM44ToytM6c/MPGluhuvOwbaacAO1HviMuen9qsVjk4kwcf3 jGYTbEPHATxyu5/6oKDTkQTYhzdwg3B2qHCiKMGw3l1kXhaQLFcaOivOLV8Sf3xh bj1070pkGe9OpqaVzMnwDtJ8rnsBl/Nt4wj9oiQPxbX71GYZAmcMIYn9WEkcKFI= =AR2D -----END PGP SIGNATURE----- Merge tag 'dm-4.4-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: "Smaller set of DM changes for this merge. I've based these changes on Jens' for-4.4/reservations branch because the associated DM changes required it. - Revert a dm-multipath change that caused a regression for unprivledged users (e.g. kvm guests) that issued ioctls when a multipath device had no available paths. - Include Christoph's refactoring of DM's ioctl handling and add support for passing through persistent reservations with DM multipath. - All other changes are very simple cleanups" * tag 'dm-4.4-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm switch: simplify conditional in alloc_region_table() dm delay: document that offsets are specified in sectors dm delay: capitalize the start of an delay_ctr() error message dm delay: Use DM_MAPIO macros instead of open-coded equivalents dm linear: remove redundant target name from error messages dm persistent data: eliminate unnecessary return values dm: eliminate unused "bioset" process for each bio-based DM device dm: convert ffs to __ffs dm: drop NULL test before kmem_cache_destroy() and mempool_destroy() dm: add support for passing through persistent reservations dm: refactor ioctl handling Revert "dm mpath: fix stalls when handling invalid ioctls" dm: initialize non-blk-mq queue data before queue is used	2015-11-04 21:19:53 -08:00
Linus Torvalds	527d1529e3	Merge branch 'for-4.4/integrity' of git://git.kernel.dk/linux-block Pull block integrity updates from Jens Axboe: ""This is the joint work of Dan and Martin, cleaning up and improving the support for block data integrity" * 'for-4.4/integrity' of git://git.kernel.dk/linux-block: block, libnvdimm, nvme: provide a built-in blk_integrity nop profile block: blk_flush_integrity() for bio-based drivers block: move blk_integrity to request_queue block: generic request_queue reference counting nvme: suspend i/o during runtime blk_integrity_unregister md: suspend i/o during runtime blk_integrity_unregister md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown block: Inline blk_integrity in struct gendisk block: Export integrity data interval size in sysfs block: Reduce the size of struct blk_integrity block: Consolidate static integrity profile properties block: Move integrity kobject to struct gendisk	2015-11-04 20:51:48 -08:00
Mikulas Patocka	dbba42d8a9	dm: eliminate unused "bioset" process for each bio-based DM device Commit `54efd50bfd` ("block: make generic_make_request handle arbitrarily sized bios") makes it possible for block devices to process large bios. In doing so that commit allocates a new queue->bio_split bioset for each block device, this bioset is used for allocating bios when the driver needs to split large bios. Each bioset allocates a workqueue process, thus the above commit increases the number of processes allocated per block device. DM doesn't need the queue->bio_split bioset, thus we can deallocate it. This reduces the number of allocated processes per bio-based DM device from 3 to 2. Also remove the call to blk_queue_split(), it is not needed because DM does its own splitting. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-10-31 19:06:02 -04:00
Julia Lawall	6f65985e26	dm: drop NULL test before kmem_cache_destroy() and mempool_destroy() Remove DM's unneeded NULL tests before calling these destroy functions, now that they check for NULL, thanks to these v4.3 commits: `3942d2991` ("mm/slab_common: allow NULL cache pointer in kmem_cache_destroy()") `4e3ca3e03` ("mm/mempool: allow NULL `pool' pointer in mempool_destroy()") The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression x; @@ -if (x != NULL) \(kmem_cache_destroy\\|mempool_destroy\\|dma_pool_destroy\)(x); // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-10-31 19:06:00 -04:00
Christoph Hellwig	71cdb6978a	dm: add support for passing through persistent reservations This adds support to pass through persistent reservation requests similar to the existing ioctl handling, and with the same limitations, e.g. devices may only have a single target attached. This is mostly intended for multipathing. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-10-31 19:05:59 -04:00
Christoph Hellwig	e56f81e0b0	dm: refactor ioctl handling This moves the call to blkdev_ioctl and the argument checking to DM core code, and only leaves a callout to find the block device to operate on in the targets. This simplifies the code and allows us to pass through ioctl-like command using other methods in the next patch. Also split out a helper around calling the prepare_ioctl method that will be reused for persistent reservation handling. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-10-31 19:05:59 -04:00
Mikulas Patocka	ad5f498f61	dm: initialize non-blk-mq queue data before queue is used Commit `bfebd1cdb4` ("dm: add full blk-mq support to request-based DM") moves the initialization of the fields backing_dev_info.congested_fn, backing_dev_info.congested_data and queuedata from the function dm_init_md_queue (that is called when the device is created) to dm_init_old_md_queue (that is called after the device type is determined). There is no locking when accessing these variables, thus it is possible for other parts of the kernel to briefly see this data in a transient state (e.g. queue->backing_dev_info.congested_fn initialized and md->queue->backing_dev_info.congested_data uninitialized, resulting in passing an incorrect parameter to the function dm_any_congested). This queue data is left initialized for blk-mq devices even though they that don't use it. Fixes: `bfebd1cdb4` ("dm: add full blk-mq support to request-based DM") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # v4.1+	2015-10-29 22:09:40 -04:00
Dan Williams	9609b9942b	md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown Now that the integrity profile is statically allocated there is no work to do when shutting down an integrity enabled block device. Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: James Bottomley <JBottomley@Odin.com> Acked-by: NeilBrown <neilb@suse.com> Acked-by: Keith Busch <keith.busch@intel.com> Acked-by: Vishal Verma <vishal.l.verma@intel.com> Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-10-21 14:43:37 -06:00
Junichi Nomura	50887bd139	dm: fix request-based dm error reporting end_clone_bio() is a endio callback for clone bio and should check and save the clone's bi_error for error reporting. However, `4246a0b63b` ("block: add a bi_error field to struct bio") changed the function to check the original bio's bi_error, which is 0. Without this fix, clone's error is ignored and reported to the original request as success. Thus data corruption will be observed. Fixes: `4246a0b63b` ("block: add a bi_error field to struct bio") Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-10-06 10:08:16 -04:00
Junichi Nomura	2a708cff93	dm: fix AB-BA deadlock in __dm_destroy() __dm_destroy() takes io_barrier SRCU lock (dm_get_live_table) and suspend_lock in reverse order. Doing so can cause AB-BA deadlock: __dm_destroy dm_swap_table --------------------------------------------------- mutex_lock(suspend_lock) dm_get_live_table() srcu_read_lock(io_barrier) dm_sync_table() synchronize_srcu(io_barrier) .. waiting for dm_put_live_table() mutex_lock(suspend_lock) .. waiting for suspend_lock Fix this by taking the locks in proper order. Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Fixes: `ab7c7bb6f4` ("dm: hold suspend_lock while suspending device during device deletion") Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-10-01 10:40:20 -04:00
Linus Torvalds	1e1a4e8f43	Merge tag 'dm-4.3-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper update from Mike Snitzer: - a couple small cleanups in dm-cache, dm-verity, persistent-data's dm-btree, and DM core. - a 4.1-stable fix for dm-cache that fixes the leaking of deferred bio prison cells - a 4.2-stable fix that adds feature reporting for the dm-stats features added in 4.2 - improve DM-snapshot to not invalidate the on-disk snapshot if snapshot device write overflow occurs; but a write overflow triggered through the origin device will still invalidate the snapshot. - optimize DM-thinp's async discard submission a bit now that late bio splitting has been included in block core. - switch DM-cache's SMQ policy lock from using a mutex to a spinlock; improves performance on very low latency devices (eg. NVMe SSD). - document DM RAID 4/5/6's discard support [ I did not pull the slab changes, which weren't appropriate for this tree, and weren't obviously the right thing to do anyway. At the very least they need some discussion and explanation before getting merged. Because not pulling the actual tagged commit but doing a partial pull instead, this merge commit thus also obviously is missing the git signature from the original tag ] * tag 'dm-4.3-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm cache: fix use after freeing migrations dm cache: small cleanups related to deferred prison cell cleanup dm cache: fix leaking of deferred bio prison cells dm raid: document RAID 4/5/6 discard support dm stats: report precise_timestamps and histogram in @stats_list output dm thin: optimize async discard submission dm snapshot: don't invalidate on-disk image on snapshot write overflow dm: remove unlikely() before IS_ERR() dm: do not override error code returned from dm_get_device() dm: test return value for DM_MAPIO_SUBMITTED dm verity: remove unused mempool dm cache: move wake_waker() from free_migrations() to where it is needed dm btree remove: remove unused function get_nr_entries() dm btree: remove unused "dm_block_t root" parameter in btree_split_sibling() dm cache policy smq: change the mutex to a spinlock	2015-09-02 16:35:26 -07:00
Linus Torvalds	1081230b74	Merge branch 'for-4.3/core' of git://git.kernel.dk/linux-block Pull core block updates from Jens Axboe: "This first core part of the block IO changes contains: - Cleanup of the bio IO error signaling from Christoph. We used to rely on the uptodate bit and passing around of an error, now we store the error in the bio itself. - Improvement of the above from myself, by shrinking the bio size down again to fit in two cachelines on x86-64. - Revert of the max_hw_sectors cap removal from a revision again, from Jeff Moyer. This caused performance regressions in various tests. Reinstate the limit, bump it to a more reasonable size instead. - Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me. Most devices have huge trim limits, which can cause nasty latencies when deleting files. Enable the admin to configure the size down. We will look into having a more sane default instead of UINT_MAX sectors. - Improvement of the SGP gaps logic from Keith Busch. - Enable the block core to handle arbitrarily sized bios, which enables a nice simplification of bio_add_page() (which is an IO hot path). From Kent. - Improvements to the partition io stats accounting, making it faster. From Ming Lei. - Also from Ming Lei, a basic fixup for overflow of the sysfs pending file in blk-mq, as well as a fix for a blk-mq timeout race condition. - Ming Lin has been carrying Kents above mentioned patches forward for a while, and testing them. Ming also did a few fixes around that. - Sasha Levin found and fixed a use-after-free problem introduced by the bio->bi_error changes from Christoph. - Small blk cgroup cleanup from Viresh Kumar" * 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits) blk: Fix bio_io_vec index when checking bvec gaps block: Replace SG_GAPS with new queue limits mask block: bump BLK_DEF_MAX_SECTORS to 2560 Revert "block: remove artifical max_hw_sectors cap" blk-mq: fix race between timeout and freeing request blk-mq: fix buffer overflow when reading sysfs file of 'pending' Documentation: update notes in biovecs about arbitrarily sized bios block: remove bio_get_nr_vecs() fs: use helper bio_add_page() instead of open coding on bi_io_vec block: kill merge_bvec_fn() completely md/raid5: get rid of bio_fits_rdev() md/raid5: split bio for chunk_aligned_read block: remove split code in blkdev_issue_{discard,write_same} btrfs: remove bio splitting and merge_bvec_fn() calls bcache: remove driver private bio splitting code block: simplify bio_add_page() block: make generic_make_request handle arbitrarily sized bios blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL) block: don't access bio->bi_error after bio_put() block: shrink struct bio down to 2 cache lines again ...	2015-09-02 13:10:25 -07:00
Kent Overstreet	8ae126660f	block: kill merge_bvec_fn() completely As generic_make_request() is now able to handle arbitrarily sized bios, it's no longer necessary for each individual block driver to define its own ->merge_bvec_fn() callback. Remove every invocation completely. Cc: Jens Axboe <axboe@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: drbd-user@lists.linbit.com Cc: Jiri Kosina <jkosina@suse.cz> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@kernel.org> Cc: ceph-devel@vger.kernel.org Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Neil Brown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits) Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: also remove ->merge_bvec_fn() in dm-thin as well as dm-era-target, and resolve merge conflicts] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-08-13 12:31:57 -06:00
Kent Overstreet	54efd50bfd	block: make generic_make_request handle arbitrarily sized bios The way the block layer is currently written, it goes to great lengths to avoid having to split bios; upper layer code (such as bio_add_page()) checks what the underlying device can handle and tries to always create bios that don't need to be split. But this approach becomes unwieldy and eventually breaks down with stacked devices and devices with dynamic limits, and it adds a lot of complexity. If the block layer could split bios as needed, we could eliminate a lot of complexity elsewhere - particularly in stacked drivers. Code that creates bios can then create whatever size bios are convenient, and more importantly stacked drivers don't have to deal with both their own bio size limitations and the limitations of the (potentially multiple) devices underneath them. In the future this will let us delete merge_bvec_fn and a bunch of other code. We do this by adding calls to blk_queue_split() to the various make_request functions that need it - a few can already handle arbitrary size bios. Note that we add the call _after_ any call to blk_queue_bounce(); this means that blk_queue_split() and blk_recalc_rq_segments() don't need to be concerned with bouncing affecting segment merging. Some make_request_fn() callbacks were simple enough to audit and verify they don't need blk_queue_split() calls. The skipped ones are: * nfhd_make_request (arch/m68k/emu/nfblock.c) * axon_ram_make_request (arch/powerpc/sysdev/axonram.c) * simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c) * brd_make_request (ramdisk - drivers/block/brd.c) * mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c) * loop_make_request * null_queue_bio * bcache's make_request fns Some others are almost certainly safe to remove now, but will be left for future patches. Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ming Lei <ming.lei@canonical.com> Cc: Neil Brown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: drbd-user@lists.linbit.com Cc: Jiri Kosina <jkosina@suse.cz> Cc: Geoff Levand <geoff@infradead.org> Cc: Jim Paris <jim@jtan.com> Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: Andreas Dilger <andreas.dilger@intel.com> Acked-by: NeilBrown <neilb@suse.de> (for the 'md/md.c' bits) Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: skip more mq-based drivers, resolve merge conflicts, etc.] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-08-13 12:31:33 -06:00
Mikulas Patocka	ab37844d61	dm: test return value for DM_MAPIO_SUBMITTED In properly written code we should not assume that DM_MAPIO_SUBMITTED is zero. We should test the return value for DM_MAPIO_SUBMITTED rather than testing it for zero. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-08-12 11:32:20 -04:00
Mike Snitzer	bd4aaf8f9b	dm: fix dm_merge_bvec regression on 32 bit systems A DM regression on 32 bit systems was reported against v4.2-rc3 here: https://lkml.org/lkml/2015/7/29/401 Fix this by reverting both commit `1c220c69` ("dm: fix casting bug in dm_merge_bvec()") and `148e51ba` ("dm: improve documentation and code clarity in dm_merge_bvec"). This combined revert is done to eliminate the possibility of a partial revert in stable@ kernels. In hindsight the correct fix, at the time `1c220c69` was applied to fix the regression that `148e51ba` introduced, should've been to simply revert `148e51ba`. Reported-by: Josh Boyer <jwboyer@fedoraproject.org> Tested-by: Adam Williamson <awilliam@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.19+	2015-08-03 22:49:59 -04:00
Christoph Hellwig	4246a0b63b	block: add a bi_error field to struct bio Currently we have two different ways to signal an I/O error on a BIO: (1) by clearing the BIO_UPTODATE flag (2) by returning a Linux errno value to the bi_end_io callback The first one has the drawback of only communicating a single possible error (-EIO), and the second one has the drawback of not beeing persistent when bios are queued up, and are not passed along from child to parent bio in the ever more popular chaining scenario. Having both mechanisms available has the additional drawback of utterly confusing driver authors and introducing bugs where various I/O submitters only deal with one of them, and the others have to add boilerplate code to deal with both kinds of error returns. So add a new bi_error field to store an errno value directly in struct bio and remove the existing mechanisms to clean all this up. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-07-29 08:55:15 -06:00
Mikulas Patocka	b06075a98d	dm: fix use after free crash due to incorrect cleanup sequence Linux 4.2-rc1 Commit `0f20972f7b` ("dm: factor out a common cleanup_mapped_device()") moved a common cleanup code to a separate function. Unfortunately, that commit incorrectly changed the order of cleanup, so that it destroys the mapped_device's srcu structure 'io_barrier' before destroying its workqueue. The function that is executed on the workqueue (dm_wq_work) uses the srcu structure, thus it may use it after being freed. That results in a crash in the LVM test suite's mirror-vgreduce-removemissing.sh test. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: `0f20972f7b` ("dm: factor out a common cleanup_mapped_device()") Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-07-13 09:14:11 -04:00
Mike Snitzer	621739b00e	Revert "dm: only run the queue on completion if congested or no requests pending" This reverts commit `9a0e609e3f`. (Resolved a conflict during revert due to commit `bfebd1cdb4` that came after) This revert is motivated by a couple failure reports on request-based DM multipath testbeds: 1) Netapp reported that their multipath fault injection test under heavy IO load can stall longer than 300 seconds. 2) IBM reported elevated lock contention in their testbed (likely due to increased back pressure due to IO not being dispatched as quickly): https://www.redhat.com/archives/dm-devel/2015-July/msg00057.html Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 4.1+	2015-07-08 16:16:07 -04:00
Linus Torvalds	22165fa798	- Revert block and DM core changes the removed request-based DM's ability to handle partial request completions -- otherwise with the current SCSI LLDs these changes could lead to silent data corruption. - Fix two DM version bumps that were missing from the initial 4.2 DM pull request (enabled userspace lvm2 to know certain changes have been made). -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJVjWetAAoJEMUj8QotnQNaEngIAMVwExw0u04jqoW9rUwLDbpr PS2A4lh/MGtMqGGPwJp5qiwnKkgQ5/FcxRpslNQYqA6KrIlnjWJhacWl7tOrwqxn +WBsHIUwjcpwK2RqxSS3Petb6xDd7A3LfTQVhKV9xKZpZp8Y25a+1MPmUYKsFLBH DJ1d9bXPMdN1qjBXBU1rKkVxj6z8iNz/lv24eN0MGyWhfUUTc8lQg3eey3L0BzCc siOuupFQXaWIkbawLZrmvPPNm1iMoABC1OPZCTB1AYZYx1rqzEGUR1nZN+qWf6Wf rZtAPZehbRzvOaf5jC6tEfAcTF23aPEyp4LD+aAQpbuC/1IBi8a3S8z6PvR5EjA= =QY48 -----END PGP SIGNATURE----- Merge tag 'dm-4.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: "Apologies for not pressing this request-based DM partial completion issue further, it was an oversight on my part. We'll have to get it fixed up properly and revisit for a future release. - Revert block and DM core changes the removed request-based DM's ability to handle partial request completions -- otherwise with the current SCSI LLDs these changes could lead to silent data corruption. - Fix two DM version bumps that were missing from the initial 4.2 DM pull request (enabled userspace lvm2 to know certain changes have been made)" * tag 'dm-4.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm cache policy smq: fix "default" version to be 1.4.0 dm: bump the ioctl version to 4.32.0 Revert "block, dm: don't copy bios for request clones" Revert "dm: do not allocate any mempools for blk-mq request-based DM"	2015-06-26 12:35:01 -07:00
Mike Snitzer	78d8e58a08	Revert "block, dm: don't copy bios for request clones" This reverts commit `5f1b670d0b`. Justification for revert as reported in this dm-devel post: https://www.redhat.com/archives/dm-devel/2015-June/msg00160.html this change should not be pushed to mainline yet. Firstly, Christoph has a newer version of the patch that fixes silent data corruption problem: https://www.redhat.com/archives/dm-devel/2015-May/msg00229.html And the new version still depends on LLDDs to always complete requests to the end when error happens, while block API doesn't enforce such a requirement. If the assumption is ever broken, the inconsistency between request and bio (e.g. rq->__sector and rq->bio) will cause silent data corruption: https://www.redhat.com/archives/dm-devel/2015-June/msg00022.html Reported-by: Junichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-06-26 10:11:58 -04:00
Mike Snitzer	4e6e36c371	Revert "dm: do not allocate any mempools for blk-mq request-based DM" This reverts commit `cbc4e3c135`. Reported-by: Junichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-06-26 10:11:07 -04:00
Linus Torvalds	6597ac8a51	- DM core cleanups - blk-mq request-based DM no longer uses any mempools now that partial completions are no longer handled as part of cloned requests - DM raid cleanups and support for MD raid0 - DM cache core advances and a new stochastic-multi-queue (smq) cache replacement policy - smq is the new default dm-cache policy - DM thinp cleanups and much more efficient large discard support - DM statistics support for request-based DM and nanosecond resolution timestamps - Fixes to DM stripe, DM log-writes, DM raid1 and DM crypt -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJViE/tAAoJEMUj8QotnQNag8wIAMhcmy46qGCy6SrOew6D8EQ0 4Ielsr5ZI67Q9pZVI1x/Zdt5GfDUGXiSEy/y8xoIuJfmsRXwnZ/gkOUyWpSUW8dH mgPUUpGF0aN2D/66P3chgm39rVZEt3crDY+SQu02Fm86JKlMUomFbWKgMXpsM6Ga HHW80zLF9Ca6D5m8xMze/ic1KBJtA9zaXcO9xnMCBym8mSaORtgwoCzKAgy43H7U KIBwPKR/y+c7SaKWGxXwxxhTKZDyP4FbgtiJ2Dc9yBDoD0RzbyuY/EFMEUPEbEMv e9fFHNEzmKlsNVY2vYXkVH4TdldWrrjkmYj/JVtz8cifoupWOOnDTjb2aqjAIww= =DKCC -----END PGP SIGNATURE----- Merge tag 'dm-4.2-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - DM core cleanups: * blk-mq request-based DM no longer uses any mempools now that partial completions are no longer handled as part of cloned requests - DM raid cleanups and support for MD raid0 - DM cache core advances and a new stochastic-multi-queue (smq) cache replacement policy * smq is the new default dm-cache policy - DM thinp cleanups and much more efficient large discard support - DM statistics support for request-based DM and nanosecond resolution timestamps - Fixes to DM stripe, DM log-writes, DM raid1 and DM crypt * tag 'dm-4.2-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (39 commits) dm stats: add support for request-based DM devices dm stats: collect and report histogram of IO latencies dm stats: support precise timestamps dm stats: fix divide by zero if 'number_of_areas' arg is zero dm cache: switch the "default" cache replacement policy from mq to smq dm space map metadata: fix occasional leak of a metadata block on resize dm thin metadata: fix a race when entering fail mode dm thin: fail messages with EOPNOTSUPP when pool cannot handle messages dm thin: range discard support dm thin metadata: add dm_thin_remove_range() dm thin metadata: add dm_thin_find_mapped_range() dm btree: add dm_btree_remove_leaves() dm stats: Use kvfree() in dm_kvfree() dm cache: age and write back cache entries even without active IO dm cache: prefix all DMERR and DMINFO messages with cache device name dm cache: add fail io mode and needs_check flag dm cache: wake the worker thread every time we free a migration object dm cache: add stochastic-multi-queue (smq) policy dm cache: boost promotion of blocks that will be overwritten dm cache: defer whole cells ...	2015-06-25 16:34:39 -07:00
Linus Torvalds	e4bc13adfd	Merge branch 'for-4.2/writeback' of git://git.kernel.dk/linux-block Pull cgroup writeback support from Jens Axboe: "This is the big pull request for adding cgroup writeback support. This code has been in development for a long time, and it has been simmering in for-next for a good chunk of this cycle too. This is one of those problems that has been talked about for at least half a decade, finally there's a solution and code to go with it. Also see last weeks writeup on LWN: http://lwn.net/Articles/648292/" * 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits) writeback, blkio: add documentation for cgroup writeback support vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB writeback: do foreign inode detection iff cgroup writeback is enabled v9fs: fix error handling in v9fs_session_init() bdi: fix wrong error return value in cgwb_create() buffer: remove unusued 'ret' variable writeback: disassociate inodes from dying bdi_writebacks writeback: implement foreign cgroup inode bdi_writeback switching writeback: add lockdep annotation to inode_to_wb() writeback: use unlocked_inode_to_wb transaction in inode_congested() writeback: implement unlocked_inode_to_wb transaction and use it for stat updates writeback: implement [locked_]inode_to_wb_and_lock_list() writeback: implement foreign cgroup inode detection writeback: make writeback_control track the inode being written back writeback: relocate wb[_try]_get(), wb_put(), inode_{attach\|detach}_wb() mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use writeback: implement memcg writeback domain based throttling writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes writeback: implement memcg wb_domain writeback: update wb_over_bg_thresh() to use wb_domain aware operations ...	2015-06-25 16:00:17 -07:00
Linus Torvalds	bfffa1cc9d	Merge branch 'for-4.2/core' of git://git.kernel.dk/linux-block Pull core block IO update from Jens Axboe: "Nothing really major in here, mostly a collection of smaller optimizations and cleanups, mixed with various fixes. In more detail, this contains: - Addition of policy specific data to blkcg for block cgroups. From Arianna Avanzini. - Various cleanups around command types from Christoph. - Cleanup of the suspend block I/O path from Christoph. - Plugging updates from Shaohua and Jeff Moyer, for blk-mq. - Eliminating atomic inc/dec of both remaining IO count and reference count in a bio. From me. - Fixes for SG gap and chunk size support for data-less (discards) IO, so we can merge these better. From me. - Small restructuring of blk-mq shared tag support, freeing drivers from iterating hardware queues. From Keith Busch. - A few cfq-iosched tweaks, from Tahsin Erdogan and me. Makes the IOPS mode the default for non-rotational storage" * 'for-4.2/core' of git://git.kernel.dk/linux-block: (35 commits) cfq-iosched: fix other locations where blkcg_to_cfqgd() can return NULL cfq-iosched: fix sysfs oops when attempting to read unconfigured weights cfq-iosched: move group scheduling functions under ifdef cfq-iosched: fix the setting of IOPS mode on SSDs blktrace: Add blktrace.c to BLOCK LAYER in MAINTAINERS file block, cgroup: implement policy-specific per-blkcg data block: Make CFQ default to IOPS mode on SSDs block: add blk_set_queue_dying() to blkdev.h blk-mq: Shared tag enhancements block: don't honor chunk sizes for data-less IO block: only honor SG gap prevention for merges that contain data block: fix returnvar.cocci warnings block, dm: don't copy bios for request clones block: remove management of bi_remaining when restoring original bi_end_io block: replace trylock with mutex_lock in blkdev_reread_part() block: export blkdev_reread_part() and __blkdev_reread_part() suspend: simplify block I/O handling block: collapse bio bit space block: remove unused BIO_RW_BLOCK and BIO_EOF flags block: remove BIO_EOPNOTSUPP ...	2015-06-25 14:29:53 -07:00
Mikulas Patocka	e262f34741	dm stats: add support for request-based DM devices This makes it possible to use dm stats with DM multipath. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-06-17 12:40:41 -04:00
Tejun Heo	4452226ea2	writeback: move backing_dev_info->state into bdi_writeback Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback) and the role of the separation is unclear. For cgroup support for writeback IOs, a bdi will be updated to host multiple wb's where each wb serves writeback IOs of a different cgroup on the bdi. To achieve that, a wb should carry all states necessary for servicing writeback IOs for a cgroup independently. This patch moves bdi->state into wb. * enum bdi_state is renamed to wb_state and the prefix of all enums is changed from BDI_ to WB_. * Explicit zeroing of bdi->state is removed without adding zeoring of wb->state as the whole data structure is zeroed on init anyway. * As there's still only one bdi_writeback per backing_dev_info, all uses of bdi->state are mechanically replaced with bdi->wb.state introducing no behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: drbd-dev@lists.linbit.com Cc: Neil Brown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-06-02 08:33:34 -06:00
Mike Snitzer	0f20972f7b	dm: factor out a common cleanup_mapped_device() Introduce a single common method for cleaning up a DM device's mapped_device. No functional change, just eliminates duplication of delicate mapped_device cleanup code. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-29 14:18:58 -04:00
Mike Snitzer	2d76fff18f	dm: cleanup methods that requeue requests More often than not a request that is requeued _is_ mapped (meaning the clone request is allocated and clone->q is initialized). Rename dm_requeue_unmapped_original_request() to avoid potential confusion due to function name containing "unmapped". Also, remove dm_requeue_unmapped_request() since callers can easily call the dm_requeue_original_request() directly. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-29 14:18:57 -04:00
Mike Snitzer	cbc4e3c135	dm: do not allocate any mempools for blk-mq request-based DM Do not allocate the io_pool mempool for blk-mq request-based DM (DM_TYPE_MQ_REQUEST_BASED) in dm_alloc_rq_mempools(). Also refine __bind_mempools() to have more precise awareness of which mempools each type of DM device uses -- avoids mempool churn when reloading DM tables (particularly for DM_TYPE_REQUEST_BASED). Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-29 14:18:57 -04:00
Mike Snitzer	183f7802e7	Merge remote-tracking branch 'jens/for-4.2/core' into dm-4.2	2015-05-29 14:17:16 -04:00
Joe Thornber	1c220c69ce	dm: fix casting bug in dm_merge_bvec() dm_merge_bvec() was originally added in f6fccb ("dm: introduce merge_bvec_fn"). In that commit a value in sectors is converted to bytes using << 9, and then assigned to an int. This code made assumptions about the value of BIO_MAX_SECTORS. A later commit 148e51 ("dm: improve documentation and code clarity in dm_merge_bvec") was meant to have no functional change but it removed the use of BIO_MAX_SECTORS in favor of using queue_max_sectors(). At this point the cast from sector_t to int resulted in a zero value. The fallout being dm_merge_bvec() would only allow a single page to be added to a bio. This interim fix is minimal for the benefit of stable@ because the more comprehensive cleanup of passing a sector_t to all DM targets' merge function will impact quite a few DM targets. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.19+	2015-05-29 13:41:16 -04:00
Mike Snitzer	e5d8de32cc	dm: fix false warning in free_rq_clone() for unmapped requests When stacking request-based dm device on non blk-mq device and device-mapper target could not map the request (error target is used, multipath target with all paths down, etc), the WARN_ON_ONCE() in free_rq_clone() will trigger when it shouldn't. The warning was added by commit `aa6df8d` ("dm: fix free_rq_clone() NULL pointer when requeueing unmapped request"). But free_rq_clone() with clone->q == NULL is valid usage for the case where dm_kill_unmapped_request() initiates request cleanup. Fix this false warning by just removing the WARN_ON -- it only generated false positives and was never useful in catching the intended case (completing clone request not being mapped e.g. clone->q being NULL). Fixes: `aa6df8d` ("dm: fix free_rq_clone() NULL pointer when requeueing unmapped request") Reported-by: Bart Van Assche <bart.vanassche@sandisk.com> Reported-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-29 11:07:36 -04:00
Mike Snitzer	45714fbed4	dm: requeue from blk-mq dm_mq_queue_rq() using BLK_MQ_RQ_QUEUE_BUSY Use BLK_MQ_RQ_QUEUE_BUSY to requeue a blk-mq request directly from the DM blk-mq device's .queue_rq. This cleans up the previous convoluted handling of request requeueing that would return BLK_MQ_RQ_QUEUE_OK (even though it wasn't) and then run blk_mq_requeue_request() followed by blk_mq_kick_requeue_list(). Also, document that DM blk-mq ontop of old request_fn devices cannot fail in clone_rq() since the clone request is preallocated as part of the pdu. Reported-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-27 17:37:23 -04:00
Junichi Nomura	3a1407559a	dm: fix NULL pointer when clone_and_map_rq returns !DM_MAPIO_REMAPPED When stacking request-based DM on blk_mq device, request cloning and remapping are done in a single call to target's clone_and_map_rq(). The clone is allocated and valid only if clone_and_map_rq() returns DM_MAPIO_REMAPPED. The "IS_ERR(clone)" check in map_request() does not cover all the !DM_MAPIO_REMAPPED cases that are possible (E.g. if underlying devices are not ready or unavailable, clone_and_map_rq() may return DM_MAPIO_REQUEUE without ever having established an ERR_PTR). Fix this by explicitly checking for a return that is not DM_MAPIO_REMAPPED in map_request(). Without this fix, DM core may call setup_clone() for a NULL clone and oops like this: BUG: unable to handle kernel NULL pointer dereference at 0000000000000068 IP: [<ffffffff81227525>] blk_rq_prep_clone+0x7d/0x137 ... CPU: 2 PID: 5793 Comm: kdmwork-253:3 Not tainted 4.0.0-nm #1 ... Call Trace: [<ffffffffa01d1c09>] map_tio_request+0xa9/0x258 [dm_mod] [<ffffffff81071de9>] kthread_worker_fn+0xfd/0x150 [<ffffffff81071cec>] ? kthread_parkme+0x24/0x24 [<ffffffff81071cec>] ? kthread_parkme+0x24/0x24 [<ffffffff81071fdd>] kthread+0xe6/0xee [<ffffffff81093a59>] ? put_lock_stats+0xe/0x20 [<ffffffff81071ef7>] ? __init_kthread_worker+0x5b/0x5b [<ffffffff814c2d98>] ret_from_fork+0x58/0x90 [<ffffffff81071ef7>] ? __init_kthread_worker+0x5b/0x5b Fixes: `e5863d9ad` ("dm: allocate requests in target when stacking on blk-mq devices") Reported-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 4.0+	2015-05-27 09:48:51 -04:00
Junichi Nomura	4ae9944d13	dm: run queue on re-queue Without kicking queue, requeued request may stay forever in the queue if there are no other I/O activities to the device. The original error had been in v2.6.39 with commit `7eaceaccab` ("block: remove per-queue plugging"), which replaced conditional plugging by periodic runqueue. Commit `9d1deb83d4` in v4.1-rc1 removed the periodic runqueue and the problem started to manifest. Fixes: `9d1deb83d4` ("dm: don't schedule delayed run of the queue if nothing to do") Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-05-26 09:57:36 -04:00
Christoph Hellwig	5f1b670d0b	block, dm: don't copy bios for request clones Currently dm-multipath has to clone the bios for every request sent to the lower devices, which wastes cpu cycles and ties down memory. This patch instead adds a new REQ_CLONE flag that instructs req_bio_endio to not complete bios attached to a request, which we set on clone requests similar to bios in a flush sequence. With this change I/O errors on a path failure only get propagated to dm-multipath, which can then either resubmit the I/O or complete the bios on the original request. I've done some basic testing of this on a Linux target with ALUA support, and it survives path failures during I/O nicely. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-05-22 08:58:57 -06:00
Mike Snitzer	aa6df8dd28	dm: fix free_rq_clone() NULL pointer when requeueing unmapped request Commit `022333427a` ("dm: optimize dm_mq_queue_rq to _not_ use kthread if using pure blk-mq") mistakenly removed free_rq_clone()'s clone->q check before testing clone->q->mq_ops. It was an oversight to discontinue that check for 1 of the 2 use-cases for free_rq_clone(): 1) free_rq_clone() called when an unmapped original request is requeued 2) free_rq_clone() called in the request-based IO completion path The clone->q check made sense for case #1 but not for #2. However, we cannot just reinstate the check as it'd mask a serious bug in the IO completion case #2 -- no in-flight request should have an uninitialized request_queue (basic block layer refcounting _should_ ensure this). The NULL pointer seen for case #1 is detailed here: https://www.redhat.com/archives/dm-devel/2015-April/msg00160.html Fix this free_rq_clone() NULL pointer by simply checking if the mapped_device's type is DM_TYPE_MQ_REQUEST_BASED (clone's queue is blk-mq) rather than checking clone->q->mq_ops. This avoids the need to dereference clone->q, but a WARN_ON_ONCE is added to let us know if an uninitialized clone request is being completed. Reported-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-04-30 10:25:21 -04:00
Christoph Hellwig	3e6180f0c8	dm: only initialize the request_queue once Commit `bfebd1cdb4` ("dm: add full blk-mq support to request-based DM") didn't properly account for the need to short-circuit re-initializing DM's blk-mq request_queue if it was already initialized. Otherwise, reloading a blk-mq request-based DM table (either manually or via multipathd) resulted in errors, see: https://www.redhat.com/archives/dm-devel/2015-April/msg00132.html Fix is to only initialize the request_queue on the initial table load (when the mapped_device type is assigned). This is better than having dm_init_request_based_blk_mq_queue() return early if the queue was already initialized because it elevates the constraint to a more meaningful location in DM core. As such the pre-existing early return in dm_init_request_based_queue() can now be removed. Fixes: `bfebd1cdb4` ("dm: add full blk-mq support to request-based DM") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-04-30 10:25:21 -04:00
Sami Tolvanen	65ff5b7ddf	dm verity: add error handling modes for corrupted blocks Add device specific modes to dm-verity to specify how corrupted blocks should be handled. The following modes are defined: - DM_VERITY_MODE_EIO is the default behavior, where reading a corrupted block results in -EIO. - DM_VERITY_MODE_LOGGING only logs corrupted blocks, but does not block the read. - DM_VERITY_MODE_RESTART calls kernel_restart when a corrupted block is discovered. In addition, each mode sends a uevent to notify userspace of corruption and to allow further recovery actions. The driver defaults to previous behavior (DM_VERITY_MODE_EIO) and other modes can be enabled with an additional parameter to the verity table. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-04-15 12:10:22 -04:00
Mike Snitzer	17e149b8f7	dm: add 'use_blk_mq' module param and expose in per-device ro sysfs attr Request-based DM's blk-mq support defaults to off; but a user can easily change the default using the dm_mod.use_blk_mq module/boot option. Also, you can check what mode a given request-based DM device is using with: cat /sys/block/dm-X/dm/use_blk_mq This change enabled further cleanup and reduced work (e.g. the md->io_pool and md->rq_pool isn't created if using blk-mq). Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-04-15 12:10:17 -04:00
Mike Snitzer	022333427a	dm: optimize dm_mq_queue_rq to _not_ use kthread if using pure blk-mq dm_mq_queue_rq() is in atomic context so care must be taken to not sleep -- as such GFP_ATOMIC is used for the md->bs bioset allocations and dm-mpath's call to blk_get_request(). In the future the bioset allocations will hopefully go away (by removing support for partial completions of bios in a cloned request). Also prepare for supporting DM blk-mq ontop of old-style request_fn device(s) if a new dm-mod 'use_blk_mq' parameter is set. The kthread will still be used to queue work if blk-mq is used ontop of old-style request_fn device(s). Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-04-15 12:10:17 -04:00
Mike Snitzer	bfebd1cdb4	dm: add full blk-mq support to request-based DM Commit `e5863d9ad` ("dm: allocate requests in target when stacking on blk-mq devices") served as the first step toward fully utilizing blk-mq in request-based DM -- it enabled stacking an old-style (request_fn) request_queue ontop of the underlying blk-mq device(s). That first step didn't improve performance of DM multipath ontop of fast blk-mq devices (e.g. NVMe) because the top-level old-style request_queue was severely limited by the queue_lock. The second step offered here enables stacking a blk-mq request_queue ontop of the underlying blk-mq device(s). This unlocks significant performance gains on fast blk-mq devices, Keith Busch tested on his NVMe testbed and offered this really positive news: "Just providing a performance update. All my fio tests are getting roughly equal performance whether accessed through the raw block device or the multipath device mapper (~470k IOPS). I could only push ~20% of the raw iops through dm before this conversion, so this latest tree is looking really solid from a performance standpoint." Signed-off-by: Mike Snitzer <snitzer@redhat.com> Tested-by: Keith Busch <keith.busch@intel.com>	2015-04-15 12:10:16 -04:00
Mike Snitzer	0ce65797a7	dm: impose configurable deadline for dm_request_fn's merge heuristic Otherwise, for sequential workloads, the dm_request_fn can allow excessive request merging at the expense of increased service time. Add a per-device sysfs attribute to allow the user to control how long a request, that is a reasonable merge candidate, can be queued on the request queue. The resolution of this request dispatch deadline is in microseconds (ranging from 1 to 100000 usecs), to set a 20us deadline: echo 20 > /sys/block/dm-7/dm/rq_based_seq_io_merge_deadline The dm_request_fn's merge heuristic and associated extra accounting is disabled by default (rq_based_seq_io_merge_deadline is 0). This sysfs attribute is not applicable to bio-based DM devices so it will only ever report 0 for them. By allowing a request to remain on the queue it will block others requests on the queue. But introducing a short dequeue delay has proven very effective at enabling certain sequential IO workloads on really fast, yet IOPS constrained, devices to build up slightly larger IOs -- yielding 90+% throughput improvements. Having precise control over the time taken to wait for larger requests to build affords control beyond that of waiting for certain IO sizes to accumulate (which would require a deadline anyway). This knob will only ever make sense with sequential IO workloads and the particular value used is storage configuration specific. Given the expected niche use-case for when this knob is useful it has been deemed acceptable to expose this relatively crude method for crafting optimal IO on specific storage -- especially given the solution is simple yet effective. In the context of DM multipath, it is advisable to tune this sysfs attribute to a value that offers the best performance for the common case (e.g. if 4 paths are expected active, tune for that; if paths fail then performance may be slightly reduced). Alternatives were explored to have request-based DM autotune this value (e.g. if/when paths fail) but they were quickly deemed too fragile and complex to warrant further design and development time. If this problem proves more common as faster storage emerges we'll have to look at elevating a generic solution into the block core. Tested-by: Shiva Krishna Merla <shivakrishna.merla@netapp.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-04-15 12:10:15 -04:00
Mike Snitzer	de3ec86dff	dm: don't start current request if it would've merged with the previous Request-based DM's dm_request_fn() is so fast to pull requests off the queue that steps need to be taken to promote merging by avoiding request processing if it makes sense. If the current request would've merged with previous request let the current request stay on the queue longer. Suggested-by: Jens Axboe <axboe@fb.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-04-15 12:10:14 -04:00
Mike Snitzer	d548b34b06	dm: reduce the queue delay used in dm_request_fn from 100ms to 10ms Commit `7eaceaccab` ("block: remove per-queue plugging") didn't justify DM's use of a 100ms delay; such an extended delay is a liability when there is reason to re-kick the queue. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-04-15 12:10:13 -04:00
Mike Snitzer	9d1deb83d4	dm: don't schedule delayed run of the queue if nothing to do In request-based DM's dm_request_fn(), if blk_peek_request() returns NULL just return. Avoids unnecessary blk_delay_queue(). Reported-by: Jens Axboe <axboe@fb.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-04-15 12:10:13 -04:00
Mike Snitzer	9a0e609e3f	dm: only run the queue on completion if congested or no requests pending On really fast storage it can be beneficial to delay running the request_queue to allow the elevator more opportunity to merge requests. Otherwise, it has been observed that requests are being sent to q->request_fn much quicker than is ideal on IOPS-bound backends. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-04-15 12:10:12 -04:00
Mike Snitzer	ff36ab3458	dm: remove request-based logic from make_request_fn wrapper The old dm_request() method used for q->make_request_fn had a branch for request-based DM support but it isn't needed given that dm_init_request_based_queue() sets it to the standard blk_queue_bio() anyway. Cleanup dm_init_md_queue() to be DM device-type agnostic and have dm_setup_md_queue() properly finish queue setup based on DM device-type (bio-based vs request-based). A followup block patch can be made to remove the export for blk_queue_bio() now that DM no longer calls it directly. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-04-15 12:08:48 -04:00
Mike Snitzer	d56b9b28a4	dm: remove request-based DM queue's lld_busy_fn hook DM multipath is the only caller of blk_lld_busy() -- which calls a queue's lld_busy_fn hook. Request-based DM doesn't support stacking multipath devices so there is no reason to register the lld_busy_fn hook on a multipath device's queue using blk_queue_lld_busy(). As such, remove functions dm_lld_busy and dm_table_any_busy_target. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-03-31 12:03:49 -04:00
Mike Snitzer	52b09914af	dm: remove unnecessary wrapper around blk_lld_busy There is no need for DM to export a wrapper around the already exported blk_lld_busy(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-03-31 12:03:49 -04:00
Mike Snitzer	09c2d53101	dm: rename __dm_get_reserved_ios() helper to __dm_get_module_param() __dm_get_module_param() could be useful for future DM module parameters besides those related to "reserved_ios". Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-03-31 12:03:49 -04:00
Mike Snitzer	63a4f065ec	dm: fix add_disk() NULL pointer due to race with free_dev() Commit `c4db59d31e` ("fs: don't reassign dirty inodes to default_backing_dev_info") exposed DM to a latent race in free_dev() vs add_disk() in relation to management of the device's minor number. Fix this by refactoring free_dev() to match cleanup order of the alloc_dev() error path. Move cleanup of the gendisk, queue, and bdev to _before_ the cleanup of the idr managed minor number. Also, purely due to cleanup that fell out during the free_dev() audit: - adjust dm_blk_close() to access the gendisk's private_data under the _minor_lock spinlock. - move __dm_destroy()'s dm_get_live_table() call out from under the _minor_lock spinlock. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1202449 Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Reported-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-03-23 18:14:00 -04:00
Mikulas Patocka	09ee96b214	dm snapshot: suspend merging snapshot when doing exception handover The "dm snapshot: suspend origin when doing exception handover" commit fixed a exception store handover bug associated with pending exceptions to the "snapshot-origin" target. However, a similar problem exists in snapshot merging. When snapshot merging is in progress, we use the target "snapshot-merge" instead of "snapshot-origin". Consequently, during exception store handover, we must find the snapshot-merge target and suspend its associated mapped_device. To avoid lockdep warnings, the target must be suspended and resumed without holding _origins_lock. Introduce a dm_hold() function that grabs a reference on a mapped_device, but unlike dm_get(), it doesn't crash if the device has the DMF_FREEING flag set, it returns an error in this case. In snapshot_resume() we grab the reference to the origin device using dm_hold() while holding _origins_lock (_origins_lock guarantees that the device won't disappear). Then we release _origins_lock, suspend the device and grab _origins_lock again. NOTE to stable@ people: When backporting to kernels 3.18 and older, use dm_internal_suspend and dm_internal_resume instead of dm_internal_suspend_fast and dm_internal_resume_fast. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-02-27 14:53:16 -05:00
Mikulas Patocka	b735fede8d	dm snapshot: suspend origin when doing exception handover In the function snapshot_resume we perform exception store handover. If there is another active snapshot target, the exception store is moved from this target to the target that is being resumed. The problem is that if there is some pending exception, it will point to an incorrect exception store after that handover, causing a crash due to dm-snap-persistent.c:get_exception()'s BUG_ON. This bug can be triggered by repeatedly changing snapshot permissions with "lvchange -p r" and "lvchange -p rw" while there are writes on the associated origin device. To fix this bug, we must suspend the origin device when doing the exception store handover to make sure that there are no pending exceptions: - introduce _origin_hash that keeps track of dm_origin structures. - introduce functions __lookup_dm_origin, __insert_dm_origin and __remove_dm_origin that manipulate the origin hash. - modify snapshot_resume so that it calls dm_internal_suspend_fast() and dm_internal_resume_fast() on the origin device. NOTE to stable@ people: When backporting to kernels 3.12-3.18, use dm_internal_suspend and dm_internal_resume instead of dm_internal_suspend_fast and dm_internal_resume_fast. When backporting to kernels older than 3.12, you need to pick functions dm_internal_suspend and dm_internal_resume from the commit `fd2ed4d252`. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-02-27 14:49:47 -05:00
Mikulas Patocka	ab7c7bb6f4	dm: hold suspend_lock while suspending device during device deletion __dm_destroy() must take the suspend_lock so that its presuspend and postsuspend calls do not race with an internal suspend. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-02-27 14:09:23 -05:00
Mikulas Patocka	2bec1f4a88	dm: fix a race condition in dm_get_md The function dm_get_md finds a device mapper device with a given dev_t, increases the reference count and returns the pointer. dm_get_md calls dm_find_md, dm_find_md takes _minor_lock, finds the device, tests that the device doesn't have DMF_DELETING or DMF_FREEING flag, drops _minor_lock and returns pointer to the device. dm_get_md then calls dm_get. dm_get calls BUG if the device has the DMF_FREEING flag, otherwise it increments the reference count. There is a possible race condition - after dm_find_md exits and before dm_get is called, there are no locks held, so the device may disappear or DMF_FREEING flag may be set, which results in BUG. To fix this bug, we need to call dm_get while we hold _minor_lock. This patch renames dm_find_md to dm_get_md and changes it so that it calls dm_get while holding the lock. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2015-02-18 09:41:19 -05:00
Linus Torvalds	802ea9d864	- Most significant change this cycle is request-based DM now supports stacking ontop of blk-mq devices. This blk-mq support changes the model request-based DM uses for cloning a request to relying on calling blk_get_request() directly from the underlying blk-mq device. Early consumer of this code is Intel's emerging NVMe hardware; thanks to Keith Busch for working on, and pushing for, these changes. - A few other small fixes and cleanups across other DM targets. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJU3NRnAAoJEMUj8QotnQNavG0H/3yogMcHvKg9H+w0WmUQdwhN w99Wj3nkquAw2sm9yahKlAMBNY53iu/LHmC6/PaTpJetgdH7y1foTrRa0qjyeB2D DgNr8mOzxSxzX6CX9V8JMwqzky9XoG2IOt/7FeQQOpMqp4T1M2zgvbZtpl0lK/f3 lNaNBFpl+47NbGssD/WbtfI4Yy3hX0u406yGmQN5DxRyGTWD2AFqpA76g2mp8vrp wmw259gPr4oLhj3pDc0GkuiVn59ZR2Zp+2gs0jD5uKlDL84VP/nE+WNB+ny1Mnmt cOg8Q+W6/OosL66MKBHNsF0QS6DXNo5UvsN9fHGa5IUJw7Tsa11ZEPKHZGEbQw4= =RiN2 -----END PGP SIGNATURE----- Merge tag 'dm-3.20-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper changes from Mike Snitzer: - The most significant change this cycle is request-based DM now supports stacking ontop of blk-mq devices. This blk-mq support changes the model request-based DM uses for cloning a request to relying on calling blk_get_request() directly from the underlying blk-mq device. An early consumer of this code is Intel's emerging NVMe hardware; thanks to Keith Busch for working on, and pushing for, these changes. - A few other small fixes and cleanups across other DM targets. * tag 'dm-3.20-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm: inherit QUEUE_FLAG_SG_GAPS flags from underlying queues dm snapshot: remove unnecessary NULL checks before vfree() calls dm mpath: simplify failure path of dm_multipath_init() dm thin metadata: remove unused dm_pool_get_data_block_size() dm ioctl: fix stale comment above dm_get_inactive_table() dm crypt: update url in CONFIG_DM_CRYPT help text dm bufio: fix time comparison to use time_after_eq() dm: use time_in_range() and time_after() dm raid: fix a couple integer overflows dm table: train hybrid target type detection to select blk-mq if appropriate dm: allocate requests in target when stacking on blk-mq devices dm: prepare for allocating blk-mq clone requests in target dm: submit stacked requests in irq enabled context dm: split request structure out from dm_rq_target_io structure dm: remove exports for request-based interfaces without external callers	2015-02-12 16:36:31 -08:00
Linus Torvalds	3e12cefbe1	Merge branch 'for-3.20/core' of git://git.kernel.dk/linux-block Pull core block IO changes from Jens Axboe: "This contains: - A series from Christoph that cleans up and refactors various parts of the REQ_BLOCK_PC handling. Contributions in that series from Dongsu Park and Kent Overstreet as well. - CFQ: - A bug fix for cfq for realtime IO scheduling from Jeff Moyer. - A stable patch fixing a potential crash in CFQ in OOM situations. From Konstantin Khlebnikov. - blk-mq: - Add support for tag allocation policies, from Shaohua. This is a prep patch enabling libata (and other SCSI parts) to use the blk-mq tagging, instead of rolling their own. - Various little tweaks from Keith and Mike, in preparation for DM blk-mq support. - Minor little fixes or tweaks from me. - A double free error fix from Tony Battersby. - The partition 4k issue fixes from Matthew and Boaz. - Add support for zero+unprovision for blkdev_issue_zeroout() from Martin" * 'for-3.20/core' of git://git.kernel.dk/linux-block: (27 commits) block: remove unused function blk_bio_map_sg block: handle the null_mapped flag correctly in blk_rq_map_user_iov blk-mq: fix double-free in error path block: prevent request-to-request merging with gaps if not allowed blk-mq: make blk_mq_run_queues() static dm: fix multipath regression due to initializing wrong request cfq-iosched: handle failure of cfq group allocation block: Quiesce zeroout wrapper block: rewrite and split __bio_copy_iov() block: merge __bio_map_user_iov into bio_map_user_iov block: merge __bio_map_kern into bio_map_kern block: pass iov_iter to the BLOCK_PC mapping functions block: add a helper to free bio bounce buffer pages block: use blk_rq_map_user_iov to implement blk_rq_map_user block: simplify bio_map_kern block: mark blk-mq devices as stackable block: keep established cmd_flags when cloning into a blk-mq request block: add blk-mq support to blk_insert_cloned_request() block: require blk_rq_prep_clone() be given an initialized clone request blk-mq: add tag allocation policy ...	2015-02-12 14:13:23 -08:00
Mike Snitzer	e5863d9ad7	dm: allocate requests in target when stacking on blk-mq devices For blk-mq request-based DM the responsibility of allocating a cloned request is transfered from DM core to the target type. Doing so enables the cloned request to be allocated from the appropriate blk-mq request_queue's pool (only the DM target, e.g. multipath, can know which block device to send a given cloned request to). Care was taken to preserve compatibility with old-style block request completion that requires request-based DM _not_ acquire the clone request's queue lock in the completion path. As such, there are now 2 different request-based DM target_type interfaces: 1) the original .map_rq() interface will continue to be used for non-blk-mq devices -- the preallocated clone request is passed in from DM core. 2) a new .clone_and_map_rq() and .release_clone_rq() will be used for blk-mq devices -- blk_get_request() and blk_put_request() are used respectively from these hooks. dm_table_set_type() was updated to detect if the request-based target is being stacked on blk-mq devices, if so DM_TYPE_MQ_REQUEST_BASED is set. DM core disallows switching the DM table's type after it is set. This means that there is no mixing of non-blk-mq and blk-mq devices within the same request-based DM table. [This patch was started by Keith and later heavily modified by Mike] Tested-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-02-09 13:06:47 -05:00
Keith Busch	466d89a6bc	dm: prepare for allocating blk-mq clone requests in target For blk-mq request-based DM the responsibility of allocating a cloned request will be transfered from DM core to the target type. To prepare for conditionally using this new model the original request's 'special' now points to the dm_rq_target_io because the clone is allocated later in the block layer rather than in DM core. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-02-09 13:06:47 -05:00
Keith Busch	2eb6e1e3aa	dm: submit stacked requests in irq enabled context Switch to having request-based DM enqueue all prep'ed requests into work processed by another thread. This allows request-based DM to invoke block APIs that assume interrupt enabled context (e.g. blk_get_request) and is a prerequisite for adding blk-mq support to request-based DM. The new kernel thread is only initialized for request-based DM devices. multipath_map() is now always in irq enabled context so change multipath spinlock (m->lock) locking to always disable interrupts. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-02-09 13:06:47 -05:00
Mike Snitzer	1ae49ea2cf	dm: split request structure out from dm_rq_target_io structure Request-based DM support for blk-mq devices requires that dm_rq_target_io structures not be allocated with an embedded request structure. The request-based DM target (e.g. dm-multipath) must allocate the request from the blk-mq devices' request_queue using blk_get_request(). The unfortunate side-effect of this change is old-style request-based DM support will no longer use contiguous memory for the dm_rq_target_io and request structures for each clone. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-02-09 13:06:47 -05:00
Mike Snitzer	dbf9782c10	dm: remove exports for request-based interfaces without external callers Remove exports for dm_dispatch_request, dm_requeue_unmapped_request, and dm_kill_unmapped_request. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-02-09 12:59:48 -05:00
Mike Snitzer	db507b3ffd	dm: fix multipath regression due to initializing wrong request Commit febf715 ("block: require blk_rq_prep_clone() be given an initialized clone request") introduced a regression by calling blk_rq_init() on the original request rather than the clone request that is passed to setup_clone(). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Fixes: `febf71588c` ("block: require blk_rq_prep_clone() be given an initialized clone request") Signed-off-by: Jens Axboe <axboe@fb.com>	2015-02-09 10:46:08 -07:00
Keith Busch	febf71588c	block: require blk_rq_prep_clone() be given an initialized clone request Prepare to allow blk_rq_prep_clone() to accept clone requests that were allocated from blk-mq request queues. As such the blk_rq_prep_clone() caller must first initialize the clone request. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-28 09:44:11 -07:00
Mikulas Patocka	96b26c8c64	dm: fix handling of multiple internal suspends Commit `ffcc393641` ("dm: enhance internal suspend and resume interface") attempted to handle multiple internal suspends on the same device, but it did that incorrectly. When these functions are called in this order on the same device the device is no longer suspended, but it should be: dm_internal_suspend_noflush dm_internal_suspend_noflush dm_internal_resume Fix this bug by maintaining an 'internal_suspend_count' and resuming the device when this count drops to zero. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2015-01-24 14:50:08 -05:00
zhendong chen	5164bece16	dm: fix missed error code if .end_io isn't implemented by target_type In bio-based DM's clone_endio(), when target_type doesn't implement .end_io (e.g. linear) r will be always be initialized 0. So if a WRITE SAME bio fails WRITE SAME will not be disabled as intended. Fix this by initializing r to error, rather than 0, in clone_endio(). Signed-off-by: Alex Chen <alex.chen@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Fixes: `7eee4ae2db` ("dm: disable WRITE SAME if it fails") Cc: stable@vger.kernel.org	2014-12-17 12:31:13 -05:00
Linus Torvalds	9ea18f8cab	Merge branch 'for-3.19/drivers' of git://git.kernel.dk/linux-block Pull block layer driver updates from Jens Axboe: - NVMe updates: - The blk-mq conversion from Matias (and others) - A stack of NVMe bug fixes from the nvme tree, mostly from Keith. - Various bug fixes from me, fixing issues in both the blk-mq conversion and generic bugs. - Abort and CPU online fix from Sam. - Hot add/remove fix from Indraneel. - A couple of drbd fixes from the drbd team (Andreas, Lars, Philipp) - With the generic IO stat accounting from 3.19/core, converting md, bcache, and rsxx to use those. From Gu Zheng. - Boundary check for queue/irq mode for null_blk from Matias. Fixes cases where invalid values could be given, causing the device to hang. - The xen blkfront pull request, with two bug fixes from Vitaly. * 'for-3.19/drivers' of git://git.kernel.dk/linux-block: (56 commits) NVMe: fix race condition in nvme_submit_sync_cmd() NVMe: fix retry/error logic in nvme_queue_rq() NVMe: Fix FS mount issue (hot-remove followed by hot-add) NVMe: fix error return checking from blk_mq_alloc_request() NVMe: fix freeing of wrong request in abort path xen/blkfront: remove redundant flush_op xen/blkfront: improve protection against issuing unsupported REQ_FUA NVMe: Fix command setup on IO retry null_blk: boundary check queue_mode and irqmode block/rsxx: use generic io stats accounting functions to simplify io stat accounting md: use generic io stats accounting functions to simplify io stat accounting drbd: use generic io stats accounting functions to simplify io stat accounting md/bcache: use generic io stats accounting functions to simplify io stat accounting NVMe: Update module version major number NVMe: fail pci initialization if the device doesn't have any BARs NVMe: add ->exit_hctx() hook NVMe: make setup work for devices that don't do INTx NVMe: enable IO stats by default NVMe: nvme_submit_async_admin_req() must use atomic rq allocation NVMe: replace blk_put_request() with blk_mq_free_request() ...	2014-12-13 14:22:26 -08:00
Gu Zheng	18c0b223cf	md: use generic io stats accounting functions to simplify io stat accounting Use generic io stats accounting help functions (generic_{start,end}_io_acct) to simplify io stat accounting. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-24 08:05:16 -07:00
Eric Dumazet	a12f5d48bd	dm: use rcu_dereference_protected instead of rcu_dereference rcu_dereference() should be used in sections protected by rcu_read_lock. For writers, holding some kind of mutex or lock, rcu_dereference_protected() is the way to go, adding explicit lockdep bits. In __unbind(), we are the last user of this mapped device, so can use the constant '1' instead of a lockdep_is_held(), not consistent with other uses of rcu_dereference_protected() which use md->suspend_lock mutex. Reported-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: `33423974bf` ("dm: Use rcu_dereference() for accessing rcu pointer") Cc: Pranith Kumar <bobby.prani@gmail.com> [snitzer: allow lines longer than 80 columns, refine subject] Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-23 20:32:45 -05:00
Mike Snitzer	ffcc393641	dm: enhance internal suspend and resume interface Rename dm_internal_{suspend,resume} to dm_internal_{suspend,resume}_fast -- dm-stats will continue using these methods to avoid all the extra suspend/resume logic that is not needed in order to quickly flush IO. Introduce dm_internal_suspend_noflush() variant that actually calls the mapped_device's target callbacks -- otherwise target-specific hooks are avoided (e.g. dm-thin's thin_presuspend and thin_postsuspend). Common code between dm_internal_{suspend_noflush,resume} and dm_{suspend,resume} was factored out as __dm_{suspend,resume}. Update dm_internal_{suspend_noflush,resume} to always take and release the mapped_device's suspend_lock. Also update dm_{suspend,resume} to be aware of potential for DM_INTERNAL_SUSPEND_FLAG to be set and respond accordingly by interruptibly waiting for the DM_INTERNAL_SUSPEND_FLAG to be cleared. Add lockdep annotation to dm_suspend() and dm_resume(). The existing DM_SUSPEND_FLAG remains unchanged. DM_INTERNAL_SUSPEND_FLAG is set by dm_internal_suspend_noflush() and cleared by dm_internal_resume(). Both DM_SUSPEND_FLAG and DM_INTERNAL_SUSPEND_FLAG may be set if a device was already suspended when dm_internal_suspend_noflush() was called -- this can be thought of as a "nested suspend". A "nested suspend" can occur with legacy userspace dm-thin code that might suspend all active thin volumes before suspending the pool for resize. But otherwise, in the normal dm-thin-pool suspend case moving forward: the thin-pool will have DM_SUSPEND_FLAG set and all active thins from that thin-pool will have DM_INTERNAL_SUSPEND_FLAG set. Also add DM_INTERNAL_SUSPEND_FLAG to status report. This new DM_INTERNAL_SUSPEND_FLAG state is being reported to assist with debugging (e.g. 'dmsetup info' will report an internally suspended device accordingly). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>	2014-11-19 12:31:17 -05:00
Mike Snitzer	d67ee213fa	dm: add presuspend_undo hook to target_type The DM thin-pool target now must undo the changes performed during pool_presuspend() so introduce presuspend_undo hook in target_type. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>	2014-11-19 11:24:59 -05:00
Mike Snitzer	4d341d8216	dm: return earlier from dm_blk_ioctl if target doesn't implement .ioctl No point checking if the device is suspended if the current target doesn't even implement .ioctl Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-19 11:24:56 -05:00
Hannes Reinecke	41abc4e1af	dm: do not call dm_sync_table() when creating new devices When creating new devices dm_sync_table() calls synchronize_rcu_expedited(), causing _all_ pending RCU pointers to be flushed. This causes a latency overhead that is especially noticeable when creating lots of devices. And all of this is pointless as there are no old maps to be disconnected, and hence no stale pointers which would need to be cleared up. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:29 -05:00
Pranith Kumar	6fa9952097	dm: sparse: Annotate field with __rcu for checking Annotate the map field with __rcu since this is a rcu pointer which is checked by sparse. Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:29 -05:00
Pranith Kumar	33423974bf	dm: Use rcu_dereference() for accessing rcu pointer The map field in 'struct mapped_device' is an rcu pointer. Use rcu_dereference() while accessing it. Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:29 -05:00
Mike Snitzer	148e51baf8	dm: improve documentation and code clarity in dm_merge_bvec These code changes do not introduce a functional change. But bio_add_page() will never attempt to build up a bio larger than queue_max_sectors(). Similarly, bio_get_nr_vecs() is also bound by queue_max_sectors(). Therefore, there is no point in allowing dm_merge_bvec() to answer "how many sectors can a bio have at this offset?" with anything larger than queue_max_sectors(). Using queue_max_sectors() rather than BIO_MAX_SECTORS serves to more accurately convey the limits that are being imposed. Also, use unlikely() to clarify the fact that the defensive code in dm_merge_bvec() relative to max_size going negative shouldn't ever happen -- if it does happen there is a bug in the block layer for requesting larger than dm_merge_bvec()'s initial response for a given offset. Also, update a comment in dm_merge_bvec() relative to max_hw_sectors_kb. And fix empty newline whitespace. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-11-10 15:25:27 -05:00
Benjamin Marzinski	86f1152b11	dm: allow active and inactive tables to share dm_devs Until this change, when loading a new DM table, DM core would re-open all of the devices in the DM table. Now, DM core will avoid redundant device opens (and closes when destroying the old table) if the old table already has a device open using the same mode. This is achieved by managing reference counts on the table_devices that DM core now stores in the mapped_device structure (rather than in the dm_table structure). So a mapped_device's active and inactive dm_tables' dm_dev lists now just point to the dm_devs stored in the mapped_device's table_devices list. This improvement in DM core's device reference counting has the side-effect of fixing a long-standing limitation of the multipath target: a DM multipath table couldn't include any paths that were unusable (failed). For example: if all paths have failed and you add a new, working, path to the table; you can't use it since the table load would fail due to it still containing failed paths. Now a re-load of a multipath table can include failed devices and when those devices become active again they can be used instantly. The device list code in dm.c isn't a straight copy/paste from the code in dm-table.c, but it's very close (aside from some variable renames). One subtle difference is that find_table_device for the tables_devices list will only match devices with the same name and mode. This is because we don't want to upgrade a device's mode in the active table when an inactive table is loaded. Access to the mapped_device structure's tables_devices list requires a mutex (tables_devices_lock), so that tables cannot be created and destroyed concurrently. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-10-05 20:03:35 -04:00
Junichi Nomura	3d8aab2d2c	dm: use bioset_create_nobvec() Since DM core uses bio_clone_fast() for both bio-based and request-based DM devices there is no need for DM's bioset to have a bvec mempool. With this patch, on arch with 4KB page for example, memory usage will be reduced by 64KB for each bio-based DM device and 1MB for each request-based DM device. For example, when you create 10,000 bio-based DM devices and 1,000 request-based DM devices, memory usage of biovec under no load is: # grep biovec /proc/slabinfo biovec-256 418068 418068 4096 ... biovec-128 0 0 2048 ... biovec-64 0 0 1024 ... biovec-16 0 0 256 ... With this patch series applied, the usage becomes: # grep biovec /proc/slabinfo biovec-256 116 116 4096 ... biovec-128 0 0 2048 ... biovec-64 0 0 1024 ... biovec-16 0 0 256 ... So 4096 * (418068 - 116) = 1.6GB of memory is saved in this example. Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-10-05 20:03:34 -04:00
Junichi Nomura	997782735c	dm: remove nr_iovecs parameter from alloc_tio() alloc_tio() uses bio_alloc_bioset() to allocate a clone-bio for a bio. alloc_tio() takes the number of bvecs to allocate for the clone-bio. However, with v3.14's immutable biovec changes DM now uses __bio_clone_fast() and no longer needs to allocate bvecs. In practice, the 'nr_iovecs' passed to alloc_tio() is always effectively 0. __clone_and_map_simple_bio() looked like it was passing non-zero nr_iovecs, but its value was always within the range of inline bvecs and no allocation actually happened. If allocation happened, the BUG_ON() in __bio_clone_fast() would've triggered. Remove the nr_iovecs parameter from alloc_tio() to prevent possible future bio_alloc_bioset() mis-use of a new bioset interface that will no longer allow bvecs to be allocated. Also fix extra whitespace before the __bio_clone_fast() call in __clone_and_map_simple_bio(). Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-10-05 20:03:34 -04:00
Mikulas Patocka	acfe0ad74d	dm: allocate a special workqueue for deferred device removal The commit `2c140a246d` ("dm: allow remove to be deferred") introduced a deferred removal feature for the device mapper. When this feature is used (by passing a flag DM_DEFERRED_REMOVE to DM_DEV_REMOVE_CMD ioctl) and the user tries to remove a device that is currently in use, the device will be removed automatically in the future when the last user closes it. Device mapper used the system workqueue to perform deferred removals. However, some targets (dm-raid1, dm-mpath, dm-stripe) flush work items scheduled for the system workqueue from their destructor. If the destructor itself is called from the system workqueue during deferred removal, it introduces a possible deadlock - the workqueue tries to flush itself. Fix this possible deadlock by introducing a new workqueue for deferred removals. We allocate just one workqueue for all dm targets. The ability of dm targets to process IOs isn't dependent on deferred removal of unused targets, so a deadlock due to shared workqueue isn't possible. Also, cleanup local_init() to eliminate potential for returning success on failure. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.13+	2014-07-10 16:44:13 -04:00
Linus Torvalds	0e04c641b1	. Add dm_accept_partial_bio interface to DM core to allow DM targets to only process a portion of a bio, the remainder being sent in the next bio. This enables the old dm snapshot-origin target to only split write bios on chunk boundaries, read bios are now sent to the origin device unchanged. . Add DM core support for disabling WRITE SAME if the underlying SCSI layer disables it due to command failure. . Reduce lock contention in DM's bio-prison. . A few small cleanups and fixes to dm-thin and dm-era. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJTmavIAAoJEMUj8QotnQNay/EIAJI6lEFlQK3gG830+Yaw1m2U 7mnnX/Rd/N6RnccyhmFqk7Xu0REM7gJEgWicTQjR58La2DKFi072N0mwgHgHfB8f 1oOvUKN5Nb/a1CmRcVzSO0sbYcJn9I1r+k0buqfFHivU68wuedG+MrVya3YzOjvC 63MQiu4+3icDprcToxn+etz75FhrFps5QAsS0cH6t1VZqFCGIzxqgUKgY8zGo1CH P9hkYpJhhJe2aDh4vlFvpFYVFXt9zPoR+MkqXFW6Dn9GbR36gGaldTwnyhapla4N XYHDEdETcTqm2srOM5wW+jD0p+Id9/Lmd5Ld5J2zyARCYn/EG/SDCbiaVqOxEnc= =tv1B -----END PGP SIGNATURE----- Merge tag 'dm-3.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: "This pull request is later than I'd have liked because I was waiting for some performance data to help finally justify sending the long-standing dm-crypt cpu scalability improvements upstream. Unfortunately we came up short, so those dm-crypt changes will continue to wait, but it seems we're not far off. . Add dm_accept_partial_bio interface to DM core to allow DM targets to only process a portion of a bio, the remainder being sent in the next bio. This enables the old dm snapshot-origin target to only split write bios on chunk boundaries, read bios are now sent to the origin device unchanged. . Add DM core support for disabling WRITE SAME if the underlying SCSI layer disables it due to command failure. . Reduce lock contention in DM's bio-prison. . A few small cleanups and fixes to dm-thin and dm-era" * tag 'dm-3.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm thin: update discard_granularity to reflect the thin-pool blocksize dm bio prison: implement per bucket locking in the dm_bio_prison hash table dm: remove symbol export for dm_set_device_limits dm: disable WRITE SAME if it fails dm era: check for a non-NULL metadata object before closing it dm thin: return ENOSPC instead of EIO when error_if_no_space enabled dm thin: cleanup noflush_work to use a proper completion dm snapshot: do not split read bios sent to snapshot-origin target dm snapshot: allocate a per-target structure for snapshot-origin target dm: introduce dm_accept_partial_bio dm: change sector_count member in clone_info from sector_t to unsigned	2014-06-12 13:33:29 -07:00
Mike Snitzer	11f0431be2	dm: remove symbol export for dm_set_device_limits There is no need for code other than DM core to use dm_set_device_limits so remove its EXPORT_SYMBOL_GPL. Also, cleanup a couple whitespace nits. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-06-04 09:46:34 -04:00
Mike Snitzer	7eee4ae2db	dm: disable WRITE SAME if it fails Add DM core support for disabling WRITE SAME on first failure to both request-based and bio-based targets. The need to disable WRITE SAME stems from SCSI enabling it by default but then disabling it when it fails. When SCSI does this it returns "permanent target failure, do not retry" using -EREMOTEIO. Update DM core to only disable WRITE SAME on failure if the returned error is -EREMOTEIO. Commit `f84cb8a4` ("dm mpath: disable WRITE SAME if it fails") implemented multipath specific disabling of WRITE SAME if it fails. However, as that commit detailed, the multipath-only solution doesn't go far enough if bio-based DM targets are stacked ontop of the request-based dm-multipath target (as is commonly done using dm-linear to support partitions on multipath devices, via kpartx). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Tested-by: Alex Chen <alex.chen@huawei.com>	2014-06-04 09:45:52 -04:00
Linus Torvalds	776edb5931	Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next Pull core locking updates from Ingo Molnar: "The main changes in this cycle were: - reduced/streamlined smp_mb__() interface that allows more usecases and makes the existing ones less buggy, especially in rarer architectures - add rwsem implementation comments - bump up lockdep limits" 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits) rwsem: Add comments to explain the meaning of the rwsem's count field lockdep: Increase static allocations arch: Mass conversion of smp_mb__() arch,doc: Convert smp_mb__() arch,xtensa: Convert smp_mb__() arch,x86: Convert smp_mb__() arch,tile: Convert smp_mb__() arch,sparc: Convert smp_mb__() arch,sh: Convert smp_mb__() arch,score: Convert smp_mb__() arch,s390: Convert smp_mb__() arch,powerpc: Convert smp_mb__() arch,parisc: Convert smp_mb__() arch,openrisc: Convert smp_mb__() arch,mn10300: Convert smp_mb__() arch,mips: Convert smp_mb__() arch,metag: Convert smp_mb__() arch,m68k: Convert smp_mb__() arch,m32r: Convert smp_mb__() arch,ia64: Convert smp_mb__() ...	2014-06-03 12:57:53 -07:00
Mikulas Patocka	1dd40c3ecd	dm: introduce dm_accept_partial_bio The function dm_accept_partial_bio allows the target to specify how many sectors of the current bio it will process. If the target only wants to accept part of the bio, it calls dm_accept_partial_bio and the DM core sends the rest of the data in next bio. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-06-03 13:44:06 -04:00
Mikulas Patocka	e0d6609a5f	dm: change sector_count member in clone_info from sector_t to unsigned It is impossible to create bios with 2^23 or more sectors (the size is stored as a 32-bit byte count in the bio). So we convert some sector_t values to unsigned integers. This is needed for the next commit ("dm: introduce dm_accept_partial_bio") that replaces integer value arguments with pointers, so the size of the integer must match. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-06-03 13:44:06 -04:00
Peter Zijlstra	4e857c58ef	arch: Mass conversion of smp_mb__() Mostly scripted conversion of the smp_mb__ barriers. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-arch@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2014-04-18 14:20:48 +02:00
Jens Axboe	b4f42e2831	block: remove struct request buffer member This was used in the olden days, back when onions were proper yellow. Basically it mapped to the current buffer to be transferred. With highmem being added more than a decade ago, most drivers map pages out of a bio, and rq->buffer isn't pointing at anything valid. Convert old style drivers to just use bio_data(). For the discard payload use case, just reference the page in the bio. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-04-15 14:03:02 -06:00
Mike Snitzer	9974fa2c6a	dm table: add dm_table_run_md_queue_async Introduce dm_table_run_md_queue_async() to run the request_queue of the mapped_device associated with a request-based DM table. Also add dm_md_get_queue() wrapper to extract the request_queue from a mapped_device. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>	2014-03-27 16:56:24 -04:00
Monam Agarwal	9cdb852004	dm: use RCU_INIT_POINTER instead of rcu_assign_pointer in __unbind Replace rcu_assign_pointer(p, NULL) with RCU_INIT_POINTER(p, NULL). The rcu_assign_pointer() ensures that the initialization of a structure is carried out before storing a pointer to that structure. And in the case of the NULL pointer, there is no structure to initialize. So, rcu_assign_pointer(p, NULL) can be safely converted to RCU_INIT_POINTER(p, NULL). Signed-off-by: Monam Agarwal <monamagarwal123@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-03-27 16:56:24 -04:00
Mikulas Patocka	bfc6d41cee	dm: stop using bi_private Device mapper uses the bio structure's bi_private field as a pointer to dm_target_io or dm_rq_clone_bio_info. But a bio structure is embedded in the dm_target_io and dm_rq_clone_bio_info structures, so the pointer to the structure that contains the bio can be found with the container_of() macro. Remove the use of bi_private and use container_of() instead. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-03-27 16:56:24 -04:00
Mikulas Patocka	d70ab4fb72	dm: remove dm_get_mapinfo Remove dm_get_mapinfo() because no target uses it. Targets can allocate per-bio data using ti->per_bio_data_size, this is much more flexible than union map_info. Leave union map_info only for the request-based multipath target's use. Also delete the unused "unsigned long long ll" field of union map_info. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-03-27 16:56:24 -04:00
Linus Torvalds	f568849eda	Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-block Pull core block IO changes from Jens Axboe: "The major piece in here is the immutable bio_ve series from Kent, the rest is fairly minor. It was supposed to go in last round, but various issues pushed it to this release instead. The pull request contains: - Various smaller blk-mq fixes from different folks. Nothing major here, just minor fixes and cleanups. - Fix for a memory leak in the error path in the block ioctl code from Christian Engelmayer. - Header export fix from CaiZhiyong. - Finally the immutable biovec changes from Kent Overstreet. This enables some nice future work on making arbitrarily sized bios possible, and splitting more efficient. Related fixes to immutable bio_vecs: - dm-cache immutable fixup from Mike Snitzer. - btrfs immutable fixup from Muthu Kumar. - bio-integrity fix from Nic Bellinger, which is also going to stable" * 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits) xtensa: fixup simdisk driver to work with immutable bio_vecs block/blk-mq-cpu.c: use hotcpu_notifier() blk-mq: for_each_* macro correctness block: Fix memory leak in rw_copy_check_uvector() handling bio-integrity: Fix bio_integrity_verify segment start bug block: remove unrelated header files and export symbol blk-mq: uses page->list incorrectly blk-mq: use __smp_call_function_single directly btrfs: fix missing increment of bi_remaining Revert "block: Warn and free bio if bi_end_io is not set" block: Warn and free bio if bi_end_io is not set blk-mq: fix initializing request's start time block: blk-mq: don't export blk_mq_free_queue() block: blk-mq: make blk_sync_queue support mq block: blk-mq: support draining mq queue dm cache: increment bi_remaining when bi_end_io is restored block: fixup for generic bio chaining block: Really silence spurious compiler warnings block: Silence spurious compiler warnings block: Kill bio_pair_split() ...	2014-01-30 11:19:05 -08:00
Mikulas Patocka	2995fa78e4	dm sysfs: fix a module unload race This reverts commit `be35f48610` ("dm: wait until embedded kobject is released before destroying a device") and provides an improved fix. The kobject release code that calls the completion must be placed in a non-module file, otherwise there is a module unload race (if the process calling dm_kobject_release is preempted and the DM module unloaded after the completion is triggered, but before dm_kobject_release returns). To fix this race, this patch moves the completion code to dm-builtin.c which is always compiled directly into the kernel if BLK_DEV_DM is selected. The patch introduces a new dm_kobject_holder structure, its purpose is to keep the completion and kobject in one place, so that it can be accessed from non-module code without the need to export the layout of struct mapped_device to that code. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2014-01-14 23:23:04 -05:00
Mikulas Patocka	be35f48610	dm: wait until embedded kobject is released before destroying a device There may be other parts of the kernel holding a reference on the dm kobject. We must wait until all references are dropped before deallocating the mapped_device structure. The dm_kobject_release method signals that all references are dropped via completion. But dm_kobject_release doesn't free the kobject (which is embedded in the mapped_device structure). This is the sequence of operations: * when destroying a DM device, call kobject_put from dm_sysfs_exit * wait until all users stop using the kobject, when it happens the release method is called * the release method signals the completion and should return without delay * the dm device removal code that waits on the completion continues * the dm device removal code drops the dm_mod reference the device had * the dm device removal code frees the mapped_device structure that contains the kobject Using kobject this way should avoid the module unload race that was mentioned at the beginning of this thread: https://lkml.org/lkml/2014/1/4/83 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2014-01-07 21:01:43 -05:00
Mikulas Patocka	1ddd641ddc	dm: remove pointless kobject comparison in dm_get_from_kobject The comparison is always true and the compiler optimizes it out anyway. Milan offered additional context relative to the original commit `784aae735d` ("dm: add name and uuid to sysfs") which introduced the code: "I think it is just relict of some experiments before I committed this simple embedded sysfs kobj handling". Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-01-07 13:22:32 -05:00
Kent Overstreet	1c3b13e64c	dm: Refactor for new bio cloning/splitting We need to convert the dm code to the new bvec_iter primitives which respect bi_bvec_done; they also allow us to drastically simplify dm's bio splitting code. Also, it's no longer necessary to save/restore the bvec array anymore - driver conversions for immutable bvecs are done, so drivers should never be modifying it. Also kill bio_sector_offset(), dm was the only user and it doesn't make much sense anymore. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Alasdair Kergon <agk@redhat.com> Cc: dm-devel@redhat.com Reviewed-by: Mike Snitzer <snitzer@redhat.com>	2013-11-23 22:33:55 -08:00
Kent Overstreet	4f024f3797	block: Abstract out bvec iterator Immutable biovecs are going to require an explicit iterator. To implement immutable bvecs, a later patch is going to add a bi_bvec_done member to this struct; for now, this patch effectively just renames things. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Ed L. Cashin" <ecashin@coraid.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@inktank.com> Cc: ceph-devel@vger.kernel.org Cc: Joshua Morris <josh.h.morris@us.ibm.com> Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Neil Brown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux390@de.ibm.com Cc: Boaz Harrosh <bharrosh@panasas.com> Cc: Benny Halevy <bhalevy@tonian.com> Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <chris.mason@fusionio.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: Joern Engel <joern@logfs.org> Cc: Prasad Joshi <prasadjoshi.linux@gmail.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Ben Myers <bpm@sgi.com> Cc: xfs@oss.sgi.com Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Len Brown <len.brown@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Cc: "Roger Pau Monné" <roger.pau@citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Ian Campbell <Ian.Campbell@citrix.com> Cc: Sebastian Ott <sebott@linux.vnet.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchand@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Peng Tao <tao.peng@emc.com> Cc: Andy Adamson <andros@netapp.com> Cc: fanchaoting <fanchaoting@cn.fujitsu.com> Cc: Jie Liu <jeff.liu@oracle.com> Cc: Sunil Mushran <sunil.mushran@gmail.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Namjae Jeon <namjae.jeon@samsung.com> Cc: Pankaj Kumar <pankaj.km@samsung.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Mel Gorman <mgorman@suse.de>6	2013-11-23 22:33:47 -08:00
Mikulas Patocka	2c140a246d	dm: allow remove to be deferred This patch allows the removal of an open device to be deferred until it is closed. (Previously such a removal attempt would fail.) The deferred remove functionality is enabled by setting the flag DM_DEFERRED_REMOVE in the ioctl structure on DM_DEV_REMOVE or DM_REMOVE_ALL ioctl. On return from DM_DEV_REMOVE, the flag DM_DEFERRED_REMOVE indicates if the device was removed immediately or flagged to be removed on close - if the flag is clear, the device was removed. On return from DM_DEV_STATUS and other ioctls, the flag DM_DEFERRED_REMOVE is set if the device is scheduled to be removed on closure. A device that is scheduled to be deleted can be revived using the message "@cancel_deferred_remove". This message clears the DMF_DEFERRED_REMOVE flag so that the device won't be deleted on close. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2013-11-09 18:20:22 -05:00
Mike Snitzer	e8603136cb	dm: add reserved_bio_based_ios module parameter Allow user to change the number of IOs that are reserved by bio-based DM's mempools by writing to this file: /sys/module/dm_mod/parameters/reserved_bio_based_ios The default value is RESERVED_BIO_BASED_IOS (16). The maximum allowed value is RESERVED_MAX_IOS (1024). Export dm_get_reserved_bio_based_ios() for use by DM targets and core code. Switch to sizing dm-io's mempool and bioset using DM core's configurable 'reserved_bio_based_ios'. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Frank Mayhar <fmayhar@google.com>	2013-09-23 10:42:24 -04:00
Mike Snitzer	f47908269f	dm: add reserved_rq_based_ios module parameter Allow user to change the number of IOs that are reserved by request-based DM's mempools by writing to this file: /sys/module/dm_mod/parameters/reserved_rq_based_ios The default value is RESERVED_REQUEST_BASED_IOS (256). The maximum allowed value is RESERVED_MAX_IOS (1024). Export dm_get_reserved_rq_based_ios() for use by DM targets and core code. Switch to sizing dm-mpath's mempool using DM core's configurable 'reserved_rq_based_ios'. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Frank Mayhar <fmayhar@google.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com>	2013-09-23 10:42:24 -04:00
Mike Snitzer	6cfa58573f	dm: lower bio-based mempool reservation Bio-based device mapper processing doesn't need larger mempools (like request-based DM does), so lower the number of reserved entries for bio-based operation. 16 was already used for bio-based DM's bioset but mistakenly wasn't used for it's _io_cache. Formalize difference between bio-based and request-based defaults by introducing RESERVED_BIO_BASED_IOS and RESERVED_REQUEST_BASED_IOS. (based on older code from Mikulas Patocka) Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Frank Mayhar <fmayhar@google.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com>	2013-09-23 10:42:23 -04:00
Mike Snitzer	f84cb8a46a	dm mpath: disable WRITE SAME if it fails Workaround the SCSI layer's problematic WRITE SAME heuristics by disabling WRITE SAME in the DM multipath device's queue_limits if an underlying device disabled it. The WRITE SAME heuristics, with both the original commit `5db44863b6` ("[SCSI] sd: Implement support for WRITE SAME") and the updated commit `66c28f971` ("[SCSI] sd: Update WRITE SAME heuristics"), default to enabling WRITE SAME(10) even without successfully determining it is supported. After the first failed WRITE SAME the SCSI layer will disable WRITE SAME for the device (by setting sdkp->device->no_write_same which results in 'max_write_same_sectors' in device's queue_limits to be set to 0). When a device is stacked ontop of such a SCSI device any changes to that SCSI device's queue_limits do not automatically propagate up the stack. As such, a DM multipath device will not have its WRITE SAME support disabled. This causes the block layer to continue to issue WRITE SAME requests to the mpath device which causes paths to fail and (if mpath IO isn't configured to queue when no paths are available) it will result in actual IO errors to the upper layers. This fix doesn't help configurations that have additional devices stacked ontop of the mpath device (e.g. LVM created linear DM devices ontop). A proper fix that restacks all the queue_limits from the bottom of the device stack up will need to be explored if SCSI will continue to use this model of optimistically allowing op codes and then disabling them after they fail for the first time. Before this patch: EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: (null) device-mapper: multipath: XXX snitm debugging: got -EREMOTEIO (-121) device-mapper: multipath: XXX snitm debugging: failing WRITE SAME IO with error=-121 end_request: critical target error, dev dm-6, sector 528 dm-6: WRITE SAME failed. Manually zeroing. device-mapper: multipath: Failing path 8:112. end_request: I/O error, dev dm-6, sector 4616 dm-6: WRITE SAME failed. Manually zeroing. end_request: I/O error, dev dm-6, sector 4616 end_request: I/O error, dev dm-6, sector 5640 end_request: I/O error, dev dm-6, sector 6664 end_request: I/O error, dev dm-6, sector 7688 end_request: I/O error, dev dm-6, sector 524288 Buffer I/O error on device dm-6, logical block 65536 lost page write due to I/O error on dm-6 JBD2: Error -5 detected when updating journal superblock for dm-6-8. end_request: I/O error, dev dm-6, sector 524296 Aborting journal on device dm-6-8. end_request: I/O error, dev dm-6, sector 524288 Buffer I/O error on device dm-6, logical block 65536 lost page write due to I/O error on dm-6 JBD2: Error -5 detected when updating journal superblock for dm-6-8. # cat /sys/block/sdh/queue/write_same_max_bytes 0 # cat /sys/block/dm-6/queue/write_same_max_bytes 33553920 After this patch: EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: (null) device-mapper: multipath: XXX snitm debugging: got -EREMOTEIO (-121) device-mapper: multipath: XXX snitm debugging: WRITE SAME I/O failed with error=-121 end_request: critical target error, dev dm-6, sector 528 dm-6: WRITE SAME failed. Manually zeroing. # cat /sys/block/sdh/queue/write_same_max_bytes 0 # cat /sys/block/dm-6/queue/write_same_max_bytes 0 It should be noted that WRITE SAME support wasn't enabled in DM multipath until v3.10. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Hannes Reinecke <hare@suse.de> Cc: stable@vger.kernel.org # 3.10+	2013-09-20 10:36:34 -04:00
Mikulas Patocka	fd2ed4d252	dm: add statistics support Support the collection of I/O statistics on user-defined regions of a DM device. If no regions are defined no statistics are collected so there isn't any performance impact. Only bio-based DM devices are currently supported. Each user-defined region specifies a starting sector, length and step. Individual statistics will be collected for each step-sized area within the range specified. The I/O statistics counters for each step-sized area of a region are in the same format as /sys/block/*/stat or /proc/diskstats but extra counters (12 and 13) are provided: total time spent reading and writing in milliseconds. All these counters may be accessed by sending the @stats_print message to the appropriate DM device via dmsetup. The creation of DM statistics will allocate memory via kmalloc or fallback to using vmalloc space. At most, 1/4 of the overall system memory may be allocated by DM statistics. The admin can see how much memory is used by reading /sys/module/dm_mod/parameters/stats_current_allocated_bytes See Documentation/device-mapper/statistics.txt for more details. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2013-09-05 20:46:06 -04:00
Mike Snitzer	00c4fc3b1f	dm ioctl: increase granularity of type_lock when loading table Hold the mapped device's type_lock before calling populate_table() since it is where the table's type is determined based on the specified targets. There is no need to allow concurrent table loads to race to establish the table's targets or type. This eliminates the need to grab the lock in dm_table_set_type(). Also verify that the type_lock is held in both dm_set_md_type() and dm_get_md_type(). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2013-09-05 20:46:06 -04:00
Tejun Heo	670368a8dd	dm: stop using WQ_NON_REENTRANT `dbf2576e37` ("workqueue: make all workqueues non-reentrant") made WQ_NON_REENTRANT no-op and the flag is going away. Remove its usages. This patch doesn't introduce any behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>	2013-08-23 09:02:13 -04:00
Mikulas Patocka	2a7faeb176	dm: optimize reorder structure This reorder actually improves performance by 20% (from 39.1s to 32.8s) on x86-64 quad core Opteron. I have no explanation for this, possibly it makes some other entries are better cache-aligned. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2013-07-10 23:41:18 +01:00
Mikulas Patocka	83d5e5b0af	dm: optimize use SRCU and RCU This patch removes "io_lock" and "map_lock" in struct mapped_device and "holders" in struct dm_table and replaces these mechanisms with sleepable-rcu. Previously, the code would call "dm_get_live_table" and "dm_table_put" to get and release table. Now, the code is changed to call "dm_get_live_table" and "dm_put_live_table". dm_get_live_table locks sleepable-rcu and dm_put_live_table unlocks it. dm_get_live_table_fast/dm_put_live_table_fast can be used instead of dm_get_live_table/dm_put_live_table. These *_fast functions use non-sleepable RCU, so the caller must not block between them. If the code changes active or inactive dm table, it must call dm_sync_table before destroying the old table. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2013-07-10 23:41:18 +01:00
Hannes Reinecke	6c182cd88d	dm mpath: fix ioctl deadlock when no paths When multipath needs to retry an ioctl the reference to the current live table needs to be dropped. Otherwise a deadlock occurs when all paths are down: - dm_blk_ioctl takes a reference to the current table and spins in multipath_ioctl(). - A new table is being loaded, but upon resume the process hangs in dm_table_destroy() waiting for references to drop to zero. With this patch the reference to the old table is dropped prior to retry, thereby avoiding the deadlock. Signed-off-by: Hannes Reinecke <hare@suse.de> Cc: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2013-07-10 23:41:15 +01:00
Al Viro	db2a144bed	block_device_operations->release() should return void The value passed is 0 in all but "it can never happen" cases (and those only in a couple of drivers) and it would've been lost on the way out anyway, even if something tried to pass something meaningful. Just don't bother. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-05-07 02:16:21 -04:00
Linus Torvalds	0a82a8d132	Revert "block: add missing block_bio_complete() tracepoint" This reverts commit `3a366e614d`. Wanlong Gao reports that it causes a kernel panic on his machine several minutes after boot. Reverting it removes the panic. Jens says: "It's not quite clear why that is yet, so I think we should just revert the commit for 3.9 final (which I'm assuming is pretty close). The wifi is crap at the LSF hotel, so sending this email instead of queueing up a revert and pull request." Reported-by: Wanlong Gao <gaowanlong@cn.fujitsu.com> Requested-by: Jens Axboe <axboe@kernel.dk> Cc: Tejun Heo <tj@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-18 09:00:26 -07:00
Alasdair G Kergon	b0d8ed4d96	dm: add target num_write_bios fn Add a num_write_bios function to struct target. If an instance of a target sets this, it will be queried before the target's mapping function is called on a write bio, and the response controls the number of copies of the write bio that the target will receive. This provides a convenient way for a target to send the same data to more than one device. The new cache target uses this in writethrough mode, to send the data both to the cache and the backing device. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2013-03-01 22:45:49 +00:00
Jun'ichi Nomura	5f01520415	dm: merge io_pool and tio_pool This patch merges io_pool and tio_pool into io_pool and cleans up related functions. Though device-mapper used to have 2 pools of objects for each dm device, the use of bioset frontbad for per-bio data has shrunk the number of pools to 1 for both bio-based and request-based device types. (See `c0820cf5` "dm: introduce per_bio_data" and `94818742` "dm: Use bioset's front_pad for dm_rq_clone_bio_info") So dm no longer has to maintain 2 different pointers. No functional changes. Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2013-03-01 22:45:48 +00:00
Jun'ichi Nomura	23e5083b4d	dm: remove unused _rq_bio_info_cache Remove _rq_bio_info_cache, which is no longer used. No functional changes. Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2013-03-01 22:45:48 +00:00
Mike Christie	87eb5b21d9	dm: fix limits initialization when there are no data devices dm_calculate_queue_limits will first reset the provided limits to defaults using blk_set_stacking_limits; whereby defeating the purpose of retaining the original live table's limits -- as was intended via commit `3ae7065616` ("dm: retain table limits when swapping to new table with no devices"). Fix this improper limits initialization (in the no data devices case) by avoiding the call to dm_calculate_queue_limits. [patch header revised by Mike Snitzer] Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # v3.6+ Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2013-03-01 22:45:48 +00:00
Alasdair G Kergon	e4c938111f	dm: refactor bio cloning Refactor part of the bio splitting and cloning code to try to make it easier to understand. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2013-03-01 22:45:47 +00:00
Alasdair G Kergon	14fe594d67	dm: rename bio cloning functions Rename functions involved in splitting and cloning bios. The sequence of functions is now: (1) __split_and_process* - entry point that selects the processing strategy (2) __send* - prepare the details for each bio needed and loop through them (3) __clone_and_map* - creates a clone and maps it Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2013-03-01 22:45:47 +00:00
Alasdair G Kergon	55a62eef8d	dm: rename request variables to bios Use 'bio' in the name of variables and functions that deal with bios rather than 'request' to avoid confusion with the normal block layer use of 'request'. No functional changes. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2013-03-01 22:45:47 +00:00
Alasdair G Kergon	bd2a49b86d	dm: clean up clone_bio Remove the no-longer-used struct bio_set argument from clone_bio and split_bvec. Use tio->ti in __map_bio() instead of passing in ti. Factor out some code for setting up cloned bios. Take target_request_nr as a parameter to alloc_tio(). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2013-03-01 22:45:46 +00:00
Jun'ichi Nomura	16245bdc9d	dm: do not replace bioset for request based dm This patch fixes a regression introduced in v3.8, which causes oops like this when dm-multipath is used: general protection fault: 0000 [#1] SMP RIP: 0010:[<ffffffff810fe754>] [<ffffffff810fe754>] mempool_free+0x24/0xb0 Call Trace: <IRQ> [<ffffffff81187417>] bio_put+0x97/0xc0 [<ffffffffa02247a5>] end_clone_bio+0x35/0x90 [dm_mod] [<ffffffff81185efd>] bio_endio+0x1d/0x30 [<ffffffff811f03a3>] req_bio_endio.isra.51+0xa3/0xe0 [<ffffffff811f2f68>] blk_update_request+0x118/0x520 [<ffffffff811f3397>] blk_update_bidi_request+0x27/0xa0 [<ffffffff811f343c>] blk_end_bidi_request+0x2c/0x80 [<ffffffff811f34d0>] blk_end_request+0x10/0x20 [<ffffffffa000b32b>] scsi_io_completion+0xfb/0x6c0 [scsi_mod] [<ffffffffa000107d>] scsi_finish_command+0xbd/0x120 [scsi_mod] [<ffffffffa000b12f>] scsi_softirq_done+0x13f/0x160 [scsi_mod] [<ffffffff811f9fd0>] blk_done_softirq+0x80/0xa0 [<ffffffff81044551>] __do_softirq+0xf1/0x250 [<ffffffff8142ee8c>] call_softirq+0x1c/0x30 [<ffffffff8100420d>] do_softirq+0x8d/0xc0 [<ffffffff81044885>] irq_exit+0xd5/0xe0 [<ffffffff8142f3e3>] do_IRQ+0x63/0xe0 [<ffffffff814257af>] common_interrupt+0x6f/0x6f <EOI> [<ffffffffa021737c>] srp_queuecommand+0x8c/0xcb0 [ib_srp] [<ffffffffa0002f18>] scsi_dispatch_cmd+0x148/0x310 [scsi_mod] [<ffffffffa000a38e>] scsi_request_fn+0x31e/0x520 [scsi_mod] [<ffffffff811f1e57>] __blk_run_queue+0x37/0x50 [<ffffffff811f1f69>] blk_delay_work+0x29/0x40 [<ffffffff81059003>] process_one_work+0x1c3/0x5c0 [<ffffffff8105b22e>] worker_thread+0x15e/0x440 [<ffffffff8106164b>] kthread+0xdb/0xe0 [<ffffffff8142db9c>] ret_from_fork+0x7c/0xb0 The regression was introduced by the change `c0820cf5` "dm: introduce per_bio_data", where dm started to replace bioset during table replacement. For bio-based dm, it is good because clone bios do not exist during the table replacement. For request-based dm, however, (not-yet-mapped) clone bios may stay in request queue and survive during the table replacement. So freeing the old bioset could cause the oops in bio_put(). Since the size of front_pad may change only with bio-based dm, it is not necessary to replace bioset for request-based dm. Reported-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2013-03-01 22:45:44 +00:00
Linus Torvalds	ee89f81252	Merge branch 'for-3.9/core' of git://git.kernel.dk/linux-block Pull block IO core bits from Jens Axboe: "Below are the core block IO bits for 3.9. It was delayed a few days since my workstation kept crashing every 2-8h after pulling it into current -git, but turns out it is a bug in the new pstate code (divide by zero, will report separately). In any case, it contains: - The big cfq/blkcg update from Tejun and and Vivek. - Additional block and writeback tracepoints from Tejun. - Improvement of the should sort (based on queues) logic in the plug flushing. - _io() variants of the wait_for_completion() interface, using io_schedule() instead of schedule() to contribute to io wait properly. - Various little fixes. You'll get two trivial merge conflicts, which should be easy enough to fix up" Fix up the trivial conflicts due to hlist traversal cleanups (commit b67bfe0d42ca: "hlist: drop the node parameter from iterators"). * 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits) block: remove redundant check to bd_openers() block: use i_size_write() in bd_set_size() cfq: fix lock imbalance with failed allocations drivers/block/swim3.c: fix null pointer dereference block: don't select PERCPU_RWSEM block: account iowait time when waiting for completion of IO request sched: add wait_for_completion_io[_timeout] writeback: add more tracepoints block: add block_{touch\|dirty}_buffer tracepoint buffer: make touch_buffer() an exported function block: add @req to bio_{front\|back}_merge tracepoints block: add missing block_bio_complete() tracepoint block: Remove should_sort judgement when flush blk_plug block,elevator: use new hashtable implementation cfq-iosched: add hierarchical cfq_group statistics cfq-iosched: collect stats from dead cfqgs cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats() blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock block: RCU free request_queue blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge() ...	2013-02-28 12:52:24 -08:00
Tejun Heo	c9d76be696	dm: convert to idr_alloc() Convert to the much saner new idr interface. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Alasdair Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-27 19:10:17 -08:00
Tejun Heo	adaedbd9fe	dm: don't use idr_remove_all() idr_destroy() can destroy idr by itself and idr_remove_all() is being deprecated. Drop its usage. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Alasdair Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-27 19:10:13 -08:00
Alasdair G Kergon	fe7af2d3ba	dm: fix write same requests counting When processing write same requests, fix dm to send the configured number of WRITE SAME requests to the target rather than the number of discards, which is not always the same. Device-mapper WRITE SAME support was introduced by commit `23508a96cd` ("dm: add WRITE SAME support"). Signed-off-by: Alasdair G Kergon <agk@redhat.com> Acked-by: Mike Snitzer <snitzer@redhat.com>	2013-01-31 14:23:36 +00:00
Tejun Heo	3a366e614d	block: add missing block_bio_complete() tracepoint bio completion didn't kick block_bio_complete TP. Only dm was explicitly triggering the TP on IO completion. This makes block_bio_complete TP useless for tracers which want to know about bios, and all other bio based drivers skip generating blktrace completion events. This patch makes all bio completions via bio_endio() generate block_bio_complete TP. * Explicit trace_block_bio_complete() invocation removed from dm and the trace point is unexported. * @rq dropped from trace_block_bio_complete(). bios may fly around w/o queue associated. Verifying and accessing the assocaited queue belongs to TP probes. * blktrace now gets both request and bio completions. Make it ignore bio completions if request completion path is happening. This makes all bio based drivers generate blktrace completion events properly and makes the block_bio_complete TP actually useful. v2: With this change, block_bio_complete TP could be invoked on sg commands which have bio's with %NULL bi_bdev. Update TP assignment code to check whether bio->bi_bdev is %NULL before dereferencing. Signed-off-by: Tejun Heo <tj@kernel.org> Original-patch-by: Namhyung Kim <namhyung@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Alasdair Kergon <agk@redhat.com> Cc: dm-devel@redhat.com Cc: Neil Brown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-01-14 15:00:36 +01:00
Mikulas Patocka	7de3ee57da	dm: remove map_info This patch removes map_info from bio-based device mapper targets. map_info is still used for request-based targets. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-12-21 20:23:41 +00:00
Mikulas Patocka	ddbd658f64	dm: move target request nr to dm_target_io This patch moves target_request_nr from map_info to dm_target_io and makes it accessible with dm_bio_get_target_request_nr. This patch is a preparation for the next patch that removes map_info. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-12-21 20:23:39 +00:00
Mikulas Patocka	c0820cf5ad	dm: introduce per_bio_data Introduce a field per_bio_data_size in struct dm_target. Targets can set this field in the constructor. If a target sets this field to a non-zero value, "per_bio_data_size" bytes of auxiliary data are allocated for each bio submitted to the target. These data can be used for any purpose by the target and help us improve performance by removing some per-target mempools. Per-bio data is accessed with dm_per_bio_data. The argument data_size must be the same as the value per_bio_data_size in dm_target. If the target has a pointer to per_bio_data, it can get a pointer to the bio with dm_bio_from_per_bio_data() function (data_size must be the same as the value passed to dm_per_bio_data). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-12-21 20:23:38 +00:00
Mike Snitzer	23508a96cd	dm: add WRITE SAME support WRITE SAME bios have a payload that contain a single page. When cloning WRITE SAME bios DM has no need to modify the bi_io_vec attributes (and doing so would be detrimental). DM need only alter the start and end of the WRITE SAME bio accordingly. Rather than duplicate __clone_and_map_discard, factor out a common function that is also used by __clone_and_map_write_same. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-12-21 20:23:37 +00:00
Jens Axboe	a8c32a5c98	dm: fix deadlock with request based dm and queue request_fn recursion Request based dm attempts to re-run the request queue off the request completion path. If used with a driver that potentially does end_io from its request_fn, we could deadlock trying to recurse back into request dispatch. Fix this by punting the request queue run to kblockd. Tested to fix a quickly reproducible deadlock in such a scenario. Cc: stable@kernel.org Acked-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-11-23 14:32:54 +01:00
Mikulas Patocka	dba141601d	dm: store dm_target_io in bio front_pad Use the recently-added bio front_pad field to allocate struct dm_target_io. Prior to this patch, dm_target_io was allocated from a mempool. For each dm_target_io, there is exactly one bio allocated from a bioset. This patch merges these two allocations into one allocation: we create a bioset with front_pad equal to the size of dm_target_io so that every bio allocated from the bioset has sizeof(struct dm_target_io) bytes before it. We allocate a bio and use the bytes before the bio as dm_target_io. _tio_cache is removed and the tio_pool mempool is now only used for request-based devices. This idea was introduced by Kent Overstreet. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: tj@kernel.org Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Bill Pemberton <wfp5p@viridian.itc.virginia.edu> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-10-12 21:02:15 +01:00
Linus Torvalds	ce40be7a82	Merge branch 'for-3.7/core' of git://git.kernel.dk/linux-block Pull block IO update from Jens Axboe: "Core block IO bits for 3.7. Not a huge round this time, it contains: - First series from Kent cleaning up and generalizing bio allocation and freeing. - WRITE_SAME support from Martin. - Mikulas patches to prevent O_DIRECT crashes when someone changes the block size of a device. - Make bio_split() work on data-less bio's (like trim/discards). - A few other minor fixups." Fixed up silent semantic mis-merge as per Mikulas Patocka and Andrew Morton. It is due to the VM no longer using a prio-tree (see commit 6b2dbba8b6ac: "mm: replace vma prio_tree with an interval tree"). So make set_blocksize() use mapping_mapped() instead of open-coding the internal VM knowledge that has changed. * 'for-3.7/core' of git://git.kernel.dk/linux-block: (26 commits) block: makes bio_split support bio without data scatterlist: refactor the sg_nents scatterlist: add sg_nents fs: fix include/percpu-rwsem.h export error percpu-rw-semaphore: fix documentation typos fs/block_dev.c:1644:5: sparse: symbol 'blkdev_mmap' was not declared blockdev: turn a rw semaphore into a percpu rw semaphore Fix a crash when block device is read and block size is changed at the same time block: fix request_queue->flags initialization block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue() block: ioctl to zero block ranges block: Make blkdev_issue_zeroout use WRITE SAME block: Implement support for WRITE SAME block: Consolidate command flag and queue limit checks for merges block: Clean up special command handling logic block/blk-tag.c: Remove useless kfree block: remove the duplicated setting for congestion_threshold block: reject invalid queue attribute values block: Add bio_clone_bioset(), bio_clone_kmalloc() block: Consolidate bio_alloc_bioset(), bio_kmalloc() ...	2012-10-11 09:04:23 +09:00
Mike Snitzer	3ae7065616	dm: retain table limits when swapping to new table with no devices Add a safety net that will re-use the DM device's existing limits in the event that DM device has a temporary table that doesn't have any component devices. This is to reduce the chance that requests not respecting the hardware limits will reach the device. DM recalculates queue limits based only on devices which currently exist in the table. This creates a problem in the event all devices are temporarily removed such as all paths being lost in multipath. DM will reset the limits to the maximum permissible, which can then assemble requests which exceed the limits of the paths when the paths are restored. The request will fail the blk_rq_check_limits() test when sent to a path with lower limits, and will be retried without end by multipath. This became a much bigger issue after v3.6 commit `fe86cdcef` ("block: do not artificially constrain max_sectors for stacking drivers"). Reported-by: David Jeffery <djeffery@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-09-26 23:45:45 +01:00
Mike Snitzer	ba1cbad93d	dm: handle requests beyond end of device instead of using BUG_ON The access beyond the end of device BUG_ON that was introduced to dm_request_fn via commit `29e4013de7` ("dm: implement REQ_FLUSH/FUA support for request-based dm") was an overly drastic (but simple) response to this situation. I have received a report that this BUG_ON was hit and now think it would be better to use dm_kill_unmapped_request() to fail the clone and original request with -EIO. map_request() will assign the valid target returned by dm_table_find_target to tio->ti. But when the target isn't valid tio->ti is never assigned (because map_request isn't called); so add a check for tio->ti != NULL to dm_done(). Reported-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: stable@vger.kernel.org # v2.6.37+ Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-09-26 23:45:42 +01:00
Kent Overstreet	bf800ef181	block: Add bio_clone_bioset(), bio_clone_kmalloc() Previously, there was bio_clone() but it only allocated from the fs bio set; as a result various users were open coding it and using __bio_clone(). This changes bio_clone() to become bio_clone_bioset(), and then we add bio_clone() and bio_clone_kmalloc() as wrappers around it, making use of the functionality the last patch adedd. This will also help in a later patch changing how bio cloning works. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: NeilBrown <neilb@suse.de> CC: Alasdair Kergon <agk@redhat.com> CC: Boaz Harrosh <bharrosh@panasas.com> CC: Jeff Garzik <jeff@garzik.org> Acked-by: Jeff Garzik <jgarzik@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-09-09 10:35:39 +02:00
Kent Overstreet	9481874231	dm: Use bioset's front_pad for dm_rq_clone_bio_info Previously, dm_rq_clone_bio_info needed to be freed by the bio's destructor to avoid a memory leak in the blk_rq_prep_clone() error path. This gets rid of a memory allocation and means we can kill dm_rq_bio_destructor. The _rq_bio_info_cache kmem cache is unused now and needs to be deleted, but due to the way io_pool is used and overloaded this looks not quite trivial so I'm leaving it for a later patch. v6: Fix comment on struct dm_rq_clone_bio_info, per Tejun Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Alasdair Kergon <agk@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-09-09 10:35:38 +02:00
Kent Overstreet	1e2a410ff7	block: Ues bi_pool for bio_integrity_alloc() Now that bios keep track of where they were allocated from, bio_integrity_alloc_bioset() becomes redundant. Remove bio_integrity_alloc_bioset() and drop bio_set argument from the related functions and make them use bio->bi_pool. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-09-09 10:35:38 +02:00
Kent Overstreet	395c72a707	block: Generalized bio pool freeing With the old code, when you allocate a bio from a bio pool you have to implement your own destructor that knows how to find the bio pool the bio was originally allocated from. This adds a new field to struct bio (bi_pool) and changes bio_alloc_bioset() to use it. This makes various bio destructors unnecessary, so they're then deleted. v6: Explain the temporary if statement in bio_put Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: NeilBrown <neilb@suse.de> CC: Alasdair Kergon <agk@redhat.com> CC: Nicholas Bellinger <nab@linux-iscsi.org> CC: Lars Ellenberg <lars.ellenberg@linbit.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-09-09 10:35:38 +02:00
Mikulas Patocka	7acf0277ce	dm: introduce split_discard_requests This patch introduces a new variable split_discard_requests. It can be set by targets so that discard requests are split on max_io_len boundaries. When split_discard_requests is not set, discard requests are only split on boundaries between targets, as was the case before this patch. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:03 +01:00
Mike Snitzer	542f903814	dm: support non power of two target max_io_len Remove the restriction that limits a target's specified maximum incoming I/O size to be a power of 2. Rename this setting from 'split_io' to the less-ambiguous 'max_io_len'. Change it from sector_t to uint32_t, which is plenty big enough, and introduce a wrapper function dm_set_target_max_io_len() to set it. Use sector_div() to process it now that it is not necessarily a power of 2. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-07-27 15:08:00 +01:00
Hannes Reinecke	4d7b38b7d9	dm: clear bi_end_io on remapping failure As a precaution, set bi_end_io to NULL when failing to remap. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2012-03-28 18:41:25 +01:00
Al Viro	ff01bb4832	fs: move code out of buffer.c Move invalidate_bdev, block_sync_page into fs/block_dev.c. Export kill_bdev as well, so brd doesn't have to open code it. Reduce buffer_head.h requirement accordingly. Removed a rather large comment from invalidate_bdev, as it looked a bit obsolete to bother moving. The small comment replacing it says enough. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:54:07 -05:00
Linus Torvalds	b4fdcb02f1	Merge branch 'for-3.2/core' of git://git.kernel.dk/linux-block * 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits) block: don't call blk_drain_queue() if elevator is not up blk-throttle: use queue_is_locked() instead of lockdep_is_held() blk-throttle: Take blkcg->lock while traversing blkcg->policy_list blk-throttle: Free up policy node associated with deleted rule block: warn if tag is greater than real_max_depth. block: make gendisk hold a reference to its queue blk-flush: move the queue kick into blk-flush: fix invalid BUG_ON in blk_insert_flush block: Remove the control of complete cpu from bio. block: fix a typo in the blk-cgroup.h file block: initialize the bounce pool if high memory may be added later block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown block: drop @tsk from attempt_plug_merge() and explain sync rules block: make get_request[_wait]() fail if queue is dead block: reorganize throtl_get_tg() and blk_throtl_bio() block: reorganize queue draining block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg() block: pass around REQ_* flags instead of broken down booleans during request alloc/free block: move blk_throtl prototypes to block/blk.h block: fix genhd refcounting in blkio_policy_parse_and_set() ... Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion and making the request functions be of type "void" instead of "int" in - drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c} - drivers/staging/zram/zram_drv.c	2011-11-04 17:06:58 -07:00
Alasdair G Kergon	3cf2e4ba74	dm: export dm get md Export dm_get_md() for the new thin provisioning target to use. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-10-31 20:19:06 +00:00
Alasdair G Kergon	36a0456fbf	dm table: add immutable feature Introduce DM_TARGET_IMMUTABLE to indicate that the target type cannot be mixed with any other target type, and once loaded into a device, it cannot be replaced with a table containing a different type. The thin provisioning pool device will use this. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-10-31 20:19:04 +00:00
Namhyung Kim	fbdc86f3bd	dm: remove superfluous smp_mb Since set_current_state() contains a memory barrier in it, an additional barrier isn't needed. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-10-31 20:18:56 +00:00
Namhyung Kim	71a16736a1	dm: use local printk ratelimit printk_ratelimit() shares global ratelimiting state with all other subsystems, so its usage is discouraged. Instead, define and use dm's local state. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-10-31 20:18:54 +00:00
Christoph Hellwig	5a7bbad27a	block: remove support for bio remapping from ->make_request There is very little benefit in allowing to let a ->make_request instance update the bios device and sector and loop around it in __generic_make_request when we can archive the same through calling generic_make_request from the driver and letting the loop in generic_make_request handle it. Note that various drivers got the return value from ->make_request and returned non-zero values for errors. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-09-12 12:12:01 +02:00
Jens Axboe	c20e8de27f	block: rename __make_request() to blk_queue_bio() Now that it's exported, lets put it in a more sane namespace. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-09-12 12:08:31 +02:00
Christoph Hellwig	166e1f901b	block: export __make_request Avoid the hacks need for request based device mappers currently by simply exporting the symbol instead of trying to get it through the back door. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-09-12 12:08:27 +02:00
Mike Snitzer	ed8b752bcc	dm table: set flush capability based on underlying devices DM has always advertised both REQ_FLUSH and REQ_FUA flush capabilities regardless of whether or not a given DM device's underlying devices also advertised a need for them. Block's flush-merge changes from 2.6.39 have proven to be more costly for DM devices. Performance regressions have been reported even when DM's underlying devices do not advertise that they have a write cache. Fix the performance regressions by configuring a DM device's flushing capabilities based on those of the underlying devices' capabilities. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-08-02 12:32:08 +01:00
Mikulas Patocka	d5b9dd04bd	dm: ignore merge_bvec for snapshots when safe Add a new flag DMF_MERGE_IS_OPTIONAL to struct mapped_device to indicate whether the device can accept bios larger than the size its merge function returns. When set, use this to send large bios to snapshots which can split them if necessary. Snapshot I/O may be significantly fragmented and this approach seems to improve peformance. Before the patch, dm_set_device_limits restricted bio size to page size if the underlying device had a merge function and the target didn't provide a merge function. After the patch, dm_set_device_limits restricts bio size to page size if the underlying device has a merge function, doesn't have DMF_MERGE_IS_OPTIONAL flag and the target doesn't provide a merge function. The snapshot target can't provide a merge function because when the merge function is called, it is impossible to determine where the bio will be remapped. Previously this led us to impose a 4k limit, which we can now remove if the snapshot store is located on a device without a merge function. Together with another patch for optimizing full chunk writes, it improves performance from 29MB/s to 40MB/s when writing to the filesystem on snapshot store. If the snapshot store is placed on a non-dm device with a merge function (such as md-raid), device mapper still limits all bios to page size. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-08-02 12:32:04 +01:00
Mike Snitzer	936688d7eb	dm table: fix discard support Remove 'discards_supported' from the dm_table structure. The same information can be easily discovered from the table's target(s) in dm_table_supports_discards(). Before this fix dm_table_supports_discards() would skip checking the individual targets' 'discards_supported' flag if any one target in the table didn't set num_discard_requests > 0. Now the per-target 'discards_supported' flag is effective at insuring the final DM device advertises discard support. But, to be clear, targets that don't support discards (!num_discard_requests) will not receive discard requests. Also DMWARN if a target sets 'discards_supported' override but forgets to set 'num_discard_requests'. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-08-02 12:32:01 +01:00
Alasdair G Kergon	d15b774c29	dm: fix idr leak on module removal Destroy _minor_idr when unloading the core dm module. (Found by kmemleak.) Cc: stable@kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2011-08-02 12:32:01 +01:00
Shaohua Li	1e9bb8808a	block: fix non-atomic access to genhd inflight structures After the stack plugging introduction, these are called lockless. Ensure that the counters are updated atomically. Signed-off-by: Shaohua Li<shaohua.li@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-22 08:35:35 +01:00
Martin K. Petersen	a91a2785b2	block: Require subsystems to explicitly allocate bio_set integrity mempool MD and DM create a new bio_set for every metadevice. Each bio_set has an integrity mempool attached regardless of whether the metadevice is capable of passing integrity metadata. This is a waste of memory. Instead we defer the allocation decision to MD and DM since we know at metadevice creation time whether integrity passthrough is needed or not. Automatic integrity mempool allocation can then be removed from bioset_create() and we make an explicit integrity allocation for the fs_bio_set. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Acked-by: Mike Snitzer <snizer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-17 11:11:05 +01:00
Jens Axboe	7eaceaccab	block: remove per-queue plugging Code has been converted over to the new explicit on-stack plugging, and delay users have been converted to use the new API for that. So lets kill off the old plugging along with aops->sync_page(). Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-10 08:52:07 +01:00

... 2 3 4 5 6 ...

563 Commits