linux-sg2042

Commit Graph

Author	SHA1	Message	Date
NeilBrown	e2f23b606b	md: avoid oops on unload if some process is in poll or select. If md-mod is unloaded while some process is in poll() or select(), then that process maintains a pointer to md_event_waiters, and when the try to unlink from that list, they will oops. The procfs infrastructure ensures that ->poll won't be called after remove_proc_entry, but doesn't provide a wait_queue_head for us to use, and the waitqueue code doesn't provide a way to remove all listeners from a waitqueue. So we need to: 1/ make sure no further references to md_event_waiters are taken (by setting md_unloading) 2/ wake up all processes currently waiting, and 3/ wait until all those processes have disconnected from our wait_queue_head. Reported-by: "majianpeng" <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2014-04-09 14:42:34 +10:00
NeilBrown	da1aab3dca	md/raid1: r1buf_pool_alloc: free allocate pages when subsequent allocation fails. When performing a user-request check/repair (MD_RECOVERY_REQUEST is set) on a raid1, we allocate multiple bios each with their own set of pages. If the page allocations for one bio fails, we currently do not free the pages allocated for the previous bios, nor do we free the bio itself. This patch frees all the already-allocate pages, and makes sure that all the bios are freed as well. This bug can cause a memory leak which can ultimately OOM a machine. It was introduced in 3.10-rc1. Fixes: `a07876064a` Cc: Kent Overstreet <koverstreet@google.com> Cc: stable@vger.kernel.org (3.10+) Reported-by: Russell King - ARM Linux <linux@arm.linux.org.uk> Signed-off-by: NeilBrown <neilb@suse.de>	2014-04-09 14:42:23 +10:00
NeilBrown	035328c202	md/bitmap: don't abuse i_writecount for bitmap files. md bitmap code currently tries to use i_writecount to stop any other process from writing to out bitmap file. But that is really an abuse and has bit-rotted so locking is all wrong. So discard that - root should be allowed to shoot self in foot. Still use it in a much less intrusive way to stop the same file being used as bitmap on two different array, and apply other checks to ensure the file is at least vaguely usable for bitmap storage (is regular, is open for write. Support for ->bmap is already checked elsewhere). Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: NeilBrown <neilb@suse.de>	2014-04-09 12:26:59 +10:00
Joe Thornber	b10ebd34cc	dm thin: fix rcu_read_lock being held in code that can sleep Commit `c140e1c4e2` ("dm thin: use per thin device deferred bio lists") introduced the use of an rculist for all active thin devices. The use of rcu_read_lock() in process_deferred_bios() can result in a BUG if a dm_bio_prison_cell must be allocated as a side-effect of bio_detain(): BUG: sleeping function called from invalid context at mm/mempool.c:203 in_atomic(): 1, irqs_disabled(): 0, pid: 6, name: kworker/u8:0 3 locks held by kworker/u8:0/6: #0: ("dm-" "thin"){.+.+..}, at: [<ffffffff8106be42>] process_one_work+0x192/0x550 #1: ((&pool->worker)){+.+...}, at: [<ffffffff8106be42>] process_one_work+0x192/0x550 #2: (rcu_read_lock){.+.+..}, at: [<ffffffff816360b5>] do_worker+0x5/0x4d0 We can't process deferred bios with the rcu lock held, since dm_bio_prison_cell allocation may block if the bio-prison's cell mempool is exhausted. To fix: - Introduce a refcount and completion field to each thin_c - Add thin_get/put methods for adjusting the refcount. If the refcount hits zero then the completion is triggered. - Initialise refcount to 1 when creating thin_c - When iterating the active_thins list we thin_get() whilst the rcu lock is held. - After the rcu lock is dropped we process the deferred bios for that thin. - When destroying a thin_c we thin_put() and then wait for the completion -- to avoid a race between the worker thread iterating from that thin_c and destroying the thin_c. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-04-08 10:18:35 -04:00
Joe Thornber	5e3283e292	dm thin: irqsave must always be used with the pool->lock spinlock Commit `c140e1c4e2` ("dm thin: use per thin device deferred bio lists") incorrectly stopped disabling irqs when taking the pool's spinlock. Irqs must be disabled when taking the pool's spinlock otherwise a thread could spin_lock(), then get interrupted to service thin_endio() in interrupt context, which would then deadlock in spin_lock_irqsave(). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-04-08 10:10:51 -04:00
Linus Torvalds	04535d273e	. Fix dm-cache corruption caused by discard_block_size > cache_block_size . Fix a lock-inversion detected by LOCKDEP in dm-cache . Fix a dangling bio bug in the dm-thinp target's process_deferred_bios error path . Fix corruption due to non-atomic transaction commit which allowed a metadata superblock to be written before all other metadata was successfully written -- this is common to all targets that use the persistent-data library's transaction manager (dm-thinp, dm-cache and dm-era). . Various small cleanups in the DM core . Add the dm-era target which is useful for keeping track of which blocks were written within a user defined period of time called an 'era'. Use cases include tracking changed blocks for backup software, and partially invalidating the contents of a cache to restore cache coherency after rolling back a vendor snapshot. . Improve the on-disk layout of multithreaded writes to the dm-thin-pool by splitting the pool's deferred bio list to be a per-thin device list and then sorting that list using an rb_tree. The subsequent read throughput of the data written via multiple threads improved by ~70%. . Simplify the multipath target's handling of queuing IO by pushing requests back to the request queue rather than queueing the IO internally. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJTPv/6AAoJEMUj8QotnQNagQYH/3EkB2f66TRfjRQpVAZuchw/ U0IbVWcMJKMdhj3uaSNzIkAbTgF+QsZUOLHP/7Q6zLq0M2J3WGrJn2ELqq53MenF E0+rJ8duKnJ5oLhhVT62ukBDh3XHWT0JyijXPWNa2gUoYwJqM9BAlXbC/OTfUNaZ mBCxvUWGME8k3ht310GhwvzBQjYuxIXhw8XlbGvakb9S83SZwNpCh231iumOEzPe Vzfx/xTto0fH3R5/knNV/H9xt0Dv4vt4Aqbqqys9UbQvPzj9qN/mxUZIFg+LZh/w WuvHHw6HcAiNNrQGFcm6i1AK2jJ+F61K3afMlYsiamTxMNM+0q/B9HemkX/0ieU= =lY8m -----END PGP SIGNATURE----- Merge tag 'dm-3.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper changes from Mike Snitzer: - Fix dm-cache corruption caused by discard_block_size > cache_block_size - Fix a lock-inversion detected by LOCKDEP in dm-cache - Fix a dangling bio bug in the dm-thinp target's process_deferred_bios error path - Fix corruption due to non-atomic transaction commit which allowed a metadata superblock to be written before all other metadata was successfully written -- this is common to all targets that use the persistent-data library's transaction manager (dm-thinp, dm-cache and dm-era). - Various small cleanups in the DM core - Add the dm-era target which is useful for keeping track of which blocks were written within a user defined period of time called an 'era'. Use cases include tracking changed blocks for backup software, and partially invalidating the contents of a cache to restore cache coherency after rolling back a vendor snapshot. - Improve the on-disk layout of multithreaded writes to the dm-thin-pool by splitting the pool's deferred bio list to be a per-thin device list and then sorting that list using an rb_tree. The subsequent read throughput of the data written via multiple threads improved by ~70%. - Simplify the multipath target's handling of queuing IO by pushing requests back to the request queue rather than queueing the IO internally. * tag 'dm-3.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (24 commits) dm cache: fix a lock-inversion dm thin: sort the per thin deferred bios using an rb_tree dm thin: use per thin device deferred bio lists dm thin: simplify pool_is_congested dm thin: fix dangling bio in process_deferred_bios error path dm mpath: print more useful warnings in multipath_message() dm-mpath: do not activate failed paths dm mpath: remove extra nesting in map function dm mpath: remove map_io() dm mpath: reduce memory pressure when requeuing dm mpath: remove process_queued_ios() dm mpath: push back requests instead of queueing dm table: add dm_table_run_md_queue_async dm mpath: do not call pg_init when it is already running dm: use RCU_INIT_POINTER instead of rcu_assign_pointer in __unbind dm: stop using bi_private dm: remove dm_get_mapinfo dm: make dm_table_alloc_md_mempools static dm: take care to copy the space map roots before locking the superblock dm transaction manager: fix corruption due to non-atomic transaction commit ...	2014-04-05 18:49:31 -07:00
Joe Thornber	0596661f0a	dm cache: fix a lock-inversion When suspending a cache the policy is walked and the individual policy hints written to the metadata via sync_metadata(). This led to this lock order: policy->lock cache_metadata->root_lock When loading the cache target the policy is populated while the metadata lock is held: cache_metadata->root_lock policy->lock Fix this potential lock-inversion (ABBA) deadlock in sync_metadata() by ensuring the cache_metadata root_lock is held whilst all the hints are written, rather than being repeatedly locked while policy->lock is held (as was the case with each callout that policy_walk_mappings() made to the old save_hint() method). Found by turning on the CONFIG_PROVE_LOCKING ("Lock debugging: prove locking correctness") build option. However, it is not clear how the LOCKDEP reported paths can lead to a deadlock since the two paths, suspending a target and loading a target, never occur at the same time. But that doesn't mean the same lock-inversion couldn't have occurred elsewhere. Reported-by: Marian Csontos <mcsontos@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2014-04-04 14:53:05 -04:00
Mike Snitzer	67324ea188	dm thin: sort the per thin deferred bios using an rb_tree A thin-pool will allocate blocks using FIFO order for all thin devices which share the thin-pool. Because of this simplistic allocation the thin-pool's space can become fragmented quite easily; especially when multiple threads are requesting blocks in parallel. Sort each thin device's deferred_bio_list based on logical sector to help reduce fragmentation of the thin-pool's ondisk layout. The following tables illustrate the realized gains/potential offered by sorting each thin device's deferred_bio_list. An "io size"-sized random read of the device would result in "seeks/io" fragments being read, with an average "distance/seek" between each fragment. Data was written to a single thin device using multiple threads via iozone (8 threads, 64K for both the block_size and io_size). unsorted: io size seeks/io distance/seek -------------------------------------- 4k 0.000 0b 16k 0.013 11m 64k 0.065 11m 256k 0.274 10m 1m 1.109 10m 4m 4.411 10m 16m 17.097 11m 64m 60.055 13m 256m 148.798 25m 1g 809.929 21m sorted: io size seeks/io distance/seek -------------------------------------- 4k 0.000 0b 16k 0.000 1g 64k 0.001 1g 256k 0.003 1g 1m 0.011 1g 4m 0.045 1g 16m 0.181 1g 64m 0.747 1011m 256m 3.299 1g 1g 14.373 1g Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>	2014-04-04 14:53:03 -04:00
Linus Torvalds	b33ce44299	Merge branch 'for-3.15/drivers' of git://git.kernel.dk/linux-block Pull block driver update from Jens Axboe: "On top of the core pull request, here's the pull request for the driver related changes for 3.15. It contains: - Improvements for msi-x registration for block drivers (mtip32xx, skd, cciss, nvme) from Alexander Gordeev. - A round of cleanups and improvements for drbd from Andreas Gruenbacher and Rashika Kheria. - A round of clanups and improvements for bcache from Kent. - Removal of sleep_on() and friends in DAC960, ataflop, swim3 from Arnd Bergmann. - Bug fix for a bug in the mtip32xx async completion code from Sam Bradshaw. - Bug fix for accidentally bouncing IO on 32-bit platforms with mtip32xx from Felipe Franciosi" * 'for-3.15/drivers' of git://git.kernel.dk/linux-block: (103 commits) bcache: remove nested function usage bcache: Kill bucket->gc_gen bcache: Kill unused freelist bcache: Rework btree cache reserve handling bcache: Kill btree_io_wq bcache: btree locking rework bcache: Fix a race when freeing btree nodes bcache: Add a real GC_MARK_RECLAIMABLE bcache: Add bch_keylist_init_single() bcache: Improve priority_stats bcache: Better alloc tracepoints bcache: Kill dead cgroup code bcache: stop moving_gc marking buckets that can't be moved. bcache: Fix moving_pred() bcache: Fix moving_gc deadlocking with a foreground write bcache: Fix discard granularity bcache: Fix another bug recovering from unclean shutdown bcache: Fix a bug recovering from unclean shutdown bcache: Fix a journalling reclaim after recovery bug bcache: Fix a null ptr deref in journal replay ...	2014-04-01 19:43:53 -07:00
Linus Torvalds	675c354a95	Char/Misc driver patches for 3.15-rc1 Here's the big char/misc driver updates for 3.15-rc1. Lots of various things here, including the new mcb driver subsystem. All of these have been in linux-next for a while. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iEYEABECAAYFAlM7ArIACgkQMUfUDdst+ylS+gCfcJr0Zo2v5aWnqD7rFtFETmFI LhcAoNTQ4cvlVdxnI0driWCWFYxLj6at =aj+L -----END PGP SIGNATURE----- Merge tag 'char-misc-3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc driver patches from Greg KH: "Here's the big char/misc driver updates for 3.15-rc1. Lots of various things here, including the new mcb driver subsystem. All of these have been in linux-next for a while" * tag 'char-misc-3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (118 commits) extcon: Move OF helper function to extcon core and change function name extcon: of: Remove unnecessary function call by using the name of device_node extcon: gpio: Use SIMPLE_DEV_PM_OPS macro extcon: palmas: Use SIMPLE_DEV_PM_OPS macro mei: don't use deprecated DEFINE_PCI_DEVICE_TABLE macro mei: amthif: fix checkpatch error mei: client.h fix checkpatch errors mei: use cl_dbg where appropriate mei: fix Unnecessary space after function pointer name mei: report consistently copy_from/to_user failures mei: drop pr_fmt macros mei: make me hw headers private to me hw. mei: fix memory leak of pending write cb objects mei: me: do not reset when less than expected data is received drivers: mcb: Fix build error discovered by 0-day bot cs5535-mfgpt: Simplify dependencies spmi: pm: drop bus-level PM suspend/resume routines spmi: pmic_arb: make selectable on ARCH_QCOM Drivers: hv: vmbus: Increase the limit on the number of pfns we can handle pch_phub: Report error writing MAC back to user ...	2014-04-01 16:13:21 -07:00
Mike Snitzer	c140e1c4e2	dm thin: use per thin device deferred bio lists The thin-pool previously only had a single deferred_bios list that would collect bios for all thin devices in the pool. Split this per-pool deferred_bios list out to per-thin deferred_bios_list -- doing so enables increased parallelism when processing deferred bios. And now that each thin device has it's own deferred_bios_list we can sort all bios in the list using logical sector. The requeue code in error handling path is also cleaner as a side-effect. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>	2014-03-31 14:14:15 -04:00
Mike Snitzer	760fe67e53	dm thin: simplify pool_is_congested The pool is congested if the pool is in PM_OUT_OF_DATA_SPACE mode. This is more explicit/clear/efficient than inferring whether or not the pool is congested by checking if retry_on_resume_list is empty. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>	2014-03-31 10:05:51 -04:00
Mike Snitzer	fe76cd88e6	dm thin: fix dangling bio in process_deferred_bios error path If unable to ensure_next_mapping() we must add the current bio, which was removed from the @bios list via bio_list_pop, back to the deferred_bios list before all the remaining @bios. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Cc: stable@vger.kernel.org	2014-03-28 14:37:02 -04:00
Jose Castillo	a356e42620	dm mpath: print more useful warnings in multipath_message() The warning message "Unrecognised multipath message received" is displayed in two different situations in multipath_message(): when the number of arguments passed is invalid and when the string passed in argv[0] is not recognized. Make it easier to identify where the problem is by making these warnings more specific with additional context for each case. Signed-off-by: Jose Castillo <jcastillo@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-03-27 16:56:25 -04:00
Hannes Reinecke	3a01750964	dm-mpath: do not activate failed paths activate_path() is run without a lock, so the path might be set to failed before activate_path() had a chance to run. This patch add a check for ->active in activate_path() to avoid unnecessary overhead by calling functions which are known to be failing. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-03-27 16:56:25 -04:00
Mike Snitzer	9bf59a611a	dm mpath: remove extra nesting in map function Return early for case when no path exists, and when the pathgroup isn't ready. This eliminates the need for extra nesting for the the common case. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Hannes Reinecke <hare@suse.de>	2014-03-27 16:56:25 -04:00
Hannes Reinecke	36fcffcc65	dm mpath: remove map_io() multipath_map() is now just a wrapper around map_io(), so we can rename map_io() to multipath_map(). Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>	2014-03-27 16:56:25 -04:00
Hannes Reinecke	e3bde04f1e	dm mpath: reduce memory pressure when requeuing When multipath needs to requeue I/O in the block layer the per-request context shouldn't be allocated, as it will be freed immediately afterwards anyway. Avoiding this memory allocation will reduce memory pressure during requeuing. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>	2014-03-27 16:56:25 -04:00
Hannes Reinecke	3e9f1be1b4	dm mpath: remove process_queued_ios() process_queued_ios() has served 3 functions: 1) select pg and pgpath if none is selected 2) start pg_init if requested 3) dispatch queued IOs when pg is ready Basically, a call to queue_work(process_queued_ios) can be replaced by dm_table_run_md_queue_async(), which runs request queue and ends up calling map_io(), which does 1), 2) and 3). Exception is when !pg_ready() (which means either pg_init is running or requested), then multipath_busy() prevents map_io() being called from request_fn. If pg_init is running, it should be ok as long as pg_init_done() does the right thing when pg_init is completed, I.e.: restart pg_init if !pg_ready() or call dm_table_run_md_queue_async() to kick map_io(). If pg_init is requested, we have to make sure the request is detected and pg_init will be started. pg_init is requested in 3 places: a) __choose_pgpath() in map_io() b) __choose_pgpath() in multipath_ioctl() c) pg_init retry in pg_init_done() a) is ok because map_io() calls __pg_init_all_paths(), which does 2). b) needs a call to __pg_init_all_paths(), which does 2). c) needs a call to __pg_init_all_paths(), which does 2). So this patch removes process_queued_ios() and ensures that __pg_init_all_paths() is called at the appropriate locations. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>	2014-03-27 16:56:24 -04:00
Hannes Reinecke	e809917735	dm mpath: push back requests instead of queueing There is no reason why multipath needs to queue requests internally for queue_if_no_path or pg_init; we should rather push them back onto the request queue. And while we're at it we can simplify the conditional statement in map_io() to make it easier to read. Since mpath no longer does internal queuing of I/O the table info no longer emits the internal queue_size. Instead it displays 1 if queuing is being used or 0 if it is not. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>	2014-03-27 16:56:24 -04:00
Mike Snitzer	9974fa2c6a	dm table: add dm_table_run_md_queue_async Introduce dm_table_run_md_queue_async() to run the request_queue of the mapped_device associated with a request-based DM table. Also add dm_md_get_queue() wrapper to extract the request_queue from a mapped_device. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>	2014-03-27 16:56:24 -04:00
Hannes Reinecke	17f4ff45b5	dm mpath: do not call pg_init when it is already running This patch moves condition checks as a preparation of following patches and has no effect on behaviour. process_queued_ios() is the only caller of __pg_init_all_paths() and 2 condition checks are moved from outside to inside without side effects. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>	2014-03-27 16:56:24 -04:00
Monam Agarwal	9cdb852004	dm: use RCU_INIT_POINTER instead of rcu_assign_pointer in __unbind Replace rcu_assign_pointer(p, NULL) with RCU_INIT_POINTER(p, NULL). The rcu_assign_pointer() ensures that the initialization of a structure is carried out before storing a pointer to that structure. And in the case of the NULL pointer, there is no structure to initialize. So, rcu_assign_pointer(p, NULL) can be safely converted to RCU_INIT_POINTER(p, NULL). Signed-off-by: Monam Agarwal <monamagarwal123@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-03-27 16:56:24 -04:00
Mikulas Patocka	bfc6d41cee	dm: stop using bi_private Device mapper uses the bio structure's bi_private field as a pointer to dm_target_io or dm_rq_clone_bio_info. But a bio structure is embedded in the dm_target_io and dm_rq_clone_bio_info structures, so the pointer to the structure that contains the bio can be found with the container_of() macro. Remove the use of bi_private and use container_of() instead. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-03-27 16:56:24 -04:00
Mikulas Patocka	d70ab4fb72	dm: remove dm_get_mapinfo Remove dm_get_mapinfo() because no target uses it. Targets can allocate per-bio data using ti->per_bio_data_size, this is much more flexible than union map_info. Leave union map_info only for the request-based multipath target's use. Also delete the unused "unsigned long long ll" field of union map_info. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-03-27 16:56:24 -04:00
Mikulas Patocka	473c36dfee	dm: make dm_table_alloc_md_mempools static Make the function dm_table_alloc_md_mempools static because it is not called from another file. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-03-27 16:56:23 -04:00
Joe Thornber	5a32083d03	dm: take care to copy the space map roots before locking the superblock In theory copying the space map root can fail, but in practice it never does because we're careful to check what size buffer is needed. But make certain we're able to copy the space map roots before locking the superblock. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # drop dm-era and dm-cache changes as needed	2014-03-27 16:56:23 -04:00
Joe Thornber	a9d45396f5	dm transaction manager: fix corruption due to non-atomic transaction commit The persistent-data library used by dm-thin, dm-cache, etc is transactional. If anything goes wrong, such as an io error when writing new metadata or a power failure, then we roll back to the last transaction. Atomicity when committing a transaction is achieved by: a) Never overwriting data from the previous transaction. b) Writing the superblock last, after all other metadata has hit the disk. This commit and the following commit ("dm: take care to copy the space map roots before locking the superblock") fix a bug associated with (b). When committing it was possible for the superblock to still be written in spite of an io error occurring during the preceeding metadata flush. With these commits we're careful not to take the write lock out on the superblock until after the metadata flush has completed. Change the transaction manager's semantics for dm_tm_commit() to assume all data has been flushed _before_ the single superblock that is passed in. As a prerequisite, split the block manager's block unlocking and flushing by simplifying dm_bm_flush_and_unlock() to dm_bm_flush(). Now the unlocking must be done separately. This issue was discovered by forcing io errors at the crucial time using dm-flakey. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2014-03-27 16:56:23 -04:00
Heinz Mauelshagen	64ab346a36	dm cache: remove remainder of distinct discard block size Discard block size not being equal to cache block size causes data corruption by erroneously avoiding migrations in issue_copy() because the discard state is being cleared for a group of cache blocks when it should not. Completely remove all code that enabled a distinction between the cache block size and discard block size. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-03-27 16:56:23 -04:00
Mike Snitzer	d132cc6d9e	dm cache: prevent corruption caused by discard_block_size > cache_block_size If the discard block size is larger than the cache block size we will not properly quiesce IO to a region that is about to be discarded. This results in a race between a cache migration where no copy is needed, and a write to an adjacent cache block that's within the same large discard block. Workaround this by limiting the discard_block_size to cache_block_size. Also limit the max_discard_sectors to cache_block_size. A more comprehensive fix that introduces range locking support in the bio_prison and proper quiescing of a discard range that spans multiple cache blocks is already in development. Reported-by: Morgan Mears <Morgan.Mears@netapp.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Acked-by: Heinz Mauelshagen <heinzm@redhat.com> Cc: stable@vger.kernel.org	2014-03-27 16:56:23 -04:00
Joe Thornber	428e469864	dm bitset: only flush the current word if it has been dirtied This change offers a big performance boost for dm-era. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-03-27 16:56:23 -04:00
Joe Thornber	eec40579d8	dm: add era target dm-era is a target that behaves similar to the linear target. In addition it keeps track of which blocks were written within a user defined period of time called an 'era'. Each era target instance maintains the current era as a monotonically increasing 32-bit counter. Use cases include tracking changed blocks for backup software, and partially invalidating the contents of a cache to restore cache coherency after rolling back a vendor snapshot. dm-era is primarily expected to be paired with the dm-cache target. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-03-27 16:56:23 -04:00
John Sheu	cb85114956	bcache: remove nested function usage Uninlined nested functions can cause crashes when using ftrace, as they don't follow the normal calling convention and confuse the ftrace function graph tracer as it examines the stack. Also, nested functions are supported as a gcc extension, but may fail on other compilers (e.g. llvm). Signed-off-by: John Sheu <john.sheu@gmail.com>	2014-03-18 12:39:28 -07:00
Kent Overstreet	3a2fd9d509	bcache: Kill bucket->gc_gen gc_gen was a temporary used to recalculate last_gc, but since we only need bucket->last_gc when gc isn't running (gc_mark_valid = 1), we can just update last_gc directly. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-03-18 12:24:54 -07:00
Kent Overstreet	2531d9ee61	bcache: Kill unused freelist This was originally added as at optimization that for various reasons isn't needed anymore, but it does add a lot of nasty corner cases (and it was responsible for some recently fixed bugs). Just get rid of it now. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-03-18 12:23:36 -07:00
Kent Overstreet	0a63b66db5	bcache: Rework btree cache reserve handling This changes the bucket allocation reserves to use _real_ reserves - separate freelists - instead of watermarks, which if nothing else makes the current code saner to reason about and is going to be important in the future when we add support for multiple btrees. It also adds btree_check_reserve(), which checks (and locks) the reserves for both bucket allocation and memory allocation for btree nodes; the old code just kinda sorta assumed that since (e.g. for btree node splits) it had the root locked and that meant no other threads could try to make use of the same reserve; this technically should have been ok for memory allocation (we should always have a reserve for memory allocation (the btree node cache is used as a reserve and we preallocate it)), but multiple btrees will mean that locking the root won't be sufficient anymore, and for the bucket allocation reserve it was technically possible for the old code to deadlock. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-03-18 12:23:35 -07:00
Kent Overstreet	56b30770b2	bcache: Kill btree_io_wq With the locking rework in the last patch, this shouldn't be needed anymore - btree_node_write_work() only takes b->write_lock which is never held for very long. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-03-18 12:23:35 -07:00
Kent Overstreet	2a285686c1	bcache: btree locking rework Add a new lock, b->write_lock, which is required to actually modify - or write - a btree node; this lock is only held for short durations. This means we can write out a btree node without taking b->lock, which _is_ held for long durations - solving a deadlock when btree_flush_write() (from the journalling code) is called with a btree node locked. Right now just occurs in bch_btree_set_root(), but with an upcoming journalling rework is going to happen a lot more. This also turns b->lock is now more of a read/intent lock instead of a read/write lock - but not completely, since it still blocks readers. May turn it into a real intent lock at some point in the future. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-03-18 12:23:35 -07:00
Kent Overstreet	05335cff9f	bcache: Fix a race when freeing btree nodes This isn't a bulletproof fix; btree_node_free() -> bch_bucket_free() puts the bucket on the unused freelist, where it can be reused right away without any ordering requirements. It would be better to wait on at least a journal write to go down before reusing the bucket. bch_btree_set_root() does this, and inserting into non leaf nodes is completely synchronous so we should be ok, but future patches are just going to get rid of the unused freelist - it was needed in the past for various reasons but shouldn't be anymore. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-03-18 12:23:34 -07:00
Kent Overstreet	4fe6a81670	bcache: Add a real GC_MARK_RECLAIMABLE This means the garbage collection code can better check for data and metadata pointers to the same buckets. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-03-18 12:22:36 -07:00
Kent Overstreet	c13f3af924	bcache: Add bch_keylist_init_single() This will potentially save us an allocation when we've got inode/dirent bkeys that don't fit in the keylist's inline keys. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-03-18 12:22:36 -07:00
Kent Overstreet	1575402052	bcache: Improve priority_stats Break down data into clean data/dirty data/metadata. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-03-18 12:22:35 -07:00
Kent Overstreet	7159b1ad3d	bcache: Better alloc tracepoints Change the invalidate tracepoint to indicate how much data we're invalidating, and change the alloc tracepoints to indicate what offset they're for. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-03-18 12:22:35 -07:00
Kent Overstreet	3f5e0a34da	bcache: Kill dead cgroup code This hasn't been used or even enabled in ages. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-03-18 12:22:35 -07:00
Nicholas Swenson	3f6ef38110	bcache: stop moving_gc marking buckets that can't be moved. Signed-off-by: Nicholas Swenson <nks@daterainc.com>	2014-03-18 12:22:34 -07:00
Kent Overstreet	10d9dcf6ee	bcache: Fix moving_pred() Avoid a potential null pointer deref (e.g. from check keys for cache misses) Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-03-18 12:22:34 -07:00
Nicholas Swenson	da415a096f	bcache: Fix moving_gc deadlocking with a foreground write Deadlock happened because a foreground write slept, waiting for a bucket to be allocated. Normally the gc would mark buckets available for invalidation. But the moving_gc was stuck waiting for outstanding writes to complete. These writes used the bcache_wq, the same queue foreground writes used. This fix gives moving_gc its own work queue, so it was still finish moving even if foreground writes are stuck waiting for allocation. It also makes work queue a parameter to the data_insert path, so moving_gc can use its workqueue for writes. Signed-off-by: Nicholas Swenson <nks@daterainc.com> Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-03-18 12:22:33 -07:00
Kent Overstreet	90db6919f5	bcache: Fix discard granularity blk_stack_limits() doesn't like a discard granularity of 0. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-03-18 12:22:33 -07:00
Kent Overstreet	487dded86e	bcache: Fix another bug recovering from unclean shutdown The on disk bucket gens are allowed to be out of date, when we reuse buckets that didn't have any live data in them. To deal with this, the initial gc has to update the bucket gen when we find a pointer gen newer than the bucket's gen. Unfortunately we weren't doing this for pointers in the journal that we're about to replay. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-03-18 12:22:33 -07:00
Kent Overstreet	0bd143fd80	bcache: Fix a bug recovering from unclean shutdown The code to fixup incorrect bucket prios incorrectly did not skip btree node freeing keys Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-03-18 12:22:32 -07:00
Kent Overstreet	27201cfdaa	bcache: Fix a journalling reclaim after recovery bug On recovery we weren't correctly keeping track of what journal buckets had open journal entries, thus it was possible for them to be overwritten until we'd written all new journal entries. Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-03-18 12:21:48 -07:00
Kent Overstreet	65ddf45a31	bcache: Fix a null ptr deref in journal replay Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-03-17 19:01:03 -07:00
Kent Overstreet	4fa03402cd	bcache: Fix a lockdep splat in an error path Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-03-17 18:59:09 -07:00
Heinz Mauelshagen	e893fba90c	dm cache: fix access beyond end of origin device In order to avoid wasting cache space a partial block at the end of the origin device is not cached. Unfortunately, the check for such a partial block at the end of the origin device was flawed. Fix accesses beyond the end of the origin device that occured due to attempted promotion of an undetected partial block by: - initializing the per bio data struct to allow cache_end_io to work properly - recognizing access to the partial block at the end of the origin device - avoiding out of bounds access to the discard bitset Otherwise, users can experience errors like the following: attempt to access beyond end of device dm-5: rw=0, want=20971520, limit=20971456 ... device-mapper: cache: promotion failed; couldn't copy block Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2014-03-12 13:52:00 -04:00
Heinz Mauelshagen	8b9d966665	dm cache: fix truncation bug when copying a block to/from >2TB fast device During demotion or promotion to a cache's >2TB fast device we must not truncate the cache block's associated sector to 32bits. The 32bit temporary result of from_cblock() caused a 32bit multiplication when calculating the sector of the fast device in issue_copy_real(). Use an intermediate 64bit type to store the 32bit from_cblock() to allow for proper 64bit multiplication. Here is an example of how this bug manifests on an ext4 filesystem: EXT4-fs error (device dm-0): ext4_mb_generate_buddy:756: group 17136, 32768 clusters in bitmap, 30688 in gd; block bitmap corrupt. JBD2: Spotted dirty metadata buffer (dev = dm-0, blocknr = 0). There's a risk of filesystem corruption in case of system crash. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2014-03-12 13:49:27 -04:00
Joe Thornber	cebc2de44d	dm space map metadata: fix refcount decrement below 0 which caused corruption This has been a relatively long-standing issue that wasn't nailed down until Teng-Feng Yang's meticulous bug report to dm-devel on 3/7/2014, see: http://www.redhat.com/archives/dm-devel/2014-March/msg00021.html From that report: "When decreasing the reference count of a metadata block with its reference count equals 3, we will call dm_btree_remove() to remove this enrty from the B+tree which keeps the reference count info in metadata device. The B+tree will try to rebalance the entry of the child nodes in each node it traversed, and the rebalance process contains the following steps. (1) Finding the corresponding children in current node (shadow_current(s)) (2) Shadow the children block (issue BOP_INC) (3) redistribute keys among children, and free children if necessary (issue BOP_DEC) Since the update of a metadata block's reference count could be recursive, we will stash these reference count update operations in smm->uncommitted and then process them in a FILO fashion. The problem is that step(3) could free the children which is created in step(2), so the BOP_DEC issued in step(3) will be carried out before the BOP_INC issued in step(2) since these BOPs will be processed in FILO fashion. Once the BOP_DEC from step(3) tries to decrease the reference count of newly shadow block, it will report failure for its reference equals 0 before decreasing. It looks like we can solve this issue by processing these BOPs in a FIFO fashion instead of FILO." Commit `5b564d80` ("dm space map: disallow decrementing a reference count below zero") changed the code to report an error for this temporary refcount decrement below zero. So what was previously a harmless invalid refcount became a hard failure due to the new error path: device-mapper: space map common: unable to decrement a reference count below 0 device-mapper: thin: 253:6: dm_thin_insert_block() failed: error = -22 device-mapper: thin: 253:6: switching pool to read-only mode This bug is in dm persistent-data code that is common to the DM thin and cache targets. So any users of those targets should apply this fix. Fix this by applying recursive space map operations in FIFO order rather than FILO. Resolves: https://bugzilla.kernel.org/show_bug.cgi?id=68801 Reported-by: Apollon Oikonomopoulos <apoikos@debian.org> Reported-by: edwillam1007@gmail.com Reported-by: Teng-Feng Yang <shinrairis@gmail.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.13+	2014-03-07 12:02:47 -05:00
Joe Thornber	738211f70a	dm thin: fix noflush suspend IO queueing i) by the time DM core calls the postsuspend hook the dm_noflush flag has been cleared. So the old thin_postsuspend did nothing. We need to use the presuspend hook instead. ii) There was a race between bios leaving DM core and arriving in the deferred queue. thin_presuspend now sets a 'requeue' flag causing all bios destined for that thin to be requeued back to DM core. Then it requeues all held IO, and all IO on the deferred queue (destined for that thin). Finally postsuspend clears the 'requeue' flag. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-03-05 15:26:59 -05:00
Joe Thornber	18adc57779	dm thin: fix deadlock in __requeue_bio_list The spin lock in requeue_io() was held for too long, allowing deadlock. Don't worry, due to other issues addressed in the following "dm thin: fix noflush suspend IO queueing" commit, this code was never called. Fix this by taking the spin lock for a much shorter period of time. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-03-05 15:26:58 -05:00
Joe Thornber	3e1a069909	dm thin: fix out of data space handling Ideally a thin pool would never run out of data space; the low water mark would trigger userland to extend the pool before we completely run out of space. However, many small random IOs to unprovisioned space can consume data space at an alarming rate. Adjust your low water mark if you're frequently seeing "out-of-data-space" mode. Before this fix, if data space ran out the pool would be put in PM_READ_ONLY mode which also aborted the pool's current metadata transaction (data loss for any changes in the transaction). This had a side-effect of needlessly compromising data consistency. And retry of queued unserviceable bios, once the data pool was resized, could initiate changes to potentially inconsistent pool metadata. Now when the pool's data space is exhausted transition to a new pool mode (PM_OUT_OF_DATA_SPACE) that allows metadata to be changed but data may not be allocated. This allows users to remove thin volumes or discard data to recover data space. The pool is no longer put in PM_READ_ONLY mode in response to the pool running out of data space. And PM_READ_ONLY mode no longer aborts the pool's current metadata transaction. Also, set_pool_mode() will now notify userspace when the pool mode is changed. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-03-05 15:26:58 -05:00
Mike Snitzer	07f2b6e038	dm thin: ensure user takes action to validate data and metadata consistency If a thin metadata operation fails the current transaction will abort, whereby causing potential for IO layers up the stack (e.g. filesystems) to have data loss. As such, set THIN_METADATA_NEEDS_CHECK_FLAG in the thin metadata's superblock which: 1) requires the user verify the thin metadata is consistent (e.g. use thin_check, etc) 2) suggests the user verify the thin data is consistent (e.g. use fsck) The only way to clear the superblock's THIN_METADATA_NEEDS_CHECK_FLAG is to run thin_repair. On metadata operation failure: abort current metadata transaction, set pool in read-only mode, and now set the needs_check flag. As part of this change, constraints are introduced or relaxed: * don't allow a pool to transition to write mode if needs_check is set * don't allow data or metadata space to be resized if needs_check is set * if a thin pool's metadata space is exhausted: the kernel will now force the user to take the pool offline for repair before the kernel will allow the metadata space to be extended. Also, update Documentation to include information about when the thin provisioning target commits metadata, how it handles metadata failures and running out of space. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com>	2014-03-05 15:25:35 -05:00
Mike Snitzer	cdc2b41584	dm thin: synchronize the pool mode during suspend Commit `b5330655` ("dm thin: handle metadata failures more consistently") increased potential for the pool's mode to be changed in response to metadata operation failures. When the pool mode is changed it isn't synchronized with the mode in pool_features stored in the target's context (ti->private) that is used as the basis for (re)establishing the pool mode during resume via bind_control_target. It is important that we synchronize the pool mode when it is changed otherwise the pool may experience and unexpected mode transition on the next resume (especially if there was no new table load). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>	2014-03-04 11:17:51 -05:00
Mikulas Patocka	2c945820ca	dm snapshot: fix metadata corruption Commit `55494bf294` ("dm snapshot: use dm-bufio") broke snapshots. Before that 3.14-rc1 commit, loading a snapshot's list of exceptions involved reading exception areas one by one into ps->area and inserting those exceptions into the hash table. Commit `55494bf294` changed it so that dm-bufio with prefetch is used to load exceptions in batchs. Exceptions are loaded correctly, but ps->area is left uninitialized. When a new exception is allocated, it is stored in this uninitialized ps->area which will be written to the disk. This causes metadata corruption. Fix this corruption by copying the last area that was read via dm-bufio into ps->area. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-03-03 17:58:13 -05:00
Mike Snitzer	c64d240df3	dm: fix Kconfig indentation Since DM_DEBUG_BLOCK_STACK_TRACING is a DM_PERSISTENT_DATA config option move it from drivers/md/Kconfig to drivers/md/persistent-data/Kconfig. Doing so fixes indentation for other DM config options. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-03-03 17:31:07 -05:00
Greg Kroah-Hartman	aa074c1c80	Merge 3.14-rc5 into char-misc-next We want these fixes in here as well.	2014-03-02 19:53:09 -08:00
Heinz Mauelshagen	14f398ca2f	dm cache mq: fix memory allocation failure for large cache devices The memory allocated for the multiqueue policy's hash table doesn't need to be physically contiguous. Use vzalloc() instead of kzalloc(). Fedora has been carrying this fix since 10/10/2013. Failure seen during creation of a 10TB cached device with a 2048 sector block size and 411GB cache size: dmsetup: page allocation failure: order:9, mode:0x10c0d0 CPU: 11 PID: 29235 Comm: dmsetup Not tainted 3.10.4 #3 Hardware name: Supermicro X8DTL/X8DTL, BIOS 2.1a 12/30/2011 000000000010c0d0 ffff880090941898 ffffffff81387ab4 ffff880090941928 ffffffff810bb26f 0000000000000009 000000000010c0d0 ffff880090941928 ffffffff81385dbc ffffffff815f3840 ffffffff00000000 000002000010c0d0 Call Trace: [<ffffffff81387ab4>] dump_stack+0x19/0x1b [<ffffffff810bb26f>] warn_alloc_failed+0x110/0x124 [<ffffffff81385dbc>] ? __alloc_pages_direct_compact+0x17c/0x18e [<ffffffff810bda2e>] __alloc_pages_nodemask+0x6c7/0x75e [<ffffffff810bdad7>] __get_free_pages+0x12/0x3f [<ffffffff810ea148>] kmalloc_order_trace+0x29/0x88 [<ffffffff810ec1fd>] __kmalloc+0x36/0x11b [<ffffffffa031eeed>] ? mq_create+0x1dc/0x2cf [dm_cache_mq] [<ffffffffa031efc0>] mq_create+0x2af/0x2cf [dm_cache_mq] [<ffffffffa0314605>] dm_cache_policy_create+0xa7/0xd2 [dm_cache] [<ffffffffa0312530>] ? cache_ctr+0x245/0xa13 [dm_cache] [<ffffffffa031263e>] cache_ctr+0x353/0xa13 [dm_cache] [<ffffffffa012b916>] dm_table_add_target+0x227/0x2ce [dm_mod] [<ffffffffa012e8e4>] table_load+0x286/0x2ac [dm_mod] [<ffffffffa012e65e>] ? dev_wait+0x8a/0x8a [dm_mod] [<ffffffffa012e324>] ctl_ioctl+0x39a/0x3c2 [dm_mod] [<ffffffffa012e35a>] dm_ctl_ioctl+0xe/0x12 [dm_mod] [<ffffffff81101181>] vfs_ioctl+0x21/0x34 [<ffffffff811019d3>] do_vfs_ioctl+0x3b1/0x3f4 [<ffffffff810f4d2e>] ? ____fput+0x9/0xb [<ffffffff81050b6c>] ? task_work_run+0x7e/0x92 [<ffffffff81101a68>] SyS_ioctl+0x52/0x82 [<ffffffff81391d92>] system_call_fastpath+0x16/0x1b Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2014-02-28 12:18:29 -05:00
Heinz Mauelshagen	e0d849fad7	dm cache: fix truncation bug when mapping I/O to >2TB fast device When remapping a block to the cache's fast device that is larger than 2TB we must not truncate the destination sector to 32bits. The 32bit temporary result of from_cblock() was being overflowed in remap_to_cache() due to the logical left shift. Use an intermediate 64bit type to store the 32bit from_cblock() result to fix the overflow. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2014-02-28 09:23:02 -05:00
Mike Snitzer	7d48935eff	dm thin: allow metadata space larger than supported to go unused It was always intended that a user could provide a thin metadata device that is larger than the max supported by the on-disk format. The extra space would just go unused. Unfortunately that never worked. If the user attempted to use a larger metadata device on creation they would get an error like the following: device-mapper: space map common: space map too large device-mapper: transaction manager: couldn't create metadata space map device-mapper: thin metadata: tm_create_with_sm failed device-mapper: table: 252:17: thin-pool: Error creating metadata object device-mapper: ioctl: error adding target to table Fix this by allowing the initial metadata space map creation to cap its size at the max number of blocks supported (DM_SM_METADATA_MAX_BLOCKS). get_metadata_dev_size() must also impose DM_SM_METADATA_MAX_BLOCKS (via THIN_METADATA_MAX_SECTORS), otherwise extending metadata would cap at THIN_METADATA_MAX_SECTORS_WARNING (which is larger than supported). Also, the calculation for THIN_METADATA_MAX_SECTORS didn't account for the sizeof the disk_bitmap_header. So the supported maximum metadata size is a bit smaller (reduced from 33423360 to 33292800 sectors). Lastly, remove the "excess space will not be used" warning message from get_metadata_dev_size(); it resulted in printing the warning multiple times. Factor out warn_if_metadata_device_too_big(), call it from pool_ctr() and maybe_resize_metadata_dev(). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>	2014-02-27 11:49:08 -05:00
Hannes Reinecke	a1989b3300	dm mpath: fix stalls when handling invalid ioctls An invalid ioctl will never be valid, irrespective of whether multipath has active paths or not. So for invalid ioctls we do not have to wait for multipath to activate any paths, but can rather return an error code immediately. This fix resolves numerous instances of: udevd[]: worker [] unexpectedly returned with status 0x0100 that have been seen during testing. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org	2014-02-26 09:44:44 -05:00
Kent Overstreet	dabb443340	bcache: Fix a shutdown bug Shutdown wasn't cancelling/waiting on journal_write_work() Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-02-25 18:42:49 -08:00
Kent Overstreet	1b4eaf3d38	bcache: Fix flash_dev_cache_miss() for real this time The code was using sectors to count the number of sectors it was zeroing... but then it passed it to bio_advance()... after it had been set to 0. Amusing... Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-02-25 18:41:11 -08:00
Mike Snitzer	1acacc0784	dm thin: fix the error path for the thin device constructor dm_pool_close_thin_device() must be called if dm_set_target_max_io_len() fails in thin_ctr(). Otherwise __pool_destroy() will fail because the pool will still have an open thin device: device-mapper: thin metadata: attempt to close pmd when 1 device(s) are still open device-mapper: thin: __pool_destroy: dm_pool_metadata_close() failed. Also, must establish error code if failing thin_ctr() because the pool is in fail_io mode. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Cc: stable@vger.kernel.org	2014-02-24 11:41:18 -05:00
Kent Overstreet	85cbe1f88c	bcache: Fix another compiler warning on m68k Use a bigger hammer this time Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org>	2014-02-18 08:55:05 -08:00
Greg Kroah-Hartman	ba4b60e85d	Merge 3.14-rc3 into char-misc-next We need the fixes here for future mei and other patches. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-02-18 08:09:40 -08:00
Mikulas Patocka	f3a44fe060	dm raid1: fix immutable biovec related BUG when retrying read bio When restoring bi_end_io, increase bi_remaining before retrying the bio to avoid BUG_ON(atomic_read(&bio->bi_remaining) <= 0) in bio_endio(). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-02-18 10:48:57 -05:00
Mikulas Patocka	d73f990729	dm io: fix I/O to multiple destinations Commit `003b5c5719` ("block: Convert drivers to immutable biovecs") broke dm-mirror due to dm-io breakage. dm-io had three possible iterators (DM_IO_PAGE_LIST, DM_IO_BVEC, DM_IO_VMA) that iterate over pages where the I/O should be performed. The switch to immutable biovecs changed the DM_IO_BVEC iterator to DM_IO_BIO. Before this change the iterator stored the pointer to a bio vector in the dpages structure. The iterator incremented the pointer in the dpages structure as it advanced over the pages. After the immutable biovecs change, the DM_IO_BIO iterator stores a pointer to the bio in the dpages structure and uses bio_advance to change the bio as it advances. The problem is that the function dispatch_io stores the content of the dpages structure into the variable old_pages and restores it before issuing I/O to each of the devices. Before the change, the statement "dp = old_pages;" restored the iterator to its starting position. After the change, struct dpages holds a pointer to the bio, thus the statement "dp = old_pages;" doesn't restore the iterator. Consequently, in the context of dm-mirror: only the first mirror leg is written correctly, the kernel locks up when trying to write the other mirror legs because the number of sectors to write in the where->count variable doesn't match the number of sectors returned by the iterator. This patch fixes the bug by partially reverting the original patch - it changes the code so that struct dpages holds a pointer to the bio vector, so that the statement "*dp = old_pages;" restores the iterator correctly. The field "context_u" holds the offset from the beginning of the current bio vector entry, just like the "bio->bi_iter.bi_bvec_done" field. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-02-17 11:00:05 -05:00
Mike Snitzer	4d1662a30d	dm thin: avoid metadata commit if a pool's thin devices haven't changed Commit `905e51b` ("dm thin: commit outstanding data every second") introduced a periodic commit. This commit occurs regardless of whether any thin devices have made changes. Fix the periodic commit to check if any of a pool's thin devices have changed using dm_pool_changed_this_transaction(). Reported-by: Alexander Larsson <alexl@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Cc: stable@vger.kernel.org	2014-02-17 11:00:05 -05:00
Mike Snitzer	80ae49aaed	dm cache: do not add migration to completed list before unhooking bio When completing an overwrite bio, in overwrite_endio(), the associated migration should not be added to the 'completed_migrations' until the bio's fields are restored with dm_unhook_bio(). Otherwise, do_worker() can race to process 'completed_migrations' before dm_unhook_bio() -- so the bio's bi_end_io is incorrect. This is unlikely to cause any problems given the current code but should be fixed on the basis of correctness. Also, the cache's spinlock only needs to be held when manipulating the 'completed_migrations' list -- other changes don't need protection. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>	2014-02-17 11:00:05 -05:00
Mike Snitzer	c6eda5e81c	dm cache: move hook_info into common portion of per_bio_data structure Commit `c9d28d5d` ("dm cache: promotion optimisation for writes") incorrectly placed the 'hook_info' member in the writethrough-only portion of the per_bio_data structure. Given that the overwrite optimization may be used for writeback the 'hook_info' member must be placed above the 'cache' member of the per_bio_data structure. Any members above 'cache' are available from both writeback and writethrough modes' per_bio_data structure. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Cc: stable@vger.kernel.org # 3.13+	2014-02-17 11:00:05 -05:00
Linus Torvalds	bd3813d52d	Two bugfixes for md both tagged for -stable -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQIVAwUAUvxBNjnsnt1WYoG5AQLymhAAnKznI2YhFVqK21mpo1l2JDSkwxqIBvBZ hcW24zF6dNU4cJFmRQqOeL2AkzHWSqX4/J/DGXvI9wFll1CkdNs+UVQJ12Pod3gK gTDmqRCe/x+bQxrOR5VfyKv0slia12vn9mqfDd2mX41wcr7ceHsdHbemPhgIcUCC WLERQi9Yn/Eb2+rltTzZ3XaHwIlIozqZ0yRZ6wH45iyuk+uiholEJjJp8LOWpzTe rKE4s5qd1NAAJsrMHZ11mZWq/4VtgYJ3AcWVXVWqBPxmlI0FnBPU/KVpJkAcrVjB N6tqmR1/nHcrGlaOgWSS6UfNGVMe3L2HJpaIdjTM65Tdb+WFpEPevTy9qYsLC3Ic zV/KmErUtSFMJKYBr9YyRnSpXtnSDo8BeRsWJm9ZaA5UV9yUVBNwWDFNFP/Bkqze v4wLMRj54U5fjRZBq/PaFbk/A2nDCkGHC4uZCgJ+Mwhoo6rxpho/oKBjBBlmpw3q 4Q0yWgZ8F/ZWFUrGzi1TY3tdYrl3yCOpZ3l5aRTtTqlU3aVShIIiKCKDvs2v8l6h C5igUbnW5BtsMMCOwdULc/lHgN3vMbJEA+7YdmeouDEY5QAk0O6nxan3y+cbtC5u F+++tkWzSQZJRGhdAxdAXsABYfHiR7Wnft96+iMpnQYbm35CdYYwlOhhl0iI/+Ec FcpDXOz9faA= =J3I5 -----END PGP SIGNATURE----- Merge tag 'md/3.14-fixes' of git://neil.brown.name/md Pull md fixes from Neil Brown: "Two bugfixes for md both tagged for -stable" * tag 'md/3.14-fixes' of git://neil.brown.name/md: md/raid5: Fix CPU hotplug callback registration md/raid1: restore ability for check and repair to fix read errors.	2014-02-14 12:48:16 -08:00
Oleg Nesterov	789b5e0315	md/raid5: Fix CPU hotplug callback registration Subsystems that want to register CPU hotplug callbacks, as well as perform initialization for the CPUs that are already online, often do it as shown below: get_online_cpus(); for_each_online_cpu(cpu) init_cpu(cpu); register_cpu_notifier(&foobar_cpu_notifier); put_online_cpus(); This is wrong, since it is prone to ABBA deadlocks involving the cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently with CPU hotplug operations). Interestingly, the raid5 code can actually prevent double initialization and hence can use the following simplified form of callback registration: register_cpu_notifier(&foobar_cpu_notifier); get_online_cpus(); for_each_online_cpu(cpu) init_cpu(cpu); put_online_cpus(); A hotplug operation that occurs between registering the notifier and calling get_online_cpus(), won't disrupt anything, because the code takes care to perform the memory allocations only once. So reorganize the code in raid5 this way to fix the deadlock with callback registration. Cc: linux-raid@vger.kernel.org Cc: stable@vger.kernel.org (v2.6.32+) Fixes: `36d1c6476b` Signed-off-by: Oleg Nesterov <oleg@redhat.com> [Srivatsa: Fixed the unregister_cpu_notifier() deadlock, added the free_scratch_buffer() helper to condense code further and wrote the changelog.] Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: NeilBrown <neilb@suse.de>	2014-02-13 13:46:45 +11:00
David Fries	ac8f73305e	connector: add portid to unicast in addition to broadcasting This allows replying only to the requestor portid while still supporting broadcasting. Pass 0 to portid for the previous behavior. Signed-off-by: David Fries <David@Fries.net> Acked-by: Evgeniy Polyakov <zbr@ioremap.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2014-02-07 15:40:17 -08:00
NeilBrown	1877db7558	md/raid1: restore ability for check and repair to fix read errors. commit `30bc9b5387` md/raid1: fix bio handling problems in process_checks() Move the bio_reset() to a point before where BIO_UPTODATE is checked, so that check now always report that the bio is uptodate, even if it is not. This causes process_check() to sometimes treat read-errors as successful matches so the good data isn't written out. This patch preserves the flag until it is needed. Bug was introduced in 3.11, but backported to 3.10-stable (as it fixed an even worse bug). So suitable for any -stable since 3.10. Reported-and-tested-by: Michael Tokarev <mjt@tls.msk.ru> Cc: stable@vger.kernel.org (3.10+) Fixed: `30bc9b5387` Signed-off-by: NeilBrown <neilb@suse.de>	2014-02-05 12:26:04 +11:00
Jens Axboe	96d2e8b5e2	Merge branch 'bcache-for-3.14' of git://evilpiepirate.org/~kent/linux-bcache into for-linus	2014-01-30 12:57:55 -07:00
Linus Torvalds	53d8ab29f8	Merge branch 'for-3.14/drivers' of git://git.kernel.dk/linux-block Pull block IO driver changes from Jens Axboe: - bcache update from Kent Overstreet. - two bcache fixes from Nicholas Swenson. - cciss pci init error fix from Andrew. - underflow fix in the parallel IDE pg_write code from Dan Carpenter. I'm sure the 1 (or 0) users of that are now happy. - two PCI related fixes for sx8 from Jingoo Han. - floppy init fix for first block read from Jiri Kosina. - pktcdvd error return miss fix from Julia Lawall. - removal of IRQF_SHARED from the SEGA Dreamcast CD-ROM code from Michael Opdenacker. - comment typo fix for the loop driver from Olaf Hering. - potential oops fix for null_blk from Raghavendra K T. - two fixes from Sam Bradshaw (Micron) for the mtip32xx driver, fixing an OOM problem and a problem with handling security locked conditions * 'for-3.14/drivers' of git://git.kernel.dk/linux-block: (47 commits) mg_disk: Spelling s/finised/finished/ null_blk: Null pointer deference problem in alloc_page_buffers mtip32xx: Correctly handle security locked condition mtip32xx: Make SGL container per-command to eliminate high order dma allocation drivers/block/loop.c: fix comment typo in loop_config_discard drivers/block/cciss.c:cciss_init_one(): use proper errnos drivers/block/paride/pg.c: underflow bug in pg_write() drivers/block/sx8.c: remove unnecessary pci_set_drvdata() drivers/block/sx8.c: use module_pci_driver() floppy: bail out in open() if drive is not responding to block0 read bcache: Fix auxiliary search trees for key size > cacheline size bcache: Don't return -EINTR when insert finished bcache: Improve bucket_prio() calculation bcache: Add bch_bkey_equal_header() bcache: update bch_bkey_try_merge bcache: Move insert_fixup() to btree_keys_ops bcache: Convert sorting to btree_keys bcache: Convert debug code to btree_keys bcache: Convert btree_iter to struct btree_keys bcache: Refactor bset_tree sysfs stats ...	2014-01-30 11:40:10 -08:00
Linus Torvalds	f568849eda	Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-block Pull core block IO changes from Jens Axboe: "The major piece in here is the immutable bio_ve series from Kent, the rest is fairly minor. It was supposed to go in last round, but various issues pushed it to this release instead. The pull request contains: - Various smaller blk-mq fixes from different folks. Nothing major here, just minor fixes and cleanups. - Fix for a memory leak in the error path in the block ioctl code from Christian Engelmayer. - Header export fix from CaiZhiyong. - Finally the immutable biovec changes from Kent Overstreet. This enables some nice future work on making arbitrarily sized bios possible, and splitting more efficient. Related fixes to immutable bio_vecs: - dm-cache immutable fixup from Mike Snitzer. - btrfs immutable fixup from Muthu Kumar. - bio-integrity fix from Nic Bellinger, which is also going to stable" * 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits) xtensa: fixup simdisk driver to work with immutable bio_vecs block/blk-mq-cpu.c: use hotcpu_notifier() blk-mq: for_each_* macro correctness block: Fix memory leak in rw_copy_check_uvector() handling bio-integrity: Fix bio_integrity_verify segment start bug block: remove unrelated header files and export symbol blk-mq: uses page->list incorrectly blk-mq: use __smp_call_function_single directly btrfs: fix missing increment of bi_remaining Revert "block: Warn and free bio if bi_end_io is not set" block: Warn and free bio if bi_end_io is not set blk-mq: fix initializing request's start time block: blk-mq: don't export blk_mq_free_queue() block: blk-mq: make blk_sync_queue support mq block: blk-mq: support draining mq queue dm cache: increment bi_remaining when bi_end_io is restored block: fixup for generic bio chaining block: Really silence spurious compiler warnings block: Silence spurious compiler warnings block: Kill bio_pair_split() ...	2014-01-30 11:19:05 -08:00
Nicholas Swenson	e3b4825b85	bcache: bugfix - gc thread now gets woken when cache is full Signed-off-by: Nicholas Swenson <nks@daterainc.com>	2014-01-29 13:06:42 -08:00
Kent Overstreet	3572324af0	bcache: Minor fixes from kbuild robot Signed-off-by: Kent Overstreet <kmo@daterainc.com>	2014-01-29 13:06:41 -08:00
Darrick J. Wong	9471744767	bcache: fix BUG_ON due to integer overflow with GC_SECTORS_USED The BUG_ON at the end of __bch_btree_mark_key can be triggered due to an integer overflow error: BITMASK(GC_SECTORS_USED, struct bucket, gc_mark, 2, 13); ... SET_GC_SECTORS_USED(g, min_t(unsigned, GC_SECTORS_USED(g) + KEY_SIZE(k), (1 << 14) - 1)); BUG_ON(!GC_SECTORS_USED(g)); In bcache.h, the SECTORS_USED bitfield is defined to be 13 bits wide. While the SET_ code tries to ensure that the field doesn't overflow by clamping it to (1<<14)-1 == 16383, this is incorrect because 16383 requires 14 bits. Therefore, if GC_SECTORS_USED() + KEY_SIZE() = 8192, the SET_ statement tries to store 8192 into a 13-bit field. In a 13-bit field, 8192 becomes zero, thus triggering the BUG_ON. Therefore, create a field width constant and a max value constant, and use those to create the bitfield and check the inputs to SET_GC_SECTORS_USED. Arguably the BITMASK() template ought to have BUG_ON checks for too-large values, but that's a separate patch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2014-01-29 13:06:15 -08:00
Linus Torvalds	5c85121bf6	md updates for 3.14 All bug fixes, two tagged for -stable. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQIVAwUAUuG8WTnsnt1WYoG5AQKW7A//V93TYUiUAG6zNUFrNZjuoXP0ym3jlpkH eIIFdcV7rr0Irgtd8+s9cW8Cjsbq3d/vMbFlwP1Co32mnCnFFojeKtCvM9GqkYrH o4Zr1nVAVYzKO4awByK3wBT9WbzEc/XlDgYQIpZExIYeZdzLOm6HyvlRbcE86Ug5 QoGYOUlLu4LUZmFgB9zQ7JM0GACV5pS1afSObtACj2t2x5GVHNU84u+M+D8urPXO wnf+AIAzquh5F+8MX+DxmMEUaSzUHf8fXOM3jYVbzPI71SpaHssL4SwBn+j4I/8/ SCSqeIh7qMSuqUy63/iHKCy5qAgNuRdL9fYlOTkpxzHm81Ddj8u7fySsApggVOa2 yeKTkSRlsMFeu+LiGKNi/fINVxboaoYJVZ2DTNtKxSuW2VL2aPNz1Qjq4QnR3nSI LpaB3VeVKdMsH8Em1a8cgZWcjo5YFAcNtUnJq2fvj9VZ3SJNw4ZoKDL+l718iGao xIwAXMSafAHQVAAaNVFkwrea13TeOyxikY5Ra4vWfm+Fw8TzmYq5DqO0zaILwdAJ 2FnNj2/2y3hk2K7qBcEvjjEakxPlTwzrzxZMfJDRMuQLqvrjbXiMGOnWzgl1D/9x 4/uPjeFZLG7byxmIyg4Y83NkPgkWnRPpGK98r26pUH1UgnRF0a5aUFXQk7rsQrU6 noRkZ9EPD8s= =d81E -----END PGP SIGNATURE----- Merge tag 'md/3.14' of git://neil.brown.name/md Pull md updates from Neil Brown: "All bug fixes, two tagged for -stable" * tag 'md/3.14' of git://neil.brown.name/md: md/raid5: close recently introduced race in stripe_head management. md/raid5: fix long-standing problem with bitmap handling on write failure. md: check command validity early in md_ioctl(). md: ensure metadata is writen after raid level change. md/raid10: avoid fullsync when not necessary. md: allow a partially recovered device to be hot-added to an array. md: Change handling of save_raid_disk and metadata update during recovery.	2014-01-24 17:41:50 -08:00
Linus Torvalds	fe41c2c018	A set of device-mapper changes for 3.14. A lot of attention was paid to improving the thin-provisioning target's handling of metadata operation failures and running out of space. A new 'error_if_no_space' feature was added to allow users to error IOs rather than queue them when either the data or metadata space is exhausted. Additional fixes/features include: - a few fixes to properly support thin metadata device resizing - a solution for reliably waiting for a DM device's embedded kobject to be released before destroying the device - old dm-snapshot is updated to use the dm-bufio interface to take advantage of readahead capabilities that improve snapshot activation - new dm-cache target tunables to control how quickly data is promoted to the cache (fast) device - improved write efficiency of cluster mirror target by combining userspace flush and mark requests -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJS4GClAAoJEMUj8QotnQNacdEH/2ES5k5itUQRY9jeI+u2zYNP vdsRTYf+97+B3jpRvpWbMt4kxT2tjaQbkxJ+iKRHy2MBLFUgq8ruH1RS/Q5VbDeg 6i6ol8mpNxhlvo/KTMxXqRcWDSxShiMfhz2lXC2bJ7M4sP/iiH85s4Pm4YQ59jpd OIX7qj36m/cV/le9YQbexJEEsaj+3genbzL26wyyvtG/rT9fWnXa7clj2gqTdToG YCEBCRf5FH9X6W/Oc50nMw5n2dt/MRmPre/MAlOjemeaosB0WJiKaswM25rnvHp0 JnhxQ2K2C5KIKAWIfwPOImdb9zWW7p1dIRLsS8nHBUQr0BF5VRkmvpnYH4qBtcc= =e7e0 -----END PGP SIGNATURE----- Merge tag 'dm-3.14-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device-mapper changes from Mike Snitzer: "A lot of attention was paid to improving the thin-provisioning target's handling of metadata operation failures and running out of space. A new 'error_if_no_space' feature was added to allow users to error IOs rather than queue them when either the data or metadata space is exhausted. Additional fixes/features include: - a few fixes to properly support thin metadata device resizing - a solution for reliably waiting for a DM device's embedded kobject to be released before destroying the device - old dm-snapshot is updated to use the dm-bufio interface to take advantage of readahead capabilities that improve snapshot activation - new dm-cache target tunables to control how quickly data is promoted to the cache (fast) device - improved write efficiency of cluster mirror target by combining userspace flush and mark requests" * tag 'dm-3.14-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (35 commits) dm log userspace: allow mark requests to piggyback on flush requests dm space map metadata: fix bug in resizing of thin metadata dm cache: add policy name to status output dm thin: fix pool feature parsing dm sysfs: fix a module unload race dm snapshot: use dm-bufio prefetch dm snapshot: use dm-bufio dm snapshot: prepare for switch to using dm-bufio dm snapshot: use GFP_KERNEL when initializing exceptions dm cache: add block sizes and total cache blocks to status output dm btree: add dm_btree_find_lowest_key dm space map metadata: fix extending the space map dm space map common: make sure new space is used during extend dm: wait until embedded kobject is released before destroying a device dm: remove pointless kobject comparison in dm_get_from_kobject dm snapshot: call destroy_work_on_stack() to pair with INIT_WORK_ONSTACK() dm cache policy mq: introduce three promotion threshold tunables dm cache policy mq: use list_del_init instead of list_del + INIT_LIST_HEAD dm thin: fix set_pool_mode exposed pool operation races dm thin: eliminate the no_free_space flag ...	2014-01-22 20:17:48 -08:00
Dongmao Zhang	5066a4df1f	dm log userspace: allow mark requests to piggyback on flush requests In the cluster evironment, cluster write has poor performance because userspace_flush() has to contact a userspace program (cmirrord) for clear/mark/flush requests. But both mark and flush requests require cmirrord to communicate the message to all the cluster nodes for each flush call. This behaviour is really slow. To address this we now merge mark and flush requests together to reduce the kernel-userspace-kernel time. We allow a new directive, "integrated_flush" that can be used to instruct the kernel log code to combine flush and mark requests when directed by userspace. If not directed by userspace (due to an older version of the userspace code perhaps), the kernel will function as it did previously - preserving backwards compatibility. Additionally, flush requests are performed lazily when only clear requests exist. Signed-off-by: Dongmao Zhang <dmzhang@suse.com> Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-01-21 23:46:27 -05:00
Linus Torvalds	f075e0f699	Merge branch 'for-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "The bulk of changes are cleanups and preparations for the upcoming kernfs conversion. - cgroup_event mechanism which is and will be used only by memcg is moved to memcg. - pidlist handling is updated so that it can be served by seq_file. Also, the list is not sorted if sane_behavior. cgroup documentation explicitly states that the file is not sorted but it has been for quite some time. - All cgroup file handling now happens on top of seq_file. This is to prepare for kernfs conversion. In addition, all operations are restructured so that they map 1-1 to kernfs operations. - Other cleanups and low-pri fixes" * 'for-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (40 commits) cgroup: trivial style updates cgroup: remove stray references to css_id doc: cgroups: Fix typo in doc/cgroups cgroup: fix fail path in cgroup_load_subsys() cgroup: fix missing unlock on error in cgroup_load_subsys() cgroup: remove for_each_root_subsys() cgroup: implement for_each_css() cgroup: factor out cgroup_subsys_state creation into create_css() cgroup: combine css handling loops in cgroup_create() cgroup: reorder operations in cgroup_create() cgroup: make for_each_subsys() useable under cgroup_root_mutex cgroup: css iterations and css_from_dir() are safe under cgroup_mutex cgroup: unify pidlist and other file handling cgroup: replace cftype->read_seq_string() with cftype->seq_show() cgroup: attach cgroup_open_file to all cgroup files cgroup: generalize cgroup_pidlist_open_file cgroup: unify read path so that seq_file is always used cgroup: unify cgroup_write_X64() and cgroup_write_string() cgroup: remove cftype->read(), ->read_map() and ->write() hugetlb_cgroup: convert away from cftype->read() ...	2014-01-21 17:51:34 -08:00
NeilBrown	7da9d450ab	md/raid5: close recently introduced race in stripe_head management. As release_stripe and __release_stripe decrement ->count and then manipulate ->lru both under ->device_lock, it is important that get_active_stripe() increments ->count and clears ->lru also under ->device_lock. However we currently list_del_init ->lru under the lock, but increment the ->count outside the lock. This can lead to races and list corruption. So move the atomic_inc(&sh->count) up inside the ->device_lock protected region. Note that we still increment ->count without device lock in the case where get_free_stripe() was called, and in fact don't take ->device_lock at all in that path. This is safe because if the stripe_head can be found by get_free_stripe, then the hash lock assures us the no-one else could possibly be calling release_stripe() at the same time. Fixes: `566c09c534` Cc: stable@vger.kernel.org (3.13) Reported-and-tested-by: Ian Kumlien <ian.kumlien@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2014-01-22 11:45:03 +11:00
Joe Thornber	fca028438f	dm space map metadata: fix bug in resizing of thin metadata This bug was introduced in commit `7e664b3dec` ("dm space map metadata: fix extending the space map"). When extending a dm-thin metadata volume we: - Switch the space map into a simple bootstrap mode, which allocates all space linearly from the newly added space. - Add new bitmap entries for the new space - Increment the reference counts for those newly allocated bitmap entries - Commit changes to disk - Switch back out of bootstrap mode. But, the disk commit may allocate space itself, if so this fact will be lost when switching out of bootstrap mode. The bug exhibited itself as an error when the bitmap_root, with an erroneous ref count of 0, was subsequently decremented as part of a later disk commit. This would cause the disk commit to fail, and thinp to enter read_only mode. The metadata was not damaged (thin_check passed). The fix is to put the increments + commit into a loop, running until the commit has not allocated extra space. In practise this loop only runs twice. With this fix the following device mapper testsuite test passes: dmtest run --suite thin-provisioning -n thin_remove_works_after_resize Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # depends on commit `7e664b3dec`	2014-01-21 12:15:01 -05:00
Linus Torvalds	d3bad75a6d	Driver core / sysfs patches for 3.14-rc1 Here's the big driver core and sysfs patch set for 3.14-rc1. There's a lot of work here moving sysfs logic out into a "kernfs" to allow other subsystems to also have a virtual filesystem with the same attributes of sysfs (handle device disconnect, dynamic creation / removal as needed / unneeded, etc. This is primarily being done for the cgroups filesystem, but the goal is to also move debugfs to it when it is ready, solving all of the known issues in that filesystem as well. The code isn't completed yet, but all should be stable now (there is a big section that was reverted due to problems found when testing.) There's also some other smaller fixes, and a driver core addition that allows for a "collection" of objects, that the DRM people will be using soon (it's in this tree to make merges after -rc1 easier.) All of this has been in linux-next with no reported issues. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iEYEABECAAYFAlLdh0cACgkQMUfUDdst+ylv4QCfeDKDgLo4LsaBIIrFSxLoH/c7 UUsAoMPRwA0h8wy+BQcJAg4H4J4maKj3 =0pc0 -----END PGP SIGNATURE----- Merge tag 'driver-core-3.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core / sysfs patches from Greg KH: "Here's the big driver core and sysfs patch set for 3.14-rc1. There's a lot of work here moving sysfs logic out into a "kernfs" to allow other subsystems to also have a virtual filesystem with the same attributes of sysfs (handle device disconnect, dynamic creation / removal as needed / unneeded, etc) This is primarily being done for the cgroups filesystem, but the goal is to also move debugfs to it when it is ready, solving all of the known issues in that filesystem as well. The code isn't completed yet, but all should be stable now (there is a big section that was reverted due to problems found when testing) There's also some other smaller fixes, and a driver core addition that allows for a "collection" of objects, that the DRM people will be using soon (it's in this tree to make merges after -rc1 easier) All of this has been in linux-next with no reported issues" * tag 'driver-core-3.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (113 commits) kernfs: associate a new kernfs_node with its parent on creation kernfs: add struct dentry declaration in kernfs.h kernfs: fix get_active failure handling in kernfs_seq_() Revert "kernfs: fix get_active failure handling in kernfs_seq_()" Revert "kernfs: replace kernfs_node->u.completion with kernfs_root->deactivate_waitq" Revert "kernfs: remove KERNFS_ACTIVE_REF and add kernfs_lockdep()" Revert "kernfs: remove KERNFS_REMOVED" Revert "kernfs: restructure removal path to fix possible premature return" Revert "kernfs: invoke kernfs_unmap_bin_file() directly from __kernfs_remove()" Revert "kernfs: remove kernfs_addrm_cxt" Revert "kernfs: make kernfs_get_active() block if the node is deactivated but not removed" Revert "kernfs: implement kernfs_{de\|re}activate[_self]()" Revert "kernfs, sysfs, driver-core: implement kernfs_remove_self() and its wrappers" Revert "pci: use device_remove_file_self() instead of device_schedule_callback()" Revert "scsi: use device_remove_file_self() instead of device_schedule_callback()" Revert "s390: use device_remove_file_self() instead of device_schedule_callback()" Revert "sysfs, driver-core: remove unused {sysfs\|device}_schedule_callback_owner()" Revert "kernfs: remove unnecessary NULL check in __kernfs_remove()" kernfs: remove unnecessary NULL check in __kernfs_remove() drivers/base: provide an infrastructure for componentised subsystems ...	2014-01-20 15:49:44 -08:00
Mike Snitzer	2e68c4e6ca	dm cache: add policy name to status output The cache's policy may have been established using the "default" alias, which is currently the "mq" policy but the default policy may change in the future. It is useful to know exactly which policy is being used. Add a 'real' member to the dm_cache_policy_type structure and have the "default" dm_cache_policy_type point to the real "mq" dm_cache_policy_type. Update dm_cache_policy_get_name() to check if real is set, if so report the name of the real policy (not the alias). Requested-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-01-16 13:44:11 -05:00
Mike Snitzer	74aa45c33c	dm thin: fix pool feature parsing Commit `787a996cb2` ("dm thin: add error_if_no_space feature") mistakenly forgot to increase the number of feature args supported. Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2014-01-15 21:16:24 -05:00
NeilBrown	9f97e4b128	md/raid5: fix long-standing problem with bitmap handling on write failure. Before a write starts we set a bit in the write-intent bitmap. When the write completes we clear that bit if the write was successful to all devices. However if the write wasn't fully successful we should not clear the bit. If the faulty drive is subsequently re-added, the fact that the bit is still set ensure that we will re-write the data that is missing. This logic is mediated by the STRIPE_DEGRADED flag - we only clear the bitmap bit when this flag is not set. Currently we correctly set the flag if a write starts when some devices are failed or missing. But we do not set the flag if some device failed during the write attempt. This is wrong and can result in clearing the bit inappropriately. So: set the flag when a write fails. This bug has been present since bitmaps were introduces, so the fix is suitable for any -stable kernel. Reported-by: Ethan Wilson <ethan.wilson@shiftmail.org> Cc: stable@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2014-01-16 09:35:38 +11:00
Nicolas Schichan	cb335f88eb	md: check command validity early in md_ioctl(). Verify that the cmd parameter passed to md_ioctl() is valid before doing anything. This fixes mddev->hold_active being set to 0 when an invalid ioctl command is passed to md_ioctl() before the array has been configured. Clearing mddev->hold_active in that case can lead to a livelock situation when an invalid ioctl number is given to md_ioctl() by a process when the mddev is currently being opened by another process: Process 1 Process 2 --------- --------- md_alloc() mddev_find() -> returns a new mddev with hold_active == UNTIL_IOCTL add_disk() -> sends KOBJ_ADD uevent (sees KOBJ_ADD uevent for device) md_open() md_ioctl(INVALID_IOCTL) -> returns ENODEV and clears mddev->hold_active md_release() md_put() -> deletes the mddev as hold_active is 0 md_open() mddev_find() -> returns a newly allocated mddev with mddev->gendisk == NULL -> returns with ERESTARTSYS (kernel restarts the open syscall) Signed-off-by: Nicolas Schichan <nschichan@freebox.fr> Signed-off-by: NeilBrown <neilb@suse.de>	2014-01-16 08:55:00 +11:00
Linus Torvalds	1a60864fc1	md: half a dozen bug fixes for 3.13 All of these fix real bugs the people have hit, and are tagged for -stable. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQIVAwUAUtYZqznsnt1WYoG5AQK50g//XuqVR/esIpGR+knf+1sD3Zk85Vp33kGL 2UfbQbi40q/uLjBhJhOSkx/sYGw1Eo255vNX+yMVjYT9F+xbhI8vlLfecqx5Fk5J M+JH1sM7E2T79boFLoOBGSl/qppsQsPHa3p87FmFHQrrAuEMIbFiP98MnQjdSiv4 Cu9cAR7x7njepHeMXBFiV7URaYtCHAXR9iMdkebkKIFlfND8w2QYD+LWo3SzBKs9 jTrSBJRpXLHE+bZLOQPhAryb7nWkcT1R7N0vsVMQKcq1o6ZiRNnk/B9xNtV34hkc 5zwTPe/d5AsV6Tsxg0dSs7xcBn/A+F5lg8fzdOhyE1F13COmB7sepjPTMPAy/oP1 zjyPwnnWkHMDUW2usf3aqPMt+LGMofRCJHXjkqpMgIWQ96SQUY8F9PPxchkUCsx/ A38I+vXl2jGDHh/DFSduef3sDOF6TYyKyLteJftyny96dc1RutrZSbHPdrkDz1YQ 6zcyvpv0FexiXITrLg70FG8fnRMK91ZfHrmuzVP7tpm2TyeIfDriLhTAIXAcXHOT l22a1bNj4shFfztnD0CbH6nY/iJM7ov0x5+IyG5/iYbipon02MenQeV9km6JVwQb OCGHYCTswiFSduX1E1ru52dHXifbANWgzcUH0sjGQ0YZNmxvPRBWDjB1H2J1auzW J8T10qimw1w= =uvyl -----END PGP SIGNATURE----- Merge tag 'md/3.13-fixes' of git://neil.brown.name/md Pull late md fixes from Neil Brown: "Half a dozen md bug fixes. All of these fix real bugs the people have hit, and are tagged for -stable. Sorry they are late .... Christmas holidays and all that. Hopefully they can still squeak into 3.13" * tag 'md/3.13-fixes' of git://neil.brown.name/md: md: fix problem when adding device to read-only array with bitmap. md/raid10: fix bug when raid10 recovery fails to recover a block. md/raid5: fix a recently broken BUG_ON(). md/raid1: fix request counting bug in new 'barrier' code. md/raid10: fix two bugs in handling of known-bad-blocks. md/raid5: Fix possible confusion when multiple write errors occur.	2014-01-15 15:07:36 +07:00

1 2 3 4 5 ...

3274 Commits