When creating dm-cache with the default policy, it will call
request_module("dm-cache-default") to register the default policy.
But the "dm-cache-default" alias was left referring to the MQ policy.
Fix this by moving the module alias to SMQ.
Fixes: bccab6a0 (dm cache: switch the "default" cache replacement policy from mq to smq)
Signed-off-by: Yi Zhang <yizhan@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
When using nested btrees, the top leaves of the top levels contain
block addresses for the root of the next tree down. If we shadow a
shared leaf node the leaf values (sub tree roots) should be incremented
accordingly.
This is only an issue if there is metadata sharing in the top levels.
Which only occurs if metadata snapshots are being used (as is possible
with dm-thinp). And could result in a block from the thinp metadata
snap being reused early, thus corrupting the thinp metadata snap.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
The device details and mapping trees were just being decremented
before. Now btree_del() is called to do a deep delete.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
- Fix for a 4.2 DM thinp discard regression due to inability to properly
delete a range of blocks in a data mapping btree.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVxNokAAoJEMUj8QotnQNaclUIALSoAHN1B47juyNgEjkDQI8x
LodGbto3gLIigsYTrVMH13ohwVaekErsYbe65L8wQT5vyNcXbHTs6YRyCfVAdEhb
4/hmGkVaRtitnHwxrzUv+ycCb1Gj/rtSGdt264AiSMAL2cD/ucL1aLYhkiEdGp9k
hv5qKxjFyzSK0NVAuKn5B8tNtNGVMQv4faTiWKs5JVA4a8RERD2B4+nDROJY3SwM
zgHUPMsPAUb4uphwa0q+qhhrGZWAspS43fCjVBbjXi9lSeUV0TpeYoG4tV7A8T+o
jP9xq4kCLbZ7PzYjMWp6s30+H741Wi1AkBx0NkS7Aoz9kt+SBMlTMFXojPP5Pks=
=AkR0
-----END PGP SIGNATURE-----
Merge tag 'dm-4.2-fixes-4' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
- stable fix for a dm_merge_bvec() regression on 32 bit Fedora systems.
- fix for a 4.2 DM thinp discard regression due to inability to
properly delete a range of blocks in a data mapping btree.
* tag 'dm-4.2-fixes-4' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm btree remove: fix bug in remove_one()
dm: fix dm_merge_bvec regression on 32 bit systems
remove_one() was not incrementing the key for the beginning of the
range, so not all entries were being removed. This resulted in
discards that were not unmapping all blocks.
Fixes: 4ec331c3ea ("dm btree: add dm_btree_remove_leaves()")
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
A DM regression on 32 bit systems was reported against v4.2-rc3 here:
https://lkml.org/lkml/2015/7/29/401
Fix this by reverting both commit 1c220c69 ("dm: fix casting bug in
dm_merge_bvec()") and 148e51ba ("dm: improve documentation and code
clarity in dm_merge_bvec"). This combined revert is done to eliminate
the possibility of a partial revert in stable@ kernels.
In hindsight the correct fix, at the time 1c220c69 was applied to fix
the regression that 148e51ba introduced, should've been to simply revert
148e51ba.
Reported-by: Josh Boyer <jwboyer@fedoraproject.org>
Tested-by: Adam Williamson <awilliam@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.19+
I have a report of drop_one_stripe() called from
raid5_cache_scan() apparently finding ->max_nr_stripes == 0.
This should not be allowed.
So add a test to keep max_nr_stripes above min_nr_stripes.
Also use a 'mask' rather than a 'mod' in drop_one_stripe
to ensure 'hash' is valid even if max_nr_stripes does reach zero.
Fixes: edbe83ab4c ("md/raid5: allow the stripe_cache to grow and shrink.")
Cc: stable@vger.kernel.org (4.1 - please release with 2d5b569b66)
Reported-by: Tomas Papan <tomas.papan@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.com>
In drivers/md/md.c get_bitmap_file() uses kmalloc() for creating a
mdu_bitmap_file_t called "file".
5769 file = kmalloc(sizeof(*file), GFP_NOIO);
5770 if (!file)
5771 return -ENOMEM;
This structure is copied to user space at the end of the function.
5786 if (err == 0 &&
5787 copy_to_user(arg, file, sizeof(*file)))
5788 err = -EFAULT
But if bitmap is disabled only the first byte of "file" is initialized
with zero, so it's possible to read some bytes (up to 4095) of kernel
space memory from user space. This is an information leak.
5775 /* bitmap disabled, zero the first byte and copy out */
5776 if (!mddev->bitmap_info.file)
5777 file->pathname[0] = '\0';
Signed-off-by: Benjamin Randazzo <benjamin@randazzo.fr>
Signed-off-by: NeilBrown <neilb@suse.com>
raid1_end_read_request() assumes that the In_sync bits are consistent
with the ->degaded count.
raid1_spare_active updates the In_sync bit before the ->degraded count
and so exposes an inconsistency, as does error()
So extend the spinlock in raid1_spare_active() and error() to hide those
inconsistencies.
This should probably be part of
Commit: 34cab6f420 ("md/raid1: fix test for 'was read error from
last working device'.")
as it addresses the same issue. It fixes the same bug and should go
to -stable for same reasons.
Fixes: 76073054c9 ("md/raid1: clean up read_balance.")
Cc: stable@vger.kernel.org (v3.0+)
Signed-off-by: NeilBrown <neilb@suse.com>
Commit 665022d72f ("dm cache: avoid calls to prealloc_free_structs() if
possible") introduced a regression that caused the removal of a DM cache
device to hang in cache_postsuspend()'s call to wait_for_migrations()
with the following stack trace:
[<ffffffff81651457>] schedule+0x37/0x80
[<ffffffffa041e21b>] cache_postsuspend+0xbb/0x470 [dm_cache]
[<ffffffff810ba970>] ? prepare_to_wait_event+0xf0/0xf0
[<ffffffffa0006f77>] dm_table_postsuspend_targets+0x47/0x60 [dm_mod]
[<ffffffffa0001eb5>] __dm_destroy+0x215/0x250 [dm_mod]
[<ffffffffa0004113>] dm_destroy+0x13/0x20 [dm_mod]
[<ffffffffa00098cd>] dev_remove+0x10d/0x170 [dm_mod]
[<ffffffffa00097c0>] ? dev_suspend+0x240/0x240 [dm_mod]
[<ffffffffa0009f85>] ctl_ioctl+0x255/0x4d0 [dm_mod]
[<ffffffff8127ac00>] ? SYSC_semtimedop+0x280/0xe10
[<ffffffffa000a213>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
[<ffffffff811fd432>] do_vfs_ioctl+0x2d2/0x4b0
[<ffffffff81117d5f>] ? __audit_syscall_entry+0xaf/0x100
[<ffffffff81022636>] ? do_audit_syscall_entry+0x66/0x70
[<ffffffff811fd689>] SyS_ioctl+0x79/0x90
[<ffffffff81023e58>] ? syscall_trace_leave+0xb8/0x110
[<ffffffff81654f6e>] entry_SYSCALL_64_fastpath+0x12/0x71
Fix this by accounting for the call to prealloc_data_structs()
immediately _before_ the call as opposed to after. This is needed
because it is possible to break out of the control loop after the call
to prealloc_data_structs() but before prealloc_used was set to true.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This reverts commit 386cb7cdee.
Taking the wake_worker() out of free_migration() will slow writeback
dramatically, and hence adaptability.
Say we have 10k blocks that need writing back, but are only able to
issue 5 concurrently due to the migration bandwidth: it's imperative
that we wake_worker() immediately after migration completion; waiting
for the next 1 second wake up (via do_waker) means it'll take a long
time to write that all back.
Reported-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cryptsetup moved to gitlab. This is a leftover from commit e44f23b32d
(dm crypt: update URLs to new cryptsetup project page, 2015-04-05).
Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
static analysis by cppcheck has found a check on alloc_bitset that
always evaluates as false and hence never finds an allocation failure:
[drivers/md/dm-cache-policy-smq.c:1689]: (warning) Logical conjunction
always evaluates to false: !EXPR && EXPR.
Fix this by removing the incorrect mq->cache_hit_bits check
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Several are tagged for -stable.
A few aren't because they are not very, serious or
because they are in the 'experimental' cluster code.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJVsbRgAAoJEDnsnt1WYoG54BcQAJlVxcGdMXNAbFAfhToH+cwY
DcCKmnXiu3TCcK/gtr0SlUKIhv7kPE2HaGbTko3g5/uTk7SCuMSWRbjpQJr1u98U
9VUPZV0RGLEcQXgsjG3sobEtdSYMSn1/BpdJpmsn/Q6cyTheiEJA9fEghn9F0Iw9
3Ctc56aL0nKsnpeRTvWmKR3F995L2Ene+pIHPbSqTlQcGU+DdxQL/iY9YspMzeih
6SWQICo59+pGSRQdKaU968nZ1a8hJCvhV2NbW0JsJMo9iGo5htCJLdxZatscZV6E
xAppCRymW/Nl0n59wCOBhUp+0skVhtYZ1UmNdx/vUA7vcZJX85vIG7WGaaSQAmlF
Y4nNMSaX6SWbt/oVgq+JiD39T1oFZN5N5Fh9juqcp3fshiWLO74cnTwea1LtkQZX
LfW2HhajDUgoJqoXL6ppKDK8l7alQdcaoYn0C6SfqRZ+PLB5ERrSBHNZpKDbI9aw
CFW93lHTtlwtwZn0S7k06mCG8ilO4iPthUVaJEeI1kUWVKY0Ju4xnVTt/7jCroUa
Km14KdWeZsS8j9/xr6FYBjarjuh7M9SoflgUK84txE+PIZsSczyrErpBkMf2YBWQ
0KduqQabXa1XYWGtZ7Uhj6iOOGNBcTcZsww8cBEDOJEDyCtSs5w/iPmsFCEGUuo2
YTz6qKpYPgB1VzK2PoBG
=6ZCK
-----END PGP SIGNATURE-----
Merge tag 'md/4.2-fixes' of git://neil.brown.name/md
Pull md fixes from Neil Brown:
"Some md fixes for 4.2
Several are tagged for -stable.
A few aren't because they are not very, serious or because they are in
the 'experimental' cluster code"
* tag 'md/4.2-fixes' of git://neil.brown.name/md:
md/raid5: clear R5_NeedReplace when no longer needed.
Fix read-balancing during node failure
md-cluster: fix bitmap sub-offset in bitmap_read_sb
md: Return error if request_module fails and returns positive value
md: Skip cluster setup in case of error while reading bitmap
md/raid1: fix test for 'was read error from last working device'.
md: Skip cluster setup for dm-raid
md: flush ->event_work before stopping array.
md/raid10: always set reshape_safe when initializing reshape_position.
md/raid5: avoid races when changing cache size.
This flag is currently never cleared, which can in rare cases
trigger a warn-on if it is still set but the block isn't
InSync.
So clear it when it isn't need, which includes if the replacement
device has failed.
Signed-off-by: NeilBrown <neilb@suse.com>
During a node failure, We need to suspend read balancing so that the
reads are directed to the first device and stale data is not read.
Suspending writes is not required because these would be recorded and
synced eventually.
A new flag MD_CLUSTER_SUSPEND_READ_BALANCING is set in recover_prep().
area_resyncing() will respond true for the entire devices if this
flag is set and the request type is READ. The flag is cleared
in recover_done().
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reported-By: David Teigland <teigland@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
bitmap_read_sb is modifying mddev->bitmap_info.offset. This works for
the first bitmap read. However, when multiple bitmaps need to be opened
by the same node, it ends up corrupting the offset. Fix it by using a
local variable.
Also, bitmap_read_sb is not required in bitmap_copy_from_slot since
it is called in bitmap_create. Remove bitmap_read_sb().
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
request_module() can return 256 (process exited) in some cases,
which is not as specified in the documentation before the
request_module() definition. Convert the error to -ENOENT.
The positive error number results in bitmap_create() returning
a value that is meant to be an error but doesn't look like one,
so it is dereferenced as a point and causes a crash.
(not needed for stable as this is "experimental" code)
Fixes: edb39c9ded ("Introduce md_cluster_operations to handle cluster functions")
Signed-off-By: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
If the bitmap read fails, the error code set is -EINVAL. However,
we don't check for errors and go ahead with cluster_setup.
Skip the cluster setup in case of error.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
When we get a read error from the last working device, we don't
try to repair it, and don't fail the device. We simple report a
read error to the caller.
However the current test for 'is this the last working device' is
wrong.
When there is only one fully working device, it assumes that a
non-faulty device is that device. However a spare which is rebuilding
would be non-faulty but so not the only working device.
So change the test from "!Faulty" to "In_sync". If ->degraded says
there is only one fully working device and this device is in_sync,
this must be the one.
This bug has existed since we allowed read_balance to read from
a recovering spare in v3.0
Reported-and-tested-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Fixes: 76073054c9 ("md/raid1: clean up read_balance.")
Cc: stable@vger.kernel.org (v3.0+)
Signed-off-by: NeilBrown <neilb@suse.com>
There is a bug that the bitmap superblock isn't initialised properly for
dm-raid, so a new field can have garbage in new fields.
(dm-raid does initialisation in the kernel - md initialised the
superblock in mdadm).
This means that for dm-raid we cannot currently trust the new ->nodes
field. So:
- use __GFP_ZERO to initialise the superblock properly for all new
arrays
- initialise all fields in bitmap_info in bitmap_new_disk_sb
- ignore ->nodes for dm arrays (yes, this is a hack)
This bug exposes dm-raid to bug in the (still experimental) md-cluster
code, so it is suitable for -stable. It does cause crashes.
References: https://bugzilla.kernel.org/show_bug.cgi?id=100491
Cc: stable@vger.kernel.org (v4.1)
Signed-off-By: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
The 'event_work' worker used by dm-raid may still be running
when the array is stopped. This can result in an oops.
So flush the workqueue on which it is run after detaching
and before destroying the device.
Reported-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Cc: stable@vger.kernel.org (2.6.38+ please delay 2 weeks after -final release)
Fixes: 9d09e663d5 ("dm: raid456 basic support")
'reshape_position' tracks where in the reshape we have reached.
'reshape_safe' tracks where in the reshape we have safely recorded
in the metadata.
These are compared to determine when to update the metadata.
So it is important that reshape_safe is initialised properly.
Currently it isn't. When starting a reshape from the beginning
it usually has the correct value by luck. But when reducing the
number of devices in a RAID10, it has the wrong value and this leads
to the metadata not being updated correctly.
This can lead to corruption if the reshape is not allowed to complete.
This patch is suitable for any -stable kernel which supports RAID10
reshape, which is 3.5 and later.
Fixes: 3ea7daa5d7 ("md/raid10: add reshape support")
Cc: stable@vger.kernel.org (v3.5+ please wait for -final to be out for 2 weeks)
Signed-off-by: NeilBrown <neilb@suse.com>
Cache size can grow or shrink due to various pressures at
any time. So when we resize the cache as part of a 'grow'
operation (i.e. change the size to allow more devices) we need
to blocks that automatic growing/shrinking.
So introduce a mutex. auto grow/shrink uses mutex_trylock()
and just doesn't bother if there is a blockage.
Resizing the whole cache holds the mutex to ensure that
the correct number of new stripes is allocated.
This bug can result in some stripes not being freed when an
array is stopped. This leads to the kmem_cache not being
freed and a subsequent array can try to use the same kmem_cache
and get confused.
Fixes: edbe83ab4c ("md/raid5: allow the stripe_cache to grow and shrink.")
Cc: stable@vger.kernel.org (4.1 - please delay until 2 weeks after release of 4.2)
Signed-off-by: NeilBrown <neilb@suse.com>
increase and adversely impact both throughput and system load
- Fix for a use after free bug in DM core's device cleanup
- A couple DM btree removal fixes (used by dm-thinp)
- A DM thinp fix for order-5 allocation failure
- A DM thinp fix to not degrade to read-only metadata mode when in
out-of-data-space mode for longer than the 'no_space_timeout'
- Fix a long-standing oversight in both dm-thinp and dm-cache by
now exporting 'needs_check' in status if it was set in metadata
- Fix an embarrassing dm-cache busy-loop that caused worker threads to
eat cpu even if no IO was actively being issued to the cache device
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVqUfSAAoJEMUj8QotnQNa+RUH+wXrHCGI6J7RHIXVd5igP9K0
yFZGEnLZe6Ebt5CACLcKn/qN0g97iwCrlcxFt+1Gj/GbW1GIQzs7vg38La3PZxWZ
jAkI3JMY816bP1x3VK1HtMsk2gRaE/hh0gxK5pPLB9a+ZdEsz9UML0rs+JseOdn3
n+454dhwOyChwz7zFEbpn+mfjoruFScGX0Y2qaSHBV/xMhmExpthw9V1yFC2v2tW
8cAHOMDLNLHhR5adF9YxjZH8wILbyYK9oPy3iGhj/TF/Dx7saWYG4UlnL5xIOLsB
5WK9gRrJJ/Wf0FsDdN88AaY4Bdpj4esS2JeTZpvujxeBb7ZNeJoCUqyzggURv/c=
=hCjo
-----END PGP SIGNATURE-----
Merge tag 'dm-4.2-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
- revert a request-based DM core change that caused IO latency to
increase and adversely impact both throughput and system load
- fix for a use after free bug in DM core's device cleanup
- a couple DM btree removal fixes (used by dm-thinp)
- a DM thinp fix for order-5 allocation failure
- a DM thinp fix to not degrade to read-only metadata mode when in
out-of-data-space mode for longer than the 'no_space_timeout'
- fix a long-standing oversight in both dm-thinp and dm-cache by now
exporting 'needs_check' in status if it was set in metadata
- fix an embarrassing dm-cache busy-loop that caused worker threads to
eat cpu even if no IO was actively being issued to the cache device
* tag 'dm-4.2-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm cache: avoid calls to prealloc_free_structs() if possible
dm cache: avoid preallocation if no work in writeback_some_dirty_blocks()
dm cache: do not wake_worker() in free_migration()
dm cache: display 'needs_check' in status if it is set
dm thin: display 'needs_check' in status if it is set
dm thin: stay in out-of-data-space mode once no_space_timeout expires
dm: fix use after free crash due to incorrect cleanup sequence
Revert "dm: only run the queue on completion if congested or no requests pending"
dm btree: silence lockdep lock inversion in dm_btree_del()
dm thin: allocate the cell_sort_array dynamically
dm btree remove: fix bug in redistribute3
If no work was performed then prealloc_data_structs() wasn't ever called
so there isn't any need to call prealloc_free_structs().
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Refactor writeback_some_dirty_blocks() to avoid prealloc_data_structs()
if the policy doesn't have any dirty blocks ready for writeback.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
All methods that queue work call wake_worker() as you'd expect.
E.g. cell_defer, defer_bio, quiesce_migration (which is called by
writeback, promote, demote_then_promote, invalidate, discard, etc).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
There is currently no way to see that the needs_check flag has been set
in the metadata. Display 'needs_check' in the cache status if it is set
in the cache metadata.
Also, update cache documentation.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
There is currently no way to see that the needs_check flag has been set
in the metadata. Display 'needs_check' in the thin-pool status if it is
set in the thinp metadata.
Also, update thinp documentation.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This fixes an issue where running out of data space would cause the
thin-pool's metadata to become read-only. There was no reason to make
metadata read-only -- calling set_pool_mode() with PM_READ_ONLY was a
misguided way to error all queued and future write IOs. We can
accomplish the same by degrading from PM_OUT_OF_DATA_SPACE to
PM_OUT_OF_DATA_SPACE with error_if_no_space enabled.
Otherwise, the use of PM_READ_ONLY could cause a race where commit() was
started before the PM_READ_ONLY transition but dm_pool_commit_metadata()
would go on to fail because the block manager had transitioned to
read-only. The return of -EPERM from dm_pool_commit_metadata(), due to
attempting to commit while in read-only mode, caused the thin-pool to
set 'needs_check' because a metadata_operation_failed(). This needless
cascade of failures makes life for users more difficult than needed.
Reported-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Linux 4.2-rc1 Commit 0f20972f7b ("dm: factor out a common
cleanup_mapped_device()") moved a common cleanup code to a separate
function. Unfortunately, that commit incorrectly changed the order of
cleanup, so that it destroys the mapped_device's srcu structure
'io_barrier' before destroying its workqueue.
The function that is executed on the workqueue (dm_wq_work) uses the srcu
structure, thus it may use it after being freed. That results in a
crash in the LVM test suite's mirror-vgreduce-removemissing.sh test.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 0f20972f7b ("dm: factor out a common cleanup_mapped_device()")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This is horribly confusing, it breaks the flow of the code without
it being apparent in the caller.
Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Christoph Hellwig <hch@lst.de>
This reverts commit 9a0e609e3f.
(Resolved a conflict during revert due to commit bfebd1cdb4 that came
after)
This revert is motivated by a couple failure reports on request-based DM
multipath testbeds:
1) Netapp reported that their multipath fault injection test under heavy
IO load can stall longer than 300 seconds.
2) IBM reported elevated lock contention in their testbed (likely due to
increased back pressure due to IO not being dispatched as quickly):
https://www.redhat.com/archives/dm-devel/2015-July/msg00057.html
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.1+
Allocate memory using GFP_NOIO when deleting a btree. dm_btree_del()
can be called via an ioctl and we don't want to recurse into the FS or
block layer.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Given the pool's cell_sort_array holds 8192 pointers it triggers an
order 5 allocation via kmalloc. This order 5 allocation is prone to
failure as system memory gets more fragmented over time.
Fix this by allocating the cell_sort_array using vmalloc.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
redistribute3() shares entries out across 3 nodes. Some entries were
being moved the wrong way, breaking the ordering. This manifested as a
BUG() in dm-btree-remove.c:shift() when entries were removed from the
btree.
For additional context see:
https://www.redhat.com/archives/dm-devel/2015-May/msg00113.html
Signed-off-by: Dennis Yang <shinrairis@gmail.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Pull more vfs updates from Al Viro:
"Assorted VFS fixes and related cleanups (IMO the most interesting in
that part are f_path-related things and Eric's descriptor-related
stuff). UFS regression fixes (it got broken last cycle). 9P fixes.
fs-cache series, DAX patches, Jan's file_remove_suid() work"
[ I'd say this is much more than "fixes and related cleanups". The
file_table locking rule change by Eric Dumazet is a rather big and
fundamental update even if the patch isn't huge. - Linus ]
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
9p: cope with bogus responses from server in p9_client_{read,write}
p9_client_write(): avoid double p9_free_req()
9p: forgetting to cancel request on interrupted zero-copy RPC
dax: bdev_direct_access() may sleep
block: Add support for DAX reads/writes to block devices
dax: Use copy_from_iter_nocache
dax: Add block size note to documentation
fs/file.c: __fget() and dup2() atomicity rules
fs/file.c: don't acquire files->file_lock in fd_install()
fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
vfs: avoid creation of inode number 0 in get_next_ino
namei: make set_root_rcu() return void
make simple_positive() public
ufs: use dir_pages instead of ufs_dir_pages()
pagemap.h: move dir_pages() over there
remove the pointless include of lglock.h
fs: cleanup slight list_entry abuse
xfs: Correctly lock inode when removing suid and file capabilities
fs: Call security_ops->inode_killpriv on truncate
fs: Provide function telling whether file_remove_privs() will do anything
...
Kent's email address in MAINTAINERS seems to be invalid.
This was his last sign-off address, so use that if appropriate.
Fix the S: status entry while there.
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use kvfree() instead of open-coding it.
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A mixed bag
- a few bug fixes
- some performance improvement that decrease lock contention
- some clean-up
Nothing major.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJVi6weAAoJEDnsnt1WYoG50CsP/RqFbZicRSIvzXUURwP+yCP0
3YZuURj4IXC6Cy/HLX+bZoj1p/b+GIRsZ72fWFJrd2LheaAI6WojCCLlnmXUtI/Y
LIppF8/A2hfCNbF9cILByvrbzfndeEGK8kvootBDpvD0jlYiGePPAMQY2zx0MAyb
T4yJ/KiziLniP6x7vqZrQ6I1MRVjeanN6RWXktFtixMpNOKUJe3PiZbUz4VDIrHR
DaiHCbMjvRIkUWgNY8HmijEt+c8AYia7muqLj359dy2xF1hlUIdCx+61cgFD1zd8
enKDH3xp+3B9BEgHe+AtxTAzpqSgU93tdhUjGcy/orA+yYjAAcA4ifngrzfE3VKb
kwQgPh2JvUrubavrcto0hthS5RldrCpDXebOM4aEq+7lDHCwrZ39Qio5+1F7TLt5
A5E3Eb7dPRdp9T3LrluX8/f7bO/Wbmxvv/RwnSLTpnGQoBWIAqCpQ+e9ro446Gsx
/phXv3tE78fKj88LgQY/mm8ICeCppmQGLrpmjk9bkaZzqFdzQoURVmPh8QPMuJB4
iMHpOOKLzrUlW/23rRxaIKwPuFyxlNuLAvyA3ezsymGiZ+SqSeFCEm1jN64EfMCI
39rpfZt2pcVVOZJ9YeuzZG9wpie96yGZgnVWlP3FPjqRpboXqmtHlYA6EMRtqDAy
mjSiGDF2bxkT1/YcjELD
=sXTI
-----END PGP SIGNATURE-----
Merge tag 'md/4.2' of git://neil.brown.name/md
Pull md updates from Neil Brown:
"A mixed bag
- a few bug fixes
- some performance improvement that decrease lock contention
- some clean-up
Nothing major"
* tag 'md/4.2' of git://neil.brown.name/md:
md: clear Blocked flag on failed devices when array is read-only.
md: unlock mddev_lock on an error path.
md: clear mddev->private when it has been freed.
md: fix a build warning
md/raid5: ignore released_stripes check
md/raid5: per hash value and exclusive wait_for_stripe
md/raid5: split wait_for_stripe and introduce wait_for_quiescent
wait: introduce wait_event_exclusive_cmd
md: convert to kstrto*()
md/raid10: make sync_request_write() call bio_copy_data()
ability to handle partial request completions -- otherwise with the
current SCSI LLDs these changes could lead to silent data corruption.
- Fix two DM version bumps that were missing from the initial 4.2 DM
pull request (enabled userspace lvm2 to know certain changes have been
made).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVjWetAAoJEMUj8QotnQNaEngIAMVwExw0u04jqoW9rUwLDbpr
PS2A4lh/MGtMqGGPwJp5qiwnKkgQ5/FcxRpslNQYqA6KrIlnjWJhacWl7tOrwqxn
+WBsHIUwjcpwK2RqxSS3Petb6xDd7A3LfTQVhKV9xKZpZp8Y25a+1MPmUYKsFLBH
DJ1d9bXPMdN1qjBXBU1rKkVxj6z8iNz/lv24eN0MGyWhfUUTc8lQg3eey3L0BzCc
siOuupFQXaWIkbawLZrmvPPNm1iMoABC1OPZCTB1AYZYx1rqzEGUR1nZN+qWf6Wf
rZtAPZehbRzvOaf5jC6tEfAcTF23aPEyp4LD+aAQpbuC/1IBi8a3S8z6PvR5EjA=
=QY48
-----END PGP SIGNATURE-----
Merge tag 'dm-4.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
"Apologies for not pressing this request-based DM partial completion
issue further, it was an oversight on my part. We'll have to get it
fixed up properly and revisit for a future release.
- Revert block and DM core changes the removed request-based DM's
ability to handle partial request completions -- otherwise with the
current SCSI LLDs these changes could lead to silent data
corruption.
- Fix two DM version bumps that were missing from the initial 4.2 DM
pull request (enabled userspace lvm2 to know certain changes have
been made)"
* tag 'dm-4.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm cache policy smq: fix "default" version to be 1.4.0
dm: bump the ioctl version to 4.32.0
Revert "block, dm: don't copy bios for request clones"
Revert "dm: do not allocate any mempools for blk-mq request-based DM"
Merge second patchbomb from Andrew Morton:
- most of the rest of MM
- lots of misc things
- procfs updates
- printk feature work
- updates to get_maintainer, MAINTAINERS, checkpatch
- lib/ updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (96 commits)
exit,stats: /* obey this comment */
coredump: add __printf attribute to cn_*printf functions
coredump: use from_kuid/kgid when formatting corename
fs/reiserfs: remove unneeded cast
NILFS2: support NFSv2 export
fs/befs/btree.c: remove unneeded initializations
fs/minix: remove unneeded cast
init/do_mounts.c: add create_dev() failure log
kasan: remove duplicate definition of the macro KASAN_FREE_PAGE
fs/efs: femove unneeded cast
checkpatch: emit "NOTE: <types>" message only once after multiple files
checkpatch: emit an error when there's a diff in a changelog
checkpatch: validate MODULE_LICENSE content
checkpatch: add multi-line handling for PREFER_ETHER_ADDR_COPY
checkpatch: suggest using eth_zero_addr() and eth_broadcast_addr()
checkpatch: fix processing of MEMSET issues
checkpatch: suggest using ether_addr_equal*()
checkpatch: avoid NOT_UNIFIED_DIFF errors on cover-letter.patch files
checkpatch: remove local from codespell path
checkpatch: add --showfile to allow input via pipe to show filenames
...
Commit bccab6a0 ("dm cache: switch the "default" cache replacement
policy from mq to smq") should've incremented the "default" policy's
version number to 1.4.0 rather than reverting to version 1.0.0.
Reported-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This reverts commit 5f1b670d0b.
Justification for revert as reported in this dm-devel post:
https://www.redhat.com/archives/dm-devel/2015-June/msg00160.html
this change should not be pushed to mainline yet.
Firstly, Christoph has a newer version of the patch that fixes silent
data corruption problem:
https://www.redhat.com/archives/dm-devel/2015-May/msg00229.html
And the new version still depends on LLDDs to always complete requests
to the end when error happens, while block API doesn't enforce such a
requirement. If the assumption is ever broken, the inconsistency between
request and bio (e.g. rq->__sector and rq->bio) will cause silent data
corruption:
https://www.redhat.com/archives/dm-devel/2015-June/msg00022.html
Reported-by: Junichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
There's no point in starting over when we meet a '/'. This also
eliminates a stack variable and a little .text.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- blk-mq request-based DM no longer uses any mempools now that partial
completions are no longer handled as part of cloned requests
- DM raid cleanups and support for MD raid0
- DM cache core advances and a new stochastic-multi-queue (smq) cache
replacement policy
- smq is the new default dm-cache policy
- DM thinp cleanups and much more efficient large discard support
- DM statistics support for request-based DM and nanosecond resolution
timestamps
- Fixes to DM stripe, DM log-writes, DM raid1 and DM crypt
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJViE/tAAoJEMUj8QotnQNag8wIAMhcmy46qGCy6SrOew6D8EQ0
4Ielsr5ZI67Q9pZVI1x/Zdt5GfDUGXiSEy/y8xoIuJfmsRXwnZ/gkOUyWpSUW8dH
mgPUUpGF0aN2D/66P3chgm39rVZEt3crDY+SQu02Fm86JKlMUomFbWKgMXpsM6Ga
HHW80zLF9Ca6D5m8xMze/ic1KBJtA9zaXcO9xnMCBym8mSaORtgwoCzKAgy43H7U
KIBwPKR/y+c7SaKWGxXwxxhTKZDyP4FbgtiJ2Dc9yBDoD0RzbyuY/EFMEUPEbEMv
e9fFHNEzmKlsNVY2vYXkVH4TdldWrrjkmYj/JVtz8cifoupWOOnDTjb2aqjAIww=
=DKCC
-----END PGP SIGNATURE-----
Merge tag 'dm-4.2-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- DM core cleanups:
* blk-mq request-based DM no longer uses any mempools now that
partial completions are no longer handled as part of cloned
requests
- DM raid cleanups and support for MD raid0
- DM cache core advances and a new stochastic-multi-queue (smq) cache
replacement policy
* smq is the new default dm-cache policy
- DM thinp cleanups and much more efficient large discard support
- DM statistics support for request-based DM and nanosecond resolution
timestamps
- Fixes to DM stripe, DM log-writes, DM raid1 and DM crypt
* tag 'dm-4.2-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (39 commits)
dm stats: add support for request-based DM devices
dm stats: collect and report histogram of IO latencies
dm stats: support precise timestamps
dm stats: fix divide by zero if 'number_of_areas' arg is zero
dm cache: switch the "default" cache replacement policy from mq to smq
dm space map metadata: fix occasional leak of a metadata block on resize
dm thin metadata: fix a race when entering fail mode
dm thin: fail messages with EOPNOTSUPP when pool cannot handle messages
dm thin: range discard support
dm thin metadata: add dm_thin_remove_range()
dm thin metadata: add dm_thin_find_mapped_range()
dm btree: add dm_btree_remove_leaves()
dm stats: Use kvfree() in dm_kvfree()
dm cache: age and write back cache entries even without active IO
dm cache: prefix all DMERR and DMINFO messages with cache device name
dm cache: add fail io mode and needs_check flag
dm cache: wake the worker thread every time we free a migration object
dm cache: add stochastic-multi-queue (smq) policy
dm cache: boost promotion of blocks that will be overwritten
dm cache: defer whole cells
...
Pull cgroup writeback support from Jens Axboe:
"This is the big pull request for adding cgroup writeback support.
This code has been in development for a long time, and it has been
simmering in for-next for a good chunk of this cycle too. This is one
of those problems that has been talked about for at least half a
decade, finally there's a solution and code to go with it.
Also see last weeks writeup on LWN:
http://lwn.net/Articles/648292/"
* 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
writeback, blkio: add documentation for cgroup writeback support
vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
writeback: do foreign inode detection iff cgroup writeback is enabled
v9fs: fix error handling in v9fs_session_init()
bdi: fix wrong error return value in cgwb_create()
buffer: remove unusued 'ret' variable
writeback: disassociate inodes from dying bdi_writebacks
writeback: implement foreign cgroup inode bdi_writeback switching
writeback: add lockdep annotation to inode_to_wb()
writeback: use unlocked_inode_to_wb transaction in inode_congested()
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
writeback: implement [locked_]inode_to_wb_and_lock_list()
writeback: implement foreign cgroup inode detection
writeback: make writeback_control track the inode being written back
writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
writeback: implement memcg writeback domain based throttling
writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
writeback: implement memcg wb_domain
writeback: update wb_over_bg_thresh() to use wb_domain aware operations
...