Commit Graph

781454 Commits

Author SHA1 Message Date
Josef Bacik 5e65a20341 blk-wbt: wake up all when we scale up, not down
Tetsuo brought to my attention that I screwed up the scale_up/scale_down
helpers when I factored out the rq-qos code.  We need to wake up all the
waiters when we add slots for requests to make, not when we shrink the
slots.  Otherwise we'll end up things waiting forever.  This was a
mistake and simply puts everything back the way it was.

cc: stable@vger.kernel.org
Fixes: a79050434b ("blk-rq-qos: refactor out common elements of blk-wbt")
eported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-11 13:31:28 -06:00
Jens Axboe 133424a207 Merge branch 'nvme-4.19' of git://git.infradead.org/nvme into for-linus
Pull NVMe fix from Christoph.

* 'nvme-4.19' of git://git.infradead.org/nvme:
  nvme: properly propagate errors in nvme_mpath_init
2018-09-28 09:41:40 -06:00
Juergen Gross 6c76786740 xen/blkfront: correct purging of persistent grants
Commit a46b53672b ("xen/blkfront: cleanup
stale persistent grants") introduced a regression as purged persistent
grants were not pu into the list of free grants again. Correct that.

Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-28 09:40:39 -06:00
Jens Axboe 15c2068876 Revert "xen/blkfront: When purging persistent grants, keep them in the buffer"
Fix didn't work for all cases, reverting to add a (hopefully)
better fix.

This reverts commit f151ba989d.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-28 09:40:17 -06:00
Ilya Dryomov 587562d0c7 blk-mq: I/O and timer unplugs are inverted in blktrace
trace_block_unplug() takes true for explicit unplugs and false for
implicit unplugs.  schedule() unplugs are implicit and should be
reported as timer unplugs.  While correct in the legacy code, this has
been inverted in blk-mq since 4.11.

Cc: stable@vger.kernel.org
Fixes: bd166ef183 ("blk-mq-sched: add framework for MQ capable IO schedulers")
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-27 13:12:44 -06:00
Guoju Fang 0f843e65d9 bcache: add separate workqueue for journal_write to avoid deadlock
After write SSD completed, bcache schedules journal_write work to
system_wq, which is a public workqueue in system, without WQ_MEM_RECLAIM
flag. system_wq is also a bound wq, and there may be no idle kworker on
current processor. Creating a new kworker may unfortunately need to
reclaim memory first, by shrinking cache and slab used by vfs, which
depends on bcache device. That's a deadlock.

This patch create a new workqueue for journal_write with WQ_MEM_RECLAIM
flag. It's rescuer thread will work to avoid the deadlock.

Signed-off-by: Guoju Fang <fangguoju@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-27 09:47:01 -06:00
Boris Ostrovsky f151ba989d xen/blkfront: When purging persistent grants, keep them in the buffer
Commit a46b53672b ("xen/blkfront: cleanup stale persistent grants")
added support for purging persistent grants when they are not in use. As
part of the purge, the grants were removed from the grant buffer, This
eventually causes the buffer to become empty, with BUG_ON triggered in
get_free_grant(). This can be observed even on an idle system, within
20-30 minutes.

We should keep the grants in the buffer when purging, and only free the
grant ref.

Fixes: a46b53672b ("xen/blkfront: cleanup stale persistent grants")
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-27 08:26:38 -06:00
Damien Le Moal 854f31ccdd block: fix deadline elevator drain for zoned block devices
When the deadline scheduler is used with a zoned block device, writes
to a zone will be dispatched one at a time. This causes the warning
message:

deadline: forced dispatching is broken (nr_sorted=X), please report this

to be displayed when switching to another elevator with the legacy I/O
path while write requests to a zone are being retained in the scheduler
queue.

Prevent this message from being displayed when executing
elv_drain_elevator() for a zoned block device. __blk_drain_queue() will
loop until all writes are dispatched and completed, resulting in the
desired elevator queue drain without extensive modifications to the
deadline code itself to handle forced-dispatch calls.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Fixes: 8dc8146f9c ("deadline-iosched: Introduce zone locking support")
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-26 19:57:24 -06:00
Keith Busch 530ca2c9bd blk-mq: Allow blocking queue tag iter callbacks
A recent commit runs tag iterator callbacks under the rcu read lock,
but existing callbacks do not satisfy the non-blocking requirement.
The commit intended to prevent an iterator from accessing a queue that's
being modified. This patch fixes the original issue by taking a queue
reference instead of reading it, which allows callbacks to make blocking
calls.

Fixes: f5bbbbe4d6 ("blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter")
Acked-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-25 20:17:59 -06:00
Susobhan Dey bb830add19 nvme: properly propagate errors in nvme_mpath_init
Signed-off-by: Susobhan Dey <susobhan.dey@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-09-25 16:21:40 -07:00
Omar Sandoval b57e99b4b8 block: use nanosecond resolution for iostat
Klaus Kusche reported that the I/O busy time in /proc/diskstats was not
updating properly on 4.18. This is because we started using ktime to
track elapsed time, and we convert nanoseconds to jiffies when we update
the partition counter. However, this gets rounded down, so any I/Os that
take less than a jiffy are not accounted for. Previously in this case,
the value of jiffies would sometimes increment while we were doing I/O,
so at least some I/Os were accounted for.

Let's convert the stats to use nanoseconds internally. We still report
milliseconds as before, now more accurately than ever. The value is
still truncated to 32 bits for backwards compatibility.

Fixes: 522a777566 ("block: consolidate struct request timestamp fields")
Cc: stable@vger.kernel.org
Reported-by: Klaus Kusche <klaus.kusche@computerix.info>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-21 20:26:59 -06:00
Jens Axboe d611aaf336 Merge branch 'nvme-4.19' of git://git.infradead.org/nvme into for-linus
Pull NVMe fix from Christoph.

* 'nvme-4.19' of git://git.infradead.org/nvme:
  nvme: count all ANA groups for ANA Log page
2018-09-20 09:10:38 -06:00
Andy Whitcroft 65eea8edc3 floppy: Do not copy a kernel pointer to user memory in FDGETPRM ioctl
The final field of a floppy_struct is the field "name", which is a pointer
to a string in kernel memory.  The kernel pointer should not be copied to
user memory.  The FDGETPRM ioctl copies a floppy_struct to user memory,
including this "name" field.  This pointer cannot be used by the user
and it will leak a kernel address to user-space, which will reveal the
location of kernel code and data and undermine KASLR protection.

Model this code after the compat ioctl which copies the returned data
to a previously cleared temporary structure on the stack (excluding the
name pointer) and copy out to userspace from there.  As we already have
an inparam union with an appropriate member and that memory is already
cleared even for read only calls make use of that as a temporary store.

Based on an initial patch by Brian Belleville.

CVE-2018-7755
Signed-off-by: Andy Whitcroft <apw@canonical.com>

Broke up long line.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-20 09:09:48 -06:00
Jens Axboe 7ce5c8cd75 libata: mask swap internal and hardware tag
hen we're comparing the hardware completion mask passed in from the
driver with the internal tag pending mask, we need to account for the
fact that the internal tag is different from the hardware tag. If not,
then we can end up either prematurely completing the internal tag (since
it's not set in the hw mask), or simply flag an error:

ata2: illegal qc_active transition (100000000->00000001)

If the internal tag is set, then swap that with the hardware tag in this
case before comparing with what the hardware reports.

Fixes: 28361c4036 ("libata: add extra internal command")
Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=201151
Cc: stable@vger.kernel.org
Reported-by: Paul Sbarra <sbarra.paul@gmail.com>
Tested-by: Paul Sbarra <sbarra.paul@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-20 08:30:55 -06:00
Hannes Reinecke be1277f5eb nvme: count all ANA groups for ANA Log page
When issuing a short read on the ANA log page the number of groups
should not change, even though the final returned data might contain
less groups than that number.

Signed-off-by: Hannes Reinecke <hare@suse.com>
[switched to a for loop]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-09-17 15:49:40 +02:00
Jens Axboe b228ba1cb9 null_blk: fix zoned support for non-rq based operation
The supported added for zones in null_blk seem to assume that only rq
based operation is possible. But this depends on the queue_mode setting,
if this is set to 0, then cmd->bio is what we need to be operating on.
Right now any attempt to load null_blk with queue_mode=0 will
insta-crash, since cmd->rq is NULL and null_handle_cmd() assumes it to
always be set.

Make the zoned code deal with bio's instead, or pass in the
appropriate sector/nr_sectors instead.

Fixes: ca4b2a0119 ("null_blk: add zone support")
Tested-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-12 18:21:11 -06:00
Jens Axboe 01c5f85aeb blk-cgroup: increase number of supported policies
After merging the iolatency policy, we potentially now have 4 policies
being registered, but only support 3. This causes one of them to fail
loading. Takashi reports that BFQ no longer works for him, because it
fails to load due to policy registration failure.

Bump to 5 policies, and also add a warning for when we have exceeded
the global amount. If we have to touch this again, we should switch
to a dynamic scheme instead.

Reported-by: Takashi Iwai <tiwai@suse.de>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Tested-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-11 10:59:53 -06:00
Jens Axboe bf93585ee1 Merge branch 'nvme-4.19' of git://git.infradead.org/nvme into for-linus
Pull single NVMe fix from Christoph.

* 'nvme-4.19' of git://git.infradead.org/nvme:
  nvmet-rdma: fix possible bogus dereference under heavy load
2018-09-10 08:16:56 -06:00
Konstantin Khlebnikov d5274b3cd6 block: bfq: swap puts in bfqg_and_blkg_put
Fix trivial use-after-free. This could be last reference to bfqg.

Fixes: 8f9bebc33d ("block, bfq: access and cache blkg data only when safe")
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-06 11:32:58 -06:00
Mikulas Patocka 8b2ded1c94 block: don't warn when doing fsync on read-only devices
It is possible to call fsync on a read-only handle (for example, fsck.ext2
does it when doing read-only check), and this call results in kernel
warning.

The patch b089cfd95d ("block: don't warn for flush on read-only device")
attempted to disable the warning, but it is buggy and it doesn't
(op_is_flush tests flags, but bio_op strips off the flags).

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 721c7fc701 ("block: fail op_is_write() requests to read-only partitions")
Cc: stable@vger.kernel.org	# 4.18
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-05 16:14:36 -06:00
Sagi Grimberg 8407879c4e nvmet-rdma: fix possible bogus dereference under heavy load
Currently we always repost the recv buffer before we send a response
capsule back to the host. Since ordering is not guaranteed for send
and recv completions, it is posible that we will receive a new request
from the host before we got a send completion for the response capsule.

Today, we pre-allocate 2x rsps the length of the queue, but in reality,
under heavy load there is nothing that is really preventing the gap to
expand until we exhaust all our rsps.

To fix this, if we don't have any pre-allocated rsps left, we dynamically
allocate a rsp and make sure to free it when we are done. If under memory
pressure we fail to allocate a rsp, we silently drop the command and
wait for the host to retry.

Reported-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
[hch: dropped a superflous assignment]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-09-05 12:18:01 -07:00
Jens Axboe bc811f05d7 nbd: don't allow invalid blocksize settings
syzbot reports a divide-by-zero off the NBD_SET_BLKSIZE ioctl.
We need proper validation of the input here. Not just if it's
zero, but also if the value is a power-of-2 and in a valid
range. Add that.

Cc: stable@vger.kernel.org
Reported-by: syzbot <syzbot+25dbecbec1e62c6b0dd4@syzkaller.appspotmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-04 11:54:58 -06:00
Dennis Zhou (Facebook) 3111885015 blkcg: use tryget logic when associating a blkg with a bio
There is a very small change a bio gets caught up in a really
unfortunate race between a task migration, cgroup exiting, and itself
trying to associate with a blkg. This is due to css offlining being
performed after the css->refcnt is killed which triggers removal of
blkgs that reach their blkg->refcnt of 0.

To avoid this, association with a blkg should use tryget and fallback to
using the root_blkg.

Fixes: 08e18eab0c ("block: add bi_blkg to the bio for cgroups")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-31 14:48:58 -06:00
Dennis Zhou (Facebook) 59b57717ff blkcg: delay blkg destruction until after writeback has finished
Currently, blkcg destruction relies on a sequence of events:
  1. Destruction starts. blkcg_css_offline() is called and blkgs
     release their reference to the blkcg. This immediately destroys
     the cgwbs (writeback).
  2. With blkgs giving up their reference, the blkcg ref count should
     become zero and eventually call blkcg_css_free() which finally
     frees the blkcg.

Jiufei Xue reported that there is a race between blkcg_bio_issue_check()
and cgroup_rmdir(). To remedy this, blkg destruction becomes contingent
on the completion of all writeback associated with the blkcg. A count of
the number of cgwbs is maintained and once that goes to zero, blkg
destruction can follow. This should prevent premature blkg destruction
related to writeback.

The new process for blkcg cleanup is as follows:
  1. Destruction starts. blkcg_css_offline() is called which offlines
     writeback. Blkg destruction is delayed on the cgwb_refcnt count to
     avoid punting potentially large amounts of outstanding writeback
     to root while maintaining any ongoing policies. Here, the base
     cgwb_refcnt is put back.
  2. When the cgwb_refcnt becomes zero, blkcg_destroy_blkgs() is called
     and handles destruction of blkgs. This is where the css reference
     held by each blkg is released.
  3. Once the blkcg ref count goes to zero, blkcg_css_free() is called.
     This finally frees the blkg.

It seems in the past blk-throttle didn't do the most understandable
things with taking data from a blkg while associating with current. So,
the simplification and unification of what blk-throttle is doing caused
this.

Fixes: 08e18eab0c ("block: add bi_blkg to the bio for cgroups")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-31 14:48:56 -06:00
Dennis Zhou (Facebook) 6b06546206 Revert "blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()"
This reverts commit 4c6994806f.

Destroying blkgs is tricky because of the nature of the relationship. A
blkg should go away when either a blkcg or a request_queue goes away.
However, blkg's pin the blkcg to ensure they remain valid. To break this
cycle, when a blkcg is offlined, blkgs put back their css ref. This
eventually lets css_free() get called which frees the blkcg.

The above commit (4c6994806f) breaks this order of events by trying to
destroy blkgs in css_free(). As the blkgs still hold references to the
blkcg, css_free() is never called.

The race between blkcg_bio_issue_check() and cgroup_rmdir() will be
addressed in the following patch by delaying destruction of a blkg until
all writeback associated with the blkcg has been finished.

Fixes: 4c6994806f ("blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-31 14:48:54 -06:00
Jens Axboe 52bd456a66 Merge branch 'nvme-4.19' of git://git.infradead.org/nvme into for-linus
Pull NVMe fixes from Christoph.

* 'nvme-4.19' of git://git.infradead.org/nvme:
  nvmet: free workqueue object if module init fails
  nvme-fcloop: Fix dropped LS's to removed target port
  nvme-pci: add a memory barrier to nvme_dbbuf_update_and_check_event
2018-08-29 11:05:20 -06:00
Scott Bauer 8f3fafc9c2 cdrom: Fix info leak/OOB read in cdrom_ioctl_drive_status
Like d88b6d04: "cdrom: information leak in cdrom_ioctl_media_changed()"

There is another cast from unsigned long to int which causes
a bounds check to fail with specially crafted input. The value is
then used as an index in the slot array in cdrom_slot_status().

Signed-off-by: Scott Bauer <scott.bauer@intel.com>
Signed-off-by: Scott Bauer <sbauer@plzdonthack.me>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-29 08:09:20 -06:00
Chaitanya Kulkarni 04db0e5ec5 nvmet: free workqueue object if module init fails
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-08-28 08:40:44 +02:00
James Smart afd299ca99 nvme-fcloop: Fix dropped LS's to removed target port
When a targetport is removed from the config, fcloop will avoid calling
the LS done() routine thinking the targetport is gone. This leaves the
initiator reset/reconnect hanging as it waits for a status on the
Create_Association LS for the reconnect.

Change the filter in the LS callback path. If tport null (set when
failed validation before "sending to remote port"), be sure to call
done. This was the main bug. But, continue the logic that only calls
done if tport was set but there is no remoteport (e.g. case where
remoteport has been removed, thus host doesn't expect a completion).

Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-08-28 08:40:43 +02:00
Michal Wnukowski f1ed3df20d nvme-pci: add a memory barrier to nvme_dbbuf_update_and_check_event
In many architectures loads may be reordered with older stores to
different locations.  In the nvme driver the following two operations
could be reordered:

 - Write shadow doorbell (dbbuf_db) into memory.
 - Read EventIdx (dbbuf_ei) from memory.

This can result in a potential race condition between driver and VM host
processing requests (if given virtual NVMe controller has a support for
shadow doorbell).  If that occurs, then the NVMe controller may decide to
wait for MMIO doorbell from guest operating system, and guest driver may
decide not to issue MMIO doorbell on any of subsequent commands.

This issue is purely timing-dependent one, so there is no easy way to
reproduce it. Currently the easiest known approach is to run "Oracle IO
Numbers" (orion) that is shipped with Oracle DB:

orion -run advanced -num_large 0 -size_small 8 -type rand -simulate \
	concat -write 40 -duration 120 -matrix row -testname nvme_test

Where nvme_test is a .lun file that contains a list of NVMe block
devices to run test against. Limiting number of vCPUs assigned to given
VM instance seems to increase chances for this bug to occur. On test
environment with VM that got 4 NVMe drives and 1 vCPU assigned the
virtual NVMe controller hang could be observed within 10-20 minutes.
That correspond to about 400-500k IO operations processed (or about
100GB of IO read/writes).

Orion tool was used as a validation and set to run in a loop for 36
hours (equivalent of pushing 550M IO operations). No issues were
observed. That suggest that the patch fixes the issue.

Fixes: f9f38e3338 ("nvme: improve performance for virtual NVMe devices")
Signed-off-by: Michal Wnukowski <wnukowski@google.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
[hch: updated changelog and comment a bit]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-08-28 08:40:42 +02:00
John Pittman db193954ed block: bsg: move atomic_t ref_count variable to refcount API
Currently, variable ref_count within the bsg_device struct is of
type atomic_t.  For variables being used as reference counters,
the refcount API should be used instead of atomic.  The newer
refcount API works to prevent counter overflows and use-after-free
bugs.  So, move this varable from the atomic API to refcount,
potentially avoiding the issues mentioned.

Signed-off-by: John Pittman <jpittman@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-27 19:17:02 -06:00
Chengguang Xu 62d2a19407 block: remove unnecessary condition check
kmem_cache_destroy() can handle NULL pointer correctly, so there is
no need to check e->icq_cache before calling kmem_cache_destroy().

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-27 19:16:06 -06:00
Linus Walleij 46cb52ad41 ata: ftide010: Add a quirk for SQ201
The DMA is broken on this specific device for some unknown
reason (probably badly designed or plain broken interface
electronics) and will only work with PIO. Other users of
the same hardware does not have this problem.

Add a specific quirk so that this Gemini device gets
DMA turned off. Also fix up some code around passing the
port information around in probe while we're at it.

Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-27 14:25:54 -06:00
Jens Axboe b0a84beb2e blk-wbt: remove dead code
We already note and mark discard and swap IO from bio_to_wbt_flags().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-27 13:32:12 -06:00
Jens Axboe 057d3ccf93 Merge branch 'stable/for-jens-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-linus
Pull Xen block driver fixes from Konrad:

"Fix for flushing out persistent pages at a deterministic rate"

* 'stable/for-jens-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen/blkback: remove unused pers_gnts_lock from struct xen_blkif_ring
  xen/blkback: move persistent grants flags to bool
  xen/blkfront: reorder tests in xlblk_init()
  xen/blkfront: cleanup stale persistent grants
  xen/blkback: don't keep persistent grants too long
2018-08-27 11:27:32 -06:00
Jens Axboe 38cfb5a45e blk-wbt: improve waking of tasks
We have two potential issues:

1) After commit 2887e41b91, we only wake one process at the time when
   we finish an IO. We really want to wake up as many tasks as can
   queue IO. Before this commit, we woke up everyone, which could cause
   a thundering herd issue.

2) A task can potentially consume two wakeups, causing us to (in
   practice) miss a wakeup.

Fix both by providing our own wakeup function, which stops
__wake_up_common() from waking up more tasks if we fail to get a
queueing token. With the strict ordering we have on the wait list, this
wakes the right tasks and the right amount of tasks.

Based on a patch from Jianchao Wang <jianchao.w.wang@oracle.com>.

Tested-by: Agarwal, Anchal <anchalag@amazon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-27 11:27:26 -06:00
Jens Axboe 061a542753 blk-wbt: abstract out end IO completion handler
Prep patch for calling the handler from a different context,
no functional changes in this patch.

Tested-by: Agarwal, Anchal <anchalag@amazon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-27 11:27:24 -06:00
Juergen Gross 6f2f39ad1a xen/blkback: remove unused pers_gnts_lock from struct xen_blkif_ring
pers_gnts_lock isn't being used anywhere. Remove it.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2018-08-27 12:12:04 -04:00
Juergen Gross d77ff24e7f xen/blkback: move persistent grants flags to bool
The struct persistent_gnt flags member is meant to be a bitfield of
different flags. There is only PERSISTENT_GNT_ACTIVE flag left, so
convert it to a bool named "active".

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2018-08-27 12:12:04 -04:00
Juergen Gross 4bcddbae01 xen/blkfront: reorder tests in xlblk_init()
In case we don't want pv block devices we should not test parameters
for sanity and eventually print out error messages. So test precluding
conditions before checking parameters.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2018-08-27 12:12:03 -04:00
Juergen Gross a46b53672b xen/blkfront: cleanup stale persistent grants
Add a periodic cleanup function to remove old persistent grants which
are no longer in use on the backend side. This avoids starvation in
case there are lots of persistent grants for a device which no longer
is involved in I/O business.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2018-08-27 12:12:03 -04:00
Juergen Gross 973e5405f2 xen/blkback: don't keep persistent grants too long
Persistent grants are allocated until a threshold per ring is being
reached. Those grants won't be freed until the ring is being destroyed
meaning there will be resources kept busy which might no longer be
used.

Instead of freeing only persistent grants until the threshold is
reached add a timestamp and remove all persistent grants not having
been in use for a minute.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2018-08-27 12:12:03 -04:00
Linus Torvalds b8dcdab36f for-linus-20180825
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAluBbOMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqIBD/0e2/kyEE5UbIPeFK2bmrjDnbmo7Iq7Scb3
 MQwYjYL1lQK2y0yu810Ews5acWiVxJtb0D5Hko/eBEiOE81qB9j+DpTB804ewYFR
 IQwB9pKYhUVuwULHQDyEQTlUUHZba/cgqeHtF+20AmLhdv8NqpL7NfQ4d0XScHti
 AAsxKcLJ5gWAafmLQB2ZVbjlmLib5rSTLdEjVSLAdKIsED9+g12/WwIzZsnK0s9Q
 ozGasr3jPegPBPwpR0pepBBlwInwvdsCFlmzIkL1B8Nn0EWdfLY1qsgS47xJynOf
 Wz/u9ROrHaKClfJuQ08ccqnl3JNUoHBKi06CGpZP+w1S8BmZpH6WAyMEkiLz/l0r
 56US10YAC6hyq4Ectfa5aR3Ch/YnR+bVOsv6GVg4xqoXetr8sEHBM4i+iB05tsfL
 fAoHE7RPB8UwMdbJnnh+OTKsykKZG+IARurWjKBrM0j+zSNcEiRebs+5JI7tgCee
 MDRH+XOUzREbwcI5pggnz9QPavwf6j07clMfbbT6/C6CNjZJ4rAPwTvYEDnQVpos
 bXh2ifonOKVp+pq6LHW2vxBCmE6+Xh3Do+q5GBQ5png4BIM9A+hJsJT/ACqzjAip
 KYN0wdFPJu5eI2w8/X3fxgq2mrN7jUe/9x9HI+SnQ7cIkoQ7ZM7YcBrLtl6ignTN
 wH2uJIhQMQ==
 =KQSS
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20180825' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "A few small fixes for this merge window:

   - Locking imbalance fix for bcache (Shan Hai)

   - A few small fixes for wbt. One is a cleanup/prep, one is a fix for
     an existing issue, and the last two are fixes for changes that went
     into this merge window (me)"

* tag 'for-linus-20180825' of git://git.kernel.dk/linux-block:
  blk-wbt: don't maintain inflight counts if disabled
  blk-wbt: fix has-sleeper queueing check
  blk-wbt: use wq_has_sleeper() for wq active check
  blk-wbt: move disable check into get_limit()
  bcache: release dc->writeback_lock properly in bch_writeback_thread()
2018-08-25 13:30:23 -07:00
Linus Torvalds db84abf5f8 This pull request contains a single fix for UBIFS:
- Remove an empty file from UBIFS source
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAluBE84WHHJpY2hhcmRA
 c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wc73EACHsV8cp3Atkg18yZYQq3t8uOZG
 lZApUWTDr2SpWW88/OJ+RAFKGflKQqSiS1y1HYd2s8is4yYMA/XBto5pWvGqg03P
 GsFGzjMqRztZtNwg7n7JUD0Sq9WLoB1BimuMAG1b8vjTH2uuCxJZVyVi7ZF5W5Oc
 065ebVblFhziI7jddRh/Gpwa2dvx3SlKZa0VImiJd6Mw9AnjO4grU+rOofswXTjX
 AKVjeRvNp0DuxJkxaqO9mH7mug87Owi4co4DBLCJu7pPEq2kvFRHQQtn6QKCdtRk
 gyYBD/ZGkhNYz3wPoQxlHM0bGVAToKtjcT1/T61lrlXKjGJu8S8IHC/raEyCB8yb
 Cz3XxMCH7LXyS3eaov4sZAiogENsoP7i20E2AKwlGil65gx4DVR814MTAi5m0AwN
 Kpv7GHRG9oMEar378Rrv8xptRJhj6nwCxAYuBwmPFlK7d95qMQ9hLjsvYp3TLuzn
 7ihUzT8m/GZG1qrlkkk24Bjm834dCVRKPBJJ4beh3YG8/mj/h6OlQGs0NPRIF6O1
 JknbQqXOz6UZ1UsvgtDK7p8GuHztKA4K+3Qv4BWwIRFisX+gTWKB8IveGPpouUaa
 K8NMOkbw8RsatxzM4DNZBPl2YNjvHuuTGudNqMn3pytsi4Y8y5YKdkpFVgZH1V25
 QmQZbhicHSlfwWR0FQ==
 =QoL7
 -----END PGP SIGNATURE-----

Merge tag 'upstream-4.19-rc1-fix' of git://git.infradead.org/linux-ubifs

Pull UBIFS fix from Richard Weinberger:
 "Remove an empty file from UBIFS source"

* tag 'upstream-4.19-rc1-fix' of git://git.infradead.org/linux-ubifs:
  ubifs: Remove empty file.h
2018-08-25 13:27:35 -07:00
Linus Torvalds 04faac10fc three small SMB3 fixes, one for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAluAdzkACgkQiiy9cAdy
 T1GjlAv/SOsNm2sj9Bcq/Z/CpPoFRFoJBFsLeReF78QdbF/+eUFuQJvq0aIK8BmV
 0NRmvlQk9oCGQfWN0pLWeRn7a7xUqMQ7HYKSS1fzW1O+kJGlyA1cFjCbiZe4py6m
 AFSlpraPTtL7isX/ZMOyZ1D7YKMj4Fq5wndcHPSnMQXI2GlaUAip5k/zamXUbMmo
 dFaGDkXc67un6Y/04v18LsKJtOHgbVIAES2OgO0sjqiwp0cnGATsZl/OGzvsTo31
 brstBum/0Ig2Mpr+5IXa4QFoP+naNXDyhv+D69huETwsMSImnjGL6L/GAgUUO5/t
 sDN6bpQdM9wqpckNuFcV9hKbBHQ1nZhzr0gUjRnmN8CJFHOgI2Ndo4J4N9slnWo9
 wEXJV7RN5/VQh3Ozb7m+DpAYo4K4r3Oexxa3MG7+IFY8t89O39PBc0dEQ6O8ztaJ
 NCcubQVpSFxC7fwVjY56IHHofyznS7JLKMAcCe3+ssOHwJt0PZr/iqdBNHC9W4ZE
 7fcNZtct
 =jMLB
 -----END PGP SIGNATURE-----

Merge tag '4.19-rc-smb3' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Three small SMB3 fixes, one for stable"

* tag '4.19-rc-smb3' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: update internal module version number for cifs.ko to 2.12
  cifs: check kmalloc before use
  cifs: check if SMB2 PDU size has been padded and suppress the warning
  cifs: create a define for how many iovs we need for an SMB2_open()
2018-08-25 13:17:53 -07:00
Linus Torvalds 1b2de5d039 mm/cow: don't bother write protecting already write-protected pages
This is not normally noticeable, but repeated forks are unnecessarily
expensive because they repeatedly dirty the parent page tables during
the page table copy operation.

It's trivial to just avoid write protecting the page table entry if it
was already not writable.

This patch was inspired by

    https://bugzilla.kernel.org/show_bug.cgi?id=200447

which points to an ancient "waste time re-doing fork" issue in the
presence of lots of signals.

That bug was fixed by Eric Biederman's signal handling series
culminating in commit c3ad2c3b02 ("signal: Don't restart fork when
signals come in"), but the unnecessary work for repeated forks is still
work just fixing, particularly since the fix is trivial.

Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-25 13:15:03 -07:00
Colin Ian King e0fcfe1f1a hpfs: remove unnecessary checks on the value of r when assigning error code
At the point where r is being checked for different values, r is always
going to be equal to 2 as the previous if statements jump to end or end1
if r is not 2.  Hence the assignment to err can be simplified to just
err an assignment without any checks on the value or r.

Detected by CoverityScan, CID#1226737 ("Logically dead code")

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-25 12:42:33 -07:00
Jens Axboe 7634ccd2da libata: maintainership update
Tejun Heo wrote:
>
> I asked Jens whether he could take care of the libata tree and he
> thankfully agreed, so, from now on, Jens will be the libata
> maintainer.
>
> Thanks a lot!

Thanks for your work in this area. I still remember the first linux
storage summit we did in Vancouver 2001, Tejun was invited to talk about
his libata error handling work. Before that, it was basically a crap
shoot if we recovered properly or not... A lot of water has flown under
the bridge since then!

Here's an "official" patch. Linus, can you apply it?

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-25 12:35:45 -07:00
Linus Torvalds 0519359784 Merge branch 'for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata
Pull libata updates from Tejun Heo:
 "Nothing too interesting. Mostly ahci and ahci_platform changes, many
  around power management"

* 'for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata: (22 commits)
  ata: ahci_platform: enable to get and control reset
  ata: libahci_platform: add reset control support
  ata: add an extra argument to ahci_platform_get_resources()
  ata: sata_rcar: Add r8a77965 support
  ata: sata_rcar: exclude setting of PHY registers in Gen3
  ata: sata_rcar: really mask all interrupts on Gen2 and later
  Revert "ata: ahci_platform: allow disabling of hotplug to save power"
  ata: libahci: Allow reconfigure of DEVSLP register
  ata: libahci: Correct setting of DEVSLP register
  ata: ahci: Enable DEVSLP by default on x86 with SLP_S0
  ata: ahci: Support state with min power but Partial low power state
  Revert "ata: ahci_platform: convert kcalloc to devm_kcalloc"
  ata: sata_rcar: Add rudimentary Runtime PM support
  ata: sata_rcar: Provide a short-hand for &pdev->dev
  ata: Only output sg element mapped number in verbose debug
  ata: Guard ata_scsi_dump_cdb() by ATA_VERBOSE_DEBUG
  ata: ahci_platform: convert kcalloc to devm_kcalloc
  ata: ahci_platform: convert kzallloc to kcalloc
  ata: ahci_platform: correct parameter documentation for ahci_platform_shutdown
  libata: remove ata_sff_data_xfer_noirq()
  ...
2018-08-24 13:20:33 -07:00
Linus Torvalds 596766102a Merge branch 'for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "Just one commit from Steven to take out spin lock from trace event
  handlers"

* 'for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup/tracing: Move taking of spin lock out of trace event handlers
2018-08-24 13:19:27 -07:00