Now we recalculate the sequence of timeout with 'req->sequence =
ctx->cached_sq_head + count - 1', judge the right place to insert
for timeout_list by compare the number of request we still expected for
completion. But we have not consider about the situation of overflow:
1. ctx->cached_sq_head + count - 1 may overflow. And a bigger count for
the new timeout req can have a small req->sequence.
2. cached_sq_head of now may overflow compare with before req. And it
will lead the timeout req with small req->sequence.
This overflow will lead to the misorder of timeout_list, which can lead
to the wrong order of the completion of timeout_list. Fix it by reuse
req->submit.sequence to store the count, and change the logic of
inserting sort in io_timeout.
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A BIO based request queue does not have a tag_set, which prevent testing
for the flag BLK_MQ_F_NO_SCHED indicating that the queue does not
require an elevator. This leads to an incorrect initialization of a
default elevator in some cases such as BIO based null_blk
(queue_mode == BIO) with zoned mode enabled as the default elevator in
this case is mq-deadline instead of "none".
Fix this by testing for a NULL queue mq_ops field which indicates that
the queue is BIO based and should not have an elevator.
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have two ways a request can be deferred:
1) It's a regular request that depends on another one
2) It's a timeout that tracks completions
We have a shared helper to determine whether to defer, and that
attempts to make the right decision based on the request. But we
only have some of this information in the caller. Un-share the
two timeout/defer helpers so the caller can use the right one.
Fixes: 5262f56798 ("io_uring: IORING_OP_TIMEOUT support")
Reported-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
1. nbd_put takes the mutex and drops nbd->ref to 0. It then does
idr_remove and drops the mutex.
2. nbd_genl_connect takes the mutex. idr_find/idr_for_each fails
to find an existing device, so it does nbd_dev_add.
3. just before the nbd_put could call nbd_dev_remove or not finished
totally, but if nbd_dev_add try to add_disk, we can hit:
debugfs: Directory 'nbd1' with parent 'block' already present!
This patch will make sure all the disk add/remove stuff are done
by holding the nbd_index_mutex lock.
Reported-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The return code from null_handle_zoned() sets the cmd->error value.
Returning OK status when an error occured overwrites the intended
cmd->error. Return the appropriate error code instead of setting the
error in the cmd.
Fixes: fceb5d1b19 ("null_blk: create a helper for zoned devices")
Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We should not remove the workqueue, we just need to ensure that the
workqueues are synced. The workqueues are torn down on ctx removal.
Cc: stable@vger.kernel.org
Fixes: 6b06314c47 ("io_uring: add file set registration")
Reported-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Any changes interesting to tasks waiting in io_cqring_wait() are
commited with io_cqring_ev_posted(). However, io_ring_drop_ctx_refs()
also tries to do that but with no reason, that means spurious wakeups
every io_free_req() and io_uring_enter().
Just use percpu_ref_put() instead.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
scale_up wakes up waiters after scaling up. But after scaling max, it
should not wake up more waiters as waiters will not have anything to
do. This patch fixes this by making scale_up (and also scale_down)
return when threshold is reached.
This bug causes increased fdatasync latency when fdatasync and dd
conv=sync are performed in parallel on 4.19 compared to 4.14. This
bug was introduced during refactoring of blk-wbt code.
Fixes: a79050434b ("blk-rq-qos: refactor out common elements of blk-wbt")
Cc: stable@vger.kernel.org
Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This reverts commit 85fbd722ad.
The commit was added as a quick band-aid for a hang that happened when a
block device was removed during system suspend. Now that bdi_wq is not
freezable anymore the hang should not be possible and we can get rid of
this hack by reverting it.
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A removable block device, such as NVMe or SSD connected over Thunderbolt
can be hot-removed any time including when the system is suspended. When
device is hot-removed during suspend and the system gets resumed, kernel
first resumes devices and then thaws the userspace including freezable
workqueues. What happens in that case is that the NVMe driver notices
that the device is unplugged and removes it from the system. This ends
up calling bdi_unregister() for the gendisk which then schedules
wb_workfn() to be run one more time.
However, since the bdi_wq is still frozen flush_delayed_work() call in
wb_shutdown() blocks forever halting system resume process. User sees
this as hang as nothing is happening anymore.
Triggering sysrq-w reveals this:
Workqueue: nvme-wq nvme_remove_dead_ctrl_work [nvme]
Call Trace:
? __schedule+0x2c5/0x630
? wait_for_completion+0xa4/0x120
schedule+0x3e/0xc0
schedule_timeout+0x1c9/0x320
? resched_curr+0x1f/0xd0
? wait_for_completion+0xa4/0x120
wait_for_completion+0xc3/0x120
? wake_up_q+0x60/0x60
__flush_work+0x131/0x1e0
? flush_workqueue_prep_pwqs+0x130/0x130
bdi_unregister+0xb9/0x130
del_gendisk+0x2d2/0x2e0
nvme_ns_remove+0xed/0x110 [nvme_core]
nvme_remove_namespaces+0x96/0xd0 [nvme_core]
nvme_remove+0x5b/0x160 [nvme]
pci_device_remove+0x36/0x90
device_release_driver_internal+0xdf/0x1c0
nvme_remove_dead_ctrl_work+0x14/0x30 [nvme]
process_one_work+0x1c2/0x3f0
worker_thread+0x48/0x3e0
kthread+0x100/0x140
? current_work+0x30/0x30
? kthread_park+0x80/0x80
ret_from_fork+0x35/0x40
This is not limited to NVMes so exactly same issue can be reproduced by
hot-removing SSD (over Thunderbolt) while the system is suspended.
Prevent this from happening by removing WQ_FREEZABLE from bdi_wq.
Reported-by: AceLan Kao <acelan.kao@canonical.com>
Link: https://marc.info/?l=linux-kernel&m=138695698516487
Link: https://bugzilla.kernel.org/show_bug.cgi?id=204385
Link: https://lore.kernel.org/lkml/20191002122136.GD2819@lahna.fi.intel.com/#t
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
sparse warns about incorrect type when using __be64 data.
It is not being converted to CPU-endian but it should be.
Fixes these sparse warnings:
../block/sed-opal.c:375:20: warning: incorrect type in assignment (different base types)
../block/sed-opal.c:375:20: expected unsigned long long [usertype] align
../block/sed-opal.c:375:20: got restricted __be64 const [usertype] alignment_granularity
../block/sed-opal.c:376:25: warning: incorrect type in assignment (different base types)
../block/sed-opal.c:376:25: expected unsigned long long [usertype] lowest_lba
../block/sed-opal.c:376:25: got restricted __be64 const [usertype] lowest_aligned_lba
Fixes: 455a7b238c ("block: Add Sed-opal library")
Cc: Scott Bauer <scott.bauer@intel.com>
Cc: Rafael Antognolli <rafael.antognolli@intel.com>
Cc: linux-block@vger.kernel.org
Reviewed-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This reverts commit 7e64db1597.
The thin provisioning feature introduces an IOCTL and the discard support
to allow userspace tools and filesystems to release unused and previously
allocated space respectively.
During some internal performance improvements and further tests, the
release of allocated space revealed some issues that may lead to data
corruption in some configurations when filesystems are mounted with
discard support enabled.
While we're working on a fix and trying to clarify the situation,
this commit reverts the discard support for ESE volumes to prevent
potential data corruption.
Cc: <stable@vger.kernel.org> # 5.3
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It is possible that the CCW commands for reading volume and extent pool
information are not supported, either by the storage server (for
dedicated DASDs) or by z/VM (for virtual devices, such as MDISKs).
As a command reject will occur in such a case, the current error
handling leads to a failing online processing and thus the DASD can't be
used at all.
Since the data being read is not essential for an fully operational
DASD, the error handling can be removed. Information about the failing
command is sent to the s390dbf debug feature.
Fixes: c729696bcf ("s390/dasd: Recognise data for ESE volumes")
Cc: <stable@vger.kernel.org> # 5.3
Reported-by: Frank Heimes <frank.heimes@canonical.com>
Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com>
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
All system calls use struct __kernel_timespec instead of the old struct
timespec, but this one was just added with the old-style ABI. Change it
now to enforce the use of __kernel_timespec, avoiding ABI confusion and
the need for compat handlers on 32-bit architectures.
Any user space caller will have to use __kernel_timespec now, but this
is unambiguous and works for any C library regardless of the time_t
definition. A nicer way to specify the timeout would have been a less
ambiguous 64-bit nanosecond value, but I suppose it's too late now to
change that as this would impact both 32-bit and 64-bit users.
Fixes: 5262f56798 ("io_uring: IORING_OP_TIMEOUT support")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The loop driver assumes that if the passed in fd is opened with
O_DIRECT, the caller wants to use direct I/O on the loop device.
However, if the underlying block device has a different block size than
the loop block queue, direct I/O can't be enabled. Instead of requiring
userspace to manually change the blocksize and re-enable direct I/O,
just change the queue block sizes to match, as well as the io_min size.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martijn Coenen <maco@android.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull NVMe changes from Sagi:
"This set consists of various fixes and cleanups:
- controller removal race fix from Balbir
- quirk additions from Gabriel and Jian-Hong
- nvme-pci power state save fix from Mario
- Add 64bit user commands (for 64bit registers) from Marta
- nvme-rdma/nvme-tcp fixes from Max, Mark and Me
- Minor cleanups and nits from James, Dan and John"
* 'nvme-5.4' of git://git.infradead.org/nvme:
nvme-rdma: fix possible use-after-free in connect timeout
nvme: Move ctrl sqsize to generic space
nvme: Add ctrl attributes for queue_count and sqsize
nvme: allow 64-bit results in passthru commands
nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T
nvmet-tcp: remove superflous check on request sgl
Added QUIRKs for ADATA XPG SX8200 Pro 512GB
nvme-rdma: Fix max_hw_sectors calculation
nvme: fix an error code in nvme_init_subsystem()
nvme-pci: Save PCI state before putting drive into deepest state
nvme-tcp: fix wrong stop condition in io_work
nvme-pci: Fix a race in controller removal
nvmet: change ppl to lpp
Some HDD drive may expose multiple hardware queues, such as MegraRaid.
Let's apply the normal plugging for such devices because sequential IO
may benefit a lot from plug merging.
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If a device is using multiple queues, the IO scheduler may be bypassed.
This may hurt performance for some slow MQ devices, and it also breaks
zoned devices which depend on mq-deadline for respecting the write order
in one zone.
Don't bypass io scheduler if we have one setup.
This patch can double sequential write performance basically on MQ
scsi_debug when mq-deadline is applied.
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Javier González <javier@javigon.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If the connect times out, we may have already destroyed the
queue in the timeout handler, so test if the queue is still
allocated in the connect error handler.
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
We got a null pointer deference BUG_ON in blk_mq_rq_timed_out()
as following:
[ 108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040
[ 108.827059] PGD 0 P4D 0
[ 108.827313] Oops: 0000 [#1] SMP PTI
[ 108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431
[ 108.829503] Workqueue: kblockd blk_mq_timeout_work
[ 108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
[ 108.838191] Call Trace:
[ 108.838406] bt_iter+0x74/0x80
[ 108.838665] blk_mq_queue_tag_busy_iter+0x204/0x450
[ 108.839074] ? __switch_to_asm+0x34/0x70
[ 108.839405] ? blk_mq_stop_hw_queue+0x40/0x40
[ 108.839823] ? blk_mq_stop_hw_queue+0x40/0x40
[ 108.840273] ? syscall_return_via_sysret+0xf/0x7f
[ 108.840732] blk_mq_timeout_work+0x74/0x200
[ 108.841151] process_one_work+0x297/0x680
[ 108.841550] worker_thread+0x29c/0x6f0
[ 108.841926] ? rescuer_thread+0x580/0x580
[ 108.842344] kthread+0x16a/0x1a0
[ 108.842666] ? kthread_flush_work+0x170/0x170
[ 108.843100] ret_from_fork+0x35/0x40
The bug is caused by the race between timeout handle and completion for
flush request.
When timeout handle function blk_mq_rq_timed_out() try to read
'req->q->mq_ops', the 'req' have completed and reinitiated by next
flush request, which would call blk_rq_init() to clear 'req' as 0.
After commit 12f5b93145 ("blk-mq: Remove generation seqeunce"),
normal requests lifetime are protected by refcount. Until 'rq->ref'
drop to zero, the request can really be free. Thus, these requests
cannot been reused before timeout handle finish.
However, flush request has defined .end_io and rq->end_io() is still
called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq'
can be reused by the next flush request handle, resulting in null
pointer deference BUG ON.
We fix this problem by covering flush request with 'rq->ref'.
If the refcount is not zero, flush_end_io() return and wait the
last holder recall it. To record the request status, we add a new
entry 'rq_status', which will be used in flush_end_io().
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: stable@vger.kernel.org # v4.18+
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
-------
v2:
- move rq_status from struct request to struct blk_flush_queue
v3:
- remove unnecessary '{}' pair.
v4:
- let spinlock to protect 'fq->rq_status'
v5:
- move rq_status after flush_running_idx member of struct blk_flush_queue
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have updated limits after calling wbt_set_min_lat(). No need to
update again.
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The default hard disk param sets latency targets at 50ms. As the
default target percentiles are zero, these don't directly regulate
vrate; however, they're still used to calculate the period length -
100ms in this case.
This is excessively low. A SATA drive with QD32 saturated with random
IOs can easily reach avg completion latency of several hundred msecs.
A period duration which is substantially lower than avg completion
latency can lead to wildly fluctuating vrate.
Let's bump up the default latency targets to 250ms so that the period
duration is sufficiently long.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Some IOs may span multiple periods. As latencies are collected on
completion, the inbetween periods won't register them and may
incorrectly decide to increase vrate. nr_lagging tracks these IOs to
avoid those situations. Currently, whenever there are IOs which are
spanning from the previous period, busy_level is reset to 0 if
negative thus suppressing vrate increase.
This has the following two problems.
* When latency target percentiles aren't set, vrate adjustment should
only be governed by queue depth depletion; however, the current code
keeps nr_lagging active which pulls in latency results and can keep
down vrate unexpectedly.
* When lagging condition is detected, it resets the entire negative
busy_level. This turned out to be way too aggressive on some
devices which sometimes experience extended latencies on a small
subset of commands. In addition, a lagging IO will be accounted as
latency target miss on completion anyway and resetting busy_level
amplifies its impact unnecessarily.
This patch fixes the above two problems by disabling nr_lagging
counting when latency target percentiles aren't set and blocking vrate
increases when there are lagging IOs while leaving busy_level as-is.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
vrate_adj tracepoint traces vrate changes; however, it does so only
when busy_level is non-zero. busy_level turning to zero can sometimes
be as interesting an event. This patch also enables vrate_adj
tracepoint on other vrate related events - busy_level changes and
non-zero nr_lagging.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
cecf5d87ff ("block: split .sysfs_lock into two locks") starts to
release & acquire sysfs_lock before registering/un-registering elevator
queue during switching elevator for avoiding potential deadlock from
showing & storing 'queue/iosched' attributes and removing elevator's
kobject.
Turns out there isn't such deadlock because 'q->sysfs_lock' isn't
required in .show & .store of queue/iosched's attributes, and just
elevator's sysfs lock is acquired in elv_iosched_store() and
elv_iosched_show(). So it is safe to hold queue's sysfs lock when
registering/un-registering elevator queue.
The biggest issue is that commit cecf5d87ff assumes that concurrent
write on 'queue/scheduler' can't happen. However, this assumption isn't
true, because kernfs_fop_write() only guarantees that concurrent write
aren't called on the same open file, but the write could be from
different open on the file. So we can't release & re-acquire queue's
sysfs lock during switching elevator, otherwise use-after-free on
elevator could be triggered.
Fixes the issue by not releasing queue's sysfs lock during switching
elevator.
Fixes: cecf5d87ff ("block: split .sysfs_lock into two locks")
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit c48dac137a ("block: don't hold q->sysfs_lock in elevator_init_mq")
removes q->sysfs_lock from elevator_init_mq(), but forgot to deal with
lockdep_assert_held() called in blk_mq_sched_free_requests() which is
run in failure path of elevator_init_mq().
blk_mq_sched_free_requests() is called in the following 3 functions:
elevator_init_mq()
elevator_exit()
blk_cleanup_queue()
In blk_cleanup_queue(), blk_mq_sched_free_requests() is followed exactly
by 'mutex_lock(&q->sysfs_lock)'.
So moving the lockdep_assert_held() from blk_mq_sched_free_requests()
into elevator_exit() for fixing the report by syzbot.
Reported-by: syzbot+da3b7677bb913dc1b737@syzkaller.appspotmail.com
Fixed: c48dac137a ("block: don't hold q->sysfs_lock in elevator_init_mq")
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Current controller interrogation requires a lot of guesswork
on how many io queues were created and what the io sq size is.
The numbers are dependent upon core/fabric defaults, connect
arguments, and target responses.
Add sysfs attributes for queue_count and sqsize.
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
It is not possible to get 64-bit results from the passthru commands,
what prevents from getting for the Capabilities (CAP) property value.
As a result, it is not possible to implement IOL's NVMe Conformance
test 4.3 Case 1 for Fabrics targets [1] (page 123).
This issue has been already discussed [2], but without a solution.
This patch solves the problem by adding new ioctls with a new
passthru structure, including 64-bit results. The older ioctls stay
unchanged.
[1] https://www.iol.unh.edu/sites/default/files/testsuites/nvme/UNH-IOL_NVMe_Conformance_Test_Suite_v11.0.pdf
[2] http://lists.infradead.org/pipermail/linux-nvme/2018-June/018791.html
Signed-off-by: Marta Rybczynska <marta.rybczynska@kalray.eu>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Kingston NVME SSD with firmware version E8FK11.T has no interrupt after
resume with actions related to suspend to idle. This patch applied
NVME_QUIRK_SIMPLE_SUSPEND quirk to fix this issue.
Fixes: d916b1be94 ("nvme-pci: use host managed power state for suspend")
Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=204887
Signed-off-by: Jian-Hong Pan <jian-hong@endlessm.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Booting with default_ps_max_latency_us >6000 makes the device fail.
Also SUBNQN is NULL and gives a warning on each boot/resume.
$ nvme id-ctrl /dev/nvme0 | grep ^subnqn
subnqn : (null)
I use this device with an Acer Nitro 5 (AN515-43-R8BF) Laptop.
To be sure is not a Laptop issue only, I tested the device on
my server board with the same results.
( with 2x,4x link on the board and 4x link on a PCI-E card ).
Signed-off-by: Gabriel Craciunescu <nix.or.die@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
By default, the NVMe/RDMA driver should support max io_size of 1MiB (or
upto the maximum supported size by the HCA). Currently, one will see that
/sys/class/block/<bdev>/queue/max_hw_sectors_kb is 1020 instead of 1024.
A non power of 2 value can cause performance degradation due to
unnecessary splitting of IO requests and unoptimized allocation units.
The number of pages per MR has been fixed here, so there is no longer any
need to reduce max_sectors by 1.
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
"ret" should be a negative error code here, but it's either success or
possibly uninitialized.
Fixes: 32fd90c407 ("nvme: change locking for the per-subsystem controller list")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
The action of saving the PCI state will cause numerous PCI configuration
space reads which depending upon the vendor implementation may cause
the drive to exit the deepest NVMe state.
In these cases ASPM will typically resolve the PCIe link state and APST
may resolve the NVMe power state. However it has also been observed
that this register access after quiesced will cause PC10 failure
on some device combinations.
To resolve this, move the PCI state saving to before SetFeatures has been
called. This has been proven to resolve the issue across a 5000 sample
test on previously failing disk/system combinations.
Signed-off-by: Mario Limonciello <mario.limonciello@dell.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Allow the do/while statement to continue if current time
is not after the proposed time 'deadline'. Intent is to
allow loop to proceed for a specific time period. Currently
the loop, as coded, will exit after first pass.
Signed-off-by: Mark Wunderlich <mark.wunderlich@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
- automatic recovery of a blacklisted filesystem session (Zheng Yan).
This is disabled by default and can be enabled by mounting with the
new "recover_session=clean" option.
- serialize buffered reads and O_DIRECT writes (Jeff Layton). Care is
taken to avoid serializing O_DIRECT reads and writes with each other,
this is based on the exclusion scheme from NFS.
- handle large osdmaps better in the face of fragmented memory (myself)
- don't limit what security.* xattrs can be get or set (Jeff Layton).
We were overly restrictive here, unnecessarily preventing things like
file capability sets stored in security.capability from working.
- allow copy_file_range() within the same inode and across different
filesystems within the same cluster (Luis Henriques)
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl2LoD8THGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzixRYB/9H5S4fif8Pn9eiXkIJiT9UZR/o7S1k
ikfQNPeDxlBLKnoZXpDp2HqCu1/YuCcJ0zpZzPGGrKECZb7r+NaayxhmEXAZ+Vsg
YwsO3eNHBbb58pe9T4oiHp19sflwcTOeNwg8wlvmvrgfBupFz2pU8Xm72EdFyoYm
tP0QNTOCAuQK3pJcgozaptAO1TzBL3LomyVM0YzAKcumgMg47zALpaSLWJLGtDLM
5+5WLvcVfBGLVv60h4B62ldS39eBxqTsFodcRMUaqAsnhLK70HVfKlwR3GgtZggr
PDqbsuIfw/O3b65U2XDKZt1P9dyG3OE/ucueduXUxJPYNGmooEE+PpE+
=DRVP
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.4-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"The highlights are:
- automatic recovery of a blacklisted filesystem session (Zheng Yan).
This is disabled by default and can be enabled by mounting with the
new "recover_session=clean" option.
- serialize buffered reads and O_DIRECT writes (Jeff Layton). Care is
taken to avoid serializing O_DIRECT reads and writes with each
other, this is based on the exclusion scheme from NFS.
- handle large osdmaps better in the face of fragmented memory
(myself)
- don't limit what security.* xattrs can be get or set (Jeff Layton).
We were overly restrictive here, unnecessarily preventing things
like file capability sets stored in security.capability from
working.
- allow copy_file_range() within the same inode and across different
filesystems within the same cluster (Luis Henriques)"
* tag 'ceph-for-5.4-rc1' of git://github.com/ceph/ceph-client: (41 commits)
ceph: call ceph_mdsc_destroy from destroy_fs_client
libceph: use ceph_kvmalloc() for osdmap arrays
libceph: avoid a __vmalloc() deadlock in ceph_kvmalloc()
ceph: allow object copies across different filesystems in the same cluster
ceph: include ceph_debug.h in cache.c
ceph: move static keyword to the front of declarations
rbd: pull rbd_img_request_create() dout out into the callers
ceph: reconnect connection if session hang in opening state
libceph: drop unused con parameter of calc_target()
ceph: use release_pages() directly
rbd: fix response length parameter for encoded strings
ceph: allow arbitrary security.* xattrs
ceph: only set CEPH_I_SEC_INITED if we got a MAC label
ceph: turn ceph_security_invalidate_secctx into static inline
ceph: add buffered/direct exclusionary locking for reads and writes
libceph: handle OSD op ceph_pagelist_append() errors
ceph: don't return a value from void function
ceph: don't freeze during write page faults
ceph: update the mtime when truncating up
ceph: fix indentation in __get_snap_name()
...
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXYod6wAKCRDh3BK/laaZ
PG3fAP9WXuvUeYh3X7ThQPa2D33VCIMJRd6t+1TVSSc/H8P3dAD/ehN5HIWjnmzz
iZFc3zDtO9UCJUe23IZomblxOQbu6Qk=
=I0S2
-----END PGP SIGNATURE-----
Merge tag 'fuse-update-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
- Continue separating the transport (user/kernel communication) and the
filesystem layers of fuse. Getting rid of most layering violations
will allow for easier cleanup and optimization later on.
- Prepare for the addition of the virtio-fs filesystem. The actual
filesystem will be introduced by a separate pull request.
- Convert to new mount API.
- Various fixes, optimizations and cleanups.
* tag 'fuse-update-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (55 commits)
fuse: Make fuse_args_to_req static
fuse: fix memleak in cuse_channel_open
fuse: fix beyond-end-of-page access in fuse_parse_cache()
fuse: unexport fuse_put_request
fuse: kmemcg account fs data
fuse: on 64-bit store time in d_fsdata directly
fuse: fix missing unlock_page in fuse_writepage()
fuse: reserve byteswapped init opcodes
fuse: allow skipping control interface and forced unmount
fuse: dissociate DESTROY from fuseblk
fuse: delete dentry if timeout is zero
fuse: separate fuse device allocation and installation in fuse_conn
fuse: add fuse_iqueue_ops callbacks
fuse: extract fuse_fill_super_common()
fuse: export fuse_dequeue_forget() function
fuse: export fuse_get_unique()
fuse: export fuse_send_init_request()
fuse: export fuse_len_args()
fuse: export fuse_end_request()
fuse: fix request limit
...
-----BEGIN PGP SIGNATURE-----
iJYEABYIAD4WIQRE6pSOnaBC00OEHEIaerohdGur0gUCXYqr6iAcamFya2tvLnNh
a2tpbmVuQGxpbnV4LmludGVsLmNvbQAKCRAaerohdGur0pspAP9pmSqIde7M2BJD
kRmkiannjbl55Ss5+RX+oqOLwtezewEAzgK/vQkm/AWZPqix274qdOMTOX1jjwUZ
JnvsxZrEuQc=
=l6Vw
-----END PGP SIGNATURE-----
Merge tag 'tpmdd-next-20190925' of git://git.infradead.org/users/jjs/linux-tpmdd
Pull tpm fixes from Jarkko Sakkinen.
* tag 'tpmdd-next-20190925' of git://git.infradead.org/users/jjs/linux-tpmdd:
tpm: Wrap the buffer from the caller to tpm_buf in tpm_send()
MAINTAINERS: keys: Update path to trusted.h
KEYS: trusted: correctly initialize digests and fix locking issue
selftests/tpm2: Add log and *.pyc to .gitignore
selftests/tpm2: Add the missing TEST_FILES assignment
- Report both io errors and short io results to the directio endio
handler.
- Allow directio callers to pass an ops structure to iomap_dio_rw.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl2EQBwACgkQ+H93GTRK
tOvZ0w//R00Dya3pZmO+HQX/Kovo/AOKCct/gPaLvRf4DVvWZ3ZH5lPtDUSKCHBV
Pw+potTuabwtp//mZrQ84ZMoQYOmxBmxchquJ2L49qWT/mZ1BDqBA2sxQOCehffO
Dlv715eOPVhkVHOEAHv+BsatxlFDkuKgGcwUqxE4LNHWGrvmjm6rBWPCnxQMTwX9
BhGo/0WG6D0leWFSMoUIpcD4Oj302/O43RX4lJsyl0Jd5pKJD/RFtaKqwIeE+pj9
36MnoQYtT5e9BTm1/zzHRVpGgLDDWTY+IClgZNlZprU/Em+TtOGMWitWzob3wWcU
QHj5bJ9NiWBT6QJ192i0+I0thSTwW/peOAJrkBOW3YhBkH3UIKSzaPiwCBoSEUKS
fuIOJkRHFW7HTjKSw6wUCJJ6ShD+rNPOoaJ9Ivzz7x2HGEqb1al3EIumu6oDJgyq
BdnLEecN/JSxPnakpSGAnvlCjZiYbJ95E0JfapzMy7HYpMXfbzZZ4JO9Ld0hQflR
sMrGwL82tqE9ish1yRr+2nNEY5PqVJqV0XXU3dZz3HVknNuAvqrqXDXgIZT142zS
s9vGAW+2XphyUKHH9pImQKw142n2xiZK+Pd2AuiO4I06E0EkAvN/n9CN7oJ4Cr4F
z9rQKNXNx8OSjFtryYe6+IzQ0qQtl2OP7o88K91jYArKP17D8ds=
=zkFp
-----END PGP SIGNATURE-----
Merge tag 'iomap-5.4-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap updates from Darrick Wong:
"After last week's failed pull request attempt, I scuttled everything
in the branch except for the directio endio api changes, which were
trivial. Everything else will simply have to wait for the next cycle.
Summary:
- Report both io errors and short io results to the directio endio
handler.
- Allow directio callers to pass an ops structure to iomap_dio_rw"
* tag 'iomap-5.4-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: move the iomap_dio_rw ->end_io callback into a structure
iomap: split size and error for iomap_dio_rw ->end_io
Pull i2c updates from Wolfram Sang:
- new driver for ICY, an Amiga Zorro card :)
- axxia driver gained slave mode support, NXP driver gained ACPI
- the slave EEPROM backend gained 16 bit address support
- and lots of regular driver updates and reworks
* 'i2c/for-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (52 commits)
i2c: tegra: Move suspend handling to NOIRQ phase
i2c: imx: ACPI support for NXP i2c controller
i2c: uniphier(-f): remove all dev_dbg()
i2c: uniphier(-f): use devm_platform_ioremap_resource()
i2c: slave-eeprom: Add comment about address handling
i2c: exynos5: Remove IRQF_ONESHOT
i2c: stm32f7: Make structure stm32f7_i2c_algo constant
i2c: cht-wc: drop check because i2c_unregister_device() is NULL safe
i2c-eeprom_slave: Add support for more eeprom models
i2c: fsi: Add of_put_node() before break
i2c: synquacer: Make synquacer_i2c_ops constant
i2c: hix5hd2: Remove IRQF_ONESHOT
i2c: i801: Use iTCO version 6 in Cannon Lake PCH and beyond
watchdog: iTCO: Add support for Cannon Lake PCH iTCO
i2c: iproc: Make bcm_iproc_i2c_quirks constant
i2c: iproc: Add full name of devicetree node to adapter name
i2c: piix4: Add ACPI support
i2c: piix4: Fix probing of reserved ports on AMD Family 16h Model 30h
i2c: ocores: use request_any_context_irq() to register IRQ handler
i2c: designware: Fix optional reset error handling
...
A few small remaining wrap-up for this merge window.
Most of patches are device-specific (HD-audio and USB-audio quirks,
FireWire, pcm316a, fsl, rsnd, Atmel, and TI fixes), while there is
a simple fix (actually two commits) for ASoC core.
-----BEGIN PGP SIGNATURE-----
iQJCBAABCAAsFiEEIXTw5fNLNI7mMiVaLtJE4w1nLE8FAl2KBooOHHRpd2FpQHN1
c2UuZGUACgkQLtJE4w1nLE9m6xAArBi8tcosiyP34QANb+KMfm6ZkBlVBCsu1YGU
qr4L0a0PyZYWEmRvc1AETJsIe2DZDNJ5V+GTVxJ7fyQeV+Z/6XyK76ufSYYLWxSX
vjsQVPnWReJqDYQEXeanC9S3VfZuH1q2JoqLdF2F0s5RhTrGJyz0YKm4dt7O9x84
1SIp0itxmMGylNNXeemPFKXyem/LbblpRzFRwexjCFwwN/0QGzYsooBV22wm8i5L
qVyGzaDoorNfS2d3Oyljl55N6uGvXt9KJslV/JSg3iaLUe31rHB2o0bLCqXZOeUm
WofKsolrMv1CK0Bk+/HgpxT20TDlZ/73QRJVgO7/PydIZxxiqHPFxQV54gUano+a
tTvnPBwbLyFwx5Xt1RrycL9KdVv48ZF58jZIKZpW3zKDY4pUcOstdKU5UBkiopDD
ZFITveFPbvviUXXEsOl7EmAbcSaMjI84GDPD3KBFrKCe7J7/vFTPJcJhgoVBGrwb
XVXVzR3M0tvjdqfYf6OU+EEzKvrg+xYITk3SjqD666DQ0u6DMmj0z/KJkH8ZUy/w
ByPhoz9O0zh1WbLOO+753zXFS5xS2VLPOJStEmuhWDEtNoNout+T/nWiuOAH36w7
ziBnqQJVl+TY3m24UaCaswWKEceEFlM7+bLS/WlQtE/ZVZDwdwqbEmFsFNYsh1Ft
xAqPp8M=
=mOJY
-----END PGP SIGNATURE-----
Merge tag 'sound-fix-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A few small remaining wrap-up for this merge window.
Most of patches are device-specific (HD-audio and USB-audio quirks,
FireWire, pcm316a, fsl, rsnd, Atmel, and TI fixes), while there is a
simple fix (actually two commits) for ASoC core"
* tag 'sound-fix-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: usb-audio: Add DSD support for EVGA NU Audio
ALSA: hda - Add laptop imic fixup for ASUS M9V laptop
ASoC: ti: fix SND_SOC_DM365_VOICE_CODEC dependencies
ASoC: pcm3168a: The codec does not support S32_LE
ASoC: core: use list_del_init and move it back to soc_cleanup_component
ALSA: hda/realtek - PCI quirk for Medion E4254
ALSA: hda - Apply AMD controller workaround for Raven platform
ASoC: rsnd: do error check after rsnd_channel_normalization()
ASoC: atmel_ssc_dai: Remove wrong spinlock usage
ASoC: core: delete component->card_list in soc_remove_component only
ASoC: fsl_sai: Fix noise when using EDMA
ALSA: usb-audio: Add Hiby device family to quirks for native DSD support
ALSA: hda/realtek - Fix alienware headset mic
ALSA: dice: fix wrong packet parameter for Alesis iO26
tpm_send() does not give anymore the result back to the caller. This
would require another memcpy(), which kind of tells that the whole
approach is somewhat broken. Instead, as Mimi suggested, this commit
just wraps the data to the tpm_buf, and thus the result will not go to
the garbage.
Obviously this assumes from the caller that it passes large enough
buffer, which makes the whole API somewhat broken because it could be
different size than @buflen but since trusted keys is the only module
using this API right now I think that this fix is sufficient for the
moment.
In the near future the plan is to replace the parameters with a tpm_buf
created by the caller.
Reported-by: Mimi Zohar <zohar@linux.ibm.com>
Suggested-by: Mimi Zohar <zohar@linux.ibm.com>
Cc: stable@vger.kernel.org
Fixes: 412eb58558 ("use tpm_buf in tpm_transmit_cmd() as the IO parameter")
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Update MAINTAINERS record to reflect that trusted.h
was moved to a different directory in commit 22447981fc
("KEYS: Move trusted.h to include/keys [ver #2]").
Cc: Denis Kenzior <denkenz@gmail.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: linux-integrity@vger.kernel.org
Signed-off-by: Denis Efremov <efremov@linux.com>
Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Commit 0b6cf6b97b ("tpm: pass an array of tpm_extend_digest structures to
tpm_pcr_extend()") modifies tpm_pcr_extend() to accept a digest for each
PCR bank. After modification, tpm_pcr_extend() expects that digests are
passed in the same order as the algorithms set in chip->allocated_banks.
This patch fixes two issues introduced in the last iterations of the patch
set: missing initialization of the TPM algorithm ID in the tpm_digest
structures passed to tpm_pcr_extend() by the trusted key module, and
unreleased locks in the TPM driver due to returning from tpm_pcr_extend()
without calling tpm_put_ops().
Cc: stable@vger.kernel.org
Fixes: 0b6cf6b97b ("tpm: pass an array of tpm_extend_digest structures to tpm_pcr_extend()")
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Suggested-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
The Python files required by the selftests are not packaged because of
the missing assignment to TEST_FILES. Add the assignment.
Cc: stable@vger.kernel.org
Fixes: 6ea3dfe1e0 ("selftests: add TPM 2.0 tests")
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Reviewed-by: Petr Vorel <pvorel@suse.cz>