OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Sagi Grimberg	0475a8dcbc	nvme-rdma: fix timeout handler When a request times out in a LIVE state, we simply trigger error recovery and let the error recovery handle the request cancellation, however when a request times out in a non LIVE state, we make sure to complete it immediately as it might block controller setup or teardown and prevent forward progress. However tearing down the entire set of I/O and admin queues causes freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really an overkill to what we actually need, which is to just fence controller teardown that may be running, stop the queue, and cancel the request if it is not already completed. Now that we have the controller teardown_lock, we can safely serialize request cancellation. This addresses a hang caused by calling extra queue freeze on controller namespaces, causing unfreeze to not complete correctly. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2020-08-28 16:43:57 -07:00
Sagi Grimberg	5110f40241	nvme-rdma: serialize controller teardown sequences In the timeout handler we may need to complete a request because the request that timed out may be an I/O that is a part of a serial sequence of controller teardown or initialization. In order to complete the request, we need to fence any other context that may compete with us and complete the request that is timing out. In this case, we could have a potential double completion in case a hard-irq or a different competing context triggered error recovery and is running inflight request cancellation concurrently with the timeout handler. Protect using a ctrl teardown_lock to serialize contexts that may complete a cancelled request due to error recovery or a reset. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2020-08-28 16:43:57 -07:00
Sagi Grimberg	e5c01f4f7f	nvme-tcp: fix reset hang if controller died in the middle of a reset If the controller becomes unresponsive in the middle of a reset, we will hang because we are waiting for the freeze to complete, but that cannot happen since we have commands that are inflight holding the q_usage_counter, and we can't blindly fail requests that times out. So give a timeout and if we cannot wait for queue freeze before unfreezing, fail and have the error handling take care how to proceed (either schedule a reconnect of remove the controller). Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2020-08-28 16:43:57 -07:00
Sagi Grimberg	236187c4ed	nvme-tcp: fix timeout handler When a request times out in a LIVE state, we simply trigger error recovery and let the error recovery handle the request cancellation, however when a request times out in a non LIVE state, we make sure to complete it immediately as it might block controller setup or teardown and prevent forward progress. However tearing down the entire set of I/O and admin queues causes freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really an overkill to what we actually need, which is to just fence controller teardown that may be running, stop the queue, and cancel the request if it is not already completed. Now that we have the controller teardown_lock, we can safely serialize request cancellation. This addresses a hang caused by calling extra queue freeze on controller namespaces, causing unfreeze to not complete correctly. Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2020-08-28 16:43:57 -07:00
Sagi Grimberg	d4d61470ae	nvme-tcp: serialize controller teardown sequences In the timeout handler we may need to complete a request because the request that timed out may be an I/O that is a part of a serial sequence of controller teardown or initialization. In order to complete the request, we need to fence any other context that may compete with us and complete the request that is timing out. In this case, we could have a potential double completion in case a hard-irq or a different competing context triggered error recovery and is running inflight request cancellation concurrently with the timeout handler. Protect using a ctrl teardown_lock to serialize contexts that may complete a cancelled request due to error recovery or a reset. Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2020-08-28 16:43:56 -07:00
Sagi Grimberg	7cf0d7c0f3	nvme: have nvme_wait_freeze_timeout return if it timed out Users can detect if the wait has completed or not and take appropriate actions based on this information (e.g. weather to continue initialization or rather fail and schedule another initialization attempt). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2020-08-28 16:43:56 -07:00
Sagi Grimberg	d7144f5c4c	nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance NVME_CTRL_NEW should never see any I/O, because in order to start initialization it has to transition to NVME_CTRL_CONNECTING and from there it will never return to this state. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2020-08-28 16:43:56 -07:00
Ziye Yang	a6ce7d7b4a	nvmet-tcp: Fix NULL dereference when a connect data comes in h2cdata pdu When handling commands without in-capsule data, we assign the ttag assuming we already have the queue commands array allocated (based on the queue size information in the connect data payload). However if the connect itself did not send the connect data in-capsule we have yet to allocate the queue commands,and we will assign a bogus ttag and suffer a NULL dereference when we receive the corresponding h2cdata pdu. Fix this by checking if we already allocated commands before dereferencing it when handling h2cdata, if we didn't, its for sure a connect and we should use the preallocated connect command. Signed-off-by: Ziye Yang <ziye.yang@intel.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2020-08-28 16:43:56 -07:00
Linus Torvalds	c41c3ec4a2	io_uring-5.9-2020-08-23 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl9CwtMQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpsehEAC4ReB53LLbZxqcmoA2RNs9yz1I4DM2PU6z C+NSGGEnAFHQAhLbfCAzxbtQa6x/m64zoLd+8zHZNAeanJXarszcgSuqhXQFlEfX 7Jz/vdXGdu7Q4zgkLuO3FxleDoPoUC5qOSFHWYtMu6KvHLOkmc9DvdSUsFMDSThX 6RsoaQY2gDOD/pwtm8Cqmy89nLZdFoyxadXyk/lzxLodjeRZOwoVc+YM8YWmrXZ0 mKEEuO4uBWxUUmoyAwUABNqWWAkwTDEhrYCiiG81DkAa1Cu0mRXodN0xycr72cLZ Ik2OlnTLCE6B0UXsBu2c0+qXGArWsvDyhEEkwF+O+Ump4IBIr72EmgZb+o2nnkXo Uu4X/r0qeQ6XD+vBTHcE6oPUjJhV6uEXXon5aesE+vh277ILmHgMyjJKaSiJcY/E efM5SuPRq2kuROKWLKiLJnpuJ/9ZTU/4nk4k1pOlWWOVGLHien0sWBBzQ+iWr6mm eRl5EkI3JoahqIrNFz0+qF3DwKPVfu+B02/EzA8OXoYHIRV9KMS5eWX5hK12aZ3i 4AT3xuAanfcNs4qBAScOfHQxQu9U5Z7Mu4JQJ58xdsJd+UWBnbznUmSLob9KKk+c X8AvAcYhb684F87VCmaCzDlIPMb46OYxLBgI6sz7L0xdc7i8TCeeEDbQCN1HixZ3 SNtKzalNXA== =fAwK -----END PGP SIGNATURE----- Merge tag 'io_uring-5.9-2020-08-23' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: - NVMe pull request from Sagi: - nvme completion rework from Christoph and Chao that mostly came from a bit of divergence of how we classify errors related to pathing/retry etc. - nvmet passthru fixes from Chaitanya - minor nvmet fixes from Amit and I - mpath round-robin path selection fix from Martin - ignore noiob for zoned devices from Keith - minor nvme-fc fix from Tianjia" - BFQ cgroup leak fix (Dmitry) - block layer MAINTAINERS addition (Geert) - fix null_blk FUA checking (Hou) - get_max_io_size() size fix (Keith) - fix block page_is_mergeable() for compound pages (Matthew) - discard granularity fixes (Ming) - IO scheduler ordering fix (Ming) - misc fixes * tag 'io_uring-5.9-2020-08-23' of git://git.kernel.dk/linux-block: (31 commits) null_blk: fix passing of REQ_FUA flag in null_handle_rq nvmet: Disable keep-alive timer when kato is cleared to 0h nvme: redirect commands on dying queue nvme: just check the status code type in nvme_is_path_error nvme: refactor command completion nvme: rename and document nvme_end_request nvme: skip noiob for zoned devices nvme-pci: fix PRP pool size nvme-pci: Use u32 for nvme_dev.q_depth and nvme_queue.q_depth nvme: Use spin_lock_irq() when taking the ctrl->lock nvmet: call blk_mq_free_request() directly nvmet: fix oops in pt cmd execution nvmet: add ns tear down label for pt-cmd handling nvme: multipath: round-robin: eliminate "fallback" variable nvme: multipath: round-robin: fix single non-optimized path case nvme-fc: Fix wrong return value in __nvme_fc_init_request() nvmet-passthru: Reject commands with non-sgl flags set nvmet: fix a memory leak blkcg: fix memleak for iolatency MAINTAINERS: Add missing header files to BLOCK LAYER section ...	2020-08-24 11:53:15 -07:00
Gustavo A. R. Silva	df561f6688	treewide: Use fallthrough pseudo-keyword Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>	2020-08-23 17:36:59 -05:00
Amit Engel	0d3b6a8d21	nvmet: Disable keep-alive timer when kato is cleared to 0h Based on nvme spec, when keep alive timeout is set to zero the keep-alive timer should be disabled. Signed-off-by: Amit Engel <amit.engel@dell.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-08-21 17:14:28 -06:00
Chao Leng	5eac5f3342	nvme: redirect commands on dying queue If a command send through nvme-multipath failed on a dying queue, resend it on another path. Signed-off-by: Chao Leng <lengchao@huawei.com> [hch: rebased on top of the completion refactoring] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-08-21 17:14:28 -06:00
Christoph Hellwig	1e41f3bd26	nvme: just check the status code type in nvme_is_path_error Check the SCT sub-field for a path related status instead of enumerating invididual status code. As of NVMe 1.4 this adds "Internal Path Error" and "Controller Pathing Error" to the list, but it also future proofs for additional status codes added to the category. Suggested-by: Chao Leng <lengchao@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-08-21 17:14:28 -06:00
Christoph Hellwig	5ddaabe8ed	nvme: refactor command completion Lift all the code to decide the dispostition of a completed command from nvme_complete_rq and nvme_failover_req into a new helper, which returns an emum of the potential actions. nvme_complete_rq then just switches on those and calls the proper helper for the action. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-08-21 17:14:28 -06:00
Christoph Hellwig	2eb81a3364	nvme: rename and document nvme_end_request nvme_end_request is a bit misnamed, as it wraps around the blk_mq_complete_* API. It's semantics also are non-trivial, so give it a more descriptive name and add a comment explaining the semantics. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-08-21 17:14:28 -06:00
Keith Busch	c41ad98beb	nvme: skip noiob for zoned devices Zoned block devices reuse the chunk_sectors queue limit to define zone boundaries. If a such a device happens to also report an optimal boundary, do not use that to define the chunk_sectors as that may intermittently interfere with io splitting and zone size queries. Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-08-21 17:14:27 -06:00
Christoph Hellwig	c61b82c7b7	nvme-pci: fix PRP pool size All operations are based on the controller, not the host page size. Switch the dma pool to use the controller page size as well to avoid massive overallocations on large page size systems. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-08-21 17:14:27 -06:00
John Garry	7442ddcedc	nvme-pci: Use u32 for nvme_dev.q_depth and nvme_queue.q_depth Recently nvme_dev.q_depth was changed from an int to u16 type. This falls over for the queue depth calculation in nvme_pci_enable(), where NVME_CAP_MQES(dev->ctrl.cap) + 1 may overflow as a u16, as NVME_CAP_MQES() is a 16b number also. That happens for me, and this is the result: root@ubuntu:/home/john# [148.272996] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000010 Mem abort info: ESR = 0x96000004 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000004 CM = 0, WnR = 0 user pgtable: 4k pages, 48-bit VAs, pgdp=00000a27bf3c9000 [0000000000000010] pgd=0000000000000000, p4d=0000000000000000 Internal error: Oops: 96000004 [#1] PREEMPT SMP Modules linked in: nvme nvme_core CPU: 56 PID: 256 Comm: kworker/u195:0 Not tainted 5.8.0-next-20200812 #27 Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 - V1.16.01 03/15/2019 Workqueue: nvme-reset-wq nvme_reset_work [nvme] pstate: 80c00009 (Nzcv daif +PAN +UAO BTYPE=--) pc : __sg_alloc_table_from_pages+0xec/0x238 lr : __sg_alloc_table_from_pages+0xc8/0x238 sp : ffff800013ccbad0 x29: ffff800013ccbad0 x28: ffff0a27b3d380a8 x27: 0000000000000000 x26: 0000000000002dc2 x25: 0000000000000dc0 x24: 0000000000000000 x23: 0000000000000000 x22: ffff800013ccbbe8 x21: 0000000000000010 x20: 0000000000000000 x19: 00000000fffff000 x18: ffffffffffffffff x17: 00000000000000c0 x16: fffffe289eaf6380 x15: ffff800011b59948 x14: ffff002bc8fe98f8 x13: ff00000000000000 x12: ffff8000114ca000 x11: 0000000000000000 x10: ffffffffffffffff x9 : ffffffffffffffc0 x8 : ffff0a27b5f9b6a0 x7 : 0000000000000000 x6 : 0000000000000001 x5 : ffff0a27b5f9b680 x4 : 0000000000000000 x3 : ffff0a27b5f9b680 x2 : 0000000000000000 x1 : 0000000000000001 x0 : 0000000000000000 Call trace: __sg_alloc_table_from_pages+0xec/0x238 sg_alloc_table_from_pages+0x18/0x28 iommu_dma_alloc+0x474/0x678 dma_alloc_attrs+0xd8/0xf0 nvme_alloc_queue+0x114/0x160 [nvme] nvme_reset_work+0xb34/0x14b4 [nvme] process_one_work+0x1e8/0x360 worker_thread+0x44/0x478 kthread+0x150/0x158 ret_from_fork+0x10/0x34 Code: f94002c3 6b01017f 540007c2 11000486 (f8645aa5) ---[ end trace 89bb2b72d59bf925 ]--- Fix by making onto a u32. Also use u32 for nvme_dev.q_depth, as we assign this value from nvme_dev.q_depth, and nvme_dev.q_depth will possibly hold 65536 - this avoids the same crash as above. Fixes: `61f3b89630` ("nvme-pci: use unsigned for io queue depth") Signed-off-by: John Garry <john.garry@huawei.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-08-21 17:14:27 -06:00
Logan Gunthorpe	ecbcdf0c81	nvme: Use spin_lock_irq() when taking the ctrl->lock When locking the ctrl->lock spinlock IRQs need to be disabled to avoid a dead lock. The new spin_lock() calls recently added produce the following lockdep warning when running the blktest nvme/003: ================================ WARNING: inconsistent lock state -------------------------------- inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. ksoftirqd/2/22 [HC0[0]:SC1[1]:HE0:SE0] takes: ffff888276a8c4c0 (&ctrl->lock){+.?.}-{2:2}, at: nvme_keep_alive_end_io+0x50/0xc0 {SOFTIRQ-ON-W} state was registered at: lock_acquire+0x164/0x500 _raw_spin_lock+0x28/0x40 nvme_get_effects_log+0x37/0x1c0 nvme_init_identify+0x9e4/0x14f0 nvme_reset_work+0xadd/0x2360 process_one_work+0x66b/0xb70 worker_thread+0x6e/0x6c0 kthread+0x1e7/0x210 ret_from_fork+0x22/0x30 irq event stamp: 1449221 hardirqs last enabled at (1449220): [<ffffffff81c58e69>] ktime_get+0xf9/0x140 hardirqs last disabled at (1449221): [<ffffffff83129665>] _raw_spin_lock_irqsave+0x25/0x60 softirqs last enabled at (1449210): [<ffffffff83400447>] __do_softirq+0x447/0x595 softirqs last disabled at (1449215): [<ffffffff81b489b5>] run_ksoftirqd+0x35/0x50 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&ctrl->lock); <Interrupt> lock(&ctrl->lock); * DEADLOCK * no locks held by ksoftirqd/2/22. stack backtrace: CPU: 2 PID: 22 Comm: ksoftirqd/2 Not tainted 5.8.0-rc4-eid-vmlocalyes-dbg-00157-g7236657c6b3a #1450 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-1 04/01/2014 Call Trace: dump_stack+0xc8/0x11a print_usage_bug.cold.63+0x235/0x23e mark_lock+0xa9c/0xcf0 __lock_acquire+0xd9a/0x2b50 lock_acquire+0x164/0x500 _raw_spin_lock_irqsave+0x40/0x60 nvme_keep_alive_end_io+0x50/0xc0 blk_mq_end_request+0x158/0x210 nvme_complete_rq+0x146/0x500 nvme_loop_complete_rq+0x26/0x30 [nvme_loop] blk_done_softirq+0x187/0x1e0 __do_softirq+0x118/0x595 run_ksoftirqd+0x35/0x50 smpboot_thread_fn+0x1d3/0x310 kthread+0x1e7/0x210 ret_from_fork+0x22/0x30 Fixes: `be93e87e78` ("nvme: support for multiple Command Sets Supported and Effects log pages") Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Tested-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-08-21 17:14:27 -06:00
Chaitanya Kulkarni	7ee51cf60a	nvmet: call blk_mq_free_request() directly Instead of calling blk_put_request() which calls blk_mq_free_request(), call blk_mq_free_request() directly for NVMeOF passthru. This is to mainly avoid an extra function call in the completion path nvmet_passthru_req_done(). Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-08-21 17:14:27 -06:00
Chaitanya Kulkarni	a2138fd494	nvmet: fix oops in pt cmd execution In the existing NVMeOF Passthru core command handling on failure of nvme_alloc_request() it errors out with rq value set to NULL. In the error handling path it calls blk_put_request() without checking if rq is set to NULL or not which produces following Oops:- [ 1457.346861] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 1457.347838] #PF: supervisor read access in kernel mode [ 1457.348464] #PF: error_code(0x0000) - not-present page [ 1457.349085] PGD 0 P4D 0 [ 1457.349402] Oops: 0000 [#1] SMP NOPTI [ 1457.349851] CPU: 18 PID: 10782 Comm: kworker/18:2 Tainted: G OE 5.8.0-rc4nvme-5.9+ #35 [ 1457.350951] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e3214 [ 1457.352347] Workqueue: events nvme_loop_execute_work [nvme_loop] [ 1457.353062] RIP: 0010:blk_mq_free_request+0xe/0x110 [ 1457.353651] Code: 3f ff ff ff 83 f8 01 75 0d 4c 89 e7 e8 1b db ff ff e9 2d ff ff ff 0f 0b eb ef 66 8 [ 1457.355975] RSP: 0018:ffffc900035b7de0 EFLAGS: 00010282 [ 1457.356636] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000002 [ 1457.357526] RDX: ffffffffa060bd05 RSI: 0000000000000000 RDI: 0000000000000000 [ 1457.358416] RBP: 0000000000000037 R08: 0000000000000000 R09: 0000000000000000 [ 1457.359317] R10: 0000000000000000 R11: 000000000000006d R12: 0000000000000000 [ 1457.360424] R13: ffff8887ffa68600 R14: 0000000000000000 R15: ffff8888150564c8 [ 1457.361322] FS: 0000000000000000(0000) GS:ffff888814600000(0000) knlGS:0000000000000000 [ 1457.362337] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1457.363058] CR2: 0000000000000000 CR3: 000000081c0ac000 CR4: 00000000003406e0 [ 1457.363973] Call Trace: [ 1457.364296] nvmet_passthru_execute_cmd+0x150/0x2c0 [nvmet] [ 1457.364990] process_one_work+0x24e/0x5a0 [ 1457.365493] ? __schedule+0x353/0x840 [ 1457.365957] worker_thread+0x3c/0x380 [ 1457.366426] ? process_one_work+0x5a0/0x5a0 [ 1457.366948] kthread+0x135/0x150 [ 1457.367362] ? kthread_create_on_node+0x60/0x60 [ 1457.367934] ret_from_fork+0x22/0x30 [ 1457.368388] Modules linked in: nvme_loop(OE) nvmet(OE) nvme_fabrics(OE) null_blk nvme(OE) nvme_corer [ 1457.368414] ata_piix crc32c_intel virtio_pci libata virtio_ring serio_raw t10_pi virtio floppy dm_] [ 1457.380849] CR2: 0000000000000000 [ 1457.381288] ---[ end trace c6cab61bfd1f68fd ]--- [ 1457.381861] RIP: 0010:blk_mq_free_request+0xe/0x110 [ 1457.382469] Code: 3f ff ff ff 83 f8 01 75 0d 4c 89 e7 e8 1b db ff ff e9 2d ff ff ff 0f 0b eb ef 66 8 [ 1457.384749] RSP: 0018:ffffc900035b7de0 EFLAGS: 00010282 [ 1457.385393] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000002 [ 1457.386264] RDX: ffffffffa060bd05 RSI: 0000000000000000 RDI: 0000000000000000 [ 1457.387142] RBP: 0000000000000037 R08: 0000000000000000 R09: 0000000000000000 [ 1457.388029] R10: 0000000000000000 R11: 000000000000006d R12: 0000000000000000 [ 1457.388914] R13: ffff8887ffa68600 R14: 0000000000000000 R15: ffff8888150564c8 [ 1457.389798] FS: 0000000000000000(0000) GS:ffff888814600000(0000) knlGS:0000000000000000 [ 1457.390796] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1457.391508] CR2: 0000000000000000 CR3: 000000081c0ac000 CR4: 00000000003406e0 [ 1457.392525] Kernel panic - not syncing: Fatal exception [ 1457.394138] Kernel Offset: disabled [ 1457.394677] ---[ end Kernel panic - not syncing: Fatal exception ]--- We fix this Oops by adding a new goto label out_put_req and reordering the blk_put_request call to avoid calling blk_put_request() with rq value is set to NULL. Here we also update the rest of the code accordingly. Fixes: 06b7164dfdc0 ("nvmet: add passthru code to process commands") Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-08-21 17:14:27 -06:00
Chaitanya Kulkarni	4db69a3d7c	nvmet: add ns tear down label for pt-cmd handling In the current implementation before submitting the passthru cmd we may come across error e.g. getting ns from passthru controller, allocating a request from passthru controller, etc. For all the failure cases it only uses single goto label fail_out. In the target code, we follow the pattern to have a separate label for each error out the case when setting up multiple things before the actual action. This patch follows the same pattern and renames generic fail_out label to out_put_ns and updates the error out cases in the nvmet_passthru_execute_cmd() where it is needed. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-08-21 17:14:27 -06:00
Martin Wilck	e398863b75	nvme: multipath: round-robin: eliminate "fallback" variable If we find an optimized path, we quit the loop immediately. Thus we can use just one variable for the next path, slighly simplifying the code. Signed-off-by: Martin Wilck <mwilck@suse.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-08-21 17:14:27 -06:00
Martin Wilck	93eb0381e1	nvme: multipath: round-robin: fix single non-optimized path case If there's only one usable, non-optimized path, nvme_round_robin_path() returns NULL, which is wrong. Fix it by falling back to "old", like in the single optimized path case. Also, if the active path isn't changed, there's no need to re-assign the pointer. Fixes: `3f6e3246db` ("nvme-multipath: fix logic for non-optimized paths") Signed-off-by: Martin Wilck <mwilck@suse.com> Signed-off-by: Martin George <marting@netapp.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-08-21 17:14:27 -06:00
Tianjia Zhang	f34448cd0d	nvme-fc: Fix wrong return value in __nvme_fc_init_request() On an error exit path, a negative error code should be returned instead of a positive return value. Fixes: `e399441de9` ("nvme-fabrics: Add host support for FC transport") Cc: James Smart <jsmart2021@gmail.com> Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-08-21 17:14:27 -06:00
Logan Gunthorpe	0ceeab96ba	nvmet-passthru: Reject commands with non-sgl flags set Any command with a non-SGL flag set (like fuse flags) should be rejected. Fixes: `c1fef73f79` ("nvmet: add passthru code to process commands") Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-08-21 17:14:27 -06:00
Sagi Grimberg	382fee1a8b	nvmet: fix a memory leak We forgot to free new_model_number Fixes: `013b7ebe5a` ("nvmet: make ctrl model configurable") Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-08-21 17:14:27 -06:00
Linus Torvalds	060a72a268	for-5.9/block-merge-20200804 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8pjAEQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgphuHEAC5hNAi99HLktAQ7qZy4cBqnGNKCrguFszq Kxiecp3Nrb9EnAPNWYG+QMO0kD9z8quML85beBaJNxN1PlOk9pawqFAd4ziFncFI ruZwIMP+oH/0OmPUA2a4ymrqu+rpyFvfsDL2RKJ9dirAt9fuv9W0RZM5g7Oz83Xi cNdPRn0tOhK0DTPxL4M1/NR2OutSgvKDfA5Et3IrDFl7+bJAEFqmSO8wOSdZtvFp KcR4O/DXnr5Wl6cPvzlvooQze8vGGJkXAyIKaC9cuBm/nlzMCBGG8kE0v3kRJ8Sc uSSFkC+P+OlktY4JwXN+mCacDUdVBiiL/uUs1zel6HmociBgh67mgyJ6AfQtGZry yVl9mj44qWZjAzCODv5KnuxlH+gBacdmjcQqwxsZ2P477gfNkxmBXgHeWdfzO9A/ zTUXaBDXg3VdYxQfD8zTWPkCwXYp+YG3SRb9pfrIWIiYuz2UECZTvl/8Upnacz2B POTf+6vcNDlILCtboVE0mKEYR0ckxqrbs0NQloQdmVOfXNyhLml9OrXmwJIffVtE pZ9g428c5bm44lIOiB2eW+QPsXo0s8GxqIrMtxzKsJ3WgFefwLiVDLJBqEt78jRJ RvpGUxrMLgWFubowH8yDmWV+Fp0NpqcqF+GU45z8nGC3OTS+i0ZvUFYgLM6a2uOf sv4bzDPDBg== =uMth -----END PGP SIGNATURE----- Merge tag 'for-5.9/block-merge-20200804' of git://git.kernel.dk/linux-block Pull block stacking updates from Jens Axboe: "The stacking related fixes depended on both the core block and drivers branches, so here's a topic branch with that change. Outside of that, a late fix from Johannes for zone revalidation" * tag 'for-5.9/block-merge-20200804' of git://git.kernel.dk/linux-block: block: don't do revalidate zones on invalid devices block: remove blk_queue_stack_limits block: remove bdev_stack_limits block: inherit the zoned characteristics in blk_stack_limits	2020-08-05 11:12:34 -07:00
Linus Torvalds	e0fc99e21e	for-5.9/drivers-20200803 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8od3oQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgppkpD/9D+XqD9qYcYTj+ShVCc5+3RtMG5ZiAAX0y l4QXomentn/1Y0UYXFGJH7JLZWrKYT0QiktLtfpe5pmTqRUkckTIyJQlsHb+K6Dz lFjtywRK9pcFYgiWIUg80wlJKrTa8QdnrlS/Esn4YITKGRbgMIdFvq2jymXC+1ho RgodlgzcBUREgHSLo0H3cqEKA53fQiJhKC6CbFrFdrkpf2yUpcTfEDtpSwuIuPj3 2AUed1qXUtNjdHciCn3N37OuHqXKAA9noXAWfg9Gx/5zfGUNX9QJvlsny1AopgS0 jJvPSDVAhu/qRLHW6q/ZOT0JAlHegguuTAOtgMh2cMpAS5sumCAtltxVcI7Qnx41 HalMpTefXsVoBo0gfjqldnIPt34ZNj5aH5GYaH/wPpSg6VkTVBJK8GuQDBvg27qT w+U/T6EzuqniWXh/P3COhfrMCR9ueUOY1qWCRwzomlpeIfBhCzidt2wUqIxX1TOA Q0Ltf0eERDevsZbE+tIm+VAAg98kHehcS2t8lfFYFO6/PKu2iJpJt/HtJbZNBE+W rm96E4qXRiy1UuL7D9vBkaWsbnosuNHgGQXx57GlokQU+2IGBmOxV52XHiSxxpXd AS1ZTd56ItmID8VaU09Pbf7ZFbiCgdEAxIbUFzaCuvo+lxryHFphIUARNi/zPnNT UC2OzunCqA== =oADH -----END PGP SIGNATURE----- Merge tag 'for-5.9/drivers-20200803' of git://git.kernel.dk/linux-block Pull block driver updates from Jens Axboe: - NVMe: - ZNS support (Aravind, Keith, Matias, Niklas) - Misc cleanups, optimizations, fixes (Baolin, Chaitanya, David, Dongli, Max, Sagi) - null_blk zone capacity support (Aravind) - MD: - raid5/6 fixes (ChangSyun) - Warning fixes (Damien) - raid5 stripe fixes (Guoqing, Song, Yufen) - sysfs deadlock fix (Junxiao) - raid10 deadlock fix (Vitaly) - struct_size conversions (Gustavo) - Set of bcache updates/fixes (Coly) * tag 'for-5.9/drivers-20200803' of git://git.kernel.dk/linux-block: (117 commits) md/raid5: Allow degraded raid6 to do rmw md/raid5: Fix Force reconstruct-write io stuck in degraded raid5 raid5: don't duplicate code for different paths in handle_stripe raid5-cache: hold spinlock instead of mutex in r5c_journal_mode_show md: print errno in super_written md/raid5: remove the redundant setting of STRIPE_HANDLE md: register new md sysfs file 'uuid' read-only md: fix max sectors calculation for super 1.0 nvme-loop: remove extra variable in create ctrl nvme-loop: set ctrl state connecting after init nvme-multipath: do not fall back to __nvme_find_path() for non-optimized paths nvme-multipath: fix logic for non-optimized paths nvme-rdma: fix controller reset hang during traffic nvme-tcp: fix controller reset hang during traffic nvmet: introduce the passthru Kconfig option nvmet: introduce the passthru configfs interface nvmet: Add passthru enable/disable helpers nvmet: add passthru code to process commands nvme: export nvme_find_get_ns() and nvme_put_ns() nvme: introduce nvme_ctrl_get_by_path() ...	2020-08-05 10:51:40 -07:00
Linus Torvalds	382625d0d4	for-5.9/block-20200802 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8m7YwQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpt+dEAC7a0HYuX2OrkyawBnsgd1QQR/soC7surec yDDa7SMM8cOq3935bfzcYHV9FWJszEGIknchiGb9R3/T+vmSohbvDsM5zgwya9u/ FHUIuTq324I6JWXKl30k4rwjiX9wQeMt+WZ5gC8KJYCWA296i2IpJwd0A45aaKuS x4bTjxqknE+fD4gQiMUSt+bmuOUAp81fEku3EPapCRYDPAj8f5uoY7R2arT/POwB b+s+AtXqzBymIqx1z0sZ/XcdZKmDuhdurGCWu7BfJFIzw5kQ2Qe3W8rUmrQ3pGut 8a21YfilhUFiBv+B4wptfrzJuzU6Ps0BXHCnBsQjzvXwq5uFcZH495mM/4E4OJvh SbjL2K4iFj+O1ngFkukG/F8tdEM1zKBYy2ZEkGoWKUpyQanbAaGI6QKKJA+DCdBi yPEb7yRAa5KfLqMiocm1qCEO1I56HRiNHaJVMqCPOZxLmpXj19Fs71yIRplP1Trv GGXdWZsccjuY6OljoXWdEfnxAr5zBsO3Yf2yFT95AD+egtGsU1oOzlqAaU1mtflw ABo452pvh6FFpxGXqz6oK4VqY4Et7WgXOiljA4yIGoPpG/08L1Yle4eVc2EE01Jb +BL49xNJVeUhGFrvUjPGl9kVMeLmubPFbmgrtipW+VRg9W8+Yirw7DPP6K+gbPAR RzAUdZFbWw== =abJG -----END PGP SIGNATURE----- Merge tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block Pull core block updates from Jens Axboe: "Good amount of cleanups and tech debt removals in here, and as a result, the diffstat shows a nice net reduction in code. - Softirq completion cleanups (Christoph) - Stop using ->queuedata (Christoph) - Cleanup bd claiming (Christoph) - Use check_events, moving away from the legacy media change (Christoph) - Use inode i_blkbits consistently (Christoph) - Remove old unused writeback congestion bits (Christoph) - Cleanup/unify submission path (Christoph) - Use bio_uninit consistently, instead of bio_disassociate_blkg (Christoph) - sbitmap cleared bits handling (John) - Request merging blktrace event addition (Jan) - sysfs add/remove race fixes (Luis) - blk-mq tag fixes/optimizations (Ming) - Duplicate words in comments (Randy) - Flush deferral cleanup (Yufen) - IO context locking/retry fixes (John) - struct_size() usage (Gustavo) - blk-iocost fixes (Chengming) - blk-cgroup IO stats fixes (Boris) - Various little fixes" * tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block: (135 commits) block: blk-timeout: delete duplicated word block: blk-mq-sched: delete duplicated word block: blk-mq: delete duplicated word block: genhd: delete duplicated words block: elevator: delete duplicated word and fix typos block: bio: delete duplicated words block: bfq-iosched: fix duplicated word iocost_monitor: start from the oldest usage index iocost: Fix check condition of iocg abs_vdebt block: Remove callback typedefs for blk_mq_ops block: Use non _rcu version of list functions for tag_set_list blk-cgroup: show global disk stats in root cgroup io.stat blk-cgroup: make iostat functions visible to stat printing block: improve discard bio alignment in __blkdev_issue_discard() block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers block: defer flush request no matter whether we have elevator block: make blk_timeout_init() static block: remove retry loop in ioc_release_fn() block: remove unnecessary ioc nested locking block: integrate bd_start_claiming into __blkdev_get ...	2020-08-03 11:57:03 -07:00
Christoph Hellwig	5bedd3afee	nvme: add a Identify Namespace Identification Descriptor list quirk Add a quirk for a device that does not support the Identify Namespace Identification Descriptor list despite claiming 1.3 compliance. Fixes: `ea43d9709f` ("nvme: fix identify error status silent ignore") Reported-by: Ingo Brunberg <ingo_brunberg@web.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Ingo Brunberg <ingo_brunberg@web.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>	2020-07-29 08:05:44 +02:00
Chaitanya Kulkarni	b6cec06d19	nvme-loop: remove extra variable in create ctrl We can call the nvme_change_ctrl_state() directly and have WARN_ON_ONCE(1) call instead of having to use an extra variable which matches the name of the function. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-29 07:46:28 +02:00
Chaitanya Kulkarni	64d452b356	nvme-loop: set ctrl state connecting after init When creating a loop controller (ctrl) in nvme_loop_create_ctrl() -> nvme_init_ctrl() we set the ctrl state to NVME_CTRL_NEW. Prior to [1] NVME_CTRL_NEW state was allowed in nvmf_check_ready() for fabrics command type connect. Now, this fails in the following code path for fabrics connect command when creating admin queue :- nvme_loop_create_ctrl() nvme_loo_configure_admin_queue() nvmf_connect_admin_queue() __nvme_submit_sync_cmd() blk_execute_rq() nvme_loop_queue_rq() nvmf_check_ready() # echo "transport=loop,nqn=fs" > /dev/nvme-fabrics [ 6047.741327] nvmet: adding nsid 1 to subsystem fs [ 6048.756430] nvme nvme1: Connect command failed, error wo/DNR bit: 880 We need to set the ctrl state to NVME_CTRL_CONNECTING after :- nvme_loop_create_ctrl() nvme_init_ctrl() so that the above mentioned check for nvmf_check_ready() will return true. This patch sets the ctrl state to connecting after we init the ctrl in nvme_loop_create_ctrl() nvme_init_ctrl() . [1] commit aa63fa6776a7 ("nvme-fabrics: allow to queue requests for live queues") Fixes: aa63fa6776a7 ("nvme-fabrics: allow to queue requests for live queues") Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Tested-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-29 07:46:28 +02:00
Hannes Reinecke	fbd6a42d89	nvme-multipath: do not fall back to __nvme_find_path() for non-optimized paths When nvme_round_robin_path() finds a valid namespace we should be using it; falling back to __nvme_find_path() for non-optimized paths will cause the result from nvme_round_robin_path() to be ignored for non-optimized paths. Fixes: `75c10e7327` ("nvme-multipath: round-robin I/O policy") Signed-off-by: Martin Wilck <mwilck@suse.com> Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-29 07:45:22 +02:00
Martin Wilck	3f6e3246db	nvme-multipath: fix logic for non-optimized paths Handle the special case where we have exactly one optimized path, which we should keep using in this case. Fixes: `75c10e7327` ("nvme-multipath: round-robin I/O policy") Signed off-by: Martin Wilck <mwilck@suse.com> Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-29 07:45:22 +02:00
Sagi Grimberg	9f98772ba3	nvme-rdma: fix controller reset hang during traffic commit `fe35ec58f0` ("block: update hctx map when use multiple maps") exposed an issue where we may hang trying to wait for queue freeze during I/O. We call blk_mq_update_nr_hw_queues which in case of multiple queue maps (which we have now for default/read/poll) is attempting to freeze the queue. However we never started queue freeze when starting the reset, which means that we have inflight pending requests that entered the queue that we will not complete once the queue is quiesced. So start a freeze before we quiesce the queue, and unfreeze the queue after we successfully connected the I/O queues (and make sure to call blk_mq_update_nr_hw_queues only after we are sure that the queue was already frozen). This follows to how the pci driver handles resets. Fixes: `fe35ec58f0` ("block: update hctx map when use multiple maps") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-29 07:45:22 +02:00
Sagi Grimberg	2875b0aeca	nvme-tcp: fix controller reset hang during traffic commit `fe35ec58f0` ("block: update hctx map when use multiple maps") exposed an issue where we may hang trying to wait for queue freeze during I/O. We call blk_mq_update_nr_hw_queues which in case of multiple queue maps (which we have now for default/read/poll) is attempting to freeze the queue. However we never started queue freeze when starting the reset, which means that we have inflight pending requests that entered the queue that we will not complete once the queue is quiesced. So start a freeze before we quiesce the queue, and unfreeze the queue after we successfully connected the I/O queues (and make sure to call blk_mq_update_nr_hw_queues only after we are sure that the queue was already frozen). This follows to how the pci driver handles resets. Fixes: `fe35ec58f0` ("block: update hctx map when use multiple maps") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-29 07:45:21 +02:00
Chaitanya Kulkarni	d9174c1a5d	nvmet: introduce the passthru Kconfig option This patch updates KConfig file for the NVMeOF target where we add new option so that user can selectively enable/disable passthru code. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> [logang@deltatee.com: fixed some of the wording in the help message] Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-29 07:45:21 +02:00
Logan Gunthorpe	cae5b01a2a	nvmet: introduce the passthru configfs interface When CONFIG_NVME_TARGET_PASSTHRU as 'passthru' directory will be added to each subsystem. The directory is similar to a namespace and has two attributes: device_path and enable. The user must set the path to the nvme controller's char device and write '1' to enable the subsystem to use passthru. Any given subsystem is prevented from enabling both a regular namespace and the passthru device. If one is enabled, enabling the other will produce an error. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-29 07:45:21 +02:00
Logan Gunthorpe	ba76af676c	nvmet: Add passthru enable/disable helpers This patch adds helper functions which are used in the NVMeOF configfs when the user is configuring the passthru subsystem. Here we ensure that only one subsys is assigned to each nvme_ctrl by using an xarray on the cntlid. The subsystem's version number is overridden by the passed through controller's version. However, if that version is less than 1.2.1, then we bump the advertised version to that and print a warning in dmesg. Based-on-a-patch-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-29 07:45:21 +02:00
Logan Gunthorpe	c1fef73f79	nvmet: add passthru code to process commands Add passthru command handling capability for the NVMeOF target and export passthru APIs which are used to integrate passthru code with nvmet-core. The new file passthru.c handles passthru cmd parsing and execution. In the passthru mode, we create a block layer request from the nvmet request and map the data on to the block layer request. Admin commands and features are on an allow list as there are a number of each that don't make too much sense with passthrough. We use an allow list such that new commands can be considered before being blindly passed through. In both cases, vendor specific commands are always allowed. We also reject reservation IO commands as the underlying device cannot differentiate between multiple hosts behind a fabric. Based-on-a-patch-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-29 07:45:21 +02:00
Logan Gunthorpe	24493b8b85	nvme: export nvme_find_get_ns() and nvme_put_ns() nvme_find_get_ns() and nvme_put_ns() are required by the target passthru code and are exported under the NVME_TARGET_PASSTHRU namespace. Based-on-a-patch-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-29 07:45:21 +02:00
Logan Gunthorpe	f783f444ce	nvme: introduce nvme_ctrl_get_by_path() nvme_ctrl_get_by_path() is analogous to blkdev_get_by_path() except it gets a struct nvme_ctrl from the path to its char dev (/dev/nvme0). It makes use of filp_open() to open the file and uses the private data to obtain a pointer to the struct nvme_ctrl. If the fops of the file do not match, -EINVAL is returned. The purpose of this function is to support NVMe-OF target passthru and is exported under the NVME_TARGET_PASSTHRU namespace. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-29 07:45:21 +02:00
Logan Gunthorpe	17365ae697	nvme: introduce nvme_execute_passthru_rq to call nvme_passthru_[start\|end]() Introduce a new nvme_execute_passthru_rq() helper which calls nvme_passthru_[start\|end]() around blk_execute_rq(). This ensures all passthru calls (including nvme_submit_io()) will be wrapped appropriately. nvme_execute_passthru_rq() will also be useful for the nvmet passthru code and is exported in the NVME_TARGET_PASSTHRU namespace. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-29 07:45:21 +02:00
Logan Gunthorpe	df21b6b193	nvme: create helper function to obtain command effects Separate the code to obtain command effects from the code to start a passthru request and move the nvme_passthru_start() and nvme_passthru_end() functions up above nvme_submit_user_cmd() in order that they may be used in a new helper a subsequent patch. The new helper function will be necessary for nvmet passthru code to determine if we need to change out of interrupt context to handle the effects. It is exported in the NVME_TARGET_PASSTHRU namespace. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-29 07:45:20 +02:00
Logan Gunthorpe	2bf5d3bbff	nvme: clear any SGL flags in passthru commands The host driver should decide whether to use SGLs or PRPs and they currently assume the flags are cleared after the call to nvme_setup_cmd(). However, passed-through commands may erroneously set these bits; so clear them for all cases. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-29 07:45:20 +02:00
James Smart	ece0278c1c	nvmet-fc: remove redundant del_work_active flag The transport has a del_work_active flag to avoid duplicate scheduling of the del_work item. This is redundant with the checks that schedule_work() makes. Remove the del_work_active flag. Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-29 07:45:20 +02:00
James Smart	34efa23234	nvmet-fc: check successful reference in nvmet_fc_find_target_assoc When searching for an association based on an association id, when there is a match, the code takes a reference. However, it is not validating that the reference taking was successful. Check the status of the reference. If unsuccessful, the device is being deleted and should be ignored. Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-29 07:45:20 +02:00
James Smart	237480760c	nvme-fc: set max_segments to lldd max value Currently the FC transport is set max_hw_sectors based on the lldds max sgl segment count. However, the block queue max segments is set based on the controller's max_segments count, which the transport does not set. As such, the lldd is receiving sgl lists that are exceeding its max segment count. Set the controller max segment count and derive max_hw_sectors from the max segment count. Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-29 07:45:20 +02:00
Sagi Grimberg	653303f216	nvme-hwmon: log the controller device name Stay consistent with the rest of the driver Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-29 07:45:20 +02:00
Sagi Grimberg	ecca390e80	nvme: fix deadlock in disconnect during scan_work and/or ana_work A deadlock happens in the following scenario with multipath: 1) scan_work(nvme0) detects a new nsid while nvme0 is an optimized path to it, path nvme1 happens to be inaccessible. 2) Before scan_work is complete nvme0 disconnect is initiated nvme_delete_ctrl_sync() sets nvme0 state to NVME_CTRL_DELETING 3) scan_work(1) attempts to submit IO, but nvme_path_is_optimized() observes nvme0 is not LIVE. Since nvme1 is a possible path IO is requeued and scan_work hangs. -- Workqueue: nvme-wq nvme_scan_work [nvme_core] kernel: Call Trace: kernel: __schedule+0x2b9/0x6c0 kernel: schedule+0x42/0xb0 kernel: io_schedule+0x16/0x40 kernel: do_read_cache_page+0x438/0x830 kernel: read_cache_page+0x12/0x20 kernel: read_dev_sector+0x27/0xc0 kernel: read_lba+0xc1/0x220 kernel: efi_partition+0x1e6/0x708 kernel: check_partition+0x154/0x244 kernel: rescan_partitions+0xae/0x280 kernel: __blkdev_get+0x40f/0x560 kernel: blkdev_get+0x3d/0x140 kernel: __device_add_disk+0x388/0x480 kernel: device_add_disk+0x13/0x20 kernel: nvme_mpath_set_live+0x119/0x140 [nvme_core] kernel: nvme_update_ns_ana_state+0x5c/0x60 [nvme_core] kernel: nvme_set_ns_ana_state+0x1e/0x30 [nvme_core] kernel: nvme_parse_ana_log+0xa1/0x180 [nvme_core] kernel: nvme_mpath_add_disk+0x47/0x90 [nvme_core] kernel: nvme_validate_ns+0x396/0x940 [nvme_core] kernel: nvme_scan_work+0x24f/0x380 [nvme_core] kernel: process_one_work+0x1db/0x380 kernel: worker_thread+0x249/0x400 kernel: kthread+0x104/0x140 -- 4) Delete also hangs in flush_work(ctrl->scan_work) from nvme_remove_namespaces(). Similiarly a deadlock with ana_work may happen: if ana_work has started and calls nvme_mpath_set_live and device_add_disk, it will trigger I/O. When we trigger disconnect I/O will block because our accessible (optimized) path is disconnecting, but the alternate path is inaccessible, so I/O blocks. Then disconnect tries to flush the ana_work and hangs. [ 605.550896] Workqueue: nvme-wq nvme_ana_work [nvme_core] [ 605.552087] Call Trace: [ 605.552683] __schedule+0x2b9/0x6c0 [ 605.553507] schedule+0x42/0xb0 [ 605.554201] io_schedule+0x16/0x40 [ 605.555012] do_read_cache_page+0x438/0x830 [ 605.556925] read_cache_page+0x12/0x20 [ 605.557757] read_dev_sector+0x27/0xc0 [ 605.558587] amiga_partition+0x4d/0x4c5 [ 605.561278] check_partition+0x154/0x244 [ 605.562138] rescan_partitions+0xae/0x280 [ 605.563076] __blkdev_get+0x40f/0x560 [ 605.563830] blkdev_get+0x3d/0x140 [ 605.564500] __device_add_disk+0x388/0x480 [ 605.565316] device_add_disk+0x13/0x20 [ 605.566070] nvme_mpath_set_live+0x5e/0x130 [nvme_core] [ 605.567114] nvme_update_ns_ana_state+0x2c/0x30 [nvme_core] [ 605.568197] nvme_update_ana_state+0xca/0xe0 [nvme_core] [ 605.569360] nvme_parse_ana_log+0xa1/0x180 [nvme_core] [ 605.571385] nvme_read_ana_log+0x76/0x100 [nvme_core] [ 605.572376] nvme_ana_work+0x15/0x20 [nvme_core] [ 605.573330] process_one_work+0x1db/0x380 [ 605.574144] worker_thread+0x4d/0x400 [ 605.574896] kthread+0x104/0x140 [ 605.577205] ret_from_fork+0x35/0x40 [ 605.577955] INFO: task nvme:14044 blocked for more than 120 seconds. [ 605.579239] Tainted: G OE 5.3.5-050305-generic #201910071830 [ 605.580712] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 605.582320] nvme D 0 14044 14043 0x00000000 [ 605.583424] Call Trace: [ 605.583935] __schedule+0x2b9/0x6c0 [ 605.584625] schedule+0x42/0xb0 [ 605.585290] schedule_timeout+0x203/0x2f0 [ 605.588493] wait_for_completion+0xb1/0x120 [ 605.590066] __flush_work+0x123/0x1d0 [ 605.591758] __cancel_work_timer+0x10e/0x190 [ 605.593542] cancel_work_sync+0x10/0x20 [ 605.594347] nvme_mpath_stop+0x2f/0x40 [nvme_core] [ 605.595328] nvme_stop_ctrl+0x12/0x50 [nvme_core] [ 605.596262] nvme_do_delete_ctrl+0x3f/0x90 [nvme_core] [ 605.597333] nvme_sysfs_delete+0x5c/0x70 [nvme_core] [ 605.598320] dev_attr_store+0x17/0x30 Fix this by introducing a new state: NVME_CTRL_DELETE_NOIO, which will indicate the phase of controller deletion where I/O cannot be allowed to access the namespace. NVME_CTRL_DELETING still allows mpath I/O to be issued to the bottom device, and only after we flush the ana_work and scan_work (after nvme_stop_ctrl and nvme_prep_remove_namespaces) we change the state to NVME_CTRL_DELETING_NOIO. Also we prevent ana_work from re-firing by aborting early if we are not LIVE, so we should be safe here. In addition, change the transport drivers to follow the updated state machine. Fixes: `0d0b660f21` ("nvme: add ANA support") Reported-by: Anton Eidelman <anton@lightbitslabs.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-29 07:45:19 +02:00
Sagi Grimberg	4212f4e946	nvme: document nvme controller states We are starting to see some non-trivial states so lets start documenting them. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-29 07:45:19 +02:00
Chaitanya Kulkarni	7774e77ebe	nvmet: use xarray for ctrl ns storing This patch replaces the ctrl->namespaces tracking from linked list to xarray and improves the performance when accessing one namespce :- XArray vs Default:- IOPS and BW (more the better) increase BW (~1.8%):- --------------------------------------------------- XArray :- read: IOPS=160k, BW=626MiB/s (656MB/s)(18.3GiB/30001msec) read: IOPS=160k, BW=626MiB/s (656MB/s)(18.3GiB/30001msec) read: IOPS=162k, BW=631MiB/s (662MB/s)(18.5GiB/30001msec) Default:- read: IOPS=156k, BW=609MiB/s (639MB/s)(17.8GiB/30001msec) read: IOPS=157k, BW=613MiB/s (643MB/s)(17.0GiB/30001msec) read: IOPS=160k, BW=626MiB/s (656MB/s)(18.3GiB/30001msec) Submission latency (less the better) decrease (~8.3%):- ------------------------------------------------------- XArray:- slat (usec): min=7, max=8386, avg=11.19, stdev=5.96 slat (usec): min=7, max=441, avg=11.09, stdev=4.48 slat (usec): min=7, max=1088, avg=11.21, stdev=4.54 Default :- slat (usec): min=8, max=2826.5k, avg=23.96, stdev=3911.50 slat (usec): min=8, max=503, avg=12.52, stdev=5.07 slat (usec): min=8, max=2384, avg=12.50, stdev=5.28 CPU Usage (less the better) decrease (~5.2%):- ---------------------------------------------- XArray:- cpu : usr=1.84%, sys=18.61%, ctx=949471, majf=0, minf=250 cpu : usr=1.83%, sys=18.41%, ctx=950262, majf=0, minf=237 cpu : usr=1.82%, sys=18.82%, ctx=957224, majf=0, minf=234 Default:- cpu : usr=1.70%, sys=19.21%, ctx=858196, majf=0, minf=251 cpu : usr=1.82%, sys=19.98%, ctx=929720, majf=0, minf=227 cpu : usr=1.83%, sys=20.33%, ctx=947208, majf=0, minf=235. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-29 07:45:19 +02:00
Yamin Friedman	ca0f1a8055	nvmet-rdma: use new shared CQ mechanism Has the driver use shared CQs providing ~10%-20% improvement when multiple disks are used. Instead of opening a CQ for each QP per controller, a CQ for each core will be provided by the RDMA core driver that will be shared between the QPs on that core reducing interrupt overhead. Signed-off-by: Yamin Friedman <yaminf@mellanox.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-29 07:45:19 +02:00
Yamin Friedman	287f329e31	nvme-rdma: use new shared CQ mechanism Has the driver use shared CQs providing ~10%-20% improvement as seen in the patch introducing shared CQs. Instead of opening a CQ for each QP per controller connected, a CQ for each QP will be provided by the RDMA core driver that will be shared between the QPs on that core reducing interrupt overhead. Signed-off-by: Yamin Friedman <yaminf@mellanox.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-29 07:45:19 +02:00
David E. Box	df4f9bc4fb	nvme-pci: add support for ACPI StorageD3Enable property This patch implements a solution for a BIOS hack used on some currently shipping Intel systems to change driver power management policy for PCIe NVMe drives. Some newer Intel platforms, like some Comet Lake systems, require that PCIe devices use D3 when doing suspend-to-idle in order to allow the platform to realize maximum power savings. This is particularly needed to support ATX power supply shutdown on desktop systems. In order to ensure this happens for root ports with storage devices, Microsoft apparently created this ACPI _DSD property as a way to influence their driver policy. To my knowledge this property has not been discussed with the NVME specification body. Though the solution is not ideal, it addresses a problem that also affects Linux since the NVMe driver's default policy of using NVMe APST during suspend-to-idle prevents the PCI root port from going to D3 and leads to higher power consumption for these platforms. The power consumption difference may be negligible on laptop systems, but many watts on desktop systems when the ATX power supply is blocked from powering down. The patch creates a new nvme_acpi_storage_d3 function to check for the StorageD3Enable property during probe and enables D3 as a quirk if set. It also provides a 'noacpi' module parameter to allow skipping the quirk if needed. Tested with: - PM961 NVMe SED Samsung 512GB - INTEL SSDPEKKF512G8 Link: https://docs.microsoft.com/en-us/windows-hardware/design/component-guidelines/power-management-for-storage-hardware-devices-intro Signed-off-by: David E. Box <david.e.box@linux.intel.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-29 07:45:19 +02:00
Chaitanya Kulkarni	b13c6393be	nvme-pci: use max of PRP or SGL for iod size >From the initial implementation of NVMe SGL kernel support commit `a7a7cbe353` ("nvme-pci: add SGL support") with addition of the commit `943e942e62` ("nvme-pci: limit max IO size and segments to avoid high order allocations") now there is only caller left for nvme_pci_iod_alloc_size() which statically passes true for last parameter that calculates allocation size based on SGL since we need size of biggest command supported for mempool allocation. This patch modifies the helper functions nvme_pci_iod_alloc_size() such that it is now uses maximum of PRP and SGL size for iod allocation size calculation. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-29 07:45:19 +02:00
Chaitanya Kulkarni	6c3c05b087	nvme-core: replace ctrl page size with a macro Saving the nvme controller's page size was from a time when the driver tried to use different sized pages, but this value is always set to a constant, and has been this way for some time. Remove the 'page_size' field and replace its usage with the constant value. This also lets the compiler make some micro-optimizations in the io path, and that's always a good thing. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-29 07:45:18 +02:00
Baolin Wang	5887450b69	nvme: remove redundant validation in nvme_start_ctrl() We've already validated the 'kato' in nvme_start_keep_alive(), thus no need to validate it again in nvme_start_ctrl(). Remove it. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-29 07:45:18 +02:00
Dan Carpenter	eca9e8271e	nvme: remove an unnecessary condition "v" is an unsigned int so it can't be more than UINT_MAX. Removing this check makes it easier to preserve the error code as well. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-29 07:45:18 +02:00
Kai-Heng Feng	5611ec2b98	nvme-pci: prevent SK hynix PC400 from using Write Zeroes command After commit `6e02318eae` ("nvme: add support for the Write Zeroes command"), SK hynix PC400 becomes very slow with the following error message: [ 224.567695] blk_update_request: operation not supported error, dev nvme1n1, sector 499384320 op 0x9:(WRITE_ZEROES) flags 0x1000000 phys_seg 0 prio class 0] SK Hynix PC400 has a buggy firmware that treats NLB as max value instead of a range, so the NLB passed isn't a valid value to the firmware. According to SK hynix there are three commands are affected: - Write Zeroes - Compare - Write Uncorrectable Right now only Write Zeroes is implemented, so disable it completely on SK hynix PC400. BugLink: https://bugs.launchpad.net/bugs/1872383 Cc: kyounghwan sohn <kyounghwan.sohn@sk.com> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-26 17:28:19 +02:00
Sagi Grimberg	adc99fd378	nvme-tcp: fix possible hang waiting for icresp response If the controller died exactly when we are receiving icresp we hang because icresp may never return. Make sure to set a high finite limit. Fixes: `3f2304f8c6` ("nvme-tcp: add NVMe over TCP host driver") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-26 17:24:27 +02:00
Christoph Hellwig	b9b1a5d715	block: remove blk_queue_stack_limits This function is just a tiny wrapper around blk_stack_limits. Open code it int the two callers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Tested-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-20 15:38:52 -06:00
Jens Axboe	4f43d64807	Merge branch 'for-5.9/drivers' into for-5.9/block-merge * for-5.9/drivers: (38 commits) block: add max_active_zones to blk-sysfs block: add max_open_zones to blk-sysfs s390/dasd: Use struct_size() helper s390/dasd: fix inability to use DASD with DIAG driver md-cluster: fix wild pointer of unlock_all_bitmaps() md/raid5-cache: clear MD_SB_CHANGE_PENDING before flushing stripes md: fix deadlock causing by sysfs_notify md: improve io stats accounting md: raid0/linear: fix dereference before null check on pointer mddev rsxx: switch from 'pci_free_consistent()' to 'dma_free_coherent()' nvme: remove ns->disk checks nvme-pci: use standard block status symbolic names nvme-pci: use the consistent return type of nvme_pci_iod_alloc_size() nvme-pci: add a blank line after declarations nvme-pci: fix some comments issues nvme-pci: remove redundant segment validation nvme: document quirked Intel models nvme: expose reconnect_delay and ctrl_loss_tmo via sysfs nvme: support for zoned namespaces nvme: support for multiple Command Sets Supported and Effects log pages ...	2020-07-20 15:38:27 -06:00
Jens Axboe	9caaa66c91	Merge branch 'for-5.9/block' into for-5.9/block-merge * for-5.9/block: (124 commits) blk-cgroup: show global disk stats in root cgroup io.stat blk-cgroup: make iostat functions visible to stat printing block: improve discard bio alignment in __blkdev_issue_discard() block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers block: defer flush request no matter whether we have elevator block: make blk_timeout_init() static block: remove retry loop in ioc_release_fn() block: remove unnecessary ioc nested locking block: integrate bd_start_claiming into __blkdev_get block: use bd_prepare_to_claim directly in the loop driver block: refactor bd_start_claiming block: simplify the restart case in __blkdev_get Revert "blk-rq-qos: remove redundant finish_wait to rq_qos_wait." block: always remove partitions from blk_drop_partitions() block: relax jiffies rounding for timeouts blk-mq: remove redundant validation in __blk_mq_end_request() blk-mq: Remove unnecessary local variable writeback: remove bdi->congested_fn writeback: remove struct bdi_writeback_congested writeback: remove {set,clear}_wb_congested ...	2020-07-20 15:38:23 -06:00
Anthony Iliopoulos	05b29021fb	nvme: explicitly update mpath disk capacity on revalidation Commit `3b4b19721e` ("nvme: fix possible deadlock when I/O is blocked") reverted multipath head disk revalidation due to deadlocks caused by holding the bd_mutex during revalidate. Updating the multipath disk blockdev size is still required though for userspace to be able to observe any resizing while the device is mounted. Directly update the bdev inode size to avoid unnecessarily holding the bdev->bd_mutex. Fixes: `3b4b19721e` ("nvme: fix possible deadlock when I/O is blocked") Signed-off-by: Anthony Iliopoulos <ailiop@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-16 16:40:27 +02:00
Niklas Cassel	659bf827ba	block: add max_active_zones to blk-sysfs Add a new max_active zones definition in the sysfs documentation. This definition will be common for all devices utilizing the zoned block device support in the kernel. Export max_active_zones according to this new definition for NVMe Zoned Namespace devices, ZAC ATA devices (which are treated as SCSI devices by the kernel), and ZBC SCSI devices. Add the new max_active_zones member to struct request_queue, rather than as a queue limit, since this property cannot be split across stacking drivers. For SCSI devices, even though max active zones is not part of the ZBC/ZAC spec, export max_active_zones as 0, signifying "no limit". Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Reviewed-by: Javier González <javier@javigon.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-15 14:26:11 -06:00
Niklas Cassel	e15864f8ea	block: add max_open_zones to blk-sysfs Add a new max_open_zones definition in the sysfs documentation. This definition will be common for all devices utilizing the zoned block device support in the kernel. Export max open zones according to this new definition for NVMe Zoned Namespace devices, ZAC ATA devices (which are treated as SCSI devices by the kernel), and ZBC SCSI devices. Add the new max_open_zones member to struct request_queue, rather than as a queue limit, since this property cannot be split across stacking drivers. Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Reviewed-by: Javier González <javier@javigon.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-15 14:26:11 -06:00
Christoph Hellwig	3913f4f3a6	nvme: remove ns->disk checks By the time a namespace is added to ctrl->namespaces list and thus can be looked up ns->disk has been assigned, and it it never cleared. Remove all the checks for ns->disk being NULL. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>	2020-07-08 19:15:20 +02:00
Baolin Wang	359c1f88ab	nvme-pci: use standard block status symbolic names Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-08 16:16:22 +02:00
Baolin Wang	9056fc9fc5	nvme-pci: use the consistent return type of nvme_pci_iod_alloc_size() The nvme_pci_iod_alloc_size() should return 'size_t' type to be consistent with the sizeof return value. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-08 16:16:21 +02:00
Baolin Wang	4e523547e2	nvme-pci: add a blank line after declarations Add a blank line after declarations to make code more readable. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-08 16:16:21 +02:00
Baolin Wang	ee0d96d322	nvme-pci: fix some comments issues Fix comment typos and remove whitespaces before tabs to cleanup checkpatch errors. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-08 16:16:21 +02:00
Baolin Wang	c25c853ef6	nvme-pci: remove redundant segment validation We've validated the segment counts before calling nvme_map_data(), so there is no need to validate again in nvme_pci_use_sgls(, which is only called from nvme_map_data(). Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-08 16:16:21 +02:00
David Fugate	972b13e29d	nvme: document quirked Intel models Documented model names of Intel SSDs requiring quirks. Signed-off-by: David Fugate <david.fugate@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-08 16:16:20 +02:00
Sagi Grimberg	764075fdcb	nvme: expose reconnect_delay and ctrl_loss_tmo via sysfs This is useful information, and moreover its it's useful to be able to alter these parameters per controller after it has been established. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-08 16:16:20 +02:00
Keith Busch	240e6ee272	nvme: support for zoned namespaces Add support for NVM Express Zoned Namespaces (ZNS) Command Set defined in NVM Express TP4053. Zoned namespaces are discovered based on their Command Set Identifier reported in the namespaces Namespace Identification Descriptor list. A successfully discovered Zoned Namespace will be registered with the block layer as a host managed zoned block device with Zone Append command support. A namespace that does not support append is not supported by the driver. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Javier González <javier.gonz@samsung.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com> Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com> Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Matias Bjørling <matias.bjorling@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Keith Busch <keith.busch@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-08 16:16:20 +02:00
Keith Busch	be93e87e78	nvme: support for multiple Command Sets Supported and Effects log pages The Commands Supported and Effects log page was extended with a CSI field that enables the host to query the log page for each command set supported. Retrieve this log page for each command set that an attached namespace supports, and save a pointer to that log in the namespace head. Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com> Reviewed-by: Javier González <javier.gonz@samsung.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <keith.busch@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-08 16:16:20 +02:00
Niklas Cassel	71010c3094	nvme: implement multiple I/O Command Set support Implements support for multiple I/O Command Sets. NVMe TP 4056 introduces a method to enumerate multiple command sets per namespace. If the command set is exposed, this method for enumeration will be used instead of the traditional method that uses the CC.CSS register command set register for command set identification. For namespaces where the Command Set Identifier is not supported or recognized, the specific namespace will not be created. Reviewed-by: Javier González <javier.gonz@samsung.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com> Reviewed-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-08 16:16:19 +02:00
Baolin Wang	f5af577d55	nvme: use USEC_PER_SEC instead of magic numbers Use USEC_PER_SEC instead of magic numbers to make code more readable. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-08 16:16:19 +02:00
Sagi Grimberg	b8a12e9357	nvmet-tcp: simplify nvmet_process_resp_list We can make it shorter and simpler without some redundant checks. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-08 16:16:18 +02:00
Sagi Grimberg	122e5b9f3d	nvme-tcp: optimize network stack with setting msg flags according to batch size If we have a long list of request to send, signal the network stack that more is coming (MSG_MORE). If we have nothing else, signal MSG_EOR. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Tested-by: Mark Wunderlich <mark.wunderlich@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-08 16:16:18 +02:00
Sagi Grimberg	86f0348ace	nvme-tcp: leverage request plugging blk-mq request plugging can improve the execution of our pipeline. When we queue a request we actually trigger our I/O worker thread yielding a context switch by definition. However if we know that there are more requests in the pipe that are coming, we are better off not trigger our I/O worker and only do that for the last request in the batch (bd->last). By having it, we improve efficiency by amortizing context switches over a batch of requests. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Tested-by: Mark Wunderlich <mark.wunderlich@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-08 16:16:18 +02:00
Sagi Grimberg	15ec928a65	nvme-tcp: have queue prod/cons send list become a llist The queue processing will splice to a queue local list, this should alleviate some contention on the send_list lock, but also prepares us to the next patch where we look on these lists for network stack flag optimization. Remove queue lock as its not used anymore. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Tested-by: Mark Wunderlich <mark.wunderlich@intel.com> [hch: simplified a loop] Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-08 16:16:18 +02:00
Dongli Zhang	ca8f4beebf	nvme-fcloop: verify wwnn and wwpn format The nvme host and target verify the wwnn and wwpn format via nvme_fc_parse_traddr(). For instance, it is required that the length of wwnn to be either 21 ("nn-0x") or 19 (nn-). Add this verification to nvme-fcloop so that the input should always be in hex and the length of input should always be 18. Otherwise, the user may use e.g. 0x2 to create fcloop local port, while 0x0000000000000002 is required for nvme host and target. This makes the requirement of format confusing. Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-08 16:16:17 +02:00
Chaitanya Kulkarni	a0f0dbaa69	nvmet: use unsigned type for u64 In function nvmet_subsys_atte_version_show() which uses the NVME_XXX() macros related to version (of type u64) get rid of the int type cast when printing subsys version and use appropriate format specifier for u64. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-08 16:16:17 +02:00
Max Gurtovoy	6fa350f714	nvmet: introduce flags member in nvmet_fabrics_ops Replace has_keyed_sgls and metadata_support booleans with a flags member that will be used for adding more features in the future. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: Israel Rukshin <israelr@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-08 16:16:17 +02:00
Max Gurtovoy	4e10255972	nvmet-tcp: remove has_keyed_sgls initialization Since the nvmet_tcp_ops is static, there is no need to initialize values to zero. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-08 16:16:17 +02:00
Dongli Zhang	b261b61c9e	nvmet-loop: remove unused 'target_ctrl' in nvme_loop_ctrl This field is never used since the introduction of nvme loopback by commit `3a85a5de29` ("nvme-loop: add a NVMe loopback host driver"). Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-08 16:16:16 +02:00
Dongli Zhang	ad50999643	nvme-pci: remove the empty line at the beginning of nvme_should_reset() Just cleanup by removing the empty line. Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-08 16:16:16 +02:00
Chaitanya Kulkarni	9dc54a0d15	nvme-pci: code cleanup for nvme_alloc_host_mem() Although use of for loop is preferred it is not a common practice to have 80 char long for loop initialization and comparison section. Use temp variables for calculating values and replace them in the for loop with size of all variables to set to u64 since preferred variable is declared as u64. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-08 16:16:16 +02:00
Chaitanya Kulkarni	61f3b89630	nvme-pci: use unsigned for io queue depth The NVMe PCIe declares module parameter io_queue_depth as int. Change this to u16 as queue depth can never be negative. Now to reflect this update module parameter getter function from param_get_int() -> param_get_uint() and respective setter function with type of n changed from int to u16 with param_set_int() to param_set_ushort(). Finally update struct nvme_dev q_depth member to u16 and use u16 in min_t() when calculating dev->q_depth in the nvme_pci_enable() (since q_depth is now u16) and use unsigned int instead of int when calculating dev->tagset.queue_depth as target variable tagset->queue_depth is of type unsigned int in nvme_dev_add(). Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-08 16:16:16 +02:00
Chaitanya Kulkarni	d4047cf994	nvme-core: use u16 type for ctrl->sqsize In nvme_init_identify() when calculating submission queue size use u16 instead of int type in the min_t() since target variable ctrl->sqsize is of type u16. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-08 16:16:15 +02:00
Chaitanya Kulkarni	a87835e9ec	nvme-core: use u16 type for directives In nvme_configure_directives() when calculating number of streams use u16 instead of unsigned type in the min_t() since target variable ctrl->nr_streams is of type u16. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-08 16:16:15 +02:00
Jens Axboe	482c6b614a	Linux 5.8-rc4 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl8CYDYeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGcQkH/2vOsPf79yWtsc7x hd2LpCPfrm7T1xlQcYcXbEbyRI8sqPmguixO8pRI1ePl2lBZ7KurfyeYgYZNGpFU t74Ph6A6dSWoCgO68Genm/SQuK8ic6o9n1Vr8tDsGDp5KlHWNaweq4JwHrsPmO1T cI0PR/ClAhLG8cQZ4x988Es5HTNGY17XK27e+M/zKYxSMGY2NRdJBGQIq964i5Q8 2d9G0rtVCaVDzgjrLwaFm6RBu21Il7HV6KsBsacyTFiL1ywx2vnUHzeZQyvuJSOQ 4YpLo9v4tBP10WHC50LRStZyO0qRwPVd/Yl7fL4R/CKsJT9H4uiwasVoEBVSL/k6 CUn3JL0= =P/Vx -----END PGP SIGNATURE----- Merge tag 'v5.8-rc4' into for-5.9/drivers Merge in 5.8-rc4 for-5.9/block to setup for-5.9/drivers, to provide a clean base and making the life for the NVMe changes easier. Signed-off-by: Jens Axboe <axboe@kernel.dk> * tag 'v5.8-rc4': (732 commits) Linux 5.8-rc4 x86/ldt: use "pr_info_once()" instead of open-coding it badly MIPS: Do not use smp_processor_id() in preemptible code MIPS: Add missing EHB in mtc0 -> mfc0 sequence for DSPen .gitignore: Do not track `defconfig` from `make savedefconfig` io_uring: fix regression with always ignoring signals in io_cqring_wait() x86/ldt: Disable 16-bit segments on Xen PV x86/entry/32: Fix #MC and #DB wiring on x86_32 x86/entry/xen: Route #DB correctly on Xen PV x86/entry, selftests: Further improve user entry sanity checks x86/entry/compat: Clear RAX high bits on Xen PV SYSENTER i2c: mlxcpld: check correct size of maximum RECV_LEN packet i2c: add Kconfig help text for slave mode i2c: slave-eeprom: update documentation i2c: eg20t: Load module automatically if ID matches i2c: designware: platdrv: Set class based on DMI i2c: algo-pca: Add 0x78 as SCL stuck low status for PCA9665 mm/page_alloc: fix documentation error vmalloc: fix the owner argument for the new __vmalloc_node_range callers mm/cma.c: use exact_nid true to fix possible per-numa cma leak ...	2020-07-08 08:02:13 -06:00
Christoph Hellwig	72d447113b	nvme: fix a crash in nvme_mpath_add_disk For private namespaces ns->head_disk is NULL, so add a NULL check before updating the BDI capabilities. Fixes: `b2ce4d9069` ("nvme-multipath: set bdi capabilities once") Reported-by: Avinash M N <Avinash.M.N@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com>	2020-07-02 10:38:00 +02:00
Sagi Grimberg	ea43d9709f	nvme: fix identify error status silent ignore Commit `59c7c3caaa` intended to only silently ignore non retry-able errors (DNR bit set) such that we can still identify misbehaving controllers, and in the other hand propagate retry-able errors (DNR bit cleared) so we don't wrongly abandon a namespace just because it happens to be temporarily inaccessible. The goal remains the same as the original commit where this was introduced but unfortunately had the logic backwards. Fixes: `59c7c3caaa` ("nvme: fix possible hang when ns scanning fails during error recovery") Reported-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-02 10:38:00 +02:00
Christoph Hellwig	e556f6ba10	block: remove the bd_queue field from struct block_device Just use bd_disk->queue instead. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-01 08:08:20 -06:00
Christoph Hellwig	5a6c35f9af	block: remove direct_make_request Now that submit_bio_noacct has a decent blk-mq fast path there is no more need for this bypass. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-01 07:27:24 -06:00
Christoph Hellwig	ed00aabd5e	block: rename generic_make_request to submit_bio_noacct generic_make_request has always been very confusingly misnamed, so rename it to submit_bio_noacct to make it clear that it is submit_bio minus accounting and a few checks. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-01 07:27:24 -06:00
Christoph Hellwig	c62b37d96b	block: move ->make_request_fn to struct block_device_operations The make_request_fn is a little weird in that it sits directly in struct request_queue instead of an operation vector. Replace it with a block_device_operations method called submit_bio (which describes much better what it does). Also remove the request_queue argument to it, as the queue can be derived pretty trivially from the bio. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-01 07:27:24 -06:00
Christoph Hellwig	f695ca3886	block: remove the request_queue argument from blk_queue_split The queue can be trivially derived from the bio, so pass one less argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-07-01 07:27:23 -06:00
Sagi Grimberg	c31244669f	nvme-multipath: fix bogus request queue reference put The mpath disk node takes a reference on the request mpath request queue when adding live path to the mpath gendisk. However if we connected to an inaccessible path device_add_disk is not called, so if we disconnect and remove the mpath gendisk we endup putting an reference on the request queue that was never taken [1]. Fix that to check if we ever added a live path (using NVME_NS_HEAD_HAS_DISK flag) and if not, clear the disk->queue reference. [1]: ------------[ cut here ]------------ refcount_t: underflow; use-after-free. WARNING: CPU: 1 PID: 1372 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0 CPU: 1 PID: 1372 Comm: nvme Tainted: G O 5.7.0-rc2+ #3 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1 04/01/2014 RIP: 0010:refcount_warn_saturate+0xa6/0xf0 RSP: 0018:ffffb29e8053bdc0 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff8b7a2f4fc060 RCX: 0000000000000007 RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff8b7a3ec99980 RBP: ffff8b7a2f4fc000 R08: 00000000000002e1 R09: 0000000000000004 R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 R13: fffffffffffffff2 R14: ffffb29e8053bf08 R15: ffff8b7a320e2da0 FS: 00007f135d4ca800(0000) GS:ffff8b7a3ec80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005651178c0c30 CR3: 000000003b650005 CR4: 0000000000360ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: disk_release+0xa2/0xc0 device_release+0x28/0x80 kobject_put+0xa5/0x1b0 nvme_put_ns_head+0x26/0x70 [nvme_core] nvme_put_ns+0x30/0x60 [nvme_core] nvme_remove_namespaces+0x9b/0xe0 [nvme_core] nvme_do_delete_ctrl+0x43/0x5c [nvme_core] nvme_sysfs_delete.cold+0x8/0xd [nvme_core] kernfs_fop_write+0xc1/0x1a0 vfs_write+0xb6/0x1a0 ksys_write+0x5f/0xe0 do_syscall_64+0x52/0x1a0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reported-by: Anton Eidelman <anton@lightbitslabs.com> Tested-by: Anton Eidelman <anton@lightbitslabs.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-06-24 18:41:20 +02:00
Anton Eidelman	d8a22f8560	nvme-multipath: fix deadlock due to head->lock In the following scenario scan_work and ana_work will deadlock: When scan_work calls nvme_mpath_add_disk() this holds ana_lock and invokes nvme_parse_ana_log(), which may issue IO in device_add_disk() and hang waiting for an accessible path. While nvme_mpath_set_live() only called when nvme_state_is_live(), a transition may cause NVME_SC_ANA_TRANSITION and requeue the IO. Since nvme_mpath_set_live() holds ns->head->lock, an ana_work on ANY ctrl will not be able to complete nvme_mpath_set_live() on the same ns->head, which is required in order to update the new accessible path and remove NVME_NS_ANA_PENDING.. Therefore IO never completes: deadlock [1]. Fix: Move device_add_disk out of the head->lock and protect it with an atomic test_and_set for a new NVME_NS_HEAD_HAS_DISK bit. [1]: kernel: INFO: task kworker/u8:2:160 blocked for more than 120 seconds. kernel: Tainted: G OE 5.3.5-050305-generic #201910071830 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kernel: kworker/u8:2 D 0 160 2 0x80004000 kernel: Workqueue: nvme-wq nvme_ana_work [nvme_core] kernel: Call Trace: kernel: __schedule+0x2b9/0x6c0 kernel: schedule+0x42/0xb0 kernel: schedule_preempt_disabled+0xe/0x10 kernel: __mutex_lock.isra.0+0x182/0x4f0 kernel: __mutex_lock_slowpath+0x13/0x20 kernel: mutex_lock+0x2e/0x40 kernel: nvme_update_ns_ana_state+0x22/0x60 [nvme_core] kernel: nvme_update_ana_state+0xca/0xe0 [nvme_core] kernel: nvme_parse_ana_log+0xa1/0x180 [nvme_core] kernel: nvme_read_ana_log+0x76/0x100 [nvme_core] kernel: nvme_ana_work+0x15/0x20 [nvme_core] kernel: process_one_work+0x1db/0x380 kernel: worker_thread+0x4d/0x400 kernel: kthread+0x104/0x140 kernel: ret_from_fork+0x35/0x40 kernel: INFO: task kworker/u8:4:439 blocked for more than 120 seconds. kernel: Tainted: G OE 5.3.5-050305-generic #201910071830 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kernel: kworker/u8:4 D 0 439 2 0x80004000 kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core] kernel: Call Trace: kernel: __schedule+0x2b9/0x6c0 kernel: schedule+0x42/0xb0 kernel: io_schedule+0x16/0x40 kernel: do_read_cache_page+0x438/0x830 kernel: read_cache_page+0x12/0x20 kernel: read_dev_sector+0x27/0xc0 kernel: read_lba+0xc1/0x220 kernel: efi_partition+0x1e6/0x708 kernel: check_partition+0x154/0x244 kernel: rescan_partitions+0xae/0x280 kernel: __blkdev_get+0x40f/0x560 kernel: blkdev_get+0x3d/0x140 kernel: __device_add_disk+0x388/0x480 kernel: device_add_disk+0x13/0x20 kernel: nvme_mpath_set_live+0x119/0x140 [nvme_core] kernel: nvme_update_ns_ana_state+0x5c/0x60 [nvme_core] kernel: nvme_mpath_add_disk+0xbe/0x100 [nvme_core] kernel: nvme_validate_ns+0x396/0x940 [nvme_core] kernel: nvme_scan_work+0x256/0x390 [nvme_core] kernel: process_one_work+0x1db/0x380 kernel: worker_thread+0x4d/0x400 kernel: kthread+0x104/0x140 kernel: ret_from_fork+0x35/0x40 Fixes: `0d0b660f21` ("nvme: add ANA support") Signed-off-by: Anton Eidelman <anton@lightbitslabs.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-06-24 18:41:20 +02:00
Sagi Grimberg	e164471dcf	nvme: don't protect ns mutation with ns->head->lock Right now ns->head->lock is protecting namespace mutation which is wrong and unneeded. Move it to only protect against head mutations. While we're at it, remove unnecessary ns->head reference as we already have head pointer. The problem with this is that the head->lock spans mpath disk node I/O that may block under some conditions (if for example the controller is disconnecting or the path became inaccessible), The locking scheme does not allow any other path to enable itself, preventing blocked I/O to complete and forward-progress from there. This is a preparation patch for the fix in a subsequent patch where the disk I/O will also be done outside the head->lock. Fixes: `0d0b660f21` ("nvme: add ANA support") Signed-off-by: Anton Eidelman <anton@lightbitslabs.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-06-24 18:41:20 +02:00
Anton Eidelman	489dd102a2	nvme-multipath: fix deadlock between ana_work and scan_work When scan_work calls nvme_mpath_add_disk() this holds ana_lock and invokes nvme_parse_ana_log(), which may issue IO in device_add_disk() and hang waiting for an accessible path. While nvme_mpath_set_live() only called when nvme_state_is_live(), a transition may cause NVME_SC_ANA_TRANSITION and requeue the IO. In order to recover and complete the IO ana_work on the same ctrl should be able to update the path state and remove NVME_NS_ANA_PENDING. The deadlock occurs because scan_work keeps holding ana_lock, so ana_work hangs [1]. Fix: Now nvme_mpath_add_disk() uses nvme_parse_ana_log() to obtain a copy of the ANA group desc, and then calls nvme_update_ns_ana_state() without holding ana_lock. [1]: kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core] kernel: Call Trace: kernel: __schedule+0x2b9/0x6c0 kernel: schedule+0x42/0xb0 kernel: io_schedule+0x16/0x40 kernel: do_read_cache_page+0x438/0x830 kernel: read_cache_page+0x12/0x20 kernel: read_dev_sector+0x27/0xc0 kernel: read_lba+0xc1/0x220 kernel: efi_partition+0x1e6/0x708 kernel: check_partition+0x154/0x244 kernel: rescan_partitions+0xae/0x280 kernel: __blkdev_get+0x40f/0x560 kernel: blkdev_get+0x3d/0x140 kernel: __device_add_disk+0x388/0x480 kernel: device_add_disk+0x13/0x20 kernel: nvme_mpath_set_live+0x119/0x140 [nvme_core] kernel: nvme_update_ns_ana_state+0x5c/0x60 [nvme_core] kernel: nvme_set_ns_ana_state+0x1e/0x30 [nvme_core] kernel: nvme_parse_ana_log+0xa1/0x180 [nvme_core] kernel: nvme_mpath_add_disk+0x47/0x90 [nvme_core] kernel: nvme_validate_ns+0x396/0x940 [nvme_core] kernel: nvme_scan_work+0x24f/0x380 [nvme_core] kernel: process_one_work+0x1db/0x380 kernel: worker_thread+0x249/0x400 kernel: kthread+0x104/0x140 kernel: Workqueue: nvme-wq nvme_ana_work [nvme_core] kernel: Call Trace: kernel: __schedule+0x2b9/0x6c0 kernel: schedule+0x42/0xb0 kernel: schedule_preempt_disabled+0xe/0x10 kernel: __mutex_lock.isra.0+0x182/0x4f0 kernel: ? __switch_to_asm+0x34/0x70 kernel: ? select_task_rq_fair+0x1aa/0x5c0 kernel: ? kvm_sched_clock_read+0x11/0x20 kernel: ? sched_clock+0x9/0x10 kernel: __mutex_lock_slowpath+0x13/0x20 kernel: mutex_lock+0x2e/0x40 kernel: nvme_read_ana_log+0x3a/0x100 [nvme_core] kernel: nvme_ana_work+0x15/0x20 [nvme_core] kernel: process_one_work+0x1db/0x380 kernel: worker_thread+0x4d/0x400 kernel: kthread+0x104/0x140 kernel: ? process_one_work+0x380/0x380 kernel: ? kthread_park+0x80/0x80 kernel: ret_from_fork+0x35/0x40 Fixes: `0d0b660f21` ("nvme: add ANA support") Signed-off-by: Anton Eidelman <anton@lightbitslabs.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-06-24 18:41:20 +02:00
Sagi Grimberg	3b4b19721e	nvme: fix possible deadlock when I/O is blocked Revert `fab7772bfb` ("nvme-multipath: revalidate nvme_ns_head gendisk in nvme_validate_ns") When adding a new namespace to the head disk (via nvme_mpath_set_live) we will see partition scan which triggers I/O on the mpath device node. This process will usually be triggered from the scan_work which holds the scan_lock. If I/O blocks (if we got ana change currently have only available paths but none are accessible) this can deadlock on the head disk bd_mutex as both partition scan I/O takes it, and head disk revalidation takes it to check for resize (also triggered from scan_work on a different path). See trace [1]. The mpath disk revalidation was originally added to detect online disk size change, but this is no longer needed since commit `cb224c3af4` ("nvme: Convert to use set_capacity_revalidate_and_notify") which already updates resize info without unnecessarily revalidating the disk (the mpath disk doesn't even implement .revalidate_disk fop). [1]: -- kernel: INFO: task kworker/u65:9:494 blocked for more than 241 seconds. kernel: Tainted: G OE 5.3.5-050305-generic #201910071830 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kernel: kworker/u65:9 D 0 494 2 0x80004000 kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core] kernel: Call Trace: kernel: __schedule+0x2b9/0x6c0 kernel: schedule+0x42/0xb0 kernel: schedule_preempt_disabled+0xe/0x10 kernel: __mutex_lock.isra.0+0x182/0x4f0 kernel: __mutex_lock_slowpath+0x13/0x20 kernel: mutex_lock+0x2e/0x40 kernel: revalidate_disk+0x63/0xa0 kernel: __nvme_revalidate_disk+0xfe/0x110 [nvme_core] kernel: nvme_revalidate_disk+0xa4/0x160 [nvme_core] kernel: ? evict+0x14c/0x1b0 kernel: revalidate_disk+0x2b/0xa0 kernel: nvme_validate_ns+0x49/0x940 [nvme_core] kernel: ? blk_mq_free_request+0xd2/0x100 kernel: ? __nvme_submit_sync_cmd+0xbe/0x1e0 [nvme_core] kernel: nvme_scan_work+0x24f/0x380 [nvme_core] kernel: process_one_work+0x1db/0x380 kernel: worker_thread+0x249/0x400 kernel: kthread+0x104/0x140 kernel: ? process_one_work+0x380/0x380 kernel: ? kthread_park+0x80/0x80 kernel: ret_from_fork+0x1f/0x40 ... kernel: INFO: task kworker/u65:1:2630 blocked for more than 241 seconds. kernel: Tainted: G OE 5.3.5-050305-generic #201910071830 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kernel: kworker/u65:1 D 0 2630 2 0x80004000 kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core] kernel: Call Trace: kernel: __schedule+0x2b9/0x6c0 kernel: schedule+0x42/0xb0 kernel: io_schedule+0x16/0x40 kernel: do_read_cache_page+0x438/0x830 kernel: ? __switch_to_asm+0x34/0x70 kernel: ? file_fdatawait_range+0x30/0x30 kernel: read_cache_page+0x12/0x20 kernel: read_dev_sector+0x27/0xc0 kernel: read_lba+0xc1/0x220 kernel: ? kmem_cache_alloc_trace+0x19c/0x230 kernel: efi_partition+0x1e6/0x708 kernel: ? vsnprintf+0x39e/0x4e0 kernel: ? snprintf+0x49/0x60 kernel: check_partition+0x154/0x244 kernel: rescan_partitions+0xae/0x280 kernel: __blkdev_get+0x40f/0x560 kernel: blkdev_get+0x3d/0x140 kernel: __device_add_disk+0x388/0x480 kernel: device_add_disk+0x13/0x20 kernel: nvme_mpath_set_live+0x119/0x140 [nvme_core] kernel: nvme_update_ns_ana_state+0x5c/0x60 [nvme_core] kernel: nvme_set_ns_ana_state+0x1e/0x30 [nvme_core] kernel: nvme_parse_ana_log+0xa1/0x180 [nvme_core] kernel: ? nvme_update_ns_ana_state+0x60/0x60 [nvme_core] kernel: nvme_mpath_add_disk+0x47/0x90 [nvme_core] kernel: nvme_validate_ns+0x396/0x940 [nvme_core] kernel: ? blk_mq_free_request+0xd2/0x100 kernel: nvme_scan_work+0x24f/0x380 [nvme_core] kernel: process_one_work+0x1db/0x380 kernel: worker_thread+0x249/0x400 kernel: kthread+0x104/0x140 kernel: ? process_one_work+0x380/0x380 kernel: ? kthread_park+0x80/0x80 kernel: ret_from_fork+0x1f/0x40 -- Fixes: `fab7772bfb` ("nvme-multipath: revalidate nvme_ns_head gendisk in nvme_validate_ns") Signed-off-by: Anton Eidelman <anton@lightbitslabs.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-06-24 18:41:20 +02:00
Max Gurtovoy	032a9966a2	nvme-rdma: assign completion vector correctly The completion vector index that is given during CQ creation can't exceed the number of support vectors by the underlying RDMA device. This violation currently can accure, for example, in case one will try to connect with N regular read/write queues and M poll queues and the sum of N + M > num_supported_vectors. This will lead to failure in establish a connection to remote target. Instead, in that case, share a completion vector between queues. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-06-24 18:41:19 +02:00
Max Gurtovoy	1b4ad7a50a	nvme-loop: initialize tagset numa value to the value of the ctrl Both admin's and drive's tagsets should be set according the numa node of the controller. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-06-24 18:37:08 +02:00
Max Gurtovoy	610c823510	nvme-tcp: initialize tagset numa value to the value of the ctrl Both admin's and drive's tagsets should be set according the numa node of the controller. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-06-24 18:37:08 +02:00
Max Gurtovoy	d4ec47f120	nvme-pci: initialize tagset numa value to the value of the ctrl Both admin's and drive's tagsets should be set according the numa node of the controller. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-06-24 18:37:08 +02:00
Max Gurtovoy	635333e400	nvme-pci: override the value of the controller's numa node Set the node value according to the PCI device numa node. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-06-24 18:37:08 +02:00
Max Gurtovoy	4fea243ebc	nvme: set initial value for controller's numa node Initialize the node to NUMA_NO_NODE value. Transports that are aware of numa node affinity can override it (e.g. RDMA transport set the affinity according to the RDMA HCA). Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-06-24 18:37:08 +02:00
Christoph Hellwig	7a804c34c2	nvme-rdma: fix a missing completion with remove invalidation Revert and incorret transformation that caused requests using remote invalidation to never complete. Fixes: 421147be863b ("nvme-rdma: factor out a nvme_rdma_end_request helper") Reported-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-24 09:15:58 -06:00
Christoph Hellwig	ff02945149	nvme: use blk_mq_complete_request_remote to avoid an indirect function call Use the new blk_mq_complete_request_remote helper to avoid an indirect function call in the completion fast path. Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-24 09:15:57 -06:00
Christoph Hellwig	8446546cc2	nvme-rdma: factor out a nvme_rdma_end_request helper Factor a small sniplet of duplicated code into a new helper in preparation for making this sniplet a little bit less trivial. Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-24 09:15:57 -06:00
Christoph Hellwig	15f73f5b3e	blk-mq: move failure injection out of blk_mq_complete_request Move the call to blk_should_fake_timeout out of blk_mq_complete_request and into the drivers, skipping call sites that are obvious error handlers, and remove the now superflous blk_mq_force_complete_rq helper. This ensures we don't keep injecting errors into completions that just terminate the Linux request after the hardware has been reset or the command has been aborted. Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-24 09:15:57 -06:00
Linus Torvalds	6adc19fd13	Kbuild updates for v5.8 (2nd) - fix build rules in binderfs sample - fix build errors when Kbuild recurses to the top Makefile - covert '---help---' in Kconfig to 'help' -----BEGIN PGP SIGNATURE----- iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAl7lBuYVHG1hc2FoaXJv eUBrZXJuZWwub3JnAAoJED2LAQed4NsGHvIP/3iErjPshpg/phwH8NTCS4SFkiti BZRM+2lupSn7Qs53BTpVzIkXoHBJQZlJxlQ5HY8ScO+fiz28rKZr+b40us+je1Q+ SkvSPfwZzxjEg7lAZutznG4KgItJLWJKmDyh9T8Y8TAuG4f8WO0hKnXoAp3YorS2 zppEIxso8O5spZPjp+fF/fPbxPjIsabGK7Jp2LpSVFR5pVDHI/ycTlKQS+MFpMEx 6JIpdFRw7TkvKew1dr5uAWT5btWHatEqjSR3JeyVHv3EICTGQwHmcHK67cJzGInK T51+DT7/CpKtmRgGMiTEu/INfMzzoQAKl6Fcu+vMaShTN97Hk9DpdtQyvA6P/h3L 8GA4UBct05J7fjjIB7iUD+GYQ0EZbaFujzRXLYk+dQqEJRbhcCwvdzggGp0WvGRs 1f8/AIpgnQv8JSL/bOMgGMS5uL2dSLsgbzTdr6RzWf1jlYdI1i4u7AZ/nBrwWP+Z iOBkKsVceEoJrTbaynl3eoYqFLtWyDau+//oBc2gUvmhn8ioM5dfqBRiJjxJnPG9 /giRj6xRIqMMEw8Gg8PCG7WebfWxWyaIQwlWBbPok7DwISURK5mvOyakZL+Q25/y 6MBr2H8NEJsf35q0GTINpfZnot7NX4JXrrndJH8NIRC7HEhwd29S041xlQJdP0rs E76xsOr3hrAmBu4P =1NIT -----END PGP SIGNATURE----- Merge tag 'kbuild-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull more Kbuild updates from Masahiro Yamada: - fix build rules in binderfs sample - fix build errors when Kbuild recurses to the top Makefile - covert '---help---' in Kconfig to 'help' * tag 'kbuild-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: treewide: replace '---help---' in Kconfig files with 'help' kbuild: fix broken builds because of GZIP,BZIP2,LZOP variables samples: binderfs: really compile this sample and fix build issues	2020-06-13 13:29:16 -07:00
Masahiro Yamada	a7f7f6248d	treewide: replace '---help---' in Kconfig files with 'help' Since commit `84af7a6194` ("checkpatch: kconfig: prefer 'help' over '---help---'"), the number of '---help---' has been gradually decreasing, but there are still more than 2400 instances. This commit finishes the conversion. While I touched the lines, I also fixed the indentation. There are a variety of indentation styles found. a) 4 spaces + '---help---' b) 7 spaces + '---help---' c) 8 spaces + '---help---' d) 1 space + 1 tab + '---help---' e) 1 tab + '---help---' (correct indentation) f) 1 tab + 1 space + '---help---' g) 1 tab + 2 spaces + '---help---' In order to convert all of them to 1 tab + 'help', I ran the following commend: $ find . -name 'Kconfig' \| xargs sed -i 's/^[[:space:]]---help---/\thelp/' Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>	2020-06-14 01:57:21 +09:00
Linus Torvalds	a58dfea297	block-5.8-2020-06-11 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7ioawQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpvbJD/wNLN/H4yIQ7tU5XDdvxvpx/u9FC1t2Pep0 w/olj6wnrsHw/WsgJIlw7efTq9QATfszG/dJKJiBGdiJoCKE1TW/CM6RNfDJb4Z3 TUa9ghYYzcfI2NRdV94Ol9qRThjB6OG6Cdw4k3oKbx44EJOzgatBI6xIA3nU+f/L XO+xl2z3+t28guMvcgUkdJsR8GvSrwcXCvw3X/3uqbtAv5hhMbR7jyqxcHDLX72t I+y3/dWfKaienujEmcLKeW+f2RFyjYIvDbQ5b/JDqLah7Fn1A2wYf+mx7iZuQZSi 5nwGcPuj++8GXS6G8JegAl+s5L3AyBNdz5nrxdAlRjDTMgIUstFgueLnCaW64QNF 93kWK5gDwhq+26AFl3mGJ3m+qhh1AhGWaVniBiFA3OUeWcOgVGlRf6jtmWazQaEI v15WTiAXTsQujnV+t5KYKQnm9vJLIcc/njiSss1JXnqrxR6fH+QCHQ96ckTCqx66 0GbN5RkuC2J/RHYEyYnYIJlNZGDsCVoBC3QR10WNlng82cxMyrahS011xUTn9VN+ 0Gnz1ilNFc+bx1jUO+pl6EdIsEBbFkKioyoZsgba5mvM+Nn3nGbvqQPJc+18fSV2 BW1x2yuoc6yjwuol9NMV+cy13Z9u+uA4c0mFIetjuyjE3rZb77iuIiIKVWMRh6Av Ip6GuPEA2A== =TOc1 -----END PGP SIGNATURE----- Merge tag 'block-5.8-2020-06-11' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "Some followup fixes for this merge window. In particular: - Seqcount write missing preemption disable for stats (Ahmed) - blktrace fixes (Chaitanya) - Redundant initializations (Colin) - Various small NVMe fixes (Chaitanya, Christoph, Daniel, Max, Niklas, Rikard) - loop flag bug regression fix (Martijn) - blk-mq tagging fixes (Christoph, Ming)" * tag 'block-5.8-2020-06-11' of git://git.kernel.dk/linux-block: umem: remove redundant initialization of variable ret pktcdvd: remove redundant initialization of variable ret nvmet: fail outstanding host posted AEN req nvme-pci: use simple suspend when a HMB is enabled nvme-fc: don't call nvme_cleanup_cmd() for AENs nvmet-tcp: constify nvmet_tcp_ops nvme-tcp: constify nvme_tcp_mq_ops and nvme_tcp_admin_mq_ops nvme: do not call del_gendisk() on a disk that was never added blk-mq: fix blk_mq_all_tag_iter blk-mq: split out a __blk_mq_get_driver_tag helper blktrace: fix endianness for blk_log_remap() blktrace: fix endianness in get_pdu_int() blktrace: use errno instead of bi_status block: nr_sects_write(): Disable preemption on seqcount write block: remove the error argument to the block_bio_complete tracepoint loop: Fix wrong masking of status flags block/bio-integrity: don't free 'buf' if bio_integrity_add_page() failed	2020-06-11 16:07:33 -07:00
Chaitanya Kulkarni	819f7b88b4	nvmet: fail outstanding host posted AEN req In function nvmet_async_event_process() we only process AENs iff there is an open slot on the ctrl->async_event_cmds[] && aen event list posted by the target is not empty. This keeps host posted AEN outstanding if target generated AEN list is empty. We do cleanup the target generated entries from the aen list in nvmet_ctrl_free()-> nvmet_async_events_free() but we don't process AEN posted by the host. This leads to following problem :- When processing admin sq at the time of nvmet_sq_destroy() holds an extra percpu reference(atomic value = 1), so in the following code path after switching to atomic rcu, release function (nvmet_sq_free()) is not getting called which blocks the sq->free_done in nvmet_sq_destroy() :- nvmet_sq_destroy() percpu_ref_kill_and_confirm() - __percpu_ref_switch_mode() -- __percpu_ref_switch_to_atomic() --- call_rcu() -> percpu_ref_switch_to_atomic_rcu() ---- /* calls switch callback / - percpu_ref_put() -- percpu_ref_put_many(ref, 1) --- else if (unlikely(atomic_long_sub_and_test(nr, &ref->count))) ---- ref->release(ref); <---- Not called. This results in indefinite hang:- void nvmet_sq_destroy(struct nvmet_sq sq) ... if (ctrl && ctrl->sqs && ctrl->sqs[0] == sq) { nvmet_async_events_process(ctrl, status); percpu_ref_put(&sq->ref); } percpu_ref_kill_and_confirm(&sq->ref, nvmet_confirm_sq); wait_for_completion(&sq->confirm_done); wait_for_completion(&sq->free_done); <-- Hang here Which breaks the further disconnect sequence. This problem seems to be introduced after commit `64f5e9cdd7` ("nvmet: fix memory leak when removing namespaces and controllers concurrently"). This patch processes ctrl->async_event_cmds[] in the admin sq destroy() context irrespetive of aen_list. Also we get rid of the controller's aen_list processing in the nvmet_sq_destroy() context and just ignore ctrl->aen_list. This results in nvmet_async_events_process() being called from workqueue context so we adjust the code accordingly. Fixes: `64f5e9cdd7` ("nvmet: fix memory leak when removing namespaces and controllers concurrently ") Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-11 09:10:06 -06:00
Christoph Hellwig	b97120b15e	nvme-pci: use simple suspend when a HMB is enabled While the NVMe specification allows the device to access the host memory buffer in host DRAM from all power states, hosts will fail access to DRAM during S3 and similar power states. Fixes: `d916b1be94` ("nvme-pci: use host managed power state for suspend") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-11 09:10:06 -06:00
Daniel Wagner	c9c12e51b8	nvme-fc: don't call nvme_cleanup_cmd() for AENs Asynchronous event notifications do not have an associated request. When fcp_io() fails we unconditionally call nvme_cleanup_cmd() which leads to a crash. Fixes: `16686f3a6c` ("nvme: move common call to nvme_cleanup_cmd to core layer") Signed-off-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Himanshu Madhani <hmadhani2024@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-11 09:10:05 -06:00
Max Gurtovoy	a40aae6bbf	nvmet-tcp: constify nvmet_tcp_ops nvmet_tcp_ops is never modified and can be made const to allow the compiler to put it in read-only memory, as done in other transports. Before: text data bss dec hex filename 16164 160 12 16336 3fd0 drivers/nvme/target/tcp.o After: text data bss dec hex filename 16277 64 12 16353 3fe1 drivers/nvme/target/tcp.o Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: Israel Rukshin <israelr@mellanox.com> Acked-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-11 09:10:05 -06:00
Rikard Falkeborn	6acbd9619b	nvme-tcp: constify nvme_tcp_mq_ops and nvme_tcp_admin_mq_ops nvme_tcp_mq_ops and nvme_tcp_admin_mq_ops are never modified and can be made const to allow the compiler to put them in read-only memory. Before: text data bss dec hex filename 53102 6885 576 60563 ec93 drivers/nvme/host/tcp.o After: text data bss dec hex filename 53422 6565 576 60563 ec93 drivers/nvme/host/tcp.o Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com> Acked-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-11 09:10:05 -06:00
Niklas Cassel	108a58585b	nvme: do not call del_gendisk() on a disk that was never added device_add_disk() is negated by del_gendisk(). alloc_disk_node() is negated by put_disk(). In nvme_alloc_ns(), device_add_disk() is one of the last things being called in the success case, and only void functions are being called after this. Therefore this call should not be negated in the error path. The superfluous call to del_gendisk() leads to the following prints: [ 7.839975] kobject: '(null)' (000000001ff73734): is not initialized, yet kobject_put() is being called. [ 7.840865] WARNING: CPU: 2 PID: 361 at lib/kobject.c:736 kobject_put+0x70/0x120 Fixes: `33cfdc2aa6` ("nvme: enforce extended LBA format for fabrics metadata") Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-11 09:10:05 -06:00
Linus Torvalds	242b233198	RDMA 5.8 merge window pull request A few large, long discussed works this time. The RNBD block driver has been posted for nearly two years now, and the removal of FMR has been a recurring discussion theme for a long time. The usual smattering of features and bug fixes. - Various small driver bugs fixes in rxe, mlx5, hfi1, and efa - Continuing driver cleanups in bnxt_re, hns - Big cleanup of mlx5 QP creation flows - More consistent use of src port and flow label when LAG is used and a mlx5 implementation - Additional set of cleanups for IB CM - 'RNBD' network block driver and target. This is a network block RDMA device specific to ionos's cloud environment. It brings strong multipath and resiliency capabilities. - Accelerated IPoIB for HFI1 - QP/WQ/SRQ ioctl migration for uverbs, and support for multiple async fds - Support for exchanging the new IBTA defiend ECE data during RDMA CM exchanges - Removal of the very old and insecure FMR interface from all ULPs and drivers. FRWR should be preferred for at least a decade now. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl7X/IwACgkQOG33FX4g mxp2uw/+MI2S/aXqEBvZfTT8yrkAwqYezS0VeTDnwH/T6UlTMDhHVN/2Ji3tbbX3 FEKT1i2mnAL5RqUAL1lr9g4sG/bVozrpN46Ws5Lu9dTbIPLKTNPWDuLFQDUShKY7 OyMI/bRx6anGnsOy20iiBqnrQbrrZj5TECgnmrkAl62QFdcl7aBWe/yYjy4CT11N ub+aBXBREN1F1pc0HIjd2tI+8gnZc+mNm1LVVDRH9Capun/pI26qDNh7e6QwGyIo n8ItraC8znLwv/nsUoTE7/JRcsTEe6vJI26PQmczZfNJs/4O65G7fZg0eSBseZYi qKf7Uwtb3qW0R7jRUMEgFY4DKXVAA0G2ph40HXBuzOSsqlT6HqYMO2wgG8pJkrTc qAjoSJGzfAHIsjxzxKI8wKuufCddjCm30VWWU7EKeriI6h1J0uPVqKkQMfYBTkik 696eZSBycAVgwayOng3XaehiTxOL7qGMTjUpDjUR6UscbiPG919vP+QsbIUuBXdb YoddBQJdyGJiaCXv32ciJjo9bjPRRi/bII7Q5qzCNI2mi4ZVbudF4ffzyQvdHtNJ nGnpRXoPi7kMvUrKTMPWkFjj0R5/UsPszsA51zbxPydfgBe0Dlc2PrrIG8dlzYAp wbV0Lec+iJucKlt7EZtrjz1xOiOOaQt/5/cW1bWqL+wk2t6gAuY= =9zTe -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma Pull rdma updates from Jason Gunthorpe: "A more active cycle than most of the recent past, with a few large, long discussed works this time. The RNBD block driver has been posted for nearly two years now, and flowing through RDMA due to it also introducing a new ULP. The removal of FMR has been a recurring discussion theme for a long time. And the usual smattering of features and bug fixes. Summary: - Various small driver bugs fixes in rxe, mlx5, hfi1, and efa - Continuing driver cleanups in bnxt_re, hns - Big cleanup of mlx5 QP creation flows - More consistent use of src port and flow label when LAG is used and a mlx5 implementation - Additional set of cleanups for IB CM - 'RNBD' network block driver and target. This is a network block RDMA device specific to ionos's cloud environment. It brings strong multipath and resiliency capabilities. - Accelerated IPoIB for HFI1 - QP/WQ/SRQ ioctl migration for uverbs, and support for multiple async fds - Support for exchanging the new IBTA defiend ECE data during RDMA CM exchanges - Removal of the very old and insecure FMR interface from all ULPs and drivers. FRWR should be preferred for at least a decade now" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (247 commits) RDMA/cm: Spurious WARNING triggered in cm_destroy_id() RDMA/mlx5: Return ECE DC support RDMA/mlx5: Don't rely on FW to set zeros in ECE response RDMA/mlx5: Return an error if copy_to_user fails IB/hfi1: Use free_netdev() in hfi1_netdev_free() RDMA/hns: Uninitialized variable in modify_qp_init_to_rtr() RDMA/core: Move and rename trace_cm_id_create() IB/hfi1: Fix hfi1_netdev_rx_init() error handling RDMA: Remove 'max_map_per_fmr' RDMA: Remove 'max_fmr' RDMA/core: Remove FMR device ops RDMA/rdmavt: Remove FMR memory registration RDMA/mthca: Remove FMR support for memory registration RDMA/mlx4: Remove FMR support for memory registration RDMA/i40iw: Remove FMR leftovers RDMA/bnxt_re: Remove FMR leftovers RDMA/mlx5: Remove FMR leftovers RDMA/core: Remove FMR pool API RDMA/rds: Remove FMR support for memory registration RDMA/srp: Remove support for FMR memory registration ...	2020-06-05 14:05:57 -07:00
Christoph Hellwig	d24de76af8	block: remove the error argument to the block_bio_complete tracepoint The status can be trivially derived from the bio itself. That also avoid callers like NVMe to incorrectly pass a blk_status_t instead of the errno, and the overhead of translating the blk_status_t to the errno in the I/O completion fast path when no tracing is enabled. Fixes: `35fe0d12c8` ("nvme: trace bio completion") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-06-04 21:16:11 -06:00
Linus Torvalds	cb8e59cc87	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from David Miller: 1) Allow setting bluetooth L2CAP modes via socket option, from Luiz Augusto von Dentz. 2) Add GSO partial support to igc, from Sasha Neftin. 3) Several cleanups and improvements to r8169 from Heiner Kallweit. 4) Add IF_OPER_TESTING link state and use it when ethtool triggers a device self-test. From Andrew Lunn. 5) Start moving away from custom driver versions, use the globally defined kernel version instead, from Leon Romanovsky. 6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin. 7) Allow hard IRQ deferral during NAPI, from Eric Dumazet. 8) Add sriov and vf support to hinic, from Luo bin. 9) Support Media Redundancy Protocol (MRP) in the bridging code, from Horatiu Vultur. 10) Support netmap in the nft_nat code, from Pablo Neira Ayuso. 11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina Dubroca. Also add ipv6 support for espintcp. 12) Lots of ReST conversions of the networking documentation, from Mauro Carvalho Chehab. 13) Support configuration of ethtool rxnfc flows in bcmgenet driver, from Doug Berger. 14) Allow to dump cgroup id and filter by it in inet_diag code, from Dmitry Yakunin. 15) Add infrastructure to export netlink attribute policies to userspace, from Johannes Berg. 16) Several optimizations to sch_fq scheduler, from Eric Dumazet. 17) Fallback to the default qdisc if qdisc init fails because otherwise a packet scheduler init failure will make a device inoperative. From Jesper Dangaard Brouer. 18) Several RISCV bpf jit optimizations, from Luke Nelson. 19) Correct the return type of the ->ndo_start_xmit() method in several drivers, it's netdev_tx_t but many drivers were using 'int'. From Yunjian Wang. 20) Add an ethtool interface for PHY master/slave config, from Oleksij Rempel. 21) Add BPF iterators, from Yonghang Song. 22) Add cable test infrastructure, including ethool interfaces, from Andrew Lunn. Marvell PHY driver is the first to support this facility. 23) Remove zero-length arrays all over, from Gustavo A. R. Silva. 24) Calculate and maintain an explicit frame size in XDP, from Jesper Dangaard Brouer. 25) Add CAP_BPF, from Alexei Starovoitov. 26) Support terse dumps in the packet scheduler, from Vlad Buslov. 27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei. 28) Add devm_register_netdev(), from Bartosz Golaszewski. 29) Minimize qdisc resets, from Cong Wang. 30) Get rid of kernel_getsockopt and kernel_setsockopt in order to eliminate set_fs/get_fs calls. From Christoph Hellwig. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits) selftests: net: ip_defrag: ignore EPERM net_failover: fixed rollback in net_failover_open() Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv" Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv" vmxnet3: allow rx flow hash ops only when rss is enabled hinic: add set_channels ethtool_ops support selftests/bpf: Add a default $(CXX) value tools/bpf: Don't use $(COMPILE.c) bpf, selftests: Use bpf_probe_read_kernel s390/bpf: Use bcr 0,%0 as tail call nop filler s390/bpf: Maintain 8-byte stack alignment selftests/bpf: Fix verifier test selftests/bpf: Fix sample_cnt shared between two threads bpf, selftests: Adapt cls_redirect to call csum_level helper bpf: Add csum_level helper for fixing up csum levels bpf: Fix up bpf_skb_adjust_room helper's skb csum setting sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf() crypto/chtls: IPv6 support for inline TLS Crypto/chcr: Fixes a coccinile check error Crypto/chcr: Fixes compilations warnings ...	2020-06-03 16:27:18 -07:00
Linus Torvalds	bce159d734	for-5.8/drivers-2020-06-01 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7VPc4QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpgQkEACnQlzWOfNQMz1AzgUAv/S8IYDJCLrkbjLZ JK4pJv8Hjhss/7sS+fd8kyKe9VtaZz2IjmrXcC66RMMwtpx4iHnkRffoNAgEdGOl /M5TCZGhs+F/mp3Lc0WdR5DFHkM6yy2Tkk9wCFLreB4bW67janAWnd7nbU4INqJj +WqIgpzNMc/kfUhpBYTeQLORhL4e2TG9ADTi/zeUITlpnEsA65LOgXKEpeIFYnSX KTl4GIZ9tjazG3Y1Eva7DYHDIErNNAtX67KBqf+WBgMV98eB0O6xIPN1WlmhDTqj FGMLkb8msH1HHntvxDAuc4/ortnUy8vPI4o6zKP89HJJNjIM5p5eHEuVF5JnBw42 Rtu9Om6JqWx51nhAhJNBj9bUStYbhEl0vVQCwbkfPbDJhzTy3RR8z709q9+ZwOrL xbp4aJBzqrzscjBEiSQbNCf2PyuOAdU0r1x81UN81ZN41d5qUcumcinjw4Y7vru8 z5zMlo1Iy/AWQYyu7jgHmnpI7ZyA/1Qclo5dV7aa72bLFaJa35e7QxgfQOFBA5dY UZl6QPJRlnB80uGRzD5jCh2O2sQ3XZqYnpaKsUAka1GgbceCp9IC4A5mfZvpACsh Xk8VXjlhvY/iPJsKLqrh4Oedg4Dj5M3PLL9C3MDfYeIP2qgXpbnk87UV1TPNSpY0 QcTxsXXXIw== =H+/Z -----END PGP SIGNATURE----- Merge tag 'for-5.8/drivers-2020-06-01' of git://git.kernel.dk/linux-block Pull block driver updates from Jens Axboe: "On top of the core changes, here are the block driver changes for this merge window: - NVMe changes: - NVMe over Fibre Channel protocol updates, which also reach over to drivers/scsi/lpfc (James Smart) - namespace revalidation support on the target (Anthony Iliopoulos) - gcc zero length array fix (Arnd Bergmann) - nvmet cleanups (Chaitanya Kulkarni) - misc cleanups and fixes (me, Keith Busch, Sagi Grimberg) - use a SRQ per completion vector (Max Gurtovoy) - fix handling of runtime changes to the queue count (Weiping Zhang) - t10 protection information support for nvme-rdma and nvmet-rdma (Israel Rukshin and Max Gurtovoy) - target side AEN improvements (Chaitanya Kulkarni) - various fixes and minor improvements all over, icluding the nvme part of the lpfc driver" - Floppy code cleanup series (Willy, Denis) - Floppy contention fix (Jiri) - Loop CONFIGURE support (Martijn) - bcache fixes/improvements (Coly, Joe, Colin) - q->queuedata cleanups (Christoph) - Get rid of ioctl_by_bdev (Christoph, Stefan) - md/raid5 allocation fixes (Coly) - zero length array fixes (Gustavo) - swim3 task state fix (Xu)" * tag 'for-5.8/drivers-2020-06-01' of git://git.kernel.dk/linux-block: (166 commits) bcache: configure the asynchronous registertion to be experimental bcache: asynchronous devices registration bcache: fix refcount underflow in bcache_device_free() bcache: Convert pr_<level> uses to a more typical style bcache: remove redundant variables i and n lpfc: Fix return value in __lpfc_nvme_ls_abort lpfc: fix axchg pointer reference after free and double frees lpfc: Fix pointer checks and comments in LS receive refactoring nvme: set dma alignment to qword nvmet: cleanups the loop in nvmet_async_events_process nvmet: fix memory leak when removing namespaces and controllers concurrently nvmet-rdma: add metadata/T10-PI support nvmet: add metadata support for block devices nvmet: add metadata/T10-PI support nvme: add Metadata Capabilities enumerations nvmet: rename nvmet_check_data_len to nvmet_check_transfer_len nvmet: rename nvmet_rw_len to nvmet_rw_data_len nvmet: add metadata characteristics for a namespace nvme-rdma: add metadata/T10-PI support nvme-rdma: introduce nvme_rdma_sgl structure ...	2020-06-02 15:37:03 -07:00
Linus Torvalds	750a02ab8d	for-5.8/block-2020-06-01 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7VOwMQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpoR7EADAlz3TCkb4wwuHytTBDrm6gVDdsJ9zUfQW Cl2ASLtufA8PWZUCEI3vhFyOe6P5e+ZZ0O2HjljSevmHyogCaRYXFYVfbWKcQKuk AcxiTgnYNevh8KbGLfJY1WL4eXsY+C3QUGivg35cCgrx+kr9oDaHMeqA9Tm1plyM FSprDBoSmHPqRxiV/1gnr8uXLX6K7i/fHzwmKgySMhavum7Ma8W3wdAGebzvQwrO SbFSuJVgz06e4B1Fzr/wSvVNUE/qW/KqfGuQKIp7VQFIywbgG7TgRMHjE1FSnpnh gn+BfL+O5gc0sTvcOTGOE0SRWWwLx961WNg8Azq08l3fzsxLA6h8/AnoDf3i+QMA rHmLpWZIic2xPSvjaFHX3/V9ITyGYeAMpAR77EL+4ivWrKv5JrBhnSLDt1fKILdg 5elxm7RDI+C4nCP4xuTlVCy5gCd6gwjgytKj+NUWhNq1WiGAD0B54SSiV+SbCSH6 Om2f5trcxz8E4pqWcf0k3LjFapVKRNV8v/+TmVkCdRPBl3y9P0h0wFTkkcEquqnJ y7Yq6efdWviRCnX5w/r/yj0qBuk4xo5hMVsPmlthCWtnBm+xZQ6LwMRcq4HQgZgR 2SYNscZ3OFMekHssH7DvY4DAy1J+n83ims+KzbScbLg2zCZjh/scQuv38R5Eh9WZ rCS8c+T7Ig== =HYf4 -----END PGP SIGNATURE----- Merge tag 'for-5.8/block-2020-06-01' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: "Core block changes that have been queued up for this release: - Remove dead blk-throttle and blk-wbt code (Guoqing) - Include pid in blktrace note traces (Jan) - Don't spew I/O errors on wouldblock termination (me) - Zone append addition (Johannes, Keith, Damien) - IO accounting improvements (Konstantin, Christoph) - blk-mq hardware map update improvements (Ming) - Scheduler dispatch improvement (Salman) - Inline block encryption support (Satya) - Request map fixes and improvements (Weiping) - blk-iocost tweaks (Tejun) - Fix for timeout failing with error injection (Keith) - Queue re-run fixes (Douglas) - CPU hotplug improvements (Christoph) - Queue entry/exit improvements (Christoph) - Move DMA drain handling to the few drivers that use it (Christoph) - Partition handling cleanups (Christoph)" * tag 'for-5.8/block-2020-06-01' of git://git.kernel.dk/linux-block: (127 commits) block: mark bio_wouldblock_error() bio with BIO_QUIET blk-wbt: rename __wbt_update_limits to wbt_update_limits blk-wbt: remove wbt_update_limits blk-throttle: remove tg_drain_bios blk-throttle: remove blk_throtl_drain null_blk: force complete for timeout request blk-mq: drain I/O when all CPUs in a hctx are offline blk-mq: add blk_mq_all_tag_iter blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx blk-mq: use BLK_MQ_NO_TAG in more places blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG blk-mq: move more request initialization to blk_mq_rq_ctx_init blk-mq: simplify the blk_mq_get_request calling convention blk-mq: remove the bio argument to ->prepare_request nvme: force complete cancelled requests blk-mq: blk-mq: provide forced completion method block: fix a warning when blkdev.h is included for !CONFIG_BLOCK builds block: blk-crypto-fallback: remove redundant initialization of variable err block: reduce part_stat_lock() scope block: use __this_cpu_add() instead of access by smp_processor_id() ...	2020-06-02 15:29:19 -07:00
David S. Miller	1806c13dc2	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net xdp_umem.c had overlapping changes between the 64-bit math fix for the calculation of npgs and the removal of the zerocopy memory type which got rid of the chunk_size_nohdr member. The mlx5 Kconfig conflict is a case where we just take the net-next copy of the Kconfig entry dependency as it takes on the ESWITCH dependency by one level of indirection which is what the 'net' conflicting change is trying to ensure. Signed-off-by: David S. Miller <davem@davemloft.net>	2020-05-31 17:48:46 -07:00
Keith Busch	3382a567ef	nvme: force complete cancelled requests Use blk_mq_foce_complete_rq() to bypass fake timeout error injection so that request reclaim may proceed. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-29 10:21:59 -06:00
Christoph Hellwig	6ebf71bab9	ipv4: add ip_sock_set_tos Add a helper to directly set the IP_TOS sockopt from kernel space without going through a fake uaccess. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-05-28 11:11:45 -07:00
Christoph Hellwig	557eadfcc5	tcp: add tcp_sock_set_syncnt Add a helper to directly set the TCP_SYNCNT sockopt from kernel space without going through a fake uaccess. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-05-28 11:11:45 -07:00
Christoph Hellwig	12abc5ee78	tcp: add tcp_sock_set_nodelay Add a helper to directly set the TCP_NODELAY sockopt from kernel space without going through a fake uaccess. Cleanup the callers to avoid pointless wrappers now that this is a simple function call. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Sagi Grimberg <sagi@grimberg.me> Acked-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-05-28 11:11:45 -07:00
Christoph Hellwig	6e43496745	net: add sock_set_priority Add a helper to directly set the SO_PRIORITY sockopt from kernel space without going through a fake uaccess. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-05-28 11:11:44 -07:00
Christoph Hellwig	c433594c07	net: add sock_no_linger Add a helper to directly set the SO_LINGER sockopt from kernel space with onoff set to true and a linger time of 0 without going through a fake uaccess. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-05-28 11:11:44 -07:00
Christoph Hellwig	b58f0e8f38	net: add sock_set_reuseaddr Add a helper to directly set the SO_REUSEADDR sockopt from kernel space without going through a fake uaccess. For this the iscsi target now has to formally depend on inet to avoid a mostly theoretical compile failure. For actual operation it already did depend on having ipv4 or ipv6 support. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-05-28 11:11:44 -07:00
Leon Romanovsky	8094ba0ace	RDMA/cma: Provide ECE reject reason IBTA declares "vendor option not supported" reject reason in REJ messages if passive side doesn't want to accept proposed ECE options. Due to the fact that ECE is managed by userspace, there is a need to let users to provide such rejected reason. Link: https://lore.kernel.org/r/20200526103304.196371-7-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>	2020-05-27 16:05:05 -03:00
Dongli Zhang	9210c075ce	nvme-pci: avoid race between nvme_reap_pending_cqes() and nvme_poll() There may be a race between nvme_reap_pending_cqes() and nvme_poll(), e.g., when doing live reset while polling the nvme device. CPU X CPU Y nvme_poll() nvme_dev_disable() -> nvme_stop_queues() -> nvme_suspend_io_queues() -> nvme_suspend_queue() -> spin_lock(&nvmeq->cq_poll_lock); -> nvme_reap_pending_cqes() -> nvme_process_cq() -> nvme_process_cq() In the above scenario, the nvme_process_cq() for the same queue may be running on both CPU X and CPU Y concurrently. It is much more easier to reproduce the issue when CONFIG_PREEMPT is enabled in kernel. When CONFIG_PREEMPT is disabled, it would take longer time for nvme_stop_queues()-->blk_mq_quiesce_queue() to wait for grace period. This patch protects nvme_process_cq() with nvmeq->cq_poll_lock in nvme_reap_pending_cqes(). Fixes: `fa46c6fb5d` ("nvme/pci: move cqe check after device shutdown") Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-05-27 20:32:56 +02:00
Keith Busch	3b2a1ebceb	nvme: set dma alignment to qword The default dma alignment mask is 511, which is much larger than any nvme controller requires. NVMe controllers accept qword aligned DMA addresses, so set the request_queue constraints to that. This can help avoid bounce buffers on user passthrough commands. Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-05-27 07:12:40 +02:00
David Milburn	1cdf9f7670	nvmet: cleanups the loop in nvmet_async_events_process Based-on-a-patch-by: Christoph Hellwig <hch@infradead.org> Tested-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: David Milburn <dmilburn@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-05-27 07:12:40 +02:00
Sagi Grimberg	64f5e9cdd7	nvmet: fix memory leak when removing namespaces and controllers concurrently When removing a namespace, we add an NS_CHANGE async event, however if the controller admin queue is removed after the event was added but not yet processed, we won't free the aens, resulting in the below memory leak [1]. Fix that by moving nvmet_async_event_free to the final controller release after it is detached from subsys->ctrls ensuring no async events are added, and modify it to simply remove all pending aens. -- $ cat /sys/kernel/debug/kmemleak unreferenced object 0xffff888c1af2c000 (size 32): comm "nvmetcli", pid 5164, jiffies 4295220864 (age 6829.924s) hex dump (first 32 bytes): 28 01 82 3b 8b 88 ff ff 28 01 82 3b 8b 88 ff ff (..;....(..;.... 02 00 04 65 76 65 6e 74 5f 66 69 6c 65 00 00 00 ...event_file... backtrace: [<00000000217ae580>] nvmet_add_async_event+0x57/0x290 [nvmet] [<0000000012aa2ea9>] nvmet_ns_changed+0x206/0x300 [nvmet] [<00000000bb3fd52e>] nvmet_ns_disable+0x367/0x4f0 [nvmet] [<00000000e91ca9ec>] nvmet_ns_free+0x15/0x180 [nvmet] [<00000000a15deb52>] config_item_release+0xf1/0x1c0 [<000000007e148432>] configfs_rmdir+0x555/0x7c0 [<00000000f4506ea6>] vfs_rmdir+0x142/0x3c0 [<0000000000acaaf0>] do_rmdir+0x2b2/0x340 [<0000000034d1aa52>] do_syscall_64+0xa5/0x4d0 [<00000000211f13bc>] entry_SYSCALL_64_after_hwframe+0x6a/0xdf Fixes: `a07b4970f4` ("nvmet: add a generic NVMe target") Reported-by: David Milburn <dmilburn@redhat.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Tested-by: David Milburn <dmilburn@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-05-27 07:12:40 +02:00
Israel Rukshin	b09160c399	nvmet-rdma: add metadata/T10-PI support For capable HCAs (e.g. ConnectX-5/ConnectX-6) this will allow end-to-end protection information passthrough and validation for NVMe over RDMA transport. Metadata support was implemented over the new RDMA signature verbs API. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-05-27 07:12:40 +02:00
Israel Rukshin	c6e3f13398	nvmet: add metadata support for block devices Allocate the metadata SGL buffers and set metadata fields for the request. Then create a block IO request for the metadata from the protection SG list. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-05-27 07:12:40 +02:00
Israel Rukshin	ea52ac1c66	nvmet: add metadata/T10-PI support Expose the namespace metadata format when PI is enabled. The user needs to enable the capability per subsystem and per port. The other metadata properties are taken from the namespace/bdev. Usage example: echo 1 > /config/nvmet/subsystems/${NAME}/attr_pi_enable echo 1 > /config/nvmet/ports/${PORT_NUM}/param_pi_enable Signed-off-by: Israel Rukshin <israelr@mellanox.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-05-27 07:12:40 +02:00
Israel Rukshin	136cc1ffcf	nvmet: rename nvmet_check_data_len to nvmet_check_transfer_len The function doesn't check only the data length, because the transfer length includes also the metadata length in some cases. This is preparation for adding metadata (T10-PI) support. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-05-27 07:12:39 +02:00
Israel Rukshin	26af180c1b	nvmet: rename nvmet_rw_len to nvmet_rw_data_len The function doesn't add the metadata length (only data length is calculated). This is preparation for adding metadata (T10-PI) support. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-05-27 07:12:39 +02:00
Israel Rukshin	d2d1c454a4	nvmet: add metadata characteristics for a namespace Fill those namespace fields from the block device format for adding metadata (T10-PI) over fabric support with block devices. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-05-27 07:12:39 +02:00
Max Gurtovoy	5ec5d3bddc	nvme-rdma: add metadata/T10-PI support For capable HCAs (e.g. ConnectX-5/ConnectX-6) this will allow end-to-end protection information passthrough and validation for NVMe over RDMA transport. Metadata offload support was implemented over the new RDMA signature verbs API and it is enabled for capable controllers. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Israel Rukshin <israelr@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-05-27 07:12:39 +02:00
Israel Rukshin	324d9e7814	nvme-rdma: introduce nvme_rdma_sgl structure Remove first_sgl pointer from struct nvme_rdma_request and use pointer arithmetic instead. The inline scatterlist, if exists, will be located right after the nvme_rdma_request. This patch is needed as a preparation for adding PI support. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-05-27 07:12:39 +02:00
Israel Rukshin	ba7ca2ae02	nvme: introduce NVME_INLINE_METADATA_SG_CNT SGL size of metadata is usually small. Thus, 1 inline sg should cover most cases. The macro will be used for pre-allocate a single SGL entry for metadata. The preallocation of small inline SGLs depends on SG_CHAIN capability so if the ARCH doesn't support SG_CHAIN, use the runtime allocation for the SGL. This patch is a preparation for adding metadata (T10-PI) over fabric support. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-05-27 07:12:39 +02:00
Max Gurtovoy	33cfdc2aa6	nvme: enforce extended LBA format for fabrics metadata An extended LBA is a larger LBA that is created when metadata associated with the LBA is transferred contiguously with the LBA data (AKA interleaved). The metadata may be either transferred as part of the LBA (creating an extended LBA) or it may be transferred as a separate contiguous buffer of data. According to the NVMeoF spec, a fabrics ctrl supports only an Extended LBA format. Fail revalidation in case we have a spec violation. Also add a flag that will imply on capable transports and controllers as part of a preparation for allowing end-to-end protection information for fabric controllers. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-05-27 07:12:39 +02:00
Max Gurtovoy	9509335039	nvme: introduce max_integrity_segments ctrl attribute This patch doesn't change any logic, and is needed as a preparation for adding PI support for fabrics drivers that will use an extended LBA format for metadata and will support more than 1 integrity segment. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-05-27 07:12:39 +02:00
James Smart	4d2ce68835	nvme: make nvme_ns_has_pi accessible to transports Move the nvme_ns_has_pi() inline from core.c to the nvme.h header. This allows use by the transports. Signed-off-by: James Smart <jsmart2021@gmail.com> [maxg: added a comment for nvme_ns_has_pi()] Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-05-27 07:12:38 +02:00
Max Gurtovoy	b29f84857a	nvme: introduce NVME_NS_METADATA_SUPPORTED flag This is a preparation for adding support for metadata in fabric controllers. New flag will imply that NVMe namespace supports getting metadata that was originally generated by host's block layer. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-05-27 07:12:38 +02:00
Max Gurtovoy	ffc89b1d3c	nvme: introduce namespace features flag Replace the specific ext boolean (that implies on extended LBA format) with a feature in the new namespace features flag. This is a preparation for adding more namespace features (such as metadata specific features). Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-05-27 07:12:38 +02:00
Chaitanya Kulkarni	1f357548ec	nvmet: revalidate-ns & generate AEN from configfs Add a new attribute "revalidate_size" for the namespace which allows user to revalidate and generate the AEN if needed. This attribute is needed so that we can install userspace rules with systemd service based on inotify/fsnotify/uevent. The registered callback for such a service will end up writing to this attribute to generate AEN if needed. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimbeg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-05-27 07:12:38 +02:00
Chaitanya Kulkarni	de124f4273	nvmet: generate AEN for ns revalidate size change The newly added function nvmet_ns_revalidate() does update the ns size in the identify namespace in-core target data structure when host issues id-ns command. This can lead to host having inconsistencies between size of the namespace present in the id-ns command result and size of the corresponding block device until host scans the namespaces explicitly. To avoid this scenario generate AEN if old size is not same as the new one in nvmet_ns_revalidate(). This will allow automatic AEN generation when host calls id-ns command and also allows target to install userspace rules so that it can trigger nvmet_ns_revalidate() (using configfs interface with the help of next patch) resulting in appropriate AEN generation when underlying namespace size change is detected. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimbeg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-05-27 07:12:38 +02:00
Chaitanya Kulkarni	463c5fabb8	nvmet: add helper to revalidate bdev and file ns This patch adds a wrapper helper to indicate size change in the bdev & file-backed namespace when revalidating ns. This helper is needed in order to minimize code repetition in the next patch for configfs.c and existing admin-cmd.c. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimbeg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-05-27 07:12:38 +02:00
Chaitanya Kulkarni	696ece7513	nvmet: add async event tracing support This adds a new tracepoint for the target to trace async event. This is helpful in debugging and comparing host and target side async events especially when host is connected to different targets on different machines and now that we rely on userspace components to generate AEN. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-05-27 07:12:38 +02:00
Dan Carpenter	ec0862ac5a	nvme: delete an unnecessary declaration The nvme_put_ctrl() is implemented earlier as an inline function so this declaration isn't required. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-05-27 07:12:37 +02:00
Gustavo A. R. Silva	f1e71d75f0	nvme: replace zero-length array with flexible-array The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] sizeof(flexible-array-member) triggers a warning because flexible array members have incomplete type[1]. There are some instances of code in which the sizeof operator is being incorrectly/erroneously applied to zero-length arrays and the result is zero. Such instances may be hiding some bugs. So, this work (flexible-array member conversions) will also help to get completely rid of those sorts of issues. This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit `7649773293` ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-05-27 07:12:37 +02:00
Damien Le Moal	68ab60ca2d	nvme: fix io_opt limit setting Currently, a namespace io_opt queue limit is set by default to the physical sector size of the namespace and to the the write optimal size (NOWS) when the namespace reports optimal IO sizes. This causes problems with block limits stacking in blk_stack_limits() when a namespace block device is combined with an HDD which generally do not report any optimal transfer size (io_opt limit is 0). The code: /* Optimal I/O a multiple of the physical block size? */ if (t->io_opt & (t->physical_block_size - 1)) { t->io_opt = 0; t->misaligned = 1; ret = -1; } in blk_stack_limits() results in an error return for this function when the combined devices have different but compatible physical sector sizes (e.g. 512B sector SSD with 4KB sector disks). Fix this by not setting the optimal IO size queue limit if the namespace does not report an optimal write size value. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Bart van Assche <bvanassche@acm.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-05-27 07:12:37 +02:00
Wu Bo	84e4c204b6	nvme: disable streams when get stream params failed Disable streams again if getting the stream params fails. Signed-off-by: Wu Bo <wubo40@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-05-27 07:12:37 +02:00
Martin George	614fc1c0d9	nvme-fc: print proper nvme-fc devloss_tmo value The nvme-fc devloss_tmo is computed as the min of either the ctrl_loss_tmo (max_retries * reconnect_delay) or the remote port's devloss_tmo. But what gets printed as the nvme-fc devloss_tmo in nvme_fc_reconnect_or_delete() is always the remote port's devloss_tmo value. So correct this by printing the min value instead. Signed-off-by: Martin George <marting@netapp.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-05-27 07:12:37 +02:00
Weiping Zhang	9c9e76d579	nvme-pci: make sure write/poll_queues less or equal then cpu count Check module parameter write/poll_queues before using it to catch too large values. Reproducer: modprobe -r nvme modprobe nvme write_queues=`nproc` echo $((`nproc`+1)) > /sys/module/nvme/parameters/write_queues echo 1 > /sys/block/nvme0n1/device/reset_controller [ 657.069000] ------------[ cut here ]------------ [ 657.069022] WARNING: CPU: 10 PID: 1163 at kernel/irq/affinity.c:390 irq_create_affinity_masks+0x47c/0x4a0 [ 657.069056] dm_region_hash dm_log dm_mod [ 657.069059] CPU: 10 PID: 1163 Comm: kworker/u193:9 Kdump: loaded Tainted: G W 5.6.0+ #8 [ 657.069060] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019 [ 657.069064] Workqueue: nvme-reset-wq nvme_reset_work [nvme] [ 657.069066] RIP: 0010:irq_create_affinity_masks+0x47c/0x4a0 [ 657.069067] Code: fe ff ff 48 c7 c0 b0 89 14 95 48 89 46 20 e9 e9 fb ff ff 31 c0 e9 90 fc ff ff 0f 0b 48 c7 44 24 08 00 00 00 00 e9 e9 fc ff ff <0f> 0b e9 87 fe ff ff 48 8b 7c 24 28 e8 33 a0 80 00 e9 b6 fc ff ff [ 657.069068] RSP: 0018:ffffb505ce1ffc78 EFLAGS: 00010202 [ 657.069069] RAX: 0000000000000060 RBX: ffff9b97921fe5c0 RCX: 0000000000000000 [ 657.069069] RDX: ffff9b67bad80000 RSI: 00000000ffffffa0 RDI: 0000000000000000 [ 657.069070] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff9b97921fe718 [ 657.069070] R10: ffff9b97921fe710 R11: 0000000000000001 R12: 0000000000000064 [ 657.069070] R13: 0000000000000060 R14: 0000000000000000 R15: 0000000000000001 [ 657.069071] FS: 0000000000000000(0000) GS:ffff9b67c0880000(0000) knlGS:0000000000000000 [ 657.069072] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 657.069072] CR2: 0000559eac6fc238 CR3: 000000057860a002 CR4: 00000000007606e0 [ 657.069073] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 657.069073] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 657.069073] PKRU: 55555554 [ 657.069074] Call Trace: [ 657.069080] __pci_enable_msix_range+0x233/0x5a0 [ 657.069085] ? kernfs_put+0xec/0x190 [ 657.069086] pci_alloc_irq_vectors_affinity+0xbb/0x130 [ 657.069089] nvme_reset_work+0x6e6/0xeab [nvme] [ 657.069093] ? __switch_to_asm+0x34/0x70 [ 657.069094] ? __switch_to_asm+0x40/0x70 [ 657.069095] ? nvme_irq_check+0x30/0x30 [nvme] [ 657.069098] process_one_work+0x1a7/0x370 [ 657.069101] worker_thread+0x1c9/0x380 [ 657.069102] ? max_active_store+0x80/0x80 [ 657.069103] kthread+0x112/0x130 [ 657.069104] ? __kthread_parkme+0x70/0x70 [ 657.069105] ret_from_fork+0x35/0x40 [ 657.069106] ---[ end trace f4f06b7d24513d06 ]--- [ 657.077110] nvme nvme0: 95/1/0 default/read/poll queues Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-05-27 07:12:37 +02:00
Sagi Grimberg	0236d34379	nvmet-tcp: move send/recv error handling in the send/recv methods instead of call-sites Have routines handle errors and just bail out of the poll loop. This simplifies the code and will help as we may enhance the poll loop logic and these are somewhat in the way. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-05-27 07:12:37 +02:00
Sagi Grimberg	f381ab1f26	nvmet-tcp: set MSG_EOR if we send last payload in the batch when trying to send the pdu data digest, we should set this flag. Reported-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-05-27 07:12:37 +02:00
Sagi Grimberg	4eea804364	nvmet-tcp: set MSG_SENDPAGE_NOTLAST with MSG_MORE when we have more to send We can signal the stack that this is not the last page coming and the stack can build a larger tso segment, so go ahead and use it. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-05-27 07:12:37 +02:00
Sagi Grimberg	5bb052d7aa	nvme-tcp: set MSG_SENDPAGE_NOTLAST with MSG_MORE when we have more to send We can signal the stack that this is not the last page coming and the stack can build a larger tso segment, so go ahead and use it. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-05-27 07:12:36 +02:00
Christoph Hellwig	7425596945	nvmet: mark nvmet_ana_state static Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>	2020-05-27 07:12:36 +02:00
Chen Zhou	09bb8986c9	nvmet: replace kstrndup() with kmemdup_nul() It is more efficient to use kmemdup_nul() if the size is known exactly. The doc in kernel: "Note: Use kmemdup_nul() instead if the size is known exactly." Signed-off-by: Chen Zhou <chenzhou10@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-05-27 07:12:36 +02:00
Christoph Hellwig	9398554fb3	block: remove the error_sector argument to blkdev_issue_flush The argument isn't used by any caller, and drivers don't fill out bi_sector for flush requests either. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-22 08:45:46 -06:00
Keith Busch	b69e2ef24b	nvme-pci: dma read memory barrier for completions Control dependencies do not guarantee load order across the condition, allowing a CPU to predict and speculate memory reads. Commit `324b494c28` inlined verifying a new completion with its handling. At least one architecture was observed to access the contents out of order, resulting in the driver using stale data for the completion. Add a dma read barrier before reading the completion queue entry and after the condition its contents depend on to ensure the read order is determinsitic. Reported-by: John Garry <john.garry@huawei.com> Suggested-by: Will Deacon <will@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org> Tested-by: John Garry <john.garry@huawei.com> Acked-by: Will Deacon <will@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-05-12 18:02:24 +02:00
Keith Busch	92decf118f	nvme: define constants for identification values Improve code readability by defining the specification's constants that the driver is using when decoding identification payloads. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Bart van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Acked-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:36 -06:00
Chaitanya Kulkarni	d02abd1986	nvmet: align addrfam list to spec With reference to the NVMeOF Specification (page 44, Figure 38) discovery log page entry provides address family field. We do set the transport type field but the adrfam field is not set when using loop transport and also it doesn't have support in the nvme-cli. So when reading discovery log page with a loop transport it leads to confusing output. As per the spec for adrfam value 254 is reserved for Intra Host Transport i.e. loopback), we add a required macro in the protocol header file, set default port disc addr entry's adrfam to NVMF_ADDR_FAMILY_MAX, and update nvmet_addr_family configfs array for show/store attribute. Without this patch, setting adrfam to (ipv4/ipv6/ib/fc/loop/" ") we get following output for nvme discover command from nvme-cli which is confusing. trtype: loop adrfam: ipv4 trtype: loop adrfam: ipv6 trtype: loop adrfam: infiniband trtype: loop adrfam: fibre-channel trtype: loop # ${CFGFS_HOME}/nvmet/ports/1/addr_adrfam = loop adrfam: pci # <----- pci for loop trtype: loop # ${CFGFS_HOME}/nvmet/ports/1/addr_adrfam = " " adrfam: pci # <----- pci for unrecognized This patch fixes above output :- trtype: loop adrfam: ipv4 trtype: loop adrfam: ipv6 trtype: loop adrfam: infiniband trtype: loop adrfam: fibre-channel trtype: loop # ${CFGFS_HOME}/nvmet/ports/1/addr_adrfam = loop adrfam: loop # <----- loop for loop trtype: loop # ${CFGFS_HOME}/config/nvmet/ports/adrfam = " " adrfam: unrecognized # <----- unrecognized when invalid value Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:36 -06:00
Chaitanya Kulkarni	3ecb5faa07	nvmet: centralize port enable access for configfs The configfs attributes which are supposed to set when port is disable such as addr[addrfam\|portid\|traddr\|treq\|trsvcid\|inline_data_size\|trtype] has repetitive check and generic error message printing. This patch creates centralize helper to check and print an error message that also accepts caller as a parameter. This makes error message easy to parse for the user, removes the duplicate code and makes it available for futures such scenarios. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:36 -06:00
Chaitanya Kulkarni	87628e2851	nvmet: use type-name map for address treq Currently nvmet_addr_treq_[store\|show]() uses switch and if else ladder for address transport requirements to string and reverse mapping. With addtion of the generic nvmet_type_name_map structure we can get rid of the switch and if else ladder with string duplication. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:36 -06:00
Chaitanya Kulkarni	84b8d0d7aa	nvmet: use type-name map for ana states Now that we have a generic type to name map for configfs, get rid of the nvmet_ana_state_names structure and replace it with newly added nvmet_type_name_map. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:36 -06:00
Chaitanya Kulkarni	7e764179c8	nvmet: use type-name map for address family Right now nvmet_addr_adrfam_[store\|show]() uses switch and if else ladder for address family to string and reverse mapping which also repeats the strings in show and store function. With addition of generic nvmet_type_name_map structure we can now get rid of the switch and if else ladder and string duplication. Also, we add a newline in before found label in nvmet_addr_trtype_store() which keeps goto label code consistent with nvmet_allowed_hosts_drop_link(), nvmet_port_subsys_drop_link() and nvmet_ana_group_ana_state_store(). Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:36 -06:00
Chaitanya Kulkarni	45e2f3c2d2	nvmet: add generic type-name mapping This patch adds a new type to name mapping generic structure. It replaces nvmet_transport_name with new generic mapping structure nvmet_transport. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:36 -06:00
Christoph Hellwig	7890b9701b	nvme-multipath: stop using ->queuedata nvme-multipath already uses the gendisk private data, not need to also set up the request_queue queuedata and use it in one place only. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:36 -06:00
Sagi Grimberg	db5ad6b7f8	nvme-tcp: try to send request in queue_rq context Today, nvme-tcp automatically schedules a send request to a workqueue context, which is 1 more than we'd need in case the socket buffer is wide open. However, because we have async send activity (as a result of r2t, or write_space callbacks), we need to synchronize sends from possibly multiple contexts (ideally all running on the same cpu though). Thus, we only try to send directly from queue_rq in cases: 1. the send_list is empty 2. we can send it synchronously (i.e. not from the RX path) 3. we run on the same cpu as the queue->io_cpu to avoid contention on the send operation. Proposed-by: Mark Wunderlich <mark.wunderlich@intel.com> Signed-off-by: Mark Wunderlich <mark.wunderlich@intel.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:36 -06:00
Sagi Grimberg	72e5d757c6	nvme-tcp: avoid scheduling io_work if we are already polling When the user runs polled I/O, we shouldn't have to trigger the workqueue to generate the receive work upon the .data_ready upcall. This prevents a redundant context switch when the application is already polling for completions. Proposed-by: Mark Wunderlich <mark.wunderlich@intel.com> Signed-off-by: Mark Wunderlich <mark.wunderlich@intel.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:36 -06:00
Sagi Grimberg	386e5e6e1a	nvme-tcp: use bh_lock in data_ready data_ready may be invoked from send context or from softirq, so need bh locking for that. Fixes: `3f2304f8c6` ("nvme-tcp: add NVMe over TCP host driver") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:36 -06:00
Weiping Zhang	2a5bcfdd41	nvme-pci: align io queue count with allocted nvme_queue in nvme_probe Since commit `147b27e4bd` ("nvme-pci: allocate device queues storage space at probe"), nvme_alloc_queue does not alloc the nvme queues itself anymore. If the write/poll_queues module parameters are changed at runtime to values larger than the number of allocated queues in nvme_probe, nvme_alloc_queue will access unallocated memory. Add a new nr_allocated_queues member to struct nvme_dev to record how many queues were alloctated in nvme_probe to avoid using more than the allocated queues after a reset following a change to the write/poll_queues module parameters. Also add nr_write_queues and nr_poll_queues members to allow refreshing the number of write and poll queues based on a change to the module parameters when resetting the controller. Fixes: `147b27e4bd` ("nvme-pci: allocate device queues storage space at probe") Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> [hch: add nvme_max_io_queues, update the commit message] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:36 -06:00
Keith Busch	54b2fcee1d	nvme-pci: remove last_sq_tail The nvme driver does not have enough tags to wrap the queue, and blk-mq will no longer call commit_rqs() when there are no new submissions to notify. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:36 -06:00
Keith Busch	74943d45ee	nvme-pci: remove volatile cqes The completion queue entry is not volatile once the phase is confirmed. Remove the volatile keywords and check the phase using the appropriate READ_ONCE() accessor, allowing the compiler to optimize the remaining completion path. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:36 -06:00
Keith Busch	b04df85d9a	nvme: flush scan work on passthrough commands If a passthrough command causes the namespace inventory or capabilities to change, flush the scan work that handles these changes so the driver synchronizes with the user command's effects before returning the result to user space. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:36 -06:00
Christoph Hellwig	6623c5b3df	nvme: clean up error handling in nvme_init_ns_head Use a common label for putting the nshead if needed and only convert nvme status codes for the one case where it actually is needed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:36 -06:00
Arnd Bergmann	3add1d93d9	nvme-fc: avoid gcc-10 zero-length-bounds warning When CONFIG_ARCH_NO_SG_CHAIN is set, op->sgl[0] cannot be dereferenced, as gcc-10 now points out: drivers/nvme/host/fc.c: In function 'nvme_fc_init_request': drivers/nvme/host/fc.c:1774:29: warning: array subscript 0 is outside the bounds of an interior zero-length array 'struct scatterlist[0]' [-Wzero-length-bounds] 1774 \| op->op.fcp_req.first_sgl = &op->sgl[0]; \| ^~~~~~~~~~~ drivers/nvme/host/fc.c:98:21: note: while referencing 'sgl' 98 \| struct scatterlist sgl[NVME_INLINE_SG_CNT]; \| ^~~ I don't know if this is a legitimate warning or a false-positive. If this is just a false alarm, the warning is easily suppressed by interpreting the array as a pointer. Fixes: `b1ae1a2389` ("nvme-fc: Avoid preallocating big SGL for data") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:36 -06:00
Anthony Iliopoulos	e8cd1ff11d	nvmet: add ns revalidation support Add support for detecting capacity changes on nvmet blockdev and file backed namespaces. This allows for emulating and testing online resizing of nvme devices and filesystems on top. Signed-off-by: Anthony Iliopoulos <ailiop@suse.com> [chaitanya: Fix comments posted on V1] Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> [hch: reuse code a bit more] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:35 -06:00
Keith Busch	31fdad7be1	nvme: consolodate io settings The stream parameters indicating optimal io settings were just getting overwritten later. Rearrange the settings so the streams parameters can be preserved if provided. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:35 -06:00
Keith Busch	bc1af009a8	nvme: revalidate namespace stream parameters The stream parameters are based on the currently formatted logical block size. Recheck these parameters on namespace revalidation so the registered constraints will be accurate. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:35 -06:00
Keith Busch	38adf94e16	nvme: consolidate chunk_sectors settings Move the quirked chunk_sectors setting to the same location as noiob so one place registers this setting. And since the noiob value is only used locally, remove the member from struct nvme_ns. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:35 -06:00
Keith Busch	b2b2de7c5a	nvme: revalidate after verifying identifiers If the namespace identifiers have changed, skip updating the disk information, as that will register parameters from a mismatched namespace. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:35 -06:00
Keith Busch	b2ce4d9069	nvme-multipath: set bdi capabilities once The queues' backing device info capabilities don't change with each namespace revalidation. Set it only when each path's request_queue is initially added to a multipath queue. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:35 -06:00
Keith Busch	0c284db7f2	nvme: check namespace head shared property Reject a new shared namespace if a duplicate unshared namespace exists. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:35 -06:00
Keith Busch	9ad1927a3b	nvme: always search for namespace head Even if a namespace reports it is not capable of sharing, search the subsystem for a matching namespace head. If found, the driver should reject that namespace since it's coming from an invalid configuration. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:35 -06:00
Keith Busch	ac262508da	nvme: release namespace head reference on error If a namespace identification does not match the subsystem's head for that NSID, release the reference that was taken when the matching head was initially found. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:35 -06:00
Keith Busch	d567572906	nvme: unlink head after removing last namespace The driver had been unlinking the namespace head from the subsystem's list only after the last reference was released, and outside of the list's subsys->lock protection. There is no reason to track an empty head, so unlink the entry from the subsystem's list when the last namespace using that head is removed and with the mutex lock protecting the list update. The next namespace to attach reusing the previous NSID will allocate a new head rather than find the old head with mismatched identifiers. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:35 -06:00
Christoph Hellwig	aec459b484	nvme: remove the magic 1024 constant in nvme_scan_ns_list Replace it with a value derived from the identify data and nsid sizes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:35 -06:00
Christoph Hellwig	4005f28d25	nvme: avoid an Identify Controller command for each namespace scan The namespace lists are 0-terminated, so we don't really need the NN value execept for the legacy sequential scan. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:35 -06:00
Christoph Hellwig	4450ba3bbb	nvme: factor out a nvme_ns_remove_by_nsid helper Factor out a piece of deeply indented and logicaly separate code from nvme_scan_ns_list into a new helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:35 -06:00
Christoph Hellwig	25dcaa9292	nvme: clean up nvme_scan_work Move the check for the supported CNS value into nvme_scan_ns_list, and limit the life time of the identify controller allocation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:35 -06:00
Christoph Hellwig	b9a5c3d4c3	nvme: refine the Qemu Identify CNS quirk Add a helper to check if we can use Identify CNS values > 1, and refine the Qemu quirk to not apply to reported versions larger than 1.1, as the Qemu implementation had been fixed by then. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:35 -06:00
James Smart	e4fcc72c1a	nvmet-fc: slight cleanup for kbuild test warnings The kbuild tst robot flagged the following 3 issues: Case 1) >> drivers/nvme/target/fc.c:1201:37: warning: Either the condition >> '!assoc' is redundant or there is possible null pointer dereference: >> assoc. [nullPointerRedundantCheck] >> struct nvmet_fc_tgtport tgtport = assoc->tgtport; ^ >> drivers/nvme/target/fc.c:1853:7: note: Assuming that condition '!assoc' >> is not redundant >> if (!assoc) ^ >> drivers/nvme/target/fc.c:1850:37: note: Assignment >> 'assoc=nvmet_fc_find_target_assoc(tgtport,be64_to_cpu( >> rqst->associd.association_id))', assigned value is 0 >> assoc = nvmet_fc_find_target_assoc(tgtport, ^ >> drivers/nvme/target/fc.c:1896:31: note: Calling function >> 'nvmet_fc_delete_target_assoc', 1st argument 'assoc' value is 0 >> nvmet_fc_delete_target_assoc(assoc); ^ The tool isn't smart enough to see that line 1854 sets a ret value which thereafter causes the routine to exit. This occurs before any of the assoc references, so it is not an issue. There are 2 more reportings of this same failure. To quiet the tool - rework the if test that does the exit to also reference assoc. No change in logic otherwise. Case 2) drivers/nvme/target/fc.c:1202:29: warning: The scope of the variable 'queue' can be reduced. [variableScope] struct nvmet_fc_tgt_queue queue; ^ The tool is requesting the variable be declared within the code block that utilizes it. Ignoring this report as existing code style is fine. Case 3) drivers/nvme/target/fc.c:1137:16: warning: Variable 'needrandom' is assigned a value that is never used. [unreadVariable] needrandom = true; ^ Another parsing issue with the tool. Given that parens were not used with the list_for_each_entry() check, it inadvertantly thinks the break exited the outer while loop not the inner for loop. This is not an error. But, added parens to the inner list_for_each_entry() to quiet the tool and as it is better coding style. -- james Signed-off-by: James Smart <jsmart2021@gmail.com> Reported-by: kbuild test robot <lkp@intel.com> CC: kbuild test robot <lkp@intel.com> CC: Christoph Hellwig <hch@lst.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:35 -06:00
Max Gurtovoy	b0012dd397	nvmet-rdma: use SRQ per completion vector In order to save resource allocation and utilize the completion locality in a better way (compared to SRQ per device that exist today), allocate Shared Receive Queues (SRQs) per completion vector. Associate each created QP/CQ with an appropriate SRQ according to the queue index. This association will reduce the lock contention in the fast path (compared to SRQ per device solution) and increase the locality in memory buffers. Add new module parameter for SRQ size to adjust it according to the expected load. User should make sure the size is >= 256 to avoid lack of resources. Also reduce the debug level of "last WQE reached" event that is raised when a QP is using SRQ during destruction process to relief the log. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:35 -06:00
Keith Busch	03f8cebc12	nvme: remove unused parameter nvme_alloc_ns_head() doesn't use the 'struct nvme_id_ns' parameter. Remove it, and update caller accordingly. Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:35 -06:00
Keith Busch	71fb90eb71	nvme: provide num dword helper Various nvme commands use a zeroes based number of dwords field. Create a helper function to convert byte lengths to this format. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:34 -06:00
James Smart	437c0b824d	nvme-fcloop: add target to host LS request support Add support for performing LS requests from target to host. Include sending request from targetport, reception into host, host sending ls rsp. Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:34 -06:00
James Smart	ea39765843	nvme-fcloop: refactor to enable target to host LS Currently nvmefc-loop only sends LS's from host to target. Slightly rework data structures and routine names to reflect this path. Allows a straight-forward conversion to be used by ls's from target to host. Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:34 -06:00
James Smart	47bf324106	nvmet-fc: Add Disconnect Association Xmt support As part of FC-NVME-2 (and ammendment on FC-NVME), the target is to send a Disconnect LS after an association is terminated and any exchanges for the association have been ABTS'd. The target is also not to send the receipt to any Disconnect Association LS, received to initiate the association termination or received while the association is terminating, until the Disconnect LS has been transmit. Add support for sending Disconnect Association LS after all I/O's complete (which is after ABTS'd certainly). Utilizes the new LLDD api to send ls requests. There is no need to track the Disconnect LS response or to retry after timeout. All spec requirements will have been met by waiting for i/o completion to initiate the transmission. Add support for tracking the reception of Disconnect Association and defering the response transmission until after the Disconnect Association LS has been transmit. Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:34 -06:00
James Smart	349c694ee7	nvmet-fc: rename ls_list to ls_rcv_list In preparation to add ls request support, rename the current ls_list, which is RCV LS request only, to ls_rcv_list. Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:34 -06:00
James Smart	58ab8ff9dc	nvmet-fc: track hostport handle for associations In preparation for sending LS requests for an association that terminates, save and track the hosthandle that is part of the LS's that are received to create associations. Support consists of: - Create a hostport structure that will be 1:1 mapped to a host port handle. The hostport structure is specific to a targetport. - Whenever an association is created, create a host port for the hosthandle the Create Association LS was received from. There will be only 1 hostport structure created, with all associations that have the same hosthandle sharing the hostport structure. - When the association is terminated, the hostport reference will be removed. After the last association for the host port is removed, the hostport will be deleted. - Add support for the new nvmet_fc_invalidate_host() interface. In the past, the LLDD didn't notify loss of connectivity to host ports - the LLD would simply reject new requests and wait for the kato timeout to kill the association. Now, when host port connectivity is lost, the LLDD can notify the transport. The transport will initiate the termination of all associations for that host port. When the last association has been terminated and the hosthandle will no longer be referenced, the new host_release callback will be made to the lldd. - For compatibility with prior behavior which didn't report the hosthandle: the LLDD must set hosthandle to NULL. In these cases, not LS request will be made, and no host_release callbacks will be made either. Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:34 -06:00
James Smart	0dfb992e0e	nvmet-fc: perform small cleanups on unneeded checks While code reviewing saw a couple of items that can be cleaned up: - In nvmet_fc_delete_target_queue(), the routine unlocks, then checks and relocks. Reorganize to avoid the unlock/relock. - In nvmet_fc_delete_target_queue(), there's a check on the disconnect state that is unnecessary as the routine validates the state before starting any action. Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:34 -06:00
James Smart	a5c2b4f633	nvmet-fc: add LS failure messages Add LS reception failure messages Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:34 -06:00
James Smart	14fd1e98af	nvme-fc: Add Disconnect Association Rcv support The nvme-fc host transport did not support the reception of a FC-NVME LS. Reception is necessary to implement full compliance with FC-NVME-2. Populate the LS receive handler, and specifically the handling of a Disconnect Association LS. The response to the LS, if it matched a controller, must be sent after the aborts for any I/O on any connection have been sent. Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:33 -06:00
James Smart	ec3b0e3cc3	nvmet-fc: Update target for common definitions for LS handling Given that both host and target now generate and receive LS's create a single table definition for LS names. Each tranport half will have a local version of the table. Convert the target side transport to use the new common Create Association LS validation routine. Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:33 -06:00
James Smart	fd5a5f2213	nvme-fc: Update header and host for common definitions for LS handling Given that both host and target now generate and receive LS's create a single table definition for LS names. Each tranport half will have a local version of the table. As Create Association LS is issued by both sides, and received by both sides, create common routines to format the LS and to validate the LS. Convert the host side transport to use the new common Create Association LS formatting routine. Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:33 -06:00
James Smart	eb4ee8f125	nvme-fc: convert assoc_active flag to bit op Convert the assoc_active boolean flag to a bitop on the flags field. The bit ops will provide atomicity. To make this change, the flags field was converted to a long type, which also affects the FCCTRL_TERMIO flag. Both FCCTRL_TERMIO and now ASSOC_ACTIVE flags are set/cleared by bit operations. Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:33 -06:00
James Smart	f56bf76f79	nvme-fc: Ensure private pointers are NULL if no data Ensure that when allocations are done, and the lldd options indicate no private data is needed, that private pointers will be set to NULL (catches driver error that forgot to set private data size). Slightly reorg the allocations so that private data follows allocations for LS request/response buffers. Ensures better alignments for the buffers as well as the private pointer. Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:33 -06:00
James Smart	3b8281b02b	nvmet-fc: Better size LS buffers Current code uses NVME_FC_MAX_LS_BUFFER_SIZE (2KB) when allocating buffers for LS requests and responses. This is considerable overkill for what is actually defined. Rework code to have unions for all possible requests and responses and size based on the unions. Remove NVME_FC_MAX_LS_BUFFER_SIZE. Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:33 -06:00
James Smart	ca19bcd086	nvme-fc nvmet-fc: refactor for common LS definitions Routines in the target will want to be used in the host as well. Error definitions should now shared as both sides will process requests and responses to requests. Moved common declarations to new fc.h header kept in the host subdirectory. Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:33 -06:00
James Smart	72e6329f86	nvme-fc and nvmet-fc: revise LLDD api for LS reception and LS request The current LLDD api has: nvme-fc: contains api for transport to do LS requests (and aborts of them). However, there is no interface for reception of LS's and sending responses for them. nvmet-fc: contains api for transport to do reception of LS's and sending of responses for them. However, there is no interface for doing LS requests. Revise the api's so that both nvme-fc and nvmet-fc can send LS's, as well as receiving LS's and sending their responses. Change name of the rcv_ls_req struct to better reflect generic use as a context to used to send an ls rsp. Specifically: nvmefc_tgt_ls_req -> nvmefc_ls_rsp nvmefc_tgt_ls_req.nvmet_fc_private -> nvmefc_ls_rsp.nvme_fc_private Change nvmet_fc_rcv_ls_req() calling sequence to provide handle that can be used by transport in later LS request sequences for an association. nvme-fc nvmet_fc nvme_fcloop: Revise to adapt to changed names in api header. Change calling sequence to nvmet_fc_rcv_ls_req() for hosthandle. Add stubs for new interfaces: host/fc.c: nvme_fc_rcv_ls_req() target/fc.c: nvmet_fc_invalidate_host() lpfc: Revise to adapt code to changed names in api header. Change calling sequence to nvmet_fc_rcv_ls_req() for hosthandle. Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:18:33 -06:00
Sagi Grimberg	59c7c3caaa	nvme: fix possible hang when ns scanning fails during error recovery When the controller is reconnecting, the host fails I/O and admin commands as the host cannot reach the controller. ns scanning may revalidate namespaces during that period and it is wrong to remove namespaces due to these failures as we may hang (see `205da24343`). One command that may fail is nvme_identify_ns_descs. Since we return success due to having ns identify descriptor list optional, we continue to compare ns identifiers in nvme_revalidate_disk, obviously fail and return -ENODEV to nvme_validate_ns, which will remove the namespace. Exactly what we don't want to happen. Fixes: `22802bf742` ("nvme: Namepace identification descriptor list is optional") Tested-by: Anton Eidelman <anton@lightbitslabs.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:07:58 -06:00
Alexey Dobriyan	a8de663916	nvme-pci: fix "slimmer CQ head update" Pre-incrementing ->cq_head can't be done in memory because OOB value can be observed by another context. This devalues space savings compared to original code :-\ $ ./scripts/bloat-o-meter ../vmlinux-000 ../obj/vmlinux add/remove: 0/0 grow/shrink: 0/4 up/down: 0/-32 (-32) Function old new delta nvme_poll_irqdisable 464 456 -8 nvme_poll 455 447 -8 nvme_irq 388 380 -8 nvme_dev_disable 955 947 -8 But the code is minimal now: one read for head, one read for q_depth, one increment, one comparison, single instruction phase bit update and one write for new head. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reported-by: John Garry <john.garry@huawei.com> Tested-by: John Garry <john.garry@huawei.com> Fixes: `e2a366a4b0` ("nvme-pci: slimmer CQ head update") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-05-09 16:07:58 -06:00
Niklas Cassel	132be62387	nvme: prevent double free in nvme_alloc_ns() error handling When jumping to the out_put_disk label, we will call put_disk(), which will trigger a call to disk_release(), which calls blk_put_queue(). Later in the cleanup code, we do blk_cleanup_queue(), which will also call blk_put_queue(). Putting the queue twice is incorrect, and will generate a KASAN splat. Set the disk->queue pointer to NULL, before calling put_disk(), so that the first call to blk_put_queue() will not free the queue. The second call to blk_put_queue() uses another pointer to the same queue, so this call will still free the queue. Fixes: `85136c0102` ("lightnvm: simplify geometry enumeration") Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-04-27 17:08:06 +02:00
Linus Torvalds	8df2a0a6da	block-5.7-2020-04-10 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl6QhDIQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpsE/EADOQ0xDMOa8EmzRvjuCkiaB9yK2zXiBSAj5 ZBi7ReownfXhCR7nVc8Bv1s2f00PD6CFNURXZmdgyDDrXEd2ojueDoAZNBk59t0e i2CAF2wLAQ5EfuVaxSHVEOrVEmtu+ue+Ix83JNlnGPd7pf9s7uKc/W4iKGpgpxIo 1CpXmWwm5RwjX4z/Qsiaka2lB7QojjImp1n3C+XI5+pp/bJXiftep1lxH5Y3nSWU iR4jO81uxDMxhTEZ9z2cb1HarhctKvnihcb39gQYQ/kYYu7hSZnBPZo5zp5Dyb/t 4tGuDsfXCQCbF0smkusUrcyeT19vh9tOsGkiMzJ/ihm7TMyN4fT23h6DUb/7pAON jnlcB7r5Ezs8jLz9i+mAoq06djd5u54kiuKFog8170sTrtYsncZbyc01wLNAla/V /6KX1sMbPlbXZ+a3l3i7i/gcCBJ7ci6pV3x2elvM9dKHxyqJmwEGMlFVwt4s26ev wS+7+dktLAC73889Zyn8LutA4bWy5FmisSPA4PydSUSOZA+7JjlbILcz15jjwlP2 HzYk+TXsd3yJUQRYX5P0FcDaBUTISr/xeUUB+KT1rLv4Lhtso+S/9cvSc8x5mOa9 989gmqNfFAWoj1nKEIKeRwLjk0b6YA9qMv4jOwwiuobsT55aBxpbP80huNoRVj5L xFIWgBSwzg== =3woC -----END PGP SIGNATURE----- Merge tag 'block-5.7-2020-04-10' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "Here's a set of fixes that should go into this merge window. This contains: - NVMe pull request from Christoph with various fixes - Better discard support for loop (Evan) - Only call ->commit_rqs() if we have queued IO (Keith) - blkcg offlining fixes (Tejun) - fix (and fix the fix) for busy partitions" * tag 'block-5.7-2020-04-10' of git://git.kernel.dk/linux-block: block: fix busy device checking in blk_drop_partitions again block: fix busy device checking in blk_drop_partitions nvmet-rdma: fix double free of rdma queue blk-mq: don't commit_rqs() if none were queued nvme-fc: Revert "add module to ops template to allow module references" nvme: fix deadlock caused by ANA update wrong locking nvmet-rdma: fix bonding failover possible NULL deref loop: Better discard support for block devices loop: Report EOPNOTSUPP properly nvmet: fix NULL dereference when removing a referral nvme: inherit stable pages constraint in the mpath stack device blkcg: don't offline parent blkcg first blkcg: rename blkcg->cgwb_refcnt to ->online_pin and always use it nvme-tcp: fix possible crash in recv error flow nvme-tcp: don't poll a non-live queue nvme-tcp: fix possible crash in write_zeroes processing nvmet-fc: fix typo in comment nvme-rdma: Replace comma with a semicolon nvme-fcloop: fix deallocation of working context nvme: fix compat address handling in several ioctls	2020-04-10 10:06:54 -07:00
Israel Rukshin	21f9024355	nvmet-rdma: fix double free of rdma queue In case rdma accept fails at nvmet_rdma_queue_connect(), release work is scheduled. Later on, a new RDMA CM event may arrive since we didn't destroy the cm-id and call nvmet_rdma_queue_connect_fail(), which schedule another release work. This will cause calling nvmet_rdma_free_queue twice. To fix this we implicitly destroy the cm_id with non-zero ret code, which guarantees that new rdma_cm events will not arrive afterwards. Also add a qp pointer to nvmet_rdma_queue structure, so we can use it when the cm_id pointer is NULL or was destroyed. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Suggested-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-04-07 18:33:45 +02:00
James Smart	8c5c660529	nvme-fc: Revert "add module to ops template to allow module references" The original patch was to resolve the lldd being able to be unloaded while being used to talk to the boot device of the system. However, the end result of the original patch is that any driver unload while a nvme controller is live via the lldd is now being prohibited. Given the module reference, the module teardown routine can't be called, thus there's no way, other than manual actions to terminate the controllers. Fixes: `863fbae929` ("nvme_fc: add module to ops template to allow module references") Cc: <stable@vger.kernel.org> # v5.4+ Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-04-04 09:09:39 +02:00
Sagi Grimberg	657f1975e9	nvme: fix deadlock caused by ANA update wrong locking The deadlock combines 4 flows in parallel: - ns scanning (triggered from reconnect) - request timeout - ANA update (triggered from reconnect) - I/O coming into the mpath device (1) ns scanning triggers disk revalidation -> update disk info -> freeze queue -> but blocked, due to (2) (2) timeout handler reference the g_usage_counter - > but blocks in the transport .timeout() handler, due to (3) (3) the transport timeout handler (indirectly) calls nvme_stop_queue() -> which takes the (down_read) namespaces_rwsem - > but blocks, due to (4) (4) ANA update takes the (down_write) namespaces_rwsem -> calls nvme_mpath_set_live() -> which synchronize the ns_head srcu (see commit `504db087aa`) -> but blocks, due to (5) (5) I/O came into nvme_mpath_make_request -> took srcu_read_lock -> direct_make_request > blk_queue_enter -> but blocked, due to (1) ==> the request queue is under freeze -> deadlock. The fix is making ANA update take a read lock as the namespaces list is not manipulated, it is just the ns and ns->head that are being updated (which is protected with the ns->head lock). Fixes: `0d0b660f21` ("nvme: add ANA support") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-04-04 09:07:03 +02:00
Sagi Grimberg	a032e4f6d6	nvmet-rdma: fix bonding failover possible NULL deref RDMA_CM_EVENT_ADDR_CHANGE event occur in the case of bonding failover on normal as well as on listening cm_ids. Hence this event will immediately trigger a NULL dereference trying to disconnect a queue for a cm_id that actually belongs to the port. To fix this we provide a different handler for the listener cm_ids that will defer a work to disable+(re)enable the port which essentially destroys and setups another listener cm_id Reported-by: Alex Lyakas <alex@zadara.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Tested-by: Alex Lyakas <alex@zadara.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-04-04 09:06:55 +02:00
Linus Torvalds	79f51b7b9c	SCSI misc on 20200402 update changing all our txt files to rst ones. Excluding that, we have the usual driver updates (qla2xxx, ufs, lpfc, zfcp, ibmvfc, pm80xx, aacraid), a treewide update for scnprintf and some other minor updates. The major core update is Hannes moving functions out of the aacraid driver and into the core. Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com> -----BEGIN PGP SIGNATURE----- iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCXoYKiyYcamFtZXMuYm90 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishSasAP4iGwSB Y8tFaZgWadu76+wj5MdqTBoXdhnIuFF0rZG3pQEAiIKdsfQlbSFdm75+gUtx5hG/ GOilX/pJczTRJDCGNis= =g7Sk -----END PGP SIGNATURE----- Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI updates from James Bottomley: "This series has a huge amount of churn because it pulls in Mauro's doc update changing all our txt files to rst ones. Excluding that, we have the usual driver updates (qla2xxx, ufs, lpfc, zfcp, ibmvfc, pm80xx, aacraid), a treewide update for scnprintf and some other minor updates. The major core change is Hannes moving functions out of the aacraid driver and into the core" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (223 commits) scsi: aic7xxx: aic97xx: Remove FreeBSD-specific code scsi: ufs: Do not rely on prefetched data scsi: dc395x: remove dc395x_bios_param scsi: libiscsi: Fix error count for active session scsi: hpsa: correct race condition in offload enabled scsi: message: fusion: Replace zero-length array with flexible-array member scsi: qedi: Add PCI shutdown handler support scsi: qedi: Add MFW error recovery process scsi: ufs: Enable block layer runtime PM for well-known logical units scsi: ufs-qcom: Override devfreq parameters scsi: ufshcd: Let vendor override devfreq parameters scsi: ufshcd: Update the set frequency to devfreq scsi: ufs: Resume ufs host before accessing ufs device scsi: ufs-mediatek: customize the delay for enabling host scsi: ufs: make HCE polling more compact to improve initialization latency scsi: ufs: allow custom delay prior to host enabling scsi: ufs-mediatek: use common delay function scsi: ufs: introduce common and flexible delay function scsi: ufs: use an enum for host capabilities scsi: ufs: fix uninitialized tx_lanes in ufshcd_disable_tx_lcc() ...	2020-04-02 17:03:53 -07:00
Sagi Grimberg	f0e656e4f2	nvmet: fix NULL dereference when removing a referral When item release is called, the parent is already null. We need the parent to pass to nvmet_referral_disable so hook it up to ->disconnect_notify. Reported-by: Tony Asleson <tasleson@redhat.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-04-02 10:51:56 +02:00
Sagi Grimberg	74e4d20e2f	nvme: inherit stable pages constraint in the mpath stack device If the backing device require stable pages, we need to set it on the stack mpath device as well. This applies to rdma/fc transports when doing data integrity and tcp transport calculating digests. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-04-02 10:51:46 +02:00
Sagi Grimberg	39d06079a5	nvme-tcp: fix possible crash in recv error flow If the target misbehaves and sends us unexpected payload we need to make sure to fail the controller and stop processing the input stream. We clear the rd_enabled flag and stop the io_work, but we may still requeue it if we still have pending sends and then in the next invocation we will process the input stream as the check is only in the .data_ready upcall. To fix this we need to make sure not to self-requeue io_work upon a recv flow error. This fixes the crash: nvme nvme2: receive failed: -22 BUG: unable to handle page fault for address: ffffbeb5816c3b48 nvme_ns_head_make_request: 29 callbacks suppressed block nvme0n5: no usable path - requeuing I/O block nvme0n5: no usable path - requeuing I/O block nvme0n7: no usable path - requeuing I/O block nvme0n7: no usable path - requeuing I/O block nvme0n3: no usable path - requeuing I/O block nvme0n3: no usable path - requeuing I/O block nvme0n3: no usable path - requeuing I/O block nvme0n7: no usable path - requeuing I/O block nvme0n3: no usable path - requeuing I/O block nvme0n3: no usable path - requeuing I/O #PF: supervisor read access inkernel mode #PF: error_code(0x0000) - not-present page PGD 1039157067 P4D 1039157067 PUD 103915a067 PMD 102719f067 PTE 0 Oops: 0000 [#1] SMP PTI CPU: 8 PID: 411 Comm: kworker/8:1H Not tainted 5.3.0-40-generic #32~18.04.1-Ubuntu Hardware name: Supermicro Super Server/X10SRi-F, BIOS 2.0 12/17/2015 Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp] RIP: 0010:nvme_tcp_recv_skb+0x2ae/0xb50 [nvme_tcp] RSP: 0018:ffffbeb5806cfd10 EFLAGS: 00010246 RAX: ffffbeb5816c3b48 RBX: 00000000000003d0 RCX: 0000000000000008 RDX: 00000000000003d0 RSI: 0000000000000001 RDI: ffff9a3040684b40 RBP: ffffbeb5806cfd90 R08: 0000000000000000 R09: ffffffff946e6900 R10: ffffbeb5806cfce0 R11: 0000000000000001 R12: 0000000000000000 R13: ffff9a2ff86501c0 R14: 00000000000003d0 R15: ffff9a30b85f2798 FS: 0000000000000000(0000) GS:ffff9a30bf800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffbeb5816c3b48 CR3: 000000088400a006 CR4: 00000000003626e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: tcp_read_sock+0x8c/0x290 ? __release_sock+0x9d/0xe0 ? nvme_tcp_write_space+0xb0/0xb0 [nvme_tcp] nvme_tcp_io_work+0x4b4/0x830 [nvme_tcp] ? finish_task_switch+0x163/0x270 process_one_work+0x1fd/0x3f0 worker_thread+0x34/0x410 kthread+0x121/0x140 ? process_one_work+0x3f0/0x3f0 ? kthread_park+0xb0/0xb0 ret_from_fork+0x35/0x40 Reported-by: Roy Shterman <roys@lightbitslabs.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-04-01 11:07:13 +02:00
Sagi Grimberg	f86e5bf817	nvme-tcp: don't poll a non-live queue In error recovery we might be removing the queue so check we can actually poll before we do. Reported-by: Mark Wunderlich <mark.wunderlich@intel.com> Tested-by: Mark Wunderlich <mark.wunderlich@intel.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-03-31 17:57:32 +02:00
Sagi Grimberg	25e5cb780e	nvme-tcp: fix possible crash in write_zeroes processing We cannot look at blk_rq_payload_bytes without first checking that the request has a mappable physical segments first (e.g. blk_rq_nr_phys_segments(rq) != 0) and only then to take the request payload bytes. This caused us to send a wrong sgl to the target or even dereference a non-existing buffer in case we actually got to the data send sequence (if it was in-capsule). Reported-by: Tony Asleson <tasleson@redhat.com> Suggested-by: Chaitanya Kulkarni <Chaitanya.Kulkarni@wdc.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-03-31 17:57:28 +02:00
James Smart	d038dd815f	nvmet-fc: fix typo in comment Fix typo in comment: about should be abort Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chiatanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Himanshu Madhani <hmadhani@marvell.com> Reviewed-by: Hannes Reinecke <hare@suse.de>	2020-03-31 16:24:59 +02:00
Israel Rukshin	a62315b836	nvme-rdma: Replace comma with a semicolon Use a semicolon at the end of an assignment expression. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-03-31 16:20:08 +02:00
James Smart	38803fcffb	nvme-fcloop: fix deallocation of working context There's been a longstanding bug of LS completions which freed ls ops, particularly the disconnect LS, while executing on a work context that is in the memory being free. Not a good thing to do. Rework LS handling to make callbacks in the rport context rather than the ls_request context. Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-03-31 16:19:48 +02:00
Nick Bowler	c95b708d5f	nvme: fix compat address handling in several ioctls On a 32-bit kernel, the upper bits of userspace addresses passed via various ioctls are silently ignored by the nvme driver. However on a 64-bit kernel running a compat task, these upper bits are not ignored and are in fact required to be zero for the ioctls to work. Unfortunately, this difference matters. 32-bit smartctl submits the NVME_IOCTL_ADMIN_CMD ioctl with garbage in these upper bits because it seems the pointer value it puts into the nvme_passthru_cmd structure is sign extended. This works fine on 32-bit kernels but fails on a 64-bit one because (at least on my setup) the addresses smartctl uses are consistently above 2G. For example: # smartctl -x /dev/nvme0n1 smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.5.11] (local build) Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org Read NVMe Identify Controller failed: NVME_IOCTL_ADMIN_CMD: Bad address Since changing 32-bit kernels to actually check all of the submitted address bits now would break existing userspace, this patch fixes the compat problem by explicitly zeroing the upper bits in the compat case. This enables 32-bit smartctl to work on a 64-bit kernel. Signed-off-by: Nick Bowler <nbowler@draconx.ca> Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-03-31 16:19:11 +02:00
Linus Torvalds	1592614838	for-5.7/drivers-2020-03-29 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl6BJDYQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgplhMD/95jd4nlVetHAo54z+Zk2ExE13+yDamRKyh vc7t2tz1reqFOimtVr5aVuTXCTgOx4CpiIox5qcn6qAExN4JtCChOBRGize/0u8S ckxnhHbN2C0rfnGldvrYYeNRonFI+7QKimnurWUSYYGN0xqbo21BxJ7dFaohMseo q4K8sIW0ctE6AOlw28Jerkg614s2NDGZ7q1laheXnYHn5c9f1m0NaKN/jyTGgr0X TLBiLbX2yRrAuvpctBj6Fna6YN7Vdd9jsf2Bt6ipUI1XgHQoVUGMxQNhWPyjsbSv GzRQUNAfVcasLzCP/Mj/47144OkUtDDpn2mjeXDaFljLDGFULD+jp/SsOmLCxkPC gI7G2yfBvF96/SOyT0JXrLyMcBd1R2vRoASbc5tPu82mZhx7YJZH5WYtOB9h2gra RTYo3xcm0EoN6yeMaH+xOuXxTWWInIrgKPONW4H8s7hxEiMt5oFNVBI7vqPr4LVp tpfxiKZDavKOofKXogNV4W7mSMP/Ir5Q9Ha4g5SXHBGp0z/PHmnQ0xDGNq0KDnU4 eNO0UYCFNCNa+0AOhpNxaVuVm9LjrgvyXRjePgOZQ4akhohwHO6DLrHK1f8Hb1vD 8Ih6uR+F5zZlKsouWro8HLGYm5w40Wq9tbCI8QbPYH6nkGoDmzpPv9jbAeWgJU5c KqP/5TBSLA== =Bs4E -----END PGP SIGNATURE----- Merge tag 'for-5.7/drivers-2020-03-29' of git://git.kernel.dk/linux-block Pull block driver updates from Jens Axboe: - floppy driver cleanup series from Willy - NVMe updates and fixes (Various) - null_blk trace improvements (Chaitanya) - bcache fixes (Coly) - md fixes (via Song) - loop block size change optimizations (Martijn) - scnprintf() use (Takashi) * tag 'for-5.7/drivers-2020-03-29' of git://git.kernel.dk/linux-block: (81 commits) null_blk: add trace in null_blk_zoned.c null_blk: add tracepoint helpers for zoned mode block: add a zone condition debug helper nvme: cleanup namespace identifier reporting in nvme_init_ns_head nvme: rename __nvme_find_ns_head to nvme_find_ns_head nvme: refactor nvme_identify_ns_descs error handling nvme-tcp: Add warning on state change failure at nvme_tcp_setup_ctrl nvme-rdma: Add warning on state change failure at nvme_rdma_setup_ctrl nvme: Fix controller creation races with teardown flow nvme: Make nvme_uninit_ctrl symmetric to nvme_init_ctrl nvme: Fix ctrl use-after-free during sysfs deletion nvme-pci: Re-order nvme_pci_free_ctrl nvme: Remove unused return code from nvme_delete_ctrl_sync nvme: Use nvme_state_terminal helper nvme: release ida resources nvme: Add compat_ioctl handler for NVME_IOCTL_SUBMIT_IO nvmet-tcp: optimize tcp stack TX when data digest is used nvme-fabrics: Use scnprintf() for avoiding potential buffer overflow nvme-multipath: do not reset on unknown status nvmet-rdma: allocate RW ctxs according to mdts ...	2020-03-30 11:43:51 -07:00
Linus Torvalds	10f36b1e80	for-5.7/block-2020-03-29 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl6BJCoQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpvziEACqQC+QRKiqR6X5yaPWJ9LqjKE7lfI1PUb7 0a1z1mKuf8d6z0qNleUwdSOEaS5zJiswou2K8GLvEtTQH41QYsQkxc9GLjAyTveK szAyzZaa3BNUy9hkczm9i2arv3fI8XoTE3JvRM0e9wL8fBJDYCtKtHFJvF4hisOQ ydaJlU6tcwzd9bdV7K5dLwBxu3AeAJjzS3Tyfw25u9N9O/btUxJ91RTqBb2+Xeoz AVasfRlAqf/CzdjxCCmDgWE2QM4852pAeQ7UJJBGISNWNoiwkezMg+6HD0jEOLee bQ8uDyQdihIWTY+/zQasotX8/71uLV8QgtjWLXR9zrjrubIBWHGzoWSQ4kPg5DfQ bJmKO0VvWN2sshZEpWvzzAFGYxZViNphbK2Pb4hKOcv7jtMcC8mmEogh/7EqbD/n KB3IM9qVoXM8INm5o0dTy5uDRJxiHiHYkqsZaKz55BB/R4Geym5TINT3nXgxhQrn JoSwp4zdm3/NJOySruDi2eETqWJC2bsz3FsQSyCQTPOuP0nLtFKBb1UKHpmYTCXG H4LCyCKFJ6s006qBcdaNPZBw1mrSNwoxEulHnpYA4BFfPeXi72yrnMZQkdwWONpW LIVuD0hBm8X/pulbvEEdjzXBqZVkqK3xFX+uX5+bnwwaUKddXAC/h9SQKpBP2Mbb AeZToMklKw== =6Glq -----END PGP SIGNATURE----- Merge tag 'for-5.7/block-2020-03-29' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: - Online capacity resizing (Balbir) - Number of hardware queue change fixes (Bart) - null_blk fault injection addition (Bart) - Cleanup of queue allocation, unifying the node/no-node API (Christoph) - Cleanup of genhd, moving code to where it makes sense (Christoph) - Cleanup of the partition handling code (Christoph) - disk stat fixes/improvements (Konstantin) - BFQ improvements (Paolo) - Various fixes and improvements * tag 'for-5.7/block-2020-03-29' of git://git.kernel.dk/linux-block: (72 commits) block: return NULL in blk_alloc_queue() on error block: move bio_map_* to blk-map.c Revert "blkdev: check for valid request queue before issuing flush" block: simplify queue allocation bcache: pass the make_request methods to blk_queue_make_request null_blk: use blk_mq_init_queue_data block: add a blk_mq_init_queue_data helper block: move the ->devnode callback to struct block_device_operations block: move the part_stat* helpers from genhd.h to a new header block: move block layer internals out of include/linux/genhd.h block: move guard_bio_eod to bio.c block: unexport get_gendisk block: unexport disk_map_sector_rcu block: unexport disk_get_part block: mark part_in_flight and part_in_flight_rw static block: mark block_depr static block: factor out requeue handling from dispatch code block/diskstats: replace time_in_queue with sum of request times block/diskstats: accumulate all per-cpu counters in one pass block/diskstats: more accurate approximation of io_ticks for slow disks ...	2020-03-30 11:20:13 -07:00
Christoph Hellwig	3d745ea5b0	block: simplify queue allocation Current make_request based drivers use either blk_alloc_queue_node or blk_alloc_queue to allocate a queue, and then set up the make_request_fn function pointer and a few parameters using the blk_queue_make_request helper. Simplify this by passing the make_request pointer to blk_alloc_queue, and while at it merge the _node variant into the main helper by always passing a node_id, and remove the superfluous gfp_mask parameter. A lower-level __blk_alloc_queue is kept for the blk-mq case. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-27 10:23:43 -06:00
Christoph Hellwig	43fcd9e1ea	nvme: cleanup namespace identifier reporting in nvme_init_ns_head Lift the common namespace identifier reporting between the shared namespace and new nshead cases into common code. This also means one less lock is held while doing I/O. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-26 04:51:56 +09:00
Christoph Hellwig	026d2ef752	nvme: rename __nvme_find_ns_head to nvme_find_ns_head There is no non __-prefixed version, so make the name a little more readable. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-26 04:51:56 +09:00
Christoph Hellwig	fb314eb0cb	nvme: refactor nvme_identify_ns_descs error handling Move the handling of an error into the function from the caller, and only do it for an actual error on the admin command itself, not the command parsing, as that should be enough to deal with devices claiming a bogus version compliance. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-26 04:51:56 +09:00
Israel Rukshin	bea54ef53f	nvme-tcp: Add warning on state change failure at nvme_tcp_setup_ctrl The transition to LIVE state should not fail in case of a new controller. Moving to DELETING state before nvme_tcp_create_ctrl() allocates all the resources may leads to NULL dereference at teardown flow (e.g., IO tagset, admin_q, connect_q). Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-26 04:51:56 +09:00
Israel Rukshin	96135862df	nvme-rdma: Add warning on state change failure at nvme_rdma_setup_ctrl The transition to LIVE state should not fail in case of a new controller. Moving to DELETING state before nvme_tcp_create_ctrl() allocates all the resources may leads to NULL dereference at teardown flow (e.g., IO tagset, admin_q, connect_q). Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-26 04:51:56 +09:00
Israel Rukshin	ce1518139e	nvme: Fix controller creation races with teardown flow Calling nvme_sysfs_delete() when the controller is in the middle of creation may cause several bugs. If the controller is in NEW state we remove delete_controller file and don't delete the controller. The user will not be able to use nvme disconnect command on that controller again, although the controller may be active. Other bugs may happen if the controller is in the middle of create_ctrl callback and nvme_do_delete_ctrl() starts. For example, freeing I/O tagset at nvme_do_delete_ctrl() before it was allocated at create_ctrl callback. To fix all those races don't allow the user to delete the controller before it was fully created. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-26 04:51:56 +09:00
Israel Rukshin	726612b6b8	nvme: Make nvme_uninit_ctrl symmetric to nvme_init_ctrl Put the ctrl reference count at nvme_uninit_ctrl as opposed to nvme_init_ctrl which takes it. This decrease the reference count at the core layer instead of decreasing it on each transport separately. Also move the call of nvme_uninit_ctrl at PCI driver after calling to nvme_release_prp_pools and nvme_dev_unmap, in order to put the reference count after using the dev. This is safe because those functions use nvme_dev which is freed only later at nvme_pci_free_ctrl. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-26 04:51:56 +09:00
Israel Rukshin	b780d7415a	nvme: Fix ctrl use-after-free during sysfs deletion In case nvme_sysfs_delete() is called by the user before taking the ctrl reference count, the ctrl may be freed during the creation and cause the bug. Take the reference as soon as the controller is externally visible, which is done by cdev_device_add() in nvme_init_ctrl(). Also take the reference count at the core layer instead of taking it on each transport separately. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-26 04:51:56 +09:00
Israel Rukshin	253fd4ac80	nvme-pci: Re-order nvme_pci_free_ctrl Destroy the resources in the same order like in nvme_probe error flow to improve code readability. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-26 04:51:55 +09:00
Israel Rukshin	6721c18a06	nvme: Remove unused return code from nvme_delete_ctrl_sync The return code of nvme_delete_ctrl_sync is never used, so change it to void. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-26 04:51:55 +09:00
Israel Rukshin	e7c43feae2	nvme: Use nvme_state_terminal helper Improve code readability. Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Israel Rukshin <israelr@mellanox.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-26 04:51:55 +09:00
Max Gurtovoy	f41cfd5d0a	nvme: release ida resources ida instances allocate some internal memory in addition to the base 'struct ida'. Use ida_destroy() to release that memory at module_exit(). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-26 04:51:55 +09:00
masahiro31.yamada@kioxia.com	c225b61031	nvme: Add compat_ioctl handler for NVME_IOCTL_SUBMIT_IO Currently 32 bit application gets ENOTTY when it calls compat_ioctl with NVME_IOCTL_SUBMIT_IO in 64 bit kernel. The cause is that the results of sizeof(struct nvme_user_io), which is used to define NVME_IOCTL_SUBMIT_IO, are not same between 32 bit compiler and 64 bit compiler. * 32 bit: the result of sizeof nvme_user_io is 44. * 64 bit: the result of sizeof nvme_user_io is 48. 64 bit compiler seems to add 32 bit padding for multiple of 8 bytes. This patch adds a compat_ioctl handler. The handler replaces NVME_IOCTL_SUBMIT_IO32 with NVME_IOCTL_SUBMIT_IO in case 32 bit application calls compat_ioctl for submit in 64 bit kernel. Then, it calls nvme_ioctl as usual. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Masahiro Yamada (KIOXIA) <masahiro31.yamada@kioxia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-26 04:51:55 +09:00
Sagi Grimberg	e90d172b11	nvmet-tcp: optimize tcp stack TX when data digest is used If we have a 4-byte data digest to send to the wire, but we have more data to send, set MSG_MORE to tell the stack that more is coming. Reviewed-by: Mark Wunderlich <mark.wunderlich@intel.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-26 04:51:55 +09:00
Takashi Iwai	8d8a50e20d	nvme-fabrics: Use scnprintf() for avoiding potential buffer overflow Since snprintf() returns the would-be-output size instead of the actual output size, the succeeding calls may go beyond the given buffer limit. Fix it by replacing with scnprintf(). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-26 04:51:55 +09:00
John Meneghini	764e933209	nvme-multipath: do not reset on unknown status The nvme multipath error handling defaults to controller reset if the error is unknown. There are, however, no existing nvme status codes that indicate a reset should be used, and resetting causes unnecessary disruption to the rest of IO. Change nvme's error handling to first check if failover should happen. If not, let the normal error handling take over rather than reset the controller. Based-on-a-patch-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: John Meneghini <johnm@netapp.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-26 04:51:55 +09:00
Max Gurtovoy	c363f249e7	nvmet-rdma: allocate RW ctxs according to mdts Current nvmet-rdma code allocates MR pool budget based on queue size, assuming both host and target use the same "max_pages_per_mr" count. After limiting the mdts value for RDMA controllers, we know the factor of maximum MR's per IO operation. Thus, make sure MR pool will be sufficient for the required IO depth and IO size. That is, say host's SQ size is 100, then the MR pool budget allocated currently at target will also be 100 MRs. But 100 IO WRITE Requests with 256 sg_count(IO size above 1MB) require 200 MRs when target's "max_pages_per_mr" is 128. Reported-by: Krishnamraju Eraparaju <krishna2@chelsio.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Max Gurtovoy <maxg@mellanox.com>	2020-03-26 04:51:55 +09:00
Max Gurtovoy	ec6d20e16c	nvmet-rdma: Implement get_mdts controller op Set the maximal data transfer size to be 1MB (currently mdts is unlimited). This will allow calculating the amount of MR's that one ctrl should allocate to fulfill it's capabilities. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Max Gurtovoy <maxg@mellanox.com>	2020-03-26 04:51:55 +09:00
Max Gurtovoy	02cb00e233	nvmet: Add get_mdts op for controllers Some transports, such as RDMA, would like to set the Maximum Data Transfer Size (MDTS) according to device/port/ctrl characteristics. This will enable the transport to set the optimal MDTS according to controller needs and device capabilities. Add a new nvmet transport op that is called during ctrl identification. This will not effect transports that don't implement this option. The return value of the new op is according to the NVMe spec definition for MDTS. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Israel Rukshin <israelr@mellanox.com>	2020-03-26 04:51:54 +09:00
Max Gurtovoy	2db24e4a22	nvme-pci: properly print controller address Align PCI address print with fabrics address that is printed with newline character. Before: [root@server40 linux]# cat /sys/class/nvme/nvme2/address 0000:0b:00.0[root@server40 linux]# After: [root@server40 linux]# cat /sys/class/nvme/nvme2/address 0000:0b:00.0 [root@server40 linux]# Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Max Gurtovoy <maxg@mellanox.com>	2020-03-26 04:51:54 +09:00
Sagi Grimberg	761ad26c45	nvme-tcp: break from io_work loop if recv failed If we failed to receive data from the socket, don't try to further process it, we will for sure be handling a queue error at this point. While no issue was seen with the current behavior thus far, its safer to cease socket processing if we detected an error. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-26 04:51:45 +09:00
Sagi Grimberg	5ff4e11264	nvme-tcp: move send failure to nvme_tcp_try_send Consolidate the request failure handling code to where it is being fetched (nvme_tcp_try_send). Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-26 04:50:21 +09:00
Sagi Grimberg	9cda34e374	nvmet-tcp: fix maxh2cdata icresp parameter MAXH2CDATA is not zero based. Also no reason to limit ourselves to 1M transfers as we can do more easily. Make this an arbitrary limit of 16M. Reported-by: Wenhua Liu <liuw@vmware.com> Cc: stable@vger.kernel.org # v5.0+ Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-26 04:49:35 +09:00
Sagi Grimberg	40510a639e	nvme-tcp: optimize queue io_cpu assignment for multiple queue maps Currently, queue io_cpu assignment is done sequentially for default, read and poll queues based on queue id. This causes miss-alignment between context of CPU initiating I/O and the I/O worker thread processing queued requests or completions. Change to modify queue io_cpu assignment to take into account queue maps offset. Each queue io_cpu will start at zero for each queue map. This essentially aligns read/poll queues to start over the same range as default queues. Testing performed by Mark with: - ram device (nvmet) - single CPU core (pinned) - 100% 4k reads - engine io_uring (not using sq_thread option) - hipri flag set Micro-benchmark results show a net gain of: - increase of 18%-29% in IOPs - reduction of 16%-22% in average latency - reduction of 7%-23% in 99.99% latency Baseline: ======== QDepth/Batch \| IOPs [k] \| Avg. Lat [us] \| 99.99% Lat [us] ----------------------------------------------------------------- 1/1 \| 32.4 \| 30.11 \| 50.94 32/8 \| 179 \| 168.20 \| 371 CPU alignment: ============= QDepth/Batch \| IOPs [k] \| Avg. Lat [us] \| 99.99% Lat [us] ----------------------------------------------------------------- 1/1 \| 38.5 \| 25.18 \| 39.16 32/8 \| 231 \| 130.75 \| 343 Reported-by: Mark Wunderlich <mark.wunderlich@intel.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-26 04:48:06 +09:00
Keith Busch	fa059b856a	nvme-pci: Simplify nvme_poll_irqdisable The timeout handler can use the existing nvme_poll() if it needs to check a polled queue, allowing nvme_poll_irqdisable() to handle only irq driven queues for the remaining callers. Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-26 04:48:06 +09:00
Keith Busch	324b494c28	nvme-pci: Remove two-pass completions Completion handling had been done in two steps: find all new completions under a lock, then handle those completions outside the lock. This was done to make the locked section as short as possible so that other threads using the same lock wait less time. The driver no longer shares locks during completion, and is in fact lockless for interrupt driven queues, so the optimization no longer serves its original purpose. Replace the two-pass completion queue handler with a single pass that completes entries immediately. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-26 04:48:06 +09:00
Keith Busch	bf392a5dc0	nvme-pci: Remove tag from process cq The only user for tagged completion was for timeout handling. That user, though, really only cares if the timed out command is completed, which we can safely check within the timeout handler. Remove the tag check to simplify completion handling. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-26 04:48:06 +09:00
Alexey Dobriyan	e2a366a4b0	nvme-pci: slimmer CQ head update Update CQ head with pre-increment operator. This saves subtraction of 1 and a few registers. Also update phase with "^= 1". This generates only one RMW instruction. ffffffff815ba150 <nvme_update_cq_head>: ffffffff815ba150: 0f b7 47 70 movzx eax,WORD PTR [rdi+0x70] ffffffff815ba154: 83 c0 01 add eax,0x1 ffffffff815ba157: 66 89 47 70 mov WORD PTR [rdi+0x70],ax ffffffff815ba15b: 66 3b 47 68 cmp ax,WORD PTR [rdi+0x68] ffffffff815ba15f: 74 01 je ffffffff815ba162 <nvme_update_cq_head+0x12> ffffffff815ba161: c3 ret ffffffff815ba162: 31 c0 xor eax,eax ffffffff815ba164: 80 77 74 01 ===> xor BYTE PTR [rdi+0x74],0x1 ffffffff815ba168: 66 89 47 70 mov WORD PTR [rdi+0x70],ax ffffffff815ba16c: c3 ret add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-119 (-119) Function old new delta nvme_poll 690 678 -12 nvme_dev_disable 1230 1177 -53 nvme_irq 613 559 -54 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2020-03-26 04:48:06 +09:00
Amit Engel	6d525f9755	nvmet: check ncqr & nsqr for set-features cmd For set feature command when setting up NVME_FEAT_NUM_QUEUES, check Number of I/O Completion Queues Requested (NCQR) and Number of I/O Submission Queues Requested (NSQR) before we proceed, for invalid values (i.e. 65535) return an appropriate NVMe invalid field status. Signed-off-by: Amit Engel <Amit.Engel@dell.com> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-26 04:47:03 +09:00
Josh Triplett	3e98c2443f	nvme: Check for readiness more quickly, to speed up boot time After initialization, nvme_wait_ready checks for readiness every 100ms, even though the drive may be ready far sooner than that. This delays system boot by hundreds of milliseconds. Reduce the delay, checking for readiness every millisecond instead. Boot-time tests on an AWS c5.12xlarge: Before: [ 0.546936] initcall nvme_init+0x0/0x5b returned 0 after 37 usecs ... [ 0.764178] nvme nvme0: 2/0/0 default/read/poll queues [ 0.768424] nvme0n1: p1 [ 0.774132] EXT4-fs (nvme0n1p1): mounted filesystem with ordered data mode. Opts: (null) [ 0.774146] VFS: Mounted root (ext4 filesystem) on device 259:1. ... [ 0.788141] Run /sbin/init as init process After: [ 0.537088] initcall nvme_init+0x0/0x5b returned 0 after 37 usecs ... [ 0.543457] nvme nvme0: 2/0/0 default/read/poll queues [ 0.548473] nvme0n1: p1 [ 0.554339] EXT4-fs (nvme0n1p1): mounted filesystem with ordered data mode. Opts: (null) [ 0.554344] VFS: Mounted root (ext4 filesystem) on device 259:1. ... [ 0.567931] Run /sbin/init as init process Signed-off-by: Josh Triplett <josh@joshtriplett.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-26 04:47:03 +09:00
Rupesh Girase	94d2e705b6	nvme: log additional message for controller status Log the controller status to know more about issue if it lies within kernel nvme subsytem or controller is unhealthy. Signed-off-by: Rupesh Girase <rgirase@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulakrni@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-26 04:47:03 +09:00
Chaitanya Kulkarni	ad95a613ea	nvme: code cleanup nvme_identify_ns_desc() The function nvme_identify_ns_desc() has 3 levels of nesting which make error message to exceeded > 80 char per line which is not aligned with the kernel code standards and rest of the NVMe subsystem code. Add a helper function to move the processing of the log when the command is successful by reducing the nesting and keeping the code < 80 char per line. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-26 04:45:25 +09:00
Jean Delvare	228914504c	nvme: Don't deter users from enabling hwmon support I see no good reason for the "If unsure, say N" advice in the description of the NVME_HWMON configuration option. It is not dangerous, it does not select any other option, and has a fairly low overhead. As the option is already not enabled by default, further suggesting hesitant users to not enable it is not useful anyway. Unlike some other options where the description alone may not be sufficient for users to make a decision, NVME_HWMON is pretty simple to grasp in my opinion, so just let the user do what they want. Signed-off-by: Jean Delvare <jdelvare@suse.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-26 04:45:25 +09:00
Sagi Grimberg	45fb19f766	nvme: expose hostid via sysfs for fabrics controllers We allow userspace to connect with a custom hostid which is useful for certain use-cases. However there is is no way to tell what is the hostid used to connect to a given controller. Expose this so userspace can correlate controllers based on hostid. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-26 04:45:12 +09:00
Sagi Grimberg	76171c6cdf	nvme: expose hostnqn via sysfs for fabrics controllers We allow userspace to connect with a custom hostnqn which is useful for certain use-cases. However there is no way to tell what is the hostnqn used to connect to a given controller. Expose this so userspace can correlate controllers based on hostnqn. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-26 04:44:35 +09:00
Christoph Hellwig	c6a564ffad	block: move the part_stat* helpers from genhd.h to a new header These macros are just used by a few files. Move them out of genhd.h, which is included everywhere into a new standalone header. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-25 09:50:09 -06:00
Sagi Grimberg	98fd5c7237	nvmet-tcp: set MSG_MORE only if we actually have more to send When we send PDU data, we want to optimize the tcp stack operation if we have more data to send. So when we set MSG_MORE when: - We have more fragments coming in the batch, or - We have a more data to send in this PDU - We don't have a data digest trailer - We optimize with the SUCCESS flag and omit the NVMe completion (used if sq_head pointer update is disabled) This addresses a regression in QD=1 with SUCCESS flag optimization as we unconditionally set MSG_MORE when we didn't actually have more data to send. Fixes: `7058329538` ("nvmet-tcp: implement C2HData SUCCESS optimization") Reported-by: Mark Wunderlich <mark.wunderlich@intel.com> Tested-by: Mark Wunderlich <mark.wunderlich@intel.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-21 04:37:53 +09:00
Balbir Singh	cb224c3af4	nvme: Convert to use set_capacity_revalidate_and_notify block/genhd provides set_capacity_revalidate_and_notify() for sending RESIZE notifications via uevents. This notification is newly added to NVME devices Signed-off-by: Balbir Singh <sblbir@amazon.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-03-18 15:13:21 -06:00
Bart Van Assche	a7afff31d5	scsi: treewide: Consolidate {get,put}_unaligned_[bl]e24() definitions Move the get_unaligned_be24(), get_unaligned_le24() and put_unaligned_le24() definitions from various drivers into include/linux/unaligned/generic.h. Add a put_unaligned_be24() implementation. Link: https://lore.kernel.org/r/20200313203102.16613-4-bvanassche@acm.org Cc: Keith Busch <kbusch@kernel.org> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Jens Axboe <axboe@fb.com> Cc: Harvey Harrison <harvey.harrison@gmail.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> # For drivers/usb Reviewed-by: Felipe Balbi <balbi@kernel.org> # For drivers/usb/gadget Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2020-03-16 22:08:34 -04:00
Prabhath Sajeepa	9134ae2a25	nvme-rdma: Avoid double freeing of async event data The timeout of identify cmd, which is invoked as part of admin queue creation, can result in freeing of async event data both in nvme_rdma_timeout handler and error handling path of nvme_rdma_configure_admin queue thus causing NULL pointer reference. Call Trace: ? nvme_rdma_setup_ctrl+0x223/0x800 [nvme_rdma] nvme_rdma_create_ctrl+0x2ba/0x3f7 [nvme_rdma] nvmf_dev_write+0xa54/0xcc6 [nvme_fabrics] __vfs_write+0x1b/0x40 vfs_write+0xb2/0x1b0 ksys_write+0x61/0xd0 __x64_sys_write+0x1a/0x20 do_syscall_64+0x60/0x1e0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reviewed-by: Roland Dreier <roland@purestorage.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Prabhath Sajeepa <psajeepa@purestorage.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-10 13:57:39 -07:00
Wunderlich, Mark	43cc66892e	nvmet-tcp: set SO_PRIORITY for accepted sockets Enable ability to associate all sockets related to NVMf TCP traffic to a priority group that will perform optimized network processing for this traffic class. Maintain initial default behavior of using priority of zero. Signed-off-by: Kiran Patil <kiran.patil@intel.com> Signed-off-by: Mark Wunderlich <mark.wunderlich@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-04 09:09:09 -08:00
Wunderlich, Mark	9912ade355	nvme-tcp: Set SO_PRIORITY for all host sockets Enable ability to associate all sockets related to NVMf TCP traffic to a priority group that will perform optimized network processing for this traffic class. Maintain initial default behavior of using priority of zero. Signed-off-by: Kiran Patil <kiran.patil@intel.com> Signed-off-by: Mark Wunderlich <mark.wunderlich@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-04 09:09:08 -08:00
Chaitanya Kulkarni	d3a9b0cadf	nvmet: check sscanf value for subsys serial attr For nvmet in configfs.c we check return values for all the sscanf() calls. Add similar check into the nvmet_subsys_attr_serial_store(). Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-04 09:09:08 -08:00
Mark Ruijter	013b7ebe5a	nvmet: make ctrl model configurable This patch adds a new target subsys attribute which allows user to optionally specify model name which then used in the nvmet_execute_identify_ctrl() to fill up the nvme_id_ctrl structure. The default value for the model is set to "Linux" for backward compatibility. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Mark Ruijter <MRuijter@onestopsystems.com> [chaitanya.kulkarni@wdc.com Use macro for default model, coding style fixes. Use RCU for accessing model in for configfs and in nvmet_execute_identify_ctrl(). ] Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-04 09:09:08 -08:00
Chaitanya Kulkarni	94a39d61f8	nvmet: make ctrl-id configurable This patch adds a new target subsys attribute which allows user to optionally specify target controller IDs which then used in the nvmet_execute_identify_ctrl() to fill up the nvme_id_ctrl structure. For example, when using a cluster setup with two nodes, with a dual ported NVMe drive and exporting the drive from both the nodes, The connection to the host fails due to the same controller ID and results in the following error message:- "nvme nvmeX: Duplicate cntlid XXX with nvmeX, rejecting" With this patch now user can partition the controller IDs for each subsystem by setting up the cntlid_min and cntlid_max. These values will be used at the time of the controller ID creation. By partitioning the ctrl-ids for each subsystem results in the unique ctrl-id space which avoids the collision. When new attribute is not specified target will fall back to original cntlid calculation method. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-04 09:09:08 -08:00
Chaitanya Kulkarni	527123c7de	nvmet: configfs code cleanup This is a pure code cleanup patch which does not change any functionality. This patch removes the extra lines, get rid of else which is duplicate for return. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-04 09:09:08 -08:00
Edmund Nadolski	adce7e9856	nvme: remove unused return code from nvme_alloc_ns The return code of nvme_alloc_ns is never used, so change it to void. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Edmund Nadolski <edmund.nadolski@intel.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-03-04 09:09:08 -08:00
Bijan Mottahedeh	9515743bfb	nvme-pci: Hold cq_poll_lock while completing CQEs Completions need to consumed in the same order the controller submitted them, otherwise future completion entries may overwrite ones we haven't handled yet. Hold the nvme queue's poll lock while completing new CQEs to prevent another thread from freeing command tags for reuse out-of-order. Fixes: `dabcefab45` ("nvme: provide optimized poll function for separate poll queues") Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-02-28 01:32:14 +09:00
Linus Torvalds	f6c69b7f51	block-5.6-2020-02-22 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl5RXlAQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpt6LD/9+QcLDwom/y76ZKM+sqFUd1cIozbuVbP0E 35vrMuANLryLhKXxo5rZ2fqIyuhD8IzvvcztLP53aGdi8NEfiDK6KdUz4rmioA8I htVxiAXf5jNFM/6eVP3APKn5H7QG4Q0CiLiXMrSajbxgZh70CtsBsZODxmHSO9L5 +H3ew6rCcE29U+bOsSYDTwPRMGWOX8FBKfgX8KWPql5uqzNAklA8TUizv7H97NIj wuFbeIiT/JfypFh7ahu6k2JNE+gB6Fchbbt3I52/8tjWWMsTkrgz2bj0JjrFkrsj wlhyv5aK8p7QSpPDtqFZN2W4pT5IaqWrgjbxR4KEpCu0G3f80LiHD3Ck8taJ311V 4dSd7oQ+XpeEa7S/iWGDc90pQbbqCJN9NiniFBs/aTxu368leKDmP7YTnNfaNqcR 9mpgzgK+Jcz/w2ub0L8LYBmZLhoDdqZFAjrAzFDq+pJWZrTbtzw1HGBMXf8KZqhs bdNiITD1lVW8h2v7FoalrCEmn7vfsteV7eFPJ96DPBVAUp3yBJsZUk1q4iPcX3xG gGIU0DdDOPTZoCfsLlHR4mfG3mW7QkAI50MNECj9Qe0h53TutO5GVbLOgHzwe1nu p6irFNyCiFJ9HkFfUAUG1D8xUafb5i0Xvvnf8TLbkgImlZKTK+RuSfGsMlHLlZnm 80OCm8s8Lg== =s3i8 -----END PGP SIGNATURE----- Merge tag 'block-5.6-2020-02-22' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "Just a set of NVMe fixes via Keith" * tag 'block-5.6-2020-02-22' of git://git.kernel.dk/linux-block: nvme-multipath: Fix memory leak with ana_log_buf nvme: Fix uninitialized-variable warning nvme-pci: Use single IRQ vector for old Apple models nvme/pci: Add sleep quirk for Samsung and Toshiba drives	2020-02-22 11:09:06 -08:00
Logan Gunthorpe	3b7830904e	nvme-multipath: Fix memory leak with ana_log_buf kmemleak reports a memory leak with the ana_log_buf allocated by nvme_mpath_init(): unreferenced object 0xffff888120e94000 (size 8208): comm "nvme", pid 6884, jiffies 4295020435 (age 78786.312s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................ 01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000e2360188>] kmalloc_order+0x97/0xc0 [<0000000079b18dd4>] kmalloc_order_trace+0x24/0x100 [<00000000f50c0406>] __kmalloc+0x24c/0x2d0 [<00000000f31a10b9>] nvme_mpath_init+0x23c/0x2b0 [<000000005802589e>] nvme_init_identify+0x75f/0x1600 [<0000000058ef911b>] nvme_loop_configure_admin_queue+0x26d/0x280 [<00000000673774b9>] nvme_loop_create_ctrl+0x2a7/0x710 [<00000000f1c7a233>] nvmf_dev_write+0xc66/0x10b9 [<000000004199f8d0>] __vfs_write+0x50/0xa0 [<0000000065466fef>] vfs_write+0xf3/0x280 [<00000000b0db9a8b>] ksys_write+0xc6/0x160 [<0000000082156b91>] __x64_sys_write+0x43/0x50 [<00000000c34fbb6d>] do_syscall_64+0x77/0x2f0 [<00000000bbc574c9>] entry_SYSCALL_64_after_hwframe+0x49/0xbe nvme_mpath_init() is called by nvme_init_identify() which is called in multiple places (nvme_reset_work(), nvme_passthru_end(), etc). This means nvme_mpath_init() may be called multiple times before nvme_mpath_uninit() (which is only called on nvme_free_ctrl()). When nvme_mpath_init() is called multiple times, it overwrites the ana_log_buf pointer with a new allocation, thus leaking the previous allocation. To fix this, free ana_log_buf before allocating a new one. Fixes: `0d0b660f21` ("nvme: add ANA support") Cc: <stable@vger.kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-02-21 23:52:25 +09:00
Keith Busch	15755854d5	nvme: Fix uninitialized-variable warning gcc may detect a false positive on nvme using an unintialized variable if setting features fails. Since this is not a fast path, explicitly initialize this variable to suppress the warning. Reported-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-02-20 01:40:57 +09:00
Andy Shevchenko	98f7b86a0b	nvme-pci: Use single IRQ vector for old Apple models People reported that old Apple machines are not working properly if the non-first IRQ vector is in use. Set quirk for that models to limit IRQ to use first vector only. Based on original patch by GitHub user npx001. Link: https://github.com/Dunedan/mbp-2016-linux/issues/9 Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Leif Liddy <leif.liddy@gmail.com> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-02-20 00:30:58 +09:00
Shyjumon N	1fae37accf	nvme/pci: Add sleep quirk for Samsung and Toshiba drives The Samsung SSD SM981/PM981 and Toshiba SSD KBG40ZNT256G on the Lenovo C640 platform experience runtime resume issues when the SSDs are kept in sleep/suspend mode for long time. This patch applies the 'Simple Suspend' quirk to these configurations. With this patch, the issue had not been observed in a 1+ day test. Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Shyjumon N <shyjumon.n@intel.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-02-20 00:29:39 +09:00
Linus Torvalds	e29c6a13dd	block-5.6-2020-02-16 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl5JdgMQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpg6eEAC7/f/ATcrAbQUjx8vsNS0UgtYB1LPdgga6 7MwYKnrzdWWZICOxOgRR4W//wXFRf3vUtmzF5MgGpzJzgeKuyv09Vz1PLcxkAcny hu9NTQWu6xtxEGiYlfaSW7EQBRdb1876kzOyV+hTrkvQZ9CXcIAQHwjQPKqjqdlb /j3l5GSjyO0npvsCqIWrsNoeSwfDOhFi+I3hHZutt3T0fPnjo6DGtXN7m4jhWzW/ U3S4yHxmLVDKzorDDMoTV2D8UGrpjQmRXD78QOpOJKO7ngr9OT69dcMH6TnDkUDb a7O5l5k5DB+0QOQWk5IHJpMRRp3NdVHgnZhTJZaho8kuKoTLwteJFAprJTzHuuvC BkTBYf1Ref9KT5YfaUcWI+rmq32LgRq8rRaJBF+vqGcOwJw6WRIuygGyZ8eJMItb oOhsFEOKKDAILncxg5C61v2Sh4g+/YpQojyVCzJSb7UNjOMtDPgjEp2dDSa+/p84 detTqmlt5ZPYHFvsp/EHuGB6ohscsvKzZk+wlQvj4Wq3T7agSp9uljfiTmUdsmWn 69fs66HBvd3CNoOauSVXvI+rhdsgTMX0ptnjIiPYwE0RQtxptp+rPjOfuT4JdboM AU8f0VKeer1kRPxVbka9YwgVrUw5JqgFkPu2aC9B+ao+4IAM96n6lJje/rC59zkf eBclcnOlsQ== =jy10 -----END PGP SIGNATURE----- Merge tag 'block-5.6-2020-02-16' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "Not a lot here, which is great, basically just three small bcache fixes from Coly, and four NVMe fixes via Keith" * tag 'block-5.6-2020-02-16' of git://git.kernel.dk/linux-block: nvme: fix the parameter order for nvme_get_log in nvme_get_fw_slot_info nvme/pci: move cqe check after device shutdown nvme: prevent warning triggered by nvme_stop_keep_alive nvme/tcp: fix bug on double requeue when send fails bcache: remove macro nr_to_fifo_front() bcache: Revert "bcache: shrink btree node cache after bch_btree_check()" bcache: ignore pending signals when creating gc and allocator thread	2020-02-16 12:35:52 -08:00
Yi Zhang	f25372ffc3	nvme: fix the parameter order for nvme_get_log in nvme_get_fw_slot_info nvme fw-activate operation will get bellow warning log, fix it by update the parameter order [ 113.231513] nvme nvme0: Get FW SLOT INFO log error Fixes: `0e98719b0e` ("nvme: simplify the API for getting log pages") Reported-by: Sujith Pandel <sujith_pandel@dell.com> Reviewed-by: David Milburn <dmilburn@redhat.com> Signed-off-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-14 10:12:04 -07:00
Keith Busch	fa46c6fb5d	nvme/pci: move cqe check after device shutdown Many users have reported nvme triggered irq_startup() warnings during shutdown. The driver uses the nvme queue's irq to synchronize scanning for completions, and enabling an interrupt affined to only offline CPUs triggers the alarming warning. Move the final CQE check to after disabling the device and all registered interrupts have been torn down so that we do not have any IRQ to synchronize. Link: https://bugzilla.kernel.org/show_bug.cgi?id=206509 Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-14 10:12:04 -07:00
Nigel Kirkland	97b2512ad0	nvme: prevent warning triggered by nvme_stop_keep_alive Delayed keep alive work is queued on system workqueue and may be cancelled via nvme_stop_keep_alive from nvme_reset_wq, nvme_fc_wq or nvme_wq. Check_flush_dependency detects mismatched attributes between the work-queue context used to cancel the keep alive work and system-wq. Specifically system-wq does not have the WQ_MEM_RECLAIM flag, whereas the contexts used to cancel keep alive work have WQ_MEM_RECLAIM flag. Example warning: workqueue: WQ_MEM_RECLAIM nvme-reset-wq:nvme_fc_reset_ctrl_work [nvme_fc] is flushing !WQ_MEM_RECLAIM events:nvme_keep_alive_work [nvme_core] To avoid the flags mismatch, delayed keep alive work is queued on nvme_wq. However this creates a secondary concern where work and a request to cancel that work may be in the same work queue - namely err_work in the rdma and tcp transports, which will want to flush/cancel the keep alive work which will now be on nvme_wq. After reviewing the transports, it looks like err_work can be moved to nvme_reset_wq. In fact that aligns them better with transition into RESETTING and performing related reset work in nvme_reset_wq. Change nvme-rdma and nvme-tcp to perform err_work in nvme_reset_wq. Signed-off-by: Nigel Kirkland <nigel.kirkland@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-14 10:12:04 -07:00
Anton Eidelman	2d570a7c02	nvme/tcp: fix bug on double requeue when send fails When nvme_tcp_io_work() fails to send to socket due to connection close/reset, error_recovery work is triggered from nvme_tcp_state_change() socket callback. This cancels all the active requests in the tagset, which requeues them. The failed request, however, was ended and thus requeued individually as well unless send returned -EPIPE. Another return code to be treated the same way is -ECONNRESET. Double requeue caused BUG_ON(blk_queued_rq(rq)) in blk_mq_requeue_request() from either the individual requeue of the failed request or the bulk requeue from blk_mq_tagset_busy_iter(, nvme_cancel_request, ); Signed-off-by: Anton Eidelman <anton@lightbitslabs.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-02-14 10:12:04 -07:00
Linus Torvalds	ed535f2c9e	block-5.6-2020-02-05 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl47ML4QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpvm2EACGaxAxP7pLniNV30cRotF8lPpQ5nUrpiem H1r5WqeI5osCGkRKHaJQ4O0Sw8IV2pWzHTWz+9bv56zLM40yIMaEHLRU00AM047n KFdA2x4xH+HhbR9lF+flYz1oInlIEXxPiERKm/p1pvQEbquzi4X5cQqv6q2pdzJ9 sf8OBJhKs4rp/ooqzWwjVOeP/n1sT2r+XDg9C9WC5aXaVZbbLw50r1WRYFt1zf7N oa+91fq2lasxK1c79OtbbGJlBXWTurAtUaKBM0KKPguiH2h9j47pAs0HsV02kZ2M 1ZltwKTyfDNMzBEgvkdB3R0G9nU422nIF+w319i6on8P8xfz8Px13d1KCQGAmfD6 K1YuaCgOjWuVhOKpMwBq9ql6QVP+1LIMKIl2OGJkrBgl9ZzfE8KMZa2QZTGrGO/U xE/hirYdj5T1O8umUQ4cmZHTROASOJZ8/eU9XHA1vf/eJYXiS31/4ewgRzP3oGX2 5Jvz3o144nBeBTOiFlzs3Fe+wX63QABNG22bijzEGoNTxjXJFroBDYzeiOELjECZ /xGRZG1bLOGMj8Gg4ZADSILQDkqISsQHofl1I9mWTbwB1j7g69ZjV8Ie2dyMaX6b 5z5Smqzd9gcok9hr8NGWkV3c3NypPxIWxrOcyzYbGLUPDGqa+QjGtlLrGgeinhLM SitalHw0KA== =05d8 -----END PGP SIGNATURE----- Merge tag 'block-5.6-2020-02-05' of git://git.kernel.dk/linux-block Pull more block updates from Jens Axboe: "Some later arrivals, but all fixes at this point: - bcache fix series (Coly) - Series of BFQ fixes (Paolo) - NVMe pull request from Keith with a few minor NVMe fixes - Various little tweaks" * tag 'block-5.6-2020-02-05' of git://git.kernel.dk/linux-block: (23 commits) nvmet: update AEN list and array at one place nvmet: Fix controller use after free nvmet: Fix error print message at nvmet_install_queue function brd: check and limit max_part par nvme-pci: remove nvmeq->tags nvmet: fix dsm failure when payload does not match sgl descriptor nvmet: Pass lockdep expression to RCU lists block, bfq: clarify the goal of bfq_split_bfqq() block, bfq: get a ref to a group when adding it to a service tree block, bfq: remove ifdefs from around gets/puts of bfq groups block, bfq: extend incomplete name of field on_st block, bfq: get extra ref to prevent a queue from being freed during a group move block, bfq: do not insert oom queue into position tree block, bfq: do not plug I/O for bfq_queues with no proc refs bcache: check return value of prio_read() bcache: fix incorrect data type usage in btree_flush_write() bcache: add readahead cache policy options via sysfs interface bcache: explicity type cast in bset_bkey_last() bcache: fix memory corruption in bch_cache_accounting_clear() xen/blkfront: limit allocated memory size to actual use case ...	2020-02-06 06:15:23 +00:00
Daniel Wagner	0f5be6a4ff	nvmet: update AEN list and array at one place All async events are enqueued via nvmet_add_async_event() which updates the ctrl->async_event_cmds[] array and additionally an struct nvmet_async_event is added to the ctrl->async_events list. Under normal operations the nvmet_async_event_work() updates again the ctrl->async_event_cmds and removes the corresponding struct nvmet_async_event from the list again. Though nvmet_sq_destroy() could be called which calls nvmet_async_events_free() which only updates the ctrl->async_event_cmds[] array. Add new functions nvmet_async_events_process() and nvmet_async_events_free() to process async events, update an array and the list. When we destroy submission queue after clearing the aen present on the ctrl->async list we also loop over ctrl->async_event_cmds[] for any requests posted by the host for which we don't have the AEN in the ctrl->async_events list by calling nvmet_async_event_process() and nvmet_async_events_free(). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> [chaitanya.kulkarni@wdc.com * Loop over and clear out outstanding requests * Update changelog ] Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-02-05 01:56:10 +09:00
Israel Rukshin	1a3f540d63	nvmet: Fix controller use after free After nvmet_install_queue() sets sq->ctrl calling to nvmet_sq_destroy() reduces the controller refcount. In case nvmet_install_queue() fails, calling to nvmet_ctrl_put() is done twice (at nvmet_sq_destroy and nvmet_execute_io_connect/nvmet_execute_admin_connect) instead of once for the queue which leads to use after free of the controller. Fix this by set NULL at sq->ctrl in case of a failure at nvmet_install_queue(). The bug leads to the following Call Trace: [65857.994862] refcount_t: underflow; use-after-free. [65858.108304] Workqueue: events nvmet_rdma_release_queue_work [nvmet_rdma] [65858.115557] RIP: 0010:refcount_warn_saturate+0xe5/0xf0 [65858.208141] Call Trace: [65858.211203] nvmet_sq_destroy+0xe1/0xf0 [nvmet] [65858.216383] nvmet_rdma_release_queue_work+0x37/0xf0 [nvmet_rdma] [65858.223117] process_one_work+0x167/0x370 [65858.227776] worker_thread+0x49/0x3e0 [65858.232089] kthread+0xf5/0x130 [65858.235895] ? max_active_store+0x80/0x80 [65858.240504] ? kthread_bind+0x10/0x10 [65858.244832] ret_from_fork+0x1f/0x30 [65858.249074] ---[ end trace f82d59250b54beb7 ]--- Fixes: `bb1cc74790` ("nvmet: implement valid sqhd values in completions") Fixes: `1672ddb8d6` ("nvmet: Add install_queue callout") Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-02-05 01:13:09 +09:00
Israel Rukshin	0b87a2b795	nvmet: Fix error print message at nvmet_install_queue function Place the arguments in the correct order. Fixes: `1672ddb8d6` ("nvmet: Add install_queue callout") Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-02-05 01:13:06 +09:00
Christoph Hellwig	cfa27356f8	nvme-pci: remove nvmeq->tags There is no real need to have a pointer to the tagset in struct nvme_queue, as we only need it in a single place, and that place can derive the used tagset from the device and qid trivially. This fixes a problem with stale pointer exposure when tagsets are reset, and also shrinks the nvme_queue structure. It also matches what most other transports have done since day 1. Reported-by: Edmund Nadolski <edmund.nadolski@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-02-04 03:00:25 +09:00
Sagi Grimberg	b716e6889c	nvmet: fix dsm failure when payload does not match sgl descriptor The host is allowed to pass the controller an sgl describing a buffer that is larger than the dsm payload itself, allow it when executing dsm. Reported-by: Dakshaja Uppalapati <dakshaja@chelsio.com> Reviewed-by: Christoph Hellwig <hch@lst.de>, Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-02-04 03:00:24 +09:00
Amol Grover	4ac76436a6	nvmet: Pass lockdep expression to RCU lists ctrl->subsys->namespaces and subsys->namespaces are traversed with list_for_each_entry_rcu outside an RCU read-side critical section but under the protection of ctrl->subsys->lock and subsys->lock respectively. Hence, add the corresponding lockdep expression to the list traversal primitive to silence false-positive lockdep warnings, and harden RCU lists. Reported-by: kbuild test robot <lkp@intel.com> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Amol Grover <frextrite@gmail.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2020-02-04 03:00:24 +09:00
Akinobu Mita	7724cd2bff	nvme: hwmon: switch to use <linux/units.h> helpers This switches the nvme driver to use kelvin_to_millicelsius() and millicelsius_to_kelvin() in <linux/units.h>. Link: http://lkml.kernel.org/r/1576386975-7941-8-git-send-email-akinobu.mita@gmail.com Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Sujith Thomas <sujith.thomas@intel.com> Cc: Darren Hart <dvhart@infradead.org> Cc: Andy Shevchenko <andy@infradead.org> Cc: Zhang Rui <rui.zhang@intel.com> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Amit Kucheria <amit.kucheria@verdurent.com> Cc: Jean Delvare <jdelvare@suse.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Keith Busch <kbusch@kernel.org> Cc: Jens Axboe <axboe@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Cc: Hartmut Knaack <knaack.h@gmx.de> Cc: Johannes Berg <johannes.berg@intel.com> Cc: Jonathan Cameron <jic23@kernel.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Kalle Valo <kvalo@codeaurora.org> Cc: Lars-Peter Clausen <lars@metafoo.de> Cc: Luca Coelho <luciano.coelho@intel.com> Cc: Peter Meerwald-Stadler <pmeerw@pmeerw.net> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-01-31 10:30:40 -08:00
Linus Torvalds	48b4b4ff1e	for-5.6/block-2020-01-27 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl4vOqAQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgppYBD/wLczY7hyjF2loc71MC9HloUq3BVbATktM3 OF6wRbyxbeiOj/7Px0lE0M67tQbnEoIP26gS03fd6e7HE19//gmzGuB3Z2R2CJ5q XKkTamqz0pcPcX5FdDO5JFQZf27/1Qs3g7Nkr7FjVcR2XQ8PFv5B/FLMhse4frJI k92Sj0V1OwdNtMXozKqno/7xPwL/kQKWoF6aFDgO27xLfsFmi8Wbgf/CslOOTHIN vAUaz3Cue6V17M5y98wD4nwpjG7Ve+aY1i6oFPBE7Az9TA0xoiBA/tNPKW7iS10C GEP1aoI6lpgkxAzvyR29K1ayjzV11hEIig3rNIWxNfmCSGaawttWXAPEi7jU5u2D ZXbzUJxKnfeg8yrAj0CTcKLA9i4v1cZXPCUXqMO2+wHEWgmxq2IWuWjSl/V4fn3Y zgTPBngDM4Gx3fAqvD8SVfCW7xwI4VRP+da58WCFOjwnOgYSouxS7RnCtm+yPUbk Es6m2XBb+3ycaJPT58LcXPrnTJWZeRincs3MfFJeTXRn5T7IzlBjKdIvQiQSHQXo caZzWHEJW827+wfQFNreXpk5KPi+D6boeziYe96UcII8L5qVw3N0X5hOpr6IRhkX hn2CUb/CmY6bl8PJJPVc4ygqgiavyvynJu+A0uJvFSjvXX6jjXNEsSJ6bz8aBxdm 4rmgPFTlqA== =yJZi -----END PGP SIGNATURE----- Merge tag 'for-5.6/block-2020-01-27' of git://git.kernel.dk/linux-block Pull core block updates from Jens Axboe: "This may be the most quiet round we've had in years. I'm not complaining. Really not a lot to detail here, outside of spelling and documentation improvements/fixes, we have: - Allow t10-pi to be modular (Herbert) - Remove dead code in bfq (Alex) - Mark zone management requests with REQ_SYNC (Chaitanya) - BFQ division improvement (Wen) - Small series improving plugging (Pavel)" * tag 'for-5.6/block-2020-01-27' of git://git.kernel.dk/linux-block: partitions/ldm: fix spelling mistake "to" -> "too" block, bfq: improve arithmetic division in bfq_delta() block/bfq: remove unused bfq_class_rt which never used block: mark zone-mgmt bios with REQ_SYNC blk-mq: Document functions for sending request block: Allow t10-pi to be modular blk-mq: optimise blk_mq_flush_plug_list() list: introduce list_for_each_continue() blk-mq: optimise rq sort function	2020-01-27 12:38:25 -08:00
Amit Engel	e17016f6dc	nvmet: fix per feat data len for get_feature The existing implementation for the get_feature admin-cmd does not use per-feature data len. This patch introduces a new helper function nvmet_feat_data_len(), which is used to calculate per feature data len. Right now we only set data len for fid 0x81 (NVME_FEAT_HOST_ID). Fixes: commit `e9061c3978` ("nvmet: Remove the data_len field from the nvmet_req struct") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Amit Engel <amit.engel@dell.com> [endiness, naming, and kernel style fixes] Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-10 08:55:50 -07:00
Keith Busch	35038bffa8	nvme: Translate more status codes to blk_status_t Decode interrupted command and not ready namespace nvme status codes to BLK_STS_TARGET. These are not generic IO errors and should use a non-path specific error so that it can use the non-failover retry path. Reported-by: John Meneghini <John.Meneghini@netapp.com> Cc: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-10 08:55:50 -07:00
Herbert Xu	a754bd5f18	block: Allow t10-pi to be modular Currently t10-pi can only be built into the block layer which via crc-t10dif pulls in a whole chunk of the Crypto API. In fact all users of t10-pi work as modules and there is no reason for it to always be built-in. This patch adds a new hidden option for t10-pi that is selected automatically based on BLK_DEV_INTEGRITY and whether the users of t10-pi are built-in or not. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-01-06 20:59:04 -07:00
Linus Torvalds	f1fcd7786e	for-linus-20191212 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3y54EQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpqJuD/93LZmzS5UEWrNLkRaAaCyAy40MPxuXRZEp 42yk7cvAT4OcCr+W6nkAgG6IHGRXOz8QvOzt0P5/HfugpNlB2oz5a/6+TiTtcZTt YNt0Z4yuBMU5SXIIxc3lUMcJGxslzOr+L+9ZXD4u5UqIdG1fSrECAexSCrlmmTwu Fx02TakDc/bbUYDfLAQD1+/Z066rp1ZWDkjXqA4kUvbFzt8F7qEOc1Evq47SuR7d Iw0bM3LVASXwTq2lRc1bFFL2glku6wwkccjwdyjSrQmK4+8LhF396fQGtXuj0Mrs OzuWhaOoGhan7dpj1D8e4tqugflQy9rv9bcy6Z9PjBY+VauuFdgPr3iFcwPaPbXm 17ir4y7xJJxXlhZl/Bn06KIB2h+nLWDIaundFys5JnMmTiZvWIgSJ6Q3gWtMxgfH zWZLMw/UtRAmjHhLqvGsMaBTfgKX5ATpMbfGeZeXheVtVaOgGTunXunT56o7oRHB q4XWZqbydsYyHBUhgSzhBr03i67wbotxtebqg9VZ0UD8XM4iM8Kor/DleK03oUqD DsltKF66NAGNeOcV3TNzJuXHyF6S/vZdO7JdFHY29+pdljoTj5GB88+W9CbhwQRe WiKVpq7sAe/bh0wtqrD+QCByjSNSVU62kVgRhfqms47804j/vNqNvOKaC5UWTd0I 2LG4jfSbeg== =hmxJ -----END PGP SIGNATURE----- Merge tag 'for-linus-20191212' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: - stable fix for the bi_size overflow. Not a corruption issue, but a case wher we could merge but disallowed (Andreas) - NVMe pull request via Keith, with various fixes. - MD pull request from Song. - Merge window regression fix for the rq passthrough stats (Logan) - Remove unused blkcg_drain_queue() function (Guoqing) * tag 'for-linus-20191212' of git://git.kernel.dk/linux-block: blk-cgroup: remove blkcg_drain_queue block: fix NULL pointer dereference in account statistics with IDE md: make sure desc_nr less than MD_SB_DISKS md: raid1: check rdev before reference in raid1_sync_request func raid5: need to set STRIPE_HANDLE for batch head block: fix "check bi_size overflow before merge" nvme/pci: Fix read queue count nvme/pci Limit write queue sizes to possible cpus nvme/pci: Fix write and poll queue types nvme/pci: Remove last_cq_head nvme: Namepace identification descriptor list is optional nvme-fc: fix double-free scenarios on hw queues nvme: else following return is not needed nvme: add error message on mismatching controller ids nvme_fc: add module to ops template to allow module references nvmet-loop: Avoid preallocating big SGL for data nvme-fc: Avoid preallocating big SGL for data nvme-rdma: Avoid preallocating big SGL for data	2019-12-13 14:27:19 -08:00
Jens Axboe	dc3ecfc981	Merge branch 'nvme/for-5.5' of git://git.infradead.org/nvme into for-linus Pull NVMe fixes from Keith * 'nvme/for-5.5' of git://git.infradead.org/nvme: nvme/pci: Fix read queue count nvme/pci Limit write queue sizes to possible cpus nvme/pci: Fix write and poll queue types nvme/pci: Remove last_cq_head nvme: Namepace identification descriptor list is optional nvme-fc: fix double-free scenarios on hw queues nvme: else following return is not needed nvme: add error message on mismatching controller ids nvme_fc: add module to ops template to allow module references nvmet-loop: Avoid preallocating big SGL for data nvme-fc: Avoid preallocating big SGL for data nvme-rdma: Avoid preallocating big SGL for data	2019-12-06 17:27:56 -07:00
Keith Busch	7e4c6b9a5d	nvme/pci: Fix read queue count If nvme.write_queues equals the number of CPUs, the driver had decreased the number of interrupts available such that there could only be one read queue even if the controller could support more. Remove the interrupt count reduction in this case. The driver wouldn't request more IRQs than it wants queues anyway. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Keith Busch <kbusch@kernel.org>	2019-12-07 02:52:47 +09:00
Keith Busch	17c3316734	nvme/pci Limit write queue sizes to possible cpus The driver can never use more queues of any type than the number of possible CPUs, so a higher value causes the driver to allocate more memory for IO queues than it could ever use. Limit the parameter at module load time to the number of possible cpus. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Keith Busch <kbusch@kernel.org>	2019-12-07 02:52:42 +09:00
Keith Busch	3f68baf706	nvme/pci: Fix write and poll queue types The number of poll or write queues should never be negative. Use unsigned types so that it's not possible to break have the driver not allocate any queues. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Keith Busch <kbusch@kernel.org>	2019-12-07 02:52:24 +09:00
Linus Torvalds	c3bed3b20e	pci-v5.5-changes -----BEGIN PGP SIGNATURE----- iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAl3leXUUHGJoZWxnYWFz QGdvb2dsZS5jb20ACgkQWYigwDrT+vyY3g/9FAVVdPEaadNtAhQ/zIxcjozDovKq 0q7yOA3aTBTUoNEinm88an6p0dcC4gNKtGukXmzVH2Hhxm9kLRdtpZGYY00tpLUB 9rI7XsgwwHa+hLwsHbIs507sKGFGy5FLr0ChTTGLDEMppnEvjA2hZooYmcB/OgrC LlFcwbNKGOk/Si9u2bF2nLO0JDoVHnwzpF99saew/nqc7Lfj9e9IPZFom+VjPBUh AOvRp2H7uBN+WQlpLeFeMDDoeXh34lX0kYqIV/cVkXVnknDGYKV2CBTg2aeX7jd0 QiPHZh6zlW8zNQgaCZRiBAbatVEOnRMRJ++yiqB8hBYp1LMXm6kJ01YSQpXkugoY Vp9dtzzTARWV/XkKwD4brw9ZEmIDnO+Ed2x2VbUkPJVcXAvzSQWAx82IU0Iuqmcb 9qr6U2Zf/Xk5aFlGPYVH8QOG+QqzIbZNRQ7NlhDlITyW4P6QPu0mw374yYP2wDGL sP5YSS3YGa0sQcEgDtVnd4z+WTZI4AwXLPaeaLkDhdfHp2FsERUY4TrPs33J99xw og4EyokVFzjYzlnBPU6WWn7LL+jj5ccXkL3MA4DR4FJOnNGHh7NXfQUH56rrgsq7 F9/8shL5DuTbQkde1uSyUG9Iq/RigVLlV5DQavFm3dSXvZi0E16t5alC5URNTzk7 at8Bogn53QhlmYc= =uUXw -----END PGP SIGNATURE----- Merge tag 'pci-v5.5-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI updates from Bjorn Helgaas: "Enumeration: - Warn if a host bridge has no NUMA info (Yunsheng Lin) - Add PCI_STD_NUM_BARS for the number of standard BARs (Denis Efremov) Resource management: - Fix boot-time Embedded Controller GPE storm caused by incorrect resource assignment after ACPI Bus Check Notification (Mika Westerberg) - Protect pci_reassign_bridge_resources() against concurrent addition/removal (Benjamin Herrenschmidt) - Fix bridge dma_ranges resource list cleanup (Rob Herring) - Add "pci=hpmmiosize" and "pci=hpmmioprefsize" parameters to control the MMIO and prefetchable MMIO window sizes of hotplug bridges independently (Nicholas Johnson) - Fix MMIO/MMIO_PREF window assignment that assigned more space than desired (Nicholas Johnson) - Only enforce bus numbers from bridge EA if the bridge has EA devices downstream (Subbaraya Sundeep) - Consolidate DT "dma-ranges" parsing and convert all host drivers to use shared parsing (Rob Herring) Error reporting: - Restore AER capability after resume (Mayurkumar Patel) - Add PoisonTLPBlocked AER counter (Rajat Jain) - Use for_each_set_bit() to simplify AER code (Andy Shevchenko) - Fix AER kernel-doc (Andy Shevchenko) - Add "pcie_ports=dpc-native" parameter to allow native use of DPC even if platform didn't grant control over AER (Olof Johansson) Hotplug: - Avoid returning prematurely from sysfs requests to enable or disable a PCIe hotplug slot (Lukas Wunner) - Don't disable interrupts twice when suspending hotplug ports (Mika Westerberg) - Fix deadlocks when PCIe ports are hot-removed while suspended (Mika Westerberg) Power management: - Remove unnecessary ASPM locking (Bjorn Helgaas) - Add support for disabling L1 PM Substates (Heiner Kallweit) - Allow re-enabling Clock PM after it has been disabled (Heiner Kallweit) - Add sysfs attributes for controlling ASPM link states (Heiner Kallweit) - Remove CONFIG_PCIEASPM_DEBUG, including "link_state" and "clk_ctl" sysfs files (Heiner Kallweit) - Avoid AMD FCH XHCI USB PME# from D0 defect that prevents wakeup on USB 2.0 or 1.1 connect events (Kai-Heng Feng) - Move power state check out of pci_msi_supported() (Bjorn Helgaas) - Fix incorrect MSI-X masking on resume and revert related nvme quirk for Kingston NVME SSD running FW E8FK11.T (Jian-Hong Pan) - Always return devices to D0 when thawing to fix hibernation with drivers like mlx4 that used legacy power management (previously we only did it for drivers with new power management ops) (Dexuan Cui) - Clear PCIe PME Status even for legacy power management (Bjorn Helgaas) - Fix PCI PM documentation errors (Bjorn Helgaas) - Use dev_printk() for more power management messages (Bjorn Helgaas) - Apply D2 delay as milliseconds, not microseconds (Bjorn Helgaas) - Convert xen-platform from legacy to generic power management (Bjorn Helgaas) - Removed unused .resume_early() and .suspend_late() legacy power management hooks (Bjorn Helgaas) - Rearrange power management code for clarity (Rafael J. Wysocki) - Decode power states more clearly ("4" or "D4" really refers to "D3cold") (Bjorn Helgaas) - Notice when reading PM Control register returns an error (~0) instead of interpreting it as being in D3hot (Bjorn Helgaas) - Add missing link delays required by the PCIe spec (Mika Westerberg) Virtualization: - Move pci_prg_resp_pasid_required() to CONFIG_PCI_PRI (Bjorn Helgaas) - Allow VFs to use PRI (the PF PRI is shared by the VFs, but the code previously didn't recognize that) (Kuppuswamy Sathyanarayanan) - Allow VFs to use PASID (the PF PASID capability is shared by the VFs, but the code previously didn't recognize that) (Kuppuswamy Sathyanarayanan) - Disconnect PF and VF ATS enablement, since ATS in PFs and associated VFs can be enabled independently (Kuppuswamy Sathyanarayanan) - Cache PRI and PASID capability offsets (Kuppuswamy Sathyanarayanan) - Cache the PRI PRG Response PASID Required bit (Bjorn Helgaas) - Consolidate ATS declarations in linux/pci-ats.h (Krzysztof Wilczynski) - Remove unused PRI and PASID stubs (Bjorn Helgaas) - Removed unnecessary EXPORT_SYMBOL_GPL() from ATS, PRI, and PASID interfaces that are only used by built-in IOMMU drivers (Bjorn Helgaas) - Hide PRI and PASID state restoration functions used only inside the PCI core (Bjorn Helgaas) - Add a DMA alias quirk for the Intel VCA NTB (Slawomir Pawlowski) - Serialize sysfs sriov_numvfs reads vs writes (Pierre Crégut) - Update Cavium ACS quirk for ThunderX2 and ThunderX3 (George Cherian) - Fix the UPDCR register address in the Intel ACS quirk (Steffen Liebergeld) - Unify ACS quirk implementations (Bjorn Helgaas) Amlogic Meson host bridge driver: - Fix meson PERST# GPIO polarity problem (Remi Pommarel) - Add DT bindings for Amlogic Meson G12A (Neil Armstrong) - Fix meson clock names to match DT bindings (Neil Armstrong) - Add meson support for Amlogic G12A SoC with separate shared PHY (Neil Armstrong) - Add meson extended PCIe PHY functions for Amlogic G12A USB3+PCIe combo PHY (Neil Armstrong) - Add arm64 DT for Amlogic G12A PCIe controller node (Neil Armstrong) - Add commented-out description of VIM3 USB3/PCIe mux in arm64 DT (Neil Armstrong) Broadcom iProc host bridge driver: - Invalidate iProc PAXB address mapping before programming it (Abhishek Shah) - Fix iproc-msi and mvebu __iomem annotations (Ben Dooks) Cadence host bridge driver: - Refactor Cadence PCIe host controller to use as a library for both host and endpoint (Tom Joseph) Freescale Layerscape host bridge driver: - Add layerscape LS1028a support (Xiaowei Bao) Intel VMD host bridge driver: - Add VMD bus 224-255 restriction decode (Jon Derrick) - Add VMD 8086:9A0B device ID (Jon Derrick) - Remove Keith from VMD maintainer list (Keith Busch) Marvell ARMADA 3700 / Aardvark host bridge driver: - Use LTSSM state to build link training flag since Aardvark doesn't implement the Link Training bit (Remi Pommarel) - Delay before training Aardvark link in case PERST# was asserted before the driver probe (Remi Pommarel) - Fix Aardvark issues with Root Control reads and writes (Remi Pommarel) - Don't rely on jiffies in Aardvark config access path since interrupts may be disabled (Remi Pommarel) - Fix Aardvark big-endian support (Grzegorz Jaszczyk) Marvell ARMADA 370 / XP host bridge driver: - Make mvebu_pci_bridge_emul_ops static (Ben Dooks) Microsoft Hyper-V host bridge driver: - Add hibernation support for Hyper-V virtual PCI devices (Dexuan Cui) - Track Hyper-V pci_protocol_version per-hbus, not globally (Dexuan Cui) - Avoid kmemleak false positive on hv hbus buffer (Dexuan Cui) Mobiveil host bridge driver: - Change mobiveil csr_read()/write() function names that conflict with riscv arch functions (Kefeng Wang) NVIDIA Tegra host bridge driver: - Fix Tegra CLKREQ dependency programming (Vidya Sagar) Renesas R-Car host bridge driver: - Remove unnecessary header include from rcar (Andrew Murray) - Tighten register index checking for rcar inbound range programming (Marek Vasut) - Fix rcar inbound range alignment calculation to improve packing of multiple entries (Marek Vasut) - Update rcar MACCTLR setting to match documentation (Yoshihiro Shimoda) - Clear bit 0 of MACCTLR before PCIETCTLR.CFINIT per manual (Yoshihiro Shimoda) - Add Marek Vasut and Yoshihiro Shimoda as R-Car maintainers (Simon Horman) Rockchip host bridge driver: - Make rockchip 0V9 and 1V8 power regulators non-optional (Robin Murphy) Socionext UniPhier host bridge driver: - Set uniphier to host (RC) mode always (Kunihiko Hayashi) Endpoint drivers: - Fix endpoint driver sign extension problem when shifting page number to phys_addr_t (Alan Mikhak) Misc: - Add NumaChip SPDX header (Krzysztof Wilczynski) - Replace EXTRA_CFLAGS with ccflags-y (Krzysztof Wilczynski) - Remove unused includes (Krzysztof Wilczynski) - Removed unused sysfs attribute groups (Ben Dooks) - Remove PTM and ASPM dependencies on PCIEPORTBUS (Bjorn Helgaas) - Add PCIe Link Control 2 register field definitions to replace magic numbers in AMDGPU and Radeon CIK/SI (Bjorn Helgaas) - Fix incorrect Link Control 2 Transmit Margin usage in AMDGPU and Radeon CIK/SI PCIe Gen3 link training (Bjorn Helgaas) - Use pcie_capability_read_word() instead of pci_read_config_word() in AMDGPU and Radeon CIK/SI (Frederick Lawler) - Remove unused pci_irq_get_node() Greg Kroah-Hartman) - Make asm/msi.h mandatory and simplify PCI_MSI_IRQ_DOMAIN Kconfig (Palmer Dabbelt, Michal Simek) - Read all 64 bits of Switchtec part_event_bitmap (Logan Gunthorpe) - Fix erroneous intel-iommu dependency on CONFIG_AMD_IOMMU (Bjorn Helgaas) - Fix bridge emulation big-endian support (Grzegorz Jaszczyk) - Fix dwc find_next_bit() usage (Niklas Cassel) - Fix pcitest.c fd leak (Hewenliang) - Fix typos and comments (Bjorn Helgaas) - Fix Kconfig whitespace errors (Krzysztof Kozlowski)" * tag 'pci-v5.5-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (160 commits) PCI: Remove PCI_MSI_IRQ_DOMAIN architecture whitelist asm-generic: Make msi.h a mandatory include/asm header Revert "nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T" PCI/MSI: Fix incorrect MSI-X masking on resume PCI/MSI: Move power state check out of pci_msi_supported() PCI/MSI: Remove unused pci_irq_get_node() PCI: hv: Avoid a kmemleak false positive caused by the hbus buffer PCI: hv: Change pci_protocol_version to per-hbus PCI: hv: Add hibernation support PCI: hv: Reorganize the code in preparation of hibernation MAINTAINERS: Remove Keith from VMD maintainer PCI/ASPM: Remove PCIEASPM_DEBUG Kconfig option and related code PCI/ASPM: Add sysfs attributes for controlling ASPM link states PCI: Fix indentation drm/radeon: Prefer pcie_capability_read_word() drm/radeon: Replace numbers with PCI_EXP_LNKCTL2 definitions drm/radeon: Correct Transmit Margin masks drm/amdgpu: Prefer pcie_capability_read_word() PCI: uniphier: Set mode register to host mode drm/amdgpu: Replace numbers with PCI_EXP_LNKCTL2 definitions ...	2019-12-03 13:58:22 -08:00
Keith Busch	f6c4d97b0d	nvme/pci: Remove last_cq_head We had been saving the last_cq_head seen from an interrupt so that a polled queue wouldn't mistakenly trigger spruious interrupt detection. We don't poll interrupt driven queues any more, so saving this value is pointless. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2019-12-04 00:38:06 +09:00
Keith Busch	22802bf742	nvme: Namepace identification descriptor list is optional Despite NVM Express specification 1.3 requires a controller claiming to be 1.3 or higher implement Identify CNS 03h (Namespace Identification Descriptor list), the driver doesn't really need this identification in order to use a namespace. The code had already documented in comments that we're not to consider an error to this command. Return success if the controller provided any response to an namespace identification descriptors command. Fixes: `538af88ea7` ("nvme: make nvme_report_ns_ids propagate error back") Link: https://bugzilla.kernel.org/show_bug.cgi?id=205679 Reported-by: Ingo Brunberg <ingo_brunberg@web.de> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: stable@vger.kernel.org # 5.4+ Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2019-12-03 06:10:00 +09:00
Linus Torvalds	0da522107e	compat_ioctl: remove most of fs/compat_ioctl.c As part of the cleanup of some remaining y2038 issues, I came to fs/compat_ioctl.c, which still has a couple of commands that need support for time64_t. In completely unrelated work, I spent time on cleaning up parts of this file in the past, moving things out into drivers instead. After Al Viro reviewed an earlier version of this series and did a lot more of that cleanup, I decided to try to completely eliminate the rest of it and move it all into drivers. This series incorporates some of Al's work and many patches of my own, but in the end stops short of actually removing the last part, which is the scsi ioctl handlers. I have patches for those as well, but they need more testing or possibly a rewrite. Signed-off-by: Arnd Bergmann <arnd@arndb.de> -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJdsHCdAAoJEJpsee/mABjZtYkP/1JGl3jFv3Iq/5BCdPkaePP1 RtMJRNfURgK3GeuHUui330PvVjI/pLWXU/VXMK2MPTASpJLzYz3uCaZrpVWEMpDZ +ImzGmgJkITlW1uWU3zOcQhOxTyb1hCZ0Ci+2xn9QAmyOL7prXoXCXDWv3h6iyiF lwG+nW+HNtyx41YG+9bRfKNoG0ZJ+nkJ70BV6u0acQHXWn7Xuupa9YUmBL87hxAL 6dlJfLTJg6q8QSv/Q6LxslfWk2Ti8OOJZOwtFM5R8Bgl0iUcvshiRCKfv/3t9jXD dJNvF1uq8z+gracWK49Qsfq5dnZ2ZxHFUo9u0NjbCrxNvWH/sdvhbaUBuJI75seH VIznCkdxFhrqitJJ8KmxANxG08u+9zSKjSlxG2SmlA4qFx/AoStoHwQXcogJscNb YIXYKmWBvwPzYu09QFAXdHFPmZvp/3HhMWU6o92lvDhsDwzkSGt3XKhCJea4DCaT m+oCcoACqSWhMwdbJOEFofSub4bY43s5iaYuKes+c8O261/Dwg6v/pgIVez9mxXm TBnvCsotq5m8wbwzv99eFqGeJH8zpDHrXxEtRR5KQqMqjLq/OQVaEzmpHZTEuK7n e/V/PAKo2/V63g4k6GApQXDxnjwT+m0aWToWoeEzPYXS6KmtWC91r4bWtslu3rdl bN65armTm7bFFR32Avnu =lgCl -----END PGP SIGNATURE----- Merge tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground Pull removal of most of fs/compat_ioctl.c from Arnd Bergmann: "As part of the cleanup of some remaining y2038 issues, I came to fs/compat_ioctl.c, which still has a couple of commands that need support for time64_t. In completely unrelated work, I spent time on cleaning up parts of this file in the past, moving things out into drivers instead. After Al Viro reviewed an earlier version of this series and did a lot more of that cleanup, I decided to try to completely eliminate the rest of it and move it all into drivers. This series incorporates some of Al's work and many patches of my own, but in the end stops short of actually removing the last part, which is the scsi ioctl handlers. I have patches for those as well, but they need more testing or possibly a rewrite" * tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (42 commits) scsi: sd: enable compat ioctls for sed-opal pktcdvd: add compat_ioctl handler compat_ioctl: move SG_GET_REQUEST_TABLE handling compat_ioctl: ppp: move simple commands into ppp_generic.c compat_ioctl: handle PPPIOCGIDLE for 64-bit time_t compat_ioctl: move PPPIOCSCOMPRESS to ppp_generic compat_ioctl: unify copy-in of ppp filters tty: handle compat PPP ioctls compat_ioctl: move SIOCOUTQ out of compat_ioctl.c compat_ioctl: handle SIOCOUTQNSD af_unix: add compat_ioctl support compat_ioctl: reimplement SG_IO handling compat_ioctl: move WDIOC handling into wdt drivers fs: compat_ioctl: move FITRIM emulation into file systems gfs2: add compat_ioctl support compat_ioctl: remove unused convert_in_user macro compat_ioctl: remove last RAID handling code compat_ioctl: remove /dev/raw ioctl translation compat_ioctl: remove PCI ioctl translation compat_ioctl: remove joystick ioctl translation ...	2019-12-01 13:46:15 -08:00
Jian-Hong Pan	655e7aee1f	Revert "nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T" Since `e045fa29e8` ("PCI/MSI: Fix incorrect MSI-X masking on resume") is merged, we can revert the previous quirk now. This reverts commit `19ea025e1d`. Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=204887 Fixes: `19ea025e1d` ("nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T") Link: https://lore.kernel.org/r/20191031093408.9322-1-jian-hong@endlessm.com Signed-off-by: Jian-Hong Pan <jian-hong@endlessm.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org	2019-11-26 13:13:14 -06:00
James Smart	c869e494ef	nvme-fc: fix double-free scenarios on hw queues If an error occurs on one of the ios used for creating an association, the creating routine has error paths that are invoked by the command failure and the error paths will free up the controller resources created to that point. But... the io was ultimately determined by an asynchronous completion routine that detected the error and which unconditionally invokes the error_recovery path which calls delete_association. Delete association deletes all outstanding io then tears down the controller resources. So the create_association thread can be running in parallel with the error_recovery thread. What was seen was the LLDD received a call to delete a queue, causing the LLDD to do a free of a resource, then the transport called the delete queue again causing the driver to repeat the free call. The second free routine corrupted the allocator. The transport shouldn't be making the duplicate call, and the delete queue is just one of the resources being freed. To fix, it is realized that the create_association path is completely serialized with one command at a time. So the failed io completion will always be seen by the create_association path and as of the failure, there are no ios to terminate and there is no reason to be manipulating queue freeze states, etc. The serialized condition stays true until the controller is transitioned to the LIVE state. Thus the fix is to change the error recovery path to check the controller state and only invoke the teardown path if not already in the CONNECTING state. Reviewed-by: Himanshu Madhani <hmadhani@marvell.com> Reviewed-by: Ewan D. Milne <emilne@redhat.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2019-11-27 03:00:13 +09:00
Edmund Nadolski	c80b36cd95	nvme: else following return is not needed Remove unnecessary keyword in nvme_create_queue(). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Edmund Nadolski <edmund.nadolski@intel.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2019-11-27 02:48:33 +09:00
James Smart	a8157ff360	nvme: add error message on mismatching controller ids We've seen a few devices that return different controller id's to the Fabric Connect command vs the Identify(controller) command. It's currently hard to identify this failure by existing error messages. It comes across as a (re)connect attempt in the transport that fails with a -22 (-EINVAL) status. The issue is compounded by older kernels not having the controller id check or had the identify command overwrite the fabrics controller id value before it checked. Both resulted in cases where the devices appeared fine until more recent kernels. Clarify the reject by adding an error message on controller id mismatches. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ewan D. Milne <emilne@redhat.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2019-11-27 02:48:33 +09:00
James Smart	863fbae929	nvme_fc: add module to ops template to allow module references In nvme-fc: it's possible to have connected active controllers and as no references are taken on the LLDD, the LLDD can be unloaded. The controller would enter a reconnect state and as long as the LLDD resumed within the reconnect timeout, the controller would resume. But if a namespace on the controller is the root device, allowing the driver to unload can be problematic. To reload the driver, it may require new io to the boot device, and as it's no longer connected we get into a catch-22 that eventually fails, and the system locks up. Fix this issue by taking a module reference for every connected controller (which is what the core layer did to the transport module). Reference is cleared when the controller is removed. Acked-by: Himanshu Madhani <hmadhani@marvell.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2019-11-27 02:48:27 +09:00
Israel Rukshin	52e6d8ed16	nvmet-loop: Avoid preallocating big SGL for data nvme_loop_create_io_queues() preallocates a big buffer for the IO SGL based on SG_CHUNK_SIZE. Modern DMA engines are often capable of dealing with very big segments so the SG_CHUNK_SIZE is often too big. SG_CHUNK_SIZE results in a static 4KB SGL allocation per command. If a controller has lots of deep queues, preallocation for the sg list can consume substantial amounts of memory. For nvmet-loop, nr_hw_queues can be 128 and each queue's depth 128. This means the resulting preallocation for the data SGL is 1281284K = 64MB per controller. Switch to runtime allocation for SGL for lists longer than 2 entries. This is the approach used by NVMe PCI so it should be reasonable for NVMeOF as well. Runtime SGL allocation has always been the case for the legacy I/O path so this is nothing new. Tested-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Israel Rukshin <israelr@mellanox.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2019-11-27 02:14:19 +09:00
Israel Rukshin	b1ae1a2389	nvme-fc: Avoid preallocating big SGL for data nvme_fc_create_io_queues() preallocates a big buffer for the IO SGL based on SG_CHUNK_SIZE. Modern DMA engines are often capable of dealing with very big segments so the SG_CHUNK_SIZE is often too big. SG_CHUNK_SIZE results in a static 4KB SGL allocation per command. If a controller has lots of deep queues, preallocation for the sg list can consume substantial amounts of memory. For nvme-fc, nr_hw_queues can be 128 and each queue's depth 128. This means the resulting preallocation for the data SGL is 1281284K = 64MB per controller. Switch to runtime allocation for SGL for lists longer than 2 entries. This is the approach used by NVMe PCI so it should be reasonable for NVMeOF as well. Runtime SGL allocation has always been the case for the legacy I/O path so this is nothing new. Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Israel Rukshin <israelr@mellanox.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2019-11-27 02:14:01 +09:00
Israel Rukshin	38e1800275	nvme-rdma: Avoid preallocating big SGL for data nvme_rdma_alloc_tagset() preallocates a big buffer for the IO SGL based on SG_CHUNK_SIZE. Modern DMA engines are often capable of dealing with very big segments so the SG_CHUNK_SIZE is often too big. SG_CHUNK_SIZE results in a static 4KB SGL allocation per command. If a controller has lots of deep queues, preallocation for the sg list can consume substantial amounts of memory. For nvme-rdma, nr_hw_queues can be 128 and each queue's depth 128. This means the resulting preallocation for the data SGL is 1281284K = 64MB per controller. Switch to runtime allocation for SGL for lists longer than 2 entries. This is the approach used by NVMe PCI so it should be reasonable for NVMeOF as well. Runtime SGL allocation has always been the case for the legacy I/O path so this is nothing new. The preallocated small SGL depends on SG_CHAIN so if the ARCH doesn't support SG_CHAIN, use only runtime allocation for the SGL. We didn't notice of a performance degradation, since for small IOs we'll use the inline SG and for the bigger IOs the allocation of a bigger SGL from slab is fast enough. Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Israel Rukshin <israelr@mellanox.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2019-11-27 02:13:45 +09:00
Linus Torvalds	323264eefb	for-5.5/drivers-post-20191122 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3X/7gQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgponxD/9mb/H9LD6/flEqoPv7n1dv7Y8Oe+AKpGLb a2Jh8ycpU6WtzZdlMYbQkxAqgJCupLTlih3WY3NuI1fwSsxwyMziQEnVnKgPNf7s PLWt+Qo5ryooyVkPi4KCHKjx2CFDUL4B1BqtSLm9n3eN72FRa9HCsCiMugjCbV+K aF7snzZ0ss+m7SnKIpdtXJjcdIFC2hXwCAGWAOv1vOPwhqQZFBsxjHKnEJtumTp2 +wzLBPItLBbzHtAyopbiNlfsuHL9CF9L1QFaCTZE7N6eVWnlYpIhMCMrfEJ/e3jK 27ct3PWAa4Qr2S/85//AMOyL9mxK96g9FjuGjxnKQZ+qWh89RPLq+oGs1zWSv2O0 08BynWvxYHkoOR2+baPx89SHrlwyN0HCsvFBxVKrMsVHpy81sIYLFlytYaQUMx2/ RkgkIxiAo5R5vYHPZYX8cU4c1rASjG0tYD9OA6e78MJFIBfkTl70XHHVX2Tgf5by fuwon/g+iVvN94QHb81ulZcA0hXRz8jM2RNIAPhqfoJXX6wNzD5MooNxTs9m88IP 6/HaM1l6AJUtOMNA1aZbKAq+ARYuIA0/qHoSS0UVHoG8D4YkYaFpvYsAaA+TBzeO J8IcwSu6eVR1NvgJ9b4cwXFWSf75a/o4UeDP/1fYcQU5Gn/KQEdQVFSMvND1nnHW hUTP5AKFcQ== =oyE3 -----END PGP SIGNATURE----- Merge tag 'for-5.5/drivers-post-20191122' of git://git.kernel.dk/linux-block Pull additional block driver updates from Jens Axboe: "Here's another block driver update, done to avoid conflicts with the zoned changes coming next. This contains: - Prepare SCSI sd for zone open/close/finish support - Small NVMe pull request - hwmon support (Akinobu) - add new co-maintainer (Christoph) - work-around for a discard issue on non-conformant drives (Eduard) - Small nbd leak fix" * tag 'for-5.5/drivers-post-20191122' of git://git.kernel.dk/linux-block: nbd: prevent memory leak nvme: hwmon: add quirk to avoid changing temperature threshold nvme: hwmon: provide temperature min and max values for each sensor nvmet: add another maintainer nvme: Discard workaround for non-conformant devices nvme: Add hardware monitoring support scsi: sd_zbc: add zone open, close, and finish support	2019-11-25 11:18:03 -08:00
Linus Torvalds	2d53943090	for-5.5/drivers-20191121 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3WyEAQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgplgbD/4jNeqT0q2IkNcUUEWkZWsBOlfi0SiclS5v X8JY1IxTlL0kaBWm83mw06JewucQ97Fh7xblPE8/iDHJqpgEX4vvSQY1b8hcDulZ YOKUnLkFU22nICeT04/8x/+f8gqD5KOlGxkgEvUKViQW15oc0oNu4St/yFM1QEN0 qNMzpcfFXV9lYsOPl0y3pKdP+qbfcpeSmaFD9Z65gxN6rJy1WR8rtUGXy2luoiEc dh15IL9AGN/r8VTo8yRpD9PStiuJqpALIR8OHJSHPj+s0pQ6twk4aehcnYseAMbH zSDpa9AJrfqlnh8tUfKYLWi/PM7pMH0F01rAiQv47j/C0+QhbiOU/uTFTzUW5hQ1 eK6XzJ0slxwnDsHLKf+xJmCj0Oyk0jDimNQr/2MNsuhmr29V5lfvBNflub8eOLyZ ie2Eulv+z6pYBSJx6kqm0X3vhXOy4wgU+X8LzvfcP9iAjgU1rfzxUWxLEj+KfJS2 Nl+ERV9nafoPpoKpNR7zWRBUulp1qZJzo/U9JaUKiI5cWkIH1hhHmU2++xMeyJpb XHoDFNTGv6z/eef65eSveFD7F274TSi16K56Obk+4KWaSrIR0d6VwUA7FDmJbSI+ Jqk1OFdaRGsQ5OcVxF1Qo4WChn0FvhcD0c+yL0N19WZ01QeYsb3hlA+MUPDtGQ04 U79MPfu7iA== =i0jf -----END PGP SIGNATURE----- Merge tag 'for-5.5/drivers-20191121' of git://git.kernel.dk/linux-block Pull block driver updates from Jens Axboe: "Here are the main block driver updates for 5.5. Nothing major in here, mostly just fixes. This contains: - a set of bcache changes via Coly - MD changes from Song - loop unmap write-zeroes fix (Darrick) - spelling fixes (Geert) - zoned additions cleanups to null_blk/dm (Ajay) - allow null_blk online submit queue changes (Bart) - NVMe changes via Keith, nothing major here either" * tag 'for-5.5/drivers-20191121' of git://git.kernel.dk/linux-block: (56 commits) Revert "bcache: fix fifo index swapping condition in journal_pin_cmp()" drivers/md/raid5-ppl.c: use the new spelling of RWH_WRITE_LIFE_NOT_SET drivers/md/raid5.c: use the new spelling of RWH_WRITE_LIFE_NOT_SET bcache: don't export symbols bcache: remove the extra cflags for request.o bcache: at least try to shrink 1 node in bch_mca_scan() bcache: add idle_max_writeback_rate sysfs interface bcache: add code comments in bch_btree_leaf_dirty() bcache: fix deadlock in bcache_allocator bcache: add code comment bch_keylist_pop() and bch_keylist_pop_front() bcache: deleted code comments for dead code in bch_data_insert_keys() bcache: add more accurate error messages in read_super() bcache: fix static checker warning in bcache_device_free() bcache: fix a lost wake-up problem caused by mca_cannibalize_lock bcache: fix fifo index swapping condition in journal_pin_cmp() md/raid10: prevent access of uninitialized resync_pages offset md: avoid invalid memory access for array sb->dev_roles md/raid1: avoid soft lockup under high load null_blk: add zone open, close, and finish support dm: add zone open, close and finish support ...	2019-11-25 11:15:41 -08:00
Akinobu Mita	6c6aa2f26c	nvme: hwmon: add quirk to avoid changing temperature threshold This adds a new quirk NVME_QUIRK_NO_TEMP_THRESH_CHANGE to avoid changing the value of the temperature threshold feature for specific devices that show undesirable behavior. Guenter reported: "On my Intel NVME drive (SSDPEKKW512G7), writing any minimum limit on the Composite temperature sensor results in a temperature warning, and that warning is sticky until I reset the controller. It doesn't seem to matter which temperature I write; writing -273000 has the same result." The Intel NVMe has the latest firmware version installed, so this isn't a problem that was ever fixed. Reported-by: Guenter Roeck <linux@roeck-us.net> Cc: Keith Busch <kbusch@kernel.org> Cc: Jens Axboe <axboe@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Jean Delvare <jdelvare@suse.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2019-11-22 02:21:08 +09:00
Akinobu Mita	52deba0f02	nvme: hwmon: provide temperature min and max values for each sensor According to the NVMe specification, the over temperature threshold and under temperature threshold features shall be implemented for Composite Temperature if a non-zero WCTEMP field value is reported in the Identify Controller data structure. The features are also implemented for all implemented temperature sensors (i.e., all Temperature Sensor fields that report a non-zero value). This provides the over temperature threshold and under temperature threshold for each sensor as temperature min and max values of hwmon sysfs attributes. The WCTEMP is already provided as a temperature max value for Composite Temperature, but this change isn't incompatible. Because the default value of the over temperature threshold for Composite Temperature is the WCTEMP. Now the alarm attribute for Composite Temperature indicates one of the temperature is outside of a temperature threshold. Because there is only a single bit in Critical Warning field that indicates a temperature is outside of a threshold. Example output from the "sensors" command: nvme-pci-0100 Adapter: PCI adapter Composite: +33.9°C (low = -273.1°C, high = +69.8°C) (crit = +79.8°C) Sensor 1: +34.9°C (low = -273.1°C, high = +65261.8°C) Sensor 2: +31.9°C (low = -273.1°C, high = +65261.8°C) Sensor 5: +47.9°C (low = -273.1°C, high = +65261.8°C) This also adds helper macros for kelvin from/to milli Celsius conversion, and replaces the repeated code in hwmon.c. Cc: Keith Busch <kbusch@kernel.org> Cc: Jens Axboe <axboe@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Jean Delvare <jdelvare@suse.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2019-11-22 02:21:08 +09:00
Eduard Hasenleithner	530436c45e	nvme: Discard workaround for non-conformant devices Users observe IOMMU related errors when performing discard on nvme from non-compliant nvme devices reading beyond the end of the DMA mapped ranges to discard. Two different variants of this behavior have been observed: SM22XX controllers round up the read size to a multiple of 512 bytes, and Phison E12 unconditionally reads the maximum discard size allowed by the spec (256 segments or 4kB). Make nvme_setup_discard unconditionally allocate the maximum DSM buffer so the driver DMA maps a memory range that will always succeed. Link: https://bugzilla.kernel.org/show_bug.cgi?id=202665 many Signed-off-by: Eduard Hasenleithner <eduard@hasenleithner.at> [changelog, use existing define, kernel coding style] Signed-off-by: Keith Busch <kbusch@kernel.org>	2019-11-13 06:37:38 +09:00
Guenter Roeck	400b6a7b13	nvme: Add hardware monitoring support nvme devices report temperature information in the controller information (for limits) and in the smart log. Currently, the only means to retrieve this information is the nvme command line interface, which requires super-user privileges. At the same time, it would be desirable to be able to use NVMe temperature information for thermal control. This patch adds support to read NVMe temperatures from the kernel using the hwmon API and adds temperature zones for NVMe drives. The thermal subsystem can use this information to set thermal policies, and userspace can access it using libsensors and/or the "sensors" command. Example output from the "sensors" command: nvme0-pci-0100 Adapter: PCI adapter Composite: +39.0°C (high = +85.0°C, crit = +85.0°C) Sensor 1: +39.0°C Sensor 2: +41.0°C Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Keith Busch <kbusch@kernel.org>	2019-11-12 01:57:35 +09:00
Linus Torvalds	5cb8418cb5	for-linus-2019-11-08 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3F+DIQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpkVmD/9FIa092Q6obga0RqA16GlbI85tgtNyRFZU neIO/g9R3G/uBTGbiUeXHXHDz9CqUXIRYX7pmI2u0b07iGLRz8oUsOsgyVEfCMen VitqwkJJAZ9j9OifyKpLYCZX9ulVDWX5hEz/vm2cNWDkjCbOpXvRuQmkXEzp7RNM F7K25PpGLvJHfqC90q9FXXxNDlB2i1M/rh5I7eUqhb6rHmzfJGKCd+H80t+REoB1 iXAygPj86agQKLOUKZtJjXUjq9Ol/0FD+OKY+eP3EfVv/FJvIeWWYe78WplxJpRD BYb9dhLMCSo619WVVy4hNYCPjSfPKVT2cO5QJmVRpgOI1urFuTNgNoIiw06AgvkZ 09vrlrJZ5A7eFEppuAFQC4WRYKWCQCQfq8wxt2iGUivXgHfskjJ7qJz1Mh5Nlxsr JGm1hSVw9UCBzjqC75K2CR+vVt4T8ovEaizPFvzVj6lQ0lRTmSchisxbXzTdevOn kFvBOntBdpeSq/CwZ0x5PDP4AbRsgH3ny47LCJoEpFZlxlOLuEB/dAx564dn6ZgZ rRz6mKU7rlO7brrkW+DbRZ0XiOn5qzN6FrcSGX1DBzc4+hMus/5PPjv8Gk+Wo1PU 388mu6N6DObSjw1ij1AqkwyzudmbrziKM4isJY4I72I0YGSq9cuG5VXgM2GqbGlU XXpzsu8pSw== =8B/z -----END PGP SIGNATURE----- Merge tag 'for-linus-2019-11-08' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: - Two NVMe device removal crash fixes, and a compat fixup for for an ioctl that was introduced in this release (Anton, Charles, Max - via Keith) - Missing error path mutex unlock for drbd (Dan) - cgroup writeback fixup on dead memcg (Tejun) - blkcg online stats print fix (Tejun) * tag 'for-linus-2019-11-08' of git://git.kernel.dk/linux-block: cgroup,writeback: don't switch wbs immediately on dead wbs if the memcg is dead block: drbd: remove a stray unlock in __drbd_send_protocol() blkcg: make blkcg_print_stat() print stats only for online blkgs nvme: change nvme_passthru_cmd64 to explicitly mark rsvd nvme-multipath: fix crash in nvme_mpath_clear_ctrl_paths nvme-rdma: fix a segmentation fault during module unload	2019-11-08 18:15:55 -08:00
Anton Eidelman	763303a83a	nvme-multipath: fix crash in nvme_mpath_clear_ctrl_paths nvme_mpath_clear_ctrl_paths() iterates through the ctrl->namespaces list while holding ctrl->scan_lock. This does not seem to be the correct way of protecting from concurrent list modification. Specifically, nvme_scan_work() sorts ctrl->namespaces AFTER unlocking scan_lock. This may result in the following (rare) crash in ctrl disconnect during scan_work: BUG: kernel NULL pointer dereference, address: 0000000000000050 Oops: 0000 [#1] SMP PTI CPU: 0 PID: 3995 Comm: nvme 5.3.5-050305-generic RIP: 0010:nvme_mpath_clear_current_path+0xe/0x90 [nvme_core] ... Call Trace: nvme_mpath_clear_ctrl_paths+0x3c/0x70 [nvme_core] nvme_remove_namespaces+0x35/0xe0 [nvme_core] nvme_do_delete_ctrl+0x47/0x90 [nvme_core] nvme_sysfs_delete+0x49/0x60 [nvme_core] dev_attr_store+0x17/0x30 sysfs_kf_write+0x3e/0x50 kernfs_fop_write+0x11e/0x1a0 __vfs_write+0x1b/0x40 vfs_write+0xb9/0x1a0 ksys_write+0x67/0xe0 __x64_sys_write+0x1a/0x20 do_syscall_64+0x5a/0x130 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f8d02bfb154 Fix: After taking scan_lock in nvme_mpath_clear_ctrl_paths() down_read(&ctrl->namespaces_rwsem) as well to make list traversal safe. This will not cause deadlocks because taking scan_lock never happens while holding the namespaces_rwsem. Moreover, scan work downs namespaces_rwsem in the same order. Alternative: sort ctrl->namespaces in nvme_scan_work() while still holding the scan_lock. This would leave nvme_mpath_clear_ctrl_paths() without correct protection against ctrl->namespaces modification by anyone other than scan_work. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anton Eidelman <anton@lightbitslabs.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2019-11-06 00:30:37 +09:00
Max Gurtovoy	9ad9e8d6ca	nvme-rdma: fix a segmentation fault during module unload In case there are controllers that are not associated with any RDMA device (e.g. during unsuccessful reconnection) and the user will unload the module, these controllers will not be freed and will access already freed memory. The same logic appears in other fabric drivers as well. Fixes: `87fd125344` ("nvme-rdma: remove redundant reference between ib_device and tagset") Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2019-11-06 00:29:23 +09:00
Prabhath Sajeepa	64fab7290d	nvme: Fix parsing of ANA log page Check validity of offset into ANA log buffer before accessing nvme_ana_group_desc. This check ensures the size of ANA log buffer >= offset + sizeof(nvme_ana_group_desc) Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Prabhath Sajeepa <psajeepa@purestorage.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-04 10:56:42 -07:00
Christoph Hellwig	716fd9c119	nvmet: stop using bio_set_op_attrs bio_set_op_attrs has been long deprecated, replace it with a direct assignment of the flags to bio->bi_opf. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-04 10:56:42 -07:00
Christoph Hellwig	9dea0c81ee	nvmet: add plugging for read/write when ns is bdev With reference to the following issue reported on the mailing list :- http://lists.infradead.org/pipermail/linux-nvme/2019-October/027604.html this patch adds plugging for the bdev-ns under nvmet_bdev_execute_rw(). We can see the following performance improvement in random write workload I/Os with the setup described in the link when device_path configured as /dev/md0. Without this patch :- write: IOPS=40.8k, BW=159MiB/s (167MB/s)(4777MiB/30002msec) write: IOPS=41.2k, BW=161MiB/s (169MB/s)(4831MiB/30011msec) slat (usec): min=8, max=10823, avg=15.64, stdev=16.85 slat (usec): min=8, max=401, avg=15.40, stdev= 9.56 clat (usec): min=54, max=2492, avg=759.07, stdev=172.62 clat (usec): min=56, max=1997, avg=768.06, stdev=178.72 With this patch :- write: IOPS=123k, BW=480MiB/s (504MB/s)(14.1GiB/30011msec) write: IOPS=123k, BW=481MiB/s (504MB/s)(14.1GiB/30002msec) slat (usec): min=8, max=9941, avg=13.31, stdev= 8.04 slat (usec): min=8, max=289, avg=13.31, stdev= 3.37 clat (usec): min=43, max=17635, avg=245.46, stdev=171.23 clat (usec): min=44, max=17751, avg=245.25, stdev=183.14 Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-04 10:56:42 -07:00
Christoph Hellwig	d84dd8cde6	nvmet: clean up command parsing a bit Move the special cases for fabrics commands and the discovery controller to nvmet_parse_admin_cmd in preparation for adding passthrough support. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-04 10:56:42 -07:00
Geert Uytterhoeven	05d3046ff7	nvme-pci: Spelling s/resdicovered/rediscovered/ Fix misspelling of "rediscovered". Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-04 10:56:42 -07:00
Sagi Grimberg	d4b3a17411	nvmet: fill discovery controller sn, fr and mn correctly Discovery controllers need this information as well. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-04 10:56:42 -07:00
Christoph Hellwig	be3f3114dd	nvmet: Open code nvmet_req_execute() Now that nvmet_req_execute does nothing, open code it. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> [split patch, update changelog] Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-04 10:56:42 -07:00
Christoph Hellwig	e9061c3978	nvmet: Remove the data_len field from the nvmet_req struct Instead of storing the expected length and checking it when it's executed, just check the length inside the command themselves. A new helper, nvmet_check_data_len() is created to help with this check. Signed-off-by: Christoph Hellwig <hch@lst.de> [split patch, udpate changelog] Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-04 10:56:42 -07:00
Christoph Hellwig	59ef0eaa77	nvmet: Introduce nvmet_dsm_len() helper Similar to the nvmet_rw_len helper. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> [split patch, update changelog] Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-04 10:56:42 -07:00
Christoph Hellwig	6f86f2c9d9	nvmet: Cleanup discovery execute handlers Push the lid and cns check into their respective handlers and, while we're at it, rename the functions to be consistent with other discovery handlers. Signed-off-by: Christoph Hellwig <hch@lst.de> [split patch, update changelog] Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-04 10:56:41 -07:00
Christoph Hellwig	2cb6963a16	nvmet: Introduce common execute function for get_log_page and identify Instead of picking the sub-command handler to execute in a nested switch statement introduce a landing functions that calls out to the appropriate sub-command handler. This will allow us to have a common place in the handler to check the transfer length in a future patch. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> [split patch, update change log] Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-04 10:56:41 -07:00
Logan Gunthorpe	c73eebc07a	nvmet-tcp: Don't set the request's data_len It's not apprporiate for the transports to set the data_len field of the request which is only used by the core. In this case, just use a variable on the stack to store the length of the sgl for comparison. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-04 10:56:41 -07:00
Logan Gunthorpe	e0bace7177	nvmet-tcp: Don't check data_len in nvmet_tcp_map_data() None of the other transports check data_len which is verified in core code. The function should instead check that the sgl length is non-zero. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-04 10:56:41 -07:00
Damien Le Moal	e08f2ae850	nvme: Introduce nvme_lba_to_sect() Introduce the new helper function nvme_lba_to_sect() to convert a device logical block number to a 512B sector number. Use this new helper in obvious places, cleaning up the code. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-04 10:56:41 -07:00
Damien Le Moal	314d48dd22	nvme: Cleanup and rename nvme_block_nr() Rename nvme_block_nr() to nvme_sect_to_lba() and use SECTOR_SHIFT instead of its hard coded value 9. Also add a comment to decribe this helper. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-04 10:56:41 -07:00
Max Gurtovoy	16686f3a6c	nvme: move common call to nvme_cleanup_cmd to core layer nvme_cleanup_cmd should be called for each call to nvme_setup_cmd (symmetrical functions). Move the call for nvme_cleanup_cmd to the common core layer and call it during nvme_complete_rq for the good flow. For error flow, each transport will call nvme_cleanup_cmd independently. Also take care of a special case of path failure, where we call nvme_complete_rq without doing nvme_setup_cmd. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-04 10:56:41 -07:00
Max Gurtovoy	2dc3947b53	nvme: introduce "Command Aborted By host" status code Fix the status code of canceled requests initiated by the host according to TP4028 (Status Code 0x371): "Command Aborted By host: The command was aborted as a result of host action (e.g., the host disconnected the Fabric connection)." Also in a multipath environment, unless otherwise specified, errors of this type (path related) should be retried using a different path, if one is available. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-04 10:56:41 -07:00
Israel Rukshin	59534b9d60	nvmet-rdma: add unlikely check at nvmet_rdma_map_sgl_keyed The calls to nvmet_req_alloc_sgl and rdma_rw_ctx_init should usually succeed, so add this simple optimization to the fast path. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-04 10:56:41 -07:00
Israel Rukshin	e522f44602	nvmet: add unlikely check at nvmet_req_alloc_sgl The call to sgl_alloc shouldn't fail so add this simple optimization to the fast path. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-04 10:56:40 -07:00
Israel Rukshin	4d764bb9a9	nvmet: use bio_io_error instead of duplicating it This commit doesn't change any logic. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-04 10:56:40 -07:00
Israel Rukshin	58a8df67e0	nvme: introduce nvme_is_aen_req function This function improves code readability and reduces code duplication. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-04 10:56:40 -07:00
James Smart	bcde5f0fc7	nvme-fc: ensure association_id is cleared regardless of a Disconnect LS Code today only clears the association_id if a Disconnect LS is transmit. Remove ambiguity and unconditionally clear the association_id if the association has been terminated. Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-04 10:56:40 -07:00
James Smart	7db394848e	nvme-fc: clarify error messages Change wording on a couple of messages to clarify what happened. Signed-off-by: Ewan D. Milne <emilne@redhat.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-04 10:56:40 -07:00
James Smart	44fbf3bb1a	nvme-fc: Set new cmd set indicator in nvme-fc cmnd iu Set the new category field in the FC-NVME CMND_IU based on queue number. Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-04 10:56:40 -07:00
James Smart	53b2b2f599	nvme-fc and nvmet-fc: sync with FC-NVME-2 header changes Sync sources with revised structure and field names to correspond with FC-NVME-2 header sync-up. Tested interoperability with success: - prior initiator with new target - prior target with new initiator - new on new Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-04 10:56:40 -07:00
Linus Torvalds	1204c70d9d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from David Miller: 1) Fix free/alloc races in batmanadv, from Sven Eckelmann. 2) Several leaks and other fixes in kTLS support of mlx5 driver, from Tariq Toukan. 3) BPF devmap_hash cost calculation can overflow on 32-bit, from Toke Høiland-Jørgensen. 4) Add an r8152 device ID, from Kazutoshi Noguchi. 5) Missing include in ipv6's addrconf.c, from Ben Dooks. 6) Use siphash in flow dissector, from Eric Dumazet. Attackers can easily infer the 32-bit secret otherwise etc. 7) Several netdevice nesting depth fixes from Taehee Yoo. 8) Fix several KCSAN reported errors, from Eric Dumazet. For example, when doing lockless skb_queue_empty() checks, and accessing sk_napi_id/sk_incoming_cpu lockless as well. 9) Fix jumbo packet handling in RXRPC, from David Howells. 10) Bump SOMAXCONN and tcp_max_syn_backlog values, from Eric Dumazet. 11) Fix DMA synchronization in gve driver, from Yangchun Fu. 12) Several bpf offload fixes, from Jakub Kicinski. 13) Fix sk_page_frag() recursion during memory reclaim, from Tejun Heo. 14) Fix ping latency during high traffic rates in hisilicon driver, from Jiangfent Xiao. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (146 commits) net: fix installing orphaned programs net: cls_bpf: fix NULL deref on offload filter removal selftests: bpf: Skip write only files in debugfs selftests: net: reuseport_dualstack: fix uninitalized parameter r8169: fix wrong PHY ID issue with RTL8168dp net: dsa: bcm_sf2: Fix IMP setup for port different than 8 net: phylink: Fix phylink_dbg() macro gve: Fixes DMA synchronization. inet: stop leaking jiffies on the wire ixgbe: Remove duplicate clear_bit() call Documentation: networking: device drivers: Remove stray asterisks e1000: fix memory leaks i40e: Fix receive buffer starvation for AF_XDP igb: Fix constant media auto sense switching when no cable is connected net: ethernet: arc: add the missed clk_disable_unprepare igb: Enable media autosense for the i350. igb/igc: Don't warn on fatal read failures when the device is removed tcp: increase tcp_max_syn_backlog max value net: increase SOMAXCONN to 4096 netdevsim: Fix use-after-free during device dismantle ...	2019-11-01 17:48:11 -07:00
Anton Eidelman	86cccfbf77	nvme-multipath: remove unused groups_only mode in ana log groups_only mode in nvme_read_ana_log() is no longer used: remove it. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anton Eidelman <anton@lightbitslabs.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 08:55:00 -06:00
Anton Eidelman	af8fd04247	nvme-multipath: fix possible io hang after ctrl reconnect The following scenario results in an IO hang: 1) ctrl completes a request with NVME_SC_ANA_TRANSITION. NVME_NS_ANA_PENDING bit in ns->flags is set and ana_work is triggered. 2) ana_work: nvme_read_ana_log() tries to get the ANA log page from the ctrl. This fails because ctrl disconnects. Therefore nvme_update_ns_ana_state() is not called and NVME_NS_ANA_PENDING bit in ns->flags is not cleared. 3) ctrl reconnects: nvme_mpath_init(ctrl,...) calls nvme_read_ana_log(ctrl, groups_only=true). However, nvme_update_ana_state() does not update namespaces because nr_nsids = 0 (due to groups_only mode). 4) scan_work calls nvme_validate_ns() finds the ns and re-validates OK. Result: The ctrl is now live but NVME_NS_ANA_PENDING bit in ns->flags is still set. Consequently ctrl will never be considered a viable path by __nvme_find_path(). IO will hang if ctrl is the only or the last path to the namespace. More generally, while ctrl is reconnecting, its ANA state may change. And because nvme_mpath_init() requests ANA log in groups_only mode, these changes are not propagated to the existing ctrl namespaces. This may result in a mal-function or an IO hang. Solution: nvme_mpath_init() will nvme_read_ana_log() with groups_only set to false. This will not harm the new ctrl case (no namespaces present), and will make sure the ANA state of namespaces gets updated after reconnect. Note: Another option would be for nvme_mpath_init() to invoke nvme_parse_ana_log(..., nvme_set_ns_ana_state) for each existing namespace. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anton Eidelman <anton@lightbitslabs.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-10-29 08:55:00 -06:00
Eric Dumazet	3f926af3f4	net: use skb_queue_empty_lockless() in busy poll contexts Busy polling usually runs without locks. Let's use skb_queue_empty_lockless() instead of skb_queue_empty() Also uses READ_ONCE() in __skb_try_recv_datagram() to address a similar potential problem. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-28 13:33:41 -07:00
Arnd Bergmann	1832f2d8ff	compat_ioctl: move more drivers to compat_ptr_ioctl The .ioctl and .compat_ioctl file operations have the same prototype so they can both point to the same function, which works great almost all the time when all the commands are compatible. One exception is the s390 architecture, where a compat pointer is only 31 bit wide, and converting it into a 64-bit pointer requires calling compat_ptr(). Most drivers here will never run in s390, but since we now have a generic helper for it, it's easy enough to use it consistently. I double-checked all these drivers to ensure that all ioctl arguments are used as pointers or are ignored, but are not interpreted as integer values. Acked-by: Jason Gunthorpe <jgg@mellanox.com> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Acked-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: David Sterba <dsterba@suse.com> Acked-by: Darren Hart (VMware) <dvhart@infradead.org> Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Acked-by: Bjorn Andersson <bjorn.andersson@linaro.org> Acked-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2019-10-23 17:23:44 +02:00
Kevin Hao	a4f40484e7	nvme-pci: Set the prp2 correctly when using more than 4k page In the current code, the nvme is using a fixed 4k PRP entry size, but if the kernel use a page size which is more than 4k, we should consider the situation that the bv_offset may be larger than the dev->ctrl.page_size. Otherwise we may miss setting the prp2 and then cause the command can't be executed correctly. Fixes: `dff824b2aa` ("nvme-pci: optimize mapping of small single segment requests") Cc: stable@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kevin Hao <haokexin@gmail.com> Signed-off-by: Keith Busch <kbusch@kernel.org>	2019-10-18 23:09:41 +09:00
Max Gurtovoy	28a4cac48c	nvme-tcp: fix possible leakage during error flow During nvme_tcp_setup_cmd_pdu error flow, one must call nvme_cleanup_cmd since it's symmetric to nvme_setup_cmd. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2019-10-15 22:47:29 +09:00
Max Gurtovoy	5812d04c4c	nvmet-loop: fix possible leakage during error flow During nvme_loop_queue_rq error flow, one must call nvme_cleanup_cmd since it's symmetric to nvme_setup_cmd. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2019-10-15 22:47:28 +09:00
Sebastian Andrzej Siewior	ac1c4e1885	nvme-tcp: Initialize sk->sk_ll_usec only with NET_RX_BUSY_POLL The access to sk->sk_ll_usec should be hidden behind CONFIG_NET_RX_BUSY_POLL like the definition of sk_ll_usec. Put access to ->sk_ll_usec behind CONFIG_NET_RX_BUSY_POLL. Fixes: `1a9460cef5` ("nvme-tcp: support simple polling") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2019-10-14 23:27:01 +09:00
Keith Busch	c1ac9a4b07	nvme: Wait for reset state when required Prevent simultaneous controller disabling/enabling tasks from interfering with each other through a function to wait until the task successfully transitioned the controller to the RESETTING state. This ensures disabling the controller will not be interrupted by another reset path, otherwise a concurrent reset may leave the controller in the wrong state. Tested-by: Edmund Nadolski <edmund.nadolski@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2019-10-14 23:22:00 +09:00
Keith Busch	4c75f87785	nvme: Prevent resets during paused controller state A paused controller is doing critical internal activation work in the background. Prevent subsequent controller resets from occurring during this period by setting the controller state to RESETTING first. A helper function, nvme_try_sched_reset_work(), is introduced for these paths so they may continue with scheduling the reset_work after they've completed their uninterruptible critical section. Tested-by: Edmund Nadolski <edmund.nadolski@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2019-10-14 23:21:54 +09:00
Keith Busch	92b98e88d5	nvme: Restart request timers in resetting state A controller in the resetting state has not yet completed its recovery actions. The pci and fc transports were already handling this, so update the remaining transports to not attempt additional recovery in this state. Instead, just restart the request timer. Tested-by: Edmund Nadolski <edmund.nadolski@intel.com> Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2019-10-14 23:21:49 +09:00
Keith Busch	5d02a5c1d6	nvme: Remove ADMIN_ONLY state The admin only state was intended to fence off actions that don't apply to a non-IO capable controller. The only actual user of this is the scan_work, and pci was the only transport to ever set this state. The consequence of having this state is placing an additional burden on every other action that applies to both live and admin only controllers. Remove the admin only state and place the admin only burden on the only place that actually cares: scan_work. This also prepares to make it easier to temporarily pause a LIVE state so that we don't need to remember which state the controller had been in prior to the pause. Tested-by: Edmund Nadolski <edmund.nadolski@intel.com> Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2019-10-14 23:21:44 +09:00
Keith Busch	770597ecb2	nvme-pci: Free tagset if no IO queues If a controller becomes degraded after a reset, we will not be able to perform any IO. We currently teardown previously created request queues and namespaces, but we had kept the unusable tagset. Free it after all queues using it have been released. Tested-by: Edmund Nadolski <edmund.nadolski@intel.com> Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2019-10-14 23:21:38 +09:00
Ard Biesheuvel	3a8ecc935e	nvme: retain split access workaround for capability reads Commit `7fd8930f26` "nvme: add a common helper to read Identify Controller data" has re-introduced an issue that we have attempted to work around in the past, in commit `a310acd7a7` ("NVMe: use split lo_hi_{read,write}q"). The problem is that some PCIe NVMe controllers do not implement 64-bit outbound accesses correctly, which is why the commit above switched to using lo_hi_[read\|write]q for all 64-bit BAR accesses occuring in the code. In the mean time, the NVMe subsystem has been refactored, and now calls into the PCIe support layer for NVMe via a .reg_read64() method, which fails to use lo_hi_readq(), and thus reintroduces the problem that the workaround above aimed to address. Given that, at the moment, .reg_read64() is only used to read the capability register [which is known to tolerate split reads], let's switch .reg_read64() to lo_hi_readq() as well. This fixes a boot issue on some ARM boxes with NVMe behind a Synopsys DesignWare PCIe host controller. Fixes: `7fd8930f26` ("nvme: add a common helper to read Identify Controller data") Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-10-04 17:10:12 -07:00
Sagi Grimberg	6abff1b9f7	nvme: fix possible deadlock when nvme_update_formats fails nvme_update_formats may fail to revalidate the namespace and attempt to remove the namespace. This may lead to a deadlock as nvme_ns_remove will attempt to acquire the subsystem lock which is already acquired by the passthru command with effects. Move the invalid namepsace removal to after the passthru command releases the subsystem lock. Reported-by: Judy Brock <judy.brock@samsung.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-10-04 17:10:12 -07:00
Jens Axboe	2d5ba0c712	Merge branch 'nvme-5.4' of git://git.infradead.org/nvme into for-linus Pull NVMe changes from Sagi: "This set consists of various fixes and cleanups: - controller removal race fix from Balbir - quirk additions from Gabriel and Jian-Hong - nvme-pci power state save fix from Mario - Add 64bit user commands (for 64bit registers) from Marta - nvme-rdma/nvme-tcp fixes from Max, Mark and Me - Minor cleanups and nits from James, Dan and John" * 'nvme-5.4' of git://git.infradead.org/nvme: nvme-rdma: fix possible use-after-free in connect timeout nvme: Move ctrl sqsize to generic space nvme: Add ctrl attributes for queue_count and sqsize nvme: allow 64-bit results in passthru commands nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T nvmet-tcp: remove superflous check on request sgl Added QUIRKs for ADATA XPG SX8200 Pro 512GB nvme-rdma: Fix max_hw_sectors calculation nvme: fix an error code in nvme_init_subsystem() nvme-pci: Save PCI state before putting drive into deepest state nvme-tcp: fix wrong stop condition in io_work nvme-pci: Fix a race in controller removal nvmet: change ppl to lpp	2019-09-27 13:17:37 -06:00
Sagi Grimberg	67b483dd03	nvme-rdma: fix possible use-after-free in connect timeout If the connect times out, we may have already destroyed the queue in the timeout handler, so test if the queue is still allocated in the connect error handler. Reported-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-09-27 10:24:53 -07:00
Keith Busch	f968688f44	nvme: Move ctrl sqsize to generic space This isn't specific to fabrics. Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-09-26 13:00:47 -07:00
James Smart	2b1ff255d2	nvme: Add ctrl attributes for queue_count and sqsize Current controller interrogation requires a lot of guesswork on how many io queues were created and what the io sq size is. The numbers are dependent upon core/fabric defaults, connect arguments, and target responses. Add sysfs attributes for queue_count and sqsize. Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-09-25 13:01:44 -07:00
Marta Rybczynska	65e68edce0	nvme: allow 64-bit results in passthru commands It is not possible to get 64-bit results from the passthru commands, what prevents from getting for the Capabilities (CAP) property value. As a result, it is not possible to implement IOL's NVMe Conformance test 4.3 Case 1 for Fabrics targets [1] (page 123). This issue has been already discussed [2], but without a solution. This patch solves the problem by adding new ioctls with a new passthru structure, including 64-bit results. The older ioctls stay unchanged. [1] https://www.iol.unh.edu/sites/default/files/testsuites/nvme/UNH-IOL_NVMe_Conformance_Test_Suite_v11.0.pdf [2] http://lists.infradead.org/pipermail/linux-nvme/2018-June/018791.html Signed-off-by: Marta Rybczynska <marta.rybczynska@kalray.eu> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-09-25 12:53:27 -07:00
Jian-Hong Pan	19ea025e1d	nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T Kingston NVME SSD with firmware version E8FK11.T has no interrupt after resume with actions related to suspend to idle. This patch applied NVME_QUIRK_SIMPLE_SUSPEND quirk to fix this issue. Fixes: `d916b1be94` ("nvme-pci: use host managed power state for suspend") Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=204887 Signed-off-by: Jian-Hong Pan <jian-hong@endlessm.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-09-25 12:53:14 -07:00
Sagi Grimberg	30f27d57c0	nvmet-tcp: remove superflous check on request sgl Now that sgl_free is null safe, drop the superflous check. Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-09-25 12:53:14 -07:00
Gabriel Craciunescu	f03e42c6af	Added QUIRKs for ADATA XPG SX8200 Pro 512GB Booting with default_ps_max_latency_us >6000 makes the device fail. Also SUBNQN is NULL and gives a warning on each boot/resume. $ nvme id-ctrl /dev/nvme0 \| grep ^subnqn subnqn : (null) I use this device with an Acer Nitro 5 (AN515-43-R8BF) Laptop. To be sure is not a Laptop issue only, I tested the device on my server board with the same results. ( with 2x,4x link on the board and 4x link on a PCI-E card ). Signed-off-by: Gabriel Craciunescu <nix.or.die@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-09-25 12:53:14 -07:00
Max Gurtovoy	ff13c1b87c	nvme-rdma: Fix max_hw_sectors calculation By default, the NVMe/RDMA driver should support max io_size of 1MiB (or upto the maximum supported size by the HCA). Currently, one will see that /sys/class/block/<bdev>/queue/max_hw_sectors_kb is 1020 instead of 1024. A non power of 2 value can cause performance degradation due to unnecessary splitting of IO requests and unoptimized allocation units. The number of pages per MR has been fixed here, so there is no longer any need to reduce max_sectors by 1. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-09-25 12:53:14 -07:00
Dan Carpenter	bc4f6e06a9	nvme: fix an error code in nvme_init_subsystem() "ret" should be a negative error code here, but it's either success or possibly uninitialized. Fixes: `32fd90c407` ("nvme: change locking for the per-subsystem controller list") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-09-25 12:53:14 -07:00
Mario Limonciello	7cbb5c6f9a	nvme-pci: Save PCI state before putting drive into deepest state The action of saving the PCI state will cause numerous PCI configuration space reads which depending upon the vendor implementation may cause the drive to exit the deepest NVMe state. In these cases ASPM will typically resolve the PCIe link state and APST may resolve the NVMe power state. However it has also been observed that this register access after quiesced will cause PC10 failure on some device combinations. To resolve this, move the PCI state saving to before SetFeatures has been called. This has been proven to resolve the issue across a 5000 sample test on previously failing disk/system combinations. Signed-off-by: Mario Limonciello <mario.limonciello@dell.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-09-25 12:53:14 -07:00
Wunderlich, Mark	ddef29578a	nvme-tcp: fix wrong stop condition in io_work Allow the do/while statement to continue if current time is not after the proposed time 'deadline'. Intent is to allow loop to proceed for a specific time period. Currently the loop, as coded, will exit after first pass. Signed-off-by: Mark Wunderlich <mark.wunderlich@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-09-25 12:53:14 -07:00
Linus Torvalds	2e959dd87a	for-5.4/post-2019-09-24 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl2J8xQQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpujgD/94s9GGKN8JShxCpT0YNuWyyFF5gNlaimQU RSGAwnv2YUgEGNSUOPpcaj5FAYhTfYzbqoHlE+jytA2U5KXTOhc5Z85QV+TY4HPs I03xczYuYD/uX0QuF00zU2+6eV3lETELPiBARbfEQdHfm72iwurweHzlh4dfhbxW P7UA/cKixXWF2CH9wg5347Ll93nD24f2pi8BUyLJi/xpdlaRrN11Ii8AzNlRmq52 VRxURuogl98W89F6EV2VhPGFgUEYHY2Ot7II2OqqV+jmjHDQW9y5hximzINOqkxs bQwo5J+WrDSPoqwl8+db2k7QQjAl1XKDAHmCwz+7J/BoOgZj8/M1FMBwzita+5x+ UqxEYe7k+2G3w2zuhBrq03BypU8pwqFep/QI0cCCPaHs4J5QnkVOScEqd6iV/C3T FPvMvqDf7MrElghj4Qa2IZlh/CgqmLG5NUEz8E40cXkdiP+E+eK9ZY2Uwx2XhBrm 7Gl+SpG5DxWqqJeRNVWjFwM4p5L+01NtwDbTjZ1rsf+mCW5cNsy/L9B4UpPz4HxW coAs0y/Ce+ZhCopIXZ4jLDBoTG9yoVg8EcyfaHKD2Zz0mUFxa2xm+LvXKeT49qqx xuodpKD3fiuM7h9Xgv+cDsmn8Rr8gSeXEGV7qzpudmkxbp6IVg/yG5hC/dM921GR EVrRtUIwdw== =aAPP -----END PGP SIGNATURE----- Merge tag 'for-5.4/post-2019-09-24' of git://git.kernel.dk/linux-block Pull more block updates from Jens Axboe: "Some later additions that weren't quite done for the first pull request, and also a few fixes that have arrived since. This contains: - Kill silly pktcdvd warning on attempting to register a non-scsi passthrough device (me) - Use symbolic constants for the block t10 protection types, and switch to handling it in core rather than in the drivers (Max) - libahci platform missing node put fix (Nishka) - Small series of fixes for BFQ (Paolo) - Fix possible nbd crash (Xiubo)" * tag 'for-5.4/post-2019-09-24' of git://git.kernel.dk/linux-block: block: drop device references in bsg_queue_rq() block: t10-pi: fix -Wswitch warning pktcdvd: remove warning on attempting to register non-passthrough dev ata: libahci_platform: Add of_node_put() before loop exit nbd: fix possible page fault for nbd disk nbd: rename the runtime flags as NBD_RT_ prefixed block, bfq: push up injection only after setting service time block, bfq: increase update frequency of inject limit block, bfq: reduce upper bound for inject limit to max_rq_in_driver+1 block, bfq: update inject limit only after injection occurred block: centralize PI remapping logic to the block layer block: use symbolic constants for t10_pi type	2019-09-24 16:31:50 -07:00
Linus Torvalds	299d14d4c3	pci-v5.4-changes -----BEGIN PGP SIGNATURE----- iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAl2JNVAUHGJoZWxnYWFz QGdvb2dsZS5jb20ACgkQWYigwDrT+vyTOA/9EZeyS7J+ZcOwihWz5vNijf0kfpKp /jZ9VF9nHjsL9Pw3/Fzha605Ssrtwcqge8g/sze9f0g/pxZk99lLHokE6dEOurEA GyKpNNMdiBol4YZMCsSoYji0MpwW0uMCuASPMiEwv2LxZ72A2Tu1RbgYLU+n4m1T fQldDTxsUMXc/OH/8SL8QDEh6o8qyDRhmSXFAOv8RGqN8N3iUwVwhQobKpwpmEvx ddzqWMS8f91qkhIKO7fgc9P4NI/7yI7kkF+wcdwtfiMO8Qkr4IdcdF7qwNVAtpKA A+sMRi59i2XxDTqRFx+wXXMa+rt+Pf1pucv77SO74xXWwpuXSxLVDYjULP1YQugK FTBo4SNmico/ts+n5cgm+CGMq2P2E29VYeqkI1Un6eDDvQnQlBgQdpdcBoadJ0rW y31OInjhRJC1ZK5bATKfCMbmB+VQxFsbyeUA7PBlrALyAmXZfw30iNxX9iHBhWqc myPNVEJJGp0cWTxGxMAU9MhelzeQxDAd+Eb44J5gv51bx0w9yqmZHECSDrOVdtYi HpOyI7E3Cb8m23BOHvCdB/v8igaYMZl08LUUJqu1S9mFclYyYVuOOIB04Yc2Qrx1 3PHtT8TC47FbWuzKwo12RflzoAiNShJGw+tNKo6T1jC+r5jdbKWWtTnsoRqbSfaG rG5RJpB7EuQSP1Y= =/xB3 -----END PGP SIGNATURE----- Merge tag 'pci-v5.4-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI updates from Bjorn Helgaas: "Enumeration: - Consolidate _HPP/_HPX stuff in pci-acpi.c and simplify it (Krzysztof Wilczynski) - Fix incorrect PCIe device types and remove dev->has_secondary_link to simplify code that deals with upstream/downstream ports (Mika Westerberg) - After suspend, restore Resizable BAR size bits correctly for 1MB BARs (Sumit Saxena) - Enable PCI_MSI_IRQ_DOMAIN support for RISC-V (Wesley Terpstra) Virtualization: - Add ACS quirks for iProc PAXB (Abhinav Ratna), Amazon Annapurna Labs (Ali Saidi) - Move sysfs SR-IOV functions to iov.c (Kelsey Skunberg) - Remove group write permissions from sysfs sriov_numvfs, sriov_drivers_autoprobe (Kelsey Skunberg) Hotplug: - Simplify pciehp indicator control (Denis Efremov) Peer-to-peer DMA: - Allow P2P DMA between root ports for whitelisted bridges (Logan Gunthorpe) - Whitelist some Intel host bridges for P2P DMA (Logan Gunthorpe) - DMA map P2P DMA requests that traverse host bridge (Logan Gunthorpe) Amazon Annapurna Labs host bridge driver: - Add DT binding and controller driver (Jonathan Chocron) Hyper-V host bridge driver: - Fix hv_pci_dev->pci_slot use-after-free (Dexuan Cui) - Fix PCI domain number collisions (Haiyang Zhang) - Use instance ID bytes 4 & 5 as PCI domain numbers (Haiyang Zhang) - Fix build errors on non-SYSFS config (Randy Dunlap) i.MX6 host bridge driver: - Limit DBI register length (Stefan Agner) Intel VMD host bridge driver: - Fix config addressing issues (Jon Derrick) Layerscape host bridge driver: - Add bar_fixed_64bit property to endpoint driver (Xiaowei Bao) - Add CONFIG_PCI_LAYERSCAPE_EP to build EP/RC drivers separately (Xiaowei Bao) Mediatek host bridge driver: - Add MT7629 controller support (Jianjun Wang) Mobiveil host bridge driver: - Fix CPU base address setup (Hou Zhiqiang) - Make "num-lanes" property optional (Hou Zhiqiang) Tegra host bridge driver: - Fix OF node reference leak (Nishka Dasgupta) - Disable MSI for root ports to work around design problem (Vidya Sagar) - Add Tegra194 DT binding and controller support (Vidya Sagar) - Add support for sideband pins and slot regulators (Vidya Sagar) - Add PIPE2UPHY support (Vidya Sagar) Misc: - Remove unused pci_block_cfg_access() et al (Kelsey Skunberg) - Unexport pci_bus_get(), etc (Kelsey Skunberg) - Hide PM, VC, link speed, ATS, ECRC, PTM constants and interfaces in the PCI core (Kelsey Skunberg) - Clean up sysfs DEVICE_ATTR() usage (Kelsey Skunberg) - Mark expected switch fall-through (Gustavo A. R. Silva) - Propagate errors for optional regulators and PHYs (Thierry Reding) - Fix kernel command line resource_alignment parameter issues (Logan Gunthorpe)" * tag 'pci-v5.4-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (112 commits) PCI: Add pci_irq_vector() and other stubs when !CONFIG_PCI arm64: tegra: Add PCIe slot supply information in p2972-0000 platform arm64: tegra: Add configuration for PCIe C5 sideband signals PCI: tegra: Add support to enable slot regulators PCI: tegra: Add support to configure sideband pins PCI: vmd: Fix shadow offsets to reflect spec changes PCI: vmd: Fix config addressing when using bus offsets PCI: dwc: Add validation that PCIe core is set to correct mode PCI: dwc: al: Add Amazon Annapurna Labs PCIe controller driver dt-bindings: PCI: Add Amazon's Annapurna Labs PCIe host bridge binding PCI: Add quirk to disable MSI-X support for Amazon's Annapurna Labs Root Port PCI/VPD: Prevent VPD access for Amazon's Annapurna Labs Root Port PCI: Add ACS quirk for Amazon Annapurna Labs root ports PCI: Add Amazon's Annapurna Labs vendor ID MAINTAINERS: Add PCI native host/endpoint controllers designated reviewer PCI: hv: Use bytes 4 and 5 from instance ID as the PCI domain numbers dt-bindings: PCI: tegra: Add PCIe slot supplies regulator entries dt-bindings: PCI: tegra: Add sideband pins configuration entries PCI: tegra: Add Tegra194 PCIe support PCI: Get rid of dev->has_secondary_link flag ...	2019-09-23 19:16:01 -07:00
Balbir Singh	b224726de5	nvme-pci: Fix a race in controller removal User space programs like udevd may try to read to partitions at the same time the driver detects a namespace is unusable, and may deadlock if revalidate_disk() is called while such a process is waiting to enter the frozen queue. On detecting a dead namespace, move the disk revalidate after unblocking dispatchers that may be holding bd_butex. changelog Suggested-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Balbir Singh <sblbir@amzn.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-09-23 14:00:11 -07:00
John Pittman	0ec64895b0	nvmet: change ppl to lpp In nvmet_bdev_set_limits() the number of logical blocks per physical block is calculated, but the opposite is mentioned in the associated comment and reflected in the variable name. Correct the comment and adjust the variable name to reflect the calculation done. Signed-off-by: John Pittman <jpittman@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-09-23 13:59:49 -07:00
Max Gurtovoy	54d4e6ab91	block: centralize PI remapping logic to the block layer Currently t10_pi_prepare/t10_pi_complete functions are called during the NVMe and SCSi layers command preparetion/completion, but their actual place should be the block layer since T10-PI is a general data integrity feature that is used by block storage protocols. Introduce .prepare_fn and .complete_fn callbacks within the integrity profile that each type can implement according to its needs. Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Suggested-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Fixed to not call queue integrity functions if BLK_DEV_INTEGRITY isn't defined in the config. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-09-17 20:03:49 -06:00
Linus Torvalds	7ad67ca553	for-5.4/block-2019-09-16 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl1/no0QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpmo9EACFXMbdNmEEUMyRSdOkVLlr7ZlTyQi1tLpB YESDPxdBfybzpi0qa8JSaysGIfvSkSjmSAqBqrWPmASOSOL6CK4bbA4fTYbgPplk XeHUdgGiG34oCQUn8Xil5reYaTm7I6LQWnWTpVa5fIhAyUYaGJL+987ykoGmpQmB Dvf3YSc+8H0RTp9PCMVd6UCGPkZbVlLImGad3PF5ULvTEaE4RCXC2aiAgh0p1l5A J2CkRZ+/mio3zN2O4YN7VdPGfr1Wo1iZ834xbIGLegv1miHXagFk7jwTcC7zIt5t oSnJnqIg3iCe7SpWt4Bkzw/zy/2UqaspifbCMgw8vychlViVRUHFO5h85Yboo7kQ OMLEQPcwjm6dTHv5h1iXF9LW1O7NoiYmmgvApU9uOo1HUrl1X7PZ3JEfUsVHxkOO T4D5igf0Krsl1eAbiwEUQzy7vFZ8PlRHqrHgK+fkyotzHu1BJR7OQkYygEfGFOB/ EfMxplGDpmibYGuWCwDX2bPAmLV3SPUQENReHrfPJRDt5TD1UkFpVGv/PLLhbr0p cLYI78DKpDSigBpVMmwq5nTYpnex33eyDTTA8C0sakcsdzdmU5qv30y3wm4nTiep f6gZo6IMXwRg/rCgVVrd9SKQAr/8wEzVlsDW3qyi2pVT8sHIgm0tFv7paihXGdDV xsKgmTrQQQ== =Qt+h -----END PGP SIGNATURE----- Merge tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: - Two NVMe pull requests: - ana log parse fix from Anton - nvme quirks support for Apple devices from Ben - fix missing bio completion tracing for multipath stack devices from Hannes and Mikhail - IP TOS settings for nvme rdma and tcp transports from Israel - rq_dma_dir cleanups from Israel - tracing for Get LBA Status command from Minwoo - Some nvme-tcp cleanups from Minwoo, Potnuri and Myself - Some consolidation between the fabrics transports for handling the CAP register - reset race with ns scanning fix for fabrics (move fabrics commands to a dedicated request queue with a different lifetime from the admin request queue)." - controller reset and namespace scan races fixes - nvme discovery log change uevent support - naming improvements from Keith - multiple discovery controllers reject fix from James - some regular cleanups from various people - Series fixing (and re-fixing) null_blk debug printing and nr_devices checks (André) - A few pull requests from Song, with fixes from Andy, Guoqing, Guilherme, Neil, Nigel, and Yufen. - REQ_OP_ZONE_RESET_ALL support (Chaitanya) - Bio merge handling unification (Christoph) - Pick default elevator correctly for devices with special needs (Damien) - Block stats fixes (Hou) - Timeout and support devices nbd fixes (Mike) - Series fixing races around elevator switching and device add/remove (Ming) - sed-opal cleanups (Revanth) - Per device weight support for BFQ (Fam) - Support for blk-iocost, a new model that can properly account cost of IO workloads. (Tejun) - blk-cgroup writeback fixes (Tejun) - paride queue init fixes (zhengbin) - blk_set_runtime_active() cleanup (Stanley) - Block segment mapping optimizations (Bart) - lightnvm fixes (Hans/Minwoo/YueHaibing) - Various little fixes and cleanups * tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block: (186 commits) null_blk: format pr_* logs with pr_fmt null_blk: match the type of parameter nr_devices null_blk: do not fail the module load with zero devices block: also check RQF_STATS in blk_mq_need_time_stamp() block: make rq sector size accessible for block stats bfq: Fix bfq linkage error raid5: use bio_end_sector in r5_next_bio raid5: remove STRIPE_OPS_REQ_PENDING md: add feature flag MD_FEATURE_RAID0_LAYOUT md/raid0: avoid RAID0 data corruption due to layout confusion. raid5: don't set STRIPE_HANDLE to stripe which is in batch list raid5: don't increment read_errors on EILSEQ return nvmet: fix a wrong error status returned in error log page nvme: send discovery log page change events to userspace nvme: add uevent variables for controller devices nvme: enable aen regardless of the presence of I/O queues nvme-fabrics: allow discovery subsystems accept a kato nvmet: Use PTR_ERR_OR_ZERO() in nvmet_init_discovery() nvme: Remove redundant assignment of cq vector nvme: Assign subsys instance from first ctrl ...	2019-09-17 16:57:47 -07:00
Amit	5f8badbcbe	nvmet: fix a wrong error status returned in error log page When the command data_len cannot hold all the controller errors, we should simply return as much errors as we can fit instead of failing the command. Signed-off-by: Amit Engel <amit.engel@dell.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-09-12 08:50:46 -07:00
Sagi Grimberg	85f8a4351d	nvme: send discovery log page change events to userspace If the controller supports discovery log page change events, we want to enable it. When we see a discovery log change event we will send it up to userspace and expect it to handle it. Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-09-12 08:50:46 -07:00
Sagi Grimberg	a42f42e5bb	nvme: add uevent variables for controller devices When we send uevents to userspace, add controller specific environment variables to uniquly identify the controller beyond its device name. This will be useful to address discovery log change events by actually verifying that the discovery controller is indeed the same as the device that generated the event. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-09-12 08:50:46 -07:00
Sagi Grimberg	93da40239b	nvme: enable aen regardless of the presence of I/O queues AENs in general are not related to the presence of I/O queues, so enable them regardless. Note that the only exception is that discovery controller will not support any of the requested AENs and nvme_enable_aen will respect that and return, so it is still safe to enable regardless. Note it is safe to enable AENs even before the initial namespace scanning as we have the scan operation in a workqueue context. Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-09-12 08:50:46 -07:00
Sagi Grimberg	2d352df57b	nvme-fabrics: allow discovery subsystems accept a kato This modifies the behavior of discovery subsystems to accept a kato as a preparation to support discovery log change events. This also means that now every discovery controller will have a default kato value, and for non-persistent connections the host needs to pass in a zero kato value (keep_alive_tmo=0). Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-09-12 08:50:46 -07:00
Markus Elfring	1179d337be	nvmet: Use PTR_ERR_OR_ZERO() in nvmet_init_discovery() Simplify this function implementation by using a known function. Generated by: scripts/coccinelle/api/ptr_ret.cocci Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-09-12 08:50:46 -07:00
Israel Rukshin	97b3807e93	nvme: Remove redundant assignment of cq vector The cq vector is already assigned with the correct value. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-09-12 08:50:46 -07:00
Keith Busch	733e4b69d5	nvme: Assign subsys instance from first ctrl The namespace disk names must be unique for the lifetime of the subsystem. This was accomplished by using their parent subsystems' instances which were allocated independently from the controllers connected to that subsystem. This allowed name prefixes assigned to namespaces to match a controller from an unrelated subsystem, and has created confusion among users examining device nodes. Ensure a namespace's subsystem instance never clashes with a controller instance of another subsystem by transferring the instance ownership to the parent subsystem from the first controller discovered in that subsystem. Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Minwoo Im <minwoo.im@samsung.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-09-12 08:50:46 -07:00
Colin Ian King	312910f4d2	nvme: tcp: remove redundant assignment to variable ret The variable ret is being initialized with a value that is never read and is being re-assigned immediately afterwards. The assignment is redundant and hence can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-09-12 08:50:46 -07:00
Edmund Nadolski	03894b7a89	nvme: include admin_q sync with nvme_sync_queues nvme_sync_queues currently syncs all namespace queues, but should also sync the admin queue, if present. Signed-off-by: Edmund Nadolski <edmund.nadolski@intel.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-09-12 08:50:45 -07:00
James Smart	c26aa57202	nvme: Treat discovery subsystems as unique subsystems Current code matches subnqn and collapses all controllers to the same subnqn to a single subsystem structure. This is good for recognizing multiple controllers for the same subsystem. But with the well-known discovery subnqn, the subsystems aren't truly the same subsystem. As such, subsystem specific rules, such as no overlap of controller id, do not apply. With today's behavior, the check for overlap of controller id can fail, preventing the new discovery controller from being created. When searching for like subsystem nqn, exclude the discovery nqn from matching. This will result in each discovery controller being attached to a unique subsystem structure. Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-09-12 08:50:45 -07:00
Sagi Grimberg	205da24343	nvme: fix ns removal hang when failing to revalidate due to a transient error If a controller reset is racing with a namespace revalidation, the revalidation (admin) I/O will surely fail, but we should not remove the namespace as we will execute the I/O when the controller is back up. Same for spurious allocation errors (return -ENOMEM). Fix this by checking the specific error code in nvme_revalidate_disk and if it is a transient error (for example non DNR nvme statuses or a negative ENOMEM as allocation failure), do not remove the namespace as it will either recover when the controller is back up and schedule a subsequent scan, or the controller is going away and the namespaces will be removed anyways. This fixes a hang namespace scanning racing with a controller reset and also sporious I/O errors in path failover coditions where the controller reset is racing with the namespace scan work with multipath enabled. Reported-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-09-12 08:50:45 -07:00
Sagi Grimberg	538af88ea7	nvme: make nvme_report_ns_ids propagate error back Make the callers check the return status and propagate back accordingly (casting to errno from a positive nvme status). Also print the return status in nvme_report_ns_ids. Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-09-12 08:50:45 -07:00
Sagi Grimberg	331813f687	nvme: make nvme_identify_ns propagate errors back right now callers of nvme_identify_ns only know that it failed, but don't know why. Make nvme_identify_ns propagate the error back. Because nvme_submit_sync_cmd may return a positive status code, we make nvme_identify_ns receive the id by reference and return that status up the call chain, but make sure not to leak positive nvme status codes to the upper layers. Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-09-12 08:50:45 -07:00
Sagi Grimberg	2f9c173647	nvme: pass status to nvme_error_status No need for the full blown request structure. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-09-12 08:50:45 -07:00
James Smart	74bd8cbe7d	nvme-fc: Fail transport errors with NVME_SC_HOST_PATH NVME_SC_INTERNAL should indicate an internal controller errors and not host transport errors. These errors will propagate to upper layers (essentially nvme core) and be interpereted as transport errors which should not be taken into account for namespace state or condition. Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-09-12 08:50:45 -07:00
Sagi Grimberg	1668601008	nvme-tcp: fail command with NVME_SC_HOST_PATH_ERROR send failed This is a more appropriate error status for a transport error detected by us (the host). Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-09-12 08:50:45 -07:00
Sagi Grimberg	1c0d12c0b1	nvme: fail cancelled commands with NVME_SC_HOST_PATH_ERROR NVME_SC_ABORT_REQ means that the request was aborted due to an abort command received. In our case, this is a transport cancellation, so host pathing error is much more appropriate. Also, convert NVME_SC_HOST_PATH_ERROR to BLK_STS_TRANSPORT for such that callers can understand that the status is a transport related error. This will be used by the ns scanning code to understand if it got an error from the controller or that the controller happens to be unreachable by the transport. Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-09-12 08:50:45 -07:00
Israel Rukshin	bc31c1eea9	nvme-rdma: Use rq_dma_dir macro Remove code duplication. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-08-29 12:55:03 -07:00
Israel Rukshin	f15872c5dc	nvme-fc: Use rq_dma_dir macro Remove code duplication. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-08-29 12:55:03 -07:00
Israel Rukshin	f2fa006f81	nvme-pci: Tidy up nvme_unmap_data Remove pointless local variable and use rq_dma_dir macro. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-08-29 12:55:03 -07:00
Sagi Grimberg	e7832cb48a	nvme: make fabrics command run on a separate request queue We have a fundamental issue that fabric commands use the admin_q. The reason is, that admin-connect, register reads and writes and admin commands cannot be guaranteed ordering while we are running controller resets. For example, when we reset a controller we perform: 1. disable the controller 2. teardown the admin queue 3. re-establish the admin queue 4. enable the controller In order to perform (3), we need to unquiesce the admin queue, however we may have some admin commands that are already pending on the quiesced admin_q and will immediate execute when we unquiesce it before we execute (4). The host must not send admin commands to the controller before enabling the controller. To fix this, we have the fabric commands (admin connect and property get/set, but not I/O queue connect) use a separate fabrics_q and make sure to quiesce the admin_q before we disable the controller, and unquiesce it only after we enable the controller. This fixes the error prints from nvmet in a controller reset storm test: kernel: nvmet: got cmd 6 while CC.EN == 0 on qid = 0 Which indicate that the host is sending an admin command when the controller is not enabled. Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-08-29 12:55:03 -07:00
Benjamin Herrenschmidt	d38e9f04eb	nvme-pci: Support shared tags across queues for Apple 2018 controllers Another issue with the Apple T2 based 2018 controllers seem to be that they blow up (and shut the machine down) if there's a tag collision between the IO queue and the Admin queue. My suspicion is that they use our tags for their internal tracking and don't mix them with the queue id. They also seem to not like when tags go beyond the IO queue depth, ie 128 tags. This adds a quirk that marks tags 0..31 of the IO queue reserved Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-08-29 12:55:02 -07:00
Benjamin Herrenschmidt	66341331ba	nvme-pci: Add support for Apple 2018+ models Based on reverse engineering and original patch by Paul Pawlowski <paul@mrarm.io> This adds support for Apple weird implementation of NVME in their 2018 or later machines. It accounts for the twice-as-big SQ entries for the IO queues, and the fact that only interrupt vector 0 appears to function properly. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-08-29 12:55:02 -07:00
Benjamin Herrenschmidt	c1e0cc7e1d	nvme-pci: Add support for variable IO SQ element size The size of a submission queue element should always be 6 (64 bytes) by spec. However some controllers such as Apple's are not properly implementing the standard and require a different size. This provides the ground work for the subsequent quirks for these controllers. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-08-29 12:55:02 -07:00
Benjamin Herrenschmidt	8a1d09a668	nvme-pci: Pass the queue to SQ_SIZE/CQ_SIZE macros This will make it easier to handle variable queue entry sizes later. No functional change. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-08-29 12:55:02 -07:00
Hannes Reinecke	35fe0d12c8	nvme: trace bio completion When native multipathing is enabled we cannot enable blktrace for the underlying paths, so any completion is never traced. Signed-off-by: Hannes Reinecke <hare@suse.com> [fixed-up by Mikhail for non-multipath-build] Signed-off-by: Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-08-29 12:55:02 -07:00
Anton Eidelman	e01f91dff9	nvme-multipath: fix ana log nsid lookup when nsid is not found ANA log parsing invokes nvme_update_ana_state() per ANA group desc. This updates the state of namespaces with nsids in desc->nsids[]. Both ctrl->namespaces list and desc->nsids[] array are sorted by nsid. Hence nvme_update_ana_state() performs a single walk over ctrl->namespaces: - if current namespace matches the current desc->nsids[n], this namespace is updated, and n is incremented. - the process stops when it encounters the end of either ctrl->namespaces end or desc->nsids[] In case desc->nsids[n] does not match any of ctrl->namespaces, the remaining nsids following desc->nsids[n] will not be updated. Such situation was considered abnormal and generated WARN_ON_ONCE. However ANA log MAY contain nsids not (yet) found in ctrl->namespaces. For example, lets consider the following scenario: - nvme0 exposes namespaces with nsids = [2, 3] to the host - a new namespace nsid = 1 is added dynamically - also, a ANA topology change is triggered - NS_CHANGED aen is generated and triggers scan_work - before scan_work discovers nsid=1 and creates a namespace, a NOTICE_ANA aen was issues and ana_work receives ANA log with nsids=[1, 2, 3] Result: ana_work fails to update ANA state on existing namespaces [2, 3] Solution: Change the way nvme_update_ana_state() namespace list walk checks the current namespace against desc->nsids[n] as follows: a) ns->head->ns_id < desc->nsids[n]: keep walking ctrl->namespaces. b) ns->head->ns_id == desc->nsids[n]: match, update the namespace c) ns->head->ns_id >= desc->nsids[n]: skip to desc->nsids[n+1] This enables correct operation in the scenario described above. This also allows ANA log to contain nsids currently invisible to the host, i.e. inactive nsids. Signed-off-by: Anton Eidelman <anton@lightbitslabs.com> Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-08-29 12:55:02 -07:00
Israel Rukshin	89275a9659	nvmet-tcp: Add TOS for tcp transport Set the outgoing packets type of service (TOS) according to the receiving TOS. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Suggested-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-08-29 12:55:02 -07:00
Israel Rukshin	bb13985d5a	nvme-tcp: Add TOS for tcp transport TOS provide clients the ability to segregate traffic flows for different type of data. One of the TOS usage is bandwidth management which allows setting bandwidth limits for QoS classes, e.g. 80% bandwidth to controllers at QoS class A and 20% to controllers at QoS class B. usage examples: nvme connect --tos=0 --transport=tcp --traddr=10.0.1.1 --nqn=test-nvme Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-08-29 12:55:02 -07:00
Israel Rukshin	9924b0304a	nvme-tcp: Use struct nvme_ctrl directly This patch doesn't change any functionality. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-08-29 12:55:02 -07:00
Israel Rukshin	e63440d6a3	nvme-rdma: Add TOS for rdma transport For RDMA transports, TOS is an extension of IB QoS to provide clients the ability to segregate traffic flows for different type of data. RDMA CM abstract it for ULPs using rdma_set_service_type(). Internally, each traffic flow is represented by a connection with all of its independent resources like that of a normal connection, and is differentiated by service type. In other words, there can be multiple qp connections between an IP pair and each supports a unique service type. One of the TOS usage is bandwidth management which allows setting bandwidth limits for QoS classes, e.g. 80% bandwidth to controllers at QoS class A and 20% to controllers at QoS class B. Note: In addition to the TOS configuration, QOS must be configured on the relevant HCA on the target (send RDMA commands) and initiator to effect the traffic. usage examples: nvme connect --tos=0 --transport=rdma --traddr=10.0.1.1 --nqn=test-nvme Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-08-29 12:55:02 -07:00
Israel Rukshin	52b4451a9e	nvme-fabrics: Add type of service (TOS) configuration TOS is user-defined and needs to be configured via nvme-cli. It must be set before initiating any traffic and once set the TOS cannot be changed. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-08-29 12:55:01 -07:00
Sagi Grimberg	35d1a938dc	nvmet-tcp: fix possible memory leak when we uninit a command in error flow we also need to free an iovec if it was allocated. Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-08-29 12:55:01 -07:00
Sagi Grimberg	b627200762	nvmet-tcp: fix possible NULL deref We must only call sgl_free for sgl that we actually allocated. Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-08-29 12:55:01 -07:00
Minwoo Im	42df26d4df	nvmet: trace: parse Get LBA Status command in detail Four different fields are in CDWs of Get LBA Status command which means it would be great if we can see in detail when tracing in target side also. Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-08-29 12:55:01 -07:00
Minwoo Im	177b06ed09	nvme: trace: parse Get LBA Status command in detail Four different fields are in CDWs of Get LBA Status command which means it would be great if we can see in detail when tracing. Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-08-29 12:55:01 -07:00
Tom Wu	3bec2e3754	nvmet: fix data units read and written counters in SMART log In nvme spec 1.3 there is a definition for data write/read counters from SMART log, (See section 5.14.1.2): This value is reported in thousands (i.e., a value of 1 corresponds to 1000 units of 512 bytes read) and is rounded up. However, in nvme target where value is reported with actual units, but not thousands of units as the spec requires. Signed-off-by: Tom Wu <tomwu@mellanox.com> Reviewed-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-08-29 12:55:01 -07:00
Sagi Grimberg	1a9460cef5	nvme-tcp: support simple polling Simple polling support via socket busy_poll interface. Although we do not shutdown interrupts but simply hammer the socket poll, we can sometimes find completions faster than the normal interrupt driven RX path. We add per queue nr_cqe counter that resets every time RX path is invoked such that .poll callback can return it to stay consistent with the semantics. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-08-29 12:55:01 -07:00
Minwoo Im	79fd751d61	nvme: tcp: selects CRYPTO_CRC32C for nvme-tcp The tcp host module is now taking those APIs from crypto ahash: (1) crypto_ahash_final() (2) crypto_ahash_digest() (3) crypto_alloc_ahash() nvme-tcp should depends on CRYPTO_CRC32C. Cc: Christoph Hellwig <hch@lst.de> Cc: Keith Busch <kbusch@kernel.org> Cc: Jens Axboe <axboe@fb.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-08-29 12:55:01 -07:00
Sagi Grimberg	b5b0504878	nvme: don't pass cap to nvme_disable_ctrl All seem to call it with ctrl->cap so no need to pass it at all. Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-08-29 12:55:00 -07:00
Sagi Grimberg	c0f2f45be2	nvme: move sqsize setting to the core nvme_enable_ctrl reads the cap register right after, so no need to do that locally in the transport driver. Have sqsize setting in nvme_init_identify. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-08-29 12:55:00 -07:00
Sagi Grimberg	aa22c8e665	nvme-pci: set ctrl sqsize to the device q_depth Align with what the rest of the transports are doing. Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-08-29 12:55:00 -07:00
Sagi Grimberg	4fba445828	nvme: have nvme_init_identify set ctrl->cap No need to use a stack cap variable. Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-08-29 12:55:00 -07:00
Potnuri Bharat Teja	10407ec9b4	nvme-tcp: Use protocol specific operations while reading socket Using socket specific read_sock() calls instead of directly calling tcp_read_sock() helps lld module registered handlers if any, to be called from nvme-tcp host. This patch therefore replaces the tcp_read_sock() with socket specific prot_ops. Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com> Acked-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-08-29 12:55:00 -07:00
Sagi Grimberg	6be182607d	nvme-tcp: cleanup nvme_tcp_recv_pdu Can return directly in the switch statement Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-08-29 12:55:00 -07:00
Mario Limonciello	cb32de1b7e	nvme: Add quirk for LiteON CL1 devices running FW 22301111 One of the components in LiteON CL1 device has limitations that can be encountered based upon boundary race conditions using the nvme bus specific suspend to idle flow. When this situation occurs the drive doesn't resume properly from suspend-to-idle. LiteON has confirmed this problem and fixed in the next firmware version. As this firmware is already in the field, avoid running nvme specific suspend to idle flow. Fixes: `d916b1be94` ("nvme-pci: use host managed power state for suspend") Link: http://lists.infradead.org/pipermail/linux-nvme/2019-July/thread.html Signed-off-by: Mario Limonciello <mario.limonciello@dell.com> Signed-off-by: Charles Hyde <charles.hyde@dellteam.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-20 11:02:10 -06:00
Guilherme G. Piccoli	a89fcca818	nvme: Fix cntlid validation when not using NVMEoF Commit `1b1031ca63` ("nvme: validate cntlid during controller initialisation") introduced a validation for controllers with duplicate cntlid that runs on nvme_init_subsystem(). The problem is that the validation relies on ctrl->cntlid, and this value is assigned (from id_ctrl value) after the call for nvme_init_subsystem() in nvme_init_identify() for non-fabrics scenario. That leads to ctrl->cntlid always being 0 in case we have a physical set of controllers in the same subsystem. This patch fixes that by loading the discovered cntlid id_ctrl value into ctrl->cntlid before the subsystem initialization, only for the non-fabrics case. The patch was tested with emulated nvme devices (qemu) having two controllers in a single subsystem. Without the patch, we couldn't make it work failing in the duplicate check; when running with the patch, we could see the subsystem holding both controllers. For the fabrics case we see ctrl->cntlid has a more intricate relation with the admin connect, so we didn't change that. Fixes: `1b1031ca63` ("nvme: validate cntlid during controller initialisation") Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-20 11:02:10 -06:00
Anton Eidelman	504db087aa	nvme-multipath: fix possible I/O hang when paths are updated nvme_state_set_live() making a path available triggers requeue_work in order to resubmit requests that ended up on requeue_list when no paths were available. This requeue_work may race with concurrent nvme_ns_head_make_request() that do not observe the live path yet. Such concurrent requests may by made by either: - New IO submission. - Requeue_work triggered by nvme_failover_req() or another ana_work. A race may cause requeue_work capture the state of requeue_list before more requests get onto the list. These requests will stay on the list forever unless requeue_work is triggered again. In order to prevent such race, nvme_state_set_live() should synchronize_srcu(&head->srcu) before triggering the requeue_work and prevent nvme_ns_head_make_request referencing an old snapshot of the path list. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anton Eidelman <anton@lightbitslabs.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-20 11:02:10 -06:00
Linus Torvalds	8fde2832bd	for-linus-2019-08-17 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl1Ygq4QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpjUnD/47XM1xHO2L/66+XiRnXwNc0Zmc0cNrtIsp 09EqZFeSwaqJKfoodcDphuHSYf6jkLk9xRKiFEr+KrUfzwvmgN3fFlbk6Azz0A4f ACv4DtTPML40QUGzo5c+UfnJTwK8z4An/lG4aNDB3UucFwoWhfvLkzQAN+FNilAQ sMTWPvhjiV/E7BMa7p108/NL9sv0UzUzgIOI7cZCq/Z5/jJBiohl4DzPPyfCKzOK shKP3Q7zId2q7d9zJZZuwy/iYLTCEHzy+37hxS7343vXeTlxKbrxY6xKZ8Rhsccx E3ep+SYuRxl7BHXPTwFh5HSMRXq5JEVBy3dnoMzFEu6pqzAPyLjQS0tG/lR9T1uS E8t0QnsdjVvedF/c3a3b2pVcYCbRjHsB/5DLBPxRfgAPkvVm180XKQM95HkLQYle A74mgU1WODnXsdyJeAacf1b+3t8p+5x5vpRq6Ls6dgmGn74qhfvs4tQnqypFFnyR Le0O7ICWKotQFo0CzbPB029xORepq9HLkFEeh73K6uG+Bn5iGMC1oUmxB0uSuHpz h9lPDjjj6RSLpzp1jCqswSHsQi00sLSKMtytjEt83/EWPcyN2oiPP1BTq+9jITYR SvoKF6R+CLwQscb/DEXb/vRSehg5aivXHg1AOo0l6/OiPQmNdmSZEU4/McPoB2xW 1MC5Y4wV1g== =f6DS -----END PGP SIGNATURE----- Merge tag 'for-linus-2019-08-17' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "A collection of fixes that should go into this series. This contains: - Revert of the REQ_NOWAIT_INLINE and associated dio changes. There were still corner cases there, and even though I had a solution for it, it's too involved for this stage. (me) - Set of NVMe fixes (via Sagi) - io_uring fix for fixed buffers (Anthony) - io_uring defer issue fix (Jackie) - Regression fix for queue sync at exit time (zhengbin) - xen blk-back memory leak fix (Wenwen)" * tag 'for-linus-2019-08-17' of git://git.kernel.dk/linux-block: io_uring: fix an issue when IOSQE_IO_LINK is inserted into defer list block: remove REQ_NOWAIT_INLINE io_uring: fix manual setup of iov_iter for fixed buffers xen/blkback: fix memory leaks blk-mq: move cancel of requeue_work to the front of blk_exit_queue nvme-pci: Fix async probe remove race nvme: fix controller removal race with scan work nvme-rdma: fix possible use-after-free in connect error flow nvme: fix a possible deadlock when passthru commands sent to a multipath device nvme-core: Fix extra device_put() call on error path nvmet-file: fix nvmet_file_flush() always returning an error nvmet-loop: Flush nvme_delete_wq when removing the port nvmet: Fix use-after-free bug when a port is removed nvme-multipath: revalidate nvme_ns_head gendisk in nvme_validate_ns	2019-08-17 19:39:54 -07:00
Logan Gunthorpe	7f73eac3a7	PCI/P2PDMA: Introduce pci_p2pdma_unmap_sg() Add pci_p2pdma_unmap_sg() to the two places that call pci_p2pdma_map_sg(). This is a prep patch to introduce correct mappings for p2pdma transactions that go through the root complex. Link: https://lore.kernel.org/r/20190730163545.4915-10-logang@deltatee.com Link: https://lore.kernel.org/r/20190812173048.9186-10-logang@deltatee.com Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2019-08-16 08:41:26 -05:00
Logan Gunthorpe	2b9f4bb2a4	PCI/P2PDMA: Add attrs argument to pci_p2pdma_map_sg() This is to match the dma_map_sg() API which this function will have to call in an future patch. Add a pci_p2pdma_map_sg_attrs() function and helper to call it with no attributes just like the dma_map_sg() function. Link: https://lore.kernel.org/r/20190730163545.4915-9-logang@deltatee.com Link: https://lore.kernel.org/r/20190812173048.9186-9-logang@deltatee.com Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2019-08-16 08:41:15 -05:00
Rafael J. Wysocki	4eaefe8c62	nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled One of the modifications made by commit `d916b1be94` ("nvme-pci: use host managed power state for suspend") was adding a pci_save_state() call to nvme_suspend() so as to instruct the PCI bus type to leave devices handled by the nvme driver in D0 during suspend-to-idle. That was done with the assumption that ASPM would transition the device's PCIe link into a low-power state when the device became inactive. However, if ASPM is disabled for the device, its PCIe link will stay in L0 and in that case commit `d916b1be94` is likely to cause the energy used by the system while suspended to increase. Namely, if the device in question works in accordance with the PCIe specification, putting it into D3hot causes its PCIe link to go to L1 or L2/L3 Ready, which is lower-power than L0. Since the energy used by the system while suspended depends on the state of its PCIe link (as a general rule, the lower-power the state of the link, the less energy the system will use), putting the device into D3hot during suspend-to-idle should be more energy-efficient that leaving it in D0 with disabled ASPM. For this reason, avoid leaving NVMe devices with disabled ASPM in D0 during suspend-to-idle. Instead, shut them down entirely and let the PCI bus type put them into D3. Fixes: `d916b1be94` ("nvme-pci: use host managed power state for suspend") Link: https://lore.kernel.org/linux-pm/2763495.NmdaWeg79L@kreacher/T/#t Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Keith Busch <keith.busch@intel.com>	2019-08-12 10:47:55 +02:00
Hans Holmberg	48e5da7255	lightnvm: move metadata mapping to lower level driver Now that blk_rq_map_kern can map both kmem and vmem, move internal metadata mapping down to the lower level driver. Reviewed-by: Javier González <javier@javigon.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Hans Holmberg <hans@owltronix.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-06 08:20:10 -06:00
Hans Holmberg	98d87f70f4	lightnvm: remove nvm_submit_io_sync_fn Move the redundant sync handling interface and wait for a completion in the lightnvm core instead. Reviewed-by: Javier González <javier@javigon.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Hans Holmberg <hans@owltronix.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-06 08:20:09 -06:00
Ming Lei	a87ccce0b5	blk-mq: remove blk_mq_complete_request_sync blk_mq_tagset_wait_completed_request() has been applied for waiting for completed request's fn, so not necessary to use blk_mq_complete_request_sync() any more. Cc: Max Gurtovoy <maxg@mellanox.com> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Keith Busch <keith.busch@intel.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-04 21:41:29 -06:00
Ming Lei	622b8b6893	nvme: wait until all completed request's complete fn is called When aborting in-flight request for recovering controller, we have to make sure that queue's complete function is called on completed request before moving on. Otherwise, for example, the warning of WARN_ON_ONCE(qp->mrs_used > 0) in ib_destroy_qp_user() may be triggered on nvme-rdma. Fix this issue by using blk_mq_tagset_wait_completed_request. Cc: Max Gurtovoy <maxg@mellanox.com> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Keith Busch <keith.busch@intel.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-04 21:41:29 -06:00
Ming Lei	78ca407247	nvme: don't abort completed request in nvme_cancel_request Before aborting in-flight requests, all IO queues and their interrupts have been shutdown. However, request's completion function may not be done yet because it can be scheduled to run via IPI. So don't abort one request if it is marked as completed, otherwise we may abort one normal completed request. Cc: Max Gurtovoy <maxg@mellanox.com> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Keith Busch <keith.busch@intel.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-08-04 21:41:29 -06:00
Keith Busch	bd46a90634	nvme-pci: Fix async probe remove race Ensure the controller is not in the NEW state when nvme_probe() exits. This will always allow a subsequent nvme_remove() to set the state to DELETING, fixing a potential race between the initial asynchronous probe and device removal. Reported-by: Li Zhong <lizhongfs@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-07-31 18:03:36 -07:00
Sagi Grimberg	0157ec8dad	nvme: fix controller removal race with scan work With multipath enabled, nvme_scan_work() can read from the device (through nvme_mpath_add_disk()) and hang [1]. However, with fabrics, once ctrl->state is set to NVME_CTRL_DELETING, the reads will hang (see nvmf_check_ready()) and the mpath stack device make_request will block if head->list is not empty. However, when the head->list consistst of only DELETING/DEAD controllers, we should actually not block, but rather fail immediately. In addition, before we go ahead and remove the namespaces, make sure to clear the current path and kick the requeue list so that the request will fast fail upon requeuing. [1]: -- INFO: task kworker/u4:3:166 blocked for more than 120 seconds. Not tainted 5.2.0-rc6-vmlocalyes-00005-g808c8c2dc0cf #316 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kworker/u4:3 D 0 166 2 0x80004000 Workqueue: nvme-wq nvme_scan_work Call Trace: __schedule+0x851/0x1400 schedule+0x99/0x210 io_schedule+0x21/0x70 do_read_cache_page+0xa57/0x1330 read_cache_page+0x4a/0x70 read_dev_sector+0xbf/0x380 amiga_partition+0xc4/0x1230 check_partition+0x30f/0x630 rescan_partitions+0x19a/0x980 __blkdev_get+0x85a/0x12f0 blkdev_get+0x2a5/0x790 __device_add_disk+0xe25/0x1250 device_add_disk+0x13/0x20 nvme_mpath_set_live+0x172/0x2b0 nvme_update_ns_ana_state+0x130/0x180 nvme_set_ns_ana_state+0x9a/0xb0 nvme_parse_ana_log+0x1c3/0x4a0 nvme_mpath_add_disk+0x157/0x290 nvme_validate_ns+0x1017/0x1bd0 nvme_scan_work+0x44d/0x6a0 process_one_work+0x7d7/0x1240 worker_thread+0x8e/0xff0 kthread+0x2c3/0x3b0 ret_from_fork+0x35/0x40 INFO: task kworker/u4:1:1034 blocked for more than 120 seconds. Not tainted 5.2.0-rc6-vmlocalyes-00005-g808c8c2dc0cf #316 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kworker/u4:1 D 0 1034 2 0x80004000 Workqueue: nvme-delete-wq nvme_delete_ctrl_work Call Trace: __schedule+0x851/0x1400 schedule+0x99/0x210 schedule_timeout+0x390/0x830 wait_for_completion+0x1a7/0x310 __flush_work+0x241/0x5d0 flush_work+0x10/0x20 nvme_remove_namespaces+0x85/0x3d0 nvme_do_delete_ctrl+0xb4/0x1e0 nvme_delete_ctrl_work+0x15/0x20 process_one_work+0x7d7/0x1240 worker_thread+0x8e/0xff0 kthread+0x2c3/0x3b0 ret_from_fork+0x35/0x40 -- Reported-by: Logan Gunthorpe <logang@deltatee.com> Tested-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-07-31 18:01:56 -07:00
Sagi Grimberg	d94211b8ba	nvme-rdma: fix possible use-after-free in connect error flow When start_queue fails, we need to make sure to drain the queue cq before freeing the rdma resources because we might still race with the completion path. Have start_queue() error path safely stop the queue. -- [30371.808111] nvme nvme1: Failed reconnect attempt 11 [30371.808113] nvme nvme1: Reconnecting in 10 seconds... [...] [30382.069315] nvme nvme1: creating 4 I/O queues. [30382.257058] nvme nvme1: Connect Invalid SQE Parameter, qid 4 [30382.257061] nvme nvme1: failed to connect queue: 4 ret=386 [30382.305001] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 [30382.305022] IP: qedr_poll_cq+0x8a3/0x1170 [qedr] [30382.305028] PGD 0 P4D 0 [30382.305037] Oops: 0000 [#1] SMP PTI [...] [30382.305153] Call Trace: [30382.305166] ? __switch_to_asm+0x34/0x70 [30382.305187] __ib_process_cq+0x56/0xd0 [ib_core] [30382.305201] ib_poll_handler+0x26/0x70 [ib_core] [30382.305213] irq_poll_softirq+0x88/0x110 [30382.305223] ? sort_range+0x20/0x20 [30382.305232] __do_softirq+0xde/0x2c6 [30382.305241] ? sort_range+0x20/0x20 [30382.305249] run_ksoftirqd+0x1c/0x60 [30382.305258] smpboot_thread_fn+0xef/0x160 [30382.305265] kthread+0x113/0x130 [30382.305273] ? kthread_create_worker_on_cpu+0x50/0x50 [30382.305281] ret_from_fork+0x35/0x40 -- Reported-by: Nicolas Morey-Chaisemartin <NMoreyChaisemartin@suse.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-07-31 18:00:07 -07:00
Sagi Grimberg	b9156daeb1	nvme: fix a possible deadlock when passthru commands sent to a multipath device When the user issues a command with side effects, we will end up freezing the namespace request queue when updating disk info (and the same for the corresponding mpath disk node). However, we are not freezing the mpath node request queue, which means that mpath I/O can still come in and block on blk_queue_enter (called from nvme_ns_head_make_request -> direct_make_request). This is a deadlock, because blk_queue_enter will block until the inner namespace request queue is unfroze, but that process is blocked because the namespace revalidation is trying to update the mpath disk info and freeze its request queue (which will never complete because of the I/O that is blocked on blk_queue_enter). Fix this by freezing all the subsystem nsheads request queues before executing the passthru command. Given that these commands are infrequent we should not worry about this temporary I/O freeze to keep things sane. Here is the matching hang traces: -- [ 374.465002] INFO: task systemd-udevd:17994 blocked for more than 122 seconds. [ 374.472975] Not tainted 5.2.0-rc3-mpdebug+ #42 [ 374.478522] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 374.487274] systemd-udevd D 0 17994 1 0x00000000 [ 374.493407] Call Trace: [ 374.496145] __schedule+0x2ef/0x620 [ 374.500047] schedule+0x38/0xa0 [ 374.503569] blk_queue_enter+0x139/0x220 [ 374.507959] ? remove_wait_queue+0x60/0x60 [ 374.512540] direct_make_request+0x60/0x130 [ 374.517219] nvme_ns_head_make_request+0x11d/0x420 [nvme_core] [ 374.523740] ? generic_make_request_checks+0x307/0x6f0 [ 374.529484] generic_make_request+0x10d/0x2e0 [ 374.534356] submit_bio+0x75/0x140 [ 374.538163] ? guard_bio_eod+0x32/0xe0 [ 374.542361] submit_bh_wbc+0x171/0x1b0 [ 374.546553] block_read_full_page+0x1ed/0x330 [ 374.551426] ? check_disk_change+0x70/0x70 [ 374.556008] ? scan_shadow_nodes+0x30/0x30 [ 374.560588] blkdev_readpage+0x18/0x20 [ 374.564783] do_read_cache_page+0x301/0x860 [ 374.569463] ? blkdev_writepages+0x10/0x10 [ 374.574037] ? prep_new_page+0x88/0x130 [ 374.578329] ? get_page_from_freelist+0xa2f/0x1280 [ 374.583688] ? __alloc_pages_nodemask+0x179/0x320 [ 374.588947] read_cache_page+0x12/0x20 [ 374.593142] read_dev_sector+0x2d/0xd0 [ 374.597337] read_lba+0x104/0x1f0 [ 374.601046] find_valid_gpt+0xfa/0x720 [ 374.605243] ? string_nocheck+0x58/0x70 [ 374.609534] ? find_valid_gpt+0x720/0x720 [ 374.614016] efi_partition+0x89/0x430 [ 374.618113] ? string+0x48/0x60 [ 374.621632] ? snprintf+0x49/0x70 [ 374.625339] ? find_valid_gpt+0x720/0x720 [ 374.629828] check_partition+0x116/0x210 [ 374.634214] rescan_partitions+0xb6/0x360 [ 374.638699] __blkdev_reread_part+0x64/0x70 [ 374.643377] blkdev_reread_part+0x23/0x40 [ 374.647860] blkdev_ioctl+0x48c/0x990 [ 374.651956] block_ioctl+0x41/0x50 [ 374.655766] do_vfs_ioctl+0xa7/0x600 [ 374.659766] ? locks_lock_inode_wait+0xb1/0x150 [ 374.664832] ksys_ioctl+0x67/0x90 [ 374.668539] __x64_sys_ioctl+0x1a/0x20 [ 374.672732] do_syscall_64+0x5a/0x1c0 [ 374.676828] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 374.738474] INFO: task nvmeadm:49141 blocked for more than 123 seconds. [ 374.745871] Not tainted 5.2.0-rc3-mpdebug+ #42 [ 374.751419] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 374.760170] nvmeadm D 0 49141 36333 0x00004080 [ 374.766301] Call Trace: [ 374.769038] __schedule+0x2ef/0x620 [ 374.772939] schedule+0x38/0xa0 [ 374.776452] blk_mq_freeze_queue_wait+0x59/0x100 [ 374.781614] ? remove_wait_queue+0x60/0x60 [ 374.786192] blk_mq_freeze_queue+0x1a/0x20 [ 374.790773] nvme_update_disk_info.isra.57+0x5f/0x350 [nvme_core] [ 374.797582] ? nvme_identify_ns.isra.50+0x71/0xc0 [nvme_core] [ 374.804006] __nvme_revalidate_disk+0xe5/0x110 [nvme_core] [ 374.810139] nvme_revalidate_disk+0xa6/0x120 [nvme_core] [ 374.816078] ? nvme_submit_user_cmd+0x11e/0x320 [nvme_core] [ 374.822299] nvme_user_cmd+0x264/0x370 [nvme_core] [ 374.827661] nvme_dev_ioctl+0x112/0x1d0 [nvme_core] [ 374.833114] do_vfs_ioctl+0xa7/0x600 [ 374.837117] ? __audit_syscall_entry+0xdd/0x130 [ 374.842184] ksys_ioctl+0x67/0x90 [ 374.845891] __x64_sys_ioctl+0x1a/0x20 [ 374.850082] do_syscall_64+0x5a/0x1c0 [ 374.854178] entry_SYSCALL_64_after_hwframe+0x44/0xa9 -- Reported-by: James Puthukattukaran <james.puthukattukaran@oracle.com> Tested-by: James Puthukattukaran <james.puthukattukaran@oracle.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-07-31 17:59:01 -07:00
Logan Gunthorpe	8c36e66fb4	nvme-core: Fix extra device_put() call on error path In the error path for nvme_init_subsystem(), nvme_put_subsystem() will call device_put(), but it will get called again after the mutex_unlock(). The device_put() only needs to be called if device_add() fails. This bug caused a KASAN use-after-free error when adding and removing subsytems in a loop: BUG: KASAN: use-after-free in device_del+0x8d9/0x9a0 Read of size 8 at addr ffff8883cdaf7120 by task multipathd/329 CPU: 0 PID: 329 Comm: multipathd Not tainted 5.2.0-rc6-vmlocalyes-00019-g70a2b39005fd-dirty #314 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 Call Trace: dump_stack+0x7b/0xb5 print_address_description+0x6f/0x280 ? device_del+0x8d9/0x9a0 __kasan_report+0x148/0x199 ? device_del+0x8d9/0x9a0 ? class_release+0x100/0x130 ? device_del+0x8d9/0x9a0 kasan_report+0x12/0x20 __asan_report_load8_noabort+0x14/0x20 device_del+0x8d9/0x9a0 ? device_platform_notify+0x70/0x70 nvme_destroy_subsystem+0xf9/0x150 nvme_free_ctrl+0x280/0x3a0 device_release+0x72/0x1d0 kobject_put+0x144/0x410 put_device+0x13/0x20 nvme_free_ns+0xc4/0x100 nvme_release+0xb3/0xe0 __blkdev_put+0x549/0x6e0 ? kasan_check_write+0x14/0x20 ? bd_set_size+0xb0/0xb0 ? kasan_check_write+0x14/0x20 ? mutex_lock+0x8f/0xe0 ? __mutex_lock_slowpath+0x20/0x20 ? locks_remove_file+0x239/0x370 blkdev_put+0x72/0x2c0 blkdev_close+0x8d/0xd0 __fput+0x256/0x770 ? _raw_read_lock_irq+0x40/0x40 ____fput+0xe/0x10 task_work_run+0x10c/0x180 ? filp_close+0xf7/0x140 exit_to_usermode_loop+0x151/0x170 do_syscall_64+0x240/0x2e0 ? prepare_exit_to_usermode+0xd5/0x190 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f5a79af05d7 Code: 00 00 0f 05 48 3d 00 f0 ff ff 77 3f c3 66 0f 1f 44 00 00 53 89 fb 48 83 ec 10 e8 c4 fb ff ff 89 df 89 c2 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2b 89 d7 89 44 24 0c e8 06 fc ff ff 8b 44 24 RSP: 002b:00007f5a7799c810 EFLAGS: 00000293 ORIG_RAX: 0000000000000003 RAX: 0000000000000000 RBX: 0000000000000008 RCX: 00007f5a79af05d7 RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000008 RBP: 00007f5a58000f98 R08: 0000000000000002 R09: 00007f5a7935ee80 R10: 0000000000000000 R11: 0000000000000293 R12: 000055e432447240 R13: 0000000000000000 R14: 0000000000000001 R15: 000055e4324a9cf0 Allocated by task 1236: save_stack+0x21/0x80 __kasan_kmalloc.constprop.6+0xab/0xe0 kasan_kmalloc+0x9/0x10 kmem_cache_alloc_trace+0x102/0x210 nvme_init_identify+0x13c3/0x3820 nvme_loop_configure_admin_queue+0x4fa/0x5e0 nvme_loop_create_ctrl+0x469/0xf40 nvmf_dev_write+0x19a3/0x21ab __vfs_write+0x66/0x120 vfs_write+0x154/0x490 ksys_write+0x104/0x240 __x64_sys_write+0x73/0xb0 do_syscall_64+0xa5/0x2e0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Freed by task 329: save_stack+0x21/0x80 __kasan_slab_free+0x129/0x190 kasan_slab_free+0xe/0x10 kfree+0xa7/0x200 nvme_release_subsystem+0x49/0x60 device_release+0x72/0x1d0 kobject_put+0x144/0x410 put_device+0x13/0x20 klist_class_dev_put+0x31/0x40 klist_put+0x8f/0xf0 klist_del+0xe/0x10 device_del+0x3a7/0x9a0 nvme_destroy_subsystem+0xf9/0x150 nvme_free_ctrl+0x280/0x3a0 device_release+0x72/0x1d0 kobject_put+0x144/0x410 put_device+0x13/0x20 nvme_free_ns+0xc4/0x100 nvme_release+0xb3/0xe0 __blkdev_put+0x549/0x6e0 blkdev_put+0x72/0x2c0 blkdev_close+0x8d/0xd0 __fput+0x256/0x770 ____fput+0xe/0x10 task_work_run+0x10c/0x180 exit_to_usermode_loop+0x151/0x170 do_syscall_64+0x240/0x2e0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: `32fd90c407` ("nvme: change locking for the per-subsystem controller list") Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by : Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-07-31 17:57:24 -07:00
Logan Gunthorpe	cfc1a1af56	nvmet-file: fix nvmet_file_flush() always returning an error Presently, nvmet_file_flush() always returns a call to errno_to_nvme_status() but that helper doesn't take into account the case when errno=0. So nvmet_file_flush() always returns an error code. All other callers of errno_to_nvme_status() check for success before calling it. To fix this, ensure errno_to_nvme_status() returns success if the errno is zero. This should prevent future mistakes like this from happening. Fixes: `c6aa3542e0` ("nvmet: add error log support for file backend") Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-07-31 17:57:21 -07:00
Logan Gunthorpe	86b9a63e59	nvmet-loop: Flush nvme_delete_wq when removing the port After calling nvme_loop_delete_ctrl(), the controllers will not yet be deleted because nvme_delete_ctrl() only schedules work to do the delete. This means a race can occur if a port is removed but there are still active controllers trying to access that memory. To fix this, flush the nvme_delete_wq before returning from nvme_loop_remove_port() so that any controllers that might be in the process of being deleted won't access a freed port. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by : Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-07-31 17:57:17 -07:00
Logan Gunthorpe	3aed86731e	nvmet: Fix use-after-free bug when a port is removed When a port is removed through configfs, any connected controllers are still active and can still send commands. This causes a use-after-free bug which is detected by KASAN for any admin command that dereferences req->port (like in nvmet_execute_identify_ctrl). To fix this, disconnect all active controllers when a subsystem is removed from a port. This ensures there are no active controllers when the port is eventually removed. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by : Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>	2019-07-31 17:57:06 -07:00
Anthony Iliopoulos	fab7772bfb	nvme-multipath: revalidate nvme_ns_head gendisk in nvme_validate_ns When CONFIG_NVME_MULTIPATH is set, only the hidden gendisk associated with the per-controller ns is run through revalidate_disk when a rescan is triggered, while the visible blockdev never gets its size (bdev->bd_inode->i_size) updated to reflect any capacity changes that may have occurred. This prevents online resizing of nvme block devices and in extension of any filesystems atop that will are unable to expand while mounted, as userspace relies on the blockdev size for obtaining the disk capacity (via BLKGETSIZE/64 ioctls). Fix this by explicitly revalidating the actual namespace gendisk in addition to the per-controller gendisk, when multipath is enabled. Signed-off-by: Anthony Iliopoulos <ailiopoulos@suse.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-07-30 13:29:20 +03:00
yangerkun	8fe34be14e	Revert "nvme-pci: don't create a read hctx mapping without read queues" This reverts commit `0298d54352`. With this patch, set 'poll_queues > hard queues' will lead to 'nr_read_queues = 0' in nvme_calc_irq_sets. Then poll_queues setting can fail since dev->tagset.nr_maps equals to 2 and nvme_pci_map_queues will not do map for poll queues. Signed-off-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-07-23 17:47:02 +02:00
Marta Rybczynska	66b20ac0a1	nvme: fix multipath crash when ANA is deactivated Fix a crash with multipath activated. It happends when ANA log page is larger than MDTS and because of that ANA is disabled. The driver then tries to access unallocated buffer when connecting to a nvme target. The signature is as follows: [ 300.433586] nvme nvme0: ANA log page size (8208) larger than MDTS (8192). [ 300.435387] nvme nvme0: disabling ANA support. [ 300.437835] nvme nvme0: creating 4 I/O queues. [ 300.459132] nvme nvme0: new ctrl: NQN "nqn.0.0.0", addr 10.91.0.1:8009 [ 300.464609] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 [ 300.466342] #PF error: [normal kernel read fault] [ 300.467385] PGD 0 P4D 0 [ 300.467987] Oops: 0000 [#1] SMP PTI [ 300.468787] CPU: 3 PID: 50 Comm: kworker/u8:1 Not tainted 5.0.20kalray+ #4 [ 300.470264] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 300.471532] Workqueue: nvme-wq nvme_scan_work [nvme_core] [ 300.472724] RIP: 0010:nvme_parse_ana_log+0x21/0x140 [nvme_core] [ 300.474038] Code: 45 01 d2 d8 48 98 c3 66 90 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 89 fb 48 83 ec 08 48 8b af 20 0a 00 00 48 89 34 24 <66> 83 7d 08 00 0f 84 c6 00 00 00 44 8b 7d 14 49 89 d5 8b 55 10 48 [ 300.477374] RSP: 0018:ffffa50e80fd7cb8 EFLAGS: 00010296 [ 300.478334] RAX: 0000000000000001 RBX: ffff9130f1872258 RCX: 0000000000000000 [ 300.479784] RDX: ffffffffc06c4c30 RSI: ffff9130edad4280 RDI: ffff9130f1872258 [ 300.481488] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000044 [ 300.483203] R10: 0000000000000220 R11: 0000000000000040 R12: ffff9130f18722c0 [ 300.484928] R13: ffff9130f18722d0 R14: ffff9130edad4280 R15: ffff9130f18722c0 [ 300.486626] FS: 0000000000000000(0000) GS:ffff9130f7b80000(0000) knlGS:0000000000000000 [ 300.488538] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 300.489907] CR2: 0000000000000008 CR3: 00000002365e6000 CR4: 00000000000006e0 [ 300.491612] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 300.493303] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 300.494991] Call Trace: [ 300.495645] nvme_mpath_add_disk+0x5c/0xb0 [nvme_core] [ 300.496880] nvme_validate_ns+0x2ef/0x550 [nvme_core] [ 300.498105] ? nvme_identify_ctrl.isra.45+0x6a/0xb0 [nvme_core] [ 300.499539] nvme_scan_work+0x2b4/0x370 [nvme_core] [ 300.500717] ? __switch_to_asm+0x35/0x70 [ 300.501663] process_one_work+0x171/0x380 [ 300.502340] worker_thread+0x49/0x3f0 [ 300.503079] kthread+0xf8/0x130 [ 300.503795] ? max_active_store+0x80/0x80 [ 300.504690] ? kthread_bind+0x10/0x10 [ 300.505502] ret_from_fork+0x35/0x40 [ 300.506280] Modules linked in: nvme_tcp nvme_rdma rdma_cm iw_cm ib_cm ib_core nvme_fabrics nvme_core xt_physdev ip6table_raw ip6table_mangle ip6table_filter ip6_tables xt_comment iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_CHECKSUM iptable_mangle iptable_filter veth ebtable_filter ebtable_nat ebtables iptable_raw vxlan ip6_udp_tunnel udp_tunnel sunrpc joydev pcspkr virtio_balloon br_netfilter bridge stp llc ip_tables xfs libcrc32c ata_generic pata_acpi virtio_net virtio_console net_failover virtio_blk failover ata_piix serio_raw libata virtio_pci virtio_ring virtio [ 300.514984] CR2: 0000000000000008 [ 300.515569] ---[ end trace faa2eefad7e7f218 ]--- [ 300.516354] RIP: 0010:nvme_parse_ana_log+0x21/0x140 [nvme_core] [ 300.517330] Code: 45 01 d2 d8 48 98 c3 66 90 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 89 fb 48 83 ec 08 48 8b af 20 0a 00 00 48 89 34 24 <66> 83 7d 08 00 0f 84 c6 00 00 00 44 8b 7d 14 49 89 d5 8b 55 10 48 [ 300.520353] RSP: 0018:ffffa50e80fd7cb8 EFLAGS: 00010296 [ 300.521229] RAX: 0000000000000001 RBX: ffff9130f1872258 RCX: 0000000000000000 [ 300.522399] RDX: ffffffffc06c4c30 RSI: ffff9130edad4280 RDI: ffff9130f1872258 [ 300.523560] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000044 [ 300.524734] R10: 0000000000000220 R11: 0000000000000040 R12: ffff9130f18722c0 [ 300.525915] R13: ffff9130f18722d0 R14: ffff9130edad4280 R15: ffff9130f18722c0 [ 300.527084] FS: 0000000000000000(0000) GS:ffff9130f7b80000(0000) knlGS:0000000000000000 [ 300.528396] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 300.529440] CR2: 0000000000000008 CR3: 00000002365e6000 CR4: 00000000000006e0 [ 300.530739] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 300.531989] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 300.533264] Kernel panic - not syncing: Fatal exception [ 300.534338] Kernel Offset: 0x17c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 300.536227] ---[ end Kernel panic - not syncing: Fatal exception ]--- Condition check refactoring from Christoph Hellwig. Signed-off-by: Marta Rybczynska <marta.rybczynska@kalray.eu> Tested-by: Jean-Baptiste Riaux <jbriaux@kalray.eu> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-07-23 17:46:57 +02:00
Logan Gunthorpe	e654dfd38c	nvme: fix memory leak caused by incorrect subsystem free When freeing the subsystem after finding another match with __nvme_find_get_subsystem(), use put_device() instead of __nvme_release_subsystem() which calls kfree() directly. Per the documentation, put_device() should always be used after device_initialization() is called. Otherwise, leaks like the one below which was detected by kmemleak may occur. Once the call of __nvme_release_subsystem() is removed it no longer makes sense to keep the helper, so fold it back into nvme_release_subsystem(). unreferenced object 0xffff8883d12bfbc0 (size 16): comm "nvme", pid 2635, jiffies 4294933602 (age 739.952s) hex dump (first 16 bytes): 6e 76 6d 65 2d 73 75 62 73 79 73 32 00 88 ff ff nvme-subsys2.... backtrace: [<000000007d8fc208>] __kmalloc_track_caller+0x16d/0x2a0 [<0000000081169e5f>] kvasprintf+0xad/0x130 [<0000000025626f25>] kvasprintf_const+0x47/0x120 [<00000000fa66ad36>] kobject_set_name_vargs+0x44/0x120 [<000000004881f8b3>] dev_set_name+0x98/0xc0 [<000000007124dae3>] nvme_init_identify+0x1995/0x38e0 [<000000009315020a>] nvme_loop_configure_admin_queue+0x4fa/0x5e0 [<000000001a63e766>] nvme_loop_create_ctrl+0x489/0xf80 [<00000000a46ecc23>] nvmf_dev_write+0x1a12/0x2220 [<000000002259b3d5>] __vfs_write+0x66/0x120 [<000000002f6df81e>] vfs_write+0x154/0x490 [<000000007e8cfc19>] ksys_write+0x10a/0x240 [<00000000ff5c7b85>] __x64_sys_write+0x73/0xb0 [<00000000fee6d692>] do_syscall_64+0xaa/0x470 [<00000000997e1ede>] entry_SYSCALL_64_after_hwframe+0x49/0xbe Fixes: `ab9e00cc72` ("nvme: track subsystems") Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-07-23 17:46:43 +02:00
Misha Nasledov	08b903b5fd	nvme: ignore subnqn for ADATA SX6000LNP The ADATA SX6000LNP NVMe SSDs have the same subnqn and, due to this, a system with more than one of these SSDs will only have one usable. [ 0.942706] nvme nvme1: ignoring ctrl due to duplicate subnqn (nqn.2018-05.com.example:nvme:nvm-subsystem-OUI00E04C). [ 0.943017] nvme nvme1: Removing after probe failure status: -22 02:00.0 Non-Volatile memory controller [0108]: Realtek Semiconductor Co., Ltd. Device [10ec:5762] (rev 01) 71:00.0 Non-Volatile memory controller [0108]: Realtek Semiconductor Co., Ltd. Device [10ec:5762] (rev 01) There are no firmware updates available from the vendor, unfortunately. Applying the NVME_QUIRK_IGNORE_DEV_SUBNQN quirk for these SSDs resolves the issue, and they all work after this patch: /dev/nvme0n1 2J1120050420 ADATA SX6000LNP [...] /dev/nvme1n1 2J1120050540 ADATA SX6000LNP [...] Signed-off-by: Misha Nasledov <misha@nasledov.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-07-23 17:46:43 +02:00
Linus Torvalds	9637d51734	for-linus-20190715 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl0s1ZEQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpiCEEACE9H/pXoegTTWIVPVajMlsa19UHIeilk4N GI7oKSiirQEMZnAOmrEzgB4/0zyYQsVypys0gZlYUD3GJVsXDT3zzjNXL5NpVg/O nqwSGWMHBSjWkLbaM40Pb2QLXsYgveptNL+9PtxrgtoYPoT5/+TyrJMFrRfi72EK WFeNDKOu6aJxpJ26JSsckJ0gluKeeEpRoEqsgHGIwaMIGHQf+b+ikk7tel5FAIgA uDwwD+Oxsdgh/ChsXL0d90GkcbcSp6GQ7GybxVmw/tPijx6mpeIY72xY3Zx+t8zF b71UNk6NmCKjOPO/6fiuYKKTYw+KhzlyEKO0j675HKfx2AhchEwKw0irp4yUlydA zxWYmz4U7iRgktJtymv3J4FEQQ3S6d1EnuQkQNX1LwiOsEsfzhkWi+7jy7KFhZoJ AqtYzqnOXvLx92q0vloj06HtK6zo+I/MINldy0+qn9lq0N0VF+dctyztAHLsF7P6 pUtS6i7l1JSFKAmMhC31sIj5TImaehM2e/TWMUPEDZaO96oKCmQwOF1oiloc6vlW h4xWsxP/9zOFcWNyPzy6Vo3JUXWRvFA7K+jV3Hsukw6rVHiNCGVYGSlTv8Roi5b7 I4ggu9R2JOGyku7UIlL50IRxEyjAp11LaO8yHhcCnRB65rmyBuNMQNcfOsfxpZ5Y 1mtSNhm5TQ== =g8xI -----END PGP SIGNATURE----- Merge tag 'for-linus-20190715' of git://git.kernel.dk/linux-block Pull more block updates from Jens Axboe: "A later pull request with some followup items. I had some vacation coming up to the merge window, so certain things items were delayed a bit. This pull request also contains fixes that came in within the last few days of the merge window, which I didn't want to push right before sending you a pull request. This contains: - NVMe pull request, mostly fixes, but also a few minor items on the feature side that were timing constrained (Christoph et al) - Report zones fixes (Damien) - Removal of dead code (Damien) - Turn on cgroup psi memstall (Josef) - block cgroup MAINTAINERS entry (Konstantin) - Flush init fix (Josef) - blk-throttle low iops timing fix (Konstantin) - nbd resize fixes (Mike) - nbd 0 blocksize crash fix (Xiubo) - block integrity error leak fix (Wenwen) - blk-cgroup writeback and priority inheritance fixes (Tejun)" * tag 'for-linus-20190715' of git://git.kernel.dk/linux-block: (42 commits) MAINTAINERS: add entry for block io cgroup null_blk: fixup ->report_zones() for !CONFIG_BLK_DEV_ZONED block: Limit zone array allocation size sd_zbc: Fix report zones buffer allocation block: Kill gfp_t argument of blkdev_report_zones() block: Allow mapping of vmalloc-ed buffers block/bio-integrity: fix a memory leak bug nvme: fix NULL deref for fabrics options nbd: add netlink reconfigure resize support nbd: fix crash when the blksize is zero block: Disable write plugging for zoned block devices block: Fix elevator name declaration block: Remove unused definitions nvme: fix regression upon hot device removal and insertion blk-throttle: fix zero wait time for iops throttled group block: Fix potential overflow in blk_report_zones() blkcg: implement REQ_CGROUP_PUNT blkcg, writeback: Implement wbc_blkcg_css() blkcg, writeback: Add wbc->no_cgroup_owner blkcg, writeback: Rename wbc_account_io() to wbc_account_cgroup_owner() ...	2019-07-15 21:20:52 -07:00
Linus Torvalds	2a3c389a0f	5.3 Merge window RDMA pull request A smaller cycle this time. Notably we see another new driver, 'Soft iWarp', and the deletion of an ancient unused driver for nes. - Revise and simplify the signature offload RDMA MR APIs - More progress on hoisting object allocation boiler plate code out of the drivers - Driver bug fixes and revisions for hns, hfi1, efa, cxgb4, qib, i40iw - Tree wide cleanups: struct_size, put_user_page, xarray, rst doc conversion - Removal of obsolete ib_ucm chardev and nes driver - netlink based discovery of chardevs and autoloading of the modules providing them - Move more of the rdamvt/hfi1 uapi to include/uapi/rdma - New driver 'siw' for software based iWarp running on top of netdev, much like rxe's software RoCE. - mlx5 feature to report events in their raw devx format to userspace - Expose per-object counters through rdma tool - Adaptive interrupt moderation for RDMA (DIM), sharing the DIM core from netdev -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl0ozSwACgkQOG33FX4g mxqncg//Qe2zSnlbd6r3hofsc1WiHSx/CiXtT52BUGipO+cWQUwO7hGFuUHIFCuZ JBg7mc998xkyLIH85a/txd+RwAIApKgHVdd+VlrmybZeYCiERAMFpWg8cHpzrbnw l3Ln9fTtJf/NAhO0ZCGV9DCd01fs9yVQgAv21UnLJMUhp9Pzk/iMhu7C7IiSLKvz t7iFhEqPXNJdoqZ+wtWyc/463YxKUd9XNg9Z1neQdaeZrX4UjgDbY9x/ub3zOvQV jc/IL4GysJ3z8mfx5mAd6sE/jAjhcnJuaGYYATqkxiLZEP+muYwU50CNs951XhJC b/EfRQIcLg9kq/u6CP+CuWlMrRWy3U7yj3/mrbbGhlGq88Yt6FGqUf0aFy6TYMaO RzTG5ZR+0AmsOrR1QU+DbH9CKX5PGZko6E7UCdjROqUlAUOjNwRr99O5mYrZoM9E PdN2vtdWY9COR3Q+7APdhWIA/MdN2vjr3LDsR3H94tru1yi6dB/BPDRcJieozaxn 2T+YrZbV+9/YgrccpPQCilaQdanXKpkmbYkbEzVLPcOEV/lT9odFDt3eK+6duVDL ufu8fs1xapMDHKkcwo5jeNZcoSJymAvHmGfZlo2PPOmh802Ul60bvYKwfheVkhHF Eee5/ovCMs1NLqFiq7Zq5mXO0fR0BHyg9VVjJBZm2JtazyuhoHQ= =iWcG -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma Pull rdma updates from Jason Gunthorpe: "A smaller cycle this time. Notably we see another new driver, 'Soft iWarp', and the deletion of an ancient unused driver for nes. - Revise and simplify the signature offload RDMA MR APIs - More progress on hoisting object allocation boiler plate code out of the drivers - Driver bug fixes and revisions for hns, hfi1, efa, cxgb4, qib, i40iw - Tree wide cleanups: struct_size, put_user_page, xarray, rst doc conversion - Removal of obsolete ib_ucm chardev and nes driver - netlink based discovery of chardevs and autoloading of the modules providing them - Move more of the rdamvt/hfi1 uapi to include/uapi/rdma - New driver 'siw' for software based iWarp running on top of netdev, much like rxe's software RoCE. - mlx5 feature to report events in their raw devx format to userspace - Expose per-object counters through rdma tool - Adaptive interrupt moderation for RDMA (DIM), sharing the DIM core from netdev" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (194 commits) RMDA/siw: Require a 64 bit arch RDMA/siw: Mark expected switch fall-throughs RDMA/core: Fix -Wunused-const-variable warnings rdma/siw: Remove set but not used variable 's' rdma/siw: Add missing dependencies on LIBCRC32C and DMA_VIRT_OPS RDMA/siw: Add missing rtnl_lock around access to ifa rdma/siw: Use proper enumerated type in map_cqe_status RDMA/siw: Remove unnecessary kthread create/destroy printouts IB/rdmavt: Fix variable shadowing issue in rvt_create_cq RDMA/core: Fix race when resolving IP address RDMA/core: Make rdma_counter.h compile stand alone IB/core: Work on the caller socket net namespace in nldev_newlink() RDMA/rxe: Fill in wc byte_len with IB_WC_RECV_RDMA_WITH_IMM RDMA/mlx5: Set RDMA DIM to be enabled by default RDMA/nldev: Added configuration of RDMA dynamic interrupt moderation to netlink RDMA/core: Provide RDMA DIM support for ULPs linux/dim: Implement RDMA adaptive moderation (DIM) IB/mlx5: Report correctly tag matching rendezvous capability docs: infiniband: add it to the driver-api bookset IB/mlx5: Implement VHCA tunnel mechanism in DEVX ...	2019-07-15 20:38:15 -07:00
Linus Torvalds	1f7563f743	SCSI sg on 20190709 This topic branch covers a fundamental change in how our sg lists are allocated to make mq more efficient by reducing the size of the preallocated sg list. This necessitates a large number of driver changes because the previous guarantee that if a driver specified SG_ALL as the size of its scatter list, it would get a non-chained list and didn't need to bother with scatterlist iterators is now broken and every driver must use scatterlist iterators. This was broken out as a separate topic because we need to convert all the drivers before pulling the trigger and unconverted drivers kept being found, necessitating a rebase. Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com> -----BEGIN PGP SIGNATURE----- iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCXSTzzCYcamFtZXMuYm90 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishZB+AP9I8j/s wWfg0Z3WNuf4D5I3rH4x1J3cQTqPJed+RjwgcQEA1gZvtOTg1ZEn/CYMVnaB92x0 t6MZSchIaFXeqfD+E7U= =cv8o -----END PGP SIGNATURE----- Merge tag 'scsi-sg' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI scatter-gather list updates from James Bottomley: "This topic branch covers a fundamental change in how our sg lists are allocated to make mq more efficient by reducing the size of the preallocated sg list. This necessitates a large number of driver changes because the previous guarantee that if a driver specified SG_ALL as the size of its scatter list, it would get a non-chained list and didn't need to bother with scatterlist iterators is now broken and every driver must use scatterlist iterators. This was broken out as a separate topic because we need to convert all the drivers before pulling the trigger and unconverted drivers kept being found, necessitating a rebase" * tag 'scsi-sg' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (21 commits) scsi: core: don't preallocate small SGL in case of NO_SG_CHAIN scsi: lib/sg_pool.c: clear 'first_chunk' in case of no preallocation scsi: core: avoid preallocating big SGL for data scsi: core: avoid preallocating big SGL for protection information scsi: lib/sg_pool.c: improve APIs for allocating sg pool scsi: esp: use sg helper to iterate over scatterlist scsi: NCR5380: use sg helper to iterate over scatterlist scsi: wd33c93: use sg helper to iterate over scatterlist scsi: ppa: use sg helper to iterate over scatterlist scsi: pcmcia: nsp_cs: use sg helper to iterate over scatterlist scsi: imm: use sg helper to iterate over scatterlist scsi: aha152x: use sg helper to iterate over scatterlist scsi: s390: zfcp_fc: use sg helper to iterate over scatterlist scsi: staging: unisys: visorhba: use sg helper to iterate over scatterlist scsi: usb: image: microtek: use sg helper to iterate over scatterlist scsi: pmcraid: use sg helper to iterate over scatterlist scsi: ipr: use sg helper to iterate over scatterlist scsi: mvumi: use sg helper to iterate over scatterlist scsi: lpfc: use sg helper to iterate over scatterlist scsi: advansys: use sg helper to iterate over scatterlist ...	2019-07-11 15:17:41 -07:00
Minwoo Im	7d30c81b80	nvme: fix NULL deref for fabrics options git://git.infradead.org/nvme.git nvme-5.3 branch now causes the following NULL deref oops. Check the ctrl->opts first before the deref. [ 16.337581] BUG: kernel NULL pointer dereference, address: 0000000000000056 [ 16.338551] #PF: supervisor read access in kernel mode [ 16.338551] #PF: error_code(0x0000) - not-present page [ 16.338551] PGD 0 P4D 0 [ 16.338551] Oops: 0000 [#1] SMP PTI [ 16.338551] CPU: 2 PID: 1035 Comm: kworker/u16:5 Not tainted 5.2.0-rc6+ #1 [ 16.338551] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [ 16.338551] Workqueue: nvme-wq nvme_scan_work [nvme_core] [ 16.338551] RIP: 0010:nvme_validate_ns+0xc9/0x7e0 [nvme_core] [ 16.338551] Code: c0 49 89 c5 0f 84 00 07 00 00 48 8b 7b 58 e8 be 48 39 c1 48 3d 00 f0 ff ff 49 89 45 18 0f 87 a4 06 00 00 48 8b 93 70 0a 00 00 <80> 7a 56 00 74 0c 48 8b 40 68 83 48 3c 08 49 8b 45 18 48 89 c6 bf [ 16.338551] RSP: 0018:ffffc900024c7d10 EFLAGS: 00010283 [ 16.338551] RAX: ffff888135a30720 RBX: ffff88813a4fd1f8 RCX: 0000000000000007 [ 16.338551] RDX: 0000000000000000 RSI: ffffffff8256dd38 RDI: ffff888135a30720 [ 16.338551] RBP: 0000000000000001 R08: 0000000000000007 R09: ffff88813aa6a840 [ 16.338551] R10: 0000000000000001 R11: 000000000002d060 R12: ffff88813a4fd1f8 [ 16.338551] R13: ffff88813a77f800 R14: ffff88813aa35180 R15: 0000000000000001 [ 16.338551] FS: 0000000000000000(0000) GS:ffff88813ba80000(0000) knlGS:0000000000000000 [ 16.338551] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 16.338551] CR2: 0000000000000056 CR3: 000000000240a002 CR4: 0000000000360ee0 [ 16.338551] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 16.338551] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 16.338551] Call Trace: [ 16.338551] nvme_scan_work+0x2c0/0x340 [nvme_core] [ 16.338551] ? __switch_to_asm+0x40/0x70 [ 16.338551] ? _raw_spin_unlock_irqrestore+0x18/0x30 [ 16.338551] ? try_to_wake_up+0x408/0x450 [ 16.338551] process_one_work+0x20b/0x3e0 [ 16.338551] worker_thread+0x1f9/0x3d0 [ 16.338551] ? cancel_delayed_work+0xa0/0xa0 [ 16.338551] kthread+0x117/0x120 [ 16.338551] ? kthread_stop+0xf0/0xf0 [ 16.338551] ret_from_fork+0x3a/0x50 [ 16.338551] Modules linked in: nvme nvme_core [ 16.338551] CR2: 0000000000000056 [ 16.338551] ---[ end trace b9bf761a93e62d84 ]--- [ 16.338551] RIP: 0010:nvme_validate_ns+0xc9/0x7e0 [nvme_core] [ 16.338551] Code: c0 49 89 c5 0f 84 00 07 00 00 48 8b 7b 58 e8 be 48 39 c1 48 3d 00 f0 ff ff 49 89 45 18 0f 87 a4 06 00 00 48 8b 93 70 0a 00 00 <80> 7a 56 00 74 0c 48 8b 40 68 83 48 3c 08 49 8b 45 18 48 89 c6 bf [ 16.338551] RSP: 0018:ffffc900024c7d10 EFLAGS: 00010283 [ 16.338551] RAX: ffff888135a30720 RBX: ffff88813a4fd1f8 RCX: 0000000000000007 [ 16.338551] RDX: 0000000000000000 RSI: ffffffff8256dd38 RDI: ffff888135a30720 [ 16.338551] RBP: 0000000000000001 R08: 0000000000000007 R09: ffff88813aa6a840 [ 16.338551] R10: 0000000000000001 R11: 000000000002d060 R12: ffff88813a4fd1f8 [ 16.338551] R13: ffff88813a77f800 R14: ffff88813aa35180 R15: 0000000000000001 [ 16.338551] FS: 0000000000000000(0000) GS:ffff88813ba80000(0000) knlGS:0000000000000000 [ 16.338551] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 16.338551] CR2: 0000000000000056 CR3: 000000000240a002 CR4: 0000000000360ee0 [ 16.338551] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 16.338551] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Fixes: `958f2a0f81` ("nvme-tcp: set the STABLE_WRITES flag when data digests are enabled") Cc: Christoph Hellwig <hch@lst.de> Cc: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-07-11 16:02:26 -06:00
Sagi Grimberg	420dc733f9	nvme: fix regression upon hot device removal and insertion When we validate the new controller id, we want to skip controllers that are either deleting or dead. Fix the check to do that and not on the newly added controller. Fixes: `1b1031ca63` ("nvme: validate cntlid during controller initialisation") Reported-by: Jon Derrick <jonathan.derrick@intel.com> Tested-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-07-10 09:36:16 -07:00
James Smart	4c73cbdff1	nvme-fc: fix module unloads while lports still pending Current code allows the module to be unloaded even if there are pending data structures, such as localports and controllers on the localports, that have yet to hit their reference counting to remove them. Fix by having exit entrypoint explicitly delete every controller, which in turn will remove references on the remoteports and localports causing them to be deleted as well. The exit entrypoint, after initiating the deletes, will wait for the last localport to be deleted before continuing. Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-07-09 14:37:05 -07:00
Mikhail Skorzhinskii	37c1521959	nvme-tcp: don't use sendpage for SLAB pages According to commit `a10674bf24` ("tcp: detecting the misuse of .sendpage for Slab objects") and previous discussion, tcp_sendpage should not be used for pages that is managed by SLAB, as SLAB is not taking page reference counters into consideration. Signed-off-by: Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-07-09 14:18:09 -07:00
Mikhail Skorzhinskii	958f2a0f81	nvme-tcp: set the STABLE_WRITES flag when data digests are enabled There was a few false alarms sighted on target side about wrong data digest while performing high throughput load to XFS filesystem shared through NVMoF TCP. This flag tells the rest of the kernel to ensure that the data buffer does not change while the write is in flight. It incurs a performance penalty, so only enable it when it is actually needed, i.e. when we are calculating data digests. Although even with this change in place, ext2 users can steel experience false positives, as ext2 is not respecting this flag. This may be apply to vfat as well. Signed-off-by: Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com> Signed-off-by: Mike Playle <mplayle@solarflare.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-07-09 14:18:09 -07:00
Mikhail Skorzhinskii	5ba895033b	nvmet: print a hint while rejecting NSID 0 or 0xffffffff Adding this hint for the sake of convenience. It was spotted that a few times people spent some time before understanding what is exactly wrong in configuration process. This should save a few time in such situations, especially for people who is not very confident with NVMe requirements. Signed-off-by: Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-07-09 14:18:09 -07:00
Hannes Reinecke	04e70bd4a0	nvme-multipath: do not select namespaces which are about to be removed nvme_ns_remove() will first set the NVME_NS_REMOVING flag before removing it from the list at the very last step. So to avoid selecting a namespace in nvme_find_path() which is about to be removed check the NVME_NS_REMOVING flag, too, when selecting a new path. Signed-off-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-07-09 14:18:03 -07:00
Hannes Reinecke	2032d07471	nvme-multipath: also check for a disabled path if there is a single sibling When we have a singular list in nvme_round_robin_path() we still need to check its validity. Signed-off-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-07-09 14:17:21 -07:00
Hannes Reinecke	ca7ae5c966	nvme-multipath: factor out a nvme_path_is_disabled helper Factor our a common helper to check if a path has been disabled by something other than the per-namespace ANA state. Signed-off-by: Hannes Reinecke <hare@suse.com> [hch: split from a bigger patch] Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-07-09 14:16:38 -07:00
Bart Van Assche	81adb86334	nvme: set physical block size and optimal I/O size >From the NVMe 1.4 spec: NSFEAT bit 4 if set to 1: indicates that the fields NPWG, NPWA, NPDG, NPDA, and NOWS are defined for this namespace and should be used by the host for I/O optimization; [ ... ] Namespace Preferred Write Granularity (NPWG): This field indicates the smallest recommended write granularity in logical blocks for this namespace. This is a 0's based value. The size indicated should be less than or equal to Maximum Data Transfer Size (MDTS) that is specified in units of minimum memory page size. The value of this field may change if the namespace is reformatted. The size should be a multiple of Namespace Preferred Write Alignment (NPWA). Refer to section 8.25 for how this field is utilized to improve performance and endurance. [ ... ] Each Write, Write Uncorrectable, or Write Zeroes commands should address a multiple of Namespace Preferred Write Granularity (NPWG) (refer to Figure 245) and Stream Write Size (SWS) (refer to Figure 515) logical blocks (as expressed in the NLB field), and the SLBA field of the command should be aligned to Namespace Preferred Write Alignment (NPWA) (refer to Figure 245) for best performance. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-07-09 14:15:37 -07:00
Bart Van Assche	9d05a96e29	nvmet: export I/O characteristics attributes in Identify Make the NVMe NAWUN, NAWUPF, NACWU, NPWG, NPWA, NPDG and NOWS attributes available to initator systems for the block backend. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-07-09 14:15:37 -07:00
Tom Wu	4c0181bf6c	nvme-trace: add delete completion and submission queue to admin cmds tracer The trace log for 'delete I/O submission queue' and 'delete I/O completion queue' command will look like as below: kworker/u49:1-3438 [003] .... 6693.070865: nvme_setup_cmd: nvme0: qid=0, cmdid=11, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_delete_sq sqid=1) kworker/u49:1-3438 [003] .... 6693.071171: nvme_setup_cmd: nvme0: qid=0, cmdid=8, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_delete_cq cqid=24) Signed-off-by: Tom Wu <tomwu@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Israel Rukshin <israelr@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-07-09 14:15:37 -07:00
Colin Ian King	91f6d79853	nvme-trace: fix spelling mistake "spcecific" -> "specific" There are two spelling mistakes in trace_seq_printf messages, fix these. Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-07-09 13:44:45 -07:00
Christoph Hellwig	7637de311b	nvme-pci: limit max_hw_sectors based on the DMA max mapping size When running a NVMe device that is attached to a addressing challenged PCIe root port that requires bounce buffering, our request sizes can easily overflow the swiotlb bounce buffer size. Limit the maximum I/O size to the limit exposed by the DMA mapping subsystem. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Atish Patra <Atish.Patra@wdc.com> Tested-by: Atish Patra <Atish.Patra@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>	2019-07-09 13:44:44 -07:00
Alan Mikhak	bfac8e9f55	nvme-pci: check for NULL return from pci_alloc_p2pmem() Modify nvme_alloc_sq_cmds() to call pci_free_p2pmem() to free the memory it allocated using pci_alloc_p2pmem() in case pci_p2pmem_virt_to_bus() returns null. Makes sure not to call pci_free_p2pmem() if pci_alloc_p2pmem() returned NULL, which can happen if CONFIG_PCI_P2PDMA is not configured. The current implementation is not expected to leak since pci_p2pmem_virt_to_bus() is expected to fail only if pci_alloc_p2pmem() returns null. However, checking the return value of pci_alloc_p2pmem() is more explicit. Signed-off-by: Alan Mikhak <alan.mikhak@sifive.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-07-09 13:44:44 -07:00
Alan Mikhak	0298d54352	nvme-pci: don't create a read hctx mapping without read queues Only request an IRQ mapping for read queues if at least one read queue is being allocted, as nvme_pci_map_queues() will later on ignore the unnecessary mapping request should nvme_dev_add() request such an IRQ mapping even though no read queues are being allocated. However, nvme_dev_add() can avoid making the request by checking the number of read queues without assuming. This would bring it more in line with nvme_setup_irqs() and nvme_calc_irq_sets(). Signed-off-by: Alan Mikhak <alan.mikhak@sifive.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-07-09 13:44:44 -07:00
Christoph Hellwig	4fe06923f5	nvme-pci: don't fall back to a 32-bit DMA mask Since Linux 5.0 drivers can safely set the largest DMA mask supported by the device, and don't need fallbacks to work around the dma mapping implementations. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>	2019-07-09 13:44:44 -07:00
YueHaibing	2177422232	nvme-pci: make nvme_dev_pm_ops static Fix sparse warning: drivers/nvme/host/pci.c:2926:25: warning: symbol 'nvme_dev_pm_ops' was not declared. Should it be static? Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-07-09 13:16:10 -07:00
James Smart	e0620bf858	nvme-fcloop: resolve warnings on RCU usage and sleep warnings With additional debugging enabled, seeing warnings for suspicious RCU usage or Sleeping function called from invalid context. These both map to allocation of a work structure which is currently GFP_KERNEL, meaning it can sleep. For the RCU warning, the sequence was sleeping while holding the RCU lock. Convert the allocation to GFP_ATOMIC. Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-07-09 13:16:09 -07:00
James Smart	c38dbbfab1	nvme-fcloop: fix inconsistent lock state warnings With extra debug on, inconsistent lock state warnings are being called out as the tfcp_req->reqlock is being taken out without irq, while some calling sequences have the sequence in a softirq state. Change the lock taking/release to raise/drop irq. Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-07-09 13:16:09 -07:00
Jason Gunthorpe	371bb62158	Linux 5.2-rc6 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl0Os1seHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGtx4H/j6i482XzcGFKTBm A7mBoQpy+kLtoUov4EtBAR62OuwI8rsahW9di37QKndPoQrczWaKBmr3De6LCdPe v3pl3O6wBbvH5ru+qBPFX9PdNbDvimEChh7LHxmMxNQq3M+AjZAZVJyfpoiFnx35 Fbge+LZaH/k8HMwZmkMr5t9Mpkip715qKg2o9Bua6dkH0AqlcpLlC8d9a+HIVw/z aAsyGSU8jRwhoAOJsE9bJf0acQ/pZSqmFp0rDKqeFTSDMsbDRKLGq/dgv4nW0RiW s7xqsjb/rdcvirRj3rv9+lcTVkOtEqwk0PVdL9WOf7g4iYrb3SOIZh8ZyViaDSeH VTS5zps= =huBY -----END PGP SIGNATURE----- Merge tag 'v5.2-rc6' into rdma.git for-next For dependencies in next patches. Resolve conflicts: - Use uverbs_get_cleared_udata() with new cq allocation flow - Continue to delete nes despite SPDX conflict - Resolve list appends in mlx5_command_str() - Use u16 for vport_rule stuff - Resolve list appends in struct ib_client Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>	2019-06-28 21:18:23 -03:00
Israel Rukshin	5a6781a558	RDMA/core: Add an integrity MR pool support This is a preparation for adding new signature API to the rw-API. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>	2019-06-24 11:49:27 -03:00
Akinobu Mita	f79d5fda4e	nvme: enable to inject errors into admin commands This enables to inject errors into the commands submitted to the admin queue. It is useful to test error handling in the controller initialization. # echo 100 > /sys/kernel/debug/nvme0/fault_inject/probability # echo 1 > /sys/kernel/debug/nvme0/fault_inject/times # echo 10 > /sys/kernel/debug/nvme0/fault_inject/space # nvme reset /dev/nvme0 # dmesg ... nvme nvme0: Could not set queue count (16385) nvme nvme0: IO queues not created Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2019-06-21 11:15:50 +02:00

... 8 9 10 11 12 ...

2478 Commits