OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Keith Busch	32f0c4afb4	nvme: Remove RCU namespace protection We can't sleep with RCU read lock held, but we need to do potentially blocking stuff to namespace queues when iterating the list. This patch removes the RCU locking and holds a mutex instead. To prevent deadlocks, this patch removes holding the mutex during namespace scanning and removal. The unlocked namespace scanning is made safe by holding a reference to the namespace being scanned. List iteration that does IO has to be unlocked to allow error recovery. The caller must ensure the list can not be manipulated during such an event, so this patch adds a comment explaining this requirement to the only function that iterates an unlocked list. All callers currently meet this requirement, so no further changes required. List iterations that do not do IO can safely use the lock since it couldn't block recovery from missing forced IO completions. Reported-by: Ming Lin <mlin at kernel.org> [fixes `0bf77e9` nvme: switch to RCU freeing the namespace] Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-14 08:48:08 -07:00
Masayoshi Mizuma	2fa843512b	nvme: avoid crashes when node 0 is memoryless node. When CONFIG_NUMA is enabled and node 0 is memoryless, the system crashes because nvme_probe() sets the device->numa_node to 0 by set_dev_node(&pdev->dev, 0), so it tries to allocate memory from node 0. To avoid the crash, we should change the 0 to first_memory_node. Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-13 09:17:55 -07:00
Keith Busch	f80ec966c1	nvme: Limit command retries Many controller implementations will return errors to commands that will not succeed, but without the DNR bit set. The driver previously retried these commands an unlimited number of times until the command timeout has exceeded, which takes an unnecessarilly long period of time. This patch limits the number of retries a command can have, defaulting to 5, but is user tunable at load or runtime. The struct request's 'retries' field is used to track the number of retries attempted. This is in contrast with scsi's use of this field, which indicates how many retries are allowed. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-12 16:20:31 -07:00
Arnd Bergmann	6eae8c4520	nvme-loop: fix nvme-loop Kconfig dependencies I ran into the same problem on NVME_TARGET_RDMA now, which otherwise needs dependencies on both CONFIG_BLOCK and CONFIGFS_FS: warning: (NVME_TARGET_LOOP && NVME_TARGET_RDMA) selects NVME_TARGET which has unmet direct dependencies (BLOCK && CONFIGFS_FS) 0xA002B368 Mon Jul 11 18:00:45 CEST 2016 failed In file included from ../drivers/nvme/target/core.c:16:0: drivers/nvme/target/nvmet.h:222:14: error: field 'inline_bio' has incomplete type struct bio inline_bio; ^~~~~~~~~~ drivers/nvme/target/core.c: In function 'nvmet_async_event_work': drivers/nvme/target/core.c:98:3: error: implicit declaration of function 'kfree' [-Werror=implicit-function-declaration] kfree(aen); ^~~~~ ../drivers/nvme/target/core.c: In function 'nvmet_ns_enable': ../drivers/nvme/target/core.c:269:13: error: implicit declaration of function 'blkdev_get_by_path' [-Werror=implicit-function-declaration] ns->bdev = blkdev_get_by_path(ns->device_path, FMODE_READ \| FMODE_WRITE, Folding in my patch below should address that too. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reported-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-12 08:36:40 -07:00
Wei Yongjun	69555af2ce	nvmet: fix return value check in nvmet_subsys_alloc() In case of error, the function kstrndup() returns NULL pointer not ERR_PTR(). The IS_ERR() test in the return value check should be replaced with NULL test. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Reviewed-by: Jay Freyensee <james_p_freyensee@linux.intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-12 08:33:43 -07:00
Ming Lin	e76debd996	nvme-fabrics: add-remove ctrl repeat fix Repeatedly adding then removing the same NVMe-over-Fabrics controller over and over again (shown below) can cause a kernel crash (also shown below). This patch fixes that. [nvmf]# ./setup_nvme_connections.sh traddr=192.168.1.100,transport=rdma,trsvcid=4420,nqn=darkside -nqn,hostnqn=evil-wins-nqn,nr_io_queues=16 > /dev/nvme-fabrics traddr=192.168.1.100,transport=rdma,trsvcid=4420,nqn=lightside -nqn,hostnqn=good-wins-nqn > /dev/nvme-fabrics [nvmf]# ./remove_nvme_connections.sh 2 echo 1 > /sys/class/nvme/nvme0/delete_controller echo 1 > /sys/class/nvme/nvme1/delete_controller [nvmf]# ./setup_nvme_connections.sh traddr=192.168.1.100,transport=rdma,trsvcid=4420,nqn=darkside -nqn,hostnqn=evil-wins-nqn,nr_io_queues=16 > /dev/nvme-fabrics Killed [nvmf]# dmesg [ 313.416908] nvme nvme0: creating 16 I/O queues. [ 313.523908] nvme nvme0: new ctrl: NQN "darkside-nqn", addr 192.168.1.100:4420 [ 313.524857] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 [ 313.525262] IP: [<ffffffff8136c60e>] strcmp+0xe/0x30 [ 313.525490] PGD 0 [ 313.525726] Oops: 0000 [#1] SMP [ 313.525900] Modules linked in: nvme_rdma nvme_fabrics nvme_core ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_en mlx4_ib ib_core mlx4_core [ 313.527085] CPU: 15 PID: 5856 Comm: setup_nvme_conn Not tainted 4.7.0-rc2+ #2 [ 313.527259] Hardware name: Supermicro X9DRT-F/IBQF/IBFF/X9DRT -F/IBQF/IBFF, BIOS 1.0a 10/09/2012 [ 313.527551] task: ffff88027646cd40 ti: ffff88025b980000 task.ti: ffff88025b980000 [ 313.527879] RIP: 0010:[<ffffffff8136c60e>] [<ffffffff8136c60e>] strcmp+0xe/0x30 [ 313.528232] RSP: 0018:ffff88025b983db0 EFLAGS: 00010206 [ 313.528403] RAX: 0000000000000000 RBX: ffff880471879880 RCX: fffffffffffffff1 [ 313.528594] RDX: 0000000000000000 RSI: ffff880474afa860 RDI: 0000000000000011 [ 313.528778] RBP: ffff88025b983db0 R08: ffff880474afa860 R09: ffff880471879058 [ 313.528956] R10: 000000000000002c R11: ffff88047f415000 R12: ffff880471879800 [ 313.529129] R13: ffff880471879000 R14: ffff880474afa860 R15: fffffffffffffff8 [ 313.529303] FS: 00007f778f510700(0000) GS:ffff88047fbc0000(0000) knlGS:0000000000000000 [ 313.529629] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 313.529817] CR2: 0000000000000010 CR3: 0000000274174000 CR4: 00000000000406e0 [ 313.529989] Stack: [ 313.530154] ffff88025b983e48 ffffffffa0171c74 0000000000000001 0000000000000059 [ 313.530621] ffff880476f32400 ffff88047e8add80 0000010074b33aa0 ffff880471879059 [ 313.531162] ffff88047187904b ffff880471879058 0000000000000000 ffff88047736e000 [ 313.531629] Call Trace: [ 313.531797] [<ffffffffa0171c74>] nvmf_dev_write+0x674/0x840 [nvme_fabrics] [ 313.531974] [<ffffffff81180b53>] __vfs_write+0x23/0x120 [ 313.532146] [<ffffffff8119daff>] ? __fd_install+0x1f/0xc0 [ 313.532316] [<ffffffff8119d97a>] ? __alloc_fd+0x3a/0x170 [ 313.532487] [<ffffffff811811f3>] vfs_write+0xb3/0x1b0 [ 313.532658] [<ffffffff8117e321>] ? filp_close+0x51/0x70 [ 313.532845] [<ffffffff811824e1>] SyS_write+0x41/0xa0 [ 313.533016] [<ffffffff8183055b>] entry_SYSCALL_64_fastpath+0x13/0x8f [ 313.533188] Code: 80 3a 00 75 f7 48 83 c6 01 0f b6 4e ff 48 83 c2 01 84 c9 88 4a ff 75 ed 5d c3 0f 1f 00 55 48 89 e5 eb 04 84 c0 74 18 48 83 c7 01 <0f> b6 47 ff 48 83 c6 01 3a 46 ff 74 eb 19 c0 83 c8 01 5d c3 31 [ 313.536563] RIP [<ffffffff8136c60e>] strcmp+0xe/0x30 [ 313.536815] RSP <ffff88025b983db0> [ 313.536981] CR2: 0000000000000010 [ 313.537151] ---[ end trace 3d952e590e7bc2d5 ]--- Reported-and-tested-by: Jay Freyensee <james.p.freyensee@intel.com> Signed-off-by: Ming Lin <mlin@kernel.org> Signed-off-by: Jay Freyensee <james.p.freyensee@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-12 08:32:19 -07:00
Sagi Grimberg	6a92967ccb	nvme-fabrics: Remove tl_retry_count The timeout before error recovery logic kicks in is dictated by the nvme keep-alive, so we don't really need a transport layer retry count. transports can retry for as much as they like. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-12 08:31:11 -07:00
Sagi Grimberg	2ac17c283a	nvme-rdma: Don't use tl_retry_count Always use the maximum qp retry count as the error recovery timeout is dictated from the nvme keep-alive. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-12 08:31:10 -07:00
Wei Yongjun	458a9632ad	nvme-rdma: fix the return value of nvme_rdma_reinit_request() PTR_ERR should be applied before its argument is reassigned, otherwise the return value will be set to 0, not error code. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Reviewed-by: Jay Freyensee <james_p_freyensee@linux.intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-12 08:27:03 -07:00
Guilherme G. Piccoli	54adc01055	nvme/quirk: Add a delay before checking for adapter readiness When disabling the controller, the specification says the register NVME_REG_CC should be written and then driver needs to wait the adapter to be ready, which is checked by reading another register bit (NVME_CSTS_RDY). There's a timeout validation in this checking, so in case this timeout is reached the driver gives up and removes the adapter from the system. After a firmware activation procedure, the PCI_DEVICE(0x1c58, 0x0003) (HGST adapter) end up being removed if we issue a reset_controller, because driver keeps verifying the NVME_REG_CSTS until the timeout is reached. This patch adds a necessary quirk for this adapter, by introducing a delay before nvme_wait_ready(), so the reset procedure is able to be completed. This quirk is needed because just increasing the timeout is not enough in case of this adapter - the driver must wait before start reading NVME_REG_CSTS register on this specific device. Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-12 08:23:00 -07:00
Jens Axboe	41d512e51b	Merge branch 'for-4.8/block' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm into for-4.8/drivers Dan writes: "The removal of ->driverfs_dev in favor of just passing the parent device in as a parameter to add_disk(). See below, it has received a "Reviewed-by" from Christoph, Bart, and Johannes. It is also a pre-requisite for Fam Zheng's work to cleanup gendisk uevents vs attribute visibility [1]. We would extend device_add_disk() to take an attribute_group list. This is based off a branch of block.git/for-4.8/drivers and has received a positive build success notification from the kbuild robot across several configs. [1]: "gendisk: Generate uevent after attribute available" http://marc.info/?l=linux-virtualization&m=146725201522201&w=2"	2016-07-08 16:04:11 -06:00
Christoph Hellwig	7110230719	nvme-rdma: add a NVMe over Fabrics RDMA host driver This patch implements the RDMA host (initiator in SCSI speak) driver. It can be used to connect to remote NVMe over Fabrics controllers over Infiniband, RoCE or iWarp, and uses the existing NVMe core driver as well a the new fabrics library. To connect to all NVMe over Fabrics controller reachable on a given taget port using RDMA/CM use the following command: nvme connect-all -t rdma -a $IPADDR This requires the latest version of nvme-cli with Fabrics support. Signed-off-by: Jay Freyensee <james.p.freyensee@intel.com> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-08 08:38:49 -06:00
Christoph Hellwig	8f000cac6e	nvmet-rdma: add a NVMe over Fabrics RDMA target driver This patch implements the RDMA transport for the NVMe over Fabrics target, which allows exporting NVMe over Fabrics functionality over RDMA fabrics (Infiniband, RoCE, iWARP). All NVMe logic is in the generic target and this module just provides a small glue between it and the generic code in the RDMA subsystem. Signed-off-by: Armen Baloyan <armenx.baloyan@intel.com>, Signed-off-by: Jay Freyensee <james.p.freyensee@intel.com> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-08 08:38:49 -06:00
Christoph Hellwig	def61eca96	nvme: add new reconnecting controller state The nvme fabric (RDMA, FC, etc...) can introduce port, link or node failures that may require a reconnect to re-establish the connection. Add a new reconnecting state that will initially be used by the RDMA driver. Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-08 08:38:49 -06:00
Johannes Thumshirn	6f92970219	nvme: lightnvm: make MLC num_pairs little endian According to the OpenChannel SSD interface specification the NAND flash MLC page pairing information's number of page page pairings field is the first two bytes in the MLC Page Pairing data structure. The hardware's data structure itself is little endian so annotate it as such, like the rest of lighnvm's data structures. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-07 08:51:52 -06:00
Dan Carpenter	f98d9ca17f	nvmet: fix an error code We accidentally return zero here when ERR_PTR(-ENOMEM) is intended. Fixes: `a07b4970f4` ('nvmet: add a generic NVMe target') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-07 08:37:36 -06:00
Arnd Bergmann	1fb4704084	nvme-loop: add configfs dependency CONFIG_NVME_TARGET has a correct CONFIG_CONFIGFS_FS dependency, but the newly added NVME_TARGET_LOOP is missing this, resulting in a link failure: drivers/nvme/built-in.o: In function `nvmet_init_configfs': loop.c:(.init.text+0x2a0): undefined reference to `config_group_init' loop.c:(.init.text+0x2c0): undefined reference to `config_group_init_type_name' loop.c:(.init.text+0x318): undefined reference to `configfs_register_subsystem' drivers/nvme/built-in.o: In function `nvmet_exit_configfs': loop.c:(.exit.text+0x9c): undefined reference to `configfs_unregister_subsystem' This adds the same dependency here. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: `3a85a5de29` ("nvme-loop: add a NVMe loopback host driver") Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-07 08:34:45 -06:00
Christoph Hellwig	3a85a5de29	nvme-loop: add a NVMe loopback host driver This patch implements adds nvme-loop which allows to access local devices exported as NVMe over Fabrics namespaces. This module can be useful for easy evaluation, testing and also feature experimentation. To createa nvme-loop device you need to configure the NVMe target to export a loop port (see the nvmetcli documentaton for that) and then connect to it using nvme connect-all -t loop which requires the very latest nvme-cli version with Fabrics support. Signed-off-by: Jay Freyensee <james.p.freyensee@intel.com> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-05 11:30:36 -06:00
Christoph Hellwig	a07b4970f4	nvmet: add a generic NVMe target This patch introduces a implementation of NVMe subsystems, controllers and discovery service which allows to export NVMe namespaces across fabrics such as Ethernet, FC etc. The implementation conforms to the NVMe 1.2.1 specification and interoperates with NVMe over fabrics host implementations. Configuration works using configfs, and is best performed using the nvmetcli tool from http://git.infradead.org/users/hch/nvmetcli.git, which also has a detailed explanation of the required steps in the README file. Signed-off-by: Armen Baloyan <armenx.baloyan@intel.com> Signed-off-by: Anthony Knapp <anthony.j.knapp@intel.com> Signed-off-by: Jay Freyensee <james.p.freyensee@intel.com> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-05 11:30:33 -06:00
Sagi Grimberg	038bd4cb67	nvme: add keep-alive support Periodic keep-alive is a mandatory feature in NVMe over Fabrics, and optional in NVMe 1.2.1 for PCIe. This patch adds periodic keep-alive sent from the host to verify that the controller is still responsive and vice-versa. The keep-alive timeout is user-defined (with keep_alive_tmo connection parameter) and defaults to 5 seconds. In order to avoid a race condition where the host sends a keep-alive competing with the target side keep-alive timeout expiration, the host adds a grace period of 10 seconds when publishing the keep-alive timeout to the target. In case a keep-alive failed (or timed out), a transport specific error recovery kicks in. For now only NVMe over Fabrics is wired up to support keep alive, but we can add PCIe support easily once controllers actually supporting it become available. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Steve Wise <swise@chelsio.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-05 11:28:20 -06:00
Christoph Hellwig	07bfcd09a2	nvme-fabrics: add a generic NVMe over Fabrics library The NVMe over Fabrics library provides an interface for both transports and the nvme core to handle fabrics specific commands and attributes independent of the underlying transport. In addition, the fabrics library adds a misc device interface that allow actually creating a fabrics controller, as we can't just autodiscover it like in the PCI case. The nvme-cli utility has been enhanced to use this interface to support fabric connect and discovery. Signed-off-by: Armen Baloyan <armenx.baloyan@intel.com>, Signed-off-by: Jay Freyensee <james.p.freyensee@intel.com>, Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-05 11:28:16 -06:00
Christoph Hellwig	eb793e2c92	nvme.h: add NVMe over Fabrics definitions The NVMe over Fabrics specification defines a protocol interface and related extensions to NVMe that enable operation over network protocols. The NVMe over Fabrics specification has an NVMe Transport binding for each NVMe Transport. This patch adds the fabrics related definitions: - fabric specific command set and error codes - transport addressing and binding definitions - fabrics sgl extensions - controller identification fabrics enhancements - discovery log page definition Signed-off-by: Armen Baloyan <armenx.baloyan@intel.com> Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Jay Freyensee <james.p.freyensee@intel.com> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-05 11:28:14 -06:00
Ming Lin	1a353d85b0	nvme: add fabrics sysfs attributes - delete_controller: This attribute allows to delete a controller. A driver is not obligated to support it (pci doesn't) so it is created only if the driver supports it. The new fabrics drivers will support it (essentialy a disconnect operation). Usage: echo > /sys/class/nvme/nvme0/delete_controller - subsysnqn: This attribute shows the subsystem nqn of the configured device. If a driver does not implement the get_subsysnqn method, the file will not appear in sysfs. - transport: This attribute shows the transport name. Added a "name" field to struct nvme_ctrl_ops. For loop, cat /sys/class/nvme/nvme0/transport loop For RDMA, cat /sys/class/nvme/nvme0/transport rdma For PCIe, cat /sys/class/nvme/nvme0/transport pcie - address: This attributes shows the controller address. The fabrics drivers that will implement get_address can show the address of the connected controller. example: cat /sys/class/nvme/nvme0/address traddr=192.168.2.2,trsvcid=1023 Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-05 11:28:12 -06:00
Christoph Hellwig	eb71f43557	nvme: Modify and export sync command submission for fabrics NVMe over fabrics will use __nvme_submit_sync_cmd in the the transport and require a few tweaks to it. For that we export it and add a few more paramters: 1. allow passing a queue ID to the block layer For the NVMe over Fabrics connect command we need to able to specify a queue ID that we want to send the command on. Add a qid parameter to the relevant functions to enable this behavior. 2. allow submitting at_head commands In cases where we want to (re)connect to a controller where we have inflight queued commands we want to first connect and only then allow the other queued commands to be kicked. This will prevents failures in controller resets and reconnects. 3. allow passing flags to blk_mq_allocate_request Both for Fabrics connect the the keep-alive feature in NVMe 1.2.1 we want to be able to use reserved requests. Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Tested-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-05 11:28:11 -06:00
Christoph Hellwig	7d2e80080d	nvme: allow transitioning from NEW to LIVE state For Fabrics we're not going through an intermediate reset state (at least for now). Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-05 11:28:09 -06:00
Dan Williams	0d52c756a6	block: convert to device_add_disk() For block drivers that specify a parent device, convert them to use device_add_disk(). This conversion was done with the following semantic patch: @@ struct gendisk disk; expression E; @@ - disk->driverfs_dev = E; ... - add_disk(disk); + device_add_disk(E, disk); @@ struct gendisk disk; expression E1, E2; @@ - disk->driverfs_dev = E1; ... E2 = disk; ... - add_disk(E2); + device_add_disk(E1, E2); ...plus some manual fixups for a few missed conversions. Cc: Jens Axboe <axboe@fb.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: David S. Miller <davem@davemloft.net> Cc: James Bottomley <James.Bottomley@hansenpartnership.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-06-27 12:26:08 -07:00
Johannes Thumshirn	a1f447b35b	NVMe: Use pci_(request\|release)_mem_regions Now that we do have pci_request_mem_regions() and pci_release_mem_regions() at hand, use it in the NVMe driver. Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> CC: Keith Busch <keith.busch@intel.com> CC: Jens Axboe <axboe@fb.com>	2016-06-21 17:09:32 -05:00
Christoph Hellwig	f5fa90dc0a	nvme: move the workaround for I/O queue-less controllers from PCIe to core We want to apply this to Fabrics drivers as well, so move it to common code. Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Tested-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-12 07:29:43 -06:00
Christoph Hellwig	7a5abb4b48	nvme: factor out a add nvme_is_write helper Centralize the check if a given NVMe command reads or writes data. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-12 07:29:43 -06:00
Christoph Hellwig	a229dbf61e	nvme: allow for size limitations from transport drivers Some transport drivers may have a lower transfer size than the controller. So allow the transport to set it in the controller max_hw_sectors. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-12 07:29:43 -06:00
Johannes Thumshirn	edb50a5403	NVMe: Only release requested regions The NVMe driver only requests the PCIe device's memory regions but releases all possible regions (including eventual I/O regions). This leads to a stale warning entry in dmesg about freeing non existent resources. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-09 14:28:28 -06:00
Sunad Bhandary	47b0e50ac7	NVMe: Fix removal in case of active namespace list scanning method In case of the active namespace list scanning method, a namespace that is detached is not removed from the host if it was the last entry in the list. Fix this by adding a scan to validate namespaces greater than the value of prev. This also handles the case of removing namespaces whose value exceed the device's reported number of namespaces. Signed-off-by: Sunad Bhandary S <sunad.s@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-08 08:48:47 -06:00
Minfei Huang	bd0fc2884c	nvme: use UINT_MAX for max discard sectors It's more elegant to use UINT_MAX to represent the max value of type unsigned int. So replace the actual value by using this define. Signed-off-by: Minfei Huang <mnghuan@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-07 22:23:12 -06:00
Ming Lin	c55a2fd4bb	nvme: move nvme_cancel_request() to common code So it can be used by fabrics driver also. Signed-off-by: Ming Lin <ming.l@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Keith Busch <keith.bsuch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-07 13:43:02 -06:00
Ming Lin	e1958e6534	nvme: update and rename nvme_cancel_io to nvme_cancel_request nvme_cancel_io is a bit confusing (given the distinction of io/admin), so rename it to nvme_cancel_request. And update it a bit to pass in struct nvme_ctrl, so it can be used by Fabrics driver also. Signed-off-by: Ming Lin <ming.l@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Suggested-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Keith Busch <keith.bsuch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-07 13:43:02 -06:00
Mike Christie	3a5e02ced1	block, drivers: add REQ_OP_FLUSH operation This adds a REQ_OP_FLUSH operation that is sent to request_fn based drivers by the block layer's flush code, instead of sending requests with the request->cmd_flags REQ_FLUSH bit set. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-07 13:41:38 -06:00
Mike Christie	c2df40dfb8	drivers: use req op accessor The req operation REQ_OP is separated from the rq_flag_bits definition. This converts the block layer drivers to use req_op to get the op from the request struct. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-06-07 13:41:38 -06:00
Nicholas Bellinger	ba36c21b0c	nvme/host: Add missing blk_integrity tag_size + flags assignments While doing recent bring-up of nvme/host with target-core T10-PI, I noticed /sys/block/nvme/integrity/device_is_integrity_capable was false, and /sys/block/nvme/integrity/tag_size contained a bogus value. AFAICT outside of blk_integrity_compare() for DM + MD these are informational values, but go ahead and add the missing assignments for nvme/host to match what SCSI does within sd_dif_config_host() for consistency's sake. Cc: Keith Busch <keith.busch@intel.com> Cc: Jay Freyensee <james.p.freyensee@intel.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Sagi Grimberg <sagig@grimberg.me> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Reviewed-by: Sagi Grimberg <sagi at grimberg.me> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-17 17:14:21 -06:00
Keith Busch	99466e708d	NVMe: Add device ID's with stripe quirk Adds two Intel controllers that have the "stripe" quirk. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-17 17:14:21 -06:00
Keith Busch	0ff9d4e1a2	NVMe: Short-cut removal on surprise hot-unplug This patch adds a new state that when set has the core automatically kill request queues prior to removing namespaces. If PCI device is not present at the time the nvme driver's remove is called, we can kill all IO queues immediately instead of waiting for the watchdog thread to do that at its polling interval. This improves scenarios where multiple hot plug events occur at the same time since it doesn't block the pci enumeration for as long. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-17 17:14:21 -06:00
Keith Busch	9ec3bb2f99	NVMe: Allow user initiated rescan This exposes ioctl and sysfs methods a user can invoke to request the driver rescan a controller and its namespaces. This is less harsh than doing a controller reset, which temporarilly halts all IO, just to surface a newly attached namespace. This is mainly useful for controllers that implement the namespace management command, but do not support the namespace notify change asynchronous event notification. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-17 17:14:21 -06:00
Keith Busch	d011fb3164	NVMe: Reduce driver log spamming Reduce error logging when no corrective action is required. Suggessted-by: Chris Petersen <cpetersen@fb.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-17 17:14:21 -06:00
Keith Busch	921920ab32	NVMe: Unbind driver on failure Instead of removing the PCI device from the kernel's topology on controller failure, this patch simply requests unbinding the device from the driver. This avoids concurrently running pci removal with the hot plug event, which has been reported to be problematic when multiple surprise events occur near simultaneously. The other benefit is that we will have PCI config and memory space available to poke around for debugging a failed controller, assuming the device was not physically removed. The down side occurs if the platform and/or kernel do not support any type of surprise hot removal. The device will remain visible through sysfs (and therefore lspci), and some manual work is necessary to get the logical topology corrected. But if your platform and/or kernel don't support surprise removal, you probably shouldn't be doing that anyway. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-17 17:14:21 -06:00
Keith Busch	014a0d609e	NVMe: Delete only created queues Use the online queue count instead of the number of allocated queues. The controller should just return an invalid queue identifier error to the commands if a queue wasn't created. While it's not harmful, it's still not correct. Reported-by: Saar Gross <saar@annapurnalabs.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-17 17:14:21 -06:00
Keith Busch	2800b8e7d9	NVMe: Allocate queues only for online cpus The driver previously requested allocating queues for the total possible number of CPUs so that blk-mq could rebalance these if CPUs were added after initialization. The number of hardware contexts can now be changed at runtime, so we only need to allocate the number of online queues since we can add more later. Suggested-by: Jeff Lien <jeff.lien@hgst.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-17 17:14:21 -06:00
Linus Torvalds	24b9f0cf00	Merge branch 'for-4.7/drivers' of git://git.kernel.dk/linux-block Pull block driver updates from Jens Axboe: "On top of the core pull request, this is the drivers pull request for this merge window. This contains: - Switch drivers to the new write back cache API, and kill off the flush flags. From me. - Kill the discard support for the STEC pci-e flash driver. It's trivially broken, and apparently unmaintained, so it's safer to just remove it. From Jeff Moyer. - A set of lightnvm updates from the usual suspects (Matias/Javier, and Simon), and fixes from Arnd, Jeff Mahoney, Sagi, and Wenwei Tao. - A set of updates for NVMe: - Turn the controller state management into a proper state machine. From Christoph. - Shuffling of code in preparation for NVMe-over-fabrics, also from Christoph. - Cleanup of the command prep part from Ming Lin. - Rewrite of the discard support from Ming Lin. - Deadlock fix for namespace removal from Ming Lin. - Use the now exported blk-mq tag helper for IO termination. From Sagi. - Various little fixes from Christoph, Guilherme, Keith, Ming Lin, Wang Sheng-Hui. - Convert mtip32xx to use the now exported blk-mq tag iter function, from Keith" * 'for-4.7/drivers' of git://git.kernel.dk/linux-block: (74 commits) lightnvm: reserved space calculation incorrect lightnvm: rename nr_pages to nr_ppas on nvm_rq lightnvm: add is_cached entry to struct ppa_addr lightnvm: expose gennvm_mark_blk to targets lightnvm: remove mgt targets on mgt removal lightnvm: pass dma address to hardware rather than pointer lightnvm: do not assume sequential lun alloc. nvme/lightnvm: Log using the ctrl named device lightnvm: rename dma helper functions lightnvm: enable metadata to be sent to device lightnvm: do not free unused metadata on rrpc lightnvm: fix out of bound ppa lun id on bb tbl lightnvm: refactor set_bb_tbl for accepting ppa list lightnvm: move responsibility for bad blk mgmt to target lightnvm: make nvm_set_rqd_ppalist() aware of vblks lightnvm: remove struct factory_blks lightnvm: refactor device ops->get_bb_tbl() lightnvm: introduce nvm_for_each_lun_ppa() macro lightnvm: refactor dev->online_target to global nvm_targets lightnvm: rename nvm_targets to nvm_tgt_type ...	2016-05-17 16:03:32 -07:00
Javier González	6d5be9590b	lightnvm: rename nr_pages to nr_ppas on nvm_rq The number of ppas contained on a request is not necessarily the number of pages that it maps to neither on the target nor on the device side. In order to avoid confusion, rename nr_pages to nr_ppas since it is what the variable actually contains. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-06 12:51:10 -06:00
Arnd Bergmann	45bbd0529e	lightnvm: pass dma address to hardware rather than pointer A recent change to lightnvm added code to pass a kernel pointer to the hardware, which gcc complained about: drivers/nvme/host/lightnvm.c: In function 'nvme_nvm_rqtocmd': drivers/nvme/host/lightnvm.c:472:32: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast] c->ph_rw.metadata = cpu_to_le64(rqd->meta_list); It looks like this has no way of working anyway, so this changes the code to pass the dma_address instead. This was most likely what was intended here. Neither of the two are currently ever written to, so the effect is the same for now. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: a34b1eb78e21 ("lightnvm: enable metadata to be sent to device") Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-06 12:51:10 -06:00
Sagi Grimberg	b86d8d363e	nvme/lightnvm: Log using the ctrl named device Align with the rest of the nvme subsystem. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-06 12:51:10 -06:00
Javier González	75b8564932	lightnvm: rename dma helper functions Until now, the dma pool have been exclusively used to allocate the ppa list being sent to the device. In pblk (upcoming), we use these pools to allocate metadata too. Thus, we generalize the names of some variables on the dma helper functions to make the code more readable. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-06 12:51:10 -06:00
Javier González	003fad376b	lightnvm: enable metadata to be sent to device Enable metadata buffer to be sent to the device through the metadata field on the physical rw nvme command. The size of the metadata buffer must follow dev->oob_size * # of PPAs. Signed-off-by: Javier González <javier@cnexlabs.com> Updated description. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-06 12:51:10 -06:00
Matias Bjørling	00ee6cc3b7	lightnvm: refactor set_bb_tbl for accepting ppa list The set_bb_tbl takes struct nvm_rq and only uses its ppa_list and nr_pages internally. Instead, make these two variables explicit. This allows a user to call it without initializing a struct nvm_rq first. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-06 12:51:10 -06:00
Matias Bjørling	e11903f5df	lightnvm: refactor device ops->get_bb_tbl() The device ops->get_bb_tbl() takes a callback, that allows the caller to use its own callback function to update its data structures in the returning function. This makes it difficult to send parameters to the callback, and usually is circumvented by small private structures, that both carry the callers state and any flags needed to fulfill the update. Refactor ops->get_bb_tbl() to fill a data buffer with the status of the blocks returned, and let the user call the callback function manually. That will provide the necessary flags and data structures and simplify the logic around ops->get_bb_tbl(). Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-06 12:51:10 -06:00
Matias Bjørling	22e8c9766a	lightnvm: move block fold outside of get_bb_tbl() The get block table command returns a list of blocks and planes with their associated state. Users, such as gennvm and sysblk, manages all planes as a single virtual block. It was therefore natural to fold the bad block list before it is returned. However, to allow users, which manages on a per-plane block level, to also use the interface, the get_bb_tbl interface is changed to not fold by default and instead let the caller fold if necessary. Reviewed by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-06 12:51:10 -06:00
Keith Busch	87c3207781	NVMe: Fix reset/remove race This fixes a scenario where device is present and being reset, but a request to unbind the driver occurs. A previous patch series addressing a device failure removal scenario flushed reset_work after controller disable to unblock reset_work waiting on a completion that wouldn't occur. This isn't safe as-is. The broken scenario can potentially be induced with: modprobe nvme && modprobe -r nvme To fix, the reset work is flushed immediately after setting the controller removing flag, and any subsequent reset will not proceed with controller initialization if the flag is set. The controller status must be polled while active, so the watchdog timer is also left active until the controller is disabled to cleanup requests that may be stuck during namespace removal. [Fixes: `ff23a2a15a`] Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-03 14:00:29 -06:00
Ming Lin	b7b9c22787	nvme: fix nvme_ns_remove() deadlock On receipt of a namespace attribute changed AER, we acquire the namespace mutex lock before proceeding to scan and validate the namespace list. In case of namespace detach/delete command, nvme_ns_remove function deadlocks trying to acquire the already held lock. All callers, except nvme_remove_namespaces(), of nvme_ns_remove() already held namespaces_mutex. So we can simply fix the deadlock by not acquiring the mutex in nvme_ns_remove() and acquiring it in nvme_remove_namespaces(). Reported-by: Sunad Bhandary S <sunad.s@samsung.com> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimerg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-02 09:16:13 -06:00
Ming Lin	0bf77e9dbb	nvme: switch to RCU freeing the namespace Switch to RCU freeing the namespace structure so that nvme_start_queues, nvme_stop_queues and nvme_kill_queues would be able to get away with only a RCU read side critical section. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimerg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-02 09:16:11 -06:00
Ming Lin	6904242db1	nvme: add helper nvme_cleanup_cmd() This hides command cleanup into nvme.h and fabrics drivers will also use it. Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-02 09:11:58 -06:00
Christoph Hellwig	f866fc4282	nvme: move AER handling to common code The transport driver still needs to do the actual submission, but all the higher level code can be shared. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-02 09:09:26 -06:00
Christoph Hellwig	5955be2144	nvme: move namespace scanning to core Move the scan work item and surrounding code to the common code. For now we need a new finish_scan method to allow the PCI driver to set the irq affinity hints, but I have plans in the works to obsolete this as well. Note that this moves the namespace scanning from nvme_wq to the system workqueue, but as we don't rely on namespace scanning to finish from reset or I/O this should be fine. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by Jon Derrick: <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-02 09:09:24 -06:00
Christoph Hellwig	92911a55d4	nvme: tighten up state check for namespace scanning We only should be scanning namespaces if the controller is live. Currently we call the function just before setting it live, so fix the code up to move the call to nvme_queue_scan to just below the state change. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Acked-by Jon Derrick: <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-02 09:09:23 -06:00
Christoph Hellwig	bb8d261e08	nvme: introduce a controller state machine Replace the adhoc flags in the PCI driver with a state machine in the core code. Based on code from Sagi Grimberg for the Fabrics driver. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Acked-by Jon Derrick: <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-02 09:09:21 -06:00
Christoph Hellwig	04a934d4c7	nvme: remove the io_incapable method It's unused since "NVMe: Move error handling to failed reset handler". Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jon Derrick <jonathan.derrick@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-02 09:09:20 -06:00
Wang Sheng-Hui	23bd63ceea	NVMe: nvme_core_exit() should do cleanup in the reverse order as nvme_core_init does nvme_core_init does: 1) register_blkdev 2) __register_chrdev 3) class_create nvme_core_exit should do cleanup in the reverse order. Signed-off-by: Wang Sheng-Hui <shhuiw@foxmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-02 09:05:03 -06:00
Keith Busch	3b24774e1f	NVMe: Fix check_flush_dependency warning If the controller fails and is degraded after a reset, we need to kill off all requests queues before removing the inaccessble namespaces. This will prevent del_gendisk from syncing dirty data, which we can't due from a WQ_MEM_RECLAIM work queue. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-02 09:03:06 -06:00
Wang Sheng-Hui	b31356dfde	NVMe: small typo in section BLK_DEV_NVME_SCSI of host/Kconfig "as well as " is miss typed "as well a " in section "config BLK_DEV_NVME_SCSI" Signed-off-by: Wang Sheng-Hui <shhuiw@foxmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-04-26 08:31:50 -06:00
Christoph Hellwig	76e3914ae5	nvme: fix cntlid type Controller IDs in NVMe are unsigned 16-bit types. In the Fabrics driver we actually pass ctrl->id by reference, so we need it to have the correct type. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-04-26 08:31:22 -06:00
Keith Busch	a5229050b6	NVMe: Always use MSI/MSI-x interrupts Multiple users have reported device initialization failure due the driver not receiving legacy PCI interrupts. This is not unique to any particular controller, but has been observed on multiple platforms. There have been no issues reported or observed when with message signaled interrupts, so this patch attempts to use MSI-x during initialization, falling back to MSI. If that fails, legacy would become the default. The setup_io_queues error handling had to change as a result: the admin queue's msix_entry used to be initialized to the legacy IRQ. The case where nr_io_queues is 0 would fail request_irq when setting up the admin queue's interrupt since re-enabling MSI-x fails with 0 vectors, leaving the admin queue's msix_entry invalid. Instead, return success immediately. Reported-by: Tim Muhlemmer <muhlemmer@gmail.com> Reported-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-04-14 14:04:50 -06:00
Guilherme G. Piccoli	c875a7093f	nvme: Avoid reset work on watchdog timer function during error recovery This patch adds a check on nvme_watchdog_timer() function to avoid the call to reset_work() when an error recovery process is ongoing on controller. The check is made by looking at pci_channel_offline() result. If we don't check for this on nvme_watchdog_timer(), error recovery mechanism can't recover well, because reset_work() won't be able to do its job (since we're in the middle of an error) and so the controller is removed from the system before error recovery mechanism can perform slot reset (which would allow the adapter to recover). In this patch we also have split the huge condition expression on nvme_watchdog_timer() by introducing an auxiliary function to help make the code more readable. Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-04-13 08:15:39 -06:00
Jens Axboe	7e19793096	NVMe: silence warning about unused 'dev' Depending on options, we might not be using dev in nvme_cancel_io(): drivers/nvme/host/pci.c: In function ‘nvme_cancel_io’: drivers/nvme/host/pci.c:970:19: warning: unused variable ‘dev’ [-Wunused-variable] struct nvme_dev *dev = data; ^ So get rid of it, and just cast for the dev_dbg_ratelimited() call. Fixes: `82b4552b91` ("nvme: Use blk-mq helper for IO termination") Signed-off-by: Jens Axboe <axboe@fb.com>	2016-04-12 16:11:11 -06:00
Jens Axboe	7c88cb00f2	NVMe: switch to using blk_queue_write_cache() Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-04-12 16:00:39 -06:00
Sagi Grimberg	82b4552b91	nvme: Use blk-mq helper for IO termination blk-mq offers a tagset iterator so let's use that instead of using nvme_clear_queues. Note, we changed nvme_queue_cancel_ios name to nvme_cancel_io as there is no concept of a queue now in this function (we also lost the print). Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-04-12 15:07:15 -06:00
Keith Busch	21f033f7c7	NVMe: Skip async events for degraded controllers If the controller is degraded, the driver should stay out of the way so the user can recover the drive. This patch skips driver initiated async event requests when the drive is in this state. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-04-12 13:44:00 -06:00
Ming Lin	8093f7ca73	nvme: add helper nvme_setup_cmd() This moves nvme_setup_{flush,discard,rw} calls into a common nvme_setup_cmd() helper. So we can eventually hide all the command setup in the core module and don't even need to update the fabrics drivers for any specific command type. Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-04-12 13:44:00 -06:00
Ming Lin	03b5929ebb	nvme: rewrite discard support This rewrites nvme_setup_discard() with blk_add_request_payload(). It allocates only the necessary amount(16 bytes) for the payload. Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-04-12 13:44:00 -06:00
Ming Lin	58b4560275	nvme: add helper nvme_map_len() The helper returns the number of bytes that need to be mapped using PRPs/SGL entries. Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-04-12 13:44:00 -06:00
Ming Lin	2e39e0f608	nvme: add missing lock nesting notation When unloading driver, nvme_disable_io_queues() calls nvme_delete_queue() that sends nvme_admin_delete_cq command to admin sq. So when the command completed, the lock acquired by nvme_irq() actually belongs to admin queue. While the lock that nvme_del_cq_end() trying to acquire belongs to io queue. So it will not deadlock. This patch adds lock nesting notation to fix following report. [ 109.840952] ============================================= [ 109.846379] [ INFO: possible recursive locking detected ] [ 109.851806] 4.5.0+ #180 Tainted: G E [ 109.856533] --------------------------------------------- [ 109.861958] swapper/0/0 is trying to acquire lock: [ 109.866771] (&(&nvmeq->q_lock)->rlock){-.....}, at: [<ffffffffc0820bc6>] nvme_del_cq_end+0x26/0x70 [nvme] [ 109.876535] [ 109.876535] but task is already holding lock: [ 109.882398] (&(&nvmeq->q_lock)->rlock){-.....}, at: [<ffffffffc0820c2b>] nvme_irq+0x1b/0x50 [nvme] [ 109.891547] [ 109.891547] other info that might help us debug this: [ 109.898107] Possible unsafe locking scenario: [ 109.898107] [ 109.904056] CPU0 [ 109.906515] ---- [ 109.908974] lock(&(&nvmeq->q_lock)->rlock); [ 109.913381] lock(&(&nvmeq->q_lock)->rlock); [ 109.917787] [ 109.917787] * DEADLOCK * [ 109.917787] [ 109.923738] May be due to missing lock nesting notation [ 109.923738] [ 109.930558] 1 lock held by swapper/0/0: [ 109.934413] #0: (&(&nvmeq->q_lock)->rlock){-.....}, at: [<ffffffffc0820c2b>] nvme_irq+0x1b/0x50 [nvme] [ 109.944010] [ 109.944010] stack backtrace: [ 109.948389] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G E 4.5.0+ #180 [ 109.955734] Hardware name: Dell Inc. OptiPlex 7010/0YXT71, BIOS A15 08/12/2013 [ 109.962989] 0000000000000000 ffff88011e203c38 ffffffff81383d9c ffffffff81c13540 [ 109.970478] ffffffff826711d0 ffff88011e203ce8 ffffffff810bb429 0000000000000046 [ 109.977964] 0000000000000046 0000000000000000 0000000000b2e597 ffffffff81f4cb00 [ 109.985453] Call Trace: [ 109.987911] <IRQ> [<ffffffff81383d9c>] dump_stack+0x85/0xc9 [ 109.993711] [<ffffffff810bb429>] __lock_acquire+0x19b9/0x1c60 [ 109.999575] [<ffffffff810b6d1d>] ? trace_hardirqs_off+0xd/0x10 [ 110.005524] [<ffffffff810b386d>] ? complete+0x3d/0x50 [ 110.010688] [<ffffffff810bb760>] lock_acquire+0x90/0xf0 [ 110.016029] [<ffffffffc0820bc6>] ? nvme_del_cq_end+0x26/0x70 [nvme] [ 110.022418] [<ffffffff81772afb>] _raw_spin_lock_irqsave+0x4b/0x60 [ 110.028632] [<ffffffffc0820bc6>] ? nvme_del_cq_end+0x26/0x70 [nvme] [ 110.035019] [<ffffffffc0820bc6>] nvme_del_cq_end+0x26/0x70 [nvme] [ 110.041232] [<ffffffff8135b485>] blk_mq_end_request+0x35/0x60 [ 110.047095] [<ffffffffc0821ad8>] nvme_complete_rq+0x68/0x190 [nvme] [ 110.053481] [<ffffffff8135b53f>] __blk_mq_complete_request+0x8f/0x130 [ 110.060043] [<ffffffff8135b611>] blk_mq_complete_request+0x31/0x40 [ 110.066343] [<ffffffffc08209e3>] __nvme_process_cq+0x83/0x240 [nvme] [ 110.072818] [<ffffffffc0820c35>] nvme_irq+0x25/0x50 [nvme] [ 110.078419] [<ffffffff810cdb66>] handle_irq_event_percpu+0x36/0x110 [ 110.084804] [<ffffffff810cdc77>] handle_irq_event+0x37/0x60 [ 110.090491] [<ffffffff810d0ea3>] handle_edge_irq+0x93/0x150 [ 110.096180] [<ffffffff81012306>] handle_irq+0xa6/0x130 [ 110.101431] [<ffffffff81011abe>] do_IRQ+0x5e/0x120 [ 110.106333] [<ffffffff8177384c>] common_interrupt+0x8c/0x8c Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-04-12 13:44:00 -06:00
Keith Busch	788e15abbb	NVMe: Always use MSI/MSI-x interrupts Multiple users have reported device initialization failure due the driver not receiving legacy PCI interrupts. This is not unique to any particular controller, but has been observed on multiple platforms. There have been no issues reported or observed when with message signaled interrupts, so this patch attempts to use MSI-x during initialization, falling back to MSI. If that fails, legacy would become the default. The setup_io_queues error handling had to change as a result: the admin queue's msix_entry used to be initialized to the legacy IRQ. The case where nr_io_queues is 0 would fail request_irq when setting up the admin queue's interrupt since re-enabling MSI-x fails with 0 vectors, leaving the admin queue's msix_entry invalid. Instead, return success immediately. Reported-by: Tim Muhlemmer <muhlemmer@gmail.com> Reported-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-04-12 13:44:00 -06:00
Keith Busch	9bf2b972af	NVMe: Fix reset/remove race This fixes a scenario where device is present and being reset, but a request to unbind the driver occurs. A previous patch series addressing a device failure removal scenario flushed reset_work after controller disable to unblock reset_work waiting on a completion that wouldn't occur. This isn't safe as-is. The broken scenario can potentially be induced with: modprobe nvme && modprobe -r nvme To fix, the reset work is flushed immediately after setting the controller removing flag, and any subsequent reset will not proceed with controller initialization if the flag is set. The controller status must be polled while active, so the watchdog timer is also left active until the controller is disabled to cleanup requests that may be stuck during namespace removal. [Fixes: `ff23a2a15a`] Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-04-11 10:00:04 -06:00
Marta Rybczynska	d783e0bd02	nvme: avoid cqe corruption when update at the same time as read Make sure the CQE phase (validity) is read before the rest of the structure. The phase bit is the highest address and the CQE read will happen on most platforms from lower to upper addresses and will be done by multiple non-atomic loads. If the structure is updated by PCI during the reads from the processor, the processor may get a corrupted copy. The addition of the new nvme_cqe_valid function that verifies the validity bit also allows refactoring of the other CQE read sequences. Signed-off-by: Marta Rybczynska <marta.rybczynska@kalray.eu> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-03-22 10:27:29 -06:00
Matias Bjorling	9f86726843	nvme: lightnvm: return ppa completion status PPAs sent to device is separately acknowledge in a 64bit status variable. The status is stored in DW0 and DW1 of the completion queue entry. Store this status inside the nvm_rq for further processing. This can later be used to implement retry techniques for failed writes and reads. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-03-18 18:10:38 -07:00
Linus Torvalds	237045fc3c	Merge branch 'for-4.6/drivers' of git://git.kernel.dk/linux-block Pull block driver updates from Jens Axboe: "This is the block driver pull request for this merge window. It sits on top of for-4.6/core, that was just sent out. This contains: - A set of fixes for lightnvm. One from Alan, fixing an overflow, and the rest from the usual suspects, Javier and Matias. - A set of fixes for nbd from Markus and Dan, and a fixup from Arnd for correct usage of the signed 64-bit divider. - A set of bug fixes for the Micron mtip32xx, from Asai. - A fix for the brd discard handling from Bart. - Update the maintainers entry for cciss, since that hardware has transferred ownership. - Three bug fixes for bcache from Eric Wheeler. - Set of fixes for xen-blk{back,front} from Jan and Konrad. - Removal of the cpqarray driver. It has been disabled in Kconfig since 2013, and we were initially scheduled to remove it in 3.15. - Various updates and fixes for NVMe, with the most important being: - Removal of the per-device NVMe thread, replacing that with a watchdog timer instead. From Christoph. - Exposing the namespace WWID through sysfs, from Keith. - Set of cleanups from Ming Lin. - Logging the controller device name instead of the underlying PCI device name, from Sagi. - And a bunch of fixes and optimizations from the usual suspects in this area" * 'for-4.6/drivers' of git://git.kernel.dk/linux-block: (49 commits) NVMe: Expose ns wwid through single sysfs entry drivers:block: cpqarray clean up brd: Fix discard request processing cpqarray: remove it from the kernel cciss: update MAINTAINERS NVMe: Remove unused sq_head read in completion path bcache: fix cache_set_flush() NULL pointer dereference on OOM bcache: cleaned up error handling around register_cache() bcache: fix race of writeback thread starting before complete initialization NVMe: Create discard zero quirk white list nbd: use correct div_s64 helper mtip32xx: remove unneeded variable in mtip_cmd_timeout() lightnvm: generalize rrpc ppa calculations lightnvm: remove struct nvm_dev->total_blocks lightnvm: rename ->nr_pages to ->nr_sects lightnvm: update closed list outside of intr context xen/blback: Fit the important information of the thread in 17 characters lightnvm: fold get bb tbl when using dual/quad plane mode lightnvm: fix up nonsensical configure overrun checking xen-blkback: advertise indirect segment support earlier ...	2016-03-18 17:13:31 -07:00
Keith Busch	118472ab85	NVMe: Expose ns wwid through single sysfs entry The method to uniquely identify a namespace depends on the controller's specification revision level and implemented capabilities. This patch has the driver figure this out and exports the unique string through a single 'wwid' attribute so the user doesn't have this burden. The longest namespace unique identifier is used if available. If not available, the driver will concat the controller's vendor, serial, and model with the namespace ID. The specification provides this as a unique indentifier. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-03-16 07:46:25 -07:00
Jon Derrick	48c7823f42	NVMe: Remove unused sq_head read in completion path Signed-off-by: Jon Derrick <jonathan.derrick@intel.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-03-08 15:01:06 -07:00
Keith Busch	08095e7078	NVMe: Create discard zero quirk white list The NVMe specification does not require discarded blocks return zeroes on read, but provides that behavior as a possibility. Some applications more efficiently use an SSD if reads on discarded blocks were deterministically zero, based on the "discard_zeroes_data" queue attribute. There is no specification defined way to determine device behavior on discarded blocks, so the driver always left the queue setting disabled. We can only know behavior based on individual device models, so this patch adds a flag to the NVMe "quirk" list that vendors may set if they know their controller works that way. The patch also sets the new flag for one such known device. Signed-off-by: Keith Busch <keith.busch@intel.com> Suggested-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-03-08 08:32:40 -07:00
Matias Bjørling	d5bdec8ddb	lightnvm: fold get bb tbl when using dual/quad plane mode When the media manager runs in dual or quad plane mode, lightnvm abstracts away plane specific commands. This poses a problem for get bad block table, as it reports bad blocks per plane, making the table either two or four times bigger than expected. Fold the bad block list before returning. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-03-03 14:45:53 -07:00
Christoph Hellwig	45686b6198	nvme: fix max_segments integer truncation The block layer uses an unsigned short for max_segments. The way we calculate the value for NVMe tends to generate very large 32-bit values, which after integer truncation may lead to a zero value instead of the desired outcome. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Jeff Lien <Jeff.Lien@hgst.com> Tested-by: Jeff Lien <Jeff.Lien@hgst.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-03-03 14:43:10 -07:00
Christoph Hellwig	da35825d9a	nvme: set queue limits for the admin queue Factor out a helper to set all the device specific queue limits and apply them to the admin queue in addition to the I/O queues. Without this the command size on the admin queue is arbitrarily low, and the missing other limitations are just minefields waiting for victims. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Jeff Lien <Jeff.Lien@hgst.com> Tested-by: Jeff Lien <Jeff.Lien@hgst.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-03-03 14:42:50 -07:00
Keith Busch	e9fc63d682	NVMe: Fix 0-length integrity payload A user could send a passthrough IO command with a metadata pointer to a namespace without metadata. With metadata length of 0, kmalloc returns ZERO_SIZE_PTR. Since that is not NULL, the driver would have set this as the bio's integrity payload, which causes an access fault on completion. This patch ignores the users metadata buffer if the namespace format does not support separate metadata. Reported-by: Stephen Bates <stephen.bates@microsemi.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-03-03 14:42:50 -07:00
Keith Busch	63088ec7c8	NVMe: Don't allow unsupported flags The command flags can change the meaning of other fields in the command that the driver is not prepared to handle. Specifically, the user could passthrough an SGL flag, causing the controller to misinterpret the PRP list the driver created, potentially corrupting memory or data. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-03-03 14:42:50 -07:00
Keith Busch	69d9a99c25	NVMe: Move error handling to failed reset handler This moves failed queue handling out of the namespace removal path and into the reset failure path, fixing a hanging condition if the controller fails or link down during del_gendisk. Previously the driver had to see the controller as degraded prior to calling del_gendisk to setup the queues to fail. But, if the controller happened to fail after this, there was no task to end outstanding requests. On failure, all namespace states are set to dead. This has capacity revalidate to 0, and ends all new requests with error status. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-03-03 14:42:50 -07:00
Keith Busch	f58944e265	NVMe: Simplify device reset failure A reset failure schedules the device to unbind from the driver through the pci driver's remove. This cleans up all intialization, so there is no need to duplicate the potentially racy cleanup. To help understand why a reset failed, the status is logged with the existing warning message. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-03-03 14:42:49 -07:00
Keith Busch	646017a612	NVMe: Fix namespace removal deadlock This patch makes nvme namespace removal lockless. It is up to the caller to ensure no active namespace scanning is occuring. To ensure no scan work occurs, the nvme pci driver adds a removing state to the controller device to avoid queueing scan work during removal. The work is flushed after setting the state, so no new scan work can be queued. The lockless removal allows the driver to cleanup a namespace request_queue if the controller fails during removal. Previously this could deadlock trying to acquire the namespace mutex in order to handle such events. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-03-03 14:42:49 -07:00
Keith Busch	075790ebba	NVMe: Use IDA for namespace disk naming A namespace may be detached from a controller, but a user may be holding a reference to it. Attaching a new namespace with the same NSID will create duplicate names when using the NSID to name the disk. This patch uses an IDA that is released only when the last reference is released instead of using the namespace ID. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-03-03 14:42:49 -07:00
Keith Busch	b00a726a9f	NVMe: Don't unmap controller registers on reset Unmapping the registers on reset or shutdown is not necessary. Keeping the mapping simplifies reset handling. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-03-03 14:42:49 -07:00
Ming Lin	931e1c2204	nvme: expose cntlid in sysfs For NVMe over Fabrics, the cntlid will be used by systemd/udev to create link to the device, for example, /dev/disk/by-path/<fabrics-info>-<cntlid>-<namespace> -> /dev/nvme0n1 Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-02-29 12:31:27 -07:00
Christoph Hellwig	1cb3cce5eb	nvme: return the whole CQE through the request passthrough interface Both LighNVM and NVMe over Fabrics need to look at more than just the status and result field. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matias Bj?rling <m@bjorling.me> Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-02-29 08:47:17 -07:00
Christoph Hellwig	2d55cd5f51	nvme: replace the kthread with a per-device watchdog timer The only work left in the kthread is the periodic health check for each controller. There is no need to run this from process context or keep a thread context around for it, so replace it with a simpler timer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-02-29 08:47:16 -07:00
Christoph Hellwig	79f2b358c9	nvme: don't poll the CQ from the kthread There is no reason to do unconditional polling of CQs per the NVMe spec. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-02-29 08:47:14 -07:00
Christoph Hellwig	9396dec916	nvme: use a work item to submit async event requests Use a dedicated work item to submit async event requests instead of the global kthread. This simplifies the code and reduces the latencies to resubmit a request once an even notification happened. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-02-29 08:47:13 -07:00
Keith Busch	f8e68a7c9a	NVMe: Rate limit nvme IO warnings We don't need to spam the kernel logs with thousands of IO cancelling messages. We can infer all IO's are being cancelled with fewer, or even none at all. This patch rate limits the message and uses the debug log level as it is mainly used for testing purposes. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-02-12 08:10:31 -07:00
Keith Busch	ff23a2a15a	NVMe: Poll device while still active during remove A device failure or link down wouldn't have been detected during namespace removal. This patch keeps the device in the list for polling so that the thread may see such failure and initiate a reset. The device is removed from the list after disable, so we can safely flush the reset work as it can't be requeued when disable completes. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-02-12 08:10:16 -07:00
Keith Busch	ae1fba2001	NVMe: Requeue requests on suspended queues It's possible a request may get to the driver after the nvme queue was disabled. This has the request requeue if that happens. Note the request is still "started" by the driver, but requeuing will clear the start state for timeout handling. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-02-12 08:10:08 -07:00
Keith Busch	ef2d4615c5	NVMe: Allow request merges It is generally more efficient to submit larger IO. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-02-11 13:14:03 -07:00
Keith Busch	4f76d0e498	NVMe: Fix io incapable return values The function returns true when the controller can't handle IO. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-02-11 13:14:02 -07:00
Ming Lin	576d55d625	nvme: split pci module out of core module NVMe over Fabrics drivers are going to reuse the core, so splits nvme.ko into 2 modules: nvme-core.ko: the core part nvme.ko: the PCI driver Export symbols from nvme-core.ko. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-02-10 14:22:38 -07:00
Ming Lin	9f2482b91b	nvme: split dev_list_lock Split dev_list_lock into one in the core and one in the PCI driver. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-02-10 14:22:36 -07:00
Ming Lin	ba0ba7d3e5	nvme: move timeout variables to core.c These variables are used by PCI driver and will also be used in the forthcoming NVMe over Fabrics drivers. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-02-10 14:22:34 -07:00
Sagi Grimberg	e439bb12e7	nvme/host: reference the fabric module for each bdev open callout We don't want to be able to unload the fabric driver when we have openened referenced to our namespaces. Thus, for each nvme_open we take a reference on the fabric driver and put it in nvme_release. This behavior is consistent with the scsi model. This resolves the panic when unloading a fabric module with mpath holders. Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ian Bakshan <ianb@mellanox.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-02-10 14:22:32 -07:00
Sagi Grimberg	1b3c47c182	nvme: Log the ctrl device name instead of the underlying pci device name Having the ctrl name "nvmeX" seems much more friendly than the underlying device name. Also, with other nvme transports such as the soon to come nvme-loop we don't have an underlying device so it doesn't makes sense to make up one. In order to help matching an instance name to a pci function, we add a info print in nvme_probe. Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Acked-by: Keith Busch <keith.busch@intel.com> Manually fixed up the hunk in nvme_cancel_queue_ios(). Signed-off-by: Jens Axboe <axboe@fb.com>	2016-02-10 08:51:15 -07:00
Christoph Hellwig	f4f0f63e6f	nvme: fix drvdata setup for the nvme device Pass the right private data to device_create_with_groups from the beginning, and remove the superflous call to dev_set_drvdata. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-02-09 12:44:03 -07:00
Keith Busch	949928c1c7	NVMe: Fix possible queue use after freed This notifies blk-mq when the tag set contains a different number of queues prior to freeing unused ones that the request queue points to. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-02-09 12:42:35 -07:00
Christoph Hellwig	21d147880e	nvme: fix Kconfig description for BLK_DEV_NVME_SCSI Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-02-09 10:21:22 -07:00
Matias Bjørling	6dde1d6c90	lightnvm: check overflow and correct mlc pairs The specification currently limits the number of MLC pairs to 886. Make sure that a device is unable to be instantiate if more is configured. Also, previously the patch had the wrong math for copying MLC pairs, as it only copied half of the actual entries. Fixes: `ca5927e7ab` "lightnvm: introduce mlc lower page table mappings" Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-02-04 09:19:45 -07:00
Linus Torvalds	3e1e21c7bf	Merge branch 'for-4.5/nvme' of git://git.kernel.dk/linux-block Pull NVMe updates from Jens Axboe: "Last branch for this series is the nvme changes. It's in a separate branch to avoid splitting too much between core and NVMe changes, since NVMe is still helping drive some blk-mq changes. That said, not a huge amount of core changes in here. The grunt of the work is the continued split of the code" * 'for-4.5/nvme' of git://git.kernel.dk/linux-block: (67 commits) uapi: update install list after nvme.h rename NVMe: Export NVMe attributes to sysfs group NVMe: Shutdown controller only for power-off NVMe: IO queue deletion re-write NVMe: Remove queue freezing on resets NVMe: Use a retryable error code on reset NVMe: Fix admin queue ring wrap nvme: make SG_IO support optional nvme: fixes for NVME_IOCTL_IO_CMD on the char device nvme: synchronize access to ctrl->namespaces nvme: Move nvme_freeze/unfreeze_queues to nvme core PCI/AER: include header file NVMe: Export namespace attributes to sysfs NVMe: Add pci error handlers block: remove REQ_NO_TIMEOUT flag nvme: merge iod and cmd_info nvme: meta_sg doesn't have to be an array nvme: properly free resources for cancelled command nvme: simplify completion handling nvme: special case AEN requests ...	2016-01-21 19:58:02 -08:00
Linus Torvalds	0a13daedf7	Merge branch 'for-4.5/lightnvm' of git://git.kernel.dk/linux-block Pull lightnvm fixes and updates from Jens Axboe: "This should have been part of the drivers branch, but it arrived a bit late and wasn't based on the official core block driver branch. So they got a small scolding, but got a pass since it's still new. Hence it's in a separate branch. This is mostly pure fixes, contained to lightnvm/, and minor feature additions" * 'for-4.5/lightnvm' of git://git.kernel.dk/linux-block: (26 commits) lightnvm: ensure that nvm_dev_ops can be used without CONFIG_NVM lightnvm: introduce factory reset lightnvm: use system block for mm initialization lightnvm: introduce ioctl to initialize device lightnvm: core on-disk initialization lightnvm: introduce mlc lower page table mappings lightnvm: add mccap support lightnvm: manage open and closed blocks separately lightnvm: fix missing grown bad block type lightnvm: reference rrpc lun in rrpc block lightnvm: introduce nvm_submit_ppa lightnvm: move rq->error to nvm_rq->error lightnvm: support multiple ppas in nvm_erase_ppa lightnvm: move the pages per block check out of the loop lightnvm: sectors first in ppa list lightnvm: fix locking and mempool in rrpc_lun_gc lightnvm: put block back to gc list on its reclaim fail lightnvm: check bi_error in gc lightnvm: return the get_bb_tbl return value lightnvm: refactor end_io functions for sync ...	2016-01-21 19:01:55 -08:00
Linus Torvalds	7c24d9f3b2	Merge branch 'for-4.5/core' of git://git.kernel.dk/linux-block Pull core block updates from Jens Axboe: "We don't have a lot of core changes this time around, it's mostly in drivers, which will come in a subsequent pull. The cores changes include: - blk-mq - Prep patch from Christoph, changing blk_mq_alloc_request() to take flags instead of just using gfp_t for sleep/nosleep. - Doc patch from me, clarifying the difference between legacy and blk-mq for timer usage. - Fixes from Raghavendra for memory-less numa nodes, and a reuse of CPU masks. - Cleanup from Geliang Tang, using offset_in_page() instead of open coding it. - From Ilya, rename request_queue slab to it reflects what it holds, and a fix for proper use of bdgrab/put. - A real fix for the split across stripe boundaries from Keith. We yanked a broken version of this from 4.4-rc final, this one works. - From Mike Krinkin, emit a trace message when we split. - From Wei Tang, two small cleanups, not explicitly clearing memory that is already cleared" * 'for-4.5/core' of git://git.kernel.dk/linux-block: block: use bd{grab,put}() instead of open-coding block: split bios to max possible length block: add call to split trace point blk-mq: Avoid memoryless numa node encoded in hctx numa_node blk-mq: Reuse hardware context cpumask for tags blk-mq: add a flags parameter to blk_mq_alloc_request Revert "blk-flush: Queue through IO scheduler when flush not required" block: clarify blk_add_timer() use case for blk-mq bio: use offset_in_page macro block: do not initialise statics to 0 or NULL block: do not initialise globals to 0 or NULL block: rename request_queue slab cache	2016-01-19 15:03:34 -08:00
Keith Busch	779ff75617	NVMe: Export NVMe attributes to sysfs group Adds all controller information to attribute list exposed to sysfs, and appends the reset_controller attribute to it. The nvme device is created with this attribute list, so driver no long manages its attributes. Reported-by: Sujith Pandel <sujithpshankar@gmail.com> Cc: Sujith Pandel <sujithpshankar@ gmail.com> Cc: David Milburn <dmilburn@redhat.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-01-12 15:11:54 -07:00
Keith Busch	a5cdb68c2c	NVMe: Shutdown controller only for power-off We don't need to shutdown a controller for a reset. A controller in a shutdown state may take longer to become ready than one that was simply disabled. This patch has the driver shut down a controller only if the device is about to be powered off or being removed. When taking the controller down for a reset reason, the controller will be disabled instead. Function names have been updated in this patch to reflect their changed semantics. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-01-12 14:47:59 -07:00
Keith Busch	db3cbfff5b	NVMe: IO queue deletion re-write The nvme driver deletes IO queues asynchronously since this operation may potentially take an undesirable amount of time with a large number of queues if done serially. The driver used to manage coordinating asynchronous deletions. This patch simplifies that by leveraging the block layer rather than using kthread workers and chaining more complicated callbacks. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-01-12 14:47:56 -07:00
Keith Busch	25646264e1	NVMe: Remove queue freezing on resets NVMe submits all commands through the block layer now. This means we can let requests queue at the blk-mq hardware context since there is no path that bypasses this anymore so we don't need to freeze the queues anymore. The driver can simply stop the h/w queues from running during a reset instead. This also fixes a WARN in percpu_ref_reinit when the queue was unfrozen with requeued requests. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-01-12 13:33:36 -07:00
Keith Busch	1d49c38c48	NVMe: Use a retryable error code on reset A negative status has the "do not retry" bit set, which makes it not retryable. Use a fake status that can potentially be retried on reset. An aborted command's status is overridden by the timeout handler so that it won't be retried, which is necessary to keep initialization from getting into a reset loop. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-01-12 13:33:35 -07:00
Keith Busch	e3e9d50cd6	NVMe: Fix admin queue ring wrap The tag set queue depth needs to be one less than the h/w queue depth so we don't wrap the circular buffer. This conforms to the specification defined "Full Queue" condition. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-01-12 13:33:33 -07:00
Christoph Hellwig	4490733250	nvme: make SG_IO support optional Translation SCSI commands to NVMe commands is rather pointless in general as applications must not expext to be able to use SCSI commands on a generic block device. Make the huge translation layer optional and hope no one will ever enable it in the future. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-01-12 13:30:16 -07:00
Christoph Hellwig	bfd8947194	nvme: fixes for NVME_IOCTL_IO_CMD on the char device Make sure we synchronize access to the namespaces list and grab a reference to the namespace before doing I/O. Make sure to reject the ioctl if multiple namespaces are present as it's entirely unsafe, and warn when using it even with a single namespace. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-01-12 13:30:15 -07:00
Christoph Hellwig	69d3b8ac15	nvme: synchronize access to ctrl->namespaces Currently traversal and modification of ctrl->namespaces happens completely unsynchronized, which can be fixed by the addition of a simple mutex. Note: nvme_dev_ioctl will be handled in the next patch. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-01-12 13:30:13 -07:00
Sagi Grimberg	363c9aacb6	nvme: Move nvme_freeze/unfreeze_queues to nvme core Nothing pci specific about them and We'll need them exported in other transports too. Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-01-12 13:30:11 -07:00
Matias Bjørling	ca5927e7ab	lightnvm: introduce mlc lower page table mappings NAND MLC memories have both lower and upper pages. When programming, both of these must be written, before data can be read. However, these lower and upper pages might not placed at even and odd flash pages, but can be skipped. Therefore each flash memory has its lower pages defined, which can then be used when programming and to know when padding are necessary. This patch implements the lower page definition in the specification, and exposes it through a simple lookup table at dev->lptbl. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-01-12 08:21:17 -07:00
Matias Bjørling	22513215b8	lightnvm: return the get_bb_tbl return value During get_bb_tbl, a callback is used to allow an user-specific scan function to be called. The callback may return an error, and in that case, the return value is overridden. However, the callback error is needed when the fault is a user error and not a kernel error. For example, when a user tries to initialize the same device twice. The get_bb_tbl callback should be able to communicate this. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-01-12 08:21:16 -07:00
Matias Bjørling	91276162de	lightnvm: refactor end_io functions for sync To implement sync I/O support within the LightNVM core, the end_io functions are refactored to take an end_io function pointer instead of testing for initialized media manager, followed by calling its end_io function. Sync I/O can then be implemented using a callback that signal I/O completion. This is similar to the logic found in blk_to_execute_io(). By implementing it this way, the underlying device I/Os submission logic is abstracted away from core, targets, and media managers. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-01-12 08:21:16 -07:00
Keith Busch	b5875222de	NVMe: IO ending fixes on surprise removal This patch fixes a lost request discovered during IO + hot removal. The driver's pci removal deletes gendisks prior to shutting down the controller to allow dirty data to sync. Dirty data can not be synced on a surprise removal, though, and would potentially block indefinitely. The driver previously had marked the queue as dying in this scenario to prevent new requests from attempting, however it will still block for requests that already entered the queue. This patch fixes this by quiescing IO first, then aborting the requeued requests before deleting disks. Reported-by: Sujith Pandel <sujith_pandel@dell.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Tested-by: Sujith Pandel <sujith_pandel@dell.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-22 10:12:04 -07:00
Keith Busch	2b9b6e86bc	NVMe: Export namespace attributes to sysfs Exposes the NGUID, EUI-64, and NSID to sysfs entries under the disk's kobject. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-22 10:10:45 -07:00
Keith Busch	a0a3408ee6	NVMe: Add pci error handlers Requests enabling pcie aer support. Shuts down the controller on error detected with io frozen state prior to requesting slot reset; resumes controller after reset completes. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-22 10:09:06 -07:00
Christoph Hellwig	f4800d6d15	nvme: merge iod and cmd_info Merge the two per-request structures in the nvme driver into a single one. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-22 09:38:34 -07:00
Christoph Hellwig	bf68405705	nvme: meta_sg doesn't have to be an array Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-22 09:38:34 -07:00
Christoph Hellwig	eee417b069	nvme: properly free resources for cancelled command We need to move freeing of resources to the ->complete handler to ensure they are also freed when we cancel the command. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-22 09:38:34 -07:00
Christoph Hellwig	aae239e191	nvme: simplify completion handling Now that all commands are executed as block layer requests we can remove the internal completion in the NVMe driver. Note that we can simply call blk_mq_complete_request to abort commands as the block layer will protect against double copletions internally. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-22 09:38:34 -07:00
Christoph Hellwig	adf68f21c1	nvme: special case AEN requests AEN requests are different from other requests in that they don't time out or can easily be cancelled. Because of that we should not use the blk-mq infrastructure but just special case them in the completion path. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-22 09:38:34 -07:00
Christoph Hellwig	e7a2a87d59	nvme: switch abort to blk_execute_rq_nowait And remove the now unused nvme_submit_cmd helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-22 09:38:34 -07:00
Christoph Hellwig	d8f32166a9	nvme: switch delete SQ/CQ to blk_execute_rq_nowait Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-22 09:38:34 -07:00
Christoph Hellwig	7688faa6dd	nvme: factor out a few helpers from req_completion We'll need them in other places later. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-22 09:38:34 -07:00
Christoph Hellwig	4680072003	nvme: fix admin queue depth The number in tag_set->queue depth includes the reserved tags. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-22 09:38:34 -07:00
Keith Busch	4b9d5b1510	NVMe: Simplify metadata setup We no longer require the two-pass setup for block integrity. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-22 09:38:34 -07:00
Keith Busch	53029b0441	NVMe: Remove device management handles on remove We don't want to allow new references to open on a device that is removed. This ties the lifetime of these handles to the physical device's presence rather than to the open reference count. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-22 09:38:33 -07:00
Keith Busch	92f7a1624b	NVMe: Use unbounded work queue for all work Removes all usage of the global work queue so work can't be scheduled on two different work queues, and removes nvme's work queue singlethreadedness so controllers can be driven in parallel. Signed-off-by: Keith Busch <keith.busch@intel.com> [hch: keep the dead controller removal on the system workqueue to avoid deadlocks] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-22 09:38:33 -07:00
Keith Busch	540c801c65	NVMe: Implement namespace list scanning The NVMe 1.1 specification provides an identify mode to return a list of active namespaces. This is more efficient to discover which namespace identifiers are active on a controller, providing potentially significant improvement in scan time for controllers with sparesly populated namespaces. Signed-off-by: Keith Busch <keith.busch@intel.com> [hch: add quirk for the broken Qemu Identify implementation. To be relaxed later] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-22 09:38:33 -07:00
Christoph Hellwig	6bf25d1641	nvme: switch abort_limit to an atomic_t There is no lock to sychronize access to the abort_limit field of struct nvme_ctrl, so switch it to an atomic_t. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-22 09:38:33 -07:00
Christoph Hellwig	5c8809e650	nvme: remove dead controllers from a work item Compared to the kthread this gives us multiple call prevention for free. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-22 09:38:33 -07:00
Christoph Hellwig	fd634f4142	nvme: merge probe_work and reset_work If we're using two work queues we're always going to run into races where one item is tearing down what the other one is initializing. So insted merge the two work queues, and let the old probe_work also tear the controller down first if it was alive. Together with the better detection of the probe path using a flag this gives us a properly serialized reset/probe path that also doesn't accidentally trigger when two commands time out and the second one tries to reset the controller while the first reset is still in progress. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-22 09:38:33 -07:00
Keith Busch	e1569a1618	nvme: do not restart the request timeout if we're resetting the controller Otherwise we're never going to complete a command when it is restarted just after we completed all other outstanding commands in nvme_clear_queue. The controller must be disabled prior to completing a presumed lost command, do this by directly shutting down the controller before queueing the reset work, and return EH_HANDLED from the timeout handler after we shut the controller down. Signed-off-by: Keith Busch <keith.busch@intel.com> [hch: split and rebase] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-22 09:38:33 -07:00
Christoph Hellwig	846cc05f95	nvme: simplify resets Don't delete the controller from dev_list before queuing a reset, instead just check for it being reset in the polling kthread. This allows to remove the dev_list_lock in various places, and in addition we can simply rely on checking the queue_work return value to see if we could reset a controller. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-22 09:38:33 -07:00
Christoph Hellwig	297465c873	nvme: add NVME_SC_CANCELLED To properly document how we are using a negative Linux error value to communicate request cancellations inside the driver. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-22 09:38:33 -07:00
Christoph Hellwig	31c7c7d2c9	nvme: merge nvme_abort_req and nvme_timeout We want to be able to return bettern error values frmo nvme_timeout, which is significantly easier if the two functions are merged. Also clean up and reduce the printk spew so that we only get one message per abort. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-22 09:38:33 -07:00
Christoph Hellwig	4c9f748f0e	nvme: don't take the I/O queue q_lock in nvme_timeout There is nothing it protects, but it makes lockdep unhappy in many different ways. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-22 09:38:32 -07:00
Keith Busch	77bf25ea70	nvme: protect against simultaneous shutdown invocations Signed-off-by: Keith Busch <keith.busch@intel.com> [hch: split from a larger patch] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-22 09:38:32 -07:00
Christoph Hellwig	7385014c07	nvme: only add a controller to dev_list after it's been fully initialized Without this we can easily get bad derferences on nvmeq->d_db when the nvme kthread tries to poll the CQs for controllers that are in half initialized state. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-22 09:38:23 -07:00
Christoph Hellwig	749941f236	nvme: only ignore hardware errors in nvme_create_io_queues Half initialized queues due to kernel error returns or timeout are still a good reason to give up on initializing a controller. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-22 09:38:20 -07:00
Dan Carpenter	8c0b391550	nvme: precedence bug in nvme_pr_clear() The "\|" operator has higher precedence than "?:" so this didn't work as intended. I had previously fixed this bug, but it we copied the older unfixed version when we moved the function between files. Fixes: `1673f1f08c` ('nvme: move block_device_operations and ns/ctrl freeing to common code') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-09 10:56:52 -07:00
Arnd Bergmann	d1ea7be5f7	nvme: fix another 32-bit build warning The nvme_user_cmd function was recently moved around from one file to another, which made a warning reappear that I had fixed before at some point: drivers/nvme/host/core.c: In function 'nvme_user_cmd': drivers/nvme/host/core.c:424:4: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] This applies the same workaround that we have elsewhere in the driver with an extra type cast to uintptr_t. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: `1673f1f08c` ("nvme: move block_device_operations and ns/ctrl freeing to common code") Link: https://lkml.org/lkml/2015/10/9/611 Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-08 12:14:25 -07:00
Matias Bjørling	16f26c3aa9	lightnvm: replace req queue with nvmdev for lld In the case where a request queue is passed to the low lever lightnvm device drive integration, the device driver might pass its admin commands through another queue. Instead pass nvm_dev, and let the low level drive the appropriate queue. Reported-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-07 09:14:19 -07:00
Matias Bjørling	437f8f9a1e	lightnvm: check mm before use The core can may issue I/Os before a media manager is registered with the lightnvm subsystem. Make sure that we don't call the media manager ->end_io prematurely with a null pointer. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-07 09:14:19 -07:00
Christoph Hellwig	ac02dddec6	NVMe: fix build with CONFIG_NVM enabled Looks like I didn't test with CONFIG_NVM enabled, and neither did the build bot. Most of this is really weird crazy shit in the lighnvm support, though. Struct nvme_ns is a structure for the NVM I/O command set, and it has no business poking into it. Second this commit: commit `47b3115ae7` Author: Wenwei Tao <ww.tao0320@gmail.com> Date: Fri Nov 20 13:47:55 2015 +0100 nvme: lightnvm: use admin queues for admin cmds Does even more crazy stuff. If a function gets a request_queue parameter passed it'd better use that and not look for another one. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-03 09:52:05 -07:00
Keith Busch	06c1e3902a	blk-integrity: empty implementation when disabled This patch moves the blk_integrity_payload definition outside the CONFIG_BLK_DEV_INTERITY dependency and provides empty function implementations when the kernel configuration disables integrity extensions. This simplifies drivers that make use of these to map user data so they don't need to repeat the same configuration checks. Signed-off-by: Keith Busch <keith.busch@intel.com> Updated by Jens to pass an error pointer return from bio_integrity_alloc(), otherwise if CONFIG_BLK_DEV_INTEGRITY isn't set, we return a weird ENOMEM from __nvme_submit_user_cmd() if a meta buffer is set. Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-03 09:32:21 -07:00
Stephan Günther	1f390c1fde	nvme: temporary fix for Apple controller reset Recent patches added basic support for the Apple NVMe controller but still cause resets and data corruption on that particular controller when a specific pattern of read/flush commands occurs. Limiting the queue depth to 2 works around that issue. This patch enforces that limit only for the Apple controller and is considered a temporary fix until we find the root source of that problem. Signed-off-by: Stephan Günther <guenther@tum.de> Signed-off-by: Maurice Leclaire <leclaire@in.tum.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-01 13:23:22 -07:00
Christoph Hellwig	9a0be7abb6	nvme: refactor set_queue_count Split out a helper that just issues the Set Features and interprets the result which can go to common code, and document why we are ignoring non-timeout error returns in the PCIe driver. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-01 10:59:40 -07:00
Christoph Hellwig	f3ca80fc11	nvme: move chardev and sysfs interface to common code For this we need to add a proper controller init routine and a list of all controllers that is in addition to the list of PCIe controllers, which stays in pci.c. Note that we remove the sysfs device when the last reference to a controller is dropped now - the old code would have kept it around longer, which doesn't make much sense. This requires a new ->reset_ctrl operation to implement controleller resets, and a new ->write_reg32 operation that is required to implement subsystem resets. We also now store caches copied of the NVMe compliance version and the flag if a controller is attached to a subsystem or not in the generic controller structure now. Signed-off-by: Christoph Hellwig <hch@lst.de> [Fixes for pr merge] Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-01 10:59:40 -07:00
Christoph Hellwig	5bae7f73d3	nvme: move namespace scanning to common code The namespace scanning code has been mostly generic already, we just need to store a pointer to the tagset in the nvme_ctrl structure, and add a method to check if a controller is I/O incapable. The latter will hopefully be replaced by a proper controller state machine soon. Signed-off-by: Christoph Hellwig <hch@lst.de> [Fixed pr conflicts] Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-01 10:59:40 -07:00
Christoph Hellwig	ce4541f40a	nvme: move the call to nvme_init_identify earlier We want to record the identify and CAP values even if no I/O queue is available. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-01 10:59:40 -07:00
Christoph Hellwig	7fd8930f26	nvme: add a common helper to read Identify Controller data And add the 64-bit register read operation for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-01 10:59:39 -07:00
Christoph Hellwig	5fd4ce1b00	nvme: move nvme_{enable,disable,shutdown}_ctrl to common code Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-01 10:59:39 -07:00
Christoph Hellwig	1b2eb37465	nvme: move remaining CC setup into nvme_enable_ctrl Remove the calculation of all the bits written into the CC register into nvme_enable_ctrl, so that they can be moved into the core NVMe driver in the future. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-01 10:59:39 -07:00
Christoph Hellwig	106198edb7	nvme: add explicit quirk handling Add an enum for all workarounds not in the spec and identify the affected controllers at probe time. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-01 10:59:39 -07:00
Christoph Hellwig	1673f1f08c	nvme: move block_device_operations and ns/ctrl freeing to common code This moves the block_device_operations over to common code mostly as-is. The only change is that the ns and ctrl refcounting got some small refcounting to have wrappers around the kref_put operations. A new free_ctrl operation is added to allow the PCI driver to free it's ressources on the final drop. Signed-off-by: Christoph Hellwig <hch@lst.de> [Moved the integrity and pr changes due to merge conflict] Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-01 10:59:39 -07:00
Keith Busch	0b7f1f26f9	nvme: use the block layer for userspace passthrough metadata Use the integrity API to pass through metadata from userspace. For PI enabled devices this means that we now validate the reftag, which seems like an unintentional ommission in the old code. Thanks to Keith Busch for testing and fixes. Signed-off-by: Christoph Hellwig <hch@lst.de> [Skip metadata setup on admin commands] Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-01 10:59:39 -07:00
Christoph Hellwig	4160982e75	nvme: split __nvme_submit_sync_cmd Add a separate nvme_submit_user_cmd for commands that directly DMA to or from userspace. We'll add metadata support to that soon and the common version would become too messy. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-01 10:59:39 -07:00
Christoph Hellwig	22944e9981	nvme: move nvme_setup_flush and nvme_setup_rw to common code And mark them inline so that we don't slow down the I/O submission path by having to turn it into a forced out of line call. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-01 10:59:39 -07:00
Christoph Hellwig	15a190f7f5	nvme: move nvme_error_status to common code And mark it inline so that we don't slow down the completion path by having to turn it into a forced out of line call. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-01 10:59:39 -07:00
Christoph Hellwig	d4f6c3aba5	nvme: factor out a nvme_unmap_data helper This is the counter part to nvme_map_data. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-01 10:59:39 -07:00
Christoph Hellwig	ba1ca37ea4	nvme: refactor nvme_queue_rq This "backports" the structure I've used for the fabrics driver. It mostly started out as a cleanup so that I could actually understand the code, but I think it also qualifies as a micro-optimization due to the reduced time we hold q_lock and disable interrupts. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-01 10:59:39 -07:00
Christoph Hellwig	69d2b57174	nvme: simplify nvme_setup_prps calling convention Pass back a true/false value instead of the length which needs a compare with the bytes in the request and drop the pointless gfp_t argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-01 10:59:39 -07:00
Christoph Hellwig	1c63dc6658	nvme: split a new struct nvme_ctrl out of struct nvme_dev The new struct nvme_ctrl will be used by the common NVMe code that sits on top of struct request_queue and the new nvme_ctrl_ops abstraction. It only contains the bare minimum required, which consists of values sampled during controller probe, the admin queue pointer and a second struct device pointer at the moment, but more will follow later. Only values that are not used in the I/O fast path should be moved to struct nvme_ctrl so that drivers can optimize their cache line usage easily. That's also the reason why we have two device pointers as the struct device is used for DMA mapping purposes. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-01 10:59:38 -07:00
Christoph Hellwig	01fec28a6f	nvme: use vendor it from identify Use the vendor ID from the identify data instead of the PCI device to make the SCSI translation layer independent from the PCI driver. The NVMe spec defines them as having the same value for current PCIe devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-01 10:59:38 -07:00
Christoph Hellwig	bf7d3ebbd2	nvme: split nvme_trans_device_id_page Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-01 10:59:38 -07:00
Christoph Hellwig	7a67cbea65	nvme: use offset instead of a struct for registers This makes life easier for future non-PCI drivers where access to the registers might be more complicated. Note that Linux drivers are pretty evenly split between the two versions, and in fact the NVMe driver already uses offsets for the doorbells. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> [Fixed CMBSZ offset] Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-01 10:59:38 -07:00
Christoph Hellwig	21d34711e1	nvme: split command submission helpers out of pci.c Create a new core.c and start by adding the command submission helpers to it, which are already abstracted away from the actual hardware queues by the block layer. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-01 10:59:38 -07:00
Christoph Hellwig	71bd150c71	nvme: move struct nvme_iod to pci.c This structure is specific to the PCIe driver internals and should be moved to pci.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-01 10:59:38 -07:00
Christoph Hellwig	6f3b0e8bcf	blk-mq: add a flags parameter to blk_mq_alloc_request We already have the reserved flag, and a nowait flag awkwardly encoded as a gfp_t. Add a real flags argument to make the scheme more extensible and allow for a nicer calling convention. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-12-01 10:53:59 -07:00
Matias Bjørling	08236c6bb2	lightnvm: unconverted ppa returned in get_bb_tbl The get_bb_tbl function takes ppa as a generic address, which is converted to the ppa device address within the device driver. When the update_bbtbl callback is called from get_bb_tbl, the device specific ppa is used, instead of the generic ppa. Make sure to pass the generic ppa. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-11-29 14:34:58 -07:00
Matias Bjørling	09f2e71609	lightnvm: refactor and change vendor id for qemu The QEMU NVMe implementation uses Intel vendor, Intel device id, and the first vendor specific byte to identify a LightNVM compatible nvme instance. Instead of using the Intel specific, use a preallocated from CNEX Labs instead. This lets us uniquely identify a QEMU lightnvm device without breaking other vendor specific work in the qemu device driver. Reported-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-11-29 14:34:58 -07:00
Keith Busch	c4699e70d1	lightnvm: Simplify config when disabled We shouldn't compile an object file to get empty implementations; conforms to linux coding style on conditional compilation. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-11-29 14:34:57 -07:00
Christoph Hellwig	bf508e910b	nvme: add missing unmaps in nvme_queue_rq When we fail various metadata related operations in nvme_queue_rq we need to unmap the data SGL. Cc: stable@vger.kernel.org Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-11-24 15:24:05 -07:00
Nishanth Aravamudan	c5c9f25b98	NVMe: default to 4k device page size We received a bug report recently when DDW (64-bit direct DMA on Power) is not enabled for NVMe devices. In that case, we fall back to 32-bit DMA via the IOMMU, which is always done via 4K TCEs (Translation Control Entries). The NVMe device driver, though, assumes that the DMA alignment for the PRP entries will match the device's page size, and that the DMA aligment matches the kernel's page aligment. On Power, the the IOMMU page size, as mentioned above, can be 4K, while the device can have a page size of 8K, while the kernel has a page size of 64K. This eventually trips the BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple of 4K but not 8K (e.g., 0xF000). In this particular case of page sizes, we clearly want to use the IOMMU's page size in the driver. And generally, the NVMe driver in this function should be using the IOMMU's page size for the default device page size, rather than the kernel's page size. There is not currently an API to obtain the IOMMU's page size across all architectures and in the interest of a stop-gap fix to this functional issue, default the NVMe device page size to 4K, with the intent of adding such an API and implementation across all architectures in the next merge window. With the functionally equivalent v3 of this patch, our hardware test exerciser survives when using 32-bit DMA; without the patch, the kernel will BUG within a few minutes. Signed-off-by: Nishanth Aravamudan <nacc at linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-11-24 15:05:51 -07:00
Keith Busch	604e8c8da8	NVMe: reap completion entries when deleting queue Make sure that there are no unprocesssed entries on a completion queue before deleting it, and check for validity of the CQ door bell before writing completions to it. This fixes problems with doing a sysfs reset of the device while it's handling IO. Tested-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-11-20 08:38:13 -07:00
Wenwei Tao	47b3115ae7	nvme: lightnvm: use admin queues for admin cmds According to the Open-Channel SSD Specification, the NVMe-NVM admin commands use vendor specific opcodes of NVMe, so use the NVMe admin queue to dispatch these commands. Signed-off-by: Wenwei Tao <ww.tao0320@gmail.com> Updated by me to include set bad block table as well and also use the admin queue for l2p len calculation. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-11-20 08:33:18 -07:00
Keith Busch	6824c5ef5e	NVMe: Fix possible arithmetic overflow for max segments Reported-by: Paul Grabinar <paul.grabinar@ranbarg.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-11-19 13:02:24 -07:00
Matias Bjørling	dad1b00977	nvme: remove reserved double word The specification was updated the remove the double word just after number of configuration groups and capabilities. Update the identify structure to reflect it. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-11-16 15:20:38 -07:00
Matias Bjørling	2393bd39c7	nvme: missing ppaf copy The ppa format was not copied from the NVMe specific ppa format to the lightnvm specific ppa format. This led to the ppa format not being communicated to the layers above. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-11-16 15:20:37 -07:00
Matias Bjørling	7386af270c	lightnvm: remove linear and device addr modes The linear and device specific address modes can be replaced with a simple offset and bit length conversion that is generic across all devices. This both simplifies the specification and removes the special case for qemu nvme, that previously relied on the linear address mapping. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-11-16 15:20:34 -07:00
Matias Bjørling	12be5edf68	lightnvm: expose mccap in identify command The mccap field is required for I/O command option support. It defines the following flash access modes: * SLC mode * Erase/Program Suspension * Scramble On/Off * Encryption It is slotted in between mpos and cpar, changing the offset for cpar as well. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-11-16 15:20:27 -07:00
Matias Bjørling	36d5dbc694	lightnvm: update alignments for identify command A single 8 bit and 16 bit reserve field were inserted in the specification to align fields appropriately. Reflect this in the identify group structure. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-11-16 15:20:26 -07:00
Matias Bjørling	1145046983	lightnvm: update bad block table format The specification was changed to reflect a multi-value bad block table. Instead of bit-based bad block table, the bad block table now allows eight bad block categories. Currently four are defined: * Factory bad blocks * Grown bad blocks * Device-side reserved blocks * Host-side reserved blocks The factory and grown bad blocks are the regular bad blocks. The reserved blocks are either for internal use or external use. In particular, the device-side reserved blocks allows the host to bootstrap from a limited number of flash blocks. Reducing the flash blocks to scan upon super block initialization. Support for both get bad block table and set bad block table is added. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-11-16 15:20:25 -07:00
Stephan Günther	c74dc7801d	NVMe: add support for Apple NVMe controller Add PCI ID of Apple's NVMe controller. Signed-off-by: Stephan Guenther <guenther@tum.de> Signed-off-by: Maurice Leclaire <leclaire@in.tum.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-11-11 09:36:57 -07:00
Stephan Günther	a310acd7a7	NVMe: use split lo_hi_{read,write}q Some controllers may require ordered split transfers even on 64bit machines, e.g. Apple's NVMe controller as found in the MacBook8,1 and MacBookAir7,1 (256/512GB models). This patch enforces ordered split transfers on 64bit platforms, which works around that issue for all controllers. As pointed out by Christoph [1] there should be no performance impact due to that modification. [1] http://lists.infradead.org/pipermail/linux-nvme/2015-November/002965.html Signed-off-by: Stephan Guenther <guenther@tum.de> Signed-off-by: Maurice Leclaire <leclaire@in.tum.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Updated by me to explicitly use lo_hi_read/writeq instead of playing define tricks. Signed-off-by: Jens Axboe <axboe@fb.com>	2015-11-11 09:36:56 -07:00
Sathyavathi M	b12363d0a5	NVMe: Increase the max transfer size when mdts is 0 This patch address the issue when IO with 128KB from FIO is split into two parts, 124KB and 4KB, due to max transfer size(127KB). This degrades the device performance. Signed-off-by: Sathyavathi M <sathya.m@samsung.com> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-11-11 09:36:56 -07:00
Linus Torvalds	3419b45039	Merge branch 'for-4.4/io-poll' of git://git.kernel.dk/linux-block Pull block IO poll support from Jens Axboe: "Various groups have been doing experimentation around IO polling for (really) fast devices. The code has been reviewed and has been sitting on the side for a few releases, but this is now good enough for coordinated benchmarking and further experimentation. Currently O_DIRECT sync read/write are supported. A framework is in the works that allows scalable stats tracking so we can auto-tune this. And we'll add libaio support as well soon. Fow now, it's an opt-in feature for test purposes" * 'for-4.4/io-poll' of git://git.kernel.dk/linux-block: direct-io: be sure to assign dio->bio_bdev for both paths directio: add block polling support NVMe: add blk polling support block: add block polling support blk-mq: return tag/queue combo in the make_request_fn handlers block: change ->make_request_fn() and users to return a queue cookie	2015-11-10 17:23:49 -08:00
Linus Torvalds	ad804a0b2a	Merge branch 'akpm' (patches from Andrew) Merge second patch-bomb from Andrew Morton: - most of the rest of MM - procfs - lib/ updates - printk updates - bitops infrastructure tweaks - checkpatch updates - nilfs2 update - signals - various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc, dma-debug, dma-mapping, ... * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (102 commits) ipc,msg: drop dst nil validation in copy_msg include/linux/zutil.h: fix usage example of zlib_adler32() panic: release stale console lock to always get the logbuf printed out dma-debug: check nents in dma_sync_sg* dma-mapping: tidy up dma_parms default handling pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode kexec: use file name as the output message prefix fs, seqfile: always allow oom killer seq_file: reuse string_escape_str() fs/seq_file: use seq_* helpers in seq_hex_dump() coredump: change zap_threads() and zap_process() to use for_each_thread() coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT) signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread() signal: turn dequeue_signal_lock() into kernel_dequeue_signal() signals: kill block_all_signals() and unblock_all_signals() nilfs2: fix gcc uninitialized-variable warnings in powerpc build nilfs2: fix gcc unused-but-set-variable warnings MAINTAINERS: nilfs2: add header file for tracing nilfs2: add tracepoints for analyzing reading and writing metadata files ...	2015-11-07 14:32:45 -08:00
Jens Axboe	a0fa9647a5	NVMe: add blk polling support Add nvme_poll(), which will check a specific completion queue for command completions. Wire that up to the new block layer poll mechanism. Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com>	2015-11-07 10:40:47 -07:00
Mel Gorman	71baba4b92	mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM __GFP_WAIT was used to signal that the caller was in atomic context and could not sleep. Now it is possible to distinguish between true atomic context and callers that are not willing to sleep. The latter should clear __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing __GFP_WAIT behaves differently, there is a risk that people will clear the wrong flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly indicate what it does -- setting it allows all reclaim activity, clearing them prevents it. [akpm@linux-foundation.org: fix build] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-11-06 17:50:42 -08:00
Linus Torvalds	9cf5c095b6	asm-generic cleanups The asm-generic changes for 4.4 are mostly a series from Christoph Hellwig to clean up various abuses of headers in there. The patch to rename the io-64-nonatomic-.h headers caused some conflicts with new users, so I added a workaround that we can remove in the next merge window. The only other patch is a warning fix from Marek Vasut -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIVAwUAVjzaf2CrR//JCVInAQImmhAA20fZ91sUlnA5skKNPT1phhF6Z7UF2Sx5 nPKcHQD3HA3lT1OKfPBYvCo+loYflvXFLaQThVylVcnE/8ecAEMtft4nnGW2nXvh sZqHIZ8fszTB53cynAZKTjdobD1wu33Rq7XRzg0ugn1mdxFkOzCHW/xDRvWRR5TL rdQjzzgvn2PNlqFfHlh6cZ5ykShM36AIKs3WGA0H0Y/aYsE9GmDOAUp41q1mLXnA 4lKQaIxoeOa+kmlsUB0wEHUecWWWJH4GAP+CtdKzTX9v12bGNhmiKUMCETG78BT3 uL8irSqaViNwSAS9tBxSpqvmVUsa5aCA5M3MYiO+fH9ifd7wbR65g/wq39D3Pc01 KnZ3BNVRW5XSA3c86pr8vbg/HOynUXK8TN0lzt6rEk8bjoPBNEDy5YWzy0t6reVe wX65F+ver8upjOKe9yl2Jsg+5Kcmy79GyYjLUY3TU2mZ+dIdScy/jIWatXe/OTKZ iB4Ctc4MDe9GDECmlPOWf98AXqsBUuKQiWKCN/OPxLtFOeWBvi4IzvFuO8QvnL9p jZcRDmIlIWAcDX/2wMnLjV+Hqi3EeReIrYznxTGnO7HHVInF555GP51vFaG5k+SN smJQAB0/sostmC1OCCqBKq5b6/li95/No7+0v0SUhJJ5o76AR5CcNsnolXesw1fu vTUkB/I66Hk= =dQKG -----END PGP SIGNATURE----- Merge tag 'asm-generic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic Pull asm-generic cleanups from Arnd Bergmann: "The asm-generic changes for 4.4 are mostly a series from Christoph Hellwig to clean up various abuses of headers in there. The patch to rename the io-64-nonatomic-.h headers caused some conflicts with new users, so I added a workaround that we can remove in the next merge window. The only other patch is a warning fix from Marek Vasut" * tag 'asm-generic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: asm-generic: temporarily add back asm-generic/io-64-nonatomic.h asm-generic: cmpxchg: avoid warnings from macro-ized cmpxchg() implementations gpio-mxc: stop including <asm-generic/bug> n_tracesink: stop including <asm-generic/bug> n_tracerouter: stop including <asm-generic/bug> mlx5: stop including <asm-generic/kmap_types.h> hifn_795x: stop including <asm-generic/kmap_types.h> drbd: stop including <asm-generic/kmap_types.h> move count_zeroes.h out of asm-generic move io-64-nonatomic.h out of asm-generic	2015-11-06 14:22:15 -08:00
Linus Torvalds	ccf21b69a8	Merge branch 'for-4.4/reservations' of git://git.kernel.dk/linux-block Pull block reservation support from Jens Axboe: "This adds support for persistent reservations, both at the core level, as well as for sd and NVMe" [ Background from the docs: "Persistent Reservations allow restricting access to block devices to specific initiators in a shared storage setup. All implementations are expected to ensure the reservations survive a power loss and cover all connections in a multi path environment" ] * 'for-4.4/reservations' of git://git.kernel.dk/linux-block: NVMe: Precedence error in nvme_pr_clear() nvme: add missing endianess annotations in nvme_pr_command NVMe: Add persistent reservation ops sd: implement the Persistent Reservation API block: add an API for Persistent Reservations block: cleanup blkdev_ioctl	2015-11-04 21:01:27 -08:00
Linus Torvalds	527d1529e3	Merge branch 'for-4.4/integrity' of git://git.kernel.dk/linux-block Pull block integrity updates from Jens Axboe: ""This is the joint work of Dan and Martin, cleaning up and improving the support for block data integrity" * 'for-4.4/integrity' of git://git.kernel.dk/linux-block: block, libnvdimm, nvme: provide a built-in blk_integrity nop profile block: blk_flush_integrity() for bio-based drivers block: move blk_integrity to request_queue block: generic request_queue reference counting nvme: suspend i/o during runtime blk_integrity_unregister md: suspend i/o during runtime blk_integrity_unregister md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown block: Inline blk_integrity in struct gendisk block: Export integrity data interval size in sysfs block: Reduce the size of struct blk_integrity block: Consolidate static integrity profile properties block: Move integrity kobject to struct gendisk	2015-11-04 20:51:48 -08:00
Linus Torvalds	effa04cc5a	Merge branch 'for-4.4/lightnvm' of git://git.kernel.dk/linux-block Pull lightnvm support from Jens Axboe: "This adds support for lightnvm, and adds support to NVMe as well. This is pretty exciting, in that it enables new and interesting use cases for compatible flash devices. There's a LWN writeup about an earlier posting here: https://lwn.net/Articles/641247/ This has been underway for a while, and should be ready for merging at this point" * 'for-4.4/lightnvm' of git://git.kernel.dk/linux-block: nvme: lightnvm: clean up a data type lightnvm: refactor phys addrs type to u64 nvme: LightNVM support rrpc: Round-robin sector target with cost-based gc gennvm: Generic NVM manager lightnvm: Support for Open-Channel SSDs	2015-11-04 20:46:08 -08:00
Linus Torvalds	a9aa31cdc2	Merge branch 'for-4.4/drivers' of git://git.kernel.dk/linux-block Pull block driver updates from Jens Axboe: "Here are the block driver changes for 4.4. This pull request contains: - NVMe: - Refactor and moving of code to prepare for proper target support. From Christoph and Jay. - 32-bit nvme warning fix from Arnd. - Error initialization fix from me. - Proper namespace removal and reference counting support from Keith. - Device resume fix on IO failure, also from Keith. - Dependency fix from Keith, now that nvme isn't under the umbrella of the block anymore. - Target location and maintainers update from Jay. - From Ming Lei, the long awaited DIO/AIO support for loop. - Enable BD-RE writeable opens, from Georgios" * 'for-4.4/drivers' of git://git.kernel.dk/linux-block: (24 commits) Update target repo for nvme patch contributions NVMe: initialize error to '0' nvme: use an integer value to Linux errno values nvme: fix 32-bit build warning NVMe: Add explicit block config dependency nvme: include <linux/types.ĥ> in <linux/nvme.h> nvme: move to a new drivers/nvme/host directory nvme.h: add missing nvme_id_ctrl endianess annotations nvme: move hardware structures out of the uapi version of nvme.h nvme: add a local nvme.h header nvme: properly handle partially initialized queues in nvme_create_io_queues nvme: merge nvme_dev_start, nvme_dev_resume and nvme_async_probe nvme: factor reset code into a common helper nvme: merge nvme_dev_reset into nvme_reset_failed_dev nvme: delete dev from dev_list in nvme_reset NVMe: Simplify device resume on io queue failure NVMe: Namespace removal simplifications NVMe: Reference count open namespaces cdrom: Random writing support for BD-RE media block: loop: support DIO & AIO ...	2015-11-04 20:37:27 -08:00
Dan Carpenter	5f436e5ef1	nvme: lightnvm: clean up a data type "nlb_pr_rq" can't be more than u32 because "len" is a u32. Later we truncate it to u32 anyway when we calculate min_t(). Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-11-03 16:01:38 -07:00
Dan Carpenter	73fcf4e20e	NVMe: Precedence error in nvme_pr_clear() The original code is equivalent to: u32 cdw10 = (1 \| key) ? 1 << 3 : 0; But we want: u32 cdw10 = 1 \| (key ? 1 << 3 : 0); Fixes: 1d277a637a71: ('NVMe: Add persistent reservation ops') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-11-03 12:59:03 -07:00
Matias Bjørling	ca0640850e	nvme: LightNVM support The first generation of Open-Channel SSDs is based on NVMe. The NVMe driver is extended with support for the LightNVM command set. Detection is made through PCI IDs. Current supported devices are the qemu nvme simulator and CNEX Labs Westlake SSD. The qemu nvme enables support through vendor specific bits in the namespace identification and the CNEX Labs Westlake SSD implements a LightNVM compatible firmware and is detected using the same method as qemu. After detection, vendor specific codes are used to identify the device and enumerate supported features. Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Javier González <jg@lightnvm.io> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-10-29 17:57:29 +09:00
Christoph Hellwig	a6dd1020d8	nvme: add missing endianess annotations in nvme_pr_command Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Fixes: ad4fd3610c27 ("NVMe: Add persistent reservation ops") Signed-off-by: Jens Axboe <axboe@fb.com>	2015-10-22 09:15:48 -06:00
Keith Busch	1d277a637a	NVMe: Add persistent reservation ops Signed-off-by: Keith Busch <keith.busch@intel.com> [hch: rebased, set PTPL=1] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-10-22 09:15:48 -06:00
Dan Williams	4125a09b0a	block, libnvdimm, nvme: provide a built-in blk_integrity nop profile The libnvidmm-btt and nvme drivers use blk_integrity to reserve space for per-sector metadata, but sometimes without protection checksums. This property is generically useful, so teach the block core to internally specify a nop profile if one is not provided at registration time. Cc: Keith Busch <keith.busch@intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Suggested-by: Christoph Hellwig <hch@lst.de> [hch: kill the local nvme nop profile as well] Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-10-21 14:43:45 -06:00
Dan Williams	ac6fc48c9f	block: move blk_integrity to request_queue A trace like the following proceeds a crash in bio_integrity_process() when it goes to use an already freed blk_integrity profile. BUG: unable to handle kernel paging request at ffff8800d31b10d8 IP: [<ffff8800d31b10d8>] 0xffff8800d31b10d8 PGD 2f65067 PUD 21fffd067 PMD 80000000d30001e3 Oops: 0011 [#1] SMP Dumping ftrace buffer: --------------------------------- ndctl-2222 2.... 44526245us : disk_release: pmem1s systemd--2223 4.... 44573945us : bio_integrity_endio: pmem1s <...>-409 4.... 44574005us : bio_integrity_process: pmem1s --------------------------------- [..] Call Trace: [<ffffffff8144e0f9>] ? bio_integrity_process+0x159/0x2d0 [<ffffffff8144e4f6>] bio_integrity_verify_fn+0x36/0x60 [<ffffffff810bd2dc>] process_one_work+0x1cc/0x4e0 Given that a request_queue is pinned while i/o is in flight and that a gendisk is allowed to have a shorter lifetime, move blk_integrity to request_queue to satisfy requests arriving after the gendisk has been torn down. Cc: Christoph Hellwig <hch@lst.de> Cc: Martin K. Petersen <martin.petersen@oracle.com> [martin: fix the CONFIG_BLK_DEV_INTEGRITY=n case] Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-10-21 14:43:42 -06:00
Dan Williams	4cfc766e07	nvme: suspend i/o during runtime blk_integrity_unregister Synchronize pending i/o against a change in the integrity profile to avoid the possibility of spurious integrity errors. Cc: Matthew Wilcox <willy@linux.intel.com> Acked-by: Keith Busch <keith.busch@intel.com> [keith: also protect dynamic integrity registration] Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-10-21 14:43:39 -06:00
Dan Williams	9609b9942b	md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown Now that the integrity profile is statically allocated there is no work to do when shutting down an integrity enabled block device. Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: James Bottomley <JBottomley@Odin.com> Acked-by: NeilBrown <neilb@suse.com> Acked-by: Keith Busch <keith.busch@intel.com> Acked-by: Vishal Verma <vishal.l.verma@intel.com> Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-10-21 14:43:37 -06:00
Martin K. Petersen	25520d55cd	block: Inline blk_integrity in struct gendisk Up until now the_integrity profile has been dynamically allocated and attached to struct gendisk after the disk has been made active. This causes problems because NVMe devices need to register the profile prior to the partition table being read due to a mandatory metadata buffer requirement. In addition, DM goes through hoops to deal with preallocating, but not initializing integrity profiles. Since the integrity profile is small (4 bytes + a pointer), Christoph suggested moving it to struct gendisk proper. This requires several changes: - Moving the blk_integrity definition to genhd.h. - Inlining blk_integrity in struct gendisk. - Removing the dynamic allocation code. - Adding helper functions which allow gendisk to set up and tear down the integrity sysfs dir when a disk is added/deleted. - Adding a blk_integrity_revalidate() callback for updating the stable pages bdi setting. - The calls that depend on whether a device has an integrity profile or not now key off of the bi->profile pointer. - Simplifying the integrity support routines in DM (Mike Snitzer). Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reported-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-10-21 14:42:42 -06:00
Martin K. Petersen	0f8087ecde	block: Consolidate static integrity profile properties We previously made a complete copy of a device's data integrity profile even though several of the fields inside the blk_integrity struct are pointers to fixed template entries in t10-pi.c. Split the static and per-device portions so that we can reference the template directly. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reported-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-10-21 14:42:38 -06:00
Jens Axboe	ef658fc2a6	NVMe: initialize error to '0' Reported-by: Keith Busch <keith.busch@intel.com> Fixes: `1951feae88` ("nvme: use an integer value to Linux errno values") Signed-off-by: Jens Axboe <axboe@fb.com>	2015-10-15 09:49:57 -06:00
Christoph Hellwig	1951feae88	nvme: use an integer value to Linux errno values Use a separate integer variable to hold the signed Linux errno values we pass back to the block layer. Note that for pass through commands those might still be NVMe values, but those fit into the int as well. Fixes: f4829a9b7a61: ("blk-mq: fix racy updates of rq->errors") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-10-15 09:04:36 -06:00
Arnd Bergmann	3d42e67fe5	nvme: fix 32-bit build warning Compiling the nvme driver on 32-bit warns about a cast from a __u64 variable to a pointer: drivers/block/nvme-core.c: In function 'nvme_submit_io': drivers/block/nvme-core.c:1847:4: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] (void __user *)io.addr, length, NULL, 0); The cast here is intentional and safe, so we can shut up the gcc warning by adding an intermediate cast to 'uintptr_t'. I had previously submitted a patch to fix this problem in the nvme driver, but it was accepted on the same day that two new warnings got added. For clarification, I also change the third instance of this cast to use uintptr_t instead of unsigned long now. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: `d29ec8241c` ("nvme: submit internal commands through the block layer") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-10-12 13:09:15 -06:00
Keith Busch	11feb18f4e	NVMe: Add explicit block config dependency The nvme driver was moved from drivers/block, losing our implicit dependency on CONFIG_BLOCK. This makes it an explicit driver dependency. Reported-by: Jim Davis <jim.epost@gmail.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-10-12 11:43:22 -06:00
Jay Sternberg	57dacad5f2	nvme: move to a new drivers/nvme/host directory This patch moves the NVMe driver from drivers/block/ to its own new drivers/nvme/host/ directory. This is in preparation of splitting the current monolithic driver up and add support for the upcoming NVMe over Fabrics standard. The drivers/nvme/host/ is chose to leave space for a NVMe target implementation in addition to this host side driver. Signed-off-by: Jay Sternberg <jay.e.sternberg@intel.com> [hch: rebased, renamed core.c to pci.c, slight tweaks] Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2015-10-09 10:40:37 -06:00

... 27 28 29 30 31 ...

1629 Commits