The six callers of nvmet_find_namespace() duplicate the error log page
update and status setting code for each call on failure.
All callers are nvmet requests based functions, so we can pass req
to the nvmet_find_namesapce() & derive ctrl from req, that'll allow us
to update the error log page in nvmet_find_namespace(). Now that we
pass the request we can also get rid of the local variable in
nvmet_find_namespace() and use the req->ns and return the error code.
Replace the ctrl parameter with nvmet_req for nvmet_find_namespace(),
centralize the error log page update for non allocated namesapces, and
return uniform error for non-allocated namespace.
The nvmet_find_namespace() takes nsid parameter which is from NVMe
commands structures such as get_log_page, identify, rw and common. All
these commands have same offset for the nsid field.
Derive nsid from req->cmd->common.nsid) & remove the extra parameter
from the nvmet_find_namespace().
Lastly now we associate the ns to the req parameter that we pass to the
nvmet_find_namespace(), rename nvmet_find_namespace() to
nvmet_req_find_ns().
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
For nvmet_find_namespace() error case we have inconsistent error code
mapping in the function nvmet_get_smart_log_nsid() and
nvmet_set_feat_write_protect().
There is no point in retrying for the invalid namesapce from the host
side. Set the error code to the NVME_SC_INVALID_NS | NVME_SC_DNR which
matches what we have in nvmet_execute_identify_desclist().
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
For unallocated namespace in nvmet_execute_identify_ns() don't set the
status to NVME_SC_INVALID_NS, set it to zero.
Fixes: bffcd50778 ("nvmet: set right status on error in id-ns handler")
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Make sparse happy after the recent conversion to RCU lookups.
Fixes: 4e2f02bf77 ("nvmet-fc: use RCU proctection for assoc_list")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
When we accept a TCP connection and allocate an nvmet-tcp queue we should
make sure not to fully establish it or reference it as the connection may
be already closing, which triggers queue release work, which does not
fence against queue establishment.
In order to address such a race, we make sure to check the sk_state and
contain the queue reference to be done underneath the sk_callback_lock
such that the queue release work correctly fences against it.
Fixes: 872d26a391 ("nvmet-tcp: add NVMe over TCP target driver")
Reported-by: Elad Grupi <elad.grupi@dell.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
When a host sends multiple h2cdata PDUs for a single command, we
should verify the data digest calculation per PDU and not
per command.
Fixes: 872d26a391 ("nvmet-tcp: add NVMe over TCP target driver")
Reported-by: Narayan Ayalasomayajula <Narayan.Ayalasomayajula@wdc.com>
Tested-by: Narayan Ayalasomayajula <Narayan.Ayalasomayajula@wdc.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
In this preparation patch, we add helpers to convert lbas to sectors &
sectors to lba. This is needed to eliminate code duplication in the ZBD
backend.
Use these helpers in the block device backend.
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
We remove the extra local variable struct nvmet_ns in
nvmet_execute_identify_ns() since req already has ns member that can be
reused, this also eliminates the explicit call to nvmet_put_namespace()
which is already present in the request completion path.
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
We remove the extra local variable struct nvmet_ns in
nvmet_execute_identify_desclist() since req already has ns member that
can be reused, this also eliminates the explicit call to
nvmet_put_namespace() which is already present in the request
completion path.
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
We remove the extra local variable struct nvmet_ns in
nvmet_get_smart_log_nsid() since req already has ns member that can be
reused, this also eliminates the explicit call to nvmet_put_namespace()
which is already present in the request completion path.
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The only usage of these is to put their addresses in arrays of pointers
to const attribute_groups. Make them const to allow the compiler to put
them in read-only memory.
Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
searching assoc_list protected by rcu_read_lock if list not changed inline.
and according to the rcu list rules.
queue array embedded into nvmet_fc_tgt_assoc protected by rcu_read_lock
according to rcu dereference/assign rules.
queue and assoc object freed after grace period by call_rcu.
tgtport lock taken for changing assoc_list.
Reviewed-by: Eldad Zinger <Eldad.Zinger@dell.com>
Reviewed-by: Elad Grupi <Elad.Grupi@dell.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Leonid Ravich <Leonid.Ravich@emc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Remove extra tab.
Signed-off-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Remove code duplication.
Signed-off-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The function nvmet_execute_identify_ns() doesn't set the status if call
to nvmet_find_namespace() fails. In that case we set the status of the
request to the value return by the nvmet_copy_sgl().
Set the status to NVME_SC_INVALID_NS and adjust the code such that
request will have the right status on nvmet_find_namespace() failure.
Without this patch :-
NVME Identify Namespace 3:
nsze : 0
ncap : 0
nuse : 0
nsfeat : 0
nlbaf : 0
flbas : 0
mc : 0
dpc : 0
dps : 0
nmic : 0
rescap : 0
fpi : 0
dlfeat : 0
nawun : 0
nawupf : 0
nacwu : 0
nabsn : 0
nabo : 0
nabspf : 0
noiob : 0
nvmcap : 0
mssrl : 0
mcl : 0
msrc : 0
nsattr : 0
nvmsetid: 0
anagrpid: 0
endgid : 0
nguid : 00000000000000000000000000000000
eui64 : 0000000000000000
lbaf 0 : ms:0 lbads:0 rp:0 (in use)
With this patch-series :-
feb3b88b501e (HEAD -> nvme-5.11) nvmet: remove extra variable in identify ns
6302aa67210a nvmet: remove extra variable in id-desclist
ed57951da453 nvmet: remove extra variable in smart log nsid
be384b8c24dc nvmet: set right status on error in id-ns handler
NVMe status: INVALID_NS: The namespace or the format of that namespace is invalid(0xb)
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
When setting port traddr to INADDR_ANY, the listening cm_id->device
is NULL. The associate IB device is known only when a connect request
event arrives, so checking T10-PI device capability should be done
at this stage.
Fixes: b09160c399 ("nvmet-rdma: add metadata/T10-PI support")
Signed-off-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
When a queue is in NVMET_RDMA_Q_CONNECTING state, it may has some
requests at rsp_wait_list. In case a disconnect occurs at this
state, no one will empty this list and will return the requests to
free_rsps list. Normally nvmet_rdma_queue_established() free those
requests after moving the queue to NVMET_RDMA_Q_LIVE state, but in
this case __nvmet_rdma_queue_disconnect() is called before. The
crash happens at nvmet_rdma_free_rsps() when calling
list_del(&rsp->free_list), because the request exists only at
the wait list. To fix the issue, simply clear rsp_wait_list when
destroying the queue.
Signed-off-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Kernel robot had the following warnings:
>> fcloop.c:1506:6: warning: %x in format string (no. 1) requires
>> 'unsigned int *' but the argument type is 'signed int *'.
>> [invalidScanfArgType_int]
>> if (sscanf(buf, "%x:%d:%d", &opcode, &starting, &amount) != 3)
>> ^
Resolve by changing opcode from and int to an unsigned int
and
>> fcloop.c:1632:32: warning: Uninitialized variable: lport [uninitvar]
>> ret = __wait_localport_unreg(lport);
>> ^
>> fcloop.c:1615:28: warning: Uninitialized variable: nport [uninitvar]
>> ret = __remoteport_unreg(nport, rport);
>> ^
These aren't actual issues as the values are assigned prior to use.
It appears the tool doesn't understand list_first_entry_or_null().
Regardless, quiet the tool by initializing the pointers to NULL at
declaration.
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
A smaller set of patches, nothing stands out as being particularly major
this cycle:
- Driver bug fixes and updates: bnxt_re, cxgb4, rxe, hns, i40iw, cxgb4,
mlx4 and mlx5
- Bug fixes and polishing for the new rts ULP
- Cleanup of uverbs checking for allowed driver operations
- Use sysfs_emit all over the place
- Lots of bug fixes and clarity improvements for hns
- hip09 support for hns
- NDR and 50/100Gb signaling rates
- Remove dma_virt_ops and go back to using the IB DMA wrappers
- mlx5 optimizations for contiguous DMA regions
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl/aNXUACgkQOG33FX4g
mxqlMQ/+O6UhxKnDAnMB+HzDGvOm+KXNHOQBuzxz4ZWXqtUrW8WU5ca3PhXovc4z
/QX0HhMhQmVsva5mjp1OGVATxQ2E+yasqFLg4QXAFWFR3N7s0u/sikE9i1DoPvOC
lsmLTeRauCFaE4mJD5nvYwm+riECX0GmyVVW7v6V05xwAp0hwdhyU7Kb6Yh3lxsE
umTz+onPNJcD6Tc4snziyC5QEp5ebEjAaj4dVI1YPR5X0c2RwC5E1CIDI6u4OQ2k
j7/+Kvo8LNdYNERGiR169x6c1L7WS6dYnGMMeXRgyy0BVbVdRGDnvCV9VRmF66w5
99fHfDjNMNmqbGNt/4/gwNdVrR9aI4jMZWCh7SmsguX6XwNOlhYldy3x3WnlkfkQ
e4O0huJceJqcB2Uya70GqufnAetRXsbjzcvWxpR5YAwRmcRkm1f6aGK3BxPjWEbr
BbYRpiKMxxT4yTe65BuuThzx6g4pNQHe0z3BM/dzMJQAX+PZcs1CPQR8F8PbCrZR
Ad7qw4HJ587PoSxPi3toVMpYZRP6cISh1zx9q/JCj8cxH9Ri4MovUCS3cF63Ny3B
1LJ2q0x8FuLLjgZJogKUyEkS8OO6q7NL8WumjvrYWWx19+jcYsV81jTRGSkH3bfY
F7Esv5K2T1F2gVsCe1ZFFplQg6ja1afIcc+LEl8cMJSyTdoSub4=
=9t8b
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"A smaller set of patches, nothing stands out as being particularly
major this cycle. The biggest item would be the new HIP09 HW support
from HNS, otherwise it was pretty quiet for new work here:
- Driver bug fixes and updates: bnxt_re, cxgb4, rxe, hns, i40iw,
cxgb4, mlx4 and mlx5
- Bug fixes and polishing for the new rts ULP
- Cleanup of uverbs checking for allowed driver operations
- Use sysfs_emit all over the place
- Lots of bug fixes and clarity improvements for hns
- hip09 support for hns
- NDR and 50/100Gb signaling rates
- Remove dma_virt_ops and go back to using the IB DMA wrappers
- mlx5 optimizations for contiguous DMA regions"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (147 commits)
RDMA/cma: Don't overwrite sgid_attr after device is released
RDMA/mlx5: Fix MR cache memory leak
RDMA/rxe: Use acquire/release for memory ordering
RDMA/hns: Simplify AEQE process for different types of queue
RDMA/hns: Fix inaccurate prints
RDMA/hns: Fix incorrect symbol types
RDMA/hns: Clear redundant variable initialization
RDMA/hns: Fix coding style issues
RDMA/hns: Remove unnecessary access right set during INIT2INIT
RDMA/hns: WARN_ON if get a reserved sl from users
RDMA/hns: Avoid filling sl in high 3 bits of vlan_id
RDMA/hns: Do shift on traffic class when using RoCEv2
RDMA/hns: Normalization the judgment of some features
RDMA/hns: Limit the length of data copied between kernel and userspace
RDMA/mlx4: Remove bogus dev_base_lock usage
RDMA/uverbs: Fix incorrect variable type
RDMA/core: Do not indicate device ready when device enablement fails
RDMA/core: Clean up cq pool mechanism
RDMA/core: Update kernel documentation for ib_create_named_qp()
MAINTAINERS: SOFT-ROCE: Change Zhu Yanjun's email address
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/XgdYQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjTBD/4me2TNvGOogbcL0b1leAotndJ7spI/IcFM
NUMNy3pOGuRBcRjwle85xq44puAjlNkZE2LLatem5sT7ZvS+8lPNnOIoTYgfaCjt
PhKx2sKlLumVm3BwymYAPcPtke4fikGG15Mwu5nX1oOehmyGrjObGAr3Lo6gexCT
tQoCOczVqaTsV+iTXrLlmgEgs07J9Tm93uh2cNR8Jgroxb8ivuWeUq4YgbV4kWk+
Y8XvOyVE/yba0vQf5/hHtWuVoC6RdELnqZ6NCkcP/EicdBecwk1GMJAej1S3zPS1
0BT7GSFTpm3YUHcygD6LRmRg4I/BmWDTDtMi84+jLat6VvSG1HwIm//qHiCJh3ku
SlvFZENIWAv5LP92x2vlR5Lt7uE3GK2V/5Pxt2fekyzCth6mzu+hLH4CBPQ3xgyd
E1JqIQ/ilbXstp+EYoivV5x8yltZQnKEZRopws0EOqj1LsmDPj9XT1wzE9RnB0o+
PWu/DNhQFhhcmP7Z8uLgPiKIVpyGs+vjxiJLlTtGDFTCy6M5JbcgzGkEkSmnybxH
7lSanjpLt1dWj85FBMc6fNtJkv2rBPfb4+j0d1kZ45Dzcr4umirGIh7wtCHcgc83
brmXSt29hlKHseSHMMuNWK8haXcgAE7gq9tD8GZ/kzM7+vkmLLxHJa22Qhq5rp4w
URPeaBaQJw==
=ayp2
-----END PGP SIGNATURE-----
Merge tag 'for-5.11/drivers-2020-12-14' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
"Nothing major in here:
- NVMe pull request from Christoph:
- nvmet passthrough improvements (Chaitanya Kulkarni)
- fcloop error injection support (James Smart)
- read-only support for zoned namespaces without Zone Append
(Javier González)
- improve some error message (Minwoo Im)
- reject I/O to offline fabrics namespaces (Victor Gladkov)
- PCI queue allocation cleanups (Niklas Schnelle)
- remove an unused allocation in nvmet (Amit Engel)
- a Kconfig spelling fix (Colin Ian King)
- nvme_req_qid simplication (Baolin Wang)
- MD pull request from Song:
- Fix race condition in md_ioctl() (Dae R. Jeong)
- Initialize read_slot properly for raid10 (Kevin Vigor)
- Code cleanup (Pankaj Gupta)
- md-cluster resync/reshape fix (Zhao Heming)
- Move null_blk into its own directory (Damien Le Moal)
- null_blk zone and discard improvements (Damien Le Moal)
- bcache race fix (Dongsheng Yang)
- Set of rnbd fixes/improvements (Gioh Kim, Guoqing Jiang, Jack Wang,
Lutz Pogrell, Md Haris Iqbal)
- lightnvm NULL pointer deref fix (tangzhenhao)
- sr in_interrupt() removal (Sebastian Andrzej Siewior)
- FC endpoint security support for s390/dasd (Jan Höppner, Sebastian
Ott, Vineeth Vijayan). From the s390 arch guys, arch bits included
as it made it easier for them to funnel the feature through the
block driver tree.
- Follow up fixes (Colin Ian King)"
* tag 'for-5.11/drivers-2020-12-14' of git://git.kernel.dk/linux-block: (64 commits)
block: drop dead assignments in loop_init()
sr: Remove in_interrupt() usage in sr_init_command().
sr: Switch the sector size back to 2048 if sr_read_sector() changed it.
cdrom: Reset sector_size back it is not 2048.
drivers/lightnvm: fix a null-ptr-deref bug in pblk-core.c
null_blk: Move driver into its own directory
null_blk: Allow controlling max_hw_sectors limit
null_blk: discard zones on reset
null_blk: cleanup discard handling
null_blk: Improve implicit zone close
null_blk: improve zone locking
block: Align max_hw_sectors to logical blocksize
null_blk: Fail zone append to conventional zones
null_blk: Fix zone size initialization
bcache: fix race between setting bdev state to none and new write request direct to backing
block/rnbd: fix a null pointer dereference on dev->blk_symlink_name
block/rnbd-clt: Dynamically alloc buffer for pathname & blk_symlink_name
block/rnbd: call kobject_put in the failure path
Documentation/ABI/rnbd-srv: add document for force_close
block/rnbd-srv: close a mapped device from server side.
...
Set nvme-loop's lock class via blk_mq_hctx_set_fq_lock_class for avoiding
lockdep possible recursive locking, then we can remove the dynamically
allocated lock class for each flush queue, finally we can avoid horrible
SCSI probe delay.
This way may not address situation in which one nvme-loop is backed on
another nvme-loop. However, in reality, people seldom uses this way
for test. Even though someone played in this way, it is just one
recursive locking false positive, no real deadlock issue.
Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Reported-by: Qian Cai <cai@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use struct block_device to lookup partitions on a disk. This removes
all usage of struct hd_struct from the I/O path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Coly Li <colyli@suse.de> [bcache]
Acked-by: Chao Yu <yuchao0@huawei.com> [f2fs]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is a spelling mistake in the Kconfig help text. Fix it.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Generation counter is protected by nvmet_config_sem. Make sure the
callers that call functions that might change it, are calling it
properly.
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
remove unused cqs from nvmet_ctrl struct
this will reduce the allocated memory.
Signed-off-by: Amit <amit.engel@dell.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
In nvmet_passthru_execute_cmd() which is a high frequency function
it uses bio_alloc() which leads to memory allocation from the fs pool
for each I/O.
For NVMeoF nvmet_req we already have inline_bvec allocated as a part of
request allocation that can be used with preallocated bio when we
already know the size of request before bio allocation with bio_alloc(),
which we already do.
Introduce a bio member for the nvmet_req passthru anon union. In the
fast path, check if we can get away with inline bvec and bio from
nvmet_req with bio_init() call before actually allocating from the
bio_alloc().
This will be useful to avoid any new memory allocation under high
memory pressure situation and get rid of any extra work of
allocation (bio_alloc()) vs initialization (bio_init()) when
transfer len is < NVMET_MAX_INLINE_DATA_LEN that user can configure at
compile time.
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The function blk_rq_append_bio() is a genereric API written for all
types driver (having bounce buffers) and different context (where
request is already having a bio i.e. rq->bio != NULL).
It does mainly three things: calculating the segments, bounce queue and
if req->bio == NULL call blk_rq_bio_prep() or handle low level merge()
case.
The NVMe PCIe and fabrics transports currently does not use queue
bounce mechanism. In order to find this for each request processing
in the passthru blk_rq_append_bio() does extra work in the fast path
for each request.
When I ran I/Os with different block sizes on the passthru controller
I found that we can reuse the req->sg_cnt instead of iterating over the
bvecs to find out nr_segs in blk_rq_append_bio(). This calculation in
blk_rq_append_bio() is a duplication of work given that we have the
value in req->sg_cnt. (correct me here if I'm wrong).
With NVMe passthru request based driver we allocate fresh request each
time, so every call to blk_rq_append_bio() rq->bio will be NULL i.e.
we don't really need the second condition in the blk_rq_append_bio()
and the resulting error condition in the caller of blk_rq_append_bio().
So for NVMeOF passthru driver recalculating the segments, bounce check
and ll_back_merge code is not needed such that we can get away with the
minimal version of the blk_rq_append_bio() which removes the error check
in the fast path along with extra variable in nvmet_passthru_map_sg().
This patch updates the nvmet_passthru_map_sg() such that it does only
appending the bio to the request in the context of the NVMeOF Passthru
driver. Following are perf numbers :-
With current implementation (blk_rq_append_bio()) :-
----------------------------------------------------
+ 5.80% 0.02% kworker/0:2-mm_ [nvmet] [k] nvmet_passthru_execute_cmd
+ 5.44% 0.01% kworker/0:2-mm_ [nvmet] [k] nvmet_passthru_execute_cmd
+ 4.88% 0.00% kworker/0:2-mm_ [nvmet] [k] nvmet_passthru_execute_cmd
+ 5.44% 0.01% kworker/0:2-mm_ [nvmet] [k] nvmet_passthru_execute_cmd
+ 4.86% 0.01% kworker/0:2-mm_ [nvmet] [k] nvmet_passthru_execute_cmd
+ 5.17% 0.00% kworker/0:2-eve [nvmet] [k] nvmet_passthru_execute_cmd
With this patch using blk_rq_bio_prep() :-
----------------------------------------------------
+ 3.14% 0.02% kworker/0:2-eve [nvmet] [k] nvmet_passthru_execute_cmd
+ 3.26% 0.01% kworker/0:2-eve [nvmet] [k] nvmet_passthru_execute_cmd
+ 5.37% 0.01% kworker/0:2-mm_ [nvmet] [k] nvmet_passthru_execute_cmd
+ 5.18% 0.02% kworker/0:2-eve [nvmet] [k] nvmet_passthru_execute_cmd
+ 4.84% 0.02% kworker/0:2-mm_ [nvmet] [k] nvmet_passthru_execute_cmd
+ 4.87% 0.01% kworker/0:2-mm_ [nvmet] [k] nvmet_passthru_execute_cmd
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
For passthru commands setting op_flags has no meaning. Remove the code
that sets the op flags in nvmet_passthru_map_sg().
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Right now nvme_alloc_request() allocates a request from block layer
based on the value of the qid. When qid set to NVME_QID_ANY it used
blk_mq_alloc_request() else blk_mq_alloc_request_hctx().
The function nvme_alloc_request() is called from different context, The
only place where it uses non NVME_QID_ANY value is for fabrics connect
commands :-
nvme_submit_sync_cmd() NVME_QID_ANY
nvme_features() NVME_QID_ANY
nvme_sec_submit() NVME_QID_ANY
nvmf_reg_read32() NVME_QID_ANY
nvmf_reg_read64() NVME_QID_ANY
nvmf_reg_write32() NVME_QID_ANY
nvmf_connect_admin_queue() NVME_QID_ANY
nvme_submit_user_cmd() NVME_QID_ANY
nvme_alloc_request()
nvme_keep_alive() NVME_QID_ANY
nvme_alloc_request()
nvme_timeout() NVME_QID_ANY
nvme_alloc_request()
nvme_delete_queue() NVME_QID_ANY
nvme_alloc_request()
nvmet_passthru_execute_cmd() NVME_QID_ANY
nvme_alloc_request()
nvmf_connect_io_queue() QID
__nvme_submit_sync_cmd()
nvme_alloc_request()
With passthru nvme_alloc_request() now falls into the I/O fast path such
that blk_mq_alloc_request_hctx() is never gets called and that adds
additional branch check in fast path.
Split the nvme_alloc_request() into nvme_alloc_request() and
nvme_alloc_request_qid().
Replace each call of the nvme_alloc_request() with NVME_QID_ANY param
with a call to newly added nvme_alloc_request() without NVME_QID_ANY.
Replace a call to nvme_alloc_request() with QID param with a call to
newly added nvme_alloc_request() and nvme_alloc_request_qid()
based on the qid value set in the __nvme_submit_sync_cmd().
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
NVMeOF controller in the passsthru mode is capable of handling wide set
of I/O commands including vender specific passhtru io comands.
The vendor specific I/O commands are used to read the large drive
logs and can take longer than default NVMe commands, i.e. for
passthru requests the timeout value may differ from the passthru
controller's default timeout values (nvme-core:io_timeout).
Add a configfs attribute so that user can set the io timeout values.
In case if this configfs value is not set nvme_alloc_request() will set
the NVME_IO_TIMEOUT value when request queuedata is NULL.
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
NVMeOF controller in the passsthru mode is capable of handling wide set
of admin commands including vender specific passhtru admin comands.
The vendor specific admin commands are used to read the large drive
logs and can take longer than default NVMe commands, i.e. for
passthru requests the timeout value may differ from the passthru
controller's default timeout values (nvme-core:admin_timeout).
Add a configfs attribute so that user can set the admin timeout values.
In case if this configfs value is not set nvme_alloc_request() will set
the ADMIN_TIMEOUT value when request queuedata is NULL.
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
This is purely a clenaup patch, add prefix NVME to the ADMIN_TIMEOUT to
make consistent with NVME_IO_TIMEOUT.
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Add sysfs attribute to specify parameters for dropping a command. The
attribute takes a string of:
<opcode>:<starting a what instance>:<number of times>
Opcode is formatted as lower 8 bits are opcode. If a fabrics opcode, a
bit above bits 7:0 will be set.
Once set, each sqe is looked at. If the opcode matches the running
instance count is updated. If the instance count is in the range of where
to drop (based on starting and # of times), then drop the command by not
passing it to the target layer.
Signed-off-by: James Smart <james.smart@broadcom.com>
Use the ib_dma_* helpers to skip the DMA translation instead. This
removes the last user if dma_virt_ops and keeps the weird layering
violation inside the RDMA core instead of burderning the DMA mapping
subsystems with it. This also means the software RDMA drivers now don't
have to mess with DMA parameters that are not relevant to them at all, and
that in the future we can use PCI P2P transfers even for software RDMA, as
there is no first fake layer of DMA mapping that the P2P DMA support.
Link: https://lore.kernel.org/r/20201106181941.1878556-8-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mike Marciniszyn <mike.marciniszyn@cornelisnetworks.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
By default, we set the passthru request allocation flag such that it
returns the error in the following code path and we fail the I/O when
BLK_MQ_REQ_NOWAIT is used for request allocation :-
nvme_alloc_request()
blk_mq_alloc_request()
blk_mq_queue_enter()
if (flag & BLK_MQ_REQ_NOWAIT)
return -EBUSY; <-- return if busy.
On some controllers using BLK_MQ_REQ_NOWAIT ends up in I/O error where
the controller is perfectly healthy and not in a degraded state.
Block layer request allocation does allow us to wait instead of
immediately returning the error when we BLK_MQ_REQ_NOWAIT flag is not
used. This has shown to fix the I/O error problem reported under
heavy random write workload.
Remove the BLK_MQ_REQ_NOWAIT parameter for passthru request allocation
which resolves this issue.
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Clean up some confusing elements of nvmet_passthru_map_sg() by returning
early if the request is greater than the maximum bio size. This allows
us to drop the sg_cnt variable.
This should not result in any functional change but makes the code
clearer and more understandable. The original code allocated a truncated
bio then would return EINVAL when bio_add_pc_page() filled that bio. The
new code just returns EINVAL early if this would happen.
Fixes: c1fef73f79 ("nvmet: add passthru code to process commands")
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Suggested-by: Douglas Gilbert <dgilbert@interlog.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
nvmet_passthru_map_sg() only supports mapping a single BIO, not a chain
so the effective maximum transfer should also be limitted by
BIO_MAX_PAGES (presently this works out to 1MB).
For PCI passthru devices the max_sectors would typically be more
limitting than BIO_MAX_PAGES, but this may not be true for all passthru
devices.
Fixes: c1fef73f79 ("nvmet: add passthru code to process commands")
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
When connecting a controller with a zero kato value using the following
command line
nvme connect -t tcp -n NQN -a ADDR -s PORT --keep-alive-tmo=0
the warning below can be reproduced:
WARNING: CPU: 1 PID: 241 at kernel/workqueue.c:1627 __queue_delayed_work+0x6d/0x90
with trace:
mod_delayed_work_on+0x59/0x90
nvmet_update_cc+0xee/0x100 [nvmet]
nvmet_execute_prop_set+0x72/0x80 [nvmet]
nvmet_tcp_try_recv_pdu+0x2f7/0x770 [nvmet_tcp]
nvmet_tcp_io_work+0x63f/0xb2d [nvmet_tcp]
...
This is caused by queuing up an uninitialized work. Althrough the
keep-alive timer is disabled during allocating the controller (fixed in
0d3b6a8d21), ka_work still has a chance to run (called by
nvmet_start_ctrl).
Fixes: 0d3b6a8d21 ("nvmet: Disable keep-alive timer when kato is cleared to 0h")
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The function nvme_init_ctrl() gets the ctrl reference & when it fails it
does put the ctrl reference in the error unwind code.
When creating loop ctrl in nvme_loop_create_ctrl() if nvme_init_ctrl()
returns non zero (i.e. error) value it jumps to the "out_put_ctrl" label
which calls nvme_put_ctrl(), that will lead to douple ctrl put in error
unwind path.
Update nvme_loop_create_ctrl() such that this patch removes the
"out_put_ctrl" label, add a new "out" label after nvme_put_ctrl() in
error unwind path and jump to newly added label when nvme_init_ctrl()
call retuns an error.
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
A hostport port pointer is allowed to be NULL as it is not allocated if
the lldd does not support the new interfaces for NVME LS request support.
The hostport free routine validates the handle but forgot to validate the
hostport pointer.
Validate the hostport pointer before using it to validate the handle.
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
In the default passthru implementation NVMeOF target passthru ctrl is
not capable of handling Zoned Namespaces (ZNS).
Update the nvmet_parse_pasthru_admin_cmd() to allow
NVME_ID_CNS_CS_CTRL/NVME_CIS_ZNS and NVME_ID_CNS_CS_NS/NVME_CIS_ZNS.
With this addition NVMeOF Passthru allows Zoned Namespaces.
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
A user may modify the kato by a set features cmd. To properly deal
with races or a kato value of 0 (no keep alive enabled) change
nvmet_set_feat_kato to first disable the timer, then set the value
and then re-enable the timer.
Signed-off-by: Amit Engel <amit.engel@dell.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
No real good need to spread queues artificially. Usually the
target will serve multiple hosts, and it's better to run on the socket
incoming cpu for better affinitization rather than spread queues on all
online cpus.
We rely on RSS to spread the work around sufficiently.
Signed-off-by: Mark Wunderlich <mark.wunderlich@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Lift opening the file open/close code from nvme_ctrl_get_by_path into
the caller, just keeping a simple nvme_ctrl_from_file() helper.
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
[hch: refactored a bit, split the bug fixes into a separate prep patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl9qLq8QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpqrCD/4hjvDuZIDMZILnOhHtZPB+q1bnf48dcB2v
WQlDJAZVDKIdBy49gN2mraLwbeKvnzTv25kfZkXZPL4F32TuXK8E57tHPCYEq5li
qb6y3o0Y1y0p78PtYittfWeUWaYkT2v91QQjLc6Vyh8swL15XDQ37lBw7qqtNUhw
WMTS1Q2bw0wjltRAC5XSTD73PwcjFMYhE/1YBWE4vckaB1K+kLN6RRaVZ03unC/U
zhSd2WqZyHwaEsl66Vtl7ty+SMoqahfXNvBcvvLVY6mD2U0hLWCBlnWY5SjYmzLZ
3lVqmiL+diaAoDRroQnpFkAJSRnnWv3g3gWbygbSKJScFGKh15k7Px4ztWr233Xb
KZtoZN826PhSkujIB8wKaFrrx3Zz00a9flqvum2ejOQAP5FiS2QRLlaZuf2U2xqm
5AhZ7ul1qm9kik0ZULPBY1myK1Y8sSoKSnu3WAVUPo974fAPTvhWUpO+i6SssYWu
oI2VUK9BvcgP1MjMms1EYEpY8rg8G+TGzN+P6jJcBirAAecdpXhLaLwOdEQC9De/
P/OfyHg6lIgs9PP0NOp8BeUDQe9bDW8gNxv57R57+N7ZIKP0443LwSaHpmFInBC+
lxAGcghsl++TZZ+sCDM8Lkw5IZWcc/czewHzVFVMjpnivt0kuQrndFyOxRdHjggy
4Vo1Sa1Ahg==
=zT03
-----END PGP SIGNATURE-----
Merge tag 'block-5.9-2020-09-22' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"A few NVMe fixes, and a dasd write zero fix"
* tag 'block-5.9-2020-09-22' of git://git.kernel.dk/linux-block:
nvmet: get transport reference for passthru ctrl
nvme-core: get/put ctrl and transport module in nvme_dev_open/release()
nvme-tcp: fix kconfig dependency warning when !CRYPTO
nvme-pci: disable the write zeros command for Intel 600P/P3100
s390/dasd: Fix zero write for FBA devices
Grab a reference to the transport driver to ensure it can't be unloaded
while a passthrough controller is active.
Fixes: c1fef73f79 ("nvmet: add passthru code to process commands")
Reported-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl9SWMMQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgphIcD/488Q7rXb2eABp1fGs4gu+VFOCLogeHL8xh
5xHNiOPnZG2SGr8DQJY/7EX2kE65rbZi8/g+2N6anovI2nduRu0tzSra7fRgzbys
ZQC1CUel0MbCd7e8OaEfg108PSHNxBf1PqDcE7zCeyZ0DIs3s4vK/bQtmzzxZHgU
wNw4OIP9gOdqgjowb6GGHo9SLN4GT8rZ0jZVPLa7GwFsvxCTwv/7lHO8rqeSeuCu
5H6i3M/rSbtTXPLHf4Fy97x9WmBmdgu4epTXiwbOxaagpx3lm/7n1P3CpavR+Gcq
O5VGIIzazxPwnZl9y/6rZFLGYqcj38RxUvC8KtK6tDXxEu/BDJa1d6hXI03SyXAO
ZAiEpQTKOkJE3R8ewUDrXLvl3p6FvwZVZ5SIFwUb+0JFrVQYwrgfoRJtzb5SIUan
T9/bSYge7lFRI92FZRIqhvk8rsEBRdu7N/rQCyGf6GuZ0vRXWRAqN7T02iDn3czX
pdGAepU5ymw8CwyUiNNnkY0DUaQLBIO9tCA9epxLwdroQ95vJtMPRBX1STQ65GVk
XvMFAJqDAehQ/nP5xO60cWGZHyL7L/ccpofZlA/ytgAIZRa85GvhrdVy7yc6DKto
wu6h2tkX9+ldoUjVbn/60T+Ft3QUTlfAuDfherkNoFNB/G5i1pzOHbwvL7B3czr3
ZMjoNiOIqA==
=8fvz
-----END PGP SIGNATURE-----
Merge tag 'block-5.9-2020-09-04' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"A bit larger than usual this week, mostly due to the NVMe fixes
arriving late for -rc3 and hence didn't make last weeks pull request.
- NVMe:
- instance leak and io boundary fixes from Keith
- fc locking fix from Christophe
- various tcp/rdma reset during traffic fixes from Sagi
- pci use-after-free fix from Tong
- tcp target null deref fix from Ziye
- Locking fix for partition removal (Christoph)
- Ensure bdi->io_pages is always set (me)
- Fixup for hd struct reference (Ming)
- Fix for zero length bvecs (Ming)
- Two small blk-iocost fixes (Tejun)"
* tag 'block-5.9-2020-09-04' of git://git.kernel.dk/linux-block:
block: allow for_each_bvec to support zero len bvec
blk-stat: make q->stats->lock irqsafe
blk-iocost: ioc_pd_free() shouldn't assume irq disabled
block: fix locking in bdev_del_partition
block: release disk reference in hd_struct_free_work
block: ensure bdi->io_pages is always initialized
nvme-pci: cancel nvme device request before disabling
nvme: only use power of two io boundaries
nvme: fix controller instance leak
nvmet-fc: Fix a missed _irqsave version of spin_lock in 'nvmet_fc_fod_op_done()'
nvme: Fix NULL dereference for pci nvme controllers
nvme-rdma: fix reset hang if controller died in the middle of a reset
nvme-rdma: fix timeout handler
nvme-rdma: serialize controller teardown sequences
nvme-tcp: fix reset hang if controller died in the middle of a reset
nvme-tcp: fix timeout handler
nvme-tcp: serialize controller teardown sequences
nvme: have nvme_wait_freeze_timeout return if it timed out
nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance
nvmet-tcp: Fix NULL dereference when a connect data comes in h2cdata pdu
The way 'spin_lock()' and 'spin_lock_irqsave()' are used is not consistent
in this function.
Use 'spin_lock_irqsave()' also here, as there is no guarantee that
interruptions are disabled at that point, according to surrounding code.
Fixes: a97ec51b37 ("nvmet_fc: Rework target side abort handling")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>