Replace (some) calls to pr_xxx() in rxe_net.c with rxe_dbg_xxx().
Calls with a rxe device not yet in scope are left as is.
Link: https://lore.kernel.org/r/20221103171013.20659-7-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The function check_length() is supposed to check the length of inbound
packets on responder, but it actually has been a stub since the driver was
born. Let it check the payload length and the DMA length.
Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Link: https://lore.kernel.org/r/20221107055338.357184-1-matsuda-daisuke@fujitsu.com
Reviewed-by: Li Zhijian <lizhijian@fujitsu.com>
Acked-by: Zhu Yanjun <zyjzyj2000@gmail.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The callbacks in struct class namespace() and get_ownership() do not
modify the struct device passed to them, so mark the pointer as constant
and fix up all callbacks in the kernel to have the correct function
signature.
This helps make it more obvious what calls and callbacks do, and do not,
modify structures passed to them.
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Link: https://lore.kernel.org/r/20221001165426.2690912-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Correctly set send queue element opcode during immediate work request
flushing in post sendqueue operation, if the QP is in ERROR state.
An undefined ocode value results in out-of-bounds access to an array
for mapping the opcode between siw internal and RDMA core representation
in work completion generation. It resulted in a KASAN BUG report
of type 'global-out-of-bounds' during NFSoRDMA testing.
This patch further fixes a potential case of a malicious user which may
write undefined values for completion queue elements status or opcode,
if the CQ is memory mapped to user land. It avoids the same out-of-bounds
access to arrays for status and opcode mapping as described above.
Fixes: 303ae1cdfd ("rdma/siw: application interface")
Fixes: b0fff7317b ("rdma/siw: completion queue methods")
Reported-by: Olga Kornievskaia <kolga@netapp.com>
Reviewed-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Bernard Metzler <bmt@zurich.ibm.com>
Link: https://lore.kernel.org/r/20221107145057.895747-1-bmt@zurich.ibm.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Add atomic operations support in post_send and poll_cq implementation.
Also, rename 'laddr' and 'lkey' in struct erdma_sge to 'addr' and 'key',
because this structure is used for both local and remote SGEs.
Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
Link: https://lore.kernel.org/r/20221107021845.44598-4-chengyou@linux.alibaba.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Introduce "capacity flags" field at where hardware put all zeros originally
in "query device" response. Using this field, hardware can report atomic
feature if supports.
Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
Link: https://lore.kernel.org/r/20221107021845.44598-3-chengyou@linux.alibaba.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
To support atomic operations, IB_ACCESS_REMOTE_ATOMIC right should be
passed to hardware for permission check. Since "access mode" field in FRMR
SQE and RegMr command is never used by hw, we remove the "access mode"
field, so that we can then have enough space to extend access fields.
Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
Link: https://lore.kernel.org/r/20221107021845.44598-2-chengyou@linux.alibaba.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The active link speed is currently hard-coded in irdma_query_port due
to which the port rate in ibstatus does reflect the active link speed.
Call ib_get_eth_speed in irdma_query_port to get the active link speed.
Fixes: b48c24c2d7 ("RDMA/irdma: Implement device supported verb APIs")
Reported-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Link: https://lore.kernel.org/r/20221104234957.1135-1-shiraz.saleem@intel.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The mlx5 driver dumps the entire CQE buffer by default for few syndromes.
Some syndromes are expected due to the application behavior [ex:
MLX5_CQE_SYNDROME_REMOTE_ACCESS_ERR, MLX5_CQE_SYNDROME_REMOTE_OP_ERR and
MLX5_CQE_SYNDROME_LOCAL_PROT_ERR]. Hence, for these syndromes, the patch
converts the log level from KERN_WARNING to KERN_DEBUG. This enables the
application to get the CQE buffer dump by changing to KERN_DEBUG level
as and when needed.
Suggested-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Arumugam Kolappan <aru.kolappan@oracle.com>
Link: https://lore.kernel.org/r/1667287664-19377-1-git-send-email-aru.kolappan@oracle.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Instead of 'goto and return', just return directly to
simplify the error handling, and avoid some unnecessary
return value check.
Link: https://lore.kernel.org/r/20221028075053.3990467-1-xuhaoyue1@hisilicon.com
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Haoyue Xu <xuhaoyue1@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
mr->map_shift is set to ilog2(RXE_BUF_PER_MAP) in both rxe_mr_init() and
rxe_mr_alloc() so remove the duplicate one in rxe_mr_init().
Link: https://lore.kernel.org/r/1666855893-145-1-git-send-email-yangx.jy@fujitsu.com
Signed-off-by: Xiao Yang <yangx.jy@fujitsu.com>
Reviewed-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
We should reject the requests with access flags that is not registered by
MR/MW. For example, lookup_mr() should return NULL when requested access
is 0x03 and mr->access is 0x01.
Link: https://lore.kernel.org/r/20220927055337.22630-2-lizhijian@fujitsu.com
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The subroutine rxe_do_task() is only called in rxe_task.c. This patch
makes it static and renames it do_task().
Link: https://lore.kernel.org/r/20221021200118.2163-6-rpearsonhpe@gmail.com
Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Split rxe_run_task(task, sched) into rxe_run_task(task) and
rxe_sched_task(task).
Link: https://lore.kernel.org/r/20221021200118.2163-5-rpearsonhpe@gmail.com
Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The name field in struct rxe_task is never used. This patch removes it.
Link: https://lore.kernel.org/r/20221021200118.2163-4-rpearsonhpe@gmail.com
Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The calls to spin_lock_init() for the tasklet spinlocks in
rxe_qp_init_misc() are redundant since they are intiialized in
rxe_init_task(). This patch removes them.
Link: https://lore.kernel.org/r/20221021200118.2163-3-rpearsonhpe@gmail.com
Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add a check for if create_singlethread_workqueue() fails and also destroy
the work queue on failure paths.
Fixes: e411e0587e ("RDMA/qedr: Add iWARP connection management functions")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: https://lore.kernel.org/r/Y1gBkDucQhhWj5YM@kili
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
KASAN reported a null-ptr-deref error:
KASAN: null-ptr-deref in range [0x0000000000000118-0x000000000000011f]
CPU: 1 PID: 379
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
RIP: 0010:destroy_workqueue+0x2f/0x740
RSP: 0018:ffff888016137df8 EFLAGS: 00000202
...
Call Trace:
ib_core_cleanup+0xa/0xa1 [ib_core]
__do_sys_delete_module.constprop.0+0x34f/0x5b0
do_syscall_64+0x3a/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fa1a0d221b7
...
It is because the fail of roce_gid_mgmt_init() is ignored:
ib_core_init()
roce_gid_mgmt_init()
gid_cache_wq = alloc_ordered_workqueue # fail
...
ib_core_cleanup()
roce_gid_mgmt_cleanup()
destroy_workqueue(gid_cache_wq)
# destroy an unallocated wq
Fix this by catching the fail of roce_gid_mgmt_init() in ib_core_init().
Fixes: 03db3a2d81 ("IB/core: Add RoCE GID table management")
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Link: https://lore.kernel.org/r/20221025024146.109137-1-chenzhongjin@huawei.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Before the testing, we already passed it to rxe_mr_copy() where mr could
be dereferenced. so this checking is not needed.
The only way that mr is NULL is when it reaches below line 780 with
'qp->resp.mr = NULL', which is not possible in Bob's explanation[1].
778 if (res->state == rdatm_res_state_new) {
779 if (!res->replay) {
780 mr = qp->resp.mr;
781 qp->resp.mr = NULL;
782 } else {
[1] https://lore.kernel.org/lkml/30ff25c4-ce66-eac4-eaa2-64c0db203a19@gmail.com/
Link: https://lore.kernel.org/r/1666582315-2-1-git-send-email-lizhijian@fujitsu.com
CC: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Reviewed-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Requesting nodes do not handle a reported error correctly if it is
generated in the middle of multi-packet Read responses, and the node tries
to resend the request endlessly. Let completer terminate the connection in
that case.
Link: https://lore.kernel.org/r/20221013014724.3786212-2-matsuda-daisuke@fujitsu.com
Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Reviewed-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Currently, responder can reply packets with invalid payloads if it fails
to copy messages to the packets. Add an error handling in read_reply() to
inform a requesting node of the failure.
Link: https://lore.kernel.org/r/20221013014724.3786212-1-matsuda-daisuke@fujitsu.com
Suggested-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Reviewed-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Lock grab occurs in a concurrent scenario, resulting in stepping on a NULL
pointer. It should be init mutex_init() first before use the lock.
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
Call trace:
__mutex_lock.constprop.0+0xd0/0x5c0
__mutex_lock_slowpath+0x1c/0x2c
mutex_lock+0x44/0x50
free_mr_send_cmd_to_hw+0x7c/0x1c0 [hns_roce_hw_v2]
hns_roce_v2_dereg_mr+0x30/0x40 [hns_roce_hw_v2]
hns_roce_dereg_mr+0x4c/0x130 [hns_roce_hw_v2]
ib_dereg_mr_user+0x54/0x124
uverbs_free_mr+0x24/0x30
destroy_hw_idr_uobject+0x38/0x74
uverbs_destroy_uobject+0x48/0x1c4
uobj_destroy+0x74/0xcc
ib_uverbs_cmd_verbs+0x368/0xbb0
ib_uverbs_ioctl+0xec/0x1a4
__arm64_sys_ioctl+0xb4/0x100
invoke_syscall+0x50/0x120
el0_svc_common.constprop.0+0x58/0x190
do_el0_svc+0x30/0x90
el0_svc+0x2c/0xb4
el0t_64_sync_handler+0x1a4/0x1b0
el0t_64_sync+0x19c/0x1a0
Fixes: 70f9252158 ("RDMA/hns: Use the reserved loopback QPs to free MR before destroying MPT")
Signed-off-by: Yixing Liu <liuyixing1@huawei.com>
Signed-off-by: Haoyue Xu <xuhaoyue1@hisilicon.com>
Link: https://lore.kernel.org/r/20221024083814.1089722-3-xuhaoyue1@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
When function reset and local invalidate are mixed, HNS RoCEE may hang.
Before introducing the cause of the problem, two hardware internal
concepts need to be introduced:
1. Execution queue: The queue of hardware execution instructions,
function reset and local invalidate are queued for execution in this
queue.
2.Local queue: A queue that stores local operation instructions. The
instructions in the local queue will be sent to the execution queue
for execution. The instructions in the local queue will not be removed
until the execution is completed.
The reason for the problem is as follows:
1. There is a function reset instruction in the execution queue, which
is currently being executed. A necessary condition for the successful
execution of function reset is: the hardware pipeline needs to empty
the instructions that were not completed before;
2. A local invalidate instruction at the head of the local queue is
sent to the execution queue. Now there are two instructions in the
execution queue, the first is the function reset instruction, and the
second is the local invalidate instruction, which will be executed in
se quence;
3. The user has issued many local invalidate operations, causing the
local queue to be filled up.
4. The user still has a new local operation command and is queuing to
enter the local queue. But the local queue is full and cannot receive
new instructions, this instruction is temporarily stored at the
hardware pipeline.
5. The function reset has been waiting for the instruction before the
hardware pipeline stage is drained. The hardware pipeline stage also
caches a local invalidate instruction, so the function reset cannot be
completed, and the instructions after it cannot be executed.
These factors together cause the execution logic deadlock of the hardware,
and the consequence is that RoCEE will not have any response. Considering
that the local operation command may potentially cause RoCEE to hang, this
feature is no longer supported.
Fixes: e93df01085 ("RDMA/hns: Support local invalidate for hip08 in kernel space")
Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Signed-off-by: Haoyue Xu <xuhaoyue1@hisilicon.com>
Link: https://lore.kernel.org/r/20221024083814.1089722-2-xuhaoyue1@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The member 'type' is included in both struct rxe_mr and struct ib_mr
so remove the duplicate one of struct rxe_mr.
Signed-off-by: Xiao Yang <yangx.jy@fujitsu.com>
Link: https://lore.kernel.org/r/20221021134513.17730-1-yangx.jy@fujitsu.com
Reviewed-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The variable freeze_cnt being incremented but it is never referenced,
it is redundant and can be removed.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/r/20221021173504.27546-1-colin.i.king@gmail.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
There is a single caller to iser_disconnected_handler. Open code its
logic and remove it.
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Link: https://lore.kernel.org/r/20221016093833.12537-4-mgurtovoy@nvidia.com
Reviewed-by: Sergey Gorenko <sergeygo@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
In some cases, we need to make sure that state_mutex is taken. Use
lockdep_assert_held to warn us in case it doesn't while it should.
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Link: https://lore.kernel.org/r/20221016093833.12537-3-mgurtovoy@nvidia.com
Reviewed-by: Sergey Gorenko <sergeygo@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
There is a single caller to iser_conn_state_comp_exch. Open code its
logic and remove it.
Acked-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Sergey Gorenko <sergeygo@nvidia.com>
Link: https://lore.kernel.org/r/20221016093833.12537-2-mgurtovoy@nvidia.com
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Commit 13bac86195 ("IB/hfi1: Fix abba locking issue with sc_disable()")
incorrectly tries to move a list from one list head to another. The
result is a kernel crash.
The crash is triggered when a link goes down and there are waiters for a
send to complete. The following signature is seen:
BUG: kernel NULL pointer dereference, address: 0000000000000030
[...]
Call Trace:
sc_disable+0x1ba/0x240 [hfi1]
pio_freeze+0x3d/0x60 [hfi1]
handle_freeze+0x27/0x1b0 [hfi1]
process_one_work+0x1b0/0x380
? process_one_work+0x380/0x380
worker_thread+0x30/0x360
? process_one_work+0x380/0x380
kthread+0xd7/0x100
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
The fix is to use the correct call to move the list.
Fixes: 13bac86195 ("IB/hfi1: Fix abba locking issue with sc_disable()")
Signed-off-by: Dean Luick <dean.luick@cornelisnetworks.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Link: https://lore.kernel.org/r/166610327042.674422.6146908799669288976.stgit@awfm-02.cornelisnetworks.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Commit 27cfde795a ("RDMA/cma: Fix arguments order in net device
validation") swapped the src and dst addresses in the call to
validate_net_dev().
As a consequence, the test in validate_ipv4_net_dev() to see if the
net_dev is the right one, is incorrect for port 1 <-> 2 communication when
the ports are on the same sub-net. This is fixed by denoting the
flowi4_oif as the device instead of the incoming one.
The bug has not been observed using IPv6 addresses.
Fixes: 27cfde795a ("RDMA/cma: Fix arguments order in net device validation")
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Link: https://lore.kernel.org/r/20221012141542.16925-1-haakon.bugge@oracle.com
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Prepare InfiniBand drivers to the common dynamic dma-buf locking
convention by starting to use the unlocked versions of dma-buf API
functions.
Acked-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20221017172229.42269-11-dmitry.osipenko@collabora.com
The prandom_u32() function has been a deprecated inline wrapper around
get_random_u32() for several releases now, and compiles down to the
exact same code. Replace the deprecated wrapper with a direct call to
the real function. The same also applies to get_random_int(), which is
just a wrapper around get_random_u32(). This was done as a basic find
and replace.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz> # for ext4
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake
Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Acked-by: Helge Deller <deller@gmx.de> # for parisc
Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Rather than incurring a division or requesting too many random bytes for
the given range, use the prandom_u32_max() function, which only takes
the minimum required bytes from the RNG and avoids divisions. This was
done mechanically with this coccinelle script:
@basic@
expression E;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
typedef u64;
@@
(
- ((T)get_random_u32() % (E))
+ prandom_u32_max(E)
|
- ((T)get_random_u32() & ((E) - 1))
+ prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2)
|
- ((u64)(E) * get_random_u32() >> 32)
+ prandom_u32_max(E)
|
- ((T)get_random_u32() & ~PAGE_MASK)
+ prandom_u32_max(PAGE_SIZE)
)
@multi_line@
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
identifier RAND;
expression E;
@@
- RAND = get_random_u32();
... when != RAND
- RAND %= (E);
+ RAND = prandom_u32_max(E);
// Find a potential literal
@literal_mask@
expression LITERAL;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
position p;
@@
((T)get_random_u32()@p & (LITERAL))
// Add one to the literal.
@script:python add_one@
literal << literal_mask.LITERAL;
RESULT;
@@
value = None
if literal.startswith('0x'):
value = int(literal, 16)
elif literal[0] in '123456789':
value = int(literal, 10)
if value is None:
print("I don't know how to handle %s" % (literal))
cocci.include_match(False)
elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1:
print("Skipping 0x%x for cleanup elsewhere" % (value))
cocci.include_match(False)
elif value & (value + 1) != 0:
print("Skipping 0x%x because it's not a power of two minus one" % (value))
cocci.include_match(False)
elif literal.startswith('0x'):
coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1))
else:
coccinelle.RESULT = cocci.make_expr("%d" % (value + 1))
// Replace the literal mask with the calculated result.
@plus_one@
expression literal_mask.LITERAL;
position literal_mask.p;
expression add_one.RESULT;
identifier FUNC;
@@
- (FUNC()@p & (LITERAL))
+ prandom_u32_max(RESULT)
@collapse_ret@
type T;
identifier VAR;
expression E;
@@
{
- T VAR;
- VAR = (E);
- return VAR;
+ return E;
}
@drop_var@
type T;
identifier VAR;
@@
{
- T VAR;
... when != VAR
}
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap
Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
- Small bug fixes in mlx5, efa, rxe, hns, irdma, erdma, siw
- rts tracing improvements
- Code improvements: strlscpy conversion, unused parameter, spelling
mistakes, unused variables, flex arrays
- restrack device details report for hns
- Simplify struct device initialization in SRP
- Eliminate the never-used service_mask support in IB CM
- Make rxe not print to the console for some kinds of network packets
- Asymetric paths and router support in the CM through netlink messages
- DMABUF importer support for mlx5devx umem's
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCYz9bgAAKCRCFwuHvBreF
YevoAP47J/svlOFlFtBhTVF79Ddtf+MMeqeVvLoHHQbCU5rUpAD+KUpTXAvwNcM9
dHwNXz9ctanP5397qusH0rxOKPo/EA4=
=lgSv
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"Not a big list of changes this cycle, mostly small things. The new
MANA rdma driver should come next cycle along with a bunch of work on
rxe.
Summary:
- Small bug fixes in mlx5, efa, rxe, hns, irdma, erdma, siw
- rts tracing improvements
- Code improvements: strlscpy conversion, unused parameter, spelling
mistakes, unused variables, flex arrays
- restrack device details report for hns
- Simplify struct device initialization in SRP
- Eliminate the never-used service_mask support in IB CM
- Make rxe not print to the console for some kinds of network packets
- Asymetric paths and router support in the CM through netlink
messages
- DMABUF importer support for mlx5devx umem's"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (84 commits)
RDMA/rxe: Remove error/warning messages from packet receiver path
RDMA/usnic: fix set-but-not-unused variable 'flags' warning
IB/hfi1: Use skb_put_data() instead of skb_put/memcpy pair
RDMA/hns: Unified Log Printing Style
RDMA/hns: Replacing magic number with macros in apply_func_caps()
RDMA/hns: Repacing 'dseg_len' by macros in fill_ext_sge_inl_data()
RDMA/hns: Remove redundant 'max_srq_desc_sz' in caps
RDMA/hns: Remove redundant 'num_mtt_segs' and 'max_extend_sg'
RDMA/hns: Remove redundant 'phy_addr' in hns_roce_hem_list_find_mtt()
RDMA/hns: Remove redundant 'use_lowmem' argument from hns_roce_init_hem_table()
RDMA/hns: Remove redundant 'bt_level' for hem_list_alloc_item()
RDMA/hns: Remove redundant 'attr_mask' in modify_qp_init_to_init()
RDMA/hns: Remove unnecessary brackets when getting point
RDMA/hns: Remove unnecessary braces for single statement blocks
RDMA/hns: Cleanup for a spelling error of Asynchronous
IB/rdmavt: Add __init/__exit annotations to module init/exit funcs
RDMA/rxe: Remove redundant num_sge fields
RDMA/mlx5: Enable ATS support for MRs and umems
RDMA/mlx5: Add support for dmabuf to devx umem
RDMA/core: Add UVERBS_ATTR_RAW_FD
...
Incoming packets to rxe are passed from UDP layer using an encapsulation
socket. If there are any clients reachable to a node, they can invoke the
encapsulation handler arbitrarily by sending malicious or irrelevant
packets. This can potentially cause a message overflow and a subsequent
slowdown on the node.
Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Link: https://lore.kernel.org/r/20220929080023.304242-1-matsuda-daisuke@fujitsu.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Saeed Mahameed says:
====================
updates from mlx5-next 2022-09-24
Updates form mlx5-next including[1]:
1) HW definitions and support for NPPS clock settings.
2) various cleanups
3) Enable hash mode by default for all NICs
4) page tracker and advanced virtualization HW definitions for vfio
[1] https://lore.kernel.org/netdev/20220907233636.388475-1-saeed@kernel.org/
* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
net/mlx5: Remove from FPGA IFC file not-needed definitions
net/mlx5: Remove unused structs
net/mlx5: Remove unused functions
net/mlx5: detect and enable bypass port select flow table
net/mlx5: Lag, enable hash mode by default for all NICs
net/mlx5: Lag, set active ports if support bypass port select flow table
RDMA/mlx5: Don't set tx affinity when lag is in hash mode
net/mlx5: add IFC bits for bypassing port select flow table
net/mlx5: Add support for NPPS with real time mode
net/mlx5: Expose NPPS related registers
net/mlx5: Query ADV_VIRTUALIZATION capabilities
net/mlx5: Introduce ifc bits for page tracker
RDMA/mlx5: Move function mlx5_core_query_ib_ppcnt() to mlx5_ib
====================
Link: https://lore.kernel.org/all/20220927201906.234015-1-saeed@kernel.org/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In hash mode, without setting tx affinity explicitly, the port select
flow table decides which port is used for the traffic.
If port_select_flow_table_bypass capability is supported and tx affinity
is set explicitly for QP/TIS, they will be added into the explicit affinity
table in FW to check which port is used for the traffic.
1. The overloaded explicit affinity table may affect performance.
To avoid this, do not set tx affinity explicitly by default.
2. The packets of the same flow need to be transmitted on the same port.
Because the packets of the same flow use different QPs in slow & fast
path, it shouldn't set tx affinity explicitly for these QPs.
Signed-off-by: Liu, Changcheng <jerrliu@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Use skb_put_data() instead of skb_put() and memcpy(), which is shorter
and clear. Drop the tmp variable that is not needed any more.
Link: https://lore.kernel.org/r/20220927022919.16902-1-shangxiaojing@huawei.com
Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The first letter of the log information is changed to lowercase
to keep the same style.
Link: https://lore.kernel.org/r/20220922123315.3732205-13-xuhaoyue1@hisilicon.com
Signed-off-by: Guofeng Yue <yueguofeng@hisilicon.com>
Signed-off-by: Haoyue Xu <xuhaoyue1@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The sge size is known to be constant, so it's unnecessary to use sizeof to
calculate.
Link: https://lore.kernel.org/r/20220922123315.3732205-11-xuhaoyue1@hisilicon.com
Signed-off-by: Luoyouming <luoyouming@huawei.com>
Signed-off-by: Haoyue Xu <xuhaoyue1@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The max_srq_desc_sz is defined in the code, but never used,
so delete this redundant variable.
Link: https://lore.kernel.org/r/20220922123315.3732205-10-xuhaoyue1@hisilicon.com
Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: Haoyue Xu <xuhaoyue1@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The num_mtt_segs and max_extend_sg used to be used for HIP06,
remove them since the HIP06 code has been removed.
Link: https://lore.kernel.org/r/20220922123315.3732205-9-xuhaoyue1@hisilicon.com
Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: Haoyue Xu <xuhaoyue1@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This parameter has never been used. Remove it to simplify the function.
Link: https://lore.kernel.org/r/20220922123315.3732205-8-xuhaoyue1@hisilicon.com
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Haoyue Xu <xuhaoyue1@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
As hns_roce_init_hem_table() is always called with use_lowmem
being '1', and table->lowmem is set according to that argument,
so remove table->lowmem too.
Also, as the table->lowmem is used to indicate a dma buffer
is allocated with GFP_HIGHUSER or GFP_KERNEL, and calling
dma_alloc_coherent() with GFP_KERNEL seems like a common
pattern.
Link: https://lore.kernel.org/r/20220922123315.3732205-7-xuhaoyue1@hisilicon.com
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Haoyue Xu <xuhaoyue1@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The 'bt_level' parameter is not used in hem_list_alloc_item(),
so remove it.
Link: https://lore.kernel.org/r/20220922123315.3732205-6-xuhaoyue1@hisilicon.com
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Haoyue Xu <xuhaoyue1@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
In include/uapi/rdma/rdma_user_rxe.h there are redundant copies of num_sge
in the rxe_send_wr, rxe_recv_wqe, and rxe_dma_info. Only the ones in
rxe_dma_info are actually used by the rxe kernel driver.
The userspace would set these values, but the kernel never read them.
This change has no affect on the current ABI and new or old versions of
rdma-core operate correctly with new or old versions of the kernel rxe
driver.
Link: https://lore.kernel.org/r/20220913222716.18335-1-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
For mlx5 if ATS is enabled in the PCI config then the device will use ATS
requests for only certain DMA operations. This has to be opted in by the
SW side based on the mkey or umem settings.
ATS slows down the PCI performance, so it should only be set in cases when
it is needed. All of these cases revolve around optimizing PCI P2P
transfers and avoiding bad cases where the bus just doesn't work.
Link: https://lore.kernel.org/r/4-v1-bd147097458e+ede-umem_dmabuf_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This is modeled after the similar EFA enablement in commit
66f4817b57 ("RDMA/efa: Add support for dmabuf memory regions").
Like EFA there is no support for revocation so we simply call the
ib_umem_dmabuf_get_pinned() to obtain a umem instead of the normal
ib_umem_get(). Everything else stays the same.
Link: https://lore.kernel.org/r/3-v1-bd147097458e+ede-umem_dmabuf_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This uses the same passing protocol as UVERBS_ATTR_FD (eg len = 0 data_s64
= fd), except that the FD is not required to be a uverbs object and the
core code does not covert the FD to an object handle automatically.
Access to the int fd is provided by uverbs_get_raw_fd().
Link: https://lore.kernel.org/r/2-v1-bd147097458e+ede-umem_dmabuf_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Currently in resize_finish() in rxe_queue.c there is a loop which copies
the entries in the original queue into a newly allocated queue. The
termination logic for this loop is incorrect. The call to
queue_next_index() updates cons but has no effect on whether the queue is
empty. So if the queue starts out empty nothing is copied but if it is not
then the loop will run forever. This patch changes the loop to compare the
value of cons to the original producer index.
Fixes: ae6e843fe0 ("RDMA/rxe: Add memory barriers to kernel queues")
Link: https://lore.kernel.org/r/20220825221446.6512-1-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Reviewed-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Move setting of pd in mr objects ahead of any possible errors so that it
will always be set in rxe_mr_cleanup() to avoid seg faults when
rxe_put(mr_pd(mr)) is called.
Fixes: cf40367961 ("RDMA/rxe: Move mr cleanup code to rxe_mr_cleanup()")
Link: https://lore.kernel.org/r/20220805183153.32007-2-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Reviewed-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Most code in send_ack() and send_atomic_ack() are duplicate, move them to
a new helper send_common_ack().
In newer IBA spec, some opcodes require acknowledge with a zero-length
read response, with this new helper, we can easily implement it later.
Link: https://lore.kernel.org/r/1659335010-2-1-git-send-email-lizhijian@fujitsu.com
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
"&srq->pd->usecnt" and "&pd->usecnt" are different names for the same
reference count. Use "&pd->usecnt" consistently for both the increment
and decrement.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: https://lore.kernel.org/r/YyxFe3Pm0uzRuBkQ@kili
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Currently ib_srp module does not support devices with more than 256
ports. Switch from u8 to u32 to fix the problem.
Fixes: 1fb7f8973f ("RDMA: Support more than 255 rdma ports")
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Mikhael Goikhman <migo@nvidia.com>
Link: https://lore.kernel.org/r/7d80d8844f1abb3a54170b7259f0a02be38080a6.1663747327.git.leonro@nvidia.com
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
rxe_mr and ib_mr have interchangeable members. Remove device specific
members and use ones in the generic struct. Both 'iova' and 'length' are
filled in ib_uverbs or ib_core layer after MR registration.
Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Link: https://lore.kernel.org/r/20220921080844.1616883-2-matsuda-daisuke@fujitsu.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Set 'iova' and 'length' on ib_mr in ib_uverbs and ib_core layers to let all
drivers have the members filled. Also, this commit removes redundancy in
the respective drivers.
Previously, commit 04c0a5fcfc ("IB/uverbs: Set IOVA on IB MR in uverbs
layer") changed to set 'iova', but seems to have missed 'length' and the
ib_core layer at that time.
Fixes: 04c0a5fcfc ("IB/uverbs: Set IOVA on IB MR in uverbs layer")
Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Link: https://lore.kernel.org/r/20220921080844.1616883-1-matsuda-daisuke@fujitsu.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
In inter-subnet cases, when inbound/outbound PRs are available,
outbound_PR.dlid is used as the requestor's datapath DLID and
inbound_PR.dlid is used as the responder's DLID. The inbound_PR.dlid
is passed to responder side with the "ConnectReq.Primary_Local_Port_LID"
field. With this solution the PERMISSIVE_LID is no longer used in
Primary Local LID field.
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://lore.kernel.org/r/b3f6cac685bce9dde37c610be82e2c19d9e51d9e.1662631201.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The responder should always use WC's SLID as the dlid, to follow the
IB SPEC section "13.5.4.2 COMMON RESPONSE ACTIONS":
A responder always takes the following actions in constructing a
response packet:
- The SLID of the received packet is used as the DLID in the response
packet.
Fixes: ac3a949fb2 ("IB/CM: Set appropriate slid and dlid when handling CM request")
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://lore.kernel.org/r/cd17c240231e059d2fc07c17dfe555d548b917eb.1662631201.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Support receiving inbound and outbound IB path records (along with GMP
PathRecord) from user-space service through the RDMA netlink channel.
The LIDs in these 3 PRs can be used in this way:
1. GMP PR: used as the standard local/remote LIDs;
2. DLID of outbound PR: Used as the "dlid" field for outbound traffic;
3. DLID of inbound PR: Used as the "dlid" field for outbound traffic in
responder side.
This is aimed to support adaptive routing. With current IB routing
solution when a packet goes out it's assigned with a fixed DLID per
target, meaning a fixed router will be used.
The LIDs in inbound/outbound path records can be used to identify group
of routers that allow communication with another subnet's entity. With
them packets from an inter-subnet connection may travel through any
router in the set to reach the target.
As confirmed with Jason, when sending a netlink request, kernel uses
LS_RESOLVE_PATH_USE_ALL so that the service knows kernel supports
multiple PRs.
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://lore.kernel.org/r/2fa2b6c93c4c16c8915bac3cfc4f27be1d60519d.1662631201.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
This fields means the total number of primary and alternative paths,
i.e.,:
0 - No primary nor alternate path is available;
1 - Only primary path is available;
2 - Both primary and alternate path are available.
Rename it to avoid confusion as with follow patches primary path will
support multiple path records.
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://lore.kernel.org/r/cbe424de63a56207870d70c5edce7c68e45f429e.1662631201.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Delay QP destroy completion until all siw references to QP are
dropped. The calling RDMA core will free QP structure after
successful return from siw_qp_destroy() call, so siw must not
hold any remaining reference to the QP upon return.
A use-after-free was encountered in xfstest generic/460, while
testing NFSoRDMA. Here, after a TCP connection drop by peer,
the triggered siw_cm_work_handler got delayed until after
QP destroy call, referencing a QP which has already freed.
Fixes: 303ae1cdfd ("rdma/siw: application interface")
Reported-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Bernard Metzler <bmt@zurich.ibm.com>
Link: https://lore.kernel.org/r/20220920082503.224189-1-bmt@zurich.ibm.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
For header and trailer/padding processing, siw did not consume new
skb data until minimum amount present to fill current header or trailer
structure, including potential payload padding. Not consuming any
data during upcall may cause a receive stall, since tcp_read_sock()
is not upcalling again if no new data arrive.
A NFSoRDMA client got stuck at RDMA Write reception of unaligned
payload, if the current skb did contain only the expected 3 padding
bytes, but not the 4 bytes CRC trailer. Expecting 4 more bytes already
arrived in another skb, and not consuming those 3 bytes in the current
upcall left the Write incomplete, waiting for the CRC forever.
Fixes: 8b6a361b8c ("rdma/siw: receive path")
Reported-by: Olga Kornievskaia <kolga@netapp.com>
Tested-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Bernard Metzler <bmt@zurich.ibm.com>
Link: https://lore.kernel.org/r/20220920081202.223629-1-bmt@zurich.ibm.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
rc_only_opcode and uc_only_opcode have been removed since
commit b374e060cc ("IB/hfi1: Consolidate pio control masks
into single definition"), so remove them.
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Link: https://lore.kernel.org/r/20220911092325.3216513-1-cuigaosheng1@huawei.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
In preparation for FORTIFY_SOURCE performing run-time destination buffer
bounds checking for memcpy(), specify the destination output buffer
explicitly, instead of asking memcpy() to write past the end of what looked
like a fixed-size object.
Notice that srp_rsp[] is a pointer to a structure that contains
flexible-array member data[]:
struct srp_rsp {
...
__be32 sense_data_len;
__be32 resp_data_len;
u8 data[];
};
link: https://github.com/KSPP/linux/issues/201
Signed-off-by: Hangyu Hua <hbh25y@gmail.com>
Link: https://lore.kernel.org/r/20220909022943.8896-1-hbh25y@gmail.com
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Some opcodes are used in hardware internally, and driver does not care
about them. So, we change them to reserved opcodes in driver.
Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220909093822.33868-4-chengyou@linux.alibaba.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Many of erdma's includes are redundant, because they are already included
indirectly by kernel headers or custom headers. So we remove all the
unnecessary direct-includes. Besides, add linux/pci.h to erdma.h because
it's also used in the file.
Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220909093822.33868-3-chengyou@linux.alibaba.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
erdma_post_cmd_wait does not use the 'u64 *req' input parameter directly.
So it is better to define it to 'void *req', and by this we can eliminate
the casting when calling erdma_post_cmd_wait.
Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220909093822.33868-2-chengyou@linux.alibaba.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Fix the code for converting a SCSI command pointer into an SRP request
pointer.
Cc: Xiao Yang <yangx.jy@fujitsu.com>
Fixes: ad215aaea4 ("RDMA/srp: Make struct scsi_cmnd and struct srp_request adjacent")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220908233139.3042628-1-bvanassche@acm.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Currently ib_copy_from_udata and ib_copy_to_udata could underfill
the request and response buffer if the user-space passes an undersized
value for udata->inlen or udata->outlen respectively [1]
This could lead to undesirable behavior.
Zero initing the buffer only goes as far as preventing using the buffer
uninitialized.
Validate udata->inlen and udata->outlen passed from user-space to ensure
they are at least the required minimum size.
[1] https://lore.kernel.org/linux-rdma/MWHPR11MB0029F37D40D9D4A993F8F549E9D79@MWHPR11MB0029.namprd11.prod.outlook.com/
Fixes: b48c24c2d7 ("RDMA/irdma: Implement device supported verb APIs")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Link: https://lore.kernel.org/r/20220907191324.1173-3-shiraz.saleem@intel.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
A number of asynchronous event (AE) ids were not aligned to the
correct flush_code and event_type. Fix these up so that the
correct IBV error and event codes are returned to application.
Also, add handling for new AE ids like IRDMA_AE_INVALID_REQUEST to
return the correct WC error code.
Fixes: 44d9e52977 ("RDMA/irdma: Implement device initialization definitions")
Signed-off-by: Sindhu-Devale <sindhu.devale@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Link: https://lore.kernel.org/r/20220907191324.1173-2-shiraz.saleem@intel.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Send with invalidate verb call can pass in an
uninitialized s/g array with 0 sge's which is
filled into irdma WQE and causes a HW asynchronous
event.
Fix this by using the s/g array in irdma post send
only when its valid.
Fixes: 551c46e ("RDMA/irdma: Add user/kernel shared libraries")
Signed-off-by: Sindhu-Devale <sindhu.devale@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Link: https://lore.kernel.org/r/20220906223244.1119-5-shiraz.saleem@intel.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
When a QP and a MR on a local host are in different PDs, the HW generates
an asynchronous event (AE). The same AE is generated when a QP and a MW
are in different PDs during a bind operation. Return the more appropriate
IBV_WC_MW_BIND_ERR for the latter case by checking the OP type from the
CQE in error.
Fixes: 551c46edc7 ("RDMA/irdma: Add user/kernel shared libraries")
Signed-off-by: Sindhu-Devale <sindhu.devale@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Link: https://lore.kernel.org/r/20220906223244.1119-4-shiraz.saleem@intel.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The MR deregister CQP can fail if an MW is bound to it.
Return an appropriate error for this case.
Fixes: b48c24c2d7 ("RDMA/irdma: Implement device supported verb APIs")
Signed-off-by: Sindhu-Devale <sindhu.devale@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Link: https://lore.kernel.org/r/20220906223244.1119-3-shiraz.saleem@intel.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Report the correct max cqes available to an application taking
into account a reserved entry to detect overflow.
Fixes: b48c24c2d7 ("RDMA/irdma: Implement device supported verb APIs")
Signed-off-by: Sindhu-Devale <sindhu.devale@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Link: https://lore.kernel.org/r/20220906223244.1119-2-shiraz.saleem@intel.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Return the value set_link_state() directly instead of storing it in
another redundant variable.
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn>
Link: https://lore.kernel.org/r/20220901074209.313004-1-ye.xingchen@zte.com.cn
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Query eswitch functions returns information of the external host
PF(if it exists). It can be used to check if DEVX is running on ECPF.
Reviewed-by: Erez Shitrit <erezsh@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Link: https://lore.kernel.org/r/4265925178ab3224dc1d3e3784bb312d808edca5.1661763785.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The cited commit removed from the cleanup flow of umr the checks
if the resources were created. This could lead to null-ptr-deref
in case that we had failure in mlx5_ib_stage_ib_reg_init stage.
Fix it by adding new state to the umr that can say if the resources
were created or not and check it in the umr cleanup flow before
destroying the resources.
Fixes: 04876c12c1 ("RDMA/mlx5: Move init and cleanup of UMR to umr.c")
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Link: https://lore.kernel.org/r/4cfa61386cf202e9ce330e8d228ce3b25a36326e.1661763459.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
When accessing Ports Performance Counters Register (PPCNT),
local port must be one if it is Function-Per-Port HCA that
HCA_CAP.num_ports is 1.
The offending patch can change the local port to other values
when accessing PPCNT after enabling switchdev mode. The following
syndrome will be printed:
# cat /sys/class/infiniband/rdmap4s0f0/ports/2/counters/*
# dmesg
mlx5_core 0000:04:00.0: mlx5_cmd_check:756:(pid 12450): ACCESS_REG(0x805) op_mod(0x1) failed, status bad parameter(0x3), syndrome (0x1e5585)
Fix it by setting local port to one for Function-Per-Port HCA.
Fixes: 210b1f7807 ("IB/mlx5: When not in dual port RoCE mode, use provided port as native")
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Chris Mi <cmi@nvidia.com>
Link: https://lore.kernel.org/r/6c5086c295c76211169e58dbd610fb0402360bab.1661763459.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
When the RDMA auxiliary driver probes, it sets its profile based on
devlink driverinit value. The latter might not be in sync with FW yet
(In case devlink reload is not performed), thus causing a mismatch
between RDMA driver and FW. This results in the following FW syndrome
when the RDMA driver tries to adjust RoCE state, which fails the probe:
"0xC1F678 | modify_nic_vport_context: roce_en set on a vport that
doesn't support roce"
To prevent this, select the PF profile based on FW RoCE capability
instead of relying on devlink driverinit value.
To provide backward compatibility of the RoCE disable feature, on older
FW's where roce_rw is not set (FW RoCE capability is read-only), keep
the current behavior e.g., rely on devlink driverinit value.
Fixes: fbfa97b4d7 ("net/mlx5: Disable roce at HCA level")
Reviewed-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Link: https://lore.kernel.org/r/cb34ce9a1df4a24c135cb804db87f7d2418bd6cc.1661763459.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Fix a nested dead lock as part of ODP flow by using mmput_async().
From the below call trace [1] can see that calling mmput() once we have
the umem_odp->umem_mutex locked as required by
ib_umem_odp_map_dma_and_lock() might trigger in the same task the
exit_mmap()->__mmu_notifier_release()->mlx5_ib_invalidate_range() which
may dead lock when trying to lock the same mutex.
Moving to use mmput_async() will solve the problem as the above
exit_mmap() flow will be called in other task and will be executed once
the lock will be available.
[1]
[64843.077665] task:kworker/u133:2 state:D stack: 0 pid:80906 ppid:
2 flags:0x00004000
[64843.077672] Workqueue: mlx5_ib_page_fault mlx5_ib_eqe_pf_action [mlx5_ib]
[64843.077719] Call Trace:
[64843.077722] <TASK>
[64843.077724] __schedule+0x23d/0x590
[64843.077729] schedule+0x4e/0xb0
[64843.077735] schedule_preempt_disabled+0xe/0x10
[64843.077740] __mutex_lock.constprop.0+0x263/0x490
[64843.077747] __mutex_lock_slowpath+0x13/0x20
[64843.077752] mutex_lock+0x34/0x40
[64843.077758] mlx5_ib_invalidate_range+0x48/0x270 [mlx5_ib]
[64843.077808] __mmu_notifier_release+0x1a4/0x200
[64843.077816] exit_mmap+0x1bc/0x200
[64843.077822] ? walk_page_range+0x9c/0x120
[64843.077828] ? __cond_resched+0x1a/0x50
[64843.077833] ? mutex_lock+0x13/0x40
[64843.077839] ? uprobe_clear_state+0xac/0x120
[64843.077860] mmput+0x5f/0x140
[64843.077867] ib_umem_odp_map_dma_and_lock+0x21b/0x580 [ib_core]
[64843.077931] pagefault_real_mr+0x9a/0x140 [mlx5_ib]
[64843.077962] pagefault_mr+0xb4/0x550 [mlx5_ib]
[64843.077992] pagefault_single_data_segment.constprop.0+0x2ac/0x560
[mlx5_ib]
[64843.078022] mlx5_ib_eqe_pf_action+0x528/0x780 [mlx5_ib]
[64843.078051] process_one_work+0x22b/0x3d0
[64843.078059] worker_thread+0x53/0x410
[64843.078065] ? process_one_work+0x3d0/0x3d0
[64843.078073] kthread+0x12a/0x150
[64843.078079] ? set_kthread_struct+0x50/0x50
[64843.078085] ret_from_fork+0x22/0x30
[64843.078093] </TASK>
Fixes: 36f30e486d ("IB/core: Improve ODP to use hmm_range_fault()")
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/74d93541ea533ef7daec6f126deb1072500aeb16.1661251841.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Functions that work on a pointer to virtual memory such as
virt_to_pfn() and users of that function such as
virt_to_page() are supposed to pass a pointer to virtual
memory, ideally a (void *) or other pointer. However since
many architectures implement virt_to_pfn() as a macro,
this function becomes polymorphic and accepts both a
(unsigned long) and a (void *).
If we instead implement a proper virt_to_pfn(void *addr)
function the following happens (occurred on arch/arm):
drivers/infiniband/sw/siw/siw_qp_tx.c:32:23: warning: incompatible
integer to pointer conversion passing 'dma_addr_t' (aka 'unsigned int')
to parameter of type 'const void *' [-Wint-conversion]
drivers/infiniband/sw/siw/siw_qp_tx.c:32:37: warning: passing argument
1 of 'virt_to_pfn' makes pointer from integer without a cast
[-Wint-conversion]
drivers/infiniband/sw/siw/siw_qp_tx.c:538:36: warning: incompatible
integer to pointer conversion passing 'unsigned long long'
to parameter of type 'const void *' [-Wint-conversion]
Fix this with an explicit cast. In one case where the SIW
SGE uses an unaligned u64 we need a double cast modifying the
virtual address (va) to a platform-specific uintptr_t before
casting to a (void *).
Fixes: b9be6f18cf ("rdma/siw: transmit path")
Cc: linux-rdma@vger.kernel.org
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Link: https://lore.kernel.org/r/20220902215918.603761-1-linus.walleij@linaro.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The SoftiWARP Kconfig is missing "select" for CRYPTO and CRYPTO_CRC32C.
In addition, it improperly "depends on" LIBCRC32C, this should be a
"select", similar to net/sctp and others. As a dependency, SIW fails
to appear in generic configurations.
Link: https://lore.kernel.org/r/d366bf02-3271-754f-fc68-1a84016d0e19@talpey.com
Signed-off-by: Tom Talpey <tom@talpey.com>
Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
An incoming Read request causes multiple Read responses. If a user MR to
copy data from is unavailable or responder cannot send a reply, then the
error messages can be printed for each response attempt, resulting in
message overflow.
Link: https://lore.kernel.org/r/20220829071218.1639065-1-matsuda-daisuke@fujitsu.com
Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The member variable obj in struct rxe_task is not needed.
So remove it to save memory.
Link: https://lore.kernel.org/r/20220822011615.805603-4-yanjun.zhu@linux.dev
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Reviewed-by: Li Zhijian <lizhijian@fujitsu.com>
Reviewed-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
When sock_create_kern in the function rxe_qp_init_req fails,
qp->sk is set to NULL.
Then the function rxe_create_qp will call rxe_qp_do_cleanup
to handle allocated resource.
Before handling qp->sk, this variable should be checked.
Fixes: 8700e3e7c4 ("Soft RoCE driver")
Link: https://lore.kernel.org/r/20220822011615.805603-3-yanjun.zhu@linux.dev
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Reviewed-by: Li Zhijian <lizhijian@fujitsu.com>
Reviewed-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
When rxe_queue_init in the function rxe_qp_init_req fails,
both qp->req.task.func and qp->req.task.arg are not initialized.
Because of creation of qp fails, the function rxe_create_qp will
call rxe_qp_do_cleanup to handle allocated resource.
Before calling __rxe_do_task, both qp->req.task.func and
qp->req.task.arg should be checked.
Fixes: 8700e3e7c4 ("Soft RoCE driver")
Link: https://lore.kernel.org/r/20220822011615.805603-2-yanjun.zhu@linux.dev
Reported-by: syzbot+ab99dc4c6e961eed8b8e@syzkaller.appspotmail.com
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Reviewed-by: Li Zhijian <lizhijian@fujitsu.com>
Reviewed-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Move the device and service_id match code at the top of
cm_insert_listen() and cm_find_listen() into the final else branch.
Link: https://lore.kernel.org/r/20220819090859.957943-4-markzhang@nvidia.com
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The service_mask is always ~cpu_to_be64(0), so the result is always
a NOP when it is &'d with a service_id. Remove it for simplicity.
Link: https://lore.kernel.org/r/20220819090859.957943-3-markzhang@nvidia.com
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Since process_{read,write} already prints direction info if ctx->ops.rdma_ev
fails, no need to pass 'dir'.
Link: https://lore.kernel.org/r/20220826081117.21687-1-guoqing.jiang@linux.dev
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The bt number of qpc_timer of HIP09 increases compared with that of HIP08.
Therefore, qpc_timer_bt_num and num_qpc_timer do not match. As a result,
the driver may fail to allocate qpc_timer. So the driver needs to uniquely
uses qpc_timer_bt_num to represent the bt number of qpc_timer.
Fixes: 0e40dc2f70 ("RDMA/hns: Add timer allocation support for hip08")
Link: https://lore.kernel.org/r/20220829105021.1427804-4-liangwenpeng@huawei.com
Signed-off-by: Yixing Liu <liuyixing1@huawei.com>
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The value of qp->rq.wqe_shift of HIP08 is always determined by the number
of sge. So delete the wrong branch.
Fixes: cfc85f3e4b ("RDMA/hns: Add profile support for hip08 driver")
Fixes: 926a01dc00 ("RDMA/hns: Add QP operations support for hip08 SoC")
Link: https://lore.kernel.org/r/20220829105021.1427804-3-liangwenpeng@huawei.com
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The supported page size for hns is (4K, 128M), not (4K, 2G).
Fixes: cfc85f3e4b ("RDMA/hns: Add profile support for hip08 driver")
Link: https://lore.kernel.org/r/20220829105021.1427804-2-liangwenpeng@huawei.com
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Fix the order of source and destination addresses when resolving the
route between server and client to validate use of correct net device.
The reverse order we had so far didn't actually validate the net device
as the server would try to resolve the route to itself, thus always
getting the server's net device.
The issue was discovered when running cm applications on a single host
between 2 interfaces with same subnet and source based routing rules.
When resolving the reverse route the source based route rules were
ignored.
Fixes: f887f2ac87 ("IB/cma: Validate routing of incoming requests")
Link: https://lore.kernel.org/r/1c1ec2277a131d277ebcceec987fd338d35b775f.1661251872.git.leonro@nvidia.com
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Commit 1e75550648 ("Revert "RDMA/rxe: Create duplicate mapping tables for
FMRs"") brought back the member 'va' to struct rxe_mr. However, it is
actually used by nobody and thus can be removed.
Fixes: 1e75550648 ("Revert "RDMA/rxe: Create duplicate mapping tables for FMRs"")
Link: https://lore.kernel.org/r/20220829012335.1212697-1-matsuda-daisuke@fujitsu.com
Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Simplify the SRP driver by using the attribute group mechanism instead
of calling device_create_file() explicitly.
Link: https://lore.kernel.org/r/20220825213900.864587-5-bvanassche@acm.org
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Instead of ignoring dev_set_name() failure, handle dev_set_name()
failure. Convert a device_register() call into device_initialize() and
device_add() calls.
Link: https://lore.kernel.org/r/20220825213900.864587-4-bvanassche@acm.org
Reported-by: Bo Liu <liubo03@inspur.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Move the kfree(host) calls into srp_release_dev(). Convert a
device_unregister() call into a device_del() and a device_put() call.
Remove the host->released completion object. This patch prepares for
handling dev_set_name() failure in srp_add_port().
Link: https://lore.kernel.org/r/20220825213900.864587-3-bvanassche@acm.org
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
device_register() always calls device_initialize() so calling device_del()
is safe even if device_register() fails. Implement the following advice
from the comment block above device_register(): "NOTE: _Never_ directly free
@dev after calling this function, even if it returned an error! Always use
put_device() to give up the reference initialized in this function instead."
Keep the kfree() call in the error path since srp_release_dev() does not
free the host.
Link: https://lore.kernel.org/r/20220825213900.864587-2-bvanassche@acm.org
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The MR raw restrack attributes come from the queue context maintained by
the ROCEE.
For example:
$ rdma res show mr dev hns_0 mrn 6 -dd -jp -r
[ {
"ifindex": 4,
"ifname": "hns_0",
"data": [ 1,0,0,0,2,0,0,0,0,3,0,0,0,0,2,0,0,0,0,0,32,0,0,0,2,0,0,0,
2,0,0,0,0,0,0,0 ]
} ]
Link: https://lore.kernel.org/r/20220822104455.2311053-8-liangwenpeng@huawei.com
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The QP raw restrack attributes come from the queue context maintained by
the ROCEE.
For example:
$ rdma res show qp link hns_0 -jp -dd -r
[ {
"ifindex": 4,
"ifname": "hns_0",
"data": [ 2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0,0,
5,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,255,156,0,0,63,156,0,0,
7,0,0,0,1,0,0,0,9,0,0,0,0,0,0,0,2,0,0,0,2,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,63,156,0,
0,0,0,0,0 ]
} ]
Link: https://lore.kernel.org/r/20220822104455.2311053-6-liangwenpeng@huawei.com
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The CQ raw restrack attributes come from the queue context maintained by
the ROCEE.
For example:
$ rdma res show cq dev hns_0 cqn 14 -dd -jp -r
[ {
"ifindex": 4,
"ifname": "hns_0",
"data": [ 1,0,0,0,7,0,0,0,0,0,0,0,0,82,6,0,0,82,6,0,0,82,6,0,
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,
6,0,0,0,0,0,0,0 ]
} ]
Link: https://lore.kernel.org/r/20220822104455.2311053-4-liangwenpeng@huawei.com
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Remove the resttrack attributes from the queue context held by ROCEE, and
add the resttrack attributes from the queue information maintained by the
driver.
For example:
$ rdma res show cq dev hns_0 cqn 14 -dd -jp
[ {
"ifindex": 4,
"ifname": "hns_0",
"cqn": 14,
"cqe": 127,
"users": 1,
"adaptive-moderation": false,
"ctxn": 8,
"pid": 1524,
"comm": "ib_send_bw"
},
"drv_cq_depth": 128,
"drv_cons_index": 0,
"drv_cqe_size": 32,
"drv_arm_sn": 1
}
Link: https://lore.kernel.org/r/20220822104455.2311053-3-liangwenpeng@huawei.com
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
There is no need to use a dedicated DXF file and DFX structure to manage
the interface of the query queue context.
Link: https://lore.kernel.org/r/20220822104455.2311053-2-liangwenpeng@huawei.com
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Add a parameter for create CQ admin command to set source address on
receive completion descriptors. Report capability for this feature
through query device verb.
Link: https://lore.kernel.org/r/20220818140449.414-1-mrgolin@amazon.com
Reviewed-by: Firas Jahjah <firasj@amazon.com>
Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
Signed-off-by: Daniel Kranzdorf <dkkranzd@amazon.com>
Signed-off-by: Michael Margolin <mrgolin@amazon.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
ib_dma_map_sg() augments the SGL into a 'dma mapped SGL'. This process
may change the number of entries and the lengths of each entry.
Code that touches dma_address is iterating over the 'dma mapped SGL'
and must use dma_nents which returned from ib_dma_map_sg().
We should use the return count from ib_dma_map_sg for futher usage.
Fixes: 9cb8374804 ("RDMA/rtrs: server: main functionality")
Link: https://lore.kernel.org/r/20220818105355.110344-4-haris.iqbal@ionos.com
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Reviewed-by: Aleksei Marov <aleksei.marov@ionos.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
When iommu is enabled, we hit warnings like this:
WARNING: at rtrs/rtrs.c:178 rtrs_iu_post_rdma_write_imm+0x9b/0x110
rtrs warn on one sge entry length is 0, which is unexpected.
The problem is ib_dma_map_sg augments the SGL into a 'dma mapped SGL'.
This process may change the number of entries and the lengths of each
entry.
Code that touches dma_address is iterating over the 'dma mapped SGL'
and must use dma_nents which returned from ib_dma_map_sg().
So pass the count return from ib_dma_map_sg.
Fixes: 6a98d71dae ("RDMA/rtrs: client: main functionality")
Link: https://lore.kernel.org/r/20220818105355.110344-3-haris.iqbal@ionos.com
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Reviewed-by: Aleksei Marov <aleksei.marov@ionos.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Add event tracing mechanism for following routines:
- send_io_resp_imm()
How to use:
1. Load the rtrs_server module
2. cd /sys/kernel/debug/tracing
3. If all the events need to be enabled:
echo 1 > events/rtrs_srv/enable
4. OR only speific routine/event needs to be enabled e.g.
echo 1 > events/rtrs_srv/send_io_resp_imm/enable
5. cat trace
6. Run some I/O workload which can trigger send_io_resp_imm()
Link: https://lore.kernel.org/r/20220818105240.110234-3-haris.iqbal@ionos.com
Signed-off-by: Santosh Pradhan <santosh.pradhan@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Add event tracing mechanism for following routines:
- rtrs_clt_reconnect_work()
- rtrs_clt_close_conns()
- rtrs_rdma_error_recovery()
How to use:
1. Load the rtrs_client module
2. cd /sys/kernel/debug/tracing
3. If all the events need to be enabled:
echo 1 > events/rtrs_clt/enable
4. OR only speific routine/event needs to be enabled e.g.
echo 1 > events/rtrs_clt/rtrs_clt_close_conns/enable
5. cat trace
6. Run some workload which can trigger rtrs_clt_close_conns()
Link: https://lore.kernel.org/r/20220818105240.110234-2-haris.iqbal@ionos.com
Signed-off-by: Santosh Pradhan <santosh.pradhan@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
ib_umem_dmabuf_map_pages() returns 0 on success and -ERRNO on failure.
dma_resv_wait_timeout() uses a different scheme:
* Returns -ERESTARTSYS if interrupted, 0 if the wait timed out, or
* greater than zero on success.
This results in ib_umem_dmabuf_map_pages() being non-functional as a
positive return will be understood to be an error by drivers.
Fixes: f30bceab16 ("RDMA: use dma_resv_wait() instead of extracting the fence")
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/0-v1-d8f4e1fa84c8+17-rdma_dmabuf_fix_jgg@nvidia.com
Tested-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
QP0 in HW is used for CMDQ, and the rest is for RDMA QPs. So the actual
max_qp capacity reported to core should be max_qp (reported by HW) - 1.
So does max_cq.
Fixes: 1550557717 ("RDMA/erdma: Add verbs implementation")
Link: https://lore.kernel.org/all/20220810014320.88026-1-chengyou@linux.alibaba.com
Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Commit 'c2ed5611afd7' has increased the cpl_t5_pass_accept_rpl{} structure
size by 8B to avoid roundup. cpl_t5_pass_accept_rpl{} is a HW specific
structure and increasing its size will lead to unwanted adapter errors.
Current commit reverts the cpl_t5_pass_accept_rpl{} back to its original
and allocates zeroed skb buffer there by avoiding the memset for iss field.
Reorder code to minimize chip type checks.
Fixes: c2ed5611af ("iw_cxgb4: Use memset_startat() for cpl_t5_pass_accept_rpl")
Link: https://lore.kernel.org/r/20220809184118.2029-1-rahul.lakkireddy@chelsio.com
Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The cited commit allowed the driver to operate over HCAs that have
4 physical ports. Use the number of ports of the RDMA device in the for
loop instead of using the struct size.
Fixes: 4cd14d44b1 ("net/mlx5: Support devices with more than 2 ports")
Link: https://lore.kernel.org/r/a54a56c2ede16044a29d119209b35189c662ac72.1659944855.git.leonro@nvidia.com
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The iSER Initiator uses two types of receive buffers:
- one big login buffer posted by iser_post_recvl();
- several small message buffers posted by iser_post_recvm().
The login buffer is used at the login phase and full feature phase in
the discovery session. It may take a few requests and responses to
complete the login phase. The message buffers are only used in the
normal operational session at the full feature phase.
After the commit referred in the fixes line, the login operation fails
if the authentication is enabled. That happens because the Initiator
posts a small receive buffer after the first response from Target. So,
the next send operation fails because Target's second response does not
fit into the small receive buffer.
This commit adds additional checks to prevent posting small receive
buffers until the full feature phase.
Fixes: 39b169ea0d ("IB/iser: Fix RNR errors")
Link: https://lore.kernel.org/r/20220805060135.18493-1-sergeygo@nvidia.com
Signed-off-by: Sergey Gorenko <sergeygo@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
According to the ib spec:
If the CI supports the Base Memory Management Extensions defined in this
specification, the L_Key format must consist of:
24 bit index in the most significant bits of the R_Key, and
8 bit key in the least significant bits of the R_Key
Through a successful Allocate L_Key verb invocation, the CI must let the
consumer own the key portion of the returned R_Key
Therefore, when creating a mkey using DEVX, the consumer is allowed to
change the key part. The kernel should compare only the index part of a
R_Key to determine equality with another R_Key.
Adding capability in order not to break backward compatibility.
Fixes: 534fd7aac5 ("IB/mlx5: Manage indirection mkey upon DEVX flow for ODP")
Link: https://lore.kernel.org/r/3d669aacea85a3a15c3b3b953b3eaba3f80ef9be.1659255945.git.leonro@nvidia.com
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
This new function is defined only on ARM and serves to guarantee a barrier
in the WC operation. The barrier means that another run of this loop will
not combine with the stores this loop created.
On x86 this is happening implicitly because of the spin_unlock().
Link: https://lore.kernel.org/r/0-v1-c5dade92f363+11-mlx5_io_stop_wc_jgg@nvidia.com
Suggested-by: Pavel Shamis <Pavel.Shamis@arm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
- convert arm32 to the common dma-direct code (Arnd Bergmann, Robin Murphy,
Christoph Hellwig)
- restructure the PCIe peer to peer mapping support (Logan Gunthorpe)
- allow the IOMMU code to communicate an optional DMA mapping length
and use that in scsi and libata (John Garry)
- split the global swiotlb lock (Tianyu Lan)
- various fixes and cleanup (Chao Gao, Dan Carpenter, Dongli Zhang,
Lukas Bulwahn, Robin Murphy)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmLuIYULHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYPS5A//Ty1ZNyXExmwZ6J6g7/oIvQlpAHilDr22mCd8tR8Y
Ne7TgLa/X+usFvJTxJfkvg/LNMDjD7qx0J/mhDGm4reOFcEL4/PBy0rDSOgnmntV
k/fPhgwnpuztiAQ+s+WkJ3pkrmG1HaEId7GGj2JaoYdas6RX2mGX7vL8uvUFepjw
lYPAqWMtJHkOfsDK0PqqyQsr7dcC6lyFLqnn/wqvHtTJeKCfGs6W/SIrlWme2SZY
3dNx84ZR1uPjaazAmtf2IWfjh/TBmd0ETRYycgUUKRP9iwsCkBQDBwsBGSIYXiWj
BUKQ5oMvjAlUGRF0jYz9e77KuedE6GxWiXNQstitBmid142M37DHA5tvZRf65MPS
THHcjTDmmoaO4YfFhhXOcFOrjG4/V8bF7fgHB6XkHDjhVVTcnIx8zuOAXIVBZvIV
VAALmamBqEfIZZrCqgr7hzFssK2bip+TIMkdoD46Wcr+D7bAlujhuzWxubn9+ulT
23v/pAvC80ut6LvKj6EA+GpRm/pejfOtEbjXPoO2hguNxvuUKvPQqNh9hy0q+v1e
8n2Y/4lhy5bv02S7wKooNkfCoV753jBY1TIru45UmEYc3EkTQPii6okYe0DvW4QX
VCnKgo156wSBfE+9eWdxCROv2SZqJFMV/wL3vw54dpJQMbDy7VkNsh4mGREdUkU1
uek=
=Bv19
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-5.20-2022-08-06' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping updates from Christoph Hellwig:
- convert arm32 to the common dma-direct code (Arnd Bergmann, Robin
Murphy, Christoph Hellwig)
- restructure the PCIe peer to peer mapping support (Logan Gunthorpe)
- allow the IOMMU code to communicate an optional DMA mapping length
and use that in scsi and libata (John Garry)
- split the global swiotlb lock (Tianyu Lan)
- various fixes and cleanup (Chao Gao, Dan Carpenter, Dongli Zhang,
Lukas Bulwahn, Robin Murphy)
* tag 'dma-mapping-5.20-2022-08-06' of git://git.infradead.org/users/hch/dma-mapping: (45 commits)
swiotlb: fix passing local variable to debugfs_create_ulong()
dma-mapping: reformat comment to suppress htmldoc warning
PCI/P2PDMA: Remove pci_p2pdma_[un]map_sg()
RDMA/rw: drop pci_p2pdma_[un]map_sg()
RDMA/core: introduce ib_dma_pci_p2p_dma_supported()
nvme-pci: convert to using dma_map_sgtable()
nvme-pci: check DMA ops when indicating support for PCI P2PDMA
iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg
iommu: Explicitly skip bus address marked segments in __iommu_map_sg()
dma-mapping: add flags to dma_map_ops to indicate PCI P2PDMA support
dma-direct: support PCI P2PDMA pages in dma-direct map_sg
dma-mapping: allow EREMOTEIO return code for P2PDMA transfers
PCI/P2PDMA: Introduce helpers for dma_map_sg implementations
PCI/P2PDMA: Attempt to set map_type if it has not been set
lib/scatterlist: add flag for indicating P2PDMA segments in an SGL
swiotlb: clean up some coding style and minor issues
dma-mapping: update comment after dmabounce removal
scsi: sd: Add a comment about limiting max_sectors to shost optimal limit
ata: libata-scsi: cap ata_device->max_sectors according to shost->max_sectors
scsi: scsi_transport_sas: cap shost opt_sectors according to DMA optimal limit
...
- Runtime verification infrastructure
This is the biggest change for this pull request. It introduces the
runtime verification that is necessary for running Linux on safety
critical systems. It allows for deterministic automata models to be
inserted into the kernel that will attach to tracepoints, where the
information on these tracepoints will move the model from state to state.
If a state is encountered that does not belong to the model, it will then
activate a given reactor, that could just inform the user or even panic
the kernel (for which safety critical systems will detect and can recover
from).
- Two monitor models are also added: Wakeup In Preemptive (WIP - not to be
confused with "work in progress"), and Wakeup While Not Running (WWNR).
- Added __vstring() helper to the TRACE_EVENT() macro to replace several
vsnprintf() usages that were all doing it wrong.
- eprobes now can have their event autogenerated when the event name is left
off.
- The rest is various cleanups and fixes.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYu0yzRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qj4HAP4tQtV55rjj4DQ5XIXmtI3/64PmyRSJ
+y4DEXi1UvEUCQD/QAuQfWoT/7gh35ltkfeS4t3ockzy14rrkP5drZigiQA=
=kEtM
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- Runtime verification infrastructure
This is the biggest change here. It introduces the runtime
verification that is necessary for running Linux on safety critical
systems.
It allows for deterministic automata models to be inserted into the
kernel that will attach to tracepoints, where the information on
these tracepoints will move the model from state to state.
If a state is encountered that does not belong to the model, it will
then activate a given reactor, that could just inform the user or
even panic the kernel (for which safety critical systems will detect
and can recover from).
- Two monitor models are also added: Wakeup In Preemptive (WIP - not to
be confused with "work in progress"), and Wakeup While Not Running
(WWNR).
- Added __vstring() helper to the TRACE_EVENT() macro to replace
several vsnprintf() usages that were all doing it wrong.
- eprobes now can have their event autogenerated when the event name is
left off.
- The rest is various cleanups and fixes.
* tag 'trace-v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (50 commits)
rv: Unlock on error path in rv_unregister_reactor()
tracing: Use alignof__(struct {type b;}) instead of offsetof()
tracing/eprobe: Show syntax error logs in error_log file
scripts/tracing: Fix typo 'the the' in comment
tracepoints: It is CONFIG_TRACEPOINTS not CONFIG_TRACEPOINT
tracing: Use free_trace_buffer() in allocate_trace_buffers()
tracing: Use a struct alignof to determine trace event field alignment
rv/reactor: Add the panic reactor
rv/reactor: Add the printk reactor
rv/monitor: Add the wwnr monitor
rv/monitor: Add the wip monitor
rv/monitor: Add the wip monitor skeleton created by dot2k
Documentation/rv: Add deterministic automata instrumentation documentation
Documentation/rv: Add deterministic automata monitor synthesis documentation
tools/rv: Add dot2k
Documentation/rv: Add deterministic automaton documentation
tools/rv: Add dot2c
Documentation/rv: Add a basic documentation
rv/include: Add instrumentation helper functions
rv/include: Add deterministic automata monitor definition via C macros
...
This PR includes a new RDMA driver for Alibaba Cloud hardware
- Bug fixes and small features for irdma, hns, siw, qedr, hfi1, mlx5
- General spelling/grammer fixes
- rdma cm can follow changes in neighbours for control packets
- Significant amounts of rxe fixes and spec compliance changes
- Use the modern NAPI API
- Use the bitmap API instead of open coding
- Performance improvements for rtrs
- Add the ERDMA driver for Alibaba cloud
- Fix a use after free bug in SRP
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCYuwAuAAKCRCFwuHvBreF
YcRDAQC41YJNs7xve7r62/E6M+o/AXiwXa+m8rGRvcP3mdilNAEAhdom6HskenMZ
/sopeBWF78M9plLvNzWkwukaqIwrXgM=
=abuq
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"This cycle we got a new RDMA driver "ERDMA" for the Alibaba cloud
environment. Otherwise the changes are dominated by rxe fixes.
There is another RDMA driver on the list that might get merged next
cycle, 'MANA' for the Azure cloud environment.
Summary:
- Bug fixes and small features for irdma, hns, siw, qedr, hfi1, mlx5
- General spelling/grammer fixes
- rdma cm can follow changes in neighbours for control packets
- Significant amounts of rxe fixes and spec compliance changes
- Use the modern NAPI API
- Use the bitmap API instead of open coding
- Performance improvements for rtrs
- Add the ERDMA driver for Alibaba cloud
- Fix a use after free bug in SRP"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (99 commits)
RDMA/ib_srpt: Unify checking rdma_cm_id condition in srpt_cm_req_recv()
RDMA/rxe: Fix error unwind in rxe_create_qp()
RDMA/mlx5: Add missing check for return value in get namespace flow
RDMA/rxe: Split qp state for requester and completer
RDMA/rxe: Generate error completion for error requester QP state
RDMA/rxe: Update wqe_index for each wqe error completion
RDMA/srpt: Fix a use-after-free
RDMA/srpt: Introduce a reference count in struct srpt_device
RDMA/srpt: Duplicate port name members
IB/qib: Fix repeated "in" within comments
RDMA/erdma: Add driver to kernel build environment
RDMA/erdma: Add the ABI definitions
RDMA/erdma: Add the erdma module
RDMA/erdma: Add connection management (CM) support
RDMA/erdma: Add verbs implementation
RDMA/erdma: Add verbs header file
RDMA/erdma: Add event queue implementation
RDMA/erdma: Add cmdq implementation
RDMA/erdma: Add main include file
RDMA/erdma: Add the hardware related definitions
...
Updates to the usual drivers (ufs, qla2xx, target, lpfc, smartpqi,
mpi3mr). The main driver change that might cause issues on down the
road is the conversion of some of our oldest surviving drivers to the
DMA API (should only affect m68k). The only major core change is the
rework of async resume; the rest are either completely trivial or for
updating deprecated APIs.
Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCYuvakyYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishfvOAP4m0N6b
e3JwoBtB1c0JMKv6G4gka8suEG8p5f4khDu8wwD+LfGUCzG49Y5Ts7rByXfEiGgO
krSdwsAZiV6yKg/HuPw=
=Ak9L
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"Updates to the usual drivers (ufs, qla2xx, target, lpfc, smartpqi,
mpi3mr).
The main driver change that might cause issues on down the road is the
conversion of some of our oldest surviving drivers to the DMA API
(should only affect m68k).
The only major core change is the rework of async resume; the rest are
either completely trivial or for updating deprecated APIs"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (195 commits)
scsi: target: Remove XDWRITEREAD emulated support
scsi: megaraid: Remove the static variable initialisation
scsi: ch: Do not initialise statics to 0
scsi: ufs: core: Fix spelling mistake "Cannnot" -> "Cannot"
scsi: target: iscsi: Do not require target authentication
scsi: target: iscsi: Allow AuthMethod=None
scsi: target: iscsi: Support base64 in CHAP
scsi: target: iscsi: Add support for extended CDB AHS
scsi: ufs: dt-bindings: Add SC8280XP binding
scsi: target: iscsi: Fix clang -Wformat warnings
scsi: ufs: core: Read device property for ref clock
scsi: libsas: Resume SAS host for phy reset or enable via sysfs
scsi: hisi_sas: Modify v3 HW SATA completion error processing
scsi: hisi_sas: Relocate DMA unmap of SMP task
scsi: hisi_sas: Remove unnecessary variable to hold DMA map elements
scsi: hisi_sas: Call hisi_sas_slave_configure() from slave_configure_v3_hw()
scsi: mpi3mr: Delete a stray tab
scsi: mpi3mr: Unlock on error path
scsi: mpi3mr: Reduce VD queue depth on detecting throttling
scsi: mpi3mr: Resource Based Metering
...
Core
----
- Refactor the forward memory allocation to better cope with memory
pressure with many open sockets, moving from a per socket cache to
a per-CPU one
- Replace rwlocks with RCU for better fairness in ping, raw sockets
and IP multicast router.
- Network-side support for IO uring zero-copy send.
- A few skb drop reason improvements, including codegen the source file
with string mapping instead of using macro magic.
- Rename reference tracking helpers to a more consistent
netdev_* schema.
- Adapt u64_stats_t type to address load/store tearing issues.
- Refine debug helper usage to reduce the log noise caused by bots.
BPF
---
- Improve socket map performance, avoiding skb cloning on read
operation.
- Add support for 64 bits enum, to match types exposed by kernel.
- Introduce support for sleepable uprobes program.
- Introduce support for enum textual representation in libbpf.
- New helpers to implement synproxy with eBPF/XDP.
- Improve loop performances, inlining indirect calls when
possible.
- Removed all the deprecated libbpf APIs.
- Implement new eBPF-based LSM flavor.
- Add type match support, which allow accurate queries to the
eBPF used types.
- A few TCP congetsion control framework usability improvements.
- Add new infrastructure to manipulate CT entries via eBPF programs.
- Allow for livepatch (KLP) and BPF trampolines to attach to the same
kernel function.
Protocols
---------
- Introduce per network namespace lookup tables for unix sockets,
increasing scalability and reducing contention.
- Preparation work for Wi-Fi 7 Multi-Link Operation (MLO) support.
- Add support to forciby close TIME_WAIT TCP sockets via user-space
tools.
- Significant performance improvement for the TLS 1.3 receive path,
both for zero-copy and not-zero-copy.
- Support for changing the initial MTPCP subflow priority/backup
status
- Introduce virtually contingus buffers for sockets over RDMA,
to cope better with memory pressure.
- Extend CAN ethtool support with timestamping capabilities
- Refactor CAN build infrastructure to allow building only the needed
features.
Driver API
----------
- Remove devlink mutex to allow parallel commands on multiple links.
- Add support for pause stats in distributed switch.
- Implement devlink helpers to query and flash line cards.
- New helper for phy mode to register conversion.
New hardware / drivers
----------------------
- Ethernet DSA driver for the rockchip mt7531 on BPI-R2 Pro.
- Ethernet DSA driver for the Renesas RZ/N1 A5PSW switch.
- Ethernet DSA driver for the Microchip LAN937x switch.
- Ethernet PHY driver for the Aquantia AQR113C EPHY.
- CAN driver for the OBD-II ELM327 interface.
- CAN driver for RZ/N1 SJA1000 CAN controller.
- Bluetooth: Infineon CYW55572 Wi-Fi plus Bluetooth combo device.
Drivers
-------
- Intel Ethernet NICs:
- i40e: add support for vlan pruning
- i40e: add support for XDP framented packets
- ice: improved vlan offload support
- ice: add support for PPPoE offload
- Mellanox Ethernet (mlx5)
- refactor packet steering offload for performance and scalability
- extend support for TC offload
- refactor devlink code to clean-up the locking schema
- support stacked vlans for bridge offloads
- use TLS objects pool to improve connection rate
- Netronome Ethernet NICs (nfp):
- extend support for IPv6 fields mangling offload
- add support for vepa mode in HW bridge
- better support for virtio data path acceleration (VDPA)
- enable TSO by default
- Microsoft vNIC driver (mana)
- add support for XDP redirect
- Others Ethernet drivers:
- bonding: add per-port priority support
- microchip lan743x: extend phy support
- Fungible funeth: support UDP segmentation offload and XDP xmit
- Solarflare EF100: add support for virtual function representors
- MediaTek SoC: add XDP support
- Mellanox Ethernet/IB switch (mlxsw):
- dropped support for unreleased H/W (XM router).
- improved stats accuracy
- unified bridge model coversion improving scalability
(parts 1-6)
- support for PTP in Spectrum-2 asics
- Broadcom PHYs
- add PTP support for BCM54210E
- add support for the BCM53128 internal PHY
- Marvell Ethernet switches (prestera):
- implement support for multicast forwarding offload
- Embedded Ethernet switches:
- refactor OcteonTx MAC filter for better scalability
- improve TC H/W offload for the Felix driver
- refactor the Microchip ksz8 and ksz9477 drivers to share
the probe code (parts 1, 2), add support for phylink
mac configuration
- Other WiFi:
- Microchip wilc1000: diable WEP support and enable WPA3
- Atheros ath10k: encapsulation offload support
Old code removal:
- Neterion vxge ethernet driver: this is untouched since more than
10 years.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmLqN+oSHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkB9kQAI9VqW0c3SfiTJnkVBEIovZ6Tnh5stD2
UYFkh1BdchLsYxi7W4XMpVPSzRztiTP87mIx5c/KvIzj+QNeWL1XWRJSPdI9HhTD
pTAA/tM2OG7bqrbyQiKDNfpQdNl7+kk1RwnYd+f9RFl1QVuIJaYhmjVwrsN5xF/+
jUsotpROarM2dGFWiFwJbKhP2zMDT+6qEEahM8pEPggKhv8wRLYjany2cZVEe4e0
WGUpbINAS8gEKm0Ob922WaDfDrcK/N1Z0jNz/kMaENkK18Vvc7F6bCO0DzAawKX9
QZMMwm6mHp3EThflJAMAzCGIYiIcwLhykgdyj8rrjPhFrWbMD2Sdsbo21HOXU/8j
u4aAhVl+d+h7emmbgBoJ8sycVJ7BQlXz7lX20sTgADv9xI4/dPhQ17CMRuwX6fXX
JSrn6P6e1LTV5CEg6vrlSPnKPY6uhFn/cPw47FxCjRwJ9phVnp+8uZWQmf9Pz3yf
Ok/tcj+juFbsmuOshHy2cbRkuNZNS0oRWlSTBo5795ZwOLSakMonR3L+ev2aOvzz
DVrFp2Y/iIVwMSFdCbouYdYnhArPRhOAtCmZc2afY8aBN7aaMgrdTy3+mzUoHy3I
FG3K+VuKpfi0vY4zn6ZoLZDIpyXIoJJ93RcSGltD32t3Dp1RaQMVEI4s45k05PVm
1nYpXKHA8qML
=hxEG
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking changes from Paolo Abeni:
"Core:
- Refactor the forward memory allocation to better cope with memory
pressure with many open sockets, moving from a per socket cache to
a per-CPU one
- Replace rwlocks with RCU for better fairness in ping, raw sockets
and IP multicast router.
- Network-side support for IO uring zero-copy send.
- A few skb drop reason improvements, including codegen the source
file with string mapping instead of using macro magic.
- Rename reference tracking helpers to a more consistent netdev_*
schema.
- Adapt u64_stats_t type to address load/store tearing issues.
- Refine debug helper usage to reduce the log noise caused by bots.
BPF:
- Improve socket map performance, avoiding skb cloning on read
operation.
- Add support for 64 bits enum, to match types exposed by kernel.
- Introduce support for sleepable uprobes program.
- Introduce support for enum textual representation in libbpf.
- New helpers to implement synproxy with eBPF/XDP.
- Improve loop performances, inlining indirect calls when possible.
- Removed all the deprecated libbpf APIs.
- Implement new eBPF-based LSM flavor.
- Add type match support, which allow accurate queries to the eBPF
used types.
- A few TCP congetsion control framework usability improvements.
- Add new infrastructure to manipulate CT entries via eBPF programs.
- Allow for livepatch (KLP) and BPF trampolines to attach to the same
kernel function.
Protocols:
- Introduce per network namespace lookup tables for unix sockets,
increasing scalability and reducing contention.
- Preparation work for Wi-Fi 7 Multi-Link Operation (MLO) support.
- Add support to forciby close TIME_WAIT TCP sockets via user-space
tools.
- Significant performance improvement for the TLS 1.3 receive path,
both for zero-copy and not-zero-copy.
- Support for changing the initial MTPCP subflow priority/backup
status
- Introduce virtually contingus buffers for sockets over RDMA, to
cope better with memory pressure.
- Extend CAN ethtool support with timestamping capabilities
- Refactor CAN build infrastructure to allow building only the needed
features.
Driver API:
- Remove devlink mutex to allow parallel commands on multiple links.
- Add support for pause stats in distributed switch.
- Implement devlink helpers to query and flash line cards.
- New helper for phy mode to register conversion.
New hardware / drivers:
- Ethernet DSA driver for the rockchip mt7531 on BPI-R2 Pro.
- Ethernet DSA driver for the Renesas RZ/N1 A5PSW switch.
- Ethernet DSA driver for the Microchip LAN937x switch.
- Ethernet PHY driver for the Aquantia AQR113C EPHY.
- CAN driver for the OBD-II ELM327 interface.
- CAN driver for RZ/N1 SJA1000 CAN controller.
- Bluetooth: Infineon CYW55572 Wi-Fi plus Bluetooth combo device.
Drivers:
- Intel Ethernet NICs:
- i40e: add support for vlan pruning
- i40e: add support for XDP framented packets
- ice: improved vlan offload support
- ice: add support for PPPoE offload
- Mellanox Ethernet (mlx5)
- refactor packet steering offload for performance and scalability
- extend support for TC offload
- refactor devlink code to clean-up the locking schema
- support stacked vlans for bridge offloads
- use TLS objects pool to improve connection rate
- Netronome Ethernet NICs (nfp):
- extend support for IPv6 fields mangling offload
- add support for vepa mode in HW bridge
- better support for virtio data path acceleration (VDPA)
- enable TSO by default
- Microsoft vNIC driver (mana)
- add support for XDP redirect
- Others Ethernet drivers:
- bonding: add per-port priority support
- microchip lan743x: extend phy support
- Fungible funeth: support UDP segmentation offload and XDP xmit
- Solarflare EF100: add support for virtual function representors
- MediaTek SoC: add XDP support
- Mellanox Ethernet/IB switch (mlxsw):
- dropped support for unreleased H/W (XM router).
- improved stats accuracy
- unified bridge model coversion improving scalability (parts 1-6)
- support for PTP in Spectrum-2 asics
- Broadcom PHYs
- add PTP support for BCM54210E
- add support for the BCM53128 internal PHY
- Marvell Ethernet switches (prestera):
- implement support for multicast forwarding offload
- Embedded Ethernet switches:
- refactor OcteonTx MAC filter for better scalability
- improve TC H/W offload for the Felix driver
- refactor the Microchip ksz8 and ksz9477 drivers to share the
probe code (parts 1, 2), add support for phylink mac
configuration
- Other WiFi:
- Microchip wilc1000: diable WEP support and enable WPA3
- Atheros ath10k: encapsulation offload support
Old code removal:
- Neterion vxge ethernet driver: this is untouched since more than 10 years"
* tag 'net-next-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1890 commits)
doc: sfp-phylink: Fix a broken reference
wireguard: selftests: support UML
wireguard: allowedips: don't corrupt stack when detecting overflow
wireguard: selftests: update config fragments
wireguard: ratelimiter: use hrtimer in selftest
net/mlx5e: xsk: Discard unaligned XSK frames on striding RQ
net: usb: ax88179_178a: Bind only to vendor-specific interface
selftests: net: fix IOAM test skip return code
net: usb: make USB_RTL8153_ECM non user configurable
net: marvell: prestera: remove reduntant code
octeontx2-pf: Reduce minimum mtu size to 60
net: devlink: Fix missing mutex_unlock() call
net/tls: Remove redundant workqueue flush before destroy
net: txgbe: Fix an error handling path in txgbe_probe()
net: dsa: Fix spelling mistakes and cleanup code
Documentation: devlink: add add devlink-selftests to the table of contents
dccp: put dccp_qpolicy_full() and dccp_qpolicy_push() in the same lock
net: ionic: fix error check for vlan flags in ionic_set_nic_features()
net: ice: fix error NETIF_F_HW_VLAN_CTAG_FILTER check in ice_vsi_sync_fltr()
nfp: flower: add support for tunnel offload without key ID
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLko3gQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpmQaD/90NKFj4v8I456TUQyg1jimXEsL+e84E6o2
ALWVb6JzQvlPVQXNLnK5YKIunMWOTtTMz0nyB8sVRwVJVJO0P5d7QopAkZM8fkyU
MK5OCzoryENw4DTc2wJS4in6cSbGylIuN74wMzlf7+M67JTImfoZQhbTMcjwzZfn
b3OlL6sID7zMXwGcuOJPZyUJICCpDhzdSF9JXqKma5PQuG2SBmQyvFxJAcsoFBPc
YetnoRIOIN6yBvsIZaPaYq7XI9MIvF0e67EQtyCEHj4tHpyVnyDWkeObVFULsISU
gGEKbkYPvNUzRAU5Q1NBBHh1tTfkf/MaUxTuZwoEwZ/s04IGBGMmrZGyfvdfzYo6
M7NwSEg/TrUSNfTwn65mQi7uOXu1pGkJrqz84Flm8u9Qid9Vd7LExLG5p/ggnWdH
5th93MDEmtEg29e9DXpEAuS5d0t3TtSvosflaKpyfNNfr+P0rWCN6GM/uW62VUTK
ls69SQh/AQJRbg64jU4xper6WhaYtSXK7TKEnxJycoEn9gYNyCcdot2uekth0xRH
ChHGmRlteiqe/y4uFWn/2dcxWjoleiHbFjTaiRL75WVl8wIDEjw02LGuoZ61Ss9H
WOV+MT7KqNjBGe6lreUY+O/PO02dzmoR6heJXN19p8zr/pBuLCTGX7UpO7rzgaBR
4N1HEozvIw==
=celk
-----END PGP SIGNATURE-----
Merge tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
- Improve the type checking of request flags (Bart)
- Ensure queue mapping for a single queues always picks the right queue
(Bart)
- Sanitize the io priority handling (Jan)
- rq-qos race fix (Jinke)
- Reserved tags handling improvements (John)
- Separate memory alignment from file/disk offset aligment for O_DIRECT
(Keith)
- Add new ublk driver, userspace block driver using io_uring for
communication with the userspace backend (Ming)
- Use try_cmpxchg() to cleanup the code in various spots (Uros)
- Finally remove bdevname() (Christoph)
- Clean up the zoned device handling (Christoph)
- Clean up independent access range support (Christoph)
- Clean up and improve block sysfs handling (Christoph)
- Clean up and improve teardown of block devices.
This turns the usual two step process into something that is simpler
to implement and handle in block drivers (Christoph)
- Clean up chunk size handling (Christoph)
- Misc cleanups and fixes (Bart, Bo, Dan, GuoYong, Jason, Keith, Liu,
Ming, Sebastian, Yang, Ying)
* tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block: (178 commits)
ublk_drv: fix double shift bug
ublk_drv: make sure that correct flags(features) returned to userspace
ublk_drv: fix error handling of ublk_add_dev
ublk_drv: fix lockdep warning
block: remove __blk_get_queue
block: call blk_mq_exit_queue from disk_release for never added disks
blk-mq: fix error handling in __blk_mq_alloc_disk
ublk: defer disk allocation
ublk: rewrite ublk_ctrl_get_queue_affinity to not rely on hctx->cpumask
ublk: fold __ublk_create_dev into ublk_ctrl_add_dev
ublk: cleanup ublk_ctrl_uring_cmd
ublk: simplify ublk_ch_open and ublk_ch_release
ublk: remove the empty open and release block device operations
ublk: remove UBLK_IO_F_PREFLUSH
ublk: add a MAINTAINERS entry
block: don't allow the same type rq_qos add more than once
mmc: fix disk/queue leak in case of adding disk failure
ublk_drv: fix an IS_ERR() vs NULL check
ublk: remove UBLK_IO_F_INTEGRITY
ublk_drv: remove unneeded semicolon
...
Although rdma_cm_id and ib_cm_id passing to srpt_cm_req_recv() are
exclusive currently, all other checking condition are using rdma_cm_id.
So unify the 'if' condition to make the code more clear.
Link: https://lore.kernel.org/r/1659336226-2-1-git-send-email-lizhijian@fujitsu.com
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
In the function rxe_create_qp(), rxe_qp_from_init() is called to
initialize qp, internally things like the spin locks are not setup until
rxe_qp_init_req().
If an error occures before this point then the unwind will call
rxe_cleanup() and eventually to rxe_qp_do_cleanup()/rxe_cleanup_task()
which will oops when trying to access the uninitialized spinlock.
Move the spinlock initializations earlier before any failures.
Fixes: 8700e3e7c4 ("Soft RoCE driver")
Link: https://lore.kernel.org/r/20220731063621.298405-1-yanjun.zhu@linux.dev
Reported-by: syzbot+833061116fa28df97f3b@syzkaller.appspotmail.com
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add missing check for return value when calling to
mlx5_ib_ft_type_to_namespace, even though it can't really fail in this
specific call.
Fixes: 52438be441 ("RDMA/mlx5: Allow inserting a steering rule to the FDB")
Link: https://lore.kernel.org/r/7b9ceda217d9368a51dc47a46b769bad4af9ac92.1659256069.git.leonro@nvidia.com
Reviewed-by: Itay Aveksis <itayav@nvidia.com>
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Currently the requester can continue to process send wqes after an local
qp operation error is detected because the setting of the qp state to the
error state is deferred until later. This patch splits the qp state for
the completer and requester into two separate states and sets
qp->req.state = QP_STATE_ERROR as soon as the error is detected before
another wqe can be executed.
Link: https://lore.kernel.org/r/1658307368-1851-4-git-send-email-lizhijian@fujitsu.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
As per IBTA specification, all subsequent WQEs while QP is in error state
should be completed with a flush error.
Here we check QP_STATE_ERROR after req_next_wqe() so that rxe_completer()
has chance to be called where it will set CQ state to FLUSH ERROR and the
completion can associate with its WQE.
Link: https://lore.kernel.org/r/1658307368-1851-3-git-send-email-lizhijian@fujitsu.com
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Previously, if user space keeps sending abnormal wqe, queue.index will
keep increasing while qp->req.wqe_index doesn't. Once
qp->req.wqe_index==queue.index in next round, req_next_wqe() will treat
queue as empty. In such case, no new completion would be generated.
Update wqe_index for each wqe completion so that req_next_wqe() can get
next wqe properly.
Link: https://lore.kernel.org/r/1658307368-1851-2-git-send-email-lizhijian@fujitsu.com
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Change the LIO port members inside struct srpt_port from regular members
into pointers. Allocate the LIO port data structures from inside
srpt_make_tport() and free these from inside srpt_make_tport(). Keep
struct srpt_device as long as either an RDMA port or a LIO target port is
associated with it. This patch decouples the lifetime of struct srpt_port
(controlled by the RDMA core) and struct srpt_port_id (controlled by LIO).
This patch fixes the following KASAN complaint:
BUG: KASAN: use-after-free in srpt_enable_tpg+0x31/0x70 [ib_srpt]
Read of size 8 at addr ffff888141cc34b8 by task check/5093
Call Trace:
<TASK>
show_stack+0x4e/0x53
dump_stack_lvl+0x51/0x66
print_address_description.constprop.0.cold+0xea/0x41e
print_report.cold+0x90/0x205
kasan_report+0xb9/0xf0
__asan_load8+0x69/0x90
srpt_enable_tpg+0x31/0x70 [ib_srpt]
target_fabric_tpg_base_enable_store+0xe2/0x140 [target_core_mod]
configfs_write_iter+0x18b/0x210
new_sync_write+0x1f2/0x2f0
vfs_write+0x3e3/0x540
ksys_write+0xbb/0x140
__x64_sys_write+0x42/0x50
do_syscall_64+0x34/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
</TASK>
Link: https://lore.kernel.org/r/20220727193415.1583860-4-bvanassche@acm.org
Reported-by: Li Zhijian <lizhijian@fujitsu.com>
Tested-by: Li Zhijian <lizhijian@fujitsu.com>
Fixes: a42d985bd5 ("ib_srpt: Initial SRP Target merge for v3.3-rc1")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This will be used to keep struct srpt_device around as long as either the
RDMA port exists or a LIO target port is associated with the struct
srpt_device.
Link: https://lore.kernel.org/r/20220727193415.1583860-3-bvanassche@acm.org
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Prepare for decoupling the lifetimes of struct srpt_port and struct
srpt_port_id by duplicating the port name into struct srpt_port.
Link: https://lore.kernel.org/r/20220727193415.1583860-2-bvanassche@acm.org
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Cheng Xu says
====================
This v14 patch set introduces the Elastic RDMA Adapter (ERDMA) driver,
which released in Apsara Conference 2021 by Alibaba. The PR of ERDMA
userspace provider has already been created [1].
ERDMA enables large-scale RDMA acceleration capability in Alibaba ECS
environment, initially offered in g7re instance. It can improve the
efficiency of large-scale distributed computing and communication
significantly and expand dynamically with the cluster scale of Alibaba
Cloud.
ERDMA is a RDMA networking adapter based on the Alibaba MOC hardware. It
works in the VPC network environment (overlay network), and uses iWarp
transport protocol. ERDMA supports reliable connection (RC). ERDMA also
supports both kernel space and user space verbs. Now we have already
supported HPC/AI applications with libfabric, NoF and some other internal
verbs libraries, such as xrdma, epsl, etc,.
For the ECS instance with RDMA enabled, our MOC hardware generates two
kinds of PCI devices: one for ERDMA, and one for the original net device
(virtio-net). They are separated PCI devices.
====================
* branch 'erdma':
RDMA/erdma: Add driver to kernel build environment
RDMA/erdma: Add the ABI definitions
RDMA/erdma: Add the erdma module
RDMA/erdma: Add connection management (CM) support
RDMA/erdma: Add verbs implementation
RDMA/erdma: Add verbs header file
RDMA/erdma: Add event queue implementation
RDMA/erdma: Add cmdq implementation
RDMA/erdma: Add main include file
RDMA/erdma: Add the hardware related definitions
RDMA: Add ERDMA to rdma_driver_id definition
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add the main erdma module, which provides interface to infiniband
subsystem.
This commit includes a modification from Christophe, that using the bitmap
API to allocate bitmaps instead of hand-writing. And the commit also fixes
warnings reported by static checkers.
Link: https://lore.kernel.org/r/20220727014927.76564-10-chengyou@linux.alibaba.com
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
ERDMA's transport protocol is iWarp, so the driver must support CM
interface. In CM part, we use the same way as SoftiWarp: using kernel
socket to set up the connection, then performing MPA negotiation in
kernel. So, this part of code mainly comes from SoftiWarp, base on it,
we add some more features, such as non-blocking iw_connect implementation.
This commit also fixes a duplicated include issue reported by Abaci Robot.
Link: https://lore.kernel.org/r/20220727014927.76564-9-chengyou@linux.alibaba.com
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The RDMA verbs implementation of erdma is divided into three files:
erdma_qp.c, erdma_cq.c, and erdma_verbs.c. Internal used functions and
datapath functions of QP/CQ are put in erdma_qp.c and erdma_cq.c, the rest
is in erdma_verbs.c.
This commit also fixes some static check warnings.
Link: https://lore.kernel.org/r/20220727014927.76564-8-chengyou@linux.alibaba.com
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This header file defines the main structures and functions used for RDMA
Verbs, including qp, cq, mr, ucontext, etc,.
Link: https://lore.kernel.org/r/20220727014927.76564-7-chengyou@linux.alibaba.com
Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Event queue (EQ) is the main notification way from erdma hardware to its
driver. Each erdma device contains 2 kinds EQs: asynchronous EQ (AEQ) and
completion EQ (CEQ). Per device has 1 AEQ, which used for RDMA async event
report, and max to 32 CEQs (numbered for CEQ0 to CEQ31). CEQ0 is used for
cmdq completion event report, and the rest CEQs are used for RDMA
completion event report.
Link: https://lore.kernel.org/r/20220727014927.76564-6-chengyou@linux.alibaba.com
Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Cmdq is the main control plane channel between erdma driver and hardware.
After erdma device is initialized, the cmdq channel will be active in the
whole lifecycle of this driver.
This commit also includes two modifications from Christophe, one is using
the bitmap API to allocate bitmaps instead of hand-writing, and another
is using the non-atomic bitmap API when applicable.
Link: https://lore.kernel.org/r/20220727014927.76564-5-chengyou@linux.alibaba.com
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add ERDMA driver main header file, defining internal used data structures
and operations. The defined data structures includes *cmdq*, which is used
as the communication channel between ERDMA driver and hardware.
Link: https://lore.kernel.org/r/20220727014927.76564-4-chengyou@linux.alibaba.com
Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
ERDMA is a PCIe device, and this file provides ERDMA hardware related
definitions, mainly including PCIe device capabilities and restrictions,
device registers definitions, doorbell space, doorbell structure
definitions and WQE definitions.
Link: https://lore.kernel.org/r/20220727014927.76564-3-chengyou@linux.alibaba.com
Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
After replacing the MR cache with an Mkey cache, rename the variables and
functions to fit the new meaning.
Link: https://lore.kernel.org/r/20220726071911.122765-6-michaelgur@nvidia.com
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Currently, the driver stores mlx5_ib_mr struct in the cache entries,
although the only use of the cached MR is the mkey. Store only the mkey in
the cache.
Link: https://lore.kernel.org/r/20220726071911.122765-5-michaelgur@nvidia.com
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
total_mrs is used only to calculate the number of mkeys currently in
use. To simplify things, replace it with a new member called "in_use" and
directly store the number of mkeys currently in use.
Link: https://lore.kernel.org/r/20220726071911.122765-4-michaelgur@nvidia.com
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The Xarray allows us to store the cached mkeys in memory efficient way.
Entries are reserved in the Xarray using xa_cmpxchg before calling to the
upcoming callbacks to avoid allocations in interrupt context. The
xa_cmpxchg can sleep when using GFP_KERNEL, so we call it in a loop to
ensure one reserved entry for each process trying to reserve.
Link: https://lore.kernel.org/r/20220726071911.122765-3-michaelgur@nvidia.com
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
In the next patch, ent->list will be replaced with an xarray. The xarray
uses an internal lock to protect the indexes. Use it to protect all the
entry fields, and get rid of ent->lock.
Link: https://lore.kernel.org/r/20220726071911.122765-2-michaelgur@nvidia.com
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
dma_map_sg() now supports the use of P2PDMA pages so pci_p2pdma_map_sg()
is no longer necessary and may be dropped. This means the
rdma_rw_[un]map_sg() helpers are no longer necessary. Remove it all.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
In rxe_req.c replace calls to __rxe_do_task() by calls to rxe_run_task(..,
0). Using __rxe_do_task is an error because the completer tasklet is not
designed to be re-entrant and __rxe_do_task() should only be called when
it is clear that no one else could be calling the completer tasklet as is
the case in rxe_qp.c where this call is used in safe environments.
Link: https://lore.kernel.org/r/20220630190425.2251-10-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Reviewed-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Limit the maximum number of calls to each tasklet from rxe_do_task()
before yielding the cpu. When the limit is reached reschedule the tasklet
and exit the calling loop. This patch prevents one tasklet from consuming
100% of a cpu core and causing a deadlock or soft lockup.
Link: https://lore.kernel.org/r/20220630190425.2251-9-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Make changes to the three tasklets so that the exit logic from each is the
same. This makes the code easier to understand.
Link: https://lore.kernel.org/r/20220630190425.2251-8-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Currently the completer tasklet when retransmit timer or the rnr timer
fires the same flag (qp->req.need_retry) is set so that if either timer
fires it will attempt to perform a retry flow on the send queue. This has
the effect of responding to an RNR NAK at the first retransmit timer event
which might not allow the requested rnr timeout.
This patch adds a new flag (qp->req.wait_for_rnr_timer) which, if set,
prevents a retry flow until the rnr nak timer fires.
This patch fixes rnr retry errors which can be observed by running the
pyverbs test_rdmacm_async_traffic_external_qp multiple times. With this
patch applied they do not occur.
Link: https://lore.kernel.org/linux-rdma/a8287823-1408-4273-bc22-99a0678db640@gmail.com/
Link: https://lore.kernel.org/linux-rdma/2bafda9e-2bb6-186d-12a1-179e8f6a2678@talpey.com/
Fixes: 8700e3e7c4 ("Soft RoCE driver")
Link: https://lore.kernel.org/r/20220630190425.2251-6-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The code thc that decides whether to defer execution of a wqe in
rxe_requester.c is isolated into a subroutine rxe_is_fenced() and removed
from the call to req_next_wqe(). The condition whether a wqe should be
fenced is changed to comply with the IBA. Currently an operation is fenced
if the fence bit is set in the wqe flags and the last wqe has not
completed. For normal operations the IBA actually only requires that the
last read or atomic operation is complete.
Link: https://lore.kernel.org/r/20220630190425.2251-2-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Reviewed-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The 'rkey' input can be an lkey or rkey, and in rxe the lkey or rkey have
the same value, including the variant bits.
So, if mr->rkey is set, compare the invalidate key with it, otherwise
compare with the mr->lkey.
Since we already did a lookup on the non-varient bits to get this far, the
check's only purpose is to confirm that the wqe has the correct variant
bits.
Fixes: 001345339f ("RDMA/rxe: Separate HW and SW l/rkeys")
Link: https://lore.kernel.org/r/20220707073006.328737-1-haris.phnx@gmail.com
Signed-off-by: Md Haris Iqbal <haris.phnx@gmail.com>
Reviewed-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
There is a spelling mistake (writeable) in function rxe_check_bind_mw.
Fix it.
Signed-off-by: Zhang Jiaming <jiaming@nfschina.com>
Reviewed-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Expose a steering anchor per priority to allow users to re-inject
packets back into default NIC pipeline for additional processing.
MLX5_IB_METHOD_STEERING_ANCHOR_CREATE returns a flow table ID which
a user can use to re-inject packets at a specific priority.
A FTE (flow table entry) can be created and the flow table ID
used as a destination.
When a packet is taken into a RDMA-controlled steering domain (like
software steering) there may be a need to insert the packet back into
the default NIC pipeline. This exposes a flow table ID to the user that can
be used as a destination in a flow table entry.
With this new method priorities that are exposed to users via
MLX5_IB_METHOD_FLOW_MATCHER_CREATE can be reached from a non-zero UID.
As user-created flow tables (via RDMA DEVX) are created with a non-zero UID
thus it's impossible to point to a NIC core flow table (core driver flow tables
are created with UID value of zero) from userspace.
Create flow tables that are exposed to users with the shared UID, this
allows users to point to default NIC flow tables.
Steering loops are prevented at FW level as FW enforces that no flow
table at level X can point to a table at level lower than X.
Link: https://lore.kernel.org/all/20220703205407.110890-6-saeed@kernel.org/
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
_get_flow_table() requires the entire matcher being passed
while all it needs is the priority and namespace type.
Pass the priority and namespace type directly instead.
Link: https://lore.kernel.org/all/20220703205407.110890-5-saeed@kernel.org/
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The qp parameter in free_rd_atomic_resource() has become
unused so remove it directly.
Fixes: 15ae1375ea ("RDMA/rxe: Fix qp reference counting for atomic ops")
Link: https://lore.kernel.org/all/20220708035547.6592-1-yangx.jy@fujitsu.com/
Signed-off-by: Xiao Yang <yangx.jy@fujitsu.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
setup_base_ctxt() allocates a memory chunk for uctxt->groups with
hfi1_alloc_ctxt_rcv_groups(). When init_user_ctxt() fails, uctxt->groups
is not released, which will lead to a memory leak.
We should release the uctxt->groups with hfi1_free_ctxt_rcv_groups()
when init_user_ctxt() fails.
Fixes: e87473bc1b ("IB/hfi1: Only set fd pointer when base context is completely initialized")
Link: https://lore.kernel.org/r/20220711070718.2318320-1-niejianglei2021@163.com
Signed-off-by: Jianglei Nie <niejianglei2021@163.com>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
This parameter had been deprecated since below commit:
1a7085b342 ("RDMA/rxe: Skip adjusting remote addr for write in retry operation")
Link: https://lore.kernel.org/r/20220715035340.1900168-1-lizhijian@fujitsu.com
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Reviewed-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
It's redundant to prepare resources for Read and Atomic
requests by different functions. Replace them by a common
rxe_prepare_res() with different parameters. In addition,
the common rxe_prepare_res() can also be used by new Flush
and Atomic Write requests in the future.
Link: https://lore.kernel.org/r/20220705145212.12014-1-yangx.jy@fujitsu.com
Signed-off-by: Xiao Yang <yangx.jy@fujitsu.com>
Reviewed-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The function rxe_create_qp calls rxe_qp_from_init. If some error
occurs, the error handler of function rxe_qp_from_init will set
both scq and rcq to NULL.
Then rxe_create_qp calls rxe_put to handle qp. In the end,
rxe_qp_do_cleanup is called by rxe_put. rxe_qp_do_cleanup directly
accesses scq and rcq before checking them. This will cause
null-ptr-deref error.
The call graph is as below:
rxe_create_qp {
...
rxe_qp_from_init {
...
err1:
...
qp->rcq = NULL; <---rcq is set to NULL
qp->scq = NULL; <---scq is set to NULL
...
}
qp_init:
rxe_put{
...
rxe_qp_do_cleanup {
...
atomic_dec(&qp->scq->num_wq); <--- scq is accessed
...
atomic_dec(&qp->rcq->num_wq); <--- rcq is accessed
}
}
Fixes: 4703b4f0d9 ("RDMA/rxe: Enforce IBA C11-17")
Link: https://lore.kernel.org/r/20220705225414.315478-1-yanjun.zhu@linux.dev
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Reviewed-by: Bob Pearson <rpearsonhpe@gmail.com>
Reviewed-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
If siw_recv_mpa_rr returns -EAGAIN, it means that the MPA reply hasn't
been received completely, and should not report IW_CM_EVENT_CONNECT_REPLY
in this case. This may trigger a call trace in iw_cm. A simple way to
trigger this:
server: ib_send_lat
client: ib_send_lat -R <server_ip>
The call trace looks like this:
kernel BUG at drivers/infiniband/core/iwcm.c:894!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
<...>
Workqueue: iw_cm_wq cm_work_handler [iw_cm]
Call Trace:
<TASK>
cm_work_handler+0x1dd/0x370 [iw_cm]
process_one_work+0x1e2/0x3b0
worker_thread+0x49/0x2e0
? rescuer_thread+0x370/0x370
kthread+0xe5/0x110
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
</TASK>
Fixes: 6c52fdc244 ("rdma/siw: connection management")
Link: https://lore.kernel.org/r/dae34b5fd5c2ea2bd9744812c1d2653a34a94c67.1657706960.git.chengyou@linux.alibaba.com
Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Since ECC memory maintains a memory system immune to single-bit errors,
add support for correcting the 1bit-ECC error, which prevents a 1bit-ECC
error become an uncorrected type error. When a 1bit-ECC error happens in
the internal ram of the ROCE engine, such as the QPC table, as a 1bit-ECC
error caused by reading, the ROCE engine only corrects those 1bit ECC
errors by writing.
Link: https://lore.kernel.org/r/20220714134353.16700-6-liangwenpeng@huawei.com
Signed-off-by: Haoyue Xu <xuhaoyue1@hisilicon.com>
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Use a single function to handle the same kind of abnormal interrupts.
Link: https://lore.kernel.org/r/20220714134353.16700-5-liangwenpeng@huawei.com
Signed-off-by: Haoyue Xu <xuhaoyue1@hisilicon.com>
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The driver will clear all the interrupts in the same area
when the driver handles the interrupt of type AEQ overflow.
It should only set the interrupt status bit of type AEQ overflow.
Fixes: a5073d6054 ("RDMA/hns: Add eq support of hip08")
Link: https://lore.kernel.org/r/20220714134353.16700-4-liangwenpeng@huawei.com
Signed-off-by: Haoyue Xu <xuhaoyue1@hisilicon.com>
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The type of return value of the interrupt handler should be irqreturn_t.
Link: https://lore.kernel.org/r/20220714134353.16700-3-liangwenpeng@huawei.com
Signed-off-by: Haoyue Xu <xuhaoyue1@hisilicon.com>
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The HNS NIC driver receives and handles the abnormal interrupt of the RAS
type generated by ROCEE, and the HNS RDMA driver does not need to handle
this type of interrupt. Therefore, delete unused codes in the HNS RDMA
driver.
Link: https://lore.kernel.org/r/20220714134353.16700-2-liangwenpeng@huawei.com
Signed-off-by: Haoyue Xu <xuhaoyue1@hisilicon.com>
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
__qedr_alloc_mr() allocates a memory chunk for "mr->info.pbl_table" with
init_mr_info(). When rdma_alloc_tid() and rdma_register_tid() fail, "mr"
is released while "mr->info.pbl_table" is not released, which will lead
to a memory leak.
We should release the "mr->info.pbl_table" with qedr_free_pbl() when error
occurs to fix the memory leak.
Fixes: e0290cce6a ("qedr: Add support for memory registeration verbs")
Link: https://lore.kernel.org/r/20220714061505.2342759-1-niejianglei2021@163.com
Signed-off-by: Jianglei Nie <niejianglei2021@163.com>
Acked-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Both hfi1 and UML depend on x86_64, this can trigger build errors.
This driver must depends on !UML because it accesses x86_64
features that are not supported by UML.
Link: https://lore.kernel.org/r/165755127879.2996325.5668395672492732376.stgit@awfm-02.cornelisnetworks.com
Signed-off-by: Ehab Ababneh <ehab.ababneh@cornelisnetworks.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The mempool is for guaranteed memory allocation during
extreme VM load (see the header of mempool.c of the kernel).
But rtrs-srv allocates pages only when creating new session.
There is no need to use the mempool.
With the removal of mempool, rtrs-server no longer need to reserve
huge mount of memory, this will avoid error like this:
https://lore.kernel.org/lkml/20220620020727.GA3669@xsang-OptiPlex-9020/
Link: https://lore.kernel.org/r/20220712103113.617754-6-haris.iqbal@ionos.com
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Gioh Kim <gi-oh.kim@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
removes list_next_or_null_rr_rcu macro to fix below warnings.
That macro is used only twice.
CHECK:MACRO_ARG_REUSE: Macro argument reuse 'head' - possible side-effects?
CHECK:MACRO_ARG_REUSE: Macro argument reuse 'ptr' - possible side-effects?
CHECK:MACRO_ARG_REUSE: Macro argument reuse 'memb' - possible side-effects?
Replaces that macro with an inline function.
Fixes: 6a98d71dae ("RDMA/rtrs: client: main functionality")
Cc: jinpu.wang@ionos.com
Link: https://lore.kernel.org/r/20220712103113.617754-5-haris.iqbal@ionos.com
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Convert server stat counters from atomic to per-cpu variables.
Link: https://lore.kernel.org/r/20220712103113.617754-4-haris.iqbal@ionos.com
Signed-off-by: Santosh Kumar Pradhan <santosh.pradhan@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Use this_cpu_x() for increasing/adding a percpu counter through a
percpu pointer without the need to disable/enable preemption.
Link: https://lore.kernel.org/r/20220712103113.617754-3-haris.iqbal@ionos.com
Suggested-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Santosh Kumar Pradhan <santosh.pradhan@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Reviewed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
stringify works with define, not enum.
Fixes: 91fddedd43 ("RDMA/rtrs: private headers with rtrs protocol structs and helpers")
Cc: jinpu.wang@ionos.com
Link: https://lore.kernel.org/r/20220712103113.617754-2-haris.iqbal@ionos.com
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Reviewed-by: Aleksei Marov <aleksei.marov@ionos.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Comparison of 'cq' with NULL is useless since
'cq' is a result of container_of and cannot be NULL
in any reasonable scenario.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: 303ae1cdfd ("rdma/siw: application interface")
Link: https://lore.kernel.org/r/20220711151251.17089-1-strochuk@ispras.ru
Signed-off-by: Andrey Strachuk <strochuk@ispras.ru>
Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Setting err_rq_idx_valid field in QP context when the AE source of the
AEQE is not associated with an RQ causes the firmware flush to fail.
Set err_rq_idx_valid field in QP context only if it is associated with an
RQ. Additionally, cleanup the redundant setting of this field in
irdma_process_aeq.
Fixes: 44d9e52977 ("RDMA/irdma: Implement device initialization definitions")
Link: https://lore.kernel.org/r/20220705230815.265-8-shiraz.saleem@intel.com
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
When an application listens on a wildcard address, and there are VLAN and
non-VLAN IP addresses, iWARP connection establishemnt can fail if the listen
node VLAN ID does not match.
Fix this by checking the vlan_id only if not a wildcard listen node.
Fixes: 146b9756f1 ("RDMA/irdma: Add connection manager")
Link: https://lore.kernel.org/r/20220705230815.265-7-shiraz.saleem@intel.com
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
During a destroy CQ an interrupt may cause processing of a CQE after CQ
resources are freed by irdma_cq_free_rsrc(). Fix this by moving the call
to irdma_cq_free_rsrc() after the irdma_sc_cleanup_ceqes(), which is
called under the cq_lock.
Fixes: b48c24c2d7 ("RDMA/irdma: Implement device supported verb APIs")
Link: https://lore.kernel.org/r/20220705230815.265-6-shiraz.saleem@intel.com
Signed-off-by: Bartosz Sobczak <bartosz.sobczak@intel.com>
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Adapt the resource distribution algorithm in irdma_cfg_fpm_val to be more
QP oriented. If the configuration is too big for the available memory,
trim the MR and PBLE's first before trimming the QPs. This also avoids
having to double QPs requested as input to algorithm for GEN1 devices.
Link: https://lore.kernel.org/r/20220705230815.265-5-shiraz.saleem@intel.com
Signed-off-by: Nayan Kumar <nayan.kumar@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The invalid state error returned by the Control Queue-Pair (CQP) is not a
critical error.
Add it to the irdma_noncrit_err_list and drop reporting it as device error
message.
Link: https://lore.kernel.org/r/20220705230815.265-4-shiraz.saleem@intel.com
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
To assist with debugging add the Asynchronous Event (AE) source when
logging the abnormal AE error log message.
Link: https://lore.kernel.org/r/20220705230815.265-3-shiraz.saleem@intel.com
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Level 2 Physical Buffer List Entry (PBLE) is currently not supported for
Fast MRs which limits memory registrations to 256K pages.
Adapt irdma_set_page and irdma_alloc_mr to allow for 2 level PBLEs.
Link: https://lore.kernel.org/r/20220705230815.265-2-shiraz.saleem@intel.com
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
We want to remove the weight argument from the basic
netif_napi_add() API and just default to 64.
Switch ipoib to the new API for explicitly specifying
the weight.
Link: https://lore.kernel.org/r/20220705230208.924408-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Since we'll remove the last argument from netif_napi_add()
soon switch this RDMA driver to netif_napi_add_weight()
for now to avoid cross-tree patches.
Link: https://lore.kernel.org/r/20220705230208.924408-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Leon Romanovsky <leon@kernel.org>