As qedr driver supports both RoCE and iWarp, make sure to set the
max_pkeys only when running in RoCE mode.
Link: https://lore.kernel.org/r/20200827141655.406185-1-kamalheib1@gmail.com
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Acked-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This function has a lot of gotos which could be replaced by simple
returns, making the function tidier and less bug prone.
Link: https://lore.kernel.org/r/20200825171242.448447-2-alex.dewar90@gmail.com
Signed-off-by: Alex Dewar <alex.dewar90@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Commit 36a8f01cd2 ("IB/qib: Add congestion control agent
implementation") erroneously marked a couple of switch cases as /*
FALLTHROUGH */, which were later converted to fallthrough statements by
commit df561f6688 ("treewide: Use fallthrough pseudo-keyword"). This
triggered a Coverity warning about unreachable code.
Remove the fallthrough statements.
Link: https://lore.kernel.org/r/20200825171242.448447-1-alex.dewar90@gmail.com
Addresses-Coverity: ("Unreachable code")
Fixes: 36a8f01cd2 ("IB/qib: Add congestion control agent implementation")
Signed-off-by: Alex Dewar <alex.dewar90@gmail.com>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl9ML+IeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGA8EIAIy/kTbFS0yrE9yV
hb98oX0z9+EU9YQg9vhaRWwPd+rJF/JMQZLqYcwbhjG9abaUL3T3fEcSAefMHw8E
LAt+hYzA38dHt7tqhsFQX3vV1VorvDVICBVN0yRPRWKKikq4OPIHzaAR9tleGAF5
8btQisl1PjN+obwYmLuNb6aX16OCwAF+uXOwehcoJs9dvMNhwtXRzfOflWzOvOo6
tE0bHErlylLDfLv4ZzEfczTdks4QJZ7C0xLSf3oN9AAynW42Xnhct4hi8qZY/hCf
CMaqeN4hdpub6TvQIqBdDqMMjEXGFgeNSnAEBQY9VpvUqz8NTu6sQxwgJEKDF5tg
d81lv2c=
=uW/F
-----END PGP SIGNATURE-----
Merge tag 'v5.9-rc3' into rdma.git for-next
Required due to dependencies in following patches.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Change rxe pools to use kzalloc instead of kmem_cache to allocate memory
for rxe objects. The pools are not really necessary and they trigger
hardened user copy warnings as the ioctl framework copies the QP number
directly to userspace.
Also the general project to move object alloation to the core code will
eventually clean these out anyhow.
Link: https://lore.kernel.org/r/20200827163535.2632-1-rpearson@hpe.com
Signed-off-by: Bob Pearson <rpearson@hpe.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Drivers that fail destroy can cause uverbs to leak uobjects. Drivers are
required to always eventually destroy their ubojects, so trigger a WARN_ON
to detect this driver bug.
Link: https://lore.kernel.org/r/0-v1-b1e0ed400ba9+f7-warn_destroy_ufile_hw_jgg@nvidia.com
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The UDP source port number in RoCE v2 is used to create entropy for
network routers (ECMP), load balancers and 802.3ad link aggregation
switching that are not aware of RoCE IB headers. Considering that the IB
core has achieved a new interface to get a hashed value of it, the fixed
value of it in QPC and UD WQE in hns driver could be fixed and the port
number is to be set dynamically now.
For QPC of RC, the value could be hashed from flow_lable if the user pass
it in or from remote qpn and local qpn. For WQE of UD, it is set according
to fl or as a random value.
Link: https://lore.kernel.org/r/1598002289-8611-1-git-send-email-liweihang@huawei.com
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
It should be considered an illegal operation if the ULP attempts to modify
a QP from another state to the current hardware state. Otherwise, the ULP
can modify some fields of QPC at any time. For example, for a QP in state
of RTS, modify it from RTR to RTS can change the PSN, which is always not
as expected.
Fixes: 9a4435375c ("IB/hns: Add driver files for hns RoCE driver")
Link: https://lore.kernel.org/r/1598353674-24270-1-git-send-email-liweihang@huawei.com
Signed-off-by: Lang Cheng <chenglang@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Driver crashes when destroy_qp is re-tried because of an error
returned. This is because the qp entry was removed from the qp list during
the first call.
Remove qp from the list only if destroy_qp returns success.
The driver will still trigger a WARN_ON due to the memory leaking, but at
least it isn't corrupting memory too.
Fixes: 8dae419f9e ("RDMA/bnxt_re: Refactor queue pair creation code")
Link: https://lore.kernel.org/r/1598292876-26529-2-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
When computing the first psn entry, driver checks for page alignment. If
this address is not page aligned,it attempts to compute the offset in that
page for later use by using ALIGN macro. ALIGN macro does not return
offset bytes but the requested aligned address and hence cannot be used
directly to store as offset. Since driver was using the address itself
instead of offset, it resulted in invalid address when filling the psn
buffer.
Fixed driver to use PAGE_MASK macro to calculate the offset.
Fixes: fddcbbb02a ("RDMA/bnxt_re: Simplify obtaining queue entry from hw ring")
Link: https://lore.kernel.org/r/1598292876-26529-7-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Naresh Kumar PBS <nareshkumar.pbs@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Some adapters report more than 256 gid entries. Restrict it to 256 for
now.
Fixes: 1ac5a4047975("RDMA/bnxt_re: Add bnxt_re RoCE driver")
Link: https://lore.kernel.org/r/1598292876-26529-6-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Naresh Kumar PBS <nareshkumar.pbs@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
At first, driver allocates memory for NQ based on qplib_ctx->cq_count and
qplib_ctx->srqc_count. Later when creating ring, it uses a static value
of 128K -1.
Fixing this with a static value for now.
Fixes: b08fe048a6 ("RDMA/bnxt_re: Refactor net ring allocation function")
Link: https://lore.kernel.org/r/1598292876-26529-5-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Naresh Kumar PBS <nareshkumar.pbs@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
qp->id can be a value outside the max number of qp. Indexing the qp table
with the id can cause out of bounds crash. So changing the qp table
indexing by (qp->id % max_qp -1).
Allocating one extra entry for QP1. Some adapters create one more than the
max_qp requested to accommodate QP1. If the qp->id is 1, store the
inforamtion in the last entry of the qp table.
Fixes: f218d67ef0 ("RDMA/bnxt_re: Allow posting when QPs are in error")
Link: https://lore.kernel.org/r/1598292876-26529-4-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
QP1 Rx CQE reports transparent VLAN ID in the completion and this is used
while reporting the completion for received MAD packet. Check if the vlan
id is configured before reporting it in the work completion.
Fixes: 84511455ac ("RDMA/bnxt_re: report vlan_id and sl in qp1 recv completion")
Link: https://lore.kernel.org/r/1598292876-26529-3-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
If the pkey_table is not available (which is the case when RoCE is not
supported), the cited commit caused a regression where mlx4_devices
without RoCE are not created.
Fix this by returning a pkey table length of zero in procedure
eth_link_query_port() if the pkey-table length reported by the device is
zero.
Link: https://lore.kernel.org/r/20200824110229.1094376-1-leon@kernel.org
Cc: <stable@vger.kernel.org>
Fixes: 1901b91f99 ("IB/core: Fix potential NULL pointer dereference in pkey cache")
Fixes: fa417f7b52 ("IB/mlx4: Add support for IBoE")
Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Now that the query_pkey() isn't mandatory by the RDMA core, this callback
can be removed from the usnic provider. The libfabric userspace never
touches the pkey.
Link: https://lore.kernel.org/r/20200820125346.111902-1-kamalheib1@gmail.com
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Use cancel_work_sync() to ensure that the wq is not running and simply
assign NULL to ctx->cm_id to indicate if the work ran or not. Delete the
close_wq since flush_workqueue() is no longer needed.
Link: https://lore.kernel.org/r/20200818120526.702120-15-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
When a new connection is established the RDMA CM creates a new cm_id and
passes it through to the event handler. However inside the UCMA the new ID
is not assigned a ucma_context until the user retrieves the event from a
syscall.
This creates a weird edge condition where a cm_id's context can continue
to point at the listening_id that created it, and a number of additional
edge conditions on event list clean up related to destroying half created
IDs.
There is also a race condition in ucma_get_events() where the
cm_id->context is being assigned without holding the handler_mutex.
Simplify all of this by creating the ucma_context inside the event handler
itself and eliminating the edge case of a half created cm_id. All cm_id's
can be uniformly destroyed via __destroy_id() or via the close_work.
Link: https://lore.kernel.org/r/20200818120526.702120-14-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Since the backlog is now an atomic the file->mut is now only protecting
the event_list and ctx_list. Narrow its scope to make it clear
Link: https://lore.kernel.org/r/20200818120526.702120-13-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
All entry points to the rdma_cm from a ULP must be single threaded,
even this error unwinds. Add the missing locking.
Fixes: 7c11910783 ("RDMA/ucma: Put a lock around every call to the rdma_cm layer")
Link: https://lore.kernel.org/r/20200818120526.702120-11-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This value is locked under the file->mut, ensure it is held whenever
touching it.
The case in ucma_migrate_id() is a race, while in ucma_free_uctx() it is
already not possible for the write side to run, the movement is just for
clarity.
Fixes: 88314e4dda ("RDMA/cma: add support for rdma_migrate_id()")
Link: https://lore.kernel.org/r/20200818120526.702120-10-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
ctx->file is changed under the file->mut lock by ucma_migrate_id(), which
is impossible to lock correctly. Instead change ctx->file under the
handler_lock and ctx_table lock and revise all places touching ctx->file
to use this locking when reading ctx->file.
Link: https://lore.kernel.org/r/20200818120526.702120-9-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The only reader of destroying is inside a handler under the handler_mutex,
so directly use the handler_mutex when setting it instead of the larger
file->mut.
As the refcount could be zero here, and the cm_id already freed, and
additional refcount grab around the locking is required to touch the
cm_id.
Link: https://lore.kernel.org/r/20200818120526.702120-8-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
In almost all cases rdma_accept() is called under the handler_mutex by
ULPs from their handler callbacks. The one exception was ucma which did
not get the handler_mutex.
To improve the understand-ability of the locking scheme obtain the mutex
for ucma as well.
This improves how ucma works by allowing it to directly use handler_mutex
for some of its internal locking against the handler callbacks intead of
the global file->mut lock.
There does not seem to be a serious bug here, other than a DISCONNECT event
can be delivered concurrently with accept succeeding.
Link: https://lore.kernel.org/r/20200818120526.702120-7-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
It is not really necessary to keep a linked list of mcs associated with
each context when we can just scan the xarray to find the right things.
The removes another overloading of file->mut by relying on the xarray
locking for mc instead.
Link: https://lore.kernel.org/r/20200818120526.702120-6-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The store to ctx->cm_id was based on the idea that _ucma_find_context()
would not return the ctx until it was fully setup.
Without locking this doesn't work properly.
Split things so that the xarray is allocated with NULL to reserve the ID
and once everything is final set the cm_id and store.
Along the way this shows that the error unwind in ucma_get_event() if a
new ctx is created is wrong, fix it up.
Link: https://lore.kernel.org/r/20200818120526.702120-5-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
ucma_close() is open coding the tail end of ucma_destroy_id(), consolidate
this duplicated code into a function.
Link: https://lore.kernel.org/r/20200818120526.702120-4-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
During the file_operations release function it is already not possible
that write() can be running concurrently, remove the extra locking
around the ctx_list.
Link: https://lore.kernel.org/r/20200818120526.702120-3-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Both ucma_destroy_id() and ucma_close_id() (triggered from an event via a
wq) can drive the refcount to zero. ucma_get_ctx() was wrongly assuming
that the refcount can only go to zero from ucma_destroy_id() which also
removes it from the xarray.
Use refcount_inc_not_zero() instead.
Link: https://lore.kernel.org/r/20200818120526.702120-2-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
When DCT QPs work in RoCE LAG mode:
1. DCT creation is allowed only when it is supported
2. The "port" of a DCT QP is assigned in a round-robin way
Link: https://lore.kernel.org/r/20200818115245.700581-3-leon@kernel.org
Signed-off-by: Mark Zhang <markz@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
DCI QP supports tx_affinity as well.
Link: https://lore.kernel.org/r/20200818115245.700581-2-leon@kernel.org
Signed-off-by: Mark Zhang <markz@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
In the interest of converging on a common instrumentation infrastructure,
modernize the pr_debug() call sites added by commit 119bf81793 ("IB/cm:
Add debug prints to ib_cm"). The new tracepoints appear in a new "ib_cma"
subsystem.
The conversion is somewhat mechanical. Someone more familiar with the
semantics of the recorded information might suggest additional data
capture.
Some benefits include:
- Tracepoints enable "always on" reporting of these errors
- The error records are structured and compact
- Tracepoints provide hooks for eBPF scripts
Sample output:
nfsd-1954 [003] 62.017901: icm_dreq_skipped: local_id=1998890974 remote_id=1129750393 state=DREQ_RCVD lap_state=LAP_UNINIT
Link: https://lore.kernel.org/r/159767239665.2968.10613294222688696646.stgit@klimt.1015granger.net
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The alloc ucontext flow is always called with a valid udata, there's no
need to test whether it's NULL.
While at it, the 'udata->outlen' check is removed as well as we copy the
minimum between the size of the response and outlen, so in case of zero
outlen, zero bytes will be copied.
Link: https://lore.kernel.org/r/20200818110835.54299-1-galpress@amazon.com
Reviewed-by: Firas JahJah <firasj@amazon.com>
Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Fix the kernel-doc documentation by matching between the functions
definitions and documentation.
Link: https://lore.kernel.org/r/20200820123512.105193-1-kamalheib1@gmail.com
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Acked-by: Adit Ranadive <aditr@vmware.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Creating rxe device on top of vlan interface will create a non-functional
device that has an empty gids table and can't be used for rdma cm
communication.
This is caused by the logic in
enum_all_gids_of_dev_cb()/is_eth_port_of_netdev(), which only considers
networks connected to "upper devices" of the configured network device,
resulting in an empty set of gids for a vlan interface, and attempts to
connect via this rdma device fail in cm_init_av_for_response because no
gids can be resolved.
Apparently, this behavior was implemented to fit the HW-RoCE devices that
create RoCE device per port, therefore RXE must behave the same like
HW-RoCE devices and create rxe device per real device only.
In order to communicate via a vlan interface, the user must use the gid
index of the vlan address instead of creating rxe over vlan.
Link: https://lore.kernel.org/r/20200811150415.3693-1-goody698@gmail.com
Signed-off-by: Mohammad Heib <goody698@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
'parent' sysfs reads will yield '\0' bytes when the interface name has 15
chars, and there will no "\n" output.
To reproduce, create one interface with 15 chars:
[root@test ~]# ip a s enp0s29u1u7u3c2
2: enp0s29u1u7u3c2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
link/ether 02:21:28:57:47:17 brd ff:ff:ff:ff:ff:ff
inet6 fe80::ac41:338f:5bcd:c222/64 scope link noprefixroute
valid_lft forever preferred_lft forever
[root@test ~]# modprobe rdma_rxe
[root@test ~]# echo enp0s29u1u7u3c2 > /sys/module/rdma_rxe/parameters/add
[root@test ~]# cat /sys/class/infiniband/rxe0/parent
enp0s29u1u7u3c2[root@test ~]#
[root@test ~]# f="/sys/class/infiniband/rxe0/parent"
[root@test ~]# echo "$(<"$f")"
-bash: warning: command substitution: ignored null byte in input
enp0s29u1u7u3c2
Use scnprintf and PAGE_SIZE to fill the sysfs output buffer.
Cc: stable@vger.kernel.org
Fixes: 8700e3e7c4 ("Soft RoCE driver")
Link: https://lore.kernel.org/r/20200820153646.31316-1-yi.zhang@redhat.com
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
There are error cases when we will call free_srv before device kobject is
initialized; in such cases calling put_device generates the following
warning:
kobject: '(null)' (000000009f5445ed): is not initialized, yet
kobject_put() is being called.
So call device_initialize() only once when the server is allocated. If we
end up calling put_srv() and subsequently free_srv(), our call to
put_device() would result in deletion of the obj. Call device_add() later
when we actually have a connection. Correspondingly, call device_del()
instead of device_unregister() when srv->dev_ref falls to 0.
Fixes: 9cb8374804 ("RDMA/rtrs: server: main functionality")
Link: https://lore.kernel.org/r/20200811092722.2450-1-haris.iqbal@cloud.ionos.com
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@cloud.ionos.com>
Reviewed-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
When scheduling delayed work to clean up the cache, if the entry already
has been scheduled for deletion, we adjust the delay.
Fixes: 3cf69cc8db ("IB/mlx4: Add CM paravirtualization")
Link: https://lore.kernel.org/r/20200803061941.1139994-7-haakon.bugge@oracle.com
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
A CM REJ packet with its reason equal to timeout is a special beast in the
sense that it doesn't have a Remote Communication ID nor does it have a
Remote Port GID.
Using CX-3 virtual functions, either from a bare-metal machine or
pass-through from a VM, MAD packets are proxied through the PF driver.
Since the VF drivers have separate name spaces for MAD Transaction Ids
(TIDs), the PF driver has to re-map the TIDs and keep the book keeping
in a cache.
This proxying doesn't not handle said REJ packets.
If the active side abandons its connection attempt after having sent a
REQ, it will send a REJ with the reason being timeout. This example can be
provoked by a simple user-verbs program, which ends up doing:
rdma_connect(cm_id, &conn_param);
rdma_destroy_id(cm_id);
using the async librdmacm API.
Having dynamic debug prints enabled in the mlx4_ib driver, we will then
see:
mlx4_ib_demux_cm_handler: Couldn't find an entry for pv_cm_id 0x0, attr_id 0x12
The solution is to introduce a radix-tree. When a REQ packet is received
and handled in mlx4_ib_demux_cm_handler(), we know the connecting peer's
para-virtual cm_id and the destination slave. We then insert an entry into
the tree with said information. We also schedule work to remove this entry
from the tree and free it, in order to avoid memory leak.
When a REJ packet with reason timeout is received, we can look up the
slave in the tree, and deliver the packet to the correct slave.
When a duplicate REQ packet is received, the entry is in the tree. In this
case, we adjust the delayed work in order to avoid a too premature
eviction of the entry.
When cleaning up, we simply traverse the tree and modify any delayed work
to use a zero delay. A subsequent flush of the system_wq will ensure all
entries being wiped out.
Fixes: 3cf69cc8db ("IB/mlx4: Add CM paravirtualization")
Link: https://lore.kernel.org/r/20200803061941.1139994-6-haakon.bugge@oracle.com
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The mlx4 driver will proxy MAD packets through the PF driver. A VM or an
instantiated VF will send its MAD packets to the PF driver using
loop-back. The PF driver will be informed by an interrupt, but defer the
handling and polling of CQEs to a worker thread running on an ordered
work-queue.
Consider the following scenario: the VMs will in short proximity in time,
for example due to a network event, send many MAD packets to the PF
driver. Lets say there are K VMs, each sending N packets.
The interrupt from the first VM will start the worker thread, which will
poll N CQEs. A common case here is where the PF driver will multiplex the
packets received from the VMs out on the wire QP.
But before the wire QP has returned a send CQE and associated interrupt,
the other K - 1 VMs have sent their N packets as well.
The PF driver has to multiplex K * N packets out on the wire QP. But the
send-queue on the wire QP has a finite capacity.
So, in this scenario, if K * N is larger than the send-queue capacity of
the wire QP, we will get MAD packets dropped on the floor with this
dynamic debug message:
mlx4_ib_multiplex_mad: failed sending GSI to wire on behalf of slave 2 (-11)
and this despite the fact that the wire send-queue could have capacity,
but the PF driver isn't aware, because the wire send CQEs have not yet
been polled.
We can also have a similar scenario inbound, with a wire recv-queue larger
than the tunnel QP's send-queue. If many remote peers send MAD packets to
the very same VM, the tunnel send-queue destined to the VM could allegedly
be construed to be full by the PF driver.
This starvation is fixed by introducing separate work queues for the wire
QPs vs. the tunnel QPs.
With this fix, using a dual ported HCA, 8 VFs instantiated, we could run
cmtime on each of the 18 interfaces towards a similar configured peer,
each cmtime instance with 800 QPs (all in all 14400 QPs) without a single
CM packet getting lost.
Fixes: 3cf69cc8db ("IB/mlx4: Add CM paravirtualization")
Link: https://lore.kernel.org/r/20200803061941.1139994-5-haakon.bugge@oracle.com
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Using CX-3 in virtualized mode, MAD packets are proxied through the PF
driver. The feed is N tunnel QPs, and what is received from the VFs is
multiplexed out on the wire QP. Since this is a many-to-one scenario, it
is better to have separate initialization parameters for the two usages.
The number of wire and tunnel bufs are yanked up to 2K and 512
respectively. With this set of parameters, a system consisting of eight
physical servers, each with eight VMs and 14 I/O servers (BM), can run
switch fail-over without seeing:
mlx4_ib_demux_mad: failed sending GSI to slave 3 via tunnel qp (-11)
or
mlx4_ib_multiplex_mad: failed sending GSI to wire on behalf of slave 2 (-11)
Fixes: 3cf69cc8db ("IB/mlx4: Add CM paravirtualization")
Link: https://lore.kernel.org/r/20200803061941.1139994-4-haakon.bugge@oracle.com
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Using CX-3 in virtualized mode, MAD packets are proxied through the PF
driver. However, the handling lacks support of the MRA (Message Receipt
Acknowledgment) packet. When having dynamic debug enabled, we see tons of:
mlx4_ib_multiplex_cm_handler: id{slave: 7, sl_cm_id: 0x8fcb45a0} is NULL! attr_id: 0x11
Fixes: 3cf69cc8db ("IB/mlx4: Add CM paravirtualization")
Link: https://lore.kernel.org/r/20200803061941.1139994-3-haakon.bugge@oracle.com
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add missing check for success after call to mlx4_ib_send_to_wire() in
mlx4_ib_multiplex_mad().
Amended the existing pr_debug() in mlx4_ib_multiplex_cm_handler() and
mlx4_ib_demux_cm_handler() with attr_id during a lookup failure.
Removed two noisy pr_debug() in mad.c
Fixes: 3cf69cc8db ("IB/mlx4: Add CM paravirtualization")
Link: https://lore.kernel.org/r/20200803061941.1139994-2-haakon.bugge@oracle.com
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This patch caused some issues on SEND operation, and it should be reverted
to make the drivers work correctly. There will be a better solution that
has been tested carefully to solve the original problem.
This reverts commit 711195e57d.
Fixes: 711195e57d ("RDMA/hns: Reserve one sge in order to avoid local length error")
Link: https://lore.kernel.org/r/1597829984-20223-1-git-send-email-liweihang@huawei.com
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The following message occurs when running an AI application with TID RDMA
enabled:
hfi1 0000:7f:00.0: hfi1_0: [QP74] hfi1_tid_timeout 4084
hfi1 0000:7f:00.0: hfi1_0: [QP70] hfi1_tid_timeout 4084
The issue happens when TID RDMA WRITE request is followed by an
IB_WR_RDMA_WRITE_WITH_IMM request, the latter could be completed first on
the responder side. As a result, no ACK packet for the latter could be
sent because the TID RDMA WRITE request is still being processed on the
responder side.
When the TID RDMA WRITE request is eventually completed, the requester
will wait for the IB_WR_RDMA_WRITE_WITH_IMM request to be acknowledged.
If the next request is another TID RDMA WRITE request, no TID RDMA WRITE
DATA packet could be sent because the preceding IB_WR_RDMA_WRITE_WITH_IMM
request is not completed yet.
Consequently the IB_WR_RDMA_WRITE_WITH_IMM will be retried but it will be
ignored on the responder side because the responder thinks it has already
been completed. Eventually the retry will be exhausted and the qp will be
put into error state on the requester side. On the responder side, the TID
resource timer will eventually expire because no TID RDMA WRITE DATA
packets will be received for the second TID RDMA WRITE request. There is
also risk of a write-after-write memory corruption due to the issue.
Fix by adding a requester side interlock to prevent any potential data
corruption and TID RDMA protocol error.
Fixes: a0b34f75ec ("IB/hfi1: Add interlock between a TID RDMA request and other requests")
Link: https://lore.kernel.org/r/20200811174931.191210.84093.stgit@awfm-01.aw.intel.com
Cc: <stable@vger.kernel.org> # 5.4.x+
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Driver shall add only the kernel qps to the flush list for clean up.
During async error events from the HW, driver is adding qps to this list
without checking if the qp is kernel qp or not.
Add a check to avoid user qp addition to the flush list.
Fixes: 942c9b6ca8 ("RDMA/bnxt_re: Avoid Hard lockup during error CQE processing")
Fixes: c50866e285 ("bnxt_re: fix the regression due to changes in alloc_pbl")
Link: https://lore.kernel.org/r/1596689148-4023-1-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The isert_post_recv function prints an error in case of failures, so no
need for the callers to add another print.
Link: https://lore.kernel.org/r/20200805121231.166162-2-maxg@mellanox.com
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
In order to allow sniffer when the RDMA device is in switchdev mode, we
don't need to set the source port when creating the sniffer rule.
Link: https://lore.kernel.org/r/20200803060214.15328-1-leon@kernel.org
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Support 56, 25, 100, 200 and 50Gbps IB rates in mlx5 driver.
Link: https://lore.kernel.org/r/20200802081712.1993490-1-leon@kernel.org
Signed-off-by: Mark Zhang <markz@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This patch introduces the ability to configure SRD QPs with the RNR retry
parameter when issuing a modify QP command.
In addition, a capability bit was added to report support to the userspace
library.
Link: https://lore.kernel.org/r/20200731060420.17053-5-galpress@amazon.com
Reviewed-by: Firas JahJah <firasj@amazon.com>
Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This precursory patch adds the SRD QP type state machine, which is
currently identical to the one of UD QP type. A following patch is going
to change the SRD QP state machine to support RNR retry modifications.
Link: https://lore.kernel.org/r/20200731060420.17053-4-galpress@amazon.com
Reviewed-by: Firas JahJah <firasj@amazon.com>
Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The modify QP bitmask was not consistent with other bitmasks used in the
device interface. Remove the bitmask enum and allow usage with
EFA_GET/SET.
Link: https://lore.kernel.org/r/20200731060420.17053-3-galpress@amazon.com
Reviewed-by: Firas JahJah <firasj@amazon.com>
Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Instead of adding a new function for each capability added, introduce a
generic helper to query device capabilities.
Link: https://lore.kernel.org/r/20200731060420.17053-2-galpress@amazon.com
Reviewed-by: Firas JahJah <firasj@amazon.com>
Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The "domain" argument is constant and modern device (mlx5) doesn't support
anything except IB_FLOW_DOMAIN_USER, so delete this extra parameter and
simplify code.
Link: https://lore.kernel.org/r/20200730081235.1581127-4-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Improve readability of fs.c by converting multiple else-if constructions
to be implemented with switch keyword.
Link: https://lore.kernel.org/r/20200730081235.1581127-2-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
After the cleanup of page fault accounting, gup does not need to pass
task_struct around any more. Remove that parameter in the whole gup
stack.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Link: http://lkml.kernel.org/r/20200707225021.200906-26-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Smaller set of RDMA updates. A smaller number of 'big topics' with the
majority of changes being driver updates.
- Driver updates for hfi1, rxe, mlx5, hns, qedr, usnic, bnxt_re
- Removal of dead or redundant code across the drivers
- RAW resource tracker dumps to include a device specific data blob for
device objects to aide device debugging
- Further advance the IOCTL interface, remove the ability to turn it off.
Add QUERY_CONTEXT, QUERY_MR, and QUERY_PD commands
- Remove stubs related to devices with no pkey table
- A shared CQ scheme to allow multiple ULPs to share the CQ rings of a
device to give higher performance
- Several more static checker, syzkaller and rare crashers fixed
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl8sSA0ACgkQOG33FX4g
mxpp1w/8Df/KIB38PVHpKraIW10bX03KsXwoskMYCA+ITYWM5ce+P7YF+yXXGs69
Vh2vUYHlr1RvqXQkq3Y3LjzCPKTYFuNFVQRZF1LrfbfOpSS9aoQqoxwgKs08dibm
YDeRwueWneksWhXeEZLA0QoKd4kEWrScA/n7VGYQ4YcWw8FLKa9t6OMSGivCrFLu
QA+sA9nytrvMWC5uJUCdeVwlRnoaICPYHmM5yafOykPyEciRw2jU1kzTRVy5Z0Hu
iCsXm2lJPcVoMgSjW6SgktY3oBkQeSu3ZZesT3eTM6FJsoDYkuSiKjNmWSZjW1zv
x6CFGjVVin41rN4FMTeqqnwYoML9Q/obbyHvBHs5MTd5J8tLDhesQj3Ev7CUaUed
b0s38v+oEL1w22nkOChfeyfh7eLcy3yiszqvkIU9ABk8mF0p1guGQYsfguzbsq0K
3ZRw/361SxCUBvU6P8CdQbIJlhkH+Un7d81qyt+rhLgaZYm/N+d8auIKUxP1jCxh
q9hss2Cj2U9eZsA/wGNqV1LNazfEAAj/5qjItMirbRd90FL8h+AP2LfJfC7p+id3
3BfOui0JbZqNTTl4ftTxPuxtWDEdTPgwi7JvQd/be9HRlSV8DYCSMUzYFn8A+Zya
cbxjxFuBJWmF+y9csDIVBTdFi+j9hO6notw+G89NznuB3QlPl50=
=0z2L
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"A quiet cycle after the larger 5.8 effort. Substantially cleanup and
driver work with a few smaller features this time.
- Driver updates for hfi1, rxe, mlx5, hns, qedr, usnic, bnxt_re
- Removal of dead or redundant code across the drivers
- RAW resource tracker dumps to include a device specific data blob
for device objects to aide device debugging
- Further advance the IOCTL interface, remove the ability to turn it
off. Add QUERY_CONTEXT, QUERY_MR, and QUERY_PD commands
- Remove stubs related to devices with no pkey table
- A shared CQ scheme to allow multiple ULPs to share the CQ rings of
a device to give higher performance
- Several more static checker, syzkaller and rare crashers fixed"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (121 commits)
RDMA/mlx5: Fix flow destination setting for RDMA TX flow table
RDMA/rxe: Remove pkey table
RDMA/umem: Add a schedule point in ib_umem_get()
RDMA/hns: Fix the unneeded process when getting a general type of CQE error
RDMA/hns: Fix error during modify qp RTS2RTS
RDMA/hns: Delete unnecessary memset when allocating VF resource
RDMA/hns: Remove redundant parameters in set_rc_wqe()
RDMA/hns: Remove support for HIP08_A
RDMA/hns: Refactor hns_roce_v2_set_hem()
RDMA/hns: Remove redundant hardware opcode definitions
RDMA/netlink: Remove CAP_NET_RAW check when dump a raw QP
RDMA/include: Replace license text with SPDX tags
RDMA/rtrs: remove WQ_MEM_RECLAIM for rtrs_wq
RDMA/rtrs-clt: add an additional random 8 seconds before reconnecting
RDMA/cma: Execute rdma_cm destruction from a handler properly
RDMA/cma: Remove unneeded locking for req paths
RDMA/cma: Using the standard locking pattern when delivering the removal event
RDMA/cma: Simplify DEVICE_REMOVAL for internal_id
RDMA/efa: Add EFA 0xefa1 PCI ID
RDMA/efa: User/kernel compatibility handshake mechanism
...
Pull networking updates from David Miller:
1) Support 6Ghz band in ath11k driver, from Rajkumar Manoharan.
2) Support UDP segmentation in code TSO code, from Eric Dumazet.
3) Allow flashing different flash images in cxgb4 driver, from Vishal
Kulkarni.
4) Add drop frames counter and flow status to tc flower offloading,
from Po Liu.
5) Support n-tuple filters in cxgb4, from Vishal Kulkarni.
6) Various new indirect call avoidance, from Eric Dumazet and Brian
Vazquez.
7) Fix BPF verifier failures on 32-bit pointer arithmetic, from
Yonghong Song.
8) Support querying and setting hardware address of a port function via
devlink, use this in mlx5, from Parav Pandit.
9) Support hw ipsec offload on bonding slaves, from Jarod Wilson.
10) Switch qca8k driver over to phylink, from Jonathan McDowell.
11) In bpftool, show list of processes holding BPF FD references to
maps, programs, links, and btf objects. From Andrii Nakryiko.
12) Several conversions over to generic power management, from Vaibhav
Gupta.
13) Add support for SO_KEEPALIVE et al. to bpf_setsockopt(), from Dmitry
Yakunin.
14) Various https url conversions, from Alexander A. Klimov.
15) Timestamping and PHC support for mscc PHY driver, from Antoine
Tenart.
16) Support bpf iterating over tcp and udp sockets, from Yonghong Song.
17) Support 5GBASE-T i40e NICs, from Aleksandr Loktionov.
18) Add kTLS RX HW offload support to mlx5e, from Tariq Toukan.
19) Fix the ->ndo_start_xmit() return type to be netdev_tx_t in several
drivers. From Luc Van Oostenryck.
20) XDP support for xen-netfront, from Denis Kirjanov.
21) Support receive buffer autotuning in MPTCP, from Florian Westphal.
22) Support EF100 chip in sfc driver, from Edward Cree.
23) Add XDP support to mvpp2 driver, from Matteo Croce.
24) Support MPTCP in sock_diag, from Paolo Abeni.
25) Commonize UDP tunnel offloading code by creating udp_tunnel_nic
infrastructure, from Jakub Kicinski.
26) Several pci_ --> dma_ API conversions, from Christophe JAILLET.
27) Add FLOW_ACTION_POLICE support to mlxsw, from Ido Schimmel.
28) Add SK_LOOKUP bpf program type, from Jakub Sitnicki.
29) Refactor a lot of networking socket option handling code in order to
avoid set_fs() calls, from Christoph Hellwig.
30) Add rfc4884 support to icmp code, from Willem de Bruijn.
31) Support TBF offload in dpaa2-eth driver, from Ioana Ciornei.
32) Support XDP_REDIRECT in qede driver, from Alexander Lobakin.
33) Support PCI relaxed ordering in mlx5 driver, from Aya Levin.
34) Support TCP syncookies in MPTCP, from Flowian Westphal.
35) Fix several tricky cases of PMTU handling wrt. briding, from Stefano
Brivio.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2056 commits)
net: thunderx: initialize VF's mailbox mutex before first usage
usb: hso: remove bogus check for EINPROGRESS
usb: hso: no complaint about kmalloc failure
hso: fix bailout in error case of probe
ip_tunnel_core: Fix build for archs without _HAVE_ARCH_IPV6_CSUM
selftests/net: relax cpu affinity requirement in msg_zerocopy test
mptcp: be careful on subflow creation
selftests: rtnetlink: make kci_test_encap() return sub-test result
selftests: rtnetlink: correct the final return value for the test
net: dsa: sja1105: use detected device id instead of DT one on mismatch
tipc: set ub->ifindex for local ipv6 address
ipv6: add ipv6_dev_find()
net: openvswitch: silence suspicious RCU usage warning
Revert "vxlan: fix tos value before xmit"
ptp: only allow phase values lower than 1 period
farsync: switch from 'pci_' to 'dma_' API
wan: wanxl: switch from 'pci_' to 'dma_' API
hv_netvsc: do not use VF device if link is down
dpaa2-eth: Fix passing zero to 'PTR_ERR' warning
net: macb: Properly handle phylink on at91sam9x
...
For RDMA TX flow table, set destination type to be 'port' and prevent
creation of flows with TIR destination.
As RDMA TX is an egress flow table the rules on this flow table should
not forward traffic back to the NIC and should set the destination to be
the port.
Without the setting of this destination type flow rules on the RDMA TX
flow tables are not created as FW invokes a syndrome for undefined
destination for the rule.
Fixes: 24670b1a31 ("net/mlx5: Add support for RDMA TX steering")
Link: https://lore.kernel.org/r/20200803055849.14947-1-leon@kernel.org
Signed-off-by: Michael Guralnik <michaelgur@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
- make support for dma_ops optional
- move more code out of line
- add generic support for a dma_ops bypass mode
- misc cleanups
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl8oGscLHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYNfEhAAmFwd6BBHGwAhXUchoIue5vdNnuY3GiBFRzUdz67W
zRYYgZYiPjl+MwflRmwPcoWEnGzmweRa2s6OnyDostiCRauioa8BuQfGqJasf1yZ
D36dFNVHGW0o6pRDUQkd688k/4A6szwuwpq83qi4e8X2I9QzAITHtW8izjfPM923
FlJzxEFggbB2TvwfUXOZhmpuG4Dog8S7VZ1Uz4QAg0Z/5FDqIKAAG2aZMqCXBbiX
01E8tr0AqU/jn2xpc8O+DJGFiYIRhqhyNxQbH6qz1Q3xGFSokcLYm3YqkqVOgpn1
DLs2UFDxWkly/F+wGnYtju7OD9VGPywzOcW125/LIsApYN5R/rYrtQzK41eq7Mp5
HY3tqgNTIMdnl4so7QXeU4Vxj+lUdPlI26NZGszcM5AVftdTX8KjGdS+0+PBza6i
i7trwG7J5/DnwiBCvEKoul7Ul1psUMTSvYwINTXRqsU4mZXhhx/mwyXbtruELnkj
3agM98u6hoalLNjd2aueh+NjMZi1r+MchTrfRvTcxJ+yQ5BoR5kF+iz7eT/LtZ72
AqWwimsPGNkLHUa0TrqWql5tv90cdDkBZzWXVbixwxRfgynWYLE6jugeIy8hwjFf
GjO5XKbBwnWPjdSzFsVMPeuNpmr7ZjVHHewy2Q/jWQAIOyeof0VztEl23LN5yUkx
pc8=
=90UK
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-5.9' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping updates from Christoph Hellwig:
- make support for dma_ops optional
- move more code out of line
- add generic support for a dma_ops bypass mode
- misc cleanups
* tag 'dma-mapping-5.9' of git://git.infradead.org/users/hch/dma-mapping:
dma-contiguous: cleanup dma_alloc_contiguous
dma-debug: use named initializers for dir2name
powerpc: use the generic dma_ops_bypass mode
dma-mapping: add a dma_ops_bypass flag to struct device
dma-mapping: make support for dma ops optional
dma-mapping: inline the fast path dma-direct calls
dma-mapping: move the remaining DMA API calls out of line
The RoCE spec requires RoCE devices to support only the default pkey.
However the rxe driver maintains a 64 enties pkey table and uses only the
first entry. Remove the pkey table and hard code a table of length one
hard wired with the default pkey. Replace all checks of the pkey_table
with a comparison to the default_pkey instead.
Link: https://lore.kernel.org/r/20200721101618.686110-1-kamalheib1@gmail.com
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Mapping as little as 64GB can take more than 10 seconds, triggering issues
on kernels with CONFIG_PREEMPT_NONE=y.
ib_umem_get() already splits the work in 2MB units on x86_64, adding a
cond_resched() in the long-lasting loop is enough to solve the issue.
Note that sg_alloc_table() can still use more than 100 ms, which is also
problematic. This might be addressed later in ib_umem_add_sg_table(),
adding new blocks in sgl on demand.
Link: https://lore.kernel.org/r/20200730015755.1827498-1-edumazet@google.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
If the hns ROCEE reports a general error CQE (types not specified by the IB
General Specifications), it's no need to change the QP state to error, and
the driver should just skip it.
Fixes: 7c044adca2 ("RDMA/hns: Simplify the cqe code of poll cq")
Link: https://lore.kernel.org/r/1595932941-40613-8-git-send-email-liweihang@huawei.com
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
One qp state migrations legal configuration was deleted mistakenly.
Fixes: 357f342946 ("RDMA/hns: Simplify the state judgment code of qp")
Link: https://lore.kernel.org/r/1595932941-40613-7-git-send-email-liweihang@huawei.com
Signed-off-by: Lang Cheng <chenglang@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The hns_roce_cmq_setup_basic_desc() can clear the whole desc, so removes
these redundant memset operations.
Link: https://lore.kernel.org/r/1595932941-40613-6-git-send-email-liweihang@huawei.com
Signed-off-by: Lang Cheng <chenglang@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
There are some functions called by set_rc_wqe() use two parameters:
"void *wqe" and "struct hns_roce_v2_rc_send_wqe *rc_sq_wqe", but the first
one can be got from the second one. So remove the redundant wqe from
related functions.
Link: https://lore.kernel.org/r/1595932941-40613-5-git-send-email-liweihang@huawei.com
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
HIP08_A is an temporary version and all features of it are supported by
HIP08_B. So remove the relevant code.
Link: https://lore.kernel.org/r/1595932941-40613-4-git-send-email-liweihang@huawei.com
Signed-off-by: Lang Cheng <chenglang@huawei.com>
Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The parts about preparing and sending mailbox to hardware is not strongly
related to other codes in hns_roce_v2_set_hem(), and can be encapsulated
into a separate function.
Link: https://lore.kernel.org/r/1595932941-40613-3-git-send-email-liweihang@huawei.com
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
HNS_ROCE_SQ_OPCODE_XXXs and HNS_ROCE_V2_WQE_OP_XXXs have same values, so
remove a set of redundant definitions. In addition, remove the suffix of
HNS_ROCE_V2_WQE_OP_BIND_MW_TYPE.
Link: https://lore.kernel.org/r/1595932941-40613-2-git-send-email-liweihang@huawei.com
Signed-off-by: Lang Cheng <chenglang@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The memory allocated for the DIM wasn't freed in in error unwind path, fix
it by calling to rdma_dim_destroy().
Fixes: da6629793a ("RDMA/core: Provide RDMA DIM support for ULPs")
Link: https://lore.kernel.org/r/20200730082719.1582397-4-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com <mailto:maxg@mellanox.com>>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
HW destroy operation should be last operation after all possible CQ users
completed their work, so move DIM work cancellation before such destroy
call.
Fixes: da6629793a ("RDMA/core: Provide RDMA DIM support for ULPs")
Link: https://lore.kernel.org/r/20200730082719.1582397-3-leon@kernel.org
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The lookaside count is improperly initialized to the size of the
Receive Queue with the additional +1. In the traces below, the
RQ size is 384, so the count was set to 385.
The lookaside count is then rarely refreshed. Note the high and
incorrect count in the trace below:
rvt_get_rwqe: [hfi1_0] wqe ffffc900078e9008 wr_id 55c7206d75a0 qpn c
qpt 2 pid 3018 num_sge 1 head 1 tail 0, count 385
rvt_get_rwqe: (hfi1_rc_rcv+0x4eb/0x1480 [hfi1] <- rvt_get_rwqe) ret=0x1
The head,tail indicate there is only one RWQE posted although the count
says 385 and we correctly return the element 0.
The next call to rvt_get_rwqe with the decremented count:
rvt_get_rwqe: [hfi1_0] wqe ffffc900078e9058 wr_id 0 qpn c
qpt 2 pid 3018 num_sge 0 head 1 tail 1, count 384
rvt_get_rwqe: (hfi1_rc_rcv+0x4eb/0x1480 [hfi1] <- rvt_get_rwqe) ret=0x1
Note that the RQ is empty (head == tail) yet we return the RWQE at tail 1,
which is not valid because of the bogus high count.
Best case, the RWQE has never been posted and the rc logic sees an RWQE
that is too small (all zeros) and puts the QP into an error state.
In the worst case, a server slow at posting receive buffers might fool
rvt_get_rwqe() into fetching an old RWQE and corrupt memory.
Fix by deleting the faulty initialization code and creating an
inline to fetch the posted count and convert all callers to use
new inline.
Fixes: f592ae3c99 ("IB/rdmavt: Fracture single lock used for posting and processing RWQEs")
Link: https://lore.kernel.org/r/20200728183848.22226.29132.stgit@awfm-01.aw.intel.com
Reported-by: Zhaojuan Guo <zguo@redhat.com>
Cc: <stable@vger.kernel.org> # 5.4.x
Reviewed-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Tested-by: Honggang Li <honli@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
When dumping QPs bound to a counter, raw QPs should be allowed to dump
without the CAP_NET_RAW privilege. This is consistent with what "rdma res
show qp" does.
Fixes: c4ffee7c9b ("RDMA/netlink: Implement counter dumpit calback")
Link: https://lore.kernel.org/r/20200727095828.496195-1-leon@kernel.org
Signed-off-by: Mark Zhang <markz@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
lockdep triggers a warning from time to time when running a regression
test:
rnbd_client L685: </dev/nullb0@bla> Device disconnected.
rnbd_client L1756: Unloading module
workqueue: WQ_MEM_RECLAIM rtrs_client_wq:rtrs_clt_reconnect_work [rtrs_client] is flushing !WQ_MEM_RECLAIM ib_addr:process_one_req [ib_core]
WARNING: CPU: 2 PID: 18824 at kernel/workqueue.c:2517 check_flush_dependency+0xad/0x130
The root cause is workqueue core expect flushing should not be done for a
!WQ_MEM_RECLAIM wq from a WQ_MEM_RECLAIM workqueue.
In above case ib_addr workqueue without WQ_MEM_RECLAIM, but rtrs_wq
WQ_MEM_RECLAIM.
To avoid the warning, remove the WQ_MEM_RECLAIM flag.
Fixes: 9cb8374804 ("RDMA/rtrs: server: main functionality")
Fixes: 6a98d71dae ("RDMA/rtrs: client: main functionality")
Link: https://lore.kernel.org/r/20200724111508.15734-4-haris.iqbal@cloud.ionos.com
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@cloud.ionos.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
In order to avoid all the clients to start reconnecting at the same time
schedule the reconnect dwork with a random jitter of +[0,8] seconds.
Fixes: 6a98d71dae ("RDMA/rtrs: client: main functionality")
Link: https://lore.kernel.org/r/20200724111508.15734-2-haris.iqbal@cloud.ionos.com
Signed-off-by: Danil Kipnis <danil.kipnis@cloud.ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@cloud.ionos.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Scatter CQE feature relies on two flags MLX5_QP_FLAG_SCATTER_CQE and
MLX5_QP_FLAG_ALLOW_SCATTER_CQE, both of them can be provided without
relation to device capability.
Relax global validity check to allow MLX5_QP_FLAG_ALLOW_SCATTER_CQE QP
flag.
Existing user applications are failing on this new validity check.
Fixes: 90ecb37a75 ("RDMA/mlx5: Change scatter CQE flag to be set like other vendor flags")
Fixes: 37518fa49f ("RDMA/mlx5: Process all vendor flags in one place")
Link: https://lore.kernel.org/r/20200728120255.805733-1-leon@kernel.org
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
When a rdma_cm_id needs to be destroyed after a handler callback fails,
part of the destruction pattern is open coded into each call site.
Unfortunately the blind assignment to state discards important information
needed to do cma_cancel_operation(). This results in active operations
being left running after rdma_destroy_id() completes, and the
use-after-free bugs from KASAN.
Consolidate this entire pattern into destroy_id_handler_unlock() and
manage the locking correctly. The state should be set to
RDMA_CM_DESTROYING under the handler_lock to atomically ensure no futher
handlers are called.
Link: https://lore.kernel.org/r/20200723070707.1771101-5-leon@kernel.org
Reported-by: syzbot+08092148130652a6faae@syzkaller.appspotmail.com
Reported-by: syzbot+a929647172775e335941@syzkaller.appspotmail.com
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The REQ flows are concerned that once the handler is called on the new
cm_id the ULP can choose to trigger a rdma_destroy_id() concurrently at
any time.
However, this is not true, while the ULP can call rdma_destroy_id(), it
immediately blocks on the handler_mutex which prevents anything harmful
from running concurrently.
Remove the confusing extra locking and refcounts and make the
handler_mutex protecting state during destroy more clear.
Link: https://lore.kernel.org/r/20200723070707.1771101-4-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Whenever an event is delivered to the handler it should be done under the
handler_mutex and upon any non-zero return from the handler it should
trigger destruction of the cm_id.
cma_process_remove() skips some steps here, it is not necessarily wrong
since the state change should prevent any races, but it is confusing and
unnecessary.
Follow the standard pattern here, with the slight twist that the
transition to RDMA_CM_DEVICE_REMOVAL includes a cma_cancel_operation().
Link: https://lore.kernel.org/r/20200723070707.1771101-3-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
cma_process_remove() triggers an unconditional rdma_destroy_id() for
internal_id's and skips the event deliver and transition through
RDMA_CM_DEVICE_REMOVAL.
This is confusing and unnecessary. internal_id always has
cma_listen_handler() as the handler, have it catch the
RDMA_CM_DEVICE_REMOVAL event and directly consume it and signal removal.
This way the FSM sequence never skips the DEVICE_REMOVAL case and the
logic in this hard to test area is simplified.
Link: https://lore.kernel.org/r/20200723070707.1771101-2-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Introduce a mechanism that performs an handshake between the userspace
provider and kernel driver which verifies that the user supports all
required features in order to operate correctly.
The handshake verifies the needed functionality by comparing the reported
device caps and the provider caps. If the device reports a non-zero
capability the appropriate comp mask is required from the userspace
provider in order to allocate the context.
Link: https://lore.kernel.org/r/20200722140312.3651-4-galpress@amazon.com
Reviewed-by: Shadi Ammouri <sammouri@amazon.com>
Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The device reports the minimum SQ size required for creation.
This patch queries the min SQ size and reports it back to the userspace
library.
Link: https://lore.kernel.org/r/20200722140312.3651-3-galpress@amazon.com
Reviewed-by: Firas JahJah <firasj@amazon.com>
Reviewed-by: Shadi Ammouri <sammouri@amazon.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The device reports the maximum number of bytes to be written before
ringing the doorbell (zero means unlimited).
This patch queries the max batch size and reports it back to the userspace
library.
Link: https://lore.kernel.org/r/20200722140312.3651-2-galpress@amazon.com
Reviewed-by: Daniel Kranzdorf <dkkranzd@amazon.com>
Reviewed-by: Firas JahJah <firasj@amazon.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Have the driver use shared CQs provided by the rdma core driver. This
provides the advantage of improved efficiency handling interrupts.
Link: https://lore.kernel.org/r/20200722135629.49467-3-maxg@mellanox.com
Signed-off-by: Yamin Friedman <yaminf@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Have the driver use shared CQs provided by the rdma core driver. Since
this provides similar functionality to iser_comp it has been removed.
Now there is no reason to allocate very large CQs when the driver is
loaded while gaining the advantage of shared CQs. Previously when a single
connection was opened a CQ was opened for every core with enough space for
eight connections, this is a very large overhead that in most cases will
not be utilized.
Link: https://lore.kernel.org/r/20200722135629.49467-2-maxg@mellanox.com
Signed-off-by: Yamin Friedman <yaminf@mellanox.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Have the driver use shared CQs provided by the rdma core driver. Since
this provides similar functionality to iser_comp it has been removed. Now
there is no reason to allocate very large CQs when the driver is loaded
while gaining the advantage of shared CQs.
Link: https://lore.kernel.org/r/20200722135629.49467-1-maxg@mellanox.com
Signed-off-by: Yamin Friedman <yaminf@mellanox.com>
Acked-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Currently n_rq_elems is being assigned to params.elem_size instead of the
field params.num_elems. Coverity is detecting this as a double assingment
to params.elem_size and reporting this as an usused value on the first
assignment. Fix this.
Addresses-Coverity: ("Unused value")
Fixes: b6db3f71c9 ("qed: simplify chain allocation with init params struct")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Alexander Lobakin <alobakin@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Main changes are 200G support and fixing the definitions of discard and
error counters to match the hardware definitions.
Because the HWRM_PORT_PHY_QCFG message size has now exceeded the max.
encapsulated response message size of 96 bytes from the PF to the VF,
we now need to cap this message to 96 bytes for forwarding. The forwarded
response only needs to contain the basic link status and speed information
and can be capped without adding the new information.
v2: Fix bnxt_re compile error.
Cc: Selvin Xavier <selvin.xavier@broadcom.com>
Reviewed-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The error codes in _ib_modify_qp() are supposed to be negative errno.
Fixes: 7a5c938b9e ("IB/core: Check for rdma_protocol_ib only after validating port_num")
Link: https://lore.kernel.org/r/1595645787-20375-1-git-send-email-liheng40@huawei.com
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Li Heng <liheng40@huawei.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
destroy_prefetch_work() must always be called if the work is not going
to be queued. The num_sge also should have been set to i, not i-1
which avoids the condition where it shouldn't have been called in the
first place.
Cc: stable@vger.kernel.org
Fixes: fb985e278a ("RDMA/mlx5: Use SRCU properly in ODP prefetch")
Link: https://lore.kernel.org/r/20200727095712.495652-1-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
These are missing throughout ucma, it harmlessly copies garbage from
userspace, but in this new code which uses min to compute the copy length
it can result in uninitialized stack memory. Check for minimum length at
the very start.
BUG: KMSAN: uninit-value in ucma_connect+0x2aa/0xab0 drivers/infiniband/core/ucma.c:1091
CPU: 0 PID: 8457 Comm: syz-executor069 Not tainted 5.8.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1df/0x240 lib/dump_stack.c:118
kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:121
__msan_warning+0x58/0xa0 mm/kmsan/kmsan_instr.c:215
ucma_connect+0x2aa/0xab0 drivers/infiniband/core/ucma.c:1091
ucma_write+0x5c5/0x630 drivers/infiniband/core/ucma.c:1764
do_loop_readv_writev fs/read_write.c:737 [inline]
do_iter_write+0x710/0xdc0 fs/read_write.c:1020
vfs_writev fs/read_write.c:1091 [inline]
do_writev+0x42d/0x8f0 fs/read_write.c:1134
__do_sys_writev fs/read_write.c:1207 [inline]
__se_sys_writev+0x9b/0xb0 fs/read_write.c:1204
__x64_sys_writev+0x4a/0x70 fs/read_write.c:1204
do_syscall_64+0xb0/0x150 arch/x86/entry/common.c:386
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: 34e2ab57a9 ("RDMA/ucma: Extend ucma_connect to receive ECE parameters")
Fixes: 0cb15372a6 ("RDMA/cma: Connect ECE to rdma_accept")
Link: https://lore.kernel.org/r/0-v1-d5b86dab17dc+28c25-ucma_syz_min_jgg@nvidia.com
Reported-by: syzbot+086ab5ca9eafd2379aa6@syzkaller.appspotmail.com
Reported-by: syzbot+7446526858b83c8828b2@syzkaller.appspotmail.com
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Meir Lichtinger says:
====================
ConnectX-7 supports setting relaxed ordering read/write mkey attribute by
UMR, indicated by new HCA capabilities, so extend mlx5_ib driver to
configure UMR control segment
====================
Based on the mlx5-next branch at
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
due to dependencies.
* branch 'mlx5_uar':
RDMA/mlx5: Set mkey relaxed ordering by UMR with ConnectX-7
RDMA/mlx5: Use MLX5_SET macro instead of local structure
RDMA/mlx5: ConnectX-7 new capabilities to set relaxed ordering by UMR
Up to ConnectX-7 UMR is not used when user passes relaxed ordering access
flag. ConnectX-7 supports setting relaxed ordering read/write mkey
attribute by UMR, indicated by new HCA capabilities.
With ConnectX-7 driver uses UMR when user set relaxed ordering access
flag, in contrast to previous silicon models. Specifically it includes
setting relvant flags of mkey context mask in UMR control segment, and
relaxed ordering write and read flags in UMR mkey context segment.
Link: https://lore.kernel.org/r/20200716105248.1423452-4-leon@kernel.org
Signed-off-by: Meir Lichtinger <meirl@mellanox.com>
Reviewed-by: Michael Guralnik <michaelgur@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Use generic mlx5 structure defined in mlx5_ifc.h to represent ConnectX
device data structures instead of using structure defined specifically for
mlx5_ib module.
Link: https://lore.kernel.org/r/20200716105248.1423452-3-leon@kernel.org
Signed-off-by: Meir Lichtinger <meirl@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The UDP reuseport conflict was a little bit tricky.
The net-next code, via bpf-next, extracted the reuseport handling
into a helper so that the BPF sk lookup code could invoke it.
At the same time, the logic for reuseport handling of unconnected
sockets changed via commit efc6b6f6c3
which changed the logic to carry on the reuseport result into the
rest of the lookup loop if we do not return immediately.
This requires moving the reuseport_has_conns() logic into the callers.
While we are here, get rid of inline directives as they do not belong
in foo.c files.
The other changes were cases of more straightforward overlapping
modifications.
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix reported by kbuild warning.
drivers/infiniband/core/uverbs_cmd.c:1897:47: warning: Shifting signed 32-bit value by 31 bits is undefined behaviour [shiftTooManyBitsSigned]
BUILD_BUG_ON(IB_USER_LAST_QP_ATTR_MASK == (1 << 31));
^
Link: https://lore.kernel.org/r/20200720175627.1273096-3-leon@kernel.org
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The kbuild reported the following warning, so clean whole uverbs_cmd.c
file.
drivers/infiniband/core/uverbs_cmd.c:1066:6: warning: Variable 'ret' is reassigned a value before the old one has been used. [redundantAssignment]
ret = uverbs_request(attrs, &cmd, sizeof(cmd));
^
drivers/infiniband/core/uverbs_cmd.c:1064:0: note: Variable 'ret' is reassigned a value before the old one has been used.
int ret = -EINVAL;
^
Link: https://lore.kernel.org/r/20200720175627.1273096-2-leon@kernel.org
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The automatic object lifetime model allows us to change the write()
interface to have the same logic as the ioctl() path. Update the
create/alloc functions to be in the following format, so the code flow
will be the same:
* Allocate objects
* Initialize them
* Call to the drivers, this is last step that is allowed to fail
* Finalize object
* Return response and allow to core code to handle abort/commit
respectively.
Link: https://lore.kernel.org/r/20200719052223.75245-3-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Create the same logic flow for the write() interface as we have for the
ioctl() path by making sure that the object is committed or aborted
automatically after HW object creation.
Link: https://lore.kernel.org/r/20200719052223.75245-2-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Currently the SQ is set to a ready state when the RAW QP is modified to
INIT. When the TIS is modified, e.g. to change the lag_tx_affinity, then
SQs which are already in the ready state will not be affected.
Open a window to modify the SQ behavior by setting the SQ as ready only
when QP was modified to RTS.
Link: https://lore.kernel.org/r/20200716105416.1423826-1-leon@kernel.org
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Mark Zhang <markz@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Extend current infrastructure to store chain page size in a struct
and use it in all functions instead of fixed QED_CHAIN_PAGE_SIZE.
Its value remains the default one, but can be overridden in
qed_chain_init_params before chain allocation.
Signed-off-by: Alexander Lobakin <alobakin@marvell.com>
Signed-off-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To simplify qed_chain_alloc() prototype and call sites, introduce struct
qed_chain_init_params to specify chain params, and pass a pointer to
filled struct to the actual qed_chain_alloc() instead of a long list
of separate arguments.
Signed-off-by: Alexander Lobakin <alobakin@marvell.com>
Signed-off-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prefetch work in mlx5_ib_prefetch_mr_work can be queued and able to run
concurrently with destruction of the implicit MR. The num_deferred_work
was intended to serialize this, but there is a race:
CPU0 CPU1
mlx5_ib_free_implicit_mr()
xa_erase(odp_mkeys)
synchronize_srcu()
__xa_erase(implicit_children)
mlx5_ib_prefetch_mr_work()
pagefault_mr()
pagefault_implicit_mr()
implicit_get_child_mr()
xa_cmpxchg()
atomic_dec_and_test(num_deferred_mr)
wait_event(imr->q_deferred_work)
ib_umem_odp_release(odp_imr)
kfree(odp_imr)
At this point in mlx5_ib_free_implicit_mr() the implicit_children list is
supposed to be empty forever so that destroy_unused_implicit_child_mr()
and related are not and will not be running.
Since it is not empty the destroy_unused_implicit_child_mr() flow ends up
touching deallocated memory as mlx5_ib_free_implicit_mr() already tore down the
imr parent.
The solution is to flush out the prefetch wq by driving num_deferred_work
to zero after creation of new prefetch work is blocked.
Fixes: 5256edcb98 ("RDMA/mlx5: Rework implicit ODP destroy")
Link: https://lore.kernel.org/r/20200719065435.130722-1-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Modifying the post-send and post-recv to initialize the wqes slot by slot
dynamically depending on the number of max sges requested by consumer at
the time of QP creation.
Changed the QP creation logic to determine the size of SQ and RQ in 16B
slots based on the number of wqe and number of SGEs requested by consumer
Link: https://lore.kernel.org/r/1594822619-4098-6-git-send-email-devesh.sharma@broadcom.com
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Adding few helper data structure which are useful to initialize hardware
send wqe in variable wqe mode.
Adding a qp flag in HSI to indicate variable wqe is enabled for this qp.
Link: https://lore.kernel.org/r/1594822619-4098-5-git-send-email-devesh.sharma@broadcom.com
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Changing the PSN management memory buffers from statically initialized to
dynamic pull scheme.
During create qp only the start pointers are initialized and during
post-send the psn buffer is pulled based on current producer index.
Adjusting post_send code to accommodate dynamic psn-pull and changing
post_recv code to match post-send code wrt pseudo flush wqe generation.
Link: https://lore.kernel.org/r/1594822619-4098-4-git-send-email-devesh.sharma@broadcom.com
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The bnxt_re driver now allocates shadow sq and rq to maintain per wqe
wr_id and few other flags required to support variable wqe. Segregated the
allocation of shadow queue in a separate function and adjust the cqe
polling logic. The new polling logic is based on shadow queue indices.
Link: https://lore.kernel.org/r/1594822619-4098-3-git-send-email-devesh.sharma@broadcom.com
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The bnxt_re driver need to decide on how much SQ and RQ memory should to
be allocated and which wqe posting/polling algorithm to use.
Making changes to set the wqe-mode to a default value during device
registration sequence. The wqe-mode is passed to the lower layer driver as
well. Going forward in the lower layer driver wqe-mode will be used to
decide execution path. Initializing the wqe-mode to static wqe type for
now.
Link: https://lore.kernel.org/r/1594822619-4098-2-git-send-email-devesh.sharma@broadcom.com
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Now that the query_pkey() isn't mandatory by the RDMA core for iwarp
providers, this callback can be removed from the common ops and moved to
the RoCE only ops within the qedr driver.
Link: https://lore.kernel.org/r/20200714183414.61069-8-kamalheib1@gmail.com
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Acked-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Now that the query_pkey() isn't mandatory by the RDMA core for iwarp
providers, this callback can be removed.
Link: https://lore.kernel.org/r/20200714183414.61069-7-kamalheib1@gmail.com
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Now that the query_pkey() isn't mandatory by the RDMA core for iwarp
providers, this callback can be removed.
Link: https://lore.kernel.org/r/20200714183414.61069-6-kamalheib1@gmail.com
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Now that the query_pkey() isn't mandatory by the RDMA core for iwarp
providers, this callback can be removed.
Link: https://lore.kernel.org/r/20200714183414.61069-5-kamalheib1@gmail.com
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The query_pkey() isn't mandatory for the iwarp providers, so remove this
requirement from the RDMA core.
Link: https://lore.kernel.org/r/20200714183414.61069-4-kamalheib1@gmail.com
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Allocate the pkey cache only if the pkey_tbl_len is set by the provider,
also add checks to avoid accessing the pkey cache when it not initialized.
Link: https://lore.kernel.org/r/20200714183414.61069-3-kamalheib1@gmail.com
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Avoid the overhead of the dma ops support for tiny builds that only
use the direct mapping.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.
In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:
git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
xargs perl -pi -e \
's/\buninitialized_var\(([^\)]+)\)/\1/g;
s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'
drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.
No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.
[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/
Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>
rxe_post_send_kernel() iterates over linked list of wr's, until the
wr->next ptr is NULL. However if we've got an interrupt after last wr is
posted, control may be returned to the code after send completion callback
is executed and wr memory is freed.
As a result, wr->next pointer may contain incorrect value leading to
panic. Store the wr->next on the stack before posting it.
Fixes: 8700e3e7c4 ("Soft RoCE driver")
Link: https://lore.kernel.org/r/20200716190340.23453-1-m.malygin@yadro.com
Signed-off-by: Mikhail Malygin <m.malygin@yadro.com>
Signed-off-by: Sergey Kojushev <s.kojushev@yadro.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
User space should receive the maximum edpm size from kernel driver,
similar to other edpm/ldpm related limits. Add an additional parameter to
the alloc_ucontext_resp structure for the edpm maximum size.
In addition, pass an indication from user-space to kernel
(and not just kernel to user) that the DPM sizes are supported.
This is for supporting backward-forward compatibility between driver and
lib for everything related to DPM transaction and limit sizes.
This should have been part of commit mentioned in Fixes tag.
Link: https://lore.kernel.org/r/20200707063100.3811-3-michal.kalderon@marvell.com
Fixes: 93a3d05f9d ("RDMA/qedr: Add kernel capability flags for dpm enabled mode")
Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
In older FW versions the completion flag was treated as the ack flag in
edpm messages. commit ff937b916e ("qed: Add EDPM mode type for user-fw
compatibility") exposed the FW option of setting which mode the QP is in
by adding a flag to the qedr <-> qed API.
This patch adds the qedr <-> libqedr interface so that the libqedr can set
the flag appropriately and qedr can pass it down to FW. Flag is added for
backward compatibility with libqedr.
For older libs, this flag didn't exist and therefore set to zero.
Fixes: ac1b36e55a ("qedr: Add support for user context verbs")
Link: https://lore.kernel.org/r/20200707063100.3811-2-michal.kalderon@marvell.com
Signed-off-by: Yuval Bason <yuval.bason@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
QP's with the same SRQ, working on different CQs and running in parallel
on different CPUs could lead to a race when maintaining the SRQ consumer
count, and leads to FW running out of SRQs. Update the consumer
atomically. Make sure the wqe_prod is updated after the sge_prod due to
FW requirements.
Fixes: 3491c9e799 ("qedr: Add support for kernel mode SRQ's")
Link: https://lore.kernel.org/r/20200708195526.31040-1-ybason@marvell.com
Signed-off-by: Michal Kalderon <mkalderon@marvell.com>
Signed-off-by: Yuval Basson <ybason@marvell.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Current iSER target code allocates MR pool budget based on queue size.
Since there is no handshake between iSER initiator and target on max IO
size, we'll set the iSER target to support upto 16MiB IO operations and
allocate the correct number of RDMA ctxs according to the factor of MR's
per IO operation. This would guarantee sufficient size of the MR pool for
the required IO queue depth and IO size.
Link: https://lore.kernel.org/r/20200708091908.162263-1-maxg@mellanox.com
Reported-by: Krishnamraju Eraparaju <krishna2@chelsio.com>
Tested-by: Krishnamraju Eraparaju <krishna2@chelsio.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
When using action drop dest_type was never assigned to any value. Add
initialization of dest_type to -1 since 0 is valid.
Fixes: f29de9eee7 ("RDMA/mlx5: Add support for drop action in DV steering")
Link: https://lore.kernel.org/r/20200707110259.882276-1-leon@kernel.org
Signed-off-by: Daria Velikovsky <daria@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Instead of returning IB_LINK_LAYER_ETHERNET from rxe_link_layer, return it
directly from get_link_layer callback and remove rxe_link_layer().
Fixes: 8700e3e7c4 ("Soft RoCE driver")
Link: https://lore.kernel.org/r/20200705104313.283034-5-kamalheib1@gmail.com
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The return value from rxe_mem_init_dma() is always 0 - change it to be
void and fix the callers accordingly.
Fixes: 8700e3e7c4 ("Soft RoCE driver")
Link: https://lore.kernel.org/r/20200705104313.283034-4-kamalheib1@gmail.com
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The return value from rxe_init_port_param() is always 0 - change it to be
void.
Fixes: 8700e3e7c4 ("Soft RoCE driver")
Link: https://lore.kernel.org/r/20200705104313.283034-3-kamalheib1@gmail.com
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Both pkey_tbl_len and gid_tbl_len are set in rxe_init_port_param() - so no
need to check if they aren't set.
Fixes: 8700e3e7c4 ("Soft RoCE driver")
Link: https://lore.kernel.org/r/20200705104313.283034-2-kamalheib1@gmail.com
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
ROCE uses "VA % buf_page_size" to caclulate the offset in the PBL's first
page, the actual PA corresponding to the MR's VA is equal to MR's PA plus
this offset. The first PA in PBL has already been aligned to PAGE_SIZE
after calling ib_umem_get(), but the MR's VA may not. If the buf_page_size
is smaller than the PAGE_SIZE, this will lead the HW to access the wrong
memory because the offset is smaller than expected.
Fixes: 9b2cf76c9f ("RDMA/hns: Optimize PBL buffer allocation process")
Link: https://lore.kernel.org/r/1594726935-45666-1-git-send-email-liweihang@huawei.com
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The RoCE Engine will schedule to another QP after one has sent
(2 ^ lp_pktn_ini) packets. lp_pktn_ini is set in QPC and should be
calculated from 2 factors:
1. current MTU as a integer
2. the RoCE Engine's maximum slice length 64KB
But the driver use MTU as a enum ib_mtu and the max inline capability, the
lp_pktn_ini will be much bigger than expected which may cause traffic of
some QPs to never get scheduled.
Fixes: b713128de7 ("RDMA/hns: Adjust lp_pktn_ini dynamically")
Link: https://lore.kernel.org/r/1594726138-49294-1-git-send-email-liweihang@huawei.com
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
In manual mode allow bind user QPs with different pids to same counter,
since this is allowed in auto mode.
Bind kernel QPs and user QPs to the same counter are not allowed.
Fixes: 1bd8e0a9d0 ("RDMA/counter: Allow manual mode configuration support")
Link: https://lore.kernel.org/r/20200702082933.424537-4-leon@kernel.org
Signed-off-by: Mark Zhang <markz@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
In auto mode only bind user QPs to a dynamic counter, since this feature
is mainly used for system statistic and diagnostic purpose, while there's
no need to counter kernel QPs so far.
Fixes: 99fa331dc8 ("RDMA/counter: Add "auto" configuration mode support")
Link: https://lore.kernel.org/r/20200702082933.424537-3-leon@kernel.org
Signed-off-by: Mark Zhang <markz@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
With the "PID" category QPs have same PID will be bound to same counter;
If this category is not set then QPs have different PIDs will be bound
to same counter.
This is implemented for 2 reasons:
1. The counter is a limited resource, while there may be dozens of
applications, each of which creates several types of QPs, which means
it may doesn't have enough counter.
2. The system administrator needs all QPs created by all applications
with same type bound to one counter.
The counter name and PID is only make sense when "PID" category are
configured.
This category can also be used in combine with others, e.g. QP type.
Link: https://lore.kernel.org/r/20200702082933.424537-2-leon@kernel.org
Signed-off-by: Mark Zhang <markz@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Some released FW versions mistakenly don't set the capability that 50G per
lane link-modes are supported for VFs (ptys_extended_ethernet capability
bit).
Use PTYS.ext_eth_proto_capability instead, as this indication is always
accurate. If PTYS.ext_eth_proto_capability is valid
(has a non-zero value) conclude that the HCA supports 50G per lane.
Otherwise, conclude that the HCA doesn't support 50G per lane.
Fixes: 08e8676f16 ("IB/mlx5: Add support for 50Gbps per lane link modes")
Link: https://lore.kernel.org/r/20200707110612.882962-3-leon@kernel.org
Signed-off-by: Aya Levin <ayal@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Move the initialization of the vendor_part_id to be before calling
ib_register_device(), this is needed because the query_device() callback
is called from the context of ib_register_device() before initializing the
vendor_part_id, so the reported value is wrong.
Fixes: bdcf26bf9b ("rdma/siw: network and RDMA core interface")
Link: https://lore.kernel.org/r/20200707130931.444724-1-kamalheib1@gmail.com
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Reviewed-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Merge them into their callers, usually the only thing the caller did was
to call the one function, so this is clearer.
Link: https://lore.kernel.org/r/20200702081809.423482-7-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Move DEVX initialization and cleanup flows to the devx.c instead of having
almost empty functions in main.c
Link: https://lore.kernel.org/r/20200702081809.423482-6-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Move flow steering logic to be in separate file and rename flow.c to be
fs.c because it is better describe the content.
Link: https://lore.kernel.org/r/20200702081809.423482-5-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
There are number of counters types supported in mlx5_ib: HW counters,
congestion counters, Q-counters and flow counters. Almost all supporting
code was placed in main.c that made almost impossible to maintain the code
anymore. Let's create separate code namespace for the counters to easy
future generalization effort.
Link: https://lore.kernel.org/r/20200702081809.423482-4-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The restrack code has separate .c, so move callbacks initialization to
that file to improve code locality.
Link: https://lore.kernel.org/r/20200702081809.423482-3-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The mlx5_ib_enable_driver() is local function and doesn't need to be
shared in mlx5_ib, so change it's signature to have static keyword in it.
Link: https://lore.kernel.org/r/20200702081809.423482-2-leon@kernel.org
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
If hns ROCEE is set to level-0 addressing, the length of the entire buffer
can be used as the page size. The driver needn't to split the buffer into
small units because all pages are continuous.
Link: https://lore.kernel.org/r/1593525696-12570-1-git-send-email-liweihang@huawei.com
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
In the loopback tests, the following call trace occurs.
Call Trace:
__rxe_do_task+0x1a/0x30 [rdma_rxe]
rxe_qp_destroy+0x61/0xa0 [rdma_rxe]
rxe_destroy_qp+0x20/0x60 [rdma_rxe]
ib_destroy_qp_user+0xcc/0x220 [ib_core]
uverbs_free_qp+0x3c/0xc0 [ib_uverbs]
destroy_hw_idr_uobject+0x24/0x70 [ib_uverbs]
uverbs_destroy_uobject+0x43/0x1b0 [ib_uverbs]
uobj_destroy+0x41/0x70 [ib_uverbs]
__uobj_get_destroy+0x39/0x70 [ib_uverbs]
ib_uverbs_destroy_qp+0x88/0xc0 [ib_uverbs]
ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xb9/0xf0 [ib_uverbs]
ib_uverbs_cmd_verbs+0xb16/0xc30 [ib_uverbs]
The root cause is that the actual RDMA connection is not created in the
loopback tests and the rxe_match_dgid will fail randomly.
To fix this call trace which appear in the loopback tests, skip check of
the dgid.
Fixes: 8700e3e7c4 ("Soft RoCE driver")
Link: https://lore.kernel.org/r/20200630123605.446959-1-leon@kernel.org
Signed-off-by: Zhu Yanjun <yanjunz@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Update the code to allocate and free ib_xrcd structure in the
ib_core instead of inside drivers.
Link: https://lore.kernel.org/r/20200630101855.368895-4-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Expose UAPI to query MR, this will let user space application that
didn't allocate the MR but has access to by owning the matching command
FD to retrieve its information.
Link: https://lore.kernel.org/r/20200630093916.332097-8-leon@kernel.org
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Introduce UAPI to query PD attributes, this can be used to retrieve PD
attributes by having the PD handle of the created one and owning the
command FD for the ucontxet.
Link: https://lore.kernel.org/r/20200630093916.332097-7-leon@kernel.org
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Implement the query ucontext functionality by returning the original
ucontext data as part of an extra mlx5 attribute that holds the driver
UAPI response.
Link: https://lore.kernel.org/r/20200630093916.332097-6-leon@kernel.org
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Refactor mlx5_ib_alloc_ucontext() to set its response fields in a
cleaner way.
It includes,
- Move the relevant code to a self contained function.
- Calculate the response length once and drop redundant code all around.
- Reuse previously set ucontext fields once preparing the response.
The self contained function will be used in next patch as part of
implementing the query ucontext functionality.
Link: https://lore.kernel.org/r/20200630093916.332097-5-leon@kernel.org
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Expose UAPI to query ucontext, this will let user space application that
didn't allocate the ucontext but has access to by owning the matching
command FD to retrieve the ucontext information.
Link: https://lore.kernel.org/r/20200630093916.332097-4-leon@kernel.org
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Set IOVA on IB MR in uverbs layer to let all drivers have it, this
includes both reg/rereg MR flows.
As part of this change cleaned-up this setting from the drivers that
already did it by themselves in their user flows.
Fixes: e6f0330106 ("mlx4_ib: set user mr attributes in struct ib_mr")
Link: https://lore.kernel.org/r/20200630093916.332097-3-leon@kernel.org
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Enable CQ ioctl commands by default, this functionality is fully mature
to be used over ioctl, no reason to maintain any more the EXP KCONFIG
entry to enable it.
Link: https://lore.kernel.org/r/20200630093916.332097-2-leon@kernel.org
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Replace the mutex with read write semaphore and use xarray instead of
linked list for XRC target QPs. This will give faster XRC target
lookup. In addition, when QP is closed, don't insert it back to the xarray
if the destroy command failed.
Link: https://lore.kernel.org/r/20200706122716.647338-4-leon@kernel.org
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
ib_alloc_xrcd() already does the required initialization, so move the
uverbs to call it and save code duplication, while cleaning the function
argument lists of that function.
Link: https://lore.kernel.org/r/20200706122716.647338-3-leon@kernel.org
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The mlx5_ib creates XRC domain and uses for creating internal SRQ.
However all that is needed is XRCD number and not full blown ib_xrcd
objects.
Update the code to get and store the number only.
Link: https://lore.kernel.org/r/20200706122716.647338-2-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Allocating an MR flow can only be initiated by kernel users, and not from
userspace so a udata parameter is redundant.
Link: https://lore.kernel.org/r/20200706120343.10816-4-galpress@amazon.com
Signed-off-by: Gal Pressman <galpress@amazon.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Allocating an MR flow can only be initiated by kernel users, and not from
userspace. As a result, the udata parameter is always being passed as
NULL. Rename ib_alloc_mr_user function to ib_alloc_mr and remove the udata
parameter.
Link: https://lore.kernel.org/r/20200706120343.10816-3-galpress@amazon.com
Signed-off-by: Gal Pressman <galpress@amazon.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The common kernel pattern is to check for error, not success. Flip the if
statement accordingly and keep the main flow unindented.
Link: https://lore.kernel.org/r/20200706120343.10816-2-galpress@amazon.com
Signed-off-by: Gal Pressman <galpress@amazon.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Michael Guralnik says:
====================
This series handles IPoIB child interface creation with setting
interface's HW address.
In current implementation, lladdr requested by user is ignored and
overwritten. Child interface gets the same GID as the parent interface and
a QP number which is assigned by the underlying drivers.
In this series we fix this behavior so that user's requested address is
assigned to the newly created interface.
As specific QP number request is not supported for all vendors, QP number
requested by user will still be overwritten when this is not supported.
Behavior of creation of child interfaces through the sysfs mechanism or
without specifying a requested address, stays the same.
====================
Based on the mlx5-next branch at
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
due to dependencies.
* branch 'mlx5_ipoib_qpn':
RDMA/ipoib: Handle user-supplied address when creating child
net/mlx5: Enable QP number request when creating IPoIB underlay QP
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Use the address supplied by user when creating a child interface.
Previously, the address requested by the user was ignored and overridden
with parent's GID and the random QP number assigned to the child.
Link: https://lore.kernel.org/r/20200623110105.1225750-3-leon@kernel.org
Signed-off-by: Michael Guralnik <michaelgur@mellanox.com>
Reviewed-by: Feras Daoud <ferasda@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
For debugging purpose it will be easier to understand if prefetch works
okay if it has its own counter. Introduce ODP prefetch counter and count
per MR the total number of prefetched pages.
In addition remove comment which is not relevant anymore and anyway not in
the correct place.
Link: https://lore.kernel.org/r/20200621104147.53795-1-leon@kernel.org
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
There is a race condition where ib_nl_make_request() inserts the request
data into the linked list but the timer in ib_nl_request_timeout() can see
it and destroy it before ib_nl_send_msg() is done touching it. This could
happen, for instance, if there is a long delay allocating memory during
nlmsg_new()
This causes a use-after-free in the send_mad() thread:
[<ffffffffa02f43cb>] ? ib_pack+0x17b/0x240 [ib_core]
[ <ffffffffa032aef1>] ib_sa_path_rec_get+0x181/0x200 [ib_sa]
[<ffffffffa0379db0>] rdma_resolve_route+0x3c0/0x8d0 [rdma_cm]
[<ffffffffa0374450>] ? cma_bind_port+0xa0/0xa0 [rdma_cm]
[<ffffffffa040f850>] ? rds_rdma_cm_event_handler_cmn+0x850/0x850 [rds_rdma]
[<ffffffffa040f22c>] rds_rdma_cm_event_handler_cmn+0x22c/0x850 [rds_rdma]
[<ffffffffa040f860>] rds_rdma_cm_event_handler+0x10/0x20 [rds_rdma]
[<ffffffffa037778e>] addr_handler+0x9e/0x140 [rdma_cm]
[<ffffffffa026cdb4>] process_req+0x134/0x190 [ib_addr]
[<ffffffff810a02f9>] process_one_work+0x169/0x4a0
[<ffffffff810a0b2b>] worker_thread+0x5b/0x560
[<ffffffff810a0ad0>] ? flush_delayed_work+0x50/0x50
[<ffffffff810a68fb>] kthread+0xcb/0xf0
[<ffffffff816ec49a>] ? __schedule+0x24a/0x810
[<ffffffff816ec49a>] ? __schedule+0x24a/0x810
[<ffffffff810a6830>] ? kthread_create_on_node+0x180/0x180
[<ffffffff816f25a7>] ret_from_fork+0x47/0x90
[<ffffffff810a6830>] ? kthread_create_on_node+0x180/0x180
The ownership rule is once the request is on the list, ownership transfers
to the list and the local thread can't touch it any more, just like for
the normal MAD case in send_mad().
Thus, instead of adding before send and then trying to delete after on
errors, move the entire thing under the spinlock so that the send and
update of the lists are atomic to the conurrent threads. Lightly reoganize
things so spinlock safe memory allocations are done in the final NL send
path and the rest of the setup work is done before and outside the lock.
Fixes: 3ebd2fd0d0 ("IB/sa: Put netlink request into the request list before sending")
Link: https://lore.kernel.org/r/1592964789-14533-1-git-send-email-divya.indi@oracle.com
Signed-off-by: Divya Indi <divya.indi@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
ib_unregister_device_queued() can only be used by drivers using the new
dealloc_device callback flow, and it has a safety WARN_ON to ensure
drivers are using it properly.
However, if unregister and register are raced there is a special
destruction path that maintains the uniform error handling semantic of
'caller does ib_dealloc_device() on failure'. This requires disabling the
dealloc_device callback which triggers the WARN_ON.
Instead of using NULL to disable the callback use a special function
pointer so the WARN_ON does not trigger.
Fixes: d0899892ed ("RDMA/device: Provide APIs from the core code to help unregistration")
Link: https://lore.kernel.org/r/0-v1-a36d512e0a99+762-syz_dealloc_driver_jgg@nvidia.com
Reported-by: syzbot+4088ed905e4ae2b0e13b@syzkaller.appspotmail.com
Suggested-by: Hillf Danton <hdanton@sina.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The workqueue link_wq should only be destroyed when the hfi1 driver is
unloaded, not when the device is shut down.
Fixes: 71d47008ca ("IB/hfi1: Create workqueue for link events")
Link: https://lore.kernel.org/r/20200623204053.107638.70315.stgit@awfm-01.aw.intel.com
Cc: <stable@vger.kernel.org>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Legacy IPoIB sets IB_QP_CREATE_NETIF_QP QP create flag and because mlx5
doesn't use this flag, the process_create_flags() failed to create IPoIB
QPs.
Fixes: 2978975ce7 ("RDMA/mlx5: Process create QP flags in one place")
Link: https://lore.kernel.org/r/20200630122147.445847-1-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Clang warns:
drivers/infiniband/hw/hfi1/qp.c:198:9: warning: implicit conversion from enumeration type 'enum opa_mtu' to different enumeration type 'enum ib_mtu' [-Wenum-conversion]
mtu = OPA_MTU_8192;
~ ^~~~~~~~~~~~
enum opa_mtu extends enum ib_mtu. There are typically two ways to deal
with this:
* Remove the expected types and just use 'int' for all parameters and
types.
* Explicitly cast the enums between each other.
This driver chooses to do the later so do the same thing here.
Fixes: 6d72344cf6 ("IB/ipoib: Increase ipoib Datagram mode MTU's upper limit")
Link: https://lore.kernel.org/r/20200623005224.492239-1-natechancellor@gmail.com
Link: https://github.com/ClangBuiltLinux/linux/issues/1062
Link: https://lore.kernel.org/linux-rdma/20200527040350.GA3118979@ubuntu-s3-xlarge-x86/
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Acked-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
ipoib_mcast_carrier_on_task() insanely open codes a rtnl_lock() such that
the only time flush_workqueue() can be called is if it also clears
IPOIB_FLAG_OPER_UP.
Thus the flush inside ipoib_flush_ah() will deadlock if it gets unlucky
enough, and lockdep doesn't help us to find it early:
CPU0 CPU1 CPU2
__ipoib_ib_dev_flush()
down_read(vlan_rwsem)
ipoib_vlan_add()
rtnl_trylock()
down_write(vlan_rwsem)
ipoib_mcast_carrier_on_task()
while (!rtnl_trylock())
msleep(20);
ipoib_flush_ah()
flush_workqueue(priv->wq)
Clean up the ah_reaper related functions and lifecycle to make sense:
- Start/Stop of the reaper should only be done in open/stop NDOs, not in
any other places
- cancel and flush of the reaper should only happen in the stop NDO.
cancel is only functional when combined with IPOIB_STOP_REAPER.
- Non-stop places were flushing the AH's just need to flush out dead AH's
synchronously and ignore the background task completely. It is fully
locked and harmless to leave running.
Which ultimately fixes the ABBA deadlock by removing the unnecessary
flush_workqueue() from the problematic place under the vlan_rwsem.
Fixes: efc82eeeae ("IB/ipoib: No longer use flush as a parameter")
Link: https://lore.kernel.org/r/20200625174219.290842-1-kamalheib1@gmail.com
Reported-by: Kamal Heib <kheib@redhat.com>
Tested-by: Kamal Heib <kheib@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
pcie_speeds() and restore_pci_variables() returns PCIBIOS_ error codes
from PCIe capability accessors.
PCIBIOS_ error codes have positive values. Passing on these values is
inconsistent with functions which return only a negative value on failure.
Before passing on the return value of PCIe capability accessors, call
pcibios_err_to_errno() to convert any positive PCIBIOS_ error codes to
negative generic error values.
Link: https://lore.kernel.org/r/20200615073225.24061-3-refactormyself@gmail.com
Suggested-by: Bjorn Helgaas <bjorn@helgaas.com>
Signed-off-by: Bolarinwa Olayemi Saheed <refactormyself@gmail.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Jeff Kirsher says:
====================
40GbE Intel Wired LAN Driver Updates 2020-06-25
This series contains updates to i40e driver and removes the individual
driver versions from all of the Intel wired LAN drivers.
Shiraz moves the client header so that it can easily be shared between
the i40e LAN driver and i40iw RDMA driver.
Jesse cleans up the unused defines, since they are just dead weight.
Alek reduces the unreasonably long wait time for a PF reset after reboot
by using jiffies to limit the maximum wait time for the PF reset to
succeed. Added additional logging to let the user know when the driver
transitions into recovery mode. Adds new device support for our 5 Gbps
NICs.
Todd adds a check to see if MFS is set after warm reboot and notifies
the user when MFS is set to anything lower than the default value.
Arkadiusz fixes a possible race condition, where were holding a
spin-lock while in atomic context.
v2: removed code comments that were no longer applicable in patch 2 of
the series. Also removed 'inline' from patch 4 and patch 8 of the
series. Also re-arranged code to be able to remove the forward
function declarations. Dropped patch 9 of the series, while the
author works on cleaning up the commit message.
v3: Updated patch 8 description to answer Jakub's questions
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Move i40e_client.h to include/linux/net/intel/*
since its shared between i40iw and i40e.
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
The RWQ table is used for RSS uverbs and not in used for the kernel
consumers, delete ib_create_rwq_ind_table() routine that is not
called at all.
Link: https://lore.kernel.org/r/20200624105422.1452290-5-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>