Convert rdma_is_upper_dev_rcu, handle_netdev_upper and
ipoib_get_net_dev_match_addr to the new upper device walk API.
This is just a code conversion; no functional change is intended.
v2
- removed typecast of data
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYATXdAAoJELgmozMOVy/dn48P/2lBCAR7pJMU5AC4s1VZsYHr
A7ep5qpkmD5qGnnHNjLA2TIK/8lCy80ACt/HbV7588TxyZYpa+wIaQAdIyuUfUyS
HVdMTLMqdfYOdnPHNDiKKhdvw8Ty8gGlHsnxay32+m3WJtCPxsRObrciJO984lIk
DXKBsYuOQST5Df/1eHWSCPVUn5jHW4bKh7jPM1cs7CtFZ2bJHJQrKECm0SoKvj+3
3BNCg2gVRXeGwfX4KoSYf87nMJCCXBlNzBsqyVPjsB5teJjjk9mXV5y6qsHps9Hu
JrMjMPlvRzkUil8ZP5RiPHx29IlZypwudpswqM9cw6mxfsvvORYtYBD3BVC6Vt4A
WPVXGkx/sEO9XgbasuUJEL0ui4I3UR+lLP8MwefMiPteJ/lGdM/vydS9t57hvk9s
JeL/ep0Us70VX0VSEkc62RvYbKPcRk4qonF8liRq7nit3l45vL5YLvbTQeqe7pbI
CN0lBn83K9Z4GGwPqDzbD3pwiZ2wFV4VvrWXqOeyexT/kNi1iJlQcfNHJcUiI9vg
mkzxWvvWY+KieunrJQGWEQPkuD7fpFF77KFkIYSFVfkHBrSjc+n5a3lAY/xT8k6D
rixIl9ZhA8dMjkCzh0xqGHgEoldh4rO1ctpaTDLg3HsNkedctDEpyx4HFMhiXE2w
INAqVa/uOUC0a/uPlcWr
=Oifo
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull rdma qedr RoCE driver from Doug Ledford:
"Early on in the merge window I mentioned I had a backlog of new
drivers waiting to be reviewed and that, in addition to the hns-roce
driver, I wanted to get possible a couple more reviewed. I ended up
only having the time to complete one of the additional drivers.
During Dave Miller's pull request this go around, there were a series
of 9 patches to the QLogic qed net driver that add basic support for a
paired RoCE driver. That support is currently not functional because
it is missing the matching RoCE driver in the RDMA subsystem. I
managed to finish that review. However, because it goes against part
of Dave's net pull, and a part that was accepted a day or two after
the merge window opened, to apply cleanly it has to be applied to
either the tip of Dave's net branch, or as I did in this case, I just
applied it to your master after you had taken Dave's pull request."
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
qedr: Add events support and register IB device
qedr: Add GSI support
qedr: Add LL2 RoCE interface
qedr: Add support for data path
qedr: Add support for memory registeration verbs
qedr: Add support for QP verbs
qedr: Add support for PD,PKEY and CQ verbs
qedr: Add support for user context verbs
qedr: Add support for RoCE HW init
qedr: Add RoCE driver framework
- Small patch set for hns net driver that the roce patches depend on
- Various fixes to the hns-roce driver
- Add connection manager support to the hns-roce driver
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYATZdAAoJELgmozMOVy/d1+4P/2UhXiXx7strrr5vYtFAdbdX
9j4jPbmnXgc4hFV1EET7UScdUwYW6iuoYCYa5sJUj6dcux2Ph/pYfPbE4Civld67
xMEISaI86GcEbFy3yqZ0vhDegyReb6wUDguzht1IHKqFwl5uvXBPJhZ0vmY4ZKXd
mVKNLH4FTMbqf4rGO64AmUyN7QIlLE17zO3Nolha6mytRj7RoYHEjP8RbZPTeN5J
58QpZjomO0uz1dvxRWwRBw2eEYgXMxKa3s4W8vYYcGimoKinzbqAHhrWOm0+klHA
Nd3AFqEVDTxYxqZYSBLvhvCT4d9/vgb/Tsf+IB07qVDoM6iv2W2WM17xq9w7vitv
4w7tClX9cvAWX35k3TAhQBkN2QJhaWY9bK9JwTB/AFxQXM2gG1/2f77hi72jdsR4
kcptopV/vZSMqjobfoVe5/ac1qUxv7HM+tAN/+9j7qU3TNvn5+R7d+UBDKrbiP1c
EW5kdffRY3evemdRh/zHfUyuQzr5l/GR4vQ9gLJIBu+ZK3o1d1JNUjKNwwlzOl0r
BbvYvWJ23Na6FTjpNFOTgc3y7K4zSXlGVeHObtqg0ejlWsCU9xu+MMay9tRLy2LI
CQxr81WQbMvcEnfad2yqSUuFAAhut85Q3qYERPGDy78aiF+gNNDZLitwmjU3Q9q8
F7apPH39H41lEzOLfsMr
=PmmI
-----END PGP SIGNATURE-----
Merge tag 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull more rdma updates from Doug Ledford:
"This merge window was the first where Huawei had to try and coordinate
their patches between their net driver and their new roce driver
(similar to mlx4 and mlx5).
They didn't do horribly, but there were some issues (and we knew that
because they simply didn't know what to do in the beginning). As a
result, I had a set of patches that depended on some patches that
normally would have come to you via Dave's tree. Those patches have
been on netdev@ for a while, so I got Dave to give me his approval to
send them to you. As such, the other 29 patches I had behind them are
also now ready to go.
This catches the hns and hns-roce drivers up to current, and for
future patches we are working with them to get them up to speed on how
to do joint driver development so that they don't have these sorts of
cross tree dependency issues again. BTW, Dave gave me permission to
add his Acked-by: to the patches against the net tree, but I've had
this branch through 0day (but not linux-next since it was off by
itself) and I didn't want to rebase the series just to add Dave's ack
for the 8 patches in the net area.
Updates to the hns drivers:
- Small patch set for hns net driver that the roce patches depend on
- Various fixes to the hns-roce driver
- Add connection manager support to the hns-roce driver"
* tag 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (36 commits)
IB/hns: Fix for removal of redundant code
IB/hns: Delete the redundant lines in hns_roce_v1_m_qp()
IB/hns: Fix the bug when platform_get_resource() exec fail
IB/hns: Update the rq head when modify qp state
IB/hns: Cq has not been freed
IB/hns: Validate mtu when modified qp
IB/hns: Some items of qpc need to take user param
IB/hns: The Ack timeout need a lower limit value
IB/hns: Return bad wr while post send failed
IB/hns: Fix bug of memory leakage for registering user mr
IB/hns: Modify the init of iboe lock
IB/hns: Optimize code of aeq and ceq interrupt handle and fix the bug of qpn
IB/hns: Delete the sqp_start from the structure hns_roce_caps
IB/hns: Fix bug of clear hem
IB/hns: Remove unused parameter named qp_type
IB/hns: Simplify function of pd alloc and qp alloc
IB/hns: Fix bug of using uninit refcount and free
IB/hns: Remove parameters of resize cq
IB/hns: Remove unused parameters in some functions
IB/hns: Add node_guid definition to the bindings document
...
Adds a skeletal implementation of the qed* RoCE driver -
basically the ability to communicate with the qede driver and
receive notifications from it regarding various init/exit events.
Signed-off-by: Rajesh Borundia <rajesh.borundia@cavium.com>
Signed-off-by: Ram Amrani <Ram.Amrani@cavium.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
After the commit 9207f9d45b ("net: preserve IP control block
during GSO segmentation"), the GSO CB and the IPoIB CB conflict.
That destroy the IPoIB address information cached there,
causing a severe performance regression, as better described here:
http://marc.info/?l=linux-kernel&m=146787279825501&w=2
This change moves the data cached by the IPoIB driver from the
skb control lock into the IPoIB hard header, as done before
the commit 936d7de3d7 ("IPoIB: Stop lying about hard_header_len
and use skb->cb to stash LL addresses").
In order to avoid GRO issue, on packet reception, the IPoIB driver
stash into the skb a dummy pseudo header, so that the received
packets have actually a hard header matching the declared length.
To avoid changing the connected mode maximum mtu, the allocated
head buffer size is increased by the pseudo header length.
After this commit, IPoIB performances are back to pre-regression
value.
v2 -> v3: rebased
v1 -> v2: avoid changing the max mtu, increasing the head buf size
Fixes: 9207f9d45b ("net: preserve IP control block during GSO segmentation")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A good practice is to prefix the names of functions by the name
of the subsystem.
The kthread worker API is a mix of classic kthreads and workqueues. Each
worker has a dedicated kthread. It runs a generic function that process
queued works. It is implemented as part of the kthread subsystem.
This patch renames the existing kthread worker API to use
the corresponding name from the workqueues API prefixed by
kthread_:
__init_kthread_worker() -> __kthread_init_worker()
init_kthread_worker() -> kthread_init_worker()
init_kthread_work() -> kthread_init_work()
insert_kthread_work() -> kthread_insert_work()
queue_kthread_work() -> kthread_queue_work()
flush_kthread_work() -> kthread_flush_work()
flush_kthread_worker() -> kthread_flush_worker()
Note that the names of DEFINE_KTHREAD_WORK*() macros stay
as they are. It is common that the "DEFINE_" prefix has
precedence over the subsystem names.
Note that INIT() macros and init() functions use different
naming scheme. There is no good solution. There are several
reasons for this solution:
+ "init" in the function names stands for the verb "initialize"
aka "initialize worker". While "INIT" in the macro names
stands for the noun "INITIALIZER" aka "worker initializer".
+ INIT() macros are used only in DEFINE() macros
+ init() functions are used close to the other kthread()
functions. It looks much better if all the functions
use the same scheme.
+ There will be also kthread_destroy_worker() that will
be used close to kthread_cancel_work(). It is related
to the init() function. Again it looks better if all
functions use the same naming scheme.
+ there are several precedents for such init() function
names, e.g. amd_iommu_init_device(), free_area_init_node(),
jump_label_init_type(), regmap_init_mmio_clk(),
+ It is not an argument but it was inconsistent even before.
[arnd@arndb.de: fix linux-next merge conflict]
Link: http://lkml.kernel.org/r/20160908135724.1311726-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/1470754545-17632-3-git-send-email-pmladek@suse.com
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull more vfs updates from Al Viro:
">rename2() work from Miklos + current_time() from Deepa"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: Replace current_fs_time() with current_time()
fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
fs: Replace CURRENT_TIME with current_time() for inode timestamps
fs: proc: Delete inode time initializations in proc_alloc_inode()
vfs: Add current_time() api
vfs: add note about i_op->rename changes to porting
fs: rename "rename2" i_op to "rename"
vfs: remove unused i_op->rename
fs: make remaining filesystems use .rename2
libfs: support RENAME_NOREPLACE in simple_rename()
fs: support RENAME_NOREPLACE for local filesystems
ncpfs: fix unused variable warning
- Updates to mlx5
- Updates to mlx4 (two conflicts, both minor and easily resolved)
- Updates to iw_cxgb4 (one conflict, not so obvious to resolve, proper
resolution is to keep the code in cxgb4_main.c as it is in Linus'
tree as attach_uld was refactored and moved into cxgb4_uld.c)
- Improvements to uAPI (moved vendor specific API elements to uAPI area)
- Add hns-roce driver and hns and hns-roce ACPI reset support
- Conversion of all rdma code away from deprecated
create_singlethread_workqueue
- Security improvement: remove unsafe ib_get_dma_mr (breaks lustre in
staging)
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJX+AwSAAoJELgmozMOVy/d0WkQAKxPzVccMWwHv28iZI4ey13u
JwE+VoCNpCAZAVuEgzK5zzFdNHPvAk2jU93H4apA7dfXJBXPatVuj9Lnk+ieEEnW
tbFwJjBpbQ3Zol3+SPfAHnsVMbtax+xmd6WDKExPXXEDl1L6rutwL3KKfmgWEitg
ysX7XOJCiSdyM0hcg4T6UPB9a3jGPff9NLu0oGamV+yoUk5Y0WGoVFxHZ4MKcw8t
OkFBYIxGz4SGwq2tulStuH03HteURX594KngtrA8dyq6l1R2GlGRv+bkJAUEIWUv
aA0ow3VWusOM6fT+jLXPCv8iUwIXM8tR/U6F7X+cmORUUtWvCl+uCUVid113j/aN
BK+Af2nJnfoJ5cDBPsD+bC76l5gQycNZO/Qh8op2kmgJtD+6OpGM3cBXsHx53+kk
0wloJ2lKCGShWxNj+ig8n8rR/rhhs/x3vV3ouCVWNMbOUgOSN3eYHxmK3wGFW4nd
Qx+WYCjj9Yi/J6nmUDcfEQ4NWPR22Q2+0ENAabfhLhV6mDloAO5ILHd4GDqC3IA9
UtxlVjf4ZonaiLnTQQzCnDMGVVk6tT8FJ9D42s0ScwjbdYwjyCW9/rs/g2EhcprR
Cc+AmjqLviCWGtzBSFO0SijqQon8lcQOwdLw61CdFFvPa/mlLdf1rbx9ArIyNVKn
JSrbr3CGyoqyYj6qaEO5
=LC+S
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull main rdma updates from Doug Ledford:
"This is the main pull request for the rdma stack this release. The
code has been through 0day and I had it tagged for linux-next testing
for a couple days.
Summary:
- updates to mlx5
- updates to mlx4 (two conflicts, both minor and easily resolved)
- updates to iw_cxgb4 (one conflict, not so obvious to resolve,
proper resolution is to keep the code in cxgb4_main.c as it is in
Linus' tree as attach_uld was refactored and moved into
cxgb4_uld.c)
- improvements to uAPI (moved vendor specific API elements to uAPI
area)
- add hns-roce driver and hns and hns-roce ACPI reset support
- conversion of all rdma code away from deprecated
create_singlethread_workqueue
- security improvement: remove unsafe ib_get_dma_mr (breaks lustre in
staging)"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (75 commits)
staging/lustre: Disable InfiniBand support
iw_cxgb4: add fast-path for small REG_MR operations
cxgb4: advertise support for FR_NSMR_TPTE_WR
IB/core: correctly handle rdma_rw_init_mrs() failure
IB/srp: Fix infinite loop when FMR sg[0].offset != 0
IB/srp: Remove an unused argument
IB/core: Improve ib_map_mr_sg() documentation
IB/mlx4: Fix possible vl/sl field mismatch in LRH header in QP1 packets
IB/mthca: Move user vendor structures
IB/nes: Move user vendor structures
IB/ocrdma: Move user vendor structures
IB/mlx4: Move user vendor structures
IB/cxgb4: Move user vendor structures
IB/cxgb3: Move user vendor structures
IB/mlx5: Move and decouple user vendor structures
IB/{core,hw}: Add constant for node_desc
ipoib: Make ipoib_warn ratelimited
IB/mlx4/alias_GUID: Remove deprecated create_singlethread_workqueue
IB/ipoib_verbs: Remove deprecated create_singlethread_workqueue
IB/ipoib: Remove deprecated create_singlethread_workqueue
...
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJX+BuzAAoJELgmozMOVy/dUvwP/juvXUJg7Xh3N0PXo0+Qb8Xc
TG3a8EdumYZtDF8jualKiuJboGUdkMN4EDWhfdPxxW3lGBc8EQP7vac2H3NCcE3L
ntRYrEeaDF8x5eUSyIQf2hs9k7HVQ+gkR8KqbeVUFh8iCwNnQ+wjvBXB+WUcqioW
mp+S0//v8t1r1zPyUE7NgBo8EBVWzapiiGrDObNMke2oZxa4CsKsPCUMzSr9USW+
/oSQBqcAZQ6LL8ykvx2oov2jt+bcWKco5fxRb8R7+buZknC4jX98f70WkLUne1+q
oG3fKf1cUmORAjWelNDSxfwz8uWzkZifHqa+5UQHKVR9PcFJIQUKuW/5InMxkU3+
HG+t7YiwCKOX7+99zUC5GHWFilmAOJma567VzfEgzcJjNeRtoCVG0a7smACWKA/y
JWUoLUQd5c9dvg46CDG+qqk79tINYmRdtgvJmht2AhCAczRYI8iMIaw9Fh7yC5Dg
JvUVjLLVtdNjJSP0ixY6RJB4aQVbdVSzo4UUJKXQkaJM6KWRd5HYGcr8rUXn8HEf
PjHNcPEByOJBfH2Fg759nKDPaVAN4r/8nDHZpLfEfnzzEcXWIUk4vDcCYHxN69RF
8EZZ681c4YE/WW49Jr/JulKKRsrZn1xB4FJF5gJmAOvb0WHoJPf82YEr3kcUUOQK
HaECbnzwYqGRQe5tNFiD
=/P6a
-----END PGP SIGNATURE-----
Merge tag 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull more rdma updates from Doug Ledford:
"Minor updates for rxe driver"
[ Starting to do merge window pulls again - the current -git tree does
appear to have some netfilter use-after-free issues, but I've sent
off the report to the proper channels, and I don't want to delay merge
window activity any more ]
* tag 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
IB/rxe: improved debug prints & code cleanup
rdma_rxe: Ensure rdma_rxe init occurs at correct time
IB/rxe: Properly honor max IRD value for rd/atomic.
IB/{rxe,core,rdmavt}: Fix kernel crash for reg MR
IB/rxe: Fix sending out loopback packet on netdev interface.
IB/rxe: Avoid scheduling tasklet for userspace QP
When processing a REG_MR work request, if fw supports the
FW_RI_NSMR_TPTE_WR work request, and if the page list for this
registration is <= 2 pages, and the current state of the mr is INVALID,
then use FW_RI_NSMR_TPTE_WR to pass down a fully populated TPTE for FW
to write. This avoids FW having to do an async read of the TPTE blocking
the SQ until the read completes.
To know if the current MR state is INVALID or not, iw_cxgb4 must track the
state of each fastreg MR. The c4iw_mr struct state is updated as REG_MR
and LOCAL_INV WRs are posted and completed, when a reg_mr is destroyed,
and when RECV completions are processed that include a local invalidation.
This optimization increases small IO IOPS for both iSER and NVMF.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Avoid that mapping an sg-list in which the first element has a
non-zero offset triggers an infinite loop when using FMR. This
patch makes the FMR mapping code similar to that of ib_sg_to_pages().
Note: older Mellanox HCAs do not support non-zero offsets for FMR.
See also commit 8c4037b501 ("IB/srp: always avoid non-zero offsets
into an FMR").
Reported-by: Alex Estrin <alex.estrin@intel.com>
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Document that ib_map_mr_sg() is able to map physically discontiguous
sg-lists as a single MR. Change IB_MR_TYPE_SG_GAPS_REG into
IB_MR_TYPE_SG_GAPS.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@rimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In MLX qp packets, the LRH (built by the driver) has both a VL field
and an SL field. When building a QP1 packet, the VL field should
reflect the SLtoVL mapping and not arbitrarily contain zero (as is
done now). This bug causes credit problems in IB switches at
high rates of QP1 packets.
The fix is to cache the SL to VL mapping in the driver, and look up
the VL mapped to the SL provided in the send request when sending
QP1 packets.
For FW versions which support generating a port_management_config_change
event with subtype sl-to-vl-table-change, the driver uses that event
to update its sl-to-vl mapping cache. Otherwise, the driver snoops
incoming SMP mads to update the cache.
There remains the case where the FW is running in secure-host mode
(so no QP0 packets are delivered to the driver), and the FW does not
generate the sl2vl mapping change event. To support this case, the
driver updates (via querying the FW) its sl2vl mapping cache when
running in secure-host mode when it receives either a Port Up event
or a client-reregister event (where the port is still up, but there
may have been an opensm failover).
OpenSM modifies the sl2vl mapping before Port Up and Client-reregister
events occur, so if there is a mapping change the driver's cache will
be properly updated.
Fixes: 225c7b1fee ("IB/mlx4: Add a driver Mellanox ConnectX InfiniBand adapters")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch moves mthca vendor's specific structures to
common UAPI folder which will be visible to all consumers.
These structures are used by user-space library driver
(libmthca) and currently manually copied to that library.
This move will allow cross-compile against these files and
simplify introduction of vendor specific data.
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch moves nes vendor's specific structures to
common UAPI folder which will be visible to all consumers.
These structures are used by user-space library driver
(libmlx4) and currently manually copied to that library.
This move will allow cross-compile against these files and
simplify introduction of vendor specific data.
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch moves ocrdma vendor's specific structures to
common UAPI folder which will be visible to all consumers.
These structures are used by user-space library driver
(libmlx4) and currently manually copied to that library.
This move will allow cross-compile against these files and
simplify introduction of vendor specific data.
In addition, it changes types to be __uXX instead of uXX.
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Acked-By: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch moves mlx4 vendor's specific structures to
common UAPI folder which will be visible to all consumers.
These structures are used by user-space library driver
(libmlx4) and currently manually copied to that library.
This move will allow cross-compile against these files and
simplify introduction of vendor specific data.
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch moves cxgb4 vendor's specific structures to
common UAPI folder which will be visible to all consumers.
These structures are used by user-space library driver
(libcxgb4) and currently manually copied to that library.
This move will allow cross-compile against these files and
simplify introduction of vendor specific data.
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch moves cxgb3 vendor's specific structures to
common UAPI folder which will be visible to all consumers.
These structures are used by user-space library driver
(libcxgb3) and currently manually copied to that library.
This move will allow cross-compile against these files and
simplify introduction of vendor specific data.
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch decouples and moves vendors specific structures to
common UAPI folder which will be visible to all consumers.
These structures are used by user-space library driver
(libmlx5) and currently manually copied to that library.
This move will allow cross-compile against these files and
simplify introduction of vendor specific data.
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In certain cases it's possible to be flooded by warning messages. To
cope with such situations make the ipoib_warn macro be ratelimited.
To prevent accidental limiting of legitimate, bursty messages make
the limit fairly liberal by allowing up to 100 messages in 10 seconds.
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
alloc_ordered_workqueue() with WQ_MEM_RECLAIM set, replaces
deprecated create_singlethread_workqueue(). This is the identity
conversion.
The workqueue "wq" queues work item that maps to alias_guid_work.
It has been identity converted.
WQ_MEM_RECLAIM has been set to ensure forward progress under
memory pressure.
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
alloc_ordered_workqueue() with WQ_MEM_RECLAIM set, replaces
deprecated create_singlethread_workqueue(). This is the identity
conversion.
The workqueue "wq" queues mulitple work items viz &priv->restart_task,
&priv->cm.rx_reap_task, &priv->cm.skb_task, &priv->neigh_reap_task,
&priv->ah_reap_task, &priv->mcast_task and &priv->carrier_on_task.
The work items require strict execution ordering.
Hence, an ordered dedicated workqueue has been used.
WQ_MEM_RECLAIM has been set to ensure forward progress under
memory pressure.
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
alloc_ordered_workqueue() replaces deprecated
create_singlethread_workqueue().
The workqueue "ipoib_workqueue" that is used for all flush operations
for the device.
WQ_MEM_RECLAIM has been set since the flush operations may need to
complete in order for other network functions to continue, and
the memory reclaim operation might need the network functioning in
order to make progress.
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
alloc_ordered_workqueue() replaces deprecated
create_singlethread_workqueue().
The workqueue "event_wq" queues work item &event->event_work and the
workqueue "disconn_wq" queues work item work (maps to
g_cm_core->disconn_wq).
WQ_MEM_RECLAIM has not been set since the workqueues are not being used
on a memory reclaim path.
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
alloc_ordered_workqueue() with WQ_MEM_RECLAIM set, replaces
deprecated create_singlethread_workqueue(). This is the identity
conversion.
The workqueue "mcg_wq" queues work items &group->work
and &group->timeout_work.
The workqueue "clean_wq" queues work item mcg_clean_task.
Both have been identity converted.
WQ_MEM_RECLAIM has been set to ensure forward progress under
memory pressure.
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
alloc_ordered_workqueue() with WQ_MEM_RECLAIM set, replaces
deprecated create_singlethread_workqueue(). This is the identity
conversion.
The workqueue "wq" queues work item &ctx->work and the workqueue "ud_wq"
queues work item &dm[i]->work.
Both the workqueues have been identity converted.
WQ_MEM_RECLAIM has been set to ensure forward progress under
memory pressure.
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
alloc_ordered_workqueue() with WQ_MEM_RECLAIM set, replaces
deprecated create_singlethread_workqueue(). This is the identity
conversion.
The workqueue "wq" queues work items &dm[i]->work, &ew->work.
It has been identity converted.
WQ_MEM_RECLAIM has been set to ensure forward progress under
memory pressure.
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
alloc_ordered_workqueue() with WQ_MEM_RECLAIM set, replaces
deprecated create_singlethread_workqueue(). This is the identity
conversion.
The workqueue "mlx5_ib_page_fault_wq" queues work item &qp_pfault->work.
It has been identity converted.
WQ_MEM_RECLAIM has been set to ensure forward progress under
memory pressure.
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
alloc_ordered_workqueue() with WQ_MEM_RECLAIM set, replaces
deprecated create_singlethread_workqueue(). This is the identity
conversion.
The workqueue "cache->wq" queues work items &ent->work (maps to
cache_work_func) and &ent->dwork(maps to delayed_cache_work_func).
It has been identity converted.
WQ_MEM_RECLAIM has been set to ensure forward progress under
memory pressure.
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
alloc_ordered_workqueue() with WQ_MEM_RECLAIM set, replaces
deprecated create_singlethread_workqueue(). This is the identity
conversion.
The workqueue "event_wq" is involved in event handling and queues
i40iw_cm_event_handler.
The workqueue "disconn_wq" is involved in closing connection and queues
i40iw_disconnect_worker.
Both workqueues have been identity converted.
WQ_MEM_RECLAIM has been set to ensure forward progress under memory
pressure.
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
alloc_ordered_workqueue() with WQ_MEM_RECLAIM set, replaces
deprecated create_singlethread_workqueue(). This is the identity
conversion.
The workqueue "virtchnl_wq" queues work items i40iw_cqp_generic_worker
and i40iw_cqp_manage_hmc_fcn_worker. It has been identity converted.
WQ_MEM_RECLAIM has been set to ensure forward progress under memory
pressure.
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
alloc_ordered_workqueue() with WQ_MEM_RECLAIM set, replaces
deprecated create_singlethread_workqueue(). This is the identity
conversion.
The workqueue "catas_wq" in triggering a device remove and causing a
device reset when a catastrophic error occurs. It has been identity
converted.
WQ_MEM_RECLAIM has been set to ensure forward progress under memory
pressure.
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
alloc_ordered_workqueue() with WQ_MEM_RECLAIM set, replaces
deprecated create_singlethread_workqueue(). This is the identity
conversion.
The workqueue "workq" queues work item &skb_work. It has been
identity converted.
WQ_MEM_RECLAIM has been set to ensure forward progress under
memory pressure.
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
alloc_ordered_workqueue() with WQ_MEM_RECLAIM set, replaces
deprecated create_singlethread_workqueue(). This is the identity
conversion.
The workqueue "qib" queues work item &priv->s_work. It has been
identity converted.
WQ_MEM_RECLAIM has been set to ensure forward progress under
memory pressure.
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Tested-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Acked-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
alloc_ordered_workqueue() with WQ_MEM_RECLAIM set, replaces
deprecated create_singlethread_workqueue(). This is the identity
conversion.
The workqueue "workq" queues work item &skb_work. It has been
identity converted.
WQ_MEM_RECLAIM has been set to ensure forward progress under
memory pressure.
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
alloc_ordered_workqueue() with WQ_MEM_RECLAIM set, replaces
deprecated create_singlethread_workqueue(). This is the identity
conversion.
The workqueue "iwcm_wq" queues work item &work(maps to cm_work_handler).
It has been identity converted.
WQ_MEM_RECLAIM has been set to ensure forward progress under
memory pressure.
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The workqueue "addr_wq" queues a single work item &work and hence
doesn't require ordering. Also, it is being used on a memory reclaim
path. Hence, it has been converted to use alloc_workqueue with
WQ_MEM_RECLAIM set.
WQ_MEM_RECLAIM has been set to ensure forward progress under
memory pressure.
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
alloc_ordered_workqueue() with WQ_MEM_RECLAIM set, replaces
deprecated create_singlethread_workqueue(). This is the identity
conversion.
The workqueue "cma_wq" queues work item cma_work_handler. It has been
identity converted.
WQ_MEM_RECLAIM has been set to ensure forward progress under
memory pressure.
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
alloc_ordered_workqueue() with WQ_MEM_RECLAIM set, replaces
deprecated create_singlethread_workqueue(). This is the identity
conversion.
The workqueue "close_wq" queues work items &ctx->close_work (maps to
ucma_close_id) and &con_req_eve->close_work (maps to
ucma_close_event_id). It has been identity converted.
WQ_MEM_RECLAIM has been set to ensure forward progress under
memory pressure.
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
alloc_ordered_workqueue() with WQ_MEM_RECLAIM set, replaces
deprecated create_singlethread_workqueue(). This is the identity
conversion.
The workqueue "mcast_wq" queues work item &group->work. It has been
identity converted.
WQ_MEM_RECLAIM has been set to ensure forward progress under
memory pressure.
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The workqueue "ib_nl" queues work items &ib_nl_timed_work and
&mad_agent_priv->local_work. It has been identity converted.
WQ_MEM_RECLAIM has been set to ensure forward progress under
memory pressure.
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
alloc_ordered_workqueue() with WQ_MEM_RECLAIM set, replaces
deprecated create_singlethread_workqueue(). This is the identity
conversion.
The workqueue "ib_nl" queues work item &ib_nl_timed_work. It has been
identity converted.
WQ_MEM_RECLAIM has been set to ensure forward progress under memory
pressure.
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When LAG is active, QP tx affinity (the physical port
to which a QP is affined, or the TIS in case of raw-eth)
is set in a round robin fashion during state transition
from RESET to INIT.
Signed-off-by: Aviv Heller <avivh@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
IB bond device name is now 'mlx5_bond_X', instead of
'mlx5_X'.
Signed-off-by: Aviv Heller <avivh@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When LAG is active, port up/down events should be triggered
by tracking the LAG master, and not one of the two slave
netdevs.
In the same manner, ib_query_port() should return the details
of the LAG master.
Signed-off-by: Aviv Heller <avivh@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This is done in two steps:
1) Issuing CREATE_VPORT_LAG in order to have Ethernet traffic from
both ports arriving on PF0 root flowtable, so we will be able to catch
all raw-eth traffic on PF0.
2) Creation of LAG demux flowtable in order to direct all non-raw-eth
traffic back to its source port, assuring that normal Ethernet
traffic "jumps" to the root flowtable of its RX port (non-LAG behavior).
Signed-off-by: Aviv Heller <avivh@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Since ib_query_port() in RoCE returns the state of its netdev as the port
state, it makes sense to propagate the port up/down events to ib_core
when the netdev port state changes, instead of relying on traditional
core events.
This also keeps both the event and ib_query_port() synchronized.
Signed-off-by: Aviv Heller <avivh@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Track asynchronous events on a receive work queue by using the
mlx5_core_create_rq_tracked API.
In case a fatal error has occurred letting the IB layer know about by
using the ib_wq event handler.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add support to receive Traffic Class, specific IPv6 protocol
or IPv6 flow label.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add support to receive TOS or specific IPv4 protocol.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add the following fields to IPv6 flow filter specification:
1. Traffic Class
2. Flow Label
3. Next Header
4. Hop Limit
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Flow steering specifications structures were implemented as in an
extensible way that allows one to add new filters and new fields
to existing filters.
These specifications have never been extended, therefore the
kernel flow specifications size and the user flow specifications size
were must to be equal.
In downstream patch, the IPv4 flow specifications type is extended to
support TOS and TTL fields.
To support an extension we change the flow specifications size
condition test to be as following:
* If the user flow specifications is bigger than the kernel
specifications, we verify that all the bits which not in the kernel
specifications are zeros and the flow is added only with the kernel
specifications fields.
* Otherwise, we add flow rule only with the user specifications fields.
User space filters must be aligned with 32bits.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add validation check that all set fields in flow specification
are supported by vendor.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add validation check that all set fields in flow specification
are supported by vendor.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add support to create sniffer rule. This rule receive all
incoming and outgoing packets from the port.
A user could create such rule by using IB_FLOW_ATTR_SNIFFER type.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Move the reference count increasing of flow table to be in
create_flow_rule, it will increase the reference count for each rule
creation and not for each flow.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Fix covertiy warning of passing "&flow_attr" to function
"create_flow_rule" which uses it as an array.
In addition pass flow attributes argument as const.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Saving the flow table priority object's pointer in the flow handle
is necessary for downstream patches since the sniffer flow table isn't
placed at the standard flow_db structure but in a different database.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Counters weren't updated due to raw packet QPs' traffic since the
counter-id was not associated with the QP. Added support for
associating the q-counter-id with the raw packet QP. The attachment
is done only when changing RQ raw packet QP state from RST to INIT
in modify-RQ command. FW support is required for the above, without
this support raw packet QP counters will not count.
Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Added a struct for modifying raw QP, this will allow modifying
multiple parameters in raw packet QP RQ and can also be used for
SQ in the future.
Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Expose RSS related capabilities on both IB and vendor channels.
In addition to the IB capabilities the driver reports some extra
capabilities on its vendor channel:
- Bit mask of the supported types of hash functions.
- Bit mask of the supported RX fields that can participate
in the RX hashing.
Those capabilities are applicable only when the link layer
is Ethernet.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Query RSS related attributes and return them to user-space via the
extended query device uverbs command.
It includes both direct ones (i.e. struct ib_uverbs_rss_caps) and
max_wq_type_rq which may be used in both RSS and non RSS flows.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
1. Debugging qp state transitions and qp errors in loopback and
multiple QP tests is difficult without qp numbers in debug logs.
This patch adds qp number to important debug logs.
2. Instead of having rxe: prefix in few logs and not having in
few logs, using uniform module name prefix using pr_fmt macro.
3. Code cleanup for various warnings reported by checkpatch for
incomplete unsigned data type, line over 80 characters, return
statements.
Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
There is a problem when CONFIG_RDMA_RXE=y and CONFIG_IPV6=y. This
results in the rdma_rxe initialization occurring before the IPv6
services are ready. This patch delays the initialization of rdma_rxe
until after the IPv6 services are ready. This fix is based on one
proposed by Logan Gunthorpe on a much older code base.
Signed-off-by: Stephen Bates <sbates@raithlin.com>
Reviewed-by: Yonatan Cohen <yonatanc@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch fixes below kernel crash on memory registration for rxe
and other transport drivers which has dma_ops extension.
IB/core invokes ib_map_sg_attrs() in generic manner with dma attributes
which is used by mlx5 and mthca adapters. However in doing so it
ignored honoring dma_ops extension of software based transports for
sg map/unmap operation. This results in calling dma_map_sg_attrs of
hardware virtual device resulting in crash for null reference.
We extend the core to support sg_map/unmap_attrs and transport drivers
to implement those dma_ops callback functions.
Verified usign perftest applications.
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff81032a75>] check_addr+0x35/0x60
...
Call Trace:
[<ffffffff81032b39>] ? nommu_map_sg+0x99/0xd0
[<ffffffffa02b31c6>] ib_umem_get+0x3d6/0x470 [ib_core]
[<ffffffffa01cc329>] rxe_mem_init_user+0x49/0x270 [rdma_rxe]
[<ffffffffa01c793a>] ? rxe_add_index+0xca/0x100 [rdma_rxe]
[<ffffffffa01c995f>] rxe_reg_user_mr+0x9f/0x130 [rdma_rxe]
[<ffffffffa00419fe>] ib_uverbs_reg_mr+0x14e/0x2c0 [ib_uverbs]
[<ffffffffa003d3ab>] ib_uverbs_write+0x15b/0x3b0 [ib_uverbs]
[<ffffffff811e92a6>] ? mem_cgroup_commit_charge+0x76/0xe0
[<ffffffff811af0a9>] ? page_add_new_anon_rmap+0x89/0xc0
[<ffffffff8117e6c9>] ? lru_cache_add_active_or_unevictable+0x39/0xc0
[<ffffffff811f0da8>] __vfs_write+0x28/0x120
[<ffffffff811f1239>] ? rw_verify_area+0x49/0xb0
[<ffffffff811f1492>] vfs_write+0xb2/0x1b0
[<ffffffff811f27d6>] SyS_write+0x46/0xa0
[<ffffffff814f7d32>] entry_SYSCALL_64_fastpath+0x1a/0xa4
Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Both prepare4 and prepare6 sets loopback mask in pkt_info structure
instance of skb. The xmit_packet and other requester side functions
use a pkt_info struct from the stack without the proper mask. This
results in sending out the packet to the actual netdev device and
loopback functionality is broken.
Modify prepare() to pass its correctly marked pkt_info struct to
prepare4() and prepare6() instead of them using SKB_TO_PKT(skb) and
getting an incorrectly set mask.
Verified with perftest applications.
Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch avoids scheduing tasklet for WQE and protocol processing
for user space QP. It performs the task in calling process context.
To improve code readability kernel specific post_send handling moved to
post_send_kernel() function.
Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Pull networking updates from David Miller:
1) BBR TCP congestion control, from Neal Cardwell, Yuchung Cheng and
co. at Google. https://lwn.net/Articles/701165/
2) Do TCP Small Queues for retransmits, from Eric Dumazet.
3) Support collect_md mode for all IPV4 and IPV6 tunnels, from Alexei
Starovoitov.
4) Allow cls_flower to classify packets in ip tunnels, from Amir Vadai.
5) Support DSA tagging in older mv88e6xxx switches, from Andrew Lunn.
6) Support GMAC protocol in iwlwifi mwm, from Ayala Beker.
7) Support ndo_poll_controller in mlx5, from Calvin Owens.
8) Move VRF processing to an output hook and allow l3mdev to be
loopback, from David Ahern.
9) Support SOCK_DESTROY for UDP sockets. Also from David Ahern.
10) Congestion control in RXRPC, from David Howells.
11) Support geneve RX offload in ixgbe, from Emil Tantilov.
12) When hitting pressure for new incoming TCP data SKBs, perform a
partial rathern than a full purge of the OFO queue (which could be
huge). From Eric Dumazet.
13) Convert XFRM state and policy lookups to RCU, from Florian Westphal.
14) Support RX network flow classification to igb, from Gangfeng Huang.
15) Hardware offloading of eBPF in nfp driver, from Jakub Kicinski.
16) New skbmod packet action, from Jamal Hadi Salim.
17) Remove some inefficiencies in snmp proc output, from Jia He.
18) Add FIB notifications to properly propagate route changes to
hardware which is doing forwarding offloading. From Jiri Pirko.
19) New dsa driver for qca8xxx chips, from John Crispin.
20) Implement RFC7559 ipv6 router solicitation backoff, from Maciej
Żenczykowski.
21) Add L3 mode to ipvlan, from Mahesh Bandewar.
22) Support 802.1ad in mlx4, from Moshe Shemesh.
23) Support hardware LRO in mediatek driver, from Nelson Chang.
24) Add TC offloading to mlx5, from Or Gerlitz.
25) Convert various drivers to ethtool ksettings interfaces, from
Philippe Reynes.
26) TX max rate limiting for cxgb4, from Rahul Lakkireddy.
27) NAPI support for ath10k, from Rajkumar Manoharan.
28) Support XDP in mlx5, from Rana Shahout and Saeed Mahameed.
29) UDP replicast support in TIPC, from Richard Alpe.
30) Per-queue statistics for qed driver, from Sudarsana Reddy Kalluru.
31) Support BQL in thunderx driver, from Sunil Goutham.
32) TSO support in alx driver, from Tobias Regnery.
33) Add stream parser engine and use it in kcm.
34) Support async DHCP replies in ipconfig module, from Uwe
Kleine-König.
35) DSA port fast aging for mv88e6xxx driver, from Vivien Didelot.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1715 commits)
mlxsw: switchx2: Fix misuse of hard_header_len
mlxsw: spectrum: Fix misuse of hard_header_len
net/faraday: Stop NCSI device on shutdown
net/ncsi: Introduce ncsi_stop_dev()
net/ncsi: Rework the channel monitoring
net/ncsi: Allow to extend NCSI request properties
net/ncsi: Rework request index allocation
net/ncsi: Don't probe on the reserved channel ID (0x1f)
net/ncsi: Introduce NCSI_RESERVED_CHANNEL
net/ncsi: Avoid unused-value build warning from ia64-linux-gcc
net: Add netdev all_adj_list refcnt propagation to fix panic
net: phy: Add Edge-rate driver for Microsemi PHYs.
vmxnet3: Wake queue from reset work
i40e: avoid NULL pointer dereference and recursive errors on early PCI error
qed: Add RoCE ll2 & GSI support
qed: Add support for memory registeration verbs
qed: Add support for QP verbs
qed: PD,PKEY and CQ verb support
qed: Add support for RoCE hw init
qede: Add qedr framework
...
- Updates to hfi1 driver
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJX87HPAAoJELgmozMOVy/dBToP/jb9mSa7SzrCWaBvAovw7oK2
mEnETqHkV8fYa97SiuOFnPOQsK+fWSOgC6oL0I7JiK5BC5hpovTF8gDupN4x1q2v
4akTaAMvHwwjuXitA+EFNyCJWnt3jQDRVHE0WDRWeNMICXs1JD+xS5KzbRbZgWqQ
7fZjzUcT5uChL7i62GwjqvMPkp/s6w3PthtbxQerbikYVRvRkbU4LOAARXVfgjFM
EfslY8hiQFKRDZ20eWgkzPGKXEdCgacjv0Ev1NMzpdeHZFtHn+zw4xJ70VGm9ukc
IMKVNjbYN2Xa1hSihpxDD5ZauPxChCG6t/IKs4Bxtiodb/vnmKX5vswwdUsgpGP/
oOVixvQO8TPdoKXIB7wotfGDKLWvwd0dIhRLgmLtPj7jdLTejDAPran7/5GOSF6o
ecsj0rTsQ343yWPjIgVg8ShtSW+rVgXQcOFnoOwJUqiptsNFUZJpk6OA3tx0crOM
7lLUOezb6BI99XiBBF3jN27Zd/QEGGaCKIkkfo+laSM5LSzn9VReVFvwTlaXLXx+
AwLhyaEVgYnCsfy1DiIQIgKIkXnYiLfKEd65tVo7bGDOnMaD4e2zDux2tcd+/NK+
lz0NaJ5Xuk+zOvrSG7Jw5bFnVhzghviDUJ9EI38YXhtRTYnYLPA5lpoCj+/BMgCo
hPxlualfI+vd69dY/C5H
=o9FO
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull hdi1 rdma driver updates from Doug Ledford:
"This is the first pull request of the 4.9 merge window for the RDMA
subsystem. It is only the hfi1 driver. It had dependencies on code
that only landed late in the 4.7-rc cycle (around 4.7-rc7), so putting
this with my other for-next code would have create an ugly merge of
lot of 4.7-rc stuff. For that reason, it's being submitted
individually. It's been through 0day and linux-next"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (37 commits)
IB/rdmavt: Trivial function comment corrected.
IB/hfi1: Fix trace of atomic ack
IB/hfi1: Update SMA ingress checks for response packets
IB/hfi1: Use EPROM platform configuration read
IB/hfi1: Add ability to read platform config from the EPROM
IB/hfi1: Restore EPROM read ability
IB/hfi1: Document new sysfs entries for hfi1 driver
IB/hfi1: Add new debugfs sdma_cpu_list file
IB/hfi1: Add irq affinity notification handler
IB/hfi1: Add a new VL sysfs attribute for sdma engines
IB/hfi1: Add sysfs interface for affinity setup
IB/hfi1: Fix resource release in context allocation
IB/hfi1: Remove unused variable from devdata
IB/hfi1: Cleanup tasklet refs in comments
IB/hfi1: Adjust hardware buffering parameter
IB/hfi1: Act on external device timeout
IB/hfi1: Fix defered ack race with qp destroy
IB/hfi1: Combine shift copy and byte copy for SGE reads
IB/hfi1: Do not read more than a SGE length
IB/hfi1: Extend i2c timeout
...
This patch removes the redundant code lines present in the
functions get_send_wqe() and get_recv_wqe(). This also fixes
the error in calculating the SQ WQE.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Reviewed-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
It doesn't need to assign for the filed of qp state in qpc separately
when qp happen to migrate state which supported in RoCE engine v1.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Reviewed-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch mainly fixes the bug with platform_get_resource().
It should return NULL when platform_get_resource() exec fail.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Reviewed-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The rq head in qpc was zero will miss the rq wqes which
have be sent, so here we should take the real value.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Reviewed-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cq has not been freed when fail to ib_copy_to_udata, so need to
free it.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Peter Chen <luck.chen@huawei.com>
Reviewed-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The mtu should be validated when modify qp,so we check it.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Peter Chen <luck.chen@huawei.com>
Reviewed-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Some items of qpc need to take user param when modified qp
state.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Dongdong Huang(Donald) <hdd.huang@huawei.com>
Reviewed-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The Ack timeout of qpc need a lower limit value,otherwise
the read performance will be very lower.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Dongdong Huang(Donald) <hdd.huang@huawei.com>
Reviewed-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
While post failed, hns roce should return the wr failed to user.
We omitted this while qp type is wrong and fixed it.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Dongdong Huang(Donald) <hdd.huang@huawei.com>
Reviewed-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
While the page size attribute of umem is illegal, we should release
umem that get by ib_umem_get interface.
Also, we should return a non-zero value while pbl number is wrong.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Dongdong Huang(Donald) <hdd.huang@huawei.com>
Reviewed-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This lock will be used in query port interface, and will be called
while IB device was registered to OFED framework/IB Core. So, the
lock of iboe must be initiated before IB device was registered.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Dongdong Huang(Donald) <hdd.huang@huawei.com>
Reviewed-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch optimized the codes of aeq and ceq interrupt handle
and fixed the bug in the calculation of qpn. For the special
qp(GSI or SMI), calculated the qp number according to physical
port and the qpn reported in the event of async event queue.
Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch deleted the sqp_start from the structure hns_roce_caps, and
modified the calculation of the qp number.
Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In hip06, there's no interface to release hem memory. So, hardware can't
identify whether hem memory released or not.
If all context in a hem memory released, the related hem memory will be
released by driver and reused by others. But, hardware don't know that
this memory can't be used already.
In order to fix this bug, hns roce driver reserved 128K memory for each
type of hem(QPC/CQC/MTPT). While unmap hem memory, hns roce driver will
write base address of reserved memory according to hem type.
Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Dongdong Huang(Donald) <hdd.huang@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Hns_roce_pd_alloc and hns_roce_reserve_range_qp use unnecessary
transformation of parameters. This patch simplify these two
functions.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Dongdong Huang(Donald) <hdd.huang@huawei.com>
Reviewed-by: Wei Hu <xavier.huwei@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In current version, it uses uninitialized parameters named
refcount and free in hns_roce_cq_event.
This patch initializes these parameter in cq alloc and add
correspond process in cq free.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Dongdong Huang(Donald) <hdd.huang@huawei.com>
Reviewed-by: Wei Hu <xavier.huwei@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In old version of RoCE, it doesn't support to resize cq.
So, we remove parameters related to resize cq.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Dongdong Huang(Donald) <hdd.huang@huawei.com>
Reviewed-by: Wei Hu <xavier.huwei@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The parameter named collapsed unused in hns_roce_cq_alloc.
Also, parameter named doorbell_lock unsed in
hns_roce_v1_cq_set_ci. This patch optimize these parameters.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Reviewed-by: Wei Hu <xavier.huwei@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch mainly modify the value of HNS_ROCE_SL_SHIFT
and delete the lines for assigning for the field of
local_enable_e2e_credit in QP1C.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Reviewed-by: Wei Hu <xavier.huwei@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Fix bug of modify qp from init to init on user mode. Otherwise,
it will oops when rmda cm established.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Reviewed-by: Wei Hu <xavier.huwei@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch mainly modifies the logic for allocating uar registers.
In HiP06 SoC, HW has 8 group of uar registers for kernel and
user space application. The uar index is assigned as follows:
0 ------ for kernel
1~7 ------ for user space application
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Reviewed-by: Wei Hu <xavier.huwei@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch mainly adds phy_port to HNS RoCE QP. This shall be
used in calculating the GSI QPN for the port.
Initally when RDMA is being established, all IB ports share a
QPN which later needs to be re-assigned to a particular GSI/QPN
and which is per-port.
This also fixes a bug in base driver where iboe port was being
used instead of phy_port at some places. This values might not
be same always.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Reviewed-by: Wei Hu <xavier.huwei@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Fix the length of wqe that maybe lead to an error and
write the end bytes of QP1C into the register.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Reviewed-by: Wei Hu <xavier.huwei@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In the latest IB core version, it has some known issues
with memory registration using the local_dma_lkey.
Thus RoCE don't expose support for it, and remove
device->local_dma_lkey which is introduced to working systems.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Reviewed-by: Wei Hu <xavier.huwei@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
According to the Infiniband spec, NodeGUID uniquely identifies a
node. This must be initialized to some unique value. This patch
adds the support to the HNS RoCE driver to fetch the NodeGUID
value from DT or ACPI and then use this value to initialize the
node_guid parameter of IB device. This value shall be used by
RDMA CM.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Reviewed-by: Wei Hu <xavier.huwei@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch adds get_netdev() function to the IB device. This shall be
used to fetch netdev corresponding to the port number. This function
would be called by IB core(Generic CM Agent) for example, when the
RDMA connection is being established.
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Reviewed-by: Wei Hu <xavier.huwei@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Corrected function name in comment from qib_ to rvt_.
Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The length is incorrect, causing the trace data to
be truncated.
Add the additional 8 bytes that should have been there.
Also trace out the atomic ack in hex to aid debugging.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The driver will now try to read directly from the EPROM as its
first choice for the platform configuration file.
Reviewed-by: Easwar Hariharan <easwar.hariharan@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add a function to read the platform configuration file from
the EPROM.
Reviewed-by: Easwar Hariharan <easwar.hariharan@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Partially revert commit d079031742 ("IB/hfi1: Remove
EPROM functionality from data device"), bringing back
the ability to read from the EPROM.
This code will be used for driver-only acccess to the EPROM, hence
change EPROM read to save to a buffer instead of copy touser. Also
allow any offset and remove missed includes and leftover declarations.
Reviewed-by: Easwar Hariharan <easwar.hariharan@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add a debugfs sdma_cpu_list file that can be used to examine the CPU to
sdma engine assignments for the whole device.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Reviewed-by: Jianxin Xiong <jianxin.xiong@intel.com>
Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch adds an irq affinity notification handler.
When a user changes interrupt affinity settings for an sdma engine,
the driver needs to make changes to its internal sde structures and
also update the affinity_hint.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Reviewed-by: Jianxin Xiong <jianxin.xiong@intel.com>
Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch adds a read-only "VL" attribute for the sysfs entry of each
sdma engine. It will allow the user to check VL to sdma engine mappings.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Reviewed-by: Jianxin Xiong <jianxin.xiong@intel.com>
Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Some users want more control over which cpu cores are being used by the
driver. For example, users might want to restrict the driver to some
specified subset of the cores so that they can appropriately partition
processes, irq handlers, and work threads.
To allow the user to fine tune system affinity settings new sysfs
attributes are introduced per sdma engine. This patch adds a new
attribute type for sdma engine and a new cpu_list attribute.
When the user writes a cpu range to the cpu_list attribute the driver
will create an internal cpu->sdma map, which will be used later as a
look-up table to choose an optimal engine for a user requests.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Reviewed-by: Jianxin Xiong <jianxin.xiong@intel.com>
Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Correct resource free in allocate_ctxt() function.
When context creation fails allocated resources are properly
released and pointer in receive context data table is set back
to NULL.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jakub Pawlak <jakub.pawlak@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
We no longer use an error tasklet. Remove it from the hfi1_devdata
structure.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The code no longer uses tasklets for the send engine. However it does
use a tasklet for sdma but the send routines use a workqueue now days.
Update the comments to reflect that. Make things more generic with
saying "send engine" because that is what is being referred to.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
It was determined that 0x880 is a better value for hardware buffering,
use it.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Harish Chegondi <harish.chegondi@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add missing external device timeout notification. Recognize
it as a failed LNI signal from the 8051 firmware.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
There is a a bug in defered ack stuff that causes a race with the
destroy of a QP.
A packet causes a defered ack to be pended by putting the QP
into an rcd queue.
A return from the driver interrupt processing will process that rcd
queue of QPs and attempt to do a direct send of the ack. At this
point no locks are held and the above QP could now be put in the reset
state in the qp destroy logic. A refcount protects the QP while it
is in the rcd queue so it isn't going anywhere yet.
If the direct send fails to allocate a pio buffer,
hfi1_schedule_send() is called to trigger sending an ack from the
send engine. There is no state test in that code path.
The refcount is then dropped from the driver.c caller
potentially allowing the qp destroy to continue from its
refcount wait in parallel with the workqueue scheduling of the qp.
Cc: stable@vger.kernel.org
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Prevent over-reading the SGE length by using byte
reads for non quad-word reads.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In certain cases, if the tail of an SGE is not
8-byte aligned, bytes beyond the end to an 8-byte
alignment can be read. Change the copy routine
to avoid the over-read. Instead, stop on the final
whole quad-word, then read the remaining bytes.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Allow a longer timeout for i2c due to clock stretching and
inaccurate jiffy timing when under a spin lock. This timeout
is consistent with other i2c-algo-bit users.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The ib_write_bw test allows using up to 16384 QPs. When a relatively
large number of QPs (within that range) is used, the test can fail
because the number of CQ entries needed exceeds the limit set by the
driver.
This patch increases the default setting of max_cqes from 0x2FFFF
(196607) to 0x2FFFFF(3145727), which is sufficient to cover the
maximum number needed by the ib_write_bw test (2097152). The default
setting of max_qps is also increased from 16384 to 32768 to allow
the test to run successfully with 16383 or 16384 QPs.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The FM should have full control to set the pkeys in the
driver pkey table. Remove filtering done by the driver.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
There is no need to have a global qpt_mask as that does not support the
multiple chip model which qib has. Instead rely on the value which
exists already in the device data (dd).
Fixes: 898fa52b4a "IB/qib: Remove qpn, qp tables and related variables from qib"
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This allows for adding additional pages of adaptive pio
opcode control including manufacturer specific ones.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch adds lockdep asserts in key code paths for
insuring lock correctness.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add an rvt_qp_init() to initialize specific
common fields as the qp is created or reset.
The routine is shared by the rvt_reset_qp() and
the rvt_create_qp().
The intent is that lock dep assertions will only
appear in the rvt_reset_qp().
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The reset calldown is misplaced.
It should only be called in the code that actually
transitions the QP to reset.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The call is misplaced in the reset calldown function
and causes issues with lockdep assertions that are to
be added.
Fixes: Commit a2c2d60895 ("staging/rdma/hfi1: Remove create_qp functionality")
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The __must_hold() is sufficent to correct the sparse
context imbalance inside a function.
Per Documentation/sparse.txt:
__must_hold - The specified lock is held on function entry and exit.
Fixes: Commit c0a67f6ba3 ("IB/rdmavt: Annotate rvt_reset_qp()")
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Existing locking scheme in affinity.c file using the
&node_affinity.lock spinlock is not very elegant.
We acquire the lock to get hfi1_affinity_node entry,
unlock, and then use the entry without the lock held.
With more functions being added, which access and
modify the entries, this can lead to race conditions.
This patch makes this locking scheme more consistent.
It changes the spinlock to mutex. Since all the code
is executed in a user process context there is no need
for a spinlock. This also allows to keep the lock
not only while we look up for the node affinity entry,
but over the whole section where the entry is being used.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Reviewed-by: Jianxin Xiong <jianxin.xiong@intel.com>
Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The dma_XXX API functions return bus addresses which are
physical addresses when IOMMU is disabled. Buffer
mapping to user-space is done via remap_pfn_range() with PFN
based on bus address instead of physical. This results in
wrong pages being mapped to user-space when IOMMU is enabled.
Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Tymoteusz Kielan <tymoteusz.kielan@intel.com>
Signed-off-by: Andrzej Kacprowski <andrzej.kacprowski@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Each user SDMA request coming into the driver may contain multiple packets.
Each user packet may use multiple SDMA descriptors to fill the send buffer.
The field seqsubmitted in struct user_sdma_request counts the number of
user packets submitted to an SDMA engine. Sometimes, the intermediate count
may not be updated properly. However, once all the packets' descriptors
are successfully submitted to the SDMA engine, the final count is updated
correctly. But, if only some of the packets are submitted to the engine due
to an error, the intermediate count doesn't reflect the partial number of
packets submitted to the SDMA engine. This can cause a hang later in the
code as the count of packets submitted to the SDMA engine doesn't match the
the count of packets processed by the SDMA engine.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Harish Chegondi <harish.chegondi@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
All calls to tune_serdes and start_link are paired. Move
tune_serdes inside start_link.
Reviewed-by: Easwar Hariharan <easwar.hariharan@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use common header file structs, defines, and accessors
in the drivers. The old declarations are removed.
The repositioning of the includes allows for the removal
of hfi1_message_header and replaces its use with ib_header.
Also corrected are two issues with set_armed_to_active():
- The "packet" parameter is now a pointer as it should have been
- The etype is validated to insure that the header is correct
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_time() instead.
CURRENT_TIME is also not y2038 safe.
This is also in preparation for the patch that transitions
vfs timestamps to use 64 bit time and hence make them
y2038 safe. As part of the effort current_time() will be
extended to do range checks. Hence, it is necessary for all
file system timestamps to use current_time(). Also,
current_time() will be transitioned along with vfs to be
y2038 safe.
Note that whenever a single call to current_time() is used
to change timestamps in different inodes, it is because they
share the same time granularity.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Felipe Balbi <balbi@kernel.org>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We now only use it from ib_alloc_pd to create a local DMA lkey if the
device doesn't provide one, or a global rkey if the ULP requests it.
This patch removes ib_get_dma_mr and open codes the functionality in
ib_alloc_pd so that we can simplify the code and prevent abuse of the
functionality. As a side effect we can also simplify things by removing
the valid access bit check, and the PD refcounting.
In the future I hope to also remove the per-PD global MR entirely by
shifting this work into the HW drivers, as one step towards avoiding
the struct ib_mr overload for various different use cases.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Instead of exposing ib_get_dma_mr to ULPs and letting them use it more or
less unchecked, this moves the capability of creating a global rkey into
the RDMA core, where it can be easily audited. It also prints a warning
everytime this feature is used as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This has two reasons: a) to clearly mark that drivers don't have any
business using it, and b) because we're going to use it for the
(dangerous) global rkey soon, so that drivers don't create on themselves.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Allocate resources dynamically to cxgb4's Upper layer driver's(ULD) like
cxgbit, iw_cxgb4 and cxgb4i. Allocate resources when they register with
cxgb4 driver and free them while unregistering. All the queues and the
interrupts for them will be allocated during ULD probe only and freed
during remove.
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Various fixes to rdmavt, ipoib, mlx5, mlx4, rxe
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJX3DdRAAoJELgmozMOVy/d2k4QAJz0HbJvS8uN/ny6zaIsIa74
08pvzHWpPkbJ6JGyxySToxHx7gs+MMvsvovUM+QPQS4jt6ZdHY1vOUDztG7GVXZC
SsC8kYX8o0P2zhiDeMi/9LoBjH5bLgS3L5lfwke0jgWXCU6Cdgm5InnZ8XuoBZr6
zNQ/Zcg8epe92IhqJ9abqMveni4FstXzlj9PlhaeCkUadFarpypG2yTdcvmq7m6i
aXvGDVWgaVTB0CyaJtXIK3g/lmW4Ay3z5RpIjPbdZTd2j46c8Z4yKrhpHuK2fChb
4xPSEoMdBTO/FI/M0Mf6FKEtGv4bxFcfwpjw5fuWL3sk+hWVWA0yqil3fCJ4vAy5
klUdRLE187hd0MBj2Eq8xLeblfuqAmiuWjPJ59npcspPFUaXgvw8jolxjzxE2HVM
whAVnb5fEVu1nQ8ePfkDPNJ1osFmFwYObYiYqql258U5jBU/QXwohMQihSeIhR04
yRyzY1ob+WCGqp/MKkkAAZvjSUGdPzuky5YHCymmoinJKXvf9eJ7LdQ1l2jFMoa6
ZpZNppSya9v2pLhGf8MhO2DXyejhCnPgn1JS7OhoTCHbSPR5zDD8oucgW0+gmUpd
1QAzEL2HSojvLrDJJpJOTHhL8TQPLEVnnILMq5jRGy3y+lUJ1iGgw3qBLxAhZZRZ
r7omK4iOuut0P+Rzvgqg
=HC3X
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull rdma fixes from Doug Ledford:
"Round three of 4.8 rc fixes.
This is likely the last rdma pull request this cycle. The new rxe
driver had a few issues (you probably saw the boot bot bug report) and
they should be addressed now. There are a couple other fixes here,
mainly mlx4. There are still two outstanding issues that need
resolved but I don't think their fix will make this kernel cycle.
Summary:
- Various fixes to rdmavt, ipoib, mlx5, mlx4, rxe"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
IB/rdmavt: Don't vfree a kzalloc'ed memory region
IB/rxe: Fix kmem_cache leak
IB/rxe: Fix race condition between requester and completer
IB/rxe: Fix duplicate atomic request handling
IB/rxe: Fix kernel panic in udp_setup_tunnel
IB/mlx5: Set source mac address in FTE
IB/mlx5: Enable MAD_IFC commands for IB ports only
IB/mlx4: Diagnostic HW counters are not supported in slave mode
IB/mlx4: Use correct subnet-prefix in QP1 mads under SR-IOV
IB/mlx4: Fix code indentation in QP1 MAD flow
IB/mlx4: Fix incorrect MC join state bit-masking on SR-IOV
IB/ipoib: Don't allow MC joins during light MC flush
IB/rxe: fix GFP_KERNEL in spinlock context
This improves readability and hides the reference count
mechanism from the client drivers.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The userspace memory region 'mr' is allocated with kzalloc in
__rvt_alloc_mr however it is incorrectly being freed with vfree in
__rvt_free_mr. Fix this by using kfree to free it.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
rxe_requester() is sending a pkt with rxe_xmit_packet() and
then calls rxe_update() to update the wqe and qp's psn values.
But sometimes the response is received before the requester
had time to update the wqe in which case the completer
acts on errornous wqe values.
This fix updates the wqe and qp before actually sending
the request and rolls back when xmit fails.
Fixes: 8700e3e7c4 ("Soft RoCE driver")
Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When handling ack for atomic opcodes like "fetch&add"
or "cmp&swp", the method send_atomic_ack() saves the ack
before sending it, in case it gets lost and never reach the
requester. In which case the method duplicate_request()
will need to find it using the duplicated request.psn.
But send_atomic_ack() used a wrong psn value and thus
the above ack was never found.
This fix uses the ack.psn to locate the ack in case
its needed.
This fix also copies the ack packet to the skb's control buffer
since duplicate_request() will need it when calling rxe_xmit_packet()
Fixes: 8700e3e7c4 ("Soft RoCE driver")
Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Set the source mac address in the FTE when L2 specification
is provided.
Fixes: 038d2ef875 ('IB/mlx5: Add flow steering support')
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
MAD_IFC command is supported only for physical functions (PF)
and when physical port is IB. The proposed fix enforces it.
Fixes: d603c809ef ("IB/mlx5: Fix decision on using MAD_IFC")
Reported-by: David Chang <dchang@suse.com>
Signed-off-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Modify the mlx4_ib_diag_counters() to avoid the following error in the
hypervisor when the slave tries to query the hardware counters in SR-IOV
mode.
mlx4_core 0000:81:00.0: Unknown command:0x30 accepted from slave:1
Fixes: 3f85f2aaab ("IB/mlx4: Add diagnostic hardware counters")
Signed-off-by: Kamal Heib <kamalh@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When sending QP1 MAD packets which use a GRH, the source GID
(which consists of the 64-bit subnet prefix, and the 64 bit port GUID)
must be included in the packet GRH.
For SR-IOV, a GID cache is used, since the source GID needs to be the
slave's source GID, and not the Hypervisor's GID. This cache also
included a subnet_prefix. Unfortunately, the subnet_prefix field in
the cache was never initialized (to the default subnet prefix 0xfe80::0).
As a result, this field remained all zeroes. Therefore, when SR-IOV
was active, all QP1 packets which included a GRH had a source GID
subnet prefix of all-zeroes.
However, the subnet-prefix should initially be 0xfe80::0 (the default
subnet prefix). In addition, if OpenSM modifies a port's subnet prefix,
the new subnet prefix must be used in the GRH when sending QP1 packets.
To fix this we now initialize the subnet prefix in the SR-IOV GID cache
to the default subnet prefix. We update the cached value if/when OpenSM
modifies the port's subnet prefix. We take this cached value when sending
QP1 packets when SR-IOV is active.
Note that the value is stored as an atomic64. This eliminates any need
for locking when the subnet prefix is being updated.
Note also that we depend on the FW generating the "port management change"
event for tracking subnet-prefix changes performed by OpenSM. If running
early FW (before 2.9.4630), subnet prefix changes will not be tracked (but
the default subnet prefix still will be stored in the cache; therefore
users who do not modify the subnet prefix will not have a problem).
IF there is a need for such tracking also for early FW, we will add that
capability in a subsequent patch.
Fixes: 1ffeb2eb8b ("IB/mlx4: SR-IOV IB context objects and proxy/tunnel SQP support")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The indentation in the QP1 GRH flow in procedure build_mlx_header is
really confusing. Fix it, in preparation for a commit which touches
this code.
Fixes: 1ffeb2eb8b ("IB/mlx4: SR-IOV IB context objects and proxy/tunnel SQP support")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Because of an incorrect bit-masking done on the join state bits, when
handling a join request we failed to detect a difference between the
group join state and the request join state when joining as send only
full member (0x8). This caused the MC join request not to be sent.
This issue is relevant only when SRIOV is enabled and SM supports
send only full member.
This fix separates scope bits and join states bits a nibble each.
Fixes: b9c5d6a643 ('IB/mlx4: Add multicast group (MCG) paravirtualization for SR-IOV')
Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This fix solves a race between light flush and on the fly joins.
Light flush doesn't set the device to down and unset IPOIB_OPER_UP
flag, this means that if while flushing we have a MC join in progress
and the QP was attached to BC MGID we can have a mismatches when
re-attaching a QP to the BC MGID.
The light flush would set the broadcast group to NULL causing an on
the fly join to rejoin and reattach to the BC MCG as well as adding
the BC MGID to the multicast list. The flush process would later on
remove the BC MGID and detach it from the QP. On the next flush
the BC MGID is present in the multicast list but not found when trying
to detach it because of the previous double attach and single detach.
[18332.714265] ------------[ cut here ]------------
[18332.717775] WARNING: CPU: 6 PID: 3767 at drivers/infiniband/core/verbs.c:280 ib_dealloc_pd+0xff/0x120 [ib_core]
...
[18332.775198] Hardware name: Red Hat KVM, BIOS Bochs 01/01/2011
[18332.779411] 0000000000000000 ffff8800b50dfbb0 ffffffff813fed47 0000000000000000
[18332.784960] 0000000000000000 ffff8800b50dfbf0 ffffffff8109add1 0000011832f58300
[18332.790547] ffff880226a596c0 ffff880032482000 ffff880032482830 ffff880226a59280
[18332.796199] Call Trace:
[18332.798015] [<ffffffff813fed47>] dump_stack+0x63/0x8c
[18332.801831] [<ffffffff8109add1>] __warn+0xd1/0xf0
[18332.805403] [<ffffffff8109aebd>] warn_slowpath_null+0x1d/0x20
[18332.809706] [<ffffffffa025d90f>] ib_dealloc_pd+0xff/0x120 [ib_core]
[18332.814384] [<ffffffffa04f3d7c>] ipoib_transport_dev_cleanup+0xfc/0x1d0 [ib_ipoib]
[18332.820031] [<ffffffffa04ed648>] ipoib_ib_dev_cleanup+0x98/0x110 [ib_ipoib]
[18332.825220] [<ffffffffa04e62c8>] ipoib_dev_cleanup+0x2d8/0x550 [ib_ipoib]
[18332.830290] [<ffffffffa04e656f>] ipoib_uninit+0x2f/0x40 [ib_ipoib]
[18332.834911] [<ffffffff81772a8a>] rollback_registered_many+0x1aa/0x2c0
[18332.839741] [<ffffffff81772bd1>] rollback_registered+0x31/0x40
[18332.844091] [<ffffffff81773b18>] unregister_netdevice_queue+0x48/0x80
[18332.848880] [<ffffffffa04f489b>] ipoib_vlan_delete+0x1fb/0x290 [ib_ipoib]
[18332.853848] [<ffffffffa04df1cd>] delete_child+0x7d/0xf0 [ib_ipoib]
[18332.858474] [<ffffffff81520c08>] dev_attr_store+0x18/0x30
[18332.862510] [<ffffffff8127fe4a>] sysfs_kf_write+0x3a/0x50
[18332.866349] [<ffffffff8127f4e0>] kernfs_fop_write+0x120/0x170
[18332.870471] [<ffffffff81207198>] __vfs_write+0x28/0xe0
[18332.874152] [<ffffffff810e09bf>] ? percpu_down_read+0x1f/0x50
[18332.878274] [<ffffffff81208062>] vfs_write+0xa2/0x1a0
[18332.881896] [<ffffffff812093a6>] SyS_write+0x46/0xa0
[18332.885632] [<ffffffff810039b7>] do_syscall_64+0x57/0xb0
[18332.889709] [<ffffffff81883321>] entry_SYSCALL64_slow_path+0x25/0x25
[18332.894727] ---[ end trace 09ebbe31f831ef17 ]---
Fixes: ee1e2c82c2 ("IPoIB: Refresh paths instead of flushing them on SM change events")
Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
There is skb_clone(skb, GFP_KERNEL) in spinlock context
in rxe_rcv_mcast_pkt().
Found by Linux Driver Verification project (linuxtesting.org).
Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Acked-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add cxgb_mk_rx_data_ack() to remove duplicate
code to form CPL_RX_DATA_ACK hardware command.
Signed-off-by: Varun Prakash <varun@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add cxgb_mk_abort_rpl() to remove duplicate
code to form CPL_ABORT_RPL hardware command.
Signed-off-by: Varun Prakash <varun@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add cxgb_mk_abort_req() to remove duplicate code
to form CPL_ABORT_REQ hardware command.
Signed-off-by: Varun Prakash <varun@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add cxgb_mk_close_con_req() to remove duplicate
code to form CPL_CLOSE_CON_REQ hardware command.
Signed-off-by: Varun Prakash <varun@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add cxgb_mk_tid_release() to remove duplicate code
to form CPL_TID_RELEASE hardware command.
Signed-off-by: Varun Prakash <varun@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add cxgb_compute_wscale() in libcxgb_cm.h to remove
it's duplicate definitions from cxgb4/cm.c and
cxgbit/cxgbit_cm.c.
Signed-off-by: Varun Prakash <varun@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add cxgb_best_mtu() in libcxgb_cm.h to remove
it's duplicate definitions from cxgb4/cm.c and
cxgbit/cxgbit_cm.c
Signed-off-by: Varun Prakash <varun@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add cxgb_is_neg_adv() in libcxgb_cm.h to remove
it's duplicate definitions from cxgb4/cm.c and
cxgbit/cxgbit_cm.c.
Signed-off-by: Varun Prakash <varun@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add cxgb_find_route6() in libcxgb_cm.c to remove
it's duplicate definitions from cxgb4/cm.c and
cxgbit/cxgbit_cm.c.
Signed-off-by: Varun Prakash <varun@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add cxgb_find_route() in libcxgb_cm.c to remove
it's duplicate definitions from cxgb4/cm.c and
cxgbit/cxgbit_cm.c.
Signed-off-by: Varun Prakash <varun@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add cxgb_get_4tuple() in libcxgb_cm.c to remove
it's duplicate definitions from cxgb4/cm.c and
cxgbit/cxgbit_cm.c.
Signed-off-by: Varun Prakash <varun@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull block fixes from Jens Axboe:
"A set of fixes for the current series in the realm of block.
Like the previous pull request, the meat of it are fixes for the nvme
fabrics/target code. Outside of that, just one fix from Gabriel for
not doing a queue suspend if we didn't get the admin queue setup in
the first place"
* 'for-linus' of git://git.kernel.dk/linux-block:
nvme-rdma: add back dependency on CONFIG_BLOCK
nvme-rdma: fix null pointer dereference on req->mr
nvme-rdma: use ib_client API to detect device removal
nvme-rdma: add DELETING queue flag
nvme/quirk: Add a delay before checking device ready for memblaze device
nvme: Don't suspend admin queue that wasn't created
nvme-rdma: destroy nvme queue rdma resources on connect failure
nvme_rdma: keep a ref on the ctrl during delete/flush
iw_cxgb4: block module unload until all ep resources are released
iw_cxgb4: call dev_put() on l2t allocation failure
Conflicts:
drivers/net/ethernet/mediatek/mtk_eth_soc.c
drivers/net/ethernet/qlogic/qed/qed_dcbx.c
drivers/net/phy/Kconfig
All conflicts were cases of overlapping commits.
Signed-off-by: David S. Miller <davem@davemloft.net>
Otherwise an endpoint can be still closing down causing a touch
after free crash. Also WARN_ON if ulps have failed to destroy
various resources during device removal.
Fixes: ad61a4c7a9 ("iw_cxgb4: don't block in destroy_qp awaiting the last deref")
Reviewed-by: Sagi Grimberg <sagi@grimbrg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
The debugfs RCU trips many debug kernel warnings because of potential
sleeps with an RCU read lock held. This includes both user copy calls
and slab allocations throughout the file.
This patch switches the RCU to use SRCU for file remove/access
race protection.
In one case, the SRCU is implicit in the use of the raw debugfs file
object and just works.
In the seq_file case, a wrapper around seq_read() and seq_lseek() is
used to enforce the SRCU using the debugfs supplied functions
debugfs_use_file_start() and debugfs_use_file_stop().
The sychronize_rcu() is deleted since the SRCU prevents the remove
access race.
The RCU locking is kept for qp_stats since the QP hash list is
protected using the non-sleepable RCU.
Reviewed-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The global variable n_krcvqs stores the sum of the number of kernel
receive queues of VLs 0-7 which the user can pass to the driver through
the module parameter array krcvqs which is of type unsigned integer. If
the user passes large value(s) into krcvqs parameter array, it can cause
an arithmetic overflow while calculating n_krcvqs which is also of type
unsigned int. The overflow results in an incorrect value of n_krcvqs
which can lead to kernel crash while loading the driver.
Fix by changing the data type of n_krcvqs to unsigned long. This patch
also changes the data type of other variables that get their values from
n_krcvqs.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Harish Chegondi <harish.chegondi@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Sometimes a QSFP device does not respond in the expected time
after a power-on. Add a read pre-check/retry when starting
the link on driver load.
Reviewed-by: Easwar Hariharan <easwar.hariharan@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In the set_txreq_header_ahg(), The KDETH Intr bit is obtained from the
header in the user sdma request using a KDETH_GET shift and mask macro.
This value is then futher right shifted by 16 causing us to lose the
value i.e it is shifted to zero, leading to the following
smatch warning:
drivers/infiniband/hw/hfi1/user_sdma.c:1482 set_txreq_header_ahg()
warn: mask and shift to zero
The Intr bit should be left shifted into its correct position in the
KDETH header before the AHG update.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com>
Reviewed-by: Harish Chegondi <harish.chegondi@intel.com>
Signed-off-by: Jubin John <jubin.john@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When trying to align the source pointer and there's a byte carry
in an SGE copy, bytes are borrowed from the next quad-word X to
complete the required quad-word copy. Then, the SGE length is
reduced by the number of borrowed bytes. After this, if the
remaining number of bytes from quad-word X (extra bytes) is
greater than the new SGE length, the number of extra bytes needs
to be updated to the new SGE length. Otherwise, when the
SGE length gets updated again after the extra bytes are read to
create the new byte carry, it goes negative, which then becomes
a very large number as the SGE length is an unsigned integer.
This causes SGE buffer to be over-read.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Remove returning errors from mlx5 poll_cq function. Polling CQ
operation in kernel never fails by Mellanox HCA architecture and
respective driver design.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use TIR number based on selector, it should be done to differentiate
between RSS QP to RAW one.
Reported-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Tested-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Return variable was set in a line before the
actual return was called in begin_wqe function.
This patch removes such variable and simplifies the code.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The returned value should be EINVAL, because it is caused by wrong
caller and not by internal overflow event.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Remove returning errors from mlx4 poll_cq function. Polling CQ
operation in kernel never fails by Mellanox HCA architecture and
respective driver design.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
By Mellanox HW design and SW implementation, poll_cq never
fails and returns errors, so all these printks are to catch ULP bugs.
In case of such bug, the reverted patch will cause reentry of the
function, resulting in a printk storm.
This reverts commit 5412352fcd ("IB/mlx4: Return EAGAIN for any error in mlx4_ib_poll_one")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When a new CM connection is being requested, ipoib driver copies data
from the path pointer in the CM/tx object, the path object might be
invalid at the point and memory corruption will happened later when now
the CM driver will try using that data.
The next scenario demonstrates it:
neigh_add_path --> ipoib_cm_create_tx -->
queue_work (pointer to path is in the cm/tx struct)
#while the work is still in the queue,
#the port goes down and causes the ipoib_flush_paths:
ipoib_flush_paths --> path_free --> kfree(path)
#at this point the work scheduled starts.
ipoib_cm_tx_start --> copy from the (invalid)path pointer:
(memcpy(&pathrec, &p->path->pathrec, sizeof pathrec);)
-> memory corruption.
To fix that the driver now starts the CM/tx connection only if that
specific path exists in the general paths database.
This check is protected with the relevant locks, and uses the gid from
the neigh member in the CM/tx object which is valid according to the ref
count that was taken by the CM/tx.
Fixes: 839fcaba35 ('IPoIB: Connected mode experimental support')
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The function send_leave sets the member: group->query_id
(group->query_id = ret) after calling the sa_query, but leave_handler
can be executed before the setting and it might delete the group object,
and will get a memory corruption.
Additionally, this patch gets rid of group->query_id variable which is
not used.
Fixes: faec2f7b96 ('IB/sa: Track multicast join/leave requests')
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
We get 1 warning when build kernel with W=1:
drivers/infiniband/hw/cxgb4/qp.c:686:6: warning: no previous prototype for '_free_qp' [-Wmissing-prototypes]
In fact, this function is only used in the file in which it is declared
and don't need a declaration, but can be made static.
so this patch marks it 'static'.
Signed-off-by: Baoyou Xie <baoyou.xie@linaro.org>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When the low level driver exercises the hot unplug they would call
rdma_cm cma_remove_one which would fire DEVICE_REMOVAL event to all cma
consumers. Now, if consumer doesn't make sure they destroy all IB
objects created on that IB device instance prior to finalizing all
processing of DEVICE_REMOVAL callback, rdma_cm will let the lld to
de-register with IB core and destroy the IB device instance. And if the
consumer calls (say) ib_dereg_mr(), it will crash since that dev object
is NULL.
In the current implementation, iser-target just initiates the cleanup
and returns from DEVICE_REMOVAL callback. This deferred work creates a
race between iser-target cleaning IB objects(say MR) and lld destroying
IB device instance.
This patch includes the following fixes
-> make sure that consumer frees all IB objects associated with device
instance
-> return non-zero from the callback to destroy the rdma_cm id
Signed-off-by: Raju Rangoju <rajur@chelsio.com>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The 2nd parameter of 'find_first_bit' is the number of bits to search.
In this case, we are passing 'sizeof(u64)' which is 8.
It is likely that the number of bits of 'port_mask' was expected here.
Use sizeof() * 8 to get the correct number.
It has been spotted by the following coccinelle script:
@@
expression ret, x;
@@
* ret = \(find_first_bit \| find_first_zero_bit\) (x, sizeof(...));
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The 2nd parameter of 'find_first_bit' is the number of bits to search.
In this case, we are passing 'sizeof(tmp)' which is likely to be 4 or 8
because 'tmp' is an 'unsigned long'.
It is likely that the number of bits of 'tmp' was expected here. So use
BITS_PER_LONG instead.
It has been spotted by the following coccinelle script:
@@
expression ret, x;
@@
* ret = \(find_first_bit \| find_first_zero_bit\) (x, sizeof(...));
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Majd Dibbiny <majd@mellanox.com>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In all other places in this file where 'find_first_bit' is called,
port_num is defined as a 'u8' and no casting is done.
Do the same here in order to be more consistent.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Device notifications are not received after the first interface is
closed; since there is an unregister for notifications on every
interface close. Correct this by unregistering for device
notifications only when the last interface is closed. Also, make
all operations on the i40iw_notifiers_registered atomic as it
can be read/modified concurrently.
Fixes: 8e06af711b ("i40iw: add main, hdr, status")
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Update iwqp->hw_iwarp_state to reflect the new state of the CQP
modify QP operation. This avoids reissuing a CQP operation to
modify a QP to a state that it is already in.
Fixes: 4e9042e647 ("i40iw: add hw and utils files")
Reported-by: Stefan Assmann <sassmann@redhat.com>
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Send a zero length last streaming mode message for loopback
connections to synchronize between accepting QP and connecting QP.
This avoids data transfer to start on the accepting QP before
the connecting QP is in RTS. Also remove function i40iw_loopback_nop()
as it is no longer used.
Fixes: f27b4746f3 ("i40iw: add connection management code")
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch is meant to add support of ACPI to the Hisilicon RoCE
driver.
Changes done are primarily meant to detect the type and then either
use DT specific or ACPI spcific functions. Where ever possible,
this patch tries to make use of Unified Device Property Interface
APIs to support both DT and ACPI through single interface.
This patch depends upon HNS ethernet driver to Reset RoCE. This
function within HNS ethernet driver has also been enhanced to
support ACPI and is part of other accompanying patch with this
patch-set.
NOTE: The changes in this patch are done over below branch,
https://github.com/dledford/linux/tree/hns-roce
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
If port_guid is set with the default subnet_prefix, then we get a change
event and run a port refresh, we don't update the port_guid. As a
result, attempts to create a target device that uses the new
subnet_prefix in the wwn will fail to find a match and be rejected by
the ib_srpt driver. This makes it impossible to configure a port if it
was initialized with a default subnet_prefix and later changed to any
non-default subnet-prefix. Updating the port refresh task to always
update the wwn based upon the current subnext_prefix solves this
problem.
Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: nab@linux-iscsi.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
Current driver is reporting wrong values for max_sge and
max_sge_rd in query_device. This breaks the nfs rdma and iser
in some device profiles. Fixing the driver to report
correct values from FW.
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
iwpbl->iwmr points to the structure that contains iwpbl,
which is iwmr. Setting this to NULL would result in
writing to freed memory. So just free iwmr, and return.
Fixes: d374984179 ("i40iw: add files for iwarp interface")
Reported-by: Stefan Assmann <sassmann@redhat.com>
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Memory allocated for iwqp; iwqp->allocated_buffer is freed twice in
the create_qp error path. Correct this by having it freed only once in
i40iw_free_qp_resources().
Fixes: d374984179 ("i40iw: add files for iwarp interface")
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This file does not use any structs or functions defined by io-mapping.h
(nor does it directly use iomap, ioremap, iounamp or friends). Remove it
to simplify verification of changes to io-mapping.h
The include existed since its inception in
commit e126ba97db
Author: Eli Cohen <eli@mellanox.com>
Date: Sun Jul 7 17:25:49 2013 +0300
mlx5: Add driver for Mellanox Connect-IB adapters
which looks like a copy across from the Mellanox ethernet driver.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Eli Cohen <eli@mellanox.com>
Cc: Jack Morgenstein <jackm@dev.mellanox.co.il>
Cc: Or Gerlitz <ogerlitz@mellanox.com>
Cc: Matan Barak <matanb@mellanox.com>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: linux-rdma@vger.kernel.org
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Laurence Oberman <loberman@redhat.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In i40iw_free_virt_mem(), do not set mem->va to NULL
after freeing it as mem->va is a self-referencing pointer
to mem.
Fixes: 4e9042e647 ("i40iw: add hw and utils files")
Reported-by: Stefan Assmann <sassmann@redhat.com>
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add NULL check for pdata and pdata->addr before the memcpy in
i40iw_form_cm_frame(). This fixes a NULL pointer de-reference
which occurs when the MPA private data pointer is NULL. Also
only copy pdata->size bytes in the memcpy to prevent reading
past the length of the private data buffer provided by upper layer.
Fixes: f27b4746f3 ("i40iw: add connection management code")
Reported-by: Stefan Assmann <sassmann@redhat.com>
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Current cxgb4 arm CQ logic ignores IB_CQ_REPORT_MISSED_EVENTS for
request completion notification on a CQ. Due to this ib_poll_handler()
assumes all events polled and avoids further iopoll scheduling.
This patch adds logic to cxgb4 ib_req_notify_cq() handler to check if
CQ is not empty and return accordingly. Based on the return value of
ib_req_notify_cq() handler, ib_poll_handler() will schedule a run of
iopoll handler.
Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In i40iw_open(), check if interface is already open
and return success if it is.
Fixes: 8e06af711b ("i40iw: add main, hdr, status")
Reported-by: Stefan Assmann <sassmann@redhat.com>
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In i40iw_alloc_resource(), ensure that the update to
req_resource_num is protected by the lock.
Fixes: 8e06af711b ("i40iw: add main, hdr, status")
Reported-by: Stefan Assmann <sassmann@redhat.com>
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
iwdev->mem_resources is incorrectly defined as an unsigned
long instead of u8. As a result, the offset into the dynamic
allocated structures in i40iw_initialize_hw_resources() is
incorrectly calculated and would lead to writing of memory
regions outside of the allocated buffer.
Fixes: 8e06af711b ("i40iw: add main, hdr, status")
Reported-by: Stefan Assmann <sassmann@redhat.com>
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Reuse existing functionality from memdup_user() instead of keeping
duplicate source code.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The i40iw initiator sends an MPA-request with ird=16 and ord=16. The cxgb4
responder sends an MPA-reply with ord = 32 causing i40iw to terminate
due to insufficient resources.
The logic to reduce the ORD to <= peer's IRD was wrong.
Reported-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The i40iw initiator sends an MPA-request with ird = 63, ord = 63. The
cxgb4 responder sends a RST. Since the inbound ord=63 and it exceeds
the max_ird/c4iw_max_read_depth (=32 default), chelsio decides to abort.
Instead, cxgb4 should adjust the ord/ird down before presenting it to
the ULP.
Reported-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The unwind logic for creating a user QP has a double vfree
of the non-shared receive queue when handling a "too many qps"
failure.
The code unwinds the mmmap info by decrementing a reference
count which will call rvt_release_mmap_info() which in turn
does the vfree() of the r_rq.wq. The unwind code then does
the same free.
Fix by guarding the vfree() with the same test that is done
in close and only do the vfree() if qp->ip is NULL.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Previously, J_KEY generation was based on the lower 16 bits
of the user's UID. While this works, it was not good enough
as a non-root user could collide with a root user given a
sufficiently large UID.
This patch attempt to improve the J_KEY generation by using
the following algorithm:
The 16 bit J_KEY space is partitioned into 3 separate spaces
reserved for different user classes:
* all users with administtor privileges (including 'root')
will use J_KEYs in the range of 0 to 31,
* all kernel protocols, which use KDETH packets will use
J_KEYs in the range of 32 to 63, and
* all other users will use J_KEYs in the range of 64 to
65535.
The above separation is aimed at preventing different user levels
from sending packets to each other and, additionally, separate
kernel protocols from all other types of users. The later is meant
to prevent the potential corruption of kernel memory by any other
type of user.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The driver does not check if the CableInfo query is supported for the
port type. Return early if CableInfo is not supported for the port type,
making compliance with the specification explicit and preventing lower
level code from potentially doing the wrong thing if the query is not
supported for the hardware implementation.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Easwar Hariharan <easwar.hariharan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
If 'pci_register_driver' fails, we return 'err' which is known to be 0.
Return the error instead.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Doug Ledford <dledford@redhat.com>
It is likely that checking the result of 'setup_ctxt' is expected here.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The callback function of call_rcu() just calls a kfree(), so we
can use kfree_rcu() instead of call_rcu() + callback function.
Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com>
Tested-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Acked-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Tested-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Acked-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Validate the etype to insure that the header is correct.
Reviewed-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The "packet" parameter was being passed on the stack,
change it to a pointer.
Reviewed-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The monitor values from bytes 22 through 81 of the QSFP memory space
(SFF 8636) are dynamic and serving them out of the QSFP memory cache
maintained by the driver provides stale data to the CableInfo SMA query.
This patch refreshes the dynamic values from the QSFP memory on request
and overwrites the stale data from the cache for the overlap between the
requested range and the monitor range.
Reviewed-by: Jubin John <jubin.john@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Easwar Hariharan <easwar.hariharan@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The qp init function does a kzalloc() while holding the RCU
lock that encounters the following warning with a debug kernel
when a cat of the qp_stats is done:
[ 231.723948] rcu_scheduler_active = 1, debug_locks = 0
[ 231.731939] 3 locks held by cat/11355:
[ 231.736492] #0: (debugfs_srcu){......}, at: [<ffffffff813001a5>] debugfs_use_file_start+0x5/0x90
[ 231.746955] #1: (&p->lock){+.+.+.}, at: [<ffffffff81289a6c>] seq_read+0x4c/0x3c0
[ 231.755873] #2: (rcu_read_lock){......}, at: [<ffffffffa0a0c535>] _qp_stats_seq_start+0x5/0xd0 [hfi1]
[ 231.766862]
The init functions do an implicit next which requires the rcu read lock
before the kzalloc().
Fix for both drivers is to change the scope of the init function to only
do the allocation and the initialization of the just allocated iter.
The implict next is moved back into the respective start functions to fix
the issue.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
CC: <stable@vger.kernel.org> # 4.6.x-
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Fix to return error code -ENOMEM from the alloc error handling
case instead of 0, as done elsewhere in this function.
Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
'work' and 'route->path_rec' are malloced in cma_resolve_iboe_route()
and should be freed before leaving from the error handling cases,
otherwise it will cause memory leak.
Fixes: 200298326b ('IB/core: Validate route when we init ah')
Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com>
Reviewed-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
If CONFIG_FRAME_WARN is small (1K) and CONFIG_NR_CPUS big
then a frame size warning is triggered during build.
Allocate the cpu mask dynamically to silence the warning.
Reviewed-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Error code EAGAIN should be used when errors are temporary and next call
might succeeds.
When error code other than EAGAIN is returned, the caller (mlx4_ib_poll)
will assume all CQE in the same bunch are error too and will drop them all.
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
No need to return int if function always returns 0
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In case of error, the function devm_ioremap_resource() returns ERR_PTR()
and never returns NULL. The NULL test in the return value check should
be replaced with IS_ERR().
Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch added Kconfig and Makefile for building RoCE module.
Signed-off-by: Wei Hu <xavier.huwei@huawei.com>
Signed-off-by: Nenglong Zhao <zhaonenglong@hisilicon.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
These are the various new source code files for the Hisilicon
RoCE driver for ARM architecture.
Signed-off-by: Wei Hu <xavier.huwei@huawei.com>
Signed-off-by: Nenglong Zhao <zhaonenglong@hisilicon.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Replaced mlx5_query_port_proto_oper with separate functions per link
type. The functions should take different arguments so no point in
trying to unite them.
Signed-off-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Now as all commands use mlx5 ifc interface, instead of doing two calls
for executing a command we embed command status checking into
mlx5_cmd_exec to simplify the interface.
Also we do here some cleanup for redundant software structures
(inbox/outbox) and functions and improved command failure output.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Prior to this patch we assumed that modify QP commands have the
same layout.
In ConnectX-4 for each QP transition there is a specific command
and their layout can vary.
e.g: 2err/2rst commands don't have QP context in their layout and before
this patch we posted the QP context in those commands.
Fortunately the FW only checks the suffix of the commands and executes
them, while ignoring all invalid data sent after the valid command
layout.
This patch removes mlx5_modify_qp_mbox_in and changes
mlx5_core_qp_modify to receive the required transition and QP context
with opt_param_mask if needed. This way the caller is not required to
provide the command inbox layout and it will be generated automatically.
mlx5_core_qp_modify will generate the command inbox/outbox layouts
according to the requested transition and will fill the requested
parameters.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Remove old representation of manually created QP/XRCD commands layout
amd use mlx5_ifc canonical structures and defines.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Remove old representation of manually created MKey/PSV commands layout,
and use mlx5_ifc canonical structures and defines.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Remove old representation of manually created CQ commands layout,
and use mlx5_ifc canonical structures and defines.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
- hfi1 driver updates
- Fix for max SGEs allowed via RDMA R/W API
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXoqUzAAoJELgmozMOVy/dNKAP/1/Rzn/k97eda1qFqzWpqsPl
lMaxDiZZnRIAFJEqEF9Iwo1JLiFIzjpDJnqHB++CKuXZQT0NY6sHW0yrcyUwzsx7
5gui92ldkVg4vY7PTco171vyzG+79KKRZ1dFS14z7oC8XAg48zQ7yJmfb1op3dEw
mgxyoLaaMwMF5aLwPoWG4+aPkBMtKUGB/ARb4ehq6M2p71c43lb18GaarJuWLdAz
1HxakXL/uzttyvGDyJGKDrT6ktXXSyvdCTRO60OrrPFJ67P2xRYXce85TLRr8srp
Q5RNjyR5fP8uN0qtrQz+hl09mtBeBQHKomyFIOVwkB2r53OKqsR5g5roz3BlpA1X
7PF/MO0pKy4t8XQnLfohEwtNWgszupvxkyAAISI8MwzLOPra/V8smQ9CpTltx1UB
hTu3tpAMy1auAjh8TWzzzII1ZoRZz6YCTziWnTaC3bqAljufjt1mnvjrtNmQ1sNi
MCLeA3yr8HjlKWdwYr+gVfhSR1wEoOxwHZdLsvBsxmC32hFLlh6rbg2x8wceqTlR
4T8l0AERV1YPjsoSe3/pWVImKUA97qppIfeFcCZiBCBHBPlhpw3ebVt6B1mLVUCV
hTMuZeFVcV75D+qr0kR5ZuVn4jgEn9zB1VH3tCV9LJnhBfySZFcP4yhATqiELaHG
RVoVAiTBxq5RgNVOH4Zo
=cQcp
-----END PGP SIGNATURE-----
Merge tag 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull second round of rdma updates from Doug Ledford:
"This can be split out into just two categories:
- fixes to the RDMA R/W API in regards to SG list length limits
(about 5 patches)
- fixes/features for the Intel hfi1 driver (everything else)
The hfi1 driver is still being brought to full feature support by
Intel, and they have a lot of people working on it, so that amounts to
almost the entirety of this pull request"
* tag 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (84 commits)
IB/hfi1: Add cache evict LRU list
IB/hfi1: Fix memory leak during unexpected shutdown
IB/hfi1: Remove unneeded mm argument in remove function
IB/hfi1: Consistently call ops->remove outside spinlock
IB/hfi1: Use evict mmu rb operation
IB/hfi1: Add evict operation to the mmu rb handler
IB/hfi1: Fix TID caching actions
IB/hfi1: Make the cache handler own its rb tree root
IB/hfi1: Make use of mm consistent
IB/hfi1: Fix user SDMA racy user request claim
IB/hfi1: Fix error condition that needs to clean up
IB/hfi1: Release node on insert failure
IB/hfi1: Validate SDMA user iovector count
IB/hfi1: Validate SDMA user request index
IB/hfi1: Use the same capability state for all shared contexts
IB/hfi1: Prevent null pointer dereference
IB/hfi1: Rename TID mmu_rb_* functions
IB/hfi1: Remove unneeded empty check in hfi1_mmu_rb_unregister()
IB/hfi1: Restructure hfi1_file_open
IB/hfi1: Make iovec loop index easy to understand
...
- Updates/fixes for iw_cxgb4 driver
- Updates/fixes for mlx5 driver
- Add flow steering and RSS API
- Add hardware stats to mlx4 and mlx5 drivers
- Add firmware version API for RDMA driver use
- Add the rxe driver (this is a software RoCE driver that makes any
Ethernet device a RoCE device)
- Fixes for i40iw driver
- Support for send only multicast joins in the cma layer
- Other minor fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXo1vCAAoJELgmozMOVy/d0HcQAJqMi7siD9cSaMViYbu812pq
3kNkHZbLNB/947uShDPhhFAWFXU0nRxEnTNSvYxRo+nxnDE/9hEEXpx8OzzKLNU+
GXyDeHsEEriSFcaSne5Tak/QuiFm3PJv73ttXQROCtHG7KxLG9ieVbfusz42Xwiu
5R21qfp6PZEOC+j7L/fTZh/kEN3cfaDYrGnCgmU3z0ka9xG5Qe2/+uWGNkuioRA5
phFUR4MS+1n/VrnxPHrLXTrqv3sw8YfCfRImaXSBrxFVMqhno+cDDtEJQCRnmNrq
7KcJO2KqDMl/QqsjxdwqojNpUTh2t7SeOeQuzUsfXl15yyyetq2Zu7ZurkCGjNtQ
NtTt6hv5eXq3mNuBmOPKYDDgakSYyYjS0zueoi8wFFqIeSYxRJv4wx4xoeJ/Bsz8
2LplpaPMQaTM65FhzYXGhYNBKaRkqjL9ihbIl1OcLNvfXAqLElfONM17/Yc/hgVw
xfDtvNFrZcl7/exIpBBNOnxwbs4h78vvXsXoBiVoN7V/hBnMzDhkiBHNxNCfZXA0
REGs/cnyy6cpiJOnVCWs77NqL75oK/qb1mEwe1M+A2kaxe/tLixUdYXo/zclDPm8
3DLTL9lCgJIBIEiZT4q/alxLK+yUKD+SHtQT3lmF2Bfsmv/I38Uy55SXAiFO4yOq
kwy96TvYtT43SkyNmmBf
=oZOO
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull base rdma updates from Doug Ledford:
"Round one of 4.8 code: while this is mostly normal, there is a new
driver in here (the driver was hosted outside the kernel for several
years and is actually a fairly mature and well coded driver). It
amounts to 13,000 of the 16,000 lines of added code in here.
Summary:
- Updates/fixes for iw_cxgb4 driver
- Updates/fixes for mlx5 driver
- Add flow steering and RSS API
- Add hardware stats to mlx4 and mlx5 drivers
- Add firmware version API for RDMA driver use
- Add the rxe driver (this is a software RoCE driver that makes any
Ethernet device a RoCE device)
- Fixes for i40iw driver
- Support for send only multicast joins in the cma layer
- Other minor fixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (72 commits)
Soft RoCE driver
IB/core: Support for CMA multicast join flags
IB/sa: Add cached attribute containing SM information to SA port
IB/uverbs: Fix race between uverbs_close and remove_one
IB/mthca: Clean up error unwind flow in mthca_reset()
IB/mthca: NULL arg to pci_dev_put is OK
IB/hfi1: NULL arg to sc_return_credits is OK
IB/mlx4: Add diagnostic hardware counters
net/mlx4: Query performance and diagnostics counters
net/mlx4: Add diagnostic counters capability bit
Use smaller 512 byte messages for portmapper messages
IB/ipoib: Report SG feature regardless of HW UD CSUM capability
IB/mlx4: Don't use GFP_ATOMIC for CQ resize struct
IB/hfi1: Disable by default
IB/rdmavt: Disable by default
IB/mlx5: Fix port counter ID association to QP offset
IB/mlx5: Fix iteration overrun in GSI qps
i40iw: Add NULL check for puda buffer
i40iw: Change dup_ack_thresh to u8
i40iw: Remove unnecessary check for moving CQ head
...
Soft RoCE (RXE) - The software RoCE driver
ib_rxe implements the RDMA transport and registers to the RDMA core
device as a kernel verbs provider. It also implements the packet IO
layer. On the other hand ib_rxe registers to the Linux netdev stack
as a udp encapsulating protocol, in that case RDMA, for sending and
receiving packets over any Ethernet device. This yields a RDMA
transport over the UDP/Ethernet network layer forming a RoCEv2
compatible device.
The configuration procedure of the Soft RoCE drivers requires
binding to any existing Ethernet network device. This is done with
/sys interface.
A userspace Soft RoCE library (librxe) provides user applications
the ability to run with Soft RoCE devices. The use of rxe verbs ins
user space requires the inclusion of librxe as a device specifics
plug-in to libibverbs. librxe is packaged separately.
Architecture:
+-----------------------------------------------------------+
| Application |
+-----------------------------------------------------------+
+-----------------------------------+
| libibverbs |
User +-----------------------------------+
+----------------+ +----------------+
| librxe | | HW RoCE lib |
+----------------+ +----------------+
+---------------------------------------------------------------+
+--------------+ +------------+
| Sockets | | RDMA ULP |
+--------------+ +------------+
+--------------+ +---------------------+
| TCP/IP | | ib_core |
+--------------+ +---------------------+
+------------+ +----------------+
Kernel | ib_rxe | | HW RoCE driver |
+------------+ +----------------+
+------------------------------------+
| NIC driver |
+------------------------------------+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+-----------------------------------------------------------+
| Application |
+-----------------------------------------------------------+
+-----------------------------------+
| libibverbs |
User +-----------------------------------+
+----------------+ +----------------+
| librxe | | HW RoCE lib |
+----------------+ +----------------+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+--------------+ +------------+
| Sockets | | RDMA ULP |
+--------------+ +------------+
+--------------+ +---------------------+
| TCP/IP | | ib_core |
+--------------+ +---------------------+
+------------+ +----------------+
Kernel | ib_rxe | | HW RoCE driver |
+------------+ +----------------+
+------------------------------------+
| NIC driver |
+------------------------------------+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Soft RoCE resources:
[1[ https://github.com/SoftRoCE/librxe-dev librxe - source code in
Github
[2] https://github.com/SoftRoCE/rxe-dev/wiki/rxe-dev:-Home - Soft RoCE
Wiki page
[3] https://github.com/SoftRoCE/librxe-dev - Soft RoCE userspace library
Signed-off-by: Kamal Heib <kamalh@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The dma-mapping core and the implementations do not change the DMA
attributes passed by pointer. Thus the pointer can point to const data.
However the attributes do not have to be a bitfield. Instead unsigned
long will do fine:
1. This is just simpler. Both in terms of reading the code and setting
attributes. Instead of initializing local attributes on the stack
and passing pointer to it to dma_set_attr(), just set the bits.
2. It brings safeness and checking for const correctness because the
attributes are passed by value.
Semantic patches for this change (at least most of them):
virtual patch
virtual context
@r@
identifier f, attrs;
@@
f(...,
- struct dma_attrs *attrs
+ unsigned long attrs
, ...)
{
...
}
@@
identifier r.f;
@@
f(...,
- NULL
+ 0
)
and
// Options: --all-includes
virtual patch
virtual context
@r@
identifier f, attrs;
type t;
@@
t f(..., struct dma_attrs *attrs);
@@
identifier r.f;
@@
f(...,
- NULL
+ 0
)
Link: http://lkml.kernel.org/r/1468399300-5399-2-git-send-email-k.kozlowski@samsung.com
Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Acked-by: Vineet Gupta <vgupta@synopsys.com>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Acked-by: Hans-Christian Noren Egtvedt <egtvedt@samfundet.no>
Acked-by: Mark Salter <msalter@redhat.com> [c6x]
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com> [cris]
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> [drm]
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Acked-by: Joerg Roedel <jroedel@suse.de> [iommu]
Acked-by: Fabien Dessenne <fabien.dessenne@st.com> [bdisp]
Reviewed-by: Marek Szyprowski <m.szyprowski@samsung.com> [vb2-core]
Acked-by: David Vrabel <david.vrabel@citrix.com> [xen]
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> [xen swiotlb]
Acked-by: Joerg Roedel <jroedel@suse.de> [iommu]
Acked-by: Richard Kuo <rkuo@codeaurora.org> [hexagon]
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k]
Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> [s390]
Acked-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Acked-by: Hans-Christian Noren Egtvedt <egtvedt@samfundet.no> [avr32]
Acked-by: Vineet Gupta <vgupta@synopsys.com> [arc]
Acked-by: Robin Murphy <robin.murphy@arm.com> [arm64 and dma-iommu]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Added UCMA and CMA support for multicast join flags. Flags are
passed using UCMA CM join command previously reserved fields.
Currently supporting two join flags indicating two different
multicast JoinStates:
1. Full Member:
The initiator creates the Multicast group(MCG) if it wasn't
previously created, can send Multicast messages to the group
and receive messages from the MCG.
2. Send Only Full Member:
The initiator creates the Multicast group(MCG) if it wasn't
previously created, can send Multicast messages to the group
but doesn't receive any messages from the MCG.
IB: Send Only Full Member requires a query of ClassPortInfo
to determine if SM/SA supports this option. If SM/SA
doesn't support Send-Only there will be no join request
sent and an error will be returned.
ETH: When Send Only Full Member is requested no IGMP join
will be sent.
Signed-off-by: Alex Vesker <valex@mellanox.com>
Reviewed by: Hal Rosenstock <hal@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Added a new SA port attribute containing SM ClassPortInfo fields,
(ClassPortInfo fields: Table 126 IB Spec 1.3.). This is useful for
checking SM support for specific features. The attribute is cached
to avoid resending queries, caching is done when a successful
ClassPortInfo reply is received on the port. Invalidation of the
attribute is done on SM change events, SM re-registration events,
and SM LID change events. The fields in ClassPortInfo should not
change during SM runtime without an event.
Signed-off-by: Alex Vesker <valex@mellanox.com>
Reviewed by: Hal Rosenstock <hal@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Fixes an oops that might happen if uverbs_close races with
remove_one.
Both contexts may run ib_uverbs_cleanup_ucontext, it depends
on the flow.
Currently, there is no protection for a case that remove_one
didn't make the cleanup it runs to its end, the underlying
ib_device was freed then uverbs_close will call
ib_uverbs_cleanup_ucontext and OOPs.
Above might happen if uverbs_close deleted the file from the list
then remove_one didn't find it and runs to its end.
Fixes to protect against that case by a new cleanup lock so that
ib_uverbs_cleanup_ucontext will be called always before that
remove_one is ended.
Fixes: 35d4a0b63d ("IB/uverbs: Fix race between ib_uverbs_open and remove_one")
Reported-by: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The kfree() function was called in a few cases by the mthca_reset()
function during error handling even if the passed variables "bridge_header"
and "hca_header" contained a null pointer.
Adjust jump targets according to the Linux coding style convention.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The pci_dev_put() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The sc_return_credits() function tests whether its argument is NULL
and then returns immediately. Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Expose IB diagnostic hardware counters.
The counters count IB events and are applicable for IB and RoCE.
The counters can be divided into two groups, per device and per port.
Device counters are always exposed.
Port counters are exposed only if the firmware supports per port counters.
rq_num_dup and sq_num_to are only exposed if we have firmware support
for them, if we do, we expose them per device and per port.
rq_num_udsdprd and num_cqovf are device only counters.
rq - denotes responder.
sq - denotes requester.
|-----------------------|---------------------------------------|
| Name | Description |
|-----------------------|---------------------------------------|
|rq_num_lle | Number of local length errors |
|-----------------------|---------------------------------------|
|sq_num_lle | number of local length errors |
|-----------------------|---------------------------------------|
|rq_num_lqpoe | Number of local QP operation errors |
|-----------------------|---------------------------------------|
|sq_num_lqpoe | Number of local QP operation errors |
|-----------------------|---------------------------------------|
|rq_num_lpe | Number of local protection errors |
|-----------------------|---------------------------------------|
|sq_num_lpe | Number of local protection errors |
|-----------------------|---------------------------------------|
|rq_num_wrfe | Number of CQEs with error |
|-----------------------|---------------------------------------|
|sq_num_wrfe | Number of CQEs with error |
|-----------------------|---------------------------------------|
|sq_num_mwbe | Number of Memory Window bind errors |
|-----------------------|---------------------------------------|
|sq_num_bre | Number of bad response errors |
|-----------------------|---------------------------------------|
|sq_num_rire | Number of Remote Invalid request |
| | errors |
|-----------------------|---------------------------------------|
|rq_num_rire | Number of Remote Invalid request |
| | errors |
|-----------------------|---------------------------------------|
|sq_num_rae | Number of remote access errors |
|-----------------------|---------------------------------------|
|rq_num_rae | Number of remote access errors |
|-----------------------|---------------------------------------|
|sq_num_roe | Number of remote operation errors |
|-----------------------|---------------------------------------|
|sq_num_tree | Number of transport retries exceeded |
| | errors |
|-----------------------|---------------------------------------|
|sq_num_rree | Number of RNR NAK retries exceeded |
| | errors |
|-----------------------|---------------------------------------|
|rq_num_rnr | Number of RNR NAKs sent |
|-----------------------|---------------------------------------|
|sq_num_rnr | Number of RNR NAKs received |
|-----------------------|---------------------------------------|
|rq_num_oos | Number of Out of Sequence requests |
| | received |
|-----------------------|---------------------------------------|
|sq_num_oos | Number of Out of Sequence NAKs |
| | received |
|-----------------------|---------------------------------------|
|rq_num_udsdprd | Number of UD packets silently |
| | discarded on the Receive Queue due to |
| | lack of receive descriptor |
|-----------------------|---------------------------------------|
|rq_num_dup | Number of duplicate requests received |
|-----------------------|---------------------------------------|
|sq_num_to | Number of time out received |
|-----------------------|---------------------------------------|
|num_cqovf | Number of CQ overflows |
|-----------------------|---------------------------------------|
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Portmapper messages are short and do not occupy more than 512 bytes.
Lower portmapper message size to 512 bytes. This change significantly
reduces the amount of memory needed when trying to establish a large
number of connections simultaneously. The old value is based on page
size.
Signed-off-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Decouple SG support from HW ability to do UD checksum.
This coupling is for historical reasons and removed with 'commit
ec5f061564 ("net: Kill link between CSUM and SG features.")'
During driver load it is assumed that device does not supports SG. The
final decision is taken after creating UD QP based on device capability.
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
We allocate a small tracking structure as part of mlx4_ib_resize_cq().
However, we don't need to use GFP_ATOMIC -- immediately after the
allocation, we call mlx4_cq_resize(), which allocates a command
mailbox with GFP_KERNEL and then sleeps on a firmware command, so we
better not be in an atomic context.
This actually has a real impact, because when this GFP_ATOMIC
allocation fails (and GFP_ATOMIC does fail in practice) then a
userspace consumer resizing a CQ will get a spurious failure that we
can easily avoid.
Signed-off-by: Roland Dreier <roland@purestorage.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
There is a strict policy in the Linux kernel that new drivers must be
disabled by default. Hence leave out the "default m" line from Kconfig.
Fixes: f48ad614c1 ("IB/hfi1: Move driver out of staging")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Jubin John <jubin.john@intel.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: <stable@vger.kernel.org> # v4.7+
Acked-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
There is a strict policy in the Linux kernel that new drivers must be
disabled by default. Hence leave out the "default m" line from Kconfig.
Fixes: 0194621b22 ("IB/rdmavt: Create module framework and handle driver registration")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Jubin John <jubin.john@intel.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: <stable@vger.kernel.org> # v4.6+
Signed-off-by: Doug Ledford <dledford@redhat.com>
The original code used a LRU list to evict nodes which were least
recently used. For correctness the evict code was moved under the
handler->lock, now add back the LRU list.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
During an unexpected shutdown, references to tid_rb_node were NULL'ed out
without properly being released.
Fix this by calling clear_tid_node in the mmu notifier remove callback
rather than after these callbacks are called.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The reworked mmu_rb interface allows the unused mm argument to be removed.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The ops->remove() callback was called by hfi1_mmu_unregister() with a
NULL mm argument while holding a spinlock. In the case of sdma_rb_remove()
this caused it to pass current->mm to hfi1_release_user_pages()
This had 2 problems. First this would attempt to acquire the mmap_sem
under a spin lock. Second the use of current->mm is not always guaranteed
to be the proper mm when the fd is being closed.
Rather than depend on this implicit behavior we move all calls to
ops->remove outside of the spinlock. This also allows the correct
mm to be used in the remove callback without fear of deadlock.
Because the MMU notifier is not guaranteed to hold mm->mmap_sem, but
usually does, we must delay all remove callbacks until out of the notifier,
when the callbacks can take the mmap_sem if they need to.
Code comments were added to clarify what the expectations are for the
users of the mmu rb tree.
Suggested-by: Jim Foraker <foraker1@llnl.gov>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use the new cache evict operation in the SDMA code. This allows the cache
to properly coordinate evicts and removes, preventing any race. With this
change, the separate list, lock, and race flag are not needed.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Allow users to clear nodes from the rb tree based on their evict callback.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Per file descriptor TID caching actions depend on a global that can
change midway through the lifetime of that file descriptor.
Make the use of caching consistent for the life of the file descriptor
by using the presence of the cache handler to decide when to use the cache
functions.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The objects which use cache handling should reference their own handler
object not the internal data structure it uses to track the nodes.
Have the "users" of the mmu notifier code pass opaque objects which can
then be properly used in the mmu callbacks depending on the owners needs.
This patch has the additional benefit that operations no longer require a
look up in a list to find the handlers.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The hfi1 driver registers a mmu_notifier callback when /dev/hfi1_* is
opened, and unregisters it when the device is closed. The driver
incorrectly assumes that the close will always happen from the same
context as the open. In particular, closes due to SIGKILL or OOM killer
activity may happen from a different context. In these cases, the wrong
mm is passed to mmu_notifier_unregister(), which causes improper reference
counting for the victim mm, and eventual memory corruption.
Preserve the mm for all open file descriptors and use this mm rather than
current->mm for memory operations for the lifetime of that fd. Note: this
patch leaves 1 use of current->mm in place. This use is removed in a
follow on patch because other functional changes were required prior to
that use being removed.
If registration fails, there is no reason to keep the handler object
around. Free the handler object rather than add it to the list to
prevent any mmu_notifier operations, including unregister, when
registration fails.
Suggested-by: Jim Foraker <foraker1@llnl.gov>
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The user SDMA in-use claim bit is in the structure that gets zeroed out
once the claim is made. Move the request in-use flag into its own bit
array and use that for atomic claims. This cleans up the claim code and
removes any race possibility.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
If input validation fails, properly free the request before returning.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
If unable to insert node into the RB tree cache, node will be freed
before returning from the function. Null out iovec's pointer to node
so iovec does not try to free it later.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Save the current capability state at user context creation
time. Report this saved value for all shared contexts.
Also get rid of unnecessary hfi1_get_base_kinfo function.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
If a context has not been assigned or assignment failed, pq may be NULL.
Move the unregister within the protection of the null check.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Clarify the names of the TID mmu functions.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Checking if the rb tree is empty is redundant with the while loop which is
emptying the rb tree.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Rearrange the file open call in prep for new changes.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
For bool parameters "false" should be used
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
subctxt is not used, just remove it.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
__mmu_rb_remove was called in only 1 place which was a very simple
call site. Combine this function into its caller.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Remove, insert, and invalidate are always provided. No
need to test.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This makes it more clear what these functions are
operating on.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Parameter names to function declarations make it more clear
what those parameters do.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
These are no longer needed.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Brackets should be on the next line of a function
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Expand the serial number space by using more bits
from the GUID.
Reviewed-by: Jubin John <jubin.john@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The driver pads non-double word multiple message sizes but it doesn't
account for this padding when the packet length is calculated. Also, the
data length is miscalculated for message sizes less than 4 bytes due to
the bit representation in LRH. And there's a check for non-double word
multiple message sizes that prevents these messages from being sent.
This patch fixes length miscalculations and enables the functionality to
send non-double word multiple message sizes.
Reviewed-by: Harish Chegondi <harish.chegondi@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The use of the specific opcode test is redundant since
all ack entry users correctly manipulate the mr pointer
to selectively trigger the reference clearing.
The overly specific test hinders the use of implementation
specific operations.
The change needs to get rid of the union to insure that
an atomic value is not seen as an MR pointer.
Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Checking the return value of the memory allocation call in
init_pervl_scs() was missed. Recently the kmalloc() was changed to
kzalloc() which identified the problem.
While fixing this issue 2 other bugs were noticed. First, the array
being allocated is accessed in the nomem path which can be reached before
it is allocated. Second, kernel_send_context was not released on error.
Fix both of these by creating a more common memory unwind label structure.
Fixes: 35f6befc84 ("staging/rdma/hfi1: Add qp to send context mapping for PIO")
Reported-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Instead of copying the actual GRH of type struct ib_grh, existing code
copies the struct ib_global_route into the sge. This patch fixes that
and constructs the actual GRH from ib_global_route and copies the GRH
into the sge.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Dean Luick <dean.luick@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The interface is used to compute the 5-bit SC field from the
LRH and the RHF bits. Modify code to use the interface instead.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Cleanup hfi1_ud_rcv to not have to look at the packet
header fields multiple times. The fields are looked up
once and used throughout the function. Also fix sc
computation when validating MAD packets.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Dean Luick <dean.luick@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
hfi1_pio_header should really be called hfi1_sdma_header
as it is only used for sdma transmits.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Dean Luick <dean.luick@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
struct ahg_ib_header has no header specific information.
Rename it to struct hfi1_ahg_info
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
sde and hfi1_ib_header are not used anymore.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Dean Luick <dean.luick@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Active QSFP cables were reset only every alternate iteration of the
channel tuning algorithm instead of every iteration due to incorrect
reset of the flag that controlled QSFP reset, resulting in using stale
QSFP status in the channel tuning algorithm.
Fixes: 8ebd4cf185 ("Add active and optical cable support")
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Easwar Hariharan <easwar.hariharan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Some QSFP cables assert the interrupt line as a side effect of module
plug-in and power up. This causes the SerDes and QSFP tuning algorithm
to begin cable initialization by reading the QSFP memory map over I2C,
which fails. This patch ignores any interrupt line assertion until
the module has completed power up and voltage rails have stabilized,
which can take a maximum of 500 ms per the SFF-8679 specification.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Easwar Hariharan <easwar.hariharan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
QSFP CDR enablement is now controlled by determining power class
and the configuration file. We disable the DC 8051 from requesting
enablement or disabling of TX and RX CDRs by removing the code
that allowed the DC 8051 to request changes.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Easwar Hariharan <easwar.hariharan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Hanging has been observed while writing a file over NFSoRDMA. Dmesg on
the server contains messages like these:
[ 931.992501] svcrdma: Error -22 posting RDMA_READ
[ 952.076879] svcrdma: Error -22 posting RDMA_READ
[ 982.154127] svcrdma: Error -22 posting RDMA_READ
[ 1012.235884] svcrdma: Error -22 posting RDMA_READ
[ 1042.319194] svcrdma: Error -22 posting RDMA_READ
Here is why:
With the base memory management extension enabled, FRMR is used instead
of FMR. The xprtrdma server issues each RDMA read request as the following
bundle:
(1)IB_WR_REG_MR, signaled;
(2)IB_WR_RDMA_READ, signaled;
(3)IB_WR_LOCAL_INV, signaled & fencing.
These requests are signaled. In order to generate completion, the fast
register work request is processed by the hfi1 send engine after being
posted to the work queue, and the corresponding lkey is not valid until
the request is processed. However, the rdmavt driver validates lkey when
the RDMA read request is posted and thus it fails immediately with error
-EINVAL (-22).
This patch changes the work flow of local operations (fast register and
local invalidate) so that fast register work requests are always
processed immediately to ensure that the corresponding lkey is valid
when subsequent work requests are posted. Local invalidate requests are
processed immediately if fencing is not required and no previous local
invalidate request is pending.
To allow completion generation for signaled local operations that have
been processed before posting to the work queue, an internal send flag
RVT_SEND_COMPLETION_ONLY is added. The hfi1 send engine checks this flag
and only generates completion for such requests.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This fix allows for support of in-kernel reserved operations
without impacting the ULP user.
The low level driver can register a non-zero value which
will be transparently added to the send queue size and hidden
from the ULP in every respect.
ULP post sends will never see a full queue due to a reserved
post send and reserved operations will never exceed that
registered value.
The s_avail will continue to track the ULP swqe availability
and the difference between the reserved value and the reserved
in use will track reserved availabity.
Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Trace shows incorrect amount of allocated memory.
Fix trace to display memory in KB.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Grzegorz Heldt <grzegorz.heldt@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add sysfs entry to allow user to override affinity for SDMA
engine interrupts.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Enhance the PCIe Gen3 recipe to support static CTLE tuning,
and add a switch to choose between static and dynamic
approaches. Make discrete chips default to static CTLE
tuning.
Reviewed-by: Tadeusz Struk <tadeusz.struk@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This fixes the following warnings with PROVE_LOCKING and PROVE_RCU
enabled in the kernel:
case (1):
[ INFO: suspicious RCU usage. ]
drivers/infiniband/hw/hfi1/init.c:532
suspicious rcu_dereference_check() usage!
case (2):
[ INFO: suspicious RCU usage. ]
drivers/infiniband/hw/hfi1/hfi.h:1624
suspicious rcu_dereference_check() usage!
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Read the version of the SBus, PCIe SerDes, and Fabric Serdes
firmwares at driver load time.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When link up fails in LNI, the local and peer state complete
frames are reported as numbers. Explain what the values mean
so the operator can better diagnose the problem.
Reviewed-by: Easwar Hariharan <easwar.hariharan@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Currently, the default number of kernel receive contexts is set to the
number of NUMA nodes on the system plus one for control context. However,
the systems that have a single socket and/or have NUMA disabled in the BIOS
will have only one receive context by default. This patch would ensure that
by default there will be at least two kernel receive contexts plus one for
control context regardless of the number of NUMA nodes on the system. The
user can override the default number of kernel receive contexts with the
krcvqs module parameter.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Harish Chegondi <harish.chegondi@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Advertise and add the capability of handing all aspects of IBTA extended
memory management support in post send.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In order to support extended memory management support, add send side
processing of work requests of type IB_WR_REG_MR, IB_WR_LOCAL_INV, and
IB_WR_SEND_WITH_INV. The first two are local operations and are supported
for both RC and UC. Send with invalidate is only supported for RC because
the corresponding IB opcodes are not defined for UC.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
As part of enabling extended memory management support, add the processing
of the RC send with invalidate.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Some work requests are local operations, such as IB_WR_REG_MR and
IB_WR_LOCAL_INV. They differ from non-local operations in that:
(1) Local operations can be processed immediately without being posted
to the send queue if neither fencing nor completion generation is needed.
However, to ensure correct ordering, once a local operation is posted to
the work queue due to fencing or completion requiement, all subsequent
local operations must also be posted to the work queue until all the
local operations on the work queue have completed.
(2) Local operations don't send packets over the wire and thus don't
need (and shouldn't update) the packet sequence numbers.
Define a new a flag bit for the post send table to identify local
operations.
Add a new field to the QP structure to track the number of local
operations on the send queue to determine if direct processing of new
local operations should be enabled/disabled.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In order to support extended memory management, add the mechanism to
invalidate MR keys. This includes a flag "lkey_invalid" in the MR data
structure that is to be checked when validating access to the MR via
the associated key, and two utility functions to perform fast memory
registration and memory key invalidate operations.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This implements the device specific function needed by the verbs
API function ib_map_mr_sg().
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
There were multiple places where FECN/BECN processing was
being done for the different types of QPs. All of that code
was very similar, which meant that it could be pulled into
a single function used by the different QP types.
To retain the performance in the fastpath, the common code
starts with an inline function, which only calls the slow
path if the packet has any of the [FB]ECN bits set.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
While handling buffer control MAD, partially initialized
dd->kernel_send_context area may cause potential dereference
of uninitialized pointers. Fix by using kzalloc_node()
instead of kmalloc_node().
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Andrzej Kacprowski <andrzej.kacprowski@intel.com>
Signed-off-by: Tymoteusz Kielan <tymoteusz.kielan@intel.com>
Signed-off-by: Andrzej Kacprowski <andrzej.kacprowski@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
PMA should not sum TX and RX replay counts when reporting
local link integrity errors. Fixed by removing C_DC_TX_REPLAY
counter from calculation of the link integrity errors counter
value.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jakub Pawlak <jakub.pawlak@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Change rvt_post_one_wr to use the new table mechanism for
post send.
Validate that each low level driver specifies the table.
Reviewed-by: Jianxin Xiong <jianxin.xiong@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add flexibility for driver dependent operations in post send
because different drivers will have differing post send
operation support.
This includes data structure definitions to support a table
driven scheme along with the necessary validation routine
using the new table.
Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Reviewed-by: Jianxin Xiong <jianxin.xiong@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Prevent processing receive packet in case when opcode is
accepted by QP but handler for this type of packet is not
defined.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jakub Pawlak <jakub.pawlak@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Currently each user context is assigned a single SDMA engine
based on the VL, context id, and subcontext id. That means for
MPI applications, each rank can only use one SDMA engine for
all messages. This may create unwanted backup for independent
messages going to different destinations upon congestion at one
destination.
This patch adds the packet "dlid" to the formula of SDMA engine
selection for user SDMA requests. A simple hash table is used
to maintain even distribution among the available SDMA engines
regardless how the "dlid" values are distributed.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Reviewed-by: Tadeusz Struk <tadeusz.struk@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Remove the TWSI code. The driver now uses the kernel's built-in
i2c bit bus module.
Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Easwar Hariharan <easwar.hariharan@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Use built-in i2c bit-shift bus adapter to control the
i2c busses on the chip.
Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Easwar Hariharan <easwar.hariharan@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When performing process affinity recommendations for MPI ranks, the current
algorithm doesn't take into account multiple HFI units. Also, real
cores and HT cores are not distinguished from one another. Therefore,
all HT cores are recommended to be assigned first within the local NUMA
node before recommending the assignments of cores in other NUMA nodes.
It's ideal to assign all real cores across all NUMA nodes first, then all
HT 1 cores, then all HT 2 cores, and so on to balance CPU workload. CPU
cores in other NUMA nodes could be running interrupt handlers, and this is
not taken into account.
To balance the CPU workload for user processes, the following
recommendation algorithm is used:
For each user process that is opening a context on HFI Y:
a) If all cores are assigned to user processes, start assignments all
over from the first core
b) Assign real cores first, then HT cores (First set of HT cores on
all physical cores, then second set of HT cores, and, so on) in the
following order:
1. Same NUMA node as HFI Y and not running an IRQ handler
2. Same NUMA node as HFI Y and running an IRQ handler
3. Different NUMA node to HFI Y and not running an IRQ handler
4. Different NUMA node to HFI Y and running an IRQ handler
c) Mark core as assigned in the global affinity structure. As user
processes are done, remove core assignments from global affinity
structure.
This implementation allows an arbitrary number of HT cores and provides
support for multiple HFIs.
This is being included in the kernel rather than user space due to the
fact that user space has no way of knowing the CPU recommendations for
contexts running as part of other jobs.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Kernel receive queues oversubscribe CPU cores on multi-HFI systems.
To prevent this, the kernel receive queues are separated onto
different cores, and the SDMA engine interrupts are constrained to
a lesser number of cores.
hfi1s_on_numa_node*krcvqs is the number of CPU cores that are
reserved for kernel receive queues for all HFIs. Each HFI initializes
its kernel receive queues to one of the reserved CPU cores. If there
ends up being 0 CPU cores leftover for SDMA engines, use the same
CPU cores as receive contexts.
In addition, general and control contexts are assigned to their own
CPU core, however, both types of contexts tend to have low traffic.
To save CPU cores, collapse general and control contexts to one CPU
core for all HFI units. This change prevents SDMA engine interrupts
from wrapping around general contexts.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When HFI units get initialized, they each use their own mask copy for
affinity assignments. On a multi-HFI system, affinity assignments
overbook CPU cores as each HFI doesn't have knowledge of affinity
assignments for other HFI units. Therefore, some CPU cores are never
used for interrupt handlers in systems with high number of CPU cores
per NUMA node.
For multi-HFI systems, SDMA engine interrupt assignments start all over
from the first CPU in the local NUMA node after the first HFI
initialization. This change allows assignments to continue where the
last HFI unit left off.
Add global structure for affinity assignments for multiple HFIs to share
affinity mask.
Reviewed-by: Jianxin Xiong <jianxin.xiong@intel.com>
Reviewed-by: Jubin John <jubin.john@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The q-counter-id is given in modify-QP command associates
the QP with the counter. The offset to which the counter
ID was set is incorrect, causing IB port counters not to
count on QP.
Fixes: 0837e86a7a ('IB/mlx5: Add per port counters')
Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Tested-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Number of outstanding_pi may overflow and as a result may indicate that
there are no elements in the queue. The effect of doing this is that the
MAD layer will get stuck waiting for completions. The MAD layer will
think that the QP is full - because it didn't receive these completions.
This fix changes it so the outstanding_pi number is increased
with 32-bit wraparound and is not limited to max_send_wr so
that the difference between outstanding_pi and outstanding_ci will
really indicate the number of outstanding completions.
Cc: Stable <stable@vger.kernel.org>
Fixes: ea6dc20362 ('IB/mlx5: Reorder GSI completions')
Signed-off-by: Slava Shwartsman <slavash@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Haggai Eran <haggaie@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
i40iw_puda_get_listbuf may return NULL if the list is empty.
Add NULL check prior to accessing the pointer.
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Change dup_ack_thressh to u8 since it is a 3 bit field.
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In i40iw_cq_poll_completion, we always move the tail. So there is
no reason to check for overflow everytime we move the head.
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Replace a subtract and multiply with an add; while populating fragments
in SQ wqe.
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Child_listen_node pointer is used in a debug print after kfree.
Move the print before kfree.
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>