Commit Graph

162 Commits

Author SHA1 Message Date
Yonatan Cohen 2d4b21e0a2 IB/rxe: Prevent from completer to operate on non valid QP
On UD QP completer tasklet is scheduled for each packet sent.

If it is followed by a destroy_qp(), the kernel panic will
happen as the completer tries to operate on a destroyed QP.

Fixes: 8700e3e7c4 ("Soft RoCE driver")
Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com>
Reviewed-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-01-24 16:17:32 -05:00
Maor Gottlieb f39f775218 IB/rxe: Fix rxe dev insertion to rxe_dev_list
The first argument of list_add_tail is the new item and the second
is the head of the list. Fix the code to pass arguments in the
right order, otherwise not all the rxe devices will be removed
during teardown.

Fixes: 8700e3e7c4 ('Soft RoCE driver')
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-01-24 16:17:25 -05:00
Linus Torvalds 296915912d First round of -rc fixes for 4.10 kernel
- Series of qedr fixes
 - Series of rxe fixes
 - One isolated i40iw fix
 - One isolated cma fix
 - One isolated cxgb4 fix
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJYXAGvAAoJELgmozMOVy/dDukQAMMNarWp0U8KfNYRU5tyCBwd
 aIQC1gFT6GUCFys40Z6L84m1D3NpGR+vzVv3grVBeuge73b79zAOHXvVDwJCA+Jl
 QQLG3vZ13C3158sLDiK8zL+4Ob5OfOQ5nQ2spvDfJWpye9SD+pWFcrpqvK02ANRN
 kFHILk1gROBTNi46yBR5hjWOkw7Bua6XLsPxh6xoaDZ43NL0r0xgm43FTnj/19x3
 0zpZYYKP+3C6U7678rqaog9zfXHvadghW5/WBJ/VgfKqEmH89ESx4J2MvbB8DxFD
 1tWAOpr5TNY5jnh8mtUsceDjCzQivc/RWqAu05BspEwcavjSLFyRYr1epR0/4oAd
 PqLSmfORmhpJ8+5Kmn+chtXo3TT4SYGHIzSUbgbEV/ClwX/7UW+w8mfQZ3buUBq/
 cQp/oRnJcsrQIEDFO3AH7P+6Sxy6t3zbSl5oKBUOI1u4RFmC7YBPqo9fQu2Z2mGk
 3+AWQaPr7qgEcFzXBgLzvd4LhTYKsvmiNwrcXi9KjjwQjNEVg15qqF2YtmxEUgi9
 kh3IOcGan3iSblhV/WLrxcOjlPQrPpBOVnTPhUskFtlsrD+032OxeOBpVoU3nCUt
 MjTYWoNTYdw4wHz0w373o0uR4+4nl4a5OmO4Fh6Drmg5hm4Bl9BWy0Kziu93Z1Ay
 Z2utZVWLWhBzn8yJujUz
 =NW9g
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma

Pull rdma fixes from Doug Ledford:
 "First round of -rc fixes for 4.10 kernel:

   - a series of qedr fixes
   - a series of rxe fixes
   - one i40iw fix
   - one cma fix
   - one cxgb4 fix"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
  IB/rxe: Don't check for null ptr in send()
  IB/rxe: Drop future atomic/read packets rather than retrying
  IB/rxe: Use BTH_PSN_MASK when ACKing duplicate sends
  qedr: Always notify the verb consumer of flushed CQEs
  qedr: clear the vendor error field in the work completion
  qedr: post_send/recv according to QP state
  qedr: ignore inline flag in read verbs
  qedr: modify QP state to error when destroying it
  qedr: return correct value on modify qp
  qedr: return error if destroy CQ failed
  qedr: configure the number of CQEs on CQ creation
  i40iw: Set 128B as the only supported RQ WQE size
  IB/cma: Fix a race condition in iboe_addr_get_sgid()
  IB/rxe: Fix a memory leak in rxe_qp_cleanup()
  iw_cxgb4: set correct FetchBurstMax for QPs
2016-12-23 10:38:48 -08:00
Andrew Boyer 5cc8fabc5e IB/rxe: Don't check for null ptr in send()
pkt->qp was already dereferenced earlier in the function.

Fixes Smatch complaint:
drivers/infiniband/sw/rxe/rxe_net.c:458 send()
	 warn: variable dereferenced before check 'pkt->qp' (see line 441)

Signed-off-by: Andrew Boyer <andrew.boyer@dell.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-12-22 11:36:12 -05:00
Andrew Boyer cbf1f9a46c IB/rxe: Drop future atomic/read packets rather than retrying
If the completer is in the middle of a large read operation, one
lost packet can cause havoc. Going to COMPST_ERROR_RETRY will
cause the requester to resend the request. After that, any packet
from the first attempt still in the receive queue will be
interpreted as an error, restarting the error/retry sequence.
The transfer will quickly exhaust its retries.

This behavior is very noticeable when doing 512KB reads on a
QEMU system configured with 1500B MTU.

Also, a resent request here will prompt the responder on the
other side to immediately start resending, but the resent
packets will get stuck in the already-loaded receive queue and
will never be processed.

Rather than erroring out every time an unexpected future packet
arrives, just drop it. Eventually the retry timer will send a
duplicate request; the completer will be able to make progress since
the queue will start relatively empty.

Signed-off-by: Andrew Boyer <andrew.boyer@dell.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-12-22 11:36:12 -05:00
Andrew Boyer 37b3619394 IB/rxe: Use BTH_PSN_MASK when ACKing duplicate sends
Signed-off-by: Andrew Boyer <andrew.boyer@dell.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-12-22 11:36:12 -05:00
Bart Van Assche e259934d4d IB/rxe: Fix a memory leak in rxe_qp_cleanup()
A socket is associated with every QP by the rxe driver but sock_release()
is never called. Add a call to sock_release() in rxe_qp_cleanup().

Fixes: commit 8700e3e7c48A5 ("Add Soft RoCE driver")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Moni Shoua <monis@mellanox.com>
Cc: Kamal Heib <kamalh@mellanox.com>
Cc: Amir Vadai <amirv@mellanox.com>
Cc: Haggai Eran <haggaie@mellanox.com>
Cc: <stable@vger.kernel.org>
Reviewed-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-12-18 13:35:19 -05:00
Linus Torvalds d3ea547853 rdma: fix buggy code that the compiler warns about
Get rid of this warning:

  drivers/infiniband/sw/rdmavt/cq.c: In function ‘rvt_cq_exit’:
  drivers/infiniband/sw/rdmavt/cq.c:542:2: warning: ‘worker’ may be used uninitialized in this function [-Wmaybe-uninitialized]
    kthread_destroy_worker(worker);
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

by fixing the function to actually work.

Fixes: 6efaf10f16 ("IB/rdmavt: Avoid queuing work into a destroyed cq kthread worker")
Cc: Petr Mladek <pmladek@suse.com>
Cc: Doug Ledford <dledford@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-15 12:18:42 -08:00
Linus Torvalds 4d5b57e05a Updates for 4.10 kernel merge window
- Shared mlx5 updates with net stack (will drop out on merge if Dave's
   tree has already been merged)
 - Driver updates: cxgb4, hfi1, hns-roce, i40iw, mlx4, mlx5, qedr, rxe
 - Debug cleanups
 - New connection rejection helpers
 - SRP updates
 - Various misc fixes
 - New paravirt driver from vmware
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJYUbAPAAoJELgmozMOVy/dMXcP/iuG5MNzfN8Ny1JftyBQGWg3
 cqoQ2OLj9CsXjwVB+5EqbcZHRZY852lKONaLoDKkIOx4YAXO2YuIKOp944vN7EQx
 96wfqzT1F5jzAcy5mYZXgLaStGFDAwejKMqeHd0LfJj3OEtemGnVPWYzyqSQmSKo
 dzJraS1Z9GIRppzU5WaRpB9PtRBkqIqGJ5vZ0EKLGhed5hYY5r0iMJB0GfriMRDO
 lJ4UUVfpsAoLPnqDBFH6IMn2V2UeAw9IR5zNa1mrM1RBfvt/uYTxrw1w3p9WoaNs
 GRodhk4DCeAfeyqzVPNBLyXZ4Zq4FzGe3UWM4qysJ1RR4oFNw9Cuw0Fqk8mrfznr
 7hv5TpGIckRZiKf8l6e+qLirF0qGtXJg29j2vPVQI9i5nSj95g1agA81PnLQlLLb
 flWyxeMj81my7lfMHN1xcV6pqPEKMCOysZmfcvVfJd2XxpjuVD7ekl/YXWp8o8kU
 YPdQMqPD626XsD8VpPdMszb9FPmx0JD0HEv+Y1rIFX8JegEI+c3H2X0dqC27T/Ou
 FEPWOy025EgHm0Fh/7eIzkG6tjZ4JHoCugJAcxNZGj2XW4eB6r5vY8UwJ8iQRv+n
 PVYHiy0UoIRePh0mrdOSSphGZMi/GO/DsqKwCtAMEK43WqZQju6wR7QSIGkh66mp
 4uSHJqpf3YEYylxGMhk3
 =QeGy
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma

Pull rdma updates from Doug Ledford:
 "This is the complete update for the rdma stack for this release cycle.

  Most of it is typical driver and core updates, but there is the
  entirely new VMWare pvrdma driver. You may have noticed that there
  were changes in DaveM's pull request to the bnxt Ethernet driver to
  support a RoCE RDMA driver. The bnxt_re driver was tentatively set to
  be pulled in this release cycle, but it simply wasn't ready in time
  and was dropped (a few review comments still to address, and some
  multi-arch build issues like prefetch() not working across all
  arches).

  Summary:

   - shared mlx5 updates with net stack (will drop out on merge if
     Dave's tree has already been merged)

   - driver updates: cxgb4, hfi1, hns-roce, i40iw, mlx4, mlx5, qedr, rxe

   - debug cleanups

   - new connection rejection helpers

   - SRP updates

   - various misc fixes

   - new paravirt driver from vmware"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (210 commits)
  IB: Add vmw_pvrdma driver
  IB/mlx4: fix improper return value
  IB/ocrdma: fix bad initialization
  infiniband: nes: return value of skb_linearize should be handled
  MAINTAINERS: Update Intel RDMA RNIC driver maintainers
  MAINTAINERS: Remove Mitesh Ahuja from emulex maintainers
  IB/core: fix unmap_sg argument
  qede: fix general protection fault may occur on probe
  IB/mthca: Replace pci_pool_alloc by pci_pool_zalloc
  mlx5, calc_sq_size(): Make a debug message more informative
  mlx5: Remove a set-but-not-used variable
  mlx5: Use { } instead of { 0 } to init struct
  IB/srp: Make writing the add_target sysfs attr interruptible
  IB/srp: Make mapping failures easier to debug
  IB/srp: Make login failures easier to debug
  IB/srp: Introduce a local variable in srp_add_one()
  IB/srp: Fix CONFIG_DYNAMIC_DEBUG=n build
  IB/multicast: Check ib_find_pkey() return value
  IPoIB: Avoid reading an uninitialized member variable
  IB/mad: Fix an array index check
  ...
2016-12-15 12:03:32 -08:00
Doug Ledford 9032ad78bb Merge branches 'misc', 'qedr', 'reject-helpers', 'rxe' and 'srp' into merge-test 2016-12-14 14:44:47 -05:00
Doug Ledford 86ef0beaa0 Merge branch 'mlx' into merge-test 2016-12-14 14:44:25 -05:00
Doug Ledford 253f8b22e0 Merge branch 'hfi1' into merge-test 2016-12-14 14:44:08 -05:00
Jim Foraker 22dccc5454 IB/rdmavt: Only put mmap_info ref if it exists
rvt_create_qp() creates qp->ip only when a qp creation request comes from
userspace (udata is not NULL).  If we exceed the number of available
queue pairs however, the error path always attempts to put a kref to this
structure.  If the requestor is inside the kernel, this leads to a crash.

We fix this by checking that qp->ip is not NULL before caling kref_put().

Signed-off-by: Jim Foraker <foraker1@llnl.gov>
Acked-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Acked-by: Jonathan Toppins <jtoppins@redhat.com>
Acked-by: Alex Estrin <alex.estrin@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-12-14 12:16:11 -05:00
Petr Mladek f5eabf5e51 IB/rdmavt: Handle the kthread worker using the new API
Use the new API to create and destroy the cq kthread worker.
The API hides some implementation details.

In particular, kthread_create_worker() allocates and initializes
struct kthread_worker. It runs the kthread the right way and stores
task_struct into the worker structure. In addition, the *on_cpu()
variant binds the kthread to the given cpu and the related memory
node.

kthread_destroy_worker() flushes all pending works, stops
the kthread and frees the structure.

This patch does not change the existing behavior. Note that we must
use the on_cpu() variant because the function starts the kthread
and it must bind it to the right CPU before waking. The numa node
is associated for given CPU as well.

Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-12-14 12:16:11 -05:00
Petr Mladek 6efaf10f16 IB/rdmavt: Avoid queuing work into a destroyed cq kthread worker
The memory barrier is not enough to protect queuing works into
a destroyed cq kthread. Just imagine the following situation:

CPU1				CPU2

rvt_cq_enter()
  worker =  cq->rdi->worker;

				rvt_cq_exit()
				  rdi->worker = NULL;
				  smp_wmb();
				  kthread_flush_worker(worker);
				  kthread_stop(worker->task);
				  kfree(worker);

				  // nothing queued yet =>
				  // nothing flushed and
				  // happily stopped and freed

  if (likely(worker)) {
     // true => read before CPU2 acted
     cq->notify = RVT_CQ_NONE;
     cq->triggered++;
     kthread_queue_work(worker, &cq->comptask);

  BANG: worker has been flushed/stopped/freed in the meantime.

This patch solves this by protecting the critical sections by
rdi->n_cqs_lock. It seems that this lock is not much contended
and looks reasonable for this purpose.

One catch is that rvt_cq_enter() might be called from IRQ context.
Therefore we must always take the lock with IRQs disabled to avoid
a possible deadlock.

Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-12-14 12:16:11 -05:00
Moni Shoua 477864c8fc IB/core: Let create_ah return extended response to user
Add struct ib_udata to the signature of create_ah callback that is
implemented by IB device drivers. This allows HW drivers to return extra
data to the userspace library.
This patch prepares the ground for mlx5 driver to resolve destination
mac address for a given GID and return it to userspace.
This patch was previously submitted by Knut Omang as a part of the
patch set to support Oracle's Infiniband HCA (SIF).

Signed-off-by: Knut Omang <knut.omang@oracle.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-12-13 13:38:27 -05:00
Yonatan Cohen d680ebed91 IB/rxe: Increase max number of completions to 32k
Increase limit of max CQE from 8K to 32K to allow demanding
applications to work over SoftRoCE with same configuration
as most RoCEv2 HW vendors have.

Fixes: 8700e3e7c4 ("Soft RoCE driver")
Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com>
Reviewed-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-12-13 13:33:24 -05:00
Andrew Boyer 37f69f43fb IB/rxe: Hold refs when running tasklets
It might be possible for all of a QP's references to be dropped
while one of that QP's tasklets is running.

For example, the completer might run during QP destroy.
If qp->valid is false, it will drop all of the packets on
the resp_pkts list, potentially removing the last reference.
Then it tries to advance the SQ consumer pointer. If the
SQ's buffer has already been destroyed, the system will
panic.

To be safe, hold a reference on the QP for the duration
of each tasklet.

Signed-off-by: Andrew Boyer <andrew.boyer@dell.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-12-12 16:34:22 -05:00
Andrew Boyer 07bf9627d5 IB/rxe: Wait for tasklets to finish before tearing down QP
The system may crash when a malformed request is received and
the error is detected by the responder.

NodeA: $ ibv_rc_pingpong -g 0 -d rxe0 -i 1 -n 1 -s 50000
NodeB: $ ibv_rc_pingpong -g 0 -d rxe0 -i 1 -n 1 -s 1024 <NodeA_ip>

The responder generates a receive error on node B since the incoming
SEND is oversized. If the client tears down the QP before the responder
or the completer finish running, a page fault may occur.

The fix makes the destroy operation spin until the tasks complete, which
appears to be original intent of the design.

Signed-off-by: Andrew Boyer <andrew.boyer@dell.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-12-12 16:31:45 -05:00
Andrew Boyer 5407f53012 IB/rxe: Fix ref leak in duplicate_request()
A ref was added after the call to skb_clone().

Signed-off-by: Andrew Boyer <andrew.boyer@dell.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-12-12 16:31:45 -05:00
Andrew Boyer 5b9ea16c54 IB/rxe: Fix ref leak in rxe_create_qp()
The udata->inlen error path needs to clean up the ref
added by rxe_alloc().

Signed-off-by: Andrew Boyer <andrew.boyer@dell.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-12-12 16:31:45 -05:00
Andrew Boyer accacb8f51 IB/rxe: Add support for IB_CQ_REPORT_MISSED_EVENTS
Peek at the CQ after arming it so that we can return a hint.
This avoids missed completions due to a race between posting
CQEs and arming the CQ.

For example, CM teardown waits on MAD requests to complete with
ib_cq_poll_work(). Without this fix, the last completion might be
left on the CQ, hanging the kthread doing the teardown.

The console backtraces look like this:

[ 4199.911284] Call Trace:
[ 4199.911401]  [<ffffffff9657fe95>] schedule+0x35/0x80
[ 4199.911556]  [<ffffffff965830df>] schedule_timeout+0x22f/0x2c0
[ 4199.911727]  [<ffffffff9657f7a8>] ? __schedule+0x368/0xa20
[ 4199.911891]  [<ffffffff96580903>] wait_for_completion+0xb3/0x130
[ 4199.912067]  [<ffffffff960a17e0>] ? wake_up_q+0x70/0x70
[ 4199.912243]  [<ffffffffc074a06d>] cm_destroy_id+0x13d/0x450 [ib_cm]
[ 4199.912422]  [<ffffffff961615d5>] ? printk+0x57/0x73
[ 4199.912578]  [<ffffffffc074a390>] ib_destroy_cm_id+0x10/0x20 [ib_cm]
[ 4199.912759]  [<ffffffffc076098c>] rdma_destroy_id+0xac/0x340 [rdma_cm]
[ 4199.912941]  [<ffffffffc076f2cc>] 0xffffffffc076f2cc

Signed-off-by: Andrew Boyer <andrew.boyer@dell.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-12-12 16:31:45 -05:00
Andrew Boyer d4fb59256a IB/rxe: Add support for zero-byte operations
The last_psn algorithm fails in the zero-byte case: it calculates
first_psn = N, last_psn = N-1. This makes the operation unretryable since
the res structure will fail the (first_psn <= psn <= last_psn) test in
find_resource().

While here, use BTH_PSN_MASK to mask the calculated last_psn.

Signed-off-by: Andrew Boyer <andrew.boyer@dell.com>
Reviewed-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-12-12 16:31:45 -05:00
Andrew Boyer d38eb801aa IB/rxe: Unblock loopback by moving skb_out increment
skb_out is decremented in rxe_skb_tx_dtor(), which is not called in the
loopback() path. Move the increment to the send() path rather than
rxe_xmit_packet().

Signed-off-by: Andrew Boyer <andrew.boyer@dell.com>
Acked-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-12-12 16:31:45 -05:00
Andrew Boyer 2a7a85487e IB/rxe: Don't update the response PSN unless it's going forwards
A client might post a read followed by a send. The partner receives
and acknowledges both transactions, posting an RCQ entry for the
send, but something goes wrong with the read ACK. When the client
retries the read, the partner's responder processes the duplicate
read but incorrectly resets the PSN to the value preceding the
original send. When the duplicate send arrives, the responder cannot
tell that it is a duplicate, so the responder generates a duplicate
RCQ entry, confusing the client.

Signed-off-by: Andrew Boyer <andrew.boyer@dell.com>
Reviewed-by: Yonatan Cohen <yonatanc@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-12-12 16:31:45 -05:00
Andrew Boyer dd753d8743 IB/rxe: Advance the consumer pointer before posting the CQE
A simple userspace application might poll the CQ, find a completion,
and then attempt to post a new WQE to the SQ. A spurious error can
occur if the userspace application detects a full SQ in the instant
before the kernel is able to advance the SQ consumer pointer.

This is noticeable when using single-entry SQs with ibv_rc_pingpong
if lots of kernel and userspace library debugging is enabled.

Signed-off-by: Andrew Boyer <andrew.boyer@dell.com>
Reviewed-by: Yonatan Cohen <yonatanc@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-12-12 16:31:45 -05:00
Andrew Boyer 6e9bb530ff IB/rxe: Remove buffer used for printing IP address
Avoid smashing the stack when an ICRC error occurs on an IPv6 network.

Signed-off-by: Andrew Boyer <andrew.boyer@dell.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-12-12 16:31:45 -05:00
Dan Carpenter 95db9d05b7 IB/rxe: Remove unneeded cast in rxe_srq_from_attr()
It makes me nervous when we cast pointer parameters.  I would estimate
that around 50% of the time, it indicates a bug.  Here the cast is not
needed becaue u32 and and unsigned int are the same thing.  Removing the
cast makes the code more robust and future proof in case any of the
types change.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Acked-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-12-12 16:31:45 -05:00
Wei Yongjun 4ac4707102 IB/rxe: Use DEFINE_SPINLOCK() for spinlock
spinlock can be initialized automatically with DEFINE_SPINLOCK()
rather than explicitly calling spin_lock_init().

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: Leon Romanosky <leonro@mellanox.com>
Reviewed-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-12-12 16:31:45 -05:00
Arnd Bergmann a0fa72683e IB/rxe: avoid putting a large struct rxe_qp on stack
A race condition fix added an rxe_qp structure to the stack in order
to be able to perform rollback in rxe_requester(), but the structure
is large enough to trigger the warning for possible stack overflow:

drivers/infiniband/sw/rxe/rxe_req.c: In function 'rxe_requester':
drivers/infiniband/sw/rxe/rxe_req.c:757:1: error: the frame size of 2064 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]

This changes the rollback function to only save the psn inside
the qp, which is the only field we access in the rollback_qp
anyway.

Fixes: 3050b99850 ("IB/rxe: Fix race condition between requester and completer")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-12-12 16:31:45 -05:00
Mike Marciniszyn f2dc9cdce8 IB/rdmavt: Add a send completion helper
This is for use by client drivers to drive
send completions into a CQ.

A new exported table allows for the mapping
of ib_wr_opcode into a ib_wc_opcode.

Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-12-11 15:29:42 -05:00
Sebastian Sanchez f84dfa26e6 IB/hfi1: Use reference count wrapper for MRs
Some parts of the code don't use the standard driver
wrapper for memory region reference counters. Use the
standard driver wrapper throughout the code.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-12-11 15:29:42 -05:00
Sebastian Sanchez b44980f879 IB/hfi1: Replace qp->refcount release code with standard driver wrapper
Some parts of the code don't use the standard release
wrapper rvt_put_qp() for decrementing and testing
the refcount to then try to use a resource.
Replace this code with the standard driver wrapper.

Fixes: Commit 4d6f85c3fa ("IB/rdmavt, IB/qib, IB/hfi1: Use new QP put get routines")
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-12-11 15:29:42 -05:00
Mike Marciniszyn fcb29a6668 IB/rdmavt: Add trace of MR segs
Add tracing of MR segment information.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-12-11 15:29:42 -05:00
Dennis Dalessandro cf4c2f8c9d IB/rdmavt: Fix trace hierarchy
Split rdmavt traces into separate files to preserve the original
hierarchy since only one trace sub system may now be defined per header
file.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-12-11 15:25:13 -05:00
Leon Romanovsky 907610bfdf IB/rxe: Remove and fix debug prints after allocation failure
The prints after [k|v][m|z|c]alloc() functions are not needed,
because in case of failure, allocator will print their internal
error prints anyway.

Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-12-03 13:12:52 -05:00
Doug Ledford 6fa1f2f0aa Merge branches 'hfi1' and 'mlx' into k.o/for-4.9-rc 2016-11-16 20:05:10 -05:00
Yonatan Cohen 6d931308f5 IB/rxe: Update qp state for user query
The method rxe_qp_error() transitions QP to error state
and make sure the QP is drained. It did not though update
the QP state for user's query.

This patch fixes this.

Fixes: 8700e3e7c4 ("Soft RoCE driver")
Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com>
Reviewed-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-11-16 20:03:44 -05:00
Yonatan Cohen aa75b07b47 IB/rxe: Clear queue buffer when modifying QP to reset
RXE resets the send-q only once in rxe_qp_init_req() when
QP is created, but when the QP is reused after QP reset, the send-q
holds previous garbage data.

This garbage data wrongly fails CQEs that otherwise
should have completed successfully.

Fixes: 8700e3e7c4 ("Soft RoCE driver")
Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com>
Reviewed-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-11-16 20:03:44 -05:00
Yonatan Cohen 002e062e13 IB/rxe: Fix handling of erroneous WR
To correctly handle a erroneous WR this fix does the following
1. Make sure the bad WQE causes a user completion event.
2. Call rxe_completer to handle the erred WQE.

Before the fix, when rxe_requester found a bad WQE, it changed its
status to IB_WC_LOC_PROT_ERR and exit with 0 for non RC QPs.

If this was the 1st WQE then there would be no ACK to invoke the
completer and this bad WQE would be stuck in the QP's send-q.

On top of that the requester exiting with 0 caused rxe_do_task to
endlessly invoke rxe_requester, resulting in a soft-lockup attached
below.

In case the WQE was not the 1st and rxe_completer did get a chance to
handle the bad WQE, it did not cause a complete event since the WQE's
IB_SEND_SIGNALED flag was not set.

Setting WQE status to IB_SEND_SIGNALED is subject to IBA spec
version 1.2.1, section 10.7.3.1 Signaled Completions.

NMI watchdog: BUG: soft lockup - CPU#7 stuck for 22s!
[<ffffffffa0590145>] ? rxe_pool_get_index+0x35/0xb0 [rdma_rxe]
[<ffffffffa05952ec>] lookup_mem+0x3c/0xc0 [rdma_rxe]
[<ffffffffa0595534>] copy_data+0x1c4/0x230 [rdma_rxe]
[<ffffffffa058c180>] rxe_requester+0x9d0/0x1100 [rdma_rxe]
[<ffffffff8158e98a>] ? kfree_skbmem+0x5a/0x60
[<ffffffffa05962c9>] rxe_do_task+0x89/0xf0 [rdma_rxe]
[<ffffffffa05963e2>] rxe_run_task+0x12/0x30 [rdma_rxe]
[<ffffffffa059110a>] rxe_post_send+0x41a/0x550 [rdma_rxe]
[<ffffffff811ef922>] ? __kmalloc+0x182/0x200
[<ffffffff816ba512>] ? down_read+0x12/0x40
[<ffffffffa054bd32>] ib_uverbs_post_send+0x532/0x540 [ib_uverbs]
[<ffffffff815f8722>] ? tcp_sendmsg+0x402/0xb80
[<ffffffffa05453dc>] ib_uverbs_write+0x18c/0x3f0 [ib_uverbs]
[<ffffffff81623c2e>] ? inet_recvmsg+0x7e/0xb0
[<ffffffff8158764d>] ? sock_recvmsg+0x3d/0x50
[<ffffffff81215b87>] __vfs_write+0x37/0x140
[<ffffffff81216892>] vfs_write+0xb2/0x1b0
[<ffffffff81217ce5>] SyS_write+0x55/0xc0
[<ffffffff816bc672>] entry_SYSCALL_64_fastpath+0x1a/0xa

Fixes: 8700e3e7c4 ("Soft RoCE driver")
Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com>
Reviewed-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-11-16 20:03:44 -05:00
Yonatan Cohen 1454ca3a97 IB/rxe: Fix kernel panic in UDP tunnel with GRO and RX checksum
Missing initialization of udp_tunnel_sock_cfg causes to following
kernel panic, while kernel tries to execute gro_receive().

While being there, we converted udp_port_cfg to use the same
initialization scheme as udp_tunnel_sock_cfg.

------------[ cut here ]------------
kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
BUG: unable to handle kernel paging request at ffffffffa0588c50
IP: [<ffffffffa0588c50>] __this_module+0x50/0xffffffffffff8400 [ib_rxe]
PGD 1c09067 PUD 1c0a063 PMD bb394067 PTE 80000000ad5e8163
Oops: 0011 [#1] SMP
Modules linked in: ib_rxe ip6_udp_tunnel udp_tunnel
CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.7.0-rc3+ #2
Hardware name: Red Hat KVM, BIOS Bochs 01/01/2011
task: ffff880235e4e680 ti: ffff880235e68000 task.ti: ffff880235e68000
RIP: 0010:[<ffffffffa0588c50>]
[<ffffffffa0588c50>] __this_module+0x50/0xffffffffffff8400 [ib_rxe]
RSP: 0018:ffff880237343c80  EFLAGS: 00010282
RAX: 00000000dffe482d RBX: ffff8800ae330900 RCX: 000000002001b712
RDX: ffff8800ae330900 RSI: ffff8800ae102578 RDI: ffff880235589c00
RBP: ffff880237343cb0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800ae33e262
R13: ffff880235589c00 R14: 0000000000000014 R15: ffff8800ae102578
FS:  0000000000000000(0000) GS:ffff880237340000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffa0588c50 CR3: 0000000001c06000 CR4: 00000000000006e0
Stack:
ffffffff8160860e ffff8800ae330900 ffff8800ae102578 0000000000000014
000000000000004e ffff8800ae102578 ffff880237343ce0 ffffffff816088fb
0000000000000000 ffff8800ae330900 0000000000000000 00000000ffad0000
Call Trace:
<IRQ>
[<ffffffff8160860e>] ? udp_gro_receive+0xde/0x130
[<ffffffff816088fb>] udp4_gro_receive+0x10b/0x2d0
[<ffffffff81611373>] inet_gro_receive+0x1d3/0x270
[<ffffffff81594e29>] dev_gro_receive+0x269/0x3b0
[<ffffffff81595188>] napi_gro_receive+0x38/0x120
[<ffffffffa011caee>] mlx5e_handle_rx_cqe+0x27e/0x340 [mlx5_core]
[<ffffffffa011d076>] mlx5e_poll_rx_cq+0x66/0x6d0 [mlx5_core]
[<ffffffffa011d7ae>] mlx5e_napi_poll+0x8e/0x400 [mlx5_core]
[<ffffffff815949a0>] net_rx_action+0x160/0x380
[<ffffffff816a9197>] __do_softirq+0xd7/0x2c5
[<ffffffff81085c35>] irq_exit+0xf5/0x100
[<ffffffff816a8f16>] do_IRQ+0x56/0xd0
[<ffffffff816a6dcc>] common_interrupt+0x8c/0x8c
<EOI>
[<ffffffff81061f96>] ? native_safe_halt+0x6/0x10
[<ffffffff81037ade>] default_idle+0x1e/0xd0
[<ffffffff8103828f>] arch_cpu_idle+0xf/0x20
[<ffffffff810c37dc>] default_idle_call+0x3c/0x50
[<ffffffff810c3b13>] cpu_startup_entry+0x323/0x3c0
[<ffffffff81050d8c>] start_secondary+0x15c/0x1a0
RIP  [<ffffffffa0588c50>] __this_module+0x50/0xffffffffffff8400 [ib_rxe]
RSP <ffff880237343c80>
CR2: ffffffffa0588c50
---[ end trace 489ee31fa7614ac5 ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception in interrupt
------------[ cut here ]------------

Fixes: 8700e3e7c4 ("Soft RoCE driver")
Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com>
Reviewed-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-11-16 20:03:44 -05:00
Mike Marciniszyn 99f80d2f5f IB/hfi1: Optimize lkey validation structures
Profiling shows that the key validation is susceptible
to cache line trading when accessing the lkey table.

Fix by separating out the read mostly fields from the write
fields.   In addition the shift amount, which is function
of the lkey table size, is precomputed and stored with the
table pointer.   Since both the shift and table pointer
are in the same read mostly cacheline, this saves a cache
line in this hot path.

Reviewed-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-11-15 16:25:59 -05:00
Dennis Dalessandro e1fafdcbe0 IB/rdmavt: rdmavt can handle non aligned page maps
The initial code for rdmavt carried with it a restriction that was a
vestige from the qib driver, that to dma map a page it had to be less
than a page size. This is not the case on modern hardware, both qib and
hfi1 will be just fine with unaligned map requests.

This fixes a 4.8 regression where by an IPoIB transfer of > PAGE_SIZE
will hang because the dma map page call always fails. This was
introduced after commit 5faba54695 ("IB/ipoib: Report SG feature
regardless of HW UD CSUM capability") added the capability to use SG by
default. Rather than override this, the HW supports it, so allow SG.

Cc: Stable <stable@vger.kernel.org> # 4.8
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-11-15 16:16:40 -05:00
Petr Mladek 3989144f86 kthread: kthread worker API cleanup
A good practice is to prefix the names of functions by the name
of the subsystem.

The kthread worker API is a mix of classic kthreads and workqueues.  Each
worker has a dedicated kthread.  It runs a generic function that process
queued works.  It is implemented as part of the kthread subsystem.

This patch renames the existing kthread worker API to use
the corresponding name from the workqueues API prefixed by
kthread_:

__init_kthread_worker()		-> __kthread_init_worker()
init_kthread_worker()		-> kthread_init_worker()
init_kthread_work()		-> kthread_init_work()
insert_kthread_work()		-> kthread_insert_work()
queue_kthread_work()		-> kthread_queue_work()
flush_kthread_work()		-> kthread_flush_work()
flush_kthread_worker()		-> kthread_flush_worker()

Note that the names of DEFINE_KTHREAD_WORK*() macros stay
as they are. It is common that the "DEFINE_" prefix has
precedence over the subsystem names.

Note that INIT() macros and init() functions use different
naming scheme. There is no good solution. There are several
reasons for this solution:

  + "init" in the function names stands for the verb "initialize"
    aka "initialize worker". While "INIT" in the macro names
    stands for the noun "INITIALIZER" aka "worker initializer".

  + INIT() macros are used only in DEFINE() macros

  + init() functions are used close to the other kthread()
    functions. It looks much better if all the functions
    use the same scheme.

  + There will be also kthread_destroy_worker() that will
    be used close to kthread_cancel_work(). It is related
    to the init() function. Again it looks better if all
    functions use the same naming scheme.

  + there are several precedents for such init() function
    names, e.g. amd_iommu_init_device(), free_area_init_node(),
    jump_label_init_type(),  regmap_init_mmio_clk(),

  + It is not an argument but it was inconsistent even before.

[arnd@arndb.de: fix linux-next merge conflict]
 Link: http://lkml.kernel.org/r/20160908135724.1311726-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/1470754545-17632-3-git-send-email-pmladek@suse.com
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:33 -07:00
Parav Pandit e404f945a6 IB/rxe: improved debug prints & code cleanup
1. Debugging qp state transitions and qp errors in loopback and
multiple QP tests is difficult without qp numbers in debug logs.
This patch adds qp number to important debug logs.

2. Instead of having rxe: prefix in few logs and not having in
few logs, using uniform module name prefix using pr_fmt macro.

3. Code cleanup for various warnings reported by checkpatch for
incomplete unsigned data type, line over 80 characters, return
statements.

Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-10-06 13:50:04 -04:00
Stephen Bates b9fe856e54 rdma_rxe: Ensure rdma_rxe init occurs at correct time
There is a problem when CONFIG_RDMA_RXE=y and CONFIG_IPV6=y. This
results in the rdma_rxe initialization occurring before the IPv6
services are ready. This patch delays the initialization of rdma_rxe
until after the IPv6 services are ready. This fix is based on one
proposed by Logan Gunthorpe on a much older code base.

Signed-off-by: Stephen Bates <sbates@raithlin.com>
Reviewed-by: Yonatan Cohen <yonatanc@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-10-06 13:50:04 -04:00
Parav Pandit b6bbee0d24 IB/rxe: Properly honor max IRD value for rd/atomic.
This patch honoris the max incoming read request count instead of
outgoing read req count
(a) during modify qp by allocating response queue metadata
(b) during incoming read request processing

Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-10-06 13:50:04 -04:00
Parav Pandit d9703650f4 IB/{rxe,core,rdmavt}: Fix kernel crash for reg MR
This patch fixes below kernel crash on memory registration for rxe
and other transport drivers which has dma_ops extension.

IB/core invokes ib_map_sg_attrs() in generic manner with dma attributes
which is used by mlx5 and mthca adapters.  However in doing so it
ignored honoring dma_ops extension of software based transports for
sg map/unmap operation.  This results in calling dma_map_sg_attrs of
hardware virtual device resulting in crash for null reference.

We extend the core to support sg_map/unmap_attrs and transport drivers
to implement those dma_ops callback functions.

Verified usign perftest applications.

BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffff81032a75>] check_addr+0x35/0x60
...
Call Trace:
 [<ffffffff81032b39>] ? nommu_map_sg+0x99/0xd0
 [<ffffffffa02b31c6>] ib_umem_get+0x3d6/0x470 [ib_core]
 [<ffffffffa01cc329>] rxe_mem_init_user+0x49/0x270 [rdma_rxe]
 [<ffffffffa01c793a>] ? rxe_add_index+0xca/0x100 [rdma_rxe]
 [<ffffffffa01c995f>] rxe_reg_user_mr+0x9f/0x130 [rdma_rxe]
 [<ffffffffa00419fe>] ib_uverbs_reg_mr+0x14e/0x2c0 [ib_uverbs]
 [<ffffffffa003d3ab>] ib_uverbs_write+0x15b/0x3b0 [ib_uverbs]
 [<ffffffff811e92a6>] ? mem_cgroup_commit_charge+0x76/0xe0
 [<ffffffff811af0a9>] ? page_add_new_anon_rmap+0x89/0xc0
 [<ffffffff8117e6c9>] ? lru_cache_add_active_or_unevictable+0x39/0xc0
 [<ffffffff811f0da8>] __vfs_write+0x28/0x120
 [<ffffffff811f1239>] ? rw_verify_area+0x49/0xb0
 [<ffffffff811f1492>] vfs_write+0xb2/0x1b0
 [<ffffffff811f27d6>] SyS_write+0x46/0xa0
 [<ffffffff814f7d32>] entry_SYSCALL_64_fastpath+0x1a/0xa4

Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-10-06 13:50:04 -04:00
Parav Pandit ffae955d49 IB/rxe: Fix sending out loopback packet on netdev interface.
Both prepare4 and prepare6 sets loopback mask in pkt_info structure
instance of skb.  The xmit_packet and other requester side functions
use a pkt_info struct from the stack without the proper mask.  This
results in sending out the packet to the actual netdev device and
loopback functionality is broken.

Modify prepare() to pass its correctly marked pkt_info struct to
prepare4() and prepare6() instead of them using SKB_TO_PKT(skb) and
getting an incorrectly set mask.

Verified with perftest applications.

Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-10-06 13:50:04 -04:00
Parav Pandit 063af59597 IB/rxe: Avoid scheduling tasklet for userspace QP
This patch avoids scheduing tasklet for WQE and protocol processing
for user space QP. It performs the task in calling process context.

To improve code readability kernel specific post_send handling moved to
post_send_kernel() function.

Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-10-06 13:50:04 -04:00