Extend rdma_cm to support multicast communication. Multicast support
is added to the existing RDMA_PS_UDP port space, as well as a new
RDMA_PS_IPOIB port space. The latter port space allows joining the
multicast groups used by IPoIB, which enables offloading IPoIB traffic
to a separate QP. The port space determines the signature used in the
MGID when joining the group. The newly added RDMA_PS_IPOIB also
allows for unicast operations, similar to RDMA_PS_UDP.
Supporting the RDMA_PS_IPOIB requires changing how UD QPs are initialized,
since we can no longer assume that the qkey is constant. This requires
saving the Q_Key to use when attaching to a device, so that it is
available when creating the QP. The Q_Key information is exported to
the user through the existing rdma_init_qp_attr() interface.
Multicast support is also exported to userspace through the rdma_ucm.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The IB SA tracks multicast join/leave requests on a per port basis and
does not do any reference counting: if two users of the same port join
the same group, and one leaves that group, then the SA will remove the
port from the group even though there is one user who wants to stay a
member left. Therefore, in order to support multiple users of the
same multicast group from the same port, we need to perform reference
counting locally.
To do this, add an multicast submodule to ib_sa to perform reference
counting of multicast join/leave operations. Modify ib_ipoib (the
only in-kernel user of multicast) to use the new interface.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ipoib_cm_alloc_rx_skb() might be called from IRQ context, so it must
use dev_kfree_skb_any(), not kfree_skb().
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Remove the Open Grid Computing copyright. It shouldn't be there.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
For T3B devices, mark user QP in error once we transition
to TERMINATE.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
iwcm iw_cm_id destruction race condition fixes:
- iwcm_deref_id() always wakes up if there's another reference.
- clean up race condition in cm_work_handler().
- create static void free_cm_id() which deallocs the work entries and then
kfrees the cm_id memory. This reduces code replication.
- rem_ref() if this is the last reference -and- the IWCM owns freeing the
cm_id, then free it.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Acked-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Set the port phys state as returned from ehca_query_port() to LINK_UP.
ehca actually represents a logical HCA, whose phys/link state always
is LINK_UP.
Signed-off-by: Hoang-Nam Nguyen <hnguyen@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Allow users to en/disable scaling code when loading ib_ehca module,
rather than requiring the module to be rebuilt to change the setting.
Signed-off-by: Hoang-Nam Nguyen <hnguyen@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix a race condition in find_next_cpu_online() and some other locking
issues in ehca scaling code.
Signed-off-by: Hoang-Nam Nguyen <hnguyen@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The change to allow allocating ICM chunks from coherent memory did not
increment the count of sg entries properly, so a chunk that required
more than allocation would not be mapped properly by the HCA.
Fix this by adding the missing increment of chunk->nsg.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
RESET->RESET is an allowed QP state transition, so mthca should handle
it correctly, by just returning success without involving the firmware.
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
After Al Viro (finally) succeeded in removing the sched.h #include in module.h
recently, it makes sense again to remove other superfluous sched.h includes.
There are quite a lot of files which include it but don't actually need
anything defined in there. Presumably these includes were once needed for
macros that used to live in sched.h, but moved to other header files in the
course of cleaning it up.
To ease the pain, this time I did not fiddle with any header files and only
removed #includes from .c-files, which tend to cause less trouble.
Compile tested against 2.6.20-rc2 and 2.6.20-rc2-mm2 (with offsets) on alpha,
arm, i386, ia64, mips, powerpc, and x86_64 with allnoconfig, defconfig,
allmodconfig, and allyesconfig as well as a few randconfigs on x86_64 and all
configs in arch/arm/configs on arm. I also checked that no new warnings were
introduced by the patch (actually, some warnings are removed that were emitted
by unnecessarily included header files).
Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband:
IB/mthca: Always fill MTTs from CPU
IB/mthca: Merge MR and FMR space on 64-bit systems
IB/mthca: Fix access to MTT and MPT tables on non-cache-coherent CPUs
IB/mthca: Give reserved MTTs a separate cache line
IB/mthca: Fix reserved MTTs calculation on mem-free HCAs
RDMA/cxgb3: Add driver for Chelsio T3 RNIC
IB: Remove redundant "_wq" from workqueue names
RDMA/cma: Increment port number after close to avoid re-use
IB/ehca: Fix memleak on module unloading
IB/mthca: Work around gcc bug on sparc64
IPoIB: Connected mode experimental support
IB/core: Use ARRAY_SIZE macro for mandatory_table
IB/mthca: Use correct structure size in call to memset()
Speed up memory registration by filling in MTTs directly when the CPU
can write directly to the whole table (all mem-free cards, and to
Tavor mode on 64-bit systems with the patch I posted earlier). This
reduces the number of FW commands needed to register an MR by at least
a factor of 2 and speeds up memory registration significantly.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
For Tavor, we currently reserve separate MPT and MTT space for FMRs to
avoid abusing the vmalloc space on 32 bit kernels. No such problem
exists on 64 bit kernels so let's not do it there.
This way we have a shared pool for MR and FMR resources, used on
demand. This will also make it possible to write MTTs for regular
regions directly from driver.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We allocate the MTT table with alloc_pages() and then do pci_map_sg(),
so we must call pci_dma_sync_sg() after the CPU writes to the MTT
table. This works since the device will never write MTTs on mem-free
HCAs, once we get rid of the use of the WRITE_MTT firmware command.
This change is needed to make that work, and is an improvement for
now, since it gives FMRs a chance at working.
For MPTs, both the device and CPU might write there, so we must
allocate DMA coherent memory for these.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
MTTs are allocated in non-cache-coherent memory, so we must give
reserved MTTs their own cache line, to prevent both device and
CPU from writing into the same cache line at the same time.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The reserved_mtts field has different meaning in Tavor and Arbel, so
we are wasting mtt entries on memfree. Fix the Arbel case to match
Tavor semantics.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add an RDMA/iWARP driver for the Chelsio T3 1GbE and 10GbE adapters.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Many struct file_operations in the kernel can be "const". Marking them const
moves these to the .rodata section, which avoids false sharing with potential
dirty data. In addition it'll catch accidental writes at compile time to
these shared resources.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace appropriate pairs of "kmem_cache_alloc()" + "memset(0)" with the
corresponding "kmem_cache_zalloc()" call.
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Roland McGrath <roland@redhat.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Greg KH <greg@kroah.com>
Acked-by: Joel Becker <Joel.Becker@oracle.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Randomize the starting port number and avoid re-using port values
immediately after they are closed. Instead keep track of the last
port value used and increment it every time a new port number is
assigned, to better replicate other port spaces.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Percpu data is not freed on module unloading.
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Christoph Raisch <raisch@de.ibm.com>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Hoang-Nam Nguyen <hnguyen@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
For some reason gcc-3.4.5 on sparc64 does:
WARNING: "____ilog2_NaN" [drivers/infiniband/hw/mthca/ib_mthca.ko] undefined!
Points to note:
(1) The asm volatile flush/flushw are just markers for viewing what comes out
in the assembly; removing them has no effect on the result.
(2) Changing almost anything else in dwh__mthca_arbel_init_srq_context() or
dwh__mthca_alloc_srq() causes the problem to go away.
The compiler command line issued by the kernel build is:
/opt/crosstool/gcc-3.4.5-glibc-2.3.6/sparc64-unknown-linux-gnu/bin/sparc64-unknown-linux-gnu-gcc -fno-strict-aliasing -fno-common -Os -m64 -mno-fpu -mcpu=ultrasparc -mcmodel=medlow -ffixed-g4 -ffixed-g5 -fcall-used-g7 -Wa,--undeclared-regs -pg -fno-omit-frame-pointer -fno-optimize-sibling-calls -fasynchronous-unwind-tables -g -c -o drivers/infiniband/hw/mthca/.tmp_mthca_srq.o drivers/infiniband/hw/mthca/mthca_srq.c
This can be reduced to this whilst still retaining the problem:
/opt/crosstool/gcc-3.4.5-glibc-2.3.6/sparc64-unknown-linux-gnu/bin/sparc64-unknown-linux-gnu-gcc -m64 -c -o drivers/infiniband/hw/mthca/mthca_srq.o drivers/infiniband/hw/mthca/mthca_srq.c -Os
Removing -Os or changing it to -O or -O0 thru -O6 gets rid of the problem.
This patch to the kernel code fixes the problem:
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The following patch adds experimental support for IPoIB connected
mode, as defined by the draft from the IETF ipoib working group. The
idea is to increase performance by increasing the MTU from the maximum
of 2K (theoretically 4K) supported by IPoIB on top of UD. With this
code, I'm able to get 800MByte/sec or more with netperf without
options on a Mellanox 4x back-to-back DDR system.
Some notes on code:
1. SRQ is used for scalability to large cluster sizes
2. Only RC connections are used (UC does not support SRQ now)
3. Retry count is set to 0 since spec draft warns against retries
4. Each connection is used for data transfers in only 1 direction, so
each connection is either active(TX) or passive (RX). 2 sides that
want to communicate create 2 connections.
5. Each active (TX) connection has a separate CQ for send completions -
this keeps the code simple without CQ resize and other tricks
6. To detect stale passive side connections (where the remote side is
down), we keep an LRU list of passive connections (updated once per
second per connection) and destroy a connection after it has been
unused for several seconds. The LRU rule makes it possible to avoid
scanning connections that have recently been active.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Use ARRAY_SIZE() macro already defined in kernel.h instead of open
coding equivalent code.
Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When clearing the ib_ah_attr parameter in to_ib_ah_attr(), use sizeof
*ib_ah_attr instead of sizeof *path.
Pointed out by Jack Morgenstein <jackm@mellanox.co.il>.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This lets the network core have the ability to handle suspend/resume
issues, if it wants to.
Thanks to Frederik Deweerdt <frederik.deweerdt@gmail.com> for the arm
driver fixes.
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Remove prototypes for functions that don't exist.
Signed-off-by: Hoang-Nam Nguyen <hnguyen@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch removes do_mmap() from ehca:
- Call remap_pfn_range() for hardware register block
- Use vm_insert_page() to register memory allocated for completion
queues and queue pairs
- The actual mmap() call/trigger is now controlled by user space,
ie. libehca
Signed-off-by: Hoang-Nam Nguyen <hnguyen@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The iWARP connection manager uses the ib_addr services to do route
resolution (neighbour discovery in the IP world). The ib_addr
netevent callback routine, however, currently only acts on InfiniBand
neighbour updates. It needs to act on ethernet neighbour updates as
well.
This patch just removes filtering on device type altogether and will
trigger on any neighour updates where the nud_type is valid. This
simplifies the code some.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When there is a call to send_tsk_mgmt SRP posts a send and waits for 5
seconds to get a response.
When the QP is in the error state it is obvious that there will be no
response so it is quite useless to wait. In fact, the timeout causes
SRP to wait a long time to reconnect when a QP error occurs. (Each
abort and each reset_device calls send_tsk_mgmt, which waits for the
timeout). The following patch solves this problem by identifying the
failure and returning an immediate error code.
Signed-off-by: Ishai Rabinovitz <ishai@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
struct ib_wc currently only includes the local QP number: this matches
the IB spec, but seems mostly useless. The following patch replaces
this with the pointer to qp itself, and updates all low level drivers
and all users.
This has the following advantages:
- Ability to get a per-qp context through wc->qp->qp_context
- Existing drivers already have the qp pointer ready in poll cq, so
this change actually saves a tiny bit (extra memory read) on data path
(for ehca it would actually be expensive to find the QP pointer when
polling a CQ, but ehca does not support SRQ so we can leave wc->qp as
NULL for ehca)
- Users that need the QP number can still get it through wc->qp->qp_num
Use case:
In IPoIB connected mode code, I have a common CQ shared by multiple
QPs. To track connection usage, I need a way to get at some per-QP
context upon the completion, and I would like to avoid allocating
context object per work request just to stick a QP pointer into it.
With this code, I can just use wc->qp->qp_context.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The lock is taken with _irqsave and hence must be released with
_irqrestore on all paths.
Signed-off-by Hoang-Nam Nguyen <hnguyen@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Checks if the kmalloc in match_strdup() was successful, and bail out
on looking at the token if it failed.
Signed-off-by: Ishai Rabinovitz <ishai@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If a QP being queried is in the RESET state, don't execute the
QUERY_QP firmware command (because it will fail).
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Here is a patch for ehca to use proper flag, ie. GFP_ATOMIC
resp. GFP_KERNEL, when calling get_zeroed_page() to prevent "Bug:
scheduling while atomic...". This error does not cause a kernel panic
but makes ipoib un-usable afterwards. It is reproducible on
2.6.20-rc4 if one does ifconfig down during a flood ping test. I have
not observed this error in earlier releases incl. 2.6.20-rc1.
This error occurs when a qp event/irq is received and ehca event
handler allocates a control block/page to obtain HCA error data block.
Use of GFP_ATOMIC when in interrupt context prevents this issue.
Signed-off-by Hoang-Nam Nguyen <hnguyen@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
According to the Tavor and Arbel programmer's reference manuals, the
number of bytes transferred is not provided in the byte_cnt field of
the CQ entry for atomic operation completions. For atomic operations,
the number of bytes transferred is always 8 (when the status is
"success"), and this constant value should always be used by the
driver in the ib_wc entry returned, rather than using the CQE.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
There's a problem with how rdma cm events are reported to userspace
that can lead to application crashes.
When a new connection request arrives, a context for the connection is
allocated in the kernel. The connection event is then reported to
userspace. The userspace library retrieves the event and allocates
its own context for the connection. The userspace context is
associated with the kernel's context when accepting. This allows the
kernel to give userspace context with other events.
A problem occurs if a second event for the same connection occurs
before the user has had a chance to call accept. The userspace
context has not yet been set, which causes the librdmacm to crash.
(This has been seen when the app takes too long to call accept,
resulting in the remote side timing out and rejecting the connection)
Fix this by ignoring events for new connections until userspace has
set their context. This can only happen if an error occurs on a new
connection before the user accepts it. This is okay, since the accept
will just fail later.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We discard new connection requests while the listen backlog is full,
but leak a struct ucma_event in the process. Free the structure in
this case.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The iWARP CM should report timeouts as event RDMA_CM_EVENT_UNREACHABLE,
not event RDMA_CM_EVENT_REJECTED.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
iSER limits the number of outstanding PDUs to send. When this threshold
is reached, it should return an error code (-ENOBUFS) instead of setting
the suspend_tx bit (which should be used only by libiscsi).
Signed-off-by: Erez Zilber <erezz@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
mthca_table_find() will return the wrong address when the table entry
being searched for is exactly at the beginning of a sglist entry
(other than the first), because it uses >= when it should use >.
Example: assume we have 2 entries in scatterlist, 4K each, offset is
4K. The current code will return first entry + 4K when we really want
the second entry.
In particular this means mapping an FMR on a memfree HCA may end up
writing the page table into the wrong place, leading to memory
corruption and also causing the HCA to use an incorrect address
translation table.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Commit bed8bdfd ("IB: kmemdup() cleanup") introduced one bad conversion to
kmemdup() in mthca_alloc_fmr(), where the structure allocated and the
structure copied are not the same size. Revert this back to the original
kmalloc()/memcpy() code.
Reported-by: Dotan Barak <dotanb@mellanox.co.il>.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <roland@digitalvampire.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
mthca_device_mutex() can be initialized automatically with
DEFINE_MUTEX() rather than explicitly calling mutex_init(). This
saves a bit of text and shrinks the source by a line, so we may as
well do it....
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add module parameters that enable settting some of the HCA
profile values, such as the number of QPs, CQs, etc.
Signed-off-by: Leonid Arsh <leonida@voltaire.com>
Signed-off-by: Moni Shoua <monis@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
struct srp_device.fmr_page_mask was unsigned long, which means that
the top part of addresses above 4G was being chopped off on 32-bit
architectures. Of course nothing good happens when data from SRP
targets is DMAed to the wrong place.
Fix this by changing fmr_page_mask to u64, to match the addresses
actually used by IB devices.
Thanks to Brian Cain <Brian.Cain@ge.com> and David McMillen
<davem@systemfabricworks.com> for help diagnosing the bug and testing
the fix.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Move the initialization of ipoib_neigh's skb_queue into
ipoib_neigh_alloc(), since commit 2745b5b7 ("IPoIB: Fix skb leak when
freeing neighbour") will make iterate over the skb_queue to free any
packets left over when freeing the ipoib_neigh structure.
This fixes a crash when freeing ipoib_neigh structures allocated in
ipoib_mcast_send(), which otherwise don't have their skb_queue
initialized.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Convert iSER to use the new verbs DMA mapping functions for kernel
verbs consumers.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Convert SRP to use the new verbs DMA mapping functions for kernel
verbs consumers.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Convert IPoIB to use the new DMA mapping functions
for kernel verbs consumers.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Convert code in core/ to use the new DMA mapping functions for kernel
verbs consumers.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch implements the interposing DMA mapping functions to allow
support for IOMMUs and remove the dependence on phys_to_virt() and
bus_to_virt().
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Export the rdma cm interfaces to userspace via a misc device.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Allow the use of UD QPs through the rdma_cm, in order to provide
address translation services for resolving IB addresses for datagram
messages using SIDR.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
During connection establishment, the passive side of a connection can
receive messages from the active side before the connection event has
been delivered to the user. Allow the passive side to send messages
in response to received data before the event is delivered. To handle
the case where the connection messages are lost, a new rdma_notify()
function is added that users may invoke to force a connection into the
established state.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Connection information was never given to the recipient of a
connection request or reply message. Only the event was delivered.
Report the connection data with the event to allows user to
reject the connection based on the requested parameters, or adjust
their resources to match the request.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The qp_type parameter into the rdma_cm is unneeded, and can be
misleading. The QP type should be determined from the port space.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Commit 51f65ebc ("IB/ipath - program intconfig register using new HT
irq hook"), which fixed interrupts for HyperTransport HCAs, broke PCI
Express HCAs, because for those HCAs, the driver uses the value of
pdev->irq before pci_enable_msi() and ends up getting a totally bogus
IRQ number. Fix this by using the value of pdev->irq after
pci_enable_msi().
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Remove variables that are set but then never looked at in the iSER
initiator. These cleanups came from David Binderman's list of "set
but never used" warnings from icc.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Remove variables that are set but then never looked at in the ipath
driver. These cleanups came from David Binderman's list of "set but
never used" warnings from icc.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ib_flush_fmr_pool() stashes away the request generation number
properly, but then goes ahead and rereads it every time it tests
whether the flush generation number has caught up. This means that
there is a theoretical possibility of livelock, if the request
generation number keeps getting bumped and the flush generation number
never catches up. The fix is simple: use the request generation
number read at the beginning of the function.
Also, atomic_inc() followed by atomic_read() can be replaced with
atomic_int_return(). There's no real requirement for atomicity here
but we might as well shrink the code.
This bug was discovered using David Binderman's list of "set but never
used" warnings from icc.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This facility provides three entry points:
ilog2() Log base 2 of unsigned long
ilog2_u32() Log base 2 of u32
ilog2_u64() Log base 2 of u64
These facilities can either be used inside functions on dynamic data:
int do_something(long q)
{
...;
y = ilog2(x)
...;
}
Or can be used to statically initialise global variables with constant values:
unsigned n = ilog2(27);
When performing static initialisation, the compiler will report "error:
initializer element is not constant" if asked to take a log of zero or of
something not reducible to a constant. They treat negative numbers as
unsigned.
When not dealing with a constant, they fall back to using fls() which permits
them to use arch-specific log calculation instructions - such as BSR on
x86/x86_64 or SCAN on FRV - if available.
[akpm@osdl.org: MMC fix]
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: David Howells <dhowells@redhat.com>
Cc: Wojtek Kaniewski <wojtekka@toxygen.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
SLAB_KERNEL is an alias of GFP_KERNEL.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
SLAB_ATOMIC is an alias of GFP_ATOMIC
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Conflicts:
drivers/ata/libata-scsi.c
include/linux/libata.h
Futher merge of Linus's head and compilation fixups.
Signed-Off-By: David Howells <dhowells@redhat.com>
Conflicts:
drivers/infiniband/core/iwcm.c
drivers/net/chelsio/cxgb2.c
drivers/net/wireless/bcm43xx/bcm43xx_main.c
drivers/net/wireless/prism54/islpci_eth.c
drivers/usb/core/hub.h
drivers/usb/input/hid-core.c
net/core/netpoll.c
Fix up merge failures with Linus's head and fix new compilation failures.
Signed-Off-By: David Howells <dhowells@redhat.com>
ib_ucm_cleanup_events() holds file_mutex while calling ib_destroy_cm_id().
This can deadlock since ib_destroy_cm_id() flushes event handlers, and
ib_ucm_event_handler() needs file_mutex, too. Therefore, drop the
file_mutex during the call to ib_destroy_cm_id().
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The ib_cm_establish() function is replaced with a more generic
ib_cm_notify(). This routine is used to notify the CM that failover
has occurred, so that future CM messages (LAP, DREQ) reach the remote
CM. (Currently, we continue to use the original path) This bumps the
userspace CM ABI.
New alternate path information is captured when a LAP message is sent
or received. This allows QP attributes to be initialized for the user
when a new path is loaded after failover occurs.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ipoib_neigh_free() is sometimes called while neighbour is still alive,
so it might still have queued skbs. Fix skb leak in this case.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
SRP reallocates the IU buffers for tx_ring and rx_ring without freeing
the old buffers when it reconnects to a target. Fix this by keeping
the old IU buffers around.
Signed-off-by: Vu Pham <vu@mellanox.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix following problems in process_req() relating to cancellation:
- Function is wrongly doing another addr_remote() when cancelled,
which is not required.
- Make failure reporting immediate by using time_after_eq().
- On cancellation, -ETIMEDOUT was returned to the callback routine
instead of the more appropriate -ECANCELLED (users getting notified
may want to print/return this status, eg ucma_event_handler).
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
It is possible to swap the CQs used for send_cq and recv_cq when
creating two different QPs. If these two QPs are then destroyed at
the same time, an AB-BA deadlock can occur because the CQ locks are
taken our of order. Fix this by always taking CQ locks in a fixed
order.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When initializing an mthca SRQ, the log_srq_size field should be the
log of the number of SRQ WQEs, not the log of the number of bytes in
the SRQ.
This affects only mthca drivers for memfree HCAs which set the initial
srq wqe counter (in the SW2HW transition) to a non-zero value.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This is a patch for ehca to fix a bug in prepare_sqe_to_rts(), which
used WQE address to iterate pending work requests. This might cause
an access violation since the queue pages can not be assumed to follow
each other consecutively. Thus, this patch introduces a few queue
functions to determine WQE offset based on its address and uses WQE
offset to iterate the pending work requests.
Signed-off-by: Hoang-Nam Nguyen <hnguyen@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
In iwcm_deref_id(), the comment says : "If the last reference is being
removed and iw_destroy_cm_id is waiting, wake up the waiting
thread". The second part of the comment, "and iw_destroy_cm_id is
waiting," is wrong, since this function either wakes the waiter
already waiting in iwcm_deref_id, or enables it (so that when
wait_for_completion() is performed later, it will immediately return).
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Remove unnecessary cm_id_priv argument to copy_private_data(), and
change text to reflect the code. Fix couple of typos in comments.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If we get IW_CM_EVENT_CONNECT_REQUEST message and encounter an error
(not in the LISTEN state, cannot create an id, cannot alloc
work_entry, etc), then the memory allocated by cm_event_handler() in
the event->private_data gets leaked. Since cm_work_handler has already
put the event on the work_free_list, this allocated memory is
leaked. High backlog value can allow DoS attacks.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Possible memory corruption scenario: after putting the work entry back
on the work_free_list, we call process_event() which dereferences
work->event, which could have been modified to another value
meanwhile.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The amso1100 driver was missing a couple of __devinit/__devexit
annotations for init/cleanup functions that are called from
__devinit/__devexit functions.
Reported by Randy Dunlap <randy.dunlap@oracle.com>.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Commit b3b30f5e ("IB/mthca: Recover from catastrophic errors")
introduced some section mismatch breakage, because the error recovery
code tears down and reinitializes the device, which calls into lots of
code originally marked __devinit and __devexit from regular .text.
Fix this by getting rid of these now-incorrect section markers.
Reported by Randy Dunlap <randy.dunlap@oracle.com>.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Set the Scsi_Host's max_cmd_len from 12 (default) to 16 for
SRP. Otherwise scsi_dispatch_cmd() won't pass down certain commands
such as READ CAPACITY 16, required for supporting disks > 2TB.
Signed-off-by: Arne Redlich <arne.redlich@xiranet.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The qp_access_flags are for remote access permissions only, so
IB_ACCESS_LOCAL_WRITE is an invalid value. Remove it from the values
set by cm_init_qp_init_attr() and cma_init_ib_qp().
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Replace open coded kmemdup() to save some screen space, and allow
inlining/not inlining to be triggered by gcc.
Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Rewrite cma_req_handler error handling case to encapsulate
common code.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
In queue_req(), use time_after_eq() instead of time_after()
for following reasons :
- Improves insert time if multiple entries with same time are
present.
- set_timeout need not be called if entry with same time
is added to the list (and that happens to be the entry
with the smallest time), saving atomic/locking operations.
- Earlier entries with same time are deleted first (fifo).
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>