Commit Graph

47 Commits

Author SHA1 Message Date
Linus Torvalds fedfb947b2 Merge branch 'for-2.6.34' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.34' of git://linux-nfs.org/~bfields/linux:
  svcrdma: RDMA support not yet compatible with RPC6
2010-04-12 18:34:56 -07:00
Tom Tucker bade732a28 svcrdma: RDMA support not yet compatible with RPC6
RPC6 requires that it be possible to create endpoints that listen
exclusively for IPv4 or IPv6 connection requests. This is not currently
supported by the RDMA API.

This fixes a server RDMA regression introduced by 37498292a "NFSD:
Create PF_INET6 listener in write_ports".

Signed-off-by: Tom Tucker<tom@opengridcomputing.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2010-04-05 12:10:22 -04:00
Tejun Heo 5a0e3ad6af include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files.  percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed.  Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability.  As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

  http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
  only the necessary includes are there.  ie. if only gfp is used,
  gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
  blocks and try to put the new include such that its order conforms
  to its surrounding.  It's put in the include block which contains
  core kernel includes, in the same order that the rest are ordered -
  alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
  doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
  because the file doesn't have fitting include block), it prints out
  an error message indicating which .h file needs to be added to the
  file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
   over 4000 files, deleting around 700 includes and adding ~480 gfp.h
   and ~3000 slab.h inclusions.  The script emitted errors for ~400
   files.

2. Each error was manually checked.  Some didn't need the inclusion,
   some needed manual addition while adding it to implementation .h or
   embedding .c file was more appropriate for others.  This step added
   inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
   from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
   e.g. lib/decompress_*.c used malloc/free() wrappers around slab
   APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
   editing them as sprinkling gfp.h and slab.h inclusions around .h
   files could easily lead to inclusion dependency hell.  Most gfp.h
   inclusion directives were ignored as stuff from gfp.h was usually
   wildly available and often used in preprocessor macros.  Each
   slab.h inclusion directive was examined and added manually as
   necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
   were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
   distributed build env didn't work with gcov compiles) and a few
   more options had to be turned off depending on archs to make things
   build (like ipr on powerpc/64 which failed due to missing writeq).

   * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
   * powerpc and powerpc64 SMP allmodconfig
   * sparc and sparc64 SMP allmodconfig
   * ia64 SMP allmodconfig
   * s390 SMP allmodconfig
   * alpha SMP allmodconfig
   * um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
   a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-30 22:02:32 +09:00
Alexey Dobriyan d43c36dc6b headers: remove sched.h from interrupt.h
After m68k's task_thread_info() doesn't refer to current,
it's possible to remove sched.h from interrupt.h and not break m68k!
Many thanks to Heiko Carstens for allowing this.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
2009-10-11 11:20:58 -07:00
Wei Yongjun 846d8e7cc8 svcrdma: fix error handling of rdma_alloc_frmr()
ib_alloc_fast_reg_mr() and ib_alloc_fast_reg_page_list() returns
ERR_PTR() and not NULL. Compile tested only.

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-07-03 10:14:59 -04:00
Steve Wise 98779be861 svcrdma: dma unmap the correct length for the RPCRDMA header page.
The svcrdma module was incorrectly unmapping the RPCRDMA header page.
On IBM pserver systems this causes a resource leak that results in
running out of bus address space (10 cthon iterations will reproduce it).
The code was mapping the full page but only unmapping the actual header
length.  The fix is to only map the header length.

I also cleaned up the use of ib_dma_map_page() calls since the unmap
logic always uses ib_dma_unmap_single().  I made these symmetrical.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-05-27 18:57:24 -04:00
Steve Wise 21515e46bc svcrdma: clean up error paths.
These fixes resolved crashes due to resource leak BUG_ON checks. The
resource leaks were detected by introducing asynchronous transport errors.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-05-03 14:19:10 -04:00
Roel Kluin 5eaa65b240 net: Make static
Sparse asked whether these could be static.

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-10 15:18:31 -08:00
Ingo Molnar ed72b9c6e0 sunrpc: fix warning in net/sunrpc/xprtrdma/svc_rdma_transport.c
this warning:

  net/sunrpc/xprtrdma/svc_rdma_transport.c: In function ‘svc_rdma_accept’:
  net/sunrpc/xprtrdma/svc_rdma_transport.c:830: warning: ‘dma_mr_acc’ may be used uninitialized in this function

triggers because GCC does not recognize the (correct) flow connection
between need_dma_mr and dma_mr_acc.

Annotate it.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 16:49:37 -08:00
Harvey Harrison 21454aaad3 net: replace NIPQUAD() in net/*/
Using NIPQUAD() with NIPQUAD_FMT, %d.%d.%d.%d or %u.%u.%u.%u
can be replaced with %pI4

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-31 00:54:56 -07:00
Tom Tucker 67080c8236 svcrdma: Fix IRD/ORD polarity
The inititator/responder resources in the event have been swapped. They
no represent what the local peer would set their values to in order to
match the peer. Note that iWARP does not exchange these on the wire and
the provider is simply putting in the local device max.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
2008-10-06 14:46:13 -05:00
Tom Tucker 04911b539c svcrdma: Update svc_rdma_send_error to use DMA LKEY
Update the svc_rdma_send_error code to use the DMA LKEY which is valid
regardless of the memory registration strategy in use.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
2008-10-06 14:46:08 -05:00
Tom Tucker afd566ea08 svcrdma: Modify the RPC reply path to use FRMR when available
Use FRMR to map local RPC reply data. This allows RDMA_WRITE to send reply
data using a single WR. The FRMR is invalidated by linking the LOCAL_INV WR
to the RDMA_SEND message used to complete the reply.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
2008-10-06 14:46:05 -05:00
Tom Tucker 146b6df6a5 svcrdma: Modify the RPC recv path to use FRMR when available
RPCRDMA requests that specify a read-list are fetched with RDMA_READ. Using
an FRMR to map the data sink improves NFSRDMA security on transports that
place the RDMA_READ data sink LKEY on the wire because the valid lifetime
of the MR is only the duration of the RDMA_READ. The LKEY is invalidated
when the last RDMA_READ WR completes.

Mapping the data sink also allows for very large amounts to data to be
fetched with a single WR, so if the client is also using FRMR, the entire
RPC read-list can be fetched with a single WR.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
2008-10-06 14:46:01 -05:00
Tom Tucker 5b180a9a64 svcrdma: Add support to svc_rdma_send to handle chained WR
WR can be submitted as linked lists of WR. Update the svc_rdma_send
routine to handle WR chains. This will be used to submit a WR that
uses an FRMR with another WR that invalidates the FRMR.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
2008-10-06 14:45:56 -05:00
Tom Tucker a5abf4e815 svcrdma: Modify post recv path to use local dma key
Update the svc_rdma_post_recv routine to use the adapter's global LKEY
instead of sc_phys_mr which is only valid when using a DMA MR.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
2008-10-06 14:45:52 -05:00
Tom Tucker e118321062 svcrdma: Add a service to register a Fast Reg MR with the device
Fast Reg MR introduces a new WR type. Add a service to register the
region with the adapter and update the completion handling to support
completions with a NULL WR context.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
2008-10-06 14:45:49 -05:00
Tom Tucker 3a5c63803d svcrdma: Query device for Fast Reg support during connection setup
Query the device capabilities in the svc_rdma_accept function to determine
what advanced memory management capabilities are supported by the device.
Based on the query, select the most secure model available given the
requirements of the transport and capabilities of the adapter.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
2008-10-06 14:45:45 -05:00
Tom Tucker 64be8608c1 svcrdma: Add FRMR get/put services
Add services for the allocating, freeing, and unmapping Fast Reg MR. These
services will be used by the transport connection setup, send and receive
routines.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
2008-10-06 14:45:18 -05:00
Tom Tucker 24b8b44780 svcrdma: Fix race between svc_rdma_recvfrom thread and the dto_tasklet
RDMA_READ completions are kept on a separate queue from the general
I/O request queue. Since a separate lock is used to protect the RDMA_READ
completion queue, a race exists between the dto_tasklet and the
svc_rdma_recvfrom thread where the dto_tasklet sets the XPT_DATA
bit and adds I/O to the read-completion queue. Concurrently, the
recvfrom thread checks the generic queue, finds it empty and resets
the XPT_DATA bit. A subsequent svc_xprt_enqueue will fail to enqueue
the transport for I/O and cause the transport to "stall".

The fix is to protect both lists with the same lock and set the XPT_DATA
bit with this lock held.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-08-13 16:57:31 -04:00
Tom Tucker 8948896c9e svcrdma: Change WR context get/put to use the kmem cache
Change the WR context pool to be shared across mount points. This
reduces the RDMA transport memory footprint significantly since
idle mounts don't consume WR context memory.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
2008-07-02 15:02:02 -05:00
Tom Tucker 36ef25e464 svcrdma: Limit ORD based on client's advertised IRD
When adapters have differing IRD limits, the RDMA transport will fail to
connect properly. The RDMA transport should use the client's advertised
inbound read limit when computing its outbound read limit. For iWARP
transports, there is currently no standard for exchanging IRD/ORD
during connection establishment so the 'responder_resources' field in the
connect event is the local device's limit. The RDMA transport can be
configured to use a smaller ORD by writing the desired number to the
/proc/sys/sunrpc/svc_rdma/max_outbound_read_requests file.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
2008-07-02 15:01:59 -05:00
Tom Tucker 94dba4918d svcrdma: Remove unneeded spin locks from __svc_rdma_free
At the time __svc_rdma_free is called, we are guaranteed that all references
to this transport are gone. There is, therefore, no need to protect the
resource lists with a spin lock.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
2008-07-02 15:01:57 -05:00
Tom Tucker 87295b6c5c svcrdma: Add dma map count and WARN_ON
Add a dma map count in order to verify that all DMA mapping resources
have been freed when the transport is closed.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
2008-07-02 15:01:56 -05:00
Tom Tucker e6ab914371 svcrdma: Move the DMA unmap logic to the CQ handler
Separate DMA unmap from context destruction and perform DMA unmapping
in the SQ/RQ CQ reap functions. This is necessary to support software
based RDMA implementations that actually copy the data in their
ib_dma_unmap callback functions and architectures that don't have
cache coherent I/O busses.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
2008-07-02 15:01:55 -05:00
Tom Tucker 34d16e42a6 svcrdma: Use RPC reply map for RDMA_WRITE processing
Use the new svc_rdma_req_map data type for mapping the client side memory
to the server side memory. Move the DMA mapping to the context pointed to
by each WR individually so that it is unmapped after the WR completes.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
2008-07-02 15:01:54 -05:00
Tom Tucker ab96dddbed svcrdma: Add a type for keeping NFS RPC mapping
Create a new data structure to hold the remote client address space
to local server address space mapping.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
2008-07-02 15:01:53 -05:00
Tom Tucker 008fdbc571 svcrdma: Change svc_rdma_send_error return type to void
The svc_rdma_send_error function is called when an RPCRDMA protocol
error is detected. This function attempts to post an error reply message.
Since an error posting to a transport in error is ignored, change
the return type to void.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
2008-05-19 07:34:01 -05:00
Tom Tucker af261af4db svcrdma: Copy transport address and arm CQ before calling rdma_accept
This race was found by inspection. Messages can be received from the peer
immediately following the rdma_accept call, however, the CQ have not yet
been armed and the transport address has not yet been set.

Set the transport address in the connect request handler and arm the CQ
prior to calling rdma_accept.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
2008-05-19 07:34:00 -05:00
Tom Tucker 97a3df382e svcrdma: Use ib verbs version of dma_unmap
Use the ib_verbs version of the dma_unmap service in the
svc_rdma_put_context function. This should support providers
using software rdma.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
2008-05-19 07:33:58 -05:00
Tom Tucker 356d0a1519 svcrdma: Cleanup queued, but unprocessed I/O in svc_rdma_free
When the transport is closing, the DTO tasklet may queue data
that never gets processed. Clean up resources associated with
this I/O.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
2008-05-19 07:33:57 -05:00
Tom Tucker 1711386c62 svcrdma: Move the QP and cm_id destruction to svc_rdma_free
Move the destruction of the QP and CM_ID to the free path so that the
QP cleanup code doesn't race with the dto_tasklet handling flushed WR.
The QP reference is not needed because we now have a reference for
every WR.

Also add a guard in the SQ and RQ completion handlers to ignore
calls generated by some providers when the QP is destroyed.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
2008-05-19 07:33:56 -05:00
Tom Tucker 0905c0f0a2 svcrdma: Add reference for each SQ/RQ WR
Add a reference on the transport for every outstanding WR.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
2008-05-19 07:33:55 -05:00
Tom Tucker 8da91ea8de svcrdma: Move destroy to kernel thread
Some providers may wait while destroying adapter resources.
Since it is possible that the last reference is put on the
dto_tasklet, the actual destroy must be scheduled as a work item.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
2008-05-19 07:33:54 -05:00
Tom Tucker 47698e083e svcrdma: Shrink scope of spinlock on RQ CQ
The rq_cq_reap function is only called from the dto_tasklet. The
only resource shared with other threads is the sc_rq_dto_q. Move the
spin lock to protect only this list.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
2008-05-19 07:33:53 -05:00
Tom Tucker 8740767376 svcrdma: Use standard Linux lists for context cache
Replace the one-off linked list implementation used to implement the
context cache with the standard Linux list_head lists. Add a context
counter to catch resource leaks. A WARN_ON will be added later to
ensure that we've freed all contexts.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
2008-05-19 07:33:52 -05:00
Tom Tucker 02e7452de7 svcrdma: Simplify RDMA_READ deferral buffer management
An NFS_WRITE requires a set of RDMA_READ requests to fetch the write
data from the client. There are two principal pieces of data that
need to be tracked: the list of pages that comprise the completed RPC
and the SGE of dma mapped pages to refer to this list of pages. Previously
this whole bit was managed as a linked list of contexts with the
context containing the page list buried in this list. This patch
simplifies this processing by not keeping a linked list, but rather only
a pionter from the last submitted RDMA_READ's context to the context
that maps the set of pages that describe the RPC.  This significantly
simplifies this code path. SGE contexts are cleaned up inline in the DTO
path instead of at read completion time.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
2008-05-19 07:33:51 -05:00
Tom Tucker 10a38c33f4 svcrdma: Remove unused READ_DONE context flags bit
The RDMACTXT_F_READ_DONE bit is not longer used. Remove it.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
2008-05-19 07:33:50 -05:00
Tom Tucker 58e8f62137 svcrdma: Fix error handling during listening endpoint creation
A listening endpoint isn't known to the generic transport switch until
the svc_create_xprt function returns without error. Calling
svc_xprt_put within the xpo_create function causes the module reference
count to be erroneously decremented.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
2008-05-19 07:33:48 -05:00
Tom Tucker 05a0826a6e svcrdma: Free context on ib_post_recv error
If there is an error posting the recv WR to the RQ, free the
context associated with the WR. This would leak a context when
asynchronous errors occurred on the transport while conccurent threads
were processing their RPC.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
2008-05-19 07:33:47 -05:00
Tom Tucker 120693d12c svcrdma: Add put of connection ESTABLISHED reference in rdma_cma_handler
The svcrdma transport takes a reference when it gets the ESTABLISHED
event from the provider. This reference is supposed to be removed when
the DISCONNECT event is received, however, the call to svc_xprt_put
was missing in the switch statement. This results in the memory
associated with the transport never being freed.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
2008-05-19 07:33:46 -05:00
Tom Tucker 9d6347acd2 svcrdma: Fix return value in svc_rdma_send
Fix the return value on close to -ENOTCONN so caller knows to free context.
Also if a thread is waiting for free SQ space, check for close when waking
to avoid posting WR to a closing transport.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
2008-05-19 07:33:45 -05:00
Tom Tucker dbcd00eba9 svcrdma: Fix race with dto_tasklet in svc_rdma_send
The svc_rdma_send function will attempt to reap SQ WR to make room for
a new request if it finds the SQ full. This function races with the
dto_tasklet that also reaps SQ WR. To avoid polling and arming the CQ
unnecessarily move the test_and_clear_bit of the RDMAXPRT_SQ_PENDING
flag and arming of the CQ to the sq_cq_reap function.

Refactor the rq_cq_reap function to match sq_cq_reap so that the
code is easier to follow.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
2008-05-19 07:33:44 -05:00
Tom Tucker 0e7f011a19 svcrdma: Simplify receive buffer posting
The svcrdma transport provider currently allocates receive buffers
to the RQ through the xpo_release_rqst method. This approach is overly
complicated since it means that the rqstp rq_xprt_ctxt has to be
selectively set based on whether the RPC is going to be processed
immediately or deferred. Instead, just post the receive buffer when
we are certain that we are replying in the send_reply function.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
2008-05-19 07:33:43 -05:00
Tom Tucker 830bb59b6e SVCRDMA: Add check for XPT_CLOSE in svc_rdma_send
SVCRDMA: Add check for XPT_CLOSE in svc_rdma_send

The svcrdma transport can crash if a send is waiting for an
empty SQ slot and the connection is closed due to an asynchronous error.
The crash is caused when svc_rdma_send attempts to send on a deleted
QP.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-04-23 16:13:40 -04:00
Tom Tucker c48cbb405c SVCRDMA: Add xprt refs to fix close/unmount crash
RDMA connection shutdown on an SMP machine can cause a kernel crash due
to the transport close path racing with the I/O tasklet.

Additional transport references were added as follows:
- A reference when on the DTO Q to avoid having the transport
  deleted while queued for I/O.
- A reference while there is a QP able to generate events.
- A reference until the DISCONNECTED event is received on the CM ID

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-12 12:37:34 -07:00
Tom Tucker 377f9b2f45 rdma: SVCRDMA Core Transport Services
This file implements the core transport data management and I/O
path. The I/O path for RDMA involves receiving callbacks on interrupt
context. Since all the svc transport locks are _bh locks we enqueue the
transport on a list, schedule a tasklet to dequeue data indications from
the RDMA completion queue. The tasklet in turn takes _bh locks to
enqueue receive data indications on a list for the transport. The
svc_rdma_recvfrom transport function dequeues data from this list in an
NFSD thread context.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Acked-by: Neil Brown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2008-02-01 16:42:14 -05:00