linux-sg2042

Commit Graph

Author	SHA1	Message	Date
Jeff Layton	83a712e0af	sunrpc: add some tracepoints around enqueue and dequeue of svc_xprt These were useful when I was tracking down a race condition between svc_xprt_do_enqueue and svc_get_next_xprt. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-12-09 11:29:14 -05:00
Jeff Layton	b1691bc03d	sunrpc: convert to lockless lookup of queued server threads Testing has shown that the pool->sp_lock can be a bottleneck on a busy server. Every time data is received on a socket, the server must take that lock in order to dequeue a thread from the sp_threads list. Address this problem by eliminating the sp_threads list (which contains threads that are currently idle) and replacing it with a RQ_BUSY flag in svc_rqst. This allows us to walk the sp_all_threads list under the rcu_read_lock and find a suitable thread for the xprt by doing a test_and_set_bit. Note that we do still have a potential atomicity problem however with this approach. We don't want svc_xprt_do_enqueue to set the rqst->rq_xprt pointer unless a test_and_set_bit of RQ_BUSY returned zero (which indicates that the thread was idle). But, by the time we check that, the bit could be flipped by a waking thread. To address this, we acquire a new per-rqst spinlock (rq_lock) and take that before doing the test_and_set_bit. If that returns false, then we can set rq_xprt and drop the spinlock. Then, when the thread wakes up, it must set the bit under the same spinlock and can trust that if it was already set then the rq_xprt is also properly set. With this scheme, the case where we have an idle thread no longer needs to take the highly contended pool->sp_lock at all, and that removes the bottleneck. That still leaves one issue: What of the case where we walk the whole sp_all_threads list and don't find an idle thread? Because the search is lockess, it's possible for the queueing to race with a thread that is going to sleep. To address that, we queue the xprt and then search again. If we find an idle thread at that point, we can't attach the xprt to it directly since that might race with a different thread waking up and finding it. All we can do is wake the idle thread back up and let it attempt to find the now-queued xprt. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Tested-by: Chris Worley <chris.worley@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-12-09 11:22:22 -05:00
Jeff Layton	403c7b4444	sunrpc: fix potential races in pool_stats collection In a later patch, we'll be removing some spinlocking around the socket and thread queueing code in order to fix some contention problems. At that point, the stats counters will no longer be protected by the sp_lock. Change the counters to atomic_long_t fields, except for the "sockets_queued" counter which will still be manipulated under a spinlock. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Tested-by: Chris Worley <chris.worley@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-12-09 11:22:22 -05:00
Jeff Layton	812443865c	sunrpc: add a rcu_head to svc_rqst and use kfree_rcu to free it ...also make the manipulation of sp_all_threads list use RCU-friendly functions. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Tested-by: Chris Worley <chris.worley@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-12-09 11:22:22 -05:00
Jeff Layton	0b5707e452	sunrpc: require svc_create callers to pass in meaningful shutdown routine Currently all svc_create callers pass in NULL for the shutdown parm, which then gets fixed up to be svc_rpcb_cleanup if the service uses rpcbind. Simplify this by instead having the the only caller that requires it (lockd) pass in svc_rpcb_cleanup and get rid of the special casing. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-12-09 11:22:21 -05:00
Jeff Layton	ceff739c53	sunrpc: have svc_wake_up only deal with pool 0 The way that svc_wake_up works is a bit inefficient. It walks all of the available pools for a service and either wakes up a task in each one or sets the SP_TASK_PENDING flag in each one. When svc_wake_up is called, there is no need to wake up more than one thread to do this work. In practice, only lockd currently uses this function and it's single threaded anyway. Thus, this just boils down to doing a wake up of a thread in pool 0 or setting a single flag. Eliminate the for loop in this function and change it to just operate on pool 0. Also update the comments that sit above it and get rid of some code that has been commented out for years now. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-12-09 11:22:21 -05:00
Jeff Layton	4d5db3f536	sunrpc: convert sp_task_pending flag to use atomic bitops In a later patch, we'll want to be able to handle this flag without holding the sp_lock. Change this field to an unsigned long flags field, and declare a new flag in it that can be managed with atomic bitops. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-12-09 11:22:21 -05:00
Jeff Layton	779fb0f3af	sunrpc: move rq_splice_ok flag into rq_flags Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-12-09 11:22:21 -05:00
Jeff Layton	78b65eb3fd	sunrpc: move rq_dropme flag into rq_flags Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-12-09 11:22:20 -05:00
Jeff Layton	30660e04b0	sunrpc: move rq_usedeferral flag to rq_flags Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-12-09 11:22:20 -05:00
Jeff Layton	7501cc2bcf	sunrpc: move rq_local field to rq_flags Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-12-09 11:21:21 -05:00
Jeff Layton	4d152e2c9a	sunrpc: add a generic rq_flags field to svc_rqst and move rq_secure to it In a later patch, we're going to need some atomic bit flags. Since that field will need to be an unsigned long, we mitigate that space consumption by migrating some other bitflags to the new field. Start with the rq_secure flag. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-12-09 11:21:20 -05:00
J. Bruce Fields	2941b0e91b	NFS client updates for Linux 3.19 Highlights include: Features: - NFSv4.2 client support for hole punching and preallocation. - Further RPC/RDMA client improvements. - Add more RPC transport debugging tracepoints. - Add RPC debugging tools in debugfs. Bugfixes: - Stable fix for layoutget error handling - Fix a change in COMMIT behaviour resulting from the recent io code updates -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJUhRVTAAoJEGcL54qWCgDyfeUP/RoFo3ImTMbGxfcPJqoELjcO lZbQ+27pOE/whFDkWgiOVTwlgGct5a0WRo7GCZmpYJA4q1kmSv4ngTb3nMTCUztt xMJ0mBr0BqttVs+ouKiVPm3cejQXedEhttwWcloIXS8lNenlpL29Zlrx2NHdU8UU 13+souocj0dwIyTYYS/4Lm9KpuCYnpDBpP5ShvQjVaMe/GxJo6GyZu70c7FgwGNz Nh9onzZV3mz1elhfizlV38aVA7KWVXtLWIqOFIKlT2fa4nWB8Hc07miR5UeOK0/h r+icnF2qCQe83MbjOxYNxIKB6uiA/4xwVc90X4AQ7F0RX8XPWHIQWG5tlkC9jrCQ 3RGzYshWDc9Ud2mXtLMyVQxHVVYlFAe1WtdP8ZWb1oxDInmhrarnWeNyECz9xGKu VzIDZzeq9G8slJXATWGRfPsYr+Ihpzcen4QQw58cakUBcqEJrYEhlEOfLovM71k3 /S/jSHBAbQqiw4LPMw87bA5A6+ZKcVSsNE0XCtNnhmqFpLc1kKRrl5vaN+QMk5tJ v4/zR0fPqH7SGAJWYs4brdfahyejEo0TwgpDs7KHmu1W9zQ0LCVTaYnQuUmQjta6 WyYwIy3TTibdfR191O0E3NOW82Q/k/NBD6ySvabN9HqQ9eSk6+rzrWAslXCbYohb BJfzcQfDdx+lsyhjeTx9 =wOP3 -----END PGP SIGNATURE----- Merge tag 'nfs-for-3.19-1' into nfsd for-3.19 branch Mainly what I need is `860a0d9e51` "sunrpc: add some tracepoints in svc_rqst handling functions", which subsequent server rpc patches from jlayton depend on. I'm merging this later tag on the assumption that's more likely to be a tested and stable point.	2014-12-09 11:12:26 -05:00
Jeff Layton	067f96ef17	sunrpc: release svc_pool_map reference when serv allocation fails Currently, it leaks when the allocation fails. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-12-01 12:45:27 -07:00
Jeff Layton	8d65ef760d	sunrpc: eliminate the XPT_DETACHED flag All it does is indicate whether a xprt has already been deleted from a list or not, which is unnecessary since we use list_del_init and it's always set and checked under the sv_lock anyway. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-12-01 12:45:26 -07:00
Jeff Layton	388f0c7767	sunrpc: add a debugfs rpc_xprt directory with an info file in it Add a new directory heirarchy under the debugfs sunrpc/ directory: sunrpc/ rpc_xprt/ <xprt id>/ Within that directory, we can put files that give info about the xprts. We do have the (minor) problem that there is no succinct, unique identifier for rpc_xprts. So we generate them synthetically with a static atomic_t counter. For now, this directory just holds an "info" file, but we may add other files to it in the future. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-11-27 13:14:52 -05:00
Jeff Layton	b4b9d2ccf0	sunrpc: add debugfs file for displaying client rpc_task queue It's possible to get a dump of the RPC task queue by writing a value to /proc/sys/sunrpc/rpc_debug. If you write any value to that file, you get a dump of the RPC client task list into the log buffer. This is a rather inconvenient interface however, and makes it hard to get immediate info about the task queue. Add a new directory hierarchy under debugfs: sunrpc/ rpc_clnt/ <clientid>/ Within each clientid directory we create a new "tasks" file that will dump info similar to what shows up in the log buffer, but with a few small differences -- we avoid printing raw kernel addresses in favor of symbolic names and the XID is also displayed. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-11-27 13:14:51 -05:00
Trond Myklebust	ea5264138d	NFS: Client side changes for RDMA These patches various bugfixes and cleanups for using NFS over RDMA, including better error handling and performance improvements by using pad optimization. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJUdPv2AAoJENfLVL+wpUDrJkMQAKjtPZHLMcj+eHm4f1ZKLJxy GSrUZV21TU9tL0NVE/5An8US6hoLwHpNXsW8o+gHTAeGRyiCmIaNXGd1Ql/4PYRH zfzdNXoaJAh1N5iXX11fF3gOWqx/SolqzO2xLDVETK/3lAvq0VwMYoMElBQB6qQW 8sN3z8yVuz/9Ia9oGIFhqu1B6dcKPHkQDMtmsElGxeEX/+9yEg4HUKx+kZDtV0Uj 8/JM8Jh1FKRCQT/P6INkRItdY5KaSJGFc43BkC/8lbugfxa5XCyu/m/qMr9FJsDV nM6rwaiVcmR/mvD3fL82+Jg/M+P9VUHQ1/Az0sV9G+fEoHH/1Mey3LfMzNpUmf9v bykrPRuzXkPPQgN1VnjSaF2RF+CWwV9Nme1VVXM/zj8gHX1mcmQF/wPRxDuLjCrt EObAFsvHOwDTZZmYp9bG5kc6IvwvT8aeeVQMJ4q4PSGD3w8AtoIyJDn+Ee0LFD1K Zw0oZpTJpI4t7DVxGBSdo2wZWuMU/UKqGqGtGJ+ljXfTRuuq968Q5j5ujaA9vf0v C9igYTU8hq4teMzhZrfR1jtTWoSS+5zamb1KtvAZy8gsht2PQVgE9xka2k/AV8uE ul/w5HU4OV+QIrHNbiu7BE8B2Ags6smpdHMqn9fqLBwvG+JEwbWqk1zeTsajxzq+ hkvKkkMq6JjDbsDf96Yk =YIru -----END PGP SIGNATURE----- Merge tag 'nfs-rdma-for-3.19' of git://git.linux-nfs.org/projects/anna/nfs-rdma into linux-next Pull NFS client RDMA changes for 3.19 from Anna Schumaker: "NFS: Client side changes for RDMA These patches various bugfixes and cleanups for using NFS over RDMA, including better error handling and performance improvements by using pad optimization. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>" * tag 'nfs-rdma-for-3.19' of git://git.linux-nfs.org/projects/anna/nfs-rdma: xprtrdma: Display async errors xprtrdma: Enable pad optimization xprtrdma: Re-write rpcrdma_flush_cqs() xprtrdma: Refactor tasklet scheduling xprtrdma: unmap all FMRs during transport disconnect xprtrdma: Cap req_cqinit xprtrdma: Return an errno from rpcrdma_register_external()	2014-11-26 17:37:13 -05:00
Trond Myklebust	1702562db4	NFS: Generic client side changes from Chuck These patches fixes for iostats and SETCLIENTID in addition to cleaning up the nfs4_init_callback() function. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJUdfhQAAoJENfLVL+wpUDr9x4QANWmG6xjEU7RuIBLalOoxit6 eXnEDNZlwYp6NCkYktHVWaTqXdwq7fdGX+3p4eiwNg3C3SrJHkpWvjeFd4KT/SyP B/w3vYG3o5H01i3Mb5kCD2uW0gbS9soZXpIg+uHHOl43yZzneC0vPTLhQ/h+9zPc B32XcgQhvLIHR3LWrC/+uolsa31lcyya0W45PX+iCpzHF7i9qRBrNLODkTlw1hNQ eEnYIvVy5oW00zlHJUYiTHP3e+0EJn5PAngdYbqiboJ9mK7DbB0QDwqyvJbIT7ql WAip6cNcJnSv1eiVYqDwlR1ok8drK5X7yQCT3lcLzAMDznLsSAL1Itu0h2Ay3z61 f8XCyTwI0izq0DbdrMcPoqPSitqyM8nkPElnOuitXwzEroPaG40OF67yss3+ixbl JeQZ+u35pnpCkKUaZdCK3Pn83StxmUaBcFx8eg30NBc0SN13Eiz6aZcGEperClrR RwMLDUhUtAMcMRunRRxiN9lHafPqqeDeJre7uky0p0sU9CsH+1n5qIKLmk+Ber2d ZS29TobdR7ktfjQ52XazMAIFzI7r1v4Zn7ziH/WRbvKoKw9ICcyoUGNy9CS5LyWu BxWCZAJmby5H1i7V7ituJK8TImb81L4aPW06hHX9k+0SoTrgFjas//4jZYfysJ9N PvR0MRNfMFhuBlsTj7Gy =PMdS -----END PGP SIGNATURE----- Merge tag 'nfs-cel-for-3.19' of git://git.linux-nfs.org/projects/anna/nfs-rdma into linux-next Pull pull additional NFS client changes for 3.19 from Anna Schumaker: "NFS: Generic client side changes from Chuck These patches fixes for iostats and SETCLIENTID in addition to cleaning up the nfs4_init_callback() function. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>" * tag 'nfs-cel-for-3.19' of git://git.linux-nfs.org/projects/anna/nfs-rdma: NFS: Clean up nfs4_init_callback() NFS: SETCLIENTID XDR buffer sizes are incorrect SUNRPC: serialize iostats updates	2014-11-26 17:34:14 -05:00
Chuck Lever	edef1297f3	SUNRPC: serialize iostats updates Occasionally mountstats reports a negative retransmission rate. Ensure that two RPCs completing concurrently don't confuse the sums in the transport's op_metrics array. Since pNFS filelayout can invoke rpc_count_iostats() on another transport from xprt_release(), we can't rely on simply holding the transport_lock in xprt_release(). There's nothing for it but hard serialization. One spin lock per RPC operation should make this as painless as it can be. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-11-25 16:22:15 -05:00
Chuck Lever	7ff11de1ba	xprtrdma: Display async errors An async error upcall is a hard error, and should be reported in the system log. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-11-25 13:39:20 -05:00
Chuck Lever	d5440e27d3	xprtrdma: Enable pad optimization The Linux NFS/RDMA server used to reject NFSv3 WRITE requests when pad optimization was enabled. That bug was fixed by commit `e560e3b510` ("svcrdma: Add zero padding if the client doesn't send it"). We can now enable pad optimization on the client, which helps performance and is supported now by both Linux and Solaris servers. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-11-25 13:39:20 -05:00
Chuck Lever	5c166bef4f	xprtrdma: Re-write rpcrdma_flush_cqs() Currently rpcrdma_flush_cqs() attempts to avoid code duplication, and simply invokes rpcrdma_recvcq_upcall and rpcrdma_sendcq_upcall. 1. rpcrdma_flush_cqs() can run concurrently with provider upcalls. Both flush_cqs() and the upcalls were invoking ib_poll_cq() in different threads using the same wc buffers (ep->rep_recv_wcs and ep->rep_send_wcs), added by commit `1c00dd0776` ("xprtrmda: Reduce calls to ib_poll_cq() in completion handlers"). During transport disconnect processing, this sometimes resulted in the same reply getting added to the rpcrdma_tasklets_g list more than once, which corrupted the list. 2. The upcall functions drain only a limited number of CQEs, thanks to the poll budget added by commit `8301a2c047` ("xprtrdma: Limit work done by completion handler"). Fixes: `a7bc211ac9` ("xprtrdma: On disconnect, don't ignore ... ") BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=276 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-11-25 13:39:20 -05:00
Chuck Lever	f1a03b76fe	xprtrdma: Refactor tasklet scheduling Restore the separate function that schedules the reply handling tasklet. I need to call it from two different paths. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-11-25 13:39:20 -05:00
Chuck Lever	467c9674bc	xprtrdma: unmap all FMRs during transport disconnect When using RPCRDMA_MTHCAFMR memory registration, after a few transport disconnect / reconnect cycles, ib_map_phys_fmr() starts to return EINVAL because the provider has exhausted its map pool. Make sure that all FMRs are unmapped during transport disconnect, and that ->send_request remarshals them during an RPC retransmit. This resets the transport's MRs to ensure that none are leaked during a disconnect. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-11-25 13:39:20 -05:00
Chuck Lever	e7104a2a96	xprtrdma: Cap req_cqinit Recent work made FRMR registration and invalidation completions unsignaled. This greatly reduces the adapter interrupt rate. Every so often, however, a posted send Work Request is allowed to signal. Otherwise, the provider's Work Queue will wrap and the workload will hang. The number of Work Requests that are allowed to remain unsignaled is determined by the value of req_cqinit. Currently, this is set to the size of the send Work Queue divided by two, minus 1. For FRMR, the send Work Queue is the maximum number of concurrent RPCs (currently 32) times the maximum number of Work Requests an RPC might use (currently 7, though some adapters may need more). For mlx4, this is 224 entries. This leaves completion signaling disabled for 111 send Work Requests. Some providers hold back dispatching Work Requests until a CQE is generated. If completions are disabled, then no CQEs are generated for quite some time, and that can stall the Work Queue. I've seen this occur running xfstests generic/113 over NFSv4, where eventually, posting a FAST_REG_MR Work Request fails with -ENOMEM because the Work Queue has overflowed. The connection is dropped and re-established. Cap the rep_cqinit setting so completions are not left turned off for too long. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=269 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-11-25 13:39:20 -05:00
Chuck Lever	92b98361f1	xprtrdma: Return an errno from rpcrdma_register_external() The RPC/RDMA send_request method and the chunk registration code expects an errno from the registration function. This allows the upper layers to distinguish between a recoverable failure (for example, temporary memory exhaustion) and a hard failure (for example, a bug in the registration logic). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-11-25 13:39:20 -05:00
Jeff Layton	1306729b0d	sunrpc: eliminate RPC_TRACEPOINTS It's always set to the same value as CONFIG_TRACEPOINTS, so we can just use that instead. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-11-24 17:33:12 -05:00
Jeff Layton	f895b252d4	sunrpc: eliminate RPC_DEBUG It's always set to whatever CONFIG_SUNRPC_DEBUG is, so just use that. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-11-24 17:31:46 -05:00
Jeff Layton	1a867a0898	sunrpc: add tracepoints in xs_tcp_data_recv Add tracepoints inside the main loop on xs_tcp_data_recv that allow us to keep an eye on what's happening during each phase of it. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-11-24 12:53:35 -05:00
Jeff Layton	3705ad64f1	sunrpc: add new tracepoints in xprt handling code ...so we can keep track of when calls are sent and replies received. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-11-24 12:53:35 -05:00
Jeff Layton	860a0d9e51	sunrpc: add some tracepoints in svc_rqst handling functions ...just around svc_send, svc_recv and svc_process for now. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-11-24 12:53:34 -05:00
J. Bruce Fields	56429e9b3b	merge nfs bugfixes into nfsd for-3.19 branch In addition to nfsd bugfixes, there are some fixes in -rc5 for client bugs that can interfere with my testing.	2014-11-19 12:06:30 -05:00
Trond Myklebust	093a1468b6	SUNRPC: Fix locking around callback channel reply receive Both xprt_lookup_rqst() and xprt_complete_rqst() require that you take the transport lock in order to avoid races with xprt_transmit(). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: stable@vger.kernel.org Reviewed-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-11-19 12:03:20 -05:00
Jeff Layton	b3ecba0967	sunrpc: fix sleeping under rcu_read_lock in gss_stringify_acceptor Bruce reported that he was seeing the following BUG pop: BUG: sleeping function called from invalid context at mm/slab.c:2846 in_atomic(): 0, irqs_disabled(): 0, pid: 4539, name: mount.nfs 2 locks held by mount.nfs/4539: #0: (nfs_clid_init_mutex){+.+.+.}, at: [<ffffffffa01c0a9a>] nfs4_discover_server_trunking+0x4a/0x2f0 [nfsv4] #1: (rcu_read_lock){......}, at: [<ffffffffa00e3185>] gss_stringify_acceptor+0x5/0xb0 [auth_rpcgss] Preemption disabled at:[<ffffffff81a4f082>] printk+0x4d/0x4f CPU: 3 PID: 4539 Comm: mount.nfs Not tainted 3.18.0-rc1-00013-g5b095e9 #3393 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 ffff880021499390 ffff8800381476a8 ffffffff81a534cf 0000000000000001 0000000000000000 ffff8800381476c8 ffffffff81097854 00000000000000d0 0000000000000018 ffff880038147718 ffffffff8118e4f3 0000000020479f00 Call Trace: [<ffffffff81a534cf>] dump_stack+0x4f/0x7c [<ffffffff81097854>] __might_sleep+0x114/0x180 [<ffffffff8118e4f3>] __kmalloc+0x1a3/0x280 [<ffffffffa00e31d8>] gss_stringify_acceptor+0x58/0xb0 [auth_rpcgss] [<ffffffffa00e3185>] ? gss_stringify_acceptor+0x5/0xb0 [auth_rpcgss] [<ffffffffa006b438>] rpcauth_stringify_acceptor+0x18/0x30 [sunrpc] [<ffffffffa01b0469>] nfs4_proc_setclientid+0x199/0x380 [nfsv4] [<ffffffffa01b04d0>] ? nfs4_proc_setclientid+0x200/0x380 [nfsv4] [<ffffffffa01bdf1a>] nfs40_discover_server_trunking+0xda/0x150 [nfsv4] [<ffffffffa01bde45>] ? nfs40_discover_server_trunking+0x5/0x150 [nfsv4] [<ffffffffa01c0acf>] nfs4_discover_server_trunking+0x7f/0x2f0 [nfsv4] [<ffffffffa01c8e24>] nfs4_init_client+0x104/0x2f0 [nfsv4] [<ffffffffa01539b4>] nfs_get_client+0x314/0x3f0 [nfs] [<ffffffffa0153780>] ? nfs_get_client+0xe0/0x3f0 [nfs] [<ffffffffa01c83aa>] nfs4_set_client+0x8a/0x110 [nfsv4] [<ffffffffa0069708>] ? __rpc_init_priority_wait_queue+0xa8/0xf0 [sunrpc] [<ffffffffa01c9b2f>] nfs4_create_server+0x12f/0x390 [nfsv4] [<ffffffffa01c1472>] nfs4_remote_mount+0x32/0x60 [nfsv4] [<ffffffff81196489>] mount_fs+0x39/0x1b0 [<ffffffff81166145>] ? __alloc_percpu+0x15/0x20 [<ffffffff811b276b>] vfs_kern_mount+0x6b/0x150 [<ffffffffa01c1396>] nfs_do_root_mount+0x86/0xc0 [nfsv4] [<ffffffffa01c1784>] nfs4_try_mount+0x44/0xc0 [nfsv4] [<ffffffffa01549b7>] ? get_nfs_version+0x27/0x90 [nfs] [<ffffffffa0161a2d>] nfs_fs_mount+0x47d/0xd60 [nfs] [<ffffffff81a59c5e>] ? mutex_unlock+0xe/0x10 [<ffffffffa01606a0>] ? nfs_remount+0x430/0x430 [nfs] [<ffffffffa01609c0>] ? nfs_clone_super+0x140/0x140 [nfs] [<ffffffff81196489>] mount_fs+0x39/0x1b0 [<ffffffff81166145>] ? __alloc_percpu+0x15/0x20 [<ffffffff811b276b>] vfs_kern_mount+0x6b/0x150 [<ffffffff811b5830>] do_mount+0x210/0xbe0 [<ffffffff811b54ca>] ? copy_mount_options+0x3a/0x160 [<ffffffff811b651f>] SyS_mount+0x6f/0xb0 [<ffffffff81a5c852>] system_call_fastpath+0x12/0x17 Sleeping under the rcu_read_lock is bad. This patch fixes it by dropping the rcu_read_lock before doing the allocation and then reacquiring it and redoing the dereference before doing the copy. If we find that the string has somehow grown in the meantime, we'll reallocate and try again. Cc: <stable@vger.kernel.org> # v3.17+ Reported-by: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-11-13 13:15:49 -05:00
Dan Carpenter	eb63192bb8	SUNRPC: off by one in BUG_ON() The m->pool_to[] array has "maxpools" number of elements. It's allocated in svc_pool_map_alloc_arrays() which we called earlier in the function. This test should be >= instead of >. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-10-29 11:37:42 -04:00
J. Bruce Fields	280caac078	rpc: change comments to assertions Reported-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-10-23 14:05:11 -04:00
J. Bruce Fields	ed38c06998	RPC: remove unneeded checks from xdr_truncate_encode() Thanks to Andrea Arcangeli for pointing out these checks are obviously unnecessary given the preceding calculations. Reported-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-10-23 14:05:11 -04:00
Linus Torvalds	6dea0737bc	Merge branch 'for-3.18' of git://linux-nfs.org/~bfields/linux Pull nfsd updates from Bruce Fields: "Highlights: - support the NFSv4.2 SEEK operation (allowing clients to support SEEK_HOLE/SEEK_DATA), thanks to Anna. - end the grace period early in a number of cases, mitigating a long-standing annoyance, thanks to Jeff - improve SMP scalability, thanks to Trond" * 'for-3.18' of git://linux-nfs.org/~bfields/linux: (55 commits) nfsd: eliminate "to_delegation" define NFSD: Implement SEEK NFSD: Add generic v4.2 infrastructure svcrdma: advertise the correct max payload nfsd: introduce nfsd4_callback_ops nfsd: split nfsd4_callback initialization and use nfsd: introduce a generic nfsd4_cb nfsd: remove nfsd4_callback.cb_op nfsd: do not clear rpc_resp in nfsd4_cb_done_sequence nfsd: fix nfsd4_cb_recall_done error handling nfsd4: clarify how grace period ends nfsd4: stop grace_time update at end of grace period nfsd: skip subsequent UMH "create" operations after the first one for v4.0 clients nfsd: set and test NFSD4_CLIENT_STABLE bit to reduce nfsdcltrack upcalls nfsd: serialize nfsdcltrack upcalls for a particular client nfsd: pass extra info in env vars to upcalls to allow for early grace period end nfsd: add a v4_end_grace file to /proc/fs/nfsd lockd: add a /proc/fs/lockd/nlm_end_grace file nfsd: reject reclaim request when client has already sent RECLAIM_COMPLETE nfsd: remove redundant boot_time parm from grace_done client tracking op ...	2014-10-08 12:51:44 -04:00
Trond Myklebust	72c23f0819	Merge branch 'bugfixes' into linux-next * bugfixes: NFSv4.1: Fix an NFSv4.1 state renewal regression NFSv4: fix open/lock state recovery error handling NFSv4: Fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM fails NFS: Fabricate fscache server index key correctly SUNRPC: Add missing support for RPC_CLNT_CREATE_NO_RETRANS_TIMEOUT nfs: fix duplicate proc entries	2014-09-30 17:21:41 -04:00
Steve Wise	7e5be28827	svcrdma: advertise the correct max payload Svcrdma currently advertises 1MB, which is too large. The correct value is the minimum of RPCSVC_MAXPAYLOAD and the max scatter-gather allowed in an NFSRDMA IO chunk * the host page size. This bug is usually benign because the Linux X64 NFSRDMA client correctly limits the payload size to the correct value (64*4096 = 256KB). But if the Linux client is PPC64 with a 64KB page size, then the client will indeed use a payload size that will overflow the server. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-09-29 14:35:18 -04:00
Trond Myklebust	2aca5b869a	SUNRPC: Add missing support for RPC_CLNT_CREATE_NO_RETRANS_TIMEOUT The flag RPC_CLNT_CREATE_NO_RETRANS_TIMEOUT was intended introduced in order to allow NFSv4 clients to disable resend timeouts. Since those cause the RPC layer to break the connection, they mess up the duplicate reply caches that remain indexed on the port number in NFSv4.. This patch includes the code that was missing in the original to set the appropriate flag in struct rpc_clnt, when the caller of rpc_create() sets RPC_CLNT_CREATE_NO_RETRANS_TIMEOUT. Fixes: `8a19a0b6cb` (SUNRPC: Add RPC task and client level options to...) Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-09-25 21:25:17 -04:00
NeilBrown	1aff525629	NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page() Now that nfs_release_page() doesn't block indefinitely, other deadlock avoidance mechanisms aren't needed. - it doesn't hurt for kswapd to block occasionally. If it doesn't want to block it would clear __GFP_WAIT. The current_is_kswapd() was only added to avoid deadlocks and we have a new approach for that. - memory allocation in the SUNRPC layer can very rarely try to ->releasepage() a page it is trying to handle. The deadlock is removed as nfs_release_page() doesn't block indefinitely. So we don't need to set PF_FSTRANS for sunrpc network operations any more. Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-09-25 08:25:47 -04:00
Jason Baron	3dedbb5ca1	rpc: Add -EPERM processing for xs_udp_send_request() If an iptables drop rule is added for an nfs server, the client can end up in a softlockup. Because of the way that xs_sendpages() is structured, the -EPERM is ignored since the prior bits of the packet may have been successfully queued and thus xs_sendpages() returns a non-zero value. Then, xs_udp_send_request() thinks that because some bits were queued it should return -EAGAIN. We then try the request again and again, resulting in cpu spinning. Reproducer: 1) open a file on the nfs server '/nfs/foo' (mounted using udp) 2) iptables -A OUTPUT -d <nfs server ip> -j DROP 3) write to /nfs/foo 4) close /nfs/foo 5) iptables -D OUTPUT -d <nfs server ip> -j DROP The softlockup occurs in step 4 above. The previous patch, allows xs_sendpages() to return both a sent count and any error values that may have occurred. Thus, if we get an -EPERM, return that to the higher level code. With this patch in place we can successfully abort the above sequence and avoid the softlockup. I also tried the above test case on an nfs mount on tcp and although the system does not softlockup, I still ended up with the 'hung_task' firing after 120 seconds, due to the i/o being stuck. The tcp case appears a bit harder to fix, since -EPERM appears to get ignored much lower down in the stack and does not propogate up to xs_sendpages(). This case is not quite as insidious as the softlockup and it is not addressed here. Reported-by: Yigong Lou <ylou@akamai.com> Signed-off-by: Jason Baron <jbaron@akamai.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-09-24 23:13:46 -04:00
Jason Baron	f279cd008f	rpc: return sent and err from xs_sendpages() If an error is returned after the first bits of a packet have already been successfully queued, xs_sendpages() will return a positive 'int' value indicating success. Callers seem to treat this as -EAGAIN. However, there are cases where its not a question of waiting for the write queue to drain. For example, when there is an iptables rule dropping packets to the destination, the lower level code can return -EPERM only after parts of the packet have been successfully queued. In this case, we can end up continuously retrying resulting in a kernel softlockup. This patch is intended to make no changes in behavior but is in preparation for subsequent patches that can make decisions based on both on the number of bytes sent by xs_sendpages() and any errors that may have be returned. Signed-off-by: Jason Baron <jbaron@akamai.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-09-24 23:13:37 -04:00
Benjamin Coddington	a743419f42	SUNRPC: Don't wake tasks during connection abort When aborting a connection to preserve source ports, don't wake the task in xs_error_report. This allows tasks with RPC_TASK_SOFTCONN to succeed if the connection needs to be re-established since it preserves the task's status instead of setting it to the status of the aborting kernel_connect(). This may also avoid a potential conflict on the socket's lock. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Cc: stable@vger.kernel.org # 3.14+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-09-24 23:06:56 -04:00
Chris Perl	0f7a622ca6	rpc: xs_bind - do not bind when requesting a random ephemeral port When attempting to establish a local ephemeral endpoint for a TCP or UDP socket, do not explicitly call bind, instead let it happen implicilty when the socket is first used. The main motivating factor for this change is when TCP runs out of unique ephemeral ports (i.e. cannot find any ephemeral ports which are not a part of any TCP connection). In this situation if you explicitly call bind, then the call will fail with EADDRINUSE. However, if you allow the allocation of an ephemeral port to happen implicitly as part of connect (or other functions), then ephemeral ports can be reused, so long as the combination of (local_ip, local_port, remote_ip, remote_port) is unique for TCP sockets on the system. This doesn't matter for UDP sockets, but it seemed easiest to treat TCP and UDP sockets the same. This can allow mount.nfs(8) to continue to function successfully, even in the face of misbehaving applications which are creating a large number of TCP connections. Signed-off-by: Chris Perl <chris.perl@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-09-10 12:47:00 -07:00
Chuck Lever	71efecb3f5	sunrpc: fix byte-swapping of displayed XID xprt_lookup_rqst() and bc_send_request() display a byte-swapped XID, but receive_cb_reply() does not. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-08-28 16:00:07 -04:00
J. Bruce Fields	ae89254da6	SUNRPC: Fix compile on non-x86 current_task appears to be x86-only, oops. Let's just delete this check entirely: Any developer that adds a new user without setting rq_task will get a crash the first time they test it. I also don't think there are normally any important locks held here, and I can't see any other reason why killing a server thread would bring the whole box down. So the effort to fail gracefully here looks like overkill. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Fixes: `983c684466` "SUNRPC: get rid of the request wait queue" Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-08-28 15:51:35 -04:00
Trond Myklebust	f8d1ff47b6	SUNRPC: Optimise away svc_recv_available We really do not want to do ioctls in the server's fast path. Instead, let's use the fact that we managed to read a full record as the indicator that we should try to read the socket again. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-08-17 12:00:11 -04:00
Trond Myklebust	0c0746d03e	SUNRPC: More optimisations of svc_xprt_enqueue() Just move the transport locking out of the spin lock protected area altogether. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-08-17 12:00:11 -04:00
Trond Myklebust	a4aa8054a6	SUNRPC: Fix broken kthread_should_stop test in svc_get_next_xprt We should definitely not be exiting svc_get_next_xprt() with the thread enqueued. Fix this by ensuring that we fall through to the dequeue. Also move the test itself outside the spin lock protected section. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-08-17 12:00:11 -04:00
Trond Myklebust	983c684466	SUNRPC: get rid of the request wait queue We're always _only_ waking up tasks from within the sp_threads list, so we know that they are enqueued and alive. The rq_wait waitqueue is just a distraction with extra atomic semantics. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-08-17 12:00:11 -04:00
Trond Myklebust	106f359cf4	SUNRPC: Do not grab pool->sp_lock unnecessarily in svc_get_next_xprt Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-08-17 12:00:10 -04:00
Trond Myklebust	9e5b208dc9	SUNRPC: Do not override wspace tests in svc_handle_xprt We already determined that there was enough wspace when we called svc_xprt_enqueue. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-08-17 12:00:10 -04:00
Linus Torvalds	06b8ab5528	NFS client updates for Linux 3.17 Highlights include: - Stable fix for a bug in nfs3_list_one_acl() - Speed up NFS path walks by supporting LOOKUP_RCU - More read/write code cleanups - pNFS fixes for layout return on close - Fixes for the RCU handling in the rpcsec_gss code - More NFS/RDMA fixes -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJT65zoAAoJEGcL54qWCgDyvq8QAJ+OKuC5dpngrZ13i4ZJIcK1 TJSkWCr44FhYPlrmkLCntsGX6C0376oFEtJ5uqloqK0+/QtvwRNVSQMKaJopKIVY mR4En0WwpigxVQdW2lgto6bfOhzMVO+llVdmicEVrU8eeSThATxGNv7rxRzWorvL RX3TwBkWSc0kLtPi66VRFQ1z+gg5I0kngyyhsKnLOaHHtpTYP2JDZlRPRkokXPUg nmNedmC3JrFFkarroFIfYr54Qit2GW/eI2zVhOwHGCb45j4b2wntZ6wr7LpUdv3A OGDBzw59cTpcx3Hij9CFvLYVV9IJJHBNd2MJqdQRtgWFfs+aTkZdk4uilUJCIzZh f4BujQAlm/4X1HbPxsSvkCRKga7mesGM7e0sBDPHC1vu0mSaY1cakcj2kQLTpbQ7 gqa1cR3pZ+4shCq37cLwWU0w1yElYe1c4otjSCttPCrAjXbXJZSFzYnHm8DwKROR t+yEDRL5BIXPu1nEtSnD2+xTQ3vUIYXooZWEmqLKgRtBTtPmgSn9Vd8P1OQXmMNo VJyFXyjNx5WH06Wbc/jLzQ1/cyhuPmJWWyWMJlVROyv+FXk9DJUFBZuTkpMrIPcF NlBXLV1GnA7PzMD9Xt9bwqteERZl6fOUDJLWS9P74kTk5c2kD+m+GaqC/rBTKKXc ivr2s7aIDV48jhnwBSVL =KE07 -----END PGP SIGNATURE----- Merge tag 'nfs-for-3.17-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs Pull NFS client updates from Trond Myklebust: "Highlights include: - stable fix for a bug in nfs3_list_one_acl() - speed up NFS path walks by supporting LOOKUP_RCU - more read/write code cleanups - pNFS fixes for layout return on close - fixes for the RCU handling in the rpcsec_gss code - more NFS/RDMA fixes" * tag 'nfs-for-3.17-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (79 commits) nfs: reject changes to resvport and sharecache during remount NFS: Avoid infinite loop when RELEASE_LOCKOWNER getting expired error SUNRPC: remove all refcounting of groupinfo from rpcauth_lookupcred NFS: fix two problems in lookup_revalidate in RCU-walk NFS: allow lockless access to access_cache NFS: teach nfs_lookup_verify_inode to handle LOOKUP_RCU NFS: teach nfs_neg_need_reval to understand LOOKUP_RCU NFS: support RCU_WALK in nfs_permission() sunrpc/auth: allow lockless (rcu) lookup of credential cache. NFS: prepare for RCU-walk support but pushing tests later in code. NFS: nfs4_lookup_revalidate: only evaluate parent if it will be used. NFS: add checks for returned value of try_module_get() nfs: clear_request_commit while holding i_lock pnfs: add pnfs_put_lseg_async pnfs: find swapped pages on pnfs commit lists too nfs: fix comment and add warn_on for PG_INODE_REF nfs: check wait_on_bit_lock err in page_group_lock sunrpc: remove "ec" argument from encrypt_v2 operation sunrpc: clean up sparse endianness warnings in gss_krb5_wrap.c sunrpc: clean up sparse endianness warnings in gss_krb5_seal.c ...	2014-08-13 18:13:19 -06:00
Linus Torvalds	0d10c2c170	Merge branch 'for-3.17' of git://linux-nfs.org/~bfields/linux Pull nfsd updates from Bruce Fields: "This includes a major rewrite of the NFSv4 state code, which has always depended on a single mutex. As an example, open creates are no longer serialized, fixing a performance regression on NFSv3->NFSv4 upgrades. Thanks to Jeff, Trond, and Benny, and to Christoph for review. Also some RDMA fixes from Chuck Lever and Steve Wise, and miscellaneous fixes from Kinglong Mee and others" * 'for-3.17' of git://linux-nfs.org/~bfields/linux: (167 commits) svcrdma: remove rdma_create_qp() failure recovery logic nfsd: add some comments to the nfsd4 object definitions nfsd: remove the client_mutex and the nfs4_lock/unlock_state wrappers nfsd: remove nfs4_lock_state: nfs4_state_shutdown_net nfsd: remove nfs4_lock_state: nfs4_laundromat nfsd: Remove nfs4_lock_state(): reclaim_complete() nfsd: Remove nfs4_lock_state(): setclientid, setclientid_confirm, renew nfsd: Remove nfs4_lock_state(): exchange_id, create/destroy_session() nfsd: Remove nfs4_lock_state(): nfsd4_open and nfsd4_open_confirm nfsd: Remove nfs4_lock_state(): nfsd4_delegreturn() nfsd: Remove nfs4_lock_state(): nfsd4_open_downgrade + nfsd4_close nfsd: Remove nfs4_lock_state(): nfsd4_lock/locku/lockt() nfsd: Remove nfs4_lock_state(): nfsd4_release_lockowner nfsd: Remove nfs4_lock_state(): nfsd4_test_stateid/nfsd4_free_stateid nfsd: Remove nfs4_lock_state(): nfs4_preprocess_stateid_op() nfsd: remove old fault injection infrastructure nfsd: add more granular locking to *_delegations fault injectors nfsd: add more granular locking to forget_openowners fault injector nfsd: add more granular locking to forget_locks fault injector nfsd: add a list_head arg to nfsd_foreach_client_lock ...	2014-08-09 14:31:18 -07:00
Steve Wise	d1e458fe67	svcrdma: remove rdma_create_qp() failure recovery logic In svc_rdma_accept(), if rdma_create_qp() fails, there is useless logic to try and call rdma_create_qp() again with reduced sge depths. The assumption, I guess, was that perhaps the initial sge depths chosen were too big. However they initial depths are selected based on the rdma device attribute max_sge returned from ib_query_device(). If rdma_create_qp() fails, it would not be because the max_send_sge and max_recv_sge values passed in exceed the device's max. So just remove this code. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-08-05 16:09:21 -04:00
NeilBrown	122a8cda6a	SUNRPC: remove all refcounting of groupinfo from rpcauth_lookupcred current_cred() can only be changed by 'current', and cred->group_info is never changed. If a new group_info is needed, a new 'cred' is created. Consequently it is always safe to access current_cred()->group_info without taking any further references. So drop the refcounting and the incorrect rcu_dereference(). Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-08-04 09:22:08 -04:00
NeilBrown	bd95608053	sunrpc/auth: allow lockless (rcu) lookup of credential cache. The new flag RPCAUTH_LOOKUP_RCU to credential lookup avoids locking, does not take a reference on the returned credential, and returns -ECHILD if a simple lookup was not possible. The returned value can only be used within an rcu_read_lock protected region. The main user of this is the new rpc_lookup_cred_nonblock() which returns a pointer to the current credential which is only rcu-safe (no ref-count held), and might return -ECHILD if allocation was required. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-08-03 17:14:12 -04:00
Jeff Layton	ec25422c66	sunrpc: remove "ec" argument from encrypt_v2 operation It's always 0. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-08-03 17:05:24 -04:00
Jeff Layton	b36e9c44af	sunrpc: clean up sparse endianness warnings in gss_krb5_wrap.c Fix the endianness handling in gss_wrap_kerberos_v1 and drop the memset call there in favor of setting the filler bytes directly. In gss_wrap_kerberos_v2, get rid of the "ec" variable which is always zero, and drop the endianness conversion of 0. Sparse handles 0 as a special case, so it's not necessary. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-08-03 17:05:24 -04:00
Jeff Layton	6ac0fbbfc1	sunrpc: clean up sparse endianness warnings in gss_krb5_seal.c Use u16 pointer in setup_token and setup_token_v2. None of the fields are actually handled as __be16, so this simplifies the code a bit. Also get rid of some unneeded pointer increments. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-08-03 17:05:23 -04:00
Jeff Layton	c5e6aecd03	sunrpc: fix RCU handling of gc_ctx field The handling of the gc_ctx pointer only seems to be partially RCU-safe. The assignment and freeing are done using RCU, but many places in the code seem to dereference that pointer without proper RCU safeguards. Fix them to use rcu_dereference and to rcu_read_lock/unlock, and to properly handle the case where the pointer is NULL. Cc: Arnd Bergmann <arnd@arndb.de> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-08-03 17:05:23 -04:00
Trond Myklebust	9806755c56	Merge branch 'nfs-rdma' of git://git.linux-nfs.org/projects/anna/nfs-rdma into linux-next * 'nfs-rdma' of git://git.linux-nfs.org/projects/anna/nfs-rdma: (916 commits) xprtrdma: Handle additional connection events xprtrdma: Remove RPCRDMA_PERSISTENT_REGISTRATION macro xprtrdma: Make rpcrdma_ep_disconnect() return void xprtrdma: Schedule reply tasklet once per upcall xprtrdma: Allocate each struct rpcrdma_mw separately xprtrdma: Rename frmr_wr xprtrdma: Disable completions for LOCAL_INV Work Requests xprtrdma: Disable completions for FAST_REG_MR Work Requests xprtrdma: Don't post a LOCAL_INV in rpcrdma_register_frmr_external() xprtrdma: Reset FRMRs after a flushed LOCAL_INV Work Request xprtrdma: Reset FRMRs when FAST_REG_MR is flushed by a disconnect xprtrdma: Properly handle exhaustion of the rb_mws list xprtrdma: Chain together all MWs in same buffer pool xprtrdma: Back off rkey when FAST_REG_MR fails xprtrdma: Unclutter struct rpcrdma_mr_seg xprtrdma: Don't invalidate FRMRs if registration fails xprtrdma: On disconnect, don't ignore pending CQEs xprtrdma: Update rkeys after transport reconnect xprtrdma: Limit data payload size for ALLPHYSICAL xprtrdma: Protect ia->ri_id when unmapping/invalidating MRs ...	2014-08-03 17:04:51 -04:00
Trond Myklebust	bae6746ff3	SUNRPC: Enforce an upper limit on the number of cached credentials In some cases where the credentials are not often reused, we may want to limit their total number just in order to make the negative lookups in the hash table more manageable. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-08-03 16:02:50 -04:00
Chuck Lever	8079fb785e	xprtrdma: Handle additional connection events Commit `38ca83a5` added RDMA_CM_EVENT_TIMEWAIT_EXIT. But that status is relevant only for consumers that re-use their QPs on new connections. xprtrdma creates a fresh QP on reconnection, so that event should be explicitly ignored. Squelch the alarming "unexpected CM event" message. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-07-31 16:22:59 -04:00
Chuck Lever	a779ca5fa7	xprtrdma: Remove RPCRDMA_PERSISTENT_REGISTRATION macro Clean up. RPCRDMA_PERSISTENT_REGISTRATION was a compile-time switch between RPCRDMA_REGISTER mode and RPCRDMA_ALLPHYSICAL mode. Since RPCRDMA_REGISTER has been removed, there's no need for the extra conditional compilation. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-07-31 16:22:59 -04:00
Chuck Lever	282191cb72	xprtrdma: Make rpcrdma_ep_disconnect() return void Clean up: The return code is used only for dprintk's that are already redundant. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-07-31 16:22:58 -04:00
Chuck Lever	bb96193d91	xprtrdma: Schedule reply tasklet once per upcall Minor optimization: grab rpcrdma_tk_lock_g and disable hard IRQs just once after clearing the receive completion queue. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-07-31 16:22:58 -04:00
Chuck Lever	2e84522c2e	xprtrdma: Allocate each struct rpcrdma_mw separately Currently rpcrdma_buffer_create() allocates struct rpcrdma_mw's as a single contiguous area of memory. It amounts to quite a bit of memory, and there's no requirement for these to be carved from a single piece of contiguous memory. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-07-31 16:22:57 -04:00
Chuck Lever	f590e878c5	xprtrdma: Rename frmr_wr Clean up: Name frmr_wr after the opcode of the Work Request, consistent with the send and local invalidation paths. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-07-31 16:22:57 -04:00
Chuck Lever	dab7e3b8da	xprtrdma: Disable completions for LOCAL_INV Work Requests Instead of relying on a completion to change the state of an FRMR to FRMR_IS_INVALID, set it in advance. If an error occurs, a completion will fire anyway and mark the FRMR FRMR_IS_STALE. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-07-31 16:22:57 -04:00
Chuck Lever	050557220e	xprtrdma: Disable completions for FAST_REG_MR Work Requests Instead of relying on a completion to change the state of an FRMR to FRMR_IS_VALID, set it in advance. If an error occurs, a completion will fire anyway and mark the FRMR FRMR_IS_STALE. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-07-31 16:22:56 -04:00
Chuck Lever	440ddad51b	xprtrdma: Don't post a LOCAL_INV in rpcrdma_register_frmr_external() Any FRMR arriving in rpcrdma_register_frmr_external() is now guaranteed to be either invalid, or to be targeted by a queued LOCAL_INV that will invalidate it before the adapter processes the FAST_REG_MR being built here. The problem with current arrangement of chaining a LOCAL_INV to the FAST_REG_MR is that if the transport is not connected, the LOCAL_INV is flushed and the FAST_REG_MR is flushed. This leaves the FRMR valid with the old rkey. But rpcrdma_register_frmr_external() has already bumped the in-memory rkey. Next time through rpcrdma_register_frmr_external(), a LOCAL_INV and FAST_REG_MR is attempted again because the FRMR is still valid. But the rkey no longer matches the hardware's rkey, and a memory management operation error occurs. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-07-31 16:22:56 -04:00
Chuck Lever	ddb6bebcc6	xprtrdma: Reset FRMRs after a flushed LOCAL_INV Work Request When a LOCAL_INV Work Request is flushed, it leaves an FRMR in the VALID state. This FRMR can be returned by rpcrdma_buffer_get(), and must be knocked down in rpcrdma_register_frmr_external() before it can be re-used. Instead, capture these in rpcrdma_buffer_get(), and reset them. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-07-31 16:22:55 -04:00
Chuck Lever	9f9d802a28	xprtrdma: Reset FRMRs when FAST_REG_MR is flushed by a disconnect FAST_REG_MR Work Requests update a Memory Region's rkey. Rkey's are used to block unwanted access to the memory controlled by an MR. The rkey is passed to the receiver (the NFS server, in our case), and is also used by xprtrdma to invalidate the MR when the RPC is complete. When a FAST_REG_MR Work Request is flushed after a transport disconnect, xprtrdma cannot tell whether the WR actually hit the adapter or not. So it is indeterminant at that point whether the existing rkey is still valid. After the transport connection is re-established, the next FAST_REG_MR or LOCAL_INV Work Request against that MR can sometimes fail because the rkey value does not match what xprtrdma expects. The only reliable way to recover in this case is to deregister and register the MR before it is used again. These operations can be done only in a process context, so handle it in the transport connect worker. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-07-31 16:22:55 -04:00
Chuck Lever	c2922c0235	xprtrdma: Properly handle exhaustion of the rb_mws list If the rb_mws list is exhausted, clean up and return NULL so that call_allocate() will delay and try again. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-07-31 16:22:55 -04:00
Chuck Lever	3111d72c7c	xprtrdma: Chain together all MWs in same buffer pool During connection loss recovery, need to visit every MW in a buffer pool. Any MW that is in use by an RPC will not be on the rb_mws list. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-07-31 16:22:54 -04:00
Chuck Lever	c93e986a29	xprtrdma: Back off rkey when FAST_REG_MR fails If posting a FAST_REG_MR Work Reqeust fails, revert the rkey update to avoid subsequent IB_WC_MW_BIND_ERR completions. Suggested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-07-31 16:22:54 -04:00
Chuck Lever	0dbb4108a6	xprtrdma: Unclutter struct rpcrdma_mr_seg Clean ups: - make it obvious that the rl_mw field is a pointer -- allocated separately, not as part of struct rpcrdma_mr_seg - promote "struct {} frmr;" to a named type - promote the state enum to a named type - name the MW state field the same way other fields in rpcrdma_mw are named Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-07-31 16:22:54 -04:00
Chuck Lever	539431a437	xprtrdma: Don't invalidate FRMRs if registration fails If FRMR registration fails, it's likely to transition the QP to the error state. Or, registration may have failed because the QP is _already_ in ERROR. Thus calling rpcrdma_deregister_external() in rpcrdma_create_chunks() is useless in FRMR mode: the LOCAL_INVs just get flushed. It is safe to leave existing registrations: when FRMR registration is tried again, rpcrdma_register_frmr_external() checks if each FRMR is already/still VALID, and knocks it down first if it is. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-07-31 16:22:53 -04:00
Chuck Lever	a7bc211ac9	xprtrdma: On disconnect, don't ignore pending CQEs xprtrdma is currently throwing away queued completions during a reconnect. RPC replies posted just before connection loss, or successful completions that change the state of an FRMR, can be missed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-07-31 16:22:53 -04:00
Chuck Lever	6ab59945f2	xprtrdma: Update rkeys after transport reconnect Various reports of: rpcrdma_qp_async_error_upcall: QP error 3 on device mlx4_0 ep ffff8800bfd3e848 Ensure that rkeys in already-marshalled RPC/RDMA headers are refreshed after the QP has been replaced by a reconnect. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=249 Suggested-by: Selvin Xavier <Selvin.Xavier@Emulex.Com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-07-31 16:22:53 -04:00
Chuck Lever	43e9598817	xprtrdma: Limit data payload size for ALLPHYSICAL When the client uses physical memory registration, each page in the payload gets its own array entry in the RPC/RDMA header's chunk list. Therefore, don't advertise a maximum payload size that would require more array entries than can fit in the RPC buffer where RPC/RDMA headers are built. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=248 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-07-31 16:22:52 -04:00
Chuck Lever	73806c8832	xprtrdma: Protect ia->ri_id when unmapping/invalidating MRs Ensure ia->ri_id remains valid while invoking dma_unmap_page() or posting LOCAL_INV during a transport reconnect. Otherwise, ia->ri_id->device or ia->ri_id->qp is NULL, which triggers a panic. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=259 Fixes: `ec62f40` 'xprtrdma: Ensure ia->ri_id->qp is not NULL when reconnecting' Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-07-31 16:22:52 -04:00
Chuck Lever	5fc83f470d	xprtrdma: Fix panic in rpcrdma_register_frmr_external() seg1->mr_nsegs is not yet initialized when it is used to unmap segments during an error exit. Use the same unmapping logic for all error exits. "if (frmr_wr.wr.fast_reg.length < len) {" used to be a BUG_ON check. The broken code will never be executed under normal operation. Fixes: `c977dea` (xprtrdma: Remove BUG_ON() call sites) Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-07-31 16:22:52 -04:00
Trond Myklebust	518776800c	SUNRPC: Allow svc_reserve() to notify TCP socket that space has been freed Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-07-29 16:10:20 -04:00
Trond Myklebust	c7fb3f0631	SUNRPC: svc_tcp_write_space: don't clear SOCK_NOSPACE prematurely If requests are queued in the socket inbuffer waiting for an svc_tcp_has_wspace() requirement to be satisfied, then we do not want to clear the SOCK_NOSPACE flag until we've satisfied that requirement. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-07-29 16:10:19 -04:00
Trond Myklebust	0971374e28	SUNRPC: Reduce contention in svc_xprt_enqueue() Ensure that all calls to svc_xprt_enqueue() except svc_xprt_received() check the value of XPT_BUSY, before attempting to grab spinlocks etc. This is to avoid situations such as the following "perf" trace, which shows heavy contention on the pool spinlock: 54.15% nfsd [kernel.kallsyms] [k] _raw_spin_lock_bh \| --- _raw_spin_lock_bh \| \|--71.43%-- svc_xprt_enqueue \| \| \| \|--50.31%-- svc_reserve \| \| \| \|--31.35%-- svc_xprt_received \| \| \| \|--18.34%-- svc_tcp_data_ready ... Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-07-29 16:10:15 -04:00
Chuck Lever	e560e3b510	svcrdma: Add zero padding if the client doesn't send it See RFC 5666 section 3.7: clients don't have to send zero XDR padding. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=246 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-07-22 16:40:21 -04:00
Yan Burman	bf858ab0ad	xprtrdma: Fix DMA-API-DEBUG warning by checking dma_map result Fix the following warning when DMA-API debug is enabled by checking ib_dma_map_single result: [ 1455.345548] ------------[ cut here ]------------ [ 1455.346863] WARNING: CPU: 3 PID: 3929 at /home/yanb/kernel/net-next/lib/dma-debug.c:1140 check_unmap+0x4e5/0x990() [ 1455.349350] mlx4_core 0000:00:07.0: DMA-API: device driver failed to check map error[device address=0x000000007c9f2090] [size=2656 bytes] [mapped as single] [ 1455.349350] Modules linked in: xprtrdma netconsole configfs nfsv3 nfs_acl ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm autofs4 auth_rpcgss oid_registry nfsv4 nfs fscache lockd sunrpc dm_mirror dm_region_hash dm_log microcode pcspkr mlx4_ib ib_sa ib_mad ib_core ib_addr mlx4_en ipv6 ptp pps_core vxlan mlx4_core virtio_balloon cirrus ttm drm_kms_helper drm sysimgblt sysfillrect syscopyarea i2c_piix4 i2c_core button ext3 jbd virtio_blk virtio_net virtio_pci virtio_ring virtio uhci_hcd ata_generic ata_piix libata [ 1455.349350] CPU: 3 PID: 3929 Comm: mount.nfs Not tainted 3.15.0-rc1-dbg+ #13 [ 1455.349350] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 [ 1455.349350] 0000000000000474 ffff880069dcf628 ffffffff8151c341 ffffffff817b69d8 [ 1455.349350] ffff880069dcf678 ffff880069dcf668 ffffffff8105b5fc 0000000069dcf658 [ 1455.349350] ffff880069dcf778 ffff88007b0c9f00 ffffffff8255ec40 0000000000000a60 [ 1455.349350] Call Trace: [ 1455.349350] [<ffffffff8151c341>] dump_stack+0x52/0x81 [ 1455.349350] [<ffffffff8105b5fc>] warn_slowpath_common+0x8c/0xc0 [ 1455.349350] [<ffffffff8105b6e6>] warn_slowpath_fmt+0x46/0x50 [ 1455.349350] [<ffffffff812e6305>] check_unmap+0x4e5/0x990 [ 1455.349350] [<ffffffff81521fb0>] ? _raw_spin_unlock_irq+0x30/0x60 [ 1455.349350] [<ffffffff812e6a0a>] debug_dma_unmap_page+0x5a/0x60 [ 1455.349350] [<ffffffffa0389583>] rpcrdma_deregister_internal+0xb3/0xd0 [xprtrdma] [ 1455.349350] [<ffffffffa038a639>] rpcrdma_buffer_destroy+0x69/0x170 [xprtrdma] [ 1455.349350] [<ffffffffa03872ff>] xprt_rdma_destroy+0x3f/0xb0 [xprtrdma] [ 1455.349350] [<ffffffffa04a95ff>] xprt_destroy+0x6f/0x80 [sunrpc] [ 1455.349350] [<ffffffffa04a9625>] xprt_put+0x15/0x20 [sunrpc] [ 1455.349350] [<ffffffffa04a899a>] rpc_free_client+0x8a/0xe0 [sunrpc] [ 1455.349350] [<ffffffffa04a8a58>] rpc_release_client+0x68/0xa0 [sunrpc] [ 1455.349350] [<ffffffffa04a9060>] rpc_shutdown_client+0xb0/0xc0 [sunrpc] [ 1455.349350] [<ffffffffa04a8f5d>] ? rpc_ping+0x5d/0x70 [sunrpc] [ 1455.349350] [<ffffffffa04a91ab>] rpc_create_xprt+0xbb/0xd0 [sunrpc] [ 1455.349350] [<ffffffffa04a9273>] rpc_create+0xb3/0x160 [sunrpc] [ 1455.349350] [<ffffffff81129749>] ? __probe_kernel_read+0x69/0xb0 [ 1455.349350] [<ffffffffa053851c>] nfs_create_rpc_client+0xdc/0x100 [nfs] [ 1455.349350] [<ffffffffa0538cfa>] nfs_init_client+0x3a/0x90 [nfs] [ 1455.349350] [<ffffffffa05391c8>] nfs_get_client+0x478/0x5b0 [nfs] [ 1455.349350] [<ffffffffa0538e50>] ? nfs_get_client+0x100/0x5b0 [nfs] [ 1455.349350] [<ffffffff81172c6d>] ? kmem_cache_alloc_trace+0x24d/0x260 [ 1455.349350] [<ffffffffa05393f3>] nfs_create_server+0xf3/0x4c0 [nfs] [ 1455.349350] [<ffffffffa0545ff0>] ? nfs_request_mount+0xf0/0x1a0 [nfs] [ 1455.349350] [<ffffffffa031c0c3>] nfs3_create_server+0x13/0x30 [nfsv3] [ 1455.349350] [<ffffffffa0546293>] nfs_try_mount+0x1f3/0x230 [nfs] [ 1455.349350] [<ffffffff8108ea21>] ? get_parent_ip+0x11/0x50 [ 1455.349350] [<ffffffff812d6343>] ? __this_cpu_preempt_check+0x13/0x20 [ 1455.349350] [<ffffffff810d632b>] ? try_module_get+0x6b/0x190 [ 1455.349350] [<ffffffffa05449f7>] nfs_fs_mount+0x187/0x9d0 [nfs] [ 1455.349350] [<ffffffffa0545940>] ? nfs_clone_super+0x140/0x140 [nfs] [ 1455.349350] [<ffffffffa0543b20>] ? nfs_auth_info_match+0x40/0x40 [nfs] [ 1455.349350] [<ffffffff8117e360>] mount_fs+0x20/0xe0 [ 1455.349350] [<ffffffff811a1c16>] vfs_kern_mount+0x76/0x160 [ 1455.349350] [<ffffffff811a29a8>] do_mount+0x428/0xae0 [ 1455.349350] [<ffffffff811a30f0>] SyS_mount+0x90/0xe0 [ 1455.349350] [<ffffffff8152af52>] system_call_fastpath+0x16/0x1b [ 1455.349350] ---[ end trace f1f31572972e211d ]--- Signed-off-by: Yan Burman <yanb@mellanox.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-07-22 13:55:30 -04:00
Trond Myklebust	22cb43855d	SUNRPC: xdr_get_next_encode_buffer should be declared static Quell another sparse warning. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-07-18 11:35:46 -04:00
Chuck Lever	3c45ddf823	svcrdma: Select NFSv4.1 backchannel transport based on forward channel The current code always selects XPRT_TRANSPORT_BC_TCP for the back channel, even when the forward channel was not TCP (eg, RDMA). When a 4.1 mount is attempted with RDMA, the server panics in the TCP BC code when trying to send CB_NULL. Instead, construct the transport protocol number from the forward channel transport or'd with XPRT_TRANSPORT_BC. Transports that do not support bi-directional RPC will not have registered a "BC" transport, causing create_backchannel_client() to fail immediately. Fixes: https://bugzilla.linux-nfs.org/show_bug.cgi?id=265 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-07-18 11:35:45 -04:00
NeilBrown	c1221321b7	sched: Allow wait_on_bit_action() functions to support a timeout It is currently not possible for various wait_on_bit functions to implement a timeout. While the "action" function that is called to do the waiting could certainly use schedule_timeout(), there is no way to carry forward the remaining timeout after a false wake-up. As false-wakeups a clearly possible at least due to possible hash collisions in bit_waitqueue(), this is a real problem. The 'action' function is currently passed a pointer to the word containing the bit being waited on. No current action functions use this pointer. So changing it to something else will be a little noisy but will have no immediate effect. This patch changes the 'action' function to take a pointer to the "struct wait_bit_key", which contains a pointer to the word containing the bit so nothing is really lost. It also adds a 'private' field to "struct wait_bit_key", which is initialized to zero. An action function can now implement a timeout with something like static int timed_out_waiter(struct wait_bit_key key) { unsigned long waited; if (key->private == 0) { key->private = jiffies; if (key->private == 0) key->private -= 1; } waited = jiffies - key->private; if (waited > 10 HZ) return -EAGAIN; schedule_timeout(waited - 10 * HZ); return 0; } If any other need for context in a waiter were found it would be easy to use ->private for some other purpose, or even extend "struct wait_bit_key". My particular need is to support timeouts in nfs_release_page() to avoid deadlocks with loopback mounted NFS. While wait_on_bit_timeout() would be a cleaner interface, it will not meet my need. I need the timeout to be sensitive to the state of the connection with the server, which could change. So I need to use an 'action' interface. Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Steve French <sfrench@samba.org> Cc: David Howells <dhowells@redhat.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140707051604.28027.41257.stgit@notabene.brown Signed-off-by: Ingo Molnar <mingo@kernel.org>	2014-07-16 15:10:41 +02:00
Daniel Walter	00cfaa943e	replace strict_strto calls Replace obsolete strict_strto calls with appropriate kstrto calls Signed-off-by: Daniel Walter <dwalter@google.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-07-12 18:45:49 -04:00
Himangi Saraogi	adcda652c9	rpc_pipe: Drop memory allocation cast Drop cast on the result of kmalloc and similar functions. The semantic patch that makes this change is as follows: // <smpl> @@ type T; @@ - (T *) (\(kmalloc\\|kzalloc\\|kcalloc\\|kmem_cache_alloc\\|kmem_cache_zalloc\\| kmem_cache_alloc_node\\|kmalloc_node\\|kzalloc_node\)(...)) // </smpl> Signed-off-by: Himangi Saraogi <himangi774@gmail.com> Acked-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-07-12 18:43:44 -04:00
Jeff Layton	a0337d1ddb	sunrpc: add a new "stringify_acceptor" rpc_credop ...and add an new rpc_auth function to call it when it exists. This is only applicable for AUTH_GSS mechanisms, so we only specify this for those sorts of credentials. Signed-off-by: Jeff Layton <jlayton@poochiereds.net> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-07-12 18:41:20 -04:00
Jeff Layton	2004c726b9	auth_gss: fetch the acceptor name out of the downcall If rpc.gssd sends us an acceptor name string trailing the context token, stash it as part of the context. Signed-off-by: Jeff Layton <jlayton@poochiereds.net> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-07-12 18:41:06 -04:00
Steve Wise	255942907e	svcrdma: send_write() must not overflow the device's max sge Function send_write() must stop creating sges when it reaches the device max and return the amount sent in the RDMA Write to the caller. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-07-11 15:03:48 -04:00
Trond Myklebust	2fc193cf92	SUNRPC: Handle EPIPE in xprt_connect_status The callback handler xs_error_report() can end up propagating an EPIPE error by means of the call to xprt_wake_pending_tasks(). Ensure that xprt_connect_status() does not automatically convert this into an EIO error. Reported-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-07-03 00:11:35 -04:00
Trond Myklebust	3601c4a91e	SUNRPC: Ensure that we handle ENOBUFS errors correctly. Currently, an ENOBUFS error will result in a fatal error for the RPC call. Normally, we will just want to wait and then retry. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-06-30 13:42:19 -04:00
Andy Adamson	66b0686049	NFSv4: test SECINFO RPC_AUTH_GSS pseudoflavors for support Fix nfs4_negotiate_security to create an rpc_clnt used to test each SECINFO returned pseudoflavor. Check credential creation (and gss_context creation) which is important for RPC_AUTH_GSS pseudoflavors which can fail for multiple reasons including mis-configuration. Don't call nfs4_negotiate in nfs4_submount as it was just called by nfs4_proc_lookup_mountpoint (nfs4_proc_lookup_common) Signed-off-by: Andy Adamson <andros@netapp.com> [Trond: fix corrupt return value from nfs_find_best_sec()] Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-06-24 18:46:58 -04:00
Kinglong Mee	f15a5cf912	SUNRPC/NFSD: Change to type of bool for rq_usedeferral and rq_splice_ok rq_usedeferral and rq_splice_ok are used as 0 and 1, just defined to bool. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-06-23 11:31:36 -04:00
Linus Torvalds	f9da455b93	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next Pull networking updates from David Miller: 1) Seccomp BPF filters can now be JIT'd, from Alexei Starovoitov. 2) Multiqueue support in xen-netback and xen-netfront, from Andrew J Benniston. 3) Allow tweaking of aggregation settings in cdc_ncm driver, from Bjørn Mork. 4) BPF now has a "random" opcode, from Chema Gonzalez. 5) Add more BPF documentation and improve test framework, from Daniel Borkmann. 6) Support TCP fastopen over ipv6, from Daniel Lee. 7) Add software TSO helper functions and use them to support software TSO in mvneta and mv643xx_eth drivers. From Ezequiel Garcia. 8) Support software TSO in fec driver too, from Nimrod Andy. 9) Add Broadcom SYSTEMPORT driver, from Florian Fainelli. 10) Handle broadcasts more gracefully over macvlan when there are large numbers of interfaces configured, from Herbert Xu. 11) Allow more control over fwmark used for non-socket based responses, from Lorenzo Colitti. 12) Do TCP congestion window limiting based upon measurements, from Neal Cardwell. 13) Support busy polling in SCTP, from Neal Horman. 14) Allow RSS key to be configured via ethtool, from Venkata Duvvuru. 15) Bridge promisc mode handling improvements from Vlad Yasevich. 16) Don't use inetpeer entries to implement ID generation any more, it performs poorly, from Eric Dumazet. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1522 commits) rtnetlink: fix userspace API breakage for iproute2 < v3.9.0 tcp: fixing TLP's FIN recovery net: fec: Add software TSO support net: fec: Add Scatter/gather support net: fec: Increase buffer descriptor entry number net: fec: Factorize feature setting net: fec: Enable IP header hardware checksum net: fec: Factorize the .xmit transmit function bridge: fix compile error when compiling without IPv6 support bridge: fix smatch warning / potential null pointer dereference via-rhine: fix full-duplex with autoneg disable bnx2x: Enlarge the dorq threshold for VFs bnx2x: Check for UNDI in uncommon branch bnx2x: Fix 1G-baseT link bnx2x: Fix link for KR with swapped polarity lane sctp: Fix sk_ack_backlog wrap-around problem net/core: Add VF link state control policy net/fsl: xgmac_mdio is dependent on OF_MDIO net/fsl: Make xgmac_mdio read error message useful net_sched: drr: warn when qdisc is not work conserving ...	2014-06-12 14:27:40 -07:00
Tom Herbert	7e3cead517	net: Save software checksum complete In skb_checksum complete, if we need to compute the checksum for the packet (via skb_checksum) save the result as CHECKSUM_COMPLETE. Subsequent checksum verification can use this. Also, added csum_complete_sw flag to distinguish between software and hardware generated checksum complete, we should always be able to trust the software computation. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-06-11 15:46:13 -07:00
Linus Torvalds	d1e1cda862	NFS client updates for Linux 3.16 Highlights include: - Massive cleanup of the NFS read/write code by Anna and Dros - Support multiple NFS read/write requests per page in order to deal with non-page aligned pNFS striping. Also cleans up the r/wsize < page size code nicely. - stable fix for ensuring inode is declared uptodate only after all the attributes have been checked. - stable fix for a kernel Oops when remounting - NFS over RDMA client fixes - move the pNFS files layout driver into its own subdirectory -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJTl3pmAAoJEGcL54qWCgDyraIP/08ZbbDowVTP9572bxl+VR2i zNbrflBtl1R05D4Imi/IEySK0w6xj1CLsncNpXAT2bxTlyKPW70tpiiPlRKMPuO8 JW+iPiepR2t0mol6MEd46yuV8btXVk8I+7IYjPXANiMJG8O5dJzNQ8NiCQOERBNt FQ7rzTCFO0ESGXnT6vYrT4I0bwqYVklBiJRTT4PQVzhhhDq9qUdq21BlQjQJFXP4 9aBLurxKptlHBvE6A2Quja6ObEC0s31CxcijqHIJ+Ue4GbKcFbMG1tgjY7ESE/AD rqzDeF0jvWHT+frmvFEUUXWqzF1ReZ4x9pfDoOgeG6T9/K6DT91O0yMOgG8jvlbF 8DSATNYGDX5sSjpvaG5JokGG+cGCk9srVDx+itn7HlwzalRwn0PjKtIYwOJ7TJIr o/j20nOsPrRGF0OqLf9phyocgRrlbMKOzj1IXldHHfAbNkRcISTK08lxvsz96Ddn zRyDmbsbY6QFXdB3AVSeQmg5R0OOLtzNIcsFPmNdvy5eiy67qU0lsGg8UGNnoz8k PHN1pcGejkctLhQ32ee3w/W6zkrgpJZcNC9JSoG8Dc3SeXus0c3IgumRknFCmiep ssN+1jEITAGeS5a2aBxwLQLVI2JAr2lxs5e+R4D5EsQlFkCl6Mrgtzh/aToWTuFl Qt7l2zI3r3VieKT9u7Bh =OyXR -----END PGP SIGNATURE----- Merge tag 'nfs-for-3.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs Pull NFS client updates from Trond Myklebust: "Highlights include: - massive cleanup of the NFS read/write code by Anna and Dros - support multiple NFS read/write requests per page in order to deal with non-page aligned pNFS striping. Also cleans up the r/wsize < page size code nicely. - stable fix for ensuring inode is declared uptodate only after all the attributes have been checked. - stable fix for a kernel Oops when remounting - NFS over RDMA client fixes - move the pNFS files layout driver into its own subdirectory" * tag 'nfs-for-3.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (79 commits) NFS: populate ->net in mount data when remounting pnfs: fix lockup caused by pnfs_generic_pg_test NFSv4.1: Fix typo in dprintk NFSv4.1: Comment is now wrong and redundant to code NFS: Use raw_write_seqcount_begin/end int nfs4_reclaim_open_state xprtrdma: Disconnect on registration failure xprtrdma: Remove BUG_ON() call sites xprtrdma: Avoid deadlock when credit window is reset SUNRPC: Move congestion window constants to header file xprtrdma: Reset connection timeout after successful reconnect xprtrdma: Use macros for reconnection timeout constants xprtrdma: Allocate missing pagelist xprtrdma: Remove Tavor MTU setting xprtrdma: Ensure ia->ri_id->qp is not NULL when reconnecting xprtrdma: Reduce the number of hardway buffer allocations xprtrdma: Limit work done by completion handler xprtrmda: Reduce calls to ib_poll_cq() in completion handlers xprtrmda: Reduce lock contention in completion handlers xprtrdma: Split the completion queue xprtrdma: Make rpcrdma_ep_destroy() return void ...	2014-06-10 15:02:42 -07:00
Linus Torvalds	5b174fd647	Merge branch 'for-3.16' of git://linux-nfs.org/~bfields/linux Pull nfsd updates from Bruce Fields: "The largest piece is a long-overdue rewrite of the xdr code to remove some annoying limitations: for example, there was no way to return ACLs larger than 4K, and readdir results were returned only in 4k chunks, limiting performance on large directories. Also: - part of Neil Brown's work to make NFS work reliably over the loopback interface (so client and server can run on the same machine without deadlocks). The rest of it is coming through other trees. - cleanup and bugfixes for some of the server RDMA code, from Steve Wise. - Various cleanup of NFSv4 state code in preparation for an overhaul of the locking, from Jeff, Trond, and Benny. - smaller bugfixes and cleanup from Christoph Hellwig and Kinglong Mee. Thanks to everyone! This summer looks likely to be busier than usual for knfsd. Hopefully we won't break it too badly; testing definitely welcomed" * 'for-3.16' of git://linux-nfs.org/~bfields/linux: (100 commits) nfsd4: fix FREE_STATEID lockowner leak svcrdma: Fence LOCAL_INV work requests svcrdma: refactor marshalling logic nfsd: don't halt scanning the DRC LRU list when there's an RC_INPROG entry nfs4: remove unused CHANGE_SECURITY_LABEL nfsd4: kill READ64 nfsd4: kill READ32 nfsd4: simplify server xdr->next_page use nfsd4: hash deleg stateid only on successful nfs4_set_delegation nfsd4: rename recall_lock to state_lock nfsd: remove unneeded zeroing of fields in nfsd4_proc_compound nfsd: fix setting of NFS4_OO_CONFIRMED in nfsd4_open nfsd4: use recall_lock for delegation hashing nfsd: fix laundromat next-run-time calculation nfsd: make nfsd4_encode_fattr static SUNRPC/NFSD: Remove using of dprintk with KERN_WARNING nfsd: remove unused function nfsd_read_file nfsd: getattr for FATTR4_WORD0_FILES_AVAIL needs the statfs buffer NFSD: Error out when getting more than one fsloc/secinfo/uuid NFSD: Using type of uint32_t for ex_nflavors instead of int ...	2014-06-10 11:50:57 -07:00
Steve Wise	83710fc753	svcrdma: Fence LOCAL_INV work requests Fencing forces the invalidate to only happen after all prior send work requests have been completed. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Reported by : Devesh Sharma <Devesh.Sharma@Emulex.Com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-06-06 19:22:51 -04:00
Steve Wise	0bf4828983	svcrdma: refactor marshalling logic This patch refactors the NFSRDMA server marshalling logic to remove the intermediary map structures. It also fixes an existing bug where the NFSRDMA server was not minding the device fast register page list length limitations. Signed-off-by: Tom Tucker <tom@opengridcomputing.com> Signed-off-by: Steve Wise <swise@opengridcomputing.com>	2014-06-06 19:22:50 -04:00
J. Bruce Fields	05638dc73a	nfsd4: simplify server xdr->next_page use The rpc code makes available to the NFS server an array of pages to encod into. The server represents its reply as an xdr buf, with the head pointing into the first page in that array, the pages ** array starting just after that, and the tail (if any) sharing any leftover space in the page used by the head. While encoding, we use xdr_stream->page_ptr to keep track of which page we're currently using. Currently we set xdr_stream->page_ptr to buf->pages, which makes the head a weird exception to the rule that page_ptr always points to the page we're currently encoding into. So, instead set it to buf->pages - 1 (the page actually containing the head), and remove the need for a little unintuitive logic in xdr_get_next_encode_buffer() and xdr_truncate_encode. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-06-06 19:22:46 -04:00
Chuck Lever	c93c62231c	xprtrdma: Disconnect on registration failure If rpcrdma_register_external() fails during request marshaling, the current RPC request is killed. Instead, this RPC should be retried after reconnecting the transport instance. The most likely reason for registration failure with FRMR is a failed post_send, which would be due to a remote transport disconnect or memory exhaustion. These issues can be recovered by a retry. Problems encountered in the marshaling logic itself will not be corrected by trying again, so these should still kill a request. Now that we've added a clean exit for marshaling errors, take the opportunity to defang some BUG_ON's. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-06-04 08:56:53 -04:00
Chuck Lever	c977dea227	xprtrdma: Remove BUG_ON() call sites If an error occurs in the marshaling logic, fail the RPC request being processed, but leave the client running. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-06-04 08:56:53 -04:00
Chuck Lever	e7ce710a88	xprtrdma: Avoid deadlock when credit window is reset Update the cwnd while processing the server's reply. Otherwise the next task on the xprt_sending queue is still subject to the old credit window. Currently, no task is awoken if the old congestion window is still exceeded, even if the new window is larger, and a deadlock results. This is an issue during a transport reconnect. Servers don't normally shrink the credit window, but the client does reset it to 1 when reconnecting so the server can safely grow it again. As a minor optimization, remove the hack of grabbing the initial cwnd size (which happens to be RPC_CWNDSCALE) and using that value as the congestion scaling factor. The scaling value is invariant, and we are better off without the multiplication operation. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-06-04 08:56:52 -04:00
Chuck Lever	4f4cf5ad6f	SUNRPC: Move congestion window constants to header file I would like to use one of the RPC client's congestion algorithm constants in transport-specific code. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-06-04 08:56:51 -04:00
Chuck Lever	18906972aa	xprtrdma: Reset connection timeout after successful reconnect If the new connection is able to make forward progress, reset the re-establish timeout. Otherwise it keeps growing even if disconnect events are rare. The same behavior as TCP is adopted: reconnect immediately if the transport instance has been able to make some forward progress. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-06-04 08:56:51 -04:00
Chuck Lever	bfaee096de	xprtrdma: Use macros for reconnection timeout constants Clean up: Ensure the same max and min constant values are used everywhere when setting reconnect timeouts. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-06-04 08:56:50 -04:00
Shirley Ma	196c69989d	xprtrdma: Allocate missing pagelist GETACL relies on transport layer to alloc memory for reply buffer. However xprtrdma assumes that the reply buffer (pagelist) has been pre-allocated in upper layer. This problem was reported by IOL OFA lab test on PPC. Signed-off-by: Shirley Ma <shirley.ma@oracle.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Edward Mossman <emossman@iol.unh.edu> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-06-04 08:56:49 -04:00
Chuck Lever	5bc4bc7292	xprtrdma: Remove Tavor MTU setting Clean up. Remove HCA-specific clutter in xprtrdma, which is supposed to be device-independent. Hal Rosenstock <hal@dev.mellanox.co.il> observes: > Note that there is OpenSM option (enable_quirks) to return 1K MTU > in SA PathRecord responses for Tavor so that can be used for this. > The default setting for enable_quirks is FALSE so that would need > changing. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-06-04 08:56:48 -04:00
Chuck Lever	ec62f40d35	xprtrdma: Ensure ia->ri_id->qp is not NULL when reconnecting Devesh Sharma <Devesh.Sharma@Emulex.Com> reports that after a disconnect, his HCA is failing to create a fresh QP, leaving ia_ri->ri_id->qp set to NULL. But xprtrdma still allows RPCs to wake up and post LOCAL_INV as they exit, causing an oops. rpcrdma_ep_connect() is allowing the wake-up by leaking the QP creation error code (-EPERM in this case) to the RPC client's generic layer. xprt_connect_status() does not recognize -EPERM, so it kills pending RPC tasks immediately rather than retrying the connect. Re-arrange the QP creation logic so that when it fails on reconnect, it leaves ->qp with the old QP rather than NULL. If pending RPC tasks wake and exit, LOCAL_INV work requests will flush rather than oops. On initial connect, leaving ->qp == NULL is OK, since there are no pending RPCs that might use ->qp. But be sure not to try to destroy a NULL QP when rpcrdma_ep_connect() is retried. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-06-04 08:56:47 -04:00
Chuck Lever	65866f8259	xprtrdma: Reduce the number of hardway buffer allocations While marshaling an RPC/RDMA request, the inline_{rsize,wsize} settings determine whether an inline request is used, or whether read or write chunks lists are built. The current default value of these settings is 1024. Any RPC request smaller than 1024 bytes is sent to the NFS server completely inline. rpcrdma_buffer_create() allocates and pre-registers a set of RPC buffers for each transport instance, also based on the inline rsize and wsize settings. RPC/RDMA requests and replies are built in these buffers. However, if an RPC/RDMA request is expected to be larger than 1024, a buffer has to be allocated and registered for that RPC, and deregistered and released when the RPC is complete. This is known has a "hardway allocation." Since the introduction of NFSv4, the size of RPC requests has become larger, and hardway allocations are thus more frequent. Hardway allocations are significant overhead, and they waste the existing RPC buffers pre-allocated by rpcrdma_buffer_create(). We'd like fewer hardway allocations. Increasing the size of the pre-registered buffers is the most direct way to do this. However, a blanket increase of the inline thresholds has interoperability consequences. On my 64-bit system, rpcrdma_buffer_create() requests roughly 7000 bytes for each RPC request buffer, using kmalloc(). Due to internal fragmentation, this wastes nearly 1200 bytes because kmalloc() already returns an 8192-byte piece of memory for a 7000-byte allocation request, though the extra space remains unused. So let's round up the size of the pre-allocated buffers, and make use of the unused space in the kmalloc'd memory. This change reduces the amount of hardway allocated memory for an NFSv4 general connectathon run from 1322092 to 9472 bytes (99%). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-06-04 08:56:46 -04:00
Chuck Lever	8301a2c047	xprtrdma: Limit work done by completion handler Sagi Grimberg <sagig@dev.mellanox.co.il> points out that a steady stream of CQ events could starve other work because of the boundless loop pooling in rpcrdma_{send,recv}_poll(). Instead of a (potentially infinite) while loop, return after collecting a budgeted number of completions. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Acked-by: Sagi Grimberg <sagig@dev.mellanox.co.il> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-06-04 08:56:45 -04:00
Chuck Lever	1c00dd0776	xprtrmda: Reduce calls to ib_poll_cq() in completion handlers Change the completion handlers to grab up to 16 items per ib_poll_cq() call. No extra ib_poll_cq() is needed if fewer than 16 items are returned. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-06-04 08:56:44 -04:00
Chuck Lever	7f23f6f6e3	xprtrmda: Reduce lock contention in completion handlers Skip the ib_poll_cq() after re-arming, if the provider knows there are no additional items waiting. (Have a look at commit `ed23a727` for more details). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-06-04 08:56:43 -04:00
Chuck Lever	fc66448549	xprtrdma: Split the completion queue The current CQ handler uses the ib_wc.opcode field to distinguish between event types. However, the contents of that field are not reliable if the completion status is not IB_WC_SUCCESS. When an error completion occurs on a send event, the CQ handler schedules a tasklet with something that is not a struct rpcrdma_rep. This is never correct behavior, and sometimes it results in a panic. To resolve this issue, split the completion queue into a send CQ and a receive CQ. The send CQ handler now handles only struct rpcrdma_mw wr_id's, and the receive CQ handler now handles only struct rpcrdma_rep wr_id's. Fix suggested by Shirley Ma <shirley.ma@oracle.com> Reported-by: Rafael Reiter <rafael.reiter@ims.co.at> Fixes: `5c635e09ce` BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=73211 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Klemens Senn <klemens.senn@ims.co.at> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-06-04 08:56:42 -04:00
Chuck Lever	7f1d54191e	xprtrdma: Make rpcrdma_ep_destroy() return void Clean up: rpcrdma_ep_destroy() returns a value that is used only to print a debugging message. rpcrdma_ep_destroy() already prints debugging messages in all error cases. Make rpcrdma_ep_destroy() return void instead. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-06-04 08:56:41 -04:00
Chuck Lever	13c9ff8f67	xprtrdma: Simplify rpcrdma_deregister_external() synopsis Clean up: All remaining callers of rpcrdma_deregister_external() pass NULL as the last argument, so remove that argument. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-06-04 08:56:40 -04:00
Chuck Lever	cdd9ade711	xprtrdma: mount reports "Invalid mount option" if memreg mode not supported If the selected memory registration mode is not supported by the underlying provider/HCA, the NFS mount command reports that there was an invalid mount option, and fails. This is misleading. Reporting a problem allocating memory is a lot closer to the truth. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-06-04 08:56:39 -04:00
Chuck Lever	f10eafd3a6	xprtrdma: Fall back to MTHCAFMR when FRMR is not supported An audit of in-kernel RDMA providers that do not support the FRMR memory registration shows that several of them support MTHCAFMR. Prefer MTHCAFMR when FRMR is not supported. If MTHCAFMR is not supported, only then choose ALLPHYSICAL. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-06-04 08:56:39 -04:00
Chuck Lever	0ac531c183	xprtrdma: Remove REGISTER memory registration mode All kernel RDMA providers except amso1100 support either MTHCAFMR or FRMR, both of which are faster than REGISTER. amso1100 can continue to use ALLPHYSICAL. The only other ULP consumer in the kernel that uses the reg_phys_mr verb is Lustre. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-06-04 08:56:38 -04:00
Chuck Lever	b45ccfd25d	xprtrdma: Remove MEMWINDOWS registration modes The MEMWINDOWS and MEMWINDOWS_ASYNC memory registration modes were intended as stop-gap modes before the introduction of FRMR. They are now considered obsolete. MEMWINDOWS_ASYNC is also considered unsafe because it can leave client memory registered and exposed for an indeterminant time after each I/O. At this point, the MEMWINDOWS modes add needless complexity, so remove them. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-06-04 08:56:37 -04:00
Chuck Lever	03ff8821eb	xprtrdma: Remove BOUNCEBUFFERS memory registration mode Clean up: This memory registration mode is slow and was never meant for use in production environments. Remove it to reduce implementation complexity. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-06-04 08:56:37 -04:00
Chuck Lever	254f91e2fa	xprtrdma: RPC/RDMA must invoke xprt_wake_pending_tasks() in process context An IB provider can invoke rpcrdma_conn_func() in an IRQ context, thus rpcrdma_conn_func() cannot be allowed to directly invoke generic RPC functions like xprt_wake_pending_tasks(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-06-04 08:56:36 -04:00
Allen Andrews	4034ba0423	nfs-rdma: Fix for FMR leaks Two memory region leaks were found during testing: 1. rpcrdma_buffer_create: While allocating RPCRDMA_FRMR's ib_alloc_fast_reg_mr is called and then ib_alloc_fast_reg_page_list is called. If ib_alloc_fast_reg_page_list returns an error it bails out of the routine dropping the last ib_alloc_fast_reg_mr frmr region creating a memory leak. Added code to dereg the last frmr if ib_alloc_fast_reg_page_list fails. 2. rpcrdma_buffer_destroy: While cleaning up, the routine will only free the MR's on the rb_mws list if there are rb_send_bufs present. However, in rpcrdma_buffer_create while the rb_mws list is being built if one of the MR allocation requests fail after some MR's have been allocated on the rb_mws list the routine never gets to create any rb_send_bufs but instead jumps to the rpcrdma_buffer_destroy routine which will never free the MR's on rb_mws list because the rb_send_bufs were never created. This leaks all the MR's on the rb_mws list that were created prior to one of the MR allocations failing. Issue(2) was seen during testing. Our adapter had a finite number of MR's available and we created enough connections to where we saw an MR allocation failure on our Nth NFS connection request. After the kernel cleaned up the resources it had allocated for the Nth connection we noticed that FMR's had been leaked due to the coding error described above. Issue(1) was seen during a code review while debugging issue(2). Signed-off-by: Allen Andrews <allen.andrews@emulex.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-06-04 08:56:35 -04:00
Steve Wise	0fc6c4e7bb	xprtrdma: mind the device's max fast register page list depth Some rdma devices don't support a fast register page list depth of at least RPCRDMA_MAX_DATA_SEGS. So xprtrdma needs to chunk its fast register regions according to the minimum of the device max supported depth or RPCRDMA_MAX_DATA_SEGS. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2014-06-04 08:56:33 -04:00
Kinglong Mee	a48fd0f9f7	SUNRPC/NFSD: Remove using of dprintk with KERN_WARNING When debugging, rpc prints messages from dprintk(KERN_WARNING ...) with "^A4" prefixed, [ 2780.339988] ^A4nfsd: connect from unprivileged port: 127.0.0.1, port=35316 Trond tells, > dprintk != printk. We have NEVER supported dprintk(KERN_WARNING...) This patch removes using of dprintk with KERN_WARNING. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-05-30 20:25:28 -04:00
J. Bruce Fields	a5cddc885b	nfsd4: better reservation of head space for krb5 RPC_MAX_AUTH_SIZE is scattered around several places. Better to set it once in the auth code, where this kind of estimate should be made. And while we're at it we can leave it zero when we're not using krb5i or krb5p. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-05-30 17:32:17 -04:00
J. Bruce Fields	db3f58a95b	rpc: define xdr_restrict_buflen With this xdr_reserve_space can help us enforce various limits. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-05-30 17:32:01 -04:00
J. Bruce Fields	2825a7f907	nfsd4: allow encoding across page boundaries After this we can handle for example getattr of very large ACLs. Read, readdir, readlink are still special cases with their own limits. Also we can't handle a new operation starting close to the end of a page. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-05-30 17:31:54 -04:00
J. Bruce Fields	3e19ce762b	rpc: xdr_truncate_encode This will be used in the server side in a few cases: - when certain operations (read, readdir, readlink) fail after encoding a partial response. - when we run out of space after encoding a partial response. - in readlink, where we initially reserve PAGE_SIZE bytes for data, then truncate to the actual size. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-05-30 17:31:47 -04:00
David Rientjes	c6c8fe79a8	net, sunrpc: suppress allocation warning in rpc_malloc() rpc_malloc() allocates with GFP_NOWAIT without making any attempt at reclaim so it easily fails when low on memory. This ends up spamming the kernel log: SLAB: Unable to allocate memory on node 0 (gfp=0x4000) cache: kmalloc-8192, object size: 8192, order: 1 node 0: slabs: 207/207, objs: 207/207, free: 0 rekonq: page allocation failure: order:1, mode:0x204000 CPU: 2 PID: 14321 Comm: rekonq Tainted: G O 3.15.0-rc3-12.gfc9498b-desktop+ #6 Hardware name: System manufacturer System Product Name/M4A785TD-V EVO, BIOS 2105 07/23/2010 0000000000000000 ffff880010ff17d0 ffffffff815e693c 0000000000204000 ffff880010ff1858 ffffffff81137bd2 0000000000000000 0000001000000000 ffff88011ffebc38 0000000000000001 0000000000204000 ffff88011ffea000 Call Trace: [<ffffffff815e693c>] dump_stack+0x4d/0x6f [<ffffffff81137bd2>] warn_alloc_failed+0xd2/0x140 [<ffffffff8113be19>] __alloc_pages_nodemask+0x7e9/0xa30 [<ffffffff811824a8>] kmem_getpages+0x58/0x140 [<ffffffff81183de6>] fallback_alloc+0x1d6/0x210 [<ffffffff81183be3>] ____cache_alloc_node+0x123/0x150 [<ffffffff81185953>] __kmalloc+0x203/0x490 [<ffffffffa06b0ee2>] rpc_malloc+0x32/0xa0 [sunrpc] [<ffffffffa06a6999>] call_allocate+0xb9/0x170 [sunrpc] [<ffffffffa06b19d8>] __rpc_execute+0x88/0x460 [sunrpc] [<ffffffffa06b2da9>] rpc_execute+0x59/0xc0 [sunrpc] [<ffffffffa06a932b>] rpc_run_task+0x6b/0x90 [sunrpc] [<ffffffffa077b5c1>] nfs4_call_sync_sequence+0x51/0x80 [nfsv4] [<ffffffffa077d45d>] _nfs4_do_setattr+0x1ed/0x280 [nfsv4] [<ffffffffa0782a72>] nfs4_do_setattr+0x72/0x180 [nfsv4] [<ffffffffa078334c>] nfs4_proc_setattr+0xbc/0x140 [nfsv4] [<ffffffffa074a7e8>] nfs_setattr+0xd8/0x240 [nfs] [<ffffffff811baa71>] notify_change+0x231/0x380 [<ffffffff8119cf5c>] chmod_common+0xfc/0x120 [<ffffffff8119df80>] SyS_chmod+0x40/0x90 [<ffffffff815f4cfd>] system_call_fastpath+0x1a/0x1f ... If the allocation fails, simply return NULL and avoid spamming the kernel log. Reported-by: Marc Dietrich <marvin24@gmx.de> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-05-29 11:11:51 -04:00
Tom Herbert	0f8066bd48	sunrpc: Remove sk_no_check setting Setting sk_no_check to UDP_CSUM_NORCV seems to have no effect. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-05-23 16:28:53 -04:00
NeilBrown	ef11ce2487	SUNRPC: track whether a request is coming from a loop-back interface. If an incoming NFS request is coming from the local host, then nfsd will need to perform some special handling. So detect that possibility and make the source visible in rq_local. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-05-22 15:59:18 -04:00
Trond Myklebust	c789102c20	SUNRPC: Fix a module reference leak in svc_handle_xprt If the accept() call fails, we need to put the module reference. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-05-22 15:57:22 -04:00
Chuck Lever	16e4d93f6d	NFSD: Ignore client's source port on RDMA transports An NFS/RDMA client's source port is meaningless for RDMA transports. The transport layer typically sets the source port value on the connection to a random ephemeral port. Currently, NFS server administrators must specify the "insecure" export option to enable clients to access exports via RDMA. But this means NFS clients can access such an export via IP using an ephemeral port, which may not be desirable. This patch eliminates the need to specify the "insecure" export option to allow NFS/RDMA clients access to an export. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=250 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-05-22 15:55:48 -04:00
Trond Myklebust	7a9a7b774f	SUNRPC: Fix a module reference issue in rpcsec_gss We're not taking a reference in the case where _gss_mech_get_by_pseudoflavor loops without finding the correct rpcsec_gss flavour, so why are we releasing it? Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>	2014-05-18 13:47:14 -04:00
Kinglong Mee	ecca063b31	SUNRPC: Fix printk that is not only for nfsd Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2014-05-08 14:59:51 -04:00
Peter Zijlstra	4e857c58ef	arch: Mass conversion of smp_mb__() Mostly scripted conversion of the smp_mb__ barriers. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-arch@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2014-04-18 14:20:48 +02:00
Linus Torvalds	454fd351f2	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull yet more networking updates from David Miller: 1) Various fixes to the new Redpine Signals wireless driver, from Fariya Fatima. 2) L2TP PPP connect code takes PMTU from the wrong socket, fix from Dmitry Petukhov. 3) UFO and TSO packets differ in whether they include the protocol header in gso_size, account for that in skb_gso_transport_seglen(). From Florian Westphal. 4) If VLAN untagging fails, we double free the SKB in the bridging output path. From Toshiaki Makita. 5) Several call sites of sk->sk_data_ready() were referencing an SKB just added to the socket receive queue in order to calculate the second argument via skb->len. This is dangerous because the moment the skb is added to the receive queue it can be consumed in another context and freed up. It turns out also that none of the sk->sk_data_ready() implementations even care about this second argument. So just kill it off and thus fix all these use-after-free bugs as a side effect. 6) Fix inverted test in tcp_v6_send_response(), from Lorenzo Colitti. 7) pktgen needs to do locking properly for LLTX devices, from Daniel Borkmann. 8) xen-netfront driver initializes TX array entries in RX loop :-) From Vincenzo Maffione. 9) After refactoring, some tunnel drivers allow a tunnel to be configured on top itself. Fix from Nicolas Dichtel. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (46 commits) vti: don't allow to add the same tunnel twice gre: don't allow to add the same tunnel twice drivers: net: xen-netfront: fix array initialization bug pktgen: be friendly to LLTX devices r8152: check RTL8152_UNPLUG net: sun4i-emac: add promiscuous support net/apne: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO net: ipv6: Fix oif in TCP SYN+ACK route lookup. drivers: net: cpsw: enable interrupts after napi enable and clearing previous interrupts drivers: net: cpsw: discard all packets received when interface is down net: Fix use after free by removing length arg from sk_data_ready callbacks. Drivers: net: hyperv: Address UDP checksum issues Drivers: net: hyperv: Negotiate suitable ndis version for offload support Drivers: net: hyperv: Allocate memory for all possible per-pecket information bridge: Fix double free and memory leak around br_allowed_ingress bonding: Remove debug_fs files when module init fails i40evf: program RSS LUT correctly i40evf: remove open-coded skb_cow_head ixgb: remove open-coded skb_cow_head igbvf: remove open-coded skb_cow_head ...	2014-04-12 17:31:22 -07:00
David S. Miller	676d23690f	net: Fix use after free by removing length arg from sk_data_ready callbacks. Several spots in the kernel perform a sequence like: skb_queue_tail(&sk->s_receive_queue, skb); sk->sk_data_ready(sk, skb->len); But at the moment we place the SKB onto the socket receive queue it can be consumed and freed up. So this skb->len access is potentially to freed up memory. Furthermore, the skb->len can be modified by the consumer so it is possible that the value isn't accurate. And finally, no actual implementation of this callback actually uses the length argument. And since nobody actually cared about it's value, lots of call sites pass arbitrary values in such as '0' and even '1'. So just remove the length argument from the callback, that way there is no confusion whatsoever and all of these use-after-free cases get fixed as a side effect. Based upon a patch by Eric Dumazet and his suggestion to audit this issue tree-wide. Signed-off-by: David S. Miller <davem@davemloft.net>	2014-04-11 16:15:36 -04:00

1 2 3 4 5 ...

2385 Commits