Commit Graph

58 Commits

Author SHA1 Message Date
Sage Weil cd634fb6ee libceph: subscribe to osdmap when cluster is full
When the cluster is marked full, subscribe to subsequent map updates to
ensure we find out promptly when it is no longer full.  This will prevent
us from spewing ENOSPC for (much) longer than necessary.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-24 11:52:11 -07:00
Sage Weil 7662d8ff57 libceph: handle new osdmap down/state change encoding
Old incrementals encode a 0 value (nearly always) when an osd goes down.
Change that to allow any state bit(s) to be flipped.  Special case 0 to
mean flip the CEPH_OSD_UP bit to mimic the old behavior.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-24 11:52:09 -07:00
Sage Weil 9d6fcb081a ceph: check return value for start_request in writepages
Since we pass the nofail arg, we should never get an error; BUG if we do.
(And fix the function to not return an error if __map_request fails.)

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:25:05 -07:00
Sage Weil a2a79609c0 libceph: add missing breaks in addr_set_port
Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:25:05 -07:00
Sage Weil 0417788226 libceph: fix TAG_WAIT case
If we get a WAIT as a client something went wrong; error out.  And don't
fall through to an unrelated case.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:25:04 -07:00
Sage Weil 31456665a0 libceph: fix osdmap timestamp assignment
Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:25:03 -07:00
Sage Weil 12a2f643b0 libceph: use snprintf for unknown addrs
Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:25:03 -07:00
Sage Weil 2dab036b8c libceph: use snprintf for formatting object name
Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:25:02 -07:00
Sage Weil e8f54ce169 libceph: fix uninitialized value when no get_authorizer method is set
If there is no get_authorizer method we set the out_kvec to a bogus
pointer.  The length is also zero in that case, so it doesn't much matter,
but it's better not to add the empty item in the first place.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:25:02 -07:00
Sage Weil 0da5d70369 libceph: handle connection reopen race with callbacks
If a connection is closed and/or reopened (ceph_con_close, ceph_con_open)
it can race with a callback.  con_work does various state checks for
closed or reopened sockets at the beginning, but drops con->mutex before
making callbacks.  We need to check for state bit changes after retaking
the lock to ensure we restart con_work and execute those CLOSED/OPENING
tests or else we may end up operating under stale assumptions.

In Jim's case, this was causing 'bad tag' errors.

There are four cases where we re-take the con->mutex inside con_work: catch
them all and return EAGAIN from try_{read,write} so that we can restart
con_work.

Reported-by: Jim Schutt <jaschut@sandia.gov>
Tested-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-19 11:21:05 -07:00
Sage Weil 4ad12621e4 libceph: fix ceph_osdc_alloc_request error checks
ceph_osdc_alloc_request returns NULL on failure.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-03 09:28:13 -07:00
Henry C Chang ca20892db7 libceph: fix ceph_msg_new error path
If memory allocation failed, calling ceph_msg_put() will cause GPF
since some of ceph_msg variables are not initialized first.

Fix Bug #970.

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-05-03 09:28:11 -07:00
Linus Torvalds e6d2831834 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  libceph: fix linger request requeueing
2011-04-14 19:02:55 -07:00
Linus Torvalds 42933bac11 Merge branch 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6
* 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6:
  Fix common misspellings
2011-04-07 11:14:49 -07:00
Sage Weil 77f38e0eea libceph: fix linger request requeueing
Fix the request transition from linger -> normal request.  The key is to
preserve r_osd and requeue on the same OSD.  Reregister as a normal request,
add the request to the proper queues, then unregister the linger.  Fix the
unregister helper to avoid clearing r_osd (and also simplify the parallel
check in __unregister_request()).

Reported-by: Henry Chang <henry.cy.chang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-04-06 09:09:16 -07:00
Lucas De Marchi 25985edced Fix common misspellings
Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-31 11:26:23 -03:00
Tommi Virtanen 4b2a58abd1 libceph: Create a new key type "ceph".
This allows us to use existence of the key type as a feature test,
from userspace.

Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-29 12:11:24 -07:00
Tommi Virtanen e2c3d29b42 libceph: Get secret from the kernel keys api when mounting with key=NAME.
Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-29 12:11:19 -07:00
Tommi Virtanen 8323c3aa74 ceph: Move secret key parsing earlier.
This makes the base64 logic be contained in mount option parsing,
and prepares us for replacing the homebew key management with the
kernel key retention service.

Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-29 12:11:16 -07:00
Sage Weil fbdb919048 libceph: fix null dereference when unregistering linger requests
We should only clear r_osd if we are neither registered as a linger or a
regular request.  We may unregister as a linger while still registered as
a regular request (e.g., in reset_osd).  Incorrectly clearing r_osd there
leads to a null pointer dereference in __send_request.

Also simplify the parallel check in __unregister_request() where we just
removed r_osd_item and know it's empty.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-29 12:11:06 -07:00
Dan Carpenter 234af26ff1 ceph: unlock on error in ceph_osdc_start_request()
There was a missing unlock on the error path if __map_request() failed.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-29 08:59:54 -07:00
Mariusz Kozlowski 6b0ae4097c ceph: fix possible NULL pointer dereference
This patch fixes 'event_work' dereference before it is checked for NULL.

Signed-off-by: Mariusz Kozlowski <mk@lab.zgora.pl>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-26 13:41:20 -07:00
Sage Weil ef550f6f4f ceph: flush msgr_wq during mds_client shutdown
The release method for mds connections uses a backpointer to the
mds_client, so we need to flush the workqueue of any pending work (and
ceph_connection references) prior to freeing the mds_client.  This fixes
an oops easily triggered under UML by

 while true ; do mount ... ; umount ... ; done

Also fix an outdated comment: the flush in ceph_destroy_client only flushes
OSD connections out.  This bug is basically an artifact of the ceph ->
ceph+libceph conversion.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-25 13:27:48 -07:00
Yehuda Sadeh a40c4f10e3 libceph: add lingering request and watch/notify event framework
Lingering requests are requests that are sent to the OSD normally but
tracked also after we get a successful request.  This keeps the OSD
connection open and resends the original request if the object moves to
another OSD.  The OSD can then send notification messages back to us
if another client initiates a notify.

This framework will be used by RBD so that the client gets notification
when a snapshot is created by another node or tool.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-22 11:33:55 -07:00
Sage Weil 6f6c700675 libceph: fix osd request queuing on osdmap updates
If we send a request to osd A, and the request's pg remaps to osd B and
then back to A in quick succession, we need to resend the request to A. The
old code was only calling kick_requests after processing all incremental
maps in a message, so it was very possible to not resend a request that
needed to be resent.  This would make the osd eventually time out (at least
with the current default of osd timeouts enabled).

The correct approach is to scan requests on every map incremental.  This
patch refactors the kick code in a few ways:
 - all requests are either on req_lru (in flight), req_unsent (ready to
   send), or req_notarget (currently map to no up osd)
 - mapping always done by map_request (previous map_osds)
 - if the mapping changes, we requeue.  requests are resent only after all
   map incrementals are processed.
 - some osd reset code is moved out of kick_requests into a separate
   function
 - the "kick this osd" functionality is moved to kick_osd_requests, as it
   is unrelated to scanning for request->pg->osd mapping changes

Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-21 12:24:19 -07:00
Tommi Virtanen b09734b1f4 libceph: Fix base64-decoding when input ends in newline.
It used to return -EINVAL because it thought the end was not aligned
to 4 bytes.

Clean up superfluous src < end test in if, the while itself guarantees
that.

Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-15 09:14:02 -07:00
Sage Weil e00de341fd libceph: fix msgr standby handling
The standby logic used to be pretty dependent on the work requeueing
behavior that changed when we switched to WQ_NON_REENTRANT.  It was also
very fragile.

Restructure things so that:
 - We clear WRITE_PENDING when we set STANDBY.  This ensures we will
   requeue work when we wake up later.
 - con_work backs off if STANDBY is set.  There is nothing to do if we are
   in standby.
 - clear_standby() helper is called by both con_send() and con_keepalive(),
   the two actions that can wake us up again.  Move the connect_seq++
   logic here.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-04 12:25:05 -08:00
Sage Weil e76661d0a5 libceph: fix msgr keepalive flag
There was some broken keepalive code using a dead variable.  Shift to using
the proper bit flag.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-04 12:24:31 -08:00
Sage Weil 60bf8bf881 libceph: fix msgr backoff
With commit f363e45f we replaced a bunch of hacky workqueue mutual
exclusion logic with the WQ_NON_REENTRANT flag.  One pieces of fallout is
that the exponential backoff breaks in certain cases:

 * con_work attempts to connect.
 * we get an immediate failure, and the socket state change handler queues
   immediate work.
 * con_work calls con_fault, we decide to back off, but can't queue delayed
   work.

In this case, we add a BACKOFF bit to make con_work reschedule delayed work
next time it runs (which should be immediately).

Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-04 12:24:28 -08:00
Sage Weil 692d20f576 libceph: retry after authorization failure
If we mark the connection CLOSED we will give up trying to reconnect to
this server instance.  That is appropriate for things like a protocol
version mismatch that won't change until the server is restarted, at which
point we'll get a new addr and reconnect.  An authorization failure like
this is probably due to the server not properly rotating it's secret keys,
however, and should be treated as transient so that the normal backoff and
retry behavior kicks in.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-03 13:47:40 -08:00
Sage Weil 38815b7802 libceph: fix handling of short returns from get_user_pages
get_user_pages() can return fewer pages than we ask for.  We were returning
a bogus pointer/error code in that case.  Instead, loop until we get all
the pages we want or get an error we can return to the caller.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-03-03 13:47:39 -08:00
Linus Torvalds 8bd89ca220 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: keep reference to parent inode on ceph_dentry
  ceph: queue cap_snaps once per realm
  libceph: fix socket write error handling
  libceph: fix socket read error handling
2011-02-21 15:01:38 -08:00
Sage Weil 42961d2333 libceph: fix socket write error handling
Pass errors from writing to the socket up the stack.  If we get -EAGAIN,
return 0 from the helper to simplify the callers' checks.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-25 08:19:34 -08:00
Sage Weil 98bdb0aa00 libceph: fix socket read error handling
If we get EAGAIN when trying to read from the socket, it is not an error.
Return 0 from the helper in this case to simplify the error handling cases
in the caller (indirectly, try_read).

Fix try_read to pass any error to it's caller (con_work) instead of almost
always returning 0.  This let's us respond to things like socket
disconnects.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-25 08:17:48 -08:00
Linus Torvalds a170315420 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: fix cleanup when trying to mount inexistent image
  net/ceph: make ceph_msgr_wq non-reentrant
  ceph: fsc->*_wq's aren't used in memory reclaim path
  ceph: Always free allocated memory in osdmap_decode()
  ceph: Makefile: Remove unnessary code
  ceph: associate requests with opening sessions
  ceph: drop redundant r_mds field
  ceph: implement DIRLAYOUTHASH feature to get dir layout from MDS
  ceph: add dir_layout to inode
2011-01-13 10:25:24 -08:00
Tejun Heo f363e45fd1 net/ceph: make ceph_msgr_wq non-reentrant
ceph messenger code does a rather complex dancing around multithread
workqueue to make sure the same work item isn't executed concurrently
on different CPUs.  This restriction can be provided by workqueue with
WQ_NON_REENTRANT.

Make ceph_msgr_wq non-reentrant workqueue with the default concurrency
level and remove the QUEUED/BUSY logic.

* This removes backoff handling in con_work() but it couldn't reliably
  block execution of con_work() to begin with - queue_con() can be
  called after the work started but before BUSY is set.  It seems that
  it was an optimization for a rather cold path and can be safely
  removed.

* The number of concurrent work items is bound by the number of
  connections and connetions are independent from each other.  With
  the default concurrency level, different connections will be
  executed independently.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Sage Weil <sage@newdream.net>
Cc: ceph-devel@vger.kernel.org
Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-12 15:15:14 -08:00
Jesper Juhl b0aee3516d ceph: Always free allocated memory in osdmap_decode()
Always free memory allocated to 'pi' in
net/ceph/osdmap.c::osdmap_decode().

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-12 15:15:14 -08:00
Sage Weil 6c0f3af72c ceph: add dir_layout to inode
Add a ceph_dir_layout to the inode, and calculate dentry hash values based
on the parent directory's specified dir_hash function.  This is needed
because the old default Linux dcache hash function is extremely week and
leads to a poor distribution of files among dir fragments.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-01-12 15:15:12 -08:00
David S. Miller 17f7f4d9fc Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	net/ipv4/fib_frontend.c
2010-12-26 22:37:05 -08:00
Linus Torvalds 9d5004fcf6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: handle partial result from get_user_pages
  ceph: mark user pages dirty on direct-io reads
  ceph: fix null pointer dereference in ceph_init_dentry for nfs reexport
  ceph: fix direct-io on non-page-aligned buffers
  ceph: fix msgr_init error path
2010-12-20 21:32:20 -08:00
Henry C Chang 361cf40519 ceph: handle partial result from get_user_pages
The get_user_pages() helper can return fewer than the requested pages.
Error out in that case, and clean up the partial result.

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-17 09:55:59 -08:00
Henry C Chang b6aa5901c7 ceph: mark user pages dirty on direct-io reads
For read operation, we have to set the argument _write_ of get_user_pages
to 1 since we will write data to pages. Also, we need to SetPageDirty before
releasing these pages.

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-17 09:54:40 -08:00
Sage Weil d96c9043d1 ceph: fix msgr_init error path
create_workqueue() returns NULL on failure.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-12-13 20:30:28 -08:00
David S. Miller fe6c791570 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/wireless/ath/ath9k/ar9003_eeprom.c
	net/llc/af_llc.c
2010-12-08 13:47:38 -08:00
Linus Torvalds a01af8e4a4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (27 commits)
  af_unix: limit recursion level
  pch_gbe driver: The wrong of initializer entry
  pch_gbe dreiver: chang author
  ucc_geth: fix ucc halt problem in half duplex mode
  inet: Fix __inet_inherit_port() to correctly increment bsockets and num_owners
  ehea: Add some info messages and fix an issue
  hso: fix disable_net
  NET: wan/x25_asy, move lapb_unregister to x25_asy_close_tty
  cxgb4vf: fix setting unicast/multicast addresses ...
  net, ppp: Report correct error code if unit allocation failed
  DECnet: don't leak uninitialized stack byte
  au1000_eth: fix invalid address accessing the MAC enable register
  dccp: fix error in updating the GAR
  tcp: restrict net.ipv4.tcp_adv_win_scale (#20312)
  netns: Don't leak others' openreq-s in proc
  Net: ceph: Makefile: Remove unnessary code
  vhost/net: fix rcu check usage
  econet: fix CVE-2010-3848
  econet: fix CVE-2010-3850
  econet: disallow NULL remote addr for sendmsg(), fixes CVE-2010-3849
  ...
2010-11-29 14:36:33 -08:00
Tracey Dent 4cb6a614ba Net: ceph: Makefile: Remove unnessary code
Remove the if and else conditional because the code is in mainline and there
is no need in it being there.

Signed-off-by: Tracey Dent <tdent48227@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-27 17:39:29 -08:00
Linus Torvalds 3cbaa0f7a7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
  of/phylib: Use device tree properties to initialize Marvell PHYs.
  phylib: Add support for Marvell 88E1149R devices.
  phylib: Use common page register definition for Marvell PHYs.
  qlge: Fix incorrect usage of module parameters and netdev msg level
  ipv6: fix missing in6_ifa_put in addrconf
  SuperH IrDA: correct Baud rate error correction
  atl1c: Fix hardware type check for enabling OTP CLK
  net: allow GFP_HIGHMEM in __vmalloc()
  bonding: change list contact to netdev@vger.kernel.org
  e1000: fix screaming IRQ
2010-11-24 08:22:34 +09:00
Tracey Dent fa13bc3daa Net: ceph: Makefile: remove deprecated kbuild goal definitions
Changed Makefile to use <modules>-y instead of <modules>-objs
because -objs is deprecated and not mentioned in
Documentation/kbuild/makefiles.txt.

Signed-off-by: Tracey Dent <tdent48227@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-22 08:16:10 -08:00
Eric Dumazet 7a1c8e5ab1 net: allow GFP_HIGHMEM in __vmalloc()
We forgot to use __GFP_HIGHMEM in several __vmalloc() calls.

In ceph, add the missing flag.

In fib_trie.c, xfrm_hash.c and request_sock.c, using vzalloc() is
cleaner and allows using HIGHMEM pages as well.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-21 10:04:04 -08:00
Sage Weil c5c6b19d4b ceph: explicitly specify page alignment in network messages
The alignment used for reading data into or out of pages used to be taken
from the data_off field in the message header.  This only worked as long
as the page alignment matched the object offset, breaking direct io to
non-page aligned offsets.

Instead, explicitly specify the page alignment next to the page vector
in the ceph_msg struct, and use that instead of the message header (which
probably shouldn't be trusted).  The alloc_msg callback is responsible for
filling in this field properly when it sets up the page vector.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-11-09 12:43:17 -08:00