Commit Graph

3745 Commits

Author SHA1 Message Date
Linus Torvalds 2d4fe27850 Merge git://git.infradead.org/users/willy/linux-nvme
Pull NVMe driver update from Matthew Wilcox:
 "Lots of exciting new features in the NVM Express driver this time,
  including support for emulating SCSI commands, discard support and the
  ability to submit per-sector metadata with I/Os.

  It's still mostly bugfixes though!"

* git://git.infradead.org/users/willy/linux-nvme: (27 commits)
  NVMe: Use user defined admin ioctl timeout
  NVMe: Simplify Firmware Activate code slightly
  NVMe: Only clear the enable bit when disabling controller
  NVMe: Wait for device to acknowledge shutdown
  NVMe: Schedule timeout for sync commands
  NVMe: Meta-data support in NVME_IOCTL_SUBMIT_IO
  NVMe: Device specific stripe size handling
  NVMe: Split non-mergeable bio requests
  NVMe: Remove dead code in nvme_dev_add
  NVMe: Check for NULL memory in nvme_dev_add
  NVMe: Fix error clean-up on nvme_alloc_queue
  NVMe: Free admin queue on request_irq error
  NVMe: Add scsi unmap to SG_IO
  NVMe: queue usage fixes in nvme-scsi
  NVMe: Set TASK_INTERRUPTIBLE before processing queues
  NVMe: Add a character device for each nvme device
  NVMe: Fix endian-related problems in user I/O submission path
  NVMe: Fix I/O cancellation status on big-endian machines
  NVMe: Fix sparse warnings in scsi emulation
  NVMe: Don't fail initialisation unnecessarily
  ...
2013-05-09 16:35:00 -07:00
Keith Busch 94f370cab6 NVMe: Use user defined admin ioctl timeout
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-09 16:03:50 -04:00
Linus Torvalds ebb3727779 Merge branch 'for-3.10/drivers' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
 "It might look big in volume, but when categorized, not a lot of
  drivers are touched.  The pull request contains:

   - mtip32xx fixes from Micron.

   - A slew of drbd updates, this time in a nicer series.

   - bcache, a flash/ssd caching framework from Kent.

   - Fixes for cciss"

* 'for-3.10/drivers' of git://git.kernel.dk/linux-block: (66 commits)
  bcache: Use bd_link_disk_holder()
  bcache: Allocator cleanup/fixes
  cciss: bug fix to prevent cciss from loading in kdump crash kernel
  cciss: add cciss_allow_hpsa module parameter
  drivers/block/mg_disk.c: add CONFIG_PM_SLEEP to suspend/resume functions
  mtip32xx: Workaround for unaligned writes
  bcache: Make sure blocksize isn't smaller than device blocksize
  bcache: Fix merge_bvec_fn usage for when it modifies the bvm
  bcache: Correctly check against BIO_MAX_PAGES
  bcache: Hack around stuff that clones up to bi_max_vecs
  bcache: Set ra_pages based on backing device's ra_pages
  bcache: Take data offset from the bdev superblock.
  mtip32xx: mtip32xx: Disable TRIM support
  mtip32xx: fix a smatch warning
  bcache: Disable broken btree fuzz tester
  bcache: Fix a format string overflow
  bcache: Fix a minor memory leak on device teardown
  bcache: Documentation updates
  bcache: Use WARN_ONCE() instead of __WARN()
  bcache: Add missing #include <linux/prefetch.h>
  ...
2013-05-08 11:51:05 -07:00
Linus Torvalds 4de13d7aa8 Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block
Pull block core updates from Jens Axboe:

 - Major bit is Kents prep work for immutable bio vecs.

 - Stable candidate fix for a scheduling-while-atomic in the queue
   bypass operation.

 - Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
   discard bios.

 - Tejuns changes to convert the writeback thread pool to the generic
   workqueue mechanism.

 - Runtime PM framework, SCSI patches exists on top of these in James'
   tree.

 - A few random fixes.

* 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
  relay: move remove_buf_file inside relay_close_buf
  partitions/efi.c: replace useless kzalloc's by kmalloc's
  fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
  block: fix max discard sectors limit
  blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
  Documentation: cfq-iosched: update documentation help for cfq tunables
  writeback: expose the bdi_wq workqueue
  writeback: replace custom worker pool implementation with unbound workqueue
  writeback: remove unused bdi_pending_list
  aoe: Fix unitialized var usage
  bio-integrity: Add explicit field for owner of bip_buf
  block: Add an explicit bio flag for bios that own their bvec
  block: Add bio_alloc_pages()
  block: Convert some code to bio_for_each_segment_all()
  block: Add bio_for_each_segment_all()
  bounce: Refactor __blk_queue_bounce to not use bi_io_vec
  raid1: use bio_copy_data()
  pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
  pktcdvd: use bio_copy_data()
  block: Add bio_copy_data()
  ...
2013-05-08 10:13:35 -07:00
Matthew Wilcox ab3ea5bf37 NVMe: Simplify Firmware Activate code slightly
Add definitions for the three Firmware Activate actions, and change the
SCSI translation code to construct the command into a temporary variable
instead of translating the endianness back-and-forth.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Vishal Verma <vishal.l.verma@linux.intel.com>
2013-05-08 09:55:05 -04:00
Matthew Wilcox 44af146a84 NVMe: Only clear the enable bit when disabling controller
Many of the bits in the Controller Configuration register may only be
modified when the Enable bit is clear.  Clearing them at the same time
as the Enable bit might be OK, but let's play it safe and only touch the
Enable bit.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
2013-05-08 09:54:31 -04:00
Matthew Wilcox ba47e3865e NVMe: Wait for device to acknowledge shutdown
A recent update to the specification makes it clear that the host
is expected to wait for the device to acknowledge the Enable bit
transitioning to 0 as well as waiting for the device to acknowledge a
transition to 1.

Reported-by: Khosrow Panah <Khosrow.Panah@idt.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
2013-05-08 09:53:49 -04:00
Linus Torvalds 292088ee03 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 "A couple of fixes + getting rid of __blkdev_put() return value"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  proc: Use PDE attribute setting accessor functions
  make blkdev_put() return void
  block_device_operations->release() should return void
  mtd_blktrans_ops->release() should return void
  hfs: SMP race on directory close()
2013-05-07 15:14:53 -07:00
Al Viro db2a144bed block_device_operations->release() should return void
The value passed is 0 in all but "it can never happen" cases (and those
only in a couple of drivers) *and* it would've been lost on the way
out anyway, even if something tried to pass something meaningful.
Just don't bother.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-07 02:16:21 -04:00
Linus Torvalds 91f8575685 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph changes from Alex Elder:
 "This is a big pull.

  Most of it is culmination of Alex's work to implement RBD image
  layering, which is now complete (yay!).

  There is also some work from Yan to fix i_mutex behavior surrounding
  writes in cephfs, a sync write fix, a fix for RBD images that get
  resized while they are mapped, and a few patches from me that resolve
  annoying auth warnings and fix several bugs in the ceph auth code."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (254 commits)
  rbd: fix image request leak on parent read
  libceph: use slab cache for osd client requests
  libceph: allocate ceph message data with a slab allocator
  libceph: allocate ceph messages with a slab allocator
  rbd: allocate image object names with a slab allocator
  rbd: allocate object requests with a slab allocator
  rbd: allocate name separate from obj_request
  rbd: allocate image requests with a slab allocator
  rbd: use binary search for snapshot lookup
  rbd: clear EXISTS flag if mapped snapshot disappears
  rbd: kill off the snapshot list
  rbd: define rbd_snap_size() and rbd_snap_features()
  rbd: use snap_id not index to look up snap info
  rbd: look up snapshot name in names buffer
  rbd: drop obj_request->version
  rbd: drop rbd_obj_method_sync() version parameter
  rbd: more version parameter removal
  rbd: get rid of some version parameters
  rbd: stop tracking header object version
  rbd: snap names are pointer to constant data
  ...
2013-05-06 13:11:19 -07:00
Linus Torvalds 736a2dd257 Lots of virtio work which wasn't quite ready for last merge window. Plus
I dived into lguest again, reworking the pagetable code so we can move
 the switcher page: our fixmaps sometimes take more than 2MB now...
 
 Cheers,
 Rusty.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJRga7lAAoJENkgDmzRrbjx/yIQAKpqIBtxOJeYH3SY+Uoe7Cfp
 toNYcpJEldvb0UcWN8M2cSZpHoxl1SUoq9djwcM29tcKa7EZAjHaGtb/Q1qMTDgv
 +B3WAfiGU2pmXFxLAkbrlLNGnysy24JspqJQ5hcYV84EiBxQdZp+nCYgOphd+GMK
 ww16vo9ya8jFjzt3GeRp/Heb3vEzV4Cp6BC3i0m8A3WNpEpbRb66pqXNk5o8ggJO
 SxQOKSXmUM+0m+jKSul5xn3e2Ls2LOrZZ8/DIHA+gW66N4Zab7n2/j1Q9VRxb4lh
 FqnR7KwgBX8OCh9IsBDqQYS7MohvMYge6eUdLtFrq84jvMleMEhrC8q9v2tucFUb
 5t18CLwvyK7Gdg6UCKiZ7YSPcuURAILO16al9bh5IseeBDsuX+43VsvQoBmFn9k6
 cLOVTZ6BlOmahK5PyRYFSvLa9Rxzr/05Mr7oYq9UgshD9io78dnqczFYIORF53rW
 zD7C4HuTZfYJFfNd0wAJ0RfVXnf8QvDlMdo7zPC26DSXNWqj8OexCY0qqSWUB+2F
 vcfJP6NkV4fZB8aawWIFUVwc64yqtt2uPVLa7ATZWqk16PgKrchGewmw3tiEwOgu
 1l7xgffTRRUIJsqaCZoXdgw3yezcKRjuUBcOxL09lDAAhc+NxWNvzZBsKp66DwDk
 yZQKn0OdXnuf0CeEOfFf
 =1tYL
 -----END PGP SIGNATURE-----

Merge tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux

Pull virtio & lguest updates from Rusty Russell:
 "Lots of virtio work which wasn't quite ready for last merge window.

  Plus I dived into lguest again, reworking the pagetable code so we can
  move the switcher page: our fixmaps sometimes take more than 2MB now..."

Ugh.  Annoying conflicts with the tcm_vhost -> vhost_scsi rename.
Hopefully correctly resolved.

* tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (57 commits)
  caif_virtio: Remove bouncing email addresses
  lguest: improve code readability in lg_cpu_start.
  virtio-net: fill only rx queues which are being used
  lguest: map Switcher below fixmap.
  lguest: cache last cpu we ran on.
  lguest: map Switcher text whenever we allocate a new pagetable.
  lguest: don't share Switcher PTE pages between guests.
  lguest: expost switcher_pages array (as lg_switcher_pages).
  lguest: extract shadow PTE walking / allocating.
  lguest: make check_gpte et. al return bool.
  lguest: assume Switcher text is a single page.
  lguest: rename switcher_page to switcher_pages.
  lguest: remove RESERVE_MEM constant.
  lguest: check vaddr not pgd for Switcher protection.
  lguest: prepare to make SWITCHER_ADDR a variable.
  virtio: console: replace EMFILE with EBUSY for already-open port
  virtio-scsi: reset virtqueue affinity when doing cpu hotplug
  virtio-scsi: introduce multiqueue support
  virtio-scsi: push vq lock/unlock into virtscsi_vq_done
  virtio-scsi: pass struct virtio_scsi to virtqueue completion function
  ...
2013-05-02 14:14:04 -07:00
Keith Busch 78f8d2577b NVMe: Schedule timeout for sync commands
Schedule a timeout on sync commands in case the command times out and
the device is not being polled for timeouts. This prevents device removal
from hanging forever if the device has stopped responding.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-02 15:36:02 -04:00
Keith Busch f410c680b5 NVMe: Meta-data support in NVME_IOCTL_SUBMIT_IO
This adds support for namespaces with separate meta-data formats in the
submit io ioctl. The meta-data buffer has to be a contiguous, so such
a buffer is allocated and the mapped user pages are copied to/from this
buffer for write/read commands.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-02 15:35:09 -04:00
Keith Busch 159b67d7ae NVMe: Device specific stripe size handling
We have an nvme device that has a concept of a stripe size. IO requests
that do not transfer data crossing a stripe boundary has greater
performance compared to IO that does cross it. This patch sets the
stripe size for the device if the device and vendor ids match one with
this feature and splits IO requests that cross the stripe boundary.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-02 14:41:05 -04:00
Keith Busch 427e970801 NVMe: Split non-mergeable bio requests
It is possible a bio request can not be submitted as a single NVMe IO
command if the bio_vec is not mergeable with the NVMe PRP alignement
constraints. This condition was handled by submitting an IO for the
mergeable portion then submitting a follow on IO for the remaining data
after the previous IO completes. The remainder to be sent was tracked
by manipulating the bio->bi_idx and bio->bi_sector. This patch splits
the request as many times as necessary and submits the bios together.

Since submitting the bio may cause it to be requeued on split,
nvme_resubmit_bios had to be modified to remove the wait queue when
the bio list is empty prior to submitting the bio since a split would
have added the wait queue a second time, corrupting the wait queue head
task list.

There are a few other benefits from doing this: it fixes a potential
issue with the previous handling of a non-mergeable bio as the requeuing
method could would use an unlocked nvme_queue if the callback isn't
invoked on the queue's associated cpu; it will be possible to retry a
failed bio if desired at some later time since it does not manipulate
the original bio; the bio integrity extensions require the bio to be in
its original condition for the checks to work correctly if we implement
the end-to-end data protection in the future.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-02 14:38:59 -04:00
Keith Busch cbb6218fd4 NVMe: Remove dead code in nvme_dev_add
There is no situation that could occur where we could error out of this
function and require cleaning up allocated namespaces.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-02 14:36:45 -04:00
Keith Busch a9ef4343af NVMe: Check for NULL memory in nvme_dev_add
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-02 14:35:44 -04:00
Keith Busch 68b8eca5f8 NVMe: Fix error clean-up on nvme_alloc_queue
The nvme_queue's depth is not set if we fail to allocate submission queue
entries, which was being used to determine how much coherent memory to
free on error. Use the depth variable instead.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-02 14:34:35 -04:00
Keith Busch 025c557a71 NVMe: Free admin queue on request_irq error
Fixes a potential memory leak if requesting the admin queue irq fails.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-02 14:33:53 -04:00
Keith Busch ec50373350 NVMe: Add scsi unmap to SG_IO
Translates a scsi unmap request from SG_IO ioctl to NVMe
data-set-management deallocate.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Acked-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-02 14:32:08 -04:00
Keith Busch 14385de117 NVMe: queue usage fixes in nvme-scsi
Fixes nvme queue usages in scsi-to-nvme translation code to not get
a queue more often than it is being put, and not use the queue in an
unsafe way without it being locked.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Acked-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-05-02 14:30:53 -04:00
Alex Elder b5b09be30c rbd: fix image request leak on parent read
When a read for a layered image object finds the target object
doesn't exist, a read image request for the parent image is created
and submitted.  When that completes, the callback routine was
not releasing that parent image request.  Fix that.

The slab allocation stuff just added has greatly simplified the
search for the source of this memory leak.

This resolves:
    http://tracker.ceph.com/issues/4803

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-02 12:15:28 -05:00
Alex Elder 78c2a44aae rbd: allocate image object names with a slab allocator
The names of objects used for image object requests are always fixed
size.  So create a slab cache to manage them.  Define a new function
rbd_segment_name_free() to match rbd_segment_name() (which is what
supplies the dynamically-allocated name buffer).

This is part of:
    http://tracker.ceph.com/issues/3926

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-02 11:58:30 -05:00
Alex Elder 868311b1eb rbd: allocate object requests with a slab allocator
Create a slab cache to manage rbd_obj_request allocation.  We aren't
using a constructor, and we'll zero-fill object request structures
when they're allocated.

This is part of:
    http://tracker.ceph.com/issues/3926

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-02 11:58:30 -05:00
Alex Elder f907ad5596 rbd: allocate name separate from obj_request
The next patch will define a slab allocator for a object requests.
To use that we'll need to allocate the name of an object separate
from the request structure itself.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-02 11:58:29 -05:00
Alex Elder 1c2a9dfe21 rbd: allocate image requests with a slab allocator
Create a slab cache to manage rbd_img_request allocation.  Nothing
too fancy at this point--we'll still initialize everything at
allocation time (no constructor)

This is part of:
    http://tracker.ceph.com/issues/3926

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-02 11:58:29 -05:00
Alex Elder 30d1cff817 rbd: use binary search for snapshot lookup
Use bsearch(3) to make snapshot lookup by id more efficient.  (There
could be thousands of snapshots, and conceivably many more.)

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-02 11:58:17 -05:00
Alex Elder 15228ede7d rbd: clear EXISTS flag if mapped snapshot disappears
This functionality inadvertently disappeared in the last patch.

Image snapshots can get removed at just about any time.  In
particular it can disappear even if it is in use by an rbd
client as a mapped image.

The rbd client deals with such a disappearance by responding to new
requests with ENXIO.  This is implemented by each rbd device
maintaining an EXISTS flag, which is normally set but cleared if a
snapshot disappears.

This patch (re-)implements the clearing of that flag.

Whenever mapped image header information is refreshed, if the
mapping is for a snapshot, verify the mapped snapshot is still
present in the updated snapshot context.  If it is not, clear the
flag.

It is not necessary to check this in the initial probe, because the
probe will not succeed if the snapshot doesn't exist.

This resolves:
    http://tracker.ceph.com/issues/4880

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-02 11:57:03 -05:00
Alex Elder 33dca39f5c rbd: kill off the snapshot list
We no longer use the snapshot list for anything.  When we need to
look up a snapshot name, id, size, or feature mask, we just do it
directly rather than relying on this list being updated with every
refresh.  The main reason it existed was for the benefit of the
device/sysfs entries that previously were associated with snapshots.

So get rid of the snapshot list, and struct rbd_snap, and the
hundreds of lines of code that supported them.

This resolves:
    http://tracker.ceph.com/issues/4868

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:22 -07:00
Alex Elder 2ad3d7167e rbd: define rbd_snap_size() and rbd_snap_features()
This patch defines a handful of new functions that will allow
us to get rid of the rbd device structure's list of snapshots.

Define rbd_snap_id_by_name() to look up a snapshot id given its
name.  This is efficient for format 1 images but not for format 2.
Fortunately it only gets called at mapping time so it's not that
critical.

Use rbd_snap_id_by_name() to find out the id for a snapshot getting
mapped, and pass that id to new functions rbd_snap_size() and
rbd_snap_features() to look up information about a given snapshot's
size and feature mask given its snapshot id.  All this gets done
in rbd_dev_mapping_set().

As a result, snap_by_name() is no longer needed, so get rid of it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:20 -07:00
Alex Elder 54cac61fb6 rbd: use snap_id not index to look up snap info
In order to align with what was needed for format 1 rbd images,
rbd_dev_v2_snap_info() was set up to take as argument an index into
the array of snapshot ids in a rbd device's snapshot context.

This switches that around, so we pass the snapshot id instead.
In doing this, rbd_snap_name() now returns a dynamically-allocated
string rather than a fixed one, so there's no need to make a
duplicate in its caller, rbd_dev_spec_update().

This means the following functions take a snapshot id where they
previously used an index value:
    rbd_dev_snap_info()
    rbd_dev_v1_snap_info()
    rbd_dev_v2_snap_info()

A new function, rbd_dev_snap_index(), determines the snap index for
format 1 images and uses it to look up the name.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:19 -07:00
Alex Elder 9682fc6d3a rbd: look up snapshot name in names buffer
Rather than scanning the list of snapshot structures for it, scan
the snapshot context buffer containing snapshot names in order to
determine for a format 1 image the name associated with a given
snapshot id.

Pull out the part of rbd_dev_v1_snap_info() that does this scan into
a new function, _rbd_dev_v1_snap_name().  Have that function return
a dynamically-allocated copy of the name, and don't duplicate it in
rbd_dev_v1_snap_info().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:18 -07:00
Alex Elder dedc81ea84 rbd: drop obj_request->version
Nothing ever uses the version field maintained in the object request
structure any more, so get rid of it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:17 -07:00
Alex Elder e2a58ee55b rbd: drop rbd_obj_method_sync() version parameter
Only NULL is passed as the version argument to rbd_obj_method_sync(),
so get rid of it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:16 -07:00
Alex Elder cc4a38bdd5 rbd: more version parameter removal
Continued from the last patch, more parameters that can go away
because we no longer have a need to track object versions.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:15 -07:00
Alex Elder 7097f8df6e rbd: get rid of some version parameters
Several functions in rbd have parameters meant to allow the version
of an object to be passed in or out.  The purpose of those was to
allow the version of a header object to be maintained, but we no
longer do that.  As a result, these parameters are never actually
needed or used, so get rid of them.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:14 -07:00
Alex Elder b21ebdddeb rbd: stop tracking header object version
The rbd code takes care to maintain the version of the header
object.  This was done in hopes of using it to detect a change in
the object between reading it and setting up a watch request to
be notified of changes.

The mechanism was never fully implemented, however.  And we now
avoid the original problem by setting up the watch request before
ever reading the content of the header.

The osd doesn't interpret the object version supplied with a WATCH
osd op, nor does it use the version supplied with a NOTIFY_ACK op
(we can just supply 0 for both).  There is therefore no need to
maintain the header's object version any more, so stop doing so.

We'll be able to simplify some more rbd code in the next few patches
as a result of this.

This resolves:
    http://tracker.ceph.com/issues/3952

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:13 -07:00
Alex Elder cb75223d2b rbd: snap names are pointer to constant data
Make explicit that snapshot names don't change by making functions
return and take parameters that that point to const qualified data.

This resolves:
    http://tracker.ceph.com/issues/4867

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:12 -07:00
Alex Elder a3fbe5d447 rbd: don't revalidate so much
Whenever a header object event causes a mapped rbd image to refresh
its header information, revalidate_disk() is being called.  This was
done in rbd_dev_refresh() outside the control mutex in order to
avoid a lock inversion.  Although a an event like this *might*
indicate the image has changed size, most of the time it does not.

Record the image size before and after the refresh, and only
call revalidate_disk() if it changes.

This resolves:
    http://tracker.ceph.com/issues/4867

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:11 -07:00
Alex Elder 96882f55c4 rbd: fix up the layering warning message
A warning gets spewed for any image being probed, including parent
images.  Set up a condition such that the warning message only gets
printed for the image being mapped, not any of its parents.

Also, I didn't like the way the warning ended up being so long.
Make it a terse warning instead.  People experimenting with layering
will know what the message means.

This is part of:
    http://tracker.ceph.com/issues/4867

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:10 -07:00
Alex Elder 812164f8c3 ceph: use ceph_create_snap_context()
Now that we have a library routine to create snap contexts, use it.

This is part of:
    http://tracker.ceph.com/issues/4857

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:09 -07:00
Alex Elder b536f69a3a rbd: set up devices only for mapped images
Stop setting up Linux devices during the image probe operation.
Instead, set up the devices as a separate step after the image
probe, in rbd_add().

A consequence of this is that only mapped images get devices
assigned to them, which is pretty sweet.

This resolves:
    http://tracker.ceph.com/issues/4774

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:07 -07:00
Alex Elder 8ad42cd0c0 rbd: don't have device release destroy rbd_dev
Currently an rbd_device structure gets destroyed from the release
routine for the device embedded within it.  Stop doing that, instead
calling rbd_dev_image_release() right after rbd_bus_del_dev()
wherever the latter is called.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:05 -07:00
Alex Elder 6fd48b3be9 rbd: define rbd_dev_unprobe()
Define a new function rbd_dev_unprobe() which undoes state changes
that occur from calling rbd_dev_v1_probe() or rbd_dev_v2_probe().
Note that this is a superset of rbd_header_free(), which is now
getting removed (it seems to have been used improperly anyway).

Flesh out rbd_dev_image_release() so it undoes exactly what
rbd_dev_image_probe() does.

This means that:
    - rbd_dev_device_release() gets called when the last device
      reference gets dropped;
    - that undoes everything done by the rbd_dev_device_setup() call
      at the end of rbd_dev_image_probe() (and nothing more), ending
      by calling rbd_dev_image_release(); and
    - rbd_dev_image_release() undoes everything else done by
      rbd_dev_image_probe() (and this includes a call to
      rbd_dev_unprobe().

This means the image and device portions of an rbd device are fairly
cleanly separated now, so error paths should be a little easier to
verify than they used to be.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:04 -07:00
Alex Elder 200a6a8be5 rbd: don't destroy rbd_dev in device release function
Rename rbd_dev_probe_finish() to be rbd_dev_device_setup().  Its
purpose is to set up the Linux side of an rbd device mapping.
Rename rbd_dev_release() to be rbd_dev_device_release(), making
it more obvious it serves as the inverse of the setup function
(or it will).

Encapsulate some of what was done in rbd_dev_release() into a new
function rbd_dev_image_release(), which serves as the inverse of
setting up the ceph side of the mapped rbd image.

Define a new helper rbd_dev_clear_mapping() to simply zero out the
fields of a mapping structure--the inverse of rbd_dev_set_mapping().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:03 -07:00
Alex Elder 79ab7558aa rbd: drop module later
Drop the module reference at the end of rbd_remove() for symmetry
with adding a reference at the top of rbd_add().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:02 -07:00
Alex Elder b644de2ba0 rbd: set up watch in rbd_dev_image_probe()
Move setting up the watch request for an image so it's done in
rbd_dev_image_probe() rather than rbd_dev_probe_finish().  Move
it all the way up to before doing the initial probe.  This avoids
a potential race condition, in which we get (and use) the initial
snapshot context for an image, and it gets changed between that
time and the time we get the watch set up.

This resolves:
    http://tracker.ceph.com/issues/3871

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:01 -07:00
Alex Elder 96f03e08f9 rbd: don't bother checking whether order changes
When a format 2 image is refreshed, code is in place to verify that
the object order never changes from what it was originally.  This
relies on the fact that the refresh will occur *after* an initial
load of information about the image.

An upcoming patch makes it possible for the refresh to occur first,
so we can no longer make this order check.  The order really can't
ever change anyway--this was just a sanity check.  So get rid of it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:20:00 -07:00
Alex Elder 0d8189e175 rbd: don't clean up watch in device release function
Currently, a watch on an rbd device header object gets torn down
when its final Linux device reference gets dropped.  Instead, tear
it down when removing the device.  If an error occurs cleaning up
the watch event when unmapping, abort the unmap request.

All images (including parents) still get watch requests set up, so
tear these down also, in rbd_dev_remove_parent().  For now, ignore
any errors that occur in this case.

Get rid of local variable "rc" in rbd_remove(); use "ret" instead
(they both somehow ended up defined in the function and only one is
needed).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:59 -07:00
Alex Elder 332bb12db9 rbd: define rbd_header_name()
Define a new function rbd_header_name(), which allocates and formats
the name of the header object for the rbd device.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01 21:19:58 -07:00