Rather than constructing a local structure instance on the stack, fill
the fields directly on the shared ring, just like other backends do.
Build on the fact that all response structure flavors are actually
identical (the old code did make this assumption too).
This is XSA-216.
Cc: stable@vger.kernel.org
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Today disconnecting xen-blkback is broken in case there are still
I/Os in flight: xen_blkif_disconnect() will bail out early without
releasing all resources in the hope it will be called again when
the last request has terminated. This, however, won't happen as
xen_blkif_free() won't be called on termination of the last running
request: xen_blkif_put() won't decrement the blkif refcnt to 0 as
xen_blkif_disconnect() didn't finish before thus some xen_blkif_put()
calls in xen_blkif_disconnect() didn't happen.
To solve this deadlock xen_blkif_disconnect() and
xen_blkif_alloc_rings() shouldn't use xen_blkif_put() and
xen_blkif_get() but use some other way to do their accounting of
resources.
This at once fixes another error in xen_blkif_disconnect(): when it
returned early with -EBUSY for another ring than 0 it would call
xen_blkif_put() again for already handled rings on a subsequent call.
This will lead to inconsistencies in the refcnt handling.
Cc: stable@vger.kernel.org
Signed-off-by: Juergen Gross <jgross@suse.com>
Tested-by: Steven Haigh <netwiz@crc.id.au>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Pull block driver updates from Jens Axboe:
"This is the block driver pull request for 4.5, with the exception of
NVMe, which is in a separate branch and will be posted after this one.
This pull request contains:
- A set of bcache stability fixes, which have been acked by Kent.
These have been used and tested for more than a year by the
community, so it's about time that they got in.
- A set of drbd updates from the drbd team (Andreas, Lars, Philipp)
and Markus Elfring, Oleg Drokin.
- A set of fixes for xen blkback/front from the usual suspects, (Bob,
Konrad) as well as community based fixes from Kiri, Julien, and
Peng.
- A 2038 time fix for sx8 from Shraddha, with a fix from me.
- A small mtip32xx cleanup from Zhu Yanjun.
- A null_blk division fix from Arnd"
* 'for-4.5/drivers' of git://git.kernel.dk/linux-block: (71 commits)
null_blk: use sector_div instead of do_div
mtip32xx: restrict variables visible in current code module
xen/blkfront: Fix crash if backend doesn't follow the right states.
xen/blkback: Fix two memory leaks.
xen/blkback: make st_ statistics per ring
xen/blkfront: Handle non-indirect grant with 64KB pages
xen-blkfront: Introduce blkif_ring_get_request
xen-blkback: clear PF_NOFREEZE for xen_blkif_schedule()
xen/blkback: Free resources if connect_ring failed.
xen/blocks: Return -EXX instead of -1
xen/blkback: make pool of persistent grants and free pages per-queue
xen/blkback: get the number of hardware queues/rings from blkfront
xen/blkback: pseudo support for multi hardware queues/rings
xen/blkback: separate ring information out of struct xen_blkif
xen/blkfront: correct setting for xen_blkif_max_ring_order
xen/blkfront: make persistent grants pool per-queue
xen/blkfront: Remove duplicate setting of ->xbdev.
xen/blkfront: Cleanup of comments, fix unaligned variables, and syntax errors.
xen/blkfront: negotiate number of queues/rings to be used with backend
xen/blkfront: split per device io_lock
...
Make st_* statistics per ring and the VBD sysfs would iterate over all the
rings.
Note: xenvbd_sysfs_delif() is called in xen_blkbk_remove() before all rings
are torn down, so it's safe.
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
v2: Aligned the variables on the same column.
Make pool of persistent grants and free pages per-queue/ring instead of
per-device to get better scalability.
Test was done based on null_blk driver:
dom0: v4.2-rc8 16vcpus 10GB "modprobe null_blk"
domu: v4.2-rc8 16vcpus 10GB
[test]
rw=read
direct=1
ioengine=libaio
bs=4k
time_based
runtime=30
filename=/dev/xvdb
numjobs=16
iodepth=64
iodepth_batch=64
iodepth_batch_complete=64
group_reporting
Results:
iops1: After patch "xen/blkfront: make persistent grants per-queue".
iops2: After this patch.
Queues: 1 4 8 16
Iops orig(k): 810 1064 780 700
Iops1(k): 810 1230(~20%) 1024(~20%) 850(~20%)
Iops2(k): 810 1410(~35%) 1354(~75%) 1440(~100%)
With 4 queues after this commit we can get ~75% increase in IOPS, and
performance won't drop if increasing queue numbers.
Please find the respective chart in this link:
https://www.dropbox.com/s/agrcy2pbzbsvmwv/iops.png?dl=0
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Backend advertises "multi-queue-max-queues" to front, also get the negotiated
number from "multi-queue-num-queues" written by blkfront.
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Preparatory patch for multiple hardware queues (rings). The number of
rings is unconditionally set to 1, larger number will be enabled in
"xen/blkback: get the number of hardware queues/rings from blkfront".
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
v2: Align variables in the structures.
Split per ring information to an new structure "xen_blkif_ring", so that one vbd
device can be associated with one or more rings/hardware queues.
Introduce 'pers_gnts_lock' to protect the pool of persistent grants since we
may have multi backend threads.
This patch is a preparation for supporting multi hardware queues/rings.
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
v2: Align the variables in the structure.
A compiler may load a switch statement value multiple times, which could
be bad when the value is in memory shared with the frontend.
When converting a non-native request to a native one, ensure that
src->operation is only loaded once by using READ_ONCE().
This is part of XSA155.
CC: stable@vger.kernel.org
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
The PV block protocol is using 4KB page granularity. The goal of this
patch is to allow a Linux using 64KB page granularity behaving as a
block backend on a non-modified Xen.
It's only necessary to adapt the ring size and the number of request per
indirect frames. The rest of the code is relying on the grant table
code.
Note that the grant table code is allocating a Linux page per grant
which will result to waste 6OKB for every grant when Linux is using 64KB
page granularity. This could be improved by sharing the page between
multiple grants.
Signed-off-by: Julien Grall <julien.grall@citrix.com>
Acked-by: "Roger Pau Monné" <roger.pau@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
- Add "make xenconfig" to assist in generating configs for Xen guests.
- Preparatory cleanups necessary for supporting 64 KiB pages in ARM
guests.
- Automatically use hvc0 as the default console in ARM guests.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJVkpoqAAoJEFxbo/MsZsTRu3IH/2AMPx2i65hoSqfHtGf3sz/z
XNfcidVmOElFVXGaW83m0tBWMemT5LpOGRfiq5sIo8xt/8xD2vozEkl/3kkf3RrX
EmZDw3E8vmstBdBTjWdovVhNenRc0m0pB5daS7wUdo9cETq1ag1L3BHTB3fEBApO
74V6qAfnhnq+snqWhRD3XAk3LKI0nWuWaV+5HsmxDtnunGhuRLGVs7mwxZGg56sM
mILA0eApGPdwyVVpuDe0SwV52V8E/iuVOWTcomGEN2+cRWffG5+QpHxQA8bOtF6O
KfqldiNXOY/idM+5+oSm9hespmdWbyzsFqmTYz0LvQvxE8eEZtHHB3gIcHkE8QU=
=danz
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.2-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from David Vrabel:
"Xen features and cleanups for 4.2-rc0:
- add "make xenconfig" to assist in generating configs for Xen guests
- preparatory cleanups necessary for supporting 64 KiB pages in ARM
guests
- automatically use hvc0 as the default console in ARM guests"
* tag 'for-linus-4.2-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
block/xen-blkback: s/nr_pages/nr_segs/
block/xen-blkfront: Remove invalid comment
block/xen-blkfront: Remove unused macro MAXIMUM_OUTSTANDING_BLOCK_REQS
arm/xen: Drop duplicate define mfn_to_virt
xen/grant-table: Remove unused macro SPP
xen/xenbus: client: Fix call of virt_to_mfn in xenbus_grant_ring
xen: Include xen/page.h rather than asm/xen/page.h
kconfig: add xenconfig defconfig helper
kconfig: clarify kvmconfig is for kvm
xen/pcifront: Remove usage of struct timeval
xen/tmem: use BUILD_BUG_ON() in favor of BUG_ON()
hvc_xen: avoid uninitialized variable warning
xenbus: avoid uninitialized variable warning
xen/arm: allow console=hvc0 to be omitted for guests
arm,arm64/xen: move Xen initialization earlier
arm/xen: Correctly check if the event channel interrupt is present
Make the code less confusing to read now that Linux may not have the
same page size as Xen.
Signed-off-by: Julien Grall <julien.grall@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Extend xen/block to support multi-page ring, so that more requests can be
issued by using more than one pages as the request ring between blkfront
and backend.
As a result, the performance can get improved significantly.
We got some impressive improvements on our highend iscsi storage cluster
backend. If using 64 pages as the ring, the IOPS increased about 15 times
for the throughput testing and above doubled for the latency testing.
The reason was the limit on outstanding requests is 32 if use only one-page
ring, but in our case the iscsi lun was spread across about 100 physical
drives, 32 was really not enough to keep them busy.
Changes in v2:
- Rebased to 4.0-rc6.
- Document on how multi-page ring feature working to linux io/blkif.h.
Changes in v3:
- Remove changes to linux io/blkif.h and follow the protocol defined
in io/blkif.h of XEN tree.
- Rebased to 4.1-rc3
Changes in v4:
- Turn to use 'ring-page-order' and 'max-ring-page-order'.
- A few comments from Roger.
Changes in v5:
- Clarify with 4k granularity to comment
- Address more comments from Roger
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
This is a pre-patch for multi-page ring feature.
In connect_ring, we can know exactly how many pages are used for the shared
ring, delay pending_req allocation here so that we won't waste too much memory.
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Define pr_fmt macro with {xen-blkback: } prefix, then remove all use
of DRV_PFX in the pr sentences. Replace all DPRINTK with pr sentences,
and get rid of DPRINTK macro. It will simplify the code.
And if the pr sentences miss a \n, add it in the end. If the DPRINTK
sentences have redundant \n, remove it. It will format the code.
These all make the readability of the code become better.
Signed-off-by: Tao Chen <boby.chen@huawei.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Pull block driver changes from Jens Axboe:
"This contains:
- The 4k/partition fixes for brd from Boaz/Matthew.
- A few xen front/back block fixes from David Vrabel and Roger Pau
Monne.
- Floppy changes from Takashi, cleaning the device file creation.
- Switching libata to use the new blk-mq tagging policy, removing
code (and a suboptimal implementation) from libata. This will
throw you a merge conflict, since a bug in the original libata
tagging code was fixed since this code was branched. Trivial.
From Shaohua.
- Conversion of loop to blk-mq, from Ming Lei.
- Cleanup of the io_schedule() handling in bsg from Peter Zijlstra.
He claims it improves on unreadable code, which will cost him a
beer.
- Maintainer update or NDB, now handled by Markus Pargmann.
- NVMe:
- Optimization from me that avoids a kmalloc/kfree per IO for
smaller (<= 8KB) IO. This cuts about 1% of high IOPS CPU
overhead.
- Removal of (now) dead RCU code, a relic from before NVMe was
converted to blk-mq"
* 'for-3.20/drivers' of git://git.kernel.dk/linux-block:
xen-blkback: default to X86_32 ABI on x86
xen-blkfront: fix accounting of reqs when migrating
xen-blkback,xen-blkfront: add myself as maintainer
block: Simplify bsg complete all
floppy: Avoid manual call of device_create_file()
NVMe: avoid kmalloc/kfree for smaller IO
MAINTAINERS: Update NBD maintainer
libata: make sata_sil24 use fifo tag allocator
libata: move sas ata tag allocation to libata-scsi.c
libata: use blk taging
NVMe: within nvme_free_queues(), delete RCU sychro/deferred free
null_blk: suppress invalid partition info
brd: Request from fdisk 4k alignment
brd: Fix all partitions BUGs
axonram: Fix bug in direct_access
loop: add blk-mq.h include
block: loop: don't handle REQ_FUA explicitly
block: loop: introduce lo_discard() and lo_req_flush()
block: loop: say goodby to bio
block: loop: improve performance via blk-mq
Prior to the existance of 64-bit backends using the X86_64 ABI,
frontends used the X86_32 ABI. These old frontends do not specify the
ABI and when used with a 64-bit backend do not work.
On x86, default to the X86_32 ABI if one is not specified. Backends
on ARM continue to default to their NATIVE ABI.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Use gnttab_unmap_refs_async() to wait until the mapped pages are no
longer in use before unmapping them.
This allows blkback to use network storage which may retain refs to
pages in queued skbs after the block I/O has completed.
Signed-off-by: Jennifer Herbert <jennifer.herbert@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jens Axboe <axboe@kernel.de>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Currently xenwatch blocks in VBD disconnect, waiting for all pending I/O
requests to finish. If the VBD is attached to a hot-swappable disk, then
xenwatch can hang for a long period of time, stalling other watches.
INFO: task xenwatch:39 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ffff880057f01bd0 0000000000000246 ffff880057f01ac0 ffffffff810b0782
ffff880057f01ad0 00000000000131c0 0000000000000004 ffff880057edb040
ffff8800344c6080 0000000000000000 ffff880058c00ba0 ffff880057edb040
Call Trace:
[<ffffffff810b0782>] ? irq_to_desc+0x12/0x20
[<ffffffff8128f761>] ? list_del+0x11/0x40
[<ffffffff8147a080>] ? wait_for_common+0x60/0x160
[<ffffffff8147bcef>] ? _raw_spin_lock_irqsave+0x2f/0x50
[<ffffffff8147bd49>] ? _raw_spin_unlock_irqrestore+0x19/0x20
[<ffffffff8147a26a>] schedule+0x3a/0x60
[<ffffffffa018fe6a>] xen_blkif_disconnect+0x8a/0x100 [xen_blkback]
[<ffffffff81079f70>] ? wake_up_bit+0x40/0x40
[<ffffffffa018ffce>] xen_blkbk_remove+0xae/0x1e0 [xen_blkback]
[<ffffffff8130b254>] xenbus_dev_remove+0x44/0x90
[<ffffffff81345cb7>] __device_release_driver+0x77/0xd0
[<ffffffff81346488>] device_release_driver+0x28/0x40
[<ffffffff813456e8>] bus_remove_device+0x78/0xe0
[<ffffffff81342c9f>] device_del+0x12f/0x1a0
[<ffffffff81342d2d>] device_unregister+0x1d/0x60
[<ffffffffa0190826>] frontend_changed+0xa6/0x4d0 [xen_blkback]
[<ffffffffa019c252>] ? frontend_changed+0x192/0x650 [xen_netback]
[<ffffffff8130ae50>] ? cmp_dev+0x60/0x60
[<ffffffff81344fe4>] ? bus_for_each_dev+0x94/0xa0
[<ffffffff8130b06e>] xenbus_otherend_changed+0xbe/0x120
[<ffffffff8130b4cb>] frontend_changed+0xb/0x10
[<ffffffff81309c82>] xenwatch_thread+0xf2/0x130
[<ffffffff81079f70>] ? wake_up_bit+0x40/0x40
[<ffffffff81309b90>] ? xenbus_directory+0x80/0x80
[<ffffffff810799d6>] kthread+0x96/0xa0
[<ffffffff81485934>] kernel_thread_helper+0x4/0x10
[<ffffffff814839f3>] ? int_ret_from_sys_call+0x7/0x1b
[<ffffffff8147c17c>] ? retint_restore_args+0x5/0x6
[<ffffffff81485930>] ? gs_change+0x13/0x13
With this patch, when there is still pending I/O, the actual disconnect
is done by the last reference holder (last pending I/O request). In this
case, xenwatch doesn't block indefinitely.
Signed-off-by: Valentin Priescu <priescuv@amazon.com>
Reviewed-by: Steven Kady <stevkady@amazon.com>
Reviewed-by: Steven Noonan <snoonan@amazon.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Initialize persistent_purge_work work_struct on xen_blkif_alloc (and
remove the previous initialization done in purge_persistent_gnt). This
prevents flush_work from complaining even if purge_persistent_gnt has
not been used.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Tested-by: Sander Eikelenboom <linux@eikelenboom.it>
Signed-off-by: Jens Axboe <axboe@fb.com>
This was wrongly introduced in commit 402b27f9, the only difference
between blkif_request_segment_aligned and blkif_request_segment is
that the former has a named padding, while both share the same
memory layout.
Also correct a few minor glitches in the description, including for it
to no longer assume PAGE_SIZE == 4096.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
[Description fix by Jan Beulich]
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reported-by: Jan Beulich <jbeulich@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Tested-by: Matt Rushton <mrushton@amazon.com>
Cc: Matt Wilson <msw@amazon.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Introduce a new variable to keep track of the number of in-flight
requests. We need to make sure that when xen_blkif_put is called the
request has already been freed and we can safely free xen_blkif, which
was not the case before.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Tested-by: Matt Rushton <mrushton@amazon.com>
Reviewed-by: Matt Rushton <mrushton@amazon.com>
Cc: Matt Wilson <msw@amazon.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
I've at least identified two possible memory leaks in blkback, both
related to the shutdown path of a VBD:
- blkback doesn't wait for any pending purge work to finish before
cleaning the list of free_pages. The purge work will call
put_free_pages and thus we might end up with pages being added to
the free_pages list after we have emptied it. Fix this by making
sure there's no pending purge work before exiting
xen_blkif_schedule, and moving the free_page cleanup code to
xen_blkif_free.
- blkback doesn't wait for pending requests to end before cleaning
persistent grants and the list of free_pages. Again this can add
pages to the free_pages list or persistent grants to the
persistent_gnts red-black tree. Fixed by moving the persistent
grants and free_pages cleanup code to xen_blkif_free.
Also, add some checks in xen_blkif_free to make sure we are cleaning
everything.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Tested-by: Matt Rushton <mrushton@amazon.com>
Reviewed-by: Matt Rushton <mrushton@amazon.com>
Cc: Matt Wilson <msw@amazon.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Check that the ring does not have an insane amount of requests
(more than there could fit on the ring).
If we detect this case we will stop processing the requests
and wait until the XenBus disconnects the ring.
The existing check RING_REQUEST_CONS_OVERFLOW which checks for how
many responses we have created in the past (rsp_prod_pvt) vs
requests consumed (req_cons) and whether said difference is greater or
equal to the size of the ring, does not catch this case.
Wha the condition does check if there is a need to process more
as we still have a backlog of responses to finish. Note that both
of those values (rsp_prod_pvt and req_cons) are not exposed on the
shared ring.
To understand this problem a mini crash course in ring protocol
response/request updates is in place.
There are four entries: req_prod and rsp_prod; req_event and rsp_event
to track the ring entries. We are only concerned about the first two -
which set the tone of this bug.
The req_prod is a value incremented by frontend for each request put
on the ring. Conversely the rsp_prod is a value incremented by the backend
for each response put on the ring (rsp_prod gets set by rsp_prod_pvt when
pushing the responses on the ring). Both values can
wrap and are modulo the size of the ring (in block case that is 32).
Please see RING_GET_REQUEST and RING_GET_RESPONSE for the more details.
The culprit here is that if the difference between the
req_prod and req_cons is greater than the ring size we have a problem.
Fortunately for us, the '__do_block_io_op' loop:
rc = blk_rings->common.req_cons;
rp = blk_rings->common.sring->req_prod;
while (rc != rp) {
..
blk_rings->common.req_cons = ++rc; /* before make_response() */
}
will loop up to the point when rc == rp. The macros inside of the
loop (RING_GET_REQUEST) is smart and is indexing based on the modulo
of the ring size. If the frontend has provided a bogus req_prod value
we will loop until the 'rc == rp' - which means we could be processing
already processed requests (or responses) often.
The reason the RING_REQUEST_CONS_OVERFLOW is not helping here is
b/c it only tracks how many responses we have internally produced
and whether we would should process more. The astute reader will
notice that the macro RING_REQUEST_CONS_OVERFLOW provides two
arguments - more on this later.
For example, if we were to enter this function with these values:
blk_rings->common.sring->req_prod = X+31415 (X is the value from
the last time __do_block_io_op was called).
blk_rings->common.req_cons = X
blk_rings->common.rsp_prod_pvt = X
The RING_REQUEST_CONS_OVERFLOW(&blk_rings->common, blk_rings->common.req_cons)
is doing:
req_cons - rsp_prod_pvt >= 32
Which is,
X - X >= 32 or 0 >= 32
And that is false, so we continue on looping (this bug).
If we re-use said macro RING_REQUEST_CONS_OVERFLOW and pass in the rp
instead (sring->req_prod) of rc, the this macro can do the check:
req_prod - rsp_prov_pvt >= 32
Which is,
X + 31415 - X >= 32 , or 31415 >= 32
which is true, so we can error out and break out of the function.
Unfortunatly the difference between rsp_prov_pvt and req_prod can be
at 32 (which would error out in the macro). This condition exists when
the backend is lagging behind with the responses and still has not finished
responding to all of them (so make_response has not been called), and
the rsp_prov_pvt + 32 == req_cons. This ends up with us not being able
to use said macro.
Hence introducing a new macro called RING_REQUEST_PROD_OVERFLOW which does
a simple check of:
req_prod - rsp_prod_pvt > RING_SIZE
And with the X values from above:
X + 31415 - X > 32
Returns true. Also not that if the ring is full (which is where
the RING_REQUEST_CONS_OVERFLOW triggered), we would not hit the
same condition:
X + 32 - X > 32
Which is false.
Lets use that macro.
Note that in v5 of this patchset the macro was different - we used an
earlier version.
Cc: stable@vger.kernel.org
[v1: Move the check outside the loop]
[v2: Add a pr_warn as suggested by David]
[v3: Use RING_REQUEST_CONS_OVERFLOW as suggested by Jan]
[v4: Move wake_up after kthread_stop as suggested by Jan]
[v5: Use RING_REQUEST_PROD_OVERFLOW instead]
[v6: Use RING_REQUEST_PROD_OVERFLOW - Jan's version]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
gadsa
Allocate pending requests in smaller chunks instead of allocating them
all at the same time.
This change also removes the global array of pending_reqs, it is no
longer necessay.
Variables related to the grant mapping have been grouped into a struct
called "grant_page", this allows to allocate them in smaller chunks,
and also improves memory locality.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Tested-by: Sander Eikelenboom <linux@eikelenboom.it>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Indirect descriptors introduce a new block operation
(BLKIF_OP_INDIRECT) that passes grant references instead of segments
in the request. This grant references are filled with arrays of
blkif_request_segment_aligned, this way we can send more segments in a
request.
The proposed implementation sets the maximum number of indirect grefs
(frames filled with blkif_request_segment_aligned) to 256 in the
backend and 32 in the frontend. The value in the frontend has been
chosen experimentally, and the backend value has been set to a sane
value that allows expanding the maximum number of indirect descriptors
in the frontend if needed.
The migration code has changed from the previous implementation, in
which we simply remapped the segments on the shared ring. Now the
maximum number of segments allowed in a request can change depending
on the backend, so we have to requeue all the requests in the ring and
in the queue and split the bios in them if they are bigger than the
new maximum number of segments.
[v2: Fixed minor comments by Konrad.
[v1: Added padding to make the indirect request 64bit aligned.
Added some BUGs, comments; fixed number of indirect pages in
blkif_get_x86_{32/64}_req. Added description about the indirect operation
in blkif.h]
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
[v3: Fixed spaces and tabs mix ups]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Remove the last dependency from blkbk by moving the list of free
requests to blkif. This change reduces the contention on the list of
available requests.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: xen-devel@lists.xen.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
This mechanism allows blkback to change the number of grants
persistently mapped at run time.
The algorithm uses a simple LRU mechanism that removes (if needed) the
persistent grants that have not been used since the last LRU run, or
if all grants have been used it removes the first grants in the list
(that are not in use).
The algorithm allows the user to change the maximum number of
persistent grants, by changing max_persistent_grants in sysfs.
Since we are storing the persistent grants used inside the request
struct (to be able to mark them as "unused" when unmapping), we no
longer need the bitmap (unmap_seg).
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: xen-devel@lists.xen.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Using balloon pages for all granted pages allows us to simplify the
logic in blkback, especially in the xen_blkbk_map function, since now
we can decide if we want to map a grant persistently or not after we
have actually mapped it. This could not be done before because
persistent grants used ballooned pages, whereas non-persistent grants
used pages from the kernel.
This patch also introduces several changes, the first one is that the
list of free pages is no longer global, now each blkback instance has
it's own list of free pages that can be used to map grants. Also, a
run time parameter (max_buffer_pages) has been added in order to tune
the maximum number of free pages each blkback instance will keep in
it's buffer.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: xen-devel@lists.xen.org
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
dev_bus_addr returned in the grant ref map operation is the mfn of the
passed page, there's no need to store it in the persistent grant
entry, since we can always get it provided that we have the page.
This reduces the memory overhead of persistent grants in blkback.
While at it, rename the 'seg[i].buf' to be 'seg[i].offset' as
it makes much more sense - as we use that value in bio_add_page
which as the fourth argument expects the offset.
We hadn't used the physical address as part of this at all.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: xen-devel@lists.xen.org
[v1: s/buf/offset/]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
These values shouldn't be negative, but after an overflow their value
can turn into negative, if they are signed. xentop can show bogus
values in this case.
Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
Reported-by: Ichiro Ogino <ichiro.ogino@citrix.co.jp>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
If the frontend is using a non-native protocol (e.g., a 64-bit
frontend with a 32-bit backend) and it sent an unrecognized request,
the request was not translated and the response would have the
incorrect ID. This may cause the frontend driver to behave
incorrectly or crash.
Since the ID field in the request is always in the same place,
regardless of the request type we can get the correct ID and make a
valid response (which will report BLKIF_RSP_EOPNOTSUPP).
This bug affected 64-bit SLES 11 guests when using a 32-bit backend.
This guest does a BLKIF_OP_RESERVED_1 (BLKIF_OP_PACKET in the SLES
source) and would crash in blkif_int() as the ID in the response would
be invalid.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Cc: stable@vger.kernel.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Pull block driver update from Jens Axboe:
"Now that the core bits are in, here are the driver bits for 3.8. The
branch contains:
- A huge pile of drbd bits that were dumped from the 3.7 merge
window. Following that, it was both made perfectly clear that
there is going to be no more over-the-wall pulls and how the
situation on individual pulls can be improved.
- A few cleanups from Akinobu Mita for drbd and cciss.
- Queue improvement for loop from Lukas. This grew into adding a
generic interface for waiting/checking an even with a specific
lock, allowing this to be pulled out of md and now loop and drbd is
also using it.
- A few fixes for xen back/front block driver from Roger Pau Monne.
- Partition improvements from Stephen Warren, allowing partiion UUID
to be used as an identifier."
* 'for-3.8/drivers' of git://git.kernel.dk/linux-block: (609 commits)
drbd: update Kconfig to match current dependencies
drbd: Fix drbdsetup wait-connect, wait-sync etc... commands
drbd: close race between drbd_set_role and drbd_connect
drbd: respect no-md-barriers setting also when changed online via disk-options
drbd: Remove obsolete check
drbd: fixup after wait_even_lock_irq() addition to generic code
loop: Limit the number of requests in the bio list
wait: add wait_event_lock_irq() interface
xen-blkfront: free allocated page
xen-blkback: move free persistent grants code
block: partition: msdos: provide UUIDs for partitions
init: reduce PARTUUID min length to 1 from 36
block: store partition_meta_info.uuid as a string
cciss: use check_signature()
cciss: cleanup bitops usage
drbd: use copy_highpage
drbd: if the replication link breaks during handshake, keep retrying
drbd: check return of kmalloc in receive_uuids
drbd: Broadcast sync progress no more often than once per second
drbd: don't try to clear bits once the disk has failed
...
This patch implements persistent grants for the xen-blk{front,back}
mechanism. The effect of this change is to reduce the number of unmap
operations performed, since they cause a (costly) TLB shootdown. This
allows the I/O performance to scale better when a large number of VMs
are performing I/O.
Previously, the blkfront driver was supplied a bvec[] from the request
queue. This was granted to dom0; dom0 performed the I/O and wrote
directly into the grant-mapped memory and unmapped it; blkfront then
removed foreign access for that grant. The cost of unmapping scales
badly with the number of CPUs in Dom0. An experiment showed that when
Dom0 has 24 VCPUs, and guests are performing parallel I/O to a
ramdisk, the IPIs from performing unmap's is a bottleneck at 5 guests
(at which point 650,000 IOPS are being performed in total). If more
than 5 guests are used, the performance declines. By 10 guests, only
400,000 IOPS are being performed.
This patch improves performance by only unmapping when the connection
between blkfront and back is broken.
On startup blkfront notifies blkback that it is using persistent
grants, and blkback will do the same. If blkback is not capable of
persistent mapping, blkfront will still use the same grants, since it
is compatible with the previous protocol, and simplifies the code
complexity in blkfront.
To perform a read, in persistent mode, blkfront uses a separate pool
of pages that it maps to dom0. When a request comes in, blkfront
transmutes the request so that blkback will write into one of these
free pages. Blkback keeps note of which grefs it has already
mapped. When a new ring request comes to blkback, it looks to see if
it has already mapped that page. If so, it will not map it again. If
the page hasn't been previously mapped, it is mapped now, and a record
is kept of this mapping. Blkback proceeds as usual. When blkfront is
notified that blkback has completed a request, it memcpy's from the
shared memory, into the bvec supplied. A record that the {gref, page}
tuple is mapped, and not inflight is kept.
Writes are similar, except that the memcpy is peformed from the
supplied bvecs, into the shared pages, before the request is put onto
the ring.
Blkback stores a mapping of grefs=>{page mapped to by gref} in
a red-black tree. As the grefs are not known apriori, and provide no
guarantees on their ordering, we have to perform a search
through this tree to find the page, for every gref we receive. This
operation takes O(log n) time in the worst case. In blkfront grants
are stored using a single linked list.
The maximum number of grants that blkback will persistenly map is
currently set to RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, to
prevent a malicios guest from attempting a DoS, by supplying fresh
grefs, causing the Dom0 kernel to map excessively. If a guest
is using persistent grants and exceeds the maximum number of grants to
map persistenly the newly passed grefs will be mapped and unmaped.
Using this approach, we can have requests that mix persistent and
non-persistent grants, and we need to handle them correctly.
This allows us to set the maximum number of persistent grants to a
lower value than RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, although
setting it will lead to unpredictable performance.
In writing this patch, the question arrises as to if the additional
cost of performing memcpys in the guest (to/from the pool of granted
pages) outweigh the gains of not performing TLB shootdowns. The answer
to that question is `no'. There appears to be very little, if any
additional cost to the guest of using persistent grants. There is
perhaps a small saving, from the reduced number of hypercalls
performed in granting, and ending foreign access.
Signed-off-by: Oliver Chick <oliver.chick@citrix.com>
Signed-off-by: Roger Pau Monne <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[v1: Fixed up the misuse of bool as int]
Changing the type of bdev parameters to be unsigned int :1, rather than bool.
This is more consistent with the types of other features in the block drivers.
Signed-off-by: Oliver Chick <oliver.chick@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We weren't copying the id field so when we sent the response
back to the frontend (especially with a 64-bit host and 32-bit
guest), we ended up using a random value. This lead to the
frontend crashing as it would try to pass to __blk_end_request_all
a NULL 'struct request' (b/c it would use the 'id' to find the
proper 'struct request' in its shadow array) and end up crashing:
BUG: unable to handle kernel NULL pointer dereference at 000000e4
IP: [<c0646d4c>] __blk_end_request_all+0xc/0x40
.. snip..
EIP is at __blk_end_request_all+0xc/0x40
.. snip..
[<ed95db72>] blkif_interrupt+0x172/0x330 [xen_blkfront]
This fixes the bug by passing in the proper id for the response.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=824641
CC: stable@kernel.org
Tested-by: William Dauchy <wdauchy@gmail.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
The only reason for the distinction was for the special case of
'file' (which is assumed to be loopback device), was to reach inside
the loopback device, find the underlaying file, and call fallocate on it.
Fortunately "xen-blkback: convert hole punching to discard request on
loop devices" removes that use-case and we now based the discard
support based on blk_queue_discard(q) and extract all appropriate
parameters from the 'struct request_queue'.
CC: Li Dongyang <lidongyang@novell.com>
Acked-by: Jan Beulich <JBeulich@suse.com>
[v1: Dropping pointless initializer and keeping blank line]
[v2: Remove the kfree as it is not used anymore]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Part of the blkdev_issue_discard(xx) operation is that it can also
issue a secure discard operation that will permanantly remove the
sectors in question. We advertise that we can support that via the
'discard-secure' attribute and on the request, if the 'secure' bit
is set, we will attempt to pass in REQ_DISCARD | REQ_SECURE.
CC: Li Dongyang <lidongyang@novell.com>
[v1: Used 'flag' instead of 'secure:1' bit]
[v2: Use 'reserved' uint8_t instead of adding a new value]
[v3: Check for nseg when mapping instead of operation]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
In a union type structure to deal with the overlapping
attributes in a easier manner.
Suggested-by: Ian Campbell <Ian.Campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
* 'stable/vmalloc-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
net: xen-netback: use API provided by xenbus module to map rings
block: xen-blkback: use API provided by xenbus module to map rings
xen: use generic functions instead of xen_{alloc, free}_vm_area()
* 'for-3.2/drivers' of git://git.kernel.dk/linux-block: (30 commits)
virtio-blk: use ida to allocate disk index
hpsa: add small delay when using PCI Power Management to reset for kump
cciss: add small delay when using PCI Power Management to reset for kump
xen/blkback: Fix two races in the handling of barrier requests.
xen/blkback: Check for proper operation.
xen/blkback: Fix the inhibition to map pages when discarding sector ranges.
xen/blkback: Report VBD_WSECT (wr_sect) properly.
xen/blkback: Support 'feature-barrier' aka old-style BARRIER requests.
xen-blkfront: plug device number leak in xlblk_init() error path
xen-blkfront: If no barrier or flush is supported, use invalid operation.
xen-blkback: use kzalloc() in favor of kmalloc()+memset()
xen-blkback: fixed indentation and comments
xen-blkfront: fix a deadlock while handling discard response
xen-blkfront: Handle discard requests.
xen-blkback: Implement discard requests ('feature-discard')
xen-blkfront: add BLKIF_OP_DISCARD and discard request struct
drivers/block/loop.c: remove unnecessary bdev argument from loop_clr_fd()
drivers/block/loop.c: emit uevent on auto release
drivers/block/cpqarray.c: use pci_dev->revision
loop: always allow userspace partitions and optionally support automatic scanning
...
Fic up trivial header file includsion conflict in drivers/block/loop.c
The xenbus module provides xenbus_map_ring_valloc() and
xenbus_map_ring_vfree(). Use these to map the ring pages granted by
the frontend.
Acked-by: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (59 commits)
MAINTAINERS: linux-m32r is moderated for non-subscribers
linux@lists.openrisc.net is moderated for non-subscribers
Drop default from "DM365 codec select" choice
parisc: Kconfig: cleanup Kernel page size default
Kconfig: remove redundant CONFIG_ prefix on two symbols
cris: remove arch/cris/arch-v32/lib/nand_init.S
microblaze: add missing CONFIG_ prefixes
h8300: drop puzzling Kconfig dependencies
MAINTAINERS: microblaze-uclinux@itee.uq.edu.au is moderated for non-subscribers
tty: drop superfluous dependency in Kconfig
ARM: mxc: fix Kconfig typo 'i.MX51'
Fix file references in Kconfig files
aic7xxx: fix Kconfig references to READMEs
Fix file references in drivers/ide/
thinkpad_acpi: Fix printk typo 'bluestooth'
bcmring: drop commented out line in Kconfig
btmrvl_sdio: fix typo 'btmrvl_sdio_sd6888'
doc: raw1394: Trivial typo fix
CIFS: Don't free volume_info->UNC until we are entirely done with it.
treewide: Correct spelling of successfully in comments
...
We emulate the barrier requests by draining the outstanding bio's
and then sending the WRITE_FLUSH command. To drain the I/Os
we use the refcnt that is used during disconnect to wait for all
the I/Os before disconnecting from the frontend. We latch on its
value and if it reaches either the threshold for disconnect or when
there are no more outstanding I/Os, then we have drained all I/Os.
Suggested-by: Christopher Hellwig <hch@infradead.org>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
This patch fixes belows:
1. Fix code style issue.
2. Fix incorrect functions name in comments.
Signed-off-by: Joe Jin <joe.jin@oracle.com>
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: Ian Campbell <Ian.Campbell@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
..aka ATA TRIM/SCSI UNMAP command to be passed through the frontend
and used as appropiately by the backend. We also advertise
certain granulity parameters to the frontend so it can plug them in.
If the backend is a realy device - we just end up using
'blkdev_issue_discard' while for loopback devices - we just punch
a hole in the image file.
Signed-off-by: Li Dongyang <lidongyang@novell.com>
[v1: Fixed up pr_debug and commit description]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
It was pointed out by 'make versioncheck' that some includes of
linux/version.h are not needed in drivers/block/.
This patch removes them.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
This patch fixes belows:
1. Fix code style issue.
2. Fix incorrect functions name in comments.
Signed-off-by: Joe Jin <joe.jin@oracle.com>
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: Ian Campbell <Ian.Campbell@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>