This is a fix for my earlier patch: "virtio: Add memory statistics reporting to
the balloon driver (V4)".
I discovered that all_vm_events() can sleep and therefore stats collection
cannot be done in interrupt context. One solution is to handle the interrupt
by noting that stats need to be collected and waking the existing vballoon
kthread which will complete the work via stats_handle_request(). Rusty, is
this a saner way of doing business?
There is one issue that I would like a broader opinion on. In stats_request, I
update vb->need_stats_update and then wake up the kthread. The kthread uses
vb->need_stats_update as a condition variable. Do I need a memory barrier
between the update and wake_up to ensure that my kthread sees the correct
value? My testing suggests that it is not needed but I would like some
confirmation from the experts.
Signed-off-by: Adam Litke <agl@us.ibm.com>
To: Rusty Russell <rusty@rustcorp.com.au>
Cc: Anthony Liguori <aliguori@linux.vnet.ibm.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changes since V3:
- Do not do endian conversions as they will be done in the host
- Report stats that reference a quantity of memory in bytes
- Minor coding style updates
Changes since V2:
- Increase stat field size to 64 bits
- Report all sizes in kb (not pages)
- Drop anon_pages stat and fix endianness conversion
Changes since V1:
- Use a virtqueue instead of the device config space
When using ballooning to manage overcommitted memory on a host, a system for
guests to communicate their memory usage to the host can provide information
that will minimize the impact of ballooning on the guests. The current method
employs a daemon running in each guest that communicates memory statistics to a
host daemon at a specified time interval. The host daemon aggregates this
information and inflates and/or deflates balloons according to the level of
host memory pressure. This approach is effective but overly complex since a
daemon must be installed inside each guest and coordinated to communicate with
the host. A simpler approach is to collect memory statistics in the virtio
balloon driver and communicate them directly to the hypervisor.
This patch enables the guest-side support by adding stats collection and
reporting to the virtio balloon driver.
Signed-off-by: Adam Litke <agl@us.ibm.com>
Cc: Anthony Liguori <anthony@codemonkey.ws>
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (minor fixes)
This is needed to compile with CONFIG_VIRTIO_PCI=y,
because virtio_pci_remove is marked __devexit.
Signed-off-by: Jamie Lokier <jamie@shareable.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Fix fixes the following warnings by renaming the driver structures to be
suffixed with _driver.
WARNING: drivers/virtio/virtio_balloon.o(.data+0x88): Section mismatch in reference from the variable virtio_balloon to the function .devexit.text:virtballoon_remove()
WARNING: drivers/char/hw_random/virtio-rng.o(.data+0x88): Section mismatch in reference from the variable virtio_rng to the function .devexit.text:virtrng_remove()
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On SMP guests, reads from the ring might bypass used index reads. This
causes guest crashes because host writes to used index to signal ring
data readiness. Fix this by inserting rmb before used ring reads.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: stable@kernel.org
Commit f68d24082e
in 2.6.32-rc1 broke requesting IRQs for per-VQ MSI-X vectors:
- vector number was used instead of the vector itself
- we try to request an IRQ for VQ which does not
have a callback handler
This is a regression that causes warnings in kernel log,
potentially lower performance as we need to scan vq list,
and might cause system failure if the interrupt
requested is in fact needed by another system.
This was not noticed earlier because in most cases
we were falling back on shared interrupt for all vqs.
The warnings often look like this:
virtio-pci 0000:00:03.0: irq 26 for MSI/MSI-X
virtio-pci 0000:00:03.0: irq 27 for MSI/MSI-X
virtio-pci 0000:00:03.0: irq 28 for MSI/MSI-X
IRQ handler type mismatch for IRQ 1
current handler: i8042
Pid: 2400, comm: modprobe Tainted: G W
2.6.32-rc3-11952-gf3ed8d8-dirty #1
Call Trace:
[<ffffffff81072aed>] ? __setup_irq+0x299/0x304
[<ffffffff81072ff3>] ? request_threaded_irq+0x144/0x1c1
[<ffffffff813455af>] ? vring_interrupt+0x0/0x30
[<ffffffff81346598>] ? vp_try_to_find_vqs+0x583/0x5c7
[<ffffffffa0015188>] ? skb_recv_done+0x0/0x34 [virtio_net]
[<ffffffff81346609>] ? vp_find_vqs+0x2d/0x83
[<ffffffff81345d00>] ? vp_get+0x3c/0x4e
[<ffffffffa0016373>] ? virtnet_probe+0x2f1/0x428 [virtio_net]
[<ffffffffa0015188>] ? skb_recv_done+0x0/0x34 [virtio_net]
[<ffffffffa00150d8>] ? skb_xmit_done+0x0/0x39 [virtio_net]
[<ffffffff8110ab92>] ? sysfs_do_create_link+0xcb/0x116
[<ffffffff81345cc2>] ? vp_get_status+0x14/0x16
[<ffffffff81345464>] ? virtio_dev_probe+0xa9/0xc8
[<ffffffff8122b11c>] ? driver_probe_device+0x8d/0x128
[<ffffffff8122b206>] ? __driver_attach+0x4f/0x6f
[<ffffffff8122b1b7>] ? __driver_attach+0x0/0x6f
[<ffffffff8122a9f9>] ? bus_for_each_dev+0x43/0x74
[<ffffffff8122a374>] ? bus_add_driver+0xea/0x22d
[<ffffffff8122b4a3>] ? driver_register+0xa7/0x111
[<ffffffffa001a000>] ? init+0x0/0xc [virtio_net]
[<ffffffff81009051>] ? do_one_initcall+0x50/0x148
[<ffffffff8106e117>] ? sys_init_module+0xc5/0x21a
[<ffffffff8100af02>] ? system_call_fastpath+0x16/0x1b
virtio-pci 0000:00:03.0: irq 26 for MSI/MSI-X
virtio-pci 0000:00:03.0: irq 27 for MSI/MSI-X
Reported-by: Marcelo Tosatti <mtosatti@redhat.com>
Reported-by: Shirley Ma <xma@us.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The function virtballoon_remove is used only wrapped by __devexit_p so
define it using __devexit.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Acked-by: Sam Ravnborg <sam@ravnborg.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rusty,
commit 3ca4f5ca73
virtio: add virtio IDs file
moved all device IDs into a single file. While the change itself is
a very good one, it can break userspace applications. For example
if a userspace tool wanted to get the ID of virtio_net it used to
include virtio_net.h. This does no longer work, since virtio_net.h
does not include virtio_ids.h.
This patch moves all "#include <linux/virtio_ids.h>" from the C
files into the header files, making the header files compatible with
the old ones.
In addition, this patch exports virtio_ids.h to userspace.
CC: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Virtio IDs are spread all over the tree which makes assigning new IDs
bothersome. Putting them together should make the process less error-prone.
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This API change means that virtio_net can tell how much capacity
remains for buffers. It's necessarily fuzzy, since
VIRTIO_RING_F_INDIRECT_DESC means we can fit any number of descriptors
in one, *if* we can kmalloc.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Dinesh Subhraveti <dineshs@us.ibm.com>
1) Rename vp_request_vectors to vp_request_msix_vectors, and take
non-MSI-X case out to caller.
2) Comment weird pci_enable_msix API
3) Rename vp_find_vq to setup_vq.
4) Fix spaces to tabs
5) Make nvectors calc internal to vp_try_to_find_vqs()
6) Rename vector to msix_vector for more clarity.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
This refactors find_vqs, making it more readable and robust, and fixing
two regressions from 2.6.30:
- double free_irq causing BUG_ON on device removal
- probe failure when vq can't be assigned to msi-x vector
(reported on old host kernels)
Tested-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This makes delete vq the reverse of find vq.
This is required to make it possible to retry find_vqs
after a failure, otherwise the list gets corrupted.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Make vp_free_vectors do the reverse of vq_request_vectors.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If pci_register_driver() fails we're incorrectly unregistering the root
device with device_unregister() rather than root_device_unregister().
Reported-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This patch allows a virtio driver to use VIRTIO_DEV_ANY_ID for the
device id. This will be used by a test module that can be bound to
any virtio device.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This bug never appeared, since all current virtio drivers use
VIRTIO_DEV_ANY_ID for the vendor field. If a real vendor would be used,
the check in virtio_id_match is wrong - it returns 0 if
id->vendor == dev->id.vendor.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Add a new feature flag for indirect ring entries. These are ring
entries which point to a table of buffer descriptors.
The idea here is to increase the ring capacity by allowing a larger
effective ring size whereby the ring size dictates the number of
requests that may be outstanding, rather than the size of those
requests.
This should be most effective in the case of block I/O where we can
potentially benefit by concurrently dispatching a large number of
large requests. Even in the simple case of single segment block
requests, this results in a threefold increase in ring capacity.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This implements optional MSI-X support in virtio_pci.
MSI-X is used whenever the host supports at least 2 MSI-X
vectors: 1 for configuration changes and 1 for virtqueues.
Per-virtqueue vectors are allocated if enough vectors
available.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (+ whitespace, style)
This reorganizes virtio-pci code in vp_interrupt slightly, so that
it's easier to add per-vq MSI support on top.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This replaces find_vq/del_vq with find_vqs/del_vqs virtio operations,
and updates all drivers. This is needed for MSI support, because MSI
needs to know the total number of vectors upfront.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (+ lguest/9p compile fixes)
Add a linked list of all virtqueues for a virtio device: this helps for
debugging and is also needed for upcoming interface change.
Also, add a "name" field for clearer debug messages.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Virtio devices are supposed to negotiate features before they start using
the device, but the current code doesn't do this. This is because the
driver's probe() function invariably has to add buffers to a virtqueue,
or probe the disk (virtio_blk).
This currently doesn't matter since no existing backend is strict about
the feature negotiation. But it's possible to imagine a future feature
which completely changes how a device operates: in this case, we'd need
to acknowledge it before using the device.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Break out of wait_event_interruptible() if freezing has been requested,
in the vballoon thread. Without this change vballoon refuses to stop and
the system can't suspend.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: stable@kernel.org
Impact: cleanup
Roel Kluin drew attention to these macros with his patch: here I
neaten them a little further:
1) Add a comment on what START_USE and END_USE are checking,
2) Brackets around _vq in BAD_RING,
3) Neaten formatting for START_USE so it's less than 80 cols.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Impact: cleanup
fix BAD_RING, START_US and END_USE macros
When these macros aren't called with a variable named vq as first
argument, this would result in a build failure.
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The host really shouldn't be notifying us of config changes
before the device status is VIRTIO_CONFIG_S_DRIVER or
VIRTIO_CONFIG_S_DRIVER_OK.
However, if we do happen to be interrupted while we're not
attached to a driver, we really shouldn't oops. Prevent
this simply by checking that device->driver is non-NULL
before trying to notify the driver of config changes.
Problem observed by doing a "set_link virtio.0 down" with
QEMU before the net driver had been loaded.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We shouldn't be statically allocating the root device object,
so dynamically allocate it using root_device_register()
instead.
Also avoids this warning from 'rmmod virtio_pci':
Device 'virtio-pci' does not have a release() function, it is broken and must be fixed
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Add a release() function for virtio_pci devices so as to avoid:
Device 'virtio0' does not have a release() function, it is broken and must be fixed
Move the code to free the resources associated with the device
from virtio_pci_remove() into this new function. virtio_pci_remove()
now merely unregisters the device which should cause the final
ref to be dropped and virtio_pci_release_dev() to be called.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Reported-by: Michael Tokarev <mjt@tls.msk.ru>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Make the balloon interface always use 4K pages, and convert Linux pfns if
necessary. This patch assumes that Linux's PAGE_SHIFT will never be less than
12.
Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (modified)
This allows each virtio user to hand in the alignment appropriate to
their virtio_ring structures.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
The virtio PCI devices don't depend on the guest page size. This matters
now PowerPC virtio is gaining ground (they like 64k pages).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This patch is part of a larger patch series which will remove
the "char bus_id[20]" name string from struct device. The device
name is managed in the kobject anyway, and without any size
limitation, and just needlessly copied into "struct device".
To set and read the device name dev_name(dev) and dev_set_name(dev)
must be used. If your code uses static kobjects, which it shouldn't
do, "const char *init_name" can be used to statically provide the
name the registered device should have. At registration time, the
init_name field is cleared, to enforce the use of dev_name(dev) to
access the device name at a later time.
We need to get rid of all occurrences of bus_id in the entire tree
to be able to enable the new interface. Please apply this patch,
and possibly convert any remaining remaining occurrences of bus_id.
We want to submit a patch to -next, which will remove bus_id from
"struct device", to find the remaining pieces to convert, and finally
switch over to the new api, which will remove the 20 bytes array
and does no longer have a size limitation.
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
kzalloc() does not guarantee page alignment, and in fact this broke when
I enabled CONFIG_SLUB_DEBUG_ON.
(Thanks to Anthony Liguori for spotting the missing kfree sub)
Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (fixed kfree)
Tested-by: Anthony Liguori <aliguori@us.ibm.com>
Both v and vb->num_pages are u32 and unsigned int respectively. If v is less
than vb->num_pages (and it is, when deflating the balloon), the result is a
very large 32-bit number. Since we're returning a s64, instead of getting the
same negative number we desire, we get a very large positive number.
This handles the case where v < vb->num_pages and ensures we get a small,
negative, s64 as the result.
Rusty: please push this for 2.6.27-rc4. It's probably appropriate for the
stable tree too as it will cause an unexpected OOM when ballooning.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (simplified)
To prepare for virtio_ring transport feature bits, hook in a call in
all the users to manipulate them. This currently just clears all the
bits, since it doesn't understand any features.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rather than explicitly handing the features to the lower-level, we just
hand the virtio_device and have it set the features. This make it clear
that it has the chance to manipulate the features of the device at this
point (and that all feature negotiation is already done).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We assign feature bits as required, but it makes sense to reserve some
for the particular transport, rather than the particular device.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Hook up to the probe() and remove() methods in bus_type
rather than device_driver. The latter has been preferred
since 2.6.16.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We force notification when the ring is full, even if the host has
indicated it doesn't want to know. This seemed like a good idea at
the time: if we fill the transmit ring, we should tell the host
immediately.
Unfortunately this logic also applies to the receiving ring, which is
refilled constantly. We should introduce real notification thesholds
to replace this logic. Meanwhile, removing the logic altogether breaks
the heuristics which KVM uses, so we use a hack: only notify if there are
outgoing parts of the new buffer.
Here are the number of exits with lguest's crappy network implementation:
Before:
network xmit 7859051 recv 236420
After:
network xmit 7858610 recv 118136
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
lguest (in rusty's use-tun-ringfd patch) assumes that the
guest has updated its feature bits before setting its status
to VIRTIO_CONFIG_S_DRIVER_OK.
That's pretty reasonable, so let's make it so.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
virtio allows drivers to suppress callbacks (ie. interrupts) for
efficiency (no locking, it's just an optimization).
There's a similar mechanism for the host to suppress notifications
coming from the guest: in that case, we ignore the suppression if the
ring is completely full.
It turns out that life is simpler if the host similarly ignores
callback suppression when the ring is completely empty: the network
driver wants to free up old packets in a timely manner, and otherwise
has to use a timer to poll.
We have to remove the code which ignores interrupts when the driver
has disabled them (again, it had no locking and hence was unreliable
anyway).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Hello Rusty,
seems that we still have a problem with virtio_net and the enable_cb callback.
During a long running network stress tests with virtio and got the following
oops:
------------[ cut here ]------------
kernel BUG at drivers/virtio/virtio_ring.c:230!
illegal operation: 0001 [#1] SMP
Modules linked in:
CPU: 0 Not tainted 2.6.26-rc2-kvm-00436-gc94c08b-dirty #34
Process netserver (pid: 2582, task: 000000000fbc4c68, ksp: 000000000f42b990)
Krnl PSW : 0704c00180000000 00000000002d0ec8 (vring_enable_cb+0x1c/0x60)
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 EA:3
Krnl GPRS: 0000000000000000 0000000000000000 000000000ef3d000 0000000010009800
0000000000000000 0000000000419ce0 0000000000000080 000000000000007b
000000000adb5538 000000000ef40900 000000000ef40000 000000000ef40920
0000000000000000 0000000000000005 000000000029c1b0 000000000fea7d18
Krnl Code: 00000000002d0ebc: a7110001 tmll %r1,1
00000000002d0ec0: a7740004 brc 7,2d0ec8
00000000002d0ec4: a7f40001 brc 15,2d0ec6
>00000000002d0ec8: a517fffe nill %r1,65534
00000000002d0ecc: 40103000 sth %r1,0(%r3)
00000000002d0ed0: 07f0 bcr 15,%r0
00000000002d0ed2: e31020380004 lg %r1,56(%r2)
00000000002d0ed8: a7480000 lhi %r4,0
Call Trace:
([<000000000029c0fc>] virtnet_poll+0x290/0x3b8)
[<0000000000333fb8>] net_rx_action+0x9c/0x1b8
[<00000000001394bc>] __do_softirq+0x74/0x108
[<000000000010d16a>] do_softirq+0x92/0xac
[<0000000000139826>] irq_exit+0x72/0xc8
[<000000000010a7b6>] do_extint+0xe2/0x104
[<0000000000110508>] ext_no_vtime+0x16/0x1a
Last Breaking-Event-Address:
[<00000000002d0ec4>] vring_enable_cb+0x18/0x60
I looked into the virtio_net code for some time and I think the following
scenario happened. Please look at virtnet_poll:
[...]
/* Out of packets? */
if (received < budget) {
netif_rx_complete(vi->dev, napi);
if (unlikely(!vi->rvq->vq_ops->enable_cb(vi->rvq))
&& napi_schedule_prep(napi)) {
vi->rvq->vq_ops->disable_cb(vi->rvq);
__netif_rx_schedule(vi->dev, napi);
goto again;
}
}
If an interrupt arrives after netif_rx_complete, a second poll routine can run
on a different cpu. The second check for napi_schedule_prep would prevent any
harm in the network stack, but we have called enable_cb possibly after the
disable_cb in skb_recv_done.
static void skb_recv_done(struct virtqueue *rvq)
{
struct virtnet_info *vi = rvq->vdev->priv;
/* Schedule NAPI, Suppress further interrupts if successful. */
if (netif_rx_schedule_prep(vi->dev, &vi->napi)) {
rvq->vq_ops->disable_cb(rvq);
__netif_rx_schedule(vi->dev, &vi->napi);
}
}
That means that the second poll routine runs with interrupts enabled, which is
ok, since we can handle additional interrupts. The problem is now that the
second poll routine might also call enable_cb, triggering the BUG.
The only solution I can come up with, is to remove the BUG statement in
enable_cb - similar to disable_cb. Opinions or better ideas where the oops
could come from?
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Anthony Liguori points out that three different transports use the virtio code,
but each one keeps its own counter to set the virtio_device's index field. In
theory (though not in current practice) this means that names could be
duplicated, and that risk grows as more transports are created.
So we move the selection of the unique virtio_device.index into the common code
in virtio.c, which has the side-benefit of removing duplicate code.
The only complexity is that lguest and S/390 use the index to uniquely identify
the device in case of catastrophic failure before register_virtio_device() is
called: now we use the offset within the descriptor page as a unique identifier
for the printks.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Chris Lalancette <clalance@redhat.com>
Cc: Anthony Liguori <anthony@codemonkey.ws>
The common virtio code sets the bus_id, overriding anything virtio_pci
sets anyway.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Chris Lalancette <clalance@redhat.com>
Cc: Anthony Liguori <anthony@codemonkey.ws>
Chris Lalancette <clalance@redhat.com> points out that virtio.c sets all device
names to '0', '1', etc, which looks silly in /proc/interrupts. We change this
from '%d' to 'virtio%d'.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Chris Lalancette <clalance@redhat.com>
Cc: Anthony Liguori <anthony@codemonkey.ws>
A recent proposed feature addition to the virtio block driver revealed
some flaws in the API: in particular, we assume that feature
negotiation is complete once a driver's probe function returns.
There is nothing in the API to require this, however, and even I
didn't notice when it was violated.
So instead, we require the driver to specify what features it supports
in a table, we can then move the feature negotiation into the virtio
core. The intersection of device and driver features are presented in
a new 'features' bitmap in the struct virtio_device.
Note that this highlights the difference between Linux unsigned-long
bitmaps where each unsigned long is in native endian, and a
straight-forward little-endian array of bytes.
Drivers can still remove feature bits in their probe routine if they
really have to.
API changes:
- dev->config->feature() no longer gets and acks a feature.
- drivers should advertise their features in the 'feature_table' field
- use virtio_has_feature() for extra sanity when checking feature bits
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
A recent proposed feature addition to the virtio block driver revealed
some flaws in the API, in particular how easy it is to break big
endian machines.
The virtio config space was originally chosen to be little-endian,
because we thought the config might be part of the PCI config space
for virtio_pci. It's actually a separate mmio region, so that
argument holds little water; as only x86 is currently using the virtio
mechanism, we can change this (but must do so now, before the
impending s390 merge).
API changes:
- __virtio_config_val() just becomes a striaght vdev->config_get() call.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
A corrupt virtqueue (caused by the other end screwing up) can have
strange results such as a driver spinning: just bail when we try to
get a buffer from a known-broken queue.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The 'disable_cb' callback is designed as an optimization to tell the host
we don't need callbacks now. As it is not reliable, the debug check is
overzealous: it can happen on two CPUs at the same time. Document this.
Even if it were reliable, the virtio_net driver doesn't disable
callbacks on transmit so the START_USE/END_USE debugging reentrance
protection can be easily tripped even on UP.
Thanks to Balaji Rao for the bug report and testing.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
CC: Balaji Rao <balajirrao@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make sure to call unregister_virtio_device() when a virtio device is removed.
Otherwise, virtio_pci.ko cannot be rmmod'd.
This was spotted by Marcelo Tosatti.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
There is a race in virtio_net, dealing with disabling/enabling the callback.
I saw the following oops:
kernel BUG at /space/kvm/drivers/virtio/virtio_ring.c:218!
illegal operation: 0001 [#1] SMP
Modules linked in: sunrpc dm_mod
CPU: 2 Not tainted 2.6.25-rc1zlive-host-10623-gd358142-dirty #99
Process swapper (pid: 0, task: 000000000f85a610, ksp: 000000000f873c60)
Krnl PSW : 0404300180000000 00000000002b81a6 (vring_disable_cb+0x16/0x20)
R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:3 PM:0 EA:3
Krnl GPRS: 0000000000000001 0000000000000001 0000000010005800 0000000000000001
000000000f3a0900 000000000f85a610 0000000000000000 0000000000000000
0000000000000000 000000000f870000 0000000000000000 0000000000001237
000000000f3a0920 000000000010ff74 00000000002846f6 000000000fa0bcd8
Krnl Code: 00000000002b819a: a7110001 tmll %r1,1
00000000002b819e: a7840004 brc 8,2b81a6
00000000002b81a2: a7f40001 brc 15,2b81a4
>00000000002b81a6: a51b0001 oill %r1,1
00000000002b81aa: 40102000 sth %r1,0(%r2)
00000000002b81ae: 07fe bcr 15,%r14
00000000002b81b0: eb7ff0380024 stmg %r7,%r15,56(%r15)
00000000002b81b6: a7f13e00 tmll %r15,15872
Call Trace:
([<000000000fa0bcd0>] 0xfa0bcd0)
[<00000000002b8350>] vring_interrupt+0x5c/0x6c
[<000000000010ab08>] do_extint+0xb8/0xf0
[<0000000000110716>] ext_no_vtime+0x16/0x1a
[<0000000000107e72>] cpu_idle+0x1c2/0x1e0
The problem can be triggered with a high amount of host->guest traffic.
I think its the following race:
poll says netif_rx_complete
poll calls enable_cb
enable_cb opens the interrupt mask
a new packet comes, an interrupt is triggered----\
enable_cb sees that there is more work |
enable_cb disables the interrupt |
. V
. interrupt is delivered
. skb_recv_done does atomic napi test, ok
some waiting disable_cb is called->check fails->bang!
.
poll would do napi check
poll would do disable_cb
The fix is to let enable_cb not disable the interrupt again, but expect the
caller to do the cleanup if it returns false. In that case, the interrupt is
only disabled, if the napi test_set_bit was successful.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (cleaned up doco)
If the host asks for a huge target towards_target() can overflow, and
we up oops as we try to release more pages than we have. The simple
fix is to use a 64-bit value.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
virtio-pci acquires its spin lock in an interrupt context so it's necessary
to use spin_lock_irqsave/restore variants. This patch fixes guest SMP when
using virtio devices in KVM.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Include linux/delay.h to fix compiler error:
drivers/virtio/virtio_balloon.c: In function 'fill_balloon':
drivers/virtio/virtio_balloon.c:98: error: implicit declaration of function 'msleep'
Signed-off-by: Johann Felix Soden <johfel@users.sourceforge.net>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After discussions with Anthony Liguori, it seems that the virtio
balloon can be made even simpler. Here's my attempt.
The device configuration tells the driver how much memory it should
take from the guest (ie. balloon size). The guest feeds the page
numbers it has taken via one virtqueue.
A second virtqueue feeds the page numbers the driver wants back: if
the device has the VIRTIO_BALLOON_F_MUST_TELL_HOST bit, then this
queue is compulsory, otherwise it's advisory (and the guest can simply
fault the pages back in).
This driver can be enhanced later to deflate the balloon via a
shrinker, oom callback or we could even go for a complete set of
in-guest regulators.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
As Avi pointed out, as we continue to massage the virtio PCI ABI, we can make
things a little more friendly to users by utilizing the PCI revision field to
indicate which version of the ABI we're using. This is a hard ABI version
and incrementing it will cause the guest driver to break.
This is the necessary changes to virtio_pci to support this.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This is a PCI device that implements a transport for virtio. It allows virtio
devices to be used by QEMU based VMMs like KVM or Xen.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This is needed for the virtio PCI device to be compiled as a module.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Anthony Liguori found double interrupt suppression in the virtio_net
driver, triggered by two skb_recv_done's in a row. This is because
virtio_ring's interrupt suppression is a best-effort optimization: it
contains no synchronization so the host can miss it and still send
interrupts.
But it's certainly nicer for virtio users if calling disable_cb
actually disables callbacks, so we check for the race in the interrupt
routine.
Note: SMP guests might require syncronization here, but since
disable_cb is actually called from interrupt context, there has to be
some form of synchronization before the next same interrupt handler is
called (Linux guarantees that the same device's irq handler will never
run simultanously on multiple CPUs).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
A reset function solves three problems:
1) It allows us to renegotiate features, eg. if we want to upgrade a
guest driver without rebooting the guest.
2) It gives us a clean way of shutting down virtqueues: after a reset,
we know that the buffers won't be used by the host, and
3) It helps the guest recover from messed-up drivers.
So we remove the ->shutdown hook, and the only way we now remove
feature bits is via reset.
We leave it to the driver to do the reset before it deletes queues:
the balloon driver, for example, needs to chat to the host in its
remove function.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The other side (host) can set the NO_NOTIFY flag as an optimization,
to say "no need to kick me when you add things". Make it clear that
this is advisory only; especially that we should always notify when
the ring is full.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It seems that virtio_net wants to disable callbacks (interrupts) before
calling netif_rx_schedule(), so we can't use the return value to do so.
Rename "restart" to "cb_enable" and introduce "cb_disable" hook: callback
now returns void, rather than a boolean.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Previously we used a type/len pair within the config space, but this
seems overkill. We now simply define a structure which represents the
layout in the config space: the config space can now only be extended
at the end.
The main driver-visible changes:
1) We indicate what fields are present with an explicit feature bit.
2) Virtqueues are explicitly numbered, and not in the config space.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The virtio code never hooked through the ->remove callback. Although
noone supports device removal at the moment, this code is already
needed for module unloading.
This of course also revealed bugs in virtio_blk, virtio_net and lguest
unloading paths.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The virtio descriptor rings of size N-1 were nicely set up to be
aligned to an N-byte boundary. But as Anthony Liguori points out, the
free-running indices used by virtio require that the sizes be a power
of 2, otherwise we get problems on wrap (demonstrated with lguest).
So we replace the clever "2^n-1" scheme with a simple "align to page
boundary" scheme: this means that all virtio rings take at least two
pages, but it's safer than guessing cache alignment.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The more_used() function compares the vq->vring.used->idx with last_used_idx.
Since vq->vring.used->idx is a 16-bit integer, and last_used_idx is an
unsigned int, this results in unpredictable behavior when vq->vring.used->idx
wraps around.
This patch corrects this by changing last_used_idx to the correct type.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
These helper routines supply most of the virtqueue_ops for hypervisors
which want to use a ring for virtio. Unlike the previous lguest
implementation:
1) The rings are variable sized (2^n-1 elements).
2) They have an unfortunate limit of 65535 bytes per sg element.
3) The page numbers are always 64 bit (PAE anyone?)
4) They no longer place used[] on a separate page, just a separate
cacheline.
5) We do a modulo on a variable. We could be tricky if we cared.
6) Interrupts and notifies are suppressed using flags within the rings.
Users need only get the ring pages and provide a notify hook (KVM
wants the guest to allocate the rings, lguest does it sanely).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Dor Laor <dor.laor@qumranet.com>
This adds the logic to convert the virtio ids into module aliases, and
includes a modalias entry in sysfs and the env var to make probing work.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This attempts to implement a "virtual I/O" layer which should allow
common drivers to be efficiently used across most virtual I/O
mechanisms. It will no-doubt need further enhancement.
The virtio drivers add buffers to virtio queues; as the buffers are consumed
the driver "interrupt" callbacks are invoked.
There is also a generic implementation of config space which drivers can query
to get setup information from the host.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Dor Laor <dor.laor@qumranet.com>
Cc: Arnd Bergmann <arnd@arndb.de>