OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Michael S. Tsirkin	544fc7dbbf	virtio_mem: convert device block size into 64bit If subblock size is large (e.g. 1G) 32 bit math involving it can overflow. Rather than try to catch all instances of that, let's tweak block size to 64 bit. It ripples through UAPI which is an ABI change, but it's not too late to make it, and it will allow supporting >4Gbyte blocks while might become necessary down the road. Fixes: `5f1f79bbc9` ("virtio-mem: Paravirtualized memory hotplug") Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: David Hildenbrand <david@redhat.com>	2020-06-09 06:42:06 -04:00
Michael S. Tsirkin	b3fb6de7c6	virtio-mem: drop unnecessary initialization rc is initialized to -ENIVAL but that's never used. Drop it. Fixes: `5f1f79bbc9` ("virtio-mem: Paravirtualized memory hotplug") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: David Hildenbrand <david@redhat.com>	2020-06-08 05:12:52 -04:00
David Hildenbrand	fce8afd76e	virtio-mem: Don't rely on implicit compiler padding for requests The compiler will add padding after the last member, make that explicit. The size of a request is always 24 bytes. The size of a response always 10 bytes. Add compile-time checks. Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: teawater <teawaterz@linux.alibaba.com> Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20200515101402.16597-1-david@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2020-06-04 15:36:52 -04:00
David Hildenbrand	72f9525ad7	virtio-mem: Try to unplug the complete online memory block first Right now, we always try to unplug single subblocks when processing an online memory block. Let's try to unplug the complete online memory block first, in case it is fully plugged and the unplug request is large enough. Fallback to single subblocks in case the memory block cannot get unplugged as a whole. Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20200507140139.17083-16-david@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2020-06-04 15:36:52 -04:00
David Hildenbrand	8d4edcfe78	virtio-mem: Use -ETXTBSY as error code if the device is busy Let's be able to distinguish if the device or if memory is busy. Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20200507140139.17083-15-david@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2020-06-04 15:36:52 -04:00
David Hildenbrand	562e08cd24	virtio-mem: Unplug subblocks right-to-left We unplug blocks right-to-left, let's also unplug subblocks within a block right-to-left. Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20200507140139.17083-14-david@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2020-06-04 15:36:52 -04:00
David Hildenbrand	3c42e198e6	virtio-mem: Drop manual check for already present memory Registering our parent resource will fail if any memory is still present (e.g., because somebody unloaded the driver and tries to reload it). No need for the manual check. Move our "unplug all" handling to after registering the resource. Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20200507140139.17083-13-david@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2020-06-04 15:36:52 -04:00
David Hildenbrand	ebf71552bb	virtio-mem: Add parent resource for all added "System RAM" Let's add a parent resource, named after the virtio device (inspired by drivers/dax/kmem.c). This allows user space to identify which memory belongs to which virtio-mem device. With this change and two virtio-mem devices: :/# cat /proc/iomem 00000000-00000fff : Reserved 00001000-0009fbff : System RAM [...] 140000000-333ffffff : virtio0 140000000-147ffffff : System RAM 148000000-14fffffff : System RAM 150000000-157ffffff : System RAM [...] 334000000-3033ffffff : virtio1 338000000-33fffffff : System RAM 340000000-347ffffff : System RAM 348000000-34fffffff : System RAM [...] Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20200507140139.17083-12-david@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>	2020-06-04 15:36:52 -04:00
David Hildenbrand	23e77b5dc9	virtio-mem: Better retry handling Let's start with a retry interval of 5 seconds and double the time until we reach 5 minutes, in case we keep getting errors. Reset the retry interval in case we succeeded. The two main reasons for having to retry are - The hypervisor is busy and cannot process our request - We cannot reach the desired requested_size (esp., not enough memory can get unplugged because we can't allocate any subblocks). Tested-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20200507140139.17083-11-david@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2020-06-04 15:36:52 -04:00
David Hildenbrand	a573238786	virtio-mem: Offline and remove completely unplugged memory blocks Let's offline+remove memory blocks once all subblocks are unplugged. We can use the new Linux MM interface for that. As no memory is in use anymore, this shouldn't take a long time and shouldn't fail. There might be corner cases where the offlining could still fail (especially, if another notifier NACKs the offlining request). Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Tested-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20200507140139.17083-10-david@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2020-06-04 15:36:52 -04:00
David Hildenbrand	8e5c921ca0	virtio-mem: Allow to offline partially unplugged memory blocks Dropping the reference count of PageOffline() pages during MEM_GOING_ONLINE allows offlining code to skip them. However, we also have to clear PG_reserved, because PG_reserved pages get detected as unmovable right away. Take care of restoring the reference count when offlining is canceled. Clarify why we don't have to perform any action when unloading the driver. Also, let's add a warning if anybody is still holding a reference to unplugged pages when offlining. Tested-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20200507140139.17083-8-david@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2020-06-04 15:36:52 -04:00
David Hildenbrand	255f598507	virtio-mem: Paravirtualized memory hotunplug part 2 We also want to unplug online memory (contained in online memory blocks and, therefore, managed by the buddy), and eventually replug it later. When requested to unplug memory, we use alloc_contig_range() to allocate subblocks in online memory blocks (so we are the owner) and send them to our hypervisor. When requested to plug memory, we can replug such memory using free_contig_range() after asking our hypervisor. We also want to mark all allocated pages PG_offline, so nobody will touch them. To differentiate pages that were never onlined when onlining the memory block from pages allocated via alloc_contig_range(), we use PageDirty(). Based on this flag, virtio_mem_fake_online() can either online the pages for the first time or use free_contig_range(). It is worth noting that there are no guarantees on how much memory can actually get unplugged again. All device memory might completely be fragmented with unmovable data, such that no subblock can get unplugged. We are not touching the ZONE_MOVABLE. If memory is onlined to the ZONE_MOVABLE, it can only get unplugged after that memory was offlined manually by user space. In normal operation, virtio-mem memory is suggested to be onlined to ZONE_NORMAL. In the future, we will try to make unplug more likely to succeed. Add a module parameter to control if online memory shall be touched. As we want to access alloc_contig_range()/free_contig_range() from kernel module context, export the symbols. Note: Whenever virtio-mem uses alloc_contig_range(), all affected pages are on the same node, in the same zone, and contain no holes. Acked-by: Michal Hocko <mhocko@suse.com> # to export contig range allocator API Tested-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Alexander Potapenko <glider@google.com> Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20200507140139.17083-6-david@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2020-06-04 15:36:52 -04:00
David Hildenbrand	c627ff5d98	virtio-mem: Paravirtualized memory hotunplug part 1 Unplugging subblocks of memory blocks that are offline is easy. All we have to do is watch out for concurrent onlining activity. Tested-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20200507140139.17083-5-david@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2020-06-04 15:36:52 -04:00
David Hildenbrand	f2af6d3978	virtio-mem: Allow to specify an ACPI PXM as nid We want to allow to specify (similar as for a DIMM), to which node a virtio-mem device (and, therefore, its memory) belongs. Add a new virtio-mem feature flag and export pxm_to_node, so it can be used in kernel module context. Acked-by: Michal Hocko <mhocko@suse.com> # for the export Acked-by: "Rafael J. Wysocki" <rafael@kernel.org> # for the export Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Tested-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Len Brown <lenb@kernel.org> Cc: linux-acpi@vger.kernel.org Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20200507140139.17083-4-david@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2020-06-04 15:36:52 -04:00
David Hildenbrand	5f1f79bbc9	virtio-mem: Paravirtualized memory hotplug Each virtio-mem device owns exactly one memory region. It is responsible for adding/removing memory from that memory region on request. When the device driver starts up, the requested amount of memory is queried and then plugged to Linux. On request, further memory can be plugged or unplugged. This patch only implements the plugging part. On x86-64, memory can currently be plugged in 4MB ("subblock") granularity. When required, a new memory block will be added (e.g., usually 128MB on x86-64) in order to plug more subblocks. Only x86-64 was tested for now. The online_page callback is used to keep unplugged subblocks offline when onlining memory - similar to the Hyper-V balloon driver. Unplugged pages are marked PG_offline, to tell dump tools (e.g., makedumpfile) to skip them. User space is usually responsible for onlining the added memory. The memory hotplug notifier is used to synchronize virtio-mem activity against memory onlining/offlining. Each virtio-mem device can belong to a NUMA node, which allows us to easily add/remove small chunks of memory to/from a specific NUMA node by using multiple virtio-mem devices. Something that works even when the guest has no idea about the NUMA topology. One way to view virtio-mem is as a "resizable DIMM" or a DIMM with many "sub-DIMMS". This patch directly introduces the basic infrastructure to implement memory unplug. Especially the memory block states and subblock bitmaps will be heavily used there. Notes: - In case memory is to be onlined by user space, we limit the amount of offline memory blocks, to not run out of memory. This is esp. an issue if memory is added faster than it is getting onlined. - Suspend/Hibernate is not supported due to the way virtio-mem devices behave. Limited support might be possible in the future. - Reloading the device driver is not supported. Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Tested-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Len Brown <lenb@kernel.org> Cc: linux-acpi@vger.kernel.org Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20200507140139.17083-2-david@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2020-06-04 15:36:52 -04:00
Alexander Duyck	fb69c2c896	virtio-balloon: Disable free page reporting if page poison reporting is not enabled We should disable free page reporting if page poisoning is enabled but we cannot report it via the balloon interface. This way we can avoid the possibility of corrupting guest memory. Normally the page poisoning feature should always be present when free page reporting is enabled on the hypervisor, however this allows us to correctly handle a case of the virtio-balloon device being possibly misconfigured. Fixes: 5d757c8d518d ("virtio-balloon: add support for providing free page reports to host") Cc: stable@vger.kernel.org Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Link: https://lore.kernel.org/r/20200508173732.17877.85060.stgit@localhost.localdomain Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2020-06-04 15:36:51 -04:00
Markus Elfring	0c35c67412	virtio-mmio: Delete an error message in vm_find_vqs() The function “platform_get_irq” can log an error already. Thus omit a redundant message for the exception handling in the calling function. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Link: https://lore.kernel.org/r/9e27bc4a-cfa1-7818-dc25-8ad308816b30@web.de Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2020-06-02 02:45:13 -04:00
Matej Genci	e7c8cc35a6	virtio: add VIRTIO_RING_NO_LEGACY Add macro to disable legacy vring functions. Signed-off-by: Matej Genci <matej.genci@nutanix.com> Link: https://lore.kernel.org/r/20190911124942.243713-1-matej.genci@nutanix.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2020-06-02 02:45:13 -04:00
Alexander Duyck	31ba514b2f	virtio-balloon: Avoid using the word 'report' when referring to free page hinting It can be confusing to have multiple features within the same driver that are using the same verbage. As such this patch is creating a union of free_page_report_cmd_id with free_page_hint_cmd_id so that we can clean-up the userspace code a bit in terms of readability while maintaining the functionality of legacy code. Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Link: https://lore.kernel.org/r/20200415174318.13597.99753.stgit@localhost.localdomain Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2020-04-17 06:05:30 -04:00
Jason Yan	dc39cbb4e8	virtio-balloon: make virtballoon_free_page_report() static Fix the following sparse warning: drivers/virtio/virtio_balloon.c:168:5: warning: symbol 'virtballoon_free_page_report' was not declared. Should it be static? Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Jason Yan <yanaijie@huawei.com> Link: https://lore.kernel.org/r/20200409085047.45483-1-yanaijie@huawei.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2020-04-17 06:05:30 -04:00
Michael S. Tsirkin	58ad13729a	vdpa: make vhost, virtio depend on menu If user did not configure any vdpa drivers, neither vhost nor virtio vdpa are going to be useful. So there's no point in prompting for these and selecting vdpa core automatically. Simplify configuration by making virtio and vhost vdpa drivers depend on vdpa menu entry. Once done, we no longer need a separate menu entry, so also get rid of this. While at it, fix up the IFC entry: VDPA->vDPA for consistency with other places. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com>	2020-04-17 06:05:30 -04:00
Michael S. Tsirkin	f091abe806	virtio_input: pull in slab.h In preparation to virtio header changes, include slab.h directly as this module is using it. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2020-04-17 06:05:29 -04:00
Linus Torvalds	9bb715260e	virtio: fixes, vdpa Some bug fixes. The new vdpa subsystem with two first drivers. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> -----BEGIN PGP SIGNATURE----- iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAl6MS7wPHG1zdEByZWRo YXQuY29tAAoJECgfDbjSjVRpGp8H/2H49Gya1cfVbGU13qgmBSQqQXC8hS3iNLuG ltRgU+jafJT//kvkdm3/DUzfK3eRUWUfqZLKEbAQDtMY0OGHi/KGEBYVLDde7Zxt Lg4VnwBhkYDR/f01ZZDbHxzj9JAr83i28nILjLIqf3a1BX4zf203+ZE0/JM8a7wL dOPoH7NAfyz5ul2F67bR1IOF8vC6TidpavzR2+HC/MocHYXb6Bgfvt+i4EcrfuMf 9lnBfajgklKr9sNJniwvvR1pWVg+YyG3VeC6T8tIC/xzbCmIoNT+5b3q2XPSIHq1 EuQTeXH9CBFXS0qcFlq2ktR1xd1Lx95hKwZpqLwLFDmfgjhV2QU= =/84P -----END PGP SIGNATURE----- Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost Pull virtio updates from Michael Tsirkin: - Some bug fixes - The new vdpa subsystem with two first drivers * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: virtio-balloon: Revert "virtio-balloon: Switch back to OOM handler for VIRTIO_BALLOON_F_DEFLATE_ON_OOM" vdpa: move to drivers/vdpa virtio: Intel IFC VF driver for VDPA vdpasim: vDPA device simulator vhost: introduce vDPA-based backend virtio: introduce a vDPA based transport vDPA: introduce vDPA bus vringh: IOTLB support vhost: factor out IOTLB vhost: allow per device message handler vhost: refine vhost and vringh kconfig virtio-balloon: Switch back to OOM handler for VIRTIO_BALLOON_F_DEFLATE_ON_OOM virtio-net: Introduce hash report feature virtio-net: Introduce RSS receive steering feature virtio-net: Introduce extended RSC feature tools/virtio: option to build an out of tree module	2020-04-08 10:51:53 -07:00
David Hildenbrand	da10329cb0	virtio-balloon: switch back to OOM handler for VIRTIO_BALLOON_F_DEFLATE_ON_OOM Commit `71994620bb` ("virtio_balloon: replace oom notifier with shrinker") changed the behavior when deflation happens automatically. Instead of deflating when called by the OOM handler, the shrinker is used. However, the balloon is not simply some other slab cache that should be shrunk when under memory pressure. The shrinker does not have a concept of priorities yet, so this behavior cannot be configured. Eventually once that is in place, we might want to switch back after doing proper testing. There was a report that this results in undesired side effects when inflating the balloon to shrink the page cache. [1] "When inflating the balloon against page cache (i.e. no free memory remains) vmscan.c will both shrink page cache, but also invoke the shrinkers -- including the balloon's shrinker. So the balloon driver allocates memory which requires reclaim, vmscan gets this memory by shrinking the balloon, and then the driver adds the memory back to the balloon. Basically a busy no-op." The name "deflate on OOM" makes it pretty clear when deflation should happen - after other approaches to reclaim memory failed, not while reclaiming. This allows to minimize the footprint of a guest - memory will only be taken out of the balloon when really needed. Keep using the shrinker for VIRTIO_BALLOON_F_FREE_PAGE_HINT, because this has no such side effects. Always register the shrinker with VIRTIO_BALLOON_F_FREE_PAGE_HINT now. We are always allowed to reuse free pages that are still to be processed by the guest. The hypervisor takes care of identifying and resolving possible races between processing a hinting request and the guest reusing a page. In contrast to pre commit `71994620bb` ("virtio_balloon: replace oom notifier with shrinker"), don't add a module parameter to configure the number of pages to deflate on OOM. Can be re-added if really needed. Also, pay attention that leak_balloon() returns the number of 4k pages - convert it properly in virtio_balloon_oom_notify(). Testing done by Tyler for future reference: Test setup: VM with 16 CPU, 64GB RAM. Running Debian 10. We have a 42 GB file full of random bytes that we continually cat to /dev/null. This fills the page cache as the file is read. Meanwhile, we trigger the balloon to inflate, with a target size of 53 GB. This setup causes the balloon inflation to pressure the page cache as the page cache is also trying to grow. Afterwards we shrink the balloon back to zero (so total deflate == total inflate). Without this patch (kernel 4.19.0-5): Inflation never reaches the target until we stop the "cat file > /dev/null" process. Total inflation time was 542 seconds. The longest period that made no net forward progress was 315 seconds. Result of "grep balloon /proc/vmstat" after the test: balloon_inflate 154828377 balloon_deflate 154828377 With this patch (kernel 5.6.0-rc4+): Total inflation duration was 63 seconds. No deflate-queue activity occurs when pressuring the page-cache. Result of "grep balloon /proc/vmstat" after the test: balloon_inflate 12968539 balloon_deflate 12968539 Conclusion: This patch fixes the issue. In the test it reduced inflate/deflate activity by 12x, and reduced inflation time by 8.6x. But more importantly, if we hadn't killed the "cat file > /dev/null" process then, without the patch, the inflation process would never reach the target. [1] https://www.spinics.net/lists/linux-virtualization/msg40863.html Link: http://lkml.kernel.org/r/20200311135523.18512-2-david@redhat.com Fixes: `71994620bb` ("virtio_balloon: replace oom notifier with shrinker") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Tyler Sanderson <tysand@google.com> Tested-by: Tyler Sanderson <tysand@google.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Cc: Wei Wang <wei.w.wang@intel.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Nadav Amit <namit@vmware.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-04-07 10:43:39 -07:00
Alexander Duyck	b0c504f154	virtio-balloon: add support for providing free page reports to host Add support for the page reporting feature provided by virtio-balloon. Reporting differs from the regular balloon functionality in that is is much less durable than a standard memory balloon. Instead of creating a list of pages that cannot be accessed the pages are only inaccessible while they are being indicated to the virtio interface. Once the interface has acknowledged them they are placed back into their respective free lists and are once again accessible by the guest system. Unlike a standard balloon we don't inflate and deflate the pages. Instead we perform the reporting, and once the reporting is completed it is assumed that the page has been dropped from the guest and will be faulted back in the next time the page is accessed. Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nitesh Narayan Lal <nitesh@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pagupta@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Wang <wei.w.wang@intel.com> Cc: Yang Zhang <yang.zhang.wz@gmail.com> Cc: wei qi <weiqi4@huawei.com> Link: http://lkml.kernel.org/r/20200211224657.29318.68624.stgit@localhost.localdomain Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-04-07 10:43:39 -07:00
Alexander Duyck	d74b78fabe	virtio-balloon: pull page poisoning config out of free page hinting Currently the page poisoning setting wasn't being enabled unless free page hinting was enabled. However we will need the page poisoning tracking logic as well for free page reporting. As such pull it out and make it a separate bit of config in the probe function. In addition we need to add support for the more recent init_on_free feature which expects a behavior similar to page poisoning in that we expect the page to be pre-zeroed. Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nitesh Narayan Lal <nitesh@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pagupta@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Wang <wei.w.wang@intel.com> Cc: Yang Zhang <yang.zhang.wz@gmail.com> Cc: wei qi <weiqi4@huawei.com> Link: http://lkml.kernel.org/r/20200211224646.29318.695.stgit@localhost.localdomain Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-04-07 10:43:38 -07:00
Michael S. Tsirkin	835a6a649d	virtio-balloon: Revert "virtio-balloon: Switch back to OOM handler for VIRTIO_BALLOON_F_DEFLATE_ON_OOM" This reverts commit `5a6b4cc5b7`. It has been queued properly in the akpm tree, this version is just creating conflicts. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2020-04-07 05:44:57 -04:00
Michael S. Tsirkin	c9b9f5f8c0	vdpa: move to drivers/vdpa We have both vhost and virtio drivers that depend on vdpa. It's easier to locate it at a top level directory otherwise we run into issues e.g. if vhost is built-in but virtio is modular. Let's just move it up a level. Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2020-04-02 10:41:40 -04:00
Zhu Lingshan	5a2414bc45	virtio: Intel IFC VF driver for VDPA This commit introduced two layers to drive IFC VF: (1) ifcvf_base layer, which handles IFC VF NIC hardware operations and configurations. (2) ifcvf_main layer, which complies to VDPA bus framework, implemented device operations for VDPA bus, handles device probe, bus attaching, vring operations, etc. Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com> Signed-off-by: Bie Tiwei <tiwei.bie@intel.com> Signed-off-by: Wang Xiao <xiao.w.wang@intel.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20200326140125.19794-10-jasowang@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2020-04-02 10:41:40 -04:00
Jason Wang	2c53d0f64c	vdpasim: vDPA device simulator This patch implements a software vDPA networking device. The datapath is implemented through vringh and workqueue. The device has an on-chip IOMMU which translates IOVA to PA. For kernel virtio drivers, vDPA simulator driver provides dma_ops. For vhost driers, set_map() methods of vdpa_config_ops is implemented to accept mappings from vhost. Currently, vDPA device simulator will loopback TX traffic to RX. So the main use case for the device is vDPA feature testing, prototyping and development. Note, there's no management API implemented, a vDPA device will be registered once the module is probed. We need to handle this in the future development. Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20200326140125.19794-9-jasowang@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2020-04-02 10:41:40 -04:00
Jason Wang	c043b4a8cf	virtio: introduce a vDPA based transport This patch introduces a vDPA transport for virtio. This is used to use kernel virtio driver to drive the vDPA device that is capable of populating virtqueue directly. A new virtio-vdpa driver will be registered to the vDPA bus, when a new virtio-vdpa device is probed, it will register the device with vdpa based config ops. This means it is a software transport between vDPA driver and vDPA device. The transport was implemented through bus_ops of vDPA parent. Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20200326140125.19794-7-jasowang@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2020-04-01 12:06:26 -04:00
Jason Wang	961e9c8407	vDPA: introduce vDPA bus vDPA device is a device that uses a datapath which complies with the virtio specifications with vendor specific control path. vDPA devices can be both physically located on the hardware or emulated by software. vDPA hardware devices are usually implemented through PCIE with the following types: - PF (Physical Function) - A single Physical Function - VF (Virtual Function) - Device that supports single root I/O virtualization (SR-IOV). Its Virtual Function (VF) represents a virtualized instance of the device that can be assigned to different partitions - ADI (Assignable Device Interface) and its equivalents - With technologies such as Intel Scalable IOV, a virtual device (VDEV) composed by host OS utilizing one or more ADIs. Or its equivalent like SF (Sub function) from Mellanox. >From a driver's perspective, depends on how and where the DMA translation is done, vDPA devices are split into two types: - Platform specific DMA translation - From the driver's perspective, the device can be used on a platform where device access to data in memory is limited and/or translated. An example is a PCIE vDPA whose DMA request was tagged via a bus (e.g PCIE) specific way. DMA translation and protection are done at PCIE bus IOMMU level. - Device specific DMA translation - The device implements DMA isolation and protection through its own logic. An example is a vDPA device which uses on-chip IOMMU. To hide the differences and complexity of the above types for a vDPA device/IOMMU options and in order to present a generic virtio device to the upper layer, a device agnostic framework is required. This patch introduces a software vDPA bus which abstracts the common attributes of vDPA device, vDPA bus driver and the communication method (vdpa_config_ops) between the vDPA device abstraction and the vDPA bus driver. This allows multiple types of drivers to be used for vDPA device like the virtio_vdpa and vhost_vdpa driver to operate on the bus and allow vDPA device could be used by either kernel virtio driver or userspace vhost drivers as: virtio drivers vhost drivers \| \| [virtio bus] [vhost uAPI] \| \| virtio device vhost device virtio_vdpa drv vhost_vdpa drv \ / [vDPA bus] \| vDPA device hardware drv \| [hardware bus] \| vDPA hardware With the abstraction of vDPA bus and vDPA bus operations, the difference and complexity of the under layer hardware is hidden from upper layer. The vDPA bus drivers on top can use a unified vdpa_config_ops to control different types of vDPA device. Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20200326140125.19794-6-jasowang@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2020-04-01 12:06:26 -04:00
David Hildenbrand	5a6b4cc5b7	virtio-balloon: Switch back to OOM handler for VIRTIO_BALLOON_F_DEFLATE_ON_OOM Commit `71994620bb` ("virtio_balloon: replace oom notifier with shrinker") changed the behavior when deflation happens automatically. Instead of deflating when called by the OOM handler, the shrinker is used. However, the balloon is not simply some slab cache that should be shrunk when under memory pressure. The shrinker does not have a concept of priorities, so this behavior cannot be configured. There was a report that this results in undesired side effects when inflating the balloon to shrink the page cache. [1] "When inflating the balloon against page cache (i.e. no free memory remains) vmscan.c will both shrink page cache, but also invoke the shrinkers -- including the balloon's shrinker. So the balloon driver allocates memory which requires reclaim, vmscan gets this memory by shrinking the balloon, and then the driver adds the memory back to the balloon. Basically a busy no-op." The name "deflate on OOM" makes it pretty clear when deflation should happen - after other approaches to reclaim memory failed, not while reclaiming. This allows to minimize the footprint of a guest - memory will only be taken out of the balloon when really needed. Especially, a drop_slab() will result in the whole balloon getting deflated - undesired. While handling it via the OOM handler might not be perfect, it keeps existing behavior. If we want a different behavior, then we need a new feature bit and document it properly (although, there should be a clear use case and the intended effects should be well described). Keep using the shrinker for VIRTIO_BALLOON_F_FREE_PAGE_HINT, because this has no such side effects. Always register the shrinker with VIRTIO_BALLOON_F_FREE_PAGE_HINT now. We are always allowed to reuse free pages that are still to be processed by the guest. The hypervisor takes care of identifying and resolving possible races between processing a hinting request and the guest reusing a page. In contrast to pre commit `71994620bb` ("virtio_balloon: replace oom notifier with shrinker"), don't add a moodule parameter to configure the number of pages to deflate on OOM. Can be re-added if really needed. Also, pay attention that leak_balloon() returns the number of 4k pages - convert it properly in virtio_balloon_oom_notify(). Note1: using the OOM handler is frowned upon, but it really is what we need for this feature. Note2: without VIRTIO_BALLOON_F_MUST_TELL_HOST (iow, always with QEMU) we could actually skip sending deflation requests to our hypervisor, making the OOM path very simple. Besically freeing pages and updating the balloon. If the communication with the host ever becomes a problem on this call path. [1] https://www.spinics.net/lists/linux-virtualization/msg40863.html Test report by Tyler Sanderson: Test setup: VM with 16 CPU, 64GB RAM. Running Debian 10. We have a 42 GB file full of random bytes that we continually cat to /dev/null. This fills the page cache as the file is read. Meanwhile we trigger the balloon to inflate, with a target size of 53 GB. This setup causes the balloon inflation to pressure the page cache as the page cache is also trying to grow. Afterwards we shrink the balloon back to zero (so total deflate = total inflate). Without patch (kernel 4.19.0-5): Inflation never reaches the target until we stop the "cat file > /dev/null" process. Total inflation time was 542 seconds. The longest period that made no net forward progress was 315 seconds (see attached graph). Result of "grep balloon /proc/vmstat" after the test: balloon_inflate 154828377 balloon_deflate 154828377 With patch (kernel 5.6.0-rc4+): Total inflation duration was 63 seconds. No deflate-queue activity occurs when pressuring the page-cache. Result of "grep balloon /proc/vmstat" after the test: balloon_inflate 12968539 balloon_deflate 12968539 Conclusion: This patch fixes the issue. In the test it reduced inflate/deflate activity by 12x, and reduced inflation time by 8.6x. But more importantly, if we hadn't killed the "grep balloon /proc/vmstat" process then, without the patch, the inflation process would never reach the target. Attached [1] is a png of a graph showing the problematic behavior without this patch. It shows deflate-queue activity increasing linearly while balloon size stays constant over the course of more than 8 minutes of the test. [1] https://lore.kernel.org/linux-mm/CAJuQAmphPcfew1v_EOgAdSFiprzjiZjmOf3iJDmFX0gD6b9TYQ@mail.gmail.com/2-without_patch.png Full test report and discussion [2]: [2] https://lore.kernel.org/r/CAJuQAmphPcfew1v_EOgAdSFiprzjiZjmOf3iJDmFX0gD6b9TYQ@mail.gmail.com Tested-by: Tyler Sanderson <tysand@google.com> Reported-by: Tyler Sanderson <tysand@google.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Wei Wang <wei.w.wang@intel.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Nadav Amit <namit@vmware.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20200205163402.42627-4-david@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2020-03-23 09:50:02 -04:00
Nathan Chancellor	6ae4edab2f	virtio_balloon: Adjust label in virtballoon_probe Clang warns when CONFIG_BALLOON_COMPACTION is unset: ../drivers/virtio/virtio_balloon.c:963:1: warning: unused label 'out_del_vqs' [-Wunused-label] out_del_vqs: ^~~~~~~~~~~~ 1 warning generated. Move the label within the preprocessor block since it is only used when CONFIG_BALLOON_COMPACTION is set. Fixes: `1ad6f58ea9` ("virtio_balloon: Fix memory leaks on errors in virtballoon_probe()") Link: https://github.com/ClangBuiltLinux/linux/issues/886 Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Link: https://lore.kernel.org/r/20200216004039.23464-1-natechancellor@gmail.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com>	2020-03-08 05:35:24 -04:00
Suman Anna	f13f09a12c	virtio_ring: Fix mem leak with vring_new_virtqueue() The functions vring_new_virtqueue() and __vring_new_virtqueue() are used with split rings, and any allocations within these functions are managed outside of the .we_own_ring flag. The commit `cbeedb72b9` ("virtio_ring: allocate desc state for split ring separately") allocates the desc state within the __vring_new_virtqueue() but frees it only when the .we_own_ring flag is set. This leads to a memory leak when freeing such allocated virtqueues with the vring_del_virtqueue() function. Fix this by moving the desc_state free code outside the flag and only for split rings. Issue was discovered during testing with remoteproc and virtio_rpmsg. Fixes: `cbeedb72b9` ("virtio_ring: allocate desc state for split ring separately") Signed-off-by: Suman Anna <s-anna@ti.com> Link: https://lore.kernel.org/r/20200224212643.30672-1-s-anna@ti.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com>	2020-03-08 05:35:23 -04:00
David Hildenbrand	1ad6f58ea9	virtio_balloon: Fix memory leaks on errors in virtballoon_probe() We forget to put the inode and unmount the kernfs used for compaction. Fixes: `71994620bb` ("virtio_balloon: replace oom notifier with shrinker") Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Wei Wang <wei.w.wang@intel.com> Cc: Liang Li <liang.z.li@intel.com> Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20200205163402.42627-3-david@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2020-02-06 03:40:27 -05:00
David Hildenbrand	6c22dc61c7	virtio-balloon: Fix memory leak when unloading while hinting is in progress When unloading the driver while hinting is in progress, we will not release the free page blocks back to MM, resulting in a memory leak. Fixes: `86a559787e` ("virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT") Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Wei Wang <wei.w.wang@intel.com> Cc: Liang Li <liang.z.li@intel.com> Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20200205163402.42627-2-david@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2020-02-06 03:40:26 -05:00
Michael S. Tsirkin	6e9826e772	virtio_balloon: prevent pfn array overflow Make sure, at build time, that pfn array is big enough to hold a single page. It happens to be true since the PAGE_SHIFT value at the moment is 20, which is 1M - exactly 256 4K balloon pages. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com>	2020-02-06 03:40:26 -05:00
Daniel Verkamp	303090b513	virtio-pci: check name when counting MSI-X vectors VQs without a name specified are not valid; they are skipped in the later loop that assigns MSI-X vectors to queues, but the per_vq_vectors loop above that counts the required number of vectors previously still counted any queue with a non-NULL callback as needing a vector. Add a check to the per_vq_vectors loop so that vectors with no name are not counted to make the two loops consistent. This prevents over-counting unnecessary vectors (e.g. for features which were not negotiated with the device). Cc: stable@vger.kernel.org Fixes: `86a559787e` ("virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT") Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Daniel Verkamp <dverkamp@chromium.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Wang, Wei W <wei.w.wang@intel.com>	2020-02-06 03:40:26 -05:00
Daniel Verkamp	5790b53390	virtio-balloon: initialize all vq callbacks Ensure that elements of the callbacks array that correspond to unavailable features are set to NULL; previously, they would be left uninitialized. Since the corresponding names array elements were explicitly set to NULL, the uninitialized callback pointers would not actually be dereferenced; however, the uninitialized callbacks elements would still be read in vp_find_vqs_msix() and used to calculate the number of MSI-X vectors required. Cc: stable@vger.kernel.org Fixes: `86a559787e` ("virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT") Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Daniel Verkamp <dverkamp@chromium.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2020-02-06 03:40:26 -05:00
Yangtao Li	c64eb62cfc	virtio-mmio: convert to devm_platform_ioremap_resource Use devm_platform_ioremap_resource() to simplify code, which contains platform_get_resource, devm_request_mem_region and devm_ioremap. Signed-off-by: Yangtao Li <tiny.windzz@gmail.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2020-02-06 03:40:26 -05:00
Michael S. Tsirkin	63b9b80e9f	virtio_balloon: divide/multiply instead of shifts We managed to get confused about the shift direction at least once. Let's switch to division/multiplcation instead. Add a number of pages macro for this purpose. We still keep the order macro around too since this is what alloc/free pages want. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Wei Wang <wei.w.wang@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com>	2019-12-11 08:14:07 -05:00
Michael S. Tsirkin	2a946fa1c8	virtio_balloon: name cleanups free_page_order is a confusing name. It's not a page order actually, it's the order of the block of memory we are hinting. Rename to hint_block_order. Also, rename SIZE to BYTES to make it clear it's the block size in bytes. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Wei Wang <wei.w.wang@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com>	2019-12-11 08:14:06 -05:00
David Hildenbrand	63341ab037	virtio-balloon: fix managed page counts when migrating pages between zones In case we have to migrate a ballon page to a newpage of another zone, the managed page count of both zones is wrong. Paired with memory offlining (which will adjust the managed page count), we can trigger kernel crashes and all kinds of different symptoms. One way to reproduce: 1. Start a QEMU guest with 4GB, no NUMA 2. Hotplug a 1GB DIMM and online the memory to ZONE_NORMAL 3. Inflate the balloon to 1GB 4. Unplug the DIMM (be quick, otherwise unmovable data ends up on it) 5. Observe /proc/zoneinfo Node 0, zone Normal pages free 16810 min 24848885473806 low 18471592959183339 high 36918337032892872 spanned 262144 present 262144 managed 18446744073709533486 6. Do anything that requires some memory (e.g., inflate the balloon some more). The OOM goes crazy and the system crashes [ 238.324946] Out of memory: Killed process 537 (login) total-vm:27584kB, anon-rss:860kB, file-rss:0kB, shmem-rss:00 [ 238.338585] systemd invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 [ 238.339420] CPU: 0 PID: 1 Comm: systemd Tainted: G D W 5.4.0-next-20191204+ #75 [ 238.340139] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4 [ 238.341121] Call Trace: [ 238.341337] dump_stack+0x8f/0xd0 [ 238.341630] dump_header+0x61/0x5ea [ 238.341942] oom_kill_process.cold+0xb/0x10 [ 238.342299] out_of_memory+0x24d/0x5a0 [ 238.342625] __alloc_pages_slowpath+0xd12/0x1020 [ 238.343024] __alloc_pages_nodemask+0x391/0x410 [ 238.343407] pagecache_get_page+0xc3/0x3a0 [ 238.343757] filemap_fault+0x804/0xc30 [ 238.344083] ? ext4_filemap_fault+0x28/0x42 [ 238.344444] ext4_filemap_fault+0x30/0x42 [ 238.344789] __do_fault+0x37/0x1a0 [ 238.345087] __handle_mm_fault+0x104d/0x1ab0 [ 238.345450] handle_mm_fault+0x169/0x360 [ 238.345790] do_user_addr_fault+0x20d/0x490 [ 238.346154] do_page_fault+0x31/0x210 [ 238.346468] async_page_fault+0x43/0x50 [ 238.346797] RIP: 0033:0x7f47eba4197e [ 238.347110] Code: Bad RIP value. [ 238.347387] RSP: 002b:00007ffd7c0c1890 EFLAGS: 00010293 [ 238.347834] RAX: 0000000000000002 RBX: 000055d196a20a20 RCX: 00007f47eba4197e [ 238.348437] RDX: 0000000000000033 RSI: 00007ffd7c0c18c0 RDI: 0000000000000004 [ 238.349047] RBP: 00007ffd7c0c1c20 R08: 0000000000000000 R09: 0000000000000033 [ 238.349660] R10: 00000000ffffffff R11: 0000000000000293 R12: 0000000000000001 [ 238.350261] R13: ffffffffffffffff R14: 0000000000000000 R15: 00007ffd7c0c18c0 [ 238.350878] Mem-Info: [ 238.351085] active_anon:3121 inactive_anon:51 isolated_anon:0 [ 238.351085] active_file:12 inactive_file:7 isolated_file:0 [ 238.351085] unevictable:0 dirty:0 writeback:0 unstable:0 [ 238.351085] slab_reclaimable:5565 slab_unreclaimable:10170 [ 238.351085] mapped:3 shmem:111 pagetables:155 bounce:0 [ 238.351085] free:720717 free_pcp:2 free_cma:0 [ 238.353757] Node 0 active_anon:12484kB inactive_anon:204kB active_file:48kB inactive_file:28kB unevictable:0kB iss [ 238.355979] Node 0 DMA free:11556kB min:36kB low:48kB high:60kB reserved_highatomic:0KB active_anon:152kB inactivB [ 238.358345] lowmem_reserve[]: 0 2955 2884 2884 2884 [ 238.358761] Node 0 DMA32 free:2677864kB min:7004kB low:10028kB high:13052kB reserved_highatomic:0KB active_anon:0B [ 238.361202] lowmem_reserve[]: 0 0 72057594037927865 72057594037927865 72057594037927865 [ 238.361888] Node 0 Normal free:193448kB min:99395541895224kB low:73886371836733356kB high:147673348131571488kB reB [ 238.364765] lowmem_reserve[]: 0 0 0 0 0 [ 238.365101] Node 0 DMA: 74kB (U) 58kB (UE) 616kB (UME) 232kB (UM) 164kB (U) 2128kB (UE) 3256kB (UME) 2512B [ 238.366379] Node 0 DMA32: 04kB 18kB (U) 216kB (UM) 232kB (UM) 264kB (UM) 1128kB (U) 1256kB (U) 1512kB (U)B [ 238.367654] Node 0 Normal: 19854kB (UME) 13218kB (UME) 84416kB (UME) 52432kB (UME) 30064kB (UME) 138128kB (B [ 238.369184] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 238.369915] 130 total pagecache pages [ 238.370241] 0 pages in swap cache [ 238.370533] Swap cache stats: add 0, delete 0, find 0/0 [ 238.370981] Free swap = 0kB [ 238.371239] Total swap = 0kB [ 238.371488] 1048445 pages RAM [ 238.371756] 0 pages HighMem/MovableOnly [ 238.372090] 306992 pages reserved [ 238.372376] 0 pages cma reserved [ 238.372661] 0 pages hwpoisoned In another instance (older kernel), I was able to observe this (negative page count :/): [ 180.896971] Offlined Pages 32768 [ 182.667462] Offlined Pages 32768 [ 184.408117] Offlined Pages 32768 [ 186.026321] Offlined Pages 32768 [ 187.684861] Offlined Pages 32768 [ 189.227013] Offlined Pages 32768 [ 190.830303] Offlined Pages 32768 [ 190.833071] Built 1 zonelists, mobility grouping on. Total pages: -36920272750453009 In another instance (older kernel), I was no longer able to start any process: [root@vm ~]# [ 214.348068] Offlined Pages 32768 [ 215.973009] Offlined Pages 32768 cat /proc/meminfo -bash: fork: Cannot allocate memory [root@vm ~]# cat /proc/meminfo -bash: fork: Cannot allocate memory Fix it by properly adjusting the managed page count when migrating if the zone changed. The managed page count of the zones now looks after unplug of the DIMM (and after deflating the balloon) just like before inflating the balloon (and plugging+onlining the DIMM). We'll temporarily modify the totalram page count. If this ever becomes a problem, we can fine tune by providing helpers that don't touch the totalram pages (e.g., adjust_zone_managed_page_count()). Please note that fixing up the managed page count is only necessary when we adjusted the managed page count when inflating - only if we don't have VIRTIO_BALLOON_F_DEFLATE_ON_OOM. With that feature, the managed page count is not touched when inflating/deflating. Reported-by: Yumei Huang <yuhuang@redhat.com> Fixes: `3dcc0571cd` ("mm: correctly update zone->managed_pages") Cc: <stable@vger.kernel.org> # v3.11+ Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Jiang Liu <liuj97@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Igor Mammedov <imammedo@redhat.com> Cc: virtualization@lists.linux-foundation.org Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2019-12-11 08:14:06 -05:00
Wei Wang	c9a6820fc0	virtio_balloon: fix shrinker count Instead of multiplying by page order, virtio balloon divided by page order. The result is that it can return 0 if there are a bit less than MAX_ORDER - 1 pages in use, and then shrinker scan won't be called. Cc: stable@vger.kernel.org Fixes: `71994620bb` ("virtio_balloon: replace oom notifier with shrinker") Signed-off-by: Wei Wang <wei.w.wang@intel.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com>	2019-11-20 02:15:57 -05:00
Michael S. Tsirkin	60bd04f258	virtio_balloon: fix shrinker scan number of pages virtio_balloon_shrinker_scan should return number of system pages freed, but because it's calling functions that deal with balloon pages, it gets confused and sometimes returns the number of balloon pages. It does not matter practically as the exact number isn't used, but it seems better to be consistent in case someone starts using this API. Further, if we ever tried to iteratively leak pages as virtio_balloon_shrinker_scan tries to do, we'd run into issues - this is because freed_pages was accumulating total freed pages, but was also subtracted on each iteration from pages_to_free, which can result in either leaking less memory than we were supposed to free, or more if pages_to_free underruns. On a system with 4K pages we are lucky that we are never asked to leak more than 128 pages while we can leak up to 256 at a time, but it looks like a real issue for systems with page size != 4K. Fixes: `71994620bb` ("virtio_balloon: replace oom notifier with shrinker") Reported-by: Khazhismel Kumykov <khazhy@google.com> Reviewed-by: Wei Wang <wei.w.wang@intel.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2019-11-20 02:15:57 -05:00
Halil Pasic	f7728002c1	virtio_ring: fix return code on DMA mapping fails Commit `780bc7903a` ("virtio_ring: Support DMA APIs") makes virtqueue_add() return -EIO when we fail to map our I/O buffers. This is a very realistic scenario for guests with encrypted memory, as swiotlb may run out of space, depending on it's size and the I/O load. The virtio-blk driver interprets -EIO form virtqueue_add() as an IO error, despite the fact that swiotlb full is in absence of bugs a recoverable condition. Let us change the return code to -ENOMEM, and make the block layer recover form these failures when virtio-blk encounters the condition described above. Cc: stable@vger.kernel.org Fixes: `780bc7903a` ("virtio_ring: Support DMA APIs") Signed-off-by: Halil Pasic <pasic@linux.ibm.com> Tested-by: Michael Mueller <mimu@linux.ibm.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2019-11-19 05:13:49 -05:00
Marvin Liu	40ce7919d8	virtio_ring: fix stalls for packed rings When VIRTIO_F_RING_EVENT_IDX is negotiated, virtio devices can use virtqueue_enable_cb_delayed_packed to reduce the number of device interrupts. At the moment, this is the case for virtio-net when the napi_tx module parameter is set to false. In this case, the virtio driver selects an event offset and expects that the device will send a notification when rolling over the event offset in the ring. However, if this roll-over happens before the event suppression structure update, the notification won't be sent. To address this race condition the driver needs to check wether the device rolled over the offset after updating the event suppression structure. With VIRTIO_F_RING_PACKED, the virtio driver did this by reading the flags field of the descriptor at the specified offset. Unfortunately, checking at the event offset isn't reliable: if descriptors are chained (e.g. when INDIRECT is off) not all descriptors are overwritten by the device, so it's possible that the device skipped the specific descriptor driver is checking when writing out used descriptors. If this happens, the driver won't detect the race condition and will incorrectly expect the device to send a notification. For virtio-net, the result will be a TX queue stall, with the transmission getting blocked forever. With the packed ring, it isn't easy to find a location which is guaranteed to change upon the roll-over, except the next device descriptor, as described in the spec: Writes of device and driver descriptors can generally be reordered, but each side (driver and device) are only required to poll (or test) a single location in memory: the next device descriptor after the one they processed previously, in circular order. while this might be sub-optimal, let's do exactly this for now. Cc: stable@vger.kernel.org Cc: Jason Wang <jasowang@redhat.com> Fixes: `f51f982682` ("virtio_ring: leverage event idx in packed ring") Signed-off-by: Marvin Liu <yong.liu@intel.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2019-10-28 04:24:46 -04:00
Matthias Lange	cf8f169670	virtio_ring: fix unmap of indirect descriptors The function virtqueue_add_split() DMA-maps the scatterlist buffers. In case a mapping error occurs the already mapped buffers must be unmapped. This happens by jumping to the 'unmap_release' label. In case of indirect descriptors the release is wrong and may leak kernel memory. Because the implementation assumes that the head descriptor is already mapped it starts iterating over the descriptor list starting from the head descriptor. However for indirect descriptors the head descriptor is never mapped in case of an error. The fix is to initialize the start index with zero in case of indirect descriptors and use the 'desc' pointer directly for iterating over the descriptor chain. Signed-off-by: Matthias Lange <matthias.lange@kernkonzept.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2019-09-09 10:43:15 -04:00
Linus Torvalds	933a90bf4f	Merge branch 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs mount updates from Al Viro: "The first part of mount updates. Convert filesystems to use the new mount API" * 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) mnt_init(): call shmem_init() unconditionally constify ksys_mount() string arguments don't bother with registering rootfs init_rootfs(): don't bother with init_ramfs_fs() vfs: Convert smackfs to use the new mount API vfs: Convert selinuxfs to use the new mount API vfs: Convert securityfs to use the new mount API vfs: Convert apparmorfs to use the new mount API vfs: Convert openpromfs to use the new mount API vfs: Convert xenfs to use the new mount API vfs: Convert gadgetfs to use the new mount API vfs: Convert oprofilefs to use the new mount API vfs: Convert ibmasmfs to use the new mount API vfs: Convert qib_fs/ipathfs to use the new mount API vfs: Convert efivarfs to use the new mount API vfs: Convert configfs to use the new mount API vfs: Convert binfmt_misc to use the new mount API convenience helper: get_tree_single() convenience helper get_tree_nodev() vfs: Kill sget_userns() ...	2019-07-19 10:42:02 -07:00

1 2 3 4 5 ...

504 Commits