Commit Graph

8183 Commits

Author SHA1 Message Date
Dan Williams 284ce4011b x86/memory_failure: Introduce {set, clear}_mce_nospec()
Currently memory_failure() returns zero if the error was handled. On
that result mce_unmap_kpfn() is called to zap the page out of the kernel
linear mapping to prevent speculative fetches of potentially poisoned
memory. However, in the case of dax mapped devmap pages the page may be
in active permanent use by the device driver, so it cannot be unmapped
from the kernel.

Instead of marking the page not present, marking the page UC should
be sufficient for preventing poison from being pre-fetched into the
cache. Convert mce_unmap_pfn() to set_mce_nospec() remapping the page as
UC, to hide it from speculative accesses.

Given that that persistent memory errors can be cleared by the driver,
include a facility to restore the page to cacheable operation,
clear_mce_nospec().

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: <linux-edac@vger.kernel.org>
Cc: <x86@kernel.org>
Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2018-08-20 09:22:45 -07:00
Vlastimil Babka 9df9516940 x86/speculation/l1tf: Fix overflow in l1tf_pfn_limit() on 32bit
On 32bit PAE kernels on 64bit hardware with enough physical bits,
l1tf_pfn_limit() will overflow unsigned long. This in turn affects
max_swapfile_size() and can lead to swapon returning -EINVAL. This has been
observed in a 32bit guest with 42 bits physical address size, where
max_swapfile_size() overflows exactly to 1 << 32, thus zero, and produces
the following warning to dmesg:

[    6.396845] Truncating oversized swap area, only using 0k out of 2047996k

Fix this by using unsigned long long instead.

Fixes: 17dbca1193 ("x86/speculation/l1tf: Add sysfs reporting for l1tf")
Fixes: 377eeaa8e1 ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2")
Reported-by: Dominique Leuenberger <dimstar@suse.de>
Reported-by: Adrian Schroeter <adrian@suse.de>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180820095835.5298-1-vbabka@suse.cz
2018-08-20 18:04:42 +02:00
Arnd Bergmann 704ae091b0 x86/mce: Add notifier_block forward declaration
Without linux/irq.h, there is no declaration of notifier_block, leading to
a build warning:

In file included from arch/x86/kernel/cpu/mcheck/threshold.c:10:
arch/x86/include/asm/mce.h:151:46: error: 'struct notifier_block' declared inside parameter list will not be visible outside of this definition or declaration [-Werror]

It's sufficient to declare the struct tag here, which avoids pulling in
more header files.

Fixes: 447ae31667 ("x86: Don't include linux/irq.h from asm/hardirq.h")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Nicolai Stange <nstange@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20180817100156.3009043-1-arnd@arndb.de
2018-08-20 18:04:42 +02:00
Linus Torvalds e61cf2e3a5 Minor code cleanups for PPC.
For x86 this brings in PCID emulation and CR3 caching for shadow page
 tables, nested VMX live migration, nested VMCS shadowing, an optimized
 IPI hypercall, and some optimizations.
 
 ARM will come next week.
 
 There is a semantic conflict because tip also added an .init_platform
 callback to kvm.c.  Please keep the initializer from this branch,
 and add a call to kvmclock_init (added by tip) inside kvm_init_platform
 (added here).
 
 Also, there is a backmerge from 4.18-rc6.  This is because of a
 refactoring that conflicted with a relatively late bugfix and
 resulted in a particularly hellish conflict.  Because the conflict
 was only due to unfortunate timing of the bugfix, I backmerged and
 rebased the refactoring rather than force the resolution on you.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJbdwNFAAoJEL/70l94x66DiPEH/1cAGZWGd85Y3yRu1dmTmqiz
 kZy0V+WTQ5kyJF4ZsZKKOp+xK7Qxh5e9kLdTo70uPZCHwLu9IaGKN9+dL9Jar3DR
 yLPX5bMsL8UUed9g9mlhdaNOquWi7d7BseCOnIyRTolb+cqnM5h3sle0gqXloVrS
 UQb4QogDz8+86czqR8tNfazjQRKW/D2HEGD5NDNVY1qtpY+leCDAn9/u6hUT5c6z
 EtufgyDh35UN+UQH0e2605gt3nN3nw3FiQJFwFF1bKeQ7k5ByWkuGQI68XtFVhs+
 2WfqL3ftERkKzUOy/WoSJX/C9owvhMcpAuHDGOIlFwguNGroZivOMVnACG1AI3I=
 =9Mgw
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull first set of KVM updates from Paolo Bonzini:
 "PPC:
   - minor code cleanups

  x86:
   - PCID emulation and CR3 caching for shadow page tables
   - nested VMX live migration
   - nested VMCS shadowing
   - optimized IPI hypercall
   - some optimizations

  ARM will come next week"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (85 commits)
  kvm: x86: Set highest physical address bits in non-present/reserved SPTEs
  KVM/x86: Use CC_SET()/CC_OUT in arch/x86/kvm/vmx.c
  KVM: X86: Implement PV IPIs in linux guest
  KVM: X86: Add kvm hypervisor init time platform setup callback
  KVM: X86: Implement "send IPI" hypercall
  KVM/x86: Move X86_CR4_OSXSAVE check into kvm_valid_sregs()
  KVM: x86: Skip pae_root shadow allocation if tdp enabled
  KVM/MMU: Combine flushing remote tlb in mmu_set_spte()
  KVM: vmx: skip VMWRITE of HOST_{FS,GS}_BASE when possible
  KVM: vmx: skip VMWRITE of HOST_{FS,GS}_SEL when possible
  KVM: vmx: always initialize HOST_{FS,GS}_BASE to zero during setup
  KVM: vmx: move struct host_state usage to struct loaded_vmcs
  KVM: vmx: compute need to reload FS/GS/LDT on demand
  KVM: nVMX: remove a misleading comment regarding vmcs02 fields
  KVM: vmx: rename __vmx_load_host_state() and vmx_save_host_state()
  KVM: vmx: add dedicated utility to access guest's kernel_gs_base
  KVM: vmx: track host_state.loaded using a loaded_vmcs pointer
  KVM: vmx: refactor segmentation code in vmx_save_host_state()
  kvm: nVMX: Fix fault priority for VMX operations
  kvm: nVMX: Fix fault vector for VMX operation at CPL > 0
  ...
2018-08-19 10:38:36 -07:00
Linus Torvalds d5acba26bf Char/Misc driver patches for 4.19-rc1
Here is the bit set of char/misc drivers for 4.19-rc1
 
 There is a lot here, much more than normal, seems like everyone is
 writing new driver subsystems these days...  Anyway, major things here
 are:
 	- new FSI driver subsystem, yet-another-powerpc low-level
 	  hardware bus
 	- gnss, finally an in-kernel GPS subsystem to try to tame all of
 	  the crazy out-of-tree drivers that have been floating around
 	  for years, combined with some really hacky userspace
 	  implementations.  This is only for GNSS receivers, but you
 	  have to start somewhere, and this is great to see.
 Other than that, there are new slimbus drivers, new coresight drivers,
 new fpga drivers, and loads of DT bindings for all of these and existing
 drivers.
 
 Full details of everything is in the shortlog.
 
 All of these have been in linux-next for a while with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCW3g7ew8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ykfBgCeOG0RkSI92XVZe0hs/QYFW9kk8JYAnRBf3Qpm
 cvW7a+McOoKz/MGmEKsi
 =TNfn
 -----END PGP SIGNATURE-----

Merge tag 'char-misc-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc driver updates from Greg KH:
 "Here is the bit set of char/misc drivers for 4.19-rc1

  There is a lot here, much more than normal, seems like everyone is
  writing new driver subsystems these days... Anyway, major things here
  are:

   - new FSI driver subsystem, yet-another-powerpc low-level hardware
     bus

   - gnss, finally an in-kernel GPS subsystem to try to tame all of the
     crazy out-of-tree drivers that have been floating around for years,
     combined with some really hacky userspace implementations. This is
     only for GNSS receivers, but you have to start somewhere, and this
     is great to see.

  Other than that, there are new slimbus drivers, new coresight drivers,
  new fpga drivers, and loads of DT bindings for all of these and
  existing drivers.

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'char-misc-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (255 commits)
  android: binder: Rate-limit debug and userspace triggered err msgs
  fsi: sbefifo: Bump max command length
  fsi: scom: Fix NULL dereference
  misc: mic: SCIF Fix scif_get_new_port() error handling
  misc: cxl: changed asterisk position
  genwqe: card_base: Use true and false for boolean values
  misc: eeprom: assignment outside the if statement
  uio: potential double frees if __uio_register_device() fails
  eeprom: idt_89hpesx: clean up an error pointer vs NULL inconsistency
  misc: ti-st: Fix memory leak in the error path of probe()
  android: binder: Show extra_buffers_size in trace
  firmware: vpd: Fix section enabled flag on vpd_section_destroy
  platform: goldfish: Retire pdev_bus
  goldfish: Use dedicated macros instead of manual bit shifting
  goldfish: Add missing includes to goldfish.h
  mux: adgs1408: new driver for Analog Devices ADGS1408/1409 mux
  dt-bindings: mux: add adi,adgs1408
  Drivers: hv: vmbus: Cleanup synic memory free path
  Drivers: hv: vmbus: Remove use of slow_virt_to_phys()
  Drivers: hv: vmbus: Reset the channel callback in vmbus_onoffer_rescind()
  ...
2018-08-18 11:04:51 -07:00
Sean Christopherson f19f5c49bb x86/speculation/l1tf: Exempt zeroed PTEs from inversion
It turns out that we should *not* invert all not-present mappings,
because the all zeroes case is obviously special.

clear_page() does not undergo the XOR logic to invert the address bits,
i.e. PTE, PMD and PUD entries that have not been individually written
will have val=0 and so will trigger __pte_needs_invert(). As a result,
{pte,pmd,pud}_pfn() will return the wrong PFN value, i.e. all ones
(adjusted by the max PFN mask) instead of zero. A zeroed entry is ok
because the page at physical address 0 is reserved early in boot
specifically to mitigate L1TF, so explicitly exempt them from the
inversion when reading the PFN.

Manifested as an unexpected mprotect(..., PROT_NONE) failure when called
on a VMA that has VM_PFNMAP and was mmap'd to as something other than
PROT_NONE but never used. mprotect() sends the PROT_NONE request down
prot_none_walk(), which walks the PTEs to check the PFNs.
prot_none_pte_entry() gets the bogus PFN from pte_pfn() and returns
-EACCES because it thinks mprotect() is trying to adjust a high MMIO
address.

[ This is a very modified version of Sean's original patch, but all
  credit goes to Sean for doing this and also pointing out that
  sometimes the __pte_needs_invert() function only gets the protection
  bits, not the full eventual pte.  But zero remains special even in
  just protection bits, so that's ok.   - Linus ]

Fixes: f22cc87f6c ("x86/speculation/l1tf: Invert all not present mappings")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 10:27:36 -07:00
Linus Torvalds 4e31843f68 pci-v4.19-changes
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAlt1f9AUHGJoZWxnYWFz
 QGdvb2dsZS5jb20ACgkQWYigwDrT+vxbdhAArnhRvkwOk4m4/LCuKF6HpmlxbBNC
 TjnBCenNf+lFXzWskfDFGFl/Wif4UzGbRTSCNQrwMzj3Ww3f/6R2QIq9rEJvyNC4
 VdxQnaBEZSUgN87q5UGqgdjMTo3zFvlFH6fpb5XDiQ5IX/QZeXeYqoB64w+HvKPU
 M+IsoOvnA5gb7pMcpchrGUnSfS1e6AqQbbTt6tZflore6YCEA4cH5OnpGx8qiZIp
 ut+CMBvQjQB01fHeBc/wGrVte4NwXdONrXqpUb4sHF7HqRNfEh0QVyPhvebBi+k1
 kquqoBQfPFTqgcab31VOcQhg70dEx+1qGm5/YBAwmhCpHR/g2gioFXoROsr+iUOe
 BtF6LZr+Y8cySuhJnkCrJBqWvvBaKbJLg0KMbI+7p4o9MZpod2u7LS5LFrlRDyKW
 3nz3o+b1+v3tCCKVKIhKo0ljolgkweQtR1f6KIHvq93wBODHVQnAOt9NlPfHVyks
 ryGBnOhMjoU5hvfexgIWFk9Ph9MEVQSffkI+TeFPO/tyGBfGfQyGtESiXuEaMQaH
 FGdZHX2RLkY3pWHOtWeMzRHzOnr2XjpDFcAqL3HBGPdJ30K3Umv3WOgoFe2SaocG
 0gaddPjKSwwM4Sa/VP+O5cjGuzi7QnczSDdpYjxIGZzBav32hqx4/rsnLw7bHH8y
 XkEme7cYJc8MGsA=
 =2Dmn
 -----END PGP SIGNATURE-----

Merge tag 'pci-v4.19-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull pci updates from Bjorn Helgaas:

 - Decode AER errors with names similar to "lspci" (Tyler Baicar)

 - Expose AER statistics in sysfs (Rajat Jain)

 - Clear AER status bits selectively based on the type of recovery (Oza
   Pawandeep)

 - Honor "pcie_ports=native" even if HEST sets FIRMWARE_FIRST (Alexandru
   Gagniuc)

 - Don't clear AER status bits if we're using the "Firmware-First"
   strategy where firmware owns the registers (Alexandru Gagniuc)

 - Use sysfs_match_string() to simplify ASPM sysfs parsing (Andy
   Shevchenko)

 - Remove unnecessary includes of <linux/pci-aspm.h> (Bjorn Helgaas)

 - Defer DPC event handling to work queue (Keith Busch)

 - Use threaded IRQ for DPC bottom half (Keith Busch)

 - Print AER status while handling DPC events (Keith Busch)

 - Work around IDT switch ACS Source Validation erratum (James
   Puthukattukaran)

 - Emit diagnostics for all cases of PCIe Link downtraining (Links
   operating slower than they're capable of) (Alexandru Gagniuc)

 - Skip VFs when configuring Max Payload Size (Myron Stowe)

 - Reduce Root Port Max Payload Size if necessary when hot-adding a
   device below it (Myron Stowe)

 - Simplify SHPC existence/permission checks (Bjorn Helgaas)

 - Remove hotplug sample skeleton driver (Lukas Wunner)

 - Convert pciehp to threaded IRQ handling (Lukas Wunner)

 - Improve pciehp tolerance of missed events and initially unstable
   links (Lukas Wunner)

 - Clear spurious pciehp events on resume (Lukas Wunner)

 - Add pciehp runtime PM support, including for Thunderbolt controllers
   (Lukas Wunner)

 - Support interrupts from pciehp bridges in D3hot (Lukas Wunner)

 - Mark fall-through switch cases before enabling -Wimplicit-fallthrough
   (Gustavo A. R. Silva)

 - Move DMA-debug PCI init from arch code to PCI core (Christoph
   Hellwig)

 - Fix pci_request_irq() usage of IRQF_ONESHOT when no handler is
   supplied (Heiner Kallweit)

 - Unify PCI and DMA direction #defines (Shunyong Yang)

 - Add PCI_DEVICE_DATA() macro (Andy Shevchenko)

 - Check for VPD completion before checking for timeout (Bert Kenward)

 - Limit Netronome NFP5000 config space size to work around erratum
   (Jakub Kicinski)

 - Set IRQCHIP_ONESHOT_SAFE for PCI MSI irqchips (Heiner Kallweit)

 - Document ACPI description of PCI host bridges (Bjorn Helgaas)

 - Add "pci=disable_acs_redir=" parameter to disable ACS redirection for
   peer-to-peer DMA support (we don't have the peer-to-peer support yet;
   this is just one piece) (Logan Gunthorpe)

 - Clean up devm_of_pci_get_host_bridge_resources() resource allocation
   (Jan Kiszka)

 - Fixup resizable BARs after suspend/resume (Christian König)

 - Make "pci=earlydump" generic (Sinan Kaya)

 - Fix ROM BAR access routines to stay in bounds and check for signature
   correctly (Rex Zhu)

 - Add DMA alias quirk for Microsemi Switchtec NTB (Doug Meyer)

 - Expand documentation for pci_add_dma_alias() (Logan Gunthorpe)

 - To avoid bus errors, enable PASID only if entire path supports
   End-End TLP prefixes (Sinan Kaya)

 - Unify slot and bus reset functions and remove hotplug knowledge from
   callers (Sinan Kaya)

 - Add Function-Level Reset quirks for Intel and Samsung NVMe devices to
   fix guest reboot issues (Alex Williamson)

 - Add function 1 DMA alias quirk for Marvell 88SS9183 PCIe SSD
   Controller (Bjorn Helgaas)

 - Remove Xilinx AXI-PCIe host bridge arch dependency (Palmer Dabbelt)

 - Remove Aardvark outbound window configuration (Evan Wang)

 - Fix Aardvark bridge window sizing issue (Zachary Zhang)

 - Convert Aardvark to use pci_host_probe() to reduce code duplication
   (Thomas Petazzoni)

 - Correct the Cadence cdns_pcie_writel() signature (Alan Douglas)

 - Add Cadence support for optional generic PHYs (Alan Douglas)

 - Add Cadence power management ops (Alan Douglas)

 - Remove redundant variable from Cadence driver (Colin Ian King)

 - Add Kirin MSI support (Xiaowei Song)

 - Drop unnecessary root_bus_nr setting from exynos, imx6, keystone,
   armada8k, artpec6, designware-plat, histb, qcom, spear13xx (Shawn
   Guo)

 - Move link notification settings from DesignWare core to individual
   drivers (Gustavo Pimentel)

 - Add endpoint library MSI-X interfaces (Gustavo Pimentel)

 - Correct signature of endpoint library IRQ interfaces (Gustavo
   Pimentel)

 - Add DesignWare endpoint library MSI-X callbacks (Gustavo Pimentel)

 - Add endpoint library MSI-X test support (Gustavo Pimentel)

 - Remove unnecessary GFP_ATOMIC from Hyper-V "new child" allocation
   (Jia-Ju Bai)

 - Add more devices to Broadcom PAXC quirk (Ray Jui)

 - Work around corrupted Broadcom PAXC config space to enable SMMU and
   GICv3 ITS (Ray Jui)

 - Disable MSI parsing to work around broken Broadcom PAXC logic in some
   devices (Ray Jui)

 - Hide unconfigured functions to work around a Broadcom PAXC defect
   (Ray Jui)

 - Lower iproc log level to reduce console output during boot (Ray Jui)

 - Fix mobiveil iomem/phys_addr_t type usage (Lorenzo Pieralisi)

 - Fix mobiveil missing include file (Lorenzo Pieralisi)

 - Add mobiveil Kconfig/Makefile support (Lorenzo Pieralisi)

 - Fix mvebu I/O space remapping issues (Thomas Petazzoni)

 - Use generic pci_host_bridge in mvebu instead of ARM-specific API
   (Thomas Petazzoni)

 - Whitelist VMD devices with fast interrupt handlers to avoid sharing
   vectors with slow handlers (Keith Busch)

* tag 'pci-v4.19-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (153 commits)
  PCI/AER: Don't clear AER bits if error handling is Firmware-First
  PCI: Limit config space size for Netronome NFP5000
  PCI/MSI: Set IRQCHIP_ONESHOT_SAFE for PCI-MSI irqchips
  PCI/VPD: Check for VPD access completion before checking for timeout
  PCI: Add PCI_DEVICE_DATA() macro to fully describe device ID entry
  PCI: Match Root Port's MPS to endpoint's MPSS as necessary
  PCI: Skip MPS logic for Virtual Functions (VFs)
  PCI: Add function 1 DMA alias quirk for Marvell 88SS9183
  PCI: Check for PCIe Link downtraining
  PCI: Add ACS Redirect disable quirk for Intel Sunrise Point
  PCI: Add device-specific ACS Redirect disable infrastructure
  PCI: Convert device-specific ACS quirks from NULL termination to ARRAY_SIZE
  PCI: Add "pci=disable_acs_redir=" parameter for peer-to-peer support
  PCI: Allow specifying devices using a base bus and path of devfns
  PCI: Make specifying PCI devices in kernel parameters reusable
  PCI: Hide ACS quirk declarations inside PCI core
  PCI: Delay after FLR of Intel DC P3700 NVMe
  PCI: Disable Samsung SM961/PM961 NVMe before FLR
  PCI: Export pcie_has_flr()
  PCI: mvebu: Drop bogus comment above mvebu_pcie_map_registers()
  ...
2018-08-16 09:21:54 -07:00
Guenter Roeck 0a957467c5 x86: i8259: Add missing include file
i8259.h uses inb/outb and thus needs to include asm/io.h to avoid the
following build error, as seen with x86_64:defconfig and CONFIG_SMP=n.

  In file included from drivers/rtc/rtc-cmos.c:45:0:
  arch/x86/include/asm/i8259.h: In function 'inb_pic':
  arch/x86/include/asm/i8259.h:32:24: error:
	implicit declaration of function 'inb'

  arch/x86/include/asm/i8259.h: In function 'outb_pic':
  arch/x86/include/asm/i8259.h:45:2: error:
	implicit declaration of function 'outb'

Reported-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Suggested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Fixes: 447ae31667 ("x86: Don't include linux/irq.h from asm/hardirq.h")
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-15 13:44:10 -07:00
Linus Torvalds 31130a16d4 xen: features and fixes for 4.19-rc1
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCW3LkCgAKCRCAXGG7T9hj
 vtyfAQDTMUqfBlpz9XqFyTBTFRkP3aVtnEeE7BijYec+RXPOxwEAsiXwZPsmW/AN
 up+NEHqPvMOcZC8zJZ9THCiBgOxligY=
 =F51X
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-4.19-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen updates from Juergen Gross:

 - add dma-buf functionality to Xen grant table handling

 - fix for booting the kernel as Xen PVH dom0

 - fix for booting the kernel as a Xen PV guest with
   CONFIG_DEBUG_VIRTUAL enabled

 - other minor performance and style fixes

* tag 'for-linus-4.19-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/balloon: fix balloon initialization for PVH Dom0
  xen: don't use privcmd_call() from xen_mc_flush()
  xen/pv: Call get_cpu_address_sizes to set x86_virt/phys_bits
  xen/biomerge: Use true and false for boolean values
  xen/gntdev: don't dereference a null gntdev_dmabuf on allocation failure
  xen/spinlock: Don't use pvqspinlock if only 1 vCPU
  xen/gntdev: Implement dma-buf import functionality
  xen/gntdev: Implement dma-buf export functionality
  xen/gntdev: Add initial support for dma-buf UAPI
  xen/gntdev: Make private routines/structures accessible
  xen/gntdev: Allow mappings for DMA buffers
  xen/grant-table: Allow allocating buffers suitable for DMA
  xen/balloon: Share common memory reservation routines
  xen/grant-table: Make set/clear page private code shared
2018-08-14 16:54:22 -07:00
Linus Torvalds 958f338e96 Merge branch 'l1tf-final' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Merge L1 Terminal Fault fixes from Thomas Gleixner:
 "L1TF, aka L1 Terminal Fault, is yet another speculative hardware
  engineering trainwreck. It's a hardware vulnerability which allows
  unprivileged speculative access to data which is available in the
  Level 1 Data Cache when the page table entry controlling the virtual
  address, which is used for the access, has the Present bit cleared or
  other reserved bits set.

  If an instruction accesses a virtual address for which the relevant
  page table entry (PTE) has the Present bit cleared or other reserved
  bits set, then speculative execution ignores the invalid PTE and loads
  the referenced data if it is present in the Level 1 Data Cache, as if
  the page referenced by the address bits in the PTE was still present
  and accessible.

  While this is a purely speculative mechanism and the instruction will
  raise a page fault when it is retired eventually, the pure act of
  loading the data and making it available to other speculative
  instructions opens up the opportunity for side channel attacks to
  unprivileged malicious code, similar to the Meltdown attack.

  While Meltdown breaks the user space to kernel space protection, L1TF
  allows to attack any physical memory address in the system and the
  attack works across all protection domains. It allows an attack of SGX
  and also works from inside virtual machines because the speculation
  bypasses the extended page table (EPT) protection mechanism.

  The assoicated CVEs are: CVE-2018-3615, CVE-2018-3620, CVE-2018-3646

  The mitigations provided by this pull request include:

   - Host side protection by inverting the upper address bits of a non
     present page table entry so the entry points to uncacheable memory.

   - Hypervisor protection by flushing L1 Data Cache on VMENTER.

   - SMT (HyperThreading) control knobs, which allow to 'turn off' SMT
     by offlining the sibling CPU threads. The knobs are available on
     the kernel command line and at runtime via sysfs

   - Control knobs for the hypervisor mitigation, related to L1D flush
     and SMT control. The knobs are available on the kernel command line
     and at runtime via sysfs

   - Extensive documentation about L1TF including various degrees of
     mitigations.

  Thanks to all people who have contributed to this in various ways -
  patches, review, testing, backporting - and the fruitful, sometimes
  heated, but at the end constructive discussions.

  There is work in progress to provide other forms of mitigations, which
  might be less horrible performance wise for a particular kind of
  workloads, but this is not yet ready for consumption due to their
  complexity and limitations"

* 'l1tf-final' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
  x86/microcode: Allow late microcode loading with SMT disabled
  tools headers: Synchronise x86 cpufeatures.h for L1TF additions
  x86/mm/kmmio: Make the tracer robust against L1TF
  x86/mm/pat: Make set_memory_np() L1TF safe
  x86/speculation/l1tf: Make pmd/pud_mknotpresent() invert
  x86/speculation/l1tf: Invert all not present mappings
  cpu/hotplug: Fix SMT supported evaluation
  KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry
  x86/speculation: Use ARCH_CAPABILITIES to skip L1D flush on vmentry
  x86/speculation: Simplify sysfs report of VMX L1TF vulnerability
  Documentation/l1tf: Remove Yonah processors from not vulnerable list
  x86/KVM/VMX: Don't set l1tf_flush_l1d from vmx_handle_external_intr()
  x86/irq: Let interrupt handlers set kvm_cpu_l1tf_flush_l1d
  x86: Don't include linux/irq.h from asm/hardirq.h
  x86/KVM/VMX: Introduce per-host-cpu analogue of l1tf_flush_l1d
  x86/irq: Demote irq_cpustat_t::__softirq_pending to u16
  x86/KVM/VMX: Move the l1tf_flush_l1d test to vmx_l1d_flush()
  x86/KVM/VMX: Replace 'vmx_l1d_flush_always' with 'vmx_l1d_flush_cond'
  x86/KVM/VMX: Don't set l1tf_flush_l1d to true from vmx_l1d_flush()
  cpu/hotplug: detect SMT disabled by BIOS
  ...
2018-08-14 09:46:06 -07:00
Linus Torvalds 13e091b6dd Merge branch 'x86-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 timer updates from Thomas Gleixner:
 "Early TSC based time stamping to allow better boot time analysis.

  This comes with a general cleanup of the TSC calibration code which
  grew warts and duct taping over the years and removes 250 lines of
  code. Initiated and mostly implemented by Pavel with help from various
  folks"

* 'x86-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
  x86/kvmclock: Mark kvm_get_preset_lpj() as __init
  x86/tsc: Consolidate init code
  sched/clock: Disable interrupts when calling generic_sched_clock_init()
  timekeeping: Prevent false warning when persistent clock is not available
  sched/clock: Close a hole in sched_clock_init()
  x86/tsc: Make use of tsc_calibrate_cpu_early()
  x86/tsc: Split native_calibrate_cpu() into early and late parts
  sched/clock: Use static key for sched_clock_running
  sched/clock: Enable sched clock early
  sched/clock: Move sched clock initialization and merge with generic clock
  x86/tsc: Use TSC as sched clock early
  x86/tsc: Initialize cyc2ns when tsc frequency is determined
  x86/tsc: Calibrate tsc only once
  ARM/time: Remove read_boot_clock64()
  s390/time: Remove read_boot_clock64()
  timekeeping: Default boot time offset to local_clock()
  timekeeping: Replace read_boot_clock64() with read_persistent_wall_and_boot_offset()
  s390/time: Add read_persistent_wall_and_boot_offset()
  x86/xen/time: Output xen sched_clock time from 0
  x86/xen/time: Initialize pv xen time in init_hypervisor_platform()
  ...
2018-08-13 18:28:19 -07:00
Linus Torvalds eac3411944 Merge branch 'x86/pti' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 PTI updates from Thomas Gleixner:
 "The Speck brigade sadly provides yet another large set of patches
  destroying the perfomance which we carefully built and preserved

   - PTI support for 32bit PAE. The missing counter part to the 64bit
     PTI code implemented by Joerg.

   - A set of fixes for the Global Bit mechanics for non PCID CPUs which
     were setting the Global Bit too widely and therefore possibly
     exposing interesting memory needlessly.

   - Protection against userspace-userspace SpectreRSB

   - Support for the upcoming Enhanced IBRS mode, which is preferred
     over IBRS. Unfortunately we dont know the performance impact of
     this, but it's expected to be less horrible than the IBRS
     hammering.

   - Cleanups and simplifications"

* 'x86/pti' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
  x86/mm/pti: Move user W+X check into pti_finalize()
  x86/relocs: Add __end_rodata_aligned to S_REL
  x86/mm/pti: Clone kernel-image on PTE level for 32 bit
  x86/mm/pti: Don't clear permissions in pti_clone_pmd()
  x86/mm/pti: Fix 32 bit PCID check
  x86/mm/init: Remove freed kernel image areas from alias mapping
  x86/mm/init: Add helper for freeing kernel image pages
  x86/mm/init: Pass unconverted symbol addresses to free_init_pages()
  mm: Allow non-direct-map arguments to free_reserved_area()
  x86/mm/pti: Clear Global bit more aggressively
  x86/speculation: Support Enhanced IBRS on future CPUs
  x86/speculation: Protect against userspace-userspace spectreRSB
  x86/kexec: Allocate 8k PGDs for PTI
  Revert "perf/core: Make sure the ring-buffer is mapped in all page-tables"
  x86/mm: Remove in_nmi() warning from vmalloc_fault()
  x86/entry/32: Check for VM86 mode in slow-path check
  perf/core: Make sure the ring-buffer is mapped in all page-tables
  x86/pti: Check the return value of pti_user_pagetable_walk_pmd()
  x86/pti: Check the return value of pti_user_pagetable_walk_p4d()
  x86/entry/32: Add debug code to check entry/exit CR3
  ...
2018-08-13 17:54:17 -07:00
Linus Torvalds 4d5ac4b8ca Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 fixes from Thomas Gleixner:
 "Two fixes for x86:

   - Provide a declaration for native_save_fl() which unbreaks the
     wreckage caused by making it 'extern inline'.

   - Fix the failing paravirt patching which is supposed to replace
     indirect with direct calls. The wreckage is caused by an incorrect
     clobber test"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/paravirt: Fix spectre-v2 mitigations for paravirt guests
  x86/irqflags: Provide a declaration for native_save_fl
2018-08-13 17:01:03 -07:00
Linus Torvalds 203b4fc903 Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Thomas Gleixner:

 - Make lazy TLB mode even lazier to avoid pointless switch_mm()
   operations, which reduces CPU load by 1-2% for memcache workloads

 - Small cleanups and improvements all over the place

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Remove redundant check for kmem_cache_create()
  arm/asm/tlb.h: Fix build error implicit func declaration
  x86/mm/tlb: Make clear_asid_other() static
  x86/mm/tlb: Skip atomic operations for 'init_mm' in switch_mm_irqs_off()
  x86/mm/tlb: Always use lazy TLB mode
  x86/mm/tlb: Only send page table free TLB flush to lazy TLB CPUs
  x86/mm/tlb: Make lazy TLB mode lazier
  x86/mm/tlb: Restructure switch_mm_irqs_off()
  x86/mm/tlb: Leave lazy TLB mode at page table free time
  mm: Allocate the mm_cpumask (mm->cpu_bitmap[]) dynamically based on nr_cpu_ids
  x86/mm: Add TLB purge to free pmd/pte page interfaces
  ioremap: Update pgtable free interfaces with addr
  x86/mm: Disable ioremap free page handling on x86-PAE
2018-08-13 16:29:35 -07:00
Linus Torvalds f499026456 Merge branch 'x86-hyperv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/hyper-v update from Thomas Gleixner:
 "Add fast hypercall support for guest running on the Microsoft HyperV(isor)"

* 'x86-hyperv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/hyper-v: Fix wrong merge conflict resolution
  x86/hyper-v: Check for VP_INVAL in hyperv_flush_tlb_others()
  x86/hyper-v: Check cpumask_to_vpset() return value in hyperv_flush_tlb_others_ex()
  x86/hyper-v: Trace PV IPI send
  x86/hyper-v: Use cheaper HVCALL_SEND_IPI hypercall when possible
  x86/hyper-v: Use 'fast' hypercall for HVCALL_SEND_IPI
  x86/hyper-v: Implement hv_do_fast_hypercall16
  x86/hyper-v: Use cheaper HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE} hypercalls when possible
2018-08-13 15:49:04 -07:00
Linus Torvalds 7796916146 Merge branch 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpu updates from Thomas Gleixner:
 "Two small updates for the CPU code:

   - Improve NUMA emulation

   - Add the EPT_AD CPU feature bit"

* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cpufeatures: Add EPT_AD feature bit
  x86/numa_emulation: Introduce uniform split capability
  x86/numa_emulation: Fix emulated-to-physical node mapping
2018-08-13 14:41:53 -07:00
Linus Torvalds f24d6f2654 Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 asm updates from Thomas Gleixner:
 "The lowlevel and ASM code updates for x86:

   - Make stack trace unwinding more reliable

   - ASM instruction updates for better code generation

   - Various cleanups"

* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/entry/64: Add two more instruction suffixes
  x86/asm/64: Use 32-bit XOR to zero registers
  x86/build/vdso: Simplify 'cmd_vdso2c'
  x86/build/vdso: Remove unused vdso-syms.lds
  x86/stacktrace: Enable HAVE_RELIABLE_STACKTRACE for the ORC unwinder
  x86/unwind/orc: Detect the end of the stack
  x86/stacktrace: Do not fail for ORC with regs on stack
  x86/stacktrace: Clarify the reliable success paths
  x86/stacktrace: Remove STACKTRACE_DUMP_ONCE
  x86/stacktrace: Do not unwind after user regs
  x86/asm: Use CC_SET/CC_OUT in percpu_cmpxchg8b_double() to micro-optimize code generation
2018-08-13 13:35:26 -07:00
Linus Torvalds 8603596a32 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf update from Thomas Gleixner:
 "The perf crowd presents:

  Kernel updates:

   - Removal of jprobes

   - Cleanup and consolidatation the handling of kprobes

   - Cleanup and consolidation of hardware breakpoints

   - The usual pile of fixes and updates to PMUs and event descriptors

  Tooling updates:

   - Updates and improvements all over the place. Nothing outstanding,
     just the (good) boring incremental grump work"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
  perf trace: Do not require --no-syscalls to suppress strace like output
  perf bpf: Include uapi/linux/bpf.h from the 'perf trace' script's bpf.h
  perf tools: Allow overriding MAX_NR_CPUS at compile time
  perf bpf: Show better message when failing to load an object
  perf list: Unify metric group description format with PMU event description
  perf vendor events arm64: Update ThunderX2 implementation defined pmu core events
  perf cs-etm: Generate branch sample for CS_ETM_TRACE_ON packet
  perf cs-etm: Generate branch sample when receiving a CS_ETM_TRACE_ON packet
  perf cs-etm: Support dummy address value for CS_ETM_TRACE_ON packet
  perf cs-etm: Fix start tracing packet handling
  perf build: Fix installation directory for eBPF
  perf c2c report: Fix crash for empty browser
  perf tests: Fix indexing when invoking subtests
  perf trace: Beautify the AF_INET & AF_INET6 'socket' syscall 'protocol' args
  perf trace beauty: Add beautifiers for 'socket''s 'protocol' arg
  perf trace beauty: Do not print NULL strarray entries
  perf beauty: Add a generator for IPPROTO_ socket's protocol constants
  tools include uapi: Grab a copy of linux/in.h
  perf tests: Fix complex event name parsing
  perf evlist: Fix error out while applying initial delay and LBR
  ...
2018-08-13 12:55:49 -07:00
Linus Torvalds de5d1b39ea Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking/atomics update from Thomas Gleixner:
 "The locking, atomics and memory model brains delivered:

   - A larger update to the atomics code which reworks the ordering
     barriers, consolidates the atomic primitives, provides the new
     atomic64_fetch_add_unless() primitive and cleans up the include
     hell.

   - Simplify cmpxchg() instrumentation and add instrumentation for
     xchg() and cmpxchg_double().

   - Updates to the memory model and documentation"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
  locking/atomics: Rework ordering barriers
  locking/atomics: Instrument cmpxchg_double*()
  locking/atomics: Instrument xchg()
  locking/atomics: Simplify cmpxchg() instrumentation
  locking/atomics/x86: Reduce arch_cmpxchg64*() instrumentation
  tools/memory-model: Rename litmus tests to comply to norm7
  tools/memory-model/Documentation: Fix typo, smb->smp
  sched/Documentation: Update wake_up() & co. memory-barrier guarantees
  locking/spinlock, sched/core: Clarify requirements for smp_mb__after_spinlock()
  sched/core: Use smp_mb() in wake_woken_function()
  tools/memory-model: Add informal LKMM documentation to MAINTAINERS
  locking/atomics/Documentation: Describe atomic_set() as a write operation
  tools/memory-model: Make scripts executable
  tools/memory-model: Remove ACCESS_ONCE() from model
  tools/memory-model: Remove ACCESS_ONCE() from recipes
  locking/memory-barriers.txt/kokr: Update Korean translation to fix broken DMA vs. MMIO ordering example
  MAINTAINERS: Add Daniel Lustig as an LKMM reviewer
  tools/memory-model: Fix ISA2+pooncelock+pooncelock+pombonce name
  tools/memory-model: Add litmus test for full multicopy atomicity
  locking/refcount: Always allow checked forms
  ...
2018-08-13 12:23:39 -07:00
Joerg Roedel d878efce73 x86/mm/pti: Move user W+X check into pti_finalize()
The user page-table gets the updated kernel mappings in pti_finalize(),
which runs after the RO+X permissions got applied to the kernel page-table
in mark_readonly().

But with CONFIG_DEBUG_WX enabled, the user page-table is already checked in
mark_readonly() for insecure mappings.  This causes false-positive
warnings, because the user page-table did not get the updated mappings yet.

Move the W+X check for the user page-table into pti_finalize() after it
updated all required mappings.

[ tglx: Folded !NX supported fix ]

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Waiman Long <llong@redhat.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1533727000-9172-1-git-send-email-joro@8bytes.org
2018-08-10 21:12:45 +02:00
Joerg Roedel 6488a7f35e Merge branches 'arm/shmobile', 'arm/renesas', 'arm/msm', 'arm/smmu', 'arm/omap', 'x86/amd', 'x86/vt-d' and 'core' into next 2018-08-08 12:02:27 +02:00
Andi Kleen 0768f91530 x86/speculation/l1tf: Make pmd/pud_mknotpresent() invert
Some cases in THP like:
  - MADV_FREE
  - mprotect
  - split

mark the PMD non present for temporarily to prevent races. The window for
an L1TF attack in these contexts is very small, but it wants to be fixed
for correctness sake.

Use the proper low level functions for pmd/pud_mknotpresent() to address
this.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-08-08 09:23:44 +02:00
Andi Kleen f22cc87f6c x86/speculation/l1tf: Invert all not present mappings
For kernel mappings PAGE_PROTNONE is not necessarily set for a non present
mapping, but the inversion logic explicitely checks for !PRESENT and
PROT_NONE.

Remove the PROT_NONE check and make the inversion unconditional for all not
present mappings.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-08-08 09:23:43 +02:00
Juergen Gross cd9139220b xen: don't use privcmd_call() from xen_mc_flush()
Using privcmd_call() for a singleton multicall seems to be wrong, as
privcmd_call() is using stac()/clac() to enable hypervisor access to
Linux user space.

Even if currently not a problem (pv domains can't use SMAP while HVM
and PVH domains can't use multicalls) things might change when
PVH dom0 support is added to the kernel.

Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
2018-08-07 11:37:01 -04:00
Thomas Gleixner 315706049c Merge branch 'x86/pti-urgent' into x86/pti
Integrate the PTI Global bit fixes which conflict with the 32bit PTI
support.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-08-06 20:56:34 +02:00
Dave Hansen c40a56a781 x86/mm/init: Remove freed kernel image areas from alias mapping
The kernel image is mapped into two places in the virtual address space
(addresses without KASLR, of course):

	1. The kernel direct map (0xffff880000000000)
	2. The "high kernel map" (0xffffffff81000000)

We actually execute out of #2.  If we get the address of a kernel symbol,
it points to #2, but almost all physical-to-virtual translations point to

Parts of the "high kernel map" alias are mapped in the userspace page
tables with the Global bit for performance reasons.  The parts that we map
to userspace do not (er, should not) have secrets. When PTI is enabled then
the global bit is usually not set in the high mapping and just used to
compensate for poor performance on systems which lack PCID.

This is fine, except that some areas in the kernel image that are adjacent
to the non-secret-containing areas are unused holes.  We free these holes
back into the normal page allocator and reuse them as normal kernel memory.
The memory will, of course, get *used* via the normal map, but the alias
mapping is kept.

This otherwise unused alias mapping of the holes will, by default keep the
Global bit, be mapped out to userspace, and be vulnerable to Meltdown.

Remove the alias mapping of these pages entirely.  This is likely to
fracture the 2M page mapping the kernel image near these areas, but this
should affect a minority of the area.

The pageattr code changes *all* aliases mapping the physical pages that it
operates on (by default).  We only want to modify a single alias, so we
need to tweak its behavior.

This unmapping behavior is currently dependent on PTI being in place.
Going forward, we should at least consider doing this for all
configurations.  Having an extra read-write alias for memory is not exactly
ideal for debugging things like random memory corruption and this does
undercut features like DEBUG_PAGEALLOC or future work like eXclusive Page
Frame Ownership (XPFO).

Before this patch:

current_kernel:---[ High Kernel Mapping ]---
current_kernel-0xffffffff80000000-0xffffffff81000000          16M                               pmd
current_kernel-0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd
current_kernel-0xffffffff81e00000-0xffffffff81e11000          68K     ro                 GLB x  pte
current_kernel-0xffffffff81e11000-0xffffffff82000000        1980K     RW                     NX pte
current_kernel-0xffffffff82000000-0xffffffff82600000           6M     ro         PSE     GLB NX pmd
current_kernel-0xffffffff82600000-0xffffffff82c00000           6M     RW         PSE         NX pmd
current_kernel-0xffffffff82c00000-0xffffffff82e00000           2M     RW                     NX pte
current_kernel-0xffffffff82e00000-0xffffffff83200000           4M     RW         PSE         NX pmd
current_kernel-0xffffffff83200000-0xffffffffa0000000         462M                               pmd

  current_user:---[ High Kernel Mapping ]---
  current_user-0xffffffff80000000-0xffffffff81000000          16M                               pmd
  current_user-0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd
  current_user-0xffffffff81e00000-0xffffffff81e11000          68K     ro                 GLB x  pte
  current_user-0xffffffff81e11000-0xffffffff82000000        1980K     RW                     NX pte
  current_user-0xffffffff82000000-0xffffffff82600000           6M     ro         PSE     GLB NX pmd
  current_user-0xffffffff82600000-0xffffffffa0000000         474M                               pmd

After this patch:

current_kernel:---[ High Kernel Mapping ]---
current_kernel-0xffffffff80000000-0xffffffff81000000          16M                               pmd
current_kernel-0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd
current_kernel-0xffffffff81e00000-0xffffffff81e11000          68K     ro                 GLB x  pte
current_kernel-0xffffffff81e11000-0xffffffff82000000        1980K                               pte
current_kernel-0xffffffff82000000-0xffffffff82400000           4M     ro         PSE     GLB NX pmd
current_kernel-0xffffffff82400000-0xffffffff82488000         544K     ro                     NX pte
current_kernel-0xffffffff82488000-0xffffffff82600000        1504K                               pte
current_kernel-0xffffffff82600000-0xffffffff82c00000           6M     RW         PSE         NX pmd
current_kernel-0xffffffff82c00000-0xffffffff82c0d000          52K     RW                     NX pte
current_kernel-0xffffffff82c0d000-0xffffffff82dc0000        1740K                               pte

  current_user:---[ High Kernel Mapping ]---
  current_user-0xffffffff80000000-0xffffffff81000000          16M                               pmd
  current_user-0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd
  current_user-0xffffffff81e00000-0xffffffff81e11000          68K     ro                 GLB x  pte
  current_user-0xffffffff81e11000-0xffffffff82000000        1980K                               pte
  current_user-0xffffffff82000000-0xffffffff82400000           4M     ro         PSE     GLB NX pmd
  current_user-0xffffffff82400000-0xffffffff82488000         544K     ro                     NX pte
  current_user-0xffffffff82488000-0xffffffff82600000        1504K                               pte
  current_user-0xffffffff82600000-0xffffffffa0000000         474M                               pmd

[ tglx: Do not unmap on 32bit as there is only one mapping ]

Fixes: 0f561fce4d ("x86/pti: Enable global pages for shared areas")
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Kees Cook <keescook@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Joerg Roedel <jroedel@suse.de>
Link: https://lkml.kernel.org/r/20180802225831.5F6A2BFC@viggo.jf.intel.com
2018-08-06 20:54:16 +02:00
Wanpeng Li 4180bf1b65 KVM: X86: Implement "send IPI" hypercall
Using hypercall to send IPIs by one vmexit instead of one by one for
xAPIC/x2APIC physical mode and one vmexit per-cluster for x2APIC cluster
mode. Intel guest can enter x2apic cluster mode when interrupt remmaping
is enabled in qemu, however, latest AMD EPYC still just supports xapic
mode which can get great improvement by Exit-less IPIs. This patchset
lets a guest send multicast IPIs, with at most 128 destinations per
hypercall in 64-bit mode and 64 vCPUs per hypercall in 32-bit mode.

Hardware: Xeon Skylake 2.5GHz, 2 sockets, 40 cores, 80 threads, the VM
is 80 vCPUs, IPI microbenchmark(https://lkml.org/lkml/2017/12/19/141):

x2apic cluster mode, vanilla

 Dry-run:                         0,            2392199 ns
 Self-IPI:                  6907514,           15027589 ns
 Normal IPI:              223910476,          251301666 ns
 Broadcast IPI:                   0,         9282161150 ns
 Broadcast lock:                  0,         8812934104 ns

x2apic cluster mode, pv-ipi

 Dry-run:                         0,            2449341 ns
 Self-IPI:                  6720360,           15028732 ns
 Normal IPI:              228643307,          255708477 ns
 Broadcast IPI:                   0,         7572293590 ns  => 22% performance boost
 Broadcast lock:                  0,         8316124651 ns

x2apic physical mode, vanilla

 Dry-run:                         0,            3135933 ns
 Self-IPI:                  8572670,           17901757 ns
 Normal IPI:              226444334,          255421709 ns
 Broadcast IPI:                   0,        19845070887 ns
 Broadcast lock:                  0,        19827383656 ns

x2apic physical mode, pv-ipi

 Dry-run:                         0,            2446381 ns
 Self-IPI:                  6788217,           15021056 ns
 Normal IPI:              219454441,          249583458 ns
 Broadcast IPI:                   0,         7806540019 ns  => 154% performance boost
 Broadcast lock:                  0,         9143618799 ns

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:59:20 +02:00
Tianyu Lan b08660e59d KVM: x86: Add tlb remote flush callback in kvm_x86_ops.
This patch is to provide a way for platforms to register hv tlb remote
flush callback and this helps to optimize operation of tlb flush
among vcpus for nested virtualization case.

Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:59:06 +02:00
Tianyu Lan 60cfce4c4f X86/Hyper-V: Add hyperv_nested_flush_guest_mapping ftrace support
This patch is to add hyperv_nested_flush_guest_mapping support to trace
hvFlushGuestPhysicalAddressSpace hypercall.

Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
Acked-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:59:05 +02:00
Tianyu Lan eb914cfe72 X86/Hyper-V: Add flush HvFlushGuestPhysicalAddressSpace hypercall support
Hyper-V supports a pv hypercall HvFlushGuestPhysicalAddressSpace to
flush nested VM address space mapping in l1 hypervisor and it's to
reduce overhead of flushing ept tlb among vcpus. This patch is to
implement it.

Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
Acked-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:59:04 +02:00
Junaid Shahid 208320ba10 kvm: x86: Remove CR3_PCID_INVD flag
It is a duplicate of X86_CR3_PCID_NOFLUSH. So just use that instead.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:59:02 +02:00
Junaid Shahid b94742c958 kvm: x86: Add multi-entry LRU cache for previous CR3s
Adds support for storing multiple previous CR3/root_hpa pairs maintained
as an LRU cache, so that the lockless CR3 switch path can be used when
switching back to any of them.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:59:02 +02:00
Junaid Shahid faff87588d kvm: x86: Flush only affected TLB entries in kvm_mmu_invlpg*
This needs a minor bug fix. The updated patch is as follows.

Thanks,
Junaid

------------------------------------------------------------------------------

kvm_mmu_invlpg() and kvm_mmu_invpcid_gva() only need to flush the TLB
entries for the specific guest virtual address, instead of flushing all
TLB entries associated with the VM.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:59:01 +02:00
Junaid Shahid 08fb59d8a4 kvm: x86: Support selectively freeing either current or previous MMU root
kvm_mmu_free_roots() now takes a mask specifying which roots to free, so
that either one of the roots (active/previous) can be individually freed
when needed.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:58:59 +02:00
Junaid Shahid 7eb77e9f5f kvm: x86: Add a root_hpa parameter to kvm_mmu->invlpg()
This allows invlpg() to be called using either the active root_hpa
or the prev_root_hpa.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:58:58 +02:00
Junaid Shahid ade61e2824 kvm: x86: Skip TLB flush on fast CR3 switch when indicated by guest
When PCIDs are enabled, the MSb of the source operand for a MOV-to-CR3
instruction indicates that the TLB doesn't need to be flushed.

This change enables this optimization for MOV-to-CR3s in the guest
that have been intercepted by KVM for shadow paging and are handled
within the fast CR3 switch path.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:58:58 +02:00
Junaid Shahid eb4b248e15 kvm: vmx: Support INVPCID in shadow paging mode
Implement support for INVPCID in shadow paging mode as well.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:58:57 +02:00
Junaid Shahid 6e42782f51 kvm: x86: Introduce KVM_REQ_LOAD_CR3
The KVM_REQ_LOAD_CR3 request loads the hardware CR3 using the
current root_hpa.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:58:52 +02:00
Junaid Shahid 7c390d350f kvm: x86: Add fast CR3 switch code path
When using shadow paging, a CR3 switch in the guest results in a VM Exit.
In the common case, that VM exit doesn't require much processing by KVM.
However, it does acquire the MMU lock, which can start showing signs of
contention under some workloads even on a 2 VCPU VM when the guest is
using KPTI. Therefore, we add a fast path that avoids acquiring the MMU
lock in the most common cases e.g. when switching back and forth between
the kernel and user mode CR3s used by KPTI with no guest page table
changes in between.

For now, this fast path is implemented only for 64-bit guests and hosts
to avoid the handling of PDPTEs, but it can be extended later to 32-bit
guests and/or hosts as well.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:58:51 +02:00
Jim Mattson 8fcc4b5923 kvm: nVMX: Introduce KVM_CAP_NESTED_STATE
For nested virtualization L0 KVM is managing a bit of state for L2 guests,
this state can not be captured through the currently available IOCTLs. In
fact the state captured through all of these IOCTLs is usually a mix of L1
and L2 state. It is also dependent on whether the L2 guest was running at
the moment when the process was interrupted to save its state.

With this capability, there are two new vcpu ioctls: KVM_GET_NESTED_STATE
and KVM_SET_NESTED_STATE. These can be used for saving and restoring a VM
that is in VMX operation.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Jim Mattson <jmattson@google.com>
[karahmed@ - rename structs and functions and make them ready for AMD and
             address previous comments.
           - handle nested.smm state.
           - rebase & a bit of refactoring.
           - Merge 7/8 and 8/8 into one patch. ]
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:58:30 +02:00
Paolo Bonzini 7f7f1ba33c KVM: x86: do not load vmcs12 pages while still in SMM
If the vCPU enters system management mode while running a nested guest,
RSM starts processing the vmentry while still in SMM.  In that case,
however, the pages pointed to by the vmcs12 might be incorrectly
loaded from SMRAM.  To avoid this, delay the handling of the pages
until just before the next vmentry.  This is done with a new request
and a new entry in kvm_x86_ops, which we will be able to reuse for
nested VMX state migration.

Extracted from a patch by Jim Mattson and KarimAllah Ahmed.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:57:58 +02:00
Nick Desaulniers 208cbb3255 x86/irqflags: Provide a declaration for native_save_fl
It was reported that the commit d0a8d9378d is causing users of gcc < 4.9
to observe -Werror=missing-prototypes errors.

Indeed, it seems that:
extern inline unsigned long native_save_fl(void) { return 0; }

compiled with -Werror=missing-prototypes produces this warning in gcc <
4.9, but not gcc >= 4.9.

Fixes: d0a8d9378d ("x86/paravirt: Make native_save_fl() extern inline").
Reported-by: David Laight <david.laight@aculab.com>
Reported-by: Jean Delvare <jdelvare@suse.de>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: hpa@zytor.com
Cc: jgross@suse.com
Cc: kstewart@linuxfoundation.org
Cc: gregkh@linuxfoundation.org
Cc: boris.ostrovsky@oracle.com
Cc: astrachan@google.com
Cc: mka@chromium.org
Cc: arnd@arndb.de
Cc: tstellar@redhat.com
Cc: sedat.dilek@gmail.com
Cc: David.Laight@aculab.com
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180803170550.164688-1-ndesaulniers@google.com
2018-08-05 22:30:37 +02:00
Dave Hansen 6ea2738e0c x86/mm/init: Add helper for freeing kernel image pages
When chunks of the kernel image are freed, free_init_pages() is used
directly.  Consolidate the three sites that do this.  Also update the
string to give an incrementally better description of that memory versus
what was there before.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@google.com
Cc: aarcange@redhat.com
Cc: jgross@suse.com
Cc: jpoimboe@redhat.com
Cc: gregkh@linuxfoundation.org
Cc: peterz@infradead.org
Cc: hughd@google.com
Cc: torvalds@linux-foundation.org
Cc: bp@alien8.de
Cc: luto@kernel.org
Cc: ak@linux.intel.com
Cc: Kees Cook <keescook@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Link: https://lkml.kernel.org/r/20180802225829.FE0E32EA@viggo.jf.intel.com
2018-08-05 22:21:03 +02:00
Paolo Bonzini 5b76a3cff0 KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry
When nested virtualization is in use, VMENTER operations from the nested
hypervisor into the nested guest will always be processed by the bare metal
hypervisor, and KVM's "conditional cache flushes" mode in particular does a
flush on nested vmentry.  Therefore, include the "skip L1D flush on
vmentry" bit in KVM's suggested ARCH_CAPABILITIES setting.

Add the relevant Documentation.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-08-05 17:10:20 +02:00
Paolo Bonzini 8e0b2b9166 x86/speculation: Use ARCH_CAPABILITIES to skip L1D flush on vmentry
Bit 3 of ARCH_CAPABILITIES tells a hypervisor that L1D flush on vmentry is
not needed.  Add a new value to enum vmx_l1d_flush_state, which is used
either if there is no L1TF bug at all, or if bit 3 is set in ARCH_CAPABILITIES.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-08-05 17:10:19 +02:00
Thomas Gleixner f2701b77bb Merge 4.18-rc7 into master to pick up the KVM dependcy
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-08-05 16:39:29 +02:00
Nicolai Stange ffcba43ff6 x86/irq: Let interrupt handlers set kvm_cpu_l1tf_flush_l1d
The last missing piece to having vmx_l1d_flush() take interrupts after
VMEXIT into account is to set the kvm_cpu_l1tf_flush_l1d per-cpu flag on
irq entry.

Issue calls to kvm_set_cpu_l1tf_flush_l1d() from entering_irq(),
ipi_entering_ack_irq(), smp_reschedule_interrupt() and
uv_bau_message_interrupt().

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Nicolai Stange <nstange@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-08-05 09:53:13 +02:00
Nicolai Stange 447ae31667 x86: Don't include linux/irq.h from asm/hardirq.h
The next patch in this series will have to make the definition of
irq_cpustat_t available to entering_irq().

Inclusion of asm/hardirq.h into asm/apic.h would cause circular header
dependencies like

  asm/smp.h
    asm/apic.h
      asm/hardirq.h
        linux/irq.h
          linux/topology.h
            linux/smp.h
              asm/smp.h

or

  linux/gfp.h
    linux/mmzone.h
      asm/mmzone.h
        asm/mmzone_64.h
          asm/smp.h
            asm/apic.h
              asm/hardirq.h
                linux/irq.h
                  linux/irqdesc.h
                    linux/kobject.h
                      linux/sysfs.h
                        linux/kernfs.h
                          linux/idr.h
                            linux/gfp.h

and others.

This causes compilation errors because of the header guards becoming
effective in the second inclusion: symbols/macros that had been defined
before wouldn't be available to intermediate headers in the #include chain
anymore.

A possible workaround would be to move the definition of irq_cpustat_t
into its own header and include that from both, asm/hardirq.h and
asm/apic.h.

However, this wouldn't solve the real problem, namely asm/harirq.h
unnecessarily pulling in all the linux/irq.h cruft: nothing in
asm/hardirq.h itself requires it. Also, note that there are some other
archs, like e.g. arm64, which don't have that #include in their
asm/hardirq.h.

Remove the linux/irq.h #include from x86' asm/hardirq.h.

Fix resulting compilation errors by adding appropriate #includes to *.c
files as needed.

Note that some of these *.c files could be cleaned up a bit wrt. to their
set of #includes, but that should better be done from separate patches, if
at all.

Signed-off-by: Nicolai Stange <nstange@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-08-05 09:53:13 +02:00
Nicolai Stange 45b575c00d x86/KVM/VMX: Introduce per-host-cpu analogue of l1tf_flush_l1d
Part of the L1TF mitigation for vmx includes flushing the L1D cache upon
VMENTRY.

L1D flushes are costly and two modes of operations are provided to users:
"always" and the more selective "conditional" mode.

If operating in the latter, the cache would get flushed only if a host side
code path considered unconfined had been traversed. "Unconfined" in this
context means that it might have pulled in sensitive data like user data
or kernel crypto keys.

The need for L1D flushes is tracked by means of the per-vcpu flag
l1tf_flush_l1d. KVM exit handlers considered unconfined set it. A
vmx_l1d_flush() subsequently invoked before the next VMENTER will conduct a
L1d flush based on its value and reset that flag again.

Currently, interrupts delivered "normally" while in root operation between
VMEXIT and VMENTER are not taken into account. Part of the reason is that
these don't leave any traces and thus, the vmx code is unable to tell if
any such has happened.

As proposed by Paolo Bonzini, prepare for tracking all interrupts by
introducing a new per-cpu flag, "kvm_cpu_l1tf_flush_l1d". It will be in
strong analogy to the per-vcpu ->l1tf_flush_l1d.

A later patch will make interrupt handlers set it.

For the sake of cache locality, group kvm_cpu_l1tf_flush_l1d into x86'
per-cpu irq_cpustat_t as suggested by Peter Zijlstra.

Provide the helpers kvm_set_cpu_l1tf_flush_l1d(),
kvm_clear_cpu_l1tf_flush_l1d() and kvm_get_cpu_l1tf_flush_l1d(). Make them
trivial resp. non-existent for !CONFIG_KVM_INTEL as appropriate.

Let vmx_l1d_flush() handle kvm_cpu_l1tf_flush_l1d in the same way as
l1tf_flush_l1d.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Nicolai Stange <nstange@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-05 09:53:12 +02:00
Nicolai Stange 9aee5f8a7e x86/irq: Demote irq_cpustat_t::__softirq_pending to u16
An upcoming patch will extend KVM's L1TF mitigation in conditional mode
to also cover interrupts after VMEXITs. For tracking those, stores to a
new per-cpu flag from interrupt handlers will become necessary.

In order to improve cache locality, this new flag will be added to x86's
irq_cpustat_t.

Make some space available there by shrinking the ->softirq_pending bitfield
from 32 to 16 bits: the number of bits actually used is only NR_SOFTIRQS,
i.e. 10.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Nicolai Stange <nstange@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-05 09:53:12 +02:00
Sai Praneeth 706d51681d x86/speculation: Support Enhanced IBRS on future CPUs
Future Intel processors will support "Enhanced IBRS" which is an "always
on" mode i.e. IBRS bit in SPEC_CTRL MSR is enabled once and never
disabled.

From the specification [1]:

 "With enhanced IBRS, the predicted targets of indirect branches
  executed cannot be controlled by software that was executed in a less
  privileged predictor mode or on another logical processor. As a
  result, software operating on a processor with enhanced IBRS need not
  use WRMSR to set IA32_SPEC_CTRL.IBRS after every transition to a more
  privileged predictor mode. Software can isolate predictor modes
  effectively simply by setting the bit once. Software need not disable
  enhanced IBRS prior to entering a sleep state such as MWAIT or HLT."

If Enhanced IBRS is supported by the processor then use it as the
preferred spectre v2 mitigation mechanism instead of Retpoline. Intel's
Retpoline white paper [2] states:

 "Retpoline is known to be an effective branch target injection (Spectre
  variant 2) mitigation on Intel processors belonging to family 6
  (enumerated by the CPUID instruction) that do not have support for
  enhanced IBRS. On processors that support enhanced IBRS, it should be
  used for mitigation instead of retpoline."

The reason why Enhanced IBRS is the recommended mitigation on processors
which support it is that these processors also support CET which
provides a defense against ROP attacks. Retpoline is very similar to ROP
techniques and might trigger false positives in the CET defense.

If Enhanced IBRS is selected as the mitigation technique for spectre v2,
the IBRS bit in SPEC_CTRL MSR is set once at boot time and never
cleared. Kernel also has to make sure that IBRS bit remains set after
VMEXIT because the guest might have cleared the bit. This is already
covered by the existing x86_spec_ctrl_set_guest() and
x86_spec_ctrl_restore_host() speculation control functions.

Enhanced IBRS still requires IBPB for full mitigation.

[1] Speculative-Execution-Side-Channel-Mitigations.pdf
[2] Retpoline-A-Branch-Target-Injection-Mitigation.pdf
Both documents are available at:
https://bugzilla.kernel.org/show_bug.cgi?id=199511

Originally-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim C Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Ravi Shankar <ravi.v.shankar@intel.com>
Link: https://lkml.kernel.org/r/1533148945-24095-1-git-send-email-sai.praneeth.prakhya@intel.com
2018-08-03 12:50:34 +02:00
Peter Feiner 301d328a6f x86/cpufeatures: Add EPT_AD feature bit
Some Intel processors have an EPT feature whereby the accessed & dirty bits
in EPT entries can be updated by HW. MSR IA32_VMX_EPT_VPID_CAP exposes the
presence of this capability.

There is no point in trying to use that new feature bit in the VMX code as
VMX needs to read the MSR anyway to access other bits, but having the
feature bit for EPT_AD in place helps virtualization management as it
exposes "ept_ad" in /proc/cpuinfo/$proc/flags if the feature is present.

[ tglx: Amended changelog ]

Signed-off-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: Peter Shier <pshier@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jim Mattson <jmattson@google.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lkml.kernel.org/r/20180801180657.138051-1-pshier@google.com
2018-08-03 12:36:23 +02:00
Ingo Molnar 16e0e6a83b Merge branch 'perf/urgent' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-08-02 09:59:20 +02:00
Sunil Muthuswamy 9d9c965687 Drivers: hv: vmbus: Get rid of MSR access from vmbus_drv.c
Get rid of ISA specific code from vmus_drv.c which is common code.

Fixes: 81b18bce48 ("Drivers: HV: Send one page worth of kmsg dump over Hyper-V during panic")

Signed-off-by: Sunil Muthuswamy <sunilmut@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-29 08:09:56 +02:00
Mark Rutland f9881cc43b locking/atomics: Instrument xchg()
While we instrument all of the (non-relaxed) atomic_*() functions and
cmpxchg(), we missed xchg().

Let's add instrumentation for xchg(), fixing up x86 to implement
arch_xchg().

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: andy.shevchenko@gmail.com
Cc: arnd@arndb.de
Cc: aryabinin@virtuozzo.com
Cc: catalin.marinas@arm.com
Cc: glider@google.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: parri.andrea@gmail.com
Cc: peter@hurleysoftware.com
Link: http://lkml.kernel.org/r/20180716113017.3909-5-mark.rutland@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-25 11:53:59 +02:00
Mark Rutland 00d5551cc4 locking/atomics/x86: Reduce arch_cmpxchg64*() instrumentation
Currently x86's arch_cmpxchg64() and arch_cmpxchg64_local() are
instrumented twice, as they call into instrumented atomics rather than
their arch_ equivalents.

A call to cmpxchg64() results in:

  cmpxchg64()
    kasan_check_write()
    arch_cmpxchg64()
      cmpxchg()
        kasan_check_write()
        arch_cmpxchg()

Let's fix this up and call the arch_ equivalents, resulting in:

  cmpxchg64()
    kasan_check_write()
    arch_cmpxchg64()
      arch_cmpxchg()

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: andy.shevchenko@gmail.com
Cc: arnd@arndb.de
Cc: aryabinin@virtuozzo.com
Cc: catalin.marinas@arm.com
Cc: glider@google.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: parri.andrea@gmail.com
Cc: peter@hurleysoftware.com
Link: http://lkml.kernel.org/r/20180716113017.3909-3-mark.rutland@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-25 11:53:58 +02:00
Kan Liang ec71a398c1 perf/x86/intel/ds: Handle PEBS overflow for fixed counters
The pebs_drain() need to support fixed counters. The DS Save Area now
include "counter reset value" fields for each fixed counters.

Extend the related variables (e.g. mask, counters, error) to support
fixed counters. There is no extended PEBS in PEBS v2 and earlier PEBS
format. Only need to change the code for PEBS v3 and later PEBS format.

Extend the pebs_event_reset[] logic to support new "counter reset value" fields.

Increase the reserve space for fixed counters.

Based-on-code-from: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Link: http://lkml.kernel.org/r/20180309021542.11374-3-kan.liang@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-25 11:50:50 +02:00
Ingo Molnar 93081caaae Merge branch 'perf/urgent' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-25 11:47:02 +02:00
Waiman Long c0dc373a78 locking/pvqspinlock/x86: Use LOCK_PREFIX in __pv_queued_spin_unlock() assembly code
The LOCK_PREFIX macro should be used in the __raw_callee_save___pv_queued_spin_unlock()
assembly code, so that the lock prefix can be patched out on UP systems.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joe Mario <jmario@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1531858560-21547-1-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-25 11:22:20 +02:00
Linus Torvalds 43227e098c Merge branch 'x86-pti-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 pti fixes from Ingo Molnar:
 "An APM fix, and a BTS hardware-tracing fix related to PTI changes"

* 'x86-pti-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/apm: Don't access __preempt_count with zeroed fs
  x86/events/intel/ds: Fix bts_interrupt_threshold alignment
2018-07-21 17:23:58 -07:00
Joerg Roedel 6df934b92a x86/ldt: Enable LDT user-mapping for PAE
This adds the needed special case for PAE to get the LDT mapped into the
user page-table when PTI is enabled. The big difference to the other paging
modes is that on PAE there is no full top-level PGD entry available for the
LDT, but only a PMD entry.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Pavel Machek <pavel@ucw.cz>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Waiman Long <llong@redhat.com>
Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1531906876-13451-37-git-send-email-joro@8bytes.org
2018-07-20 01:11:48 +02:00
Joerg Roedel 8195d869d1 x86/ldt: Define LDT_END_ADDR
It marks the end of the address-space range reserved for the LDT. The
LDT-code will use it when unmapping the LDT for user-space.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Pavel Machek <pavel@ucw.cz>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Waiman Long <llong@redhat.com>
Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1531906876-13451-35-git-send-email-joro@8bytes.org
2018-07-20 01:11:47 +02:00
Joerg Roedel f3e48e546c x86/ldt: Reserve address-space range on 32 bit for the LDT
Reserve 2MB/4MB of address-space for mapping the LDT to user-space on 32
bit PTI kernels.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Pavel Machek <pavel@ucw.cz>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Waiman Long <llong@redhat.com>
Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1531906876-13451-34-git-send-email-joro@8bytes.org
2018-07-20 01:11:47 +02:00
Joerg Roedel b976690f5d x86/mm/pti: Introduce pti_finalize()
Introduce a new function to finalize the kernel mappings for the userspace
page-table after all ro/nx protections have been applied to the kernel
mappings.

Also move the call to pti_clone_kernel_text() to that function so that it
will run on 32 bit kernels too.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Pavel Machek <pavel@ucw.cz>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Waiman Long <llong@redhat.com>
Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1531906876-13451-30-git-send-email-joro@8bytes.org
2018-07-20 01:11:45 +02:00
Joerg Roedel 39d668e04e x86/mm/pti: Make pti_clone_kernel_text() compile on 32 bit
The pti_clone_kernel_text() function references __end_rodata_hpage_align,
which is only present on x86-64.  This makes sense as the end of the rodata
section is not huge-page aligned on 32 bit.

Nevertheless a symbol is required for the function that points at the right
address for both 32 and 64 bit. Introduce __end_rodata_aligned for that
purpose and use it in pti_clone_kernel_text().

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Pavel Machek <pavel@ucw.cz>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Waiman Long <llong@redhat.com>
Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1531906876-13451-28-git-send-email-joro@8bytes.org
2018-07-20 01:11:44 +02:00
Joerg Roedel 2c1b9fbe83 x86/mm/pti: Define X86_CR3_PTI_PCID_USER_BIT on x86_32
Move it out of the X86_64 specific processor defines so that its visible
for 32bit too.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Pavel Machek <pavel@ucw.cz>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Waiman Long <llong@redhat.com>
Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1531906876-13451-26-git-send-email-joro@8bytes.org
2018-07-20 01:11:44 +02:00
Joerg Roedel 1f40a46cf4 x86/mm/legacy: Populate the user page-table with user pgd's
Also populate the user-spage pgd's in the user page-table.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Pavel Machek <pavel@ucw.cz>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Waiman Long <llong@redhat.com>
Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1531906876-13451-24-git-send-email-joro@8bytes.org
2018-07-20 01:11:43 +02:00
Joerg Roedel 9b7b8bbd7f x86/mm/pae: Populate the user page-table with user pgd's
When a PGD entry is populated, make sure to populate it in the user
page-table too.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Pavel Machek <pavel@ucw.cz>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Waiman Long <llong@redhat.com>
Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1531906876-13451-23-git-send-email-joro@8bytes.org
2018-07-20 01:11:43 +02:00
Joerg Roedel 6c0df86894 x86/mm/pae: Populate valid user PGD entries
Generic page-table code populates all non-leaf entries with _KERNPG_TABLE
bits set. This is fine for all paging modes except PAE.

In PAE mode only a subset of the bits is allowed to be set.  Make sure to
only set allowed bits by masking out the reserved bits.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Pavel Machek <pavel@ucw.cz>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Waiman Long <llong@redhat.com>
Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1531906876-13451-22-git-send-email-joro@8bytes.org
2018-07-20 01:11:42 +02:00
Joerg Roedel 76e258add7 x86/pgtable: Move two more functions from pgtable_64.h to pgtable.h
These two functions are required for PTI on 32 bit:

	* pgdp_maps_userspace()
	* pgd_large()

Also re-implement pgdp_maps_userspace() so that it will work on 64 and 32
bit kernels.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Pavel Machek <pavel@ucw.cz>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Waiman Long <llong@redhat.com>
Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1531906876-13451-21-git-send-email-joro@8bytes.org
2018-07-20 01:11:42 +02:00
Joerg Roedel fcbbd97757 x86/pgtable: Move pti_set_user_pgtbl() to pgtable.h
There it is also usable from 32 bit code.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Pavel Machek <pavel@ucw.cz>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Waiman Long <llong@redhat.com>
Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1531906876-13451-20-git-send-email-joro@8bytes.org
2018-07-20 01:11:42 +02:00
Joerg Roedel 8372d66865 x86/pgtable: Move pgdp kernel/user conversion functions to pgtable.h
Make them available on 32 bit and clone_pgd_range() happy.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Pavel Machek <pavel@ucw.cz>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Waiman Long <llong@redhat.com>
Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1531906876-13451-19-git-send-email-joro@8bytes.org
2018-07-20 01:11:41 +02:00
Joerg Roedel 7ffcf1497c x86/pgtable/pae: Unshare kernel PMDs when PTI is enabled
With PTI the per-process LDT must be mapped into the kernel address-space
for each process, which requires separate kernel PMDs per PGD.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Pavel Machek <pavel@ucw.cz>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Waiman Long <llong@redhat.com>
Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1531906876-13451-17-git-send-email-joro@8bytes.org
2018-07-20 01:11:40 +02:00
Joerg Roedel 23b772883d x86/pgtable: Rename pti_set_user_pgd() to pti_set_user_pgtbl()
The way page-table folding is implemented on 32 bit, these functions are
not only setting, but also PUDs and even PMDs. Give the function a more
generic name to reflect that.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Pavel Machek <pavel@ucw.cz>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Waiman Long <llong@redhat.com>
Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1531906876-13451-16-git-send-email-joro@8bytes.org
2018-07-20 01:11:40 +02:00
Joerg Roedel 252e1a0526 x86/entry: Rename update_sp0 to update_task_stack
The function does not update sp0 anymore but updates makes the task-stack
visible for entry code. This is by either writing it to sp1 or by doing a
hypercall. Rename the function to get rid of the misleading name.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Pavel Machek <pavel@ucw.cz>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Waiman Long <llong@redhat.com>
Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1531906876-13451-15-git-send-email-joro@8bytes.org
2018-07-20 01:11:40 +02:00
Joerg Roedel 45d7b25574 x86/entry/32: Enter the kernel via trampoline stack
Use the entry-stack as a trampoline to enter the kernel. The entry-stack is
already in the cpu_entry_area and will be mapped to userspace when PTI is
enabled.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Pavel Machek <pavel@ucw.cz>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Waiman Long <llong@redhat.com>
Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1531906876-13451-8-git-send-email-joro@8bytes.org
2018-07-20 01:11:37 +02:00
Pavel Tatashin 8dbe438589 x86/tsc: Make use of tsc_calibrate_cpu_early()
During early boot enable tsc_calibrate_cpu_early() and switch to
tsc_calibrate_cpu() only later. Do this unconditionally, because it is
unknown what methods other cpus will use to calibrate once they are
onlined.

If by the time tsc_init() is called tsc frequency is still unknown do only
pit_hpet_ptimer_calibrate_cpu() to calibrate, as this function contains the
only methods wich have not been called and tried earlier.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180719205545.16512-27-pasha.tatashin@oracle.com
2018-07-20 00:02:44 +02:00
Pavel Tatashin 03821f451d x86/tsc: Split native_calibrate_cpu() into early and late parts
During early boot TSC and CPU frequency can be calibrated using MSR, CPUID,
and quick PIT calibration methods. The other methods PIT/HPET/PMTIMER are
available only after ACPI is initialized.

Split native_calibrate_cpu() into early and late parts so they can be
called separately during early and late tsc calibration.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180719205545.16512-26-pasha.tatashin@oracle.com
2018-07-20 00:02:44 +02:00
Pavel Tatashin cf7a63ef4e x86/tsc: Calibrate tsc only once
During boot tsc is calibrated twice: once in tsc_early_delay_calibrate(),
and the second time in tsc_init().

Rename tsc_early_delay_calibrate() to tsc_early_init(), and rework it so
the calibration is done only early, and make tsc_init() to use the values
already determined in tsc_early_init().

Sometimes it is not possible to determine tsc early, as the subsystem that
is required is not yet initialized, in such case try again later in
tsc_init().

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180719205545.16512-20-pasha.tatashin@oracle.com
2018-07-20 00:02:42 +02:00
Pavel Tatashin 6fffacb303 x86/alternatives, jumplabel: Use text_poke_early() before mm_init()
It supposed to be safe to modify static branches after jump_label_init().
But, because static key modifying code eventually calls text_poke() it can
end up accessing a struct page which has not been initialized yet.

Here is how to quickly reproduce the problem. Insert code like this
into init/main.c:

| +static DEFINE_STATIC_KEY_FALSE(__test);
| asmlinkage __visible void __init start_kernel(void)
| {
|        char *command_line;
|@@ -587,6 +609,10 @@ asmlinkage __visible void __init start_kernel(void)
|        vfs_caches_init_early();
|        sort_main_extable();
|        trap_init();
|+       {
|+       static_branch_enable(&__test);
|+       WARN_ON(!static_branch_likely(&__test));
|+       }
|        mm_init();

The following warnings show-up:
WARNING: CPU: 0 PID: 0 at arch/x86/kernel/alternative.c:701 text_poke+0x20d/0x230
RIP: 0010:text_poke+0x20d/0x230
Call Trace:
 ? text_poke_bp+0x50/0xda
 ? arch_jump_label_transform+0x89/0xe0
 ? __jump_label_update+0x78/0xb0
 ? static_key_enable_cpuslocked+0x4d/0x80
 ? static_key_enable+0x11/0x20
 ? start_kernel+0x23e/0x4c8
 ? secondary_startup_64+0xa5/0xb0

---[ end trace abdc99c031b8a90a ]---

If the code above is moved after mm_init(), no warning is shown, as struct
pages are initialized during handover from memblock.

Use text_poke_early() in static branching until early boot IRQs are enabled
and from there switch to text_poke. Also, ensure text_poke() is never
invoked when unitialized memory access may happen by using adding a
!after_bootmem assertion.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180719205545.16512-9-pasha.tatashin@oracle.com
2018-07-20 00:02:38 +02:00
Thomas Gleixner e499a9b6dc x86/kvmclock: Move kvmclock vsyscall param and init to kvmclock
There is no point to have this in the kvm code itself and call it from
there. This can be called from an initcall and the parameter is cleared
when the hypervisor is not KVM.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Link: https://lkml.kernel.org/r/20180719205545.16512-7-pasha.tatashin@oracle.com
2018-07-20 00:02:37 +02:00
Thomas Gleixner 7a5ddc8fe0 x86/kvmclock: Decrapify kvm_register_clock()
The return value is pointless because the wrmsr cannot fail if
KVM_FEATURE_CLOCKSOURCE or KVM_FEATURE_CLOCKSOURCE2 are set.

kvm_register_clock() is only called locally so wants to be static.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Link: https://lkml.kernel.org/r/20180719205545.16512-4-pasha.tatashin@oracle.com
2018-07-20 00:02:36 +02:00
Thomas Gleixner 73ab603f44 Merge branch 'linus' into x86/timers
Pick up upstream changes to avoid conflicts
2018-07-19 23:11:52 +02:00
Jiang Biao d9f4426c73 x86/speculation: Remove SPECTRE_V2_IBRS in enum spectre_v2_mitigation
SPECTRE_V2_IBRS in enum spectre_v2_mitigation is never used. Remove it.

Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: hpa@zytor.com
Cc: dwmw2@amazon.co.uk
Cc: konrad.wilk@oracle.com
Cc: bp@suse.de
Cc: zhong.weidong@zte.com.cn
Link: https://lkml.kernel.org/r/1531872194-39207-1-git-send-email-jiang.biao2@zte.com.cn
2018-07-19 12:31:00 +02:00
Rik van Riel 95b0e6357d x86/mm/tlb: Always use lazy TLB mode
Now that CPUs in lazy TLB mode no longer receive TLB shootdown IPIs, except
at page table freeing time, and idle CPUs will no longer get shootdown IPIs
for things like mprotect and madvise, we can always use lazy TLB mode.

Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: efault@gmx.de
Cc: kernel-team@fb.com
Cc: luto@kernel.org
Link: http://lkml.kernel.org/r/20180716190337.26133-7-riel@surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-17 09:35:34 +02:00
Rik van Riel 2ff6ddf19c x86/mm/tlb: Leave lazy TLB mode at page table free time
Andy discovered that speculative memory accesses while in lazy
TLB mode can crash a system, when a CPU tries to dereference a
speculative access using memory contents that used to be valid
page table memory, but have since been reused for something else
and point into la-la land.

The latter problem can be prevented in two ways. The first is to
always send a TLB shootdown IPI to CPUs in lazy TLB mode, while
the second one is to only send the TLB shootdown at page table
freeing time.

The second should result in fewer IPIs, since operationgs like
mprotect and madvise are very common with some workloads, but
do not involve page table freeing. Also, on munmap, batching
of page table freeing covers much larger ranges of virtual
memory than the batching of unmapped user pages.

Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: efault@gmx.de
Cc: kernel-team@fb.com
Cc: luto@kernel.org
Link: http://lkml.kernel.org/r/20180716190337.26133-3-riel@surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-17 09:35:31 +02:00
Ingo Molnar 52b544bd38 Linux 4.18-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAltLpVUeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGWisH/ikONMwV7OrSk36Y
 5rxzTFUoBk0Qffct88gtSNuRVCxaVb1ofCndvFJE6A6HfJkWpbBzH6eq90aakmJi
 f7uFcu4YmsQpeQaf9lpftWmY2vDf2fIadVTV0RnSMXks57wMax1cpBe7LJGpz13e
 f+g5XRVs1MdlZVtr6tG2SU3Y5AqVVVsYe/0DBPonEqeh9/JJbPFCuNkFOxxzAqPu
 VTnjyoOqG8qtZzjklNtR5rZn0Gv592tWX36eiWTQdThNmVFkGEAJwsHCQlY4OQYK
 61QN4UhOHiu8e1ZuGDNEDhNVRnKtaaYUPFeWL1wLRW73ul4P3ZkpvpS8QTMwcFJI
 JjzNOkI=
 =ckcO
 -----END PGP SIGNATURE-----

Merge tag 'v4.18-rc5' into locking/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-17 09:27:43 +02:00
Ville Syrjälä 6f6060a5c9 x86/apm: Don't access __preempt_count with zeroed fs
APM_DO_POP_SEGS does not restore fs/gs which were zeroed by
APM_DO_ZERO_SEGS. Trying to access __preempt_count with
zeroed fs doesn't really work.

Move the ibrs call outside the APM_DO_SAVE_SEGS/APM_DO_RESTORE_SEGS
invocations so that fs is actually restored before calling
preempt_enable().

Fixes the following sort of oopses:
[    0.313581] general protection fault: 0000 [#1] PREEMPT SMP
[    0.313803] Modules linked in:
[    0.314040] CPU: 0 PID: 268 Comm: kapmd Not tainted 4.16.0-rc1-triton-bisect-00090-gdd84441a7971 #19
[    0.316161] EIP: __apm_bios_call_simple+0xc8/0x170
[    0.316161] EFLAGS: 00210016 CPU: 0
[    0.316161] EAX: 00000102 EBX: 00000000 ECX: 00000102 EDX: 00000000
[    0.316161] ESI: 0000530e EDI: dea95f64 EBP: dea95f18 ESP: dea95ef0
[    0.316161]  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
[    0.316161] CR0: 80050033 CR2: 00000000 CR3: 015d3000 CR4: 000006d0
[    0.316161] Call Trace:
[    0.316161]  ? cpumask_weight.constprop.15+0x20/0x20
[    0.316161]  on_cpu0+0x44/0x70
[    0.316161]  apm+0x54e/0x720
[    0.316161]  ? __switch_to_asm+0x26/0x40
[    0.316161]  ? __schedule+0x17d/0x590
[    0.316161]  kthread+0xc0/0xf0
[    0.316161]  ? proc_apm_show+0x150/0x150
[    0.316161]  ? kthread_create_worker_on_cpu+0x20/0x20
[    0.316161]  ret_from_fork+0x2e/0x38
[    0.316161] Code: da 8e c2 8e e2 8e ea 57 55 2e ff 1d e0 bb 5d b1 0f 92 c3 5d 5f 07 1f 89 47 0c 90 8d b4 26 00 00 00 00 90 8d b4 26 00 00 00 00 90 <64> ff 0d 84 16 5c b1 74 7f 8b 45 dc 8e e0 8b 45 d8 8e e8 8b 45
[    0.316161] EIP: __apm_bios_call_simple+0xc8/0x170 SS:ESP: 0068:dea95ef0
[    0.316161] ---[ end trace 656253db2deaa12c ]---

Fixes: dd84441a79 ("x86/speculation: Use IBRS if available before calling into firmware")
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Cc:  David Woodhouse <dwmw@amazon.co.uk>
Cc:  "H. Peter Anvin" <hpa@zytor.com>
Cc:  x86@kernel.org
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lkml.kernel.org/r/20180709133534.5963-1-ville.syrjala@linux.intel.com
2018-07-16 17:59:57 +02:00
Greg Kroah-Hartman 83cf9cd6d5 Merge 4.18-rc5 into char-misc-next
We want the char-misc fixes in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-16 09:04:54 +02:00
Dan Williams 092b31aa20 x86/asm/memcpy_mcsafe: Fix copy_to_user_mcsafe() exception handling
All copy_to_user() implementations need to be prepared to handle faults
accessing userspace. The __memcpy_mcsafe() implementation handles both
mmu-faults on the user destination and machine-check-exceptions on the
source buffer. However, the memcpy_mcsafe() wrapper may silently
fallback to memcpy() depending on build options and cpu-capabilities.

Force copy_to_user_mcsafe() to always use __memcpy_mcsafe() when
available, and otherwise disable all of the copy_to_user_mcsafe()
infrastructure when __memcpy_mcsafe() is not available, i.e.
CONFIG_X86_MCE=n.

This fixes crashes of the form:
    run fstests generic/323 at 2018-07-02 12:46:23
    BUG: unable to handle kernel paging request at 00007f0d50001000
    RIP: 0010:__memcpy+0x12/0x20
    [..]
    Call Trace:
     copyout_mcsafe+0x3a/0x50
     _copy_to_iter_mcsafe+0xa1/0x4a0
     ? dax_alive+0x30/0x50
     dax_iomap_actor+0x1f9/0x280
     ? dax_iomap_rw+0x100/0x100
     iomap_apply+0xba/0x130
     ? dax_iomap_rw+0x100/0x100
     dax_iomap_rw+0x95/0x100
     ? dax_iomap_rw+0x100/0x100
     xfs_file_dax_read+0x7b/0x1d0 [xfs]
     xfs_file_read_iter+0xa7/0xc0 [xfs]
     aio_read+0x11c/0x1a0

Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Fixes: 8780356ef6 ("x86/asm/memcpy_mcsafe: Define copy_to_iter_mcsafe()")
Link: http://lkml.kernel.org/r/153108277790.37979.1486841789275803399.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-16 00:05:05 +02:00
Jiri Kosina d90a7a0ec8 x86/bugs, kvm: Introduce boot-time control of L1TF mitigations
Introduce the 'l1tf=' kernel command line option to allow for boot-time
switching of mitigation that is used on processors affected by L1TF.

The possible values are:

  full
	Provides all available mitigations for the L1TF vulnerability. Disables
	SMT and enables all mitigations in the hypervisors. SMT control via
	/sys/devices/system/cpu/smt/control is still possible after boot.
	Hypervisors will issue a warning when the first VM is started in
	a potentially insecure configuration, i.e. SMT enabled or L1D flush
	disabled.

  full,force
	Same as 'full', but disables SMT control. Implies the 'nosmt=force'
	command line option. sysfs control of SMT and the hypervisor flush
	control is disabled.

  flush
	Leaves SMT enabled and enables the conditional hypervisor mitigation.
	Hypervisors will issue a warning when the first VM is started in a
	potentially insecure configuration, i.e. SMT enabled or L1D flush
	disabled.

  flush,nosmt
	Disables SMT and enables the conditional hypervisor mitigation. SMT
	control via /sys/devices/system/cpu/smt/control is still possible
	after boot. If SMT is reenabled or flushing disabled at runtime
	hypervisors will issue a warning.

  flush,nowarn
	Same as 'flush', but hypervisors will not warn when
	a VM is started in a potentially insecure configuration.

  off
	Disables hypervisor mitigations and doesn't emit any warnings.

Default is 'flush'.

Let KVM adhere to these semantics, which means:

  - 'lt1f=full,force'	: Performe L1D flushes. No runtime control
    			  possible.

  - 'l1tf=full'
  - 'l1tf-flush'
  - 'l1tf=flush,nosmt'	: Perform L1D flushes and warn on VM start if
			  SMT has been runtime enabled or L1D flushing
			  has been run-time enabled
			  
  - 'l1tf=flush,nowarn'	: Perform L1D flushes and no warnings are emitted.
  
  - 'l1tf=off'		: L1D flushes are not performed and no warnings
			  are emitted.

KVM can always override the L1D flushing behavior using its 'vmentry_l1d_flush'
module parameter except when lt1f=full,force is set.

This makes KVM's private 'nosmt' option redundant, and as it is a bit
non-systematic anyway (this is something to control globally, not on
hypervisor level), remove that option.

Add the missing Documentation entry for the l1tf vulnerability sysfs file
while at it.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142323.202758176@linutronix.de
2018-07-13 16:29:56 +02:00
Thomas Gleixner a7b9020b06 x86/l1tf: Handle EPT disabled state proper
If Extended Page Tables (EPT) are disabled or not supported, no L1D
flushing is required. The setup function can just avoid setting up the L1D
flush for the EPT=n case.

Invoke it after the hardware setup has be done and enable_ept has the
correct state and expose the EPT disabled state in the mitigation status as
well.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142322.612160168@linutronix.de
2018-07-13 16:29:53 +02:00
Thomas Gleixner 72c6d2db64 x86/litf: Introduce vmx status variable
Store the effective mitigation of VMX in a status variable and use it to
report the VMX state in the l1tf sysfs file.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142322.433098358@linutronix.de
2018-07-13 16:29:53 +02:00
Sunil Muthuswamy 81b18bce48 Drivers: HV: Send one page worth of kmsg dump over Hyper-V during panic
In the VM mode on Hyper-V, currently, when the kernel panics, an error
code and few register values are populated in an MSR and the Hypervisor
notified. This information is collected on the host. The amount of
information currently collected is found to be limited and not very
actionable. To gather more actionable data, such as stack trace, the
proposal is to write one page worth of kmsg data on an allocated page
and the Hypervisor notified of the page address through the MSR.

- Sysctl option to control the behavior, with ON by default.

Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: Sunil Muthuswamy <sunilmut@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-08 15:54:31 +02:00
Suravee Suthikulpanit 818b7587b4 x86: irq_remapping: Move irq remapping mode enum
The enum is currently defined in Intel-specific DMAR header file,
but it is also used by APIC common code. Therefore, move it to
a more appropriate interrupt-remapping common header file.
This will also be used by subsequent patches.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2018-07-06 14:43:47 +02:00
Thomas Gleixner 8f63e9230d Merge branch 'x86/urgent' into x86/hyperv
Integrate the upstream bug fix to resolve the resulting conflict in
__send_ipi_mask().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-07-06 12:35:56 +02:00
K. Y. Srinivasan 1268ed0c47 x86/hyper-v: Fix the circular dependency in IPI enlightenment
The IPI hypercalls depend on being able to map the Linux notion of CPU ID
to the hypervisor's notion of the CPU ID. The array hv_vp_index[] provides
this mapping. Code for populating this array depends on the IPI functionality.
Break this circular dependency.

[ tglx: Use a proper define instead of '-1' with a u32 variable as pointed
  	out by Vitaly ]

Fixes: 68bb7bfb79 ("X86/Hyper-V: Enable IPI enlightenments")
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Cc: gregkh@linuxfoundation.org
Cc: devel@linuxdriverproject.org
Cc: olaf@aepfle.de
Cc: apw@canonical.com
Cc: jasowang@redhat.com
Cc: hpa@zytor.com
Cc: sthemmin@microsoft.com
Cc: Michael.H.Kelley@microsoft.com
Cc: vkuznets@redhat.com
Link: https://lkml.kernel.org/r/20180703230155.15160-1-kys@linuxonhyperv.com
2018-07-06 12:32:59 +02:00
Paolo Bonzini c595ceee45 x86/KVM/VMX: Add L1D flush logic
Add the logic for flushing L1D on VMENTER. The flush depends on the static
key being enabled and the new l1tf_flush_l1d flag being set.

The flags is set:
 - Always, if the flush module parameter is 'always'

 - Conditionally at:
   - Entry to vcpu_run(), i.e. after executing user space

   - From the sched_in notifier, i.e. when switching to a vCPU thread.

   - From vmexit handlers which are considered unsafe, i.e. where
     sensitive data can be brought into L1D:

     - The emulator, which could be a good target for other speculative
       execution-based threats,

     - The MMU, which can bring host page tables in the L1 cache.
     
     - External interrupts

     - Nested operations that require the MMU (see above). That is
       vmptrld, vmptrst, vmclear,vmwrite,vmread.

     - When handling invept,invvpid

[ tglx: Split out from combo patch and reduced to a single flag ]

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-07-04 20:49:39 +02:00
Paolo Bonzini 3fa045be4c x86/KVM/VMX: Add L1D MSR based flush
336996-Speculative-Execution-Side-Channel-Mitigations.pdf defines a new MSR
(IA32_FLUSH_CMD aka 0x10B) which has similar write-only semantics to other
MSRs defined in the document.

The semantics of this MSR is to allow "finer granularity invalidation of
caching structures than existing mechanisms like WBINVD. It will writeback
and invalidate the L1 data cache, including all cachelines brought in by
preceding instructions, without invalidating all caches (eg. L2 or
LLC). Some processors may also invalidate the first level level instruction
cache on a L1D_FLUSH command. The L1 data and instruction caches may be
shared across the logical processors of a core."

Use it instead of the loop based L1 flush algorithm.

A copy of this document is available at
   https://bugzilla.kernel.org/show_bug.cgi?id=199511

[ tglx: Avoid allocating pages when the MSR is available ]

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-07-04 20:49:39 +02:00
Michael Kelley 7dc9b6b808 Drivers: hv: vmbus: Make TLFS #define names architecture neutral
The Hyper-V feature and hint flags in hyperv-tlfs.h are all defined
with the string "X64" in the name.  Some of these flags are indeed
x86/x64 specific, but others are not.  For the ones that are used
in architecture independent Hyper-V driver code, or will be used in
the upcoming support for Hyper-V for ARM64, this patch removes the
"X64" from the name.

This patch changes the flags that are currently known to be
used on multiple architectures. Hyper-V for ARM64 is still a
work-in-progress and the Top Level Functional Spec (TLFS) has not
been separated into x86/x64 and ARM64 areas.  So additional flags
may need to be updated later.

This patch only changes symbol names.  There are no functional
changes.

Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-03 13:09:15 +02:00