powerpc: Provide initial documentation for PAPR hcalls
This doc patch provides an initial description of the hcall op-codes that are used by Linux kernel running as a guest (LPAR) on top of PowerVM or any other sPAPR compliant hyper-visor (e.g qemu). Apart from documenting the hcalls the doc-patch also provides a rudimentary overview of how hcall ABI, how they are issued with the Linux kernel and how information/control flows between the guest and hypervisor. Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Acked-by: Nicholas Piggin <npiggin@gmail.com> [mpe: Add SPDX tag, add it to index.rst] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190828082729.16695-1-vaibhav@linux.ibm.com
This commit is contained in:
parent
9933819099
commit
58b278f568
|
@ -22,6 +22,7 @@ powerpc
|
|||
isa-versions
|
||||
kaslr-booke32
|
||||
mpc52xx
|
||||
papr_hcalls
|
||||
pci_iov_resource_on_powernv
|
||||
pmu-ebb
|
||||
ptrace
|
||||
|
|
|
@ -0,0 +1,250 @@
|
|||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
===========================
|
||||
Hypercall Op-codes (hcalls)
|
||||
===========================
|
||||
|
||||
Overview
|
||||
=========
|
||||
|
||||
Virtualization on 64-bit Power Book3S Platforms is based on the PAPR
|
||||
specification [1]_ which describes the run-time environment for a guest
|
||||
operating system and how it should interact with the hypervisor for
|
||||
privileged operations. Currently there are two PAPR compliant hypervisors:
|
||||
|
||||
- **IBM PowerVM (PHYP)**: IBM's proprietary hypervisor that supports AIX,
|
||||
IBM-i and Linux as supported guests (termed as Logical Partitions
|
||||
or LPARS). It supports the full PAPR specification.
|
||||
|
||||
- **Qemu/KVM**: Supports PPC64 linux guests running on a PPC64 linux host.
|
||||
Though it only implements a subset of PAPR specification called LoPAPR [2]_.
|
||||
|
||||
On PPC64 arch a guest kernel running on top of a PAPR hypervisor is called
|
||||
a *pSeries guest*. A pseries guest runs in a supervisor mode (HV=0) and must
|
||||
issue hypercalls to the hypervisor whenever it needs to perform an action
|
||||
that is hypervisor priviledged [3]_ or for other services managed by the
|
||||
hypervisor.
|
||||
|
||||
Hence a Hypercall (hcall) is essentially a request by the pseries guest
|
||||
asking hypervisor to perform a privileged operation on behalf of the guest. The
|
||||
guest issues a with necessary input operands. The hypervisor after performing
|
||||
the privilege operation returns a status code and output operands back to the
|
||||
guest.
|
||||
|
||||
HCALL ABI
|
||||
=========
|
||||
The ABI specification for a hcall between a pseries guest and PAPR hypervisor
|
||||
is covered in section 14.5.3 of ref [2]_. Switch to the Hypervisor context is
|
||||
done via the instruction **HVCS** that expects the Opcode for hcall is set in *r3*
|
||||
and any in-arguments for the hcall are provided in registers *r4-r12*. If values
|
||||
have to be passed through a memory buffer, the data stored in that buffer should be
|
||||
in Big-endian byte order.
|
||||
|
||||
Once control is returns back to the guest after hypervisor has serviced the
|
||||
'HVCS' instruction the return value of the hcall is available in *r3* and any
|
||||
out values are returned in registers *r4-r12*. Again like in case of in-arguments,
|
||||
any out values stored in a memory buffer will be in Big-endian byte order.
|
||||
|
||||
Powerpc arch code provides convenient wrappers named **plpar_hcall_xxx** defined
|
||||
in a arch specific header [4]_ to issue hcalls from the linux kernel
|
||||
running as pseries guest.
|
||||
|
||||
Register Conventions
|
||||
====================
|
||||
|
||||
Any hcall should follow same register convention as described in section 2.2.1.1
|
||||
of "64-Bit ELF V2 ABI Specification: Power Architecture"[5]_. Table below
|
||||
summarizes these conventions:
|
||||
|
||||
+----------+----------+-------------------------------------------+
|
||||
| Register |Volatile | Purpose |
|
||||
| Range |(Y/N) | |
|
||||
+==========+==========+===========================================+
|
||||
| r0 | Y | Optional-usage |
|
||||
+----------+----------+-------------------------------------------+
|
||||
| r1 | N | Stack Pointer |
|
||||
+----------+----------+-------------------------------------------+
|
||||
| r2 | N | TOC |
|
||||
+----------+----------+-------------------------------------------+
|
||||
| r3 | Y | hcall opcode/return value |
|
||||
+----------+----------+-------------------------------------------+
|
||||
| r4-r10 | Y | in and out values |
|
||||
+----------+----------+-------------------------------------------+
|
||||
| r11 | Y | Optional-usage/Environmental pointer |
|
||||
+----------+----------+-------------------------------------------+
|
||||
| r12 | Y | Optional-usage/Function entry address at |
|
||||
| | | global entry point |
|
||||
+----------+----------+-------------------------------------------+
|
||||
| r13 | N | Thread-Pointer |
|
||||
+----------+----------+-------------------------------------------+
|
||||
| r14-r31 | N | Local Variables |
|
||||
+----------+----------+-------------------------------------------+
|
||||
| LR | Y | Link Register |
|
||||
+----------+----------+-------------------------------------------+
|
||||
| CTR | Y | Loop Counter |
|
||||
+----------+----------+-------------------------------------------+
|
||||
| XER | Y | Fixed-point exception register. |
|
||||
+----------+----------+-------------------------------------------+
|
||||
| CR0-1 | Y | Condition register fields. |
|
||||
+----------+----------+-------------------------------------------+
|
||||
| CR2-4 | N | Condition register fields. |
|
||||
+----------+----------+-------------------------------------------+
|
||||
| CR5-7 | Y | Condition register fields. |
|
||||
+----------+----------+-------------------------------------------+
|
||||
| Others | N | |
|
||||
+----------+----------+-------------------------------------------+
|
||||
|
||||
DRC & DRC Indexes
|
||||
=================
|
||||
::
|
||||
|
||||
DR1 Guest
|
||||
+--+ +------------+ +---------+
|
||||
| | <----> | | | User |
|
||||
+--+ DRC1 | | DRC | Space |
|
||||
| PAPR | Index +---------+
|
||||
DR2 | Hypervisor | | |
|
||||
+--+ | | <-----> | Kernel |
|
||||
| | <----> | | Hcall | |
|
||||
+--+ DRC2 +------------+ +---------+
|
||||
|
||||
PAPR hypervisor terms shared hardware resources like PCI devices, NVDIMMs etc
|
||||
available for use by LPARs as Dynamic Resource (DR). When a DR is allocated to
|
||||
an LPAR, PHYP creates a data-structure called Dynamic Resource Connector (DRC)
|
||||
to manage LPAR access. An LPAR refers to a DRC via an opaque 32-bit number
|
||||
called DRC-Index. The DRC-index value is provided to the LPAR via device-tree
|
||||
where its present as an attribute in the device tree node associated with the
|
||||
DR.
|
||||
|
||||
HCALL Return-values
|
||||
===================
|
||||
|
||||
After servicing the hcall, hypervisor sets the return-value in *r3* indicating
|
||||
success or failure of the hcall. In case of a failure an error code indicates
|
||||
the cause for error. These codes are defined and documented in arch specific
|
||||
header [4]_.
|
||||
|
||||
In some cases a hcall can potentially take a long time and need to be issued
|
||||
multiple times in order to be completely serviced. These hcalls will usually
|
||||
accept an opaque value *continue-token* within there argument list and a
|
||||
return value of *H_CONTINUE* indicates that hypervisor hasn't still finished
|
||||
servicing the hcall yet.
|
||||
|
||||
To make such hcalls the guest need to set *continue-token == 0* for the
|
||||
initial call and use the hypervisor returned value of *continue-token*
|
||||
for each subsequent hcall until hypervisor returns a non *H_CONTINUE*
|
||||
return value.
|
||||
|
||||
HCALL Op-codes
|
||||
==============
|
||||
|
||||
Below is a partial list of HCALLs that are supported by PHYP. For the
|
||||
corresponding opcode values please look into the arch specific header [4]_:
|
||||
|
||||
**H_SCM_READ_METADATA**
|
||||
|
||||
| Input: *drcIndex, offset, buffer-address, numBytesToRead*
|
||||
| Out: *numBytesRead*
|
||||
| Return Value: *H_Success, H_Parameter, H_P2, H_P3, H_Hardware*
|
||||
|
||||
Given a DRC Index of an NVDIMM, read N-bytes from the the metadata area
|
||||
associated with it, at a specified offset and copy it to provided buffer.
|
||||
The metadata area stores configuration information such as label information,
|
||||
bad-blocks etc. The metadata area is located out-of-band of NVDIMM storage
|
||||
area hence a separate access semantics is provided.
|
||||
|
||||
**H_SCM_WRITE_METADATA**
|
||||
|
||||
| Input: *drcIndex, offset, data, numBytesToWrite*
|
||||
| Out: *None*
|
||||
| Return Value: *H_Success, H_Parameter, H_P2, H_P4, H_Hardware*
|
||||
|
||||
Given a DRC Index of an NVDIMM, write N-bytes to the metadata area
|
||||
associated with it, at the specified offset and from the provided buffer.
|
||||
|
||||
**H_SCM_BIND_MEM**
|
||||
|
||||
| Input: *drcIndex, startingScmBlockIndex, numScmBlocksToBind,*
|
||||
| *targetLogicalMemoryAddress, continue-token*
|
||||
| Out: *continue-token, targetLogicalMemoryAddress, numScmBlocksToBound*
|
||||
| Return Value: *H_Success, H_Parameter, H_P2, H_P3, H_P4, H_Overlap,*
|
||||
| *H_Too_Big, H_P5, H_Busy*
|
||||
|
||||
Given a DRC-Index of an NVDIMM, map a continuous SCM blocks range
|
||||
*(startingScmBlockIndex, startingScmBlockIndex+numScmBlocksToBind)* to the guest
|
||||
at *targetLogicalMemoryAddress* within guest physical address space. In
|
||||
case *targetLogicalMemoryAddress == 0xFFFFFFFF_FFFFFFFF* then hypervisor
|
||||
assigns a target address to the guest. The HCALL can fail if the Guest has
|
||||
an active PTE entry to the SCM block being bound.
|
||||
|
||||
**H_SCM_UNBIND_MEM**
|
||||
| Input: drcIndex, startingScmLogicalMemoryAddress, numScmBlocksToUnbind
|
||||
| Out: numScmBlocksUnbound
|
||||
| Return Value: *H_Success, H_Parameter, H_P2, H_P3, H_In_Use, H_Overlap,*
|
||||
| *H_Busy, H_LongBusyOrder1mSec, H_LongBusyOrder10mSec*
|
||||
|
||||
Given a DRC-Index of an NVDimm, unmap *numScmBlocksToUnbind* SCM blocks starting
|
||||
at *startingScmLogicalMemoryAddress* from guest physical address space. The
|
||||
HCALL can fail if the Guest has an active PTE entry to the SCM block being
|
||||
unbound.
|
||||
|
||||
**H_SCM_QUERY_BLOCK_MEM_BINDING**
|
||||
|
||||
| Input: *drcIndex, scmBlockIndex*
|
||||
| Out: *Guest-Physical-Address*
|
||||
| Return Value: *H_Success, H_Parameter, H_P2, H_NotFound*
|
||||
|
||||
Given a DRC-Index and an SCM Block index return the guest physical address to
|
||||
which the SCM block is mapped to.
|
||||
|
||||
**H_SCM_QUERY_LOGICAL_MEM_BINDING**
|
||||
|
||||
| Input: *Guest-Physical-Address*
|
||||
| Out: *drcIndex, scmBlockIndex*
|
||||
| Return Value: *H_Success, H_Parameter, H_P2, H_NotFound*
|
||||
|
||||
Given a guest physical address return which DRC Index and SCM block is mapped
|
||||
to that address.
|
||||
|
||||
**H_SCM_UNBIND_ALL**
|
||||
|
||||
| Input: *scmTargetScope, drcIndex*
|
||||
| Out: *None*
|
||||
| Return Value: *H_Success, H_Parameter, H_P2, H_P3, H_In_Use, H_Busy,*
|
||||
| *H_LongBusyOrder1mSec, H_LongBusyOrder10mSec*
|
||||
|
||||
Depending on the Target scope unmap all SCM blocks belonging to all NVDIMMs
|
||||
or all SCM blocks belonging to a single NVDIMM identified by its drcIndex
|
||||
from the LPAR memory.
|
||||
|
||||
**H_SCM_HEALTH**
|
||||
|
||||
| Input: drcIndex
|
||||
| Out: *health-bitmap, health-bit-valid-bitmap*
|
||||
| Return Value: *H_Success, H_Parameter, H_Hardware*
|
||||
|
||||
Given a DRC Index return the info on predictive failure and overall health of
|
||||
the NVDIMM. The asserted bits in the health-bitmap indicate a single predictive
|
||||
failure and health-bit-valid-bitmap indicate which bits in health-bitmap are
|
||||
valid.
|
||||
|
||||
**H_SCM_PERFORMANCE_STATS**
|
||||
|
||||
| Input: drcIndex, resultBuffer Addr
|
||||
| Out: None
|
||||
| Return Value: *H_Success, H_Parameter, H_Unsupported, H_Hardware, H_Authority, H_Privilege*
|
||||
|
||||
Given a DRC Index collect the performance statistics for NVDIMM and copy them
|
||||
to the resultBuffer.
|
||||
|
||||
References
|
||||
==========
|
||||
.. [1] "Power Architecture Platform Reference"
|
||||
https://en.wikipedia.org/wiki/Power_Architecture_Platform_Reference
|
||||
.. [2] "Linux on Power Architecture Platform Reference"
|
||||
https://members.openpowerfoundation.org/document/dl/469
|
||||
.. [3] "Definitions and Notation" Book III-Section 14.5.3
|
||||
https://openpowerfoundation.org/?resource_lib=power-isa-version-3-0
|
||||
.. [4] arch/powerpc/include/asm/hvcall.h
|
||||
.. [5] "64-Bit ELF V2 ABI Specification: Power Architecture"
|
||||
https://openpowerfoundation.org/?resource_lib=64-bit-elf-v2-abi-specification-power-architecture
|
|
@ -1408,22 +1408,9 @@ EXC_VIRT_NONE(0x4b00, 0x100)
|
|||
*
|
||||
* Call convention:
|
||||
*
|
||||
* syscall register convention is in Documentation/powerpc/syscall64-abi.rst
|
||||
*
|
||||
* For hypercalls, the register convention is as follows:
|
||||
* r0 volatile
|
||||
* r1-2 nonvolatile
|
||||
* r3 volatile parameter and return value for status
|
||||
* r4-r10 volatile input and output value
|
||||
* r11 volatile hypercall number and output value
|
||||
* r12 volatile input and output value
|
||||
* r13-r31 nonvolatile
|
||||
* LR nonvolatile
|
||||
* CTR volatile
|
||||
* XER volatile
|
||||
* CR0-1 CR5-7 volatile
|
||||
* CR2-4 nonvolatile
|
||||
* Other registers nonvolatile
|
||||
* syscall and hypercalls register conventions are documented in
|
||||
* Documentation/powerpc/syscall64-abi.rst and
|
||||
* Documentation/powerpc/papr_hcalls.rst respectively.
|
||||
*
|
||||
* The intersection of volatile registers that don't contain possible
|
||||
* inputs is: cr0, xer, ctr. We may use these as scratch regs upon entry
|
||||
|
|
Loading…
Reference in New Issue