powerpc/pci: Add PCI resource alignment documentation
In order to enable SRIOV on PowerNV platform, the PF's IOV BAR needs to be adjusted: 1. size expanded 2. aligned to M64BT size This patch documents this change on the reason and how. [bhelgaas: reformat, clarify, expand] Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This commit is contained in:
parent
250c7b277c
commit
3305244042
|
@ -0,0 +1,301 @@
|
|||
Wei Yang <weiyang@linux.vnet.ibm.com>
|
||||
Benjamin Herrenschmidt <benh@au1.ibm.com>
|
||||
Bjorn Helgaas <bhelgaas@google.com>
|
||||
26 Aug 2014
|
||||
|
||||
This document describes the requirement from hardware for PCI MMIO resource
|
||||
sizing and assignment on PowerKVM and how generic PCI code handles this
|
||||
requirement. The first two sections describe the concepts of Partitionable
|
||||
Endpoints and the implementation on P8 (IODA2). The next two sections talks
|
||||
about considerations on enabling SRIOV on IODA2.
|
||||
|
||||
1. Introduction to Partitionable Endpoints
|
||||
|
||||
A Partitionable Endpoint (PE) is a way to group the various resources
|
||||
associated with a device or a set of devices to provide isolation between
|
||||
partitions (i.e., filtering of DMA, MSIs etc.) and to provide a mechanism
|
||||
to freeze a device that is causing errors in order to limit the possibility
|
||||
of propagation of bad data.
|
||||
|
||||
There is thus, in HW, a table of PE states that contains a pair of "frozen"
|
||||
state bits (one for MMIO and one for DMA, they get set together but can be
|
||||
cleared independently) for each PE.
|
||||
|
||||
When a PE is frozen, all stores in any direction are dropped and all loads
|
||||
return all 1's value. MSIs are also blocked. There's a bit more state that
|
||||
captures things like the details of the error that caused the freeze etc., but
|
||||
that's not critical.
|
||||
|
||||
The interesting part is how the various PCIe transactions (MMIO, DMA, ...)
|
||||
are matched to their corresponding PEs.
|
||||
|
||||
The following section provides a rough description of what we have on P8
|
||||
(IODA2). Keep in mind that this is all per PHB (PCI host bridge). Each PHB
|
||||
is a completely separate HW entity that replicates the entire logic, so has
|
||||
its own set of PEs, etc.
|
||||
|
||||
2. Implementation of Partitionable Endpoints on P8 (IODA2)
|
||||
|
||||
P8 supports up to 256 Partitionable Endpoints per PHB.
|
||||
|
||||
* Inbound
|
||||
|
||||
For DMA, MSIs and inbound PCIe error messages, we have a table (in
|
||||
memory but accessed in HW by the chip) that provides a direct
|
||||
correspondence between a PCIe RID (bus/dev/fn) with a PE number.
|
||||
We call this the RTT.
|
||||
|
||||
- For DMA we then provide an entire address space for each PE that can
|
||||
contain two "windows", depending on the value of PCI address bit 59.
|
||||
Each window can be configured to be remapped via a "TCE table" (IOMMU
|
||||
translation table), which has various configurable characteristics
|
||||
not described here.
|
||||
|
||||
- For MSIs, we have two windows in the address space (one at the top of
|
||||
the 32-bit space and one much higher) which, via a combination of the
|
||||
address and MSI value, will result in one of the 2048 interrupts per
|
||||
bridge being triggered. There's a PE# in the interrupt controller
|
||||
descriptor table as well which is compared with the PE# obtained from
|
||||
the RTT to "authorize" the device to emit that specific interrupt.
|
||||
|
||||
- Error messages just use the RTT.
|
||||
|
||||
* Outbound. That's where the tricky part is.
|
||||
|
||||
Like other PCI host bridges, the Power8 IODA2 PHB supports "windows"
|
||||
from the CPU address space to the PCI address space. There is one M32
|
||||
window and sixteen M64 windows. They have different characteristics.
|
||||
First what they have in common: they forward a configurable portion of
|
||||
the CPU address space to the PCIe bus and must be naturally aligned
|
||||
power of two in size. The rest is different:
|
||||
|
||||
- The M32 window:
|
||||
|
||||
* Is limited to 4GB in size.
|
||||
|
||||
* Drops the top bits of the address (above the size) and replaces
|
||||
them with a configurable value. This is typically used to generate
|
||||
32-bit PCIe accesses. We configure that window at boot from FW and
|
||||
don't touch it from Linux; it's usually set to forward a 2GB
|
||||
portion of address space from the CPU to PCIe
|
||||
0x8000_0000..0xffff_ffff. (Note: The top 64KB are actually
|
||||
reserved for MSIs but this is not a problem at this point; we just
|
||||
need to ensure Linux doesn't assign anything there, the M32 logic
|
||||
ignores that however and will forward in that space if we try).
|
||||
|
||||
* It is divided into 256 segments of equal size. A table in the chip
|
||||
maps each segment to a PE#. That allows portions of the MMIO space
|
||||
to be assigned to PEs on a segment granularity. For a 2GB window,
|
||||
the segment granularity is 2GB/256 = 8MB.
|
||||
|
||||
Now, this is the "main" window we use in Linux today (excluding
|
||||
SR-IOV). We basically use the trick of forcing the bridge MMIO windows
|
||||
onto a segment alignment/granularity so that the space behind a bridge
|
||||
can be assigned to a PE.
|
||||
|
||||
Ideally we would like to be able to have individual functions in PEs
|
||||
but that would mean using a completely different address allocation
|
||||
scheme where individual function BARs can be "grouped" to fit in one or
|
||||
more segments.
|
||||
|
||||
- The M64 windows:
|
||||
|
||||
* Must be at least 256MB in size.
|
||||
|
||||
* Do not translate addresses (the address on PCIe is the same as the
|
||||
address on the PowerBus). There is a way to also set the top 14
|
||||
bits which are not conveyed by PowerBus but we don't use this.
|
||||
|
||||
* Can be configured to be segmented. When not segmented, we can
|
||||
specify the PE# for the entire window. When segmented, a window
|
||||
has 256 segments; however, there is no table for mapping a segment
|
||||
to a PE#. The segment number *is* the PE#.
|
||||
|
||||
* Support overlaps. If an address is covered by multiple windows,
|
||||
there's a defined ordering for which window applies.
|
||||
|
||||
We have code (fairly new compared to the M32 stuff) that exploits that
|
||||
for large BARs in 64-bit space:
|
||||
|
||||
We configure an M64 window to cover the entire region of address space
|
||||
that has been assigned by FW for the PHB (about 64GB, ignore the space
|
||||
for the M32, it comes out of a different "reserve"). We configure it
|
||||
as segmented.
|
||||
|
||||
Then we do the same thing as with M32, using the bridge alignment
|
||||
trick, to match to those giant segments.
|
||||
|
||||
Since we cannot remap, we have two additional constraints:
|
||||
|
||||
- We do the PE# allocation *after* the 64-bit space has been assigned
|
||||
because the addresses we use directly determine the PE#. We then
|
||||
update the M32 PE# for the devices that use both 32-bit and 64-bit
|
||||
spaces or assign the remaining PE# to 32-bit only devices.
|
||||
|
||||
- We cannot "group" segments in HW, so if a device ends up using more
|
||||
than one segment, we end up with more than one PE#. There is a HW
|
||||
mechanism to make the freeze state cascade to "companion" PEs but
|
||||
that only works for PCIe error messages (typically used so that if
|
||||
you freeze a switch, it freezes all its children). So we do it in
|
||||
SW. We lose a bit of effectiveness of EEH in that case, but that's
|
||||
the best we found. So when any of the PEs freezes, we freeze the
|
||||
other ones for that "domain". We thus introduce the concept of
|
||||
"master PE" which is the one used for DMA, MSIs, etc., and "secondary
|
||||
PEs" that are used for the remaining M64 segments.
|
||||
|
||||
We would like to investigate using additional M64 windows in "single
|
||||
PE" mode to overlay over specific BARs to work around some of that, for
|
||||
example for devices with very large BARs, e.g., GPUs. It would make
|
||||
sense, but we haven't done it yet.
|
||||
|
||||
3. Considerations for SR-IOV on PowerKVM
|
||||
|
||||
* SR-IOV Background
|
||||
|
||||
The PCIe SR-IOV feature allows a single Physical Function (PF) to
|
||||
support several Virtual Functions (VFs). Registers in the PF's SR-IOV
|
||||
Capability control the number of VFs and whether they are enabled.
|
||||
|
||||
When VFs are enabled, they appear in Configuration Space like normal
|
||||
PCI devices, but the BARs in VF config space headers are unusual. For
|
||||
a non-VF device, software uses BARs in the config space header to
|
||||
discover the BAR sizes and assign addresses for them. For VF devices,
|
||||
software uses VF BAR registers in the *PF* SR-IOV Capability to
|
||||
discover sizes and assign addresses. The BARs in the VF's config space
|
||||
header are read-only zeros.
|
||||
|
||||
When a VF BAR in the PF SR-IOV Capability is programmed, it sets the
|
||||
base address for all the corresponding VF(n) BARs. For example, if the
|
||||
PF SR-IOV Capability is programmed to enable eight VFs, and it has a
|
||||
1MB VF BAR0, the address in that VF BAR sets the base of an 8MB region.
|
||||
This region is divided into eight contiguous 1MB regions, each of which
|
||||
is a BAR0 for one of the VFs. Note that even though the VF BAR
|
||||
describes an 8MB region, the alignment requirement is for a single VF,
|
||||
i.e., 1MB in this example.
|
||||
|
||||
There are several strategies for isolating VFs in PEs:
|
||||
|
||||
- M32 window: There's one M32 window, and it is split into 256
|
||||
equally-sized segments. The finest granularity possible is a 256MB
|
||||
window with 1MB segments. VF BARs that are 1MB or larger could be
|
||||
mapped to separate PEs in this window. Each segment can be
|
||||
individually mapped to a PE via the lookup table, so this is quite
|
||||
flexible, but it works best when all the VF BARs are the same size. If
|
||||
they are different sizes, the entire window has to be small enough that
|
||||
the segment size matches the smallest VF BAR, which means larger VF
|
||||
BARs span several segments.
|
||||
|
||||
- Non-segmented M64 window: A non-segmented M64 window is mapped entirely
|
||||
to a single PE, so it could only isolate one VF.
|
||||
|
||||
- Single segmented M64 windows: A segmented M64 window could be used just
|
||||
like the M32 window, but the segments can't be individually mapped to
|
||||
PEs (the segment number is the PE#), so there isn't as much
|
||||
flexibility. A VF with multiple BARs would have to be in a "domain" of
|
||||
multiple PEs, which is not as well isolated as a single PE.
|
||||
|
||||
- Multiple segmented M64 windows: As usual, each window is split into 256
|
||||
equally-sized segments, and the segment number is the PE#. But if we
|
||||
use several M64 windows, they can be set to different base addresses
|
||||
and different segment sizes. If we have VFs that each have a 1MB BAR
|
||||
and a 32MB BAR, we could use one M64 window to assign 1MB segments and
|
||||
another M64 window to assign 32MB segments.
|
||||
|
||||
Finally, the plan to use M64 windows for SR-IOV, which will be described
|
||||
more in the next two sections. For a given VF BAR, we need to
|
||||
effectively reserve the entire 256 segments (256 * VF BAR size) and
|
||||
position the VF BAR to start at the beginning of a free range of
|
||||
segments/PEs inside that M64 window.
|
||||
|
||||
The goal is of course to be able to give a separate PE for each VF.
|
||||
|
||||
The IODA2 platform has 16 M64 windows, which are used to map MMIO
|
||||
range to PE#. Each M64 window defines one MMIO range and this range is
|
||||
divided into 256 segments, with each segment corresponding to one PE.
|
||||
|
||||
We decide to leverage this M64 window to map VFs to individual PEs, since
|
||||
SR-IOV VF BARs are all the same size.
|
||||
|
||||
But doing so introduces another problem: total_VFs is usually smaller
|
||||
than the number of M64 window segments, so if we map one VF BAR directly
|
||||
to one M64 window, some part of the M64 window will map to another
|
||||
device's MMIO range.
|
||||
|
||||
IODA supports 256 PEs, so segmented windows contain 256 segments, so if
|
||||
total_VFs is less than 256, we have the situation in Figure 1.0, where
|
||||
segments [total_VFs, 255] of the M64 window may map to some MMIO range on
|
||||
other devices:
|
||||
|
||||
0 1 total_VFs - 1
|
||||
+------+------+- -+------+------+
|
||||
| | | ... | | |
|
||||
+------+------+- -+------+------+
|
||||
|
||||
VF(n) BAR space
|
||||
|
||||
0 1 total_VFs - 1 255
|
||||
+------+------+- -+------+------+- -+------+------+
|
||||
| | | ... | | | ... | | |
|
||||
+------+------+- -+------+------+- -+------+------+
|
||||
|
||||
M64 window
|
||||
|
||||
Figure 1.0 Direct map VF(n) BAR space
|
||||
|
||||
Our current solution is to allocate 256 segments even if the VF(n) BAR
|
||||
space doesn't need that much, as shown in Figure 1.1:
|
||||
|
||||
0 1 total_VFs - 1 255
|
||||
+------+------+- -+------+------+- -+------+------+
|
||||
| | | ... | | | ... | | |
|
||||
+------+------+- -+------+------+- -+------+------+
|
||||
|
||||
VF(n) BAR space + extra
|
||||
|
||||
0 1 total_VFs - 1 255
|
||||
+------+------+- -+------+------+- -+------+------+
|
||||
| | | ... | | | ... | | |
|
||||
+------+------+- -+------+------+- -+------+------+
|
||||
|
||||
M64 window
|
||||
|
||||
Figure 1.1 Map VF(n) BAR space + extra
|
||||
|
||||
Allocating the extra space ensures that the entire M64 window will be
|
||||
assigned to this one SR-IOV device and none of the space will be
|
||||
available for other devices. Note that this only expands the space
|
||||
reserved in software; there are still only total_VFs VFs, and they only
|
||||
respond to segments [0, total_VFs - 1]. There's nothing in hardware that
|
||||
responds to segments [total_VFs, 255].
|
||||
|
||||
4. Implications for the Generic PCI Code
|
||||
|
||||
The PCIe SR-IOV spec requires that the base of the VF(n) BAR space be
|
||||
aligned to the size of an individual VF BAR.
|
||||
|
||||
In IODA2, the MMIO address determines the PE#. If the address is in an M32
|
||||
window, we can set the PE# by updating the table that translates segments
|
||||
to PE#s. Similarly, if the address is in an unsegmented M64 window, we can
|
||||
set the PE# for the window. But if it's in a segmented M64 window, the
|
||||
segment number is the PE#.
|
||||
|
||||
Therefore, the only way to control the PE# for a VF is to change the base
|
||||
of the VF(n) BAR space in the VF BAR. If the PCI core allocates the exact
|
||||
amount of space required for the VF(n) BAR space, the VF BAR value is fixed
|
||||
and cannot be changed.
|
||||
|
||||
On the other hand, if the PCI core allocates additional space, the VF BAR
|
||||
value can be changed as long as the entire VF(n) BAR space remains inside
|
||||
the space allocated by the core.
|
||||
|
||||
Ideally the segment size will be the same as an individual VF BAR size.
|
||||
Then each VF will be in its own PE. The VF BARs (and therefore the PE#s)
|
||||
are contiguous. If VF0 is in PE(x), then VF(n) is in PE(x+n). If we
|
||||
allocate 256 segments, there are (256 - numVFs) choices for the PE# of VF0.
|
||||
|
||||
If the segment size is smaller than the VF BAR size, it will take several
|
||||
segments to cover a VF BAR, and a VF will be in several PEs. This is
|
||||
possible, but the isolation isn't as good, and it reduces the number of PE#
|
||||
choices because instead of consuming only numVFs segments, the VF(n) BAR
|
||||
space will consume (numVFs * n) segments. That means there aren't as many
|
||||
available segments for adjusting base of the VF(n) BAR space.
|
Loading…
Reference in New Issue