forked from OSchip/llvm-project
885 lines
43 KiB
Plaintext
885 lines
43 KiB
Plaintext
llvm-mca - LLVM Machine Code Analyzer
|
|
-------------------------------------
|
|
|
|
llvm-mca is a performance analysis tool that uses information which is already
|
|
available in LLVM (e.g. scheduling models) to statically measure the performance
|
|
of machine code in a specific cpu.
|
|
|
|
Performance is measured in terms of throughput as well as processor resource
|
|
consumption. The tool currently works for processors with an out-of-order
|
|
backend, for which there is a scheduling model available in LLVM.
|
|
|
|
The main goal of this tool is not just to predict the performance of the code
|
|
when run on the target, but also help with diagnosing potential performance
|
|
issues.
|
|
|
|
Given an assembly code sequence, llvm-mca estimates the IPC (instructions Per
|
|
cycle), as well as hardware resources pressure. The analysis and reporting style
|
|
were inspired by the IACA tool from Intel.
|
|
|
|
The presence of long data dependency chains, as well as poor usage of hardware
|
|
resources may lead to bottlenecks in the back-end. The tool is able to generate
|
|
a detailed report which should help with identifying and analyzing sources of
|
|
bottlenecks.
|
|
|
|
Scheduling models are mostly used to compute instruction latencies, to obtain
|
|
read-advance information, and understand how processor resources are used by
|
|
instructions. By design, the quality of the performance analysis conducted by
|
|
the tool is inevitably affected by the quality of the target scheduling models.
|
|
|
|
However, scheduling models intentionally do not describe all processors details,
|
|
since the goal is just to enable the scheduling of machine instructions during
|
|
compilation. That means, there are processor details which are not important for
|
|
the purpose of scheduling instructions (and therefore not described by the
|
|
scheduling model), but are very important for this tool.
|
|
|
|
A few examples of details that are missing in scheduling models are:
|
|
- Maximum number of instructions retired per cycle.
|
|
- Actual dispatch width (it often differs from the issue width).
|
|
- Number of temporary registers available for renaming.
|
|
- Number of read/write ports in the register file(s).
|
|
- Length of the load/store queue in the LSUnit.
|
|
|
|
It is also very difficult to find a "good" abstract model to describe the
|
|
behavior of out-of-order processors. So, we have to keep in mind that all of
|
|
these aspects are going to affect the quality of the static analysis performed
|
|
by the tool.
|
|
|
|
An extensive list of known limitations is reported in one of the last sections
|
|
of this document. There is also a section related to design problems which must
|
|
be addressed (hopefully with the help of the community). At the moment, the
|
|
tool has been mostly tested for x86 targets, but there are still several
|
|
limitations, some of which could be overcome by integrating extra information
|
|
into the scheduling models.
|
|
|
|
How the tool works
|
|
------------------
|
|
|
|
The tool takes assembly code as input. Assembly code is parsed into a sequence
|
|
of MCInst with the help of the existing LLVM target assembly parsers. The parsed
|
|
sequence of MCInst is then analyzed by a 'Backend' module to generate a
|
|
performance report.
|
|
|
|
The Backend module internally emulates the execution of the machine code
|
|
sequence in a loop of iterations (which by default is 70). At the end of this
|
|
process, the backend collects a number of statistics which are then printed out
|
|
in the form of a report.
|
|
|
|
Here is an example of performance report generated by the tool for a dot-product
|
|
of two packed float vectors of four elements. The analysis is conducted for
|
|
target x86, cpu btver2:
|
|
|
|
///////////////////
|
|
|
|
Iterations: 300
|
|
Instructions: 900
|
|
Total Cycles: 610
|
|
Dispatch Width: 2
|
|
IPC: 1.48
|
|
|
|
|
|
Resources:
|
|
[0] - JALU0
|
|
[1] - JALU1
|
|
[2] - JDiv
|
|
[3] - JFPM
|
|
[4] - JFPU0
|
|
[5] - JFPU1
|
|
[6] - JLAGU
|
|
[7] - JSAGU
|
|
[8] - JSTC
|
|
[9] - JVIMUL
|
|
|
|
|
|
Resource pressure per iteration:
|
|
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
|
|
- - - - 2.00 1.00 - - - -
|
|
|
|
Resource pressure by instruction:
|
|
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] Instructions:
|
|
- - - - - 1.00 - - - - vmulps %xmm0, %xmm1, %xmm2
|
|
- - - - 1.00 - - - - - vhaddps %xmm2, %xmm2, %xmm3
|
|
- - - - 1.00 - - - - - vhaddps %xmm3, %xmm3, %xmm4
|
|
|
|
|
|
Instruction Info:
|
|
[1]: #uOps
|
|
[2]: Latency
|
|
[3]: RThroughput
|
|
[4]: MayLoad
|
|
[5]: MayStore
|
|
[6]: HasSideEffects
|
|
|
|
[1] [2] [3] [4] [5] [6] Instructions:
|
|
1 2 1.00 vmulps %xmm0, %xmm1, %xmm2
|
|
1 3 1.00 vhaddps %xmm2, %xmm2, %xmm3
|
|
1 3 1.00 vhaddps %xmm3, %xmm3, %xmm4
|
|
|
|
///////////////////
|
|
|
|
According to this report, the dot-product kernel has been executed 300 times,
|
|
for a total of 900 instructions dynamically executed.
|
|
|
|
The report is structured in three main sections. A first section collects a few
|
|
performance numbers; the goal of this section is to give a very quick overview
|
|
of the performance throughput. In this example, the two important perforamce
|
|
indicators are a) the predicted total number of cycles, and b) the IPC.
|
|
IPC is probably the most important throughput indicator. A big delta between the
|
|
Dispatch Width and the computed IPC is an indicator of potential performance
|
|
issues.
|
|
|
|
The second section is the so-called "resource pressure view". This view reports
|
|
the average number of resource cycles consumed every iteration by instructions
|
|
for every processor resource unit available on the target. Information is
|
|
structured in two tables. The first table reports the number of resource cycles
|
|
spent on average every iteration. The second table correlates the resource
|
|
cycles to the machine instruction in the sequence. For example, every iteration
|
|
of the dot-product, instruction 'vmulps' always executes on resource unit [5]
|
|
(JFPU1 - floating point pipeline #1), consuming an average of 1 resource cycle
|
|
per iteration. Note that on Jaguar, vector FP multiply can only be issued to
|
|
pipeline JFPU1, while horizontal FP adds can only be issued to pipeline JFPU0.
|
|
|
|
The third (and last) section of the report shows the latency and reciprocal
|
|
throughput of every instruction in the sequence. That section also reports extra
|
|
information related to the number of micro opcodes, and opcode properties (i.e.
|
|
'MayLoad', 'MayStore' and 'UnmodeledSideEffects').
|
|
|
|
The resource pressure view helps with identifying bottlenecks caused by high
|
|
usage of specific hardware resources. Situations with resource pressure mainly
|
|
concentrated on a few resources should, in general, be avoided. Ideally,
|
|
pressure should be uniformly distributed between multiple resources.
|
|
|
|
Timeline View
|
|
-------------
|
|
|
|
A detailed report of each instruction's state transitions over time can be
|
|
enabled using the command line flag '-timeline'. This prints an extra section
|
|
in the report which contains the so-called "timeline view". Below is the
|
|
timeline view for the dot-product example from the previous section.
|
|
|
|
///////////////
|
|
Timeline view:
|
|
012345
|
|
Index 0123456789
|
|
|
|
[0,0] DeeER. . . vmulps %xmm0, %xmm1, %xmm2
|
|
[0,1] D==eeeER . . vhaddps %xmm2, %xmm2, %xmm3
|
|
[0,2] .D====eeeER . vhaddps %xmm3, %xmm3, %xmm4
|
|
|
|
[1,0] .DeeE-----R . vmulps %xmm0, %xmm1, %xmm2
|
|
[1,1] . D=eeeE---R . vhaddps %xmm2, %xmm2, %xmm3
|
|
[1,2] . D====eeeER . vhaddps %xmm3, %xmm3, %xmm4
|
|
|
|
[2,0] . DeeE-----R . vmulps %xmm0, %xmm1, %xmm2
|
|
[2,1] . D====eeeER . vhaddps %xmm2, %xmm2, %xmm3
|
|
[2,2] . D======eeeER vhaddps %xmm3, %xmm3, %xmm4
|
|
|
|
|
|
Average Wait times (based on the timeline view):
|
|
[0]: Executions
|
|
[1]: Average time spent waiting in a scheduler's queue
|
|
[2]: Average time spent waiting in a scheduler's queue while ready
|
|
[3]: Average time elapsed from WB until retire stage
|
|
|
|
[0] [1] [2] [3]
|
|
0. 3 1.0 1.0 3.3 vmulps %xmm0, %xmm1, %xmm2
|
|
1. 3 3.3 0.7 1.0 vhaddps %xmm2, %xmm2, %xmm3
|
|
2. 3 5.7 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4
|
|
///////////////
|
|
|
|
The timeline view is very interesting because it shows how instructions changed
|
|
in state during execution. It also gives an idea of how the tool "sees"
|
|
instructions executed on the target.
|
|
|
|
The timeline view is structured in two tables. The first table shows how
|
|
instructions change in state over time (measured in cycles); the second table
|
|
(named "Average Wait times") reports useful timing statistics which should help
|
|
diagnose performance bottlenecks caused by long data dependencies and
|
|
sub-optimal usage of hardware resources.
|
|
|
|
An instruction in the timeline view is identified by a pair of indices, where
|
|
the 'first' index identifies an iteration, and the 'second' index is the actual
|
|
instruction index (i.e. where it appears in the code sequence).
|
|
|
|
Excluding the first and last column, the remaining columns are in cycles. Cycles
|
|
are numbered sequentially starting from 0. The following characters are used to
|
|
describe the state of an instruction:
|
|
|
|
D : Instruction dispatched.
|
|
e : Instruction executing.
|
|
E : Instruction executed.
|
|
R : Instruction retired.
|
|
= : Instruction already dispatched, waiting to be executed.
|
|
- : Instruction executed, waiting to be retired.
|
|
|
|
Based on the timeline view from the example, we know that:
|
|
- Instruction [1, 0] was dispatched at cycle 1.
|
|
- Instruction [1, 0] started executing at cycle 2.
|
|
- Instruction [1, 0] reached the write back stage at cycle 4.
|
|
- Instruction [1, 0] was retired at cycle 10.
|
|
|
|
Instruction [1, 0] (i.e. the vmulps from iteration #1) doesn't have to wait in
|
|
the scheduler's queue for the operands to become available. By the time the
|
|
vmulps is dispatched, operands are already available, and pipeline JFPU1 is
|
|
ready to serve another instruction. So the instruction can be immediately
|
|
issued on the JFPU1 pipeline. That is demonstrated by the fact that the
|
|
instruction only spent 1cy in the scheduler's queue.
|
|
|
|
There is a gap of 5 cycles between the write-back stage and the retire event.
|
|
That is because instructions must retire in program order, so [1,0] has to wait
|
|
for [0, 2] to be retired first (i.e it has to wait unti cycle 10).
|
|
|
|
In the dot-product example, all instructions are in a RAW (Read After Write)
|
|
dependency chain. Register %xmm2 written by the vmulps is immediately used by
|
|
the first vhaddps, and register %xmm3 written by the first vhaddps is used by
|
|
the second vhaddps. Long data dependencies negatively affect the ILP
|
|
(Instruction Level Parallelism).
|
|
|
|
In the dot-product example, there are anti-dependencies introduced by
|
|
instructions from different iterations. However, those dependencies can be
|
|
removed at register renaming stage (at the cost of allocating register aliases,
|
|
and therefore consuming temporary registers).
|
|
|
|
Table "Average Wait times" helps diagnose performance issues that are caused by
|
|
the presence of long latency instructions and potentially long data dependencies
|
|
which may limit the ILP. Note that the tool by default assumes at least 1cy
|
|
between the dispatch event and the issue event.
|
|
|
|
When the performance is limited by data dependencies and/or long latency
|
|
instructions, the number of cycles spent while in the "ready" state is expected
|
|
to be very small when compared with the total number of cycles spent in the
|
|
scheduler's queue. So the difference between the two counters is a good
|
|
indicator of how big of an impact data dependencies had on the execution of
|
|
instructions. When performance is mostly limited by the lack of hardware
|
|
resources, the delta between the two counters is small. However, the number of
|
|
cycles spent in the queue tends to be bigger (i.e. more than 1-3cy) especially
|
|
when compared with other low latency instructions.
|
|
|
|
Extra statistics to further diagnose performance issues.
|
|
--------------------------------------------------------
|
|
|
|
Flag '-verbose' enables extra statistics and performance counters for the
|
|
dispatch logic, the reorder buffer, the retire control unit and the register
|
|
file.
|
|
|
|
Below is an example of verbose output generated by the tool for the dot-product
|
|
example discussed in the previous sections.
|
|
|
|
///////////////////
|
|
Iterations: 300
|
|
Instructions: 900
|
|
Total Cycles: 610
|
|
Dispatch Width: 2
|
|
IPC: 1.48
|
|
|
|
|
|
Dynamic Dispatch Stall Cycles:
|
|
RAT - Register unavailable: 0
|
|
RCU - Retire tokens unavailable: 0
|
|
SCHEDQ - Scheduler full: 272
|
|
LQ - Load queue full: 0
|
|
SQ - Store queue full: 0
|
|
GROUP - Static restrictions on the dispatch group: 0
|
|
|
|
|
|
Register Alias Table:
|
|
Total number of mappings created: 900
|
|
Max number of mappings used: 35
|
|
|
|
|
|
Dispatch Logic - number of cycles where we saw N instructions dispatched:
|
|
[# dispatched], [# cycles]
|
|
0, 24 (3.9%)
|
|
1, 272 (44.6%)
|
|
2, 314 (51.5%)
|
|
|
|
|
|
Schedulers - number of cycles where we saw N instructions issued:
|
|
[# issued], [# cycles]
|
|
0, 7 (1.1%)
|
|
1, 306 (50.2%)
|
|
2, 297 (48.7%)
|
|
|
|
|
|
Retire Control Unit - number of cycles where we saw N instructions retired:
|
|
[# retired], [# cycles]
|
|
0, 109 (17.9%)
|
|
1, 102 (16.7%)
|
|
2, 399 (65.4%)
|
|
|
|
|
|
Scheduler's queue usage:
|
|
JALU01, 0/20
|
|
JFPU01, 18/18
|
|
JLSAGU, 0/12
|
|
///////////////////
|
|
|
|
Based on the verbose report, the backend was only able to dispatch two
|
|
instructions 51.5% of the time. The dispatch group was limited to one
|
|
instruction 44.6% of the cycles, which corresponds to 272 cycles.
|
|
|
|
If we look at section "Dynamic Dispatch Stall Cycles", we can see how counter
|
|
SCHEDQ reports 272 cycles. Counter SCHEDQ is incremented every time the dispatch
|
|
logic is unable to dispatch a full group of two instructions because the
|
|
scheduler's queue is full.
|
|
|
|
Section "Scheduler's queue usage" shows how the maximum number of buffer entries
|
|
(i.e. scheduler's queue entries) used at runtime for resource JFPU01 reached its
|
|
maximum. Note that AMD Jaguar implements three schedulers:
|
|
* JALU01 - A scheduler for ALU instructions
|
|
* JLSAGU - A scheduler for address generation
|
|
* JFPU01 - A scheduler floating point operations.
|
|
|
|
The dot-product is a kernel of three floating point instructions (a vector
|
|
multiply followed by two horizontal adds). That explains why only the floating
|
|
point scheduler appears to be used according to section "Scheduler's queue
|
|
usage".
|
|
|
|
A full scheduler's queue is either caused by data dependency chains, or by a
|
|
sub-optimal usage of hardware resources. Sometimes, resource pressure can be
|
|
mitigated by rewriting the kernel using different instructions that consume
|
|
different scheduler resources. Schedulers with a small queue are less resilient
|
|
to bottlenecks caused by the presence of long data dependencies.
|
|
|
|
In this example, we can conclude that the IPC is mostly limited by data
|
|
dependencies, and not by resource pressure.
|
|
|
|
LLVM-MCA instruction flow
|
|
-------------------------
|
|
|
|
This section describes the instruction flow through the out-of-order backend, as
|
|
well as the functional units involved in the process.
|
|
|
|
An instruction goes through a default sequence of stages:
|
|
- Dispatch (Instruction is dispatched to the schedulers).
|
|
- Issue (Instruction is issued to the processor pipelines).
|
|
- Write Back (Instruction is executed, and results are written back).
|
|
- Retire (Instruction is retired; writes are architecturally committed).
|
|
|
|
The tool only models the out-of-order portion of a processor. Therefore, the
|
|
instruction fetch and decode stages are not modeled. Performance bottlenecks in
|
|
the frontend are not diagnosed by this tool. The tool assumes that instructions
|
|
have all been decoded and placed in a queue. Also, the tool doesn't know
|
|
anything about branch prediction.
|
|
|
|
The long term plan is to make the process customizable, so that processors can
|
|
define their own. This is a future work.
|
|
|
|
Instruction Dispatch
|
|
--------------------
|
|
|
|
During the Dispatch stage, instructions are picked in program order from a queue
|
|
of already decoded instructions, and dispatched in groups to the hardware
|
|
schedulers. The dispatch logic is implemented by class DispatchUnit in file
|
|
Dispatch.h.
|
|
|
|
The size of a dispatch group depends on the availability of hardware resources,
|
|
and it cannot exceed the value of field 'DispatchWidth' in class DispatchUnit.
|
|
Note that field DispatchWidth defaults to the value of field 'IssueWidth' from
|
|
the scheduling model.
|
|
|
|
Users can override the DispatchWidth value with flag "-dispatch=<N>" (where 'N'
|
|
is an unsigned quantity).
|
|
|
|
An instruction can be dispatched if:
|
|
- The size of the dispatch group is smaller than DispatchWidth
|
|
- There are enough entries in the reorder buffer
|
|
- There are enough temporary registers to do register renaming
|
|
- Schedulers are not full.
|
|
|
|
Scheduling models don't describe register files, and therefore the tool doesn't
|
|
know if there is more than one register file, and how many temporaries are
|
|
available for register renaming.
|
|
|
|
By default, the tool (optimistically) assumes a single register file with an
|
|
unbounded number of temporary registers. Users can limit the number of
|
|
temporary registers available for register renaming using flag
|
|
`-register-file-size=<N>`, where N is the number of temporaries. A value of
|
|
zero for N means 'unbounded'. Knowing how many temporaries are available for
|
|
register renaming, the tool can predict dispatch stalls caused by the lack of
|
|
temporaries.
|
|
|
|
The number of reorder buffer entries consumed by an instruction depends on the
|
|
number of micro-opcodes it specifies in the target scheduling model (see field
|
|
'NumMicroOpcodes' of tablegen class ProcWriteResources and its derived classes;
|
|
TargetSchedule.td).
|
|
|
|
The reorder buffer is implemented by class RetireControlUnit (see Dispatch.h).
|
|
Its goal is to track the progress of instructions that are "in-flight", and
|
|
retire instructions in program order. The number of entries in the reorder
|
|
buffer defaults to the value of field 'MicroOpBufferSize' from the target
|
|
scheduling model.
|
|
|
|
Instructions that are dispatched to the schedulers consume scheduler buffer
|
|
entries. The tool queries the scheduling model to figure out the set of
|
|
buffered resources consumed by an instruction. Buffered resources are treated
|
|
like "scheduler" resources, and the field 'BufferSize' (from the processor
|
|
resource tablegen definition) defines the size of the scheduler's queue.
|
|
|
|
Zero latency instructions (for example NOP instructions) don't consume scheduler
|
|
resources. However, those instructions still reserve a number of slots in the
|
|
reorder buffer.
|
|
|
|
Instruction Issue
|
|
-----------------
|
|
|
|
As mentioned in the previous section, each scheduler resource implements a queue
|
|
of instructions. An instruction has to wait in the scheduler's queue until
|
|
input register operands become available. Only at that point, does the
|
|
instruction becomes eligible for execution and may be issued (potentially
|
|
out-of-order) to a pipeline for execution.
|
|
|
|
Instruction latencies can be computed by the tool with the help of the
|
|
scheduling model; latency values are defined by the scheduling model through
|
|
ProcWriteResources objects.
|
|
|
|
Class Scheduler (see file Scheduler.h) knows how to emulate multiple processor
|
|
schedulers. A Scheduler is responsible for tracking data dependencies, and
|
|
dynamically select which processor resources are consumed/used by instructions.
|
|
|
|
Internally, the Scheduler class delegates the management of processor resource
|
|
units and resource groups to the ResourceManager class. ResourceManager is also
|
|
responsible for selecting resource units that are effectively consumed by
|
|
instructions. For example, if an instruction consumes 1cy of a resource group,
|
|
the ResourceManager object selects one of the available units from the group; by
|
|
default, it uses a round-robin selector to guarantee that resource usage is
|
|
uniformly distributed between all units of a group.
|
|
|
|
Internally, class Scheduler implements three instruction queues:
|
|
- WaitQueue: a queue of instructions whose operands are not ready yet.
|
|
- ReadyQueue: a queue of instructions ready to execute.
|
|
- IssuedQueue: a queue of instructions executing.
|
|
|
|
Depending on the operands availability, instructions that are dispatched to the
|
|
Scheduler are either placed into the WaitQueue or into the ReadyQueue.
|
|
|
|
Every cycle, class Scheduler checks if instructions can be moved from the
|
|
WaitQueue to the ReadyQueue, and if instructions from the ReadyQueue can be
|
|
issued to the underlying pipelines. The algorithm prioritizes older
|
|
instructions over younger instructions.
|
|
|
|
Objects of class ResourceState (see Scheduler.h) describe processor resources.
|
|
There is an instance of class ResourceState for each single processor resource
|
|
specified by the scheduling model. A ResourceState object for a processor
|
|
resource with multiple units dynamically tracks the availability of every single
|
|
unit. For example, the ResourceState of a resource group tracks the
|
|
availability of every resource in that group. Internally, ResourceState
|
|
implements a round-robin selector to dynamically pick the next unit to use from
|
|
the group.
|
|
|
|
Write-Back and Retire Stage
|
|
---------------------------
|
|
|
|
Issued instructions are moved from the ReadyQueue to the IssuedQueue. There,
|
|
instructions wait until they reach the write-back stage. At that point, they
|
|
get removed from the queue and the retire control unit is notified.
|
|
|
|
On the event of "instruction executed", the retire control unit flags the
|
|
instruction as "ready to retire".
|
|
|
|
Instruction are retired in program order; an "instruction retired" event is sent
|
|
to the register file which frees the temporary registers allocated for the
|
|
instruction at register renaming stage.
|
|
|
|
Load/Store Unit and Memory Consistency Model
|
|
--------------------------------------------
|
|
|
|
The tool attempts to emulate out-of-order execution of memory operations. Class
|
|
LSUnit (see file LSUnit.h) emulates a load/store unit implementing queues for
|
|
speculative execution of loads and stores.
|
|
|
|
Each load (or store) consumes an entry in the load (or store) queue. The number
|
|
of slots in the load/store queues is unknown by the tool, since there is no
|
|
mention of it in the scheduling model. In practice, users can specify flag
|
|
`-lqueue=N` (vic. `-squeue=N`) to limit the number of entries in the queue to be
|
|
equal to exactly N (an unsigned value). If N is zero, then the tool assumes an
|
|
unbounded queue (this is the default).
|
|
|
|
LSUnit implements a relaxed consistency model for memory loads and stores. The
|
|
rules are:
|
|
1) A younger load is allowed to pass an older load only if there is no
|
|
intervening store in between the two loads.
|
|
2) An younger store is not allowed to pass an older store.
|
|
3) A younger store is not allowed to pass an older load.
|
|
4) A younger load is allowed to pass an older store provided that the load does
|
|
not alias with the store.
|
|
|
|
By default, this class conservatively (i.e. pessimistically) assumes that loads
|
|
always may-alias store operations. Essentially, this LSUnit doesn't perform any
|
|
sort of alias analysis to rule out cases where loads and stores don't overlap
|
|
with each other. The downside of this approach however is that younger loads are
|
|
never allowed to pass older stores. To make it possible for a younger load to
|
|
pass an older store, users can use the command line flag -noalias. Under
|
|
'noalias', a younger load is always allowed to pass an older store.
|
|
|
|
Note that, in the case of write-combining memory, rule 2. could be relaxed a bit
|
|
to allow reordering of non-aliasing store operations. That being said, at the
|
|
moment, there is no way to further relax the memory model (flag -noalias is the
|
|
only option). Essentially, there is no option to specify a different memory
|
|
type (for example: write-back, write-combining, write-through; etc.) and
|
|
consequently to weaken or strengthen the memory model.
|
|
|
|
Other limitations are:
|
|
* LSUnit doesn't know when store-to-load forwarding may occur.
|
|
* LSUnit doesn't know anything about the cache hierarchy and memory types.
|
|
* LSUnit doesn't know how to identify serializing operations and memory fences.
|
|
|
|
No assumption is made on the store buffer size. As mentioned before, LSUnit
|
|
conservatively assumes a may-alias relation between loads and stores, and it
|
|
doesn't attempt to identify cases where store-to-load forwarding would occur in
|
|
practice.
|
|
|
|
LSUnit doesn't attempt to predict whether a load or store hits or misses the L1
|
|
cache. It only knows if an instruction "MayLoad" and/or "MayStore". For loads,
|
|
the scheduling model provides an "optimistic" load-to-use latency (which usually
|
|
matches the load-to-use latency for when there is a hit in the L1D).
|
|
|
|
Class MCInstrDesc in LLVM doesn't know about serializing operations, nor
|
|
memory-barrier like instructions. LSUnit conservatively assumes that an
|
|
instruction which has both 'MayLoad' and 'UnmodeledSideEffects' behaves like a
|
|
"soft" load-barrier. That means, it serializes loads without forcing a flush of
|
|
the load queue. Similarly, instructions flagged with both 'MayStore' and
|
|
'UnmodeledSideEffects' are treated like store barriers. A full memory barrier
|
|
is a 'MayLoad' and 'MayStore' instruction with 'UnmodeledSideEffects'. This is
|
|
inaccurate, but it is the best that we can do at the moment with the current
|
|
information available in LLVM.
|
|
|
|
A load/store barrier consumes one entry of the load/store queue. A load/store
|
|
barrier enforces ordering of loads/stores. A younger load cannot pass a load
|
|
barrier. Also, a younger store cannot pass a store barrier. A younger load has
|
|
to wait for the memory/load barrier to execute. A load/store barrier is
|
|
"executed" when it becomes the oldest entry in the load/store queue(s). That
|
|
also means, by construction, all the older loads/stores have been executed.
|
|
|
|
In conclusion the full set of rules is:
|
|
1. A store may not pass a previous store.
|
|
2. A load may not pass a previous store unless flag 'NoAlias' is set.
|
|
3. A load may pass a previous load.
|
|
4. A store may not pass a previous load (regardless of flag 'NoAlias').
|
|
5. A load has to wait until an older load barrier is fully executed.
|
|
6. A store has to wait until an older store barrier is fully executed.
|
|
|
|
Known limitations
|
|
-----------------
|
|
Previous sections described cases where the tool is missing information to give
|
|
an accurate report. For example, the first sections of this document explained
|
|
how the lack of knowledge about the processor negatively affects the performance
|
|
analysis. The lack of knowledge is often a consequence of how scheduling models
|
|
are defined; as mentioned before, scheduling models intentionally don't describe
|
|
processors in fine details. That being said, the LLVM machine model can be
|
|
extended to expose more details, as long as they are opt-in for targets.
|
|
|
|
The accuracy of the performance analysis is also affected by assumptions made by
|
|
the processor model used by the tool.
|
|
|
|
Most recent Intel and AMD processors implement dedicated LoopBuffer/OpCache in
|
|
the hardware frontend to speedup the throughput in the presence of tight loops.
|
|
The presence of these buffers complicates the decoding logic, and requires
|
|
knowledge on the branch predictor too. Class 'SchedMachineModel' in tablegen
|
|
provides a field named 'LoopMicroOpBufferSize' which is used to describe loop
|
|
buffers. However, the purpose of that field is to enable loop unrolling of
|
|
tight loops; essentially, it affects the cost model used by pass loop-unroll.
|
|
|
|
At the current state, the tool only describes the out-of-order portion of a
|
|
processor, and consequently doesn't try to predict the frontend throughput. That
|
|
being said, this tool could be definitely extended in future to also account for
|
|
the hardware frontend when doing performance analysis. This would inevitably
|
|
require extra (extensive) processor knowledge related to all the available
|
|
decoding paths in the hardware frontend, as well as branch prediction.
|
|
|
|
Currently, the tool assumes a zero-latency "perfect" fetch&decode
|
|
stage; the full sequence of decoded instructions is immediately visible to the
|
|
dispatch logic from the start.
|
|
|
|
The tool doesn't know about simultaneous mutithreading. According to the tool,
|
|
processor resources are not statically/dynamically partitioned. Processor
|
|
resources are fully available to the hardware thread executing the
|
|
microbenchmark.
|
|
|
|
The execution model implemented by this tool assumes that instructions are
|
|
firstly dispatched in groups to hardware schedulers, and then issued to
|
|
pipelines for execution. The model assumes dynamic scheduling of instructions.
|
|
Instructions are placed in a queue and potentially executed out-of-order (based
|
|
on the operand availability). The dispatch stage is definitely distinct from the
|
|
issue stage. This will change in future; as mentioned in the first section, the
|
|
end goal is to let processors customize the process.
|
|
|
|
This model doesn't correctly describe processors where the dispatch/issue is a
|
|
single stage. This is what happens for example in VLIW processors, where
|
|
instructions are packaged and statically scheduled at compile time; it is up to
|
|
the compiler to predict the latency of instructions and package issue groups
|
|
accordingly. For such targets, there is no dynamic scheduling done by the
|
|
hardware.
|
|
|
|
Existing classes (DispatchUnit, Scheduler, etc.) could be extended/adapted to
|
|
support processors with a single dispatch/issue stage. The execution flow would
|
|
require some changes in the way how existing components (i.e. DispatchUnit,
|
|
Scheduler, etc.) interact. This can be a future development.
|
|
|
|
The following sections describes other known limitations. The goal is not to
|
|
provide an extensive list of limitations; we want to report what we believe are
|
|
the most important limitations, and suggest possible methods to overcome them.
|
|
|
|
Load/Store barrier instructions and serializing operations
|
|
----------------------------------------------------------
|
|
Section "Load/Store Unit and Memory Consistency Model" already mentioned how
|
|
LLVM doesn't know about serializing operations and memory barriers. Most of it
|
|
boils down to the fact that class MCInstrDesc (intentionally) doesn't expose
|
|
those properties. Instead, both serializing operations and memory barriers
|
|
"have side-effects" according to MCInstrDesc. That is because, at least for
|
|
scheduling purposes, knowing that an instruction has unmodeled side effects is
|
|
often enough to treat the instruction like a compiler scheduling barrier.
|
|
|
|
A performance analysis tool could use the extra knowledge on barriers and
|
|
serializing operations to generate a more accurate performance report. One way
|
|
to improve this is by reserving a couple of bits in field 'Flags' from class
|
|
MCInstrDesc: one bit for barrier operations, and another bit to mark
|
|
instructions as serializing operations.
|
|
|
|
Lack of support for instruction itineraries
|
|
-------------------------------------------
|
|
The current version of the tool doesn't know how to process instruction
|
|
itineraries. This is probably one of the most important limitations, since it
|
|
affects a few out-of-order processors in LLVM.
|
|
|
|
As mentioned in section 'Instruction Issue', class Scheduler delegates to an
|
|
instance of class ResourceManager the handling of processor resources.
|
|
ResourceManager is where most of the scheduling logic is implemented.
|
|
|
|
Adding support for instruction itineraries requires that we teach
|
|
ResourceManager how to handle functional units and instruction stages. This
|
|
development can be a future extension, and it would probably require a few
|
|
changes to the ResourceManager interface.
|
|
|
|
Instructions that affect control flow are not correctly modeled
|
|
---------------------------------------------------------------
|
|
Examples of instructions that affect the control flow are: return, indirect
|
|
branches, calls, etc. The tool doesn't try to predict/evaluate branch targets.
|
|
In particular, the tool doesn't model any sort of branch prediction, nor does it
|
|
attempt to track changes to the program counter. The tool always assumes that
|
|
the input assembly sequence is the body of a microbenchmark (a simple loop
|
|
executed for a number of iterations). The "next" instruction in sequence is
|
|
always the next instruction to dispatch.
|
|
|
|
Call instructions default to an arbitrary high latency of 100cy. A warning is
|
|
generated if the tool encounters a call instruction in the sequence. Return
|
|
instructions are not evaluated, and therefore control flow is not affected.
|
|
However, the tool still queries the processor scheduling model to obtain latency
|
|
information for instructions that affect the control flow.
|
|
|
|
Possible extensions to the scheduling model
|
|
-------------------------------------------
|
|
Section "Instruction Dispatch" explained how the tool doesn't know about the
|
|
register files, and temporaries available in each register file for register
|
|
renaming purposes.
|
|
|
|
The LLVM scheduling model could be extended to better describe register files.
|
|
Ideally, scheduling model should be able to define:
|
|
- The size of each register file
|
|
- How many temporary registers are available for register renaming
|
|
- How register classes map to register files
|
|
|
|
The scheduling model doesn't specify the retire throughput (i.e. how many
|
|
instructions can be retired every cycle). Users can specify flag
|
|
`-max-retire-per-cycle=<uint>` to limit how many instructions the retire control
|
|
unit can retire every cycle. Ideally, every processor should be able to specify
|
|
the retire throughput (for example, by adding an extra field to the scheduling
|
|
model tablegen class).
|
|
|
|
Known limitations on X86 processors
|
|
-----------------------------------
|
|
|
|
1) Partial register updates versus full register updates.
|
|
|
|
On x86-64, a 32-bit GPR write fully updates the super-register. Example:
|
|
add %edi %eax ## eax += edi
|
|
|
|
Here, register %eax aliases the lower half of 64-bit register %rax. On x86-64,
|
|
register %rax is fully updated by the 'add' (the upper half of %rax is zeroed).
|
|
Essentially, it "kills" any previous definition of (the upper half of) register
|
|
%rax.
|
|
|
|
On the other hand, 8/16 bit register writes only perform a so-called "partial
|
|
register update". Example:
|
|
add %di, %ax ## ax += di
|
|
|
|
Here, register %eax is only partially updated. To be more specific, the lower
|
|
half of %eax is set, and the upper half is left unchanged. There is also no
|
|
change in the upper 48 bits of register %rax.
|
|
|
|
To get accurate performance analysis, the tool has to know which instructions
|
|
perform a partial register update, and which instructions fully update the
|
|
destination's super-register.
|
|
|
|
One way to expose this information is (again) via tablegen. For example, we
|
|
could add a flag in the tablegen instruction class to tag instructions that
|
|
perform partial register updates. Something like this: 'bit
|
|
hasPartialRegisterUpdate = 1'. However, this would force a `let
|
|
hasPartialRegisterUpdate = 0` on several instruction definitions.
|
|
|
|
Another approach is to have a MCSubtargetInfo hook similar to this:
|
|
virtual bool updatesSuperRegisters(unsigned short opcode) { return false; }
|
|
|
|
Targets will be able to override this method if needed. Again, this is just an
|
|
idea. But the plan is to have this fixed as a future development.
|
|
|
|
2) Macro Op fusion.
|
|
|
|
The tool doesn't know about macro-op fusion. On modern x86 processors, a
|
|
'cmp/test' followed by a 'jmp' is fused into a single macro operation. The
|
|
advantage is that the fused pair only consumes a single slot in the dispatch
|
|
group.
|
|
|
|
As a future development, the tool should be extended to address macro-fusion.
|
|
Ideally, we could have LLVM generate a table enumerating all the opcode pairs
|
|
that can be fused together. That table could be exposed to the tool via the
|
|
MCSubtargetInfo interface. This is just an idea; there may be better ways to
|
|
implement this.
|
|
|
|
3) Intel processors: mixing legacy SSE with AVX instructions.
|
|
|
|
On modern Intel processors with AVX, mixing legacy SSE code with AVX code
|
|
negatively impacts the performance. The tool is not aware of this issue, and
|
|
the performance penalty is not accounted when doing the analysis. This is
|
|
something that we would like to improve in future.
|
|
|
|
4) Zero-latency register moves and Zero-idioms.
|
|
|
|
Most modern AMD/Intel processors know how to optimize out register-register
|
|
moves and zero idioms at register renaming stage. The tool doesn't know
|
|
about these patterns, and this may negatively impact the performance analysis.
|
|
|
|
Known design problems
|
|
---------------------
|
|
This section describes two design issues that are currently affecting the tool.
|
|
The long term plan is to "fix" these issues.
|
|
Both limitations would be easily fixed if we teach the tool how to directly
|
|
manipulate MachineInstr objects (instead of MCInst objects).
|
|
|
|
1) Variant instructions not correctly modeled.
|
|
|
|
The tool doesn't know how to analyze instructions with a "variant" scheduling
|
|
class descriptor. A variant scheduling class needs to be resolved dynamically.
|
|
The "actual" scheduling class often depends on the subtarget, as well as
|
|
properties of the specific MachineInstr object.
|
|
|
|
Unfortunately, the tool manipulates MCInst, and it doesn't know anything about
|
|
MachineInstr. As a consequence, the tool cannot use the existing machine
|
|
subtarget hooks that are normally used to resolve the variant scheduling class.
|
|
This is a major design issue which mostly affects ARM/AArch64 targets. It
|
|
mostly boils down to the fact that the existing scheduling framework was meant
|
|
to work for MachineInstr.
|
|
|
|
When the tool encounters a "variant" instruction, it assumes a generic 1cy
|
|
latency. However, the tool would not be able to tell which processor resources
|
|
are effectively consumed by the variant instruction.
|
|
|
|
2) MCInst and MCInstrDesc.
|
|
|
|
Performance analysis tools require data dependency information to correctly
|
|
predict the runtime performance of the code. This tool must always be able to
|
|
obtain the set of implicit/explicit register defs/uses for every instruction of
|
|
the input assembly sequence.
|
|
|
|
In the first section of this document, it was mentioned how the tool takes as
|
|
input an assembly sequence. That sequence is parsed into a MCInst sequence with
|
|
the help of assembly parsers available from the targets.
|
|
|
|
A MCInst is a very low-level instruction representation. The tool can inspect
|
|
the MCOperand sequence of an MCInst to identify register operands. However,
|
|
there is no way to tell register operands that are definitions from register
|
|
operands that are uses.
|
|
|
|
In LLVM, class MCInstrDesc is used to fully describe target instructions and
|
|
their operands. The opcode of a machine instruction (a MachineInstr object) can
|
|
be used to query the instruction set through method `MCInstrInfo::get' to obtain
|
|
the associated MCInstrDesc object.
|
|
|
|
However class MCInstrDesc describes properties and operands of MachineInstr
|
|
objects. Essentially, MCInstrDesc is not meant to be used to describe MCInst
|
|
objects. To be more specific, MCInstrDesc objects are automatically generated
|
|
via tablegen from the instruction set description in the target .td files. For
|
|
example, field `MCInstrDesc::NumDefs' is always equal to the cardinality of the
|
|
`(outs)` set from the tablegen instruction definition.
|
|
|
|
By construction, register definitions always appear at the beginning of the
|
|
MachineOperands list in MachineInstr. Basically, the (outs) are the first
|
|
operands of a MachineInstr, and the (ins) will come after in the machine operand
|
|
list. Knowing the number of register definitions is enough to identify
|
|
all the register operands that are definitions.
|
|
|
|
In a normal compilation process, MCInst objects are generated from MachineInstr
|
|
objects through a lowering step. By default the lowering logic simply iterates
|
|
over the machine operands of a MachineInstr, and converts/expands them into
|
|
equivalent MCOperand objects.
|
|
|
|
The default lowering strategy has the advantage of preserving all of the
|
|
above mentioned assumptions on the machine operand sequence. That means, register
|
|
definitions would still be at the beginning of the MCOperand sequence, and
|
|
register uses would come after.
|
|
|
|
Targets may still define custom lowering routines for specific opcodes. Some of
|
|
these routines may lower operands in a way that potentially breaks (some of) the
|
|
assumptions on the machine operand sequence which were valid for MachineInstr.
|
|
Luckily, this is not the most common form of lowering done by the targets, and
|
|
the vast majority of the MachineInstr are lowered based on the default strategy
|
|
which preserves the original machine operand sequence. This is especially true
|
|
for x86, where the custom lowering logic always preserves the original (i.e.
|
|
from the MachineInstr) operand sequence.
|
|
|
|
This tool currently works under the strong (and potentially incorrect)
|
|
assumption that register def/uses in a MCInst can always be identified by
|
|
querying the machine instruction descriptor for the opcode. This assumption made
|
|
it possible to develop this tool and get good numbers at least for the
|
|
processors available in the x86 backend.
|
|
|
|
That being said, the analysis is still potentially incorrect for other targets.
|
|
So we plan (with the help of the community) to find a proper mechanism to map
|
|
when possible MCOperand indices back to MachineOperand indices of the equivalent
|
|
MachineInstr. This would be equivalent to describing changes made by the
|
|
lowering step which affected the operand sequence. For example, we could have an
|
|
index for every register MCOperand (or -1, if the operand didn't exist in the
|
|
original MachineInstr). The mapping could look like this <0,1,3,2>. Here,
|
|
MCOperand #2 was obtained from the lowering of MachineOperand #3. etc.
|
|
|
|
This information could be automatically generated via tablegen for all the
|
|
instructions whose custom lowering step breaks assumptions made by the tool on
|
|
the register operand sequence (In general, these instructions should be the
|
|
minority of a target's instruction set). Unfortunately, we don't have that
|
|
information now. As a consequence, we assume that the number of explicit
|
|
register definitions is the same number specified in MCInstrDesc. We also
|
|
assume that register definitions always come first in the operand sequence.
|
|
|
|
In conclusion: these are for now the strong assumptions made by the tool:
|
|
* The number of explicit and implicit register definitions in a MCInst
|
|
matches the number of explicit and implicit definitions specified by the
|
|
MCInstrDesc object.
|
|
* Register uses always come after register definitions.
|
|
* If an opcode specifies an optional definition, then the optional
|
|
definition is always the last register operand in the sequence.
|
|
|
|
Note that some of the information accessible from the MCInstrDesc is always
|
|
valid for MCInst. For example: implicit register defs, implicit register uses
|
|
and 'MayLoad/MayStore/HasUnmodeledSideEffects' opcode properties still apply to
|
|
MCInst. The tool knows about this, and uses that information during its
|
|
analysis.
|
|
|
|
Future work
|
|
-----------
|
|
* Address limitations (described in section "Known limitations").
|
|
* Integrate extra description in the processor models, and make it opt-in for
|
|
the targets (see section "Possible extensions to the scheduling model").
|
|
* Let processors specify the selection strategy for processor resource groups
|
|
and resources with multiple units. The tool currently uses a round-robin
|
|
selector to pick the next resource to use.
|
|
* Address limitations specifically described in section "Known limitations on
|
|
X86 processors".
|
|
* Address design issues identified in section "Known design problems".
|
|
* Define a standard interface for "Views". This would let users customize the
|
|
performance report generated by the tool.
|
|
* Simplify the Backend interface.
|
|
|
|
When interfaces are mature/stable:
|
|
* Move the logic into a library. This will enable a number of other
|
|
interesting use cases.
|