forked from OSchip/llvm-project
[llvm-mca][docs] Always use `llvm-mca` in place of `MCA`.
llvm-svn: 338394
This commit is contained in:
parent
0b8fdd2847
commit
bdcf6ad60d
|
@ -207,23 +207,23 @@ EXIT STATUS
|
|||
:program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed
|
||||
to standard error, and the tool returns 1.
|
||||
|
||||
HOW MCA WORKS
|
||||
-------------
|
||||
HOW LLVM-MCA WORKS
|
||||
------------------
|
||||
|
||||
MCA takes assembly code as input. The assembly code is parsed into a sequence
|
||||
of MCInst with the help of the existing LLVM target assembly parsers. The
|
||||
parsed sequence of MCInst is then analyzed by a ``Pipeline`` module to generate
|
||||
a performance report.
|
||||
:program:`llvm-mca` takes assembly code as input. The assembly code is parsed
|
||||
into a sequence of MCInst with the help of the existing LLVM target assembly
|
||||
parsers. The parsed sequence of MCInst is then analyzed by a ``Pipeline`` module
|
||||
to generate a performance report.
|
||||
|
||||
The Pipeline module simulates the execution of the machine code sequence in a
|
||||
loop of iterations (default is 100). During this process, the pipeline collects
|
||||
a number of execution related statistics. At the end of this process, the
|
||||
pipeline generates and prints a report from the collected statistics.
|
||||
|
||||
Here is an example of a performance report generated by MCA for a dot-product
|
||||
of two packed float vectors of four elements. The analysis is conducted for
|
||||
target x86, cpu btver2. The following result can be produced via the following
|
||||
command using the example located at
|
||||
Here is an example of a performance report generated by the tool for a
|
||||
dot-product of two packed float vectors of four elements. The analysis is
|
||||
conducted for target x86, cpu btver2. The following result can be produced via
|
||||
the following command using the example located at
|
||||
``test/tools/llvm-mca/X86/BtVer2/dot-product.s``:
|
||||
|
||||
.. code-block:: bash
|
||||
|
@ -316,7 +316,7 @@ pressure should be uniformly distributed between multiple resources.
|
|||
|
||||
Timeline View
|
||||
^^^^^^^^^^^^^
|
||||
MCA's timeline view produces a detailed report of each instruction's state
|
||||
The timeline view produces a detailed report of each instruction's state
|
||||
transitions through an instruction pipeline. This view is enabled by the
|
||||
command line option ``-timeline``. As instructions transition through the
|
||||
various stages of the pipeline, their states are depicted in the view report.
|
||||
|
@ -331,7 +331,7 @@ These states are represented by the following characters:
|
|||
|
||||
Below is the timeline view for a subset of the dot-product example located in
|
||||
``test/tools/llvm-mca/X86/BtVer2/dot-product.s`` and processed by
|
||||
MCA using the following command:
|
||||
:program:`llvm-mca` using the following command:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
|
@ -366,7 +366,7 @@ MCA using the following command:
|
|||
2. 3 5.7 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4
|
||||
|
||||
The timeline view is interesting because it shows instruction state changes
|
||||
during execution. It also gives an idea of how MCA processes instructions
|
||||
during execution. It also gives an idea of how the tool processes instructions
|
||||
executed on the target, and how their timing information might be calculated.
|
||||
|
||||
The timeline view is structured in two tables. The first table shows
|
||||
|
@ -415,8 +415,8 @@ and therefore consuming temporary registers).
|
|||
|
||||
Table *Average Wait times* helps diagnose performance issues that are caused by
|
||||
the presence of long latency instructions and potentially long data dependencies
|
||||
which may limit the ILP. Note that MCA, by default, assumes at least 1cy
|
||||
between the dispatch event and the issue event.
|
||||
which may limit the ILP. Note that :program:`llvm-mca`, by default, assumes at
|
||||
least 1cy between the dispatch event and the issue event.
|
||||
|
||||
When the performance is limited by data dependencies and/or long latency
|
||||
instructions, the number of cycles spent while in the *ready* state is expected
|
||||
|
@ -602,9 +602,9 @@ entries in the reorder buffer defaults to the `MicroOpBufferSize` provided by
|
|||
the target scheduling model.
|
||||
|
||||
Instructions that are dispatched to the schedulers consume scheduler buffer
|
||||
entries. MCA queries the scheduling model to determine the set of
|
||||
buffered resources consumed by an instruction. Buffered resources are treated
|
||||
like scheduler resources.
|
||||
entries. :program:`llvm-mca` queries the scheduling model to determine the set
|
||||
of buffered resources consumed by an instruction. Buffered resources are
|
||||
treated like scheduler resources.
|
||||
|
||||
Instruction Issue
|
||||
"""""""""""""""""
|
||||
|
@ -612,22 +612,21 @@ Each processor scheduler implements a buffer of instructions. An instruction
|
|||
has to wait in the scheduler's buffer until input register operands become
|
||||
available. Only at that point, does the instruction becomes eligible for
|
||||
execution and may be issued (potentially out-of-order) for execution.
|
||||
Instruction latencies are computed by MCA with the help of the scheduling
|
||||
model.
|
||||
Instruction latencies are computed by :program:`llvm-mca` with the help of the
|
||||
scheduling model.
|
||||
|
||||
MCA's scheduler is designed to simulate multiple processor schedulers. The
|
||||
scheduler is responsible for tracking data dependencies, and dynamically
|
||||
selecting which processor resources are consumed by instructions.
|
||||
|
||||
The scheduler delegates the management of processor resource units and resource
|
||||
groups to a resource manager. The resource manager is responsible for
|
||||
selecting resource units that are consumed by instructions. For example, if an
|
||||
instruction consumes 1cy of a resource group, the resource manager selects one
|
||||
of the available units from the group; by default, the resource manager uses a
|
||||
:program:`llvm-mca`'s scheduler is designed to simulate multiple processor
|
||||
schedulers. The scheduler is responsible for tracking data dependencies, and
|
||||
dynamically selecting which processor resources are consumed by instructions.
|
||||
It delegates the management of processor resource units and resource groups to a
|
||||
resource manager. The resource manager is responsible for selecting resource
|
||||
units that are consumed by instructions. For example, if an instruction
|
||||
consumes 1cy of a resource group, the resource manager selects one of the
|
||||
available units from the group; by default, the resource manager uses a
|
||||
round-robin selector to guarantee that resource usage is uniformly distributed
|
||||
between all units of a group.
|
||||
|
||||
MCA's scheduler implements three instruction queues:
|
||||
:program:`llvm-mca`'s scheduler implements three instruction queues:
|
||||
|
||||
* WaitQueue: a queue of instructions whose operands are not ready.
|
||||
* ReadyQueue: a queue of instructions ready to execute.
|
||||
|
@ -638,8 +637,8 @@ scheduler are either placed into the WaitQueue or into the ReadyQueue.
|
|||
|
||||
Every cycle, the scheduler checks if instructions can be moved from the
|
||||
WaitQueue to the ReadyQueue, and if instructions from the ReadyQueue can be
|
||||
issued. The algorithm prioritizes older instructions over younger
|
||||
instructions.
|
||||
issued to the underlying pipelines. The algorithm prioritizes older instructions
|
||||
over younger instructions.
|
||||
|
||||
Write-Back and Retire Stage
|
||||
"""""""""""""""""""""""""""
|
||||
|
@ -656,15 +655,13 @@ for the instruction during the register renaming stage.
|
|||
|
||||
Load/Store Unit and Memory Consistency Model
|
||||
""""""""""""""""""""""""""""""""""""""""""""
|
||||
To simulate an out-of-order execution of memory operations, MCA utilizes a
|
||||
simulated load/store unit (LSUnit) to simulate the speculative execution of
|
||||
loads and stores.
|
||||
To simulate an out-of-order execution of memory operations, :program:`llvm-mca`
|
||||
utilizes a simulated load/store unit (LSUnit) to simulate the speculative
|
||||
execution of loads and stores.
|
||||
|
||||
Each load (or store) consumes an entry in the load (or store) queue. The
|
||||
number of slots in the load/store queues is unknown by MCA, since there is no
|
||||
mention of it in the scheduling model. In practice, users can specify flags
|
||||
``-lqueue`` and ``-squeue`` to limit the number of entries in the load and
|
||||
store queues respectively. The queues are unbounded by default.
|
||||
Each load (or store) consumes an entry in the load (or store) queue. Users can
|
||||
specify flags ``-lqueue`` and ``-squeue`` to limit the number of entries in the
|
||||
load and store queues respectively. The queues are unbounded by default.
|
||||
|
||||
The LSUnit implements a relaxed consistency model for memory loads and stores.
|
||||
The rules are:
|
||||
|
@ -701,15 +698,15 @@ cache. It only knows if an instruction "MayLoad" and/or "MayStore." For
|
|||
loads, the scheduling model provides an "optimistic" load-to-use latency (which
|
||||
usually matches the load-to-use latency for when there is a hit in the L1D).
|
||||
|
||||
MCA does not know about serializing operations or memory-barrier like
|
||||
instructions. The LSUnit conservatively assumes that an instruction which has
|
||||
both "MayLoad" and unmodeled side effects behaves like a "soft" load-barrier.
|
||||
That means, it serializes loads without forcing a flush of the load queue.
|
||||
Similarly, instructions that "MayStore" and have unmodeled side effects are
|
||||
treated like store barriers. A full memory barrier is a "MayLoad" and
|
||||
"MayStore" instruction with unmodeled side effects. This is inaccurate, but it
|
||||
is the best that we can do at the moment with the current information available
|
||||
in LLVM.
|
||||
:program:`llvm-mca` does not know about serializing operations or memory-barrier
|
||||
like instructions. The LSUnit conservatively assumes that an instruction which
|
||||
has both "MayLoad" and unmodeled side effects behaves like a "soft"
|
||||
load-barrier. That means, it serializes loads without forcing a flush of the
|
||||
load queue. Similarly, instructions that "MayStore" and have unmodeled side
|
||||
effects are treated like store barriers. A full memory barrier is a "MayLoad"
|
||||
and "MayStore" instruction with unmodeled side effects. This is inaccurate, but
|
||||
it is the best that we can do at the moment with the current information
|
||||
available in LLVM.
|
||||
|
||||
A load/store barrier consumes one entry of the load/store queue. A load/store
|
||||
barrier enforces ordering of loads/stores. A younger load cannot pass a load
|
||||
|
|
Loading…
Reference in New Issue