As suggested on the bug, to help (but not completely....) stop folded instructions crossing the inline asm barriers used for llvm-mca analysis, we should recommend tagging with memory captures/attributes.
Differential Revision: https://reviews.llvm.org/D117788
memory-barrier instructions to providing targets and developers a convenient
way to explicitly declare which instructions are memory-barriers.
Differential Revision: https://reviews.llvm.org/D116779
Moved View.h and View.cpp from /tools/llvm-mca/Views/ to /lib/MCA/ and
/include/llvm/MCA/. This is so that targets can define their own Views within
the /lib/Target/ directory (so that the View can use backend functionality).
To enable these Views within mca, targets will need to add them to the vector of
Views returned by their target's CustomBehaviour::getViews() methods.
Differential Revision: https://reviews.llvm.org/D108520
Based on the discussion in PR50922, minor changes have been done to properly
output a valid JSON. Removed "not implemented" keys.
Differential Revision: https://reviews.llvm.org/D105064
Change --max-timeline-cycles=0 to mean no limit on the number of cycles.
Use this in AMDGPU tests to show all instructions in the timeline view
instead of having it arbitrarily truncated.
Differential Revision: https://reviews.llvm.org/D104846
The original change was pushed in main as commit f7a23ecece.
It was then reverted by commit a04f01bab2 because it caused linker failures
on buildbots that don't build the AMDGPU target.
--
Some instructions are not defined well enough within the target’s scheduling
model for llvm-mca to be able to properly simulate its behaviour. The ideal
solution to this situation is to modify the scheduling model, but that’s not
always a viable strategy. Maybe other parts of the backend depend on that
instruction being modelled the way that it is. Or maybe the instruction is quite
complex and it’s difficult to fully capture its behaviour with tablegen. The
CustomBehaviour class (which I will refer to as CB frequently) is designed to
provide intuitive scaffolding for developers to implement the correct modelling
for these instructions.
More details are available in the original commit log message (f7a23ecece).
Differential Revision: https://reviews.llvm.org/D104149
Some instructions are not defined well enough within the target’s scheduling
model for llvm-mca to be able to properly simulate its behaviour. The ideal
solution to this situation is to modify the scheduling model, but that’s not
always a viable strategy. Maybe other parts of the backend depend on that
instruction being modelled the way that it is. Or maybe the instruction is quite
complex and it’s difficult to fully capture its behaviour with tablegen. The
CustomBehaviour class (which I will refer to as CB frequently) is designed to
provide intuitive scaffolding for developers to implement the correct modelling
for these instructions.
Implementation details:
llvm-mca does its best to extract relevant register, resource, and memory
information from every MCInst when lowering them to an mca::Instruction. It then
uses this information to detect dependencies and simulate stalls within the
pipeline. For some instructions, the information that gets captured within the
mca::Instruction is not enough for mca to simulate them properly. In these
cases, there are two main possibilities:
1. The instruction has a dependency that isn’t detected by mca.
2. mca is incorrectly enforcing a dependency that shouldn’t exist.
For the rest of this discussion, I will be focusing on (1), but I have put some
thought into (2) and I may revisit it in the future.
So we have an instruction that has dependencies that aren’t picked up by mca.
The basic idea for both pipelines in mca is that when an instruction wants to be
dispatched, we first check for register hazards and then we check for resource
hazards. This is where CB is injected. If no register or resource hazards have
been detected, we make a call to CustomBehaviour::checkCustomHazard() to give
the target specific CB the chance to detect and enforce any custom dependencies.
The return value for checkCustomHazaard() is an unsigned int representing the
(minimum) number of cycles that the instruction needs to stall for. It’s fine to
underestimate this value because when StallCycles gets down to 0, we’ll end up
checking for all the hazards again before the instruction is actually
dispatched. However, it’s important not to overestimate the value and the more
accurate your estimate is, the more efficient mca’s execution can be.
In general, for checkCustomHazard() to be able to detect these custom
dependencies, it needs information about the current instruction and also all of
the instructions that are still executing within the pipeline. The mca pipeline
uses mca::Instruction rather than MCInst and the current information encoded
within each mca::Instruction isn’t sufficient for my use cases. I had to add a
few extra attributes to the mca::Instruction class and have them get set by the
MCInst during instruction building. For example, the current mca::Instruction
doesn’t know its opcode, and it also doesn’t know anything about its immediate
operands (both of which I had to add to the class).
With information about the current instruction, a list of all currently
executing instructions, and some target specific objects (MCSubtargetInfo and
MCInstrInfo which the base CB class has references to), developers should be
able to detect and enforce most custom dependencies within checkCustomHazard. If
you need more information than is present in the mca::Instruction, feel free to
add attributes to that class and have them set during the lowering sequence from
MCInst.
Fortunately, in the in-order pipeline, it’s very convenient for us to pass these
arguments to checkCustomHazard. The hazard checking is taken care of within
InOrderIssueStage::canExecute(). This function takes a const InstRef as a
parameter (representing the instruction that currently wants to be dispatched)
and the InOrderIssueStage class maintains a SmallVector<InstRef, 4> which holds
all of the currently executing instructions. For the out-of-order pipeline, it’s
a bit trickier to get the list of executing instructions and this is why I have
held off on implementing it myself. This is the main topic I will bring up when
I eventually make a post to discuss and ask for feedback.
CB is a base class where targets implement their own derived classes. If a
target specific CB does not exist (or we pass in the -disable-cb flag), the base
class is used. This base class trivially returns 0 from its checkCustomHazard()
implementation (meaning that the current instruction needs to stall for 0 cycles
aka no hazard is detected). For this reason, targets or users who choose not to
use CB shouldn’t see any negative impacts to accuracy or performance (in
comparison to pre-patch llvm-mca).
Differential Revision: https://reviews.llvm.org/D104149
This is a follow-up for:
D98604 [MCA] Ensure that writes occur in-order
When instructions are aligned by the order of writes, they retire
in-order naturally. There is no need for an RCU, so it is disabled.
Differential Revision: https://reviews.llvm.org/D98628
This patch adds a pipeline to support in-order CPUs such as ARM
Cortex-A55.
In-order pipeline implements a simplified version of Dispatch,
Scheduler and Execute stages as a single stage. Entry and Retire
stages are common for both in-order and out-of-order pipelines.
Differential Revision: https://reviews.llvm.org/D94928
Summary:
As disscused in https://bugs.llvm.org/show_bug.cgi?id=43219,
i believe it may be somewhat useful to show //some// aggregates
over all the sea of statistics provided.
Example:
```
Average Wait times (based on the timeline view):
[0]: Executions
[1]: Average time spent waiting in a scheduler's queue
[2]: Average time spent waiting in a scheduler's queue while ready
[3]: Average time elapsed from WB until retire stage
[0] [1] [2] [3]
0. 3 1.0 1.0 4.7 vmulps %xmm0, %xmm1, %xmm2
1. 3 2.7 0.0 2.3 vhaddps %xmm2, %xmm2, %xmm3
2. 3 6.0 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4
3 3.2 0.3 2.3 <total>
```
I.e. we average the averages.
Reviewers: andreadb, mattd, RKSimon
Reviewed By: andreadb
Subscribers: gbedwell, arphaman, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68714
llvm-svn: 374361
Flag -show-encoding enables the printing of instruction encodings as part of the
the instruction info view.
Example (with flags -mtriple=x86_64-- -mcpu=btver2):
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[7]: Encoding Size
[1] [2] [3] [4] [5] [6] [7] Encodings: Instructions:
1 2 1.00 4 c5 f0 59 d0 vmulps %xmm0, %xmm1, %xmm2
1 4 1.00 4 c5 eb 7c da vhaddps %xmm2, %xmm2, %xmm3
1 4 1.00 4 c5 e3 7c e3 vhaddps %xmm3, %xmm3, %xmm4
In this example, column Encoding Size is the size in bytes of the instruction
encoding. Column Encodings reports the actual instruction encodings as byte
sequences in hex (objdump style).
The computation of encodings is done by a utility class named mca::CodeEmitter.
In future, I plan to expose the CodeEmitter to the instruction builder, so that
information about instruction encoding sizes can be used by the simulator. That
would be a first step towards simulating the throughput from the decoders in the
hardware frontend.
Differential Revision: https://reviews.llvm.org/D65948
llvm-svn: 368432
This patch adds a new llvm-mca flag named -print-imm-hex.
By default, the instruction printer prints immediate operands as decimals. Flag
-print-imm-hex enables the instruction printer to print those operands in hex.
This patch also adds support for MASM binary and hex literal numbers (example
0FFh, 101b).
Added tests to verify the behavior of the new flag. Tests also verify that masm
numeric literal operands are now recognized.
Differential Revision: https://reviews.llvm.org/D65588
llvm-svn: 367671
Sphinx allows for definitions of command-line options using
`.. option <name>` and references to those options via `:option:<name>`.
However, it looks like there is no scoping of these options by default,
meaning that links can end up pointing to incorrect documents. See for
example the llvm-mca document, which contains references to -o that,
prior to this patch, pointed to a different document. What's worse is
that these links appear to be non-deterministic in which one is picked
(on my machine, some references end up pointing to opt, whereas on the
live docs, they point to llvm-dwarfdump, for example).
The fix is to add the .. program <name> tag. This essentially namespaces
the options (definitions and references) to the named program, ensuring
that the links are kept correct.
Reviwed by: andreadb
Differential Revision: https://reviews.llvm.org/D63873
llvm-svn: 364538
This patch fixes PR41523
https://bugs.llvm.org/show_bug.cgi?id=41523
Regions can now nest/overlap provided that they have different names.
Anonymous regions cannot overlap.
Region end markers must specify the region name. The only exception is for when
there is only one user-defined region; in that particular case, the region end
marker doesn't need to specify a name.
Incorrect region end markers are no longer ignored. Instead, the tool reports an
error and we exit with an error code.
Added test cases to verify the new diagnostic error messages.
Updated the llvm-mca docs to reflect this feature change.
Differential Revision: https://reviews.llvm.org/D61676
llvm-svn: 360351
It makes more sense to print out the number of micro opcodes that are issued
every cycle rather than the number of instructions issued per cycle.
This behavior is also consistent with the dispatch-stats: numbers from the two
views can now be easily compared.
llvm-svn: 357919
This patch adds a new flag named -bottleneck-analysis to print out information
about throughput bottlenecks.
MCA knows how to identify and classify dynamic dispatch stalls. However, it
doesn't know how to analyze and highlight kernel bottlenecks. The goal of this
patch is to teach MCA how to correlate increases in backend pressure to backend
stalls (and therefore, the loss of throughput).
From a Scheduler point of view, backend pressure is a function of the scheduler
buffer usage (i.e. how the number of uOps in the scheduler buffers changes over
time). Backend pressure increases (or decreases) when there is a mismatch
between the number of opcodes dispatched, and the number of opcodes issued in
the same cycle. Since buffer resources are limited, continuous increases in
backend pressure would eventually leads to dispatch stalls. So, there is a
strong correlation between dispatch stalls, and how backpressure changed over
time.
This patch teaches how to identify situations where backend pressure increases
due to:
- unavailable pipeline resources.
- data dependencies.
Data dependencies may delay execution of instructions and therefore increase the
time that uOps have to spend in the scheduler buffers. That often translates to
an increase in backend pressure which may eventually lead to a bottleneck.
Contention on pipeline resources may also delay execution of instructions, and
lead to a temporary increase in backend pressure.
Internally, the Scheduler classifies instructions based on whether register /
memory operands are available or not.
An instruction is marked as "ready to execute" only if data dependencies are
fully resolved.
Every cycle, the Scheduler attempts to execute all instructions that are ready
to execute. If an instruction cannot execute because of unavailable pipeline
resources, then the Scheduler internally updates a BusyResourceUnits mask with
the ID of each unavailable resource.
ExecuteStage is responsible for tracking changes in backend pressure. If backend
pressure increases during a cycle because of contention on pipeline resources,
then ExecuteStage sends a "backend pressure" event to the listeners.
That event would contain information about instructions delayed by resource
pressure, as well as the BusyResourceUnits mask.
Note that ExecuteStage also knows how to identify situations where backpressure
increased because of delays introduced by data dependencies.
The SummaryView observes "backend pressure" events and prints out a "bottleneck
report".
Example of bottleneck report:
```
Cycles with backend pressure increase [ 99.89% ]
Throughput Bottlenecks:
Resource Pressure [ 0.00% ]
Data Dependencies: [ 99.89% ]
- Register Dependencies [ 0.00% ]
- Memory Dependencies [ 99.89% ]
```
A bottleneck report is printed out only if increases in backend pressure
eventually caused backend stalls.
About the time complexity:
Time complexity is linear in the number of instructions in the
Scheduler::PendingSet.
The average slowdown tends to be in the range of ~5-6%.
For memory intensive kernels, the slowdown can be significant if flag
-noalias=false is specified. In the worst case scenario I have observed a
slowdown of ~30% when flag -noalias=false was specified.
We can definitely recover part of that slowdown if we optimize class LSUnit (by
doing extra bookkeeping to speedup queries). For now, this new analysis is
disabled by default, and it can be enabled via flag -bottleneck-analysis. Users
of MCA as a library can enable the generation of pressure events through the
constructor of ExecuteStage.
This patch partially addresses https://bugs.llvm.org/show_bug.cgi?id=37494
Differential Revision: https://reviews.llvm.org/D58728
llvm-svn: 355308
RetireControlUnitStatistics now reports extra information about the ROB and the
avg/maximum number of entries consumed over the entire simulation.
Example:
Retire Control Unit - number of cycles where we saw N instructions retired:
[# retired], [# cycles]
0, 109 (17.9%)
1, 102 (16.7%)
2, 399 (65.4%)
Total ROB Entries: 64
Max Used ROB Entries: 35 ( 54.7% )
Average Used ROB Entries per cy: 32 ( 50.0% )
Documentation in llvm/docs/CommandGuide/llvmn-mca.rst has been updated to
reflect this change.
llvm-svn: 347493
This patch introduces the following changes to the DispatchStatistics view:
* DispatchStatistics now reports the number of dispatched opcodes instead of
the number of dispatched instructions.
* The "Dynamic Dispatch Stall Cycles" table now also reports the percentage of
stall cycles against the total simulated cycles.
This change allows users to easily compare dispatch group sizes with the
processor DispatchWidth.
Before this change, it was difficult to correlate the two numbers, since
DispatchStatistics view reported numbers of instructions (instead of opcodes).
DispatchWidth defines the maximum size of a dispatch group in terms of number of
micro opcodes.
The other change introduced by this patch is related to how DispatchStage
generates "instruction dispatch" events.
In particular:
* There can be multiple dispatch events associated with a same instruction
* Each dispatch event now encapsulates the number of dispatched micro opcodes.
The number of micro opcodes declared by an instruction may exceed the processor
DispatchWidth. Therefore, we cannot assume that instructions are always fully
dispatched in a single cycle.
DispatchStage knows already how to handle instructions declaring a number of
opcodes bigger that DispatchWidth. However, DispatchStage always emitted a
single instruction dispatch event (during the first simulated dispatch cycle)
for instructions dispatched.
With this patch, DispatchStage now correctly notifies multiple dispatch events
for instructions that cannot be dispatched in a single cycle.
A few views had to be modified. Views can no longer assume that there can only
be one dispatch event per instruction.
Tests (and docs) have been updated.
Differential Revision: https://reviews.llvm.org/D51430
llvm-svn: 341055
This patch adds two new fields to the perf report generated by the SummaryView.
Fields are now logically organized into two small groups; only the second group
contains throughput indicators.
Example:
```
Iterations: 100
Instructions: 300
Total Cycles: 414
Total uOps: 700
Dispatch Width: 4
uOps Per Cycle: 1.69
IPC: 0.72
Block RThroughput: 4.0
```
This patch also updates the docs for llvm-mca.
Due to the nature of this change, several tests in the tools/llvm-mca directory
were affected, and had to be updated using script `update_mca_test_checks.py`.
llvm-svn: 340946
Before this patch, the SchedulerStatistics only printed the maximum number of
buffer entries consumed in each scheduler's queue at a given point of the
simulation.
This patch restructures the reported table, and adds an extra field named
"Average number of used buffer entries" to it.
This patch also uses different colors to help identifying bottlenecks caused by
high scheduler's buffer pressure.
llvm-svn: 340746
This patch is a follow-up to r338702.
We don't need to use a map to model the wait/ready/issued sets. It is much more
efficient to use a vector instead.
This patch gives us an average 7.5% speedup (on top of the ~12% speedup obtained
after r338702).
llvm-svn: 338883
This patch replaces all the remaining occurrences of string "MCA" with
":program:`llvm-mca`". Somehow I missed those strings when I committed r338394.
This patch also improves section "Instruction Dispatch".
llvm-svn: 338881
Summary:
This patch mostly copies the existing Instruction Flow, and stage descriptions
from the mca README. I made a few text tweaks, but no semantic changes,
and made reference to the "default pipeline." I also removed the internals
references (e.g., reference to class names and header files). I did leave the
LSUnit name around, but only as an abbreviated word for the load-store unit.
Reviewers: andreadb, courbet, RKSimon, gbedwell, filcab
Reviewed By: andreadb
Subscribers: tschuett, jfb, llvm-commits
Differential Revision: https://reviews.llvm.org/D49692
llvm-svn: 338319
Summary: The original text was lifted from the MCA README. I re-ran the dot-product example and updated the output seen in the docs. I also added a few paragraphs discussing the instruction issued and retired histograms, as well as discussing the register file stats.
Reviewers: andreadb, RKSimon, courbet, gbedwell, filcab
Reviewed By: andreadb
Subscribers: tschuett
Differential Revision: https://reviews.llvm.org/D49614
llvm-svn: 337648
For the most part, these changes were from the RFC. I made a few minor
word/structure changes, but nothing significant. I also regenerated the
example output, and adjusted the text accordingly.
Differential Revision: https://reviews.llvm.org/D49527
llvm-svn: 337496
We're going to work on this in a separate review focusing more on documenting
the View and probably removing some of the less-interesting/less-useful pieces.
This reverts r337219,337225
llvm-svn: 337295
This patch introduces a brief description of the components of MCA. The main
focus is on Views. This is a work in progress, and more descriptions will be
introduced later. I want to flesh-out the Views section more and provide a
detailed description of eventing in MCA. Eventually a brief code example of a
View should accompany the description.
Also, we should consider moving the MCA internals guide elsewhere at some point.
llvm-svn: 337219