forked from OSchip/llvm-project
[llvm-mca][docs] Add documentation for the statistic outputs from mca. NFC
Summary: The original text was lifted from the MCA README. I re-ran the dot-product example and updated the output seen in the docs. I also added a few paragraphs discussing the instruction issued and retired histograms, as well as discussing the register file stats. Reviewers: andreadb, RKSimon, courbet, gbedwell, filcab Reviewed By: andreadb Subscribers: tschuett Differential Revision: https://reviews.llvm.org/D49614 llvm-svn: 337648
This commit is contained in:
parent
2c5b18f70f
commit
f2603c0767
|
@ -305,9 +305,9 @@ spent on average every iteration. The second table correlates the resource
|
|||
cycles to the machine instruction in the sequence. For example, every iteration
|
||||
of the instruction vmulps always executes on resource unit [6]
|
||||
(JFPU1 - floating point pipeline #1), consuming an average of 1 resource cycle
|
||||
per iteration. Note that on Jaguar, vector floating-point multiply can only be
|
||||
issued to pipeline JFPU1, while horizontal floating-point additions can only be
|
||||
issued to pipeline JFPU0.
|
||||
per iteration. Note that on AMD Jaguar, vector floating-point multiply can
|
||||
only be issued to pipeline JFPU1, while horizontal floating-point additions can
|
||||
only be issued to pipeline JFPU0.
|
||||
|
||||
The resource pressure view helps with identifying bottlenecks caused by high
|
||||
usage of specific hardware resources. Situations with resource pressure mainly
|
||||
|
@ -427,3 +427,125 @@ instructions. When performance is mostly limited by the lack of hardware
|
|||
resources, the delta between the two counters is small. However, the number of
|
||||
cycles spent in the queue tends to be larger (i.e., more than 1-3cy),
|
||||
especially when compared to other low latency instructions.
|
||||
|
||||
Extra Statistics to Further Diagnose Performance Issues
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
The ``-all-stats`` command line option enables extra statistics and performance
|
||||
counters for the dispatch logic, the reorder buffer, the retire control unit,
|
||||
and the register file.
|
||||
|
||||
Below is an example of ``-all-stats`` output generated by MCA for the
|
||||
dot-product example discussed in the previous sections.
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
Dynamic Dispatch Stall Cycles:
|
||||
RAT - Register unavailable: 0
|
||||
RCU - Retire tokens unavailable: 0
|
||||
SCHEDQ - Scheduler full: 272
|
||||
LQ - Load queue full: 0
|
||||
SQ - Store queue full: 0
|
||||
GROUP - Static restrictions on the dispatch group: 0
|
||||
|
||||
|
||||
Dispatch Logic - number of cycles where we saw N instructions dispatched:
|
||||
[# dispatched], [# cycles]
|
||||
0, 24 (3.9%)
|
||||
1, 272 (44.6%)
|
||||
2, 314 (51.5%)
|
||||
|
||||
|
||||
Schedulers - number of cycles where we saw N instructions issued:
|
||||
[# issued], [# cycles]
|
||||
0, 7 (1.1%)
|
||||
1, 306 (50.2%)
|
||||
2, 297 (48.7%)
|
||||
|
||||
|
||||
Scheduler's queue usage:
|
||||
JALU01, 0/20
|
||||
JFPU01, 18/18
|
||||
JLSAGU, 0/12
|
||||
|
||||
|
||||
Retire Control Unit - number of cycles where we saw N instructions retired:
|
||||
[# retired], [# cycles]
|
||||
0, 109 (17.9%)
|
||||
1, 102 (16.7%)
|
||||
2, 399 (65.4%)
|
||||
|
||||
|
||||
Register File statistics:
|
||||
Total number of mappings created: 900
|
||||
Max number of mappings used: 35
|
||||
|
||||
* Register File #1 -- JFpuPRF:
|
||||
Number of physical registers: 72
|
||||
Total number of mappings created: 900
|
||||
Max number of mappings used: 35
|
||||
|
||||
* Register File #2 -- JIntegerPRF:
|
||||
Number of physical registers: 64
|
||||
Total number of mappings created: 0
|
||||
Max number of mappings used: 0
|
||||
|
||||
If we look at the *Dynamic Dispatch Stall Cycles* table, we see the counter for
|
||||
SCHEDQ reports 272 cycles. This counter is incremented every time the dispatch
|
||||
logic is unable to dispatch a group of two instructions because the scheduler's
|
||||
queue is full.
|
||||
|
||||
Looking at the *Dispatch Logic* table, we see that the pipeline was only able
|
||||
to dispatch two instructions 51.5% of the time. The dispatch group was limited
|
||||
to one instruction 44.6% of the cycles, which corresponds to 272 cycles. The
|
||||
dispatch statistics are displayed by either using the command option
|
||||
``-all-stats`` or ``-dispatch-stats``.
|
||||
|
||||
The next table, *Schedulers*, presents a histogram displaying a count,
|
||||
representing the number of instructions issued on some number of cycles. In
|
||||
this case, of the 610 simulated cycles, single
|
||||
instructions were issued 306 times (50.2%) and there were 7 cycles where
|
||||
no instructions were issued.
|
||||
|
||||
The *Scheduler's queue usage* table shows that the maximum number of buffer
|
||||
entries (i.e., scheduler queue entries) used at runtime. Resource JFPU01
|
||||
reached its maximum (18 of 18 queue entries). Note that AMD Jaguar implements
|
||||
three schedulers:
|
||||
|
||||
* JALU01 - A scheduler for ALU instructions.
|
||||
* JFPU01 - A scheduler floating point operations.
|
||||
* JLSAGU - A scheduler for address generation.
|
||||
|
||||
The dot-product is a kernel of three floating point instructions (a vector
|
||||
multiply followed by two horizontal adds). That explains why only the floating
|
||||
point scheduler appears to be used.
|
||||
|
||||
A full scheduler queue is either caused by data dependency chains or by a
|
||||
sub-optimal usage of hardware resources. Sometimes, resource pressure can be
|
||||
mitigated by rewriting the kernel using different instructions that consume
|
||||
different scheduler resources. Schedulers with a small queue are less resilient
|
||||
to bottlenecks caused by the presence of long data dependencies.
|
||||
The scheduler statistics are displayed by
|
||||
using the command option ``-all-stats`` or ``-scheduler-stats``.
|
||||
|
||||
The next table, *Retire Control Unit*, presents a histogram displaying a count,
|
||||
representing the number of instructions retired on some number of cycles. In
|
||||
this case, of the 610 simulated cycles, two instructions were retired during
|
||||
the same cycle 399 times (65.4%) and there were 109 cycles where no
|
||||
instructions were retired. The retire statistics are displayed by using the
|
||||
command option ``-all-stats`` or ``-retire-stats``.
|
||||
|
||||
The last table presented is *Register File statistics*. Each physical register
|
||||
file (PRF) used by the pipeline is presented in this table. In the case of AMD
|
||||
Jaguar, there are two register files, one for floating-point registers
|
||||
(JFpuPRF) and one for integer registers (JIntegerPRF). The table shows that of
|
||||
the 900 instructions processed, there were 900 mappings created. Since this
|
||||
dot-product example utilized only floating point registers, the JFPuPRF was
|
||||
responsible for creating the 900 mappings. However, we see that the pipeline
|
||||
only used a maximum of 35 of 72 available register slots at any given time. We
|
||||
can conclude that the floating point PRF was the only register file used for
|
||||
the example, and that it was never resource constrained. The register file
|
||||
statistics are displayed by using the command option ``-all-stats`` or
|
||||
``-register-file-stats``.
|
||||
|
||||
In this example, we can conclude that the IPC is mostly limited by data
|
||||
dependencies, and not by resource pressure.
|
||||
|
|
Loading…
Reference in New Issue