[llvm-mca][docs] Add documentation for the statistic outputs from mca. NFC

Summary: The original text was lifted from the MCA README.  I re-ran the dot-product example and updated the output seen in the docs.  I also added a few paragraphs discussing the instruction issued and retired histograms, as well as discussing the register file stats.

Reviewers: andreadb, RKSimon, courbet, gbedwell, filcab

Reviewed By: andreadb

Subscribers: tschuett

Differential Revision: https://reviews.llvm.org/D49614

llvm-svn: 337648
This commit is contained in:
Matt Davis 2018-07-21 18:32:47 +00:00
parent 2c5b18f70f
commit f2603c0767
1 changed files with 125 additions and 3 deletions

View File

@ -305,9 +305,9 @@ spent on average every iteration. The second table correlates the resource
cycles to the machine instruction in the sequence. For example, every iteration
of the instruction vmulps always executes on resource unit [6]
(JFPU1 - floating point pipeline #1), consuming an average of 1 resource cycle
per iteration. Note that on Jaguar, vector floating-point multiply can only be
issued to pipeline JFPU1, while horizontal floating-point additions can only be
issued to pipeline JFPU0.
per iteration. Note that on AMD Jaguar, vector floating-point multiply can
only be issued to pipeline JFPU1, while horizontal floating-point additions can
only be issued to pipeline JFPU0.
The resource pressure view helps with identifying bottlenecks caused by high
usage of specific hardware resources. Situations with resource pressure mainly
@ -427,3 +427,125 @@ instructions. When performance is mostly limited by the lack of hardware
resources, the delta between the two counters is small. However, the number of
cycles spent in the queue tends to be larger (i.e., more than 1-3cy),
especially when compared to other low latency instructions.
Extra Statistics to Further Diagnose Performance Issues
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The ``-all-stats`` command line option enables extra statistics and performance
counters for the dispatch logic, the reorder buffer, the retire control unit,
and the register file.
Below is an example of ``-all-stats`` output generated by MCA for the
dot-product example discussed in the previous sections.
.. code-block:: none
Dynamic Dispatch Stall Cycles:
RAT - Register unavailable: 0
RCU - Retire tokens unavailable: 0
SCHEDQ - Scheduler full: 272
LQ - Load queue full: 0
SQ - Store queue full: 0
GROUP - Static restrictions on the dispatch group: 0
Dispatch Logic - number of cycles where we saw N instructions dispatched:
[# dispatched], [# cycles]
0, 24 (3.9%)
1, 272 (44.6%)
2, 314 (51.5%)
Schedulers - number of cycles where we saw N instructions issued:
[# issued], [# cycles]
0, 7 (1.1%)
1, 306 (50.2%)
2, 297 (48.7%)
Scheduler's queue usage:
JALU01, 0/20
JFPU01, 18/18
JLSAGU, 0/12
Retire Control Unit - number of cycles where we saw N instructions retired:
[# retired], [# cycles]
0, 109 (17.9%)
1, 102 (16.7%)
2, 399 (65.4%)
Register File statistics:
Total number of mappings created: 900
Max number of mappings used: 35
* Register File #1 -- JFpuPRF:
Number of physical registers: 72
Total number of mappings created: 900
Max number of mappings used: 35
* Register File #2 -- JIntegerPRF:
Number of physical registers: 64
Total number of mappings created: 0
Max number of mappings used: 0
If we look at the *Dynamic Dispatch Stall Cycles* table, we see the counter for
SCHEDQ reports 272 cycles. This counter is incremented every time the dispatch
logic is unable to dispatch a group of two instructions because the scheduler's
queue is full.
Looking at the *Dispatch Logic* table, we see that the pipeline was only able
to dispatch two instructions 51.5% of the time. The dispatch group was limited
to one instruction 44.6% of the cycles, which corresponds to 272 cycles. The
dispatch statistics are displayed by either using the command option
``-all-stats`` or ``-dispatch-stats``.
The next table, *Schedulers*, presents a histogram displaying a count,
representing the number of instructions issued on some number of cycles. In
this case, of the 610 simulated cycles, single
instructions were issued 306 times (50.2%) and there were 7 cycles where
no instructions were issued.
The *Scheduler's queue usage* table shows that the maximum number of buffer
entries (i.e., scheduler queue entries) used at runtime. Resource JFPU01
reached its maximum (18 of 18 queue entries). Note that AMD Jaguar implements
three schedulers:
* JALU01 - A scheduler for ALU instructions.
* JFPU01 - A scheduler floating point operations.
* JLSAGU - A scheduler for address generation.
The dot-product is a kernel of three floating point instructions (a vector
multiply followed by two horizontal adds). That explains why only the floating
point scheduler appears to be used.
A full scheduler queue is either caused by data dependency chains or by a
sub-optimal usage of hardware resources. Sometimes, resource pressure can be
mitigated by rewriting the kernel using different instructions that consume
different scheduler resources. Schedulers with a small queue are less resilient
to bottlenecks caused by the presence of long data dependencies.
The scheduler statistics are displayed by
using the command option ``-all-stats`` or ``-scheduler-stats``.
The next table, *Retire Control Unit*, presents a histogram displaying a count,
representing the number of instructions retired on some number of cycles. In
this case, of the 610 simulated cycles, two instructions were retired during
the same cycle 399 times (65.4%) and there were 109 cycles where no
instructions were retired. The retire statistics are displayed by using the
command option ``-all-stats`` or ``-retire-stats``.
The last table presented is *Register File statistics*. Each physical register
file (PRF) used by the pipeline is presented in this table. In the case of AMD
Jaguar, there are two register files, one for floating-point registers
(JFpuPRF) and one for integer registers (JIntegerPRF). The table shows that of
the 900 instructions processed, there were 900 mappings created. Since this
dot-product example utilized only floating point registers, the JFPuPRF was
responsible for creating the 900 mappings. However, we see that the pipeline
only used a maximum of 35 of 72 available register slots at any given time. We
can conclude that the floating point PRF was the only register file used for
the example, and that it was never resource constrained. The register file
statistics are displayed by using the command option ``-all-stats`` or
``-register-file-stats``.
In this example, we can conclude that the IPC is mostly limited by data
dependencies, and not by resource pressure.