Commit Graph

366 Commits

Author SHA1 Message Date
Andrea Di Biagio 083addf751 [llvm-mca] [llvm-mca] Improved error handling and error reporting from class InstrBuilder.
A new class named InstructionError has been added to Support.h in order to
improve the error reporting from class InstrBuilder.
The llvm-mca driver is responsible for handling InstructionError objects, and
printing them out to stderr.

The goal of this patch is to remove all the remaining error handling logic from
the library code.
In particular, this allows us to:
 - Simplify the logic in InstrBuilder by removing a needless dependency from
MCInstrPrinter.
 - Centralize all the error halding logic in a new function named 'runPipeline'
(see llvm-mca.cpp).

This is also a first step towards generalizing class InstrBuilder, so that in
future, we will be able to reuse its logic to also "lower" MachineInstr to
mca::Instruction objects.

Differential Revision: https://reviews.llvm.org/D53585

llvm-svn: 345129
2018-10-24 10:56:47 +00:00
Simon Pilgrim 7d27cfdcb2 [X86] Fix Skylake ReadAfterLd for PADDrm etc.
Missed in rL343868 as due to their custom InstrRW.

llvm-svn: 344600
2018-10-16 09:50:16 +00:00
Andrea Di Biagio 6eebbe0a97 [tblgen][llvm-mca] Add the ability to describe move elimination candidates via tablegen.
This patch adds the ability to identify instructions that are "move elimination
candidates". It also allows scheduling models to describe processor register
files that allow move elimination.

A move elimination candidate is an instruction that can be eliminated at
register renaming stage.
Each subtarget can specify which instructions are move elimination candidates
with the help of tablegen class "IsOptimizableRegisterMove" (see
llvm/Target/TargetInstrPredicate.td).

For example, on X86, BtVer2 allows both GPR and MMX/SSE moves to be eliminated.
The definition of 'IsOptimizableRegisterMove' for BtVer2 looks like this:

```
def : IsOptimizableRegisterMove<[
  InstructionEquivalenceClass<[
    // GPR variants.
    MOV32rr, MOV64rr,

    // MMX variants.
    MMX_MOVQ64rr,

    // SSE variants.
    MOVAPSrr, MOVUPSrr,
    MOVAPDrr, MOVUPDrr,
    MOVDQArr, MOVDQUrr,

    // AVX variants.
    VMOVAPSrr, VMOVUPSrr,
    VMOVAPDrr, VMOVUPDrr,
    VMOVDQArr, VMOVDQUrr
  ], CheckNot<CheckSameRegOperand<0, 1>> >
]>;
```

Definitions of IsOptimizableRegisterMove from processor models of a same
Target are processed by the SubtargetEmitter to auto-generate a target-specific
override for each of the following predicate methods:

```
bool TargetSubtargetInfo::isOptimizableRegisterMove(const MachineInstr *MI)
const;
bool MCInstrAnalysis::isOptimizableRegisterMove(const MCInst &MI, unsigned
CPUID) const;
```

By default, those methods return false (i.e. conservatively assume that there
are no move elimination candidates).

Tablegen class RegisterFile has been extended with the following information:
 - The set of register classes that allow move elimination.
 - Maxium number of moves that can be eliminated every cycle.
 - Whether move elimination is restricted to moves from registers that are
   known to be zero.

This patch is structured in three part:

A first part (which is mostly boilerplate) adds the new
'isOptimizableRegisterMove' target hooks, and extends existing register file
descriptors in MC by introducing new fields to describe properties related to
move elimination.

A second part, uses the new tablegen constructs to describe move elimination in
the BtVer2 scheduling model.

A third part, teaches llm-mca how to query the new 'isOptimizableRegisterMove'
hook to mark instructions that are candidates for move elimination. It also
teaches class RegisterFile how to describe constraints on move elimination at
PRF granularity.

llvm-mca tests for btver2 show differences before/after this patch.

Differential Revision: https://reviews.llvm.org/D53134

llvm-svn: 344334
2018-10-12 11:23:04 +00:00
Andrea Di Biagio 6a0b319549 [llvm-mca][BtVer2] Add tests for optimizable GPR register moves. NFC
llvm-svn: 344253
2018-10-11 14:54:54 +00:00
Andrea Di Biagio 1b29ec6531 [llvm-mca][BtVer2] Add two more move-elimination tests. NFC
These should test all the optimizable moves on Jaguar.
A follow-up patch will teach how to recognize these optimizable register moves.

llvm-svn: 344144
2018-10-10 14:46:54 +00:00
Simon Pilgrim f09fc3bc12 [X86] Move ReadAfterLd functionality into X86FoldableSchedWrite (PR36957)
Currently we hardcode instructions with ReadAfterLd if the register operands don't need to be available until the folded load has completed. This doesn't take into account the different load latencies of different memory operands (PR36957).

This patch adds a ReadAfterFold def into X86FoldableSchedWrite to replace ReadAfterLd, allowing us to specify the load latency at a scheduler class level.

I've added ReadAfterVec*Ld classes that match the XMM/Scl, XMM and YMM/ZMM WriteVecLoad classes that we currently use, we can tweak these values in future patches once this infrastructure is in place.

Differential Revision: https://reviews.llvm.org/D52886

llvm-svn: 343868
2018-10-05 17:57:29 +00:00
Simon Pilgrim 6ad03ad34b [llvm-mca][x86] Add PR36951 ReadAfterLd test case
llvm-svn: 343795
2018-10-04 16:26:56 +00:00
Greg Bedwell dee7bfdb9f [utils] Ensure that update_mca_test_checks.py writes prefixes in alphabetical order
llvm-svn: 343783
2018-10-04 14:42:19 +00:00
Simon Pilgrim 82a3b1c687 [llvm-mca][x86] Add tests demonstrating ReadAfterLd delay
llvm-svn: 343773
2018-10-04 13:05:42 +00:00
Simon Pilgrim 0b451a2983 [X86][Btver2] Fix MMX PSHUFB schedule
Match AMD Fam16h SOG + llvm-exegesis tests

llvm-svn: 343701
2018-10-03 18:18:50 +00:00
Andrea Di Biagio 207e0217f9 [llvm-mca] Add support for move elimination in class RegisterFile.
This patch teaches class RegisterFile how to analyze register writes from
instructions that are move elimination candidates.
In particular, it teaches it how to check if a move can be effectively eliminated
by the underlying PRF, and (if necessary) how to perform move elimination.

The long term goal is to allow processor models to describe instructions that
are valid move elimination candidates.
The idea is to let register file definitions in tablegen declare if/when moves
can be eliminated.

This patch is a non functional change.
The logic that performs move elimination is currently disabled.  A future patch
will add support for move elimination in the processor models, and enable this
new code path.

llvm-svn: 343691
2018-10-03 15:02:44 +00:00
Simon Pilgrim c68cc4efbe [X86][Btver2] Most RMW instructions don't require an additional uop
Remove uop on WriteRMW and move it into the few instructions that need it.

Match AMD Fam16h SOG + llvm-exegesis tests

llvm-svn: 343671
2018-10-03 10:28:43 +00:00
Simon Pilgrim d11015861c [X86] ALU/ADC RMW instructions should use the WriteRMW sequence class
I was expecting this to be a nfc but Silvermont seems to be setup a little differently:

// A folded store needs a cycle on MEC_RSV for the store data, but it does not need an extra port cycle to recompute the address.
def : WriteRes<WriteRMW, [SLM_MEC_RSV]>;

So moving from WriteStore to WriteRMW reduces predicted port pressure, confirmed by @craig.topper that this is correct.

Differential Revision: https://reviews.llvm.org/D52740

llvm-svn: 343670
2018-10-03 10:01:13 +00:00
Simon Pilgrim 860cb5c071 [X86][Btver2] Fix BLENDV and AESDEC schedules
Match AMD Fam16h SOG + llvm-exegesis tests

llvm-svn: 343597
2018-10-02 15:13:18 +00:00
Simon Pilgrim e0d2019052 [X86][Btver2] Fix BT(C|R|S)mr & BT(C|R|S)mi schedule latency + uop counts
Match AMD Fam16h SOG + llvm-exegesis tests

llvm-svn: 343494
2018-10-01 16:31:30 +00:00
Simon Pilgrim 6ddc4e821c [X86][Btver2] Fix BTmr schedule uop counts
Match AMD Fam16h SOG + llvm-exegesis tests

llvm-svn: 343484
2018-10-01 14:42:16 +00:00
Simon Pilgrim a982236e59 [X86][Btver2] Fix masked load schedule
JFPU01 resource usage should match JFPX

Match AMD Fam16h SOG + llvm-exegesis tests

llvm-svn: 343468
2018-10-01 13:12:05 +00:00
Andrea Di Biagio 24ea163007 [X86][BtVer2] Teach how to identify zero-idiom VPERM2F128rr instructions.
This patch adds another variant class to identify zero-idiom VPERM2F128rr
instructions.

On Jaguar, a VPERM wih bit 3 and 7 of the mask set, is a zero-idiom.

Differential Revision: https://reviews.llvm.org/D52663

llvm-svn: 343452
2018-10-01 10:35:13 +00:00
Clement Courbet a933fb237e [X86][Sched] Update scheduling information for VZEROALL on HWS, BDW, SKX, SNB.
Summary:
    While looking at PR35606, I found out that the scheduling info is incorrect.

    One can check that it's really a P5+P6 and not a 2*P56 with:
    echo -e 'vzeroall\nvandps %xmm1, %xmm2, %xmm3' | ./bin/llvm-exegesis -mode=uops -snippets-file=-
    (vandps executes on P5 only)

    Reviewers: craig.topper, RKSimon

    Subscribers: llvm-commits

    Differential Revision: https://reviews.llvm.org/D52541

llvm-svn: 343447
2018-10-01 08:37:48 +00:00
Simon Pilgrim f21083870d [X86] Fix scheduler class for BTmi instructions
This wasn't treated as a folded load instruction

llvm-svn: 343424
2018-09-30 20:19:16 +00:00
Simon Pilgrim b1108399bd [LLVM-MCA][X86] Add missing VCMPESTR/VCMPESTR tests
llvm-svn: 343421
2018-09-30 18:19:00 +00:00
Simon Pilgrim 20623f2343 [LLVM-MCA][X86] Add some AVX512 tests
These are going to be necessary to check I don't mess up when I start cleaning up all the remaining vector integer overrides

llvm-svn: 343414
2018-09-30 17:01:59 +00:00
Simon Pilgrim 4f5693ac8d [X86][Btver2] Fix PCmpIStrI/PCmpIStrM schedules
Missing JFPU0 pipe and double JFPU1 pipe (to match JVALU1) resources

Match AMD Fam16h SOG + llvm-exegesis tests

llvm-svn: 343413
2018-09-30 16:38:38 +00:00
Andrea Di Biagio 6e218d0a57 [llvm-mca] Add a test for zero-idiom VPERM2F128rr. NFC
We don't correctly model the latency and resource usage information for
zero-idiom VPERM2F128rr on Jaguar.

This is demonstrated by the incorrect numbers in the resource pressure view, and
the timeline view.
A follow up patch will fix this problem.

llvm-svn: 343346
2018-09-28 17:47:09 +00:00
Greg Bedwell becbbe0383 [utils] Stricter checking from update_mca_test_checks.py
If any prefixes have been specified on the RUN lines that do not end up
ever actually getting printed, raise an Error. This is either an
indication that the run lines just need cleaning up, or that something
is more fundamentally wrong with the test.

Also raise an Error if there are any blocks which cannot be checked
because they are not uniquely covered by a prefix.

Fixed up a couple of tests where the extra checking flagged up issues.

Differential Revision: https://reviews.llvm.org/D48276

llvm-svn: 343332
2018-09-28 15:39:09 +00:00
Greg Bedwell 2f528f8c1e [utils] Allow better identification of matching blocks in update_mca_test_checks.py
Insert empty blocks to cause the positions of matching blocks to match
across lists where possible so that later stages of the algorithm can
actually identify them as being identical.

Regenerated all tests with this change.

Differential Revision: https://reviews.llvm.org/D52560

llvm-svn: 343331
2018-09-28 15:38:56 +00:00
Simon Pilgrim 428c1196d8 [X86][Btver2] PSUBS/PSUBUS instructions are zero-idioms
Noticed during llvm-exegesis tests, the PSUBS/PSUBUS instructions have the same zero-idiom behaviour to PSUB

llvm-svn: 343321
2018-09-28 14:20:42 +00:00
Simon Pilgrim 3216fd3602 [X86][Btver2] Add zero-idiom tests for PSUBS/PSUBUS instructions
Noticed during llvm-exegesis tests, the PSUBS/PSUBUS instructions have the same zero-idiom behaviour to PSUB

llvm-svn: 343319
2018-09-28 13:53:11 +00:00
Simon Pilgrim 66da1ed29d [X86][Btver2] CVTSS2I/CVTSD2I - add missing JFPU0 pipe
We issue JFPU1->JSTC then JFPU0->JFPA then -> JALU0 (integer pipe)

Match AMD Fam16h SOG + llvm-exegesis tests

llvm-svn: 343314
2018-09-28 13:19:22 +00:00
Simon Pilgrim 17e5981ebf [X86][Btver2] Fix BSF/BSR schedule
Double throughput to account for 2 pipes + fix BSF's latency/uop counts

Match AMD Fam16h SOG + llvm-exegesis tests

llvm-svn: 343311
2018-09-28 10:26:48 +00:00
Simon Pilgrim 280af1c7f0 [X86][BtVer2] Fix PHMINPOS schedule resources typo
PHMINPOS can run on either JFPU pipe

llvm-svn: 343299
2018-09-28 08:21:39 +00:00
Simon Pilgrim 86c7b07ecd [X86][Btver2] (V)MPSADBW instructions take 3uops not 1
llvm-svn: 343238
2018-09-27 17:13:57 +00:00
Simon Pilgrim dd744f158a [X86][Btver2] BTC/BTR/BTS instructions take 2uops not 1
llvm-svn: 343234
2018-09-27 16:39:52 +00:00
Simon Pilgrim c2a88ea64e [X86][Btver2] BLSI/BLSMSK/BLSR instructions take 2uops not 1 (same as TZCNT)
llvm-svn: 343227
2018-09-27 14:57:57 +00:00
Simon Pilgrim 98f503a326 [X86][Btver2] TZCNT instructions take 2uops not 1
llvm-svn: 343200
2018-09-27 12:28:47 +00:00
Simon Pilgrim b56be79e0c Revert rL342916: [X86] Remove shift/rotate by CL memory (RMW) overrides
As suggested by Craig Topper - I'm going to look at cleaning up the RMW sequences instead.

The uops are slightly different to the register variant, so requires a +1uop tweak

llvm-svn: 342969
2018-09-25 13:01:26 +00:00
Simon Pilgrim 0b4ad7596f [X86] Remove shift/rotate by CL memory (RMW) overrides
The uops are slightly different to the register variant, so requires a +1uop tweak

llvm-svn: 342916
2018-09-24 20:11:50 +00:00
Simon Pilgrim 00865a48d1 [X86] Split WriteIMul into 8/16/32/64 implementations (PR36931)
Split WriteIMul by size and also by IMUL multiply-by-imm and multiply-by-reg cases.

This removes all the scheduler overrides for gpr multiplies and stops WriteMULH being ignored for BMI2 MULX instructions.

llvm-svn: 342892
2018-09-24 15:21:57 +00:00
Simon Pilgrim 9202c9fb47 [X86] ROR*mCL instruction models should match ROL*mCL etc.
Confirmed with Craig Topper - fix a typo that was missing a Port4 uop for ROR*mCL instructions on some Intel models.

Yet another step on the scheduler model cleanup marathon......

llvm-svn: 342846
2018-09-23 19:16:01 +00:00
Simon Pilgrim 19952add7c [X86] Added missing RCL/RCR schedule overrides to the generic SNB model
The SandyBridge model was missing schedule values for the RCL/RCR values - instead using the (incredibly optimistic) WriteShift (now WriteRotate) defaults.

I've added overrides with more realistic (slow) values, based on a mixture of Agner/instlatx64 numbers and what later Intel models do as well.

This is necessary to allow WriteRotate to be updated to remove other rotate overrides.

It'd probably be a good idea to investigate a WriteRotateCarry class at some point but its not high priority given the unusualness of these instructions.

llvm-svn: 342842
2018-09-23 17:40:24 +00:00
Clement Courbet 8171bd8e0f [X86][Sched] Add zero idiom sched data to the SNB model.
Summary:
On SNB, renamer-based zeroing does not work for:
 - 16 and 8-bit GPRs[1].
 - MMX [2].
 - ANDN variants [3]

[1] echo 'sub %ax, %ax' | /tmp/llvm-exegesis -mode=uops -snippets-file=-
[2] echo 'pxor %mm0, %mm0' | /tmp/llvm-exegesis -mode=uops -snippets-file=-
[3] echo 'andnps %xmm0, %xmm0' | /tmp/llvm-exegesis -mode=uops -snippets-file=-

Reviewers: RKSimon, andreadb

Subscribers: gbedwell, craig.topper, llvm-commits

Differential Revision: https://reviews.llvm.org/D52358

llvm-svn: 342736
2018-09-21 14:07:20 +00:00
Andrea Di Biagio 4cd5cf9fc8 [X86][BtVer2] Fix latency and resource cycles of AVX 256-bit zero-idioms.
This patch introduces a SchedWriteVariant to describe zero-idiom VXORP(S|D)Yrr
and VANDNP(S|D)Yrr.

This is a follow-up of r342555.

On Jaguar, a VXORPSYrr is 2 macro opcodes. Only one opcode is eliminated at
register-renaming stage. The other opcode has to be executed to set the upper
half of the destination YMM.
Same for VANDNP(S|D)Yrr.

Differential Revision: https://reviews.llvm.org/D52347

llvm-svn: 342728
2018-09-21 12:43:07 +00:00
Andrea Di Biagio 0aea310391 [llvm-mca][BtVer2] Modify ANDN tests in zero-idioms-avx-256.s. NFC
Two test cases should have tested 256-bit variants of VANDN zero-idioms instead
of the 128-bit variants.

llvm-svn: 342655
2018-09-20 15:48:23 +00:00
Andrea Di Biagio 8b6c314be1 [TableGen][SubtargetEmitter] Add the ability for processor models to describe dependency breaking instructions.
This patch adds the ability for processor models to describe dependency breaking
instructions.

Different processors may specify a different set of dependency-breaking
instructions.
That means, we cannot assume that all processors of the same target would use
the same rules to classify dependency breaking instructions.

The main goal of this patch is to provide the means to describe dependency
breaking instructions directly via tablegen, and have the following
TargetSubtargetInfo hooks redefined in overrides by tabegen'd
XXXGenSubtargetInfo classes (here, XXX is a Target name).

```
virtual bool isZeroIdiom(const MachineInstr *MI, APInt &Mask) const {
  return false;
}

virtual bool isDependencyBreaking(const MachineInstr *MI, APInt &Mask) const {
  return isZeroIdiom(MI);
}
```

An instruction MI is a dependency-breaking instruction if a call to method
isDependencyBreaking(MI) on the STI (TargetSubtargetInfo object) evaluates to
true. Similarly, an instruction MI is a special case of zero-idiom dependency
breaking instruction if a call to STI.isZeroIdiom(MI) returns true.
The extra APInt is used for those targets that may want to select which machine
operands have their dependency broken (see comments in code).
Note that by default, subtargets don't know about the existence of
dependency-breaking. In the absence of external information, those method calls
would always return false.

A new tablegen class named STIPredicate has been added by this patch to let
processor models classify instructions that have properties in common. The idea
is that, a MCInstrPredicate definition can be used to "generate" an instruction
equivalence class, with the idea that instructions of a same class all have a
property in common.

STIPredicate definitions are essentially a collection of instruction equivalence
classes.
Also, different processor models can specify a different variant of the same
STIPredicate with different rules (i.e. predicates) to classify instructions.
Tablegen backends (in this particular case, the SubtargetEmitter) will be able
to process STIPredicate definitions, and automatically generate functions in
XXXGenSubtargetInfo.

This patch introduces two special kind of STIPredicate classes named
IsZeroIdiomFunction and IsDepBreakingFunction in tablegen. It also adds a
definition for those in the BtVer2 scheduling model only.

This patch supersedes the one committed at r338372 (phabricator review: D49310).

The main advantages are:
 - We can describe subtarget predicates via tablegen using STIPredicates.
 - We can describe zero-idioms / dep-breaking instructions directly via
   tablegen in the scheduling models.

In future, the STIPredicates framework can be used for solving other problems.
Examples of future developments are:
 - Teach how to identify optimizable register-register moves
 - Teach how to identify slow LEA instructions (each subtarget defining its own
   concept of "slow" LEA).
 - Teach how to identify instructions that have undocumented false dependencies
   on the output registers on some processors only.

It is also (in my opinion) an elegant way to expose knowledge to both external
tools like llvm-mca, and codegen passes.
For example, machine schedulers in LLVM could reuse that information when
internally constructing the data dependency graph for a code region.

This new design feature is also an "opt-in" feature. Processor models don't have
to use the new STIPredicates. It has all been designed to be as unintrusive as
possible.

Differential Revision: https://reviews.llvm.org/D52174

llvm-svn: 342555
2018-09-19 15:57:45 +00:00
Simon Pilgrim 1c1335a10d [X86][BMI1] Fix BLSI/BLSMSK/BLSR BMI1 scheduling on btver2
These have the same behaviour as tzcnt on btver2 - confirmed with AMD 16h SOG, Agner and instlatx64.

llvm-svn: 342235
2018-09-14 13:31:14 +00:00
Andrea Di Biagio fb3d9e1449 [X86] Remove wrong ReadAdvance from multiclass sse_fp_unop_s.
A ReadAdvance was incorrectly added to the SchedReadWrite list associated with
the following SSE instructions:

sqrtss
sqrtsd
rsqrtss
rcpss

As a consequence, a wrong operand latency was computed for the register operand
used as the base address of the folded load operand.

This patch removes the wrong ReadAdvance, and updates the llvm-mca test cases.
There is still a problem with correctly modeling partial register writes on XMM
registers This other problem is currently tracked here:
https://bugs.llvm.org/show_bug.cgi?id=38813

Differential Revision: https://reviews.llvm.org/D51542

llvm-svn: 341326
2018-09-03 16:47:34 +00:00
Andrea Di Biagio a59ec4efa0 [X86][BtVer2] Remove wrong ReadAdvance from AVX vbroadcast(ss|sd|f128) instructions.
The presence of a ReadAdvance for input operand #0 is problematic
because it changes the input latency of the register used as the base address
for the folded load.

A broadcast cannot start executing if the load address hasn't been computed yet.

In the llvm-mca example, the VBROADCASTSS is dependent on the address generated
by the LEAQ.  That means, it cannot start until LEAQ reaches the write-back
stage. If we apply ReadAdvance, then we wrongly assume that the load can start 3
cycles in advance.

Differential Revision: https://reviews.llvm.org/D51534

llvm-svn: 341222
2018-08-31 16:05:48 +00:00
Andrea Di Biagio 69da3f3df6 [X86] Add llvm-mca tests that show how operand latency is wrongly computed for SSE sqrtss/sd and rcpss.
According to the timeline view, sqrtss/sd/rcpss start executing before the load
address for the memory operand is available.
This problem is caused by the presence of a ReadAfterLd (a ReadAdvance). Those
unary operations should not specify a ReadAdvance at all.

llvm-svn: 341213
2018-08-31 14:12:13 +00:00
Andrea Di Biagio 0e21ca1278 [X86][BtVer2] Add an llvm-mca test that shows how the read latency of AVX broadcastss on ymm registers is incorrectly set.
llvm-svn: 341197
2018-08-31 10:39:33 +00:00
Andrea Di Biagio b998eae2f2 [X86][BtVer2] Fix WriteFShuffle256 schedule write info.
This patch fixes the number of micro opcodes, and processor resource cycles for
the following AVX instructions:

vinsertf128rr/rm
vperm2f128rr/rm
vbroadcastf128

Tests have been regenerated using the usual scripts in the llvm/utils directory.

Differential Revision: https://reviews.llvm.org/D51492

llvm-svn: 341185
2018-08-31 08:30:47 +00:00
Andrea Di Biagio 8b647dcf4b [llvm-mca] Report the number of dispatched micro opcodes in the DispatchStatistics view.
This patch introduces the following changes to the DispatchStatistics view:
 * DispatchStatistics now reports the number of dispatched opcodes instead of
   the number of dispatched instructions.
 * The "Dynamic Dispatch Stall Cycles" table now also reports the percentage of
   stall cycles against the total simulated cycles.

This change allows users to easily compare dispatch group sizes with the
processor DispatchWidth.
Before this change, it was difficult to correlate the two numbers, since
DispatchStatistics view reported numbers of instructions (instead of opcodes).
DispatchWidth defines the maximum size of a dispatch group in terms of number of
micro opcodes.

The other change introduced by this patch is related to how DispatchStage
generates "instruction dispatch" events.
In particular:
 * There can be multiple dispatch events associated with a same instruction
 * Each dispatch event now encapsulates the number of dispatched micro opcodes.

The number of micro opcodes declared by an instruction may exceed the processor
DispatchWidth. Therefore, we cannot assume that instructions are always fully
dispatched in a single cycle.
DispatchStage knows already how to handle instructions declaring a number of
opcodes bigger that DispatchWidth. However, DispatchStage always emitted a
single instruction dispatch event (during the first simulated dispatch cycle)
for instructions dispatched.

With this patch, DispatchStage now correctly notifies multiple dispatch events
for instructions that cannot be dispatched in a single cycle.

A few views had to be modified. Views can no longer assume that there can only
be one dispatch event per instruction.

Tests (and docs) have been updated.

Differential Revision: https://reviews.llvm.org/D51430

llvm-svn: 341055
2018-08-30 10:50:20 +00:00
Andrew V. Tischenko 62f7a3207b [X86] Improved sched model for X86 CMPXCHG* instructions.
Differential Revision: https://reviews.llvm.org/D50070 

llvm-svn: 341024
2018-08-30 06:26:00 +00:00
Andrea Di Biagio a2eee47450 [llvm-mca] Add fields "Total uOps" and "uOps Per Cycle" to the report generated by the SummaryView.
This patch adds two new fields to the perf report generated by the SummaryView.
Fields are now logically organized into two small groups; only the second group
contains throughput indicators.

Example:
```
Iterations:        100
Instructions:      300
Total Cycles:      414
Total uOps:        700

Dispatch Width:    4
uOps Per Cycle:    1.69
IPC:               0.72
Block RThroughput: 4.0
```

This patch also updates the docs for llvm-mca.
Due to the nature of this change, several tests in the tools/llvm-mca directory
were affected, and had to be updated using script `update_mca_test_checks.py`.

llvm-svn: 340946
2018-08-29 17:56:39 +00:00
Andrea Di Biagio 5221e17fd6 [llvm-mca] Don't disable the SummaryView if flag `-all-stats` is false.
llvm-svn: 340945
2018-08-29 17:40:04 +00:00
Andrea Di Biagio d17d371c40 [llvm-mca][TimelineView] Force the same number of executions for every entry in the 'wait-times' table.
This patch also uses colors to highlight problematic wait-time entries.
A problematic entry is an entry with an high wait time that tends to match (or
exceed) the size of the scheduler's buffer.

Color RED is used if an instruction had to wait an average number of cycles
which is bigger than (or equal to) the size of the underlying scheduler's
buffer.
Color YELLOW is used if the time (in cycles) spend waiting for the
operands or pipeline resources is bigger than half the size of the underlying
scheduler's buffer.
Color MAGENTA is used if an instruction does not consume buffer resources
according to the scheduling model.

llvm-svn: 340825
2018-08-28 14:27:01 +00:00
Andrea Di Biagio b89b96c1b2 [llvm-mca] Improved report generated by the SchedulerStatistics view.
Before this patch, the SchedulerStatistics only printed the maximum number of
buffer entries consumed in each scheduler's queue at a given point of the
simulation.

This patch restructures the reported table, and adds an extra field named
"Average number of used buffer entries" to it.
This patch also uses different colors to help identifying bottlenecks caused by
high scheduler's buffer pressure.

llvm-svn: 340746
2018-08-27 14:52:52 +00:00
Andrea Di Biagio a03f2a77f8 [llvm-mca] Fix PR38575: Avoid an invalid implicit truncation of a processor resource mask (an uint64_t value) to unsigned.
This patch fixes a regression introduced at revision 338702.

A processor resource mask was incorrectly implicitly truncated to an unsigned
quantity. Later on, the truncated mask was used to initialize an element of a
vector of processor resource descriptors.
On targets with more than 32 processor resources, some elements of the vector
are left uninitialized. As a consequence, this bug might have eventually caused
a crash due to null dereference in the Scheduler.

This patch fixes PR38575, and adds a test for it.

llvm-svn: 339768
2018-08-15 12:53:38 +00:00
Andrew V. Tischenko 1fe3375620 [X86] MCA tests for XCHG*, XADD* and CMPXCHG* instructions
Differential Revision: https://reviews.llvm.org/D49912

llvm-svn: 339145
2018-08-07 14:36:43 +00:00
Simon Pilgrim b911d6721d [llvm-mca][x86] Add CMPXCHG instruction resource tests
I've put CMPXCHG8B/CMPXCHG16B in the same file, even though technically they are under separate CPUID bits all targets seem to support both (or neither).

llvm-svn: 338595
2018-08-01 17:25:11 +00:00
Simon Pilgrim 5c4fb14e07 [llvm-mca][x86] Add PREFETCHW instruction resource tests
These aren't just available via 3DNow! so test for them separately as well.

llvm-svn: 338584
2018-08-01 16:34:39 +00:00
Simon Pilgrim dcfa732b2f [llvm-mca][x86] Add PCLMUL instruction resource tests
Renamed the btver2 file that already contained them - the other targets were only testing the AVX versions

llvm-svn: 338583
2018-08-01 16:25:50 +00:00
Andrea Di Biagio 7f3bf5c1f9 [llvm-mca] Correctly update the rank in `Scheduler::select()`.
Found by inspection.

llvm-svn: 338579
2018-08-01 16:06:33 +00:00
Simon Pilgrim 34ac6533f4 [llvm-mca][x86] Add SET/TEST instruction resource tests
llvm-svn: 338576
2018-08-01 15:29:47 +00:00
Simon Pilgrim e364e57ac9 [llvm-mca][x86] Add LEA instruction resource tests
We already added these to btver2, now add them to other targets, even though none of their models treat them specially (yet).

llvm-svn: 338565
2018-08-01 14:25:33 +00:00
Simon Pilgrim 6754913e95 [llvm-mca][x86] Add more x86-64 system instruction resource tests
CPUID, IN/OUT, INS/OUTS, INT, PAUSE, SCAS, UD2, XLAT

llvm-svn: 338563
2018-08-01 14:18:09 +00:00
Simon Pilgrim 5f41ab79c0 [llvm-mca][x86] Add CLFLUSHOPT instruction resource tests
llvm-svn: 338550
2018-08-01 13:34:17 +00:00
Simon Pilgrim bd014f4d91 [llvm-mca][x86] Add CMPS/LODS/MOVS/STOS string instruction resource tests
llvm-svn: 338532
2018-08-01 13:14:45 +00:00
Simon Pilgrim 18d025a732 [llvm-mca][x86] Add STC + STD instruction resource tests
llvm-svn: 338514
2018-08-01 11:00:11 +00:00
Simon Pilgrim 1f4b9cb6fe [llvm-mca][x86] Add 32-bit instruction resource tests
These aren't exhaustive, but cover some instructions that are only available in 32-bit mode (where would we be without good BCD math performance?).

llvm-svn: 338404
2018-07-31 17:33:08 +00:00
Andrea Di Biagio a1852b6194 [llvm-mca][BtVer2] Teach how to identify dependency-breaking idioms.
This patch teaches llvm-mca how to identify dependency breaking instructions on
btver2.

An example of dependency breaking instructions is the zero-idiom XOR (example:
`XOR %eax, %eax`), which always generates zero regardless of the actual value of
the input register operands.
Dependency breaking instructions don't have to wait on their input register
operands before executing. This is because the computation is not dependent on
the inputs.

Not all dependency breaking idioms are also zero-latency instructions. For
example, `CMPEQ %xmm1, %xmm1` is independent on
the value of XMM1, and it generates a vector of all-ones.
That instruction is not eliminated at register renaming stage, and its opcode is
issued to a pipeline for execution. So, the latency is not zero. 

This patch adds a new method named isDependencyBreaking() to the MCInstrAnalysis
interface. That method takes as input an instruction (i.e. MCInst) and a
MCSubtargetInfo.
The default implementation of isDependencyBreaking() conservatively returns
false for all instructions. Targets may override the default behavior for
specific CPUs, and return a value which better matches the subtarget behavior.

In future, we should teach to Tablegen how to automatically generate the body of
isDependencyBreaking from scheduling predicate definitions. This would allow us
to expose the knowledge about dependency breaking instructions to the machine
schedulers (and, potentially, other codegen passes).

Differential Revision: https://reviews.llvm.org/D49310

llvm-svn: 338372
2018-07-31 13:21:43 +00:00
Roman Lebedev 52b85377eb [NFC][MCA] ZnVer1: Update RegisterFile to identify false dependencies on partially written registers.
Summary:
Pretty mechanical follow-up for D49196.

As microarchitecture.pdf notes, "20 AMD Ryzen pipeline",
"20.8 Register renaming and out-of-order schedulers":
  The integer register file has 168 physical registers of 64 bits each.
  The floating point register file has 160 registers of 128 bits each.
"20.14 Partial register access":
  The processor always keeps the different parts of an integer register together.
  ...
  An instruction that writes to part of a register will therefore have a false dependence
  on any previous write to the same register or any part of it.

Reviewers: andreadb, courbet, RKSimon, craig.topper, GGanesh

Reviewed By: GGanesh

Subscribers: gbedwell, llvm-commits

Differential Revision: https://reviews.llvm.org/D49393

llvm-svn: 337676
2018-07-23 10:10:13 +00:00
Roman Lebedev d57bd45acc [NFC][MCA] ZnVer1: add partial-reg-update tests
Reviewers: andreadb, courbet, RKSimon, craig.topper, GGanesh

Reviewed By: GGanesh

Subscribers: gbedwell, llvm-commits

Differential Revision: https://reviews.llvm.org/D49392

llvm-svn: 337675
2018-07-23 10:10:04 +00:00
Simon Pilgrim 5e729dcc03 [llvm-mca][x86] Add movsx/movzx instructions to general x86_64 resource tests
llvm-svn: 337586
2018-07-20 17:43:42 +00:00
Andrea Di Biagio b6022aa8d9 [X86][BtVer2] correctly model the latency/throughput of LEA instructions.
This patch fixes the latency/throughput of LEA instructions in the BtVer2
scheduling model.

On Jaguar, A 3-operands LEA has a latency of 2cy, and a reciprocal throughput of
1. That is because it uses one cycle of SAGU followed by 1cy of ALU1.  An LEA
with a "Scale" operand is also slow, and it has the same latency profile as the
3-operands LEA. An LEA16r has a latency of 3cy, and a throughput of 0.5 (i.e.
RThrouhgput of 2.0).

This patch adds a new TIIPredicate named IsThreeOperandsLEAFn to X86Schedule.td.
The tablegen backend (for instruction-info) expands that definition into this
(file X86GenInstrInfo.inc):
```
static bool isThreeOperandsLEA(const MachineInstr &MI) {
  return (
    (
      MI.getOpcode() == X86::LEA32r
      || MI.getOpcode() == X86::LEA64r
      || MI.getOpcode() == X86::LEA64_32r
      || MI.getOpcode() == X86::LEA16r
    )
    && MI.getOperand(1).isReg()
    && MI.getOperand(1).getReg() != 0
    && MI.getOperand(3).isReg()
    && MI.getOperand(3).getReg() != 0
    && (
      (
        MI.getOperand(4).isImm()
        && MI.getOperand(4).getImm() != 0
      )
      || (MI.getOperand(4).isGlobal())
    )
  );
}
```

A similar method is generated in the X86_MC namespace, and included into
X86MCTargetDesc.cpp (the declaration lives in X86MCTargetDesc.h).

Back to the BtVer2 scheduling model:
A new scheduling predicate named JSlowLEAPredicate now checks if either the
instruction is a three-operands LEA, or it is an LEA with a Scale value
different than 1.
A variant scheduling class uses that new predicate to correctly select the
appropriate latency profile.

Differential Revision: https://reviews.llvm.org/D49436

llvm-svn: 337469
2018-07-19 16:42:15 +00:00
Simon Pilgrim 03164dfa5e [llvm-mca][x86] Add extend, carry-flag and CMP instructions to general x86_64 resource tests
llvm-svn: 337306
2018-07-17 17:47:35 +00:00
Simon Pilgrim 92da01fed9 [llvm-mca][x86] Add MOVBE resource tests to all supporting targets
SNB doesn't support MOVBE but the numbers in Generic (which use the SNB model) look sane.

llvm-svn: 337305
2018-07-17 17:41:45 +00:00
Simon Pilgrim 94049e8b15 [llvm-mca][x86] Add BSWAP resource tests
llvm-svn: 337302
2018-07-17 17:10:47 +00:00
Simon Pilgrim 99a4f3195b [llvm-mca][x86] Add displacement-only and additional scale=1 LEA tests
llvm-svn: 337298
2018-07-17 16:17:33 +00:00
Simon Pilgrim 17d89ca70e [llvm-mca][x86] Add LEA resource tests (PR32326)
Add llvm-mca tests demonstrating how LEA instructions are currently modelled. Once this is working on btver2 I'll copy the test file to the other target directories.

llvm-svn: 337297
2018-07-17 16:13:29 +00:00
Andrea Di Biagio f84b0a6914 [llvm-mca] Regenerate X86 specific tests. NFC
Not all tests were correctly updated by the update script after r336797.

llvm-svn: 337124
2018-07-15 11:43:11 +00:00
Andrea Di Biagio ff630c2cdc [llvm-mca][BtVer2] teach how to identify false dependencies on partially written
registers.

The goal of this patch is to improve the throughput analysis in llvm-mca for the
case where instructions perform partial register writes.

On x86, partial register writes are quite difficult to model, mainly because
different processors tend to implement different register merging schemes in
hardware.

When the code contains partial register writes, the IPC (instructions per
cycles) estimated by llvm-mca tends to diverge quite significantly from the
observed IPC (using perf).

Modern AMD processors (at least, from Bulldozer onwards) don't rename partial
registers. Quoting Agner Fog's microarchitecture.pdf:
" The processor always keeps the different parts of an integer register together.
For example, AL and AH are not treated as independent by the out-of-order
execution mechanism. An instruction that writes to part of a register will
therefore have a false dependence on any previous write to the same register or
any part of it."

This patch is a first important step towards improving the analysis of partial
register updates. It changes the semantic of RegisterFile descriptors in
tablegen, and teaches llvm-mca how to identify false dependences in the presence
of partial register writes (for more details: see the new code comments in
include/Target/TargetSchedule.h - class RegisterFile).

This patch doesn't address the case where a write to a part of a register is
followed by a read from the whole register.  On Intel chips, high8 registers
(AH/BH/CH/DH)) can be stored in separate physical registers. However, a later
(dirty) read of the full register (example: AX/EAX) triggers a merge uOp, which
adds extra latency (and potentially affects the pipe usage).
This is a very interesting article on the subject with a very informative answer
from Peter Cordes:
https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to

In future, the definition of RegisterFile can be extended with extra information
that may be used to identify delays caused by merge opcodes triggered by a dirty
read of a partial write.

Differential Revision: https://reviews.llvm.org/D49196

llvm-svn: 337123
2018-07-15 11:01:38 +00:00
Andrea Di Biagio e86e6efea1 [llvm-mca][BtVer2] Add tests for dependency breaking instructions.
llvm-svn: 337024
2018-07-13 16:46:51 +00:00
Andrea Di Biagio 483db141e3 [X86] Fix MayLoad/HasSideEffect flag for (V)MOVLPSrm instructions.
Before revision 336728, the "mayLoad" flag for instruction (V)MOVLPSrm was
inferred directly from the "default" pattern associated with the instruction
definition.

r336728 removed special node X86Movlps, and all the patterns associated to it.
Now instruction (V)MOVLPSrm doesn't have a pattern associated to it, and the
'mayLoad/hasSideEffects' flags are left unset.

When the instruction info is emitted by tablegen, method
CodeGenDAGPatterns::InferInstructionFlags() sees that (V)MOVLPSrm doesn't have a
pattern, and flags are undefined. So, it conservatively sets the
"hasSideEffects" flag for it.

As a consequence, we were losing the 'mayLoad' flag, and we were gaining a
'hasSideEffect' flag in its place.
This patch fixes the issue (originally reported by Michael Holmen).

The mca tests show the differences in the instruction info flags.  Instructions
that were affected by this problem were: MOVLPSrm/VMOVLPSrm/VMOVLPSZ128rm.

Differential Revision: https://reviews.llvm.org/D49182

llvm-svn: 336818
2018-07-11 15:27:50 +00:00
Andrea Di Biagio d2e2c053cf [llvm-mca] Use a different character to flag instructions with side-effects in the Instruction Info View. NFC
This makes easier to identify changes in the instruction info flags.  It also
helps spotting potential regressions similar to the one recently introduced at
r336728.

Using the same character to mark MayLoad/MayStore/HasSideEffects is problematic
for llvm-lit. When pattern matching substrings, llvm-lit consumes tabs and
spaces. A change in position of the flag marker may not trigger a test failure.

This patch only changes the character used for flag `hasSideEffects`. The reason
why I didn't touch other flags is because I want to avoid spamming the mailing
because of the massive diff due to the numerous tests affected by this change.

In future, each instruction flag should be associated with a different character
in the Instruction Info View.

llvm-svn: 336797
2018-07-11 12:44:44 +00:00
Andrea Di Biagio 2b3a4f9c9b [llvm-mca] Add tests for partial register writes.
llvm-mca doesn't know that on modern AMD processors, portions of a general
purpose register are not treated independently. So, a partial register write has
a false dependency on the super-register.

The issue with partial register writes will be addressed by a follow-up patch.

llvm-svn: 336778
2018-07-11 09:50:00 +00:00
Andrea Di Biagio 8834779644 [llvm-mca] report an error if the assembly sequence contains an unsupported instruction.
This is a short-term fix for PR38093.
For now, we llvm::report_fatal_error if the instruction builder finds an
unsupported instruction in the instruction stream.

We need to revisit this fix once we start addressing PR38101.
Essentially, we need a better framework for error handling.

llvm-svn: 336543
2018-07-09 12:30:55 +00:00
Roman Lebedev 0e58dee284 [MCA][X86][NFC] Add BSF/BSR resource tests
Reviewers: RKSimon, andreadb, courbet

Reviewed By: RKSimon

Subscribers: gbedwell, llvm-commits

Differential Revision: https://reviews.llvm.org/D48997

llvm-svn: 336510
2018-07-08 09:50:14 +00:00
Andrea Di Biagio 61c52af9d9 [llvm-mca] improve the instruction issue logic implemented by the Scheduler.
This patch modifies the Scheduler heuristic used to select the next instruction
to issue to the pipelines.

The motivating example is test X86/BtVer2/add-sequence.s, for which llvm-mca
wrongly reported an estimated IPC of 1.50. According to perf, the actual IPC for
that test should have been ~2.00.
It turns out that an IPC of 2.00 for test add-sequence.s cannot possibly be
predicted by a Scheduler that only prioritizes instructions based on their
"age". A similar issue also affected test X86/BtVer2/dependent-pmuld-paddd.s,
for which llvm-mca wrongly estimated an IPC of 0.84 instead of an IPC of 1.00.

Instructions in the ReadyQueue are now ranked based on two factors:
 - The "age" of an instruction.
 - The number of unique users of writes associated with an instruction.

The new logic still prioritizes older instructions over younger instructions to
minimize the pressure on the reorder buffer. However, the number of users of an
instruction now also affects the overall rank. This potentially increases the
ability of the Scheduler to extract instruction level parallelism.  This patch
fixes the problem with the wrong IPC reported for test add-sequence.s and test
dependent-pmuld-paddd.s.

llvm-svn: 336420
2018-07-06 08:08:30 +00:00
Roman Lebedev 0dd27042c6 [X86][BtVer2][MCA][NFC] Add CMPEQ dependency-breaking one-idioms tests
Summary: As per `Agner's Microarchitecture doc
(21.8 AMD Bobcat and Jaguar pipeline - Dependency-breaking instructions)`,
these, like zero-idioms, are dependency-breaking,
although they produce ones and still consume resources.

FIXME: as discussed in D48877, llvm-mca handling is broken for these.

Reviewers: andreadb

Reviewed By: andreadb

Subscribers: gbedwell, RKSimon, llvm-commits

Differential Revision: https://reviews.llvm.org/D48876

llvm-svn: 336292
2018-07-04 17:32:44 +00:00
Fangrui Song f50ad6c311 Replace unused output filenames with /dev/null in tests
Similar to rLLD336129

llvm-svn: 336131
2018-07-02 18:16:44 +00:00
Simon Pilgrim 83125594ed [llvm-mca][x86] Add FMA4 resource tests
We should be ensuring we have (near) complete test coverage of instructions, at least for the generic model.

llvm-svn: 335870
2018-06-28 16:24:13 +00:00
Simon Pilgrim 12f9503d40 [llvm-mca][x86] Add 3dnow! resource tests
We should be ensuring we have (near) complete test coverage of instructions, at least for the generic model.

llvm-svn: 335869
2018-06-28 16:21:22 +00:00
Andrea Di Biagio 2145b13fc9 [llvm-mca][X86] Teach how to identify register writes that implicitly clear the upper portion of a super-register.
This patch teaches llvm-mca how to identify register writes that implicitly zero
the upper portion of a super-register.

On X86-64, a general purpose register is implemented in hardware as a 64-bit
register. Quoting the Intel 64 Software Developer's Manual: "an update to the
lower 32 bits of a 64 bit integer register is architecturally defined to zero
extend the upper 32 bits".  Also, a write to an XMM register performed by an AVX
instruction implicitly zeroes the upper 128 bits of the aliasing YMM register.

This patch adds a new method named clearsSuperRegisters to the MCInstrAnalysis
interface to help identify instructions that implicitly clear the upper portion
of a super-register.  The rest of the patch teaches llvm-mca how to use that new
method to obtain the information, and update the register dependencies
accordingly.

I compared the kernels from tests clear-super-register-1.s and
clear-super-register-2.s against the output from perf on btver2.  Previously
there was a large discrepancy between the estimated IPC and the measured IPC.
Now the differences are mostly in the noise.

Differential Revision: https://reviews.llvm.org/D48225

llvm-svn: 335113
2018-06-20 10:08:11 +00:00
Roman Lebedev d23b6831de [X86][Znver1] Specify Register Files, RCU; FP scheduler capacity.
Summary:
First off: i do not have any access to that processor,
so this is purely theoretical, no benchmarks.

I have been looking into b**d**ver2 scheduling profile, and while cross-referencing
the existing b**t**ver2, znver1 profiles, and the reference docs
(`Software Optimization Guide for AMD Family {15,16,17}h Processors`),
i have noticed that only b**t**ver2 scheduling profile specifies these.

Also, there is no mca test coverage.

Reviewers: RKSimon, craig.topper, courbet, GGanesh, andreadb

Reviewed By: GGanesh

Subscribers: gbedwell, vprasad, ddibyend, shivaram, Ashutosh, javed.absar, llvm-commits

Differential Revision: https://reviews.llvm.org/D47676

llvm-svn: 335099
2018-06-20 07:01:14 +00:00
Clement Courbet e0aa30008f [X86] Fix r335097
Missed `Generic` test in llvm-mca.

llvm-svn: 335098
2018-06-20 06:44:13 +00:00
Clement Courbet 7b9913fb9f [X86] Add sched class WriteLAHFSAHF and fix values.
Summary:
I ran llvm-exegesis on SKX, SKL, BDW, HSW, SNB.
Atom is from Agner and SLM is a guess.
I've left AMD processors alone.

Reviewers: RKSimon, craig.topper

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D48079

llvm-svn: 335097
2018-06-20 06:13:39 +00:00
Roman Lebedev ae0527aac9 [MCA][NFC] Add generic XOP resource tests
Summary:
Based on
* [[ https://support.amd.com/TechDocs/43479.pdf | AMD64 Architecture Programmer’s Manual Volume 6: 128-Bit and 256-Bit XOP and FMA4 Instructions ]],
* [[ https://support.amd.com/TechDocs/24594.pdf | AMD64 Architecture Programmer’s Manual Volume 3: General-Purpose and System Instructions]],
* https://en.wikipedia.org/wiki/XOP_instruction_set

Appears to be only supported in AMD's 15h generation, so only in b**d**ver[1-4],
for which currently llvm has no scheduling profiles.

Reviewers: RKSimon, craig.topper, andreadb, spatel

Reviewed By: RKSimon

Subscribers: gbedwell, llvm-commits

Differential Revision: https://reviews.llvm.org/D48264

llvm-svn: 335034
2018-06-19 09:21:27 +00:00
Roman Lebedev 0d12c1685b [MCA][NFC] Add generic TBM resource tests
Summary:
Based on https://support.amd.com/TechDocs/24594.pdf,
https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets#TBM_(Trailing_Bit_Manipulation)

Appears to be only supported in AMD's 15h generation, so only in b**d**ver[1-4],
for which currently llvm has no scheduling profiles.

Reviewers: RKSimon, craig.topper, simark, andreadb

Reviewed By: RKSimon

Subscribers: gbedwell, llvm-commits

Differential Revision: https://reviews.llvm.org/D48252

llvm-svn: 335033
2018-06-19 09:21:22 +00:00
Andrea Di Biagio a88281d8ae [llvm-mca] Use an ordered map to collect hardware statistics. NFC.
Histogram entries are now ordered by key.  This should improves their
readability when statistics are printed.

llvm-svn: 334961
2018-06-18 17:04:56 +00:00
Andrea Di Biagio 487da729a2 [llvm-mca] Add tests for XOP and AVX512 instructions that implicitly clear the upper portion of a super-register.
When the destination register of a XOP instruction is an XMM register, bits
[255:128] of the corresponding YMM register are cleared.

When the destination register of a EVEX encoded instruction is an XMM/YMM
register, the upper bits of the corresponding ZMM are cleared.
On processors that feature AVX512, a write to an XMM registers always clears the
upper portion of the corresponding ZMM register if the instruction is VEX or
EVEX encoded.

These new tests show some interesting cases which aren't correctly analyzed by
llvm-mca. The lack of knowledge related to the implicit update on the
super-registers is addressed by D48225.

llvm-svn: 334945
2018-06-18 14:00:30 +00:00
Clement Courbet 0d9da88d18 [X86] Fix NOOP sched overrides on BDW/HSW/SKL.
Summary: Noop certainly does not use resources.

Reviewers: RKSimon, craig.topper, andreadb

Subscribers: gbedwell, llvm-commits, gchatelet

Differential Revision: https://reviews.llvm.org/D48028

llvm-svn: 334927
2018-06-18 06:48:22 +00:00
Simon Pilgrim e930f569f7 [llvm-mca][X86] Add some avx512f/avx512vl resource test placeholders
There are a lot of instructions to add under these ISAs (and the other AVX512 variants) but this should demonstrate how to test for the EVEX instructions with different maskings

llvm-svn: 334907
2018-06-17 16:25:48 +00:00
Simon Pilgrim f5ecd8d50d [llvm-mca][x86] Add Generic cpu resource tests
Added a Generic x86 cpu set of resource tests to allow us to check all ISAs.

We currently use SandyBridge as our generic CPU model, but it's better if we actually duplicate these tests for if/when we change the model, it also means we don't end up polluting the SandyBridge folder with tests for ISAs it doesn't support.

llvm-svn: 334853
2018-06-15 18:35:25 +00:00
Roman Lebedev 9ddf128f79 [MCA] Add -summary-view option
Summary:
While that is indeed a quite interesting summary stat,
there are cases where it does not really add anything
other than consuming extra lines.

Declutters the output of D48190.

Reviewers: RKSimon, andreadb, courbet, craig.topper

Reviewed By: andreadb

Subscribers: javed.absar, gbedwell, llvm-commits

Differential Revision: https://reviews.llvm.org/D48209

llvm-svn: 334833
2018-06-15 14:01:43 +00:00
Roman Lebedev 7c423001e4 [MCA][x86][NFC] Add tests for -register-file-stats, -scheduler-stats
Summary:
There does not seem to be any other tests for this.
Split off from D47676.

Reviewers: RKSimon, craig.topper, courbet, andreadb

Reviewed By: andreadb

Subscribers: javed.absar, gbedwell, llvm-commits

Differential Revision: https://reviews.llvm.org/D48190

llvm-svn: 334832
2018-06-15 14:01:35 +00:00
Andrea Di Biagio 4cafb297d5 [llvm-mca] Add tests for instructions that implicitly clear the upper portion of a super-register.
On x86-64, a write to register EAX implicitly clears the upper half or RAX.
128-bit AVX instructions clear the upper 128-bit of the YMM register that
aliases the XMM definition register.

llvm-mca doesn't know about register writes that implicitly clear the upper
portion of an aliasing super-register. This issue will be fixed in a future patch.

llvm-svn: 334742
2018-06-14 17:48:42 +00:00
Andrea Di Biagio 4729d1ff27 [llvm-mca] Add another test for partial register stalls.
This test checks that a physical register is correctly allocated for the partial
write to register BX.
The ADD instruction has to wait for the write to RBX (and BX) before being
executed.

llvm-svn: 334730
2018-06-14 15:54:34 +00:00
Andrea Di Biagio 0ffb2271a1 [llvm-mca] Fixed a bug in the logic that checks if a memory operation is ready to execute.
Fixes PR37790.

In some (very rare) cases, the LSUnit (Load/Store unit) was wrongly marking a
load (or store) as "ready to execute" effectively bypassing older memory barrier
instructions.

To reproduce this bug, the memory barrier must be the first instruction in the
input assembly sequence, and it doesn't have to perform any register writes.

llvm-svn: 334633
2018-06-13 18:30:14 +00:00
Clement Courbet 7db69cc08a [X86] Fix skylake server scheduling info.
Summary:
This fixes most of the scheduling info for SKX vector operations.
I had to split a lot of the YMM/ZMM classes into separate classes for YMM and ZMM.

The before/after llvm-exegesis analysis are in the phabricator diff.

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D47721

llvm-svn: 334407
2018-06-11 14:37:53 +00:00
Simon Pilgrim 89deac6694 [X86][BtVer2] Add support for all SUB/XOR 32/64 scalar instructions that should match the dependency-breaking 'zero-idiom'
As detailed on Agner's Microarchitecture doc (21.8 AMD Bobcat and Jaguar pipeline - Dependency-breaking instructions), these instructions are dependency breaking and fast-path zero the destination register (and appropriate EFLAGS bits).

llvm-svn: 334303
2018-06-08 17:00:45 +00:00
Simon Pilgrim efb4806bb9 [X86][BtVer2] Remove SBB tests that were accidentally added in rL334296
These aren't true zero-idiom instructions (just dependency breaking).

llvm-svn: 334297
2018-06-08 15:43:00 +00:00
Simon Pilgrim 53766a986d [X86][BtVer2] Add tests for scalar SUB/XOR instructions that should match the dependency-breaking 'zero-idiom'
As detailed on Agner's Microarchitecture doc (21.8 AMD Bobcat and Jaguar pipeline - Dependency-breaking instructions).

llvm-svn: 334296
2018-06-08 15:28:43 +00:00
Simon Pilgrim aafcf9e4a1 [X86][BtVer2] Limit zero idiom tests to a single iteration.
Reduces output size and we're only wanting to check that the instructions are fast-path'd (just Dispatch+Retire) anyhow

llvm-svn: 334292
2018-06-08 15:01:40 +00:00
Simon Pilgrim aef5bdbea1 [X86][BtVer2] Add support for all vector instructions that should match the dependency-breaking 'zero-idiom'
As detailed on Agner's Microarchitecture doc (21.8 AMD Bobcat and Jaguar pipeline - Dependency-breaking instructions), all these instructions are dependency breaking and zero the destination register.

llvm-svn: 334119
2018-06-06 19:06:09 +00:00
Simon Pilgrim 7a48bb6e44 [llvm-mca][x86] Fix all resources-x86_64.s tests to use different registers in reg-reg cases
I noticed while working on zero-idiom + dependency-breaking support (PR36671) that most of our binary instruction tests were reusing the same src registers, which would cause the tests to fail once we enable scalar zero-idiom support on btver2. Fixed in all targets to keep them in sync.

llvm-svn: 334110
2018-06-06 18:20:25 +00:00
Simon Pilgrim 64541ff297 [X86][BtVer2] Add tests for all vector instructions that should match the dependency-breaking 'zero-idiom'
As detailed on Agner's Microarchitecture doc (21.8 AMD Bobcat and Jaguar pipeline - Dependency-breaking instructions), all these instructions are dependency breaking and zero the destination register.

TODO: Scalar instructions still need to be tested (need to check EFLAGS handling).
llvm-svn: 334104
2018-06-06 16:14:37 +00:00
Sanjay Patel 59313be8d3 [CodeGen] assume max/default throughput for unspecified instructions
This is a fix for the problem arising in D47374 (PR37678):
https://bugs.llvm.org/show_bug.cgi?id=37678

We may not have throughput info because it's not specified in the model 
or it's not available with variant scheduling, so assume that those
instructions can execute/complete at max-issue-width.

Differential Revision: https://reviews.llvm.org/D47723

llvm-svn: 334055
2018-06-05 23:34:45 +00:00
Andrea Di Biagio 757600bccb [llvm-mca] Correctly update the CyclesLeft of a register read in the presence of partial register updates.
This patch fixe the logic in ReadState::cycleEvent(). That method was not
correctly updating field `TotalCycles`.

Added extra code comments in class ReadState to better describe each field.

llvm-svn: 334028
2018-06-05 17:12:02 +00:00
Andrea Di Biagio 39e5a5695f [RFC][patch 3/3] Add support for variant scheduling classes in llvm-mca.
This patch is the last of a sequence of three patches related to LLVM-dev RFC
"MC support for variant scheduling classes".
http://lists.llvm.org/pipermail/llvm-dev/2018-May/123181.html

This fixes PR36672.

The main goal of this patch is to teach llvm-mca how to solve variant scheduling
classes.  This patch does that, plus it adds new variant scheduling classes to
the BtVer2 scheduling model to identify so-called zero-idioms (i.e. so-called
dependency breaking instructions that are known to generate zero, and that are
optimized out in hardware at register renaming stage).

Without the BtVer2 change, this patch would not have had any meaningful tests.
This patch is effectively the union of two changes:
 1) a change that teaches llvm-mca how to resolve variant scheduling classes.
 2) a change to the BtVer2 scheduling model that allows us to special-case
    packed XOR zero-idioms (this partially fixes PR36671).

Differential Revision: https://reviews.llvm.org/D47374 

llvm-svn: 333909
2018-06-04 15:43:09 +00:00
Greg Bedwell bbe64af0a0 [llvm-mca] Regenerate a test to remove a double newline
Command used: py update_mca_test_checks.py ..\test\tools\llvm-mca\*\*.s ..\test\tools\llvm-mca\*\*\*.s

llvm-svn: 333893
2018-06-04 12:30:03 +00:00
Roman Lebedev 7b53d1454f [llvm-mca] Make sure not to end the test files with an empty line.
Summary:
It's super irritating.

[properly configured] git client then complains about that double-newline,
and you have to use `--force` to ignore the warning, since even if you
fix it manually, it will be reintroduced the very next runtime :/

Reviewers: RKSimon, andreadb, courbet, craig.topper, javed.absar, gbedwell

Reviewed By: gbedwell

Subscribers: javed.absar, tschuett, gbedwell, llvm-commits

Differential Revision: https://reviews.llvm.org/D47697

llvm-svn: 333887
2018-06-04 11:48:46 +00:00
Clement Courbet 2e41c5a79c [X86] Introduce WriteFLDC for x87 constant loads.
Summary:
{FLDL2E, FLDL2T, FLDLG2, FLDLN2, FLDPI} were using WriteMicrocoded.

 - I've measured the values for Broadwell, Haswell, SandyBridge, Skylake.
 - For ZnVer1 and Atom, values were transferred form InstRWs.
 - For SLM and BtVer2, I've guessed some values :(

Reviewers: RKSimon, craig.topper, andreadb

Subscribers: gbedwell, llvm-commits

Differential Revision: https://reviews.llvm.org/D47585

llvm-svn: 333656
2018-05-31 14:22:01 +00:00
Clement Courbet b78ab5097d [X86] Extract latency of fldz/fld1 in separate classes.
Summary:
 - I've measured the values for Broadwell, Haswell, SandyBridge, Skylake.
 - For ZnVer1 and Atom, values were transferred form `InstRW`s.
 - For SLM and BtVer2, values are from Agner.

This is split off from https://reviews.llvm.org/D47377

Reviewers: RKSimon, andreadb

Subscribers: gbedwell, llvm-commits

Differential Revision: https://reviews.llvm.org/D47523

llvm-svn: 333642
2018-05-31 11:41:27 +00:00
Clement Courbet 07c9ec6f2e [X86][Sched] Add InstRW for CLC on Intel after SNB.
Summary:
After SNB, Intel CPUs can rename CF independently of other EFLAGS,
so the renamer can zero it for free. Note that STC still consumes resources.

To reproduce: `$ llvm-exegesis -mode=uops -opcode-name=CLC`

On SNB:
```
---
key:
  opcode_name:     CLC
  mode:            uops
  config:          ''
cpu_name:        sandybridge
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
  - { key: '3', value: 0.0014, debug_string: SBPort0 }
  - { key: '4', value: 0.0013, debug_string: SBPort1 }
  - { key: '5', value: 0.0003, debug_string: SBPort4 }
  - { key: '6', value: 0.0029, debug_string: SBPort5 }
  - { key: '10', value: 0.0003, debug_string: SBPort23 }
error:           ''
info:            'instruction is serial, repeating a random one.
Snippet:
CLC
'
...
```

On HSW:
```
---
key:
  opcode_name:     CLC
  mode:            uops
  config:          ''
cpu_name:        haswell
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
  - { key: '3', value: 0.001, debug_string: HWPort0 }
  - { key: '4', value: 0.0009, debug_string: HWPort1 }
  - { key: '5', value: 0.0004, debug_string: HWPort2 }
  - { key: '6', value: 0.0006, debug_string: HWPort3 }
  - { key: '7', value: 0.0002, debug_string: HWPort4 }
  - { key: '8', value: 0.0012, debug_string: HWPort5 }
  - { key: '9', value: 0.0022, debug_string: HWPort6 }
  - { key: '10', value: 0.0001, debug_string: HWPort7 }
error:           ''
info:            'instruction is serial, repeating a random one.
Snippet:
CLC
'
...

```

Reviewers: craig.topper, RKSimon

Subscribers: gchatelet, llvm-commits

Differential Revision: https://reviews.llvm.org/D47362

llvm-svn: 333392
2018-05-29 06:19:39 +00:00
Simon Pilgrim 0155bf0da9 [X86][SNB] Fix differences between vex/non-vex XMM vector moves (PR37286)
As confirmed by llvm-exegesis, there is no scheduler difference between MOVDQA/MOVDQU and VMOVDQA/VMOVDQU xmm reg-reg moves

Another chapter in the never ending crusade to remove useless InstRW overrides from the x86 scheduler models......

llvm-svn: 333271
2018-05-25 12:18:11 +00:00
Greg Bedwell e790f6fb06 [UpdateTestChecks] Improved update_mca_test_checks block analysis
Previously update_mca_test_checks worked entirely at "block" level where
a block is some sequence of lines delimited by at least one empty line.
This generally worked well, but could sometimes lead to excessive
repetition of check lines for various prefixes if some block was almost
identical between prefixes, but not quite (for example, due to a
different dispatch width in the otherwise identical summary views).

This new analyis attempts to split blocks further in the case where the
following conditions are met:
  a) There is some prefix common to every RUN line (typically 'ALL').
  b) The first line of the block is common to the output with every prefix.
  c) The block has the same number of lines for the output with every prefix.

Also, regenerated all llvm-mca test files with the following command:
update_mca_test_checks.py "../test/tools/llvm-mca/*/*.s" "../test/tools/llvm-mca/*/*/*.s"

The new analysis showed a "multiple lines not disambiguated by prefixes" warning
for test "AArch64/Exynos/scheduler-queue-usage.s" so I've also added some
explicit prefixes to each of the RUN lines in that test.

Differential Revision: https://reviews.llvm.org/D47321

llvm-svn: 333204
2018-05-24 16:36:44 +00:00
Andrea Di Biagio 3fc20c9c7f [llvm-mca] Print the "Block RThroughput" in the SummaryView.
This patch implements the "block reciprocal throughput" computation in the
SummaryView.

The block reciprocal throughput is computed as the MAX of:
  - NumMicroOps / DispatchWidth
  - Resource Cycles / #Units   (for every resource consumed).

The block throughput is bounded from above by the hardware dispatch throughput.
That is because the DispatchWidth is an upper bound on how many opcodes can be part
of a single dispatch group.

The block throughput is also limited by the amount of hardware parallelism. The
number of available resource units affects how the resource pressure is
distributed, and also how many blocks can be delivered every cycle.

llvm-svn: 333095
2018-05-23 15:59:27 +00:00
Andrea Di Biagio cb1ed400a4 [llvm-mca] Removed an empty line generated by the timeline view. NFC.
Also, regenerate all tests.

llvm-svn: 332853
2018-05-21 17:11:56 +00:00
Andrea Di Biagio b5757abefb [X86][BtVer2] Add a 'J' prefix to the PRF/RCU defs. NFC
This is to keep the Jaguar model's naming convention. Processor resources all
have a 'J' prefix in the BtVer2 scheduling model.

llvm-svn: 332851
2018-05-21 16:30:26 +00:00
Simon Pilgrim 1273f4ad93 [X86] Add GPR<->XMM Schedule Tags
BtVer2 - fix NumMicroOp and account for the Lat+6cy GPR->XMM and Lat+1cy XMm->GPR delays (see rL332737)

The high number of MOVD/MOVQ equivalent instructions meant that there were a number of missed patterns in SNB/Znver1:
SNB - add missing GPR<->MMX costs (taken from Agner / Intel AOM)
Znver1 - add missing GPR<->XMM MOVQ costs (taken from Agner)

llvm-svn: 332745
2018-05-18 17:58:36 +00:00
Simon Pilgrim 007b50fd35 [X86][BtVer2] Improve simulation of (V)PINSR values
Include the 6cy delay transferring from the GPR to FPU.

llvm-svn: 332737
2018-05-18 17:09:41 +00:00
Simon Pilgrim 3ecb0b80f6 [X86][BtVer2] Partial vector stores (inc MMX) have a 2cy latency
llvm-svn: 332722
2018-05-18 14:22:22 +00:00
Simon Pilgrim c4b8d367a8 [X86][SSE] Ensure vector partial load/stores use the WriteVecLoad/WriteVecStore scheduler classes
Retag some instructions that were missed when we split off vector load/store/moves - MOVQ/MOVD etc.

Fixes BtVer2/SLM which have different behaviours for GPR stores.

llvm-svn: 332718
2018-05-18 14:08:01 +00:00
Simon Pilgrim d749b321b2 [X86][SSE] Ensure float load/stores use the WriteFLoad/WriteFStore scheduler classes
Retag some instructions that were missed when we split off vector load/store/moves - MOVSS/MOVSD/MOVHPD/MOVHPD/MOVLPD/MOVLPS etc.

Fixes BtVer2/SLM which have different behaviours for GPR stores.

llvm-svn: 332714
2018-05-18 13:13:59 +00:00
Simon Pilgrim e389ea0e3e [llvm-mca][X86] Add CMOV test files
llvm-svn: 332622
2018-05-17 16:29:12 +00:00
Simon Pilgrim b5741f5c3d [X86][BtVer2] ADC/SBB take 2cy on an ALU pipe, not 1cy like ADD/SUB
llvm-svn: 332616
2018-05-17 15:43:23 +00:00
Andrea Di Biagio 650b5fc6cb [llvm-mca] add flag -all-views and flag -all-stats.
Flag -all-views enables all the views.
Flag -all-stats enables all the views that print hardware statistics.

llvm-svn: 332602
2018-05-17 12:27:03 +00:00
Simon Pilgrim b4fd145fc3 [llvm-mca][X86] Add ADX test files
llvm-svn: 332595
2018-05-17 11:32:38 +00:00
Simon Pilgrim d5d77dcb46 [X86] Fix typo in instregex for CVTSI642SDrr
llvm-svn: 332510
2018-05-16 18:31:17 +00:00
Andrea Di Biagio 45ccdd1785 [llvm-mca] Regenerate tests after r332381 and r332361. NFC
llvm-svn: 332447
2018-05-16 10:12:06 +00:00
Simon Pilgrim be9a206883 [X86] Split WriteCvtF2F into F32->F64 and F64->F32 scheduler classes
BtVer2 - Fixes schedules for (V)CVTPS2PD instructions

A lot of the Intel models still have too many InstRW overrides for these new classes - this needs cleaning up but I wanted to get the classes in first

llvm-svn: 332376
2018-05-15 17:36:49 +00:00
Simon Pilgrim 891ebcdbaa [X86] Split off F16C WriteCvtPH2PS/WriteCvtPS2PH scheduler classes
Btver2 - VCVTPH2PSYrm needs to double pump the AGU
Broadwell - missing VCVTPS2PH*mr stores extra latency

Allows us to remove the WriteCvtF2FSt conversion store class

llvm-svn: 332357
2018-05-15 14:12:32 +00:00
Simon Pilgrim 2aa395abcf [llvm-mca][x86] Add F16C instruction tests
llvm-svn: 332347
2018-05-15 12:50:06 +00:00
Simon Pilgrim 5bd5e2fd3e [llvm-mca][X86] Add missing SSE4A test file
llvm-svn: 332270
2018-05-14 18:20:40 +00:00
Simon Pilgrim 228d24a2d6 [X86][BtVer2] Fix MMX/YMM integer vector nt store schedules
MMX was missing and YMM was tagged as a fp nt store

llvm-svn: 332269
2018-05-14 18:07:28 +00:00
Simon Pilgrim 4135de2e93 [llvm-mca][x86] Add scalar nt-store instruction tests
llvm-svn: 332262
2018-05-14 17:10:33 +00:00
Simon Pilgrim 7340d88740 [llvm-mca][x86] Add and/not/or/xor instruction tests
llvm-svn: 332257
2018-05-14 16:26:24 +00:00
Simon Pilgrim 661ae7778d [X86][BtVer2] Model ymm move as double pumped instructions
We still need to handle mmx/xmm moves as 'decode-only' no-pipe instructions

llvm-svn: 332109
2018-05-11 17:38:36 +00:00
Simon Pilgrim 706403bab8 [X86][MMX] Tag MMX Move/Load/Store as WriteVec schedule classes
Fixes an issue on SLM/Btver2 where we had instructions were being treated as scalar loads/stores

llvm-svn: 332104
2018-05-11 16:38:59 +00:00
Simon Pilgrim 032a01f74a [X86][SLM] Vector stores only use the MEC port.
Confirmed by both Agner and Intel's AOM - the IEC/FPC are not required for pure load/stores (even if its a partial update).

Can't fix WriteStore until all RMW instructions are cleaned up though....

llvm-svn: 332096
2018-05-11 15:16:15 +00:00