I've put CMPXCHG8B/CMPXCHG16B in the same file, even though technically they are under separate CPUID bits all targets seem to support both (or neither).
llvm-svn: 338595
These aren't exhaustive, but cover some instructions that are only available in 32-bit mode (where would we be without good BCD math performance?).
llvm-svn: 338404
This patch teaches llvm-mca how to identify dependency breaking instructions on
btver2.
An example of dependency breaking instructions is the zero-idiom XOR (example:
`XOR %eax, %eax`), which always generates zero regardless of the actual value of
the input register operands.
Dependency breaking instructions don't have to wait on their input register
operands before executing. This is because the computation is not dependent on
the inputs.
Not all dependency breaking idioms are also zero-latency instructions. For
example, `CMPEQ %xmm1, %xmm1` is independent on
the value of XMM1, and it generates a vector of all-ones.
That instruction is not eliminated at register renaming stage, and its opcode is
issued to a pipeline for execution. So, the latency is not zero.
This patch adds a new method named isDependencyBreaking() to the MCInstrAnalysis
interface. That method takes as input an instruction (i.e. MCInst) and a
MCSubtargetInfo.
The default implementation of isDependencyBreaking() conservatively returns
false for all instructions. Targets may override the default behavior for
specific CPUs, and return a value which better matches the subtarget behavior.
In future, we should teach to Tablegen how to automatically generate the body of
isDependencyBreaking from scheduling predicate definitions. This would allow us
to expose the knowledge about dependency breaking instructions to the machine
schedulers (and, potentially, other codegen passes).
Differential Revision: https://reviews.llvm.org/D49310
llvm-svn: 338372
Summary:
Pretty mechanical follow-up for D49196.
As microarchitecture.pdf notes, "20 AMD Ryzen pipeline",
"20.8 Register renaming and out-of-order schedulers":
The integer register file has 168 physical registers of 64 bits each.
The floating point register file has 160 registers of 128 bits each.
"20.14 Partial register access":
The processor always keeps the different parts of an integer register together.
...
An instruction that writes to part of a register will therefore have a false dependence
on any previous write to the same register or any part of it.
Reviewers: andreadb, courbet, RKSimon, craig.topper, GGanesh
Reviewed By: GGanesh
Subscribers: gbedwell, llvm-commits
Differential Revision: https://reviews.llvm.org/D49393
llvm-svn: 337676
This patch fixes the latency/throughput of LEA instructions in the BtVer2
scheduling model.
On Jaguar, A 3-operands LEA has a latency of 2cy, and a reciprocal throughput of
1. That is because it uses one cycle of SAGU followed by 1cy of ALU1. An LEA
with a "Scale" operand is also slow, and it has the same latency profile as the
3-operands LEA. An LEA16r has a latency of 3cy, and a throughput of 0.5 (i.e.
RThrouhgput of 2.0).
This patch adds a new TIIPredicate named IsThreeOperandsLEAFn to X86Schedule.td.
The tablegen backend (for instruction-info) expands that definition into this
(file X86GenInstrInfo.inc):
```
static bool isThreeOperandsLEA(const MachineInstr &MI) {
return (
(
MI.getOpcode() == X86::LEA32r
|| MI.getOpcode() == X86::LEA64r
|| MI.getOpcode() == X86::LEA64_32r
|| MI.getOpcode() == X86::LEA16r
)
&& MI.getOperand(1).isReg()
&& MI.getOperand(1).getReg() != 0
&& MI.getOperand(3).isReg()
&& MI.getOperand(3).getReg() != 0
&& (
(
MI.getOperand(4).isImm()
&& MI.getOperand(4).getImm() != 0
)
|| (MI.getOperand(4).isGlobal())
)
);
}
```
A similar method is generated in the X86_MC namespace, and included into
X86MCTargetDesc.cpp (the declaration lives in X86MCTargetDesc.h).
Back to the BtVer2 scheduling model:
A new scheduling predicate named JSlowLEAPredicate now checks if either the
instruction is a three-operands LEA, or it is an LEA with a Scale value
different than 1.
A variant scheduling class uses that new predicate to correctly select the
appropriate latency profile.
Differential Revision: https://reviews.llvm.org/D49436
llvm-svn: 337469
Add llvm-mca tests demonstrating how LEA instructions are currently modelled. Once this is working on btver2 I'll copy the test file to the other target directories.
llvm-svn: 337297
registers.
The goal of this patch is to improve the throughput analysis in llvm-mca for the
case where instructions perform partial register writes.
On x86, partial register writes are quite difficult to model, mainly because
different processors tend to implement different register merging schemes in
hardware.
When the code contains partial register writes, the IPC (instructions per
cycles) estimated by llvm-mca tends to diverge quite significantly from the
observed IPC (using perf).
Modern AMD processors (at least, from Bulldozer onwards) don't rename partial
registers. Quoting Agner Fog's microarchitecture.pdf:
" The processor always keeps the different parts of an integer register together.
For example, AL and AH are not treated as independent by the out-of-order
execution mechanism. An instruction that writes to part of a register will
therefore have a false dependence on any previous write to the same register or
any part of it."
This patch is a first important step towards improving the analysis of partial
register updates. It changes the semantic of RegisterFile descriptors in
tablegen, and teaches llvm-mca how to identify false dependences in the presence
of partial register writes (for more details: see the new code comments in
include/Target/TargetSchedule.h - class RegisterFile).
This patch doesn't address the case where a write to a part of a register is
followed by a read from the whole register. On Intel chips, high8 registers
(AH/BH/CH/DH)) can be stored in separate physical registers. However, a later
(dirty) read of the full register (example: AX/EAX) triggers a merge uOp, which
adds extra latency (and potentially affects the pipe usage).
This is a very interesting article on the subject with a very informative answer
from Peter Cordes:
https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to
In future, the definition of RegisterFile can be extended with extra information
that may be used to identify delays caused by merge opcodes triggered by a dirty
read of a partial write.
Differential Revision: https://reviews.llvm.org/D49196
llvm-svn: 337123
Before revision 336728, the "mayLoad" flag for instruction (V)MOVLPSrm was
inferred directly from the "default" pattern associated with the instruction
definition.
r336728 removed special node X86Movlps, and all the patterns associated to it.
Now instruction (V)MOVLPSrm doesn't have a pattern associated to it, and the
'mayLoad/hasSideEffects' flags are left unset.
When the instruction info is emitted by tablegen, method
CodeGenDAGPatterns::InferInstructionFlags() sees that (V)MOVLPSrm doesn't have a
pattern, and flags are undefined. So, it conservatively sets the
"hasSideEffects" flag for it.
As a consequence, we were losing the 'mayLoad' flag, and we were gaining a
'hasSideEffect' flag in its place.
This patch fixes the issue (originally reported by Michael Holmen).
The mca tests show the differences in the instruction info flags. Instructions
that were affected by this problem were: MOVLPSrm/VMOVLPSrm/VMOVLPSZ128rm.
Differential Revision: https://reviews.llvm.org/D49182
llvm-svn: 336818
This makes easier to identify changes in the instruction info flags. It also
helps spotting potential regressions similar to the one recently introduced at
r336728.
Using the same character to mark MayLoad/MayStore/HasSideEffects is problematic
for llvm-lit. When pattern matching substrings, llvm-lit consumes tabs and
spaces. A change in position of the flag marker may not trigger a test failure.
This patch only changes the character used for flag `hasSideEffects`. The reason
why I didn't touch other flags is because I want to avoid spamming the mailing
because of the massive diff due to the numerous tests affected by this change.
In future, each instruction flag should be associated with a different character
in the Instruction Info View.
llvm-svn: 336797
llvm-mca doesn't know that on modern AMD processors, portions of a general
purpose register are not treated independently. So, a partial register write has
a false dependency on the super-register.
The issue with partial register writes will be addressed by a follow-up patch.
llvm-svn: 336778
This is a short-term fix for PR38093.
For now, we llvm::report_fatal_error if the instruction builder finds an
unsupported instruction in the instruction stream.
We need to revisit this fix once we start addressing PR38101.
Essentially, we need a better framework for error handling.
llvm-svn: 336543
This patch modifies the Scheduler heuristic used to select the next instruction
to issue to the pipelines.
The motivating example is test X86/BtVer2/add-sequence.s, for which llvm-mca
wrongly reported an estimated IPC of 1.50. According to perf, the actual IPC for
that test should have been ~2.00.
It turns out that an IPC of 2.00 for test add-sequence.s cannot possibly be
predicted by a Scheduler that only prioritizes instructions based on their
"age". A similar issue also affected test X86/BtVer2/dependent-pmuld-paddd.s,
for which llvm-mca wrongly estimated an IPC of 0.84 instead of an IPC of 1.00.
Instructions in the ReadyQueue are now ranked based on two factors:
- The "age" of an instruction.
- The number of unique users of writes associated with an instruction.
The new logic still prioritizes older instructions over younger instructions to
minimize the pressure on the reorder buffer. However, the number of users of an
instruction now also affects the overall rank. This potentially increases the
ability of the Scheduler to extract instruction level parallelism. This patch
fixes the problem with the wrong IPC reported for test add-sequence.s and test
dependent-pmuld-paddd.s.
llvm-svn: 336420
Summary: As per `Agner's Microarchitecture doc
(21.8 AMD Bobcat and Jaguar pipeline - Dependency-breaking instructions)`,
these, like zero-idioms, are dependency-breaking,
although they produce ones and still consume resources.
FIXME: as discussed in D48877, llvm-mca handling is broken for these.
Reviewers: andreadb
Reviewed By: andreadb
Subscribers: gbedwell, RKSimon, llvm-commits
Differential Revision: https://reviews.llvm.org/D48876
llvm-svn: 336292
This patch teaches llvm-mca how to identify register writes that implicitly zero
the upper portion of a super-register.
On X86-64, a general purpose register is implemented in hardware as a 64-bit
register. Quoting the Intel 64 Software Developer's Manual: "an update to the
lower 32 bits of a 64 bit integer register is architecturally defined to zero
extend the upper 32 bits". Also, a write to an XMM register performed by an AVX
instruction implicitly zeroes the upper 128 bits of the aliasing YMM register.
This patch adds a new method named clearsSuperRegisters to the MCInstrAnalysis
interface to help identify instructions that implicitly clear the upper portion
of a super-register. The rest of the patch teaches llvm-mca how to use that new
method to obtain the information, and update the register dependencies
accordingly.
I compared the kernels from tests clear-super-register-1.s and
clear-super-register-2.s against the output from perf on btver2. Previously
there was a large discrepancy between the estimated IPC and the measured IPC.
Now the differences are mostly in the noise.
Differential Revision: https://reviews.llvm.org/D48225
llvm-svn: 335113
Summary:
First off: i do not have any access to that processor,
so this is purely theoretical, no benchmarks.
I have been looking into b**d**ver2 scheduling profile, and while cross-referencing
the existing b**t**ver2, znver1 profiles, and the reference docs
(`Software Optimization Guide for AMD Family {15,16,17}h Processors`),
i have noticed that only b**t**ver2 scheduling profile specifies these.
Also, there is no mca test coverage.
Reviewers: RKSimon, craig.topper, courbet, GGanesh, andreadb
Reviewed By: GGanesh
Subscribers: gbedwell, vprasad, ddibyend, shivaram, Ashutosh, javed.absar, llvm-commits
Differential Revision: https://reviews.llvm.org/D47676
llvm-svn: 335099
Summary:
I ran llvm-exegesis on SKX, SKL, BDW, HSW, SNB.
Atom is from Agner and SLM is a guess.
I've left AMD processors alone.
Reviewers: RKSimon, craig.topper
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D48079
llvm-svn: 335097
Summary:
Based on
* [[ https://support.amd.com/TechDocs/43479.pdf | AMD64 Architecture Programmer’s Manual Volume 6: 128-Bit and 256-Bit XOP and FMA4 Instructions ]],
* [[ https://support.amd.com/TechDocs/24594.pdf | AMD64 Architecture Programmer’s Manual Volume 3: General-Purpose and System Instructions]],
* https://en.wikipedia.org/wiki/XOP_instruction_set
Appears to be only supported in AMD's 15h generation, so only in b**d**ver[1-4],
for which currently llvm has no scheduling profiles.
Reviewers: RKSimon, craig.topper, andreadb, spatel
Reviewed By: RKSimon
Subscribers: gbedwell, llvm-commits
Differential Revision: https://reviews.llvm.org/D48264
llvm-svn: 335034
When the destination register of a XOP instruction is an XMM register, bits
[255:128] of the corresponding YMM register are cleared.
When the destination register of a EVEX encoded instruction is an XMM/YMM
register, the upper bits of the corresponding ZMM are cleared.
On processors that feature AVX512, a write to an XMM registers always clears the
upper portion of the corresponding ZMM register if the instruction is VEX or
EVEX encoded.
These new tests show some interesting cases which aren't correctly analyzed by
llvm-mca. The lack of knowledge related to the implicit update on the
super-registers is addressed by D48225.
llvm-svn: 334945
There are a lot of instructions to add under these ISAs (and the other AVX512 variants) but this should demonstrate how to test for the EVEX instructions with different maskings
llvm-svn: 334907
Added a Generic x86 cpu set of resource tests to allow us to check all ISAs.
We currently use SandyBridge as our generic CPU model, but it's better if we actually duplicate these tests for if/when we change the model, it also means we don't end up polluting the SandyBridge folder with tests for ISAs it doesn't support.
llvm-svn: 334853
Summary:
While that is indeed a quite interesting summary stat,
there are cases where it does not really add anything
other than consuming extra lines.
Declutters the output of D48190.
Reviewers: RKSimon, andreadb, courbet, craig.topper
Reviewed By: andreadb
Subscribers: javed.absar, gbedwell, llvm-commits
Differential Revision: https://reviews.llvm.org/D48209
llvm-svn: 334833
Summary:
There does not seem to be any other tests for this.
Split off from D47676.
Reviewers: RKSimon, craig.topper, courbet, andreadb
Reviewed By: andreadb
Subscribers: javed.absar, gbedwell, llvm-commits
Differential Revision: https://reviews.llvm.org/D48190
llvm-svn: 334832
On x86-64, a write to register EAX implicitly clears the upper half or RAX.
128-bit AVX instructions clear the upper 128-bit of the YMM register that
aliases the XMM definition register.
llvm-mca doesn't know about register writes that implicitly clear the upper
portion of an aliasing super-register. This issue will be fixed in a future patch.
llvm-svn: 334742
This test checks that a physical register is correctly allocated for the partial
write to register BX.
The ADD instruction has to wait for the write to RBX (and BX) before being
executed.
llvm-svn: 334730