This is preparation for making clang default to -mtune=generic when no -march is specified. This will allow the default tuning to be "generic" even though our default march is "pentium4" or "x86-64".
To avoid llc lit test regressions, if no mcpu is specified, I've defaulted tune to use i586 to match the old tuning settings of no CPU. Some tests explicitly used -mcpu=generic which I've removed so they instead get this default of architecture features from generic and tune from i586.
I updated one llvm-mca test to check a different CPU since generic has a scheduler model now
Differential Revision: https://reviews.llvm.org/D86312
LDRD and STRD along with UBFX and SBFX are selected from DAGToDAG
transforms, so do not have tblgen patterns. They don't get marked as
having side effects so cannot be scheduled as efficiently as you would
like.
This specifically marks then as not having side effects.
Differential Revision: https://reviews.llvm.org/D82358
Summary:
Some instruction like VPMULDQ is NOT the variant of VPMULD but a new
one.
So we should make sure the suffix matcher only works for memory variant
that has the same size with the suffix.
Currently we only check for SSE/AVX* instructions, because many legacy
instructions didn't declare the alias instructions of their variants.
Differential Revision: https://reviews.llvm.org/D80608
This fixes a bug reported by Alex Renda on LLVMDev where mca did not correctly
mark a resource group as "reserved".
(See http://lists.llvm.org/pipermail/llvm-dev/2020-May/141485.html).
The issue was caused by a wrong check in function `initializeUsedResources`.
As a consequence of this, a resource group was left unreserved, and its field
`NumUnits` incorrectly reported an unrealistic number of consumed resource
units.
This patch fixes the issue with the handling of reserved resources in the
InstrBuilder class, and adds a simple test for it. Ideally, as suggested by
Andy Trick, most of these problems will disappear if in the future we will
introduce a (optional) DelayCycles vector for SchedWriteRes.
This fixes a regression introduced by a very old commit 280ac1fd1d (was
llvm-svn 361950).
Commit 280ac1fd1d redesigned the logic in the LSUnit with the goal of
speeding up isReady() queries, and stabilising the LSUnit API (while also making
the load store unit more customisable).
The concept of MemoryGroup (effectively an alias set) was added by that commit
to better describe and track dependencies between memory operations. However,
that concept was not just used for alias dependencies, but it was also used for
describing memory "order" dependencies (enforced by the memory consistency
model).
Instructions of a same memory group were considered "equivalent" as in:
independent operations that can potentially execute in parallel. The problem
was that the cost of a dependency (in terms of number of cycles) should have
been different for "order" dependency. Instructions in an order dependency
simply have to have to wait until their predecessors are "issued" to an
underlying pipeline (rather than having to wait until predecessors have beeng
fully executed). For simple "order" dependencies, this was effectively
introducing an artificial delay on the "issue" of independent loads and stores.
This patch fixes the issue and adds a new test named 'independent-load-stores.s'
to a bunch of x86 targets. That test contains the reproducible posted by Fabian
Ritter on PR45793.
I had to rerun the update-mca-tests script on several files. To avoid expected
regressions on some Exynos tests, I have added a -noalias=false flag (to match
the old strict behavior on latencies).
Some tests for processor Barcelona are improved/fixed by this change and they
now show better results. In a few tests we were incorrectly counting the time
spent by instructions in a scheduler queue. In one case in particular we now
correctly see a store executed out of order. That test was affected by the same
underlying issue reported as PR45793.
Reviewers: mattd
Differential Revision: https://reviews.llvm.org/D79351
There is no need to use `--check-prefix` multiple times.
It helps to improve readability/test maintainability.
This patch does it for all tools at once.
Differential revision: https://reviews.llvm.org/D78217
uops.info says these should be 15 cycle instructions. Uops.info also shows the 512-bit form uses port 0 and 5 for both register and memory. We had memory using 0 and 1.
Differential Revision: https://reviews.llvm.org/D75549
Based on uops.info these should have 5 cycle latency as they did on Haswell/Broadwell. I have no additional internal information from Intel.
This was also shown as a discrepancy in the spreadsheet that was sent with an early llvm-dev post about llvm-exegesis.
It also matches Agner Fog.
Differential Revision: https://reviews.llvm.org/D74357
The load ports need a cycle for each potentially loaded element just like Haswell and Skylake. Unlike Haswell and Broadwell, the number of uops does not scale with the number of elements. Instead the load uops run for multiple cycles.
I've taken the latency number from the uops.info. The port binding for the non-load uops is taken from the original IACA data I have.
Differential Revision: https://reviews.llvm.org/D74000
Broadwell was missing half the gather instructions. Both models
had some mixups in the resource costs and number of uops.
I've updated here based on what I think the original IACA source
says with some cross checking against the microcode.
I'm not sure about latency as the IACA source I have doesn't have
that information. So I'm using the latency from uops.info.
I plan to update Skylake models as well, but I'll do that in a
separate patch.
Differential Revision: https://reviews.llvm.org/D73844
Summary:
As determined with `llvm-exegesis`.
Some of these look like typos/misunderstandings of the sched model td
spec:
- latency defaults to `1` when not set => Maybe we can avoid
having a default ?
- problems with regexps not being anchored by default (XCHG matching
CMPXHG)
Note that this is not complete, it fixes only the most obvious mistakes,
and only for latency (not uops).
Reviewers: RKSimon, GGanesh
Subscribers: hiraditya, jfb, mstojanovic, hfinkel, craig.topper, andreadb, lebedev.ri, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73172
Based on exhaustive llvm-exegesis measurements.
There may still be some imperfections for LEA16r/LEA32r.
Much like was observed in D68646, i'm also measuring some outliers
with some specific registers.
Summary:
This patch fixes pr23772 [ARM] r226200 can emit illegal thumb2 instruction: "sub sp, r12, #80".
The violation was that SUB and ADD (reg, immediate) instructions can only write to SP if the source register is also SP. So the above instructions was unpredictable.
To enforce that the instruction t2(ADD|SUB)ri does not write to SP we now enforce the destination register to be rGPR (That exclude PC and SP).
Different than the ARM specification, that defines one instruction that can read from SP, and one that can't, here we inserted one that can't write to SP, and other that can only write to SP as to reuse most of the hard-coded size optimizations.
When performing this change, it uncovered that emitting Thumb2 Reg plus Immediate could not emit all variants of ADD SP, SP #imm instructions before so it was refactored to be able to. (see test/CodeGen/Thumb2/mve-stacksplot.mir where we use a subw sp, sp, Imm12 variant )
It also uncovered a disassembly issue of adr.w instructions, that were only written as SUBW instructions (see llvm/test/MC/Disassembler/ARM/thumb2.txt).
Reviewers: eli.friedman, dmgreen, carwil, olista01, efriedma, andreadb
Reviewed By: efriedma
Subscribers: gbedwell, john.brawn, efriedma, ostannard, kristof.beyls, hiraditya, dmgreen, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70680
The patch gives out the details of the znver2 scheduler model.
There are few improvements with respect to execution units, latencies and
throughput when compared with znver1.
The tests that were present for znver1 for llvm-mca tool were replicated.
The latencies, execution units, timeline and throughput information are updated for znver2.
Reviewers: craig.topper, Simon Pilgrim
Differential Revision: https://reviews.llvm.org/D66088
Noticed while fixing the reduction costs for D59710 - the SLM model doesn't account for the poor throughput of v2i64 ops.
Numbers taken from Intel AOM (+ checked against Agner)
This patch introduces the following changes to the btver2 scheduling model:
- The number of micro opcodes for YMM loads and stores is now 2 (it was
incorrectly set to 1 for both aligned and misaligned loads/stores).
- Increased the number of AGU resource cycles for YMM loads and stores
to 2cy (instead of 1cy).
- Removed JFPU01 and JFPX from the list of resources consumed by pure
float/vector loads (no MMX).
I verified with llvm-exegesis that pure XMM/YMM loads are no-pipe. Those
are dispatched to the FPU but not really issues on JFPU01.
Differential Revision: https://reviews.llvm.org/D68871
llvm-svn: 374765
Summary:
As disscused in https://bugs.llvm.org/show_bug.cgi?id=43219,
i believe it may be somewhat useful to show //some// aggregates
over all the sea of statistics provided.
Example:
```
Average Wait times (based on the timeline view):
[0]: Executions
[1]: Average time spent waiting in a scheduler's queue
[2]: Average time spent waiting in a scheduler's queue while ready
[3]: Average time elapsed from WB until retire stage
[0] [1] [2] [3]
0. 3 1.0 1.0 4.7 vmulps %xmm0, %xmm1, %xmm2
1. 3 2.7 0.0 2.3 vhaddps %xmm2, %xmm2, %xmm3
2. 3 6.0 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4
3 3.2 0.3 2.3 <total>
```
I.e. we average the averages.
Reviewers: andreadb, mattd, RKSimon
Reviewed By: andreadb
Subscribers: gbedwell, arphaman, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68714
llvm-svn: 374361
Before this patch, loads and stores were only tracked by their corresponding
queues in the LSUnit from dispatch until execute stage. In practice we should be
more conservative and assume that memory opcodes leave their queues at
retirement stage.
Basically, loads should leave the load queue only when they have completed and
delivered their data. We conservatively assume that a load is completed when it
is retired. Stores should be tracked by the store queue from dispatch until
retirement. In practice, stores can only leave the store queue if their data can
be written to the data cache.
This is mostly a mechanical change. With this patch, the retire stage notifies
the LSUnit when a memory instruction is retired. That would triggers the release
of LDQ/STQ entries. The only visible change is in memory tests for the bdver2
model. That is because bdver2 is the only model that defines the load/store
queue size.
This patch partially addresses PR39830.
Differential Revision: https://reviews.llvm.org/D68266
llvm-svn: 374034
This adds a -mattr flag to llvm-mca, for cases where the -mcpu option does not
contain all optional features.
Differential Revision: https://reviews.llvm.org/D68190
llvm-svn: 373358
This patch introduces a cut-off threshold for dependency edge frequences with
the goal of simplifying the critical sequence computation. This patch also
removes the cost normalization for loop carried dependencies. We didn't really
need to artificially amplify the cost of loop-carried dependencies since it is
already computed as the integral over time of the delay (in cycle).
In the absence of backend stalls there is no need for computing a critical
sequence. With this patch we early exit from the critical sequence computation
if no bottleneck was reported during the simulation.
llvm-svn: 372337
On BtVer2 conditional SIMD stores are heavily microcoded.
The latency is directly proportional to the number of packed elements extracted
from the input vector. Also, according to micro-benchmarks, most of the
computation seems to be done in the integer unit.
Only a minority of the uOPs is executed by the FPU. The observed behaviour on
the FPU looks similar to this:
- The input MASK value is moved to the Integer Unit
-- [ a VMOVMSK-like uOP-executed on JFPU0].
- In parallel, each element of the input XMM/YMM is extracted and then sent to
the IntegerUnit through JFPU1.
As expected, a (conditional) store is executed for every extracted element.
Interestingly, a (speculative) load is executed for every extracted element too.
It is as-if a "LOAD - BIT_EXTRACT- CMOV" sequence of uOPs is repeated by the
integer unit for every contionally stored element.
VMASKMOVDQU is a special case: the number of speculative loads is always 2
(presumably, one load per quadword). That means, extra shifts and masking is
performed on (one of) the loaded quadwords before each conditional store (that
also explains the big number of non-FP uOPs retired).
This patch replaces the existing writes for conditional SIMD stores (i.e.
WriteFMaskedStore, and WriteFMaskedStoreY) with the following new writes:
WriteFMaskedStore32 [ XMM Packed Single ]
WriteFMaskedStore32Y [ YMM Packed Single ]
WriteFMaskedStore64 [ XMM Packed Double ]
WriteFMaskedStore64Y [ YMM Packed Double ]
Added a wrapper class named X86SchedWriteMaskMove in X86Schedule.td to describe
both RM and MR variants for conditional SIMD moves in a single tablegen
definition.
Instances of that class are then passed in input to multiclass avx_movmask_rm
when constructing MASKMOVPS/PD definitions.
Since this patch introduces new writes, I had to update all the X86 scheduling
models.
Differential Revision: https://reviews.llvm.org/D66801
llvm-svn: 370649
This is a follow up of r369642.
This patch assigns a ReadAfterLd to every implicit register use of instruction
CMPXCHG8B and instruction CMPXCHG16B. Perf micro-benchmarks show that implicit
registers are read after 3cy from the start of execution.
llvm-svn: 369750
Excluding ADC/SBB and the bit-test instructions (BTR/BTS/BTC), the observed
latency of all other RMW integer arithmetic/logic instructions is 6cy and not
5cy.
Example (ADD):
```
addb $0, (%rsp) # Latency: 6cy
addb $7, (%rsp) # Latency: 6cy
addb %sil, (%rsp) # Latency: 6cy
addw $0, (%rsp) # Latency: 6cy
addw $511, (%rsp) # Latency: 6cy
addw %si, (%rsp) # Latency: 6cy
addl $0, (%rsp) # Latency: 6cy
addl $511, (%rsp) # Latency: 6cy
addl %esi, (%rsp) # Latency: 6cy
addq $0, (%rsp) # Latency: 6cy
addq $511, (%rsp) # Latency: 6cy
addq %rsi, (%rsp) # Latency: 6cy
```
The same latency profile applies to SUB/AND/OR/XOR/INC/DEC.
The observed latency of ADC/SBB is 7-8cy. So we need a different write to model
those. Latency of BTS/BTR/BTC is not fixed by this patch (they are much slower
than what the model for btver2 currently reports).
Differential Revision: https://reviews.llvm.org/D66636
llvm-svn: 369748
On Jaguar, XCHG has a latency of 1cy and decodes to 2 macro-opcodes. Maximum
throughput for XCHG is 1 IPC. The byte exchange has worse latency and decodes to
1 extra uOP; maximum observed throughput is 0.5 IPC.
```
xchgb %cl, %dl # Latency: 2cy - uOPs: 3 - 2 ALU
xchgw %cx, %dx # Latency: 1cy - uOPs: 2 - 2 ALU
xchgl %ecx, %edx # Latency: 1cy - uOPs: 2 - 2 ALU
xchgq %rcx, %rdx # Latency: 1cy - uOPs: 2 - 2 ALU
```
The reg-mem forms of XCHG are atomic operations with an observed latency of
16cy. The resource usage is similar to the XCHGrr variants. The biggest
difference is obviously the bus-locking, which prevents the LS to issue other
memory uOPs in parallel until the unlocking store uOP is executed.
```
xchgb %cl, (%rsp) # Latency: 16cy - uOPs: 3 - ECX latency: 11cy
xchgw %cx, (%rsp) # Latency: 16cy - uOPs: 3 - ECX latency: 11cy
xchgl %ecx, (%rsp) # Latency: 16cy - uOPs: 3 - ECX latency: 11cy
xchgq %rcx, (%rsp) # Latency: 16cy - uOPs: 3 - ECX latency: 11cy
```
The exchanged in/out register operand becomes available after 11cy from the
start of execution. Added test xchg.s to verify that we correctly see that
register write committed in 11cy (and not 16cy).
Reg-reg XADD instructions have the same latency/throughput than the byte
exchange (register-register variant).
```
xaddb %cl, %dl # latency: 2cy - uOPs: 3 - 3 ALU
xaddw %cx, %dx # latency: 2cy - uOPs: 3 - 3 ALU
xaddl %ecx, %edx # latency: 2cy - uOPs: 3 - 3 ALU
xaddq %rcx, %rdx # latency: 2cy - uOPs: 3 - 3 ALU
```
The non-atomic RM variants have a latency of 11cy, and decode to 4
macro-opcodes. They still consume 2 ALU pipes, and the exchange in/out register
operand becomes available in 3cy (it matches the 'load-to-use latency').
```
xaddb %cl, (%rsp) # latency: 11cy - uOPs: 4 - 3 ALU
xaddw %cx, (%rsp) # latency: 11cy - uOPs: 4 - 3 ALU
xaddl %ecx, (%rsp) # latency: 11cy - uOPs: 4 - 3 ALU
xaddq %rcx, (%rsp) # latency: 11cy - uOPs: 4 - 3 ALU
```
The atomic XADD variants execute in 16cy. The in/out register operand is
available after 11cy from the start of execution.
```
lock xaddb %cl, (%rsp) # latency: 16cy - uOPs: 4 - 3 ALU -- ECX latency: 11cy
lock xaddw %cx, (%rsp) # latency: 16cy - uOPs: 4 - 3 ALU -- ECX latency: 11cy
lock xaddl %ecx, (%rsp) # latency: 16cy - uOPs: 4 - 3 ALU -- ECX latency: 11cy
lock xaddq %rcx, (%rsp) # latency: 16cy - uOPs: 4 - 3 ALU -- ECX latency: 11cy
```
Added test xadd.s to verify those latencies as well as read-advance values.
Differential Revision: https://reviews.llvm.org/D66535
llvm-svn: 369642
Latency and throughput of LOCK INC/DEC/NEG/NOT is always 19cy.
Number of uOPs is still 1.
Differential Revision: https://reviews.llvm.org/D66469
llvm-svn: 369388