Summary:
This patch removes InstRW overrides for basic arithmetic/logic instructions. To do this I've added the store address port to RMW. And used a WriteSequence to make the latency additive. It does not cover ADC/SBB because they have different latency.
Apparently we were inconsistent about whether the store has latency or not thus the test changes.
I've also left out Sandy Bridge because the load latency there is currently 4 cycles and should be 5.
Reviewers: RKSimon, andreadb
Reviewed By: andreadb
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D45351
llvm-svn: 329416
As mentioned on D44647, this patch increases the default memory latency to +5cy , which more closely matches what most custom cases are doing for reg-mem instructions.
I've bumped LoadLatency, ReadAfterLd and WriteLoad values to 5cy to be consistent.
As Sandy Bridge is currently our default generic model, this affects a lot of scheduling tests...
Differential Revision: https://reviews.llvm.org/D44654
llvm-svn: 329388
MFI.LocalFrameSize was not serialized.
It is usually set from LocalStackSlotAllocation, so if that pass doesn't
run it is impossible do deduce it from the stack objects. Until now, this
information was lost.
llvm-svn: 329382
VSX D-form load/store instructions of POWER9 require the offset be a multiple of 16 and a helper`isOffsetMultipleOf` is used to check this.
So far, the helper handles FrameIndex + offset case, but not handling FrameIndex without offset case. Due to this, we are missing opportunities to exploit D-form instructions when accessing an object or array allocated on stack.
For example, x-form store (stxvx) is used for int a[4] = {0}; instead of d-form store (stxv). For larger arrays, D-form instruction is not used when accessing the first 16-byte. Using D-form instructions reduces register pressure as well as instructions.
Differential Revision: https://reviews.llvm.org/D45079
llvm-svn: 329377
This is the 32-bit mode version of LEAVE64. It should be at least somewhat similar to LEAVE64.
The Sandy Bridge version was missing a load port use.
llvm-svn: 329347
Currently it is 6. If the "feature" was not used, report dummy
hidden argument. Otherwise it does not match the kernarg size
reported in the kernel header.
Differential Revision: https://reviews.llvm.org/D45129
llvm-svn: 329341
We were forcing the latency of these instructions to 5 cycles, but every other scheduler model had them as 1 cycle. I'm sure I didn't get everything, but this gets a big portion.
llvm-svn: 329339
This patch adds a way for users to create their own custom sections to
be added to wasm files. At the LLVM IR layer, they are defined through
the "wasm.custom_sections" named metadata. The expected use case for
this is bindings generators such as wasm-bindgen.
Patch by Dan Gohman
Differential Revision: https://reviews.llvm.org/D45297
llvm-svn: 329315
A noreturn nounwind function can be expected to never return in any way, and by
never returning it will also never have to restore any callee-saved registers
for its caller. This makes it possible to skip spills of those registers during
function entry, saving some stack space and time in the process. This is rather
useful for embedded targets with limited stack space.
Should fix PR9970.
Patch by myeisha (pmb).
llvm-svn: 329287
It's failing on the bots and I'm not sure why.
This reverts:
[X86] Synchronize the SchedRW on some EVEX instructions with their VEX equivalents.
[X86] Use WriteFShuffle256 for VEXTRACTF128 to be consistent with VEXTRACTI128 which uses WriteShuffle256.
[X86] Remove some InstRWs for plain store instructions on Sandy Bridge.
[X86] Auto-generate complete checks. NFC
llvm-svn: 329256
We were forcing the latency of these instructions to 5 cycles, but every other scheduler model had them as 1 cycle. I'm sure I didn't get everything, but this gets a big portion.
llvm-svn: 329252
The implementation of shadow call stack on aarch64 is quite different to
the implementation on x86_64. Instead of reserving a segment register for
the shadow call stack, we reserve the platform register, x18. Any function
that spills lr to sp also spills it to the shadow call stack, a pointer to
which is stored in x18.
Differential Revision: https://reviews.llvm.org/D45239
llvm-svn: 329236
Summary:
These new image intrinsics contain the texture type as part of
their name and have each component of the address/coordinate as
individual parameters.
This is a preparatory step for implementing the A16 feature, where
coordinates are passed as half-floats or -ints, but the Z compare
value and texel offsets are still full dwords, making it difficult
or impossible to distinguish between A16 on or off in the old-style
intrinsics.
Additionally, these intrinsics pass the 'texfailpolicy' and
'cachectrl' as i32 bit fields to reduce operand clutter and allow
for future extensibility.
v2:
- gather4 supports 2darray images
- fix a bug with 1D images on SI
Change-Id: I099f309e0a394082a5901ea196c3967afb867f04
Reviewers: arsenm, rampitec, b-sumner
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D44939
llvm-svn: 329166
Summary:
When an i1-value is defined inside of a loop and used outside of it, we
cannot simply use the SGPR bitmask from the loop's last iteration.
There are also useful and correct cases of an i1-value being copied between
basic blocks, e.g. when a condition is computed outside of a loop and used
inside it. The concept of dominators is not sufficient to capture what is
going on, so I propose the notion of "lane-dominators".
Fixes a bug encountered in Nier: Automata.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103743
Change-Id: If37b969ddc71d823ab3004aeafb9ea050e45bd9a
Reviewers: arsenm, rampitec
Subscribers: kzhuravl, wdng, mgorny, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D40547
llvm-svn: 329164
Recommitting rL321259. Previosuly this caused an issue with PPCBE but
I didn't receieve a reproducer and didn't have the time to follow up.
If the issue appears again, please provide a reproducer so I can fix
it.
Original commit message:
If the SRL node is only used by an AND, we may be able to set the
ExtVT to the width of the mask, making the AND redundant. To support
this, another check has been added in isLegalNarrowLoad which queries
whether the load is valid.
Differential Revision: https://reviews.llvm.org/D41350
llvm-svn: 329160
Summary:
The ShadowCallStack pass instruments functions marked with the
shadowcallstack attribute. The instrumented prolog saves the return
address to [gs:offset] where offset is stored and updated in [gs:0].
The instrumented epilog loads/updates the return address from [gs:0]
and checks that it matches the return address on the stack before
returning.
Reviewers: pcc, vitalybuka
Reviewed By: pcc
Subscribers: cryptoad, eugenis, craig.topper, mgorny, llvm-commits, kcc
Differential Revision: https://reviews.llvm.org/D44802
llvm-svn: 329139
This commit is similar to r329120, but uses the existing getUsesRedZone() function
in X86MachineFunctionInfo. This teaches the outliner to look at whether or not a
function *truly* uses a redzone instead of just the noredzone attribute on a
function.
Thus, after this commit, it's possible to outline from x86 without using
-mno-red-zone and still get outlining results.
This also adds a new test for the new redzone behaviour.
llvm-svn: 329134
Summary: There are no packed instructions for min3 or max3. So, performMinMaxCombine should not optimize vectors of f16 to min3/max3.
Author: FarhanaAleen
Reviewed By: arsenm
Subscribers: llvm-commits, AMDGPU
Differential Revision: https://reviews.llvm.org/D45219
llvm-svn: 329131
This patch adds a hasRedZone() function to AArch64MachineFunctionInfo. It
returns true if the function is known to use a redzone, false if it is known
to not use a redzone, and no value otherwise.
This removes the requirement to pass -mno-red-zone when outlining for AArch64.
https://reviews.llvm.org/D45189
llvm-svn: 329120
The linkage type on outlined functions was private before. This meant that if
you set a breakpoint in an outlined function, the debugger wouldn't be able to
give a sane name to the outlined function.
This commit changes the linkage type to internal and updates any tests that
relied on the prefixes on the names of outlined functions.
llvm-svn: 329116
We don't constant fold any of these, but we could...but if we
do, we must produce the right answer.
Unlike the IR fptosi instruction or its DAG node counterpart
ISD::FP_TO_SINT, these are not undef for an out-of-range input.
llvm-svn: 329100
fact use regular expression syntax to use regular expressions.
Should restore the bots. Sorry for the noise on this test.
Thanks to Philip for spotting the bug!
llvm-svn: 329057
This adds the basic test cases from all the EFLAGS bugs in more direct
forms. It also switches to generated check lines, and includes both
32-bit and 64-bit variations.
No functionality changing here, just setting things up to have a nice
clean asm diff in my EFLAGS patch.
llvm-svn: 329056
do explicit scrubbing of the offsets of stack spills and reloads.
You can always turn this off in order to test specific stack slot usage.
We were already hiding most of this, but the new logic hides it more
generically. Notably, we should effectively hide stack slot churn in
functions that have a frame pointer now, and should also hide it when
changing a function from stack pointer to frame pointer. That transition
already changes enough to be clearly noticed in the test case diff,
showing *every* spill and reload is really noisy without benefit. See
the test case I ran this on as a classic example.
llvm-svn: 329055
Commit 37962a331c77 ("bpf: Improve expanding logic in LowerSELECT_CC")
intended to improve code quality for certain jmp conditions. The
commit, however, has a couple of issues:
(1). In code, just swap is not enough, ConditionalCode CC
should also be swapped, otherwise incorrect code will
be generated.
(2). The ConditionalCode swap should be subject to
getHasJmpExt(). If getHasJmpExt() is False, certain
conditional codes will not be supported and swap
may generate incorrect code.
The original goal for this patch is to optimize jmp operations
which does not have JmpExt turned on. If JmpExt is on,
better code could be generated. For example, the test
select_ri.ll is introduced to demonstrate the optimization.
The same result can be achieved with -mcpu=v2 flag.
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
llvm-svn: 329043
Just adds basic block labels and tidies up where comments go in the test
case and then generates fresh CHECK lines with the script. This way, the
check lines are much easier to maintain. They were already close to this
but not quite there.
llvm-svn: 329040
If a load follows a store and reloads data that the store has written to memory, Intel microarchitectures can in many cases forward the data directly from the store to the load, This "store forwarding" saves cycles by enabling the load to directly obtain the data instead of accessing the data from cache or memory.
A "store forward block" occurs in cases that a store cannot be forwarded to the load. The most typical case of store forward block on Intel Core microarchiticutre that a small store cannot be forwarded to a large load.
The estimated penalty for a store forward block is ~13 cycles.
This pass tries to recognize and handle cases where "store forward block" is created by the compiler when lowering memcpy calls to a sequence
of a load and a store.
The pass currently only handles cases where memcpy is lowered to XMM/YMM registers, it tries to break the memcpy into smaller copies.
breaking the memcpy should be possible since there is no atomicity guarantee for loads and stores to XMM/YMM.
Differential revision: https://reviews.llvm.org/D41330
Change-Id: Ib48836ccdf6005989f7d4466fa2035b7b04415d9
llvm-svn: 328973
fptosi / fptoui round towards zero, and that's the same behavior as ISD::FTRUNC,
so replace a pair of casts with the equivalent node. We don't have to account for
special cases (NaN, INF) because out-of-range casts are undefined.
Differential Revision: https://reviews.llvm.org/D44909
llvm-svn: 328921
Summary:
It seems many CPUs don't implement this instruction as well as the other vector multiplies. Often using a multi uop flow. Silvermont in particular has a 7 uop flow with 11 cycle throughput. Sandy Bridge implements it as a single uop with 5 cycle latency and 1 cycle throughput. But Haswell and later use 2 uops with 10 cycle latency and 2 cycle throughput.
This patch adds a new X86SchedWritePair we can use to tag this instruction separately. I've provided correct information for Silvermont, Btver2, and Sandy Bridge. I've removed the InstRWs for SandyBridge. I've left Haswell/Broadwell/Skylake InstRWs in place because I wasn't sure how to account for the different load latency between 128 and 256 bits. I also left Znver1 InstRWs in place because the existing values don't match Agner's spreadsheet.
I also left a FIXME in the SandyBridge model because it being used for the "generic" model is too optimistic for the 256/512-bit versions since those are multiple uops on all known CPUs.
Reviewers: RKSimon, GGanesh, courbet
Reviewed By: RKSimon
Subscribers: gchatelet, gbedwell, andreadb, llvm-commits
Differential Revision: https://reviews.llvm.org/D44972
llvm-svn: 328914
Two memory instructions with a dependency only on the address register
between the two (the first one of them being post-incrememnt) can be
packetized together after the offset on the second was updated to the
incremement value. Make sure that the new offset is valid for the
instruction.
llvm-svn: 328897
The code has bugs dealing with -0.0.
Since D44550 introduced FABS pattern folding in InstCombine,
this patch removes the now-redundant code that causes
https://bugs.llvm.org/show_bug.cgi?id=36600.
Patch by Mikhail Dvoretckii!
Differential Revision: https://reviews.llvm.org/D44683
llvm-svn: 328872
Summary:
The phase attempts to transform operations that extract a portion of a value
into an SDWA src operand in cases where that value is used only once. It
was not prepared for this use to be the preserved portion of a value for
dst:UNUSED_PRESERVE, resulting in a crash or assert.
This change either rejects the illegal SDWA attempt, or in the case where
dst:WORD_1 and the src_sel would be WORD_0, removes the unneeded
extract instruction.
Reviewers: arsenm, #amdgpu
Reviewed By: arsenm, #amdgpu
Subscribers: arsenm, kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D44364
llvm-svn: 328856
MachineCopyPropagation::CopyPropagateBlock has a bunch of special
handling for COPY instructions. This handling assumes that COPY
instructions do not modify the source of the copy; this is wrong if
the COPY destination overlaps the source.
To fix the bug, check explicitly for this situation, and fall back to
the generic instruction handling.
This bug can't happen for most register classes because they don't
have this sort of overlap, but there are a few register classes
where this is possible. The testcase uses the AArch64 QQQQ register
class.
Differential Revision: https://reviews.llvm.org/D44911
llvm-svn: 328851
While the stack access instructions don't care about
alignment > 4, some transformations on the pointer calculation
do make assumptions based on knowing the low bits of a pointer
are 0. If a stack object ends up being accessed through its
absolute address (relative to the kernel scratch wave offset),
the addressing expression may depend on the stack frame being
properly aligned. This was breaking in a testcase due to the
add->or combine.
I think some of the SP/FP handling logic is still backwards,
and overly simplistic to support all of the stack features.
Code which tries to modify the SP with inline asm for example
or variable sized objects will probably require redoing this.
llvm-svn: 328831
These instructions have the memory operand before the register operand. So we need to put ReadDefault for all the load ops first. Then the ReadAfterLd
Differential Revision: https://reviews.llvm.org/D44838
llvm-svn: 328823
DWARF v5 specifies that the root file (also given in the DW_AT_name
attribute of the compilation unit DIE) should be emitted explicitly to
the line table's list of files. This makes the line table more
independent of the .debug_info section.
We emit the new syntax only for DWARF v5 and later.
Fixes the bug found by asan. Also XFAIL the new test for Darwin, which
is stuck on DWARF v2, and fix up other tests so they stop failing on
Windows. Last but not least, don't break "clang -g" of an assembler
file that has .file directives in it.
Differential Revision: https://reviews.llvm.org/D44054
llvm-svn: 328805
Summary: Mark CFG is preserved since this pass do not make any change in CFG.
Reviewers: sebpop, mzolotukhin, mcrosier
Reviewed By: mzolotukhin
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D44845
llvm-svn: 328727
These instructions have been around for a long time, but we
haven't supported intrinsics for them. The "new" versions use
the CSx register for the start of the buffer instead of the K
field in the Mx register.
We need to use pseudo instructions for these instructions until
after register allocation. The problem is that these instructions
allocate a M0/CS0 or M1/CS1 pair. But, we can't generate code for
the CSx set-up until after register allocation when the Mx
register has been fixed for the instruction.
There is a related clang patch.
Patch by Brendon Cahoon.
llvm-svn: 328724
This commit simplifies the call outlining logic by removing references to the
Function associated with the callee. To do this, it requires that valid
callee save info is available to the outliner.
llvm-svn: 328719
This reverts commit r328676.
Commit r328676 broke the -no-integrated-as flag necessary to build Linux kernel with Clang:
$ cat t.c
void foo() {}
$ clang -no-integrated-as -c t.c -g
/tmp/t-dcdec5.s: Assembler messages:
/tmp/t-dcdec5.s:8: Error: file number less than one
clang-7.0: error: assembler command failed with exit code 1 (use -v to see invocation)
llvm-svn: 328699
This reverts commit 0daf86291d3aa04d3cc280cd0ef24abdb0174981.
It was causing an assert in test/CodeGen/AMDGPU/amdpal.ll only on a
release-with-asserts build. I will resubmit the change when I have fixed
that.
Change-Id: If270594eba27a7dc4076bdeab3fa8e6bfda3288a
llvm-svn: 328695
Follow up patch of r328313 to support the UseVMOVSR constraint. Removed
some unneeded instructions from the test and removed some stray
comments.
Differential Revision: https://reviews.llvm.org/D44941
llvm-svn: 328691
Summary:
RegisterCoalescer::removePartialRedundancy tries to hoist B = A from
BB0/BB2 to BB1:
BB1:
...
BB0/BB2: ----
B = A; |
... |
A = B; |
|-------
|
It does so if a number of conditions are fulfilled. However, it failed
to check if B was used by any of the terminators in BB1. Since we must
insert B = A before the terminators (since it's not a terminator itself),
this means that we could erroneously insert a new definition of B before a
use of it.
Reviewers: wmi, qcolombet
Reviewed By: wmi
Subscribers: MatzeB, llvm-commits, sdardis
Differential Revision: https://reviews.llvm.org/D44918
llvm-svn: 328689
DWARF v5 specifies that the root file (also given in the DW_AT_name
attribute of the compilation unit DIE) should be emitted explicitly to
the line table's list of files. This makes the line table more
independent of the .debug_info section.
Fixes the bug found by asan. Also XFAIL the new test for Darwin, which
is stuck on DWARF v2, and fix up other tests so they stop failing on
Windows. Last but not least, don't break "clang -g" of an assembler
file that has .file directives in it.
Differential Revision: https://reviews.llvm.org/D44054
llvm-svn: 328676
If an ADRP appears with, say, a CPI operand, we shouldn't outline it.
This moves the check for unsafe operands so that it occurs before the special-case
for ADRPs. Also add a test for outlining ADRPs.
llvm-svn: 328674
Summary:
For OS type AMDPAL, the scratch descriptor is loaded from offset 0 of
the GIT, whose 32 bit pointer is in s0 (s8 for gfx9 merged shaders).
This commit fixes that to use offset 0x10 instead of offset 0 for a
compute shader, per the PAL ABI spec.
Reviewers: kzhuravl, nhaehnle, timcorringham
Subscribers: kzhuravl, wdng, yaxunl, t-tye, llvm-commits, dstuttard, nhaehnle, arsenm
Differential Revision: https://reviews.llvm.org/D44468
Change-Id: I93dffa647758e37f613bb5e0dfca840d82e6d26f
llvm-svn: 328673
If a given split type unit does not have source locations, don't have
it refer to the split line table.
If no split type unit refers to the split line table, don't emit the
line table at all.
This will save a little space on rare occasions, but also refactors
things a bit to improve which class is responsible for what.
Responding to review comments on r326395.
Differential Revision: https://reviews.llvm.org/D44220
llvm-svn: 328670
Summary:
Rev 327580 "[CodeGen] Use MIR syntax for MachineMemOperand printing"
broke -print-machineinstrs for us on AMDGPU, because we have custom
pseudo source values, and MIR serialization does not implement that.
This commit at least restores the functionality of -print-machineinstrs,
even if it does not properly implement the missing MIR serialization
functionality.
Differential Revision: https://reviews.llvm.org/D44871
Change-Id: I44961c0b90bf6d48c01484ed7a4e466fd300db66
llvm-svn: 328668
Currently MOVMSK instructions use the WriteVecLogic class, which is a very poor choice given that MOVMSK involves a SSE->GPR transfer.
Differential Revision: https://reviews.llvm.org/D44924
llvm-svn: 328664
Before this was not done if the function had no calls in it. This
is still a possible issue with any callable function, regardless
of calls present.
llvm-svn: 328659
The combine on a select of a load only triggers for
addrspace 0, and discards the MachinePointerInfo. The
conservative default needs to be used for this.
llvm-svn: 328652
In a function, s5 is used as the frame base SGPR. If a function
is calling another function, during the call sequence
it is copied to a preserved SGPR and restored.
Before it was possible for the scheduler to move stack operations
before the restore of s5, since there's nothing to associate
a frame index access with the restore.
Add an implicit use of s5 to the adjcallstack pseudo which ends
the call sequence to preven this from happening. I'm not 100%
satisfied with this solution, but I'm not sure what else would be
better.
llvm-svn: 328650