This patch adds functions to allow MachineLICM to hoist invariant stores.
Currently, MachineLICM does not hoist any store instructions, however
when storing the same value to a constant spot on the stack, the store
instruction should be considered invariant and be hoisted. The function
isInvariantStore iterates each operand of the store instruction and checks
that each register operand satisfies isCallerPreservedPhysReg. The store
may be fed by a copy, which is hoisted by isCopyFeedingInvariantStore.
This patch also adds the PowerPC changes needed to consider the stack
register as caller preserved.
Differential Revision: https://reviews.llvm.org/D40196
llvm-svn: 328326
Loads and stores can only shift the offset register by the size of the value
being loaded, but currently the DAGCombiner will reduce the width of the load
if it's followed by a trunc making it impossible to later combine the shift.
Solve this by implementing shouldReduceLoadWidth for the AArch64 backend and
make it prevent the width reduction if this is what would happen, though do
allow it if reducing the load width will let us eliminate a later sign or zero
extend.
Differential Revision: https://reviews.llvm.org/D44794
llvm-svn: 328321
When targeting execute-only and fp-armv8, float constants in a compare
resulted in instruction selection failures. This is now fixed by using
vmov.f32 where possible, otherwise the floating point constant is
lowered into a integer constant that is moved into a floating point
register.
This patch also restores using fpcmp with immediate 0 under fp-armv8.
Change-Id: Ie87229706f4ed879a0c0cf66631b6047ed6c6443
llvm-svn: 328313
This was being masked because GISel is enabled by default for -O0 and
the abort was disabled. Modified test to explicitly enable abort.
llvm-svn: 328311
The VMOVMSKBrr was in a separate InstRW with a lower latency, but I assume they should be the same and the higher latency matches Agners table so I'm going with that.
llvm-svn: 328291
This makes the Y position consistent with other instructions.
This should have been NFC, but while refactoring the multiclass I noticed that VROUNDPD memory forms were using the register itinerary.
llvm-svn: 328254
In our real world application, we found the following optimization is missed in DAGCombiner
(zext (and/or/xor (shl/shr (load x), cst), cst)) -> (and/or/xor (shl/shr (zextload x), (zext cst)), (zext cst))
If the user of original zext is an add, it may enable further lea optimization on x86.
This patch add a new function CombineZExtLogicopShiftLoad to do this optimization.
Differential Revision: https://reviews.llvm.org/D44402
llvm-svn: 328252
Summary:
This pass sinks COPY instructions into a successor block, if the COPY is not
used in the current block and the COPY is live-in to a single successor
(i.e., doesn't require the COPY to be duplicated). This avoids executing the
the copy on paths where their results aren't needed. This also exposes
additional opportunites for dead copy elimination and shrink wrapping.
These copies were either not handled by or are inserted after the MachineSink
pass. As an example of the former case, the MachineSink pass cannot sink
COPY instructions with allocatable source registers; for AArch64 these type
of copy instructions are frequently used to move function parameters (PhyReg)
into virtual registers in the entry block..
For the machine IR below, this pass will sink %w19 in the entry into its
successor (%bb.1) because %w19 is only live-in in %bb.1.
```
%bb.0:
%wzr = SUBSWri %w1, 1
%w19 = COPY %w0
Bcc 11, %bb.2
%bb.1:
Live Ins: %w19
BL @fun
%w0 = ADDWrr %w0, %w19
RET %w0
%bb.2:
%w0 = COPY %wzr
RET %w0
```
As we sink %w19 (CSR in AArch64) into %bb.1, the shrink-wrapping pass will be
able to see %bb.0 as a candidate.
With this change I observed 12% more shrink-wrapping candidate and 13% more dead copies deleted in spec2000/2006/2017 on AArch64.
Reviewers: qcolombet, MatzeB, thegameg, mcrosier, gberry, hfinkel, john.brawn, twoh, RKSimon, sebpop, kparzysz
Reviewed By: sebpop
Subscribers: evandro, sebpop, sfertile, aemerson, mgorny, javed.absar, kristof.beyls, llvm-commits
Differential Revision: https://reviews.llvm.org/D41463
llvm-svn: 328237
As in SystemZ backend, correctly propagate node ids when inserting new
unselected nodes into the DAG during instruction Seleciton for X86
target.
Fixes PR36865.
Reviewers: jyknight, craig.topper
Subscribers: hiraditya, llvm-commits
Differential Revision: https://reviews.llvm.org/D44797
llvm-svn: 328233
This is needed for the upcoming implementation of the
new 8x32x16 and 32x8x16 variants of WMMA instructions
introduced in CUDA 9.1.
Differential Revision: https://reviews.llvm.org/D44719
llvm-svn: 328158
The pipeliner needs to remove instructions from the SlotIndexes
structure when they are deleted. Otherwise, the SlotIndexes map
has stale data, and an assert will occur when adding new
instructions.
This patch also changes the pipeliner to make the back-edge of
a loop carried dependence 1 cycle. The 1 cycle latency is added
to the anti-dependence that represents the back-edge. This
changes eliminates a couple of hacks added to the pipeliner to
handle the latency of the back-edge. It is needed to correctly
pipeline the test case for the sub-register elimination pass.
llvm-svn: 328113
This patch also includes extensive tests targeted at select and br+fcmp IR
inputs. A sequence of br+fcmp required support for FPR32 registers to be added
to RISCVInstrInfo::storeRegToStackSlot and
RISCVInstrInfo::loadRegFromStackSlot.
llvm-svn: 328104
Also restrict to port 0 and 1 for SkylakeClient. It looks like the scheduler models don't account for client not having a full vector ALU on port 5 like server.
Fixes PR36808.
llvm-svn: 328061
The default thread model for wasm is single, and in this mode thread-local
global variables can be lowered identically to non-thread-local variables.
Differential Revision: https://reviews.llvm.org/D44703
llvm-svn: 328049
Mingw uses the same stack protector functions as GCC provides
on other platforms as well.
Patch by Valentin Churavy!
Differential Revision: https://reviews.llvm.org/D27296
llvm-svn: 328039
I'm not entirely sure these hacks are still needed. If you remove the hacks completely, the name of the library call that gets generated doesn't match the grep the test previously had. So the test wasn't really checking anything.
If the hack is still needed it belongs in PPC specific code. I believe the FP_TO_SINT code here is the only place in the tree where a FP_ROUND_INREG node is created today. And I don't think its even being used correctly because the legalization returned a BUILD_PAIR with the same value twice. That doesn't seem right to me. By moving the code entirely to PPC we can avoid creating the FP_ROUND_INREG at all.
I replaced the grep in the existing test with full checks generated by hacking update_llc_test_check.py to support ppc32 just long enough to generate it.
Differential Revision: https://reviews.llvm.org/D44061
llvm-svn: 328017
Registers E[A-D]X, E[SD]I, E[BS]P, and EIP have 16-bit subregisters
that cover the low halves of these registers. This change adds artificial
subregisters for the high halves in order to differentiate (in terms of
register units) between the 32- and the low 16-bit registers.
This patch contains parts that aim to preserve the calculated register
pressure. This is in order to preserve the current codegen (minimize the
impact of this patch). The approach of having artificial subregisters
could be used to fix PR23423, but the pressure calculation would need
to be changed.
Differential Revision: https://reviews.llvm.org/D43353
llvm-svn: 328016
This way we can support address-space specific variants without explicitly
encoding the space in the name of the intrinsic. Less intrinsics to deal with ->
less boilerplate.
Added a bit of tablegen magic to match/replace an intrinsics with a pointer
argument in particular address space with the space-specific instruction
variant.
Updated tests to use non-default address spaces.
Differential Revision: https://reviews.llvm.org/D43268
llvm-svn: 328006
TopReadyCycle and BotReadyCycle were off by one cycle when an SU is either
the first instruction or the last instruction in a packet.
Patch by Ikhlas Ajbar.
llvm-svn: 328000
Summary:
Currently X-Ray Instrumentation pass has a dependency on MachineLoopInfo
(and thus on MachineDominatorTree as well) and we have to compute them
even if X-Ray is not used. This patch changes it to a lazy computation
to save compile time by avoiding these redundant computations.
Reviewers: dberris, kubamracek
Subscribers: llvm-commits, hiraditya
Differential Revision: https://reviews.llvm.org/D44666
llvm-svn: 327999
The test is taken from
https://github.com/avr-rust/rust/issues/57
The originally implementation of struct return lowering was made in
r325474.
Patch by Peter Nimmervoll
llvm-svn: 327967
In these cases, both parameters and return values are passed
as a pointer to a stack allocation.
MSVC doesn't use the f80 data type at all, while it is used
for long doubles on mingw.
Normally, this part of the calling convention is handled
within clang, but for intrinsics that are lowered to libcalls,
it may need to be handled within llvm as well.
Differential Revision: https://reviews.llvm.org/D44592
llvm-svn: 327957
They were incorrectly marked as RMW operations. Some of the CMP instrucions worked, but the ones that use a similar encoding as RMW form of ADD ended up marked as RMW.
TEST used the same tablegen class as some of the CMPs.
llvm-svn: 327947
When scanning the function for CSRs uses and defs, also check if
the basic block are landing pads.
Consider that landing pads needs the CSRs to be properly set.
That way we force the prologue/epilogue to always be pushed out
of the problematic "throw" region. The "throw" region is
problematic because the jumps are not properly modeled.
Fixes PR36513
llvm-svn: 327942
E.g.
bar (int x)
{
char p[x];
push outgoing variables for foo.
call foo
}
We need to generate stack adjustment instructions for outgoing arguments by
eliminateCallFramePseudoInstr when the function contains variable size
objects to avoid outgoing variables corrupt the variable size object.
Default hasReservedCallFrame will return !hasFP().
We don't want to generate extra sp adjustment instructions when hasFP()
return true, So We override hasReservedCallFrame as !hasVarSizedObjects().
Differential Revision: https://reviews.llvm.org/D43752
llvm-svn: 327938
Summary:
DbgValue nodes were not transferred when integer DAG nodes were promoted. For example, if an i32 add node was promoted to an i64 add node by DAGTypeLegalizer::PromoteIntegerResult(), its DbgValue node was not transferred to the new node. The simple fix is to update SetPromotedInteger() to transfer DbgValues.
Add AArch64/dbg-value-i8.ll to test this change and fix ARM/debug-info-d16-reg.ll which had the wrong DILocalVariable nodes with arg numbers even though they are not for function parameters.
Patch by Se Jong Oh!
Reviewers: vsk, JDevlieghere, aprantl
Reviewed By: JDevlieghere
Subscribers: javed.absar, kristof.beyls, llvm-commits
Tags: #debug-info
Differential Revision: https://reviews.llvm.org/D44546
llvm-svn: 327919
When outlining calls, the outliner needs to update CFI to ensure that, say,
exception handling works. This commit adds that functionality and adds a test
just for call outlining.
Call outlining stuff in machine-outliner.mir should be moved into
machine-outliner-calls.mir in a later commit.
llvm-svn: 327917
This extends the use of this attribute on ARM and AArch64 from
SVN r325900 (where it was only checked for fixed stack
allocations on ARM/AArch64, but for all stack allocations on X86).
This also adds a testcase for the existing use of disabling the
fixed stack probe with the attribute on ARM and AArch64.
Differential Revision: https://reviews.llvm.org/D44291
llvm-svn: 327897
PR35590 was already filed for this information being wrong. It's probably better to default to WriteSystem behavior instead of using something completely wrong.
llvm-svn: 327882
JRCXZ was already present, but not the others.
We never codegen this instruction so this doesn't affect much just trying to get them all into a single generated scheduler class in the output.
llvm-svn: 327881
The regex was looking for JECXZ_32 or JECXZ_64, but their is just one instruction called JECXZ. They used to exist as separate instructions, but were merged over 3 years ago.
llvm-svn: 327880
PowerPC targets do not use address spaces. As a result, we can get selection
failures with address space casts. This patch makes those casts noops.
Patch by Valentin Churavy.
Differential revision: https://reviews.llvm.org/D43781
llvm-svn: 327877
With the SRAs removed from the SSE2 code in D44267, then there doesn't appear to be any advantage to the sse41 code. The punpcklbw instruction and pmovsx seem to have the same latency and throughput on most CPUs. And the SSE41 code requires moving the upper 64-bits into the lower 64-bit before the sign extend can be done. The unpckhbw in sse2 code can do better than that.
llvm-svn: 327869
Sometimes we used the same itinerary for MEM and REG forms, but that seems inconsistent with our usual usage.
We also used the MUL8 itinerary for MULX32/64 which was also weird.
The test changes are because we were using IIC_IMUL32_RR and IIC_IMUL64_RR instead of IIC_IMUL32_REG/IIC_IMUL64_REG for the 32 and 64 bit multiplies that produce double width result.
llvm-svn: 327866
Summary:
This patch prevents DBG_VALUE instructions from influencing
LivePhysRegs::stepBackwards and stepForwards. In at least one case,
specifically branch folding, the stepBackwards logic was having an
influence on code generation. The result was that certain code
compiled with '-g -O2' would differ from that compiled with '-O2'
alone. It seems that the original logic, accounting for DBG_VALUE,
was influencing the placement of an IMPLICIT_DEF which had a later
impact on how blocks were processed in branch folding.
Reviewers: kparzysz, MatzeB
Reviewed By: kparzysz
Subscribers: bjope, llvm-commits
Tags: #debug-info
Differential Revision: https://reviews.llvm.org/D43850
llvm-svn: 327862
This patch adds functions to allow MachineLICM to hoist invariant stores.
Currently, MachineLICM does not hoist any store instructions, however
when storing the same value to a constant spot on the stack, the store
instruction should be considered invariant and be hoisted. The function
isInvariantStore iterates each operand of the store instruction and checks
that each register operand satisfies isCallerPreservedPhysReg. The store
may be fed by a copy, which is hoisted by isCopyFeedingInvariantStore.
This patch also adds the PowerPC changes needed to consider the stack
register as caller preserved.
Differential Revision: https://reviews.llvm.org/D40196
llvm-svn: 327856
Currently the WriteResPair style multi-classes take a single pipeline stage and latency, this patch generalizes this to make it easier to create complex schedules with ResourceCycles and NumMicroOps be overriden from their defaults.
This has already been done for the Jaguar scheduler to remove a number of custom schedule classes and adding it to the other x86 targets will make it much tidier as we add additional classes in the future to try and replace so many custom cases.
I've converted some instructions but a lot of the models need a bit of cleanup after the patch has been committed - memory latencies not being consistent, the class not actually being used when we could remove some/all customs, etc. I'd prefer to keep this as NFC as possible so later patches can be smaller and target specific.
Differential Revision: https://reviews.llvm.org/D44612
llvm-svn: 327855
1. Given that we already have a classification bucket with 'nop' in the name,
that's where 'nop' belongs. Right now, it's only used for prefix bytes and 'pause'.
2. Make the latency of this class '1' for Jaguar to tell the scheduler (and presumably
llvm-mca) how to model the resource requirements better even though a nop has no
dependencies.
Differential Revision: https://reviews.llvm.org/D44608
llvm-svn: 327853
Normally DCE kills these, but at -O0 these get left behind
leaving suspicious looking illegal copies.
Replace with IMPLICIT_DEF to avoid iterator issues.
llvm-svn: 327842
This is the groundwork for adding the Armv8.2-A FP16 vector intrinsics, which
uses v4f16 and v8f16 vector operands and return values. All the moving parts
are tested with two intrinsics, a 1-operand v8f16 and a 2-operand v4f16
intrinsic. In a follow-up patch the rest of the intrinsics and tests will be
added.
Differential Revision: https://reviews.llvm.org/D44538
llvm-svn: 327839
If DoneMBB becomes empty it must have CC added to its live-in list, since it
will fall-through into EndMBB. This happens when the CLC loop does the
complete range.
Review: Ulrich Weigand
llvm-svn: 327834
Also move ADC8i8 and SBB8i8 in the Sandy Bridge model to the same class as ADC8ri and SBB8ri. That seems more accurate since its the 8i8 is just the register forced to AL instead of coming from modrm.
llvm-svn: 327820
This patch adds i128 division support by instruction LLVM to lower
128-bit divisions to the __udivmodti4 and __divmodti4 rtlib functions.
This also adds test for 64-bit division and 128-bit division.
Patch by Peter Nimmervoll.
llvm-svn: 327814
The information was so wildly inaccurate and incomplete its better to just remove it.
MMX_MASKMOVQ64 showed up twice in several scheduler models. In Haswell and Broadwell they were on adjacent lines. On Skylake the copies had different information.
MMX_MASKMOVQ and MASKMOVDQU were completely missing.
MMX_MASKMOVQ64 was listed on Haswell/Broadwell as 1 cycle on port 1 despite it being a store instruction.
Filed PR36780 to track fixing this right.
llvm-svn: 327783
X86 Supports Indirect Branch Tracking (IBT) as part of Control-Flow Enforcement Technology (CET).
IBT instruments ENDBR instructions used to specify valid targets of indirect call / jmp.
The `nocf_check` attribute has two roles in the context of X86 IBT technology:
1. Appertains to a function - do not add ENDBR instruction at the beginning of the function.
2. Appertains to a function pointer - do not track the target function of this pointer by adding nocf_check prefix to the indirect-call instruction.
This patch implements `nocf_check` context for Indirect Branch Tracking.
It also auto generates `nocf_check` prefixes before indirect branchs to jump tables that are guarded by range checks.
Differential Revision: https://reviews.llvm.org/D41879
llvm-svn: 327767
Improve/implement these methods to improve DAG combining. This mainly
concerns intrinsics.
Some constant operands to SystemZISD nodes have been marked Opaque to avoid
transforming back and forth between generic and target nodes infinitely.
Review: Ulrich Weigand
llvm-svn: 327765
The BITCAST handling in computeKnownBits() previously only worked for little
endian.
This patch reverses the iteration over elements for a big endian target which
allows this to work in this case also.
SystemZ test case.
Review: Eli Friedman
https://reviews.llvm.org/D44249
llvm-svn: 327764
At the point the outliner runs, KILLs don't impact anything, but they're still
considered unique instructions. This commit makes them invisible like
DebugValues so that they can still be outlined without impacting outlining
decisions.
llvm-svn: 327760
Avoid scheduling two loads in such a way that they would end up in the
same packet. If there is a load in a packet, try to schedule a non-load
next.
Patch by Brendon Cahoon.
llvm-svn: 327742
Now the Windows mangling modes ('w' and 'x') do not do any mangling for
symbols starting with '?'. This means that clang can stop adding the
hideous '\01' leading escape. This means LLVM debug logs are less likely
to contain ASCII escape characters and it will be easier to copy and
paste MS symbol names from IR.
Finally.
For non-Windows platforms, names starting with '?' still get IR
mangling, so once clang stops escaping MS C++ names, we will get extra
'_' prefixing on MachO. That's fine, since it is currently impossible to
construct a triple that uses the MS C++ ABI in clang and emits macho
object files.
Differential Revision: https://reviews.llvm.org/D7775
llvm-svn: 327734
We previously avoided inserting these moves during isel in a few cases which is implemented using a whitelist of opcodes. But it's too difficult to generate a perfect list of opcodes to whitelist. Especially with AVX512F without AVX512VL using 512 bit vectors to implement some 128/256 bit operations. Since isel is done bottoms up, we'd have to check the VT and opcode and subtarget in order to determine whether an EXTRACT_SUBREG would be generated for some operations.
So instead of doing that, this patch adds a post processing step that detects when the moves are unnecesssary after isel. At that point any EXTRACT_SUBREGs would have already been created and appear in the DAG. So then we just need to ensure the input to the move isn't one.
Differential Revision: https://reviews.llvm.org/D44289
llvm-svn: 327724
This implements lowering of SELECT_CC for f16s, which enables
codegen of VSEL with f16 types.
Differential Revision: https://reviews.llvm.org/D44518
llvm-svn: 327695
Previously if getSetccResultType returned an illegal type we just fell back to using the default promoted type. This appears to have been to handle the case where for vectors getSetccResultType returns the input type, but the input type itself isn't legal and will need to be promoted. Without the legality check we would never reach a legal type.
But just picking the promoted type to be the setcc type can create strange setccs where the result type is 128 bits and the operand type is 256 bits. If for example the result type was promoted to v8i16 from v8i1, but the input type was promoted from v8i23 to v8i32. We currently handle this with custom lowering code in X86.
This legality check also caused us reject the getSetccResultType when the input type needed to be widened or split. Even though that result wouldn't have caused legalization to get stuck.
This patch tries to fix this by detecting the getSetccResultType needs to be promoted. If its input type also needs to be promoted we'll try a ask for a new setcc result type based on its eventual promoted value. Otherwise we fall back to default type to promote to.
For any other illegal values we might get back from the initial call to getSetccResultType we just keep and allow it to be re-legalized later via splitting or widening or scalarizing.
llvm-svn: 327683
The FADD part of the addsub/subadd pattern can have its operands commuted, but when checking for fsubadd we were using the fadd as reference and commuting the fsub node.
llvm-svn: 327660
The code that creates fmsubadd from shuffle vector has some code to allow commuting the operands of the fadd node. This code was originally created when we only recognized fmaddsub. When fmsubadd support was added this code was not updated and is now commuting the fsub operands instead.
llvm-svn: 327659
PR35402 triggered this case. It bswap and stores a 48bit value, current STBRX optimization transforms it into STBRX. Unfortunately 48bit is not a simple MVT, there is no PPC instruction to support it, and it can't be automatically expanded by llvm, so caused a crash.
This patch detects the non-simple MVT and returns early.
Differential Revision: https://reviews.llvm.org/D44500
llvm-svn: 327651
This patch adds new load/store instructions for integer scalar types
which can be used for X-Form when fed by add with an @tls relocation.
Differential Revision: https://reviews.llvm.org/D43315
llvm-svn: 327635
As discussed on D44428 and PR36726, this patch splits off WriteFMove/WriteVecMove, WriteFLoad/WriteVecLoad and WriteFStore/WriteVecStore scheduler classes to permit vectors to be handled separately from gpr/scalar types.
I've minimised the diff here by only moving various basic SSE/AVX vector instructions across - we can fix the rest when called for. This does fix the MOVDQA vs MOVAPS/MOVAPD discrepancies mentioned on D44428.
Differential Revision: https://reviews.llvm.org/D44471
llvm-svn: 327630
This is a follow up of the AArch64 FP16 intrinsics work;
the codegen tests had not been added yet.
Differential Revision: https://reviews.llvm.org/D44510
llvm-svn: 327624
There is no 512 bit addsub instruction, but we partially match it handle fmaddsub matching. We explicitly bail out for 512 bit vectors after failing the fmaddsub match, but we had no test coverage for that bail out.
We might want to consider splitting and using 256 bit instructions instead of the long sequence seen here.
llvm-svn: 327605
Summary:
Local values are constants, global addresses, and stack addresses that
can't be folded into the instruction that uses them. For example, when
storing the address of a global variable into memory, we need to
materialize that address into a register.
FastISel doesn't want to materialize any given local value more than
once, so it generates all local value materialization code at
EmitStartPt, which always dominates the current insertion point. This
allows it to maintain a map of local value registers, and it knows that
the local value area will always dominate the current insertion point.
The downside is that local value instructions are always emitted without
a source location. This is done to prevent jumpy line tables, but it
means that the local value area will be considered part of the previous
statement. Consider this C code:
call1(); // line 1
++global; // line 2
++global; // line 3
call2(&global, &local); // line 4
Today we end up with assembly and line tables like this:
.loc 1 1
callq call1
leaq global(%rip), %rdi
leaq local(%rsp), %rsi
.loc 1 2
addq $1, global(%rip)
.loc 1 3
addq $1, global(%rip)
.loc 1 4
callq call2
The LEA instructions in the local value area have no source location and
are treated as being on line 1. Stepping through the code in a debugger
and correlating it with the assembly won't make much sense, because
these materializations are only required for line 4.
This is actually problematic for the VS debugger "set next statement"
feature, which effectively assumes that there are no registers live
across statement boundaries. By sinking the local value code into the
statement and fixing up the source location, we can make that feature
work. This was filed as https://bugs.llvm.org/show_bug.cgi?id=35975 and
https://crbug.com/793819.
This change is obviously not enough to make this feature work reliably
in all cases, but I felt that it was worth doing anyway because it
usually generates smaller, more comprehensible -O0 code. I measured a
0.12% regression in code generation time with LLC on the sqlite3
amalgamation, so I think this is worth doing.
There are some special cases worth calling out in the commit message:
1. local values materialized for phis
2. local values used by no-op casts
3. dead local value code
Local values can be materialized for phis, and this does not show up as
a vreg use in MachineRegisterInfo. In this case, if there are no other
uses, this patch sinks the value to the first terminator, EH label, or
the end of the BB if nothing else exists.
Local values may also be used by no-op casts, which adds the register to
the RegFixups table. Without reversing the RegFixups map direction, we
don't have enough information to sink these instructions.
Lastly, if the local value register has no other uses, we can delete it.
This comes up when fastisel tries two instruction selection approaches
and the first materializes the value but fails and the second succeeds
without using the local value.
Reviewers: aprantl, dblaikie, qcolombet, MatzeB, vsk, echristo
Subscribers: dotdash, chandlerc, hans, sdardis, amccarth, javed.absar, zturner, llvm-commits, hiraditya
Differential Revision: https://reviews.llvm.org/D43093
llvm-svn: 327581
Get rid of the "; mem:" suffix and use the one we use in MIR: ":: (load 2)".
rdar://38163529
Differential Revision: https://reviews.llvm.org/D42377
llvm-svn: 327580
Optionally allow the order of restoring the callee-saved registers in the
epilogue to be reversed.
The flag -reverse-csr-restore-seq generates the following code:
```
stp x26, x25, [sp, #-64]!
stp x24, x23, [sp, #16]
stp x22, x21, [sp, #32]
stp x20, x19, [sp, #48]
; [..]
ldp x24, x23, [sp, #16]
ldp x22, x21, [sp, #32]
ldp x20, x19, [sp, #48]
ldp x26, x25, [sp], #64
ret
```
Note how the CSRs are restored in the same order as they are saved.
One exception to this rule is the last `ldp`, which allows us to merge
the stack adjustment and the ldp into a post-index ldp. This is done by
first generating:
ldp x26, x27, [sp]
add sp, sp, #64
which gets merged by the arm64 load store optimizer into
ldp x26, x25, [sp], #64
The flag is disabled by default.
llvm-svn: 327569
I removed this in r316797 because the coverage report showed no coverage and I thought it should have been handled by the auto generated table. I now see that there is code that bypasses the table if the shift amount is out of bounds.
This adds back the code. We'll codegen out of bounds i8 shifts to effectively (amount & 0x1f). The 0x1f is a strange quirk of x86 that shift amounts are always masked to 5-bits(except 64-bits). So if the masked value is still out bounds the result will be 0.
Fixes PR36731.
llvm-svn: 327540
I had to modify the bswap recognition to allow unshrunk masks to make this work.
Fixes PR36689.
Differential Revision: https://reviews.llvm.org/D44442
llvm-svn: 327530
swifterror llvm values model the swifterror register as memory at the
LLVM IR level. ISel will perform adhoc mem-to-reg on them. swifterror
values are constraint in how they can be used. Spilling them to memory
is not allowed.
SjLjEHPrepare tried to lower swifterror values to memory which is
unecessary since the back-end will spill and reload the register as
neccessary (as long as clobbering calls are marked as such which is the
case here) and further leads to invalid IR because swifterror values
can't be stored to memory.
rdar://38164004
llvm-svn: 327521
Support G_LSHR/G_ASHR/G_SHL. We have 3 variance for
shift instructions : shift gpr, shift imm, shift 1.
Currently GlobalIsel TableGen generate patterns for
shift imm and shift 1, but with shiftCount i8.
In G_LSHR/G_ASHR/G_SHL like LLVM-IR both arguments
has the same type, so for now only shift i8 can use
auto generated TableGen patterns.
The support of G_SHL/G_ASHR enables tryCombineSExt
from LegalizationArtifactCombiner.h to hit, which
results in different legalization for the following tests:
LLVM :: CodeGen/X86/GlobalISel/ext-x86-64.ll
LLVM :: CodeGen/X86/GlobalISel/gep.ll
LLVM :: CodeGen/X86/GlobalISel/legalize-ext-x86-64.mir
-; X64-NEXT: movsbl %dil, %eax
+; X64-NEXT: movl $24, %ecx
+; X64-NEXT: # kill: def $cl killed $ecx
+; X64-NEXT: shll %cl, %edi
+; X64-NEXT: movl $24, %ecx
+; X64-NEXT: # kill: def $cl killed $ecx
+; X64-NEXT: sarl %cl, %edi
+; X64-NEXT: movl %edi, %eax
..which is not optimal and should be addressed later.
Rework of the patch by igorb
Reviewed By: igorb
Differential Revision: https://reviews.llvm.org/D44395
llvm-svn: 327499
This better able to detect undef and zeros pieces in the concat. Or cases when only one subvector is non-zero. This allows us to avoid silly things like double inserts into progressively larger undefs.
This still builds 512 bit concats of 128 bits by building up through 256 bits first. But I don't know if that's best.
We probably want to merge this with the vXi1 concat code since they are very similar.
llvm-svn: 327454
BUILD_VECTORs aren't themselves legalized until LegalizeDAG so we should still be able to create an "illegal" one before that. This helps combine with BUILD_VECTORS that are introduced during LegalizeVectorOps due to unrolling.
llvm-svn: 327446
Nothing prevents us from having both frame-setup and frame-destroy on
the same instruction.
When merging:
* frame-setup OPCODE1
* frame-destroy OPCODE2
into
* frame-setup frame-destroy OPCODE3
we want to be able to print and parse both flags.
llvm-svn: 327442
Nops should have zero latency because there is no result.
Idioms like 'xorps xmm0, xmm0' may have zero latency because
they are handled without using an execution unit.
llvm-svn: 327435
Under some circumstances the divrems won't have been combined together before getting to this code.
So replace the assertion with a if() guard to not expand to X-((X/C)*C) to give the other combine chance to happen.
Reduced from OSS-Fuzz #6883https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=6883
llvm-svn: 327424
For the MIPS O32 ABI, the current call lowering logic naively lowers each
call, creating the reserved argument area to hold the argument spill areas for
$a0..$a3 and the outgoing parameter area if one is required at each call site.
In the case of a sufficently large byval argument, a call to memcpy is used
to write the start+16..end of the argument into the outgoing parameter area.
This is done within the CALLSEQ_START..CALLSEQ_END of the callee. The CALLSEQ
nodes are responsible for performing the necessary stack adjustments.
Since the O32/N32/N64 MIPS ABIs do not have a red-zone and writing below the
stack pointer and reading the values back is unpredictable, the call to memcpy
cannot be hoisted out of the callee's CALLSEQ nodes.
However, for the O32 ABI requires the reserved argument area for functions
which have parameters. The naive lowering of calls will then create nested
CALLSEQ sequences. For N32 and N64 these nodes are also created, but with
zero stack adjustments as those ABIs do not have a reserved argument area.
This patch addresses the correctness issue by recognizing the special case
of lowering a byval argument that uses memcpy. By recognizing that the
incoming chain already has a CALLSEQ_START node on it when calling memcpy,
the CALLSEQ nodes are not created. For the N32 and N64 ABIs, this is not an
issue, as no stack adjustment has to be performed.
For the O32 ABI, the correctness reasoning is different. In the case of a
sufficently large byval argument, registers a0..a3 are going to be used for
the callee's arguments, mandating the creation of the reserved argument area.
The call to memcpy in the naive case will also create its own reserved
argument area. However, since the reserved argument area consists of undefined
values, both calls can use the same reserved argument area.
Reviewers: abeserminji, atanasyan
Differential Revision: https://reviews.llvm.org/D44296
llvm-svn: 327388
splitMergedValStore will split a store into two if target prefers this, or if
-force-split-store is passed.
This patch adds the missing handling for endianness in this function along
with a test case.
Review: Eli Friedman
https://reviews.llvm.org/D44396
llvm-svn: 327375
The current zero extension elimination was restricted to operands of
comparison. It actually could be extended to more cases.
For example:
int *inc_p (int *p, unsigned a)
{
return p + a;
}
'a' will be promoted to i64 during addition, and the zero extension could
be eliminated as well.
For the elimination optimization, it should be much better to start
recognizing the candidate sequence from the SRL instruction instead of J*
instructions.
This patch makes it an generic zero extension elimination pass instead of
one restricted with comparison.
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
llvm-svn: 327367
There is a mistake in current code that we "break" out the optimization
when the first operand of J*_RR doesn't qualify the elimination. This
caused some elimination opportunities missed, for example the one in the
testcase.
The code should just fall through to handle the second operand.
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
llvm-svn: 327366
The current subregister definition check stops after the MOV_32_64
instruction.
This means we are thinking all the following instruction sequences
are safe to be eliminated:
MOV_32_64 rB, wA
SLL_ri rB, rB, 32
SRL_ri rB, rB, 32
However, this is *not* true. The source subregister wA of MOV_32_64 could
come from a implicit truncation of 64-bit register in which case the high
bits of the 64-bit register is not zeroed, therefore we can't eliminate
above sequence.
For example, for i32_val, we shouldn't do the elimination:
long long bar ();
int foo (int b, int c)
{
unsigned int i32_val = (unsigned int) bar();
if (i32_val < 10)
return b;
else
return c;
}
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
llvm-svn: 327365
Improve the test accuracy by adding more check directives.
Shifts are expected to be eliminated for zero extension but not for signed
extension.
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
llvm-svn: 327364
This adds two features: "packets", and "nvj".
Enabling "packets" allows the compiler to generate instruction packets,
while disabling it will prevent it and disable all optimizations that
generate them. This feature is enabled by default on all subtargets.
The feature "nvj" allows the compiler to generate new-value jumps and it
implies "packets". It is enabled on all subtargets.
The exception is made for packets with endloop instructions, since they
require a certain minimum number of instructions in the packets to which
they apply. Disabling "packets" will not prevent hardware loops from
being generated.
llvm-svn: 327302
Since the enqueued kernels have internal linkage, their names may be dropped.
In this case, give them unique names __amdgpu_enqueued_kernel or
__amdgpu_enqueued_kernel.n where n is a sequential number starting from 1.
Differential Revision: https://reviews.llvm.org/D44322
llvm-svn: 327291
64-bit MMX vector generation usually ends up lowering into SSE instructions before being spilled/reloaded as a MMX type.
This patch creates a MMX vector from MMX source values, taking the lowest element from each source and constructing broadcasts/build_vectors with direct calls to the MMX PUNPCKL/PSHUFW intrinsics.
We're missing a few consecutive load combines that could be handled in a future patch if that would be useful - my main interest here is just avoiding a lot of the MMX/SSE crossover.
Differential Revision: https://reviews.llvm.org/D43618
llvm-svn: 327247
Same as the VPERMILPS/VPERMILPD approach for v8f32/v4f64 cases, rely on PSHUFB using bits[3:0] for indexing - we can ignore the sign bit (zero element) as those index vector values are considered undefined. The select between the lo/hi permute results based on the index size.
llvm-svn: 327242
As VPERMILPS/VPERMILPD only selects elements based on the bits[1:0]/bit[1] then we can permute both the (repeated) lo/hi 128-bit vectors in each case and then select between these results based on whether the index was for for lo/hi.
For v4i64/v4f64 this avoids some rather nasty v4i64 multiples on the AVX2 implementation, which seems to be worse than the extra port5 pressure from the additional shuffles/blends.
llvm-svn: 327239
Summary:
There are 3 different operand orders for FMA instructions so figuring out the exact operation being performed requires a lot of thought.
This patch adds a comment to the end of the assembly line to print the exact operation.
I think I've got all the instructions in here except the ones with builtin rounding.
I didn't update all tests, but I assume we can get them as we regenerate tests in the future.
Reviewers: spatel, v_klochkov, RKSimon
Reviewed By: spatel
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D44345
llvm-svn: 327225
r327100 made us stop producing vecreduce-propagate-sd-flags.s, but it's
still sticking around on some bots. This makes the bots unhappy.
I'll revert this tomorrow.
llvm-svn: 327199
This fixes pr36674.
While it is valid for shouldAssumeDSOLocal to return false anytime,
always returning false for intrinsics is not optimal on i386 and also
hits a bug in the backend.
To use a plt, the caller must first setup ebx to handle the case of
that file being linked into a PIE executable or shared library. In
those cases the generated PLT uses ebx.
Currently we can produce "calll expf@plt" without setting ebx. We
could fix that by correctly setting ebx, but this would produce worse
code for the case where the runtime library is statically linked. It
would also required other tools to handle R_386_PLT32.
llvm-svn: 327198
r327171 "Improve Dependency analysis when doing multi-node Instruction Selection"
r328170 "[DAG] Enforce stricter NodeId invariant during Instruction selection"
Reverting patch as NodeId invariant change is causing pathological
increases in compile time on PPC
llvm-svn: 327197
These instructions have 3 operands that can be commuted. The first commute we find may not be the best. So we should keep searching if we performed an aggressive commute. There may still be an operand that is killed or a physical register constraint that might be better.
Differential Revision: https://reviews.llvm.org/D44324
llvm-svn: 327188
Relanding after fixing NodeId Invariant.
Cleanup cycle/validity checks in ISel (IsLegalToFold,
HandleMergeInputChains) and X86 (isFusableLoadOpStore). Now do a full
search for cycles / dependencies pruning the search when topological
property of NodeId allows.
As part of this propogate the NodeId-based cutoffs to narrow
hasPreprocessorHelper searches.
Reviewers: craig.topper, bogner
Subscribers: llvm-commits, hiraditya
Differential Revision: https://reviews.llvm.org/D41293
llvm-svn: 327171
Instruction Selection makes use of the topological ordering of nodes
by node id (a node's operands have smaller node id than it) when doing
cycle detection. During selection we may violate this property as a
selection of multiple nodes may induce a use dependence (and thus a
node id restriction) between two unrelated nodes. If a selected node
has an unselected successor this may allow us to miss a cycle in
detection an invalid selection.
This patch fixes this by marking all unselected successors of a
selected node have negated node id. We avoid pruning on such negative
ids but still can reconstruct the original id for pruning.
In-tree targets have been updated to replace DAG-level replacements
with ISel-level ones which enforce this property.
This preemptively fixes PR36312 before triggering commit r324359 relands
Reviewers: craig.topper, bogner, jyknight
Subscribers: arsenm, nhaehnle, javed.absar, llvm-commits, hiraditya
Differential Revision: https://reviews.llvm.org/D43198
llvm-svn: 327170
The retpoline mitigation for variant 2 of CVE-2017-5715 inhibits the
branch predictor, and as a result it can lead to a measurable loss of
performance. We can reduce the performance impact of retpolined virtual
calls by replacing them with a special construct known as a branch
funnel, which is an instruction sequence that implements virtual calls
to a set of known targets using a binary tree of direct branches. This
allows the processor to speculately execute valid implementations of the
virtual function without allowing for speculative execution of of calls
to arbitrary addresses.
This patch extends the whole-program devirtualization pass to replace
certain virtual calls with calls to branch funnels, which are
represented using a new llvm.icall.jumptable intrinsic. It also extends
the LowerTypeTests pass to recognize the new intrinsic, generate code
for the branch funnels (x86_64 only for now) and lay out virtual tables
as required for each branch funnel.
The implementation supports full LTO as well as ThinLTO, and extends the
ThinLTO summary format used for whole-program devirtualization to
support branch funnels.
For more details see RFC:
http://lists.llvm.org/pipermail/llvm-dev/2018-January/120672.html
Differential Revision: https://reviews.llvm.org/D42453
llvm-svn: 327163
Summary: Starting from GCN 2nd generation, ISA supports ds_read_b128 on top of ds_read_b64.
This patch supports ds_read_b128 instruction pattern and generation of this instruction.
In the vectorizer, this patch also widen the vector length so that vectorizer generates
128 bit loads for local address-space which gets translated to ds_read_b128.
Since the performance benefit is not clear; compiler generates ds_read_b128 under -amdgpu-ds128.
Author: FarhanaAleen
Reviewed By: rampitec, arsenm
Subscribers: llvm-commits, AMDGPU
Differential Revision: https://reviews.llvm.org/D44210
llvm-svn: 327153
The code to match and produce more x86 vector blends was enabled for all
architectures even though the transform may pessimize the code for other
architectures that do not provide a vector blend instruction.
Added an aarch64 testcase to check that a VZIP instruction is generated instead
of byte movs.
Differential Revision: https://reviews.llvm.org/D44118
llvm-svn: 327132
Previously we unpacked the even bytes of each input into the high byte of 16-bit elements then did an v8i16 arithmetic shift right by 8 bits to fill the upper bits of each word with sign bits. Then we did the v8i16 multiply and then masked to zero the upper 8-bits of each result. The similar was done for all the odd bytes. The results are then packed together with packuswb
Since we are masking each multiply result element to 8-bits, and those 8-bits are determined only by the lower 8-bits of each of the inputs, we don't need to fill the upper bits with sign bits. So we can just unpack into the low byte of each element and treat the upper bits as garbage. This is what gcc also does.
Differential Revision: https://reviews.llvm.org/D44267
llvm-svn: 327093
This patch is a fix for PR36642.
While legalizing long vector types, make sure the smaller types get the
flags of the wider type.
bugzilla link: https://bugs.llvm.org/show_bug.cgi?id=36642
Change-Id: I0c2829639f094c862c10a6b51b342d4c2563e1fa
llvm-svn: 327079
This instruction can be thought of as reading either the even elements of a vXi32 input or the lower half of each element of a vXi64 input. We currently use the vXi32 interpretation, but vXi64 matches better with its broadcast behavior in EVEX.
I'm looking at moving MULDQ/MULUDQ creation to a DAG combine so we can do it when AVX512DQ is enabled without having to go through Custom lowering. But in some of the test cases we failed to use a broadcast load due to the size difference. This should help with that.
I'm also wondering if we can model these instructions in native IR and remove the intrinsics and I think using a vXi64 type will work better with that.
llvm-svn: 326991
These patterns weren't checking the alignment of the load, but were using the aligned instructions. This will cause a GP fault if the data isn't aligned.
I believe these were introduced in r312450.
llvm-svn: 326967
The attached testcase started failing after the patch to define
isExtractSubvectorCheap with the following pattern mismatch:
ISEL: Starting pattern match
Initial Opcode index to 85068
Match failed at index 85076
LLVM ERROR: Cannot select: t47: v8i16 = insert_subvector undef:v8i16, t43, Constant:i64<0>
The code generated from llvm/lib/Target/AArch64/AArch64InstrInfo.td
def : Pat<(insert_subvector undef, (v4i16 FPR64:$src), (i32 0)),
(INSERT_SUBREG (v8i16 (IMPLICIT_DEF)), FPR64:$src, dsub)>;
is in ninja/lib/Target/AArch64/AArch64GenDAGISel.inc
At the location of the error it is:
/* 85076*/ OPC_CheckChild2Type, MVT::i32,
And it failed to match the type of operand 2.
Adding another def-pat for i64 fixes the failed def-pat error:
def : Pat<(insert_subvector undef, (v4i16 FPR64:$src), (i64 0)),
(INSERT_SUBREG (v8i16 (IMPLICIT_DEF)), FPR64:$src, dsub)>;
llvm-svn: 326949
Summary:
A desired property of the node order in Swing Modulo Scheduling is
that for nodes outside circuits the following holds: none of them is
scheduled after both a successor and a predecessor. We call
node orders that meet this property valid.
Although invalid node orders do not lead to the generation of incorrect
code, they can cause the pipeliner not being able to find a pipelined schedule
for arbitrary II. The reason is that after scheduling the successor and the
predecessor of a node, no room may be left to schedule the node itself.
For data flow graphs with 0-latency edges, the node ordering algorithm
of Swing Modulo Scheduling can generate such undesired invalid node orders.
This patch fixes that.
In the remainder of this commit message, I will give an example
demonstrating the issue, explain the fix, and explain how the the fix is tested.
Consider, as an example, the following data flow graph with all
edge latencies 0 and all edges pointing downward.
```
n0
/ \
n1 n3
\ /
n2
|
n4
```
Consider the implemented node order algorithm in top-down mode. In that mode,
the algorithm orders the nodes based on greatest Height and in case of equal
Height on lowest Movability. Finally, in case of equal Height and
Movability, given two nodes with an edge between them, the algorithm prefers
the source-node.
In the graph, for every node, the Height and Movability are equal to 0.
As will be explained below, the algorithm can generate the order n0, n1, n2, n3, n4.
So, node n3 is scheduled after its predecessor n0 and after its successor n2.
The reason that the algorithm can put node n2 in the order before node n3,
even though they have an edge between them in which node n3 is the source,
is the following: Suppose the algorithm has constructed the partial node
order n0, n1. Then, the nodes left to be ordered are nodes n2, n3, and n4. Suppose
that the while-loop in the implemented algorithm considers the nodes in
the order n4, n3, n2. The algorithm will start with node n4, and look for
more preferable nodes. First, node n4 will be compared with node n3. As the nodes
have equal Height and Movability and have no edge between them, the algorithm
will stick with node n4. Then node n4 is compared with node n2. Again the
Height and Movability are equal. But, this time, there is an edge between
the two nodes, and the algorithm will prefer the source node n2.
As there are no nodes left to compare, the algorithm will add node n2 to
the node order, yielding the partial node order n0, n1, n2. In this way node n2
arrives in the node-order before node n3.
To solve this, this patch introduces the ZeroLatencyHeight (ZLH) property
for nodes. It is defined as the maximum unweighted length of a path from the
given node to an arbitrary node in which each edge has latency 0.
So, ZLH(n0)=3, ZLH(n1)=ZLH(n3)=2, ZLH(n2)=1, and ZLH(n4)=0
In this patch, the preference for a greater ZeroLatencyHeight
is added in the top-down mode of the node ordering algorithm, after the
preference for a greater Height, and before the preference for a
lower Movability.
Therefore, the two allowed node-orders are n0, n1, n3, n2, n4 and n0, n3, n1, n2, n4.
Both of them are valid node orders.
In the same way, the bottom-up mode of the node ordering algorithm is adapted
by introducing the ZeroLatencyDepth property for nodes.
The patch is tested by adding extra checks to the following existing
lit-tests:
test/CodeGen/Hexagon/SUnit-boundary-prob.ll
test/CodeGen/Hexagon/frame-offset-overflow.ll
test/CodeGen/Hexagon/vect/vect-shuffle.ll
Before this patch, the pipeliner failed to pipeline the loops in these tests
due to invalid node-orders. After the patch, the pipeliner successfully
pipelines all these loops.
Reviewers: bcahoon
Reviewed By: bcahoon
Subscribers: Ayal, mgrang, llvm-commits
Differential Revision: https://reviews.llvm.org/D43620
llvm-svn: 326925
The v8i32 conversion on AVX1 targets was only working after LowerMUL splits 256-bit vectors.
While I was there I've also made it so we don't have to check for AVX2 and BWI directly and instead just ask if the type is legal.
Differential Revision: https://reviews.llvm.org/D44190
llvm-svn: 326917
This is a follow-up to r325169, this time for all types, not just HVX
vector types.
Disable this by default, since it's not always safe.
llvm-svn: 326915
Summary: GCN ISA supports instructions that can read 16 consecutive dwords from memory through the scalar data cache;
loadstoreVectorizer should take advantage of the wider vector length and pack 16/8 elements of dwords/quadwords.
Author: FarhanaAleen
Reviewed By: rampitec
Subscribers: llvm-commits, AMDGPU
Differential Revision: https://reviews.llvm.org/D44179
llvm-svn: 326910
Fixes the bug found by asan. Also XFAIL the new test for Darwin,
which is stuck on DWARF v2, and fix up other tests so they stop
failing on Windows.
llvm-svn: 326839
Following the ARM-neon backend, define isExtractSubvectorCheap to return true
when extracting low and high part of a neon register.
The patch disables a test in llvm/test/CodeGen/AArch64/arm64-ext.ll This
testcase is fragile in the sense that it requires a BUILD_VECTOR to "survive"
all DAG transforms until ISelLowering. The testcase is supposed to check that
AArch64TargetLowering::ReconstructShuffle() works, and for that we need a
BUILD_VECTOR in ISelLowering. As we now transform the BUILD_VECTOR earlier into
an VEXT + vector_shuffle, we don't have the BUILD_VECTOR pattern when we get to
ISelLowering. As there is no way to disable the combiner to only exercise the
code in ISelLowering, the patch disables the testcase.
Differential revision: https://reviews.llvm.org/D43973
llvm-svn: 326811
One addrspacecast disappeared in clang emitted IR for
block invoke function due to adoption of the new
addr space mapping.
Differential Revision: https://reviews.llvm.org/D43785
llvm-svn: 326806
Before I started maintaining the AVR backend, this instruction
never originally used to have an earlyclobber flag.
Some time afterwards (years ago), I must've added it back in, not realising that it
was left out for a reason.
This pseudo instrction exists solely to work around a long standing bug
in the register allocator.
Before this commit, the LDDWRdYQ pseudo was not actually working around
any bug. With the earlyclobber flag removed again, the LDDWRdYQ pseudo
now correctly works around PR13375 again.
llvm-svn: 326774
EAX can turn out to be alive here, when shrink wrapping is done
(which is allowed when using dwarf exceptions, contrary to the
normal case with WinCFI).
This fixes PR36487.
Differential Revision: https://reviews.llvm.org/D43968
llvm-svn: 326764
DWARF v5 specifies that the root file (also given in the DW_AT_name
attribute of the compilation unit DIE) should be emitted explicitly to
the line table's list of files. This makes the line table more
independent of the .debug_info section.
Differential Revision: https://reviews.llvm.org/D44054
llvm-svn: 326758
Summary:
Fabs is a common floating-point operation, especially for some expansions. This patch adds
a new generic opcode for llvm.fabs.* intrinsic in order to avoid building/matching this intrinsic.
Reviewers: qcolombet, aditya_nandakumar, dsanders, rovka
Reviewed By: aditya_nandakumar
Subscribers: kristof.beyls, javed.absar, llvm-commits
Differential Revision: https://reviews.llvm.org/D43864
llvm-svn: 326749
Up until Power9, the performance profile for rlwinm., rldicl. and andi. looked
more or less equivalent. However with Power9, the rotates are still 2-way
cracked whereas the and-immediate is not.
This patch just ensures that we don't emit record-form rotates when an andi.
is adequate.
As first pointed out by Carrot in https://bugs.llvm.org/show_bug.cgi?id=30833
(this patch is a fix for that PR).
Differential Revision: https://reviews.llvm.org/D43977
llvm-svn: 326736
The error occurs when reading i16 elements (as in the testcase) from a v8i8
with a pattern of <0,2,4,6>. As all the data in the vector is accessed, the
operation is not a VUZP. The patch stops the pattern recognition of VUZP when
EXTRACT_VECTOR_ELT has a different element type than BUILD_VECTOR.
llvm-svn: 326722
Use the whole gammut of constant immediates available to set up a vector.
Instead of using, for example, `mov w0, #0xffff; dup v0.4s, w0`, which
transfers between register files, use the more efficient `movi v0.4s, #-1`
instead. Not limited to just a few values, but any immediate value that can
be encoded by all the variants of `FMOV`, `MOVI`, `MVNI`, thus eliminating
the need to there be patterns to optimize special cases.
Differential revision: https://reviews.llvm.org/D42133
llvm-svn: 326718
Loading a constant into a k-register in AVX512 requires a bitcast from a scalar constant. In the test case here we have a k-register store that gets split into multiple parts of KNL. MergeConsecutiveStores sees each of these pieces as a consecutive store and looks through the bitcast to find the underly scalar constant. But when we went to create the combined store we didn't look through the same bitcast.
llvm-svn: 326677
Add tests for FIADD/FISUB/FISUBR/FIMUL/FIDIV/FIDIVR
Shows we have more FILD stack usage than necessary (arg load, spill, reload to x87)
llvm-svn: 326674
We were previously doing this with isel patterns. Moving it to op legalization gives us chance to see the required bitcast earlier. And it lets us remove some isel patterns.
llvm-svn: 326669
X86 considers v1i1 a legal type under AVX512 and as such a truncate from a v1iX type to v1i1 can be turned into a scalar truncate plus a conversion to v1i1. We would much prefer a v1i1 SCALAR_TO_VECTOR over a one element BUILD_VECTOR.
During lowering we were detecting the v1i1 BUILD_VECTOR as a splat BUILD_VECTOR like we try to do for v2i1/v4i1/etc. In this case we create (select i1 splat_elt, v1i1 all-ones, v1i1 all-zeroes). That goes through some more legalization and we end up with a CMOV choosing between 0 and 1 in scalar and a scalar_to_vector.
Arguably we could detect the v1i1 BUILD_VECTOR and do this better in X86 target code. But just using a SCALAR_TO_VECTOR in legalization is much easier.
llvm-svn: 326637
This adds back-end support for the anyregcc calling convention
for use with patchpoints.
Since all registers are considered call-saved with anyregcc
(except for 0 and 1 which may still be clobbered by PLT stubs
and the like), this required adding support for saving and
restoring vector registers in prologue/epilogue code for the
first time. This is not used by any other calling convention.
llvm-svn: 326612