candidates with coldcc attribute.
This patch adds support for the coldcc calling convention for Power.
This changes the set of non-volatile registers. It includes a pass to stress
test the implementation by marking all static directly called functions with
the coldcc attribute through the option -enable-coldcc-stress-test. It also
includes an option, -ppc-enable-coldcc, to add the coldcc attribute to
functions which are cold at all call sites based on BlockFrequencyInfo when
the containing function does not call any non cold functions.
Differential Revision: https://reviews.llvm.org/D38413
llvm-svn: 322721
BRCTH is capable of a long branch which needs to be recognized during branch
relaxation. This is done by checking for ExtraRelaxSize == 0.
Review: Ulrich Weigand
llvm-svn: 322688
Summary:
Loading a vector of 4 half-precision FP sometimes results in an LD1
of 2 single-precision FP + a reversal. This results in an incorrect
byte swap due to the conversion from little endian to big endian.
In order to generate the correct byte swap, it is easier to
generate the correct LD1 of 4 half-precision FP, thus avoiding the
subsequent reversal.
Reviewers: craig.topper, jmolloy, olista01
Reviewed By: olista01
Subscribers: efriedma, samparker, SjoerdMeijer, rogfer01, aemerson, rengolin, javed.absar, kristof.beyls, llvm-commits
Differential Revision: https://reviews.llvm.org/D41863
llvm-svn: 322663
Mark G_FPEXT and G_FPTRUNC as legal or libcall, depending on hardware
support, but only for conversions between float and double.
Also add the necessary boilerplate so that the LegalizerHelper can
introduce the required libcalls. This also works only for float and
double, but isn't too difficult to extend when the need arises.
llvm-svn: 322651
The match* functions have the annoying behavior of modifying its inputs.
Save and restore the inputs, just in case the early out for AVX512 is
hit. This is still not great and its only a matter of time this kind of
bug happens again, but I couldn't come up with a better pattern without
rewriting significant chunks of this code. Fixes PR35977.
llvm-svn: 322644
Change symbol values in the stack_size section from being 8 bytes, to being a target dependent size.
Differential Revision: https://reviews.llvm.org/D42108
llvm-svn: 322619
Summary:
This patch adds CustomRenderer which renders the matched
operands to the specified instruction.
Targets can enable the matching of SDNodeXForm by adding
a definition that inherits from GICustomOperandRenderer and
GISDNodeXFormEquiv as follows.
def gi_imm8 : GICustomOperandRenderer<"renderImm8”>,
GISDNodeXFormEquiv<imm8_xform>;
Custom renderer functions should be of the form:
void render(MachineInstrBuilder &MIB, const MachineInstr &I);
Reviewers: dsanders, ab, rovka
Reviewed By: dsanders
Subscribers: kristof.beyls, javed.absar, llvm-commits, mgrang, qcolombet
Differential Revision: https://reviews.llvm.org/D42012
llvm-svn: 322582
As commented on the existing code:
// The Reg operand should be a virtual register, which is defined
// outside the current basic block. DAG combiner has done a pretty
// good job in removing truncating inside a single basic block.
However, when the Reg operand comes from bpf_load_[byte | half | word]
intrinsics, the generic optimizer doesn't understand their results are
zero extended, so these single basic block elimination opportunities were
missed.
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
llvm-svn: 322534
As mentioned on PR35869, (and came up recently on D41517) we don't create a MMX zero register via the PXOR but instead perform a spill to stack from a XMM zero register.
This patch adds support for direct MMX zero vector creation and should make it easier to add better constant vector creation in the future as well.
Differential Revision: https://reviews.llvm.org/D41908
llvm-svn: 322525
Add support for custom execution domain fixing and implement support for BLENDPD/BLENDPS/PBLENDD/PBLENDW.
Differential Revision: https://reviews.llvm.org/D42042
llvm-svn: 322524
Since a load and test instruction treat its operands as signed, it can only
replace a logical compare for EQ/NE uses.
Review: Ulrich Weigand
https://bugs.llvm.org/show_bug.cgi?id=35662
llvm-svn: 322488
Summary: This patch changes the kunpck intrinsic autoupgrade to use vXi1 shufflevector operations to perform vector extracts and concats. This more closely matches the definition of the kunpck instructions. Currently we rely on a DAG combine to turn the scalar shift/and/or code into a concat vectors operation. By doing it in the IR we get this for free.
Reviewers: spatel, RKSimon, zvi, jina.nahias
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D42018
llvm-svn: 322462
We have to take special care to avoid the cases where the result of the truncate would be padded with zero elements.
Ideally we'd just use ISD::TRUNCATE for these cases instead.
llvm-svn: 322454
Extend vXi1 conditions of vXi8/vXi16 selects even before type legalization gets a chance to split wide vectors. Previously we would only extend 128 and 256 bit vectors. But if we start with a 512 bit vector or wider that needs to be split we wouldn't extend until after the split had taken place. By extending early we improve the results of type legalization.
Don't widen condition of 128/256 bit vXi16/vXi8 selects when we have BWI but not VLX. We can still use a mask register by widening the select to 512-bits instead. This is similar to what we do for compares already.
llvm-svn: 322450
Additional test cases cover selects with i16/i8 conditions that are only 128/256-bits wide, but the compares are 512-bits wide and can only produce k-registers. We should be able to artificially widen the selects to avoid moving the k-register to an xmm/ymm register.
llvm-svn: 322449
In addition to the existing match as part of a loop-reduction, add a
straightforward pattern match for DAG-contained patterns.
Reviewers: RKSimon, craig.topper
Subscribers: llvm-commits
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D41811
llvm-svn: 322446
This avoids having the result type stick around until lowering where we have to extend the setcc and insert a truncate. If we get the types converted early we can do more to optimize it.
llvm-svn: 322432
*Mostly* NFC. Still updating the test though just for completeness.
This moves the hasAddressTaken check to MachineOutliner.cpp and replaces it
with a per-basic block test rather than a per-function test. The old test was
too conservative and was preventing functions in C programs from being
outlined even though they were safe to outline.
This was mostly a problem in C sources.
llvm-svn: 322425
Summary:
A recent change
321556: AMDGPU: Remove mayLoad/hasSideEffects from MIMG stores
can allow the machine instruction scheduler to move an image store past
an image load using the same descriptor.
V2: Fixed by marking image ops as mayAlias and isAliased. This may be
overly conservative, and we may need to revisit.
V3: Reverted test change done on 321556.
Reviewers: arsenm, nhaehnle, dstuttard
Subscribers: llvm-commits, t-tye, yaxunl, wdng, kzhuravl
Differential Revision: https://reviews.llvm.org/D41969
llvm-svn: 322419
Pass MD5 checksums through from IR to assembly/object files.
After this, getting Clang to compute the MD5 should be the last step
to supporting MD5 in the DWARF v5 line table header.
Differential Revision: https://reviews.llvm.org/D41926
llvm-svn: 322391
Part of the fix for https://bugs.llvm.org/show_bug.cgi?id=35812.
This patch ensures that the compare operand for the atomic compare and swap
is properly zero-extended to 32 bits if applicable.
A follow-up commit will fix the extension for the SETCC node generated when
expanding an ATOMIC_CMP_SWAP_WITH_SUCCESS. That will complete the bug fix.
Differential Revision: https://reviews.llvm.org/D41856
llvm-svn: 322372
For hard float with VFP4, it is legal. Otherwise, we use libcalls.
This needs a bit of support in the LegalizerHelper for soft float
because we didn't handle G_FMA libcalls yet. The support is trivial, as
the only difference between G_FMA and other libcalls that we already
handle is that it has 3 input operands rather than just 2.
llvm-svn: 322366
This patch teaches the Arm back-end to generate the SMMULR, SMMLAR and SMMLSR
instructions from equivalent IR patterns.
Differential Revision: https://reviews.llvm.org/D41775
llvm-svn: 322361
While the suffix isn't required to disambiguate the instructions, it is required in order to parse the instructions when the suffix is specified in order to match the GNU assembler.
llvm-svn: 322354
The PeepholeOptimizer would fail for vregs without a definition. If this
was caused by an undef operand abort to keep the code simple (so we
don't need to add logic everywhere to replicate the undef flag).
Differential Revision: https://reviews.llvm.org/D40763
llvm-svn: 322319
While updating clang tests for having clang set dso_local I noticed
that:
- There are *a lot* of tests to update.
- Many of the updates are redundant.
They are redundant because a GV is "obviously dso_local". This patch
starts formalizing that a bit by requiring that internal and private
GVs be dso_local too. Since they all are, we don't have to print
dso_local to the textual representation, making it a bit more compact
and easier to read.
llvm-svn: 322317
When replacing a PHI the PeepholeOptimizer currently takes the register
class of the register at the first operand. This however is not correct
if this argument has a subregister index.
As there is currently no API to query the register class resulting from
applying a subregister index to all registers in a class, we can only
abort in these cases and not perform the transformation.
This changes findNextSource() to require the end of all copy chains to
not use a subregister if there is any PHI in the chain. I had to rewrite
the overly complicated inner loop there to have a good place to insert
the new check.
This fixes https://llvm.org/PR33071 (aka rdar://32262041)
Differential Revision: https://reviews.llvm.org/D40758
llvm-svn: 322313
Summary:
Fold cases such as:
(v8i8 truncate (v8i32 extract_subvector (v16i32 sext (v16i8 V), Idx)))
->
(v8i8 extract_subvector (v16i8 V), Idx)
This can be generalized to cases where the truncate and extend do not
fully cancel each other out, but it may require querying the target
about profitability.
Reviewers: RKSimon, craig.topper, spatel, efriedma
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D41927
llvm-svn: 322300
Broken test from old attempt for folding tables - we don't peek through extract_subvector spills at all (which is why it doesn't fold), and we already have foldMemoryOperandCustom to handle insertps immediate correction anyway.
llvm-svn: 322292
This was originally planned as the fix for:
https://bugs.llvm.org/show_bug.cgi?id=35834
...but simpler transforms handled that case, so I implemented a
lesser solution. It turns out we need to handle the case with 'not'
ops too because the real code example that we are trying to solve:
https://bugs.llvm.org/show_bug.cgi?id=35875
...has extra uses of the intermediate values, so we can't rely on
smaller canonicalizations to get us to the goal.
As with rL321672, I've tried to show every possibility in the
codegen tests because that's the simplest way to prove we're doing
the right thing in the wide variety of permutations of this pattern.
We can also show an InstCombine win because we added a fold for
this case in:
rL321998 / D41603
An Alive proof for one variant of the pattern to show that the
InstCombine and codegen results are correct:
https://rise4fun.com/Alive/vd1
Name: min3_nots
%nx = xor i8 %x, -1
%ny = xor i8 %y, -1
%nz = xor i8 %z, -1
%cmpxz = icmp slt i8 %nx, %nz
%minxz = select i1 %cmpxz, i8 %nx, i8 %nz
%cmpyz = icmp slt i8 %ny, %nz
%minyz = select i1 %cmpyz, i8 %ny, i8 %nz
%cmpyx = icmp slt i8 %y, %x
%r = select i1 %cmpyx, i8 %minxz, i8 %minyz
=>
%cmpxyz = icmp slt i8 %minxz, %ny
%r = select i1 %cmpxyz, i8 %minxz, i8 %ny
Name: min3_nots_alt
%nx = xor i8 %x, -1
%ny = xor i8 %y, -1
%nz = xor i8 %z, -1
%cmpxz = icmp slt i8 %nx, %nz
%minxz = select i1 %cmpxz, i8 %nx, i8 %nz
%cmpyz = icmp slt i8 %ny, %nz
%minyz = select i1 %cmpyz, i8 %ny, i8 %nz
%cmpyx = icmp slt i8 %y, %x
%r = select i1 %cmpyx, i8 %minxz, i8 %minyz
=>
%xz = icmp sgt i8 %x, %z
%maxxz = select i1 %xz, i8 %x, i8 %z
%xyz = icmp sgt i8 %maxxz, %y
%maxxyz = select i1 %xyz, i8 %maxxz, i8 %y
%r = xor i8 %maxxyz, -1
llvm-svn: 322283
Primarily, this allows us to use the aggressive extraction mechanisms in combineExtractWithShuffle earlier and make use of UNDEF elements that may be lost during lowering.
llvm-svn: 322279
Summary:
As RKSimon suggested in pr35820, in the case that Src is smaller in
bit-size than Indices, need to widen Src to avoid type mismatch.
Fixes pr35820
Reviewers: RKSimon, craig.topper
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D41865
llvm-svn: 322272
Although the register scavenger can often find a spare register, an emergency
spill slot is needed to guarantee success. Reserve this slot in cases where
the function is known to have a large stack (meaning the scavenger may be
needed when forming stack addresses).
llvm-svn: 322269
If the index is v2i64 we can use the scatter instruction that has v4i32/v4f32 data register, v2i64 index, and v2i1 mask. Similar was already done for gather.
Implement custom widening for v2i32 data to remove the code that reverses type legalization during lowering.
llvm-svn: 322254
Like rL321668 / rL321672, the planned optimizer change to
fix these will be in ValueTracking, but we can test the
changes cleanly here with AArch64 codegen.
llvm-svn: 322238
Revert for now as the testcase is hitting a pre-existing verifier error
that manifest as a failure when expensive checks are enabled (or
-verify-machineinstrs) is used.
This reverts commit r322200.
llvm-svn: 322231
Branch relaxation is needed to support branch displacements that overflow the
instruction's immediate field.
Differential Revision: https://reviews.llvm.org/D40830
llvm-svn: 322224
Make sure I really get back to the beahvior before my rewrite in r321035
which turned out not to be completely NFC as I changed the behavior for
the ios simulator environment.
llvm-svn: 322223
This is a prerequisite for the branch relaxation pass, and allows a number of
optimisation passes (e.g. BranchFolding and MachineBlockPlacement) to work.
Differential Revision: https://reviews.llvm.org/D40808
llvm-svn: 322222
Includes support for expanding va_copy. Also adds support for using 'aligned'
registers when necessary for vararg calls, and ensure the frame pointer always
points to the bottom of the vararg spill region. This is necessary to ensure
that the saved return address and stack pointer are always available at fixed
known offsets of the frame pointer.
Differential Revision: https://reviews.llvm.org/D40805
llvm-svn: 322215
Currently we infer the scale at isel time by analyzing whether the base is a constant 0 or not. If it is we assume scale is 1, else we take it from the element size of the pass thru or stored value. This seems a little weird and I think it makes more sense to make it explicit in the DAG rather than doing tricky things in the backend.
Most of this patch is just making sure we copy the scale around everywhere.
Differential Revision: https://reviews.llvm.org/D40055
llvm-svn: 322210
ADRP instructions weren't being outlined because they're PC-relative and thus
fail the LR checks. This patch adds a special case for ADRPs to
getOutliningType to make sure that ADRPs can be outlined and updates the MIR
test.
llvm-svn: 322207
Large callframes (calls with several hundreds or thousands or
parameters) could lead to situations in which the emergency spillslot is
out of range to be addressed relative to the stack pointer.
This commit forces the use of a frame pointer in the presence of large
callframes.
This commit does several things:
- Compute max callframe size at the end of instruction selection.
- Add mirFileLoaded target callback. Use it to compute the max callframe size
after loading a .mir file when the size wasn't specified in the file.
- Let TargetFrameLowering::hasFP() return true if there exists a
callframe > 255 bytes.
- Always place the emergency spillslot close to FP if we have a frame
pointer.
- Note that `useFPForScavengingIndex()` would previously return false
when a base pointer was available leading to the emergency spillslot
getting allocated late (that's the whole effect of this callback).
Which made no sense to me so I took this case out: Even though the
emergency spillslot is technically not referenced by FP in this case
we still want it allocated early.
Differential Revision: https://reviews.llvm.org/D40876
llvm-svn: 322200
Adds missing coverage for SHUFPD undef argument lowering, and also shows a missed opportunity to remove a unnecessary move compared to 02 shuffle mask.
llvm-svn: 322175
For hard float, it is legal.
For soft float, we need to lower to 0 - x first, and then we can use the
libcall for G_FSUB. This is undoing some of the canonicalization
performed by the IRTranslator (which introduces G_FNEG when it sees a
0 - x). Ideally, that canonicalization would be performed by a
pre-legalizer pass that would allow targets to opt out of this behaviour
rather than dance around it in the legalizer.
llvm-svn: 322168
Prefetches used to always be chained between any previous and following
memory accesses. The problem with this was that later optimizations, such as
folding of a load into the user instruction, got disrupted.
This patch relaxes the chaining of prefetches in order to remedy this.
Reveiw: Hal Finkel
https://reviews.llvm.org/D38886
llvm-svn: 322163
Since a load and test instruction treat its operands as signed, it can only
replace a logical compare for EQ/NE uses.
Review: Ulrich Weigand
https://bugs.llvm.org/show_bug.cgi?id=35662
llvm-svn: 322161
Planning to add support for named vregs. This puts is in a conundrum since
physregs are named as well. To rectify this we need to use a sigil other than
'%' for physregs in MIR. We've settled on using '$' for physregs but first we
must repurpose it from external symbols using it, which is what this commit is
all about. We think '&' will have familiar semantics for C/C++ users.
llvm-svn: 322146
Adds option /guard:cf to clang-cl and -cfguard to cc1 to emit function IDs
of functions that have their address taken into a section named .gfids$y for
compatibility with Microsoft's Control Flow Guard feature.
The original patch didn't have the lit.local.cfg file that restricts the new
test to x86, thus the new test was failing on the non-x86 bots.
Differential Revision: https://reviews.llvm.org/D40531
The reverts r322008, which was a revert of r322005.
This reverts commit a05b89f9aca70597dc79fe97bc49b50b51f525ba.
llvm-svn: 322136
This patch makes the following changes to the schedule of instructions in the
prologue and epilogue.
The stack pointer update is moved down in the prologue so that the callee saves
do not have to wait for the update to happen.
Saving the lr is moved down in the prologue to hide the latency of the mflr.
The stack pointer is moved up in the epilogue so that restoring of the lr can
happen sooner.
The mtlr is moved up in the epilogue so that it is away form the blr at the end
of the epilogue. The latency of the mtlr can now be hidden by the loads of the
callee saved registers.
This commit is almost identical to this one: r322036 except that two warnings
that broke build bots have been fixed.
The revision number is D41737 as before.
llvm-svn: 322124
Summary:
In the case of an fp_extend of v1f16 to v1f32 where the v1f16 is the
result of a bitcast from i16, avoid creating an illegal fp16_to_fp where
the input is not a vector and the result is a v1f32.
V2: The fix is now to avoid vector scalarization creating a v1->scalar
bitcast.
Reviewers: srhines, t.p.northover
Subscribers: nhaehnle, llvm-commits, dstuttard, t-tye, yaxunl, wdng, kzhuravl, arsenm
Differential Revision: https://reviews.llvm.org/D41126
llvm-svn: 322120
Summary:
I had a case where multiple nested uniform ifs resulted in code that did
v_cmp comparisons, combining the results with s_and_b64, s_or_b64 and
s_xor_b64 and using the resulting mask in s_cbranch_vccnz, without first
ensuring that bits for inactive lanes were clear.
There was already code for inserting an "s_and_b64 vcc, exec, vcc" to
clear bits for inactive lanes in the case that the branch is instruction
selected as s_cbranch_scc1 and is then changed to s_cbranch_vccnz in
SIFixSGPRCopies. I have added the same code into SILowerControlFlow for
the case that the branch is instruction selected as s_cbranch_vccnz.
This de-optimizes the code in some cases where the s_and is not needed,
because vcc is the result of a v_cmp, or multiple v_cmp instructions
combined by s_and/s_or. We should add a pass to re-optimize those cases.
Reviewers: arsenm, kzhuravl
Subscribers: wdng, yaxunl, t-tye, llvm-commits, dstuttard, timcorringham, nhaehnle
Differential Revision: https://reviews.llvm.org/D41292
llvm-svn: 322119
Normally target independent DAG combine would do this combine based on getSetCCResultType, but with VLX getSetCCResultType returns a vXi1 type preventing the DAG combining from kicking in.
But doing this combine can allow us to remove the explicit sign extend that would otherwise be emitted.
This patch adds a target specific DAG combine to combine the sext+setcc when the result type is the same size as the input to the setcc. I've restricted this to FP compares and things that can be represented with PCMPEQ and PCMPGT since we don't have full integer compare support on the older ISAs.
Differential Revision: https://reviews.llvm.org/D41850
llvm-svn: 322101
In -debug output we print "pred:" whenever a MachineOperand is a
predicate operand in the instruction descriptor, and "opt:" whenever a
MachineOperand is an optional def in the instruction descriptor.
Differential Revision: https://reviews.llvm.org/D41870
llvm-svn: 322096
Summary:
Added the FastVariableShuffle feature to cases that resembled processors
for which this fearure is on.
For AVX2 there are processors with and w/o this fearue enable.
For AVX512 only KNL does enable this feature so cases which only have
+avx512f were left without the FastVariableShuffle enabled.
Reviewers: RKSimon, craig.topper
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D41851
llvm-svn: 322090
Currently the MachineInstr::print function prints the
frame-setup/frame-destroy differently than it does in MIR.
Instead of:
%x21 = LDR %sp, -16; flags: FrameDestroy
print:
%x21 = frame-destroy LDR %sp, -16
llvm-svn: 322088
Ingredients in this patch:
1. Add HANDLE_LIBCALL defs for finite mathlib functions that correspond to LLVM intrinsics.
2. Plumbing to send TargetLibraryInfo down to SelectionDAGLegalize.
3. Relaxed math and library checking in SelectionDAGLegalize::ConvertNodeToLibcall() to choose finite libcalls.
There was a bug about determining the availability of the finite calls that should be fixed with:
rL322010
Not in this patch:
This doesn't resolve the question/bug of clang creating the intrinsic IR in the first place.
There's likely follow-up work needed to support the long double variants better.
There's room for improvement to reduce the code duplication.
Create finite calls that don't originate from a corresponding intrinsic or DAG node?
Differential Revision: https://reviews.llvm.org/D41338
llvm-svn: 322087
Since register classes and banks are already printed with the register
definition, don't print it at the end of every instruction anymore.
This follows MIR in this regard and is another step to the unification
of the two formats.
llvm-svn: 322086
We are printing / parsing the `frame-setup` MachineInstr flag but not
the `frame-destroy` one.
Differential Revision: https://reviews.llvm.org/D41509
llvm-svn: 322071
Summary:
This commit enables some of the arithmetic instructions for Nios2 ISA (for both
R1 and R2 revisions), implements facilities required to emit those instructions
and provides LIT tests for added instructions.
Reviewed By: hfinkel
Differential Revision: https://reviews.llvm.org/D41236
Author: belickim <mateusz.belicki@intel.com>
llvm-svn: 322069
CET (Control-Flow Enforcement Technology) introduces a new mechanism called IBT (Indirect Branch Tracking).
According to IBT, each Indirect branch should land on dedicated ENDBR instruction (End Branch).
The new pass adds ENDBR instructions for every indirect jmp/call (including jumps using jump tables / switches).
For more information, please see the following:
https://software.intel.com/sites/default/files/managed/4d/2a/control-flow-enforcement-technology-preview.pdf
Differential Revision: https://reviews.llvm.org/D40482
Change-Id: Icb754489faf483a95248f96982a4e8b1009eb709
llvm-svn: 322062
The code that checks the immediate wasn't masking to the lower 3-bits like the code in X86InstrInfo.cpp that's used by the peephole pass does.
llvm-svn: 322060
The CTRLoop pass performs checks on the argument of certain libcalls/intrinsics,
and assumes the arguments must be of a simple type. This isn't always the case
though. For example if we unroll and vectorize a loop we may end up with vectors
larger then the largest legal type, along with intrinsics that operate on those
wider types. This happened in the ffmpeg build, where we unrolled a loop and
ended up with a sqrt intrinsic that operated on V16f64, triggering an assertion.
Differential Revision: https://reviews.llvm.org/D41758
llvm-svn: 322055
I had to drop fast-isel-abort from a test because we can't fast isel some of the mask stuff. When we used intrinsics we implicitly fell back to SelectionDAG for the intrinsic call without triggering the abort error. But with native IR that doesn't happen the same way.
llvm-svn: 322050
This commit does two things. Firstly, it adds a collection of flags which can
be passed along to the target to encode information about the MBB that an
instruction lives in to the outliner.
Second, it adds some of those flags to the AArch64 outliner in order to add
more stack instructions to the list of legal instructions that are handled
by the outliner. The two flags added check if
- There are calls in the MachineBasicBlock containing the instruction
- The link register is available in the entire block
If the link register is available and there are no calls, then a stack
instruction can always be outlined without fixups, regardless of what it is,
since in this case, the outliner will never modify the stack to create a
call or outlined frame.
The motivation for doing this was checking which instructions are most often
missed by the outliner. Instructions like, say
%sp<def> = ADDXri %sp, 32, 0; flags: FrameDestroy
are very common, but cannot be outlined in the case that the outliner might
modify the stack. This commit allows us to outline instructions like this.
llvm-svn: 322048
This patch makes the following changes to the schedule of instructions in the
prologue and epilogue.
The stack pointer update is moved down in the prologue so that the callee saves
do not have to wait for the update to happen.
Saving the lr is moved down in the prologue to hide the latency of the mflr.
The stack pointer is moved up in the epilogue so that restoring of the lr can
happen sooner.
The mtlr is moved up in the epilogue so that it is away form the blr at the end
of the epilogue. The latency of the mtlr can now be hidden by the loads of the
callee saved registers.
Differential Revision: https://reviews.llvm.org/D41737
llvm-svn: 322036
The new test fails on the Hexagon bot. Reverting while I investigate.
This reverts https://reviews.llvm.org/rL322005
This reverts commit b7e0026b4385180c378edc658ec91a39566f2942.
llvm-svn: 322008
Adds option /guard:cf to clang-cl and -cfguard to cc1 to emit function IDs
of functions that have their address taken into a section named .gfids$y for
compatibility with Microsoft's Control Flow Guard feature.
Differential Revision: https://reviews.llvm.org/D40531
llvm-svn: 322005
This patch allows `r7` to be used, regardless of its use as a frame pointer, as
a temporary register when popping `lr`, and also falls back to using a high
temporary register if, for some reason, we weren't able to find a suitable low
one.
Differential revision: https://reviews.llvm.org/D40961
Fixes https://bugs.llvm.org/show_bug.cgi?id=35481
llvm-svn: 321989
Allow SimplifyDemandedBits to use TargetLoweringOpt::computeKnownBits to look through bitcasts. This can help simplifying in some cases where bitcasts of constants generated during or after legalization can't be folded away, and thus didn't get picked up by SimplifyDemandedBits. This fixes PR34620, where a redundant pand created during legalization from lowering and lshr <16xi8> wasn't being simplified due to the presence of a bitcasted build_vector as an operand.
Committed on the behalf of @sameconrad (Sam Conrad)
Differential Revision: https://reviews.llvm.org/D41643
llvm-svn: 321969
Summary:
There are few oddities that occur due to v1i1, v8i1, v16i1 being legal without v2i1 and v4i1 being legal when we don't have VLX. Particularly during legalization of v2i32/v4i32/v2i64/v4i64 masked gather/scatter/load/store. We end up promoting the mask argument to these during type legalization and then have to widen the promoted type to v8iX/v16iX and truncate it to get the element size back down to v8i1/v16i1 to use a 512-bit operation. Since need to fill the upper bits of the mask we have to fill with 0s at the promoted type.
It would be better if we could just have the v2i1/v4i1 types as legal so they don't undergo any promotion. Then we can just widen with 0s directly in a k register. There are no real v4i1/v2i1 instructions anyway. Everything is done on a larger register anyway.
This also fixes an issue that we couldn't implement a masked vextractf32x4 from zmm to xmm properly.
We now have to support widening more compares to 512-bit to get a mask result out so new tablegen patterns got added.
I had to hack the legalizer for widening the operand of a setcc a bit so it didn't try create a setcc returning v4i32, extract from it, then try to promote it using a sign extend to v2i1. Now we create the setcc with v4i1 if the original setcc's result type is v2i1. Then extract that and don't sign extend it at all.
There's definitely room for improvement with some follow up patches.
Reviewers: RKSimon, zvi, guyblank
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D41560
llvm-svn: 321967
The memory form of the xmm->xmm version only writes 64-bits. If we use it in the folding tables and its get used for a stack spill, only half the slot will be written. Then a reload may read all 128-bits which will pull in garbage. But without the spill the upper bits of the register would have been zero. By not folding we would preserve the zeros.
llvm-svn: 321950
This is the last step needed to fix PR33325:
https://bugs.llvm.org/show_bug.cgi?id=33325
We're trading branch and compares for loads and logic ops.
This makes the code smaller and hopefully faster in most cases.
The 24-byte test shows an interesting construct: we load the trailing scalar
elements into vector registers and generate the same pcmpeq+movmsk code that
we expected for a pair of full vector elements (see the 32- and 64-byte tests).
Differential Revision: https://reviews.llvm.org/D41714
llvm-svn: 321934