AVX-512: Implemented GETEXP instruction for KNL and SKX
Added rounding mode modifier for SQRTPS/PD
Added tests for encoding and intrinsics.
CR:
http://reviews.llvm.org/D9991
llvm-svn: 238923
Intel® Memory Protection Extensions (Intel® MPX) is a new feature in Skylake.
It is a part of KNL and SKX sets. It is also a part of Skylake client.
I added definition of %bnd0 - %bnd3 registers, each register is a pair of 64-bit integers.
llvm-svn: 238916
This patch removes the old X86ISD::FSRL op - which allowed float vectors to use the byte right shift operations (causing a domain switch....).
Since the refactoring of the shuffle lowering code this no longer has any use.
Differential Revision: http://reviews.llvm.org/D10169
llvm-svn: 238906
The first try (r238051) to land this was reverted due to bot failures
that were hopefully addressed by r238788.
This patch adds a TargetRecip class for processing many recip codegen possibilities.
The class is intended to handle both command-line options to llc as well
as options passed in from a front-end such as clang with the -mrecip option.
The x86 backend is updated to use the new functionality.
Only -mcpu=btver2 with -ffast-math should see a functional change from this patch.
All other x86 CPUs continue to *not* use reciprocal estimates by default with -ffast-math.
Differential Revision: http://reviews.llvm.org/D8982
llvm-svn: 238842
This is important because of different addressing modes
depending on the address space for GPU targets.
This only adds the argument, and does not update
any of the uses to provide the correct address space.
llvm-svn: 238723
best approach of each.
For vNi16, we use SHL + ADD + SRL pattern that seem easily the best.
For vNi32, we use the PUNPCK + PSADBW + PACKUSWB pattern. In some cases
there is a huge improvement with this in IACA's estimated throughput --
over 2x higher throughput!!!! -- but the measurements are too good to be
true. In one narrow case, the SHL + ADD + SHL + ADD + SRL pattern looks
slightly faster, but I'm not sure I believe any of the measurements at
this point. Both are the exact same uops though. Hard to be confident of
anything past that.
If anyone wants to collect very detailed (Agner-level) timings with the
result of this patch, or with the i32 case replaced with SHL + ADD + SHl
+ ADD + SRL, I'd be very interested. Note that you'll need to test it on
both Ivybridge and Haswell, with both SSE3, SSSE3, and AVX selected as
I saw unique behavior in each of these buckets with IACA all of which
should be checked against measured performance.
But this patch is still a useful improvement by dropping duplicate work
and getting the much nicer PSADBW lowering for v2i64.
I'd still like to rephrase this in terms of generic horizontal sum. It's
a bit lame to have a special case of that just for popcount.
llvm-svn: 238652
shorter one. NFC.
In addition to being much shorter to type and requiring fewer arguments,
this change saves over 30 lines from this one file, all wasted on total
boilerplate...
llvm-svn: 238640
shifting vectors of bytes as x86 doesn't have direct support for that.
This removes a bunch of redundant masking in the generated code for SSE2
and SSE3.
In order to avoid the really significant code size growth this would
have triggered, I also factored the completely repeatative logic for
shifting and masking into two lambdas which in turn makes all of this
much easier to read IMO.
llvm-svn: 238637
in-register LUT technique.
Summary:
A description of this technique can be found here:
http://wm.ite.pl/articles/sse-popcount.html
The core of the idea is to use an in-register lookup table and the
PSHUFB instruction to compute the population count for the low and high
nibbles of each byte, and then to use horizontal sums to aggregate these
into vector population counts with wider element types.
On x86 there is an instruction that will directly compute the horizontal
sum for the low 8 and high 8 bytes, giving vNi64 popcount very easily.
Various tricks are used to get vNi32 and vNi16 from the vNi8 that the
LUT computes.
The base implemantion of this, and most of the work, was done by Bruno
in a follow up to D6531. See Bruno's detailed post there for lots of
timing information about these changes.
I have extended Bruno's patch in the following ways:
0) I committed the new tests with baseline sequences so this shows
a diff, and regenerated the tests using the update scripts.
1) Bruno had noticed and mentioned in IRC a redundant mask that
I removed.
2) I introduced a particular optimization for the i32 vector cases where
we use PSHL + PSADBW to compute the the low i32 popcounts, and PSHUFD
+ PSADBW to compute doubled high i32 popcounts. This takes advantage
of the fact that to line up the high i32 popcounts we have to shift
them anyways, and we can shift them by one fewer bit to effectively
divide the count by two. While the PSHUFD based horizontal add is no
faster, it doesn't require registers or load traffic the way a mask
would, and provides more ILP as it happens on different ports with
high throughput.
3) I did some code cleanups throughout to simplify the implementation
logic.
4) I refactored it to continue to use the parallel bitmath lowering when
SSSE3 is not available to preserve the performance of that version on
SSE2 targets where it is still much better than scalarizing as we'll
still do a bitmath implementation of popcount even in scalar code
there.
With #1 and #2 above, I analyzed the result in IACA for sandybridge,
ivybridge, and haswell. In every case I measured, the throughput is the
same or better using the LUT lowering, even v2i64 and v4i64, and even
compared with using the native popcnt instruction! The latency of the
LUT lowering is often higher than the latency of the scalarized popcnt
instruction sequence, but I think those latency measurements are deeply
misleading. Keeping the operation fully in the vector unit and having
many chances for increased throughput seems much more likely to win.
With this, we can lower every integer vector popcount implementation
using the LUT strategy if we have SSSE3 or better (and thus have
PSHUFB). I've updated the operation lowering to reflect this. This also
fixes an issue where we were scalarizing horribly some AVX lowerings.
Finally, there are some remaining cleanups. There is duplication between
the two techniques in how they perform the horizontal sum once the byte
population count is computed. I'm going to factor and merge those two in
a separate follow-up commit.
Differential Revision: http://reviews.llvm.org/D10084
llvm-svn: 238636
a separate routine, generalize it to work for all the integer vector
sizes, and do general code cleanups.
This dramatically improves lowerings of byte and short element vector
popcount, but more importantly it will make the introduction of the
LUT-approach much cleaner.
The biggest cleanup I've done is to just force the legalizer to do the
bitcasting we need. We run these iteratively now and it makes the code
much simpler IMO. Other changes were minor, and mostly naming and
splitting things up in a way that makes it more clear what is going on.
The other significant change is to use a different final horizontal sum
approach. This is the same number of instructions as the old method, but
shifts left instead of right so that we can clear everything but the
final sum with a single shift right. This seems likely better than
a mask which will usually have to read the mask from memory. It is
certaily fewer u-ops. Also, this will be temporary. This and the LUT
approach share the need of horizontal adds to finish the computation,
and we have more clever approaches than this one that I'll switch over
to.
llvm-svn: 238635
It turns out that _except_handler3 and _except_handler4 really use the
same stack allocation layout, at least today. They just make different
choices about encoding the LSDA.
This is in preparation for lowering the llvm.eh.exceptioninfo().
llvm-svn: 238627
Small (really small!) C++ exception handling examples work on 32-bit x86
now.
This change disables the use of .seh_* directives in WinException when
CFI is not in use. It also uses absolute symbol references in the tables
instead of imagerel32 relocations.
Also fixes a cache invalidation bug in MMI personality classification.
llvm-svn: 238575
MIOperands/ConstMIOperands are classes iterating over the MachineOperand
of a MachineInstr, however MachineInstr::mop_iterator does the same
thing.
I assume these two iterators exist to have a uniform interface to
iterate over the operands of a machine instruction bundle and a single
machine instruction. However in practice I find it more confusing to have 2
different iterator classes, so this patch transforms (nearly all) the
code to use mop_iterators.
The only exception being MIOperands::anlayzePhysReg() and
MIOperands::analyzeVirtReg() still needing an equivalent, I leave that
as an exercise for the next patch.
Differential Revision: http://reviews.llvm.org/D9932
This version is slightly modified from the proposed revision in that it
introduces MachineInstr::getOperandNo to avoid the extra counting
variable in the few loops that previously used MIOperands::getOperandNo.
llvm-svn: 238539
This moves all the state numbering code for C++ EH to WinEHPrepare so
that we can call it from the X86 state numbering IR pass that runs
before isel.
Now we just call the same state numbering machinery and insert a bunch
of stores. It also populates MachineModuleInfo with information about
the current function.
llvm-svn: 238514
For x86 targets, do not do sibling call optimization when materializing
the callee's address would require a GOT relocation. We can still do
tail calls to internal functions, hidden functions, and protected
functions, because they do not require this kind of relocation. It is
still possible to get GOT relocations when the user explicitly asks for
it with musttail or -tailcallopt, both of which are supposed to
guarantee TCO.
Based on a patch by Chih-hung Hsieh.
Reviewers: srhines, timmurray, danalbert, enh, void, nadav, rnk
Subscribers: joerg, davidxl, llvm-commits
Differential Revision: http://reviews.llvm.org/D9799
llvm-svn: 238487
With this patch the x86 backend is now shrink-wrapping capable
and this functionality can be tested by using the
-enable-shrink-wrap switch.
The next step is to make more test and enable shrink-wrapping by
default for x86.
Related to <rdar://problem/20821487>
llvm-svn: 238293
This gets gas and llc -filetype=obj to agree on the order of prefixes.
For llvm-mc we need to fix the asm parser to know that it makes a difference
on which line the "lock" is in.
Part of pr23594.
llvm-svn: 238232
Previously, subtarget features were a bitfield with the underlying type being uint64_t.
Since several targets (X86 and ARM, in particular) have hit or were very close to hitting this bound, switching the features to use a bitset.
No functional change.
The first several times this was committed (e.g. r229831, r233055), it caused several buildbot failures.
Apparently the reason for most failures was both clang and gcc's inability to deal with large numbers (> 10K) of bitset constructor calls in tablegen-generated initializers of instruction info tables.
This should now be fixed.
llvm-svn: 238192
Part of D9474, this patch extends AVX2 v16i16 types to 2 x 8i32 vectors and uses i32 shift variable shifts before packing back to i16.
Adds AVX2 tests for v8i16 and v16i16
llvm-svn: 238149
The semantics of the scalar FMA intrinsics are that the high vector elements are copied from the first source.
The existing pattern switches src1 and src2 around, to match the "213" order, which ends up tying the original src2 to the dest. Since the actual scalar fma3 instructions copy the high elements from the dest register, the wrong values are copied.
This modifies the pattern to leave src1 and src2 in their original order.
Differential Revision: http://reviews.llvm.org/D9908
llvm-svn: 238131
This patch adds a class for processing many recip codegen possibilities.
The TargetRecip class is intended to handle both command-line options to llc as well
as options passed in from a front-end such as clang with the -mrecip option.
The x86 backend is updated to use the new functionality.
Only -mcpu=btver2 with -ffast-math should see a functional change from this patch.
All other CPUs continue to *not* use reciprocal estimates by default with -ffast-math.
Differential Revision: http://reviews.llvm.org/D8982
llvm-svn: 238051
The problem was that I slipped a change required for shrink-wrapping, namely I
used getFirstTerminator instead of the getLastNonDebugInstr that was here before
the refactoring, whereas the surrounding code is not yet patched for that.
Original message:
[X86] Refactor the prologue emission to prepare for shrink-wrapping.
- Add a late pass to expand pseudo instructions (tail call and EH returns).
Instead of doing it in the prologue emission.
- Factor some static methods in X86FrameLowering to ease code sharing.
NFC.
Related to <rdar://problem/20821487>
llvm-svn: 238035
Revert "[X86] Refactor the prologue emission to prepare for shrink-wrapping."
This reverts commit 6b3b93fc8b68a2c806aa992ee4bd3d7f61898d4b.
This reverts commit ab0b15dff8539826283a59c2dd700a18a9680e0f.
llvm-svn: 238011
- Add a late pass to expand pseudo instructions (tail call and EH returns).
Instead of doing it in the prologue emission.
- Factor some static methods in X86FrameLowering to ease code sharing.
NFC.
Related to <rdar://problem/20821487>
llvm-svn: 237977
This starts merging MCSection and MCSectionData.
There are a few issues with the current split between MCSection and
MCSectionData.
* It optimizes the the not as important case. We want the production
of .o files to be really fast, but the split puts the information used
for .o emission in a separate data structure.
* The ELF/COFF/MachO hierarchy is not represented in MCSectionData,
leading to some ad-hoc ways to represent the various flags.
* It makes it harder to remember where each item is.
The attached patch starts merging the two by moving the alignment from
MCSectionData to MCSection.
Most of the patch is actually just dropping 'const', since
MCSectionData is mutable, but MCSection was not.
llvm-svn: 237936
Predicate UseAVX depricates pattern selection on AVX-512.
This predicate is necessary for DAG selection to select EVEX form.
But mapping SSE intrinsics to AVX-512 instructions is not ready yet.
So I replaced UseAVX with HasAVX for intrinsics patterns.
llvm-svn: 237903
This patch improves support for sign extension of the lower lanes of vectors of integers by making use of the SSE41 pmovsx* sign extension instructions where possible, and optimizing the sign extension by shifts on pre-SSE41 targets (avoiding the use of i64 arithmetic shifts which require scalarization).
It converts SIGN_EXTEND nodes to SIGN_EXTEND_VECTOR_INREG where necessary, that more closely matches the pmovsx* instruction than the default approach of using SIGN_EXTEND_INREG which splits the operation (into an ANY_EXTEND lowered to a shuffle followed by shifts) making instruction matching difficult during lowering. Necessary support for SIGN_EXTEND_VECTOR_INREG has been added to the DAGCombiner.
Differential Revision: http://reviews.llvm.org/D9848
llvm-svn: 237885
fixed extract-insert i1 element,
load i1, zextload i1 should be with "and $1, %reg" to prevent loading garbage.
added a bunch of new tests.
llvm-svn: 237793
This changes the ABI used on 32-bit x86 for passing vector arguments.
Historically, clang passes the first 4 vector arguments in-register, and additional vector arguments on the stack, regardless of platform. That is different from the behavior of gcc, icc, and msvc, all of which pass only the first 3 arguments in-register.
The 3-register convention is documented, unofficially, in Agner's calling convention guide, and, officially, in the recently released version 1.0 of the i386 psABI.
Darwin is kept as is because the OS X ABI Function Call Guide explicitly documents the current (4-register) behavior.
This fixes PR21510
Differential revision: http://reviews.llvm.org/D9644
llvm-svn: 237682
This reverts commit r237210.
Also fix X86/complex-fca.ll to match the code that we used to generate
on win32 and now generate everwhere to conform to SysV.
llvm-svn: 237639
This was previously returning int. However there are no negative opcode
numbers and more importantly this was needlessly different from
MCInstrDesc::getOpcode() (which even is the value returned here) and
SDValue::getOpcode()/SDNode::getOpcode().
llvm-svn: 237611
instructions. These intrinsics are comming with rounding mode.
Added intrinsics for MAXSS/D, MINSS/D - with and without sae.
By Asaf Badouh (asaf.badouh@intel.com)
llvm-svn: 237560
Summary:
But still handle them the same way since I don't know how they differ on
this target.
Of these, 'o' and 'v' are not tested but were already implemented.
I'm not sure why 'i' is required for X86 since it's supposed to be an
immediate constraint rather than a memory constraint. A test asserts
without it so I've included it for now.
No functional change intended.
Reviewers: nadav
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D8254
llvm-svn: 237517
Other pieces of CodeGen want to negate frame object offsets to account
for architectures where the stack grows down. Our object is a pseudo
object so it's offset doesn't matter. However, we shouldn't choose an
offset which results in undefined behavior if you negate it.
llvm-svn: 237474
i1 type is a legal type on AVX-512 and can be passed as parameter or return value.
i1 is promoted to i8 on return and to i32 for call arguments (i8 is also promoted to i32 here).
The result code is similar to the previous X86 targets, where i1 is allways promoted to i8.
llvm-svn: 237350
Previously, subtarget features were a bitfield with the underlying type being uint64_t.
Since several targets (X86 and ARM, in particular) have hit or were very close to hitting this bound, switching the features to use a bitset.
No functional change.
The first two times this was committed (r229831, r233055), it caused several buildbot failures.
At least some of the ARM and MIPS ones were due to gcc/binutils issues, and should now be fixed.
llvm-svn: 237234
Summary:
This change adds two new parameters to the statepoint intrinsic, `i64 id`
and `i32 num_patch_bytes`. `id` gets propagated to the ID field
in the generated StackMap section. If the `num_patch_bytes` is
non-zero then the statepoint is lowered to `num_patch_bytes` bytes of
nops instead of a call (the spill and reload code remains unchanged).
A non-zero `num_patch_bytes` is useful in situations where a language
runtime requires complete control over how a call is lowered.
This change brings statepoints one step closer to patchpoints. With
some additional work (that is not part of this patch) it should be
possible to get rid of `TargetOpcode::STATEPOINT` altogether.
PlaceSafepoints generates `statepoint` wrappers with `id` set to
`0xABCDEF00` (the old default value for the ID reported in the stackmap)
and `num_patch_bytes` set to `0`. This can be made more sophisticated
later.
Reviewers: reames, pgavlin, swaroop.sridhar, AndyAyers
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D9546
llvm-svn: 237214
Summary:
This rule was always in the old SysV i386 ABI docs and the new ones that
H.J. Lu has put together, but we never noticed:
EAX scratch register; also used to return integer and pointer values
from functions; also stores the address of a returned struct or union
Fixes PR23491.
Reviewers: majnemer
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D9715
llvm-svn: 237175
The other changes in the LowerShift() are not functional,
just to make the code more convenient.
So, the functional changes for SKX only.
llvm-svn: 237129
Before revision 171146, function 'PerformTruncateCombine' used to perform
a premature lowering of TRUNCATE dag nodes.
Revision 171146 then moved all the logic implemented by PerformTruncateCombine
to a custom lowering hook. However, that revision forgot to delete
function PerformTruncateCombine from the code.
This patch removes function 'PerformTruncateCombine' since it has no effect
on the SelectionDAG. No functional change intended.
llvm-svn: 237122
The X86-specific DAGCombine for stores should not assume vector types are always simple.
This fixes PR23476.
Differential Revision: http://reviews.llvm.org/D9659
llvm-svn: 237097
to use the information in the module rather than TargetOptions.
We've had and clang has used the use-soft-float attribute for some
time now so have the backends set a subtarget feature based on
a particular function now that subtargets are created based on
functions and function attributes.
For the one middle end soft float check go ahead and create
an overloadable TargetLowering::useSoftFloat function that
just checks the TargetSubtargetInfo in all cases.
Also remove the command line option that hard codes whether or
not soft-float is set by using the attribute for all of the
target specific test cases - for the generic just go ahead and
add the attribute in the one case that showed up.
llvm-svn: 237079
Summary:
r235215 adds support for f16 to be considered as a load/store type and
promote f16 operations to f32.
This patch has miscellaneous fixes for the X86 backend so all f16
operations are handled:
1. Set loadextaction for f16 vectors to expand.
2. Handle FP_EXTEND in a switch statement when handling v2f32
3. Do not fold (FP_TO_SINT (load f16)) into FP_TO_INT*_IN_MEM or
(store (SINT_TO_FP )) to a FILD.
Tests included.
Reviewers: ab, srhines, delena
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D9092
llvm-svn: 237004
The code that builds the dependence graph assumes that two PseudoSourceValues
don't alias. In a tail calling function two FixedStackObjects might refer to the
same location. Worse 'immutable' fixed stack objects like function arguments are
not immutable and will be clobbered.
Change this so that a load from a FixedStackObject is not invariant in a tail
calling function and don't return a PseudoSourceValue for an instruction in tail
calling functions when building the dependence graph so that we handle function
arguments conservatively.
Fix for PR23459.
rdar://20740035
llvm-svn: 236916
A trunc from i32 to i1 on x86_64 generates an instruction such as
%vreg19<def> = COPY %vreg9:sub_8bit<kill>; GR8:%vreg19 GR32:%vreg9
However, the copy here should only have the kill flag on the 32-bit path, not the 64-bit one.
Otherwise, we are killing the source of the truncate which could be used later in the program.
llvm-svn: 236890
This changes the shape of the statepoint intrinsic from:
@llvm.experimental.gc.statepoint(anyptr target, i32 # call args, i32 unused, ...call args, i32 # deopt args, ...deopt args, ...gc args)
to:
@llvm.experimental.gc.statepoint(anyptr target, i32 # call args, i32 flags, ...call args, i32 # transition args, ...transition args, i32 # deopt args, ...deopt args, ...gc args)
This extension offers the backend the opportunity to insert (somewhat) arbitrary code to manage the transition from GC-aware code to code that is not GC-aware and back.
In order to support the injection of transition code, this extension wraps the STATEPOINT ISD node generated by the usual lowering lowering with two additional nodes: GC_TRANSITION_START and GC_TRANSITION_END. The transition arguments that were passed passed to the intrinsic (if any) are lowered and provided as operands to these nodes and may be used by the backend during code generation.
Eventually, the lowering of the GC_TRANSITION_{START,END} nodes should be informed by the GC strategy in use for the function containing the intrinsic call; for now, these nodes are instead replaced with no-ops.
Differential Revision: http://reviews.llvm.org/D9501
llvm-svn: 236888
The function 'getTargetShuffleMask' already knows how to deal with PSHUFB nodes
where the mask node is a load from constant pool, and the constant pool node
is wrapped by a X86ISD::Wrapper node. This patch extends that logic by teaching
it how to also look through X86ISD::WrapperRIP.
This helps function combineX86ShufflesRecusively to combine more shuffle
sequences containing PSHUFB nodes if we are in RIPRel PIC mode.
Before this change, llc (with -relocation-model=pic -march=x86-64) was unable
to decode a pshufb where the mask was loaded from a constant pool. For example,
the no-op shuffle from test 'x86-fold-pshufb.ll' was not folded into its
operand, so instead of generating a single 'movaps' the backend always
generated a sub-optimal 'movdqa + pshufb' sequence.
Added test x86-fold-pshufb.ll.
llvm-svn: 236863
This is a follow-on to r236740 where I took Andrea's advice
in D9504 to remove a redundant pattern...except that I removed
the wrong pattern!
AFAICT, there is no change in the final code produced because
subsequent passes would clean up the extra instructions created
by the more complicated pattern.
llvm-svn: 236743
Finish the job that was abandoned in D6958 following the refactoring in
http://reviews.llvm.org/rL230221:
1. Uncomment the intrinsic def for the AVX r_Int instruction.
2. Add missing r_Int entries to the load folding tables; there are already
tests that check these in "test/Codegen/X86/fold-load-unops.ll", so I
haven't added any more in this patch.
3. Add patterns to solve PR21507 ( https://llvm.org/bugs/show_bug.cgi?id=21507 ).
So instead of this:
movaps %xmm0, %xmm1
rcpss %xmm1, %xmm1
movss %xmm1, %xmm0
We should now get:
rcpss %xmm0, %xmm0
And instead of this:
vsqrtss %xmm0, %xmm0, %xmm1
vblendps $1, %xmm1, %xmm0, %xmm0 ## xmm0 = xmm1[0],xmm0[1,2,3]
We should now get:
vsqrtss %xmm0, %xmm0, %xmm0
Differential Revision: http://reviews.llvm.org/D9504
llvm-svn: 236740
Added intrinsics for the instructions. CC parameter of the intrinsics was changed from i8 to i32 according to the spec.
By Igor Breger (igor.breger@intel.com)
llvm-svn: 236714
When folding a load in to another instruction, we need to fix the class of the index register
Otherwise, it could be something like GR64 not GR64_NOSP and would fail the machine verifier.
llvm-svn: 236644
The patch disabled unrolling in loop vectorization pass when VF==1 on x86 architecture,
by setting MaxInterleaveFactor to 1. Unrolling in loop vectorization pass may introduce
the cost of overflow check, memory boundary check and extra prologue/epilogue code when
regular unroller will unroll the loop another time. Disable it when VF==1 remove the
unnecessary cost on x86. The same can be done for other platforms after verifying
interleaving/memory bound checking to be not perf critical on those platforms.
Differential Revision: http://reviews.llvm.org/D9515
llvm-svn: 236613
This patch adds the minimum plumbing necessary to use IR-level
fast-math-flags (FMF) in the backend without actually using
them for anything yet. This is a follow-on to:
http://reviews.llvm.org/rL235997
...which split the existing nsw / nuw / exact flags and FMF
into their own struct.
There are 2 structural changes here:
1. The main diff is that we're preparing to extend the optimization
flags to affect more than just binary SDNodes. Eg, IR intrinsics
( https://llvm.org/bugs/show_bug.cgi?id=21290 ) or non-binop nodes
that don't even exist in IR such as FMA, FNEG, etc.
2. The other change is that we're actually copying the FP fast-math-flags
from the IR instructions to SDNodes.
Differential Revision: http://reviews.llvm.org/D8900
llvm-svn: 236546
This reverts commit r236360.
This change exposed a bug in WinEHPrepare by opting win32 code into EH
preparation. We already knew that WinEHPrepare has bugs, and is the
status quo for x64, so I don't think that's a reason to hold off on this
change. I disabled exceptions in the sanitizer tests in r236505 and an
earlier revision.
llvm-svn: 236508
This patch introduces a new pass that computes the safe point to insert the
prologue and epilogue of the function.
The interest is to find safe points that are cheaper than the entry and exits
blocks.
As an example and to avoid regressions to be introduce, this patch also
implements the required bits to enable the shrink-wrapping pass for AArch64.
** Context **
Currently we insert the prologue and epilogue of the method/function in the
entry and exits blocks. Although this is correct, we can do a better job when
those are not immediately required and insert them at less frequently executed
places.
The job of the shrink-wrapping pass is to identify such places.
** Motivating example **
Let us consider the following function that perform a call only in one branch of
a if:
define i32 @f(i32 %a, i32 %b) {
%tmp = alloca i32, align 4
%tmp2 = icmp slt i32 %a, %b
br i1 %tmp2, label %true, label %false
true:
store i32 %a, i32* %tmp, align 4
%tmp4 = call i32 @doSomething(i32 0, i32* %tmp)
br label %false
false:
%tmp.0 = phi i32 [ %tmp4, %true ], [ %a, %0 ]
ret i32 %tmp.0
}
On AArch64 this code generates (removing the cfi directives to ease
readabilities):
_f: ; @f
; BB#0:
stp x29, x30, [sp, #-16]!
mov x29, sp
sub sp, sp, #16 ; =16
cmp w0, w1
b.ge LBB0_2
; BB#1: ; %true
stur w0, [x29, #-4]
sub x1, x29, #4 ; =4
mov w0, wzr
bl _doSomething
LBB0_2: ; %false
mov sp, x29
ldp x29, x30, [sp], #16
ret
With shrink-wrapping we could generate:
_f: ; @f
; BB#0:
cmp w0, w1
b.ge LBB0_2
; BB#1: ; %true
stp x29, x30, [sp, #-16]!
mov x29, sp
sub sp, sp, #16 ; =16
stur w0, [x29, #-4]
sub x1, x29, #4 ; =4
mov w0, wzr
bl _doSomething
add sp, x29, #16 ; =16
ldp x29, x30, [sp], #16
LBB0_2: ; %false
ret
Therefore, we would pay the overhead of setting up/destroying the frame only if
we actually do the call.
** Proposed Solution **
This patch introduces a new machine pass that perform the shrink-wrapping
analysis (See the comments at the beginning of ShrinkWrap.cpp for more details).
It then stores the safe save and restore point into the MachineFrameInfo
attached to the MachineFunction.
This information is then used by the PrologEpilogInserter (PEI) to place the
related code at the right place. This pass runs right before the PEI.
Unlike the original paper of Chow from PLDI’88, this implementation of
shrink-wrapping does not use expensive data-flow analysis and does not need hack
to properly avoid frequently executed point. Instead, it relies on dominance and
loop properties.
The pass is off by default and each target can opt-in by setting the
EnableShrinkWrap boolean to true in their derived class of TargetPassConfig.
This setting can also be overwritten on the command line by using
-enable-shrink-wrap.
Before you try out the pass for your target, make sure you properly fix your
emitProlog/emitEpilog/adjustForXXX method to cope with basic blocks that are not
necessarily the entry block.
** Design Decisions **
1. ShrinkWrap is its own pass right now. It could frankly be merged into PEI but
for debugging and clarity I thought it was best to have its own file.
2. Right now, we only support one save point and one restore point. At some
point we can expand this to several save point and restore point, the impacted
component would then be:
- The pass itself: New algorithm needed.
- MachineFrameInfo: Hold a list or set of Save/Restore point instead of one
pointer.
- PEI: Should loop over the save point and restore point.
Anyhow, at least for this first iteration, I do not believe this is interesting
to support the complex cases. We should revisit that when we motivating
examples.
Differential Revision: http://reviews.llvm.org/D9210
<rdar://problem/3201744>
llvm-svn: 236507
Removed code that was replicating v8i16 'shift + mask' implementation that is done more nicely by making use of LowerScalarImmediateShift
llvm-svn: 236388
This pass is responsible for constructing the EH registration object
that gets linked into fs:00, which is all it does in this change. In the
future, it will also insert stores to update the EH state number.
I considered keeping this functionality in WinEHPrepare, but it's pretty
separable and X86 specific. It has conceptually very little to do with
the task of WinEHPrepare, which is currently outlining. WinEHPrepare is
also in theory useful on ARM, but this logic is pretty x86 specific.
Reviewers: andrew.w.kaylor, majnemer
Differential Revision: http://reviews.llvm.org/D9422
llvm-svn: 236339