Summary: The lib paths are not correctly picked up for OpenEmbedded sysroots
(like arm-oe-linux-gnueabi). I fix this in a follow-up clang patch. But in
order to add the correct libs I need to detect if the vendor is oe. For this
reason, it is first necessary to teach llvm to detect oe vendor, which is what
this patch does.
Reviewers: chandlerc, compnerd, rengolin, javed.absar
Reviewed By: compnerd
Subscribers: kristof.beyls, dexonsmith, llvm-commits
Differential Revision: https://reviews.llvm.org/D48861
llvm-svn: 336401
Summary:
If LOCK prefix is not the first prefix in an instruction, LLVM
disassembler silently drops the prefix.
The fix is to select a proper instruction with a builtin LOCK prefix if
one exists.
Reviewers: craig.topper
Reviewed By: craig.topper
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D49001
llvm-svn: 336400
Summary:
CompileOnDemandLayer.cpp uses function in these libraries, and builds
with `-DSHARED_LIB=ON` fail without this.
Reviewers: lhames
Subscribers: mgorny, llvm-commits
Differential Revision: https://reviews.llvm.org/D48995
llvm-svn: 336389
a deficiency in TableGen that has been addressed in r336334.
[AArch64][SVE] Asm: Support for predicated FP rounding instructions.
This patch also adds instructions for predicated FP square-root and
reciprocal exponent.
The added instructions are:
- FRINTI Round to integral value (current FPCR rounding mode)
- FRINTX Round to integral value (current FPCR rounding mode, signalling inexact)
- FRINTA Round to integral value (to nearest, with ties away from zero)
- FRINTN Round to integral value (to nearest, with ties to even)
- FRINTZ Round to integral value (toward zero)
- FRINTM Round to integral value (toward minus Infinity)
- FRINTP Round to integral value (toward plus Infinity)
- FSQRT Floating-point square root
- FRECPX Floating-point reciprocal exponent
llvm-svn: 336387
writing them to a buffer and re-loading them.
Also introduces a multithreaded variant of SimpleCompiler
(MultiThreadedSimpleCompiler) for compiling IR concurrently on multiple
threads.
These changes are required to JIT IR on multiple threads correctly.
No test case yet. I will be looking at how to modify LLI / LLJIT to test
multithreaded JIT support soon.
llvm-svn: 336385
Better NaN handling for AMDGCN fmed3.
All operands are checked for NaN now. The checks
were moved before the canonicalization to provide
a better mapping from fclamp. Changed the behaviour
of fmed3(x,y,NaN) to return max(x,y) instead of
min(x,y) in light of this. Updated tests as a result
and added some new cases to cover the fix.
Patch by Alan Baker
llvm-svn: 336375
Avoid using allocateKernArg / AssignFn. We do not want any
of the type splitting properties of normal calling convention
lowering.
For now at least this exists alongside the IR argument lowering
pass. This is necessary to handle struct padding correctly while
some arguments are still skipped by the IR argument lowering
pass.
llvm-svn: 336373
Map the following instructions to the proper float128 lib calls:
pow[i], exp[2], log[2|10], sin, cos, fmin, fmax
Differential Revision: https://reviews.llvm.org/D48544
llvm-svn: 336361
This is an early step towards matching Instructions by attributes other than the opcode. This will be necessary for cast/call alternates which share the same opcode but have different types/intrinsicIDs etc. - which we could vectorize as long as we split them using the alternate mechanism.
Differential Revision: https://reviews.llvm.org/D48945
llvm-svn: 336344
Wait states are not properly being inserted after buffer_store for v_interp instructions.
Add VALU to V_INTERP instructions so that the GCNHazardRecognizer can
check and insert the appropriate wait states when needed.
Differential Revision: https://reviews.llvm.org/D48772
Change-Id: Id540c9b074fc69b5c1de6b182276aa089c74aa64
llvm-svn: 336339
Similar to PR/25526, fast-regalloc introduces spills at the end of basic
blocks. When this occurs in between an ll and sc, the stores can cause the
atomic sequence to fail.
This patch fixes the issue by introducing more pseudos to represent atomic
operations and moving their lowering to after the expansion of postRA
pseudos.
This version addresses issues with the initial implementation and covers
all atomic operations.
This resolves PR/32020.
Thanks to James Cowgill for reporting the issue!
Patch By: Simon Dardis
Differential Revision: https://reviews.llvm.org/D31287
llvm-svn: 336328
This patch also adds instructions for predicated FP square-root and
reciprocal exponent.
The added instructions are:
- FRINTI Round to integral value (current FPCR rounding mode)
- FRINTX Round to integral value (current FPCR rounding mode, signalling inexact)
- FRINTA Round to integral value (to nearest, with ties away from zero)
- FRINTN Round to integral value (to nearest, with ties to even)
- FRINTZ Round to integral value (toward zero)
- FRINTM Round to integral value (toward minus Infinity)
- FRINTP Round to integral value (toward plus Infinity)
- FSQRT Floating-point square root
- FRECPX Floating-point reciprocal exponent
llvm-svn: 336322
We were miscompiling i8 loads, so reject them as unsupported narrow operations
for now.
Differential Revision: https://reviews.llvm.org/D48944
llvm-svn: 336319
Optimize code sequences for integer conversion to fp128 when the integer is a result of:
* float->int
* float->long
* double->int
* double->long
Differential Revision: https://reviews.llvm.org/D48429
llvm-svn: 336316
The alignment specified by a constant for the field
`BumpPointerAllocator::InitialBuffer` exceeded the alignment
guaranteed by `malloc` and `new` on Windows. This change set
the alignment value to that of `long double`, which is defined
by the used platform.
It fixes https://bugs.llvm.org/show_bug.cgi?id=37944.
Differential Revision: https://reviews.llvm.org/D48889
llvm-svn: 336311
Non-homogenous aggregates are passed in consecutive GPRs, in GPRs and in memory,
or in memory. This patch ensures that float128 members of non-homogenous
aggregates are passed via VSX registers.
This is done via custom lowering a bitcast of a build_pari(i64,i64) to float128
to a new PPCISD node, BUILD_FP128.
Differential Revision: https://reviews.llvm.org/D48308
llvm-svn: 336310
Legalize and emit code for quad-precision floating point operation conversion of
single-precision value to quad-precision.
Differential Revision: https://reviews.llvm.org/D47569
llvm-svn: 336307
This patch enable parameter passing and return by value for float128 types.
Passing aggregate/union which contain float128 members will be submitted in
subsequent patches.
Differential Revision: https://reviews.llvm.org/D47552
llvm-svn: 336306
We have patterns for SELECTS that top at v1i1 and we have a pattern for (v1i1 (scalar_to_vector GR8)). The patterns being removed here do the same thing as the two other patterns combined so there is no need for them.
llvm-svn: 336305
Previously we could only negate the FMADD opcodes. This used to be mostly ok when we lowered FMA intrinsics during lowering. But with the move to llvm.fma from target specific intrinsics, we can combine (fneg (fma)) to (fmsub) earlier. So if we start with (fneg (fma (fneg))) we would get stuck at (fmsub (fneg)).
This patch fixes that so we can also combine things like (fmsub (fneg)).
llvm-svn: 336304
There's a regression in here due to inability to combine fneg inputs of X86ISD::FMSUB/FNMSUB/FNMADD nodes.
More removals to come, but I wanted to stop and fix the regression that showed up in this first.
llvm-svn: 336303
Legalize and emit code for round & convert float128 to double precision and
single precision.
Differential Revision: https://reviews.llvm.org/D46997
llvm-svn: 336299
CRC and GINV ASE require revision 6, Virtualization requires revision 5.
Print a warning when revision is older than required.
Differential Revision: https://reviews.llvm.org/D48843
llvm-svn: 336296
We want to run the Machine Scheduler instead of the List Scheduler after RA.
Checked with a performance run on a Power 9 machine with SPEC 2006 and while
some benchmarks improved and others degraded the geomean was slightly improved
with the Machine Scheduler.
Differential Revision: https://reviews.llvm.org/D45265
llvm-svn: 336295
We have bailout hacks based on min/max in various places in instcombine
that shouldn't be necessary. The affected test was added for:
D48930
...which is a consequence of the improvement in:
D48584 (https://reviews.llvm.org/rL336172)
I'm assuming the visitTrunc bailout in this patch was added specifically
to avoid a change from SimplifyDemandedBits, so I'm just moving that
below the EvaluateInDifferentType optimization. A narrow min/max is still
a min/max.
llvm-svn: 336293
Support for negative immediates was implemented in
https://reviews.llvm.org/rL298380, however few instruction options were missing.
This change adds negative immediates support and respective tests
for the following:
ADD
ADDS
ADDS.W
AND.W
ANDS
BIC.W
BICS
BICS.W
SUB
SUBS
SUBS.W
Differential Revision: https://reviews.llvm.org/D48649
llvm-svn: 336286
This patch adds both a vector and an immediate form, e.g.
- Vector form:
subr z0.h, p0/m, z0.h, z1.h
subtract active elements of z0 from z1, and store the result in z0.
- Immediate form:
subr z0.h, z0.h, #255
subtract elements of z0, and store the result in z0.
llvm-svn: 336274
When creating `phi` instructions to resume at the scalar part of the loop,
copy the DebugLoc from the original phi over to the new one.
Differential Revision: https://reviews.llvm.org/D48769
llvm-svn: 336256
When zext is EvaluatedInDifferentType, InstCombine
drops the dbg.value intrinsic. This patch tries to
preserve said DI, by inserting the zext's old DI in the
resulting instruction. (Only for integer type for now)
Differential Revision: https://reviews.llvm.org/D48331
llvm-svn: 336254
We were only doing this for basic blends, despite shuffle lowering now being good enough to handle more complex blends. This means that the two v8i16 splat shifts are performed in parallel instead of serially as the general shift case.
Reapplied with a fixed (extra null tests) version of rL336113 after reversion in rL336189 - extra test case added at rL336247.
llvm-svn: 336250
SVE overloads the AArch64 PSTATE condition flags and introduces
a set of condition code aliases for the assembler. The
details are described in section 2.2 of the architecture
reference manual supplement for SVE.
In short:
SVE alias => AArch64 name
--------------------------
NONE => EQ
ANY => NE
NLAST => HS
LAST => LO
FIRST => MI
NFRST => PL
PMORE => HI
PLAST => LS
TCONT => GE
TSTOP => LT
Reviewers: rengolin, fhahn, SjoerdMeijer, samparker, javed.absar
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D48869
llvm-svn: 336245
The following code pattern:
mov %rax, %rcx
test %rax, %rax
%rax = ....
je throw_npe
mov(%rcx), %r9
mov(%rax), %r10
gets transformed into the following incorrect code after implicit null check pass:
mov %rax, %rcx
%rax = ....
faulting_load_op("movl (%rax), %r10", throw_npe)
mov(%rcx), %r9
For implicit null check pass, if the register that is checked for null value (ie, the register used in the 'test' instruction) is written into before the condition jump, we should avoid doing the optimization.
Patch by Surya Kumari Jangala!
Differential Revision: https://reviews.llvm.org/D48627
Reviewed By: skatkov
llvm-svn: 336241
This patch adds a new token type specifically for (%dx). We will now always create this token when we parse (%dx). After all operands have been parsed, if the mnemonic is in/out we'll morph this token to a regular register token. Otherwise we keep it as the special DX token which won't match any instructions.
This removes the need for passing Mnemonic through the parsing functions. It also seems closer to gas where when its used on the wrong instruction it just gets diagnosed as an invalid operand rather than a bad memory address.
llvm-svn: 336218
This might make the error message added in r335668 unneeded, but I'm not sure yet.
The check for RIP is technically unnecessary since RIP is in GR64, but that fact is kind of surprising so be explicit.
llvm-svn: 336217
As the test diffs show, the current users of getBinOpIdentity()
are InstCombine and Reassociate. SLP vectorizer is a candidate
for using this functionality too (D28907).
The InstCombine shuffle improvements are part of the planned
enhancements noted in D48830.
InstCombine actually has several other uses of getBinOpIdentity()
via SimplifyUsingDistributiveLaws(), but we don't call that for
any FP ops. Fixing that might be another part of removing the
custom reassociation in InstCombine that is only done for fadd+fmul.
llvm-svn: 336215
r336120 resulted in falling back to SelectionDAG more often due to the G_STORE
MMOs not matching the vreg size. This fixes that by explicitly any-extending the
value.
llvm-svn: 336209
Unpredicated FP-multiply of SVE vector with a vector-element given by
vector[index], for example:
fmul z0.s, z1.s, z2.s[0]
which performs an unpredicated FP-multiply of all 32-bit elements in
'z1' with the first element from 'z2'.
This patch adds restricted register classes for SVE vectors:
ZPR_3b (only z0..z7 are allowed) - for indexed vector of 16/32-bit elements.
ZPR_4b (only z0..z15 are allowed) - for indexed vector of 64-bit elements.
Reviewers: rengolin, fhahn, SjoerdMeijer, samparker, javed.absar
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D48823
llvm-svn: 336205
This is the last significant change suggested in PR37806:
https://bugs.llvm.org/show_bug.cgi?id=37806#c5
...though there are several follow-ups noted in the code comments
in this patch to complete this transform.
It's possible that a binop feeding a select-shuffle has been eliminated
by earlier transforms (or the code was just written like this in the 1st
place), so we'll fail to match the patterns that have 2 binops from:
D48401,
D48678,
D48662,
D48485.
In that case, we can try to materialize identity constants for the remaining
binop to fill in the "ghost" lanes of the vector (where we just want to pass
through the original values of the source operand).
I added comments to ConstantExpr::getBinOpIdentity() to show planned follow-ups.
For now, we only handle the 5 commutative integer binops (add/mul/and/or/xor).
Differential Revision: https://reviews.llvm.org/D48830
llvm-svn: 336196
With a view to support parallel operations that have their results
stored to memory, refactor the consecutive access helper out so it
could support stores instructions.
Differential Revision: https://reviews.llvm.org/D48872
llvm-svn: 336195
This adds the following system registers:
- RAS registers,
- MPAM registers,
- Activitiy monitor registers,
- Trace Extension registers,
- Timing insensitivity of data processing instructions,
- Enhanced Support for Nested Virtualization.
Differential Revision: https://reviews.llvm.org/D48871
llvm-svn: 336193
Summary:
When salvaging a dbg.declare/dbg.addr we should not add
DW_OP_stack_value to the DIExpression
(see test/Transforms/InstCombine/salvage-dbg-declare.ll).
Consider this example
%vla = alloca i32, i64 2
call void @llvm.dbg.declare(metadata i32* %vla, metadata !1, metadata !DIExpression())
Instcombine will turn it into
%vla1 = alloca [2 x i32]
%vla1.sub = getelementptr inbounds [2 x i32], [2 x i32]* %vla, i64 0, i64 0
call void @llvm.dbg.declare(metadata [2 x i32]* %vla1.sub, metadata !19, metadata !DIExpression())
If the GEP can be eliminated, then the dbg.declare will be salvaged
and we should get
%vla1 = alloca [2 x i32]
call void @llvm.dbg.declare(metadata [2 x i32]* %vla1, metadata !19, metadata !DIExpression())
The problem was that salvageDebugInfo did not recognize dbg.declare
as being indirect (%vla1 points to the value, it does not hold the
value), so we incorrectly got
call void @llvm.dbg.declare(metadata [2 x i32]* %vla1, metadata !19, metadata !DIExpression(DW_OP_stack_value))
I also made sure that llvm::salvageDebugInfo and
DIExpression::prependOpcodes do not add DW_OP_stack_value to
the DIExpression in case no new operands are added to the
DIExpression. That way we avoid to, unneccessarily, turn a
register location expression into an implicit location expression
in some situations (see test11 in test/Transforms/LICM/sinking.ll).
Reviewers: aprantl, vsk
Reviewed By: aprantl, vsk
Subscribers: JDevlieghere, llvm-commits
Differential Revision: https://reviews.llvm.org/D48837
llvm-svn: 336191
Lower more than 4 arguments using stack. This patch targets MIPS32.
It supports only functions with arguments of type i32.
Patch by Petar Avramovic.
Differential Revision: https://reviews.llvm.org/D47934
llvm-svn: 336185
unswitching loops.
Original patch trying to address this was sent in D47624, but that
didn't quite handle things correctly. There are two key principles used
to select whether and how to invalidate SCEV-cached information about
loops:
1) We must invalidate any info SCEV has cached before unswitching as we
may change (or destroy) the loop structure by the act of unswitching,
and make it hard to recover everything we want to invalidate within
SCEV.
2) We need to invalidate all of the loops whose CFGs are mutated by the
unswitching. Notably, this isn't the *entire* loop nest, this is
every loop contained by the outermost loop reached by an exit block
relevant to the unswitch.
And we need to do this even when doing trivial unswitching.
I've added more focused tests that directly check that SCEV starts off
with imprecise information and after unswitching (and simplifying
instructions) re-querying SCEV will produce precise information. These
tests also specifically work to check that an *outer* loop's information
becomes precise.
However, the testing here is still a bit imperfect. Crafting test cases
that reliably fail to be analyzed by SCEV before unswitching and succeed
afterward proved ... very, very hard. It took me several hours and
careful work to build these, and I'm not optimistic about necessarily
coming up with more to cover more elaborate possibilities. Fortunately,
the code pattern we are testing here in the pass is really
straightforward and reliable.
Thanks to Max Kazantsev for the initial work on this as well as the
review, and to Hal Finkel for helping me talk through approaches to test
this stuff even if it didn't come to much.
Differential Revision: https://reviews.llvm.org/D47624
llvm-svn: 336183
This patch changes order of transform in InstCombineCompares to avoid
performing transforms based on ranges which produce complex bit arithmetics
before more simple things (like folding with constants) are done. See PR37636
for the motivating example.
Differential Revision: https://reviews.llvm.org/D48584
Reviewed By: spatel, lebedev.ri
llvm-svn: 336172
Summary:
This patch is the first in a series of patches related to the [[ http://lists.llvm.org/pipermail/llvm-dev/2018-June/123883.html | RFC - A new dominator tree updater for LLVM ]].
This patch introduces the DomTreeUpdater class, which provides a cleaner API to perform updates on available dominator trees (none, only DomTree, only PostDomTree, both) using different update strategies (eagerly or lazily) to simplify the updating process.
—Prior to the patch—
- Directly calling update functions of DominatorTree updates the data structure eagerly while DeferredDominance does updates lazily.
- DeferredDominance class cannot be used when a PostDominatorTree also needs to be updated.
- Functions receiving DT/DDT need to branch a lot which is currently necessary.
- Functions using both DomTree and PostDomTree need to call the update function separately on both trees.
- People need to construct an additional DeferredDominance class to use functions only receiving DDT.
—After the patch—
Patch by Chijun Sima <simachijun@gmail.com>.
Reviewers: kuhar, brzycki, dmgreen, grosser, davide
Reviewed By: kuhar, brzycki
Author: NutshellySima
Subscribers: vsk, mgorny, llvm-commits
Differential Revision: https://reviews.llvm.org/D48383
llvm-svn: 336163
Summary:
When we import an alias (which will import a copy of the aliasee), but
aren't going to import the aliasee directly, the distributed backend
index will not contain the aliasee summary. Handle this in the summary
assembly printer by printing "null" as the aliasee.
Reviewers: davidxl, dexonsmith
Subscribers: mehdi_amini, inglorion, eraman, steven_wu, llvm-commits
Differential Revision: https://reviews.llvm.org/D48699
llvm-svn: 336160
The verifier identified several modules that were broken due to incorrect
linkage on declarations. To fix this, CompileOnDemandLayer2::extractFunction
has been updated to change decls to external linkage.
llvm-svn: 336150
Summary:
In the individual index files emitted for distributed ThinLTO backends,
the module path ids are not contiguous. Assign slots to module paths in
order to handle this better and also to get contiguous numbering in the
summary assembly.
Reviewers: davidxl, dexonsmith
Subscribers: mehdi_amini, inglorion, eraman, llvm-commits, steven_wu
Differential Revision: https://reviews.llvm.org/D48698
llvm-svn: 336148
Summary:
Comment on Transforms/LoopVersioning/incorrect-phi.ll: With the change
SCEV is able to prove that the loop doesn't wrap-self (due to zext i16
to i64), disabling the entire loop versioning pass. Removed the zext and
just use i64.
Reviewers: sanjoy
Subscribers: jlebar, hiraditya, javed.absar, bixia, llvm-commits
Differential Revision: https://reviews.llvm.org/D48409
llvm-svn: 336140
LLVM doesn't guarantee anything about the high bits of a register holding
an i1 value at the IR level, so don't translate LLVM IR i1 values directly
into WebAssembly conditional branch operands. WebAssembly's conditional
branches do demand all 32 bits be valid.
Fixes PR38019.
llvm-svn: 336138
Add registers still missing after r328016 (D43353):
- for bits 15-8 of SI, DI, BP, SP (*H), and R8-R15 (*BH),
- for bits 31-16 of R8-R15 (*WH).
Thanks to Craig Topper for pointing it out.
llvm-svn: 336134
Summary: It is common to have the following min/max pattern during the intermediate stages of SLP since we only optimize at the end. This patch tries to catch such patterns and allow more vectorization.
%1 = extractelement <2 x i32> %a, i32 0
%2 = extractelement <2 x i32> %a, i32 1
%cond = icmp sgt i32 %1, %2
%3 = extractelement <2 x i32> %a, i32 0
%4 = extractelement <2 x i32> %a, i32 1
%select = select i1 %cond, i32 %3, i32 %4
Author: FarhanaAleen
Reviewed By: ABataev, RKSimon, spatel
Differential Revision: https://reviews.llvm.org/D47608
llvm-svn: 336130
This extends D48485 to allow another pair of binops (add/or) to be combined either
with or without a leading shuffle:
or X, C --> add X, C (when X and C have no common bits set)
Here, we need value tracking to determine that the 'or' can be reversed into an 'add',
and we've added general infrastructure to allow extending to other opcodes or moving
to where other passes could use that functionality.
Differential Revision: https://reviews.llvm.org/D48662
llvm-svn: 336128
On darwin, all virtual sections have zerofill type, and having a
.zerofill directive in a non-virtual section is not allowed. Instead of
asserting, show a nicer error.
In order to use the equivalent of .zerofill in a non-virtual section,
the usage of .zero of .space is required.
This patch replaces the assert with an error.
Differential Revision: https://reviews.llvm.org/D48517
llvm-svn: 336127
Similarily, don't fold fp128 loads into SSE instructions if the load isn't aligned. Unless we're targeting an AMD CPU that doesn't check alignment on arithmetic instructions.
Should fix PR38001
llvm-svn: 336121
We currently don't any-extend vararg parameters before storing them to the stack
locations on Darwin. However, SelectionDAG however does this, and so user code
is in the wild which inadvertently relies on this extension. This can manifest
in cases where the value stored is (int)0, but the actual parameter is interpreted
by va_arg as a pointer, and so not extending to 64 bits causes the callee to
load additional undefined bits.
llvm-svn: 336120
Summary:
This patch is the first in a series of patches related to the [[ http://lists.llvm.org/pipermail/llvm-dev/2018-June/123883.html | RFC - A new dominator tree updater for LLVM ]].
This patch introduces the DomTreeUpdater class, which provides a cleaner API to perform updates on available dominator trees (none, only DomTree, only PostDomTree, both) using different update strategies (eagerly or lazily) to simplify the updating process.
—Prior to the patch—
- Directly calling update functions of DominatorTree updates the data structure eagerly while DeferredDominance does updates lazily.
- DeferredDominance class cannot be used when a PostDominatorTree also needs to be updated.
- Functions receiving DT/DDT need to branch a lot which is currently necessary.
- Functions using both DomTree and PostDomTree need to call the update function separately on both trees.
- People need to construct an additional DeferredDominance class to use functions only receiving DDT.
—After the patch—
Patch by Chijun Sima <simachijun@gmail.com>.
Reviewers: kuhar, brzycki, dmgreen, grosser, davide
Reviewed By: kuhar, brzycki
Subscribers: vsk, mgorny, llvm-commits
Author: NutshellySima
Differential Revision: https://reviews.llvm.org/D48383
llvm-svn: 336114
We were only doing this for basic blends, despite shuffle lowering now being good enough to handle more complex blends. This means that the two v8i16 splat shifts are performed in parallel instead of serially as the general shift case.
llvm-svn: 336113
Summary:
Replace use of a SmallPtrSet with a SmallSetVector to make the worklist
iteration order deterministic. This is done as the order the blocks are
removed may affect whether or not PHI nodes in successor blocks are
removed.
For example, consider the following case where %bb1 and %bb2 are
removed:
bb1:
br i1 undef, label %bb3, label %bb4
bb2:
br i1 undef, label %bb4, label %bb3
bb3:
pv1 = phi type [ undef, %bb1 ], [ undef, %bb2], [ v0, %other ]
br label %bb4
bb4:
pv2 = phi type [ undef, %bb1 ], [ undef, %bb2 ],
[ pv1, %bb3 ], [ v0, %other ]
If %bb2 is removed before %bb1, the incoming values from %bb1 and %bb2
to pv1 will be removed before %bb1 is removed as a predecessor to %bb4.
The pv1 node will thus be optimized out (to v0) at the time %bb1 is
removed as a predecessor to %bb4, leaving the blocks as following when
the incoming value from %bb1 has been removed:
bb3: ; pv1 optimized out, incoming value to pv2 is v0
br label %bb4
bb4:
pv2 = phi type [ v0, %bb3 ], [ v0, %other ]
The pv2 PHI node will be optimized away by removePredecessor() as all
incoming values are identical.
In case %bb2 is removed after %bb1, pv1 will not be optimized out at the
time %bb2 is removed as a predecessor to %bb4, leaving the blocks as
following when the incoming value from %bb2 to pv2 has been removed:
bb3:
pv1 = phi type [ undef, %bb2 ], [ v0, %other ]
br label %bb4
bb4:
pv2 = phi type [ pv1, %bb3 ], [ v0, %other ]
The pv2 PHI node will thus not be removed in this case, ultimately
leading to the following output
bb3: ; pv1 optimized out, incoming value to pv2 is v0
br label %bb4
bb4:
pv2 = phi type [ v0, %bb3 ], [ v0, %other ]
I have not looked into changing DeleteDeadBlock() so that the redundant
PHI nodes are removed.
I have not added a test case, as I was not able to create a particularly
small and (not messy) reproducer. This is likely due to SmallPtrSet
behaving deterministically when in small mode.
Reviewers: void, dexonsmith, spatel, skatkov, fhahn, bkramer, nhaehnle
Reviewed By: fhahn
Subscribers: mgrang, llvm-commits
Differential Revision: https://reviews.llvm.org/D48369
llvm-svn: 336109
The X86 asm parser currently has custom parsing logic for .word. Rather than
use this custom logic, we can just use addAliasForDirective to enable the
reuse of AsmParser::parseDirectiveValue.
See also similar changes to Sparc (rL333078), AArch64 (rL333077), and Hexagon
(rL332607) backends.
Differential Revision: https://reviews.llvm.org/D47004
This is a fixed reland of rL336100. This should have been caught in
pre-commit testing so apologies for the noise.
llvm-svn: 336104
This code is only used by alternate opcodes so the InstructionsState has already confirmed that every Value is an Instruction, plus we use cast<Instruction> which will assert on failure.
llvm-svn: 336102
The X86 asm parser currently has custom parsing logic for .word. Rather than
use this custom logic, we can just use addAliasForDirective to enable the
reuse of AsmParser::parseDirectiveValue.
See also similar changes to Sparc (rL333078), AArch64 (rL333077), and Hexagon
(rL332607) backends.
Differential Revision: https://reviews.llvm.org/D47004
llvm-svn: 336100
This version contains a fix to add values for which the state in ParamState change
to the worklist if the state in ValueState did not change. To avoid adding the
same value multiple times, mergeInValue returns true, if it added the value to
the worklist. The value is added to the worklist depending on its state in
ValueState.
Original message:
For comparisons with parameters, we can use the ParamState lattice
elements which also provide constant range information. This improves
the code for PR33253 further and gets us closer to use
ValueLatticeElement for all values.
Also, as we are using the range information in the solver directly, we
do not need tryToReplaceWithConstantRange afterwards anymore.
Reviewers: dberlin, mssimpso, davide, efriedma
Reviewed By: mssimpso
Differential Revision: https://reviews.llvm.org/D43762
llvm-svn: 336098
We were always using the opcodes of the first 2 scalars for the costs of the alternate opcode + shuffle. This made sense when we used SK_Alternate and opcodes were guaranteed to be alternating, but this fails for the more general SK_Select case.
This fix exposes an issue demonstrated by the fmul_fdiv_v4f32_const test - the SLM model has v4f32 fdiv costs which are more than twice those of the f32 scalar cost, meaning that the cost model determines that the vectorization is not performant. Unfortunately it completely ignores the fact that the fdiv by a constant will be changed into a fmul by InstCombine for a much lower cost vectorization. But at least we're seeing this now...
llvm-svn: 336095
Increment/decrement vector by multiple of predicate constraint
element count.
The variants added by this patch are:
- INCH, INCW, INC
and (saturating):
- SQINCH, SQINCW, SQINCD
- UQINCH, UQINCW, UQINCW
- SQDECH, SQINCW, SQINCD
- UQDECH, UQINCW, UQINCW
For example:
incw z0.s, all, mul #4
llvm-svn: 336090
We don't have PMCs to cover many of the Jaguar resources but we can at least monitor the FPU issue pipes which give an indication of the fpu uop count, just not the execution resources.
llvm-svn: 336089
This change fixes the issue that arises when we duplicate condition from
the predecessor block. If the condition's arguments are not considered alive
across the blocks, fast regalloc gets confused and starts generating reloads
from the slots that have never been spilled to. This change also leads to
smaller code given that, unlike on architectures with condition codes, on
Mips we can branch directly on register value, thus we gain nothing by
duplication.
Patch by Dragan Mladjenovic.
Differential Revision: https://reviews.llvm.org/D48642
llvm-svn: 336084
These patches were previously reverted as they led to
buildbot time-outs caused by large switch statement in
printAliasInstr when using UBSan and O3. The issue has
been addressed with a workaround (r335525).
llvm-svn: 336079
It looks like someone ran clang-format over this entire file which reformatted these switches into a multiline form. But I think the single line form is more useful here.
llvm-svn: 336077
I separated out the rounding and broadcast groups into their own tables because it made the ordering in the main table easier.
Further splitting of the tables might make it possible to directly index using bits from the TSFlags, but its probably not worth it right now.
llvm-svn: 336075
For the below case, pre-inc prep think it's a good candidate to use pre-inc for the bucket, but 64bit integer load/store update (pre-inc) instruction on Power requires the displacement field should be DS-form (4's multiple). Since it can't satisfy the constraint, we have to do some fix ups later. As below, the original load/stores could be well-form, it makes things worse.
unsigned long long result = 0;
unsigned long long foo(char *p, unsigned long long n) {
for (unsigned long long i = 0; i < n; i++) {
unsigned long long x1 = *(unsigned long long *)(p - 50000 + i);
unsigned long long x2 = *(unsigned long long *)(p - 61024 + i);
unsigned long long x3 = *(unsigned long long *)(p - 62048 + i);
unsigned long long x4 = *(unsigned long long *)(p - 64096 + i);
result *= x1 * x2 * x3 * x4;
}
return result;
}
Patch by jedilyn(Kewen Lin).
Differential Revision: https://reviews.llvm.org/D48813
--This line, and those below, will be ignored--
M lib/Target/PowerPC/PPCLoopPreIncPrep.cpp
A test/CodeGen/PowerPC/preincprep-i64-check.ll
llvm-svn: 336074
Summary:
This patch introduce new intrinsic -
strip.invariant.group that was described in the
RFC: Devirtualization v2
Reviewers: rsmith, hfinkel, nlopes, sanjoy, amharc, kuhar
Subscribers: arsenm, nhaehnle, JDevlieghere, hiraditya, xbolva00, llvm-commits
Differential Revision: https://reviews.llvm.org/D47103
Co-authored-by: Krzysztof Pszeniczny <krzysztof.pszeniczny@gmail.com>
llvm-svn: 336073
section flags in the ELF assembler. This matches the defaults
given in the rest of MC.
Fixes PR37997 where we couldn't assemble our own assembly output
without warnings.
llvm-svn: 336072
findCommutedOpIndices does the pre-checking for whether commuting is possible. There should be no reason left to fail in commuteInstructionImpl. There was a missing pre-check that I've added there and changed the check to an assert in commuteInstructionImpl.
llvm-svn: 336070
I've check the disassembler tables and this shouldn't be reachable. Which is good since if it was reachable there should have been a 'return' after the addOperand line.
llvm-svn: 336066
This is a simple implementation of the unroll-and-jam classical loop
optimisation.
The basic idea is that we take an outer loop of the form:
for i..
ForeBlocks(i)
for j..
SubLoopBlocks(i, j)
AftBlocks(i)
Instead of doing normal inner or outer unrolling, we unroll as follows:
for i... i+=2
ForeBlocks(i)
ForeBlocks(i+1)
for j..
SubLoopBlocks(i, j)
SubLoopBlocks(i+1, j)
AftBlocks(i)
AftBlocks(i+1)
Remainder Loop
So we have unrolled the outer loop, then jammed the two inner loops into
one. This can lead to a simpler inner loop if memory accesses can be shared
between the now jammed loops.
To do this we have to prove that this is all safe, both for the memory
accesses (using dependence analysis) and that ForeBlocks(i+1) can move before
AftBlocks(i) and SubLoopBlocks(i, j).
Differential Revision: https://reviews.llvm.org/D41953
llvm-svn: 336062
Also move the static folding tables, their search functions and the new class into new cpp/h files.
The unfolding table is effectively static data. It's just a different ordering and a subset of the static folding tables.
By putting it in a separate ManagedStatic we ensure we only have one copy instead of one per X86InstrInfo object. This way also makes it only get initialized when really needed.
llvm-svn: 336056
The class only exists to hold a DenseMap and is only created as a ManagedStatic. It used to expose a single static method that outside code was expected to use.
This patch moves that static function out of the class and moves it implementation into the cpp file. It can now access the ManagedStatic directly by name without the need for the other static method that accessed the ManagedStatic.
llvm-svn: 336055
There are no instructions that use them so they weren't causing any bad matches. But they weren't being diagnosed as "invalid register name" if they were used and would instead trigger some form of invalid operand.
llvm-svn: 336054
I believe all of these are constants so legalizing them should be pretty trivial, but this saves a step.
In one case it looks like we may have been creating a shift amount larger than the shift input itself.
llvm-svn: 336052
This combine runs pretty late and causes us to introduce a shift after the op legalization phase has run. We need to be sure we create the shift with the proper type for the shift amount. If we don't do this, we will still re-legalize the operation properly, but we won't get a chance to fully optimize the truncate that gets inserted.
So this patch adds the necessary truncate when the shift is created. I've also narrowed the subtract that gets created to always be an i32 type. The truncate would have trigered SimplifyDemandedBits to optimize it anyway. But using a more appropriate VT here is free and saves an optimization step.
llvm-svn: 336051
The combine added in commit 329525 overlooked the case where one, but not all, of the divisor elements is -1, -1 is the only power of two value for which the sdiv expansion recipe breaks.
Thanks to @zvi for the original patch.
Differential Revision: https://reviews.llvm.org/D45806
llvm-svn: 336048
Summary:
We could split sizes that are not power of two into smaller sized
G_IMPLICIT_DEF instructions, but this ends up generating
G_MERGE_VALUES instructions which we then have to handle in the instruction
selector. Since G_IMPLICIT_DEF is really a no-op it's easier just to
keep everything that can fit into a register legal.
Reviewers: arsenm
Reviewed By: arsenm
Subscribers: kzhuravl, wdng, nhaehnle, yaxunl, rovka, kristof.beyls, dstuttard, tpr, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D48777
llvm-svn: 336041
This adds functionality to the outliner that allows targets to
specify certain functions that should be outlined from by default.
If a target supports default outlining, then it specifies that in
its TargetOptions. In the case that it does, and the user hasn't
specified that they *never* want to outline, the outliner will
be added to the pass pipeline and will run on those default functions.
This is a preliminary patch for turning the outliner on by default
under -Oz for AArch64.
https://reviews.llvm.org/D48776
llvm-svn: 336040
and diretory.
Also cleans up all the associated naming to be consistent and removes
the public access to the pass ID which was unused in LLVM.
Also runs clang-format over parts that changed, which generally cleans
up a bunch of formatting.
This is in preparation for doing some internal cleanups to the pass.
Differential Revision: https://reviews.llvm.org/D47352
llvm-svn: 336028
We have a function which switches on the type of a symbol record
to return a hardcoded offset into the record that contains the
symbol name. Not all symbols have names to begin with, and for
those records we return -1 for the offset.
Names are used for various things. Importantly for this particular
bug, a hash of the record name is used as a key for certain hash
tables which are serialied into the PDB file. One of these hash
tables is for the global symbol stream, which is basically a
collection of S_PROCREF symbols which contain the name of the
symbol, a module, and an address offset.
However, for S_PROCREF symbols, the function to return the offset
of the name was returning -1: basically it wasn't implemented.
As a result of this, all global symbols were hashing to the same
value, essentially it was as if every single global symbol's name
was the empty string.
This manifests in the VS debugger when you try to call a function
(global or member, doesn't matter) through the immediate window
and the debugger simply reports an error because it can't find the
function. This makes perfect sense, because it is hashing the name
for real, looking in the global symbol hash table, and there is only
1 entry there which corresponds to a symbol whose name is the empty
string.
Fixing this fixes the MSVC debugger in this case.
llvm-svn: 336024
Summary:
MemoryPhis now have APIs analogous to BB Phis to remove an incoming value/block.
The MemorySSAUpdater uses the above APIs when updating MemorySSA given a set of dead blocks about to be deleted.
Reviewers: george.burgess.iv
Subscribers: sanjoy, jlebar, Prazek, llvm-commits
Differential Revision: https://reviews.llvm.org/D48396
llvm-svn: 336015
Summary:
Retagging allocas before returning from the function might help
detecting use after return bugs, but it does not work at all in real
life, when instrumented and non-instrumented code is intermixed.
Consider the following code:
F_non_instrumented() {
T x;
F1_instrumented(&x);
...
}
{
F_instrumented();
F_non_instrumented();
}
- F_instrumented call leaves the stack below the current sp tagged
randomly for UAR detection
- F_non_instrumented allocates its own vars on that tagged stack,
not generating any tags, that is the address of x has tag 0, but the
shadow memory still contains tags left behind by F_instrumented on the
previous step
- F1_instrumented verifies &x before using it and traps on tag mismatch,
0 vs whatever tag was set by F_instrumented
Reviewers: eugenis
Subscribers: srhines, llvm-commits
Differential Revision: https://reviews.llvm.org/D48664
llvm-svn: 336011
When instructions with metadata are accidentally leaked, the result is a
difficult-to-find memory corruption in ~LLVMContextImpl that leads to
random crashes.
Patch by Arvīds Kokins!
llvm-svn: 336010
This was introducing unnecessary padding after the explicit
arguments, depending on the alignment of the total struct type.
Also has the side effect of avoiding creating an extra GEP for
the offset from the base kernel argument to the explicit kernel
argument offset.
llvm-svn: 335999
The important part is the creation of the SHLD/SHRD nodes. The compare and the conditional move can use target independent nodes that can be legalized on their own. This gives some opportunities to trigger the optimizations present in the lowering for those things. And its just better to limit the number of places we emit target specific nodes.
The changed test cases still aren't optimal.
Differential Revision: https://reviews.llvm.org/D48619
llvm-svn: 335998
Extends the CFGPrinter and CallPrinter with heat colors based on heuristics or
profiling information. The colors are enabled by default and can be toggled
on/off for CFGPrinter by using the option -cfg-heat-colors for both
-dot-cfg[-only] and -view-cfg[-only]. Similarly, the colors can be toggled
on/off for CallPrinter by using the option -callgraph-heat-colors for both
-dot-callgraph and -view-callgraph.
Patch by Rodrigo Caetano Rocha!
Differential Revision: https://reviews.llvm.org/D40425
llvm-svn: 335996
Previously we used a DenseMap which is costly to set up due to multiple full table rehashes as the size increases and causes the table to be reallocated.
This patch changes the table to a vector of structs. We now walk the reg->mem tables and push new entries in the mem->reg table for each row not marked TB_NO_REVERSE. Once all the table entries have been created, we sort the vector. Then we can use a binary search for lookups.
Differential Revision: https://reviews.llvm.org/D48585
llvm-svn: 335994
This allows to hoist code portion to compute reciprocal of loop
invariant denominator in integer division after codegen prepare
expansion.
Differential Revision: https://reviews.llvm.org/D48604
llvm-svn: 335988
This is a recommit of r335887, which was erroneously committed earlier.
To enable the MachineOutliner by default on AArch64, we need to be able to
disable the MachineOutliner and also provide an option to "always" enable the
outliner.
This adds that capability. It allows the user to still use the old
-enable-machine-outliner option, which defaults to "always". This is building
up to allowing the user to specify "always" versus the target default
outlining behaviour.
https://reviews.llvm.org/D48682
llvm-svn: 335986
Summary:
.debug_loc section is not supported for NVPTX target. If there is an
object whose location can change during its lifetime, we do not generate
debug location info for this variable.
Reviewers: echristo
Subscribers: jholewinski, JDevlieghere, llvm-commits
Differential Revision: https://reviews.llvm.org/D48730
llvm-svn: 335976
This was discussed in D48401 as another improvement for:
https://bugs.llvm.org/show_bug.cgi?id=37806
If we have 2 different variable values, then we shuffle (select) those lanes,
shuffle (select) the constants, and then perform the binop. This eliminates a binop.
The new shuffle uses the same shuffle mask as the existing shuffle, so there's no
danger of creating a difficult shuffle.
All of the earlier constraints still apply, but we also check for extra uses to
avoid creating more instructions than we'll remove.
Additionally, we're disallowing the fold for div/rem because that could expose a
UB hole.
Differential Revision: https://reviews.llvm.org/D48678
llvm-svn: 335974
We can have AddRec with loops having many predecessors.
This changes an assert to an early return.
Differential Revision: https://reviews.llvm.org/D48766
llvm-svn: 335965
This uses the same technique as for shifts - split the rotation into 4/2/1-bit partial rotations and select those partials based on the amount bit, making use of PBLENDVB if available. This halves the use of PBLENDVB compared to expanding to shifts, which can be a slow op.
Unfortunately I haven't found a decent way to share much of this code with the shift equivalent.
Differential Revision: https://reviews.llvm.org/D48655
llvm-svn: 335957
Initial patch adding assembly support for Armv8.4-A.
Besides adding v8.4 as a supported architecture to the usual places, this also
adds target features for the different crypto algorithms. Armv8.4-A introduced
new crypto algorithms, made them optional, and allows different combinations:
- none of the v8.4 crypto functions are supported, which is independent of the
implementation of the Armv8.0 SHA1 and SHA2 instructions.
- the v8.4 SHA512 and SHA3 support is implemented, in this case the Armv8.0
SHA1 and SHA2 instructions must also be implemented.
- the v8.4 SM3 and SM4 support is implemented, which is independent of the
implementation of the Armv8.0 SHA1 and SHA2 instructions.
- all of the v8.4 crypto functions are supported, in this case the Armv8.0 SHA1
and SHA2 instructions must also be implemented.
The v8.4 crypto instructions are added to AArch64 only, and not AArch32,
and are made optional extensions to Armv8.2-A.
The user-facing Clang options will map on these new target features, their
naming will be compatible with GCC and added in follow-up patches.
The Armv8.4-A instruction sets can be downloaded here:
https://developer.arm.com/products/architecture/a-profile/exploration-tools
Differential Revision: https://reviews.llvm.org/D48625
llvm-svn: 335953
Summary:
An alternative to D48597.
Fixes [[ https://bugs.llvm.org/show_bug.cgi?id=37936 | PR37936 ]].
The problem is as follows:
1. `indvars` marks `%dec` as `NUW`.
2. `loop-instsimplify` runs `instsimplify`, which constant-folds `%dec` to -1 (D47908)
3. `loop-reduce` tries to do some further modification, but crashes
with an type assertion in cast, because `%dec` is no longer an `Instruction`,
If the runline is split into two, i.e. you first run `-indvars -loop-instsimplify`,
store that into a file, and then run `-loop-reduce`, there is no crash.
So it looks like the problem is due to `-loop-instsimplify` not discarding SCEV.
But in this case we can just not crash if it's not an `Instruction`.
This is just a local fix, unlike D48597, so there may very well be other problems.
Reviewers: mkazantsev, uabelho, sanjoy, silviu.baranga, wmi
Reviewed By: mkazantsev
Subscribers: evstupac, javed.absar, spatel, llvm-commits
Differential Revision: https://reviews.llvm.org/D48599
llvm-svn: 335950
Summary:
We now have two sets of generated TableGen files, one for R600 and one
for GCN, so each sub-target now has its own tables of instructions,
registers, ISel patterns, etc. This should help reduce compile time
since each sub-target now only has to consider information that
is specific to itself. This will also help prevent the R600
sub-target from slowing down new features for GCN, like disassembler
support, GlobalISel, etc.
Reviewers: arsenm, nhaehnle, jvesely
Reviewed By: arsenm
Subscribers: MatzeB, kzhuravl, wdng, mgorny, yaxunl, dstuttard, tpr, t-tye, javed.absar, llvm-commits
Differential Revision: https://reviews.llvm.org/D46365
llvm-svn: 335942
We don't ever check these again (unless you're using
-fno-integrated-as), so make sure the extracted bits are well-defined.
I don't think it's possible to trigger any of the assertions on trunk,
but it's difficult to prove. (The first one depends on DAGCombine to
minimize the number of set bits in AND masks; I think the others are
mathematically impossible to hit.)
llvm-svn: 335931
This is a recommit of r335879.
We shouldn't add the outliner when compiling at -O0 even if
-enable-machine-outliner is passed in. This makes sure that we
don't add it in this case.
This also removes -O0 from the outliner DWARF test.
llvm-svn: 335930
This change adds experimental support for SHT_RELR sections, proposed
here: https://groups.google.com/forum/#!topic/generic-abi/bX460iggiKg
Definitions for the new ELF section type and dynamic array tags, as well
as the encoding used in the new section are all under discussion and are
subject to change. Use with caution!
Author: rahulchaudhry
Differential Revision: https://reviews.llvm.org/D47919
llvm-svn: 335922
There's no way to expose this difference currently,
but we should use the updated variable because the
original opcodes can go stale if we transform into
something new.
llvm-svn: 335920
This fixes a regression since SVN r334523, where the object files
built targeting MinGW were rejected by GNU binutils tools. Prior to
that commit, we only put constants in comdat for MSVC configurations.
Differential Revision: https://reviews.llvm.org/D48567
llvm-svn: 335918
Summary:
The InlinerFunctionImportStats will collect and dump stats regarding how
many function inlined into the module were imported by ThinLTO.
Reviewers: wmi, dexonsmith
Subscribers: mehdi_amini, inglorion, llvm-commits, eraman
Differential Revision: https://reviews.llvm.org/D48729
llvm-svn: 335914
Mostly just adding checks for Thumb2 instructions which correspond to
ARM instructions which already had diagnostics. While I'm here, also fix
ARM-mode strd to check the input registers correctly.
Differential Revision: https://reviews.llvm.org/D48610
llvm-svn: 335909
When rewriting an alloca partition copy the DL from the
old alloca over the the new one.
Differential Revision: https://reviews.llvm.org/D48640
llvm-svn: 335904
FileOutputBuffer creates a temp file and on commit atomically
renames the temp file to the destination file. Sometimes we
want to modify an existing file in place, but still have the
atomicity guarantee. To do this we can initialize the contents
of the temp file from the destination file (if it exists), that
way the resulting FileOutputBuffer can have only selective
bytes modified. Committing will then atomically replace the
destination file as desired.
llvm-svn: 335902
This is a follow up to r335753. At the time I forgot about isProfitableToFold which makes this pretty easy.
Differential Revision: https://reviews.llvm.org/D48706
llvm-svn: 335895
Reverting because this is causing failures in the LLDB test suite on
GreenDragon.
LLVM ERROR: unsupported relocation with subtraction expression, symbol
'__GLOBAL_OFFSET_TABLE_' can not be undefined in a subtraction
expression
llvm-svn: 335894
This is an enhancement to D48401 that was discussed in:
https://bugs.llvm.org/show_bug.cgi?id=37806
We can convert a shift-left-by-constant into a multiply (we canonicalize IR in the other
direction because that's generally better of course). This allows us to remove the shuffle
as we do in the regular opcodes-are-the-same cases.
This requires a small hack to make sure we don't introduce any extra poison:
https://rise4fun.com/Alive/ZGv
Other examples of opcodes where this would work are add+sub and fadd+fsub, but we already
canonicalize those subs into adds, so there's nothing to do for those cases AFAICT. There
are planned enhancements for opcode transforms such or -> add.
Note that there's a different fold needed if we've already managed to simplify away a binop
as seen in the test based on PR37806, but we manage to get that one case here because this
fold is positioned above the demanded elements fold currently.
Differential Revision: https://reviews.llvm.org/D48485
llvm-svn: 335888
Targets should be able to define whether or not they support the outliner
without the outliner being added to the pass pipeline. Before this, the
outliner pass would be added, and ask the target whether or not it supports the
outliner.
After this, it's possible to query the target in TargetPassConfig, before the
outliner pass is created. This ensures that passing -enable-machine-outliner
will not modify the pass pipeline of any target that does not support it.
https://reviews.llvm.org/D48683
llvm-svn: 335887
We could get away with it for constant folded cases, but not for rL335719.
Thanks to Krzysztof Parzyszek for noticing.
Reapply original commit rL335821 which was reverted at rL335871 due to a WebAssembly bug that was fixed at rL335884.
llvm-svn: 335886
This reverts commit 9c7c10e4073a0bc6a759ce5cd33afbac74930091.
It relies on r335872 since that introduces the machine outliner
flags test. I meant to commit D48683 in that commit, but got mixed
up and committed D48682 instead. So, I'm reverting this and
r335872, since D48682 hasn't made it through review yet.
llvm-svn: 335882
We shouldn't add the outliner when compiling at -O0 even if
-enable-machine-outliner is passed in. This makes sure that we
don't add it in this case.
This also updates machine-outliner-flags to reflect the change
and improves the comment describing what that test does.
llvm-svn: 335879
Add NoTrapAfterNoreturn target option which skips emission of traps
behind noreturn calls even if TrapUnreachable is enabled.
Enable the feature on Mach-O to save code size; Comments suggest it is
not possible to enable it for the other users of TrapUnreachable.
rdar://41530228
DifferentialRevision: https://reviews.llvm.org/D48674
llvm-svn: 335877
To enable the MachineOutliner by default on AArch64, we need to be able to
disable the MachineOutliner and also provide an option to "always" enable the
outliner.
This adds that capability. It allows the user to still use the old
-enable-machine-outliner option, which defaults to "always". This is building
up to allowing the user to specify "always" versus the target-default
outlining behaviour.
llvm-svn: 335872
This allows hoisting of a common code, for instance if denominator
is loop invariant. Current change is expansion only, adding licm to
the target pass list going to be a separate patch. Given this patch
changes to codegen are minor as the expansion is similar to that on
DAG. DAG expansion still must remain for R600.
Differential Revision: https://reviews.llvm.org/D48586
llvm-svn: 335868
This pass is being added in order to make the information available to BasicAA,
which can't do caching of this information itself, but possibly this information
may be useful for other passes.
Incorporates code based on Daniel Berlin's implementation of Tarjan's algorithm.
Differential Revision: https://reviews.llvm.org/D47893
llvm-svn: 335857
Armv6 introduced instructions to perform 32-bit SIMD operations. The purpose of
this pass is to do some straightforward IR pattern matching to create ACLE DSP
intrinsics, which map on these 32-bit SIMD operations.
Currently, only the SMLAD instruction gets recognised. This instruction
performs two multiplications with 16-bit operands, and stores the result in an
accumulator. We will follow this up with patches to recognise SMLAD in more
cases, and also to generate other DSP instructions (like e.g. SADD16).
Patch by: Sam Parker and Sjoerd Meijer
Differential Revision: https://reviews.llvm.org/D48128
llvm-svn: 335850
We have too many mechanisms for tracking the various offsets
used for kernel arguments, so remove one. There's still a lot of
confusion with these because there are two different "implicit"
argument areas located at the beginning and end of the kernarg
segment.
Additionally, the offset was determined based on the memory
size of the split element types. This would break in a future
commit where v3i32 is decomposed into separate i32 pieces.
llvm-svn: 335830
In principle nothing should stop these from working, but
work is necessary to create an ABI for dealing with the stack
related registers.
llvm-svn: 335829
Just fix the crash for now by not doing the optimization since
figuring out how to properly convert the bits for an arbitrary
struct is a pain.
Also fix a crash when there is only an empty struct argument.
llvm-svn: 335827
These are all benign races and only visible in !NDEBUG. tsan complains
about it, but a simple atomic bool is sufficient to make it happy.
llvm-svn: 335823
SCCP does not change the CFG, so we can mark it as preserved.
Reviewers: dberlin, efriedma, davide
Reviewed By: davide
Differential Revision: https://reviews.llvm.org/D47149
llvm-svn: 335820
If a trunc has a user in a block which is not reachable from entry,
we can safely perform trunc elimination as if this user didn't exist.
llvm-svn: 335816
Remove unused ByteStreamer argument from function emitDebugLocValue.
Patch by Nikola Prica.
Differential Revision: https://reviews.llvm.org/D48590
llvm-svn: 335811
This more efficient for the isel table generator since we can use CheckChildInteger instead of MoveChild, CheckPredicate, MoveParent. This reduced the table size by 1-2K.
I wish there was a way to share the values with X86BaseInfo.h and still use a PatFrag like this. These numbers are fixed by the X86 intrinsic spec going back many years and we should never need to change them. So we shouldn't waste table bytes to support sharing.
llvm-svn: 335806
BMI2 added new shift by register instructions that have the ability to fold a load.
Normally without doing anything special isel would prefer folding a load over folding an immediate because the load folding pattern has higher "complexity". This would require an instruction to move the immediate into a register. We would rather fold the immediate instead and have a separate instruction for the load.
We used to enforce this priority by artificially lowering the complexity of the load pattern.
This patch changes this to instead reject the load fold in isProfitableToFoldLoad if there is an immediate. This is more consistent with other binops and feels less hacky.
llvm-svn: 335804
=== Generating the CG Profile ===
The CGProfile module pass simply gets the block profile count for each BB and scans for call instructions. For each call instruction it adds an edge from the current function to the called function with the current BB block profile count as the weight.
After scanning all the functions, it generates an appending module flag containing the data. The format looks like:
```
!llvm.module.flags = !{!0}
!0 = !{i32 5, !"CG Profile", !1}
!1 = !{!2, !3, !4} ; List of edges
!2 = !{void ()* @a, void ()* @b, i64 32} ; Edge from a to b with a weight of 32
!3 = !{void (i1)* @freq, void ()* @a, i64 11}
!4 = !{void (i1)* @freq, void ()* @b, i64 20}
```
Differential Revision: https://reviews.llvm.org/D48105
llvm-svn: 335794
The code to emit the pieces of the MSF file were actually in
PDBFileBuilder. Move this to MSFBuilder so that we can
theoretically emit an MSF without having a PDB file.
llvm-svn: 335789
If we turn X86ISD::AND into ISD::AND, we delete N. But we were continuing onto the next block of code even though N no longer existed.
Just happened to notice it. I assume asan didn't notice it because we explicitly unpoison deleted nodes and give them a DELETE_NODE opcode.
llvm-svn: 335787
Summary:
In r333455 we added a peephole to fix the corner cases that result
from separating base + offset lowering of global address.The
peephole didn't handle some of the cases because it only has a basic
block view instead of a function level view.
This patch replaces that logic with a machine function pass. In
addition to handling the original cases it handles uses of the global
address across blocks in function and folding an offset from LW\SW
instruction. This pass won't run for OptNone compilation, so there
will be a negative impact overall vs the old approach at O0.
Reviewers: asb, apazos, mgrang
Reviewed By: asb
Subscribers: MartinMosbeck, brucehoult, the_o, rogfer01, mgorny, rbar, johnrusso, simoncook, niosHD, kito-cheng, shiva0217, zzheng, llvm-commits, edward-jones
Differential Revision: https://reviews.llvm.org/D47857
llvm-svn: 335786
The %eiz/%riz are dummy registers that force the encoder to emit a SIB byte when it normally wouldn't. By emitting them in the disassembly output we ensure that assembling the disassembler output would also produce a SIB byte.
This should match the behavior of objdump from binutils.
llvm-svn: 335768
Now that we have the ability to legalize based on MMO's. Add support for
legalizing based on AtomicOrdering and use it to correct the legalization
of the atomic instructions.
Also extend all() to be a variadic template as this ruleset now requires
3 and 4 argument versions.
llvm-svn: 335767
As noted in the D44909 review, the transform from (fptosi+sitofp) to ftrunc
can produce -0.0 where the original code does not:
#include <stdio.h>
int main(int argc) {
float x;
x = -0.8 * argc;
printf("%f\n", (float)((int)x));
return 0;
}
$ clang -O0 -mavx fp.c ; ./a.out
0.000000
$ clang -O1 -mavx fp.c ; ./a.out
-0.000000
Ideally, we'd use IR/node flags to predicate the transform, but the IR parser
doesn't currently allow fast-math-flags on the cast instructions. So for now,
just use the function attribute that corresponds to clang's "-fno-signed-zeros"
option.
Differential Revision: https://reviews.llvm.org/D48085
llvm-svn: 335761
Summary:
Rather than just print the GUID, when it is available in the index,
print the global name as well in the function import thin link debug
messages. Names will be available when the combined index is being
built by the same process, e.g. a linker or "llvm-lto2 run".
Reviewers: davidxl
Subscribers: mehdi_amini, inglorion, eraman, steven_wu, llvm-commits
Differential Revision: https://reviews.llvm.org/D48612
llvm-svn: 335760
It isn't safe to outline sequences of instructions where x16/x17/nzcv live
across the sequence.
This teaches the outliner to check whether or not a specific canidate has
x16/x17/nzcv live across it and discard the candidate in the case that that is
true.
https://bugs.llvm.org/show_bug.cgi?id=37573https://reviews.llvm.org/D47655
llvm-svn: 335758
If we are just modifying a single bit at a variable bit position we can use the BT* instructions to make the change instead of shifting a 1(or rotating a -1) and doing a binop. These instruction also ignore the upper bits of their index input so we can also remove an and if one is present on the index.
Fixes PR37938.
llvm-svn: 335754
Summary:
AliasSet::print uses `I->printAsOperand` to print UnknownInstructions. The problem is that not all UnknownInstructions have names (e.g. call instructions). When such instructions are printed, they appear as `<badref>` in AliasSets, which is very confusing, as the values are perfectly valid.
This patch fixes that by printing UnknownInstructions without a name using `print` instead of `printAsOperand`.
Reviewers: asbirlea, chandlerc, sanjoy, grosser
Reviewed By: asbirlea
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D48609
llvm-svn: 335751
I think the intrinsics named 'avx512.mask.' should refer to the previous behavior of taking a mask argument in the intrinsic instead of using a 'select' or 'and' instruction in IR to accomplish the masking. This is more consistent with the goal that eventually we will have no intrinsics that have masking builtin. When we reach that goal, we should have no intrinsics named "avx512.mask".
llvm-svn: 335744
If a source of rcp instruction is a result of any conversion from
an integer convert it into rcp_iflag instruction. No FP exception
can ever happen except division by zero if a single precision rcp
argument is a representation of an integral number.
Differential Revision: https://reviews.llvm.org/D48569
llvm-svn: 335742
This patch adds a custom trunc store lowering for v4i8 vector types.
Since there is not v.4b register, the v4i8 is promoted to v4i16 (v.4h)
and default action for v4i8 is to extract each element and issue 4
byte stores.
A better strategy would be to extended the promoted v4i16 to v8i16
(with undef elements) and extract and store the word lane which
represents the v4i8 subvectores. The construction:
define void @foo(<4 x i16> %x, i8* nocapture %p) {
%0 = trunc <4 x i16> %x to <4 x i8>
%1 = bitcast i8* %p to <4 x i8>*
store <4 x i8> %0, <4 x i8>* %1, align 4, !tbaa !2
ret void
}
Can be optimized from:
umov w8, v0.h[3]
umov w9, v0.h[2]
umov w10, v0.h[1]
umov w11, v0.h[0]
strb w8, [x0, #3]
strb w9, [x0, #2]
strb w10, [x0, #1]
strb w11, [x0]
ret
To:
xtn v0.8b, v0.8h
str s0, [x0]
ret
The patch also adjust the memory cost for autovectorization, so the C
code:
void foo (const int *src, int width, unsigned char *dst)
{
for (int i = 0; i < width; i++)
*dst++ = *src++;
}
can be vectorized to:
.LBB0_4: // %vector.body
// =>This Inner Loop Header: Depth=1
ldr q0, [x0], #16
subs x12, x12, #4 // =4
xtn v0.4h, v0.4s
xtn v0.8b, v0.8h
st1 { v0.s }[0], [x2], #4
b.ne .LBB0_4
Instead of byte operations.
llvm-svn: 335735
This patch adds support for the q versions of the dup
(load-to-all-lanes) NEON intrinsics, such as vld2q_dup_f16() for
example.
Currently, non-q versions of the dup intrinsics are implemented
in clang by generating IR that first loads the elements of the
structure into the first lane with the lane (to-single-lane)
intrinsics, and then propagating it other lanes. There are at
least two problems with this approach. First, there are no
double-spaced to-single-lane byte-element instructions. For
example, there is no such instruction as 'vld2.8 { d0[0], d2[0]
}, [r0]'. That means we cannot rely on the to-single-lane
intrinsics and instructions to implement the q versions of the
dup intrinsics. Note that to-all-lanes instructions do support
all sizes of data items, including bytes.
The second problem with the current approach is that we need a
separate vdup instruction to propagate the structure to each
lane. So for vld4q_dup_f16() we would need four vdup instructions
in addition to the initial vld instruction.
This patch introduces dup LLVM intrinsics and reworks handling of
the currently supported (non-q) NEON dup intrinsics to expand
them into those LLVM intrinsics, thus eliminating the need for
using to-single-lane intrinsics and instructions.
Additionally, this patch adds support for u64 and s64 dup NEON
intrinsics. These are marked as Arch64-only in the ARM NEON
Reference, but it seems there are no reasons to not support them
in AArch32 mode. Please correct, if that is wrong.
That's what we generate with this patch applied:
vld2q_dup_f16:
vld2.16 {d0[], d2[]}, [r0]
vld2.16 {d1[], d3[]}, [r0]
vld3q_dup_f16:
vld3.16 {d0[], d2[], d4[]}, [r0]
vld3.16 {d1[], d3[], d5[]}, [r0]
vld4q_dup_f16:
vld4.16 {d0[], d2[], d4[], d6[]}, [r0]
vld4.16 {d1[], d3[], d5[], d7[]}, [r0]
Differential Revision: https://reviews.llvm.org/D48439
llvm-svn: 335733
This prevents InstCombine from creating mis-sized dbg.values when
replacing a sequence of casts with a simpler cast. For example, in:
(fptrunc (floor (fpext X))) -> (floorf X)
We no longer emit dbg.value(X) (with a 32-bit float operand) to describe
(fpext X) (which is a 64-bit float).
This was diagnosed by the debugify check added in r335682.
llvm-svn: 335696
Nothing was using this relationship. By splitting them we no longer need to worry about register or memory entries being empty in a group.
The memory folding tables in X86InstrInfo.cpp can be used to access this relationship if needed.
llvm-svn: 335694
Summary:
When recording uses we need to rewrite after cloning a loop we need to
check if the use is not dominated by the original def. The initial
assumption was that the cloned basic block will introduce a new path and
thus the original def will only dominate the use if they are in the same
BB, but as the reproducer from PR37745 shows it's not always the case.
This fixes PR37745.
Reviewers: haicheng, Ka-Ka
Subscribers: hiraditya, llvm-commits
Differential Revision: https://reviews.llvm.org/D48111
llvm-svn: 335675
LLJIT is a prefabricated ORC based JIT class that is meant to be the go-to
replacement for MCJIT. Unlike OrcMCJITReplacement (which will continue to be
supported) it is not API or bug-for-bug compatible, but targets the same
use cases: Simple, non-lazy compilation and execution of LLVM IR.
LLLazyJIT extends LLJIT with support for function-at-a-time lazy compilation,
similar to what was provided by LLVM's original (now long deprecated) JIT APIs.
This commit also contains some simple utility classes (CtorDtorRunner2,
LocalCXXRuntimeOverrides2, JITTargetMachineBuilder) to support LLJIT and
LLLazyJIT.
Both of these classes are works in progress. Feedback from JIT clients is very
welcome!
llvm-svn: 335670
This addresses post-commit feedback about the name 'skipDebugInfo' being
misleading. This name could be interpreted as meaning 'a function that
skips instructions with debug locations'.
The new name, 'skipDebugIntrinsics', makes it clear that this function
only skips debug info intrinsics.
Thanks to Adrian Prantl for pointing this out!
llvm-svn: 335667
AsynchronousSymbolQuery::canStillFail checks the value of the callback to
prevent sending it redundant error notifications, so we need to reset it after
running it.
llvm-svn: 335664
Right now, when we use RIP-relative instructions in 32-bit mode, we'll just
assert and crash.
This adds an error message which tells the user that they can't do that in
32-bit mode, so that we don't crash (and also can see the issue outside of
assert builds).
llvm-svn: 335658
This replaces most argument uses with loads, but for
now not all.
The code in SelectionDAG for calling convention lowering
is actively harmful for amdgpu_kernel. It attempts to
split the argument types into register legal types, which
results in low quality code for arbitary types. Since
all kernel arguments are passed in memory, we just want the
raw types.
I've tried a couple of methods of mitigating this in SelectionDAG,
but it's easier to just bypass this problem alltogether. It's
possible to hack around the problem in the initial lowering,
but the real problem is the DAG then expects to be able to use
CopyToReg/CopyFromReg for uses of the arguments outside the block.
Exposing the argument loads in the IR also has the advantage
that the LoadStoreVectorizer can merge them.
I'm not sure the best approach to dealing with the IR
argument list is. The patch as-is just leaves the IR arguments
in place, so all the existing code will still compute the same
kernarg size and pointlessly lowers the arguments.
Arguably the frontend should emit kernels with an empty argument
list in the first place. Alternatively a dummy array could be
inserted as a single argument just to reserve space.
This does have some disadvantages. Local pointer kernel arguments can
no longer have AssertZext placed on them as the equivalent !range
metadata is not valid on pointer typed loads. This is mostly bad
for SI which needs to know about the known bits in order to use the
DS instruction offset, so in this case this is not done.
More importantly, this skips noalias arguments since this pass
does not yet convert this to the equivalent !alias.scope and !noalias
metadata. Producing this metadata correctly seems to be tricky,
although this logically is the same as inlining into a function which
doesn't exist. Additionally, exposing these loads to the vectorizer
may result in degraded aliasing information if a pointer load is
merged with another argument load.
I'm also not entirely sure this is preserving the current clover
ABI, although I would greatly prefer if it would stop widening
arguments and match the HSA ABI. As-is I think it is extending
< 4-byte arguments to 4-bytes but doesn't align them to 4-bytes.
llvm-svn: 335650
Not sure why this logic seems to be repeated in 2 different places,
one called by the other.
On AMDGPU addrspace(3) globals start allocating at 0, so these
checks will be incorrect (not that real code actually tries
to compare these addresses)
llvm-svn: 335649
Summary: This is trying to add support for r334428.
Reviewers: sanjoy
Subscribers: jlebar, hiraditya, bixia, llvm-commits
Differential Revision: https://reviews.llvm.org/D48399
llvm-svn: 335646