Summary:
If horizontal reduction tree starts from the binary operation that is
used in PHI node, but this PHI is not used in horizontal reduction, we
may end up with extra addition of this PHI node after vectorization.
Here is an example:
```
%phi = phi i32 [ %tmp, %end], ...
...
%tmp = add i32 %tmp1, %tmp2
end:
```
after vectorization we always have something like:
```
%phi = phi i32 [ %tmp, %end], ...
...
%red = extractelement <8 x 32> %vec.red, 0
%tmp = add i32 %red, %phi
end:
```
even if `%phi` is not used in reduction tree. Patch considers these PHI
nodes as extra arguments and considers them in the final result iff they
really used in reduction.
Reviewers: mkuper, hfinkel, mzolotukhin
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D30409
llvm-svn: 296606
Windows does not treat `~` as a reference to home directory, so the call
to `llvm::sys::path::native` on, say, `~/somedir` produces `~\somedir`,
which has different meaning than the original path. With this change
tilde is expanded on Windows to user profile directory. Such behavior
keeps original meaning of the path and is consistent with the algorithm
of `llvm::sys::path::home_directory`.
Differential Revision: https://reviews.llvm.org/D27527
llvm-svn: 296590
Summary:
Solves PR 31990.
The bad rewrite could replace a memcpy of one word with
store i4 -1
while it should actually be
store i8 -1
Hopefully opt and llc has improved enough so the original optimization
done by the code isn't needed anymore.
One already existing testcase is affected. It originally tested that
the memcpy was replaced with
load double
but since we now remove that rewrite it will be
load i64
instead.
Patch suggestion by Eli Friedman.
Reviewers: eli.friedman, majnemer, efriedma
Reviewed By: efriedma
Subscribers: efriedma, llvm-commits
Differential Revision: https://reviews.llvm.org/D30254
llvm-svn: 296585
The practice in LV is that we emit analysis remarks and then finally report
either a missed or applied remark on the final decision whether vectorization
is taking place. On this code path, we were closing with an analysis remark.
llvm-svn: 296578
for VectorizeTree() API.This API uses it for proper mask computation to be used in shufflevector IR.
The fix is to compute the mask for out of order memory accesses while building the vectorizable tree
instead of actual vectorization of vectorizable tree.
Reviewers: mkuper
Differential Revision: https://reviews.llvm.org/D30159
Change-Id: Id1e287f073fa4959713ba545fa4254db5da8b40d
llvm-svn: 296575
When SDAGISel (top-down) selects a tail-call, it skips the remainder
of the block.
If, before that, FastISel (bottom-up) selected some of the (no-op) next
few instructions, we can end up with dead instructions following the
terminator (selected by SDAGISel).
We need to erase them, as we know they aren't necessary (in addition to
being incorrect).
We already do this when FastISel falls back on the tail-call itself.
Also remove the FastISel-emitted code if we fallback on the
instructions between the tail-call and the return.
llvm-svn: 296552
Iterating on the use-list we're modifying doesn't work: after the first
iteration, the use-list iterator will point to a MachineOperand
referencing the new register. This caused us to skip the other uses to
replace.
Instead, use MRI.replaceRegWith(), which accounts for this behavior.
llvm-svn: 296551
Conflicting debug info for function arguments causes hard-to-debug
assertions in the DWARF backend, so the Verifier should reject it.
For performance reasons this only checks function arguments from
non-inlined debug intrinsics for now.
rdar://problem/30520286
This reapplies r295749 after fixing PR32042.
llvm-svn: 296543
This is for running the assembler with -g (to emit DWARF describing
the assembler source).
Differential Revision: http://reviews.llvm.org/D30475
llvm-svn: 296541
To facilitate this, add a new hidden command-line option to disable
the explicit-locals pass. That causes llc to emit invalid code that doesn't
have all locals converted to get_local/set_local, however it simplifies
testwriting in many cases.
llvm-svn: 296540
This prevents generating stm r1!, {r0, r1} on Thumb1, where value
stored for r1 is UNKONWN.
Patch by Zhaoshi Zheng.
Differential Revision: https://reviews.llvm.org/D27910
llvm-svn: 296538
Summary:
Currently, our post-dom tree tries to ignore and remove the effects of
infinite loops. It fails miserably at this, because it tries to do it
ahead of time, and thus can only detect self-loops, and any other type
of infinite loop, it pretends doesn't exist at all.
This can, in a bunch of cases, lead to wrong answers and a completely
empty post-dom tree.
Wrong answer:
```
declare void foo()
define internal void @f() {
entry:
br i1 undef, label %bb35, label %bb3.i
bb3.i:
call void @foo()
br label %bb3.i
bb35.loopexit3:
br label %bb35
bb35:
ret void
}
```
We get:
```
Inorder PostDominator Tree:
[1] <<exit node>> {0,7}
[2] %bb35 {1,6}
[3] %bb35.loopexit3 {2,3}
[3] %entry {4,5}
```
This is a trivial modification of the testcase for PR 6047
Note that we pretend bb3.i doesn't exist.
We also pretend that bb35 post-dominates entry.
While it's true that it does not exit in a theoretical sense, it's not
really helpful to try to ignore the effect and pretend that bb35
post-dominates entry. Worse, we pretend the infinite loop does
nothing (it's usually considered a side-effect), and doesn't even
exist, even when it calls a function. Sadly, this makes it impossible
to use when you are trying to move code safely. All compilers also
create virtual or real single exit nodes (including us), and connect
infinite loops there (which this patch does). In fact, others have
worked around our behavior here, to the point of building their own
post-dom trees:
https://zneak.github.io/fcd/2016/02/17/structuring.html and pointing
out the region infrastructure is near-useless for them with postdom in
this state :(
Completely empty post-dom tree:
```
define void @spam() #0 {
bb:
br label %bb1
bb1: ; preds = %bb1, %bb
br label %bb1
bb2: ; No predecessors!
ret void
}
```
Printing analysis 'Post-Dominator Tree Construction' for function 'foo':
=============================--------------------------------
Inorder PostDominator Tree:
[1] <<exit node>> {0,1}
:(
(note that even if you ignore the effects of infinite loops, bb2
should be present as an exit node that post-dominates nothing).
This patch changes post-dom to properly handle infinite loops and does
root finding during calculation to prevent empty tress in such cases.
We match gcc's (and the canonical theoretical) behavior for infinite
loops (find the backedge, connect it to the exit block).
Testcases coming as soon as i finish running this on a ton of random graphs :)
Reviewers: chandlerc, davide
Subscribers: bryant, llvm-commits
Differential Revision: https://reviews.llvm.org/D29705
llvm-svn: 296535
other tables. Providing a helpful error message to what the error is and
where the error occurred based on which opcode it was associated with.
There have been handful of bug fixes dealing with bad bind info in
object files, r294021 and r249845, which only put a band aid on the
problem after a bad bind table was created after unpacking from
its compact info. In these cases a bind table should have never been
created and an error should have simply been generated.
This change puts in place the plumbing to allow checking and returning
of an error when the compact info is unpacked. This follows the model
of iterators that can fail that Lang Hanes designed when fixing the problem
for bad archives r275316 (or r275361).
This change uses one of the existing test cases that now causes an
error instead of printing <<bad library ordinal>> after a bad bind table
is created. The error uses the offset into the opcode table as shown with
the macOS dyldinfo(1) tool to indicate where the error is and which
opcode and which parameter is in error.
For example the exiting test case has this lazy binding opcode table:
% dyldinfo -opcodes test/tools/llvm-objdump/Inputs/bad-ordinal.macho-x86_64
…
lazy binding opcodes:
0x0000 BIND_OPCODE_SET_SEGMENT_AND_OFFSET_ULEB(0x02, 0x00000010)
0x0002 BIND_OPCODE_SET_DYLIB_ORDINAL_IMM(2)
In the test case the binary only has one library so setting the library
ordinal to the value of 2 in the BIND_OPCODE_SET_DYLIB_ORDINAL_IMM
opcode at 0x0002 above is an error. This now produces this error message:
% llvm-objdump -lazy-bind bad-ordinal.macho-x86_64
…
llvm-objdump: 'bad-ordinal.macho-x86_64': truncated or malformed object (for BIND_OPCODE_SET_DYLIB_ORDINAL_ULEB bad library ordinal: 2 (max 1) for opcode at: 0x2)
This change provides the plumbing for the error handling and one example
of an error message. Other error checks and test cases will be added in follow
on commits.
llvm-svn: 296527
Requesting DWARF v5 will now get you the new compile-unit and
type-unit headers. llvm-dwarfdump will also recognize them.
Differential Revision: http://reviews.llvm.org/D30206
llvm-svn: 296514
It's not clear to me if this is always better than
doing ds_write2_b64 This adds the constraint of
a 128-bit register input instead of a pair of
64-bit.
llvm-svn: 296512
If during scheduling we have identified that we cannot keep optimistic
occupancy increase critical register pressure limit and try scheduling
of the whole function again. In this case blocks with smaller pressure
will have a chance for better scheduling.
Differential Revision: https://reviews.llvm.org/D30442
llvm-svn: 296506
Summary: For SamplePGO, the profile may contain cross-module inline stacks. As we need to make sure the profile annotation happens when all the hot inline stacks are expanded, we need to pass this info to the module importer so that it can import proper functions if necessary. This patch implemented this feature by emitting cross-module targets as part of function entry metadata. In the module-summary phase, the metadata is used to build call edges that points to functions need to be imported.
Reviewers: mehdi_amini, tejohnson
Reviewed By: tejohnson
Subscribers: davidxl, llvm-commits
Differential Revision: https://reviews.llvm.org/D30053
llvm-svn: 296498
This change introduces new method to estimate register pressure in
GCNScheduler. Standard RPTracker gives huge error due to the following
reasons:
1. It does not account for live-ins or live-outs if value is not used
in the region itself. That creates a huge error in a very common case
if there are a lot of live-thu registers.
2. It does not properly count subregs.
3. It assumes a register used as an input operand can be reused as an
output. This is not always possible by itself, this is not what RA
will finally do in many cases for various reasons not limited to RA's
inability to do so, and this is not so if the value is actually a
live-thu.
In addition we can now see clear separation between live-in pressure
which we cannot change with the scheduling and tentative pressure
which we can change.
Differential Revision: https://reviews.llvm.org/D30439
llvm-svn: 296491
The LLVM backend cannot produce any debug info for an llvm::Function
without a DISubprogram attachment. When inlining a debug-info-carrying
function into a nodebug function, there is therefore no reason to keep
any debug info intrinsic calls or debug locations on the instructions.
This fixes a problem discovered in PR32042.
rdar://problem/30679307
llvm-svn: 296488
This recovers a test case that was severely broken by r296476, my making sure we don't create ADD/ADC that loads and stores when there is also a flag dependency.
llvm-svn: 296486
If two subregs of the same register are defined and we need to revert
schedule changing def order, we will end up with both instructions
having def,read-undef flags because adjustLaneLiveness() will only set
this flag but will not remove it.
Fix this by removing read-undef flags before calling adjustLaneLiveness.
Differential Revision: https://reviews.llvm.org/D30428
llvm-svn: 296484
Stack Smash Protection is not completely free, so in hot code, the overhead it causes can cause performance issues. By adding diagnostic information for which functions have SSP and why, a user can quickly determine what they can do to stop SSP being applied to a specific hot function.
This change adds a remark that is reported by the stack protection code when an instruction or attribute is encountered that causes SSP to be applied.
Patch by: James Henderson
Differential Revision: https://reviews.llvm.org/D29023
llvm-svn: 296483
Recommiting after fixup of 32-bit aliasing sign offset bug in DAGCombiner.
* Simplify Consecutive Merge Store Candidate Search
Now that address aliasing is much less conservative, push through
simplified store merging search and chain alias analysis which only
checks for parallel stores through the chain subgraph. This is cleaner
as the separation of non-interfering loads/stores from the
store-merging logic.
When merging stores search up the chain through a single load, and
finds all possible stores by looking down from through a load and a
TokenFactor to all stores visited.
This improves the quality of the output SelectionDAG and the output
Codegen (save perhaps for some ARM cases where we correctly constructs
wider loads, but then promotes them to float operations which appear
but requires more expensive constant generation).
Some minor peephole optimizations to deal with improved SubDAG shapes (listed below)
Additional Minor Changes:
1. Finishes removing unused AliasLoad code
2. Unifies the chain aggregation in the merged stores across code
paths
3. Re-add the Store node to the worklist after calling
SimplifyDemandedBits.
4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
arbitrary, but seems sufficient to not cause regressions in
tests.
5. Remove Chain dependencies of Memory operations on CopyfromReg
nodes as these are captured by data dependence
6. Forward loads-store values through tokenfactors containing
{CopyToReg,CopyFromReg} Values.
7. Peephole to convert buildvector of extract_vector_elt to
extract_subvector if possible (see
CodeGen/AArch64/store-merge.ll)
8. Store merging for the ARM target is restricted to 32-bit as
some in some contexts invalid 64-bit operations are being
generated. This can be removed once appropriate checks are
added.
This finishes the change Matt Arsenault started in r246307 and
jyknight's original patch.
Many tests required some changes as memory operations are now
reorderable, improving load-store forwarding. One test in
particular is worth noting:
CodeGen/PowerPC/ppc64-align-long-double.ll - Improved load-store
forwarding converts a load-store pair into a parallel store and
a memory-realized bitcast of the same value. However, because we
lose the sharing of the explicit and implicit store values we
must create another local store. A similar transformation
happens before SelectionDAG as well.
Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle
llvm-svn: 296476
Summary:
This will allow future patches to inspect the details of the LLT. The implementation is now split between
the Support and CodeGen libraries to allow TableGen to use this class without introducing layering concerns.
Thanks to Ahmed Bougacha for finding a reasonable way to avoid the layering issue and providing the version of this patch without that problem.
Reviewers: t.p.northover, qcolombet, rovka, aditya_nandakumar, ab, javed.absar
Subscribers: arsenm, nhaehnle, mgorny, dberris, llvm-commits, kristof.beyls
Differential Revision: https://reviews.llvm.org/D30046
llvm-svn: 296474
Lower i32, float and double parameters that need to live on the stack. This
boils down to creating some G_GEPs starting from the stack pointer and storing
the values there. During the process we also keep track of the stack size and
use the final value in the ADJCALLSTACKDOWN/UP instructions.
We currently assert for smaller types, since they usually require extensions.
They will be handled in a separate patch.
llvm-svn: 296473
In Thumb2, instructions which write to the PC are UNPREDICTABLE if they are in
an IT block but not the last instruction in the block.
Previously, we only diagnosed this for LDM instructions, this patch extends the
diagnostic to cover all of the relevant instructions.
Differential Revision: https://reviews.llvm.org/D30398
llvm-svn: 296459
This is also useful in cases when llvm is in a shared library. First we dlopen
the llvm shared library and then we register it as a permanent library in order
to keep the JIT and other services working.
Patch reviewed by Vedant Kumar (D29955)!
llvm-svn: 296442
Summary:
With this change ImplicitNullCheck optimization uses alias analysis
and can use load/store memory access for implicit null check if there
are other load/store before but memory accesses do not alias.
Patch by Serguei Katkov!
Reviewers: sanjoy
Reviewed By: sanjoy
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D30331
llvm-svn: 296440
This is a patch for the outliner described in the RFC at:
http://lists.llvm.org/pipermail/llvm-dev/2016-August/104170.html
The outliner is a code-size reduction pass which works by finding
repeated sequences of instructions in a program, and replacing them with
calls to functions. This is useful to people working in low-memory
environments, where sacrificing performance for space is acceptable.
This adds an interprocedural outliner directly before printing assembly.
For reference on how this would work, this patch also includes X86
target hooks and an X86 test.
The outliner is run like so:
clang -mno-red-zone -mllvm -enable-machine-outliner file.c
Patch by Jessica Paquette<jpaquette@apple.com>!
rdar://29166825
Differential Revision: https://reviews.llvm.org/D26872
llvm-svn: 296418
Splitting critical edges when one of the source edges is an indirectbr
is hard in general (because it requires changing the memory the indirectbr
reads). But if a block only has a single indirectbr predecessor (which is
the common case), we can simulate splitting that edge by splitting
the destination block, and retargeting the *direct* branches.
This is motivated by the use of computed gotos in python 2.7: PyEval_EvalFrame()
ends up using an indirect branch with ~100 successors, and passing a constant to
each of those. Since MachineSink can't break indirect critical edges on demand
(and doing this in MIR doesn't look feasible), this causes us to emit about ~100
defs of registers containing constants, which we in the predecessor block, where
only one of those constants is used in each successor. So, at each computed goto,
we needlessly spill about a 100 constants to stack. The end result is that a
clang-compiled python interpreter can be about ~2.5x slower on a simple python
reduction loop than a gcc-compiled interpreter.
Differential Revision: https://reviews.llvm.org/D29916
llvm-svn: 296416
Before the endianness was specified on each call to read
or write of the StreamReader / StreamWriter, but in practice
it's extremely rare for streams to have data encoded in
multiple different endiannesses, so we should optimize for the
99% use case.
This makes the code cleaner and more general, but otherwise
has NFC.
llvm-svn: 296415
The Fuchsia ASan runtime reserves the low part of the address space.
Patch by Roland McGrath
Differential Revision: https://reviews.llvm.org/D30426
llvm-svn: 296405
Instead of requiring every non-COFF MCObjectStreamer to implement the
COFF hooks just to do an llvm_unreachable to say that they're not
supported, do the llvm_unreachable in the default implementation, as
suggested by rnk in https://reviews.llvm.org/D26722.
llvm-svn: 296403
This was reverted because it was breaking some builds, and
because of incorrect error code usage. Since the CL was
large and contained many different things, I'm resubmitting
it in pieces.
This portion is NFC, and consists of:
1) Renaming classes to follow a consistent naming convention.
2) Fixing the const-ness of the interface methods.
3) Adding detailed doxygen comments.
4) Fixing a few instances of passing `const BinaryStream& X`. These
are now passed as `BinaryStreamRef X`.
llvm-svn: 296394
Summary: Should use the Valuekind read from the profile.
Reviewers: davidxl
Reviewed By: davidxl
Subscribers: llvm-commits, xur
Differential Revision: https://reviews.llvm.org/D30420
llvm-svn: 296391
The transform in question claims to be doing:
// fold (add (select cc, 0, c), x) -> (select cc, x, (add, x, c))
...starting in PerformADDCombineWithOperands(), but it wasn't actually checking for a setcc node
for the sext/zext patterns.
This is exactly the opposite of a transform I'd like to add to DAGCombiner's foldSelectOfConstants(),
so I was seeing infinite loops with my draft of a patch applied.
The changes in select_const.ll look positive (less instructions). The change in arm-and-tst-peephole.ll
is unrelated. We're changing the input IR in that test to preserve the intent of the test, but that's
not affected by this code change.
Differential Revision:
https://reviews.llvm.org/D30355
llvm-svn: 296389
DAGCombiner already supports peeking thorough shuffles to improve vector element extraction, but legalization often leaves us in situations where we need to extract vector elements after shuffles have already been lowered.
This patch adds support for VECTOR_EXTRACT_ELEMENT/PEXTRW/PEXTRB instructions to attempt to handle target shuffles as well. I've covered some basic scenarios including handling shuffle mask scaling and the implicit zero-extension of PEXTRW/PEXTRB, there is more that could be done here (that I've mentioned in TODOs) but I haven't found many cases where its worth it.
Differential Revision: https://reviews.llvm.org/D30176
llvm-svn: 296381
Summary: Existing implementation of duplicateSimpleBB function drops DebugLoc metadata of branch instructions during the transformation. This patch addresses this issue by making newly created branch instructions to keep the metadata of replaced branch instructions.
Reviewers: qcolombet, craig.topper, aprantl, MatzeB, sanjoy, dblaikie
Reviewed By: dblaikie
Subscribers: dblaikie, llvm-commits
Differential Revision: https://reviews.llvm.org/D30026
llvm-svn: 296371
This was suggested in D27855: have the inliner add assumptions, so we don't
lose nonnull info provided by argument attributes.
This still doesn't solve PR28430 (dyn_cast), but this gets us closer.
https://reviews.llvm.org/D29999
llvm-svn: 296366
Summary:
SmallBitVector uses a malloc for more than 58 bits on a 64-bit target and more than 27 bits on a 32-bit target. Some of the vector types we deal with here use more than those number of elements and therefore cause a malloc.
APInt on the other hand supports up to 64 bits without a malloc. That's the maximum number of bits we need here so we can avoid a malloc for all cases by using APInt.
Reviewers: RKSimon
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D30392
llvm-svn: 296355
Summary:
SmallBitVector uses a malloc for more than 58 bits on a 64-bit target and more than 27 bits on a 32-bit target. Some of the vector types we deal with here use more than those number of elements and therefore cause a malloc.
APInt on the other hand supports up to 64 bits without a malloc. That's the maximum number of bits we need here so we can avoid a malloc for all cases by using APInt.
Reviewers: RKSimon
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D30390
llvm-svn: 296354
Some of the vectors are under sized to avoid heap allocation. In one case the vector was oversized.
Differential Revision: https://reviews.llvm.org/D30387
llvm-svn: 296353
Summary:
SmallBitVector uses a malloc for more than 58 bits on a 64-bit target and more than 27 bits on a 32-bit target. Some of the vector types we deal with here use more than those number of elements and therefore cause a malloc.
APInt on the other hand supports up to 64 bits without a malloc. That's the maximum number of bits we need here so we can avoid a malloc for all cases by using APInt. This will incur a minor increase in stack usage due to APInt storing the bit count separately from the data bits unlike SmallBitVector, but that should be ok.
Reviewers: RKSimon
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D30386
llvm-svn: 296352
This is a fix for a loop predication bug which resulted in malformed IR generation.
Loop invariant side of the widened condition is not guaranteed to be available in the preheader as is, so we need to expand it as well. See added unsigned_loop_0_to_n_hoist_length test for example.
Reviewed By: sanjoy, mkazantsev
Differential Revision: https://reviews.llvm.org/D30099
llvm-svn: 296345
This is a cleanup/rewrite of the printSysAlias function. This was not using the
tablegen instruction descriptions, but was "manually" decoding the
instructions. This has been replaced with calls to lookup_XYZ_ByEncoding
tablegen calls.
This revealed several problems. First, instruction IVAU had the wrong encoding.
This was cancelled out by the parser that incorrectly matched the wrong
encoding. Second, instruction CVAP was missing from the SystemOperands tablegen
descriptions, so this has been added. And third, the required target features
were not captured in the tablegen descriptions, so support for this has also
been added.
Differential Revision: https://reviews.llvm.org/D30329
llvm-svn: 296343
Currently we handle this correctly in arm, but in thumb we don't which leads to
an unpredictable instruction being emitted for LSL #0 in an IT block and SP not
being permitted in some cases when it should be.
For the thumb2 LSL we can handle this by making LSL #0 an alias of MOV in the
.td file, but for thumb1 we need to handle it in checkTargetMatchPredicate to
get the IT handling right. We also need to adjust the handling of
MOV rd, rn, LSL #0 to avoid generating the 16-bit encoding in an IT block. We
should also adjust it to allow SP in the same way that it is allowed in
MOV rd, rn, but I haven't done that here because it looks like it would take
quite a lot of work to get right.
Additionally correct the selection of the 16-bit shift instructions in
processInstruction, where it was checking if the two registers were equal when
it should have been checking if they were low. It appears that previously this
code was never executed and the 16-bit encoding was selected by default, but
the other changes I've done here have somehow made it start being used.
Differential Revision: https://reviews.llvm.org/D30294
llvm-svn: 296342
This pattern is essentially a i16 load from p+1 address:
%p1.i16 = bitcast i8* %p to i16*
%p2.i8 = getelementptr i8, i8* %p, i64 2
%v1 = load i16, i16* %p1.i16
%v2.i8 = load i8, i8* %p2.i8
%v2 = zext i8 %v2.i8 to i16
%v1.shl = shl i16 %v1, 8
%res = or i16 %v1.shl, %v2
Current implementation would identify %v1 load as the first byte load and would mistakenly emit a i16 load from %p1.i16 address. This patch adds a check that the first byte is loaded from a non-zero offset of the first load address. This way this address can be used as the base address for the combined value. Otherwise just give up combining.
llvm-svn: 296336
There are no instructions that have "[1]" as part of the assembly string;
FMOVXDhighr is out of date. This removes dead code.
Differential Revision: https://reviews.llvm.org/D30165
llvm-svn: 296327
- Verify that runtime metadata is actually valid runtime metadata when assembling, otherwise we could accept the following when assembling, but ocl runtime will reject it:
.amdgpu_runtime_metadata
{ amd.MDVersion: [ 2, 1 ], amd.RandomUnknownKey, amd.IsaInfo: ...
- Make IsaInfo optional, and always emit it.
Differential Revision: https://reviews.llvm.org/D30349
llvm-svn: 296324
Summary:
BranchInst, SwitchInst (with non-default case) with Undef as input is not
possible at this point. As we always default-fold terminator to one target in
ResolvedUndefsIn and set the input accordingly.
So we should only have constantint/blockaddress here.
If ConstantFoldTerminator fails, that could mean 2 things.
1. ConstantFoldTerminator is doing something unexpected, i.e. not folding on constantint
or blockaddress and not making blocks that should be dead dead.
2. This is not a terminator on constantint or blockaddress. Its on a constant or
overdefined, then this block should not be dead.
In both cases, we should assert.
Reviewers: davide, efriedma, sanjoy
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D30381
llvm-svn: 296281
Summary:
Previously we used to return a bogus result, 0, for IR like `ashr %val,
-1`.
I've also added an assert checking that `ComputeNumSignBits` at least
returns 1. That assert found an already checked in test case where we
were returning a bad result for `ashr %val, -1`.
Fixes PR32045.
Reviewers: spatel, majnemer
Reviewed By: spatel, majnemer
Subscribers: efriedma, mcrosier, llvm-commits
Differential Revision: https://reviews.llvm.org/D30311
llvm-svn: 296273
The current pattern for extract bits in range is typically:
Mask.lshr(BitOffset).trunc(SubSizeInBits);
Which can be particularly slow for large APInts (MaskSizeInBits > 64) as they require the allocation of memory for the temporary variable.
This is another of the compile time issues identified in PR32037 (see also D30265).
This patch adds the APInt::extractBits() helper method which avoids the temporary memory allocation.
Differential Revision: https://reviews.llvm.org/D30336
llvm-svn: 296272
Summary:
While collecting operands we make copies of the LiveReg objects which are stored in the LiveRegs array. If the instruction uses the same register multiple times we end up with multiple copies. Later we iterate through the collected list of LiveReg objects and merge DomainValues. In the process of doing this the merge function can change the contents of the original LiveReg object in the LiveRegs array, but not the copies that have been made. So when we get to the second usage of the register we end up seeing a stale copy of the LiveReg object.
To fix this I've stopped copying and now just store a pointer to the original LiveReg object. Another option might be to avoid adding the same register to the Regs array twice, but this approach seemed simpler.
The included test case exposes this bug due to an AVX-512 masked OR instruction using the same register for the passthru operand and one of the inputs to the OR operation.
Fixes PR30284.
Reviewers: RKSimon, stoklund, MatzeB, spatel, myatsina
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D30242
llvm-svn: 296260
r296215, "[PDB] General improvements to Stream library."
r296217, "Disable BinaryStreamTest.StreamReaderObject temporarily."
r296220, "Re-enable BinaryStreamTest.StreamReaderObject."
r296244, "[PDB] Disable some tests that are breaking bots."
r296249, "Add static_cast to silence -Wc++11-narrowing."
std::errc::no_buffer_space should be used for OS-oriented errors for socket transmission.
(Seek discussions around llvm/xray.)
I could substitute s/no_buffer_space/others/g, but I revert whole them ATM.
Could we define and use LLVM errors there?
llvm-svn: 296258
When dumping .debug_info section we loop through all attributes mentioned in
.debug_abbrev section and dump values using DWARFFormValue::extractValue().
We need to skip implicit_const attributes here as their values are not
really located in .debug_info but directly in .debug_abbrev. This patch fixes
triggered assert() in DWARFFormValue::extractValue() caused by trying to
access implicit_const values from .debug_info.
llvm-svn: 296253
Recommiting after fixup of 32-bit aliasing sign offset bug in DAGCombiner.
* Simplify Consecutive Merge Store Candidate Search
Now that address aliasing is much less conservative, push through
simplified store merging search and chain alias analysis which only
checks for parallel stores through the chain subgraph. This is cleaner
as the separation of non-interfering loads/stores from the
store-merging logic.
When merging stores search up the chain through a single load, and
finds all possible stores by looking down from through a load and a
TokenFactor to all stores visited.
This improves the quality of the output SelectionDAG and the output
Codegen (save perhaps for some ARM cases where we correctly constructs
wider loads, but then promotes them to float operations which appear
but requires more expensive constant generation).
Some minor peephole optimizations to deal with improved SubDAG shapes (listed below)
Additional Minor Changes:
1. Finishes removing unused AliasLoad code
2. Unifies the chain aggregation in the merged stores across code
paths
3. Re-add the Store node to the worklist after calling
SimplifyDemandedBits.
4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
arbitrary, but seems sufficient to not cause regressions in
tests.
5. Remove Chain dependencies of Memory operations on CopyfromReg
nodes as these are captured by data dependence
6. Forward loads-store values through tokenfactors containing
{CopyToReg,CopyFromReg} Values.
7. Peephole to convert buildvector of extract_vector_elt to
extract_subvector if possible (see
CodeGen/AArch64/store-merge.ll)
8. Store merging for the ARM target is restricted to 32-bit as
some in some contexts invalid 64-bit operations are being
generated. This can be removed once appropriate checks are
added.
This finishes the change Matt Arsenault started in r246307 and
jyknight's original patch.
Many tests required some changes as memory operations are now
reorderable, improving load-store forwarding. One test in
particular is worth noting:
CodeGen/PowerPC/ppc64-align-long-double.ll - Improved load-store
forwarding converts a load-store pair into a parallel store and
a memory-realized bitcast of the same value. However, because we
lose the sharing of the explicit and implicit store values we
must create another local store. A similar transformation
happens before SelectionDAG as well.
Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle
llvm-svn: 296252
This adds various new functionality and cleanup surrounding the
use of the Stream library. Major changes include:
* Renaming of all classes for more consistency / meaningfulness
* Addition of some new methods for reading multiple values at once.
* Full suite of unit tests for reader / writer functionality.
* Full set of doxygen comments for all classes.
* Streams now store their own endianness.
* Fixed some bugs in a few of the classes that were discovered
by the unit tests.
llvm-svn: 296215
This is part of a larger effort to get the Stream code moved
up to Support. I don't want to do it in one large patch, in
part because the changes are so big that it will treat everything
as file deletions and add, losing history in the process.
Aside from that though, it's just a good idea in general to
make small changes.
So this change only changes the names of the Stream related
source files, and applies necessary source fix ups.
llvm-svn: 296211
Current internal option -static-func-full-module-prefix keeps all the
directory path the profile counter names for static functions. The default
of this option is false. This strips the directory names from the source
filename which is problematic:
(1) it creates linker errors for profile-generation compilation, exposed in
our internal benchmarks. We are seeing messages like
"warning: relocation refers to discarded section".
This is due to the name conflicts after the stripping.
(2) the stripping only applies to getPGOFuncName.
Current Thin-LTO module importing for the indirect-calls assumes
the source directory name not being stripped. Current default value
for this option can potentially prevent some inter-module
indirect-call-promotions.
This patch turns the default value for -static-func-full-module-prefix to true.
The second part of the patch is to have an alternative implementation under
the internal option -static-func-strip-dirname-prefix=<value>
This options specifies level of directories to be stripped from the source
filename. Using a large value as the parameter has the same effect as
-static-func-full-module-prefix.
Differential Revision: http://reviews.llvm.org/D29512
llvm-svn: 296206
With the "wasm32-unknown-unknown-wasm" triple, this allows writing out
simple wasm object files, and is another step in a larger series toward
migrating from ELF to general wasm object support. Note that this code
and the binary format itself is still experimental.
llvm-svn: 296190
This reverts commit r296009. It broke one out of tree target and also
does not account for all partial lines added or removed when calculating
PressureDiff.
llvm-svn: 296182
All G_CONSTANTS created by the MachineIRBuilder have an operand of type CImm
(i.e. a ConstantInt), so that's what the selector needs to look for.
llvm-svn: 296176
When we construct addressing modes, we use isNoopAddrSpaceCast to ignore
addrspacecast instructions. Make sure we insert the correct addrspacecast
when we reconstruct the addressing mode.
Differential Revision: https://reviews.llvm.org/D30114
llvm-svn: 296167
This optimisation was crashing when there was a chain of more than one bitcast
instruction to replace, as a result of the changes in D27283.
Patch by James Price.
Differential Revision: https://reviews.llvm.org/D30347
llvm-svn: 296163
Splitting critical edges when one of the source edges is an indirectbr
is hard in general (because it requires changing the memory the indirectbr
reads). But if a block only has a single indirectbr predecessor (which is
the common case), we can simulate splitting that edge by splitting
the destination block, and retargeting the *direct* branches.
This is motivated by the use of computed gotos in python 2.7: PyEval_EvalFrame()
ends up using an indirect branch with ~100 successors, and passing a constant to
each of those. Since MachineSink can't break indirect critical edges on demand
(and doing this in MIR doesn't look feasible), this causes us to emit about ~100
defs of registers containing constants, which we in the predecessor block, where
only one of those constants is used in each successor. So, at each computed goto,
we needlessly spill about a 100 constants to stack. The end result is that a
clang-compiled python interpreter can be about ~2.5x slower on a simple python
reduction loop than a gcc-compiled interpreter.
Differential Revision: https://reviews.llvm.org/D29916
llvm-svn: 296149
The current pattern for extract bits in range is typically:
Mask.lshr(BitOffset).trunc(SubSizeInBits);
Which can be particularly slow for large APInts (MaskSizeInBits > 64) as they require the allocation of memory for the temporary variable.
This is another of the compile time issues identified in PR32037 (see also D30265).
This patch adds the APInt::extractBits() helper method which avoids the temporary memory allocation.
Differential Revision: https://reviews.llvm.org/D30336
llvm-svn: 296147
This patch merges the existing floating-point induction variable widening code
into the integer induction variable widening code, creating a single set of
functions for both kinds of inductions. The primary motivation for doing this
is to enable vector phi node creation for floating-point induction variables.
Differential Revision: https://reviews.llvm.org/D30211
llvm-svn: 296145
Provide a 64-bit pattern to use SUBFIC for subtracting from a 16-bit immediate.
The corresponding pattern already exists for 32-bit integers.
Committing on behalf of Hiroshi Inoue.
Differential Revision: https://reviews.llvm.org/D29387
llvm-svn: 296144
Emit clrrdi (extended mnemonic for rldicr) for AND-ing with masks that
clear bits from the right hand size.
Committing on behalf of Hiroshi Inoue.
Differential Revision: https://reviews.llvm.org/D29388
llvm-svn: 296143
The current pattern for extract bits in range is typically:
Mask.lshr(BitOffset).trunc(SubSizeInBits);
Which can be particularly slow for large APInts (MaskSizeInBits > 64) as they require the allocation of memory for the temporary variable.
This is another of the compile time issues identified in PR32037 (see also D30265).
This patch adds the APInt::extractBits() helper method which avoids the temporary memory allocation.
Differential Revision: https://reviews.llvm.org/D30336
llvm-svn: 296141
The motivation for filling out these select-of-constants cases goes back to D24480,
where we discussed removing an IR fold from add(zext) --> select. And that goes back to:
https://reviews.llvm.org/rL75531https://reviews.llvm.org/rL159230
The idea is that we should always canonicalize patterns like this to a select-of-constants
in IR because that's the smallest IR and the best for value tracking. Note that we currently
do the opposite in some cases (like the cases in *this* patch). Ie, the proposed folds in
this patch already exist in InstCombine today:
https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineSelect.cpp#L1151
As this patch shows, most targets generate better machine code for simple ext/add/not ops
rather than a select of constants. So the follow-up steps to make this less of a patchwork
of special-case folds and missing IR canonicalization:
1. Have DAGCombiner convert any select of constants into ext/add/not ops.
2 Have InstCombine canonicalize in the other direction (create more selects).
Differential Revision: https://reviews.llvm.org/D30180
llvm-svn: 296137
This time with the missing files.
Similar to PR/25526, fast-regalloc introduces spills at the end of basic
blocks. When this occurs in between an ll and sc, the store can cause the
atomic sequence to fail.
This patch fixes the issue by introducing more pseudos to represent atomic
operations and moving their lowering to after the expansion of postRA
pseudos.
This resolves PR/32020.
Thanks to James Cowgill for reporting the issue!
Reviewers: slthakur
Differential Revision: https://reviews.llvm.org/D30257
llvm-svn: 296134
Similar to PR/25526, fast-regalloc introduces spills at the end of basic
blocks. When this occurs in between an ll and sc, the store can cause the
atomic sequence to fail.
This patch fixes the issue by introducing more pseudos to represent atomic
operations and moving their lowering to after the expansion of postRA
pseudos.
This resolves PR/32020.
Thanks to James Cowgill for reporting the issue!
Reviewers: slthakur
Differential Revision: https://reviews.llvm.org/D30257
llvm-svn: 296132
Summary:
This isn't testable for AArch64 by itself so this patch also adds
support for constant immediates in the pattern and physical
register uses in the result.
The new IntOperandMatcher matches the constant in patterns such as
'(set $rd:GPR32, (G_XOR $rs:GPR32, -1))'. It's always safe to fold
immediates into an instruction so this is the first rule that will match
across multiple BB's.
The Renderer hierarchy is responsible for adding operands to the result
instruction. Renderers can copy operands (CopyRenderer) or add physical
registers (in particular %wzr and %xzr) to the result instruction
in any order (OperandMatchers now import the operand names from
SelectionDAG to allow renderers to access any operand). This allows us to
emit the result instruction for:
%1 = G_XOR %0, -1 --> %1 = ORNWrr %wzr, %0
%1 = G_XOR -1, %0 --> %1 = ORNWrr %wzr, %0
although the latter is untested since the matcher/importer has not been
taught about commutativity yet.
Added BuildMIAction which can build new instructions and mutate them where
possible. W.r.t the mutation aspect, MatchActions are now told the name of
an instruction they can recycle and BuildMIAction will emit mutation code
when the renderers are appropriate. They are appropriate when all operands
are rendered using CopyRenderer and the indices are the same as the matcher.
This currently assumes that all operands have at least one matcher.
Finally, this change also fixes a crash in
AArch64InstructionSelector::select() caused by an immediate operand
passing isImm() rather than isCImm(). This was uncovered by the other
changes and was detected by existing tests.
Depends on D29711
Reviewers: t.p.northover, ab, qcolombet, rovka, aditya_nandakumar, javed.absar
Reviewed By: rovka
Subscribers: aemerson, dberris, kristof.beyls, llvm-commits
Differential Revision: https://reviews.llvm.org/D29712
llvm-svn: 296131
Noticed while profiling PR32037, the target shuffle ops were being stored in SmallVector<*,8> types but the combiner could store as many as 16 ops at maximum depth (2 per depth).
llvm-svn: 296130
This one seems more obvious than D30270 that it can't make improvements because an extension always needs
all of the incoming bits. There's one specific transform in SimplifyDemandedInstructionBits of converting
a sext to a zext when the sign-bit is known zero, but that is handled explicitly in visitSext() with
ComputeSignBit().
Like D30270, there are no IR differences (other than instruction names) for the case in PR32037:
https://bugs.llvm.org//show_bug.cgi?id=32037
...and no regression test differences.
Zext/sext are a smaller part of the profile, but this still appears to shave off another 0.5% or so from
'opt -O2'.
Differential Revision: https://reviews.llvm.org/D30280
llvm-svn: 296129
Previously LLVM was assuming 32-bit signed immediates which results in and with
a bitmask that has bit 31 set to incorrectly include bits 63-32 in the result.
After applying this patch I can now compile all of the FreeBSD mips assembly
code with clang.
This issue also affects the nor, slt and sltu macros and I will fix those in a
separate review.
Patch By: Alexander Richardson
Commit message reformatted by sdardis.
Reviewers: atanasyan, theraven, sdardis
Differential Revision: https://reviews.llvm.org/D30298
llvm-svn: 296125
Make the MIPS disassembler consistent with the other targets in returning
a Size of zero when the input buffer cannot contain an instruction due
to it's size. Previously it reported the minimum instruction size when
it failed due to the buffer not being big enough for an instruction
causing llvm-objdump to crash when disassembling all sections.
Reviewers: slthakur
Differential Revision: https://reviews.llvm.org/D29984
llvm-svn: 296105
The current pattern for setting bits in range is typically:
Mask |= APInt::getBitsSet(MaskSizeInBits, LoPos, HiPos);
Which can be particularly slow for large APInts (MaskSizeInBits > 64) as they require the allocation memory for the temporary variable.
This is one of the key compile time issues identified in PR32037.
This patch adds the APInt::setBits() helper method which avoids the temporary memory allocation completely, this first implementation uses setBit() internally instead but already significantly reduces the regression in PR32037 (~10% drop). Additional optimization may be possible.
I investigated whether there is need for APInt::clearBits() and APInt::flipBits() equivalents but haven't seen these patterns to be particularly common, but reusing the code would be trivial.
Differential Revision: https://reviews.llvm.org/D30265
llvm-svn: 296102
The Fuchsia ABI defines slots from the thread pointer where the
stack-guard value for stack-protector, and the unsafe stack pointer
for safe-stack, are stored. This parallels the Android ABI support.
Patch by Roland McGrath
Differential Revision: https://reviews.llvm.org/D30237
llvm-svn: 296081
LoopUnswitch/simplify-with-nonvalness.ll is the test case for this.
The LIC has 2 users and deleting the 1st user when it can be simplified
invalidated the iterator for the 2nd user.
llvm-svn: 296069
Splitting critical edges when one of the source edges is an indirectbr
is hard in general (because it requires changing the memory the indirectbr
reads). But if a block only has a single indirectbr predecessor (which is
the common case), we can simulate splitting that edge by splitting
the destination block, and retargeting the *direct* branches.
This is motivated by the use of computed gotos in python 2.7: PyEval_EvalFrame()
ends up using an indirect branch with ~100 successors, and passing a constant to
each of those. Since MachineSink can't break indirect critical edges on demand
(and doing this in MIR doesn't look feasible), this causes us to emit about ~100
defs of registers containing constants, which we in the predecessor block, where
only one of those constants is used in each successor. So, at each computed goto,
we needlessly spill about a 100 constants to stack. The end result is that a
clang-compiled python interpreter can be about ~2.5x slower on a simple python
reduction loop than a gcc-compiled interpreter.
Differential Revision: https://reviews.llvm.org/D29916
llvm-svn: 296060
This allows the ability to call IPDBSession::getGlobalScope with a NativeSession and
to then query it for some basic fields from the PDB's InfoStream.
Note that the symbols now have non-const references back to the Session so that
NativeRawSymbol can access the PDBFile through the Session.
Differential Revision: https://reviews.llvm.org/D30314
llvm-svn: 296049
We were stopping the translation of the parent block when the
translation of an instruction failed, but we were still trying to
translate the other blocks of the parent function.
Don't do that.
llvm-svn: 296047
Summary: In case we do not know what the condition is in an unswitched loop, but we know its definitely NOT a known constant. We can perform simplifcations based on this information.
Reviewers: sanjoy, hfinkel, chenli, efriedma
Reviewed By: efriedma
Subscribers: david2050, llvm-commits, mzolotukhin
Differential Revision: https://reviews.llvm.org/D28968
llvm-svn: 296041
Summary:
The helper will be used in a later change. This change itself is NFC
since the only user of this new function is its unit test.
Reviewers: majnemer, efriedma
Reviewed By: efriedma
Subscribers: efriedma, mcrosier, llvm-commits
Differential Revision: https://reviews.llvm.org/D30184
llvm-svn: 296035
This patch enables support for .f16x2 operations.
Added new register type Float16x2.
Added support for .f16x2 instructions.
Added handling of vectorized loads/stores of v2f16 values.
Differential Revision: https://reviews.llvm.org/D30057
Differential Revision: https://reviews.llvm.org/D30310
llvm-svn: 296032
FastISel wasn't checking the isFPOnlySP subtarget feature before emitting
double-precision operations, so it got completely invalid CodeGen for doubles
on Cortex-M4F.
The normal ISel testing wasn't spectacular either so I added a second RUN line
to improve that while I was in the area.
llvm-svn: 296031
While not CVP's fault, this caused miscompiles (PR31181). Reverting
until those are resolved.
(This also reverts the follow-ups r288154 and r288161 which removed the
flag.)
llvm-svn: 296030
Summary: SamplePGO uses branch_weight annotation to represent callsite hotness. When ICP promotes an indirect call to direct call, we need to make sure the direct call is annotated with branch_weight in SamplePGO mode, so that downstream function inliner can use hot callsite heuristic.
Reviewers: davidxl, eraman, xur
Reviewed By: davidxl, xur
Subscribers: mehdi_amini, llvm-commits
Differential Revision: https://reviews.llvm.org/D30282
llvm-svn: 296028
In the bit tracker, references to other bit values in which the register
is 0 are prohibited. This means that generating self-referential register
cells like { w:32 [0-15]:s[0-15] [16-31]:s[15] } is impossible. In order
to get a self-referential cell, it had to be stored into a map and then
reloaded from it. To avoid this step, add a function that will set the
register to a given value without going through the map.
llvm-svn: 296025
Last use was killed in my previous patch. The preferred way is now to
construct the remark, pipe things to it and pass it to ORE.emit.
llvm-svn: 296019
Rename ComputedTrellisEdges to ComputedEdges to allow for other methods of
pre-computing edges.
Differential Revision: https://reviews.llvm.org/D30308
llvm-svn: 296018
Having more fine-grained information on the specific construct that
caused us to fallback is valuable for large-scale data collection.
We still have the fallback warning, that's also used for FastISel.
We still need to remove the fallback warning, and teach FastISel to also
emit remarks (it currently has a combination of the warning, stats, and
debug prints: the remarks could unify all three).
The abort-on-fallback path could also be better handled using remarks:
one could imagine a "-Rpass-error", analoguous to "-Werror", which would
promote missed/failed remarks to errors. It's not clear whether that
would be useful for other remarks though, so we're not there yet.
llvm-svn: 296013
If a subreg is used in an instruction it counts as a whole superreg
for the purpose of register pressure calculation. This patch corrects
improper register pressure calculation by examining operand's lane mask.
Differential Revision: https://reviews.llvm.org/D29835
llvm-svn: 296009
This reverts commit r295749 while investigating PR32042.
It looks like this check uncovered a problem in the frontend that
needs to be fixed before the check can be enabled again.
llvm-svn: 296005
In OptimizeAdd, we scan the operand list to see if there are any common factors
between operands that can be factored out to reduce the number of multiplies
(e.g., 'A*A+A*B*C+D' -> 'A*(A+B*C)+D'). For each operand of the operand list, we
only consider unique factors (which is tracked by the Duplicate set). Now if we
find a factor that is a negative constant, we add the negated value as a factor
as well, because we can percolate the negate out. However, we mistakenly don't
add this negated constant to the Duplicates set.
Consider the expression A*2*-2 + B. Obviously, nothing to factor.
For the added value A*2*-2 we over count 2 as a factor without this change,
which causes the assert reported in PR30256. The problem is that this code is
assuming that all the multiply operands of the add are already reassociated.
This change avoids the issue by making OptimizeAdd tolerate multiplies which
haven't been completely optimized; this sort of works, but we're doing wasted
work: we'll end up revisiting the add later anyway.
Another possible approach would be to enforce RPO iteration order more strongly.
If we have RedoInsts, we process them immediately in RPO order, rather than
waiting until we've finished processing the whole function. Intuitively, it
seems like the natural approach: reassociation works on expression trees, so
the optimization only works in one direction. That said, I'm not sure how
practical that is given the current Reassociate; the "optimal" form for an
expression depends on its use list (see all the uses of "user_back()"), so
Reassociate is really an iterative optimization of sorts, so any changes here
would probably get messy.
PR30256
Differential Revision: https://reviews.llvm.org/D30228
llvm-svn: 296003
Summary: The discriminator has been encoded, and only the base discriminator should be used during profile matching.
Reviewers: dblaikie, davidxl
Reviewed By: dblaikie, davidxl
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D30218
llvm-svn: 295999
Since LoopInfo is not available in machine passes as universally as in IR
passes, using the same approach for OptimizationRemarkEmitter as we did for IR
will run LoopInfo and DominatorTree unnecessarily. (LoopInfo is not used
lazily by ORE.)
To fix this, I am modifying the approach I took in D29836. LazyMachineBFI now
uses its client passes including MachineBFI itself that are available or
otherwise compute them on the fly.
So for example GreedyRegAlloc, since it's already using MBFI, will reuse that
instance. On the other hand, AsmPrinter in Justin's patch will generate DT,
LI and finally BFI on the fly.
(I am of course wondering now if the simplicity of this approach is even
preferable in IR. I will do some experiments.)
Testing is provided by an updated version of D29837 which requires Justin's
patch to bring ORE to the AsmPrinter.
Differential Revision: https://reviews.llvm.org/D30128
llvm-svn: 295996
Introduce a common ValueHandler for call returns and formal arguments, and
inherit two different versions for handling the differences (at the moment the
only difference is the way physical registers are marked as used).
llvm-svn: 295973
result
Summary:
If the same value is used several times as an extra value, SLP
vectorizer takes it into account only once instead of actual number of
using.
For example:
```
int val = 1;
for (int y = 0; y < 8; y++) {
for (int x = 0; x < 8; x++) {
val = val + input[y * 8 + x] + 3;
}
}
```
We have 2 extra rguments: `1` - initial value of horizontal reduction
and `3`, which is added 8*8 times to the reduction. Before the patch we
added `1` to the reduction value and added once `3`, though it must be
added 64 times.
Reviewers: mkuper, mzolotukhin
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D30262
llvm-svn: 295972
Add support for lowering calls with parameters than can fit into regs. Use the
same ValueHandler that we used for function returns, but rename it to match its
new, extended purpose.
llvm-svn: 295971
This patch adjusts the most relaxed predicate of immediate operands to accept
immediate forms such as ~(0xf0000000|0x000f00000). Previously these forms
would be accepted by GAS and rejected by IAS.
This partially resolves PR/30383.
Thanks to Sean Bruno for reporting the issue!
Reviewers: slthakur, seanbruno
Differential Revision: https://reviews.llvm.org/D29218
llvm-svn: 295965
The ARMConstantIslandPass didn't have support for handling accesses to
constant island objects through ARM::t2LDRBpci instructions. This adds
support for that.
This fixes PR31997.
llvm-svn: 295964
result
Summary:
If the same value is used several times as an extra value, SLP
vectorizer takes it into account only once instead of actual number of
using.
For example:
```
int val = 1;
for (int y = 0; y < 8; y++) {
for (int x = 0; x < 8; x++) {
val = val + input[y * 8 + x] + 3;
}
}
```
We have 2 extra rguments: `1` - initial value of horizontal reduction
and `3`, which is added 8*8 times to the reduction. Before the patch we
added `1` to the reduction value and added once `3`, though it must be
added 64 times.
Reviewers: mkuper, mzolotukhin
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D30262
llvm-svn: 295956
result
Summary:
If the same value is used several times as an extra value, SLP
vectorizer takes it into account only once instead of actual number of
using.
For example:
```
int val = 1;
for (int y = 0; y < 8; y++) {
for (int x = 0; x < 8; x++) {
val = val + input[y * 8 + x] + 3;
}
}
```
We have 2 extra rguments: `1` - initial value of horizontal reduction
and `3`, which is added 8*8 times to the reduction. Before the patch we
added `1` to the reduction value and added once `3`, though it must be
added 64 times.
Reviewers: mkuper, mzolotukhin
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D30262
llvm-svn: 295949
AVX versions of the converts work on f32/f64 types, while AVX512 version work on vectors.
Differential Revision: https://reviews.llvm.org/D29988
llvm-svn: 295940
Implement isLegalToVectorizeLoadChain for AMDGPU to avoid
producing private address spaces accesses that will need to be
split up later. This was doing the wrong thing in the case
where the queried chain was an even number of elements.
A possible <4 x i32> store was being split into
store <2 x i32>
store i32
store i32
rather than
store <2 x i32>
store <2 x i32>
when legal.
llvm-svn: 295933
There were some older intrinsics that only existed for less than a month in 2012 that still exist in some out of tree test files that start with this string, but aren't able to be handled by the current upgrade code and fire an assert. Now we'll go back to treating them as not intrinsics at all and just passing them through to output.
Fixes PR32041, sort of.
llvm-svn: 295930
The manual is unclear on the details of this. It's not
clear to me if denormals are not allowed with clamp,
or if that is only omod. Not allowing denorms for
fp16 or fp64 isn't useful so I also question if that
is really a restriction. Same with whether this is valid
without IEEE mode enabled.
llvm-svn: 295905
Notably, no regression tests change when we remove these calls, and these are expensive calls.
The motivation comes from the general acknowledgement that the compiler is getting slower:
http://lists.llvm.org/pipermail/llvm-dev/2017-January/109188.htmlhttp://lists.llvm.org/pipermail/llvm-dev/2016-December/108279.html
And specifically the test case attached to PR32037:
https://bugs.llvm.org//show_bug.cgi?id=32037
Profiling the middle-end (opt) part of the compile:
$ ./opt -O2 row_common.bc -o /dev/null
...visitAdd and visitSub are near the top of the instcombine list, and the calls to SimplifyDemandedInstructionBits()
are high within each of those. Those calls account for 1%+ of the opt time in either debug or release profiles. And
that's the rough win I see from this patch when testing opt built release from r295864 on an iMac with Haswell 4GHz
(model 4790K).
It seems unlikely that we'd be able to eliminate add/sub or change their operands given that add/sub normally affect
all bits, and the PR32037 example shows no IR difference after this change using -O2.
Also worth noting - the code comment in visitAdd:
// This handles stuff like (X & 254)+1 -> (X&254)|1
...isn't true. That transform is handled later with a call to haveNoCommonBitsSet().
Differential Revision: https://reviews.llvm.org/D30270
llvm-svn: 295898
This should avoid reporting any stack needs to be allocated in the
case where no stack is truly used. An unused stack slot is still
left around in other cases where there are real stack objects
but no spilling occurs.
llvm-svn: 295891
Summary:
Depends on D29606 and D29682
Makes us pass GVN's edge.ll (we also will pass a few other testcases
they just need cleaning up).
Thoughts on the Predicate* hiearchy of classes especially welcome :)
(it's not clear to me how best to organize it, and currently, the getBlock* seems ... uglier than maybe wasting a field somewhere or something).
Reviewers: davide
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D29747
llvm-svn: 295889
Add updater to passes that now need it.
Move around code in MemorySSA to expose needed functions.
Summary: Mostly cleanup
Reviewers: george.burgess.iv
Subscribers: llvm-commits, Prazek
Differential Revision: https://reviews.llvm.org/D30221
llvm-svn: 295887
After rL294814, LSR formula can have multiple SCEVAddRecExprs inside of its BaseRegs.
Previous canonicalization will swap the first SCEVAddRecExpr in BaseRegs with ScaledReg.
But now we want to swap the SCEVAddRecExpr Reg related with current loop with ScaledReg.
Otherwise, we may generate code like this: RegA + lsr.iv + RegB, where loop invariant
parts RegA and RegB are not grouped together and cannot be promoted outside of loop.
With this patch, it will ensure lsr.iv to be generated later in the expr:
RegA + RegB + lsr.iv, so that RegA + RegB can be promoted outside of loop.
Differential Revision: https://reviews.llvm.org/D26781
llvm-svn: 295884
This allows us to ensure that 0 is never a valid pointer
to a user object, and ensures that the offset is always legal
without needing a register to access it. This comes at the cost
of usable offsets and wasted stack space.
llvm-svn: 295877
Summary:
If the same value is used several times as an extra value, SLP
vectorizer takes it into account only once instead of actual number of
using.
For example:
```
int val = 1;
for (int y = 0; y < 8; y++) {
for (int x = 0; x < 8; x++) {
val = val + input[y * 8 + x] + 3;
}
}
```
We have 2 extra rguments: `1` - initial value of horizontal reduction
and `3`, which is added 8*8 times to the reduction. Before the patch we
added `1` to the reduction value and added once `3`, though it must be
added 64 times.
Reviewers: mkuper, mzolotukhin
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D30262
llvm-svn: 295868
Summary:
Extend AArch64RedundantCopyElimination to catch cases where the register
that is known to be zero is COPY'd in the predecessor block. Before
this change, this pass would catch cases like:
CBZW %W0, <BB#1>
BB#1:
%W0 = COPY %WZR // removed
After this change, cases like the one below are also caught:
%W0 = COPY %W1
CBZW %W1, <BB#1>
BB#1:
%W0 = COPY %WZR // removed
This change results in a 4% increase in static copies removed by this
pass when compiling the llvm test-suite. It also fixes regressions
caused by doing post-RA copy propagation (a separate change to be put up
for review shortly).
Reviewers: junbuml, mcrosier, t.p.northover, qcolombet, MatzeB
Subscribers: aemerson, rengolin, llvm-commits
Differential Revision: https://reviews.llvm.org/D30113
llvm-svn: 295863
Prevent memory objects of different address spaces to be part of
the same load/store groups when analysing interleaved accesses.
This is fixing pr31900.
Reviewers: HaoLiu, mssimpso, mkuper
Reviewed By: mssimpso, mkuper
Subscribers: llvm-commits, efriedma, mzolotukhin
Differential Revision: https://reviews.llvm.org/D29717
This reverts r295042 (re-applies r295038) with an additional fix for the
buildbot problem.
llvm-svn: 295858
LLVM CodeGen emits references to external symbols that are never declared in
LLVM IR level, so they have no declared signature. However, WebAssembly requires
all functions be declared with signatures. This patch adds a table for providing
signatures for known runtime libcalls that will be used in subsequent patches to
emit declarations for such functions.
llvm-svn: 295857
The function for distinguishing local and remote files added in r295768
unconditionally uses linux/magic.h header to provide necessary
filesystem magic numbers. However, in kernel headers predating 2.6.18
the magic numbers are spread throughout multiple include files.
Furthermore, LLVM did not require kernel headers being installed so far.
To increase the portability across different versions of Linux kernel
and different Linux systems, add CMake header checks for linux/magic.h
and -- if it is missing -- the linux/nfs_fs.h and linux/smb.h headers
which contained the numbers previously.
Furthermore, since the numbers are static and the feature does not seem
critical enough to make LLVM require kernel headers at all, add fallback
constants for the case when none of the necessary headers is available.
Differential Revision: https://reviews.llvm.org/D30261
llvm-svn: 295854
Summary: The CallTargetProfile should be added to FProfile to be consistent with other profile readers.
Reviewers: dnovillo, davidxl
Reviewed By: davidxl
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D30233
llvm-svn: 295852
Minor optimization, don't create temporary mask APInts that are just going to be OR'd into the accumulate masks - insert directly instead.
llvm-svn: 295848
The pass tries to fix a spill of LR that turns out to be unnecessary.
So it removes the tPOP but forgets to remove tPUSH.
This causes the stack be misaligned upon returning the function.
Thus, remove the tPUSH as well in this case.
Differential Revision: https://reviews.llvm.org/D30207
llvm-svn: 295816
This needed a const_cast for the dominator tree recalculation in
OptimizationRemarkEmitter, but we do that all over the place already
and it's safe.
llvm-svn: 295812
This patch adds missing sched classes for Thumb2 instructions.
This has been missing so far, and as a consequence, machine
scheduler models for individual sub-targets have tended to
be larger than they needed to be. These patches should help
write schedulers better and faster in the future
for ARM sub-targets.
Reviewer: Diana Picus
Differential Revision: https://reviews.llvm.org/D29953
llvm-svn: 295811
This patch introduces new X86ISD::FMAXS and X86ISD::FMINS opcodes. The legacy intrinsics now lower to this node. As do the AVX-512 masked intrinsics when the rounding mode is CUR_DIRECTION.
I've merged a copy of the tablegen multiclass avx512_fp_scalar into avx512_fp_scalar_sae. avx512_fp_scalar still needs to support CUR_DIRECTION appearing as a rounding mode for X86ISD::FADD_ROUND and others.
Differential revision: https://reviews.llvm.org/D30186
llvm-svn: 295810
Summary:
Motivation: fix PR31181 without regression (the actual fix is still in
progress). However, the actual content of PR31181 is not relevant
here.
This change makes poison propagation more aggressive in the following
cases:
1. poision * Val == poison, for any Val. In particular, this changes
existing intentional and documented behavior in these two cases:
a. Val is 0
b. Val is 2^k * N
2. poison << Val == poison, for any Val
3. getelementptr is poison if any input is poison
I think all of these are justified (and are axiomatically true in the
new poison / undef model):
1a: we need poison * 0 to be poison to allow transforms like these:
A * (B + C) ==> A * B + A * C
If poison * 0 were 0 then the above transform could not be allowed
since e.g. we could have A = poison, B = 1, C = -1, making the LHS
poison * (1 + -1) = poison * 0 = 0
and the RHS
poison * 1 + poison * -1 = poison + poison = poison
1b: we need e.g. poison * 4 to be poison since we want to allow
A * 4 ==> A + A + A + A
If poison * 4 were a value with all of their bits poison except the
last four; then we'd not be able to do this transform since then if A
were poison the LHS would only be "partially" poison while the RHS
would be "full" poison.
2: Same reasoning as (1b), we'd like have the following kinds
transforms be legal:
A << 1 ==> A + A
Reviewers: majnemer, efriedma
Subscribers: mcrosier, llvm-commits
Differential Revision: https://reviews.llvm.org/D30185
llvm-svn: 295809
This just adds the basic skeleton for supporting a new object file format.
All of the actual encoding will be implemented in followup patches.
Differential Revision: https://reviews.llvm.org/D26722
llvm-svn: 295803
This enables peeling of loops with low dynamic iteration count by default,
when profile information is available.
Differential Revision: https://reviews.llvm.org/D27734
llvm-svn: 295796
Change implementation to use max instead of add.
min/max/med3 do not flush denormals regardless of the mode,
so it is OK to use it whether or not they are enabled.
Also allow using clamp with f16, and use knowledge
of dx10_clamp.
llvm-svn: 295788
Original code only used vector loads/stores for explicit vector arguments.
It could also do more loads/stores than necessary (e.g v5f32 would
touch 8 f32 values). Aggregate types were loaded one element at a time,
even the vectors contained within.
This change attempts to generalize (and simplify) parameter space
loads/stores so that vector loads/stores can be used more broadly.
Functionality of the patch has been verified by compiling thrust
test suite and manually checking the differences between PTX
generated by llvm with and without the patch.
General algorithm:
* ComputePTXValueVTs() flattens input/output argument into a flat list
of scalars to load/store and returns their types and offsets.
* VectorizePTXValueVTs() uses that data to create vectorization plan
which returns an array of flags marking boundaries of vectorized
load/stores. Scalars are represented as 1-element vectors.
* Code that generates loads/stores implements a simple state machine
that constructs a vector according to the plan.
Differential Revision: https://reviews.llvm.org/D30011
llvm-svn: 295784
Since I'm only seeing failures on OSX, and it's saying
permission denied, I'm suspecting this is due to the addition
of the MAP_RESILIENT_CODESIGN and/or MAP_RESILIENT_MEDIA flags.
Speculatively trying to remove those to get the bots working.
llvm-svn: 295770
For whatever reason ld64 requires that member headers (not the member
themselves) should be aligned. The only way to do that is to edit the
previous member so that it ends at an aligned boundary.
Since modifying data put in an archive is an undesirable property,
llvm-ar should only do it when it is absolutely necessary.
llvm-svn: 295765
This is part of trying to clean up our handling of min/max patterns in IR.
By converting these to canonical form, we're more likely to recognize them
because there are various places in InstCombine that don't use
matchSelectPattern or m_SMax and friends.
The backend fixups referenced in the now deleted TODO comment were added with:
https://reviews.llvm.org/rL291392https://reviews.llvm.org/rL289738
If there's any codegen fallout from this change, we should be able to address
it in DAGCombiner or target-specific lowering.
llvm-svn: 295758
There are still over 3400 files remaining with this property set, but there are tens of thousands more with the property not set. Until we decide what to do on a global scale, this at least unblocks me temporarily.
llvm-svn: 295756
Before frame offsets are calculated, try to eliminate the
frame indexes used by SGPR spills. Then we can delete them
after.
I think for now we can be sure that no other instruction
will be re-using the same frame indexes. It should be easy
to notice if this assumption ever breaks since everything
asserts if it tries to use a dead frame index later.
The unused emergency stack slot seems to still be left behind,
so an additional 4 bytes is still wasted.
llvm-svn: 295753
Conflicting debug info for function arguments causes hard-to-debug
assertions in the DWARF backend, so the Verifier should reject it.
For performance reasons this only checks function arguments from
non-inlined debug intrinsics for now.
rdar://problem/30520286
llvm-svn: 295749
Summary:
Rework the code that was sinking/duplicating (icmp and, 0) sequences
into blocks where they were being used by conditional branches to form
more tbz instructions on AArch64. The new code is more general in that
it just looks for 'and's that have all icmp 0's as users, with a target
hook used to select which subset of 'and' instructions to consider.
This change also enables 'and' sinking for X86, where it is more widely
beneficial than on AArch64.
The 'and' sinking/duplicating code is moved into the optimizeInst phase
of CodeGenPrepare, where it can take advantage of the fact the
OptimizeCmpExpression has already sunk/duplicated any icmps into the
blocks where they are used. One minor complication from this change is
that optimizeLoadExt needed to be updated to always mark 'and's it has
determined should be in the same block as their feeding load in the
InsertedInsts set to avoid an infinite loop of hoisting and sinking the
same 'and'.
This change fixes a regression on X86 in the tsan runtime caused by
moving GVNHoist to a later place in the optimization pipeline (see
PR31382).
Reviewers: t.p.northover, qcolombet, MatzeB
Subscribers: aemerson, mcrosier, sebpop, llvm-commits
Differential Revision: https://reviews.llvm.org/D28813
llvm-svn: 295746
PC isn't allowed in the source operand of t2MOVr, so change the register class
to one without PC. SP handling is slightly trickier and changes depending on if
we're in ARMv8, so do that in checkTargetMatchPredicate.
Differential Revision: https://reviews.llvm.org/D30199
llvm-svn: 295732
This matches what is already done during shuffle lowering and helps prevent the need for a zero-vector in cases where shuffles match both patterns.
llvm-svn: 295723
Summary:
This is a fix for assertion failure in
`getInverseMinMaxSelectPattern` when ABS is passed in as a select pattern.
We should not be invoking the simplification rule for
ABS(MIN(~ x,y))) or ABS(MAX(~x,y)) combinations.
Added a test case which would cause an assertion failure without the patch.
Reviewers: sanjoy, majnemer
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D30051
llvm-svn: 295719
The new method introduced under "-lsr-exp-narrow" option (currenlty set to true).
Summary:
The method is based on registers number mathematical expectation and should be
generally closer to optimal solution.
Please see details in comments to
"LSRInstance::NarrowSearchSpaceByDeletingCostlyFormulas()" function
(in lib/Transforms/Scalar/LoopStrengthReduce.cpp).
Reviewers: qcolombet
Differential Revision: http://reviews.llvm.org/D29862
From: Evgeny Stupachenko <evstupac@gmail.com>
llvm-svn: 295704
Summary:
Sandy Bridge and later CPUs have better throughput using a SHLD to implement rotate versus the normal rotate instructions. Additionally it saves one uop and avoids a partial flag update dependency.
This patch implements this change on any Sandy Bridge or later processor without BMI2 instructions. With BMI2 we will use RORX as we currently do.
Reviewers: zvi
Reviewed By: zvi
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D30181
llvm-svn: 295697
- Fix doxygen comments (do not repeat documented name, remove definition
comment if there is already one at the declaration, add \p, ...)
- Add some const modifiers
- Use range based for
llvm-svn: 295688
- Fix doxygen comments
- Remove duplicated comments
- Remove section comments (which became wrong over time)
- Use more `const` and references but less `auto`
llvm-svn: 295687
Summary:
Currently, BranchFolder drops DebugLoc for branch instructions in some places. For example, for the test code attached, the branch instruction of 'entry' block has a DILocation of
```
!12 = !DILocation(line: 6, column: 3, scope: !11)
```
, but this information is gone when then block is lowered because BranchFolder misses it. This patch is a fix for this issue.
Reviewers: qcolombet, aprantl, craig.topper, MatzeB
Reviewed By: aprantl
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D29902
llvm-svn: 295684
Summary:
This lets one add aliasing stores to the updater.
(i'm next going to move the creation/etc functions to the updater)
Reviewers: george.burgess.iv
Subscribers: llvm-commits, Prazek
Differential Revision: https://reviews.llvm.org/D30154
llvm-svn: 295677
Pull out repeated code for extraction index operand and source vector value type.
Use isNullConstant helper to check for zero extraction index.
llvm-svn: 295670
There used to be a check in the IRTranslator that prevented us from having to
deal with atomic loads/stores. That check has been removed in r294993 and the
AArch64 backend was updated accordingly. This commit does the same thing for the
ARM backend.
In general, in the ARM backend we introduce fences during the atomic expand
pass, so we don't have to worry about atomics, *except* for the 32-bit ARMv8
target, which handles atomics more like AArch64. Since we don't want to worry
about that yet, just bail out of instruction selection if we find any atomic
loads.
llvm-svn: 295662
Its more profitable to go through memory (1 cycles throughput)
than using VMOVD + VPERMV/PSHUFB sequence ( 2/3 cycles throughput) to implement EXTRACT_VECTOR_ELT with variable index.
IACA tool was used to get performace estimation (https://software.intel.com/en-us/articles/intel-architecture-code-analyzer)
For example for var_shuffle_v16i8_v16i8_xxxxxxxxxxxxxxxx_i8 test from vector-shuffle-variable-128.ll I get 26 cycles vs 79 cycles.
Removing the VINSERT node, we don't need it any more.
Differential Revision: https://reviews.llvm.org/D29690
llvm-svn: 295660