A bit more painful than G_INSERT because it was more widely used, but this
should simplify the handling of extract operations in most locations.
llvm-svn: 297100
This is known incomplete and not called in the right order relative to
other folds, but that's the current behavior. I'm just trying to clean
this up before making actual functional changes to make the patch smaller.
The logic here should mimic the IR equivalents that are in InstSimplify's
simplifyDivRem().
llvm-svn: 297086
Some late additions to DWARF v5 were not in Dwarf.def; also one form
was redefined. Add the new cases to relevant switches in different
parts of LLVM. Replace DW_FORM_ref_sup with DW_FORM_ref_sup[4,8].
I did not add support for DW_FORM_strx3/addrx3 other that defining the
constants. We don't have any infrastructure to support these.
Differential Revision: http://reviews.llvm.org/D30664
llvm-svn: 297085
Fixed the asan bot failure which led to the last commit of the outliner being reverted.
The change is in lib/CodeGen/MachineOutliner.cpp in the SuffixTree's constructor. LeafVector
is no longer initialized using reserve but just a standard constructor.
llvm-svn: 297081
If a block has non-analyzable branches, the listed successors don't need
to add up to one. For example, if a block has a conditional tail call,
that tail call will not have a corresponding successor in the successor
list, but will still be a possible branch.
Differential Revision: https://reviews.llvm.org/D30556
llvm-svn: 297054
Before, we were producing G_INSERT instructions that were actually closer to a
cast or even a COPY when both input and output sizes are the same. This doesn't
really make sense and means that everything interpreting a G_INSERT also has to
handle all these kinds of casts.
So now we detect these degenerate cases and emit real casts instead.
llvm-svn: 297051
Now that G_INSERT instructions can only insert one register, this code was
overly general. In another direction it didn't handle registers that crossed
split boundaries properly, which needed to be fixed.
llvm-svn: 297042
Refactoring of duplicated code and more fixes to follow.
This is motivated by the post-commit comments for r296699:
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20170306/435182.html
Ie, we can crash if we're missing obvious simplifications like this that
exist in the IR simplifier or if these occur later than expected.
The x86 change for non-splat division shows a potential opportunity to improve
vector codegen: we assumed that since only one lane had meaningful results, we
should do the math in scalar. But that means moving back and forth from vector
registers.
llvm-svn: 297026
Summary:
Functions with the "xray-log-args" attribute will have a special XRay sled kind
emitted, for compiler-rt to copy any call arguments to your logging handler.
For practical and performance reasons, only the first argument is supported, and
only up to 64 bits.
Reviewers: dberris
Reviewed By: dberris
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D29702
llvm-svn: 296998
As described on PR31712, we miss a variety of legalization combines because we lower these to X86ISD::VSEXT/VZEXT despite them having the same functionality. This patch makes 128-bit (SSE41) SIGN/ZERO_EXTEND_VECTOR_IN_REG ops legal, adds the necessary tablegen plumbing and uses a helper 'getExtendInVec' to decide when to use SIGN/ZERO_EXTEND_VECTOR_IN_REG or VSEXT/VZEXT.
We're missing a couple of shuffle combines that will be added in a future patch for review.
Later patches can then support the AVX2 cases as a mixture of SIGN/ZERO_EXTEND and SIGN/ZERO_EXTEND_VECTOR_IN_REG, and then finally deal with the AVX512 cases.
Differential Revision: https://reviews.llvm.org/D30549
llvm-svn: 296985
This is more efficient by itself. But this is prep for a future patch that may remove APInt::Or while making operator| support rvalue references similar to add/sub.
llvm-svn: 296981
select Cond, C +/- 1, C --> add(ext Cond), C -- with a target hook.
This is part of the ongoing process to obsolete D24480. The motivation is to
canonicalize to select IR in InstCombine whenever possible, so we need to have a way to
undo that easily in codegen.
PowerPC is an obvious winner for this kind of transform because it has fast and complete
bit-twiddling abilities but generally lousy conditional execution perf (although this might
have changed in recent implementations).
x86 also sees some wins, but the effect is limited because these transforms already mostly
exist in its target-specific combineSelectOfTwoConstants(). The fact that we see any x86
changes just shows that that code is a mess of special-case holes. We may be able to remove
some of that logic now.
My guess is that other targets will want to enable this hook for most cases. The likely
follow-ups would be to add value type and/or the constants themselves as parameters for the
hook. As the tests in select_const.ll show, we can transform any select-of-constants to
math/logic, but the general transform for any 2 constants needs one more instruction
(multiply or 'and').
ARM is one target that I think may not want this for most cases. I see infinite loops there
because it wants to use selects to enable conditionally executed instructions.
Differential Revision: https://reviews.llvm.org/D30537
llvm-svn: 296977
Summary:
When replacing a SDValue, we should remove the replaced value from
SoftenedFloats (and possibly the other maps as well?).
When we revisit a Node because it needs analyzing again, we have to
remove all result values from SoftenedFloats (and possibly other maps?).
This fixes the fp128 test failures with expensive checks for X86.
I think we probably should also remove the values from the other maps
(PromotedIntegers and so on), let me know what you think.
Reviewers: baldrick, bogner, davidxl, ab, arsenm, pirama, chh, RKSimon
Reviewed By: chh
Subscribers: danalbert, wdng, srhines, hfinkel, sepavloff, llvm-commits
Differential Revision: https://reviews.llvm.org/D29265
llvm-svn: 296964
We can now end up in situations where we initiate LiveIntervalUnion
queries with different SubRanges against the same register unit, so the
assert() no longer holds in all cases. Just recalculate now when we know
the cache is out of date.
llvm-svn: 296928
These are simplified variants of the current G_SEQUENCE and G_EXTRACT, which
assume the individual parts will be contiguous, homogeneous, and occupy the
entirity of the larger register. This makes reasoning about them much easer
since you only have to look at the first register being merged and the result
to know what the instruction is doing.
I intend to gradually replace all uses of the more complicated sequence/extract
with these (or single-element insert/extracts), and then remove the older
variants. For now we start with legalization.
llvm-svn: 296921
This patch causes compile times for some patterns to explode. I have
a (large, unreduced) test case that slows down by more than 20x and
several test cases slow down by 2x. I'm sending some of the test cases
directly to Nirav and following up with more details in the review log,
but this should unblock anyone else hitting this.
llvm-svn: 296862
A call should never modify the stack pointer, but some backends are
not so sure about this and never list SP in the regmask. For the
purposes of LiveDebugValues we assume a call never clobbers SP. We
already have a similar workaround in DbgValueHistoryCalculator (which
we hopefully can retire soon).
This fixes the availabilty of local ASANified variables on AArch64.
rdar://problem/27757381
llvm-svn: 296847
For chains of triangles with small join blocks that can be tail duplicated, a
simple calculation of probabilities is insufficient. Tail duplication
can be profitable in 3 different ways for these cases:
1) The post-dominators marked 50% are actually taken 56% (This shrinks with
longer chains)
2) The chains are statically correlated. Branch probabilities have a very
U-shaped distribution.
[http://nrs.harvard.edu/urn-3:HUL.InstRepos:24015805]
If the branches in a chain are likely to be from the same side of the
distribution as their predecessor, but are independent at runtime, this
transformation is profitable. (Because the cost of being wrong is a small
fixed cost, unlike the standard triangle layout where the cost of being
wrong scales with the # of triangles.)
3) The chains are dynamically correlated. If the probability that a previous
branch was taken positively influences whether the next branch will be
taken
We believe that 2 and 3 are common enough to justify the small margin in 1.
The code pre-scans a function's CFG to identify this pattern and marks the edges
so that the standard layout algorithm can use the computed results.
llvm-svn: 296845
Summary:
Currently, when 't1: i1 = setcc t2, t3, cc' followed by 't4: i1 = xor t1, Constant:i1<-1>' is folded into 't5: i1 = setcc t2, t3 !cc', SDLoc of newly created SDValue 't5' follows SDLoc of 't4', not 't1'. However, as the opcode of newly created SDValue is 'setcc', it make more sense to take DebugLoc from 't1' than 't4'. For the code below
```
extern int bar();
extern int baz();
int foo(int x, int y) {
if (x != y)
return bar();
else
return baz();
}
```
, following is the bitcode representation of 'foo' at the end of llvm-ir level optimization:
```
define i32 @foo(i32 %x, i32 %y) !dbg !4 {
entry:
tail call void @llvm.dbg.value(metadata i32 %x, i64 0, metadata !9, metadata !11), !dbg !12
tail call void @llvm.dbg.value(metadata i32 %y, i64 0, metadata !10, metadata !11), !dbg !13
%cmp = icmp ne i32 %x, %y, !dbg !14
br i1 %cmp, label %if.then, label %if.else, !dbg !16
if.then: ; preds = %entry
%call = tail call i32 (...) @bar() #3, !dbg !17
br label %return, !dbg !18
if.else: ; preds = %entry
%call1 = tail call i32 (...) @baz() #3, !dbg !19
br label %return, !dbg !20
return: ; preds = %if.else, %if.then
%retval.0 = phi i32 [ %call, %if.then ], [ %call1, %if.else ]
ret i32 %retval.0, !dbg !21
}
!14 = !DILocation(line: 5, column: 9, scope: !15)
!16 = !DILocation(line: 5, column: 7, scope: !4)
```
As you can see, in 'entry' block, 'icmp' instruction and 'br' instruction have different debug locations. However, with current implementation, there's no distinction between debug locations of these two when they are lowered to asm instructions. This is because 'icmp' and 'br' become 'setcc' 'xor' and 'brcond' in SelectionDAG, where SDLoc of 'setcc' follows the debug location of 'icmp' but SDLOC of 'xor' and 'brcond' follows the debug location of 'br' instruction, and SDLoc of 'xor' overwrites SDLoc of 'setcc' when they are folded. This patch addresses this issue.
Reviewers: atrick, bogner, andreadb, craig.topper, aprantl
Reviewed By: andreadb
Subscribers: jlebar, mkuper, jholewinski, andreadb, llvm-commits
Differential Revision: https://reviews.llvm.org/D29813
llvm-svn: 296825
After several smaller patches to get most of the core improvements
finished up, this patch is a straight move and header fixup of
the source.
Differential Revision: https://reviews.llvm.org/D30266
llvm-svn: 296810
This bug was introduced with:
https://reviews.llvm.org/rL296699
There may be a way to loosen the restriction, but for now just bail out
on any opaque constant.
The tests show that opacity is target-specific. This goes back to cost
calculations in ConstantHoisting based on TTI->getIntImmCost().
llvm-svn: 296768
If dominator tree is not calculated or is invalidated, set corresponding
pointer in the pass state to nullptr. Such pointer value will indicate
that operations with dominator tree are not allowed. In particular, it
allows to skip verification for such pass state. The dominator tree is
not calculated if the machine dominator pass was skipped, it occures in
the case of entities with linkage available_externally.
The change fixes some test fails observed when expensive checks
are enabled.
Differential Revision: https://reviews.llvm.org/D29280
llvm-svn: 296742
Surprisingly, one of the three interference checks in LiveRegMatrix was
using the main live range instead of the apropriate subregister range
resulting in unnecessarily conservative results.
llvm-svn: 296722
Summary:
This can be used to optimize large multiplications after legalization.
Depends on D29565
Reviewers: mkuper, spatel, RKSimon, zvi, bkramer, aaboud, craig.topper
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D29587
llvm-svn: 296711
Until now, we've had to use -global-isel to enable GISel. But using
that on other targets that don't support it will result in an abort, as we
can't build a full pipeline.
Additionally, we want to experiment with enabling GISel by default for
some targets: we can't just enable GISel by default, even among those
target that do have some support, because the level of support varies.
This first step adds an override for the target to explicitly define its
level of support. For AArch64, do that using
a new command-line option (I know..):
-aarch64-enable-global-isel-at-O=<N>
Where N is the opt-level below which GISel should be used.
Default that to -1, so that we still don't enable GISel anywhere.
We're not there yet!
While there, remove a couple LLVM_UNLIKELYs. Building the pipeline is
such a cold path that in practice that shouldn't matter at all.
llvm-svn: 296710
This is part of the ongoing attempt to improve select codegen for all targets and select
canonicalization in IR (see D24480 for more background). The transform is a subset of what
is done in InstCombine's FoldOpIntoSelect().
I first noticed a regression in the x86 avx512-insert-extract.ll tests with a patch that
hopes to convert more selects to basic math ops. This appears to be a general missing DAG
transform though, so I added tests for all standard binops in rL296621
(PowerPC was chosen semi-randomly; it has scripted FileCheck support, but so do ARM and x86).
The poor output for "sel_constants_shl_constant" is tracked with:
https://bugs.llvm.org/show_bug.cgi?id=32105
Differential Revision: https://reviews.llvm.org/D30502
llvm-svn: 296699
Take DW_FORM_implicit_const attribute value into account when profiling
DIEAbbrevData.
Currently if we have two similar types with implicit_const attributes and
different values we end up with only one abbrev in .debug_abbrev section.
For example consider two structures: S1 with implicit_const attribute ATTR
and value VAL1 and S2 with implicit_const ATTR and value VAL2.
The .debug_abbrev section will contain only 1 related record:
[N] DW_TAG_structure_type DW_CHILDREN_yes
DW_AT_ATTR DW_FORM_implicit_const VAL1
// ....
This is incorrect as struct S2 (with VAL2) will use abbrev record with VAL1.
With this patch we will have two different abbreviations here:
[N] DW_TAG_structure_type DW_CHILDREN_yes
DW_AT_ATTR DW_FORM_implicit_const VAL1
// ....
[M] DW_TAG_structure_type DW_CHILDREN_yes
DW_AT_ATTR DW_FORM_implicit_const VAL2
// ....
llvm-svn: 296691
- We only need the information from the base class, not the additional
details in the LiveInterval class.
- Spread more `const`
- Some code cleanup
llvm-svn: 296684
Summary:
Avoids tons of prologue boilerplate when arguments are passed in memory
and left in memory. This can happen in a debug build or in a release
build when an argument alloca is escaped. This will dramatically affect
the code size of x86 debug builds, because X86 fast isel doesn't handle
arguments passed in memory at all. It only handles the x86_64 case of up
to 6 basic register parameters.
This is implemented by analyzing the entry block before ISel to identify
copy elision candidates. A copy elision candidate is an argument that is
used to fully initialize an alloca before any other possibly escaping
uses of that alloca. If an argument is a copy elision candidate, we set
a flag on the InputArg. If the the target generates loads from a fixed
stack object that matches the size and alignment requirements of the
alloca, the SelectionDAG builder will delete the stack object created
for the alloca and replace it with the fixed stack object. The load is
left behind to satisfy any remaining uses of the argument value. The
store is now dead and is therefore elided. The fixed stack object is
also marked as mutable, as it may now be modified by the user, and it
would be invalid to rematerialize the initial load from it.
Supersedes D28388
Fixes PR26328
Reviewers: chandlerc, MatzeB, qcolombet, inglorion, hans
Subscribers: igorb, llvm-commits
Differential Revision: https://reviews.llvm.org/D29668
llvm-svn: 296683
This patch adds a MachineSSA pass that coalesces blocks that branch
on the same condition.
Committing on behalf of Lei Huang.
Differential Revision: https://reviews.llvm.org/D28249
llvm-svn: 296670
Add check that deleted nodes do not get added to worklist. This can
occur when a node's operand is simplified to an existing node.
This fixes PR32108.
Reviewers: jyknight, hfinkel, chandlerc
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D30506
llvm-svn: 296668
DWARF may define a default lower-bound for arrays in languages defined
in a particular DWARF version. But the logic to suppress an
unnecessary lower-bound attribute was looking at the hard-coded
default DWARF version, not the version that had been requested.
Also updated the list with all languages defined in DWARF v5.
Differential Revision: http://reviews.llvm.org/D30484
llvm-svn: 296652
Resubmit r295336 after the bug with non-zero offset patterns on BE targets is fixed (r296336).
Support {a|s}ext, {a|z|s}ext load nodes as a part of load combine patters.
Reviewed By: filcab
Differential Revision: https://reviews.llvm.org/D29591
llvm-svn: 296651
When SDAGISel (top-down) selects a tail-call, it skips the remainder
of the block.
If, before that, FastISel (bottom-up) selected some of the (no-op) next
few instructions, we can end up with dead instructions following the
terminator (selected by SDAGISel).
We need to erase them, as we know they aren't necessary (in addition to
being incorrect).
We already do this when FastISel falls back on the tail-call itself.
Also remove the FastISel-emitted code if we fallback on the
instructions between the tail-call and the return.
llvm-svn: 296552
Iterating on the use-list we're modifying doesn't work: after the first
iteration, the use-list iterator will point to a MachineOperand
referencing the new register. This caused us to skip the other uses to
replace.
Instead, use MRI.replaceRegWith(), which accounts for this behavior.
llvm-svn: 296551
Requesting DWARF v5 will now get you the new compile-unit and
type-unit headers. llvm-dwarfdump will also recognize them.
Differential Revision: http://reviews.llvm.org/D30206
llvm-svn: 296514
This recovers a test case that was severely broken by r296476, my making sure we don't create ADD/ADC that loads and stores when there is also a flag dependency.
llvm-svn: 296486
Stack Smash Protection is not completely free, so in hot code, the overhead it causes can cause performance issues. By adding diagnostic information for which functions have SSP and why, a user can quickly determine what they can do to stop SSP being applied to a specific hot function.
This change adds a remark that is reported by the stack protection code when an instruction or attribute is encountered that causes SSP to be applied.
Patch by: James Henderson
Differential Revision: https://reviews.llvm.org/D29023
llvm-svn: 296483
Recommiting after fixup of 32-bit aliasing sign offset bug in DAGCombiner.
* Simplify Consecutive Merge Store Candidate Search
Now that address aliasing is much less conservative, push through
simplified store merging search and chain alias analysis which only
checks for parallel stores through the chain subgraph. This is cleaner
as the separation of non-interfering loads/stores from the
store-merging logic.
When merging stores search up the chain through a single load, and
finds all possible stores by looking down from through a load and a
TokenFactor to all stores visited.
This improves the quality of the output SelectionDAG and the output
Codegen (save perhaps for some ARM cases where we correctly constructs
wider loads, but then promotes them to float operations which appear
but requires more expensive constant generation).
Some minor peephole optimizations to deal with improved SubDAG shapes (listed below)
Additional Minor Changes:
1. Finishes removing unused AliasLoad code
2. Unifies the chain aggregation in the merged stores across code
paths
3. Re-add the Store node to the worklist after calling
SimplifyDemandedBits.
4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
arbitrary, but seems sufficient to not cause regressions in
tests.
5. Remove Chain dependencies of Memory operations on CopyfromReg
nodes as these are captured by data dependence
6. Forward loads-store values through tokenfactors containing
{CopyToReg,CopyFromReg} Values.
7. Peephole to convert buildvector of extract_vector_elt to
extract_subvector if possible (see
CodeGen/AArch64/store-merge.ll)
8. Store merging for the ARM target is restricted to 32-bit as
some in some contexts invalid 64-bit operations are being
generated. This can be removed once appropriate checks are
added.
This finishes the change Matt Arsenault started in r246307 and
jyknight's original patch.
Many tests required some changes as memory operations are now
reorderable, improving load-store forwarding. One test in
particular is worth noting:
CodeGen/PowerPC/ppc64-align-long-double.ll - Improved load-store
forwarding converts a load-store pair into a parallel store and
a memory-realized bitcast of the same value. However, because we
lose the sharing of the explicit and implicit store values we
must create another local store. A similar transformation
happens before SelectionDAG as well.
Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle
llvm-svn: 296476
Summary:
This will allow future patches to inspect the details of the LLT. The implementation is now split between
the Support and CodeGen libraries to allow TableGen to use this class without introducing layering concerns.
Thanks to Ahmed Bougacha for finding a reasonable way to avoid the layering issue and providing the version of this patch without that problem.
Reviewers: t.p.northover, qcolombet, rovka, aditya_nandakumar, ab, javed.absar
Subscribers: arsenm, nhaehnle, mgorny, dberris, llvm-commits, kristof.beyls
Differential Revision: https://reviews.llvm.org/D30046
llvm-svn: 296474
Summary:
With this change ImplicitNullCheck optimization uses alias analysis
and can use load/store memory access for implicit null check if there
are other load/store before but memory accesses do not alias.
Patch by Serguei Katkov!
Reviewers: sanjoy
Reviewed By: sanjoy
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D30331
llvm-svn: 296440
This is a patch for the outliner described in the RFC at:
http://lists.llvm.org/pipermail/llvm-dev/2016-August/104170.html
The outliner is a code-size reduction pass which works by finding
repeated sequences of instructions in a program, and replacing them with
calls to functions. This is useful to people working in low-memory
environments, where sacrificing performance for space is acceptable.
This adds an interprocedural outliner directly before printing assembly.
For reference on how this would work, this patch also includes X86
target hooks and an X86 test.
The outliner is run like so:
clang -mno-red-zone -mllvm -enable-machine-outliner file.c
Patch by Jessica Paquette<jpaquette@apple.com>!
rdar://29166825
Differential Revision: https://reviews.llvm.org/D26872
llvm-svn: 296418
Splitting critical edges when one of the source edges is an indirectbr
is hard in general (because it requires changing the memory the indirectbr
reads). But if a block only has a single indirectbr predecessor (which is
the common case), we can simulate splitting that edge by splitting
the destination block, and retargeting the *direct* branches.
This is motivated by the use of computed gotos in python 2.7: PyEval_EvalFrame()
ends up using an indirect branch with ~100 successors, and passing a constant to
each of those. Since MachineSink can't break indirect critical edges on demand
(and doing this in MIR doesn't look feasible), this causes us to emit about ~100
defs of registers containing constants, which we in the predecessor block, where
only one of those constants is used in each successor. So, at each computed goto,
we needlessly spill about a 100 constants to stack. The end result is that a
clang-compiled python interpreter can be about ~2.5x slower on a simple python
reduction loop than a gcc-compiled interpreter.
Differential Revision: https://reviews.llvm.org/D29916
llvm-svn: 296416
Before the endianness was specified on each call to read
or write of the StreamReader / StreamWriter, but in practice
it's extremely rare for streams to have data encoded in
multiple different endiannesses, so we should optimize for the
99% use case.
This makes the code cleaner and more general, but otherwise
has NFC.
llvm-svn: 296415
This was reverted because it was breaking some builds, and
because of incorrect error code usage. Since the CL was
large and contained many different things, I'm resubmitting
it in pieces.
This portion is NFC, and consists of:
1) Renaming classes to follow a consistent naming convention.
2) Fixing the const-ness of the interface methods.
3) Adding detailed doxygen comments.
4) Fixing a few instances of passing `const BinaryStream& X`. These
are now passed as `BinaryStreamRef X`.
llvm-svn: 296394
DAGCombiner already supports peeking thorough shuffles to improve vector element extraction, but legalization often leaves us in situations where we need to extract vector elements after shuffles have already been lowered.
This patch adds support for VECTOR_EXTRACT_ELEMENT/PEXTRW/PEXTRB instructions to attempt to handle target shuffles as well. I've covered some basic scenarios including handling shuffle mask scaling and the implicit zero-extension of PEXTRW/PEXTRB, there is more that could be done here (that I've mentioned in TODOs) but I haven't found many cases where its worth it.
Differential Revision: https://reviews.llvm.org/D30176
llvm-svn: 296381
Summary: Existing implementation of duplicateSimpleBB function drops DebugLoc metadata of branch instructions during the transformation. This patch addresses this issue by making newly created branch instructions to keep the metadata of replaced branch instructions.
Reviewers: qcolombet, craig.topper, aprantl, MatzeB, sanjoy, dblaikie
Reviewed By: dblaikie
Subscribers: dblaikie, llvm-commits
Differential Revision: https://reviews.llvm.org/D30026
llvm-svn: 296371
This pattern is essentially a i16 load from p+1 address:
%p1.i16 = bitcast i8* %p to i16*
%p2.i8 = getelementptr i8, i8* %p, i64 2
%v1 = load i16, i16* %p1.i16
%v2.i8 = load i8, i8* %p2.i8
%v2 = zext i8 %v2.i8 to i16
%v1.shl = shl i16 %v1, 8
%res = or i16 %v1.shl, %v2
Current implementation would identify %v1 load as the first byte load and would mistakenly emit a i16 load from %p1.i16 address. This patch adds a check that the first byte is loaded from a non-zero offset of the first load address. This way this address can be used as the base address for the combined value. Otherwise just give up combining.
llvm-svn: 296336
Summary:
While collecting operands we make copies of the LiveReg objects which are stored in the LiveRegs array. If the instruction uses the same register multiple times we end up with multiple copies. Later we iterate through the collected list of LiveReg objects and merge DomainValues. In the process of doing this the merge function can change the contents of the original LiveReg object in the LiveRegs array, but not the copies that have been made. So when we get to the second usage of the register we end up seeing a stale copy of the LiveReg object.
To fix this I've stopped copying and now just store a pointer to the original LiveReg object. Another option might be to avoid adding the same register to the Regs array twice, but this approach seemed simpler.
The included test case exposes this bug due to an AVX-512 masked OR instruction using the same register for the passthru operand and one of the inputs to the OR operation.
Fixes PR30284.
Reviewers: RKSimon, stoklund, MatzeB, spatel, myatsina
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D30242
llvm-svn: 296260
r296215, "[PDB] General improvements to Stream library."
r296217, "Disable BinaryStreamTest.StreamReaderObject temporarily."
r296220, "Re-enable BinaryStreamTest.StreamReaderObject."
r296244, "[PDB] Disable some tests that are breaking bots."
r296249, "Add static_cast to silence -Wc++11-narrowing."
std::errc::no_buffer_space should be used for OS-oriented errors for socket transmission.
(Seek discussions around llvm/xray.)
I could substitute s/no_buffer_space/others/g, but I revert whole them ATM.
Could we define and use LLVM errors there?
llvm-svn: 296258
Recommiting after fixup of 32-bit aliasing sign offset bug in DAGCombiner.
* Simplify Consecutive Merge Store Candidate Search
Now that address aliasing is much less conservative, push through
simplified store merging search and chain alias analysis which only
checks for parallel stores through the chain subgraph. This is cleaner
as the separation of non-interfering loads/stores from the
store-merging logic.
When merging stores search up the chain through a single load, and
finds all possible stores by looking down from through a load and a
TokenFactor to all stores visited.
This improves the quality of the output SelectionDAG and the output
Codegen (save perhaps for some ARM cases where we correctly constructs
wider loads, but then promotes them to float operations which appear
but requires more expensive constant generation).
Some minor peephole optimizations to deal with improved SubDAG shapes (listed below)
Additional Minor Changes:
1. Finishes removing unused AliasLoad code
2. Unifies the chain aggregation in the merged stores across code
paths
3. Re-add the Store node to the worklist after calling
SimplifyDemandedBits.
4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
arbitrary, but seems sufficient to not cause regressions in
tests.
5. Remove Chain dependencies of Memory operations on CopyfromReg
nodes as these are captured by data dependence
6. Forward loads-store values through tokenfactors containing
{CopyToReg,CopyFromReg} Values.
7. Peephole to convert buildvector of extract_vector_elt to
extract_subvector if possible (see
CodeGen/AArch64/store-merge.ll)
8. Store merging for the ARM target is restricted to 32-bit as
some in some contexts invalid 64-bit operations are being
generated. This can be removed once appropriate checks are
added.
This finishes the change Matt Arsenault started in r246307 and
jyknight's original patch.
Many tests required some changes as memory operations are now
reorderable, improving load-store forwarding. One test in
particular is worth noting:
CodeGen/PowerPC/ppc64-align-long-double.ll - Improved load-store
forwarding converts a load-store pair into a parallel store and
a memory-realized bitcast of the same value. However, because we
lose the sharing of the explicit and implicit store values we
must create another local store. A similar transformation
happens before SelectionDAG as well.
Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle
llvm-svn: 296252
This adds various new functionality and cleanup surrounding the
use of the Stream library. Major changes include:
* Renaming of all classes for more consistency / meaningfulness
* Addition of some new methods for reading multiple values at once.
* Full suite of unit tests for reader / writer functionality.
* Full set of doxygen comments for all classes.
* Streams now store their own endianness.
* Fixed some bugs in a few of the classes that were discovered
by the unit tests.
llvm-svn: 296215
This is part of a larger effort to get the Stream code moved
up to Support. I don't want to do it in one large patch, in
part because the changes are so big that it will treat everything
as file deletions and add, losing history in the process.
Aside from that though, it's just a good idea in general to
make small changes.
So this change only changes the names of the Stream related
source files, and applies necessary source fix ups.
llvm-svn: 296211
With the "wasm32-unknown-unknown-wasm" triple, this allows writing out
simple wasm object files, and is another step in a larger series toward
migrating from ELF to general wasm object support. Note that this code
and the binary format itself is still experimental.
llvm-svn: 296190
This reverts commit r296009. It broke one out of tree target and also
does not account for all partial lines added or removed when calculating
PressureDiff.
llvm-svn: 296182
All G_CONSTANTS created by the MachineIRBuilder have an operand of type CImm
(i.e. a ConstantInt), so that's what the selector needs to look for.
llvm-svn: 296176
When we construct addressing modes, we use isNoopAddrSpaceCast to ignore
addrspacecast instructions. Make sure we insert the correct addrspacecast
when we reconstruct the addressing mode.
Differential Revision: https://reviews.llvm.org/D30114
llvm-svn: 296167
Splitting critical edges when one of the source edges is an indirectbr
is hard in general (because it requires changing the memory the indirectbr
reads). But if a block only has a single indirectbr predecessor (which is
the common case), we can simulate splitting that edge by splitting
the destination block, and retargeting the *direct* branches.
This is motivated by the use of computed gotos in python 2.7: PyEval_EvalFrame()
ends up using an indirect branch with ~100 successors, and passing a constant to
each of those. Since MachineSink can't break indirect critical edges on demand
(and doing this in MIR doesn't look feasible), this causes us to emit about ~100
defs of registers containing constants, which we in the predecessor block, where
only one of those constants is used in each successor. So, at each computed goto,
we needlessly spill about a 100 constants to stack. The end result is that a
clang-compiled python interpreter can be about ~2.5x slower on a simple python
reduction loop than a gcc-compiled interpreter.
Differential Revision: https://reviews.llvm.org/D29916
llvm-svn: 296149
The motivation for filling out these select-of-constants cases goes back to D24480,
where we discussed removing an IR fold from add(zext) --> select. And that goes back to:
https://reviews.llvm.org/rL75531https://reviews.llvm.org/rL159230
The idea is that we should always canonicalize patterns like this to a select-of-constants
in IR because that's the smallest IR and the best for value tracking. Note that we currently
do the opposite in some cases (like the cases in *this* patch). Ie, the proposed folds in
this patch already exist in InstCombine today:
https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineSelect.cpp#L1151
As this patch shows, most targets generate better machine code for simple ext/add/not ops
rather than a select of constants. So the follow-up steps to make this less of a patchwork
of special-case folds and missing IR canonicalization:
1. Have DAGCombiner convert any select of constants into ext/add/not ops.
2 Have InstCombine canonicalize in the other direction (create more selects).
Differential Revision: https://reviews.llvm.org/D30180
llvm-svn: 296137
Summary:
This isn't testable for AArch64 by itself so this patch also adds
support for constant immediates in the pattern and physical
register uses in the result.
The new IntOperandMatcher matches the constant in patterns such as
'(set $rd:GPR32, (G_XOR $rs:GPR32, -1))'. It's always safe to fold
immediates into an instruction so this is the first rule that will match
across multiple BB's.
The Renderer hierarchy is responsible for adding operands to the result
instruction. Renderers can copy operands (CopyRenderer) or add physical
registers (in particular %wzr and %xzr) to the result instruction
in any order (OperandMatchers now import the operand names from
SelectionDAG to allow renderers to access any operand). This allows us to
emit the result instruction for:
%1 = G_XOR %0, -1 --> %1 = ORNWrr %wzr, %0
%1 = G_XOR -1, %0 --> %1 = ORNWrr %wzr, %0
although the latter is untested since the matcher/importer has not been
taught about commutativity yet.
Added BuildMIAction which can build new instructions and mutate them where
possible. W.r.t the mutation aspect, MatchActions are now told the name of
an instruction they can recycle and BuildMIAction will emit mutation code
when the renderers are appropriate. They are appropriate when all operands
are rendered using CopyRenderer and the indices are the same as the matcher.
This currently assumes that all operands have at least one matcher.
Finally, this change also fixes a crash in
AArch64InstructionSelector::select() caused by an immediate operand
passing isImm() rather than isCImm(). This was uncovered by the other
changes and was detected by existing tests.
Depends on D29711
Reviewers: t.p.northover, ab, qcolombet, rovka, aditya_nandakumar, javed.absar
Reviewed By: rovka
Subscribers: aemerson, dberris, kristof.beyls, llvm-commits
Differential Revision: https://reviews.llvm.org/D29712
llvm-svn: 296131
Splitting critical edges when one of the source edges is an indirectbr
is hard in general (because it requires changing the memory the indirectbr
reads). But if a block only has a single indirectbr predecessor (which is
the common case), we can simulate splitting that edge by splitting
the destination block, and retargeting the *direct* branches.
This is motivated by the use of computed gotos in python 2.7: PyEval_EvalFrame()
ends up using an indirect branch with ~100 successors, and passing a constant to
each of those. Since MachineSink can't break indirect critical edges on demand
(and doing this in MIR doesn't look feasible), this causes us to emit about ~100
defs of registers containing constants, which we in the predecessor block, where
only one of those constants is used in each successor. So, at each computed goto,
we needlessly spill about a 100 constants to stack. The end result is that a
clang-compiled python interpreter can be about ~2.5x slower on a simple python
reduction loop than a gcc-compiled interpreter.
Differential Revision: https://reviews.llvm.org/D29916
llvm-svn: 296060
We were stopping the translation of the parent block when the
translation of an instruction failed, but we were still trying to
translate the other blocks of the parent function.
Don't do that.
llvm-svn: 296047
Rename ComputedTrellisEdges to ComputedEdges to allow for other methods of
pre-computing edges.
Differential Revision: https://reviews.llvm.org/D30308
llvm-svn: 296018
Having more fine-grained information on the specific construct that
caused us to fallback is valuable for large-scale data collection.
We still have the fallback warning, that's also used for FastISel.
We still need to remove the fallback warning, and teach FastISel to also
emit remarks (it currently has a combination of the warning, stats, and
debug prints: the remarks could unify all three).
The abort-on-fallback path could also be better handled using remarks:
one could imagine a "-Rpass-error", analoguous to "-Werror", which would
promote missed/failed remarks to errors. It's not clear whether that
would be useful for other remarks though, so we're not there yet.
llvm-svn: 296013
If a subreg is used in an instruction it counts as a whole superreg
for the purpose of register pressure calculation. This patch corrects
improper register pressure calculation by examining operand's lane mask.
Differential Revision: https://reviews.llvm.org/D29835
llvm-svn: 296009
Since LoopInfo is not available in machine passes as universally as in IR
passes, using the same approach for OptimizationRemarkEmitter as we did for IR
will run LoopInfo and DominatorTree unnecessarily. (LoopInfo is not used
lazily by ORE.)
To fix this, I am modifying the approach I took in D29836. LazyMachineBFI now
uses its client passes including MachineBFI itself that are available or
otherwise compute them on the fly.
So for example GreedyRegAlloc, since it's already using MBFI, will reuse that
instance. On the other hand, AsmPrinter in Justin's patch will generate DT,
LI and finally BFI on the fly.
(I am of course wondering now if the simplicity of this approach is even
preferable in IR. I will do some experiments.)
Testing is provided by an updated version of D29837 which requires Justin's
patch to bring ORE to the AsmPrinter.
Differential Revision: https://reviews.llvm.org/D30128
llvm-svn: 295996
This just adds the basic skeleton for supporting a new object file format.
All of the actual encoding will be implemented in followup patches.
Differential Revision: https://reviews.llvm.org/D26722
llvm-svn: 295803
Summary:
Rework the code that was sinking/duplicating (icmp and, 0) sequences
into blocks where they were being used by conditional branches to form
more tbz instructions on AArch64. The new code is more general in that
it just looks for 'and's that have all icmp 0's as users, with a target
hook used to select which subset of 'and' instructions to consider.
This change also enables 'and' sinking for X86, where it is more widely
beneficial than on AArch64.
The 'and' sinking/duplicating code is moved into the optimizeInst phase
of CodeGenPrepare, where it can take advantage of the fact the
OptimizeCmpExpression has already sunk/duplicated any icmps into the
blocks where they are used. One minor complication from this change is
that optimizeLoadExt needed to be updated to always mark 'and's it has
determined should be in the same block as their feeding load in the
InsertedInsts set to avoid an infinite loop of hoisting and sinking the
same 'and'.
This change fixes a regression on X86 in the tsan runtime caused by
moving GVNHoist to a later place in the optimization pipeline (see
PR31382).
Reviewers: t.p.northover, qcolombet, MatzeB
Subscribers: aemerson, mcrosier, sebpop, llvm-commits
Differential Revision: https://reviews.llvm.org/D28813
llvm-svn: 295746
- Fix doxygen comments (do not repeat documented name, remove definition
comment if there is already one at the declaration, add \p, ...)
- Add some const modifiers
- Use range based for
llvm-svn: 295688
Summary:
Currently, BranchFolder drops DebugLoc for branch instructions in some places. For example, for the test code attached, the branch instruction of 'entry' block has a DILocation of
```
!12 = !DILocation(line: 6, column: 3, scope: !11)
```
, but this information is gone when then block is lowered because BranchFolder misses it. This patch is a fix for this issue.
Reviewers: qcolombet, aprantl, craig.topper, MatzeB
Reviewed By: aprantl
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D29902
llvm-svn: 295684
- Adapt MachineBasicBlock::getName() to have the same behavior as the IR
BasicBlock (Value::getName()).
- Add it to lib/CodeGen/CodeGen.cpp::initializeCodeGen so that it is linked in
the CodeGen library.
- MachineRegionInfoPass's name conflicts with RegionInfoPass's name ("region").
- MachineRegionInfo should depend on MachineDominatorTree,
MachinePostDominatorTree and MachineDominanceFrontier instead of their
respective IR versions.
- Since there were no tests for this, add a X86 MIR test.
Patch by Francis Visoiu Mistrih<fvisoiumistrih@apple.com>
llvm-svn: 295518
This fixes PR31381, which caused an assertion and/or invalid debug info.
This affects debug variables that have multiple fragments in the MMI
side (i.e.: in the stack frame) table.
rdar://problem/30571676
llvm-svn: 295486
During legalization we are often creating shuffles (via a build_vector scalarization stage) that are "any_extend_vector_inreg" style masks, and also other masks that are the equivalent of "truncate_vector_inreg" (if we had such a thing).
This patch is an attempt to match these cases to help undo the effects of just leaving shuffle lowering to handle it - which typically means we lose track of the undefined elements of the shuffles resulting in an unnecessary extension+truncation stage for widened illegal types.
The 2011-10-21-widen-cmp.ll regression will be fixed by making SIGN_EXTEND_VECTOR_IN_REG legal in SSE instead of lowering them to X86ISD::VSEXT (PR31712).
Differential Revision: https://reviews.llvm.org/D29454
llvm-svn: 295451
Summary:
This is an issue both with regular and Thin LTO. When we link together
a DICompileUnit that is marked NoDebug (e.g when compiling with -g0
but applying an AutoFDO profile, which requires location tracking
in the compiler) and a DICompileUnit with debug emission enabled,
we can have failures during dwarf debug generation. Specifically,
when we have inlined from the NoDebug compile unit into the debug
compile unit, we can fail during construction of the abstract and
inlined scope DIEs. This is because the SPMap does not include NoDebug
CUs (they are skipped in the debug_compile_units_iterator).
This patch fixes the failures by skipping locations from NoDebug CUs
when extracting lexical scopes.
Reviewers: dblaikie, aprantl
Subscribers: mehdi_amini, llvm-commits
Differential Revision: https://reviews.llvm.org/D29765
llvm-svn: 295384
Resubmit -r295314 with PowerPC and AMDGPU tests updated.
Support {a|s}ext, {a|z|s}ext load nodes as a part of load combine patters.
Reviewed By: filcab
Differential Revision: https://reviews.llvm.org/D29591
llvm-svn: 295336
Support {a|s}ext, {a|z|s}ext load nodes as a part of load combine patters.
Reviewed By: filcab
Differential Revision: https://reviews.llvm.org/D29591
llvm-svn: 295314
For the hard float calling convention, we just use the D registers.
For the soft-fp calling convention, we use the R registers and move values
to/from the D registers by means of G_SEQUENCE/G_EXTRACT. While doing so, we
make sure to honor the endianness of the target, since the CCAssignFn doesn't do
that for us.
For pure soft float targets, we still bail out because we don't support the
libcalls yet.
llvm-svn: 295295
This reverts r294348, which removed support for conditional tail calls
due to the PR above. It fixes the PR by marking live registers as
implicitly used and defined by the now predicated tailcall. This is
similar to how IfConversion predicates instructions.
Differential Revision: https://reviews.llvm.org/D29856
llvm-svn: 295262
Uses a Custom implementation because the slot sizes being a multiple of the
pointer size isn't really universal, even for the architectures that do have a
simple "void *" va_list.
llvm-svn: 295255
Lay out trellis-shaped CFGs optimally.
A trellis of the shape below:
A B
|\ /|
| \ / |
| X |
| / \ |
|/ \|
C D
would be laid out A; B->C ; D by the current layout algorithm. Now we identify
trellises and lay them out either A->C; B->D or A->D; B->C. This scales with an
increasing number of predecessors. A trellis is a a group of 2 or more
predecessor blocks that all have the same successors.
because of this we can tail duplicate to extend existing trellises.
As an example consider the following CFG:
B D F H
/ \ / \ / \ / \
A---C---E---G---Ret
Where A,C,E,G are all small (Currently 2 instructions).
The CFG preserving layout is then A,B,C,D,E,F,G,H,Ret.
The current code will copy C into B, E into D and G into F and yield the layout
A,C,B(C),E,D(E),F(G),G,H,ret
define void @straight_test(i32 %tag) {
entry:
br label %test1
test1: ; A
%tagbit1 = and i32 %tag, 1
%tagbit1eq0 = icmp eq i32 %tagbit1, 0
br i1 %tagbit1eq0, label %test2, label %optional1
optional1: ; B
call void @a()
br label %test2
test2: ; C
%tagbit2 = and i32 %tag, 2
%tagbit2eq0 = icmp eq i32 %tagbit2, 0
br i1 %tagbit2eq0, label %test3, label %optional2
optional2: ; D
call void @b()
br label %test3
test3: ; E
%tagbit3 = and i32 %tag, 4
%tagbit3eq0 = icmp eq i32 %tagbit3, 0
br i1 %tagbit3eq0, label %test4, label %optional3
optional3: ; F
call void @c()
br label %test4
test4: ; G
%tagbit4 = and i32 %tag, 8
%tagbit4eq0 = icmp eq i32 %tagbit4, 0
br i1 %tagbit4eq0, label %exit, label %optional4
optional4: ; H
call void @d()
br label %exit
exit:
ret void
}
here is the layout after D27742:
straight_test: # @straight_test
; ... Prologue elided
; BB#0: # %entry ; A (merged with test1)
; ... More prologue elided
mr 30, 3
andi. 3, 30, 1
bc 12, 1, .LBB0_2
; BB#1: # %test2 ; C
rlwinm. 3, 30, 0, 30, 30
beq 0, .LBB0_3
b .LBB0_4
.LBB0_2: # %optional1 ; B (copy of C)
bl a
nop
rlwinm. 3, 30, 0, 30, 30
bne 0, .LBB0_4
.LBB0_3: # %test3 ; E
rlwinm. 3, 30, 0, 29, 29
beq 0, .LBB0_5
b .LBB0_6
.LBB0_4: # %optional2 ; D (copy of E)
bl b
nop
rlwinm. 3, 30, 0, 29, 29
bne 0, .LBB0_6
.LBB0_5: # %test4 ; G
rlwinm. 3, 30, 0, 28, 28
beq 0, .LBB0_8
b .LBB0_7
.LBB0_6: # %optional3 ; F (copy of G)
bl c
nop
rlwinm. 3, 30, 0, 28, 28
beq 0, .LBB0_8
.LBB0_7: # %optional4 ; H
bl d
nop
.LBB0_8: # %exit ; Ret
ld 30, 96(1) # 8-byte Folded Reload
addi 1, 1, 112
ld 0, 16(1)
mtlr 0
blr
The tail-duplication has produced some benefit, but it has also produced a
trellis which is not laid out optimally. With this patch, we improve the layouts
of such trellises, and decrease the cost calculation for tail-duplication
accordingly.
This patch produces the layout A,C,E,G,B,D,F,H,Ret. This layout does have
back edges, which is a negative, but it has a bigger compensating
positive, which is that it handles the case where there are long strings
of skipped blocks much better than the original layout. Both layouts
handle runs of executed blocks equally well. Branch prediction also
improves if there is any correlation between subsequent optional blocks.
Here is the resulting concrete layout:
straight_test: # @straight_test
; BB#0: # %entry ; A (merged with test1)
mr 30, 3
andi. 3, 30, 1
bc 12, 1, .LBB0_4
; BB#1: # %test2 ; C
rlwinm. 3, 30, 0, 30, 30
bne 0, .LBB0_5
.LBB0_2: # %test3 ; E
rlwinm. 3, 30, 0, 29, 29
bne 0, .LBB0_6
.LBB0_3: # %test4 ; G
rlwinm. 3, 30, 0, 28, 28
bne 0, .LBB0_7
b .LBB0_8
.LBB0_4: # %optional1 ; B (Copy of C)
bl a
nop
rlwinm. 3, 30, 0, 30, 30
beq 0, .LBB0_2
.LBB0_5: # %optional2 ; D (Copy of E)
bl b
nop
rlwinm. 3, 30, 0, 29, 29
beq 0, .LBB0_3
.LBB0_6: # %optional3 ; F (Copy of G)
bl c
nop
rlwinm. 3, 30, 0, 28, 28
beq 0, .LBB0_8
.LBB0_7: # %optional4 ; H
bl d
nop
.LBB0_8: # %exit
Differential Revision: https://reviews.llvm.org/D28522
llvm-svn: 295223