analysis passes, support pre-registering analyses, and use that to
implement parsing and pre-registering a custom alias analysis pipeline.
With this its possible to configure the particular alias analysis
pipeline used by the AAManager from the commandline of opt. I've updated
the test to show this effectively in use to build a pipeline including
basic-aa as part of it.
My big question for reviewers are around the APIs that are used to
expose this functionality. Are folks happy with pass-by-lambda to do
pass registration? Are folks happy with pre-registering analyses as
a way to inject customized instances of an analysis while still using
the registry for the general case?
Other thoughts of course welcome. The next round of patches will be to
add the rest of the alias analyses into the new pass manager and wire
them up here so that they can be used from opt. This will require
extending the (somewhate limited) functionality of AAManager w.r.t.
module passes.
Differential Revision: http://reviews.llvm.org/D17259
llvm-svn: 261197
While we still do want reducible control flow, the RequiresStructuredCFG
flag imposes more strict structure constraints than WebAssembly wants.
Unsetting this flag enables critical edge splitting and tail merging.
Also, disable TailDuplication explicitly, as it doesn't support virtual
registers, and was previously only disabled by the RequiresStructuredCFG
flag.
llvm-svn: 261190
Changes:
- Added disassembler project
- Fixed all decoding conflicts in .td files
- Added DecoderMethod=“NONE” option to Target.td that allows to
disable decoder generation for an instruction.
- Created decoding functions for VS_32 and VReg_32 register classes.
- Added stubs for decoding all register classes.
- Added several tests for disassembler
Disassembler only supports:
- VI subtarget
- VOP1 instruction encoding
- 32-bit register operands and inline constants
[Valery]
One of the point that requires to pay attention to is how decoder
conflicts were resolved:
- Groups of target instructions were separated by using different
DecoderNamespace (SICI, VI, CI) using similar to AssemblerPredicate
approach.
- There were conflicts in IMAGE_<> instructions caused by two
different reasons:
1. dmask wasn’t specified for the output (fixed)
2. There are image instructions that differ only by the number of
the address components but have the same encoding by the HW spec. The
actual number of address components is determined by the HW at runtime
using image resource descriptor starting from the VGPR encoded in an
IMAGE instruction. This means that we should choose only one instruction
from conflicting group to be the rule for decoder. I didn’t find the way
to disable decoder generation for an arbitrary instruction and therefore
made a onelinear fix to tablegen generator that would suppress decoder
generation when DecoderMethod is set to “NONE”. This is a change that
should be reviewed and submitted first. Otherwise I would need to
specify different DecoderNamespace for every instruction in the
conflicting group. I haven’t checked yet if DecoderMethod=“NONE” is not
used in other targets.
3. IMAGE_GATHER decoder generation is for now disabled and to be
done later.
[/Valery]
Patch By: Sam Kolton
Differential Revision: http://reviews.llvm.org/D16723
llvm-svn: 261185
After r261154, we were only clearing flags if the known-zero register was
originally live-in to the basic block, but we have to do it even if not when
more than one COPY has been eliminated, otherwise the user of the first COPY
may still have <kill> marked.
E.g.
BB#N:
%X0 = COPY %XZR
STRXui %X0<kill>, <fi#0>
%X0 = COPY %XZR
STRXui %X0<kill>, <fi#1>
We can eliminate both copies, X0 is not live-in, but we must clear the kill on
the first store.
Unfortunately, I've been unable to come up with a non-fragile test for this.
I've only seen it in the wild with regalloc-created spills, and attempts to
reproduce that in a reasonable way run afoul of COPY coalescing. Even volatile
asm clobbers were moved around. Should fix the aarch64 bot though.
llvm-svn: 261175
described by an immediate.
Found via http://reviews.llvm.org/D16867
Thanks to Paul Robinson for pointing this out.
<rdar://problem/24456528>
llvm-svn: 261168
Mostly, this fixes the bug that if the CBZ guaranteed Xn but Wn was used, we
didn't sort out the use-def chain properly.
I've also made it check more than just the last instruction for a compatible
CBZ (so it can cope without fallthroughs). I'd have liked to do that
separately, but it's helps writing the test.
Finally, I removed some custom loops in favour of MachineInstr helpers and
refactored the control flow to flatten it and avoid possibly quadratic
iterations in blocks with many copies. NFC for these, just a general tidy-up.
llvm-svn: 261154
This function is used to check whether a dbg.value intrinsic has already
been inserted, but without comparing the DIExpression, it would erroneously
fire on split aggregates and only the first scalar would survive.
Found via http://reviews.llvm.org/D16867.
<rdar://problem/24456528>
llvm-svn: 261145
Loop vectorizer now knows to vectorize GEP and create masked gather and scatter intrinsics for random memory access.
The feature is enabled on AVX-512 target.
Differential Revision: http://reviews.llvm.org/D15690
llvm-svn: 261140
Summary: Store and loads unpacked by instcombine do not always have the right alignement. This explicitely compute the alignement and set it.
Reviewers: dblaikie, majnemer, reames, hfinkel, joker.eph
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D17326
llvm-svn: 261139
When support for objc_unsafeClaimAutoreleasedReturnValue has been added to the
ARC optimizer in r258970, one case was missed which would lead the optimizer
to execute an llvm_unreachable. In this case, just handle ClaimRV in the same
way we handle RetainRV.
llvm-svn: 261134
32-bit x86 Windows targets use a linked-list of nodes allocated on the
stack, referenced to via thread-local storage. The personality routine
interprets one of the fields in the node as a 'state number' which
indicates where the personality routine should transfer control.
State transitions are possible only before call-sites which may throw
exceptions. Our previous scheme had us update the state number before
all call-sites which may throw.
Instead, we can try to minimize the number of times we need to store by
reasoning about the nearest store which dominates the current call-site.
If the last store agrees with the current call-site, then we know that
the state-update is redundant and can be elided.
This is largely straightforward: an RPO walk of the blocks allows us to
correctly forward propagate the information when the function is a DAG.
Currently, loops are not handled optimally and may trigger superfluous
state stores.
Differential Revision: http://reviews.llvm.org/D16763
llvm-svn: 261122
Summary:
The syncthreads MI is modeled as mayread/maywrite -- convergence doesn't
even come into play here. Nonetheless this property is highly implicit
in the tablegen files, so a test seems appropriate.
Reviewers: jingyue
Subscribers: llvm-commits, jholewinski
Differential Revision: http://reviews.llvm.org/D17319
llvm-svn: 261114
Summary:
Otherwise we'll try to do unsafe optimizations on these MIs, such as
sinking loads below calls.
(I suspect that this is not the only bug in the NVPTX instruction
tablegen files; I need to comb through them.)
Reviewers: jholewinski, tra
Subscribers: jingyue, jhen, llvm-commits
Differential Revision: http://reviews.llvm.org/D17315
llvm-svn: 261113
The dynamic table is also an array of a fixed structure, so it can be
represented with a DynReginoInfo.
No major functionality change. The extra error checking is covered by
existing tests with a broken dynamic program header.
Idea extracted from r260488. I did the extra cleanups.
llvm-svn: 261107
We are getting better at combining constant pshufb masks - use a real input instead of undef.
Add test for decoding multi-use bitcasted masks as well (actual support will come soon).
llvm-svn: 261101
We used to keep both a section and a pointer to the first symbol.
The oddity of keeping a section for dynamic symbols is because there is
a DT_SYMTAB but no DT_SYMTABZ, so to print the table we have to find the
size via a section table.
The reason for still keeping a pointer to the first symbol is because we
want to be able to print relocation tables even if the section table is
missing (it is mandatory only for files used in linking).
With this patch we keep just a DynRegionInfo. This then requires
changing a few places that were asking for a Elf_Shdr but actually just
needed the first symbol.
The test change is to delete the program header pointer.
Now that we use the information of both DT_SYMTAB and .dynsym, we don't
depend on the sh_entsize of .dynsym if we see DT_SYMTAB.
Note: It is questionable if it is worth it putting the effort to report
broken sh_entsize given that in files with no section table we have to
assume it is sizeof(Elf_Sym), but that is for another change.
Extracted from r260488.
llvm-svn: 261099
Bug description:
The bug was discovered when test was compiled with -O0.
In case scatter result is DAG root , VectorLegalizer failed (assert) due to LowerMSCATTER() return kmask as result.
Change LowerMSCATTER() to return chain as original node do.
Differential Revision: http://reviews.llvm.org/D17331
llvm-svn: 261090
This section is used for debug information and has no need to be
in memory at runtime. This patch also fixes an error when compiling
the Linux kernel. The error is that there are relocations within the
.pdr section in a VDSO. SHT_REL was removed as it is a section type
and not a section flag, therefore it does not make sense for it to
be there. With this patch, LLVM now emits the same flags as
the GNU assembler.
llvm-svn: 261083
AVX1 doesn't support the shuffling of 256-bit integer vectors. For 32/64-bit elements we get around this by shuffling as float/double but for 8/16-bit elements (assuming they can't widen) we currently just split, shuffle as 128-bit vectors and concatenate the results back.
This patch adds the ability to lower using the bit-blend patterns before defaulting to the splitting behaviour.
Part 2 of 2
Differential Revision: http://reviews.llvm.org/D17292
llvm-svn: 261082
AVX1 doesn't support the shuffling of 256-bit integer vectors. For 32/64-bit elements we get around this by shuffling as float/double but for 8/16-bit elements (assuming they can't widen) we currently just split, shuffle as 128-bit vectors and concatenate the results back.
This patch adds the ability to lower using the bit-mask patterns before defaulting to the splitting behaviour. In some cases this ends up matching what AVX2 would do anyhow or what AVX1 does on the split vectors.
Part 1 of 2
Differential Revision: http://reviews.llvm.org/D17292
llvm-svn: 261081
This patch detects vector reductions before instruction selection. Vector
reductions are vectorized reduction operations, and for such operations we have
freedom to reorganize the elements of the result as long as the reduction of them
stay unchanged. This will enable some reduction pattern recognition during
instruction combine such as SAD/dot-product on X86. A flag is added to
SDNodeFlags to mark those vector reduction nodes to be checked during instruction
combine.
To detect those vector reductions, we search def-use chains starting from the
given instruction, and check if all uses fall into two categories:
1. Reduction with another vector.
2. Reduction on all elements.
in which 2 is detected by recognizing the pattern that the loop vectorizer
generates to reduce all elements in the vector outside of the loop, which
includes several ShuffleVector and one ExtractElement instructions.
Differential revision: http://reviews.llvm.org/D15250
llvm-svn: 261070
This fixes very slow compilation on
test/CodeGen/Generic/2010-11-04-BigByval.ll . Note that MaxStoresPerMemcpy
and friends are not yet carefully tuned so the cutoff point is currently
somewhat arbitrary. However, it's important that there be a cutoff point
so that we don't emit unbounded quantities of loads and stores.
llvm-svn: 261050
reference-edge SCCs.
This essentially builds a more normal call graph as a subgraph of the
"reference graph" that was the old model. This allows both to exist and
the different use cases to use the aspect which addresses their needs.
Specifically, the pass manager and other *ordering* constrained logic
can use the reference graph to achieve conservative order of visit,
while analyses reasoning about attributes and other properties derived
from reachability can reason about the direct call graph.
Note that this isn't necessarily complete: it doesn't model edges to
declarations or indirect calls. Those can be found by scanning the
instructions of the function if desirable, and in fact every user
currently does this in order to handle things like calls to instrinsics.
If useful, we could consider caching this information in the call graph
to save the instruction scans, but currently that doesn't seem to be
important.
An important realization for why the representation chosen here works is
that the call graph is a formal subset of the reference graph and thus
both can live within the same data structure. All SCCs of the call graph
are necessarily contained within an SCC of the reference graph, etc.
The design is to build 'RefSCC's to model SCCs of the reference graph,
and then within them more literal SCCs for the call graph.
The formation of actual call edge SCCs is not done lazily, unlike
reference edge 'RefSCC's. Instead, once a reference SCC is formed, it
directly builds the call SCCs within it and stores them in a post-order
sequence. This is used to provide a consistent platform for mutation and
update of the graph. The post-order also allows for very efficient
updates in common cases by bounding the number of nodes (and thus edges)
considered.
There is considerable common code that I'm still looking for the best
way to factor out between the various DFS implementations here. So far,
my attempts have made the code harder to read and understand despite
reducing the duplication, which seems a poor tradeoff. I've not given up
on figuring out the right way to do this, but I wanted to wait until
I at least had the system working and tested to continue attempting to
factor it differently.
This also requires introducing several new algorithms in order to handle
all of the incremental update scenarios for the more complex structure
involving two edge colorings. I've tried to comment the algorithms
sufficiently to make it clear how this is expected to work, but they may
still need more extensive documentation.
I know that there are some changes which are not strictly necessarily
coupled here. The process of developing this started out with a very
focused set of changes for the new structure of the graph and
algorithms, but subsequent changes to bring the APIs and code into
consistent and understandable patterns also ended up touching on other
aspects. There was no good way to separate these out without causing
*massive* merge conflicts. Ultimately, to a large degree this is
a rewrite of most of the core algorithms in the LCG class and so I don't
think it really matters much.
Many thanks to the careful review by Sanjoy Das!
Differential Revision: http://reviews.llvm.org/D16802
llvm-svn: 261040
__chkstk clobbers EAX. If EAX is live across the prologue, then we have
to take extra steps to save it. We already had code to do this if EAX
was a register parameter. This change adapts it to work when shrink
wrapping is used.
llvm-svn: 261039
Currently, we sometimes miscompile this vector pattern:
(c ? -v : v)
We lower it to (because "c" is <4 x i1>, lowered as a vector mask):
(~c & v) | (c & -v)
When we have SSSE3, we incorrectly lower that to PSIGN, which does:
(c < 0 ? -v : c > 0 ? v : 0)
in other words, when c is either all-ones or all-zero:
(c ? -v : 0)
While this is an old bug, it rarely triggers because the PSIGN combine
is too sensitive to operand order. This will be improved separately.
Note that the PSIGN tests are also incorrect. Consider:
%b.lobit = ashr <4 x i32> %b, <i32 31, i32 31, i32 31, i32 31>
%sub = sub nsw <4 x i32> zeroinitializer, %a
%0 = xor <4 x i32> %b.lobit, <i32 -1, i32 -1, i32 -1, i32 -1>
%1 = and <4 x i32> %a, %0
%2 = and <4 x i32> %b.lobit, %sub
%cond = or <4 x i32> %1, %2
ret <4 x i32> %cond
if %b is zero:
%b.lobit = <4 x i32> zeroinitializer
%sub = sub nsw <4 x i32> zeroinitializer, %a
%0 = <4 x i32> <i32 -1, i32 -1, i32 -1, i32 -1>
%1 = <4 x i32> %a
%2 = <4 x i32> zeroinitializer
%cond = or <4 x i32> %a, zeroinitializer
ret <4 x i32> %a
whereas we currently generate:
psignd %xmm1, %xmm0
retq
which returns 0, as %xmm1 is 0.
Instead, use a pure logic sequence, as described in:
https://graphics.stanford.edu/~seander/bithacks.html#ConditionalNegate
Fixes PR26110.
Differential Revision: http://reviews.llvm.org/D17181
llvm-svn: 261023
We're going to stop generating PSIGN, so calling a test "psign"
isn't ideal. Instead, call these tests what they really are:
variable blends using logic.
Also add a test to exhibit a case we're currently missing in
the PSIGN combine.
llvm-svn: 261022
The register stackifier currently checks for intervening stores (and
loads that may alias them) but doesn't account for the fact that the
instruction being moved may affect intervening loads.
Differential Revision: http://reviews.llvm.org/D17298
llvm-svn: 261014
Summary:
I thought -Xlinker -mllvm -Xlinker -stats worked at some point but maybe
it never did.
For clang, I believe that stats are printed from cc1_main. This patch
also prints them for LTO, specifically right after codegen happens.
I only looked at the C API for LTO briefly to see if this is a good
place. Probably there are still cases where this wouldn't be printed
but it seems to be working for the common case. I also experimented
putting this in the LTOCodeGenerator destructor but that didn't trigger
for me because ld64 does not destroy the LTOCodeGenerator.
Reviewers: dexonsmith, joker.eph
Subscribers: rafael, joker.eph, llvm-commits
Differential Revision: http://reviews.llvm.org/D17302
llvm-svn: 261013
The usual way to get a 32-bit relocation is to use a constant extender which doubles the size of the instruction, 4 bytes to 8 bytes.
Another way is to put a .word32 and mix code and data within a function. The disadvantage is it's not a valid instruction encoding and jumping over it causes prefetch stalls inside the hardware.
This relocation packs a 23-bit value in to an "r0 = add(rX, #a)" instruction by overwriting the source register bits. Since r0 is the return value register, if this instruction is placed after a function call which return void, r0 will be filled with an undefined value, the prefetch won't be confused, and the callee can access the constant value by way of the link register.
llvm-svn: 261006
Summary:
This change will add a pass to remove unnecessary zero copies in target blocks
of cbz/cbnz instructions. E.g., the copy instruction in the code below can be
removed because the cbz jumps to BB1 when x0 is zero :
BB0:
cbz x0, .BB1
BB1:
mov x0, xzr
Jun
Reviewers: gberry, jmolloy, HaoLiu, MatzeB, mcrosier
Subscribers: mcrosier, mssimpso, haicheng, bmakam, llvm-commits, aemerson, rengolin
Differential Revision: http://reviews.llvm.org/D16203
llvm-svn: 261004
CopyToReg nodes don't support FrameIndex operands. Other targets select
the FI to some LEA-like instruction, but since we don't have that, we
need to insert some kind of instruction that can take an FI operand and
produces a value usable by CopyToReg (i.e. in a vreg). So insert a dummy
copy_local between Op and its FI operand. This results in a redundant
copy which we should optimize away later (maybe in the post-FI-lowering
peephole pass).
Differential Revision: http://reviews.llvm.org/D17213
llvm-svn: 260987
The root issue appears to be a confusion around what makeNoWrapRegion actually does. It seems likely we need two versions of this function with slightly different semantics.
llvm-svn: 260981