r327100 made us stop producing vecreduce-propagate-sd-flags.s, but it's
still sticking around on some bots. This makes the bots unhappy.
I'll revert this tomorrow.
llvm-svn: 327199
This fixes pr36674.
While it is valid for shouldAssumeDSOLocal to return false anytime,
always returning false for intrinsics is not optimal on i386 and also
hits a bug in the backend.
To use a plt, the caller must first setup ebx to handle the case of
that file being linked into a PIE executable or shared library. In
those cases the generated PLT uses ebx.
Currently we can produce "calll expf@plt" without setting ebx. We
could fix that by correctly setting ebx, but this would produce worse
code for the case where the runtime library is statically linked. It
would also required other tools to handle R_386_PLT32.
llvm-svn: 327198
r327171 "Improve Dependency analysis when doing multi-node Instruction Selection"
r328170 "[DAG] Enforce stricter NodeId invariant during Instruction selection"
Reverting patch as NodeId invariant change is causing pathological
increases in compile time on PPC
llvm-svn: 327197
These instructions have 3 operands that can be commuted. The first commute we find may not be the best. So we should keep searching if we performed an aggressive commute. There may still be an operand that is killed or a physical register constraint that might be better.
Differential Revision: https://reviews.llvm.org/D44324
llvm-svn: 327188
Relanding after fixing NodeId Invariant.
Cleanup cycle/validity checks in ISel (IsLegalToFold,
HandleMergeInputChains) and X86 (isFusableLoadOpStore). Now do a full
search for cycles / dependencies pruning the search when topological
property of NodeId allows.
As part of this propogate the NodeId-based cutoffs to narrow
hasPreprocessorHelper searches.
Reviewers: craig.topper, bogner
Subscribers: llvm-commits, hiraditya
Differential Revision: https://reviews.llvm.org/D41293
llvm-svn: 327171
Instruction Selection makes use of the topological ordering of nodes
by node id (a node's operands have smaller node id than it) when doing
cycle detection. During selection we may violate this property as a
selection of multiple nodes may induce a use dependence (and thus a
node id restriction) between two unrelated nodes. If a selected node
has an unselected successor this may allow us to miss a cycle in
detection an invalid selection.
This patch fixes this by marking all unselected successors of a
selected node have negated node id. We avoid pruning on such negative
ids but still can reconstruct the original id for pruning.
In-tree targets have been updated to replace DAG-level replacements
with ISel-level ones which enforce this property.
This preemptively fixes PR36312 before triggering commit r324359 relands
Reviewers: craig.topper, bogner, jyknight
Subscribers: arsenm, nhaehnle, javed.absar, llvm-commits, hiraditya
Differential Revision: https://reviews.llvm.org/D43198
llvm-svn: 327170
The retpoline mitigation for variant 2 of CVE-2017-5715 inhibits the
branch predictor, and as a result it can lead to a measurable loss of
performance. We can reduce the performance impact of retpolined virtual
calls by replacing them with a special construct known as a branch
funnel, which is an instruction sequence that implements virtual calls
to a set of known targets using a binary tree of direct branches. This
allows the processor to speculately execute valid implementations of the
virtual function without allowing for speculative execution of of calls
to arbitrary addresses.
This patch extends the whole-program devirtualization pass to replace
certain virtual calls with calls to branch funnels, which are
represented using a new llvm.icall.jumptable intrinsic. It also extends
the LowerTypeTests pass to recognize the new intrinsic, generate code
for the branch funnels (x86_64 only for now) and lay out virtual tables
as required for each branch funnel.
The implementation supports full LTO as well as ThinLTO, and extends the
ThinLTO summary format used for whole-program devirtualization to
support branch funnels.
For more details see RFC:
http://lists.llvm.org/pipermail/llvm-dev/2018-January/120672.html
Differential Revision: https://reviews.llvm.org/D42453
llvm-svn: 327163
Summary: Starting from GCN 2nd generation, ISA supports ds_read_b128 on top of ds_read_b64.
This patch supports ds_read_b128 instruction pattern and generation of this instruction.
In the vectorizer, this patch also widen the vector length so that vectorizer generates
128 bit loads for local address-space which gets translated to ds_read_b128.
Since the performance benefit is not clear; compiler generates ds_read_b128 under -amdgpu-ds128.
Author: FarhanaAleen
Reviewed By: rampitec, arsenm
Subscribers: llvm-commits, AMDGPU
Differential Revision: https://reviews.llvm.org/D44210
llvm-svn: 327153
The code to match and produce more x86 vector blends was enabled for all
architectures even though the transform may pessimize the code for other
architectures that do not provide a vector blend instruction.
Added an aarch64 testcase to check that a VZIP instruction is generated instead
of byte movs.
Differential Revision: https://reviews.llvm.org/D44118
llvm-svn: 327132
Previously we unpacked the even bytes of each input into the high byte of 16-bit elements then did an v8i16 arithmetic shift right by 8 bits to fill the upper bits of each word with sign bits. Then we did the v8i16 multiply and then masked to zero the upper 8-bits of each result. The similar was done for all the odd bytes. The results are then packed together with packuswb
Since we are masking each multiply result element to 8-bits, and those 8-bits are determined only by the lower 8-bits of each of the inputs, we don't need to fill the upper bits with sign bits. So we can just unpack into the low byte of each element and treat the upper bits as garbage. This is what gcc also does.
Differential Revision: https://reviews.llvm.org/D44267
llvm-svn: 327093
This patch is a fix for PR36642.
While legalizing long vector types, make sure the smaller types get the
flags of the wider type.
bugzilla link: https://bugs.llvm.org/show_bug.cgi?id=36642
Change-Id: I0c2829639f094c862c10a6b51b342d4c2563e1fa
llvm-svn: 327079
This instruction can be thought of as reading either the even elements of a vXi32 input or the lower half of each element of a vXi64 input. We currently use the vXi32 interpretation, but vXi64 matches better with its broadcast behavior in EVEX.
I'm looking at moving MULDQ/MULUDQ creation to a DAG combine so we can do it when AVX512DQ is enabled without having to go through Custom lowering. But in some of the test cases we failed to use a broadcast load due to the size difference. This should help with that.
I'm also wondering if we can model these instructions in native IR and remove the intrinsics and I think using a vXi64 type will work better with that.
llvm-svn: 326991
These patterns weren't checking the alignment of the load, but were using the aligned instructions. This will cause a GP fault if the data isn't aligned.
I believe these were introduced in r312450.
llvm-svn: 326967
The attached testcase started failing after the patch to define
isExtractSubvectorCheap with the following pattern mismatch:
ISEL: Starting pattern match
Initial Opcode index to 85068
Match failed at index 85076
LLVM ERROR: Cannot select: t47: v8i16 = insert_subvector undef:v8i16, t43, Constant:i64<0>
The code generated from llvm/lib/Target/AArch64/AArch64InstrInfo.td
def : Pat<(insert_subvector undef, (v4i16 FPR64:$src), (i32 0)),
(INSERT_SUBREG (v8i16 (IMPLICIT_DEF)), FPR64:$src, dsub)>;
is in ninja/lib/Target/AArch64/AArch64GenDAGISel.inc
At the location of the error it is:
/* 85076*/ OPC_CheckChild2Type, MVT::i32,
And it failed to match the type of operand 2.
Adding another def-pat for i64 fixes the failed def-pat error:
def : Pat<(insert_subvector undef, (v4i16 FPR64:$src), (i64 0)),
(INSERT_SUBREG (v8i16 (IMPLICIT_DEF)), FPR64:$src, dsub)>;
llvm-svn: 326949
Summary:
A desired property of the node order in Swing Modulo Scheduling is
that for nodes outside circuits the following holds: none of them is
scheduled after both a successor and a predecessor. We call
node orders that meet this property valid.
Although invalid node orders do not lead to the generation of incorrect
code, they can cause the pipeliner not being able to find a pipelined schedule
for arbitrary II. The reason is that after scheduling the successor and the
predecessor of a node, no room may be left to schedule the node itself.
For data flow graphs with 0-latency edges, the node ordering algorithm
of Swing Modulo Scheduling can generate such undesired invalid node orders.
This patch fixes that.
In the remainder of this commit message, I will give an example
demonstrating the issue, explain the fix, and explain how the the fix is tested.
Consider, as an example, the following data flow graph with all
edge latencies 0 and all edges pointing downward.
```
n0
/ \
n1 n3
\ /
n2
|
n4
```
Consider the implemented node order algorithm in top-down mode. In that mode,
the algorithm orders the nodes based on greatest Height and in case of equal
Height on lowest Movability. Finally, in case of equal Height and
Movability, given two nodes with an edge between them, the algorithm prefers
the source-node.
In the graph, for every node, the Height and Movability are equal to 0.
As will be explained below, the algorithm can generate the order n0, n1, n2, n3, n4.
So, node n3 is scheduled after its predecessor n0 and after its successor n2.
The reason that the algorithm can put node n2 in the order before node n3,
even though they have an edge between them in which node n3 is the source,
is the following: Suppose the algorithm has constructed the partial node
order n0, n1. Then, the nodes left to be ordered are nodes n2, n3, and n4. Suppose
that the while-loop in the implemented algorithm considers the nodes in
the order n4, n3, n2. The algorithm will start with node n4, and look for
more preferable nodes. First, node n4 will be compared with node n3. As the nodes
have equal Height and Movability and have no edge between them, the algorithm
will stick with node n4. Then node n4 is compared with node n2. Again the
Height and Movability are equal. But, this time, there is an edge between
the two nodes, and the algorithm will prefer the source node n2.
As there are no nodes left to compare, the algorithm will add node n2 to
the node order, yielding the partial node order n0, n1, n2. In this way node n2
arrives in the node-order before node n3.
To solve this, this patch introduces the ZeroLatencyHeight (ZLH) property
for nodes. It is defined as the maximum unweighted length of a path from the
given node to an arbitrary node in which each edge has latency 0.
So, ZLH(n0)=3, ZLH(n1)=ZLH(n3)=2, ZLH(n2)=1, and ZLH(n4)=0
In this patch, the preference for a greater ZeroLatencyHeight
is added in the top-down mode of the node ordering algorithm, after the
preference for a greater Height, and before the preference for a
lower Movability.
Therefore, the two allowed node-orders are n0, n1, n3, n2, n4 and n0, n3, n1, n2, n4.
Both of them are valid node orders.
In the same way, the bottom-up mode of the node ordering algorithm is adapted
by introducing the ZeroLatencyDepth property for nodes.
The patch is tested by adding extra checks to the following existing
lit-tests:
test/CodeGen/Hexagon/SUnit-boundary-prob.ll
test/CodeGen/Hexagon/frame-offset-overflow.ll
test/CodeGen/Hexagon/vect/vect-shuffle.ll
Before this patch, the pipeliner failed to pipeline the loops in these tests
due to invalid node-orders. After the patch, the pipeliner successfully
pipelines all these loops.
Reviewers: bcahoon
Reviewed By: bcahoon
Subscribers: Ayal, mgrang, llvm-commits
Differential Revision: https://reviews.llvm.org/D43620
llvm-svn: 326925
The v8i32 conversion on AVX1 targets was only working after LowerMUL splits 256-bit vectors.
While I was there I've also made it so we don't have to check for AVX2 and BWI directly and instead just ask if the type is legal.
Differential Revision: https://reviews.llvm.org/D44190
llvm-svn: 326917
This is a follow-up to r325169, this time for all types, not just HVX
vector types.
Disable this by default, since it's not always safe.
llvm-svn: 326915
Summary: GCN ISA supports instructions that can read 16 consecutive dwords from memory through the scalar data cache;
loadstoreVectorizer should take advantage of the wider vector length and pack 16/8 elements of dwords/quadwords.
Author: FarhanaAleen
Reviewed By: rampitec
Subscribers: llvm-commits, AMDGPU
Differential Revision: https://reviews.llvm.org/D44179
llvm-svn: 326910
Fixes the bug found by asan. Also XFAIL the new test for Darwin,
which is stuck on DWARF v2, and fix up other tests so they stop
failing on Windows.
llvm-svn: 326839
Following the ARM-neon backend, define isExtractSubvectorCheap to return true
when extracting low and high part of a neon register.
The patch disables a test in llvm/test/CodeGen/AArch64/arm64-ext.ll This
testcase is fragile in the sense that it requires a BUILD_VECTOR to "survive"
all DAG transforms until ISelLowering. The testcase is supposed to check that
AArch64TargetLowering::ReconstructShuffle() works, and for that we need a
BUILD_VECTOR in ISelLowering. As we now transform the BUILD_VECTOR earlier into
an VEXT + vector_shuffle, we don't have the BUILD_VECTOR pattern when we get to
ISelLowering. As there is no way to disable the combiner to only exercise the
code in ISelLowering, the patch disables the testcase.
Differential revision: https://reviews.llvm.org/D43973
llvm-svn: 326811
One addrspacecast disappeared in clang emitted IR for
block invoke function due to adoption of the new
addr space mapping.
Differential Revision: https://reviews.llvm.org/D43785
llvm-svn: 326806
Before I started maintaining the AVR backend, this instruction
never originally used to have an earlyclobber flag.
Some time afterwards (years ago), I must've added it back in, not realising that it
was left out for a reason.
This pseudo instrction exists solely to work around a long standing bug
in the register allocator.
Before this commit, the LDDWRdYQ pseudo was not actually working around
any bug. With the earlyclobber flag removed again, the LDDWRdYQ pseudo
now correctly works around PR13375 again.
llvm-svn: 326774
EAX can turn out to be alive here, when shrink wrapping is done
(which is allowed when using dwarf exceptions, contrary to the
normal case with WinCFI).
This fixes PR36487.
Differential Revision: https://reviews.llvm.org/D43968
llvm-svn: 326764
DWARF v5 specifies that the root file (also given in the DW_AT_name
attribute of the compilation unit DIE) should be emitted explicitly to
the line table's list of files. This makes the line table more
independent of the .debug_info section.
Differential Revision: https://reviews.llvm.org/D44054
llvm-svn: 326758
Summary:
Fabs is a common floating-point operation, especially for some expansions. This patch adds
a new generic opcode for llvm.fabs.* intrinsic in order to avoid building/matching this intrinsic.
Reviewers: qcolombet, aditya_nandakumar, dsanders, rovka
Reviewed By: aditya_nandakumar
Subscribers: kristof.beyls, javed.absar, llvm-commits
Differential Revision: https://reviews.llvm.org/D43864
llvm-svn: 326749