Code mostly copied from AArch64, just tidied up a trifle and plumbed
into the ARM64 way of doing things.
This also enables the AArch64 tests which inspired the previous
untested commits.
llvm-svn: 206574
A vector extract followed by a dup can become a single instruction even if the
types don't match. AArch64 handled this in ISelLowering, but a few reasonably
simple patterns can take care of it in TableGen, so that's where I've put it.
llvm-svn: 206573
ARM64 was scalarizing some vector comparisons which don't quite map to
AArch64's compare and mask instructions. AArch64's approach of sacrificing a
little efficiency to emulate them with the limited set available was better, so
I ported it across.
More "inspired by" than copy/paste since the backend's internal expectations
were a bit different, but the tests were invaluable.
llvm-svn: 206570
I enhanced it a little in the process. The decision shouldn't really be beased
on whether a BUILD_VECTOR is a splat: any set of constants will do the job
provided they're related in the correct way.
Also, the BUILD_VECTOR could be any operand of the incoming AND nodes, so it's
best to check for all 4 possibilities rather than assuming it'll be the RHS.
llvm-svn: 206569
It's not actually used to handle C or C++ ABI rules on ARM64, but could well be
emitted by other language front-ends, so it's as well to have a sensible
implementation.
llvm-svn: 206568
Use scalar BFE with constant shift and offset when possible.
This is complicated by the fact that the scalar version packs
the two operands of the vector version into one.
llvm-svn: 206558
Rewrite the shared implementation of BlockFrequencyInfo and
MachineBlockFrequencyInfo entirely.
The old implementation had a fundamental flaw: precision losses from
nested loops (or very wide branches) compounded past loop exits (and
convergence points).
The @nested_loops testcase at the end of
test/Analysis/BlockFrequencyAnalysis/basic.ll is motivating. This
function has three nested loops, with branch weights in the loop headers
of 1:4000 (exit:continue). The old analysis gives non-sensical results:
Printing analysis 'Block Frequency Analysis' for function 'nested_loops':
---- Block Freqs ----
entry = 1.0
for.cond1.preheader = 1.00103
for.cond4.preheader = 5.5222
for.body6 = 18095.19995
for.inc8 = 4.52264
for.inc11 = 0.00109
for.end13 = 0.0
The new analysis gives correct results:
Printing analysis 'Block Frequency Analysis' for function 'nested_loops':
block-frequency-info: nested_loops
- entry: float = 1.0, int = 8
- for.cond1.preheader: float = 4001.0, int = 32007
- for.cond4.preheader: float = 16008001.0, int = 128064007
- for.body6: float = 64048012001.0, int = 512384096007
- for.inc8: float = 16008001.0, int = 128064007
- for.inc11: float = 4001.0, int = 32007
- for.end13: float = 1.0, int = 8
Most importantly, the frequency leaving each loop matches the frequency
entering it.
The new algorithm leverages BlockMass and PositiveFloat to maintain
precision, separates "probability mass distribution" from "loop
scaling", and uses dithering to eliminate probability mass loss. I have
unit tests for these types out of tree, but it was decided in the review
to make the classes private to BlockFrequencyInfoImpl, and try to shrink
them (or remove them entirely) in follow-up commits.
The new algorithm should generally have a complexity advantage over the
old. The previous algorithm was quadratic in the worst case. The new
algorithm is still worst-case quadratic in the presence of irreducible
control flow, but it's linear without it.
The key difference between the old algorithm and the new is that control
flow within a loop is evaluated separately from control flow outside,
limiting propagation of precision problems and allowing loop scale to be
calculated independently of mass distribution. Loops are visited
bottom-up, their loop scales are calculated, and they are replaced by
pseudo-nodes. Mass is then distributed through the function, which is
now a DAG. Finally, loops are revisited top-down to multiply through
the loop scales and the masses distributed to pseudo nodes.
There are some remaining flaws.
- Irreducible control flow isn't modelled correctly. LoopInfo and
MachineLoopInfo ignore irreducible edges, so this algorithm will
fail to scale accordingly. There's a note in the class
documentation about how to get closer. See also the comments in
test/Analysis/BlockFrequencyInfo/irreducible.ll.
- Loop scale is limited to 4096 per loop (2^12) to avoid exhausting
the 64-bit integer precision used downstream.
- The "bias" calculation proposed on llvmdev is *not* incorporated
here. This will be added in a follow-up commit, once comments from
this review have been handled.
llvm-svn: 206548
Change the command line vector-insertion.ll to explicitly set the neon syntax
to apple so that buildbots that default to other syntaxes won't fail.
llvm-svn: 206502
Having i128 as a legal type complicates the legalization phase. v4i32
is already a legal type, so we will use that instead.
This fixes several piglit tests.
llvm-svn: 206500
This patch improves the performance of vector creation in caseiswhere where
several of the lanes in the vector are a constant floating point value. It
also includes new patterns to fold together some of the instructions when the
value is 0.0f. Test cases included.
rdar://16349427
llvm-svn: 206496
Previously, SSPBufferSize was assigned the value of the "stack-protector-buffer-size"
attribute after all uses of SSPBufferSize. The effect was that the default
SSPBufferSize was always used during analysis. I moved the check for the
attribute before the analysis; now --param ssp-buffer-size= works correctly again.
Differential Revision: http://reviews.llvm.org/D3349
llvm-svn: 206486
The commit of r205855:
Author: Arnold Schwaighofer <aschwaighofer@apple.com>
Date: Wed Apr 9 14:20:47 2014 +0000
SLPVectorizer: Only vectorize intrinsics whose operands are widened equally
The vectorizer only knows how to vectorize intrinics by widening all operands by
the same factor.
Patch by Tyler Nowicki!
exposed a backend bug causing a regression (Cannot select ctpop).
The commit msg is a bit confusing because the patch actually changes the
behavior for the loop-vectorizer as well. As things got refactored into a
helper ctpop got snuck in to the trivially-vectorizable helper which is now
used by both vectorizers. In other words, we started seeing vector-ctpops in
the backend.
This change makes ctpop LegalizeAction::Expand for the types not supported by
the byte-only CNT instruction. We may be able to custom-lower these later to
a single CNT but this is to fix the compiler crash first.
Fixes <rdar://problem/16578951>
llvm-svn: 206433
This is so that EF_MIPS_NAN2008 is set if we are using IEEE 754-2008
NaN encoding (-mnan=2008). This patch also adds support for parsing
'.nan legacy' and '.nan 2008' assembly directives. The handling of
these directives should match GAS' behaviour i.e., the last directive
in use sets the ELF header bit (EF_MIPS_NAN2008).
Differential Revision: http://reviews.llvm.org/D3346
llvm-svn: 206396
These ones used completely different sets of intrinsics, so the only way to do
it is create a separate ARM64 copy and change them all.
Other than that, CodeGen was straightforward, no deficiencies detected here.
llvm-svn: 206392
Summary: This was a case of incorrect usage of hasMips64() vs isABI_N64()
Reviewers: matheusalmeida, dsanders
Reviewed By: dsanders
Differential Revision: http://reviews.llvm.org/D3398
llvm-svn: 206388
This should fix the ninja-x64-msvc-RA-centos6 builder.
I suspect the check in MipsSubtarget.cpp is incorrect and is really trying to
check for a bare-metal target rather and anything other than linux. I'll
investigate this.
llvm-svn: 206385
The most important part here is that we should actuall emit the stubs we refer
to in the exception table, but as a side issue this uses more sensible & GCC
compatible representations for some of the bits of information.
llvm-svn: 206380
If we know that a particular 64-bit constant has all high bits zero, then we
can rely on the fact that 32-bit ARM64 instructions automatically zero out the
high bits of an x-register. This gives the expansion logic less constraints to
satisfy and so sometimes allows it to pick better sequences.
Came up while porting test/CodeGen/AArch64/movw-consts.ll: this will allow a
32-bit MOVN to be used in @test8 soon.
llvm-svn: 206379
if not in micromips mode.
The test (elf_st_other.ll) was renamed as the name and description didn't
make sense as the test wasn't checking any symbol table entry.
Differential Revision: http://reviews.llvm.org/D3346
llvm-svn: 206377
Summary:
I had difficulty finding tests for the N32 and N64 ABI so I've added a
collection of calling convention tests based on the document MIPS ABIs
Described (MD00305), the MIPSpro N32 Handbook, and the SYSV ABI. Where the
documents/implementations disagree, I've used GCC to resolve the conflict.
A few interesting details:
* For N32, LLVM uses 64-bit pointers when saving $ra despite pointers being
32-bit. I've yet to find a supporting statement in the ABI documentation but
the current behaviour matches GCC.
* For O32, the non-variable portion of a varargs argument list is also subject
to the rule that floating-point is passed via GPR's (on N32/N64 only the
variable portion is subject to this rule). This agrees with GCC's behaviour
and the SYSV ABI but contradicts part of the MIPSpro N32 Handbook which talks about O32's behaviour.
* The N32 implementation has the wrong callee-saved register list.
(I already have a fix for this but will commit it as a follow-up).
I've left RUN-TODO lines in for O32 on MIPS64. I don't plan to support this case
for now but we should revisit it.
Reviewers: matheusalmeida, vmedic
Reviewed By: matheusalmeida
Differential Revision: http://reviews.llvm.org/D3339
llvm-svn: 206370
This particular DAG combine is designed to kick in when both ConstantFPs will
end up being loaded via a litpool, however those nodes have a semi-legal
status, dictated by isFPImmLegal so in some cases there wouldn't have been a
litpool in the first place. Don't try to be clever in those circumstances.
Picked up while merging some AArch64 tests.
llvm-svn: 206365
Print in decimal for inline immediates, and hex otherwise. Use hex
always for offsets in addressing offsets.
This approximately matches what the shader compiler does.
llvm-svn: 206335
handles Intrinsic::trap if TargetOptions::TrapFuncName is set.
This fixes a bug in which the trap function was not taken into consideration
when a program was compiled without optimization (at -O0).
<rdar://problem/16291933>
llvm-svn: 206323
This patch teaches the backend how to efficiently lower logical and
arithmetic packed shifts on both SSE and AVX/AVX2 machines.
When possible, instead of scalarizing a vector shift, the backend should try
to expand the shift into a sequence of two packed shifts by immedate count
followed by a MOVSS/MOVSD.
Example
(v4i32 (srl A, (build_vector < X, Y, Y, Y>)))
Can be rewritten as:
(v4i32 (MOVSS (srl A, <Y,Y,Y,Y>), (srl A, <X,X,X,X>)))
[with X and Y ConstantInt]
The advantage is that the two new shifts from the example would be lowered into
X86ISD::VSRLI nodes. This is always cheaper than scalarizing the vector into
four scalar shifts plus four pairs of vector insert/extract.
llvm-svn: 206316
Sometimes we need emit the bits that would actually be a MOVN when producing a
relocated MOVZ instruction (don't ask). But not always, a check which ARM64 got
wrong until now.
llvm-svn: 206289