Commit Graph

40796 Commits

Author SHA1 Message Date
Michael Zuckerman 558a4d8419 [X86][AVX512] Adding missing shuffle lowering to blend mask instructions
Some shuffles can be lowered to blend mask instruction (VPBLENDMB/VPBLENDMW/VPBLENDMD/VPBLENDMQ) .
In this patch, I added new pattern match for this case.

Reviewers:
1. craig.topper
2. guyblank
3. RKSimon
4. igorb     

Differential Revision: https://reviews.llvm.org/D28483

llvm-svn: 291888
2017-01-13 09:06:00 +00:00
Craig Topper 1ec84c2a18 [AVX-512] Remove unmasked BLENDM instructions from the wrong load folding table. The unmasked versions read memory from operand 2, but were in the operand 3 table.
These aren't the most interesting set of blendm instructions as the unmasked version isn't useful. We were also missing the B and W forms. I'll add the masked versions of all sizes in a future patch.

llvm-svn: 291885
2017-01-13 07:28:56 +00:00
Craig Topper 46b6ecf41e [X86] Move some entries in the load folding tables to move appropriate grouping. NFC
llvm-svn: 291884
2017-01-13 07:28:53 +00:00
Eugene Zelenko 8187c192c6 [PowerPC] Fix some Clang-tidy modernize and Include What You Use warnings; other minor fixes (NFC).
llvm-svn: 291872
2017-01-13 00:58:58 +00:00
Nikolai Bozhenov f02ac0eeb2 [X86] Replace AND+IMM64 with SRL/SHL
Emit SHRQ/SHLQ instead of ANDQ with a 64 bit constant mask if the result
is unused and the mask has only higher/lower bits set. For example, with
this patch LLVM emits

  shrq $41, %rdi
  je

instead of

  movabsq $0xFFFFFE0000000000, %rcx
  testq   %rcx, %rdi
  je

This reduces number of instructions, code size and register pressure.
The transformation is applied only for cases where the mask cannot be
encoded as an immediate value within TESTQ instruction.

Differential Revision: https://reviews.llvm.org/D28198

llvm-svn: 291806
2017-01-12 19:54:27 +00:00
Nikolai Bozhenov 6bdf92cec7 [X86] Tune bypassing of slow division for Intel CPUs
64-bit integer division in Intel CPUs is extremely slow, much slower
than 32-bit division. On the other hand, 8-bit and 16-bit divisions
aren't any faster. The only important exception is Atom where DIV8
is fastest. Because of that, the patch
1) Enables bypassing of 64-bit division for Atom, Silvermont and
   all big cores.
2) Modifies 64-bit bypassing to use 32-bit division instead of
   16-bit one. This doesn't make the shorter division slower but
   increases chances of taking it. Moreover, it's much more likely
   to prove at compile-time that a value fits 32 bits and doesn't
   require a run-time check (e.g. zext i32 to i64).

Differential Revision: https://reviews.llvm.org/D28196

llvm-svn: 291800
2017-01-12 19:34:15 +00:00
Matt Arsenault 45337df08f AMDGPU: Skip fneg/select combine if it can fold into other
llvm-svn: 291792
2017-01-12 18:58:15 +00:00
Matt Arsenault 31c039ef2e AMDGPU: Fold free fneg into sin
llvm-svn: 291790
2017-01-12 18:48:09 +00:00
Saleem Abdulrasool 555e5980a5 ARM: slightly more table driven libcall setup
Switch some additional library call setup to be table driven.  This
makes it more immediately obvious what the library call looks like.
This is important for ARM since the calling conventions for the builtins
change based on the target/libcall name.  NFC

llvm-svn: 291789
2017-01-12 18:46:11 +00:00
Matt Arsenault a8c325e2f5 AMDGPU: Fold fneg into fmul_legacy
llvm-svn: 291784
2017-01-12 18:26:30 +00:00
Matt Arsenault ff7e5aadf5 AMDGPU: Fold fneg into rcp
llvm-svn: 291779
2017-01-12 17:46:35 +00:00
Matt Arsenault 4242d48c36 AMDGPU: Fold fneg into fp_round
llvm-svn: 291778
2017-01-12 17:46:33 +00:00
Matt Arsenault 98d2bf1024 AMDGPU: Fold fneg into fp_extend
llvm-svn: 291777
2017-01-12 17:46:28 +00:00
Daniel Sanders b7391dd3b4 [globalisel] Move as much RegisterBank initialization to the constructor as possible
Summary:
The register bank is now entirely initialized in the constructor. However,
we still have the hardcoded number of register classes which will be
dealt with in the TableGen patch (D27338) since we do not have access
to this information to resolve this at this stage. The number of register
classes is known to the TRI and to TableGen but the RegisterBank
constructor is too early for the former and too late for the latter.
This will be fixed when the data is tablegen-erated.

Reviewers: t.p.northover, ab, rovka, qcolombet

Subscribers: aditya_nandakumar, kristof.beyls, vkalintiris, llvm-commits, dberris

Differential Revision: https://reviews.llvm.org/D27809

llvm-svn: 291770
2017-01-12 16:11:23 +00:00
Daniel Sanders ae03595bfb [globalisel] Initialize RegisterBanks with static data.
Summary:
Refactor the RegisterBank initialization to use static data. This requires
GlobalISel implementations to rewrite calls to createRegisterBank() and
addRegBankCoverage() into a call to setRegBankData().

Out of tree targets can use diff 4 of D27807
(https://reviews.llvm.org/D27807?id=84117) to have addRegBankCoverage() dump
the register classes and other data that needs to be provided to
setRegBankData(). This is the method that was used to generate the static data
in this patch.

Tablegen-eration of this static data will follow after some refactoring.

Reviewers: t.p.northover, ab, rovka, qcolombet

Subscribers: aditya_nandakumar, kristof.beyls, vkalintiris, llvm-commits, dberris

Differential Revision: https://reviews.llvm.org/D27807
Differential Revision: https://reviews.llvm.org/D27808

llvm-svn: 291768
2017-01-12 15:32:10 +00:00
Matt Arsenault f003198b28 AMDGPU: Fix sub_oneuse being marked commutative
llvm-svn: 291748
2017-01-12 07:17:28 +00:00
Craig Topper 24c3a2395f [AVX-512] Improve lowering of zero_extend of v4i1 to v4i32 and v2i1 to v2i64 with VLX, but no DQ or BW support.
llvm-svn: 291747
2017-01-12 06:49:12 +00:00
Craig Topper 69ab67b279 [AVX-512] Improve lowering of sign_extend of v4i1 to v4i32 and v2i1 to v2i64 when avx512vl is available, but not avx512dq.
llvm-svn: 291746
2017-01-12 06:49:08 +00:00
Elad Cohen c5ba925ef2 [X86][AVX512] Fix PR31515 - Do not flip vselect condition if it's not a vXi1 mask
r289653 added a case where `vselect <cond> <vector1> <all-zeros>`
is transformed to:
`vselect xor(cond, DAG.getConstant(1, DL, CondVT) <all-zeros> <vector1>`
This was not aimed to catch cases where Cond is not a vXi1
mask but it does. Moreover, when Cond type is VxiN (N > 1)
then xor(cond, DAG.getConstant(1, DL, CondVT) != NOT(cond).
This patch changes the above to xor with allones, and avoids
entering the case for non-mask Conds.

llvm-svn: 291745
2017-01-12 06:49:03 +00:00
Matt Arsenault 63f953795e AMDGPU: Fold fneg into fma or fmad
Patch mostly by Fiona Glaser

llvm-svn: 291733
2017-01-12 00:32:16 +00:00
Matt Arsenault 4103a81d6d AMDGPU: Fold fneg into fmul
Patch mostly by Fiona Glaser

llvm-svn: 291732
2017-01-12 00:23:20 +00:00
Matt Arsenault 2529fba989 AMDGPU: Fold fneg into fadd
Patch mostly by Fiona Glaser

llvm-svn: 291731
2017-01-12 00:09:34 +00:00
Matt Arsenault 2a04ff97ad AMDGPU: Pull fneg/fabs out of a select
Allows better source modifier usage.

llvm-svn: 291729
2017-01-11 23:57:38 +00:00
Peter Collingbourne 1b5f1cfdb4 X86: Remove dead code. NFC.
llvm-svn: 291721
2017-01-11 23:00:28 +00:00
Matt Arsenault 24a1273ae1 AMDGPU: Fix shrinking of addc/subb.
To shrink to VOP2 the input carry must also be VCC.

llvm-svn: 291720
2017-01-11 22:58:12 +00:00
Matt Arsenault 682eb4396a AMDGPU: Fix sext_inreg for i1 in i16
This produces worse code when i16 is legal, mostly
due to combines getting confused by conversions inserted
for uniform 16-bit operations.

llvm-svn: 291717
2017-01-11 22:35:22 +00:00
Matt Arsenault 28bd4cbeaf AMDGPU: Fix breaking VOP3 v_add_i32s
This was shrinking the instruction even though the carry output
register was a virtual register, not known VCC.

llvm-svn: 291716
2017-01-11 22:35:17 +00:00
Matt Arsenault 69e3001b84 AMDGPU: Fix folding immediates into mac src2
Whether it is legal or not needs to check for the instruction
it will be replaced with.

llvm-svn: 291711
2017-01-11 22:00:02 +00:00
Eli Friedman 3a03742c37 [ARM] More aggressive matching for vpadd and vpaddl.
The new matchers work after legalization to make them simpler, and to avoid
blocking other optimizations.

Differential Revision: https://reviews.llvm.org/D27779

llvm-svn: 291693
2017-01-11 19:33:38 +00:00
Simon Pilgrim 0c1faf432b Remove trailing whitespace. NFCI.
llvm-svn: 291680
2017-01-11 16:38:20 +00:00
Jonas Paulsson c282975604 [SystemZ] Improve isFoldableMemAccessOffset().
A store of an extracted element or a load which gets inserted into a vector,
will be combined into a vector load/store element instruction.

Therefore, isFoldableMemAccessOffset(), which is called by LSR, should
return false in these cases.

Reviewer: Ulrich Weigand
llvm-svn: 291673
2017-01-11 14:40:39 +00:00
Elena Demikhovsky 9d0e7c33d3 X86 CodeGen: Optimized pattern for truncate with unsigned saturation.
DAG patterns optimization: truncate + unsigned saturation supported by VPMOVUS* instructions in AVX-512.
And VPACKUS* instructions on SEE* targets.

Differential Revision: https://reviews.llvm.org/D28216

llvm-svn: 291670
2017-01-11 12:59:32 +00:00
Sam Kolton 9772eb3907 [AMDGPU] Assembler: SDWA/DPP should not accept scalar registers and immediate operands
Reviewers: artem.tamazov, nhaustov, vpykhtin, tstellarAMD

Subscribers: arsenm, kzhuravl, wdng, nhaehnle, yaxunl, tony-tye

Differential Revision: https://reviews.llvm.org/D28157

llvm-svn: 291668
2017-01-11 11:46:30 +00:00
Simon Pilgrim 5a81fefad3 [X86][AVX512BW] Vectorize v64i8 vector shifts
Differential Revision: https://reviews.llvm.org/D28447

llvm-svn: 291665
2017-01-11 10:36:51 +00:00
Elad Cohen 0c2601073e [X86] Fix PR30926 - Add patterns for (v)cvtsi2s{s,d} and (v)cvtsd2s{s,d}
The code emiited by Clang's intrinsics for (v)cvtsi2ss, (v)cvtsi2sd,
(v)cvtsd2ss and (v)cvtss2sd is lowered to a code sequence that includes
redundant (v)movss/(v)movsd instructions. This patch adds patterns for
optimizing these sequences.

Differential revision: https://reviews.llvm.org/D28455

llvm-svn: 291660
2017-01-11 09:11:48 +00:00
Mohammed Agabaria 2c96c43388 [X86] updating TTI costs for arithmetic instructions on X86\SLM arch.
updated instructions:
pmulld, pmullw, pmulhw, mulsd, mulps, mulpd, divss, divps, divsd, divpd, addpd and subpd.

special optimization case which replaces pmulld with pmullw\pmulhw\pshuf seq. 
In case if the real operands bitwidth <= 16.

Differential Revision: https://reviews.llvm.org/D28104 

llvm-svn: 291657
2017-01-11 08:23:37 +00:00
Eugene Zelenko c4ad1ce068 [Target] Fix some Clang-tidy modernize and Include What You Use warnings; other minor fixes (NFC).
llvm-svn: 291641
2017-01-11 01:45:03 +00:00
Hans Wennborg 6573976f57 Re-commit r289955: [X86] Fold (setcc (cmp (atomic_load_add x, -C) C), COND) to (setcc (LADD x, -C), COND) (PR31367)
This was reverted because it would miscompile code where the cmp had
multiple uses. That was due to a deficiency in the existing code, which
was fixed in r291630 (see the PR for details).

This re-commit includes an extra test for the kind of code that got
miscompiled: @test_sub_1_setcc_jcc.

llvm-svn: 291640
2017-01-11 01:36:57 +00:00
Hans Wennborg 12de693747 [X86] Dont run combineSetCCAtomicArith() when the cmp has multiple uses
We would miscompile the following:

  void g(int);
  int f(volatile long long *p) {
    bool b = __atomic_fetch_add(p, 1, __ATOMIC_SEQ_CST) < 0;
    g(b ? 12 : 34);
    return b ? 56 : 78;
  }

into

  pushq   %rax
  lock            incq    (%rdi)
  movl    $12, %eax
  movl    $34, %edi
  cmovlel %eax, %edi
  callq   g(int)
  testq   %rax, %rax   <---- Bad.
  movl    $56, %ecx
  movl    $78, %eax
  cmovsl  %ecx, %eax
  popq    %rcx
  retq

because the code failed to take into account that the cmp has multiple
uses, replaced one of them, and left the other one comparing garbage.

llvm-svn: 291630
2017-01-11 00:49:54 +00:00
Jan Vesely 0d6cb1caaf AMDGPU/EG,CM: Add fp16 conversion instructions
Differential Revision: https://reviews.llvm.org/D28164

llvm-svn: 291622
2017-01-11 00:12:39 +00:00
Justin Lebar 7d81813d76 [TM] Restore default TargetOptions in TargetMachine::resetTargetOptions.
Summary:
Previously if you had

 * a function with the fast-math-enabled attr, followed by
 * a function without the fast-math attr,

the second function would inherit the first function's fast-math-ness.

This means that mixing fast-math and non-fast-math functions in a module
was completely broken unless you explicitly annotated every
non-fast-math function with "unsafe-fp-math"="false".  This appears to
have been broken since r176986 (March 2013), when the resetTargetOptions
function was introduced.

This patch tests the correct behavior as best we can.  I don't think I
can test FPDenormalMode and NoTrappingFPMath, because they aren't used
in any backends during function lowering.  Surprisingly, I also can't
find any uses at all of LessPreciseFPMAD affecting generated code.

The NVPTX/fast-math.ll test changes are an expected result of fixing
this bug.  When FMA is disabled, we emit add as "add.rn.f32", which
prevents fma combining.  Before this patch, fast-math was enabled in all
functions following the one which explicitly enabled it on itself, so we
were emitting plain "add.f32" where we should have generated
"add.rn.f32".

Reviewers: mkuper

Subscribers: hfinkel, majnemer, jholewinski, nemanjai, llvm-commits

Differential Revision: https://reviews.llvm.org/D28507

llvm-svn: 291618
2017-01-10 23:43:04 +00:00
Evandro Menezes 330e1b8945 [AArch64] Consider all vector types for FeatureSlowMisaligned128Store
The original code considered only v2i64 as slow for this feature. This patch
consider all 128-bit long vector types as slow candidates.

In internal tests, extending this feature to all 128-bit vector types
resulted in an overall improvement of 1% on Exynos M1.

Differential revision: https://reviews.llvm.org/D27998

llvm-svn: 291616
2017-01-10 23:42:21 +00:00
Matt Arsenault 51818c14b3 AMDGPU: Constant fold when immediate is materialized
In future commits these patterns will appear after moveToVALU changes.

llvm-svn: 291615
2017-01-10 23:32:04 +00:00
Derek Schuff 7acb42a41a [WebAssembly] Only RAUW a constant once in FixFunctionBitcasts
When we collect 2 uses of a function in FindUses and then RAUW when we
visit the first, we end up visiting the wrapper (because the second was
RAUW'd).  We still want to use RAUW instead of just Use->set() because
it has special handling for Constants, so this patch just ensures that
only one use of each constant is added to the work list.

Differential Revision: https://reviews.llvm.org/D28504

llvm-svn: 291603
2017-01-10 21:59:53 +00:00
Chad Rosier d0114fc1dd [ARM] Remove rbit intrinsics and autoupgrade to generic bitreverse.
Testing already covered by CodeGen/ARM/rbit.ll

llvm-svn: 291587
2017-01-10 19:23:51 +00:00
Matt Arsenault 8871683d60 AMDGPU: Add tests for HasMultipleConditionRegisters
This was enabled without many specific tests or the comment.

llvm-svn: 291586
2017-01-10 19:08:15 +00:00
Michael Zuckerman bcd03e7f3b [X86][AVX512]Improving shuffle lowering by using AVX-512 EXPAND* instructions
This patch fix PR31351: https://llvm.org/bugs/show_bug.cgi?id=31351

1.  This patch adds new type of shuffle lowering
2.  We can use the expand instruction, When the shuffle pattern is as following:
    { 0*a[0]0*a[1]...0*a[n] , n >=0 where a[] elements in a ascending order}.

Reviewers: 1. igorb  
           2. guyblank  
           3. craig.topper  
           4. RKSimon 

Differential Revision: https://reviews.llvm.org/D28352

llvm-svn: 291584
2017-01-10 18:57:17 +00:00
Chad Rosier 3daffbf6a8 [AArch64] Add support for lowering bitreverse to the rbit instruction.
Differential Revision: https://reviews.llvm.org/D28379

llvm-svn: 291575
2017-01-10 17:20:33 +00:00
Simon Dardis 548a53f5ee [mips] Fix Mips MSA instrinsics
The usage of some MIPS MSA instrinsics that took immediates could crash LLVM
during lowering. This patch addresses that behaviour. Crucially this patch
also makes the use of intrinsics with out of range immediates as producing an
internal error.

The ld,st instrinsics would trigger an assertion failure for MIPS64 as their
lowering would attempt to add an i32 offset to a i64 pointer.

Reviewers: vkalintiris, slthakur

Differential Revision: https://reviews.llvm.org/D25438

llvm-svn: 291571
2017-01-10 16:40:57 +00:00
Simon Dardis 0e9e237310 [mips] Honour -mno-odd-spreg for vector splat (again)
Previous the lowering of FILL_FW would use the MSA128W register class when
performing a vector splat. Instead it should be honouring -mno-odd-spreg and
only use the even registers when performing a splat from word to vector
register.

Logical follow-on from r230235.

This fixes PR/31369.

A previous commit was missing the test case and had another differential
in it.

Reviewers: slthakur

Differential Revision: https://reviews.llvm.org/D28373

llvm-svn: 291566
2017-01-10 15:53:10 +00:00