Commit Graph

38893 Commits

Author SHA1 Message Date
Aakanksha Patil 464e4dc50f [AMDGPU] Add gfx1034 target
Differential Revision: https://reviews.llvm.org/D102306
2021-05-13 14:25:18 -04:00
cynecx 8ec9fd4839 Support unwinding from inline assembly
I've taken the following steps to add unwinding support from inline assembly:

1) Add a new `unwind` "attribute" (like `sideeffect`) to the asm syntax:

```
invoke void asm sideeffect unwind "call thrower", "~{dirflag},~{fpsr},~{flags}"()
    to label %exit unwind label %uexit
```

2.) Add Bitcode writing/reading support + LLVM-IR parsing.

3.) Emit EHLabels around inline assembly lowering (SelectionDAGBuilder + GlobalISel) when `InlineAsm::canThrow` is enabled.

4.) Tweak InstCombineCalls/InlineFunction pass to not mark inline assembly "calls" as nounwind.

5.) Add clang support by introducing a new clobber: "unwind", which lower to the `canThrow` being enabled.

6.) Don't allow unwinding callbr.

Reviewed By: Amanieu

Differential Revision: https://reviews.llvm.org/D95745
2021-05-13 19:13:03 +01:00
Stefan Pintilie 54310fc176 [PowerPC] Add ROP Protection to prologue and epilogue
Added hashst to the prologue and hashchk to the epilogue.
The hash for the prologue and epilogue must always be stored as the first
element in the local variable space on the stack.

Reviewed By: nemanjai, #powerpc

Differential Revision: https://reviews.llvm.org/D99377
2021-05-13 12:54:44 -05:00
David Green 1011d4ed60 [ARM] Constrain CMPZ shift combine to a single use
We currently prefer t2CMPrs over t2CMPri when the node contains a shift.
This can introduce more nodes if the shift has multiple uses though, as
value from the shift will be needed anyway, and in the case of a t2CMPri
compared with zero will more readily be removed entirely.

Differential Revision: https://reviews.llvm.org/D101688
2021-05-13 18:31:01 +01:00
Stanislav Mekhanoshin 8f98356bb5 [AMDGPU] Only allow global fp atomics with unsafe option
Previously we were allowing to use FP atomics without
-amdgpu-unsafe-fp-atomics option if a scope is less then
system. This is not safe just as well if we have UC memory.

This change only allows global and flat FP atomics with
the unsafe option. Consequentially that makes a check for
denorm mode redundant since we skip it with the unsafe
option and do not have a way to produce these instructions
without it anyway.

Differential Revision: https://reviews.llvm.org/D102347
2021-05-13 08:52:20 -07:00
Bradley Smith b1a074951f [AArch64][SVE] Fix missed immediate selection due to mishandling of signedness
The complex selection pattern for add/sub shifted immediates is
incorrect in it's handling of incoming constant values, in that it
does not properly anticipate the values to be signed extended to
32-bits.

Co-authored-by: Graham Hunter <graham.hunter@arm.com>

Differential Revision: https://reviews.llvm.org/D101833
2021-05-13 16:02:49 +01:00
Juneyoung Lee 395607af3c Reapply [ConstantFold] Fold more operations to poison
This was reverted to mitigate mitigate miscompiles caused by
the logical and/or to bitwise and/or fold. Reapply it now that
the underlying issue has been fixed by D101191.

-----

This patch folds more operations to poison.

Alive2 proof: https://alive2.llvm.org/ce/z/mxcb9G (it does not contain tests about div/rem because they fold to poison when raising UB)

Reviewed By: nikic

Differential Revision: https://reviews.llvm.org/D92270
2021-05-13 16:04:12 +02:00
Jinsong Ji b1509d067e [AIX] XFAIL CodeGen/Generic/externally_available.ll
Globals with “available_externally” linkage should never be emitted into the
    object file corresponding to the LLVM module.

    However, AIX system assembler default print error for undefined reference .
    so AIX chose to emit the available externally symbols into .s,
    so that users won't run into errors in situations like:

    clang -target powerpc-ibm-aix -xc -<<<$'extern inline
    __attribute__((__gnu_inline__)) void foo() {}\nvoid bar() { foo(); }' -O
    -Xclang -disable-llvm-passes

Reviewed By: hubert.reinterpretcast

Differential Revision: https://reviews.llvm.org/D102377
2021-05-13 13:24:48 +00:00
Stefan Pintilie 15051f0b4a [PowerPC] Handle inline assembly clobber of link regsiter
This patch adds the handling of clobbers of the link register LR for inline
assembly.

This patch is to fix:
https://bugs.llvm.org/show_bug.cgi?id=50147

Reviewed By: nemanjai, #powerpc

Differential Revision: https://reviews.llvm.org/D101657
2021-05-13 07:43:37 -05:00
Fraser Cormack 797e580db9 [RISCV][NFC] Simplify test run lines
Several tests had -verify-machineinstrs twice, and several tests were
explicitly specifying the default FileCheck prefix of CHECK.
2021-05-13 12:41:00 +01:00
Nemanja Ivanovic 39e4676ca7 [PowerPC] Provide doubleword vector predicate form comparisons on Power7
There are two reasons this shouldn't be restricted to Power8 and up:
1. For XL compatibility
2. Because clang will expand comparison operators to these intrinsics*

*Without this patch, the following causes a selection error:

int test(vector signed long a, vector signed long b) {
  return a < b;
}

This patch provides the handling for the intrinsics in the back
end and removes the Power8 guards from the predicate functions
(vec_{all|any}_{eq|ne|gt|ge|lt|le}).
2021-05-13 04:56:56 -05:00
Serge Pavlov 12537ab772 [FPEnv][X86] Implement lowering of llvm.set.rounding
Differential Revision: https://reviews.llvm.org/D74730
2021-05-13 14:30:38 +07:00
Sam Clegg 3041b16f73 [WebAssembly] Add TLS data segment flag: WASM_SEG_FLAG_TLS
Previously the linker was relying solely on the name of the segment
to imply TLS.

Differential Revision: https://reviews.llvm.org/D102202
2021-05-12 13:31:02 -07:00
Heejin Ahn ba38b72ec2 [WebAssembly] Allow Wasm EH with Emscripten SjLj
We explicitly made it error out in D101403, out of a good intention that
the error message will make people less confusing. Turns out, we weren't
failing all cases of wasm EH + SjLj; only a few cases were failing and
our client was able to get around by fixing source code, but now we made
it fail for all cases, even the cases that previously succeeded fail,
which we didn't intend. This reverts that change.

Reviewed By: tlively

Differential Revision: https://reviews.llvm.org/D102364
2021-05-12 13:27:04 -07:00
Simon Pilgrim fb1d61b725 [X86][AVX] Fold concat(ps*lq(x,32),ps*lq(y,32)) -> shuffle(concat(x,y),zero) (PR46621)
On AVX1 targets we can handle v4i64 logical shifts by 32 bits as a pair of v8f32 shuffles with zero.

I was hoping to put this in LowerScalarImmediateShift, but performing that early causes regressions where other instructions were respliting the subvectors.
2021-05-12 18:04:40 +01:00
Amara Emerson dc8d16c03f [AArch64][GlobalISel] Add MMOs to constant pool loads to allow LICM hoisting.
This caused performance regressions vs SDAG on SingleSource/Benchmarks/Adobe-C++
2021-05-12 09:47:09 -07:00
Baptiste Saleil 5885f1a4cb [AMDGPU] Disable the SIFormMemoryClauses pass at -O1
This patch disables the SIFormMemoryClauses pass at -O1. This pass has a
significant impact on compilation time, so we only want it to be enabled
starting from -O2.

Differential Revision: https://reviews.llvm.org/D101939
2021-05-12 11:51:59 -04:00
Simon Pilgrim 778562ada3 [X86][AVX] Add v4i64 shift-by-32 tests
AVX1 could perform this as a v8f32 shuffle instead of splitting - based off PR46621
2021-05-12 16:42:18 +01:00
Fraser Cormack c5ec00e62b [TargetLowering] Improve legalization of scalable vector types
This patch extends the vector type-conversion and legalization capabilities of
scalable vector types.

Firstly, `vscale x 1` types now behave more like the corresponding `vscale x
2+` types. This enables the integer promotion legalization of extended scalable
types, such as the promotion of `<vscale x 1 x i5>` to `<vscale x 1 x i8>`.

These `vscale x 1` types are also now better handled by
`getVectorTypeBreakdown`, where what looks like older handling for 1-element
fixed-length vector types was spuriously updated to include scalable types.

Widening of scalable types is now better supported, by using `INSERT_SUBVECTOR`
to insert the smaller scalable vector "value" type into the wider scalable
vector "part" type. This allows AArch64 to pass and return `vscale x 1` types
by value by widening.

There are still cases where we are unable to legalize `vscale x 1` types, such
as where expansion would require splitting the vector in two.

Reviewed By: sdesmalen

Differential Revision: https://reviews.llvm.org/D102073
2021-05-12 16:33:07 +01:00
Stefan Pintilie 8d37411e48 Revert "[SelectionDAG][Mips][PowerPC][RISCV][WebAssembly] Teach computeKnownBits/ComputeNumSignBits about atomics"
This reverts commit 6c80361b84.
Breaks PowerPC Big Endian buildbots.
2021-05-12 09:46:18 -05:00
Hendrik Greving 762ac725bf [DAGCombiner] Fix DAG combine store elimination, different address space.
Fixes a bug in the DAG combiner that eliminates the stores because it missed
to inspect the address space of the pointers.

%v = load %ptr_as1
// no chain side effect
store %v, %ptr_as2

As well as

store %v, %ptr_as1
store %v, %ptr_as2

Fixes a test for above in X86.

Differential Revision: https://reviews.llvm.org/D102096
2021-05-12 07:14:22 -07:00
Hendrik Greving 4b00ffa767 [DAGCombiner] Add test exposing bug in DAG combine.
Adds a test in X86, exposing a bug in DAG combine eliminating stores that
are the same value but no the same address space.

Differential Revision: https://reviews.llvm.org/D102243
2021-05-12 07:14:21 -07:00
Peter Waller 3fa6510f6e [CodeGen][AArch64][SVE] Fold [rdffr, ptest] => rdffrs; bugfix for optimizePTestInstr
When a ptest is used to set flags from the output of rdffr, the ptest
can be eliminated, using a flags-setting rdffrs instead.

Additionally, check that nothing consumes flags between rdffr and ptest;
this case appears to have been missed previously.

* There is no unpredicated RDFFRS instruction.
* If substituting RDFFR_PP, require that the mask argument of the
  PTEST matches that of the RDFFR_PP.
* Move some precondition code up inside optimizePTestInstr, so that it
  covers the new code paths for RDFFR which return earlier.
  * Only consider RDFFR, PTEST in same basic block.
  * Check for other flag setting instructions between the two, abort if
    found.
  * Drop an old TODO comment about removing dead PTEST instructions.

RDFFR_P to follow in later patch.

Differential Revision: https://reviews.llvm.org/D101357
2021-05-12 15:06:22 +01:00
Julien Pagès 46adccc5cc [AMDGPU] Improve Codegen for build_vector
Improve the code generation of build_vector.
Use the v_pack_b32_f16 instruction instead of
v_and_b32 + v_lshl_or_b32

Differential Revision: https://reviews.llvm.org/D98081

Patch by Julien Pagès!
2021-05-12 14:17:44 +01:00
Sanjay Patel f58e0513dd [x86] try harder to lower to PCMPGT instead of not-of-PCMPEQ
This is motivated by the example in https://llvm.org/PR50055 ,
but it doesn't do anything for that bug currently because we
don't actually have a zero-extended setcc there.

Proof for the generic transform (inverse of what we would
try to do in combining):
https://alive2.llvm.org/ce/z/aBL-Mg

Differential Revision: https://reviews.llvm.org/D102275
2021-05-12 08:25:29 -04:00
Sanjay Patel 24d06fff55 [x86] add test for pcmpeq with 0; NFC 2021-05-12 08:25:29 -04:00
Simon Pilgrim 72e242a286 [X86][AVX] canonicalizeShuffleMaskWithHorizOp - improve support for 256/512-bit vectors
Extend the HOP(HOP(X,Y),HOP(Z,W)) and SHUFFLE(HOP(X,Y),HOP(Z,W)) folds to handle repeating 256/512-bit vector cases.

This allows us to drop the UNPACK(HOP(),HOP()) custom fold in combineTargetShuffle.

This required isRepeatedTargetShuffleMask to be tweaked to support target shuffle masks taking more than 2 inputs.
2021-05-12 12:13:24 +01:00
Peter Waller 6e6f9a636b [AArch64][SVE] Improve sve.convert.to.svbool lowering
The sve.convert.to.svbool lowering has the effect of widening a logical
<M x i1> vector representing lanes into a physical <16 x i1> vector
representing bits in a predicate register.

In general, if converting to svbool, the contents of lanes in the
physical register might not be known. For sve.convert.to.svbool the new
lanes are specified to be zeroed, requiring 'and' instructions to mask
off the new lanes. For lanes coming from a ptrue or a comparison,
however, they are known to be zero.

CodeGen Before:
  ptrue p0.s, vl16
  ptrue p1.s
  ptrue p2.b
  and   p0.b, p2/z, p0.b, p1.b
  ret

After:
  ptrue	p0.s, vl16
  ret

Differential Revision: https://reviews.llvm.org/D101544
2021-05-12 10:57:25 +01:00
Piotr Sobczak 68137ef568 [AMDGPU] Skip invariant loads when avoiding WAR conflicts
No need to handle invariant loads when avoiding WAR conflicts, as
there cannot be a vector store to the same memory location.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D101177
2021-05-12 10:57:05 +02:00
Tomas Matheson 34c098b780 [ARM] Prevent spilling between ldrex/strex pairs
Based on the same for AArch64: 4751cadcca

At -O0, the fast register allocator may insert spills between the ldrex and
strex instructions inserted by AtomicExpandPass when expanding atomicrmw
instructions in LL/SC loops. To avoid this, expand to cmpxchg loops and
therefore expand the cmpxchg pseudos after register allocation.

Required a tweak to ARMExpandPseudo::ExpandCMP_SWAP to use the 4-byte encoding
of UXT, since the pseudo instruction can be allocated a high register (R8-R15)
which the 2-byte encoding doesn't support. However, the 4-byte encodings
are not present for ARM v8-M Baseline. To enable this, two new pseudos are
added for Thumb which are only valid for v8mbase, tCMP_SWAP_8 and
tCMP_SWAP_16.

The previously committed attempt in D101164 had to be reverted due to runtime
failures in the test suites. Rather than spending time fixing that
implementation (adding another implementation of atomic operations and more
divergence between backends) I have chosen to follow the approach taken in
D101163.

Differential Revision: https://reviews.llvm.org/D101898

Depends on D101912
2021-05-12 09:43:21 +01:00
Tomas Matheson edf9d88266 [ARM] Precommit test for D101898
Differential Revision: https://reviews.llvm.org/D101912
2021-05-12 09:43:21 +01:00
Matt Arsenault cc79aaced0 AMDGPU: Fix SILoadStoreOptimizer for gfx90a
This was hardcoding the register class to use for the newly created
pointer registers, violating the aligned VGPR requirement.
2021-05-11 21:26:43 -04:00
Matt Arsenault a15ed701ab AMDGPU: Fix assert on constant load from addrspacecasted pointer
This was trying to create a bitcast between different address spaces.
2021-05-11 20:12:20 -04:00
Matt Arsenault 24e2e5df0e GlobalISel: Split ValueHandler into assignment and emission classes
Currently the ValueHandler handles both selecting the type and
location for arguments, as well as inserting instructions needed to
handle them. Split this so that the determination of the argument
handling is independent of the function state. Currently the checks
for tail call compatibility do not follow the full assignment logic,
so it misses cases where arguments require nontrivial legalization.

This should help avoid targets ending up in a buggy state where the
argument evaluation may change in different contexts.
2021-05-11 19:50:12 -04:00
Austin Kerbow 4433f4601e [AMDGPU] Fix extra waitcnt being added with BUFFER_INVL2
The waitcnt pass would increment the number of vmem events for some buffer
invalidates that were not handled by the pass.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D102252
2021-05-11 13:17:33 -07:00
Craig Topper d092dd56ae [RISCV] Regenerate stepvector.ll. NFC
It looks like the RV32 and RV64 prefixes were removed from the
RUN lines while another patch was in review that added check
lines that used them.
2021-05-11 13:04:57 -07:00
Albion Fung ffbffaf6b6 [PowerPC] Improve codegen for int-to-fp conversion of subword vector extract
When an integer is converted into floating point in subword vector extract,
it can be done in 2 instructions instead of the 3+ instructions it generates
right now. This patch removes the uncessary generation.

Differential: https://reviews.llvm.org/D100604
2021-05-11 15:00:11 -05:00
Amara Emerson 69069509b2 [AArch64][GlobaISel] Mark target generic instructions as HasNoSideEffects.
One test needed updating because the newly side-effect-free instructions were
now being DCE'd.
2021-05-11 12:38:53 -07:00
Amara Emerson ae2b36e8bd [AArch64][GlobalISel] Support truncstorei8/i16 w/ combine to form truncating G_STOREs.
This needs some tablegen changes so that we can actually import the patterns properly.

Differential Revision: https://reviews.llvm.org/D102204
2021-05-11 11:33:03 -07:00
Fangrui Song ec27c5f170 [RISCV] Prefer to lower MC_GlobalAddress operands to .Lfoo$local
Similar to X86 D73230 and AArch64 D101872

With this change, we can set dso_local in clang's -fpic -fno-semantic-interposition mode,
for default visibility external linkage non-ifunc-non-COMDAT definitions.

For such dso_local definitions, variable access/taking the address of a
function/calling a function will go through a local alias to avoid GOT/PLT.

Reviewed By: jrtc27, luismarques

Differential Revision: https://reviews.llvm.org/D101875
2021-05-11 11:29:45 -07:00
Simon Pilgrim 4f80340fb6 [X86][SSE] Add tests for permute(phaddw(phaddw(x,y),phaddw(z,w))) -> phaddw(phaddw(),phaddw()) folds.
We currently only fold if NumEltsPerLane == 4
2021-05-11 17:47:10 +01:00
Craig Topper ce6e4f27dd [RISCV] Use fractional LMULs for fixed length types smaller than riscv-v-vector-bits-min.
My thought process is that if v2i64 is an LMUL=1 type then v2i32
should be an LMUL=1/2 type. We limit the fractional LMUL so that
SEW=64 clips to LMUL=1, SEW=32 clips to LMUL=1/2, etc. This
ensures there's always a fractional LMUL available to truncate a type.
This does reduce the number of vsetvlis in some cases.

Some tests increase vsetvlis because the best container type for a
mask type is dependent on the LMUL+SEW that the mask was produced
from, but you can't tell that from the type. I think this is
something we need to solve this in the machine IR when optimizing
vsetvlis.

Reviewed By: frasercrmck

Differential Revision: https://reviews.llvm.org/D101215
2021-05-11 09:42:48 -07:00
Roman Lebedev 5f78ba001c
[X86][Codegen] Shift amount mod: sh? i64 x, (32-y) --> sh? i64 x, -(y+32)
I've seen this in the RawSpeed's BitPumpMSB*::push() hotpath,
after fixing the buffer abstraction to a more sane one,
when looking into a +5% runtime regression.
I was hoping that this would fix it, but it does not look it does.

This seems to be at least not worse than the original pattern.
But i'm actually mainly interested in the case where we already
compute `(y+32)` (see last test),

https://alive2.llvm.org/ce/z/ZCzJio

Reviewed By: spatel

Differential Revision: https://reviews.llvm.org/D101944
2021-05-11 19:39:41 +03:00
Craig Topper dc00cbb505 [RISCV] Match trunc_vector_vl+sra_vl/srl_vl with splat shift amount to vnsra/vnsrl.
Limited to splats because we would need to truncate the shift
amount vector otherwise.

I tried to do this with new ISD nodes and a DAG combine to
avoid such a large pattern, but we don't form the splat until
LegalizeDAG and need DAG combine to remove a scalable->fixed->scalable
cast before it becomes visible to the shift node. By the time that
happens we've already visited the truncate node and won't revisit it.

I think I have an idea how to improve i64 on RV32 I'll save for a
follow up.

Reviewed By: frasercrmck

Differential Revision: https://reviews.llvm.org/D102019
2021-05-11 09:29:31 -07:00
Roman Lebedev 2c1f9f390b
[NFC][X86] Precommit another testcase for D101944 2021-05-11 18:34:43 +03:00
Simon Pilgrim 9acc03ad92 [X86][SSE] Replace foldShuffleOfHorizOp with generalized version in canonicalizeShuffleMaskWithHorizOp
foldShuffleOfHorizOp only handled basic shufps(hop(x,y),hop(z,w)) folds - by moving this to canonicalizeShuffleMaskWithHorizOp we can work with more general/combined v4x32 shuffles masks, float/integer domains and support shuffle-of-packs as well.

The next step will be to support 256/512-bit vector cases.
2021-05-11 14:18:45 +01:00
Piotr Sobczak 09fe84abb4 [AMDGPU] Move code sinking before structurizer
Moving code sinking pass before structurizer creates more sinking
opportunities.

The extra flow edges introduced by the structurizer can have adverse
effects on sinking, because the sinking pass prefers moving instructions
to blocks with unique predecessors and the structurizer destroys that
property in some cases.

A notable example is moving high-latency image instructions across kills.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D101115
2021-05-11 14:07:23 +02:00
Stefan Pintilie c79bc5942d [PowerPC][Bug] Fix Bug in Stack Frame Update Code
The stack frame update code does not take into consideration spilling
to registers for callee saved registers. The option -ppc-enable-pe-vector-spills
turns on spilling to registers for callee saved registers and may expose a bug
in the code that moves a stack frame pointer update instruction.

Reviewed By: nemanjai, #powerpc

Differential Revision: https://reviews.llvm.org/D101366
2021-05-11 05:54:07 -05:00
Denis Antrushin df47368d40 [RegAllocFast] properly handle STATEPOINT instruction.
STATEPOINT is a fancy and complex pseudo instruction which
has both tied defs and regmask operand.

Basic FastRA algorithm is as follows:

1. Mark registers used by defs as free
2. If instruction has regmask operand displace clobbered registers
   according to regmask.
3. Assign registers for use operands.

In case of tied defs step 1 is replaced with allocation of registers
for them. But regmask is still processed, which may displace already
allocated registers. As a result, tied use and def will get assigned
to different registers.

This patch makes FastRA to process instruction's RegMask (if any) when
checking for physical registers interference.
That way tied operands won't get registers clobbered by regmask.

Reviewed By: arsenm, skatkov
Differential Revision: https://reviews.llvm.org/D99284
2021-05-11 17:27:00 +07:00
Jay Foad 3b873831c4 [AMDGPU] Add some GFX10.3 testing. NFC. 2021-05-11 11:21:19 +01:00