Commit Graph

64968 Commits

Author SHA1 Message Date
Bradley Smith cfe22cd4ef [AArch64][SVE] Add new ld<n> intrinsics that return a struct of vscale types
This will allow us to reuse existing interleaved load logic in
lowerInterleavedLoad that exists for neon types, but for SVE fixed
types.

The goal eventually will be to replace the existing ld<n> intriniscs
with these, once a migration path has been sorted out.

Differential Revision: https://reviews.llvm.org/D112078
2021-10-22 14:13:17 +00:00
Roman Lebedev 8fac9e95ad
[X86] `X86TTIImpl::getInterleavedMemoryOpCost()`: scale interleaving cost by the fraction of live members
By definition, interleaving load of stride N means:
load N*VF elements, and shuffle them into N VF-sized vectors,
with 0'th vector containing elements `[0, VF)*stride + 0`,
and 1'th vector containing elements `[0, VF)*stride + 1`.
Example: https://godbolt.org/z/df561Me5E (i64 stride 4 vf 2 => cost 6)

Now, not fully interleaved load, is when not all of these vectors is demanded.
So at worst, we could just pretend that everything is demanded,
and discard the non-demanded vectors. What this means is that the cost
for not-fully-interleaved group should be not greater than the cost
for the same fully-interleaved group, but perhaps somewhat less.
Examples:
https://godbolt.org/z/a78dK5Geq (i64 stride 4 (indices 012u) vf 2 => cost 4)
https://godbolt.org/z/G91ceo8dM (i64 stride 4 (indices 01uu) vf 2 => cost 2)
https://godbolt.org/z/5joYob9rx (i64 stride 4 (indices 0uuu) vf 2 => cost 1)

Right now, for such not-fully-interleaved loads we just use the costs
for fully-interleaved loads. But at least **in general**,
that is obviously overly pessimistic, because **in general**,
not all the shuffles needed to perform the full interleaving
will end up being live.

So what this does, is naively scales the interleaving cost
by the fraction of the live members. I believe this should still result
in the right ballpark cost estimate, although it may be over/under -estimate.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D112307
2021-10-22 16:33:58 +03:00
Jay Foad 74cd4dee20 [AMDGPU] Preserve deadness of vcc when shrinking instructions
This doesn't have any effect on codegen now, but it might do in the
future if we shrink instructions before post-RA scheduling, which is
sensitive to live vs dead defs.

Differential Revision: https://reviews.llvm.org/D112305
2021-10-22 14:22:24 +01:00
Simon Pilgrim a750332d77 AMDGPULibCalls - constify some FuncInfo& arguments. NFCI. 2021-10-22 12:10:58 +01:00
Simon Pilgrim 99a64cc9da AMDGPULibCalls::parseFunctionName - use reference instead of pointer. NFCI.
parseFunctionName allowed a default null pointer, despite it being dereferenced immediately to be used as a reference and that all callers were taking the address of an existing reference.

Fixes static analyzer warning about potential dereferenced nulls
2021-10-22 11:45:25 +01:00
Fraser Cormack 74c6895b39 [RISCV] Fix missing cross-block VSETVLI insertion
This patch fixes a codegen bug, the test for which was introduced in
D112223.

When merging VSETVLIInfo across blocks, if the 'exit' VSETVLIInfo
produced by a block is found to be compatible with the VSETVLIInfo
computed as the intersection of the 'exit' VSETVLIInfo produced by the
block's predecessors, that blocks' 'exit' info is discarded and the
intersected value is taken in its place.

However, we have one authority on what constitutes VSETVLIInfo
compatibility and we are using it in two different contexts.

Compatibility is used in one context to elide VSETVLIs between
straight-line vector instructions. But compatibility when evaluated
between two blocks' exit infos ignores any info produced *inside* each
respective block before the exit points. As such it does not guarantee
that a block will not produce a VSETVLI which is incompatible with the
'previous' block.

As such, we must ensure that any merging of VSETVLIInfo is performed
using some notion of "strict" compatibility. I've defined this as a full
vtype match, but this is perhaps too pessimistic. Given that test
coverage in this regard is lacking -- the only change is in the failing
test -- I think this is a good starting point.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D112228
2021-10-22 10:45:10 +01:00
Chen Zheng 86a5c32616 [PowerPC] iterate on the SmallSet directly; NFC 2021-10-22 06:18:07 +00:00
Chen Zheng 13755436bb [PowerPC] return early if there is no preparing candidate in the loop; NFC
This is to improve compiling time.

Differential Revision: https://reviews.llvm.org/D112196

Reviewed By: jsji
2021-10-22 05:39:51 +00:00
Stanislav Mekhanoshin ca0c92d6a1 [AMDGPU] Allow to use a whole register file on gfx90a for VGPRs
In a kernel which does not have calls or AGPR usage we can allocate
the whole vector register budget for VGPRs and have no AGPRs as
long as VGPRs stay addressable (i.e. below 256).

Differential Revision: https://reviews.llvm.org/D111764
2021-10-21 18:24:34 -07:00
Jessica Paquette 5dc339d982 [AArch64][GlobalISel] Fold 64-bit cmps with 64-bit adds
G_ICMP is selected to an arithmetic overflow op (ADDS/SUBS/etc) with a dead
destination + a CSINC instruction.

We have a fold which allows us to combine 32-bit adds with G_ICMP.

The problem with G_ICMP is that we model it as always having a 32-bit
destination even though it can be a 64-bit operation. So, we were missing some
opportunities for 64-bit folds.

This patch teaches the fold to recognize 64-bit G_ICMPs + refactors some of
the code surrounding CSINC accordingly.

(Later down the line, I think we should probably change the way we handle G_ICMP
in general.)

Differential Revision: https://reviews.llvm.org/D111088
2021-10-21 13:51:44 -07:00
Yonghong Song 0472e83ffc BPF: emit BTF_KIND_DECL_TAG for typedef types
If a typedef type has __attribute__((btf_decl_tag("str"))) with
bpf target, emit BTF_KIND_DECL_TAG for that type in the BTF.

Differential Revision: https://reviews.llvm.org/D112259
2021-10-21 12:09:42 -07:00
Craig Topper d55be79d75 [RISCV] Expand scalable vector CTTZ/CTLZ/CTPOP.
Differential Revision: https://reviews.llvm.org/D112233
2021-10-21 10:50:04 -07:00
Anirudh Prasad aa3519f178 [SystemZ][z/OS] Initial implementation for lowerCall on z/OS
- This patch provides the initial implementation for lowering a call on z/OS according to the XPLINK64 calling convention
- A series of changes have been made to SystemZCallingConv.td to account for these additional XPLINK64 changes including adding a new helper function to shadow the stack along with allocation of a register wherever appropriate
- For the cases of copying a f64 to a gr64 and a f128 / 128-bit vector type to a gr64, a `CCBitConvertToType` has been added and has been bitcasted appropriately in the lowering phase
- Support for the ADA register (R5) will be provided in a later patch.

Reviewed By: uweigand

Differential Revision: https://reviews.llvm.org/D111662
2021-10-21 09:48:59 -04:00
David Sherwood 9448cdc900 [SVE][Analysis] Tune the cost model according to the tune-cpu attribute
This patch introduces a new function:

  AArch64Subtarget::getVScaleForTuning

that returns a value for vscale that can be used for tuning the cost
model when using scalable vectors. The VScaleForTuning option in
AArch64Subtarget is initialised according to the following rules:

1. If the user has specified the CPU to tune for we use that, else
2. If the target CPU was specified we use that, else
3. The tuning is set to "generic".

For CPUs of type "generic" I have assumed that vscale=2.

New tests added here:

  Analysis/CostModel/AArch64/sve-gather.ll
  Analysis/CostModel/AArch64/sve-scatter.ll
  Transforms/LoopVectorize/AArch64/sve-strict-fadd-cost.ll

Differential Revision: https://reviews.llvm.org/D110259
2021-10-21 09:33:50 +01:00
Sanjay Patel 40163f1df8 [x86] add special-case lowering for usubsat for AVX512
This is a small extension of D112095 to avoid another regression
seen with D112085.
In this case, we allow the same conversion from usubsat to ALU
ops if the target supports vpternlog.

That pattern will get converted later in X86DAGToDAGISel::tryVPTERNLOG().
This seems better than putting a magic immediate constant directly in
this code to create the exact vpternlog that we need. It's possible that
there are other special-cases along these lines, so we should try to
keep all of the vpternlog magic in one place.

Differential Revision: https://reviews.llvm.org/D112138
2021-10-20 16:41:13 -04:00
Stanislav Mekhanoshin 6185835656 [AMDGPU] Allow rematerialization of SOP with virtual registers
D106408 was doing this for all targets although it was
reverted due to couple performance regressions on some targets.
The difference for AMDGPU is the ability to rematerialize SOP
instructions with virtual register uses like we already do for VOP.

Differential Revision: https://reviews.llvm.org/D110743
2021-10-20 11:46:50 -07:00
Zhi An Ng e1fb13401e [WebAssembly] Add prototype relaxed float min max instructions
Add relaxed. f32x4.min, f32x4.max, f64x2.min, f64x2.max. These are only
exposed as builtins, and require user opt-in.

Differential Revision: https://reviews.llvm.org/D112146
2021-10-20 09:41:51 -07:00
Craig Topper fe1f0de003 [RISCV][WebAssembly][TargetLowering] Allow expandCTLZ/expandCTTZ to rely on CTPOP expansion for vectors.
Our fallback expansion for CTLZ/CTTZ relies on CTPOP. If CTPOP
isn't legal or custom for a vector type we would scalarize the
CTLZ/CTTZ. This is different than CTPOP itself which would use a
vector expansion.

This patch teaches expandCTLZ/CTTZ to rely on the vector CTPOP
expansion instead of scalarizing. To do this I had to add additional
checks to make sure the operations used by CTPOP expansions are all
supported. Some of the operations were already needed for the CTLZ/CTTZ
expansion.

This is a huge improvement to the RISCV which doesn't have a scalar
ctlz or cttz in the base ISA.

For WebAssembly, I've added Custom lowering to keep the scalarizing
behavior. I've also extended the scalarizing to CTPOP.

Differential Revision: https://reviews.llvm.org/D111919
2021-10-20 07:46:41 -07:00
Sanjay Patel 3efd2a0bec [x86] make helper for useVPTERNLOG; NFC
See D112085 for another use case.
2021-10-20 10:26:53 -04:00
Simon Pilgrim 5b395bd633 [CostModel][X86] Add costs for multiply-by-pow2 constants
These are folded to left shifts in the backend.

We should be able to extend this for multiply-by-negpow2 after D111968 has landed to resolve PR51436
2021-10-20 13:11:21 +01:00
Simon Pilgrim 9fc523d114 [X86] Remove X86ProcFamilyEnum::IntelSLM
Replace X86ProcFamilyEnum::IntelSLM enum with a TuningUseSLMArithCosts flag instead, matching what we already do for Goldmont.

This just leaves X86ProcFamilyEnum::IntelAtom to replace with general Tuning/Feature flags and we can finally get rid of the old X86ProcFamilyEnum enum.

Differential Revision: https://reviews.llvm.org/D112079
2021-10-20 11:58:39 +01:00
Daniel Kiss f903c85055 [AArch64] Emit .cfi_negate_ra_state for PAC-auth instructions.
autiasp, autibsp instructions are the counterpart of paciasp/pacibsp instructions
therefore let's emit .cfi_negate_ra_state for these too.
In case of Armv8.3 instruction set the retaa/retbb will do the return and authentication
in one step here we can't emit the . cfi_negate_ra_state because that would be point after
the ret* instruction.

Reviewed By: nickdesaulniers, MaskRay

Differential Revision: https://reviews.llvm.org/D111780
2021-10-20 11:03:52 +02:00
Joerg Sonnenberger ec428f7b78 [SPARC] Recognize the prefetch instruction
Reviewed By: LemonBoy

Differential Revision: https://reviews.llvm.org/D96311
2021-10-20 10:59:01 +02:00
Paulo Matos 6d0c7bc17d [WebAssembly] Implementation of table.get/set for reftypes in LLVM IR
This change implements new DAG nodes TABLE_GET/TABLE_SET, and lowering
methods for load and stores of reference types from IR arrays. These
global LLVM IR arrays represent tables at the Wasm level.

Differential Revision: https://reviews.llvm.org/D111154
2021-10-20 10:31:31 +02:00
Zi Xuan Wu de10a02fc0 [CSKY] Complete to add basic integer instruction set
Complete the basic integer instruction set and add related predictor in CSKY.td.
And it includes the instruction definition and asm parser support.

Differential Revision: https://reviews.llvm.org/D111701
2021-10-20 15:50:44 +08:00
Zhi An Ng 2542bfa43a [WebAssembly] Add prototype relaxed swizzle instructions
Add i8x16 relaxed_swizzle instructions. These are only
exposed as builtins, and require user opt-in.

Differential Revision: https://reviews.llvm.org/D112022
2021-10-19 17:53:04 -07:00
Yonghong Song cd40b5a712 BPF: set .BTF and .BTF.ext section alignment to 4
Currently, .BTF and .BTF.ext has default alignment of 1.
For example,
  $ cat t.c
    int foo() { return 0; }
  $ clang -target bpf -O2 -c -g t.c
  $ llvm-readelf -S t.o
    ...
    Section Headers:
    [Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
    ...
    [ 7] .BTF              PROGBITS        0000000000000000 000167 00008b 00      0   0  1
    [ 8] .BTF.ext          PROGBITS        0000000000000000 0001f2 000050 00      0   0  1

But to have no misaligned data access, .BTF and .BTF.ext
actually requires alignment of 4. Misalignment is not an issue
for architecture like x64/arm64 as it can handle it well. But
some architectures like mips may incur a trap if .BTF/.BTF.ext
is not properly aligned.

This patch explicitly forced .BTF and .BTF.ext alignment to be 4.
For the above example, we will have
    [ 7] .BTF              PROGBITS        0000000000000000 000168 00008b 00      0   0  4
    [ 8] .BTF.ext          PROGBITS        0000000000000000 0001f4 000050 00      0   0  4

Differential Revision: https://reviews.llvm.org/D112106
2021-10-19 16:26:01 -07:00
Artem Belevich b6b7fe60a4 [NVPTX] Add a late SROA pass which allows optimizing away more allocas.
Fixes performance regression https://bugs.llvm.org/show_bug.cgi?id=52037

Differential Revision: https://reviews.llvm.org/D111471
2021-10-19 16:18:28 -07:00
Sanjay Patel 92a0389b04 [x86] add special-case lowering for usubsat for pre-SSE4
usubsat X, SMIN --> (X ^ SMIN) & (X s>> BW-1)

This would be a regression with D112085 where we combine to
usubsat more aggressively, so avoid that by matching the
special-case where we are subtracting SMIN (signmask):
https://alive2.llvm.org/ce/z/4_3gBD

Differential Revision: https://reviews.llvm.org/D112095
2021-10-19 17:13:16 -04:00
David Sherwood 5ea35791e6 [AArch64] Split out processor/tuning features
Following on from an earlier patch that introduced support for -mtune
for AArch64 backends, this patch splits out the tuning features
from the processor features. This gives us the ability to enable
architectural feature set A for a given processor with "-mcpu=A"
and define the set of tuning features B with "-mtune=B".

It's quite difficult to write a test that proves we select the
right features according to the tuning attribute because most
of these relate to scheduling. I have created a test here:

  CodeGen/AArch64/misched-fusion-addr-tune.ll

that demonstrates the different scheduling choices based upon
the tuning.

Differential Revision: https://reviews.llvm.org/D111551
2021-10-19 15:18:55 +01:00
David Sherwood 607fb1bb8c [AArch64] Always add -tune-cpu argument to -cc1 driver
This patch ensures that we always tune for a given CPU on AArch64
targets when the user specifies the "-mtune=xyz" flag. In the
AArch64Subtarget if the tune flag is unset we use the CPU value
instead.

I've updated the release notes here:

  llvm/docs/ReleaseNotes.rst

and added tests here:

  clang/test/Driver/aarch64-mtune.c

Differential Revision: https://reviews.llvm.org/D110258
2021-10-19 14:57:51 +01:00
Simon Pilgrim 71e39e3f18 [ADT] Add APInt::isNegatedPowerOf2() helper
Inspired by D111968, provide a isNegatedPowerOf2() wrapper instead of obfuscating code with (-Value).isPowerOf2() patterns, which I'm sure are likely avenues for typos.....

Differential Revision: https://reviews.llvm.org/D111998
2021-10-19 14:38:21 +01:00
Hsiangkai Wang facff468b6 [RISCV] Reorder the vector register allocation order.
GPR uses argument registers as the first group of registers to allocate.
This patch uses vector argument registers, v8 to v23, as the first group
to allocate.

Differential Revision: https://reviews.llvm.org/D111304
2021-10-19 09:30:13 +08:00
Anshil Gandhi 0567f03331 [HIP] [AlwaysInliner] Disable AlwaysInliner to eliminate undefined symbols
By default clang emits complete contructors as alias of base constructors if they are the same.
The backend is supposed to emit symbols for the alias, otherwise it causes undefined symbols.
@yaxunl observed that this issue is related to the llvm options `-amdgpu-early-inline-all=true`
and `-amdgpu-function-calls=false`. This issue is resolved by only inlining global values
with internal linkage. The `getCalleeFunction()` in AMDGPUResourceUsageAnalysis also had
to be extended to support aliases to functions. inline-calls.ll was corrected appropriately.

Reviewed By: yaxunl, #amdgpu

Differential Revision: https://reviews.llvm.org/D109707
2021-10-18 16:53:15 -06:00
Simon Pilgrim a83384498b [X86] combineMulToPMADDWD - replace ASHR(X,16) -> LSHR(X,16)
If we're using an ashr to sign-extend the entire upper 16 bits of the i32 element, then we can replace with a lshr. The sign bit will be correctly shifted for PMADDWD's implicit sign-extension and the upper 16 bits are zero so the upper i16 sext-multiply is guaranteed to be zero.

The lshr also has a better chance of folding with shuffles etc.
2021-10-18 22:12:56 +01:00
Matt Morehouse 431a5d8411 [x86] Implement a tagged-globals backend feature.
The feature tells the backend to allow tags in the upper bits of global
variable addresses.  These tags will be ignored by upcoming CPUs with
the Intel LAM feature but may be used in instrumentation passes (e.g.,
HWASan).

This patch implements the feature by using @GOTPCREL relocations instead
of direct references to the locally defined global.  Thus the full
tagged address can be loaded by a single instruction:
  movq global@GOTPCREL(%rip), %rax

Reviewed By: eugenis

Differential Revision: https://reviews.llvm.org/D111343
2021-10-18 13:31:10 -07:00
Alexandros Lamprineas 04dc68710a [DebugInfo][ARM] Fix incorrect debug information for RWPI accessed globals
When compiling for the RWPI relocation model the debug information is wrong:

* the debug location is described as { DW_OP_addr Var }
  instead of { DW_OP_constNu Var DW_OP_bregX 0 DW_OP_plus }
* the relocation type is R_ARM_ABS32 instead of R_ARM_SBREL32

Differential Revision: https://reviews.llvm.org/D111404
2021-10-18 21:29:46 +01:00
Yonghong Song f4a8526cc4 [NFC][BPF] fix comments and rename functions related to BTF_KIND_DECL_TAG
There are no functionality change.
Fix some comments and rename processAnnotations() to
processDeclAnnotations() to avoid confusion when later
BTF_KIND_TYPE_TAG is introduced (https://reviews.llvm.org/D111199).
2021-10-18 10:43:45 -07:00
Yonghong Song e9e4fc0fd3 BPF: fix a bug in IRPeephole pass
Commit 009f3a89d8 ("BPF: remove intrindics @llvm.stacksave()
and @llvm.stackrestore()") implemented IRPeephole pass to remove
llvm.stacksave()/stackrestore() instrinsics.
Buildbot reported a failure:
  UNREACHABLE executed at ../lib/IR/LegacyPassManager.cpp:1445!
which is:
  llvm_unreachable("Pass modifies its input and doesn't report it");

The code has changed but the implementation didn't return true
for changing. This patch fixed this problem.
2021-10-18 10:18:24 -07:00
Craig Topper 84d9bc51a3 [RISCV] Rewrite forwardCopyWillClobberTuple to not assume that there are exactly 32 registers. NFC
This function was copied from ARM where register pairs/triples/quads can wrap around the 32 encoding space. So register 31 can pair with register 0. This is not true for RISCV vectors. The spec specifically mentions the possibility of a future encoding that has more than 32 registers.

This patch removes the modulo from the code and directly checks that destination register is in the source register range and not the beginning of the range. Though I don't expect an identity copy will occur.

Reviewed By: frasercrmck

Differential Revision: https://reviews.llvm.org/D111467
2021-10-18 09:57:38 -07:00
Yonghong Song 009f3a89d8 BPF: remove intrindics @llvm.stacksave() and @llvm.stackrestore()
Paul Chaignon reported a bpf verifier failure ([1]) due to using
non-ABI register R11. For the test case, llvm11 is okay while
llvm12 and later generates verifier unfriendly code.

The failure is related to variable length array size.
The following mimics the variable length array definition
in the test case:

struct t { char a[20]; };
void foo(void *);
int test() {
   const int a = 8;
   char tmp[AA + sizeof(struct t) + a];
   foo(tmp);
   ...
}

Paul helped bisect that the following llvm commit is
responsible:

552c6c2328 ("PR44406: Follow behavior of array bound constant
              folding in more recent versions of GCC.")

Basically, before the above commit, clang frontend did constant
folding for array size "AA + sizeof(struct t) + a" to be 68,
so used alloca for stack allocation. After the above commit,
clang frontend didn't do constant folding for array size
any more, which results in a VLA and llvm.stacksave/llvm.stackrestore
is generated.

BPF architecture API does not support stack pointer (sp) register.
The LLVM internally used R11 to indicate sp register but it should
not be in the final code. Otherwise, kernel verifier will reject it.

The early patch ([2]) tried to fix the issue in clang frontend.
But the upstream discussion considered frontend fix is really a
hack and the backend should properly undo llvm.stacksave/llvm.stackrestore.
This patch implemented a bpf IR phase to remove these intrinsics
unconditionally. If eventually the alloca can be resolved with
constant size, r11 will not be generated. If alloca cannot be
resolved with constant size, SelectionDag will complain, the same
as without this patch.

 [1] https://lore.kernel.org/bpf/20210809151202.GB1012999@Mem/
 [2] https://reviews.llvm.org/D107882

Differential Revision: https://reviews.llvm.org/D111897
2021-10-18 09:51:19 -07:00
Kazu Hirata 8568ca789e Use llvm::erase_if (NFC) 2021-10-18 09:33:42 -07:00
Jessica Clarke f5755c0849 [Mips] Add glue between CopyFromReg, CopyToReg and RDHWR nodes for TLS
The MIPS ABI requires the thread pointer be accessed via rdhwr $3, $r29.
This is currently represented by (CopyToReg $3, (RDHWR $29)) followed by
a (CopyFromReg $3). However, there is no glue between these, meaning
scheduling can break those apart. In particular, PR51691 is a report
where PseudoSELECT_I was moved to between the CopyToReg and CopyFromReg,
and since its expansion uses branches, it split the def and use of the
physical register between two basic blocks, resulting in the def being
eliminated and the use having no def. It also seems possible that a
similar situation could arise splitting up the CopyToReg from the RDHWR,
causing the RDHWR to use a destination register other than $3, violating
the ABI requirement.

Thus, add glue between all three nodes to ensure they aren't split up
during instruction selection. No regression test is added since any test
would be implictly relying on specific scheduling behaviour, so whilst
it might be testing that glue is preventing reordering today, changes to
scheduling behaviour could result in the test no longer being able to
catch a regression here, as the reordering might no longer happen for
other unrelated reasons.

Fixes PR51691.

Reviewed By: atanasyan, dim

Differential Revision: https://reviews.llvm.org/D111967
2021-10-18 15:10:20 +01:00
Kerry McLaughlin ac4e01ea0e [SVE][CodeGen] Fix predicate for add/sub + element count patterns
The patterns added in D111441 should use the HasSVEorStreamingSVE
predicate. This changes one incorrect use of HasSVE with the new
patterns.
2021-10-18 14:42:29 +01:00
Andrew Wei f5056c8c16 [AArch64] Improve shuffle vector by using wider types
Try to widen element type to get a new mask value for a better permutation
sequence, so that we can use NEON shuffle instructions, such as zip1/2,
UZP1/2, TRN1/2, REV, INS, etc.
For example:
  shufflevector <4 x i32> %a, <4 x i32> %b, <4 x i32> <i32 6, i32 7, i32 2, i32 3>
is equivalent to:
  shufflevector <2 x i64> %a, <2 x i64> %b, <2 x i32> <i32 3, i32 1>
Finally, we can get:
  mov     v0.d[0], v1.d[1]

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D111619
2021-10-18 21:24:45 +08:00
Simon Pilgrim f041338153 [X86][Costmodel] Add SSE2 sub-128bit vXi32/f32 stride 2 interleaved store costs
Differential Revision: https://reviews.llvm.org/D111941
2021-10-18 13:46:10 +01:00
Simon Pilgrim c850d5c5c8 [X86][Costmodel] Add SSE2 sub-128bit vXi8/16 stride 2 interleaved store costs
Differential Revision: https://reviews.llvm.org/D111941
2021-10-18 13:15:14 +01:00
Jay Foad d55db4b033 [AMDGPU] Remove unused VirtRegMap analysis. NFC. 2021-10-18 11:55:40 +01:00
Jay Foad a129932b0d [AMDGPU] Add link to bug 2021-10-18 10:33:42 +01:00
Jay Foad 012248b0bc Remove the verifyAfter mechanism that was replaced by D111397
Differential Revision: https://reviews.llvm.org/D111872
2021-10-18 10:26:46 +01:00
Jay Foad 36deb9a670 Add new MachineFunction property FailsVerification
TargetPassConfig::addPass takes a "bool verifyAfter" argument which lets
you skip machine verification after a particular pass. Unfortunately
this is used in generic code in TargetPassConfig itself to skip
verification after a generic pass, only because some previous target-
specific pass damaged the MIR on that specific target. This is bad
because problems in one target cause lack of verification for all
targets.

This patch replaces that mechanism with a new MachineFunction property
called "FailsVerification" which can be set by (usually target-specific)
passes that are known to introduce problems. Later passes can reset it
again if they are known to clean up the previous problems.

Differential Revision: https://reviews.llvm.org/D111397
2021-10-18 10:26:46 +01:00
Piotr Sobczak d869921004 [AMDGPU] Add patterns for i8/i16 local atomic load/store
Add patterns for i8/i16 local atomic load/store.

Added tests for new patterns.

Copied atomic_[store/load]_local.ll to GlobalISel directory.

Differential Revision: https://reviews.llvm.org/D111869
2021-10-18 11:23:10 +02:00
Luo, Yuanke 942536ac08 [X86] Prefer VEX encoding in X86 assembler.
This patch is to order the AVX instructions ahead of AVX512 instructions
in the matching table so that the AVX instructions can be matched first.
Thanks Craig and Shengchen for the idea.

Differential Revision: https://reviews.llvm.org/D111538
2021-10-18 16:54:11 +08:00
Stanislav Mekhanoshin 7cdb1df8c7 [AMDGPU] Divergence driven selection for fused bitlogic
The change adds divergence predicates for fused logical operations.
The problem with selecting a scalar fused op such as S_NOR_B32 is
that it does not have a VALU counterpart and will be split in
moveToVALU. At the same time it prevents selection of a better
opcode on the VALU side (such as V_OR3_B32) which does not have a
counterpart on SALU side.

XNOR opcodes are left as is and selected as scalar to get advantage
of the SIInstrInfo::lowerScalarXnor() code which can commute
operations to keep one of two opcodes on SALU if possible. See
xnor.ll test for this.

Differential Revision: https://reviews.llvm.org/D111907
2021-10-18 01:44:25 -07:00
Jingu Kang 3f0b178de2 [AArch64] Fixed a bug on AArch64MIPeepholeOpt
Create new virtual register for the definition of new AND instruction and
replace old register by the new one to keep SSA form.

Differential Revision: https://reviews.llvm.org/D109963
2021-10-18 08:55:42 +01:00
Qiu Chaofan 67c64d8337 [PowerPC] Implement scheduling model for Power10
Reviewed By: jsji

Differential Revision: https://reviews.llvm.org/D110855
2021-10-18 15:27:49 +08:00
Simon Pilgrim 0bb32b1b21 [X86][SLM] Fix BitTest+Set uops + port usage
Both ports are required for BitTest ops. Update the uops counts + port usage based off the most recent llvm-exegesis captures and what Intel AoM / Agner reports as well.
2021-10-17 18:13:15 +01:00
Simon Pilgrim 5ed5df4802 [X86][SLM] Fix uops for PCMPISTR/PCMPISTR instructions
Based off a recent llvm-exegesis capture and what Intel AoM / Agner reports as well.
2021-10-17 18:13:14 +01:00
Simon Pilgrim 680afaaa5d [X86][SLM] Fix uops for PCLMULQDQ
Based off a recent llvm-exegesis capture and what Intel AoM / Agner reports as well.
2021-10-17 18:13:14 +01:00
Simon Pilgrim 498c7236bc [X86][SLM] +1uop for PSHUFBrm xmm
Extra 1uop for folded pshufb ops, based off a recent llvm-exegesis capture and what Intel AoM / Agner reports as well.
2021-10-17 18:13:14 +01:00
Roman Lebedev 91373bf12e
[X86][Costmodel] Load/store i64 Stride=4 VF=16 interleaving costs
A few more tuples are being queried after D111546. Might be good to model them,
They all require a lot of manual assembly surgery.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/9bnKrefcG - for intels `Block RThroughput: =40.0`; for ryzens, `Block RThroughput: =16.0`
So could pick cost of `40`

For store we have:
https://godbolt.org/z/5s3s14dEY - for intels `Block RThroughput: =40.0`; for ryzens, `Block RThroughput: =16.0`
So we could pick cost of `40`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111945
2021-10-17 17:28:10 +03:00
Roman Lebedev 3274ce3a28
[X86][Costmodel] Load/store i64 Stride=2 VF=32 interleaving costs
A few more tuples are being queried after D111546. Might be good to model them,
They all require a lot of manual assembly surgery.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/MTaKboejM - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=16.0`
So could pick cost of `32`

For store we have:
https://godbolt.org/z/v7xPj3Wd4 - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=32.0`
So we could pick cost of `32`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111944
2021-10-17 17:28:10 +03:00
Roman Lebedev 3a6a9f74d3
[X86][Costmodel] Load/store i32 Stride=4 VF=32 interleaving costs
A few more tuples are being queried after D111546. Might be good to model them,
They all require a lot of manual assembly surgery.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/11rcvdreP - for intels `Block RThroughput: <=68.0`; for ryzens, `Block RThroughput: <=48.0`
So could pick cost of `68`

For store we have:
https://godbolt.org/z/6aM11fWcP - for intels `Block RThroughput: <=64.0`; for ryzens, `Block RThroughput: <=32.0`
So we could pick cost of `64`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111943
2021-10-17 17:28:09 +03:00
Roman Lebedev 4b76a74b42
[X86][Costmodel] Load/store i32 Stride=3 VF=32 interleaving costs
A few more tuples are being queried after D111546. Might be good to model them,
They all require a lot of manual assembly surgery.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/s5b6E6jsP - for intels `Block RThroughput: <=32.0`; for ryzens, `Block RThroughput: <=24.0`
So could pick cost of `32`

For store we have:
https://godbolt.org/z/efh99d93b - for intels `Block RThroughput: <=48.0`; for ryzens, `Block RThroughput: <=32.0`
So we could pick cost of `48`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111942
2021-10-17 17:28:09 +03:00
Roman Lebedev 887acf6842
[X86][Costmodel] Load/store i16 Stride=6 VF=32 interleaving costs
A few more tuples are being queried after D111546. Might be good to model them,
They all require a lot of manual assembly surgery.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/YTeT9M7fW - for intels `Block RThroughput: <=212.0`; for ryzens, `Block RThroughput: <=64.0`
So could pick cost of `212`

For store we have:
https://godbolt.org/z/vc954KEGP - for intels `Block RThroughput: <=90.0`; for ryzens, `Block RThroughput: <=24.0`
So we could pick cost of `90`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111940
2021-10-17 17:28:09 +03:00
David Truby 2e0fb007d6 [llvm][AArch64][SVE] Fold literals into math instructions
SVE has predicated literal forms of some instructions for specific
literals, which currently are generated correctly when using ACLE
but not when those instructions are generated directly.

This adds the patterns to generate those instructions when
generating from standard LLVM IR instructions.

Differential Revision: https://reviews.llvm.org/D99074
2021-10-17 10:57:04 +00:00
Kito Cheng ff13189c5d [RISCV] Unify the arch string parsing logic to to RISCVISAInfo.
How many place you need to modify when implementing a new extension for RISC-V?

At least 7 places as I know:

- Add new SubtargetFeature at RISCV.td
- -march parser in RISCV.cpp
- RISCVTargetInfo::initFeatureMap@RISCV.cpp for handling feature vector.
- RISCVTargetInfo::getTargetDefines@RISCV.cpp for pre-define marco.
- Arch string parser for ELF attribute in RISCVAsmParser.cpp
- ELF attribute emittion in RISCVAsmParser.cpp, and make sure it's in
  canonical order...
- ELF attribute emittion in RISCVTargetStreamer.cpp, and again, must in
  canonical order...

And now, this patch provide an unified infrastructure for handling (almost)
everything of RISC-V arch string.

After this patch, you only need to update 2 places for implement an extension
for RISC-V:
- Add new SubtargetFeature at RISCV.td, hmmm, it's hard to avoid.
- Add new entry to RISCVSupportedExtension@RISCVISAInfo.cpp or
  SupportedExperimentalExtensions@RISCVISAInfo.cpp .

Most codes are come from existing -march parser, but with few new feature/bug
fixes:
- Accept version for -march, e.g. -march=rv32i2p0.
- Reject version info with `p` but without minor version number like `rv32i2p`.

Differential Revision: https://reviews.llvm.org/D105168
2021-10-17 16:25:23 +08:00
Ben Shi d0dbc991c0 Revert "[AArch64] Optimize add/sub with immediate"
This reverts commit 9bf6bef995.
2021-10-16 22:17:18 +00:00
Craig Topper beb7862db5 [X86] Add DAG combine for negation of CMOV absolute value pattern.
This patch detects the absolute value pattern on the RHS of a
subtract. If we find it we swap the CMOV true/false values and
replace the subtract with an ADD.

There may be a more generic way to do this, but I'm not sure.
Targets that don't have legal or custom ISD::ABS use a generic
expand in DAG combiner already when it sees (neg (abs(x))). I
haven't checked what happens if the neg is a more general subtract.

Fixes PR50991 for X86.

Reviewed By: RKSimon, spatel

Differential Revision: https://reviews.llvm.org/D111858
2021-10-16 13:31:43 -07:00
Simon Pilgrim 85b87179f4 [TTI][X86] Add v8i16 -> 2 x v4i16 stride 2 interleaved load costs
Split SSE2 and SSSE3 costs to correctly handle PSHUFB lowering - as was noted on D111938
2021-10-16 17:28:07 +01:00
Simon Pilgrim 6ec644e215 [TTI][X86] Add SSE2 sub-128bit vXi16/32 and v2i64 stride 2 interleaved load costs
These cases use the same codegen as AVX2 (pshuflw/pshufd) for the sub-128bit vector deinterleaving, and unpcklqdq for v2i64.

It's going to take a while to add full interleaved cost coverage, but since these are the same for SSE2 -> AVX2 it should be an easy win.

Fixes PR47437

Differential Revision: https://reviews.llvm.org/D111938
2021-10-16 16:21:45 +01:00
Roman Lebedev d137f1288e
[X86][LV] X86 does *not* prefer vectorized addressing
And another attempt to start untangling this ball of threads around gather.
There's `TTI::prefersVectorizedAddressing()`hoop, which confusingly defaults to `true`,
which tells LV to try to vectorize the addresses that lead to loads,
but X86 generally can not deal with vectors of addresses,
the only instructions that support that are GATHER/SCATTER,
but even those aren't available until AVX2, and aren't really usable until AVX512.

This specializes the hook for X86, to return true only if we have AVX512 or AVX2 w/ fast gather.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111546
2021-10-16 12:32:18 +03:00
Ben Shi 9bf6bef995 [AArch64] Optimize add/sub with immediate
Optimize ([add|sub] r, imm) -> ([ADD|SUB] ([ADD|SUB] r, #imm0, lsl #12), #imm1),
if imm == (imm0<<12)+imm1. and both imm0 and imm1 are non-zero 12-bit unsigned
integers.

Optimize ([add|sub] r, imm) -> ([SUB|ADD] ([SUB|ADD] r, #imm0, lsl #12), #imm1),
if imm == -(imm0<<12)-imm1, and both imm0 and imm1 are non-zero 12-bit unsigned
integers.

Reviewed By: jaykang10, dmgreen

Differential Revision: https://reviews.llvm.org/D111034
2021-10-16 08:50:39 +00:00
Zhi An Ng da07942834 [WebAssembly] Add prototype relaxed laneselect instructions
Add i8x16, i16x8, i32x4, i64x2 laneselect instructions. These are only
exposed as builtins, and require user opt-in.
2021-10-15 17:45:09 -07:00
Anshil Gandhi 1830ec94ac Revert "[HIP] [AlwaysInliner] Disable AlwaysInliner to eliminate undefined symbols"
This reverts commit 03375a3fb3.
2021-10-15 16:16:18 -06:00
Anshil Gandhi 03375a3fb3 [HIP] [AlwaysInliner] Disable AlwaysInliner to eliminate undefined symbols
By default clang emits complete contructors as alias of base constructors if they are the same.
The backend is supposed to emit symbols for the alias, otherwise it causes undefined symbols.
@yaxunl observed that this issue is related to the llvm options `-amdgpu-early-inline-all=true`
and `-amdgpu-function-calls=false`. This issue is resolved by only inlining global values
with internal linkage. The `getCalleeFunction()` in AMDGPUResourceUsageAnalysis also had
to be extended to support aliases to functions. inline-calls.ll was corrected appropriately.

Reviewed By: yaxunl, #amdgpu

Differential Revision: https://reviews.llvm.org/D109707
2021-10-15 11:39:15 -06:00
Michael Liao bacddf47a8 [amdgpu] Fix a crash case when preserving MDT in SILowerControlFlow
- When a redundant MBB is being erased from MDT, check whether its
  single successor is dominiated by it. If yes, update that successor's
  idom before erasing MBB; otherwise, it implies MBB is a leaf node and
  could be erased directly.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D111831
2021-10-15 13:21:53 -04:00
Jonas Paulsson ccbfcfda1e [SystemZ] Handle huge immediates in SystemZInstrInfo::loadImmediate().
This is needed during isel pseudo expansion in order not to crash on huge
immediates.

Review: Ulrich Weigand
2021-10-15 19:08:45 +02:00
Jessica Paquette 59b94c4a60 NFC: Remove wayward MIR tests from lib/Target
These were put in lib/Target instead of tests.

Thankfully dupes of them already existed in the tests directory.

So, just delete them.
2021-10-15 09:59:00 -07:00
Abinav Puthan Purayil de3038400b [AMDGPU] Avoid redundant calls to numBits in AMDGPUCodeGenPrepare::replaceMulWithMul24().
The isU24() and isI24() calls numBits to make its decision. This change
replaces them with the internal numBits call so that we can use its
result for the > 32 bit width cases.

Differential Revision: https://reviews.llvm.org/D111864
2021-10-15 19:49:44 +05:30
Dávid Bolvanský 6678db00e6 [X86] Enable promotion of i16 popcnt (PR52056)
Solves https://bugs.llvm.org/show_bug.cgi?id=52056

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111507
2021-10-15 15:41:37 +02:00
Mubashar Ahmad 97809c828f [AArch64]Enabling Cortex-A510 Support
This patch enables support for Cortex-A510 CPUs.

Reviewed By: MarkMurrayARM, dmgreen

Differential Revision: https://reviews.llvm.org/D109825
2021-10-15 14:31:18 +01:00
Abinav Puthan Purayil 0379263f23 [AMDGPU] Fix width check for signed mul24 generation.
This changes fixes a case in which the highest set bit of the original
result is at bit 31 and sign-extending the mul24 for it would make the
result negative.

Differential Revision: https://reviews.llvm.org/D111823
2021-10-15 18:53:41 +05:30
David Green fa1a68285e [AArch64] Improve fptosi.sat vector lowering
Similar to D111236, this improves the lowering of vector fptosi.sat and
fptoui.sat, using legal converts and further saturating from there with
min/max. f64 are excluded for the moment due to producing worse code in
places compared to the unrolling.

Differential Revision: https://reviews.llvm.org/D111787
2021-10-15 11:37:53 +01:00
David Green a4f42a33be [AArch64] Improve fptosi.sat lowering
Improve the lowering of scalar fptosi.sat and fptoui.sat for saturating
widths smaller than legal types by using the fact that the legal type
will saturate under aarch64, and saturating the result further using
min/max.

Differential Revision: https://reviews.llvm.org/D111236
2021-10-15 11:12:23 +01:00
John Brawn 082fa56819 [ARM] Fix MOVCC peephole to not use an incorrect register class
The MOVCC peephole eliminates a MOVCC by making one of its inputs a
conditional instruction, but when doing this it should be using both
inputs of the MOVCC to decide on the register class to use as
otherwise we can get an error when using -verify-machineinstrs.

Differential Revision: https://reviews.llvm.org/D111714
2021-10-15 10:54:26 +01:00
guopeilin b092dc0bb9 [AArch64ISelLowering] Avoid duplane in some cases when sve enabled
Reviewed By: david-arm, sdesmalen

Differential Revision: https://reviews.llvm.org/D110524
2021-10-15 15:33:24 +08:00
Jim Lin 2ccdc7315e [RISCV] Add invalid match case for uimm2, uimm3 and uimm7
Reviewed By: craig.topper, jrtc27

Differential Revision: https://reviews.llvm.org/D110308
2021-10-15 14:54:48 +08:00
Ben Shi 4fe5ab4b00 [RISCV] Optimize immediate materialisation with SH*ADD
Use SH1ADD/SH2ADD/SH3ADD along with LUI+ADDI to compose int32*3,
int32*5 and int32*9.

Reviewed By: craig.topper, luismarques

Differential Revision: https://reviews.llvm.org/D111484
2021-10-15 06:46:41 +00:00
Kazu Hirata 81e9c90686 [llvm] Use llvm::is_contained (NFC) 2021-10-14 22:44:09 -07:00
Qiu Chaofan 9e9b0f4621 [PowerPC] Support ppc-asm-full-reg-names for AIX
Reviewed By: jsji

Differential Revision: https://reviews.llvm.org/D94282
2021-10-15 12:22:44 +08:00
Roman Lebedev 3d7bf6625a
[X86][Costmodel] Improve cost modelling for not-fully-interleaved load
While i've modelled most of the relevant tuples for AVX2,
that only covered fully-interleaved groups.

By definition, interleaving load of stride N means:
load N*VF elements, and shuffle them into N VF-sized vectors,
with 0'th vector containing elements `[0, VF)*stride + 0`,
and 1'th vector containing elements `[0, VF)*stride + 1`.
Example: https://godbolt.org/z/df561Me5E (i64 stride 4 vf 2 => cost 6)

Now, not fully interleaved load, is when not all of these vectors is demanded.
So at worst, we could just pretend that everything is demanded,
and discard the non-demanded vectors. What this means is that the cost
for not-fully-interleaved group should be not greater than the cost
for the same fully-interleaved group, but perhaps somewhat less.
Examples:
https://godbolt.org/z/a78dK5Geq (i64 stride 4 (indices 012u) vf 2 => cost 4)
https://godbolt.org/z/G91ceo8dM (i64 stride 4 (indices 01uu) vf 2 => cost 2)
https://godbolt.org/z/5joYob9rx (i64 stride 4 (indices 0uuu) vf 2 => cost 1)

As we have established over the course of last ~70 patches, (wow)
`BaseT::getInterleavedMemoryOpCos()` is absolutely bogus,
it is usually almost an order of magnitude overestimation,
so i would claim that we should at least use the hardcoded costs
of fully interleaved load groups.

We could go further and adjust them e.g. by the number of demanded indices,
but then i'm somewhat fearful of underestimating the cost.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111174
2021-10-14 23:14:36 +03:00
Craig Topper 79ae9562cc [RISCV] Remove unused member variable. NFC 2021-10-14 12:56:47 -07:00
Craig Topper 3ff9cc01f2 [X86] Use CMOVNS for abs instead of CMOVGE.
CMOVGE reads SF and OF. CMOVNS only reads SF. This matches with
other recent changes to use a single flag where possible. It also
matches gcc codegen.

I believe this technically changes whether the conditioanl move happens
on INT_MIN, but for INT_MIN both registers are the same so it doesn't
matter.

Differential Revision: https://reviews.llvm.org/D111826
2021-10-14 12:28:28 -07:00
Simon Pilgrim 871f773986 [TTI][X86] Merge getInterleavedMemoryOpCostAVX2 into getInterleavedMemoryOpCost. NFC
This a NFC refactor patch to merge the AVX2 interleaved cost handling back into the getInterleavedMemoryOpCost base method - while getInterleavedMemoryOpCostAVX512 uses instruction and patterns very specific to AVX512+, much of the costs analysis for AVX2 can be reused for all SSE targets.

This is the first step towards improving SSE and AVX1 costs that will reuse the relevant AVX2 costs by splitting some of the tables - for instance AVX1 has very similar costs for most vXi64/vXf64 interleave patterns and many sub-128bit vector costs are the same all the way down to SSE2 (or at least SSSE3).

Differential Revision: https://reviews.llvm.org/D111822
2021-10-14 18:46:25 +01:00
Simon Pilgrim fcbec7e668 [TTI][X86] Swap getInterleavedMemoryOpCostAVX2/getInterleavedMemoryOpCostAVX512 implementations. NFC.
I have some upcoming refactoring for SSE/AVX1 interleaving cost support, and the diff is a lot nicer if the (unaltered) AVX512 implementation isn't stuck between getInterleavedMemoryOpCost and getInterleavedMemoryOpCostAVX2
2021-10-14 18:10:03 +01:00
Craig Topper f7ba572483 [RISCV] Update Zba, Zbb, Zbc, and Zbs version from 0.93 to 1.0.
I've removed the Zbs W instructions that are not part of the frozen spec.

References to B as an extension name have been removed. Tests are updated or split accordingly.

Reviewed By: luismarques

Differential Revision: https://reviews.llvm.org/D110669
2021-10-14 09:25:03 -07:00
Brian Cain 743e263e08 [hexagon] Add system register, transfer support
This commit adds the system reg/regpair definitions and the corresponding
register transfer instructions.
2021-10-14 06:37:04 -07:00
Andrew Savonichev 51eefa8164 [NVPTX] Add VRFrame and VRFrameLocal to integer register classes
These registers are used as operands for instructions that expect an
integer register, so they should be added to Int32Regs or Int64Regs
register classes. Otherwise the machine verifier emits an error for
the following LIT tests when LLVM_ENABLE_MACHINE_VERIFIER=1
environment variable is set:

*** Bad machine code: Illegal physical register for instruction ***
- function:    kernel_func
- basic block: %bb.0 entry (0x55c8903d5438)
- instruction: %3:int64regs = LEA_ADDRi64 $vrframelocal, 0
- operand 1:   $vrframelocal
$vrframelocal is not a Int64Regs register.

    CodeGen/NVPTX/call-with-alloca-buffer.ll
    CodeGen/NVPTX/disable-opt.ll
    CodeGen/NVPTX/lower-alloca.ll
    CodeGen/NVPTX/lower-args.ll
    CodeGen/NVPTX/param-align.ll
    CodeGen/NVPTX/reg-types.ll
    DebugInfo/NVPTX/dbg-declare-alloca.ll
    DebugInfo/NVPTX/dbg-value-const-byref.ll

Differential Revision: https://reviews.llvm.org/D110164
2021-10-14 16:19:03 +03:00
Jonas Paulsson c0d88613f2 [SystemZ] Remove some now unused ISD XXX_LOOP opcodes. 2021-10-14 14:55:44 +02:00
Andrew Savonichev dc8a41de34 [ARM] Simplify address calculation for NEON load/store
The patch attempts to optimize a sequence of SIMD loads from the same
base pointer:

    %0 = gep float*, float* base, i32 4
    %1 = bitcast float* %0 to <4 x float>*
    %2 = load <4 x float>, <4 x float>* %1
    ...
    %n1 = gep float*, float* base, i32 N
    %n2 = bitcast float* %n1 to <4 x float>*
    %n3 = load <4 x float>, <4 x float>* %n2

For AArch64 the compiler generates a sequence of LDR Qt, [Xn, #16].
However, 32-bit NEON VLD1/VST1 lack the [Wn, #imm] addressing mode, so
the address is computed before every ld/st instruction:

    add r2, r0, #32
    add r0, r0, #16
    vld1.32 {d18, d19}, [r2]
    vld1.32 {d22, d23}, [r0]

This can be improved by computing address for the first load, and then
using a post-indexed form of VLD1/VST1 to load the rest:

    add r0, r0, #16
    vld1.32 {d18, d19}, [r0]!
    vld1.32 {d22, d23}, [r0]

In order to do that, the patch adds more patterns to DAGCombine:

  - (load (add ptr inc1)) and (add ptr inc2) are now folded if inc1
    and inc2 are constants.

  - (or ptr inc) is now recognized as a pointer increment if ptr is
    sufficiently aligned.

In addition to that, we now search for all possible base updates and
then pick the best one.

Differential Revision: https://reviews.llvm.org/D108988
2021-10-14 15:23:10 +03:00
Simon Pilgrim 77dcdc2f50 [CostModel][X86] Pre-SSE41 targets can use PMADDWD for sext sub-i16 -> i32
Without SSE41 sext/zext instructions the extensions will be split, meaning that the MUL->PMADDWD fold will split the sext_i32(x) into zext_i32(sext_i16(x))
2021-10-14 12:17:40 +01:00
Jonas Paulsson a33e4c8ae9 [SystemZ] Reapply memcmp and memcpy patches.
This reverts 3562076 and includes some refactoring as well.

Review: Ulrich Weigand

Differential Revision: https://reviews.llvm.org/D111733
2021-10-14 10:37:33 +02:00
Jonas Paulsson 00baad35b2 [SystemZ] Bugfix and refactorization of mem-mem operations
This patch fixes the bug that consisted of treating variable / immediate
length mem operations (such as memcpy, memset, ...) differently. The variable
length case needs to have the length minus 1 passed due to the use of EXRL
target instructions. However, the DAGCombiner can convert a register length
argument into a constant one, and whenever that happened one byte too little
would end up being performed.

This is also a refactorization by reducing the number of opcodes and variants
involved. For any opcode (variable or constant length), only the length minus
one is passed on to the ISD node. The rest of the logic is now instead
handled during isel pseudo expansion.

Review: Ulrich Weigand

Differential Revision: https://reviews.llvm.org/D111729
2021-10-14 10:37:33 +02:00
Ben Shi 7e81526126 [RISCV] Optimize immediate materialisation with BSETI/BCLRI
Opitimize immediate materialisation in the following way if profitable:
1. Use BCLRI for upper 32 bits if the lower 32 bits are negative int32.
2. Use BSETI for upper 32 bits if the lower 32 bits are positive int32.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D111508
2021-10-14 04:56:47 +00:00
Abinav Puthan Purayil b3c9d84e5a [AMDGPU] Fix 24-bit mul intrinsic generation for > 32-bit result.
The 24-bit mul intrinsics yields the low-order 32 bits. We should only
do the transformation if the operands are known to be not wider than 24
bits and the result is known to be not wider than 32 bits.

Differential Revision: https://reviews.llvm.org/D111523
2021-10-14 09:00:19 +05:30
Ben Shi 481db13fec [RISCV] Optimize immediate materialisation with SLLI.UW
Use LUI+SLLI.UW to compose the upper bits instead of LUI+SLLI.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D111705
2021-10-14 02:24:50 +00:00
Roman Lebedev 18eef13dad
[X86][Costmodel] Fix `X86TTIImpl::getGSScalarCost()`
`X86TTIImpl::getGSScalarCost()` has (at least) two issues:
* it naively computes the cost of sequence of `insertelement`/`extractelement`.
  If we are operating not on the XMM (but YMM/ZMM),
  this widely overestimates the cost of subvector insertions/extractions.
* Gather/scatter takes a vector of pointers, and scalarization results in us performing
  scalar memory operation for each of these pointers, but we never account for the cost
  of extracting these pointers out of the vector of pointers.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111222
2021-10-13 22:35:39 +03:00
Joe Nash b44eac1b85 [AMDGPU] Remove unneeded emit literal check
NFC. This check does not verify any functional property since size 8
was added. Remove it for simplicity.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D111737

Change-Id: Ifd7cbd324a137f939d8dc04acb8fbd54c9527a42
2021-10-13 12:46:22 -04:00
Kai Nacke 0a950a2e94 [SystemZ/z/OS] Implement save of non-volatile registers on z/OS XPLINK
This PR implements the save of the XPLINK callee-saved registers
on z/OS.

Reviewed By: uweigand, Kai

Differential Revision: https://reviews.llvm.org/D111653
2021-10-13 12:57:57 -04:00
Jay Foad c885857e9d [AMDGPU] Enable load clustering in the post-RA scheduler
This has a couple of benefits:
1. It can sometimes fix clusters that got broken apart when the register
   allocator inserted a copy.
2. Post-RA scheduling does not have to worry about increasing register
   pressure, which in some cases gives it more freedom to reorder
   instructions.

Testing on a collection of 10,000 graphics shaders compiled for gfx1010
showed:
- The average length of each run of one or more load instructions
  increased by about 1%.
- The number of runs of two or more load instructions increased by
  about 4%.

Differential Revision: https://reviews.llvm.org/D111646
2021-10-13 17:12:26 +01:00
Kerry McLaughlin 1a2e90199f [SVE][CodeGen] Add patterns for ADD/SUB + element count
This patch adds patterns to match the following with INC/DEC:
 - @llvm.aarch64.sve.cnt[b|h|w|d] intrinsics + ADD/SUB
 - vscale + ADD/SUB

For some implementations of SVE, INC/DEC VL is not as cheap as ADD/SUB and
so this behaviour is guarded by the "use-scalar-inc-vl" feature flag, which for SVE
is off by default. There are no known issues with SVE2, so this feature is
enabled by default when targeting SVE2.

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D111441
2021-10-13 11:36:15 +01:00
Simon Pilgrim fb2539b9d8 [X86][SSE] Add X86ISD::AVG to isCommutativeBinOp to support folding shuffles through the binop 2021-10-13 10:54:37 +01:00
Heejin Ahn 9261ee32dc [WebAssembly] Make EH work with dynamic linking
This makes Wasm EH work with dynamic linking. So far we were only able
to handle destructors, which do not use any tags or LSDA info.

1. This uses `TargetExternalSymbol` for `GCC_except_tableN` symbols,
   which points to the address of per-function LSDA info. It is more
   convenient to use than `MCSymbol` because it can take additional
   target flags.

2. When lowering `wasm_lsda` intrinsic, if PIC is enabled, make the
   symbol relative to `__memory_base` and generate the `add` node. If
   PIC is disabled, continue to use the absolute address.

3. Make tag symbols (`__cpp_exception` and `__c_longjmp`) undefined in
   the backend, because it is hard to make it work with dynamic
   linking's loading order. Instead, we make all tag symbols undefined
   in the LLVM backend and import it from JS.

4. Add support for undefined tags to the linker.

Companion patches:
- https://github.com/WebAssembly/binaryen/pull/4223
- https://github.com/emscripten-core/emscripten/pull/15266

Reviewed By: sbc100

Differential Revision: https://reviews.llvm.org/D111388
2021-10-12 23:28:27 -07:00
Ben Shi 787eeb8597 [RISCV] Optimize immediate materialisation with BCLRI
Do the following optimization for immediate materialisation:

1. For values in range 0xffffffff 7fffffff ~ 0xffffffff 00000000, first
   generate the lower 32-bit with Val|0x80000000 (which is expected be an
   int32), then emit (BCLRI r, 31).

2. For values in range 0x80000000 ~ 0xffffffff, first generate the lower
   32-bit with Val&~0x80000000 (which is expected to be an int32), then
   emit (BSETI r, 31).

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D111532
2021-10-13 00:59:23 +00:00
Fangrui Song c2d4fe51bb [X86] Remove little support we had for MPX
GCC 9.1 removed Intel MPX support. Linux kernel removed MPX in 2019.
glibc 2.35 will remove MPX.

Our support is limited: we support assembling of bndmov but not bnd.
Just remove it.

Reviewed By: pengfei, skan

Differential Revision: https://reviews.llvm.org/D111517
2021-10-12 16:18:51 -07:00
Albion Fung b4b9f9b4b3 [PowerPC] Emit dcbt and dcbtst in place of their extended mnemonics on AIX
On AIX, the system assembler does not support the extended mnemonics
dcbtt and dcbtstt. This patch stops them from being emitted on
AIX and emits the base mnemonics instead, dcbt X, X, 16 and
dcbtstt X, X, 16 respectively.

Differential revision: https://reviews.llvm.org/D111258
2021-10-12 15:47:57 -05:00
Roman Lebedev 958da6598f
[X86] `detectAVGPattern()`: don't require zext in the with-constant case 2021-10-12 20:24:17 +03:00
Roman Lebedev a1d57f75d1
[NFC][X86] `detectAVGPattern()`: rely on `AVGSplitter()` to perform truncation 2021-10-12 20:12:09 +03:00
Stanislav Mekhanoshin 9cf995be6b [AMDGPU] Promote generic pointer kernel arguments into global
The new pass walks kernel's pointer arguments, then loads from them.
If a loaded value is a pointer and loaded pointer is unmodified in
the kernel before the load, then promote loaded pointer to global.
Then recursively continue.

Differential Revision: https://reviews.llvm.org/D111464
2021-10-12 10:07:33 -07:00
Roman Lebedev 5f4f5da634
[X86] `detectAVGPattern()`: support basic case of PAVG chaining (PR52131)
As noted in https://github.com/halide/Halide/pull/6302,
we hilariously fail to match PAVG if we even as much
as look at it the wrong way.

In this particular case, the problem stems from the fact that
`PAVG` root (def) is a `trunc`, and leafs (uses) are `zext`'s,
and InstCombine really loves to get rid of both of these,
for example replace them with a bit mask. So we may not have
said `zext`.

Instead of checking for that + type match,
i think we should rely on the actual active type,
as per the knownbits.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111571
2021-10-12 19:51:43 +03:00
Roman Lebedev 7964c3ed82
[X86] `detectAVGPattern()`: small preparatory NFC refactor 2021-10-12 19:51:43 +03:00
Jay Foad 66ce1015af Revert "[AMDGPU] Enable load clustering in the post-RA scheduler"
This reverts commit 66e13c7f43.

It was committed by accident.
2021-10-12 16:19:35 +01:00
Jay Foad 66e13c7f43 [AMDGPU] Enable load clustering in the post-RA scheduler
This has a couple of benefits:
1. It can sometimes fix clusters that got broken apart when the register
   allocator inserted a copy.
2. Post-RA scheduling does not have to worry about increasing register
   pressure, which in some cases gives it more freedom to reorder
   instructions.

Testing on a collection of 10,000 graphics shaders compiled for gfx1010
showed:
- The average length of each run of one or more load instructions
  increased by about 1%.
- The number of runs of two or more load instructions increased by
  about 4%.
2021-10-12 16:09:04 +01:00
Bradley Smith 2eb42e3d2a [AArch64][SVE] Add fixed type lowering for EXTRACT_SUBVECTOR
Depends on D111135

Differential Revision: https://reviews.llvm.org/D111165
2021-10-12 14:56:15 +00:00
Simon Pilgrim 61d124f7a2 [X86] Fix implicit MathsExtras.h header dependency 2021-10-12 13:37:16 +01:00
jacquesguan 0608bbd4e8 [RISCV] Rename assembler mnemonic of unordered floating-point reductions for v1.0-rc change
Rename vfredsum and vfwredsum to vfredusum and vfwredusum. Add aliases for vfredsum and vfwredsum.

Reviewed By: luismarques, HsiangKai, khchen, frasercrmck, kito-cheng, craig.topper

Differential Revision: https://reviews.llvm.org/D105690
2021-10-12 06:46:46 +00:00
Yonghong Song 1321e47298 BPF: rename BTF_KIND_TAG to BTF_KIND_DECL_TAG
Per discussion in https://reviews.llvm.org/D111199,
the existing btf_tag attribute will be renamed to
btf_decl_tag. This patch updated BTF backend to
use btf_decl_tag attribute name and also
renamed BTF_KIND_TAG to BTF_KIND_DECL_TAG.

Differential Revision: https://reviews.llvm.org/D111592
2021-10-11 21:33:39 -07:00
hsmahesha 52cb3af08c [AMDGPU] Remove dead frame indices after sgpr spill.
All those frame indices which are dead after sgpr spill should be removed from
the function frame. Othewise, there is a side effect such as re-mapping of free
frame index ids by the later pass(es) like "stack slot coloring" which in turn
could mess-up with the book keeping of "frame index to VGPR lane".

Reviewed By: cdevadas

Differential Revision: https://reviews.llvm.org/D111150
2021-10-12 09:58:49 +05:30
Freddy Ye d57a87ea89 [X86][ISel] Lowering llvm.thread.pointer
Reviewed By: pengfei

Differential Revision: https://reviews.llvm.org/D110681
2021-10-12 11:01:18 +08:00
David Green 860b4479dc [ARM] Be more explicit about disabling CombineBaseUpdate for MVE.
This shouldn't be called for non-neon targets at the moment in either
case, but it is good to be expliit about the CombineBaseUpdate being a
NEON function, not expecting to be run under MVE.
2021-10-11 21:51:45 +01:00
Roman Lebedev 684cbae89a
[KnownBits] Introduce `countMaxActiveBits()` and use it in a few places 2021-10-11 23:36:06 +03:00
Jay Foad 2e1ad93201 [AMDGPU] Fix copying a machine operand
Without this I get:

*** Bad machine code: Instruction has operand with wrong parent set ***
- function:    available_externally_test
- basic block: %bb.0  (0x7dad598)
- instruction: %0:r600_treg32_x = MOV 1, 0, 0, 0, $alu_literal_x, 0, 0, 0, -1, 1, $pred_sel_off, @available_externally, 0

Differential Revision: https://reviews.llvm.org/D111549
2021-10-11 20:22:47 +01:00
Joe Nash b4b7e605a6 [AMDGPU] Support shared literals in FMAMK/FMAAK
These instructions should allow src0 to be a literal with the same
value as the mandatory other literal. Enable it by introducing an
operand that defers adding its value to the MI when decoding till
the mandatory literal is parsed.

Reviewed By: dp, foad

Differential Revision: https://reviews.llvm.org/D111067

Change-Id: I22b0ae0d35bad17b6f976808e48bffe9a6af70b7
2021-10-11 13:09:54 -04:00
Victor Campos 3550e242fa [Clang][ARM][AArch64] Add support for Armv9-A, Armv9.1-A and Armv9.2-A
armv9-a, armv9.1-a and armv9.2-a can be targeted using the -march option
both in ARM and AArch64.

 - Armv9-A maps to Armv8.5-A.
 - Armv9.1-A maps to Armv8.6-A.
 - Armv9.2-A maps to Armv8.7-A.
 - The SVE2 extension is enabled by default on these architectures.
 - The cryptographic extensions are disabled by default on these
 architectures.

The Armv9-A architecture is described in the Arm® Architecture Reference
Manual Supplement Armv9, for Armv9-A architecture profile
(https://developer.arm.com/documentation/ddi0608/latest).

Reviewed By: SjoerdMeijer

Differential Revision: https://reviews.llvm.org/D109517
2021-10-11 17:44:09 +01:00
Simon Pilgrim ad16c6e52f [X86][AVX] Ensure we retain zero elements in select(pshufb,pshufb) -> or(pshufb,pshufb) fold (PR52122)
The select(pshufb,pshufb) -> or(pshufb,pshufb) fold uses getConstVector to create the refreshed pshufb masks, which treats all negative indices as undef.

PR52122 shows that if we were selecting an element that the PSHUFB has set to zero we must set it back to 0x80 when we recreate the PSHUFB mask and not just leave it as SM_SentinelZero
2021-10-11 14:50:24 +01:00
Bradley Smith 03065ecd85 [AArch64][SVE] Ensure LowerEXTRACT_SUBVECTOR is not called for illegal types
The lowering for EXTRACT_SUBVECTOR should not be called during type
legalization, only as part of lowering, hence return SDValue() when
called on illegal types.

This also adds missing tests for extracting fixed types from illegal
scalable types.

Differential Revision: https://reviews.llvm.org/D111412
2021-10-11 11:20:50 +00:00
Andrew Savonichev 7ae8f392a1 [AArch64] Emit AssertZExt for i1 arguments
AAPCS requires i1 argument to be zero-extended to 8-bits by the
caller. Emit a new AArch64ISD::ASSERT_ZEXT_BOOL hint (or AssertZExt
for GlobalISel) to enable some optimization opportunities. In
particular, when the argument is forwarded to the callee, we can avoid
zero-extension and use it as-is.

Differential Revision: https://reviews.llvm.org/D107160
2021-10-11 11:55:11 +03:00
David Sherwood 26b7d9d622 [LoopVectorize] Permit vectorisation of more select(cmp(), X, Y) reduction patterns
This patch adds further support for vectorisation of loops that involve
selecting an integer value based on a previous comparison. Consider the
following C++ loop:

  int r = a;
  for (int i = 0; i < n; i++) {
    if (src[i] > 3) {
      r = b;
    }
    src[i] += 2;
  }

We should be able to vectorise this loop because all we are doing is
selecting between two states - 'a' and 'b' - both of which are loop
invariant. This just involves building a vector of values that contain
either 'a' or 'b', where the final reduced value will be 'b' if any lane
contains 'b'.

The IR generated by clang typically looks like this:

  %phi = phi i32 [ %a, %entry ], [ %phi.update, %for.body ]
  ...
  %pred = icmp ugt i32 %val, i32 3
  %phi.update = select i1 %pred, i32 %b, i32 %phi

We already detect min/max patterns, which also involve a select + cmp.
However, with the min/max patterns we are selecting loaded values (and
hence loop variant) in the loop. In addition we only support certain
cmp predicates. This patch adds a new pattern matching function
(isSelectCmpPattern) and new RecurKind enums - SelectICmp & SelectFCmp.
We only support selecting values that are integer and loop invariant,
however we can support any kind of compare - integer or float.

Tests have been added here:

  Transforms/LoopVectorize/AArch64/sve-select-cmp.ll
  Transforms/LoopVectorize/select-cmp-predicated.ll
  Transforms/LoopVectorize/select-cmp.ll

Differential Revision: https://reviews.llvm.org/D108136
2021-10-11 09:41:38 +01:00
Amara Emerson f1e9ecea44 [AArch64][GlobalISel] Legalize G_VECREDUCE_XOR. Treated same as other bitwise reductions. 2021-10-10 17:01:21 -07:00
David Green adec922361 [AArch64] Make -mcpu=generic schedule for an in-order core
We would like to start pushing -mcpu=generic towards enabling the set of
features that improves performance for some CPUs, without hurting any
others. A blend of the performance options hopefully beneficial to all
CPUs. The largest part of that is enabling in-order scheduling using the
Cortex-A55 schedule model. This is similar to the Arm backend change
from eecb353d0e which made -mcpu=generic perform in-order scheduling
using the cortex-a8 schedule model.

The idea is that in-order cpu's require the most help in instruction
scheduling, whereas out-of-order cpus can for the most part out-of-order
schedule around different codegen. Our benchmarking suggests that
hypothesis holds. When running on an in-order core this improved
performance by 3.8% geomean on a set of DSP workloads, 2% geomean on
some other embedded benchmark and between 1% and 1.8% on a set of
singlecore and multicore workloads, all running on a Cortex-A55 cluster.

On an out-of-order cpu the results are a lot more noisy but show flat
performance or an improvement. On the set of DSP and embedded
benchmarks, run on a Cortex-A78 there was a very noisy 1% speed
improvement. Using the most detailed results I could find, SPEC2006 runs
on a Neoverse N1 show a small increase in instruction count (+0.127%),
but a decrease in cycle counts (-0.155%, on average). The instruction
count is very low noise, the cycle count is more noisy with a 0.15%
decrease not being significant. SPEC2k17 shows a small decrease (-0.2%)
in instruction count leading to a -0.296% decrease in cycle count. These
results are within noise margins but tend to show a small improvement in
general.

When specifying an Apple target, clang will set "-target-cpu apple-a7"
on the command line, so should not be affected by this change when
running from clang. This also doesn't enable more runtime unrolling like
-mcpu=cortex-a55 does, only changing the schedule used.

A lot of existing tests have updated. This is a summary of the important
differences:
 - Most changes are the same instructions in a different order.
 - Sometimes this leads to very minor inefficiencies, such as requiring
   an extra mov to move variables into r0/v0 for the return value of a test
   function.
 - misched-fusion.ll was no longer fusing the pairs of instructions it
   should, as per D110561. I've changed the schedule used in the test
   for now.
 - neon-mla-mls.ll now uses "mul; sub" as opposed to "neg; mla" due to
   the different latencies. This seems fine to me.
 - Some SVE tests do not always remove movprfx where they did before due
   to different register allocation giving different destructive forms.
 - The tests argument-blocks-array-of-struct.ll and arm64-windows-calls.ll
   produce two LDR where they previously produced an LDP due to
   store-pair-suppress kicking in.
 - arm64-ldp.ll and arm64-neon-copy.ll are missing pre/postinc on LPD.
 - Some tests such as arm64-neon-mul-div.ll and
   ragreedy-local-interval-cost.ll have more, less or just different
   spilling.
 - In aarch64_generated_funcs.ll.generated.expected one part of the
   function is no longer outlined. Interestingly if I switch this to use
   any other scheduled even less is outlined.

Some of these are expected to happen, such as differences in outlining
or register spilling. There will be places where these result in worse
codegen, places where they are better, with the SPEC instruction counts
suggesting it is not a decrease overall, on average.

Differential Revision: https://reviews.llvm.org/D110830
2021-10-09 15:58:31 +01:00
Arthur Eubanks a0a4935182 Make more places that use alignment use uint64_t
Followup to D110451.
2021-10-08 16:35:19 -07:00
Reid Kleckner b3a6d096d7 Fix shlib builds for all lib/Target/*/TargetInfo libs
They all must depend on MC now that the target registry is in MC.
Also fix llvm-cxxdump
2021-10-08 15:21:13 -07:00
Reid Kleckner 2827b1b89d Fix shared library build after TargetRegistry move 2021-10-08 15:06:03 -07:00
Reid Kleckner 89b57061f7 Move TargetRegistry.(h|cpp) from Support to MC
This moves the registry higher in the LLVM library dependency stack.
Every client of the target registry needs to link against MC anyway to
actually use the target, so we might as well move this out of Support.

This allows us to ensure that Support doesn't have includes from MC/*.

Differential Revision: https://reviews.llvm.org/D111454
2021-10-08 14:51:48 -07:00
Leonard Chan 4dc462b589 [AArch64] Emit CFI instruction for updating x18 when using ShadowCallStack with exception unwinding
PR45875 notes an instance where exception handling crashes on aarch64-fuchsia
where SCS is enabled by default. The underlying issue seems to be that within libunwind,
various _Unwind_* functions, the x18 register is not updated if a function is marked
with nounwind. This removes the check for nounwind and emits the CFI instruction that updates x18.

Differential Revision: https://reviews.llvm.org/D79822
2021-10-08 14:20:26 -07:00
David Stuttard 69f7d81d0a [AMDGPU] Set number vgprs used in PS shaders based on input registers actually used
For PS shaders we can use the input SPI_PS_INPUT_ENA and SPI_PS_INPUT_ADDR
registers

Calculate the number of VGPR registers used as input VGPRs based on these
registers rather than the arguments passed in (this conservatively always
allocates the maximum).

Differential Revision: https://reviews.llvm.org/D101633

Change-Id: Idf7c060cbbd5f7e3300102c55ecee3c07f209de6
2021-10-08 14:24:35 +01:00
Jingu Kang 30caca39f4 Third Recommit "[AArch64] Split bitmask immediate of bitwise AND operation"
This reverts the revert commit fc36fb4d23 with
bug fixes.

Differential Revision: https://reviews.llvm.org/D109963
2021-10-08 11:28:49 +01:00
Craig Topper f2ad8c9dc6 [RISCV] Remove experimental-b extension that includes all Zb* extensions
At this point it looks like a B extension will never exist. Instead
Zba, Zbb, Zbc, and Zbs are individual extensions being ratified
together as a package. Unknown at this time when or if the other
Zb* extensions will be ratified.

This patch removes references to the B extension. I've updated and
split tests accordingly.

This has been split from D110669 to make review a little easier.

Differential Revision: https://reviews.llvm.org/D111338
2021-10-07 20:47:17 -07:00
Wang, Pengfei c236883b6b [X86] Optimize fdiv with reciprocal instructions for half type
Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D110557
2021-10-08 09:41:13 +08:00
Jay Foad e996cf7dce [AMDGPU] Preserve MachineDominatorTree in SILowerControlFlow
Updating the MachineDominatorTree is easy since SILowerControlFlow only
splits and removes basic blocks. This should save a bit of compile time
because previously we would recompute the dominator tree from scratch
after this pass.

Another reason for doing this is that SILowerControlFlow preserves
LiveIntervals which transitively requires MachineDominatorTree. I think
that means that SILowerControlFlow is obliged to preserve
MachineDominatorTree too as explained here:
https://lists.llvm.org/pipermail/llvm-dev/2020-November/146923.html
although it does not seem to have caused any problems in practice yet.

Differential Revision: https://reviews.llvm.org/D111313
2021-10-07 21:30:26 +01:00
Mark Schimmel 417f8ea4ba [ARC] ARCRegisterInfo cleanup prior to adding core register pairs (ARC32) and 64-bit core registers (ARC64)
Differential Revision: https://reviews.llvm.org/D11108
2021-10-07 13:01:49 -07:00
Jay Foad 5b8befdd02 [X86] Special-case ADD of two identical registers in convertToThreeAddress
X86InstrInfo::convertToThreeAddress would convert this:

  %1:gr32 = ADD32rr killed %0:gr32(tied-def 0), %0:gr32, implicit-def dead $eflags

to this:

  undef %2.sub_32bit:gr64 = COPY killed %0:gr32
  undef %3.sub_32bit:gr64_nosp = COPY %0:gr32
  %1:gr32 = LEA64_32r killed %2:gr64, 1, killed %3:gr64_nosp, 0, $noreg

Note that in the ADD32rr, %0 was used twice and the first use had a kill
flag, which is what MachineInstr::addRegisterKilled does.

In the converted code, each use of %0 is copied to a new reg, and the
first COPY inherits the kill flag from the ADD32rr. This causes
machine verification to fail (if you force it to run after
TwoAddressInstructionPass) because the second COPY uses %0 after it is
killed. Note that machine verification is currently disabled after
TwoAddressInstructionPass but this is a step towards being able to
enable it.

Fix this by not inserting more than one COPY from the same source
register.

Differential Revision: https://reviews.llvm.org/D110829
2021-10-07 19:50:27 +01:00
Craig Topper c4803bd416 [RISCV] Handle vector of pointer in getTgtMemIntrinsic for strided load/store.
getScalarSizeInBits() doesn't work if the scalar type is a pointer.
For that we need to go through DataLayout.
2021-10-07 10:11:56 -07:00
Mark Schimmel 20c074ee96 C] Add option to ARCOptAddrMode to disable the pass and diagnose errors
Fixed formatting issues reported by clang-format

Differential Revision: https://reviews.llvm.org/D111255
2021-10-07 09:07:07 -07:00
Bradley Smith 5be266db7a [AArch64][SVE] Improve VECTOR_SPLICE codegen for VL > 128-bit
Differential Revision: https://reviews.llvm.org/D111135
2021-10-07 15:28:55 +00:00
Jack Andersen bd4dad87f4 [MachineInstr] Move MIParser's DBG_VALUE RegState::Debug invariant into MachineInstr::addOperand
Based on the reasoning of D53903, register operands of DBG_VALUE are
invariably treated as RegState::Debug operands. This change enforces
this invariant as part of MachineInstr::addOperand so that all passes
emit this flag consistently.

RegState::Debug is inconsistently set on DBG_VALUE registers throughout
LLVM. This runs the risk of a filtering iterator like
MachineRegisterInfo::reg_nodbg_iterator to process these operands
erroneously when not parsed from MIR sources.

This issue was observed in the development of the llvm-mos fork which
adds a backend that relies on physical register operands much more than
existing targets. Physical RegUnit 0 has the same numeric encoding as
$noreg (indicating an undef for DBG_VALUE). Allowing debug operands into
the machine scheduler correlates $noreg with RegUnit 0 (i.e. a collision
of register numbers with different zero semantics). Eventually, this
causes an assert where DBG_VALUE instructions are prohibited from
participating in live register ranges.

Reviewed By: MatzeB, StephenTozer

Differential Revision: https://reviews.llvm.org/D110105
2021-10-07 16:08:52 +01:00
Michael Forster 53801a59eb Fix two unused-variable warnings. 2021-10-07 15:17:58 +02:00
Chen Zheng 1bf05fbc98 [PowerPC] refactor rewriteLoadStores for reusing; nfc
This is split from https://reviews.llvm.org/D108750.
Refactor rewriteLoadStores() so that we can reuse the outlined
functions.

Reviewed By: jsji

Differential Revision: https://reviews.llvm.org/D110314
2021-10-07 12:59:20 +00:00
David Green 73346f5848 [ARM] Introduce a MQPRCopy
Currently when creating tail predicated loops, we need to validate that
all the live-outs of a loop will be equivalent with and without tail
predication, and if they are not we cannot legally create a
tail-predicated loop, leaving expensive vctp and vpst instructions in
the loop. These notably can include register-allocation instructions
like stack loads and stores, and copys lowered from COPYs to MVE_VORRs.

Instead of trying to prove this is valid late in the pipeline, this
patch introduces a MQPRCopy pseudo instruction that COPY is lowered to.
This can then either be converted to a MVE_VORR where possible, or to a
couple of VMOVD instructions if not. This way they do not behave
differently within and outside of tail-predications regions, and we can
know by construction that they are always valid. The idea is that we can
do the same with stack load and stores, converting them to VLDR/VSTR or
VLDM/VSTM where required to prove tail predication is always valid.

This does unfortunately mean inserting multiple VMOVD instructions,
instead of a single MVE_VORR, but my experiments show it to be an
improvement in general.

Differential Revision: https://reviews.llvm.org/D111048
2021-10-07 12:52:12 +01:00
Cullen Rhodes 14cb138b15 [AArch64][SME] Update DUP (predicate) instruction
Changes in architecture revision 00eac1:
  * Renamed to PSEL.
  * Copies whole source register.
  * Element type suffix removed from destination.
  * Element index no longer optional and '#' prefix has been removed.

The reference can be found here:
https://developer.arm.com/documentation/ddi0602/2021-09

Depends on D111212.

Reviewed By: kmclaughlin

Differential Revision: https://reviews.llvm.org/D111213
2021-10-07 08:55:11 +00:00
Cullen Rhodes 42ba79b7b0 [AArch64][SME] Update tile slice index offset
Changes in architecture revision 00eac1:
  * Tile slice index offset no longer prefixed with '#'.
  * The syntax for 128-bit (.Q) ZA tile slice accesses must now include
    an explicit zero index.

The reference can be found here:
https://developer.arm.com/documentation/ddi0602/2021-09

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D111212
2021-10-07 08:55:10 +00:00
Itay Bookstein 40ec1c0f16 [IR][NFC] Rename getBaseObject to getAliaseeObject
To better reflect the meaning of the now-disambiguated {GlobalValue,
GlobalAlias}::getBaseObject after breaking off GlobalIFunc::getResolverFunction
(D109792), the function is renamed to getAliaseeObject.
2021-10-06 19:33:10 -07:00
Craig Topper 58b68e70eb [X86] Don't use popcnt for parity if only bits 7:0 of the input can be non-zero.
Without popcnt we had a special case for using the parity flag from a single test i8 test instruction if only bits 7:0 could be non-zero. That special case is still useful when we have popcnt.

To reach this special case, we enable custom lowering of parity for i16/i32/i64 even when popcnt is enabled. The check for POPCNT being enabled is now after the special case in LowerPARITY.

Fixes PR52093

Differential Revision: https://reviews.llvm.org/D111249
2021-10-06 14:39:57 -07:00
Wolfgang Pieb f53d05135e [UBSAN][PS4] For the PS4 target, emit the ud2 ocpode for ubsan traps.
Reviewed By: probinson

Differential Revision: https://reviews.llvm.org/D111172
2021-10-06 11:49:36 -07:00
Stefan Pintilie 740086596c [PowerPC] Fix issue with lowering byval parameters.
Lowering of byval parameters with sizes that are not represented by a single
store require multiple stores to properly address the correct size of the
parameter.

Sizes that cannot be done with a single store are 3 bytes, 5 bytes, 6 bytes,
7 bytes. It is not correct to simply perform an 8 byte store and for these
elements because then the store would be larger than the element and alias
analysis would assume that this is undefined behaivour and return NoAlias
for them.

This patch adds the correct stores so that the size of the store is not larger
than the size of the element.

Reviewed By: nemanjai, #powerpc

Differential Revision: https://reviews.llvm.org/D108795
2021-10-06 13:19:15 -05:00
Shivam Gupta 7a189333ed [NFC] Add doxygen comment for hasFp in RISCVFrameLowering.cpp 2021-10-06 22:35:28 +05:30
Pengxuan Zheng b0045f5595 [ARM] Fix a bug in finding a pair of extracts to create VMOVRRD
D100244 missed a check on the ResNo of the extract's operand 0 when finding a
pair of extracts to combine into a VMOVRRD (extract(x, n); extract(x, n+1) ->
VMOVRRD(extract x, n/2)). As a result, it can incorrectly pair an extract(x, n)
with another extract(x:3, n+1) for example. This patch fixes the bug by adding
the proper check on ResNo.

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D111188
2021-10-06 10:03:32 -07:00
Simon Pilgrim 21661607ca [llvm] Replace report_fatal_error(std::string) uses with report_fatal_error(Twine)
As described on D111049, we're trying to remove the <string> dependency from error handling and replace uses of report_fatal_error(const std::string&) with the Twine() variant which can be forward declared.
2021-10-06 12:04:30 +01:00
Nathan Sidwell c11e7b59d2 [X86][NFC] structure-return simplificiation
The X86 backend only needs to know whether structure return is via an
sret pointer.  This removes the categorization enumeration and
adjusts, templatizes and renames the related functions.

Differential Revision: https://reviews.llvm.org/D109966
2021-10-06 03:12:48 -07:00
Simon Pilgrim 0776924a17 [CostModel][X86] getCmpSelInstrCost - treat BAD_PREDICATEs the same as the worst case cost predicates for ICMP/FCMP instructions
As suggested on D111024, we should treat getCmpSelInstrCost calls without a specific predicate as matching the worst case predicate cost.

These regressions will be addressed with a mixture of D111024 and fixing other specific getCmpSelInstrCost calls to have realistic predicates.
2021-10-06 10:14:56 +01:00
Jonas Paulsson 3562076dfc [SystemZ] Temporarily revert memcmp and memcpy patches
Seem to cause test failures in compiler-rt.

Revert "[SystemZ] Implement memcmp of variable length with CLC."
This reverts commit 7a4e9a0c73.

Revert "[SystemZ] Implement memcpy of variable length with MVC."
This reverts commit c6c13c58ee.
2021-10-06 11:05:18 +02:00
David Spickett fc36fb4d23 Revert "Second Recommit "[AArch64] Split bitmask immediate of bitwise AND operation""
This reverts commit 13f3c39f36.

Due to test failures in stage 2 clang tests on AArch64 bots.
2021-10-06 08:39:48 +00:00
Paulo Matos 0c7495848a [WebAssembly] Fix call_indirect on funcrefs
The currently implementation of funcrefs is broken since it is putting
the funcref itself on the stack before the call_indirect. Instead what
should be on the stack is the constant 0, which is the index at which
we store the funcref in __funcref_call_table.

Reviewed By: tlively

Differential Revision: https://reviews.llvm.org/D111152
2021-10-06 10:11:53 +02:00
Paulo Matos 91fe069c35 [WebAssembly] De-duplicate WasmAddressSpace and refactor reftype predicates
This is a non-functional change to remove the duplicate
WasmAddressSpace enum and refactor reftype predicates by moving them
to the Utilities source file.

Reviewed By: tlively

Differential Revision: https://reviews.llvm.org/D111144
2021-10-06 09:56:23 +02:00
Carl Ritson adf7043a9f [AMDGPU] Only remove branches in SIInstrInfo::removeBranch
Without this change _term instructions can be removed during
critical edge splitting.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D111126
2021-10-06 10:34:26 +09:00
Heejin Ahn 3ec1760d91 [WebAssembly] Remove WasmTagType
This removes `WasmTagType`. `WasmTagType` contained an attribute and a
signature index:
```
struct WasmTagType {
  uint8_t Attribute;
  uint32_t SigIndex;
};
```

Currently the attribute field is not used and reserved for future use,
and always 0. And that this class contains `SigIndex` as its property is
a little weird in the place, because the tag type's signature index is
not an inherent property of a tag but rather a reference to another
section that changes after linking. This makes tag handling in the
linker also weird that tag-related methods are taking both `WasmTagType`
and `WasmSignature` even though `WasmTagType` contains a signature
index. This is because the signature index changes in linking so it
doesn't have any info at this point. This instead moves `SigIndex` to
`struct WasmTag` itself, as we did for `struct WasmFunction` in D111104.

In this CL, in lib/MC and lib/Object, this now treats tag types in the
same way as function types. Also in YAML, this removes `struct Tag`,
because now it only contains the tag index. Also tags set `SigIndex` in
`WasmImport` union, as functions do.

I think this makes things simpler and makes tag handling more in line
with function handling. These two shares similar properties in that both
of them have signatures, but they are kind of nominal so having the same
signature doesn't mean they are the same element.

Also a drive-by fix: the reserved 'attirubute' part's encoding changed
from uleb32 to uint8 a while ago. This was fixed in lib/MC and
lib/Object but not in YAML. This doesn't change object files because the
field's value is always 0 and its encoding is the same for the both
encoding.

This is effectively NFC; I didn't mark it as such just because it
changed YAML test results.

Reviewed By: sbc100, tlively

Differential Revision: https://reviews.llvm.org/D111086
2021-10-05 17:11:22 -07:00
Simon Pilgrim 2e5daac217 [llvm] Update report_fatal_error calls from raw_string_ostream to use Twine(OS.str())
As described on D111049, we're trying to remove the <string> dependency from error handling and replace uses of report_fatal_error(const std::string&) with the Twine() variant which can be forward declared.

We can use the raw_string_ostream::str() method to perform the implicit flush() and return a reference to the std::string container that we can then wrap inside Twine().
2021-10-05 18:42:12 +01:00
Jonas Paulsson 7a4e9a0c73 [SystemZ] Implement memcmp of variable length with CLC.
Following the same pattern of memset/memcpy, this patch implements a variable
length memcmp with a CLC loop followed by an EXRL instruction.

Review: Ulrich Weigand

Differential Revision: https://reviews.llvm.org/D107380
2021-10-05 18:20:36 +02:00
Jonas Paulsson c6c13c58ee [SystemZ] Implement memcpy of variable length with MVC.
Instead of making a memcpy libcall, emit an MVC loop and an EXRL instruction
the same way as is already done for memset 0.

Review: Ulrich Weigand

Differential Revision: https://reviews.llvm.org/D106874
2021-10-05 17:14:41 +02:00
Peter Waller be26e6ff73 [AArch64][SVE] Remove redundant PTEST following PNEXT/PFIRST
PNEXT and PFIRST set the NZCV flags, so the subsequent PTEST can be
optimized away in AArch64InstrInfo::optimizePTestInstr.

See-also: https://reviews.llvm.org/D93292

Differential Revision: https://reviews.llvm.org/D110177
2021-10-05 15:10:48 +00:00
Matthew Devereau 2ac1999937 [AArch64][SVE] Propagate math flags from intrinsics to instructions
Retain floating-point math flags inside instCombineSVEVectorBinOp
2021-10-05 15:39:13 +01:00
Roman Lebedev 3f9b235482
[X86][Costmodel] Load/store i64/f64 Stride=6 VF=8 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/1jfGddcre - for intels `Block RThroughput: =36.0`; for ryzens, `Block RThroughput: =12.0`
So could pick cost of `36`

For store we have:
https://godbolt.org/z/ao9srMT8r - for intels `Block RThroughput: =30.0`; for ryzens, `Block RThroughput: =12.0`
So we could pick cost of `30`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111094
2021-10-05 16:58:58 +03:00
Roman Lebedev e2784c5d8c
[X86][Costmodel] Load/store i64/f64 Stride=6 VF=4 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/rc8jYxW6M - for intels `Block RThroughput: =18.0`; for ryzens, `Block RThroughput: =6.0`
So could pick cost of `18`.

For store we have:
https://godbolt.org/z/9PhPEr65G - for intels `Block RThroughput: =15.0`; for ryzens, `Block RThroughput: =6.0`
So we could pick cost of `15`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111093
2021-10-05 16:58:58 +03:00
Roman Lebedev 3960693048
[X86][Costmodel] Load/store i64/f64 Stride=6 VF=2 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/onese7rec - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: =3.0`
So could pick cost of `6`.

For store we have:
https://godbolt.org/z/bMd7dddnT - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: <=6.0`
So we could pick cost of `8`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111092
2021-10-05 16:58:58 +03:00
Roman Lebedev 79d6d12d95
[X86][Costmodel] Load/store i32/f32 Stride=6 VF=16 interleaving costs
This one required quite a bit of an assembly surgery, but i think it's in the right ballpark..

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/na97Kb96o - for intels `Block RThroughput: <=64.0`; for ryzens, `Block RThroughput: <=32.0`
So could pick cost of `64`.

For store we have:
https://godbolt.org/z/GG1WeoKar - for intels `Block RThroughput: =66.0`; for ryzens, `Block RThroughput: <=27.5`
So we could pick cost of `66`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111091
2021-10-05 16:58:58 +03:00
Roman Lebedev 2996a2b50f
[X86][Costmodel] Load/store i32/f32 Stride=6 VF=8 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/jK85GWKaK - for intels `Block RThroughput: =31.0`; for ryzens, `Block RThroughput: <=17.0`
So could pick cost of `31`.

For store we have:
https://godbolt.org/z/hPWWhEEf9 - for intels `Block RThroughput: =33.0`; for ryzens, `Block RThroughput: <=13.8`
So we could pick cost of `33`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111089
2021-10-05 16:58:57 +03:00
Roman Lebedev d51532d8aa
[X86][Costmodel] Load/store i32/f32 Stride=6 VF=4 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/szEj1ceee - for intels `Block RThroughput: =15.0`; for ryzens, `Block RThroughput: <=8.8`
So could pick cost of `15`.

For store we have:
https://godbolt.org/z/81bq4fTo1 - for intels `Block RThroughput: =12.0`; for ryzens, `Block RThroughput: <=10.0`
So we could pick cost of `12`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111087
2021-10-05 16:58:57 +03:00
Roman Lebedev 764fd5f463
[X86][Costmodel] Load/store i32/f32 Stride=6 VF=2 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/aec96Thee - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=3.3`
So could pick cost of `6`.

For store we have:
https://godbolt.org/z/aec96Thee - for intels `Block RThroughput: =9.0`; for ryzens, `Block RThroughput: <=3.0`
So we could pick cost of `9`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111083
2021-10-05 16:58:57 +03:00
Roman Lebedev c800119c46
[X86][Costmodel] Load/store i64/f64 Stride=4 VF=8 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/3M3hbq7n8 - for intels `Block RThroughput: =20.0`; for ryzens, `Block RThroughput: =8.0`
So could pick cost of `20`.

For store we have:
https://godbolt.org/z/zvnPYWTx7 - for intels `Block RThroughput: =20.0`; for ryzens, `Block RThroughput: =8.0`
So we could pick cost of `20`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111076
2021-10-05 16:58:57 +03:00
Roman Lebedev 000ce0bfd5
[X86][Costmodel] Load/store i64/f64 Stride=4 VF=4 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/MTKdzjvnr - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: <=4.0`
So could pick cost of `8`.

For store we have:
https://godbolt.org/z/cMYEvqoah - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: <=4.0`
So we could pick cost of `8`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111075
2021-10-05 16:58:57 +03:00
Roman Lebedev dcc2b0d933
[X86][Costmodel] Load/store i64/f64 Stride=4 VF=2 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/z197317d1 - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: =2.0`
So could pick cost of `6`.

For store we have:
https://godbolt.org/z/8dzszjf9q - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=4.0`
So we could pick cost of `6`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111073
2021-10-05 16:58:57 +03:00
Roman Lebedev 7d91037fd2
[X86][Costmodel] Load/store i32/f32 Stride=4 VF=16 interleaving costs
This one required quite a bit of assembly surgery, but the trend continues, so i think this is right.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/EKWdj8cKT - for intels `Block RThroughput: <=32.0`; for ryzens, `Block RThroughput: <=24.0`
So could pick cost of `32`.

For store we have:
https://godbolt.org/z/zj4bb9P75 - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=16.0`
So we could pick cost of `32`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111064
2021-10-05 16:58:57 +03:00
Roman Lebedev 4aee1e5b93
[X86][Costmodel] Load/store i32/f32 Stride=4 VF=8 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/a6rxMG6ec - for intels `Block RThroughput: =16.0`; for ryzens, `Block RThroughput: <=12.0`
So could pick cost of `16`.

For store we have:
https://godbolt.org/z/ced1bdqc9 - for intels `Block RThroughput: =16.0`; for ryzens, `Block RThroughput: <=8.0`
So we could pick cost of `16`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111063
2021-10-05 16:58:57 +03:00
Roman Lebedev 3c2e22b795
[X86][Costmodel] Load/store i32/f32 Stride=4 VF=4 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/avq1oz98W - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: =4.0`
So could pick cost of `8`.

For store we have:
https://godbolt.org/z/89PGMc1qs - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=6.0`
So we could pick cost of `6`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111061
2021-10-05 16:58:57 +03:00
Roman Lebedev b6234c1edf
[X86][Costmodel] Load/store i32/f32 Stride=4 VF=2 interleaving costs
Finally, we are getting to the heavy-hitter stuff!

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/7crGWoar6 - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0`
So could pick cost of `4`.

For store we have:
https://godbolt.org/z/T8aq3MszM - for intels `Block RThroughput: =5.0`; for ryzens, `Block RThroughput: <=2.0`
So we could pick cost of `5`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111060
2021-10-05 16:58:56 +03:00
kpyzhov 095c48fdf3 [AMDGPU] Use "hostcall" module flag instead of searching for ockl_hostcall_internal() declaration.
The current way to detect hostcalls by looking for "ockl_hostcall_internal()" function in the module seems to be not reliable enough. The LTO may rename the "ockl_hostcall_internal()" function when an application is compiled with "-fgpu-rdc", and MetadataStreamer pass to fail to detect hostcalls, therefore it does not set the "hidden_hostcall_buffer" kernel argument.
This change adds a new module flag: hostcall that can be used to detect whether GPU functions use host calls for printf.

Differential revision: https://reviews.llvm.org/D110337
2021-10-05 09:56:04 -04:00
Hsiangkai Wang 80a6456306 [RISCV] Update to vlm.v and vsm.v according to v1.0-rc1.
vle1.v  -> vlm.v
vse1.v  -> vsm.v

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D106044
2021-10-05 21:49:54 +08:00
Jay Foad 9ce4f37206 [AMDGPU][GlobalISel] Fix legalization of G_UMULH
Scalarize before narrowing because the narrowing implementation does not
work on vectors. This matches what we do for regular G_MUL.

Differential Revision: https://reviews.llvm.org/D111129
2021-10-05 10:56:02 +01:00
Tim Northover 5f65ee260d AArch64+GISel: legalize vector remainder operations. 2021-10-05 10:20:10 +01:00
Amara Emerson 3fe475367c [AArch64][GlobalISel] Legalize G_VECREDUCE_AND.
These are handled identically to the already handled G_VECREDUCE_OR instructions.
2021-10-05 00:39:29 -07:00
Amara Emerson 8bde5e58c0 Delay outgoing register assignments to last.
The delayed stack protector feature which is currently used for SDAG (and thus
allows for more commonly generating tail calls) depends on being able to extract
the tail call into a separate return block. To do this it also has to extract
the vreg->physreg copies that set up the call's arguments, since if it doesn't
then the call inst ends up using undefined physregs in it's new spliced block.

SelectionDAG implementations can do this because they delay emitting register
copies until  *after* the stack arguments are set up. GISel however just
processes and emits the arguments in IR order, so stack arguments always end up
last, and thus this breaks the code that looks for any register arg copies that
precede the call instruction.

This patch adds a thunk argument to the assignValueToReg() and custom assignment
hooks. For outgoing arguments, register assignments use this return param to
return a thunk that does the actual generating of the copies. We collect these
until all the outgoing stack assignments have been done and then execute them,
so that the copies (and perhaps some artifacts like G_SEXTs) are placed after
any stores.

Differential Revision: https://reviews.llvm.org/D110610
2021-10-04 12:33:20 -07:00
Simon Pilgrim e8477045f6 [X86][SLM] Fix BSR/BSF port usage
Both ports are required for BitScan ops. Update the port usage and distributed throughput based off the most recent llvm-exegesis captures (PR36895) and what Intel AoM / Agner reports as well.
2021-10-04 19:52:53 +01:00
David Green 30dc53db36 [AArch64] Disable AArch64StorePairSuppress under optsize
AArch64StorePairSuppress will prevent the creation of LDP's based on
scheduling info. This shouldn't apply when optimizing for size though,
where the size decrease should be considered more important.

Differential Revision: https://reviews.llvm.org/D110809
2021-10-04 18:28:15 +01:00
Simon Pilgrim bf30c48419 [X86] SimplifyDemandedVectorEltsForTargetNode - simplify PMADDWD for known zero elements
Noticed while investigating the regressions in D110995 - if the RHS element is already zero, then we don't need the corresponding LHS element.

Technically we could also recheck RHS once we have LHS's known zeros, but I haven't seen any missed opportunities from that yet.
2021-10-04 14:36:45 +01:00
Roman Lebedev cef0a693b6
[X86][Costmodel] Load/store i64/f64 Stride=3 VF=16 interleaving costs
This required huge amount of assembly surgery, but i think this is about right.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/z11crMEcj - for intels `Block RThroughput: =20.0`; for ryzens, `Block RThroughput: <=18.0`
So could pick cost of `25`.

For store we have:
https://godbolt.org/z/eqT4ze3j4 - for intels `Block RThroughput: =24.0`; for ryzens, `Block RThroughput: <=16.0`
So we could pick cost of `24`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111031
2021-10-04 14:35:17 +03:00
Roman Lebedev ede0611e79
[X86][Costmodel] Load/store i64/f64 Stride=3 VF=8 interleaving costs
This one required quite a bit of assembly surgery.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/oYWv4cTnK - for intels `Block RThroughput: =10.0`; for ryzens, `Block RThroughput: <=8.0`
So pick cost of `10`.

For store we have:
https://godbolt.org/z/33GMhrsG9 - for intels `Block RThroughput: =12.0`; for ryzens, `Block RThroughput: <=8.0`
So pick cost of `12`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111027
2021-10-04 14:35:01 +03:00
Roman Lebedev eb9a694c17
[X86][Costmodel] Load/store i64/f64 Stride=3 VF=4 interleaving costs
This one required quite a bit of assembly surgery.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/Tce3osvcz - for intels `Block RThroughput: =5.0`; for ryzens, `Block RThroughput: <=4.0`
So pick cost of `5`.

For store we have:
https://godbolt.org/z/oc3arEcnE - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=4.0`
So pick cost of `6`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111026
2021-10-04 14:34:47 +03:00
Roman Lebedev d3bbe781ea
[X86][Costmodel] Load/store i64/f64 Stride=3 VF=2 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/sz5qdKnr4 - for intels `Block RThroughput: =1.0`; for ryzens, `Block RThroughput: <=1.0`
So pick cost of `1`.

For store we have:
https://godbolt.org/z/Kzdjff63v - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=3.0`
So pick cost of `4`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111025
2021-10-04 14:34:33 +03:00
Roman Lebedev 4ca5bc07af
[X86][Costmodel] Load/store i32/f32 Stride=3 VF=16 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/5fqrh4qqo - for intels `Block RThroughput: =14.0`; for ryzens, `Block RThroughput: <=12.0`
So pick cost of `14`.

For store we have:
https://godbolt.org/z/5fqrh4qqo - for intels `Block RThroughput: =22.0`; for ryzens, `Block RThroughput: <=16.0`
So pick cost of `22`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111022
2021-10-04 14:34:19 +03:00
Roman Lebedev 198aa84973
[X86][Costmodel] Load/store i32/f32 Stride=3 VF=8 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/zdz5Ga6fs - for intels `Block RThroughput: =7.0`; for ryzens, `Block RThroughput: <=6.0`
So pick cost of `7`.

For store we have:
https://godbolt.org/z/qn71513ac - for intels `Block RThroughput: =11.0`; for ryzens, `Block RThroughput: <=8.0`
So pick cost of `11`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111021
2021-10-04 14:34:05 +03:00
Roman Lebedev a93411c3af
[X86][Costmodel] Load/store i32/f32 Stride=3 VF=4 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/d8PdhEszo - for intels `Block RThroughput: =3.0`; for ryzens, `Block RThroughput: <=3.0`
So pick cost of `3`.

For store we have:
https://godbolt.org/z/WojonfG5n - for intels `Block RThroughput: =5.0`; for ryzens, `Block RThroughput: <=3.0`
So pick cost of `5`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111020
2021-10-04 14:34:03 +03:00
Roman Lebedev 3e93fcdfc8
[X86][Costmodel] Load/store i32/f32 Stride=3 VF=2 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/z8qa14bs3 - for intels `Block RThroughput: =3.0`; for ryzens, `Block RThroughput: =1.5`
So pick cost of `3`.

For store we have:
https://godbolt.org/z/GYGajoc4K - for intels `Block RThroughput: <=4.0`; for ryzens, `Block RThroughput: <=2.0`
So pick cost of `4`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111019
2021-10-04 14:31:50 +03:00
Stefan Pintilie 4fc2f4979c [PowerPC] Fix __builtin_ppc_load2r to return short instead of int.
This patch fixes the return value of the builtin __builtin_ppc_load2r to
correctly return short instead of int.

Reviewed By: nemanjai, #powerpc

Differential Revision: https://reviews.llvm.org/D110771
2021-10-04 06:17:02 -05:00
Jay Foad 566690b067 [APFloat] Remove BitWidth argument from getAllOnesValue
There's no need to pass this in explicitly because it is
trivially available from the semantics.
2021-10-04 11:32:16 +01:00
Jay Foad a9bceb2b05 [APInt] Stop using soft-deprecated constructors and methods in llvm. NFC.
Stop using APInt constructors and methods that were soft-deprecated in
D109483. This fixes all the uses I found in llvm, except for the APInt
unit tests which should still test the deprecated methods.

Differential Revision: https://reviews.llvm.org/D110807
2021-10-04 08:57:44 +01:00
Roman Lebedev 67f1ee2e38
[X86][Costmodel] Load/store i16 Stride=3 VF=32 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/rMaYr67hz - for intels `Block RThroughput: =56.0`; for ryzens, `Block RThroughput: <=17.8`
So pick cost of `56`.

For store we have:
https://godbolt.org/z/eMsbKqnvv - for intels `Block RThroughput: <=54.0`; for ryzens, `Block RThroughput: <=15.0`
So pick cost of `54`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111018
2021-10-03 23:40:35 +03:00
Roman Lebedev 3cbc0a07f9
[X86][Costmodel] Load/store i16 Stride=3 VF=16 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/1T6MMzeh3 - for intels `Block RThroughput: =28.0`; for ryzens, `Block RThroughput: <=8.5`
So pick cost of `28`.

For store we have:
https://godbolt.org/z/1T6MMzeh3 - for intels `Block RThroughput: <=27.0`; for ryzens, `Block RThroughput: <=7.0`
So pick cost of `27`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111017
2021-10-03 23:40:21 +03:00
Roman Lebedev 72f8a9244a
[X86][Costmodel] Load/store i16 Stride=3 VF=8 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/Mh9MnnT8W - for intels `Block RThroughput: =9.0`; for ryzens, `Block RThroughput: <=2.3`
So pick cost of `9`.

For store we have:
https://godbolt.org/z/Mh9MnnT8W - for intels `Block RThroughput: <=12.0`; for ryzens, `Block RThroughput: <=3.3`
So pick cost of `12`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111016
2021-10-03 23:40:05 +03:00
Roman Lebedev 04f1469cb4
[X86][Costmodel] Load/store i16 Stride=3 VF=4 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/sP4j1173f - for intels `Block RThroughput: =7.0`; for ryzens, `Block RThroughput: <=3.0`
So pick cost of `7`.

For store we have:
https://godbolt.org/z/sP4j1173f - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=2.0`
So pick cost of `6`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111015
2021-10-03 23:39:51 +03:00
Roman Lebedev 8e8fb77aa4
[X86][Costmodel] Load/store i16 Stride=3 VF=2 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/xnE988aej - for intels `Block RThroughput: =5.0`; for ryzens, `Block RThroughput: <=2.5`
So pick cost of `5`.

For store we have:
https://godbolt.org/z/rMGT31Tnh - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0`
So pick cost of `4`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111014
2021-10-03 23:39:36 +03:00
Roman Lebedev a5e5883ef5
[X86][Costmodel] Load/store i8 Stride=6 VF=32 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/c1jjKqP7b - for intels `Block RThroughput: <=82.0`; for ryzens, `Block RThroughput: <=26.0`
So pick cost of `82`.

For store we have:
https://godbolt.org/z/YM4ErY8x7 - for intels `Block RThroughput: <=90.0`; for ryzens, `Block RThroughput: <=25.5`
So pick cost of `90`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111013
2021-10-03 23:39:22 +03:00
Roman Lebedev bd5ba437fd
[X86][Costmodel] Load/store i8 Stride=6 VF=16 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/Gz8hhqfTM - for intels `Block RThroughput: <=43.0`; for ryzens, `Block RThroughput: <=14.0`
So pick cost of `43`.

For store we have:
https://godbolt.org/z/9vrdssYa8 - for intels `Block RThroughput: <=27.0`; for ryzens, `Block RThroughput: <=12.0`
So pick cost of `27`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111012
2021-10-03 23:39:08 +03:00
Roman Lebedev 0b27f9c088
[X86][Costmodel] Load/store i8 Stride=6 VF=8 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/v98qPTTf6 - for intels `Block RThroughput: =18.0`; for ryzens, `Block RThroughput: =6.0`
So pick cost of `18`.

For store we have:
https://godbolt.org/z/rn5T9E8q6 - for intels `Block RThroughput: <=16.0`; for ryzens, `Block RThroughput: <=4.5`
So pick cost of `16`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111011
2021-10-03 23:38:54 +03:00
Roman Lebedev 6fe4cce558
[X86][Costmodel] Load/store i8 Stride=6 VF=4 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/4sWhs396o - for intels `Block RThroughput: =14.0`; for ryzens, `Block RThroughput: <=7.0`
So pick cost of `14`.

For store we have:
https://godbolt.org/z/4sWhs396o - for intels `Block RThroughput: =9.0`; for ryzens, `Block RThroughput: <=3.0`
So pick cost of `9`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111010
2021-10-03 23:38:40 +03:00
Roman Lebedev 396b95e5c9
[X86][Costmodel] Load/store i8 Stride=6 VF=2 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/jvj6jzns5 - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=3.0`
So pick cost of `6`.

For store we have:
https://godbolt.org/z/ros7eebMP - for intels `Block RThroughput: =7.0`; for ryzens, `Block RThroughput: <=3.0`
So pick cost of `7`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111008
2021-10-03 23:38:10 +03:00
David Green 20b1a16a69 [ARM] Mark <= -1 immediate constant as cheap
A <= -1 constant on a compare can be converted to a < 0 operation, which
is usually cheap. If we mark the constant as cheap, preventing hoisting,
we allow that fold to happen even across different blocks.

Differential Revision: https://reviews.llvm.org/D109360
2021-10-03 19:30:08 +01:00
Simon Pilgrim 164cc2781f [X86] Split Cannonlake + Icelake Tuning. NFC
The Ice/Tiger/RocketLake specs were inheriting the tuning settings from CannonLake, a previous architecture. We shouldn't have this dependency, so I've copied the current tuning settings so we can make future adjustments to both CNL + ICL etc. more easily.
2021-10-03 18:38:47 +01:00
Simon Pilgrim b85bf520dc [CostModel][X86] X86TTIImpl::getCmpSelInstrCost - try to use Predicate argument directly first (PR48337)
There's still a lot of cases where getCmpSelInstrCost fails to specify a predicate, once those are in place we should be able to remove the fallback to the Instruction argument entirely.
2021-10-03 17:16:51 +01:00
Dávid Bolvanský fb84aa2a8f Fixed warnings in target/parser codes produced by -Wbitwise-instead-of-logicala 2021-10-03 15:04:01 +02:00
Kazu Hirata c1e32b3fc0 [Target] Migrate from getNumArgOperands to arg_size (NFC)
Note that getNumArgOperands is considered a legacy name.  See
llvm/include/llvm/IR/InstrTypes.h for details.
2021-10-02 12:06:29 -07:00
Simon Pilgrim 7cae0daee6 [X86][Atom] Fix BSR/BSF uops + port usage
Both ports are required for BitScan ops. Update the uops counts + port usage based off the most recent llvm-exegesis captures (PR36895) and what Intel AoM / Agner reports as well.
2021-10-02 19:09:44 +01:00
Craig Topper 33d20977b7 Revert "[RISCV] Add an GPR def to the Zvlseg SPILL/RELOAD pseudos"
This reverts commit 1f16191906.

We're seeing some issues with this internally. It seems that when
the spill is created by register allocation, the GPR doesn't get
allocated and an assertion fires during virtual register rewriting.

The .mir test case contains the spill before register allocation so
register allocation sees it as any other instruction.
2021-10-02 10:44:11 -07:00
Simon Pilgrim 9452ec722c [X86][SSE] Fix typo + infinite-loop in HOP(HOP'(X,X),HOP'(Y,Y)) fold (PR52040)
PR52040 identified several issues with the HOP(HOP'(X,X),HOP'(Y,Y)) -> HOP(PERMUTE(HOP'(X,Y)),PERMUTE(HOP'(X,Y)) slow-HOP fold.

Not only was there a copy+paste typo when accessing the inner HOP operands, but the (unnecessary) ReplaceAllUsesOfValueWith call was missing one use checks.

Now that we have better shuffle combines of HOPs we can just return a new HOP() sequence and not use ReplaceAllUsesOfValueWith at all - this actually improved pair_sum_v8i32_v4i32 codegen as it kicks off further shuffle combines.
2021-10-02 15:31:12 +01:00
Simon Pilgrim bb42cc2090 [X86] decomposeMulByConstant - decompose legal vXi32 multiplies on SlowPMULLD targets and all vXi64 multiplies
X86's decomposeMulByConstant never permits mul decomposition to shift+add/sub if the vector multiply is legal.

Unfortunately this isn't great for SSE41+ targets which have PMULLD for vXi32 multiplies, but is often quite slow. This patch proposes to allow decomposition if the target has the SlowPMULLD flag (i.e. Silvermont). We also always decompose legal vXi64 multiplies - even latest IceLake has really poor latencies for PMULLQ.

Differential Revision: https://reviews.llvm.org/D110588
2021-10-02 12:35:25 +01:00
Simon Pilgrim 8e7f6039fa [X86] Atom SSE shift-by-variable take 2uops/3uops not 1uop
Based off the most recent llvm-exegesis captures (PR36895) and what Intel AoM / Agner / InstLatX64 reports as well.
2021-10-02 12:28:41 +01:00
Roman Lebedev acb459574a
[X86][Costmodel] Load/store i8 Stride=4 VF=32 interleaving costs
While we already model this tuple, the load cost is divergent from reality, so fix it.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/zWMhhnPYa - for intels `Block RThroughput: =56.0`; for ryzens, `Block RThroughput: <=24.0`
So pick cost of `56`.

For store we have:
https://godbolt.org/z/vnqqjWx51 - for intels `Block RThroughput: =12.0`; for ryzens, `Block RThroughput: <=4.0`
So pick cost of `12`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110971
2021-10-02 13:40:21 +03:00
Roman Lebedev 0e71ae6da8
[X86][Costmodel] Load/store i8 Stride=4 VF=16 interleaving costs
While we already model this tuple, the values are divergent from reality, so fix them.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/TrGW7cKsE - for intels `Block RThroughput: =24.0`; for ryzens, `Block RThroughput: <=12.0`
So pick cost of `24`.

For store we have:
https://godbolt.org/z/Mh7qaqEfe - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: <=4.0`
So pick cost of `8`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110970
2021-10-02 13:40:21 +03:00
Roman Lebedev 74e4a0e327
[X86][Costmodel] Load/store i8 Stride=4 VF=8 interleaving costs
While we already model this tuple, the values are divergent from reality, so fix them.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/v7746Wcf7 - for intels `Block RThroughput: =12.0`; for ryzens, `Block RThroughput: <=6.0`
So pick cost of `12`.

For store we have:
https://godbolt.org/z/aEeEohEbP - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0`
So pick cost of `4`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110969
2021-10-02 13:40:20 +03:00
Roman Lebedev ae08362cb8
[X86][Costmodel] Load/store i8 Stride=4 VF=4 interleaving costs
While we already model this tuple, the store cost is divergent from reality, so fix it.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/1n4bPh7Tn - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0`
So pick cost of `4`.

For store we have:
https://godbolt.org/z/r8K9sveqo - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0`
So pick cost of `4`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110968
2021-10-02 13:40:20 +03:00
Roman Lebedev 935b9693ae
[X86][Costmodel] Load/store i8 Stride=4 VF=2 interleaving costs
While we already model this tuple, the values are divergent from reality, so fix them.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/KP6nn36zs - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0`
So pick cost of `4`.

For store we have:
https://godbolt.org/z/ov95zhrq6 - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0`
So pick cost of `4`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110966
2021-10-02 13:40:20 +03:00
Roman Lebedev 448c939839
[X86][Costmodel] Load/store i8 Stride=3 VF=32 interleaving costs
For VF=16, costs are correct.
For VF=32, load cost is divergent.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/qKjevqf4W - for intels `Block RThroughput: <=14.0`; for ryzens, `Block RThroughput: <=4.5`
So pick cost of `14`.

For store we have:
https://godbolt.org/z/xTssTq319 - for intels `Block RThroughput: =13.0`; for ryzens, `Block RThroughput: <=5.5`
So pick cost of `13`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110961
2021-10-02 13:39:15 +03:00
Roman Lebedev d1460c88a6
[X86][Costmodel] Load/store i8 Stride=3 VF=8 interleaving costs
While we already model this tuple, the values are divergent from reality, so fix them.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/1jeocxj55 - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=3.0`
So pick cost of `6`.

For store we have:
https://godbolt.org/z/fr7xfa3K5 - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=2.0`
So pick cost of `6`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110960
2021-10-02 13:39:15 +03:00
Roman Lebedev f1df2d8eaf
[X86][Costmodel] Load/store i8 Stride=3 VF=4 interleaving costs
While we already model this tuple, the values are divergent from reality, so fix them.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/obWz3PrfK - for intels `Block RThroughput: =3.0`; for ryzens, `Block RThroughput: <=1.5`
So pick cost of `3`.

For store we have:
https://godbolt.org/z/orjPshn3h - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0`
So pick cost of `4`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110958
2021-10-02 13:39:10 +03:00
Roman Lebedev 8a3c64c3a2
[X86][Costmodel] Load/store i8 Stride=3 VF=2 interleaving costs
While we already model this tuple, the values are divergent from reality, so fix them.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/WYscYMcW4 - for intels `Block RThroughput: =3.0`; for ryzens, `Block RThroughput: <=1.5`
So pick cost of `3`.

For store we have:
https://godbolt.org/z/e9qvYdbbs - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0`
So pick cost of `4`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110956
2021-10-02 13:39:05 +03:00
Amara Emerson f41a9cf859 [AArch64][GlobalISel] Lower G_SMULH/G_UMULH unless its one of the supported types.
s32 was also incorrectly marked as a supported type, and was causing fallbacks
because we don't support it.
2021-10-01 22:15:23 -07:00
Jessica Paquette 96843d220d [AArch64][GlobalISel] Change G_ANYEXT fed by scalar G_ICMP to G_ZEXT
This is a common pattern:

```
    %icmp:_(s32) = G_ICMP intpred(eq), ...
    %ext:_(s64) = G_ANYEXT %icmp(s32)
    %and:_(s64) = G_AND %ext, 1
```

Here's an example: https://godbolt.org/z/T13f6o8zE

This pattern appears because of the following combine in the
LegalizationArtifactCombiner:

```
// zext(trunc x) - > and (aext/copy/trunc x), mask
```

Which kicks in when we widen the result of G_ICMP from 1 bit to 32 bits.

We know that, on AArch64, a scalar G_ICMP will produce 0 or 1. So the result
of `%ext` will always be 0 or 1 as well.

We have some KnownBits combines which eliminate redundant G_ANDs with masks.
These combines don't kick in with G_ANYEXT.

So, if we replace the G_ANYEXT with G_ZEXT in this situation, the KnownBits
based combines can remove the redundant G_AND.

I wasn't sure if it woud be more appropriate to

* Take this route
* Put this in the LegalizationArtifactCombiner.
* Allow 64 bit G_ICMP destinations

I decided on this route because

1) It's simple

2) I'm not sure if philosophically-speaking, we should be handling non-artifact
instructions + target-specific details like TargetBooleanContents in the
LegalizationArtifactCombiner

3) There is a lot of existing code which assumes we only have 32 bit G_ICMP
destinations. So, adding support for 64-bit destinations seems rather invasive
right now. I think that adding support for 64-bit destinations, or modelling
G_ICMP as ADDS/SUBS/etc is probably cleaner long term though.

This gives minor code size savings on all CTMark benchmarks.

Differential Revision: https://reviews.llvm.org/D110959
2021-10-01 15:01:20 -07:00
Daniil Fukalov 47d6274d4c [NFC][AMDGPU] Reduce includes dependencies, part 2
1. Splitted out some parts of R600 target to separate modules/headers.
2. Reduced some include lists in headers.
3. Minor forward declarations, redundant includes and flags in GCNSubtarget
   cleanup.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D109351
2021-10-01 17:50:20 +03:00
Roman Lebedev 3e260efdfc
[X86][Costmodel] Load/store i64/f64 Stride=2 VF=16 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/1WMTojvfW - for intels `Block RThroughput: =16.0`; for ryzens, `Block RThroughput: <=8.0`
So pick cost of `16`.

For store we have:
https://godbolt.org/z/1WMTojvfW - for intels `Block RThroughput: =16.0`; for ryzens, `Block RThroughput: <=16.0`
So pick cost of `16`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110840
2021-10-01 17:48:14 +03:00
Roman Lebedev abd37de63e
[X86][Costmodel] Load/store i64/f64 Stride=2 VF=8 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/PGYbYKPq8 - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: <=4.0`
So pick cost of `8`.

For store we have:
https://godbolt.org/z/PGYbYKPq8 - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: <=8.0`
So pick cost of `8`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110838
2021-10-01 17:48:14 +03:00
Roman Lebedev 71bc31b907
[X86][Costmodel] Load/store i64/f64 Stride=2 VF=4 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/j5co1qWEW - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0`
So pick cost of `4`.

For store we have:
https://godbolt.org/z/j5co1qWEW - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=4.0`
So pick cost of `4`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110837
2021-10-01 17:48:14 +03:00
Roman Lebedev 612e5b05a2
[X86][Costmodel] Load/store i64/f64 Stride=2 VF=2 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/8a1cfGeMn - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: =1.0`
So pick cost of `2`.

For store we have:
https://godbolt.org/z/jMdcM47bx - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: <=2.0`
So pick cost of `2`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110835
2021-10-01 17:48:14 +03:00
Roman Lebedev ea76cb87ee
[X86][Costmodel] Load/store i32/f32 Stride=2 VF=32 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

Here for `store` pattern we are starting to have spilling,
so accurate modelling may be problematic,
although if i drop the spilling, the measurements don't change.

For load we have:
https://godbolt.org/z/1oTTnncbx - for intels `Block RThroughput: =16.0`; for ryzens, `Block RThroughput: <=8.0`
So pick cost of `16`.

For store we have:
https://godbolt.org/z/1oTTnncbx - for intels `Block RThroughput: =16.0`; for ryzens, `Block RThroughput: =8.0`
So pick cost of `16`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110761
2021-10-01 17:48:14 +03:00
Roman Lebedev 80cd8da78d
[X86][Costmodel] Load/store i32/f32 Stride=2 VF=16 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/M9eev3xe8 - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: <=4.0`
So pick cost of `8`.

For store we have:
https://godbolt.org/z/M9eev3xe8 - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: =4.0`
So pick cost of `8`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110756
2021-10-01 17:48:14 +03:00
Roman Lebedev 3a0643e9c2
[X86][Costmodel] Load/store i32/f32 Stride=2 VF=8 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/n8aMKeo4E - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0`
So pick cost of `4`.

For store we have:
https://godbolt.org/z/n8aMKeo4E - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: =2.0`
So pick cost of `4`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110755
2021-10-01 17:48:13 +03:00
Roman Lebedev b12aeaec9a
[X86][Costmodel] Load/store i32/f32 Stride=2 VF=4 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/EM5Ean7bd - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: =1.0`
So pick cost of `2`.

For store we have:
https://godbolt.org/z/EM5Ean7bd - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: <=2.0`
So pick cost of `2`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110754
2021-10-01 17:48:13 +03:00
Roman Lebedev f44d9009c2
[X86][Costmodel] Load/store i32/f32 Stride=2 VF=2 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/4rY96hnGT - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: =1.0`
So pick cost of `2`.

For store we have:
https://godbolt.org/z/vbo37Y3r9 - for intels `Block RThroughput: =1.0`; for ryzens, `Block RThroughput: =0.5`
So pick cost of `1`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110753
2021-10-01 17:48:13 +03:00
Fraser Cormack 52c60459f5 [RISCV][NFC] Reformat a line of frame lowering code 2021-10-01 14:25:47 +01:00
Fraser Cormack fcaa64d947 [RISCV][NFC] Add closing parentheses to frame layout comments 2021-10-01 11:58:55 +01:00
Matthew Devereau f085a9db8b [AArch64][SVE] Replace fmul, fadd and fsub LLVM IR instrinsics with LLVM IR binary ops
Replacing fmul and fadd instrinsics with their binary ops results
more succinct AArch64 SVE output, e.g.:

4:   65428041 	fmul	z1.h, p0/m, z1.h, z2.h
8:   65408020 	fadd	z0.h, p0/m, z0.h, z1.h
->
4:   65620020   fmla    z0.h, p0/m, z1.h, z2.h
2021-10-01 11:24:46 +01:00
Krasimir Georgiev 685f1bfd0a Revert "[LoopVectorize] Permit vectorisation of more select(cmp(), X, Y) reduction patterns"
It appears to cause stage2 clang build failures, e.g.,
https://lab.llvm.org/buildbot/#/builders/74/builds/7145.

This reverts commit 1fb37334bd.
2021-10-01 11:39:43 +02:00
David Sherwood 1fb37334bd [LoopVectorize] Permit vectorisation of more select(cmp(), X, Y) reduction patterns
This patch adds further support for vectorisation of loops that involve
selecting an integer value based on a previous comparison. Consider the
following C++ loop:

  int r = a;
  for (int i = 0; i < n; i++) {
    if (src[i] > 3) {
      r = b;
    }
    src[i] += 2;
  }

We should be able to vectorise this loop because all we are doing is
selecting between two states - 'a' and 'b' - both of which are loop
invariant. This just involves building a vector of values that contain
either 'a' or 'b', where the final reduced value will be 'b' if any lane
contains 'b'.

The IR generated by clang typically looks like this:

  %phi = phi i32 [ %a, %entry ], [ %phi.update, %for.body ]
  ...
  %pred = icmp ugt i32 %val, i32 3
  %phi.update = select i1 %pred, i32 %b, i32 %phi

We already detect min/max patterns, which also involve a select + cmp.
However, with the min/max patterns we are selecting loaded values (and
hence loop variant) in the loop. In addition we only support certain
cmp predicates. This patch adds a new pattern matching function
(isSelectCmpPattern) and new RecurKind enums - SelectICmp & SelectFCmp.
We only support selecting values that are integer and loop invariant,
however we can support any kind of compare - integer or float.

Tests have been added here:

  Transforms/LoopVectorize/AArch64/sve-select-cmp.ll
  Transforms/LoopVectorize/select-cmp-predicated.ll
  Transforms/LoopVectorize/select-cmp.ll

Differential Revision: https://reviews.llvm.org/D108136
2021-10-01 08:41:03 +01:00
Yonghong Song 3562ad3ebe BPF: implement isLegalAddressingMode() properly
Latest upstream llvm caused the kernel bpf selftest emitting the
following warnings:

  In file included from progs/profiler3.c:6:
  progs/profiler.inc.h:489:2: warning: loop not unrolled:
    the optimizer was unable to perform the requested transformation;
    the transformation might be disabled or specified as part of an unsupported
    transformation ordering [-Wpass-failed=transform-warning]
          for (int i = 0; i < MAX_PATH_DEPTH; i++) {
          ^

Further bisecting shows this SimplifyCFG patch ([1]) changed
the condition on how to fold branch to common dest. This caused
some unroll pragma is not honored in selftests/bpf.

The patch [1] test getUserCost() as the condition to
perform the certain basic block folding transformation.
For the above example, before the loop unroll pass, the control flow
looks like:
    cond_block:
       branch target: body_block, cleanup_block
    body_block:
       branch target: cleanup_block, end_block
    end_block:
       branch target: cleanup_block, end10_block
    end10_block:
       %add.ptr = getelementptr i8, i8* %payload.addr.0, i64 %call2
       %inc = add nuw nsw i32 %i.0, 1
       branch target: cond_block

In the above, %call2 is an unknown scalar.

Before patch [1], end10_block will be folded into end_block, forming
the code like
    cond_block:
       branch target: body_block, cleanup_block
    body_block:
       branch target: cleanup_block, end_block
    end_block:
       branch target: cleanup_block, cond_block
and the compiler is happy to perform unrolling.

With patch [1], getUserCost(), which calls getGEPCost(), which calls
isLegalAddressingMode() in TargetLoweringBase.cpp, considers IR
  %add.ptr = getelementptr i8, i8* %payload.addr.0, i64 %call2
is free, so the above basic block folding transformation is not performed
and unrolling does not happen.

For BPF target, the IR
  %add.ptr = getelementptr i8, i8* %payload.addr.0, i64 %call2
is not free and we don't have ld/st instruction address with 'r+r' mode.

This patch implemented a BPF hook for isLegalAddressingMode(), which is
identical to Mips isLegalAddressingMode() implementation where
the address pattern like 'r+r', 'r+r+i' or '2*r' are not allowed.
With testing kernel bpf selftests, all loop not unrolled warnings
are gone and all selftests run successfully.

  [1] https://reviews.llvm.org/D108837

Differential Revision: https://reviews.llvm.org/D110789
2021-09-30 16:41:15 -07:00
Craig Topper a21c557955 [RISCV] Remove Zbproposedc extension
This consists of 3 compressed instructions, c.not, c.neg, and c.zext.w.
I believe these have been picked up by the Zce effort using different
encodings. I don't think it makes sense to keep them in bitmanip. It
will eventually cause a conflict if/when Zce is implemented in llvm.

Differential Revision: https://reviews.llvm.org/D110871
2021-09-30 14:23:05 -07:00
Albion Fung 4195ed9959 [PowerPC] Improved codegen related to xscvdpsxws/xscvdpuxws
This patch removes the uneccessary mf/mtvsr generated in conjunction
with xscvdpsxws/xscvdpuxws.

Differential revision: https://reviews.llvm.org/D109902
2021-09-30 14:31:00 -05:00
Stanislav Mekhanoshin 244aa7f735 [AMDGPU] move hasAGPRs/hasVGPRs into header
It is now very simple and can go right into the header
allowing optimizer to combine callers, such as isVGPRClass
and similar.

It does not need anything from the TRI itself anymore, so
make it static class member along with the callers.

Differential Revision: https://reviews.llvm.org/D110762
2021-09-30 10:02:02 -07:00
Kazu Hirata f631173d80 [llvm] Migrate from arg_operands to args (NFC)
Note that arg_operands is considered a legacy name.  See
llvm/include/llvm/IR/InstrTypes.h for details.
2021-09-30 08:51:21 -07:00
David Green f9aa8623fe [ARM] Add more MVE intrinsics to sink splats to
This adds a few more unpredicated intrinsics to sink splats to, in order
to create more qr instruction variants. Notably this includes
saddsat/uaddsat but also some of the unpredicated mve intrinsics.

Differential Revision: https://reviews.llvm.org/D110333
2021-09-30 14:41:23 +01:00
Jingu Kang 13f3c39f36 Second Recommit "[AArch64] Split bitmask immediate of bitwise AND operation"
This reverts the revert commit c07f709969 with
bug fixes.

Differential Revision: https://reviews.llvm.org/D109963
2021-09-30 09:27:08 +01:00
Ruiling Song 52785989e9 AMDGPU: Broadcast scalar boolean to vector boolean explicitly
This is used to fix wrong code generation of s_add_co_select_user in
test/CodeGen/AMDGPU/expand-scalar-carry-out-select-user.ll

  s_addc_u32 s4, s6, 0
  s_cselect_b64 vcc, 1, 0    <-- vcc set as 0x1 if SCC==1
  v_mov_b32_e32 v1, s4
  s_cmp_gt_u32 s6, 31
  v_cndmask_b32_e32 v1, 0, v1, vcc

If the s_addc_u32 set SCC, then we will get value 0x1 in VCC.
The v_cndmask will do per thread selection with VCC as condition
register. As VCC only gets the first bit being set, only the first
thread/lane in destination register can get correct result if the
very first lane is active. In fact, we should broadcast the value to all
active lanes of the final register.

The idea here is doing this broadcast to vector boolean explicitly
instead of lowering it into a COPY from SCC which would be interpreted as
selecting between 0/1.

This is used to replace D109754.

Reviewed-by: foad, alex-t

Differential Revision: https://reviews.llvm.org/D109889
2021-09-30 10:15:01 +08:00
Amara Emerson 1c0e8a98e4 [AArch64][GlobalISel] Widen G_BUILD_VECTOR source & dest element types to s8. 2021-09-29 15:11:30 -07:00
Ricky Taylor e1e3b6ee72 [M68k] Avoid UB in disassembler
When reading 32 bits a 32-bit shift would be executed.

This is undefined behaviour, but in this case we can just replace the
entire scratch value to avoid it.

Differential Revision: https://reviews.llvm.org/D110769
2021-09-29 22:07:14 +01:00
Stefan Pintilie fb4e44c4e7 [PowerPC] The builtins load8r and store8r are Power 7 plus.
This patch makes sure that the builtins __builtin_ppc_load8r and
__ builtin_ppc_store8r are only available for Power 7 and up.
Currently the builtins seem to produce incorrect code if used for
Power 6 or before.

Reviewed By: nemanjai, #powerpc

Differential Revision: https://reviews.llvm.org/D110653
2021-09-29 14:34:40 -05:00
Roman Lebedev 2d42a192e0
[X86][Costmodel] Load/store i8 Stride=2 VF=32 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/xz6x7c35P - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=2.5`
So pick cost of `6`.

For store we have:
https://godbolt.org/z/xz6x7c35P - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0`
So pick cost of `4`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110709
2021-09-29 21:52:45 +03:00
Roman Lebedev bac60c55e0
[X86][Costmodel] Load/store i8 Stride=2 VF=16 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/a9hv4z47v - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: =2.0`
So pick cost of `4`.

For store we have:
https://godbolt.org/z/6GfPn1b79 - for intels `Block RThroughput: =3.0`; for ryzens, `Block RThroughput: <=2.0`
So pick cost of `3`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110708
2021-09-29 21:52:45 +03:00
Roman Lebedev 1962185671
[X86][Costmodel] Load/store i8 Stride=2 VF=8 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

Identical to VF=2.

For load we have:
https://godbolt.org/z/4TEbdzbMM - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: <=1.0`
So pick cost of `2`.

For store we have:
https://godbolt.org/z/MYfzGPf3Y - for intels `Block RThroughput: =1.0`; for ryzens, `Block RThroughput: <=0.5`
So pick cost of `1`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110705
2021-09-29 21:52:45 +03:00
Roman Lebedev 08face1f9a
[X86][Costmodel] Load/store i8 Stride=2 VF=4 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

Identical to VF=2.

For load we have:
https://godbolt.org/z/sGE41GYo7 - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: <=1.0`
So pick cost of `2`.

For store we have:
https://godbolt.org/z/ba5r3s9xa - for intels `Block RThroughput: =1.0`; for ryzens, `Block RThroughput: <=0.5`
So pick cost of `1`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110704
2021-09-29 21:52:45 +03:00
Roman Lebedev 7d52628eb0
[X86][Costmodel] Load/store i8 Stride=2 VF=2 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/caKqjr9hb - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: <=1.0`
So pick cost of `2`.

For store we have:
https://godbolt.org/z/6TTn3eKj8 - for intels `Block RThroughput: =1.0`; for ryzens, `Block RThroughput: <=0.5`
So pick cost of `1`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110702
2021-09-29 21:52:44 +03:00
Jay Foad f9b68304a2 [AMDGPU] Enable machine verification after AMDGPUISelDAGToDAG
This was introduced in D32628 but it does not seem to be required any
more. At least it does not show any problems in check-llvm in an
LLVM_ENABLE_EXPENSIVE_CHECKS build.

Differential Revision: https://reviews.llvm.org/D110692
2021-09-29 18:47:19 +01:00
Kazu Hirata 9a640a1cb8 [AArch64] Remove redundant declaration createAArch64ObjectTargetStreamer (NFC)
Note that createAArch64ObjectTargetStreamer is declared in
AArch64TargetStreamer.h and defined in AArch64TargetStreamer.cpp.

Identified with readability-redundant-declaration.
2021-09-29 09:08:41 -07:00
David Green e9adcbde31 [AArch64] Model Cortex-A55 Q register NEON instructions
Cortex-A55 has 2 64bit NEON vector units, meaning a 128bit instruction
requires taking both units (and can only be issued as the first
instruction in a dual issue pair). This patch models that by splitting
the WriteV SchedWrite into two - the WriteVd that reads/writes only
64bit operands, and the WriteVq that read/writes 128bit registers. The
A55 schedule then uses this distinction to model the WriteVq as taking
both resource units, and starting a Schedule Group and WriteVd as taking
one as before.

I believe this is more correct, even if it does not lead to much better
performance.

Differential Revision: https://reviews.llvm.org/D108766
2021-09-29 16:55:31 +01:00
Jay Foad 9886f21bc1 [MSP430] Recognize Bi as an indirect branch in analyzeBranch. NFC.
Recognize Bi as an unconditional branch, just like JMP. This allows
machine verification to run after MSP430BranchSelector without failing
this assertion:

virtual bool llvm::MSP430InstrInfo::analyzeBranch(llvm::MachineBasicBlock &, llvm::MachineBasicBlock *&, llvm::MachineBasicBlock *&, SmallVectorImpl<llvm::MachineOperand> &, bool) const: Assertion `I->getOpcode() == MSP430::JCC && "Invalid conditional branch"' failed.

Note that machine verification is currently disabled after
addPreEmitPass passes because of problems on other targets, so this is
currently NFC.

Differential Revision: https://reviews.llvm.org/D110691
2021-09-29 16:43:11 +01:00
Simon Pilgrim 676f2809b5 [CostModel][AArch64] Don't dereference CostTblEntry before null check.
Fix static analysis warning that we check for null Entry after dereferencing it.

I don't think this can actually happen as i8/i16 should legalize to use the i32 path which should return a cost - but I'd rather play it safe that rely on an implicit type legalization.
2021-09-29 16:35:29 +01:00
David Green 8a645fc44b [AArch64] Enable type promotion for AArch64
This enables the type promotion pass for AArch64, which acts as a
CodeGenPrepare pass to promote illegal integers to legal ones,
especially useful for removing extends that would otherwise require
cross-basic-block analysis.

I have enabled this generally, for both ISel and GlobalISel. In some
quick experiments it appeared to help GlobalISel remove extra extends in
places too, but that might just be missing optimizations that are better
left for later. We can disable it again if required.

In my experiments, this can improvement performance in some cases, and
codesize was a small improvement. SPEC was a very small improvement,
within the noise. Some of the test cases show extends being moved out of
loops, often when the extend would be part of a cmp operand, but that
should reduce the latency of the instruction in the loop on many cpus.
The signed-truncation-check tests are increasing as they are no longer
matching specific DAG combines.

We also hope to add some additional improvements to the pass in the near
future, to capture more cases of promoting extends through phis that
have come up in a few places lately.

Differential Revision: https://reviews.llvm.org/D110239
2021-09-29 15:13:12 +01:00
Nemanja Ivanovic 09b67aa1c3 [PowerPC] Implement builtin for vbpermd
The instruction has similar semantics to vbpermq but for doublewords.
It was added in Power9 and the ABI documents the builtin.

Differential revision: https://reviews.llvm.org/D107899
2021-09-29 06:34:31 -05:00
Sander de Smalen 87bcbd61b5 [AArch64][SVE] Fix extract_subvector patterns for unpacked fp types.
The patterns added in D110163 were incorrect, since it used the wrong
element widths for its shuffles.

Example for nxv2f16 extract_subvector(nxv8f16 %in, 6):
  <a|b|c|d|e|f|g|h>
               ^^^
           extract g and h.

  => UUNPKHI .h -> .s results in:
  <e  |f  |g  |h  >

  => UUNPKHI .s -> .d results in:
  <g      |h      >

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D110523
2021-09-29 11:14:49 +01:00
Martin Storsjö d6216e2cd1 [X86] Fix handling of i128<->fp on Windows
On Windows, i128 arguments are passed as indirect arguments, and
they are returned in xmm0.

This is mostly fixed up by `WinX86_64ABIInfo::classify` in Clang, making
the IR functions return v2i64 instead of i128, and making the arguments
indirect. However for cases where libcalls are generated in the target
lowering, the lowering uses the default x86_64 calling convention for
i128, where they are passed/returned as a register pair.

Add custom lowering logic, similar to the existing logic for i128
div/mod (added in 4a406d32e9),
manually making the libcall (while overriding the return type to
v2i64 or passing the arguments as pointers to arguments on the stack).

X86CallingConv.td doesn't seem to handle i128 at all, otherwise
the windows specific behaviours would ideally be implemented as
overrides there, in generic code, handling these cases automatically.

This fixes https://bugs.llvm.org/show_bug.cgi?id=48940.

Differential Revision: https://reviews.llvm.org/D110413
2021-09-29 13:05:59 +03:00
Amara Emerson e6ed880e47 [AArch64][GlobalISel] Make some vector G_SMULH/G_UMULH legal. 2021-09-29 02:35:29 -07:00
hsmahesha c0735cb9f1 [AMDGPU] Do not internalize ASan device library functions.
ASan device library functions (those starts with the prefix __asan_)
are at the moment undergoing through undesired optimizations due to
internalization. Hence, in order to avoid such undesired optimizations
on ASan device library functions, do not internalize them in the first
place.

Reviewed By: yaxunl

Differential Revision: https://reviews.llvm.org/D110468
2021-09-29 07:19:02 +05:30
Sterling Augustine c07f709969 Revert "Recommit "[AArch64] Split bitmask immediate of bitwise AND operation""
This reverts commit 73a196a11c.

Causes crashes as reported in https://reviews.llvm.org/D109963
2021-09-28 18:02:06 -07:00
Jessica Paquette 241c7b1473 [AArch64][GlobalISel] Run overlapping_and after legalization
When we have code with truncates, those truncates may be changed into G_ANDs
with constants. These may, in turn, feed into other G_AND instructions.

Running this combine post-legalize allows us to optimize examples like this one:

https://godbolt.org/z/zrGY4dfEW

SDAG currently optimizes the example above so that there is only one `and`.
GISel doesn't optimize it, because the G_AND we'd optimize here is translated
as a G_TRUNC. Later, that G_TRUNC is turned into a G_AND during legalization.

Differential Revision: https://reviews.llvm.org/D110667
2021-09-28 17:13:34 -07:00
Roman Lebedev b6b7860954
[X86][Costmodel] Load/store i16 Stride=6 VF=16 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For this tuple, measuring becomes problematic since there's a lot of spilling going on,
but apparently all these memory ops do not affect worst-case estimate at all here.

For load we have:
https://godbolt.org/z/5qGb9odP6 - for intels `Block RThroughput: <=106.0`; for ryzens, `Block RThroughput: <=34.8`
So pick cost of `106`.

For store we have:
https://godbolt.org/z/KrWcv4Ph7 - for intels `Block RThroughput: =58.0`; for ryzens, `Block RThroughput: <=20.5`
So pick cost of `58`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110593
2021-09-28 19:15:08 +03:00
Roman Lebedev 24e42f7d28
[X86][Costmodel] Load/store i16 Stride=6 VF=8 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/3Tc5s897j - for intels `Block RThroughput: =39.0`; for ryzens, `Block RThroughput: <=13.5`
So pick cost of `39`.

For store we have:
https://godbolt.org/z/fo1h9E67e - for intels `Block RThroughput: =21.0`; for ryzens, `Block RThroughput: <=12.0`
So pick cost of `21`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110592
2021-09-28 19:15:07 +03:00
Roman Lebedev b3011bcc78
[X86][Costmodel] Load/store i16 Stride=6 VF=4 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/1Wcaf9c7T - for intels `Block RThroughput: =9.0`; for ryzens, `Block RThroughput: <=4.5`
So pick cost of `9`.

For store we have:
https://godbolt.org/z/1Wcaf9c7T - for intels `Block RThroughput: =15.0`; for ryzens, `Block RThroughput: <=6.0`
So pick cost of `15`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110591
2021-09-28 19:15:01 +03:00
Roman Lebedev aa93c55889
[X86][Costmodel] Load/store i16 Stride=6 VF=2 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/bhscej4WM - for intels `Block RThroughput: =13.0`; for ryzens, `Block RThroughput: <=7.0`
So pick cost of `13`.

For store we have:
https://godbolt.org/z/Yf4Pfnxbq - for intels `Block RThroughput: =10.0`; for ryzens, `Block RThroughput: <=3.5`
So pick cost of `10`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110590
2021-09-28 19:14:56 +03:00
Quinn Pham 70391b3468 [PowerPC] FP compare and test XL compat builtins.
This patch is in a series of patches to provide builtins for
compatability with the XL compiler. This patch adds builtins for compare
exponent and test data class operations on floating point values.

Reviewed By: #powerpc, lei

Differential Revision: https://reviews.llvm.org/D109437
2021-09-28 11:01:51 -05:00
Kazu Hirata 9e4f1f9265 [SystemZ] Remove redundant declaration SystemZMnemonicSpellCheck (NFC)
Note that SystemZMnemonicSpellCheck is defined in
SystemZGenAsmMatcher.inc, which SystemZAsmParser.cpp includes.

Identified with readability-redundant-declaration.
2021-09-28 08:38:05 -07:00
David Green fdd8c10959 [ARM] Delay reverting WLS in arm-block-placement
As we have to split blocks, we may be left in an invalid loop state
after a WLS is reverted to a DLS. Instead remember the WLS that could
not be fixed and revert them after finishing processing all other loops.

Differential Revision: https://reviews.llvm.org/D110567
2021-09-28 15:38:29 +01:00
Jingu Kang 73a196a11c Recommit "[AArch64] Split bitmask immediate of bitwise AND operation"
This reverts the revert commit f85d8a5bed
with bug fixes.

Original message:

    MOVi32imm + ANDWrr ==> ANDWri + ANDWri
    MOVi64imm + ANDXrr ==> ANDXri + ANDXri

    The mov pseudo instruction could be expanded to multiple mov instructions later.
    In this case, try to split the constant operand of mov instruction into two
    bitmask immediates. It makes only two AND instructions intead of multiple
    mov + and instructions.

    Added a peephole optimization pass on MIR level to implement it.

    Differential Revision: https://reviews.llvm.org/D109963
2021-09-28 15:26:29 +01:00
David Green 2c53215e99 [ARM] Skip debug info in recomputeVPTBlockMask
The ARMLowOverheadLoops pass recalculates VPT block masks when it
converts VCMP's inside VPT blocks into VPT's. The function to do so
doesn't seem to handle debug info though, leading to invalid block
creation or asserts at compile time. Make sure the function skips any
debug info between the MVE instructions it inspects.

Differential Revision: https://reviews.llvm.org/D110564
2021-09-28 14:58:13 +01:00
hyeongyu kim 86bf234d0b [IR] Change the default value of InstertElement to poison (1/4)
This patch is for fixing potential insertElement-related bugs like D93818.
```
V = UndefValue::get(VecTy);
for(...)
  V = Builder.CreateInsertElementy(V, Elt, Idx);
=>
V = PoisonValue::get(VecTy);
for(...)
  V = Builder.CreateInsertElementy(V, Elt, Idx);
```
Like above, this patch changes the placeholder V to poison.
The patch will be separated into several commits.

Reviewed By: aqjune

Differential Revision: https://reviews.llvm.org/D110311
2021-09-28 22:29:16 +09:00
Jingu Kang f85d8a5bed Revert "[AArch64] Split bitmask immediate of bitwise AND operation"
This reverts commit 864b206796.

Reverting due to error on buildbots.
2021-09-28 13:28:09 +01:00
Jingu Kang 864b206796 [AArch64] Split bitmask immediate of bitwise AND operation
MOVi32imm + ANDWrr ==> ANDWri + ANDWri
MOVi64imm + ANDXrr ==> ANDXri + ANDXri

The mov pseudo instruction could be expanded to multiple mov instructions later.
In this case, try to split the constant operand of mov instruction into two
bitmask immediates. It makes only two AND instructions intead of multiple
mov + and instructions.

Added a peephole optimization pass on MIR level to implement it.

Differential Revision: https://reviews.llvm.org/D109963
2021-09-28 11:57:43 +01:00
Liu, Chen3 57e8f840b6 [X86][FP16] Fix a bug when Combine the FADD(A, FMA(B, C, 0)) to FMA(B, C, A).
This bug was introduced by D109953. The operand order of generated FMA
is wrong.

Differential Revision: https://reviews.llvm.org/D110606
2021-09-28 11:38:53 +08:00
Jozef Lawrynowicz 6cfb4d46ba [llvm-readobj] Support dumping of MSP430 ELF attributes
The MSP430 ABI supports build attributes for specifying
the ISA, code model, data model and enum size in ELF object files.

Differential Revision: https://reviews.llvm.org/D107969
2021-09-28 00:56:11 +03:00
Roman Lebedev 2a7a768dad
[X86][Costmodel] Load/store i16 Stride=4 VF=32 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For this tuple, measuring becomes problematic since there's a lot of spilling going on,
but apparently all these memory ops do not affect worst-case estimate at all here.

For load we have:
https://godbolt.org/z/zP4hd8MT6 - for intels `Block RThroughput: =150.0`; for ryzens, `Block RThroughput: <=59`
So pick cost of `150`.

For store we have:
https://godbolt.org/z/vKb8zTK8E - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=24.0`
So pick cost of `64`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110548
2021-09-27 22:20:01 +03:00
Roman Lebedev ee5a050e2e
[X86][Costmodel] Load/store i16 Stride=4 VF=16 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/Wd9cKab83 - for intels `Block RThroughput: =75.0`; for ryzens, `Block RThroughput: <=29.5`
So pick cost of `75`. (note that `# 32-byte Reload` does not affect throughput there.)

For store we have:
https://godbolt.org/z/Wd9cKab83 - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=12.0`
So pick cost of `32`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110543
2021-09-27 22:20:01 +03:00
Roman Lebedev 5615d6a6dd
[X86][Costmodel] Load/store i16 Stride=4 VF=8 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/dd8T5P471 - for intels `Block RThroughput: =33.0`; for ryzens, `Block RThroughput: <=14.5`
So pick cost of `33`.

For store we have:
https://godbolt.org/z/zPxcKWhn4 - for intels `Block RThroughput: =10.0`; for ryzens, `Block RThroughput: <=6.0`
So pick cost of `10`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110541
2021-09-27 22:20:01 +03:00
Roman Lebedev df2b42d12e
[X86][Costmodel] Load/store i16 Stride=4 VF=4 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/rnsf639Wh - for intels `Block RThroughput: =17.0`; for ryzens, `Block RThroughput: <=7.5`
So pick cost of `17`.

For store we have:
https://godbolt.org/z/565KKrcY6 - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: =2.0`
So pick cost of `6`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110537
2021-09-27 22:20:01 +03:00
Roman Lebedev 45caac91c4
[X86][Costmodel] Load/store i16 Stride=4 VF=2 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/5EYc6r9nh - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=3.0`
So pick cost of `6`.

For store we have:
https://godbolt.org/z/z61e5d6GE - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: <=1.0`
So pick cost of `2`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110536
2021-09-27 22:20:01 +03:00
Praveen Velliengiri e90b512c4d [AMDGPU] Change ASAN init/fini kernels linkage to external.
HSA runtime fails to find the symbols for Init and Fini kernels as
they mark with internal linkage, changing the linkage to external
to fix those errors.

Differential Revision: https://reviews.llvm.org/D110054
2021-09-27 11:50:37 -06:00
Quinn Pham 682e15f371 [PowerPC] Fix td pattern for P10 VSLDBI and VSRDBI
This patch fixes the pattern for the P10 instructions Vector Shift Left
Double by Bit Immediate VN-form and Vector Shift Right Double by Bit
Immediate VN-form. The third argument should be a target constant (`timm`)
instead of an `i32` because an immediate is expected.

Reviewed By: lei

Differential Revision: https://reviews.llvm.org/D109920
2021-09-27 12:36:18 -05:00
Craig Topper a2a07e8db3 [RISCV] Fold store of vmv.x.s to a vse with VL=1.
This can avoid a loss of decoupling with the scalar unit on cores
with decoupled scalar and vector units.

We should support FP too, but those use extract_element and not a
custom ISD node so it is a little different. I also left a FIXME
in the test for i64 extract and store on RV32.

Reviewed By: frasercrmck

Differential Revision: https://reviews.llvm.org/D109482
2021-09-27 09:54:46 -07:00
Craig Topper 933182e948 [RISCV] Improve support for forming widening multiplies when one input is a scalar splat.
If one input of a fixed vector multiply is a sign/zero extend and
the other operand is a splat of a scalar, we can use a widening
multiply if the scalar value has sufficient sign/zero bits.

Reviewed By: frasercrmck

Differential Revision: https://reviews.llvm.org/D110028
2021-09-27 09:37:07 -07:00
Kazu Hirata b68a62b3a9 [Lanai] Remove redundant declaration getTheLanaiTarget (NFC)
Note that getTheLanaiTarget is declared in
TargetInfo/LanaiTargetInfo.h, which LanaiDisassembler.cpp includes.

Identified with readability-redundant-declaration.
2021-09-27 08:58:27 -07:00
Sebastian Neubauer bf980930e5 [AMDGPU] Ignore KILLs when forming clauses
KILL instructions are sometimes present and prevented hard
clauses from being formed.

Fix this by ignoring all meta instructions in clauses.

Differential Revision: https://reviews.llvm.org/D106042
2021-09-27 16:33:52 +02:00
Roman Lebedev 7424deb743
[X86][Costmodel] Load/store i16 Stride=2 VF=32 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/q6GbK89br - for intels `Block RThroughput: =18.0`; for ryzens, `Block RThroughput: <=7.0`
So pick cost of `18`.

For store we have:
https://godbolt.org/z/Yzfoo5TnW - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: <=4.0`
So pick cost of `8`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110507
2021-09-27 14:21:12 +03:00
Roman Lebedev a5113e9445
[X86][Costmodel] Load/store i16 Stride=2 VF=16 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/Y1E7qnjz8 - for intels `Block RThroughput: =9.0`; for ryzens, `Block RThroughput: <=3.5`
So pick cost of `9`.

For store we have:
https://godbolt.org/z/Y1E7qnjz8 - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0`
So pick cost of `4`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110506
2021-09-27 14:20:11 +03:00
Roman Lebedev 70c90cc5bd
[X86][Costmodel] Load/store i16 Stride=2 VF=8 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/e5YE99a4P - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: =2.0`
So pick cost of `6`.

For store we have:
https://godbolt.org/z/3vM4KsE1n - for intels `Block RThroughput: =3.0`; for ryzens, `Block RThroughput: <=2.0`
So pick cost of `3`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110505
2021-09-27 14:18:29 +03:00
Roman Lebedev 49e532aa52
[X86][Costmodel] Load/store i16 Stride=2 VF=4 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/1j3nf3dro - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: <=1.0`
So pick cost of `2`.

For store we have:
https://godbolt.org/z/4n1zvP37j - for intels `Block RThroughput: =1.0`; for ryzens, `Block RThroughput: <=0.5`
So pick cost of `1`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110504
2021-09-27 14:15:25 +03:00
David Green bb2d23dcd4 [ARM] Improve detection of fallthough when aligning blocks
We align non-fallthrough branches under Cortex-M at O3 to lead to fewer
instruction fetches. This improves that for the block after a LE or
LETP. These blocks will still have terminating branches until the
LowOverheadLoops pass is run (as they are not handled by analyzeBranch,
the branch is not removed until later), so canFallThrough will return
false. These extra branches will eventually be removed, leaving a
fallthrough, so treat them as such and don't add unnecessary alignments.

Differential Revision: https://reviews.llvm.org/D107810
2021-09-27 11:21:21 +01:00
Simon Pilgrim 468ff703e1 [X86] combineVectorHADDSUB - remove the broken HOP(x,x) merging code (PR51974)
This intention of this code turns out to be superfluous as we can handle this with shuffle combining, and it has a critical flaw in that it doesn't check for dependencies.

Fixes PR51974
2021-09-27 10:41:22 +01:00
Fraser Cormack d48f6df1f8 [RISCV] Create the correct mask type when lowering EXTRACT_VECTOR_ELT
This particular case was creating a `VMSET_VL` using the old
fixed-length type in order to pass a mask to other custom nodes
operating on the scalable container type. This kind of thing wasn't
caught for us; I only noticed when experimenting with odd-length
vectors, where it was trying to generate an invalid `v3i1` MVT.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D110420
2021-09-27 09:43:40 +01:00
Freddy Ye 902ec6142a [X86][ISel] Lowering FROUND(f16) and FROUNDEVEN(f16)
When AVX512FP16 is enabled, FROUND(f16) cannot be dealt with
TypeLegalize, and no libcall in libm is ready for fround(f16) now.
FROUNDEVEN(f16) has related instruction in AVX512FP16.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D110312
2021-09-27 13:35:03 +08:00
Simon Pilgrim c0eff50fc5 [X86][SSE] combineMulToPMADDWD - enable sext_extend_vector_inreg(vXi16) -> zext_extend_vector_inreg(vXi16) fold
The plan is to allow combineMulToPMADDWD to match illegal vector types (as long as they're still pow2), which should allow us to start removing the 128-bit limit on more of the PMADDWD combines.
2021-09-26 19:37:23 +01:00
Simon Pilgrim ed3e4917b3 [X86] Fold PACK(*_EXTEND_VECTOR_INREG, UNDEF) -> *_EXTEND_VECTOR_INREG
For 128-bit vectors, we can remove a PACK of a EXTEND_VECTOR_INREG node and just create a smaller extension to the result/packed type.
2021-09-26 19:37:22 +01:00
Simon Pilgrim 3fe9767204 [X86] Fold ADD(VPMADDWD(X,Y),VPMADDWD(Z,W)) -> VPMADDWD(SHUFFLE(X,Z), SHUFFLE(Y,W))
Merge addition of VPMADDWD nodes if each element pair doesn't use the upper element in each pair (i.e. its zero) - we can generalize this to either element in the pair if we one day create VPMADDWD with zero lower elements.

There are still a number of issues with extending/shuffling with 256/512-bit VPMADDWD nodes so this initially only works for v2i32/v4i32 cases - I'm working on removing all these limitations but there's still a bit of yak shaving to go.....
2021-09-26 18:08:29 +01:00
Kazu Hirata c4ae4a745d [RISCV] Remove redundant declaration RISCVMnemonicSpellCheck (NFC)
Note that RISCVMnemonicSpellCheck is defined in
RISCVGenAsmMatcher.inc, which RISCVAsmParser.cpp includes.

Identified with readability-redundant-declaration.
2021-09-26 09:26:57 -07:00
Roman Lebedev d9413f46b3
[X86][Costmodel] Load/store i16 VF=2 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/M8vEKs5jY - for intels `Block RThroughput: =2.0`;
                                  for ryzens, `Block RThroughput: <=1.0`
So pick cost of `2`.

For store we have:
https://godbolt.org/z/Kx1nKz7je - for intels `Block RThroughput: =1.0`;
                                  for ryzens, `Block RThroughput: <=0.5`
So pick cost of `1`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D103144
2021-09-26 19:13:23 +03:00
Simon Pilgrim 3538ee763d [CostModel][X86] Improve AVX1/AVX2 v16i32->v16i16/v16i8 truncation costs (PR51972)
Based off worst case btver2 (AVX1) and haswell (AVX2) llvm-mca reports
2021-09-26 13:43:46 +01:00
Simon Pilgrim 8c83bd3bd4 [CostModel][X86] Adjust vXi32 multiply costs if it can be performed using PMADDWD
Update the costs to match the codegen from combineMulToPMADDWD - not only can we use PMADDWD is its zero-extended, but also if its a constant or sign-extended from a vXi16 (which can be replaced with a zero-extension).
2021-09-25 16:28:48 +01:00
Simon Pilgrim eb7c78c2c5 [X86][SSE] combineMulToPMADDWD - mask off upper bits of sign-extended vXi32 constants
If we are multiplying by a sign-extended vXi32 constant, then we can mask off the upper 16 bits to allow folding to PMADDWD and make use of its implicit sign-extension from i16
2021-09-25 15:50:45 +01:00
Simon Pilgrim 2a4fa0c27c [X86][SSE] combineMulToPMADDWD - enable sext(v8i16) -> zext(v8i16) fold on sub-128 bit vectors 2021-09-25 15:50:45 +01:00
Kazu Hirata 44c401bdc3 [Mips] Remove redundant declarations (NFC)
Note that identical declarations immediately precede what's being
removed in this patch.

Identified with readability-redundant-declaration.
2021-09-25 07:41:11 -07:00
Simon Pilgrim f5a26ccae2 [X86][SSE] combineMulToPMADDWD - enable sext(v8i16) -> zext(v8i16) fold on pre-SSE41 targets
We already do this on SSE41 targets where we have sext/zext instructions, now that combineShiftToPMULH handles SSE2 targets, we can enable this here as well.
2021-09-25 14:35:31 +01:00
Simon Pilgrim 4c72b10f0a [X86] X86FastISel::fastMaterializeConstant - break if-else chain to fix llvm-else-after-return warning. NFCI
All previous if-else cases return
2021-09-25 14:31:14 +01:00
Simon Pilgrim a25f25c3b7 [X86] combineShiftToPMULH - relax from ISA from SSE41 to SSE2
With improved shuffle combines (in particular canonicalizeShuffleWithBinOps), we can now usefully perform this on any SSE2+ target.

We should be able to remove this entirely and just use DAGCombiner's combineShiftToMULH if we can someday get it to support illegal (pre-widened) types.
2021-09-25 14:08:03 +01:00
David Green 883758ed48 [ARM] Fix Arm block placement creating branches after jump tables.
Given:
 - A jump table
 - Which jumps to the next block
 - The next block ends in a WLS
 - Where the WLS conditionally jumps to block earlier in the program.

The Arm block placement pass would attempt to move the block containing
the WLS earlier, as the WLS instruction can only branch forward. In
doing so it would add a branch from the jumptable block to the WLS
block, thinking it previously fell-through.

This in itself would be fine, if a little inefficient, but the constant
island pass expects all instructions after a jump-table branch to have
been removed by analyzeBranch. So it gets confused and can assign the
same labels to multiple jump table blocks.

I've changed the condition to the same as used in analyzeBranch.
2021-09-25 11:32:25 +01:00
Jim Lin ed687c0211 [RISCV] Fix incorrect operand type of inst alias for InstR4
Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D110381
2021-09-25 11:25:12 +08:00
Craig Topper 715cf6ffb9 [RISCV] Add another isel optimization for (and (shl X, c2), c1).
Where c1 is a shifted mask with 32-c2 leading zeros and c3 trailing
zeros and c3>c2. We can select it as (slli (srliw X, c3-c2), c3).
2021-09-24 15:10:25 -07:00
Stanislav Mekhanoshin cf74ef134c [AMDGPU] Limit promote alloca max size in functions
Non-entry functions have 32 caller saved VGPRs available. If we
promote alloca to consume more registers we will have to spill
CSRs. There is no reason to eliminate scratch access to get
another scratch access instead.

Differential Revision: https://reviews.llvm.org/D110372
2021-09-24 13:38:39 -07:00
Anirudh Prasad a9ae2436fc [SystemZ][z/OS] Introduce the GOFFMCAsmInfo Interface for z/OS
- This patch adds in the GOFFMCAsmInfo interfaces for the z/OS target.
- This patch decouples the previously existing SystemZMCAsmInfo interface for the ELF target and the z/OS target.
- This patch also removes a small test in the SystemZAsmLexerTest.cpp. The reason for this is because, the test is set up for the s390x-ibm-linux (SystemZ ELF triple), and the test checks a function which is overridden only for the z/OS target. The reason we can't change the test to use a z/OS triple outright is because there is still missing support which prevents the successful running of a test (assert in AsmParser.cpp due to missing GOFFAsmParser support)

Reviewed By: uweigand, abhina.sreeskantharajan

Differential Revision: https://reviews.llvm.org/D110077
2021-09-24 16:25:41 -04:00
Anirudh Prasad ebe06910ce [NFC] Replace hard-coded usages of SystemZ::R15D with SpecialRegisters API
This patch changes hard-coded usages of SystemZ::R15D with calls to the getStackPointerRegister function. Uses in the LowerCall function are avoided to avoid merge conflicts with an expected upcoming patch.

Reviewed By: uweigand

Differential Revision: https://reviews.llvm.org/D109702
2021-09-24 15:20:57 -04:00
Anirudh Prasad e09a1dc475 [SystemZ][z/OS] Add GOFF Support to the DataLayout
- This patch adds in the GOFF mangling support to the LLVM data layout string. A corresponding additional line has been added into the data layout section in the language reference documentation.
- Furthermore, this patch also sets the right data layout string for the z/OS target in the SystemZ backend.

Reviewed By: uweigand, Kai, abhina.sreeskantharajan, MaskRay

Differential Revision: https://reviews.llvm.org/D109362
2021-09-24 14:09:01 -04:00
Victor Huang 6e1aaf18af [PowerPC] Mark splat immediate instructions as rematerializable
This patch marks splat immediate instructions XXSPLTIW and XXSPLTIDP as
rematerializable to prevent MachineLICM from moving them out of loops.

Reviewed By: lei, amy

Differential revision: https://reviews.llvm.org/D108823
2021-09-24 12:03:34 -05:00
Stanislav Mekhanoshin 082e22f3d7 [AMDGPU] Always reserve flat scratch SGPR for architected flat scratch
With architected flat scratch it becomes readonly. We must always
reserve SGPR pair for it even if we do not use scratch at all since
an attempt to write to SGPRs mapped to FLAT_SCRATCH results in
memory violation.

This is not needed since GFX10 with architected flat scratch though
since special SGPRs are not carving space from normal SGPRs.

Differential Revision: https://reviews.llvm.org/D110376
2021-09-24 09:46:31 -07:00
Simon Pilgrim d8fc9f8727 [X86][SSE] combineMulToPMADDWD - replace sext(v8i16) -> zext(v8i16)
As suggested on D108522, if we're sign extending a v4i16 source before multiplying as a v4i32, then we can replace that with a zero extension and rely on the implicit sign-extension of PMADDWD.
2021-09-24 16:42:01 +01:00
Sanjay Patel 09e71c367a [x86] convert logic-of-FP-compares to FP logic-of-vector-compares
This is motivated by the examples and discussion in:
https://llvm.org/PR51245
...and related bugs.

By using vector compares and vector logic, we can convert 2 'set'
instructions into 1 'movd' or 'movmsk' and generally improve
throughput/reduce instructions.

Unfortunately, we don't have a complete vector compare ISA before
AVX, so I left SSE-only out of this patch. Ie, we'd need extra logic
ops to simulate the missing predicates for SSE 'cmpp*', so it's not
as clearly a win.

Differential Revision: https://reviews.llvm.org/D110342
2021-09-24 11:38:19 -04:00
Hsiangkai Wang 7d39a8a921 [RISCV] (1/2) Add the tail policy argument to builtins/intrinsics.
Add the tail policy argument to LLVM IR intrinsics. There are two policies for tail elements. Tail agnostic means users do not care about the values in the tail elements and tail undisturbed means the values in the tail elements need to be kept after the operation. In order to let users control the tail policy, we add an additional argument at the end of the argument list.

For unmasked operations, we have no maskedoff and the tail policy is always tail agnostic. If users want to keep tail elements under unmasked operations, they could use all one mask in the masked operations to do it. So, we only add the additional argument for masked operations for most cases. There are exceptions listed below.

In this patch, we do not handle the following cases to reduce the complexity of the patch. There could be two separate patches for them.

* Use dest argument to control tail policy
vmerge.vvm/vmerge.vxm/vmerge.vim (add _t builtins with additional dest argument)
vfmerge.vfm (add _t builtins with additional dest argument)
vmv.v.v (add _t builtins with additional dest argument)
vmv.v.x (add _t builtins with additional dest argument)
vmv.v.i (add _t builtins with additional dest argument)
vfmv.v.f (add _t builtins with additional dest argument)
vadc.vvm/vadc.vxm/vadc.vim (add _t builtins with additional dest argument)
vsbc.vvm/vsbc.vxm (add _t builtins with additional dest argument)

* Always has tail argument for masked/unmasked intrinsics
Vector Single-Width Integer Multiply-Add Instructions (add _t and _mt builtins)
Vector Widening Integer Multiply-Add Instructions (add _t and _mt builtins)
Vector Single-Width Floating-Point Fused Multiply-Add Instructions (add _t and _mt builtins)
Vector Widening Floating-Point Fused Multiply-Add Instructions (add _t and _mt builtins)
Vector Reduction Operations (add _t and _mt builtins)
Vector Slideup Instructions (add _t and _mt builtins)
Vector Slidedown Instructions (add _t and _mt builtins)

Discussion: https://github.com/riscv/rvv-intrinsic-doc/pull/101

Differential Revision: https://reviews.llvm.org/D105092
2021-09-24 17:09:50 +08:00
Simon Pilgrim dade83c02a [X86][SLM] Fix ADDQ/SUBQ/CMPEQQ throughput to account for running on either port.
Testing on a SLM box suggests these can run on either port, but the throughput is 4cy on either (inc MMX versions). Confirmed with Intel AoM / Agner / InstLatX64.
2021-09-24 10:06:14 +01:00
Jonas Paulsson ea92283449 [SystemZ] Implement ISD::BITCAST for fp128 -> i128.
The type legalizer has by default no method of doing this bitcast other than
storing and reloading the value from stack.

This patch implements a custom lowering of this operation using extractions
of subregs (z13 and earlier using FP128 register pairs), or of vector
elements (with 'vector enhancements 1' using VR128 FP registers).

Review: Ulrich Weigand

Differential Revision: https://reviews.llvm.org/D110346
2021-09-24 10:26:45 +02:00
Amara Emerson 661ab70314 [AArch64][GlobalISel] Fix crash in the extend(extract_vector_elt) optimization.
It was assuming that GPR extends could only have destination sizes of 32 or 64
bits, but for AArch64 we allow < 32 bits even without matching size physregs.
2021-09-23 23:07:16 -07:00
Christudasan Devadasan 7a62a5b56d [AMDGPU] Legalize initialized LDS variables
We don't allow an initializer for LDS variables
and there is an early abort during instruction
selection. This patch legalizes them by ignoring
the init values. During assembly emission, proper
error reporting already exists for such instances.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D109901
2021-09-23 22:53:20 -04:00
Craig Topper 40b230f685 [RISCV] Limit transformAddImmMulImm to prevent an infinite loop.
This fixes an issue reported in D108607.
2021-09-23 15:53:11 -07:00
Vang Thao 1443ba6163 [AMDGPU] Propagate defining src reg for AGPR to AGPR Copys
On targets that do not support AGPR to AGPR copying directly, try to find the
defining accvgpr_write and propagate its source vgpr register to the copies
before register allocation so the source vgpr register does not get clobbered.

The postrapseudos pass also attempt to propagate the defining accvgpr_write but
if the register to propagate is clobbered, it will give up and create new
temporary vgpr registers instead.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D108830
2021-09-23 15:17:53 -07:00
Craig Topper 70f50114f3 [RISCV] Add another isel optimization for (and (shl x, c2), c1)
Turn (and (shl x, c2), c1) -> (slli (srli x, c3-c2), c3) if c1 is a
shifted mask with no leading zeros and c3 trailing zeros where c3
is greater than c2.
2021-09-23 14:18:07 -07:00
Nico Weber 3fa43da7a3 [llvm] Fix a copy-pasto
We should use IMAGE_REL_I386_SECREL in the i386 section of this file.

IMAGE_REL_I386_SECREL and IMAGE_REL_AMD64_SECREL have the same
numeric value 0xB, so this doesn't change behavior.
2021-09-23 15:34:01 -04:00
Craig Topper 4a69551d66 [RISCV] Add more isel optimizations for (and (shr x, c2), c1).
Turn (and (shr x, c2), c1) -> (slli (srli x, c2+c3), c3) if c1 is a
shifted mask with c2 leading zeros and c3 trailing zeros.

When the leading zeros is C2+32 we can use SRLIW in place of SRLI.
2021-09-23 11:29:04 -07:00
Sanjay Patel 74ba4b769a [x86] move combiner state check into convertIntLogicToFPLogic(); NFC
This function can be adapted to solve bugs like PR51245,
but it could require differentiating the combiner timing
between the existing and new transforms.
2021-09-23 14:28:22 -04:00
Thomas Lively 2f519825ba [WebAssembly] Add prototype relaxed SIMD fma/fms instructions
Add experimental clang builtins, LLVM intrinsics, and backend definitions for
the new {f32x4,f64x2}.{fma,fms} instructions in the relaxed SIMD proposal:
https://github.com/WebAssembly/relaxed-simd/blob/main/proposals/relaxed-simd/Overview.md.
Do not allow these instructions to be selected without explicit user opt-in.

Differential Revision: https://reviews.llvm.org/D110295
2021-09-23 11:01:36 -07:00
Piotr Sobczak 2ac53fffae [AMDGPU] Avoid processing functions in amdgpu-propagate-attributes pass for shaders
The pass amdgpu-propagate-attributes ("Early/Late propagate attributes
from kernels to functions") is currently run also for shaders, where
it does nothing. Modify the check so the pass only processes functions
for kernels.

Differential Revision: https://reviews.llvm.org/D109961
2021-09-23 16:46:56 +02:00
Simon Pilgrim c931d35216 [CostModel][X86] Increase i64 mul cost from 1 to 2
Only the most recent cpus support really 1cy 64-bit multiplies, and the X64 cost table represents a realistic worst case. The 1cy value was also discouraging vectorization when most vXi64 PMULDQ expansions aren't actually slower than scalarization.

Noticed while investigating PR51436.
2021-09-23 14:48:21 +01:00
Jim Lin fbacf5ad38 [RISCV] Add missing op type OPERAND_UIMM2, OPERAND_UIMM3 and OPERAND_UIMM7 for verifyInstruction
Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D110307
2021-09-23 19:30:46 +08:00
Fraser Cormack e7c879a69d [RISCV][VP] Add support for VP_REDUCE_* operations
This patch adds codegen support for lowering the vector-predicated
reduction intrinsics to RVV instructions. The process is similar to that
of the other reduction intrinsics, save for the fact that every VP
reduction has a start value. We reuse the existing custom "VL" nodes,
adding extra patterns where required to handle non-true masks.

To support these nodes, the `RISCVISD::VECREDUCE_*_VL` nodes have been
given an explicit "merge" operand. This is to faciliate the VP
reductions, where we must be careful to ensure that even if no operation
is performed (when VL=0) we still produce the start value. The RVV
reductions don't update the destination register under these conditions,
so we tie the splatted start value to the output register.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D107657
2021-09-23 11:11:05 +01:00
Jay Foad 6cef28ed2d [TII] Remove the MFI argument to convertToThreeAddress. NFC.
This simplifies the API and addresses a FIXME in
TwoAddressInstructionPass::convertInstTo3Addr.

Differential Revision: https://reviews.llvm.org/D110229
2021-09-23 08:58:46 +01:00
Liu, Chen3 76656ec8ec [X86][FP16] Combine the FADD(A, FMA(B, C, 0)) to FMA(B, C, A)
This patch is to support transform something like
_mm512_add_ph(acc, _mm512_fmadd_pch(a, b, _mm512_setzero_ph()))
to _mm512_fmadd_pch(a, b, acc).

Differential Revision: https://reviews.llvm.org/D109953
2021-09-23 15:37:08 +08:00
Mikael Holmen e7b169a8ae [AMDGPU] Fix gcc warnings about unused variables [NFC] 2021-09-23 08:08:00 +02:00
Usman Nadeem 3b12282b0e [AArch64][SVE][InstCombine] Eliminate redundant chains of tuple get/set
Differential Revision: https://reviews.llvm.org/D109667

Change-Id: I06a3c28e3658ecda109a3a1b73265828274ab2ea
2021-09-22 20:59:46 -07:00
Wang, Pengfei ebec077e07 [X86][FP16] Change the order of the operands in complex FMA intrinsics to allow swap between the mul operands.
Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D109658
2021-09-23 11:02:48 +08:00
Zhi An Ng 1552179ac0 [WebAssembly] Add relaxed-simd feature
This currently only defines a constant, but it the future will be used
to gate builtins for experimenting and prototyping relaxed-simd proposal
(https://github.com/WebAssembly/relaxed-simd/).

Differential Revision: https://reviews.llvm.org/D110111
2021-09-22 14:52:50 -07:00
Craig Topper f0a422f935 [RISCV] Add fcvt.s.w(u)/fcvt.d.w(u)/fcvt.h.w(u) to hasAllNBitUsers
These instructions only read the lower 32 bits of their input.
2021-09-22 14:24:26 -07:00
Craig Topper b33a1cc05b [RISCV] Optimize vp.store with an all ones mask to avoid a vmset.
We can use riscv_vse intrinsic instead of riscv_vse_mask. The code here
is based on similar code for handling masked.scatter and vp.scatter.

Reviewed By: frasercrmck

Differential Revision: https://reviews.llvm.org/D110206
2021-09-22 09:12:47 -07:00
Hongtao Yu d9b511d8e8 [CSSPGO] Set PseudoProbeInserter as a default pass.
Currenlty PseudoProbeInserter is a pass conditioned on a target switch. It works well with a single clang invocation. It doesn't work so well when the backend is called separately (i.e, through the linker or llc), where user has always to pass -pseudo-probe-for-profiling explictly. I'm making the pass a default pass that requires no command line arg to trigger, but will be actually run depending on whether the CU comes with `llvm.pseudo_probe_desc` metadata.

Reviewed By: wenlei

Differential Revision: https://reviews.llvm.org/D110209
2021-09-22 09:09:48 -07:00
Simon Pilgrim b1f38a27f0 [Target][CodeGen] Remove default CostKind arguments on inner/impl TTI overrides
Based off a discussion on D110100, we should be avoiding default CostKinds whenever possible.

This initial patch removes them from the 'inner' target implementation callbacks - these should only be used by the main TTI calls, so this should guarantee that we don't cause changes in CostKind by missing it in an inner call. This exposed a few missing arguments in getGEPCost and reduction cost calls that I've cleaned up.

Differential Revision: https://reviews.llvm.org/D110242
2021-09-22 15:28:08 +01:00
Sander de Smalen 6375ca4059 [AArch64][SVE] Add extract_subvector patterns for unpacked fp16 and bfloat types.
Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D110163
2021-09-22 14:25:17 +01:00
Tim Northover 3a00e58c2f AArch64: use indivisible cmpxchg for 128-bit atomic loads at O0
Like normal atomicrmw operations, at -O0 the simple register-allocator can
insert spills into the LL/SC loop if it's expanded and visible when regalloc
runs. This can cause the operation to never succeed by repeatedly clearing the
monitor. Instead expand to a cmpxchg, which has a pseudo-instruction for -O0.
2021-09-22 14:20:43 +01:00
David Green 02cd8a6b91 [ARM] Allow smaller VMOVL in tail predicated loops
This allows VMOVL in tail predicated loops so long as the the vector
size the VMOVL is extending into is less than or equal to the size of
the VCTP in the tail predicated loop. These cases represent a
sign-extend-inreg (or zero-extend-inreg), which needn't block tail
predication as in https://godbolt.org/z/hdTsEbx8Y.

For this a vecsize has been added to the TSFlag bits of MVE
instructions, which stores the size of the elements that the MVE
instruction operates on. In the case of multiple size (such as a
MVE_VMOVLs8bh that extends from i8 to i16, the largest size was be
chosen). The sizes are encoded as 00 = i8, 01 = i16, 10 = i32 and 11 =
i64, which often (but not always) comes from the instruction encoding
directly. A unit test was added, and although only a subset of the
vecsizes are currently used, the rest should be useful for other cases.

Differential Revision: https://reviews.llvm.org/D109706
2021-09-22 12:07:52 +01:00
Sander de Smalen ab3607c0ed [AArch64][SVE] Add missing load/store patterns for unpacked bfloat vectors.
Reviewed By: c-rhodes

Differential Revision: https://reviews.llvm.org/D110063
2021-09-22 09:45:33 +01:00
Jay Foad 0205806d0f [AMDGPU] Convert mac/fmac to mad/fma when folding output modifiers
Use of output modifiers forces VOP3 encoding for a VOP2 mac/fmac
instruction, so we might as well convert it to the more flexible VOP3-
only mad/fma form.

With this change, the only way we should emit VOP3-encoded mac/fmac is
if regalloc chooses registers that require the VOP3 encoding, e.g. sgprs
for both src0 and src1. In all other cases the mac/fmac should either be
converted to mad/fma or shrunk to VOP2 encoding.

Differential Revision: https://reviews.llvm.org/D110156
2021-09-22 09:36:34 +01:00
Jay Foad 3828ea6181 [AMDGPU] Divergence-driven instruction selection for mul i32
Differential Revision: https://reviews.llvm.org/D109881
2021-09-22 09:36:34 +01:00
Matt Arsenault ec55dcedce AMDGPU: Refactor getWavesPerEU to separate flat workgroup size query
Add an overload to pass the flat workgroup range in separately. This
will allow the attributor to use the assumed value for
amdgpu-flat-workgroup-sizes when inferring amdgpu-waves-per-eu.
2021-09-21 22:57:17 -04:00
Chen Zheng ffa9fa9ed2 [PowerPC] prepare for udpate form with non-const increment.
This is a follow-up of D105872. Now we are able to prepare for update
form with non-const increment.

Reviewed By: jsji

Differential Revision: https://reviews.llvm.org/D106032
2021-09-22 02:54:28 +00:00
Usman Nadeem 645b8f5365 [AArch64][SVE] Add patterns to generate ADR instruction
Differential Revision: https://reviews.llvm.org/D109665

Change-Id: I9d2928688b80b804a16f52928e2057749ec2c0b2
2021-09-21 15:50:49 -07:00
Arthur Eubanks e42234383e Make DiagnosticInfoResourceLimit's limit param required
And always print it.

This makes some LLVM diagnostics match up better with Clang's diagnostics.

Updated some AMDGPU uses of DiagnosticInfoResourceLimit and now we print
better diagnostics for those.

Reviewed By: dblaikie

Differential Revision: https://reviews.llvm.org/D110204
2021-09-21 15:27:58 -07:00
Kirill Stoimenov 2649999579 [asan] Fixed a bug causing a crash when redzone optimization kicked in on X86 with -asan-optimize-callbacks flag on.
This change adds the ASan intrinsic to the list whihc are setting hasCopyImplyingStackAdjustment.

Reviewed By: eugenis

Differential Revision: https://reviews.llvm.org/D110012
2021-09-21 22:26:03 +00:00
Craig Topper b81e26c7f4 Recommit "[X86] Clear kill flags when rewriting SETCC uses in flag copy lowering."
This time with the right bug number.

When we rewrite the setcc we replace set old setcc output register
with the new CondReg. But since CondReg can be shared by other
replacements, we don't know if the kill flags for the old register
are valid for CondReg. So be conservative and remove them.

The test case has a SETCCr and a SETCCm on the same condition so
they end up sharing the same CondReg. The SETCCr had one use with
a kill flag. This kill flag isn't valid after the replacement because
CondReg needs a live range extending to the later SETCCm replacment.

Fixes PR51903.
2021-09-21 14:59:25 -07:00
Craig Topper 51a82e051e Revert "[X86] Clear kill flags when rewriting SETCC uses in flag copy lowering."
This reverts commit 7550f146ff.

I botched the bug number.
2021-09-21 14:33:44 -07:00
Craig Topper 7550f146ff [X86] Clear kill flags when rewriting SETCC uses in flag copy lowering.
When we rewrite the setcc we replace set old setcc output register
with the new CondReg. But since CondReg can be shared by other
replacements, we don't know if the kill flags for the old register
are valid for CondReg. So be conservative and remove them.

The test case has a SETCCr and a SETCCm on the same condition so
they end up sharing the same CondReg. The SETCCr had one use with
a kill flag. This kill flag isn't valid after the replacement because
CondReg needs a live range extending to the later SETCCm replacment.

Fixes PR51908.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110046
2021-09-21 14:29:46 -07:00
alex-t 1a33294652 [AMDGPU] Filtering out the inactive lanes bits when lowering copy to SCC
Normally, given that the DA results are kept consistent over the selection DAG, uniform comparisons get selected to S_CMP_* but divergent to V_CMP_*.  Sometimes, for the sake of efficiency,  SSA subgraphs may be converted to VALU to avoid repeatedly copying data back and forth. Hence we have to be able to sustain the correctness passing the i1 from VALU to SALU context and vice versa.

VALU operations only process the active lanes of the VGPR and ignore inactive ones.
Active lanes correspond to 1 bit in the EXEC mask register.
SALU represents i1 as just one bit but VALU as 64bits: 0/1 and 0/(0xffffffffffffffff & EXEC) respectively.
SALU uses one-bit conditional flag SCC but VALU - VCC that is a pair of 32-bit SGPRs

To expose SCC to the VALU context we need to convert the one-bit boolean value to the appropriate 64bit.
To return back to the SALU context we need to do the opposite.

To correctly convert 64bit VALU boolean to either 0 or 1 we need to filter out the bits corresponding to the inactive lanes.

Reviewed By: piotr

Differential Revision: https://reviews.llvm.org/D109900
2021-09-21 21:19:31 +03:00
Brendon Cahoon cbdf624bb8 [AMDGPU] Correctly merge alias.scope and noalias metadata for memops
When adding alias.scope and noalias metadata to a memcpy function,
the alias.scope and noalias metadata from the operands are merged.
The rule for merging alias.scope is to take the intersection of
the domains and the union of the scopes within those domains.
The rule for merging noalias is to take the intersection.

The bug is that AMDGPULowerModuleLDS was using concatenation for
both alias.scope and noalias. For example, when f1 and f2 are added
to the LDS structure and there is a memcpy(f2, f1, sizeof(f1)).
Then, concatenation creates noalias metadata for the memcpy that
includes both {f1, f2}. That means that the memcpy is assumed
not to alias a prior load of f2, which enables the optimizer to
remove a load of f2 that occurs after mempcy.

The function MDNode::getmostGenericAliasScope defines the semantics
for alias.scope. There is a function, combineMetadata in Local.cpp,
that uses intersect for noalias.

Differential Revision: https://reviews.llvm.org/D110049
2021-09-21 13:02:01 -05:00
Craig Topper 7c975665b4 [RISCV] Make some arrays of constants 'static const'. NFC
This helps the compiler generate better code.
2021-09-21 10:52:47 -07:00
Amy Kwan 2af57b6099 [PowerPC] Add prefix load pattern for fpext to v2f64
This patch adds a prefixed load pattern involving v2f32 fpext v2f64, where we
are dealing with a value with an offset that fits into a 34-bit signed immediate.
A reduced test case is also added to patch that tests the pattern, in which the
pattern is tested in the big endian CHECKs of the newly added test.

Differential Revision: https://reviews.llvm.org/D109887
2021-09-21 12:45:24 -05:00
Craig Topper aeb63d464f [RISCV] Teach RISCVTargetLowering::shouldSinkOperands to sink splats for and/or/xor.
This requires a minor change to CodeGenPrepare to ensure that
shouldSinkOperands will be called for And.

Reviewed By: frasercrmck

Differential Revision: https://reviews.llvm.org/D110106
2021-09-21 10:07:29 -07:00
Dmitry Preobrazhensky 3500e7d2b0 [AMDGPU][MC][GFX7][GFX10] Corrected image_atomic_fcmpswap
Differential Revision: https://reviews.llvm.org/D109616
2021-09-21 18:06:02 +03:00
Ben Shi b3052013b4 [RISCV] Optimize (add (mul x, c0), c1)
Optimize (add (mul x, c0), c1) -> (ADDI (MUL (ADDI, c1/c0), c0), c1%c0),
if c1/c0 and c1%c0 are simm12, while c1 is not.

Optimize (add (mul x, c0), c1) -> (MUL (ADDI, c1/c0), c0),
if c1%c0 is zero, and c1/c0 is simm12 while c1 is not.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D108607
2021-09-21 14:13:14 +00:00
Dmitry Preobrazhensky b8e7f53208 [AMDGPU][MC][GFX10] Enabled dlc for FLAT and GLOBAL atomics
Differential Revision: https://reviews.llvm.org/D109614
2021-09-21 16:23:20 +03:00
Jonas Paulsson a48b43f981 [SystemZ] Emit EXRL target instructions before text section is ended.
SystemZ adds the EXRL target instructions in the end of each file. This must
be done before debug info emission since that may end the text section, and
therefore this is now done in emitConstantPools() (instead of in
emitEndOfAsmFile).

Review: Ulrich Weigand

Differential Revision: https://reviews.llvm.org/D109513
2021-09-21 14:32:28 +02:00
Nicholas Guy 9e4d72675f [AArch64] Improve schedule modelling on the Cortex-A55
Enables the FuseAddress feature in the Cortex-A55 scheduling model

Differential Revision: https://reviews.llvm.org/D109323
2021-09-21 13:03:34 +01:00
Jay Foad 598bebeaa6 [AMDGPU] Prefer fmac over fma when selecting FMA_W_CHAIN
FMA_W_CHAIN is used when lowering fdiv f32. Prefer to select it to fmac
if there are no source modifiers, just like we do for other mad/mac and
fma/fmac cases.

Differential Revision: https://reviews.llvm.org/D110074
2021-09-21 11:57:45 +01:00
Jay Foad 86dcb59206 [AMDGPU] Prefer v_fmac over v_fma only when no source modifiers are used
v_fmac with source modifiers forces VOP3 encoding, but it is strictly
better to use the VOP3-only v_fma instead, because $dst and $src2 are
not tied so it gives the register allocator more freedom and avoids a
copy in some cases.

This is the same strategy we already use for v_mad vs v_mac and
v_fma_legacy vs v_fmac_legacy.

Differential Revision: https://reviews.llvm.org/D110070
2021-09-21 11:57:45 +01:00