Commit Graph

4930 Commits

Author SHA1 Message Date
Jessica Paquette 3e8223b165 [AArch64][GlobalISel] NFC: Remove dead G_BUILD_VECTOR legalization rule
Remove a rule which allows larger scalar types than the destination vector
element type.

This appears to be irrelevant now that we have G_BUILD_VECTOR_TRUNC. Plus,
making a G_BUILD_VECTOR which satisfies this introduces a verifier failure
anyway.

Differential Revision: https://reviews.llvm.org/D97727
2021-03-01 14:04:40 -08:00
Amara Emerson b783aa8979 [AArch64] Fix emitting an AdrpAddLdr LOH when there's a potential clobber of the
def of the adrp before the ldr.

Apparently this pass used to have liveness analysis but it was removed for
scompile time reasons. This workaround prevents the LOH from being emitted
unless the ADD and LDR are adjacent.

Fixes https://github.com/JuliaLang/julia/issues/39820

Differential Revision: https://reviews.llvm.org/D97571
2021-03-01 13:52:57 -08:00
Matt Arsenault 6c260d3bc0 GlobalISel: Move splitToValueTypes to generic code
I copied the nearly identical function from AArch64 into AMDGPU, so
fix this duplication.

Mips and X86 have their own more exotic versions which should be
removed. However replacing those is better left for a separate patch
since it requires other changes to avoid regressions.
2021-03-01 08:58:18 -05:00
Matt Arsenault b4bfe29415 AArch64/GlobalISel: Fix using wrong calling convention for calls
This was reusing the parent function calling convention instead of the
callee. I'm not sure if there's a case where there's an observable
difference.

I previously missed this in b72a23650f
2021-03-01 08:46:33 -05:00
David Green 7abf7dd5ef [AArch64] Add combine for add(udot(0, x, y), z) -> udot(z, x, y).
Given a zero input for a udot, an add can be folded in to take the place
of the input, using thte addition that the instruction naturally
performs.

Differential Revision: https://reviews.llvm.org/D97188
2021-03-01 12:53:34 +00:00
Fraser Cormack 6718fda6ad [CodeGen] Fix issues with subvector intrinsic index types
This patch addresses issues arising from the fact that the index type
used for subvector insertion/extraction is inconsistent between the
intrinsics and SDNodes. The intrinsic forms require i64 whereas the
SDNodes use the type returned by SelectionDAG::getVectorIdxTy.

Rather than update the intrinsic definitions to use an overloaded index
type, this patch fixes the issue by transforming the index to the
correct type as required. Any loss of index bits going from i64 to a
smaller type is unexpected, and will be caught by an assertion in
SelectionDAG::getVectorIdxConstant.

The patch also updates the documentation for INSERT_SUBVECTOR and adds
an assertion to its creation to bring it in line with EXTRACT_SUBVECTOR.
This necessitated changes to AArch64 which was using i64 for
EXTRACT_SUBVECTOR but i32 for INSERT_SUBVECTOR. Only one test changed
its codegen after updating the backend accordingly.

Reviewed By: sdesmalen

Differential Revision: https://reviews.llvm.org/D97459
2021-03-01 10:28:21 +00:00
Jessica Paquette f5d5a7d7ea [AArch64][GlobalISel] Import FMOV patterns rather than manually selecting it
There are existing patterns for FMOVHi, FMOVSi, and FMOVDi in
AArch64InstrFormats.td.

Importing these allows us to remove the manual selection code for FMOV.

It also allows us to select FMOVHi for non-zero constants when we have full
fp-16 support.

Refactor some of the code in AArch64InstrFormats.td so that we can create
equivalent custom renderers in GlobalISel.

Differential Revision: https://reviews.llvm.org/D97511
2021-02-26 16:27:39 -08:00
Tim Northover 201ada80ee AArch64: relax address-space assertion in FastISel.
Some people are using alternative address spaces to track GC data, but
otherwise they behave exactly the same. This is the only place in the backend
we even try to care about it so it's really not achieving anything.
2021-02-25 10:15:55 +00:00
Stelios Ioannou 30cb9c03b5 [AArch64] Add abs intrinsic costs
This patch adds cost-modelling for abs vector intrinsic.

Change-Id: I89007971bfb15f5b4a02a2eadfd43018e9a73976
2021-02-25 09:31:52 +00:00
Jessica Paquette e339bba637 [AArch64][GlobalISel] Fix manual selection for v4s16 and v8s8 G_DUP
The manual G_DUP selection code would produce DUPv16i8 for v8s8s and DUPv8i16
for v4s16.

This adds the missing cases to the manual selection code, and makes it return
false when there is an unexpected size.

Update select-dup.mir to reflect the change.

Differential Revision: https://reviews.llvm.org/D97240
2021-02-24 10:23:06 -08:00
Amara Emerson 0146d20631 [AArch64] Do not fold SP adjustments into pre-increment addr modes if it overflows the redzone.
Instead of outright disabling this completely with the noredzone attribute,
we only avoid doing the optimization if there are memory operations between
the adjustment and the load/store that the adjustment would be folded into.
This avoids the case of something like a stack cookie being corrupted if an
exception happens before the pre-increment to the SP occurs.

This also prevents the folding happening if we have a redzone, but the offset
being folded is above the redzone amount (128 bytes in this case).

rdar://73269336

Differential Revision: https://reviews.llvm.org/D95179
2021-02-24 09:55:48 -08:00
Florian Hahn 5c74c6be3c
[AArch64] Use CMTST for != 0 vector compares (vnot (CMEQz A)).
(CMTST A, A) will only set elements to 0 if the element is 0 in A. Use
it for != 0 compares, which currently use (vnot (CMEQz A)). This saves a
mvn instruction.

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D97303
2021-02-24 09:39:27 +00:00
Amara Emerson eb55203e00 [AArch64][GlobalISel][PostSelectOpt] Constrain reg operands after mutating instructions.
The non-flag setting variants of instructions may have different regclass
requirements. If so, we need to constrain them.

Differential Revision: https://reviews.llvm.org/D97343
2021-02-23 19:32:18 -08:00
Jessica Paquette daf7d7f0dc [AArch64][GlobalISel] Correct function evaluation order in applyINS
The order in which the nested calls to Builder.buildWhatever are
evaluated in differs between GCC and Clang.

This caused a bot failure because the MIR in the testcase was
coming out in a different order than expected.

Rather than using nested calls, pull them out in order to fix the
order of evaluation.
2021-02-23 16:21:11 -08:00
David Green f51b3de4e8 [AArch64] Introduce UDOT/SDOT DAG nodes
This is used to lower UDOT/SDOT instructions, as opposed to relying on
the intrinsic. Subsequent optimizations will be able to optimize them
more cleanly based on these nodes.
2021-02-23 20:31:01 +00:00
Jessica Paquette ef1f7f1d7d Recommit "[AArch64][GlobalISel] Match G_SHUFFLE_VECTOR -> insert elt + extract elt"
Attempted fix for the added test failing.

https://lab.llvm.org/buildbot/#/builders/104/builds/2355/steps/5/logs/stdio

I can't reproduce the failure anywhere, so I'm going to guess that passing a
std::function as MatchInfo is sketchy in this context.

Switch it to a std::tuple and hope for the best.
2021-02-23 11:55:16 -08:00
Amara Emerson 939b5ce734 [AArch64][GlobalISel] Lower G_USUBSAT and G_UADDSAT for scalars.
We have some missing optimization counterparts to LowerXALUO, but it's a start.
2021-02-23 11:54:52 -08:00
Jessica Paquette 662402a8b3 Revert "[AArch64][GlobalISel] Match G_SHUFFLE_VECTOR -> insert elt + extract elt"
This reverts commit 867e379c0e.

For some reason this is upsetting Linux/Windows bots. Reverting while I try to
reproduce.
2021-02-22 17:36:17 -08:00
Cassie Jones 8b10aa67ad [AArch64][GlobalISel] Make overflow legalization use clampScalar
Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D96674
2021-02-22 19:59:36 -05:00
Jessica Paquette 867e379c0e [AArch64][GlobalISel] Match G_SHUFFLE_VECTOR -> insert elt + extract elt
Match a G_SHUFFLE_VECTOR with a mask that allows it to be represented as a
G_INSERT_VECTOR_ELT and a G_EXTRACT_VECTOR_ELT.

This ports `isINSMask` from AArch64ISelLowering and the portion of
`AArch64TargetLowering::LowerVECTOR_SHUFFLE` which handles the equivalent
transformation.

This provides more opportunities for matching DUP. We don't have all of the
necessary combines to actually make DUP out of these yet, but this is better for
size than the full TBL expansion for G_SHUFFLE_VECTOR.

This is a -0.1% code size improvement on CTMark/Bullet at -Os.

IR example: https://godbolt.org/z/sdcevT

Differential Revision: https://reviews.llvm.org/D97214
2021-02-22 14:44:09 -08:00
Jessica Paquette 95d13c01ec [AArch64][GlobalISel] Emit G_ASSERT_SEXT for SExt parameters in CallLowering
Similar to how we emit G_ASSERT_ZEXT when we have CCValAssign::LocInfo::ZExt.

This will allow us to combine away some redundant sign extends.

Example: https://godbolt.org/z/cTbKvr

Differential Revision: https://reviews.llvm.org/D96915
2021-02-22 10:14:43 -08:00
Ryan Santhiraraja 2c25efcbd3 [AArch64] Adding SHA3 Intrinsics support
This patch adds the following SHA3 Intrinsics:
        vsha512hq_u64,
        vsha512h2q_u64,
        vsha512su0q_u64,
        vsha512su1q_u64
        veor3q_u8
        veor3q_u16
        veor3q_u32
        veor3q_u64
        veor3q_s8
        veor3q_s16
        veor3q_s32
        veor3q_s64
        vrax1q_u64
        vxarq_u64
        vbcaxq_u8
        vbcaxq_u16
        vbcaxq_u32
        vbcaxq_u64
        vbcaxq_s8
        vbcaxq_s16
        vbcaxq_s32
        vbcaxq_s64

    Note need to include +sha3 and +crypto when building from the front-end

Reviewed By: DavidSpickett

Differential Revision: https://reviews.llvm.org/D96381
2021-02-22 12:09:20 +00:00
Amara Emerson 6ff09ce061 [AArch64][GlobalISel] Fix <16 x s8> G_DUP regbankselect to assign source to gpr.
We can only select this type if the source is on GPR, not FPR.
2021-02-21 21:17:29 -08:00
Amara Emerson 067ec53df1 [AArch64][GlobalISel] Add selection support for G_VECREDUCE of <2 x i32>
This selects to a pairwise add and a subreg copy.
2021-02-20 00:39:38 -08:00
Amara Emerson 27566e9c3e [AArch64][GlobalISel] Make G_VECREDUCE_ADD of <2 x s32> legal. 2021-02-19 14:28:21 -08:00
Jessica Paquette 8d3442eddb [AArch64][GlobalISel] Run redundant_sext_inreg in the post-legalizer combiner
This is to ensure that we can eliminate G_ASSERT_SEXT.

In a follow-up patch, I'm going to make CallLowering emit G_ASSERT_SEXT for
signext parameters.

Differential Revision: https://reviews.llvm.org/D96913
2021-02-19 09:34:47 -08:00
Sjoerd Meijer 260f90bb3d [AArch64] Add some missing Neoverse features
This enables AES fusion and the post RA scheduler for the Neoverse cores.
And while we are it also for the A55 that we had missed earlier.

Differential Revision: https://reviews.llvm.org/D96866
2021-02-19 09:18:35 +00:00
Serge Pavlov 2c4f60e45b [FPEnv][AArch64] Implement lowering of llvm.set.rounding
Differential Revision: https://reviews.llvm.org/D96836
2021-02-19 13:16:51 +07:00
Bradley Smith 8bad8a43c3 [AArch64][SVE] Add patterns to generate FMLA/FMLS/FNMLA/FNMLS/FMAD
Adjust generateFMAsInMachineCombiner to return false if SVE is present
in order to combine fmul+fadd into fma. Also add new pseudo instructions
so as to select the most appropriate of FMLA/FMAD depending on register
allocation.

Depends on D96599

Differential Revision: https://reviews.llvm.org/D96424
2021-02-18 16:55:16 +00:00
Bradley Smith 5b094bfeb3 [AArch64] Allow folding FMUL/FADD into FMA for FP16 types
isFMAFasterThanFMulAndFAdd should return true for FP16 types when
HasFullFP16 is present, since we have the instructions to handle it for
both SVE and NEON. (SVE patterns and tests will follow).

Differential Revision: https://reviews.llvm.org/D96599
2021-02-18 16:51:22 +00:00
Fraser Cormack 0176fecfbc [SVE][CodeGen] Expand SVE MULH[SU] and [SU]MUL_LOHI nodes
This patch fixes a codegen crash introduced in fde2466171, where the
DAGCombiner started generating optimized MULH[SU] or [SU]MUL_LOHI nodes
unless the target opted out. The AArch64 backend cannot currently select
any of these nodes, so ensure that they are not generated in the first
place.

This issue was raised by @huihuiz in D94501.

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D96849
2021-02-18 10:06:24 +00:00
Petr Hosek 16af973933 [MC][ELF] Support for zero flag section groups
This change introduces support for zero flag ELF section groups to LLVM.
LLVM already supports COMDAT sections, which in ELF are a special type
of ELF section groups. These are generally useful to enable linker GC
where you want a group of sections to always travel together, that is to
be either retained or discarded as a whole, but without the COMDAT
semantics. Other ELF assemblers already support zero flag ELF section
groups and this change helps us reach feature parity.

Differential Revision: https://reviews.llvm.org/D95851
2021-02-16 14:23:40 -08:00
Jessica Paquette 962b73dd0f Revert "[AArch64][GlobalISel] Fold constants into G_GLOBAL_VALUE"
This reverts commit 61b4702a40.

We were seeing some test failures in SPECINT2006 due to this change. Reverting
to investigate.
2021-02-16 10:50:12 -08:00
Florian Hahn 211147c5ba
[AArch64] Convert CMP/SELECT sign patterns to OR & ASR.
ICMP & SELECT patterns extracting the sign of a value can be simplified
to OR & ASR (see  https://alive2.llvm.org/ce/z/Xx4iZ0).

This does not save any instructions in IR, but it is profitable on
AArch64, because we need at least 2 extra instructions to materialize 1
and -1 for the SELECT.

The improvements result in ~5% speedups on loops of the form

    static int sign_of(int x) {
      if (x < 0) return -1;
      return 1;
    }

    void foo(const int *x, int *res, int cnt) {
      for (int i=0;i<cnt;i++)
        res[i] = sign_of(x[i]);
    }

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D96596
2021-02-16 17:17:34 +00:00
David Truby e86f9ba15c [llvm][Aarch64][SVE] Remove extra fmov instruction with certain literals
When a literal that cannot fit in the immediate form of the fmov instruction
is used to initialise an SVE vector, an extra unnecessary fmov is currently
generated. This patch adds an extra codegen pattern preventing the extra
instruction from being generated.

Differential Revision: https://reviews.llvm.org/D96700

Co-Authored-By: Paul Walker <paul.walker@arm.com>
2021-02-16 14:16:33 +00:00
Kerry McLaughlin ba1e150d03 [SVE] Add support for scalable vectorization of loops with int/fast FP reductions
This patch enables scalable vectorization of loops with integer/fast reductions, e.g:

```
unsigned sum = 0;
for (int i = 0; i < n; ++i) {
  sum += a[i];
}
```

A new TTI interface, isLegalToVectorizeReduction, has been added to prevent
reductions which are not supported for scalable types from vectorizing.
If the reduction is not supported for a given scalable VF,
computeFeasibleMaxVF will fall back to using fixed-width vectorization.

Reviewed By: david-arm, fhahn, dmgreen

Differential Revision: https://reviews.llvm.org/D95245
2021-02-16 13:50:06 +00:00
Matt Arsenault 392e0fcfd1 GlobalISel: Handle arguments partially passed on the stack
The API is a bit awkward since you need to index into an array in the
passed struct. I guess an alternative would be to pass all of the
individual fields.
2021-02-15 17:06:14 -05:00
Florian Hahn ca23b2c8ed
[AArch64] Move machine bundle unpacking to PreEmit2 phase.
This patch adjusts the placement of the bundle unpacking to just before
code emission. In particular, this means bundle unpacking happens AFTER
the machine outliner. With the previous position, the machine outliner
may outline parts of a bundle, which breaks them up.

This is an issue for BLR_RVMARKER handling, as illustrated by the
rvmarker-pseudo-expansion-and-outlining.mir test case. The machine
outliner should not break up the bundles created during pseudo
expansion.

This should fix PR49082.

Reviewed By: SjoerdMeijer

Differential Revision: https://reviews.llvm.org/D96294
2021-02-15 16:10:43 +00:00
Caroline Concatto b52e6c5891 [CostModel]Add cost model for experimental.vector.reverse
This patch uses the function getShuffleCost with SK_Reverse to compute the cost
for experimental.vector.reverse.
For scalable vector type, it adds a table will the legal types on
AArch64TTIImpl::getShuffleCost to not assert in BasicTTIImpl::getShuffleCost,
and for fixed vector, it relies on the existing cost model in BasicTTIImpl.

Depends on D94883

Differential Revision: https://reviews.llvm.org/D95603
2021-02-15 14:23:57 +00:00
Caroline Concatto 2d728bbff5 [CodeGen][SelectionDAG]Add new intrinsic experimental.vector.reverse
This patch adds  a new intrinsic experimental.vector.reduce that takes a single
vector and returns a vector of matching type but with the original lane order
 reversed. For example:

```
vector.reverse(<A,B,C,D>) ==> <D,C,B,A>
```

The new intrinsic supports fixed and scalable vectors types.
The fixed-width vector relies on shufflevector to maintain existing behaviour.
Scalable vector uses the new ISD node - VECTOR_REVERSE.

This new intrinsic is one of the named shufflevector intrinsics proposed on the
mailing-list in the RFC at [1].

Patch by Paul Walker (@paulwalker-arm).

[1] https://lists.llvm.org/pipermail/llvm-dev/2020-November/146864.html

Differential Revision: https://reviews.llvm.org/D94883
2021-02-15 13:39:43 +00:00
Arlo Siemsen 080866470d Add ehcont section support
In the future Windows will enable Control-flow Enforcement Technology (CET aka shadow stacks). To protect the path where the context is updated during exception handling, the binary is required to enumerate valid unwind entrypoints in a dedicated section which is validated when the context is being set during exception handling.

This change allows llvm to generate the section that contains the appropriate symbol references in the form expected by the msvc linker.

This feature is enabled through a new module flag, ehcontguard, which was modelled on the cfguard flag.

The change includes a test that when the module flag is enabled the section is correctly generated.

The set of exception continuation information includes returns from exceptional control flow (catchret in llvm).

In order to collect catchret we:
1) Includes an additional flag on machine basic blocks to indicate that the given block is the target of a catchret operation,
2) Introduces a new machine function pass to insert and collect symbols at the start of each block, and
3) Combines these targets with the other EHCont targets that were already being collected.

Change originally authored by Daniel Frampton <dframpto@microsoft.com>

For more details, see MSVC documentation for `/guard:ehcont`
  https://docs.microsoft.com/en-us/cpp/build/reference/guard-enable-eh-continuation-metadata

Reviewed By: pengfei

Differential Revision: https://reviews.llvm.org/D94835
2021-02-15 14:27:12 +08:00
Jessica Paquette 61b4702a40 [AArch64][GlobalISel] Fold constants into G_GLOBAL_VALUE
This is pretty much just ports `performGlobalAddressCombine` from
AArch64ISelLowering. (AArch64 doesn't use the generic DAG combine for this.)

This adds a pre-legalize combine which looks for this pattern:

```
  %g = G_GLOBAL_VALUE @x
  %ptr1 = G_PTR_ADD %g, cst1
  %ptr2 = G_PTR_ADD %g, cst2
  ...
  %ptrN = G_PTR_ADD %g, cstN
```

And then, if possible, transforms it like so:

```
  %g = G_GLOBAL_VALUE @x
  %offset_g = G_PTR_ADD %g, -min(cst)
  %ptr1 = G_PTR_ADD %offset_g, cst1
  %ptr2 = G_PTR_ADD %offset_g, cst2
  ...
  %ptrN = G_PTR_ADD %offset_g, cstN
```

Where min(cst) is the smallest out of the G_PTR_ADD constants.

This means we should save at least one G_PTR_ADD.

This also updates code in the legalizer + selector which assumes that
G_GLOBAL_VALUE will never have an offset and adds/updates relevant tests.

Differential Revision: https://reviews.llvm.org/D96624
2021-02-12 14:55:15 -08:00
Amara Emerson 5d6d9b63a3 [GlobalISel] Propagate extends through G_PHIs into the incoming value blocks.
This combine tries to do inter-block hoisting of extends of G_PHIs, into the
originating blocks of the phi's incoming value. The idea is to expose further
optimization opportunities that are normally obscured by the PHI.

Some basic heuristics, and a target hook for AArch64 is added, to allow tuning.
E.g. if the extend is used by a G_PTR_ADD, it doesn't perform this combine
since it may be folded into the addressing mode during selection.

There are very minor code size improvements on AArch64 -Os, but the real benefit
is that it unlocks optimizations like AArch64 conditional compares on some
benchmarks.

Differential Revision: https://reviews.llvm.org/D95703
2021-02-12 11:52:52 -08:00
Akira Hatanaka ed4718eccb [ObjC][ARC] Use operand bundle 'clang.arc.attachedcall' instead of
explicitly emitting retainRV or claimRV calls in the IR

Background:

This fixes a longstanding problem where llvm breaks ARC's autorelease
optimization (see the link below) by separating calls from the marker
instructions or retainRV/claimRV calls. The backend changes are in
https://reviews.llvm.org/D92569.

https://clang.llvm.org/docs/AutomaticReferenceCounting.html#arc-runtime-objc-autoreleasereturnvalue

What this patch does to fix the problem:

- The front-end adds operand bundle "clang.arc.attachedcall" to calls,
  which indicates the call is implicitly followed by a marker
  instruction and an implicit retainRV/claimRV call that consumes the
  call result. In addition, it emits a call to
  @llvm.objc.clang.arc.noop.use, which consumes the call result, to
  prevent the middle-end passes from changing the return type of the
  called function. This is currently done only when the target is arm64
  and the optimization level is higher than -O0.

- ARC optimizer temporarily emits retainRV/claimRV calls after the calls
  with the operand bundle in the IR and removes the inserted calls after
  processing the function.

- ARC contract pass emits retainRV/claimRV calls after the call with the
  operand bundle. It doesn't remove the operand bundle on the call since
  the backend needs it to emit the marker instruction. The retainRV and
  claimRV calls are emitted late in the pipeline to prevent optimization
  passes from transforming the IR in a way that makes it harder for the
  ARC middle-end passes to figure out the def-use relationship between
  the call and the retainRV/claimRV calls (which is the cause of
  PR31925).

- The function inliner removes an autoreleaseRV call in the callee if
  nothing in the callee prevents it from being paired up with the
  retainRV/claimRV call in the caller. It then inserts a release call if
  claimRV is attached to the call since autoreleaseRV+claimRV is
  equivalent to a release. If it cannot find an autoreleaseRV call, it
  tries to transfer the operand bundle to a function call in the callee.
  This is important since the ARC optimizer can remove the autoreleaseRV
  returning the callee result, which makes it impossible to pair it up
  with the retainRV/claimRV call in the caller. If that fails, it simply
  emits a retain call in the IR if retainRV is attached to the call and
  does nothing if claimRV is attached to it.

- SCCP refrains from replacing the return value of a call with a
  constant value if the call has the operand bundle. This ensures the
  call always has at least one user (the call to
  @llvm.objc.clang.arc.noop.use).

- This patch also fixes a bug in replaceUsesOfNonProtoConstant where
  multiple operand bundles of the same kind were being added to a call.

Future work:

- Use the operand bundle on x86-64.

- Fix the auto upgrader to convert call+retainRV/claimRV pairs into
  calls with the operand bundles.

rdar://71443534

Differential Revision: https://reviews.llvm.org/D92808
2021-02-12 09:51:57 -08:00
Sanjay Patel 79b1b4a581 [Vectorizers][TTI] remove option to bypass creation of vector reduction intrinsics
The vector reduction intrinsics started life as experimental ops, so backend support
was lacking. As part of promoting them to 1st-class intrinsics, however, codegen
support was added/improved:
D58015
D90247

So I think it is safe to now remove this complication from IR.

Note that we still have an IR-level codegen expansion pass for these as discussed
in D95690. Removing that is another step in simplifying the logic. Also note that
x86 was already unconditionally forming reductions in IR, so there should be no
difference for x86.

I spot checked a couple of the tests here by running them through opt+llc and did
not see any asm diffs.

If we do find functional differences for other targets, it should be possible
to (at least temporarily) restore the shuffle IR with the ExpandReductions IR
pass.

Differential Revision: https://reviews.llvm.org/D96552
2021-02-12 08:13:50 -05:00
Pengxuan Zheng 61cca0f2e5 [AArch64] Adding Neon Sm3 & Sm4 Intrinsics
This adds SM3 and SM4 Intrinsics support for AArch64, specifically:
        vsm3ss1q_u32
        vsm3tt1aq_u32
        vsm3tt1bq_u32
        vsm3tt2aq_u32
        vsm3tt2bq_u32
        vsm3partw1q_u32
        vsm3partw2q_u32
        vsm4eq_u32
        vsm4ekeyq_u32

Reviewed By: labrinea

Differential Revision: https://reviews.llvm.org/D95655
2021-02-11 14:20:20 -08:00
Sander de Smalen 3b4f706ae1 [AArch64][SVE] Asm: Fix supported immediates for DUP/CPY
This patch fixes an issue in the implementation of DUP/CPY where certain
immediates were not accepted. Immediates should be interpreted as a two's
complement encoding of a value that fits the number of bits of the element
type.

          mov z0.b, p0/z, #127
     <=>  mov z0.b, p0/z, #-129
     <=>  mov z0.b, p0/z, #0xffffffffffffff7f

This behaviour is in line with the GNU assembler.

Reviewed By: c-rhodes

Differential Revision: https://reviews.llvm.org/D94776
2021-02-11 08:14:15 +00:00
Jessica Paquette 1514f3b2c8 [AArch64][GlobalISel] Don't perform the mul const combine with G_PTR_ADD
A G_MUL + G_PTR_ADD can also be folded into a madd. So, conservatively, we
shouldn't combine when the G_MUL is used by a G_PTR_ADD either.

Differential Revision: https://reviews.llvm.org/D96457
2021-02-10 15:30:45 -08:00
Jessica Paquette 5f7a4d8d05 [AArch64][GlobalISel] Perform load/store extended reg folding with optsize
GlobalISel was only doing this with minsize. SDAG does this with optsize.

(See: `SelectionDAG::shouldOptForSize()`)

This is a 0.3% code size improvement for CTMark at -Os.

(Best: 1.1% improvements on lencod + pairlocalalign)

Differential Revision: https://reviews.llvm.org/D96451
2021-02-10 14:42:25 -08:00
Jessica Paquette 9283058abb [AArch64][GlobalISel] Fold G_ADD into the cset for G_ICMP
When we have a G_ADD which is fed by a G_ICMP on one side, we can fold it into
the cset for the G_ICMP.

e.g. Given

```
%cmp = G_ICMP ... %x, %y
%add = G_ADD %cmp, %z
```

We would normally emit a cmp, cset, and add.

However, `%add` is either `%z` or `%z + 1`. So, we can just use `%z` as the
source of the cset rather than wzr, saving an instruction.

This would probably be cleaner in AArch64PostLegalizerLowering, but we'd need
to change the way we represent G_ICMP to do that, I think. For now, it's
easiest to implement in selection.

This is a 0.1% code size improvement on CTMark/pairlocalalign at -Os.

Example: https://godbolt.org/z/7KdrP8

Differential Revision: https://reviews.llvm.org/D96388
2021-02-10 13:28:01 -08:00