- Do not emit following assembler directives:
- .hsa_code_object_version
- .hsa_code_object_isa
- .amd_amdgpu_isa
- .amd_amdgpu_hsa_metadata
- .amd_amdgpu_pal_metadata
- Do not emit .note entries
- Cleanup and bring in sync kernel descriptor header file
- Emit kernel descriptor into .rodata with appropriate relocations and
alignments
llvm-svn: 334519
Summary: This patch originated from D46562 and is a proper subset, with some issues addressed for fmul.
Reviewers: spatel, hfinkel, wristow, arsenm
Reviewed By: spatel
Subscribers: nhaehnle, wdng
Differential Revision: https://reviews.llvm.org/D47911
llvm-svn: 334514
Implement default legalization of rotates: either in terms of the rotation
in the opposite direction (if legal), or in terms of shifts and ors.
Implement generating of rotate instructions for Hexagon. Hexagon only
supports rotates by an immediate value, so implement custom lowering of
ROTL/ROTR on Hexagon. If a rotate is not legal, use the default expansion.
Differential Revision: https://reviews.llvm.org/D47725
llvm-svn: 334497
Extend LONG_BRANCH_LUi and LONG_BRANCH_ADDiu pseudo instructions with
additional flag, so instead of always lowering to lui %hi(...),
addiu %lo(...) or addiu %hi(...), now they can lower to either %lo, %hi,
%higher or %highest depending on the added flag.
Differential Revision: https://reviews.llvm.org/D47941
llvm-svn: 334490
We were missing packed isel folding patterns for all of sse41, avx, and avx512.
For some reason avx512 had scalar load folding patterns under optsize(due to partial/undef reg update), but we didn't have the equivalent sse41 and avx patterns.
Sometimes we would get load folding due to peephole pass anyway, but we're also missing avx512 instructions from the load folding table. I'll try to fix that in another patch.
Some of this was spotted in the review for D47993.
This patch adds all the folds to isel, adds a few spot tests, and disables the peephole pass on a few tests to ensure we're testing some of these patterns.
llvm-svn: 334460
The use iterator, used within findMaskOperands(), can return anything which is
not a def. isUse() requires a register, so check isReg() before calling isUse().
Differential Revision: https://reviews.llvm.org/D48047
llvm-svn: 334459
This would fail before because 1x vectors aren't legal,
so instead just use the scalar type.
Avoids regressions in a future AMDGPU commit to add
v4i16/v4f16 as legal types.
Test update is just the one test that this triggers
on in tree now. It wasn't checking anything before.
The result is completely changed since the selects
are eliminated. Not sure if it's considered better
or not.
llvm-svn: 334440
Rational: if there is indirect access that is usually an issue
because load is not ready by the use. However, if use is inside a
loop and load is outside that is potentially an issue for a first
iteration only.
Differential Revision: https://reviews.llvm.org/D47740
llvm-svn: 334420
When program is compiled for mips3 with n64 abi, wrong register class
is used for creating an emergency spill slot. This patch fixes the
correct register class to be chosen.
This patch resolves PR35859.
Thanks to John Baldwin for reporting the issue!
Differential Revision: https://reviews.llvm.org/D47938
llvm-svn: 334419
This sets trackLivenessAfterRegAlloc on AVRRegisterInfo.
Most existing targets set this flag. Without it, specific IR inputs
cause LLVM to fail with:
Assertion failed: (getParent()->getProperties().hasProperty( MachineFunctionProperties::Property::TracksLiveness) &&
"Liveness information is accurate"), function livein_begin
file MachineBasicBlock.cpp, line 1354.
With this commit, this no longer happens.
Patch by Peter Nimmervoll.
llvm-svn: 334409
Summary:
This fixes most of the scheduling info for SKX vector operations.
I had to split a lot of the YMM/ZMM classes into separate classes for YMM and ZMM.
The before/after llvm-exegesis analysis are in the phabricator diff.
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D47721
llvm-svn: 334407
Summary:
The idiom recognition seems rather poor.
Only the `@bzhi32_d0` produces `v_bfe_u32`.
But they all should.
This needs to be fixed before D47980 can be re-landed.
Reviewers: mareko, bogner, rampitec, arsenm, tstellar, nhaehnle
Reviewed By: nhaehnle
Subscribers: kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, llvm-commits
Tags: #amdgpu
Differential Revision: https://reviews.llvm.org/D48005
llvm-svn: 334398
Summary: When compiling with -fpic, in contrast to -fPIC, use only the
immediate field to index into the GOT. This saves space if the GOT is
known to be small. The linker will warn if the GOT is too large for
this method.
Reviewers: jyknight, venkatra
Reviewed By: jyknight
Subscribers: brad, fedor.sergeev, jrtc27, llvm-commits
Differential Revision: https://reviews.llvm.org/D47136
llvm-svn: 334383
This patch started off much more general and ambitious, but it's been a nightmare
seeing all the ways x86 vector codegen can go wrong.
So the code is still structured to allow extending easily, but it's currently
limited in several ways:
1. Only handle cases with an extending load.
2. Only handle cases with a zero constant compare.
3. Ignore setcc with vector bitmask (SetCCWidth != 1) - so AVX512 should be unaffected.
The motivating case from PR37427:
https://bugs.llvm.org/show_bug.cgi?id=37427
...is the 1st test, and that shows the expected win - we eliminated the unnecessary
intermediate cast.
There's a clear regression in the last test (sgt_zero_fp_select) because we longer
recognize a 'SHRUNKBLEND' opportunity. I think that general problem is also present
in sgt_zero, so I'll try to fix that in a follow-up. We need to match a sign-bit
setcc from a sign-extended operand and remove it.
Differential Revision: https://reviews.llvm.org/D47330
llvm-svn: 334378
We currently support them only in AArch64. The NEON Reference,
however, says they are 'ARMv7, ARMv8' intrinsics.
Differential Revision: https://reviews.llvm.org/D47447
llvm-svn: 334361
It looks like this got left in by accident in r289794; I can't think of
any reason this check would be necessary. (Maybe it was meant to be a
check that the AND has one use? But we check that a few lines earlier.)
Differential Revision: https://reviews.llvm.org/D47921
llvm-svn: 334322
Extension to D46954 (PR37426), this patch adds support for v8i16/v16i16 rotations in a similar manner - the conversion of the shift/rotate amount to a multiplication factor and the use of PMULLW to shift left and PMULHUW (ISD::MULHU) to shift the wrapped bits back around to be ORd together.
Differential Revision: https://reviews.llvm.org/D47822
llvm-svn: 334309
Summary: This patch originated from D46562 and is a proper subset, with some issues addressed for fsub.
Reviewers: spatel, hfinkel, wristow, arsenm
Reviewed By: spatel
Subscribers: wdng
Differential Revision: https://reviews.llvm.org/D47910
llvm-svn: 334306
As detailed on Agner's Microarchitecture doc (21.8 AMD Bobcat and Jaguar pipeline - Dependency-breaking instructions), these instructions are dependency breaking and fast-path zero the destination register (and appropriate EFLAGS bits).
llvm-svn: 334303
Same fix as rL334110: I noticed while working on zero-idiom + dependency-breaking support (PR36671) that most of our binary instruction schedule tests were reusing the same src registers, which would cause the tests to fail once we enable scalar zero-idiom support on btver2.
llvm-svn: 334302
AMDGPU inline assembler support i16, half and i128 typed variables in constraints, but they were reported as error.
Needed to fix https://github.com/RadeonOpenCompute/ROCm/issues/341,
e.g. to be able to load with global_load_dwordx4 to a 128bit integer variable
Differential Revision: https://reviews.llvm.org/D44920
llvm-svn: 334301
Simplify combineVectorTruncationWithPACKUS to mask the upper bits followed by calling truncateVectorWithPACK instead of duplicating with similar code.
This results in the codegen using (V)PACKUSDW on SSE41+ targets for vXi64/vXi32 inputs where before it always used PACKUSWB (along with a lot more bitcasting).
I've raised PR37749 as until we avoid unnecessary concats back to 256-bit for bitwise ops, we can't avoid splitting the input value into 128-bit subvectors for masking.
llvm-svn: 334289
We have some combines/lowerings that attempt to use PACKSS-then-PACKUS and others that use PACKUS-then-PACKSS.
PACKUS is much easier to combine with if we know the upper bits are zero as ComputeKnownBits can easily see through BITCASTs etc. especially now that rL333995 and rL334007 have landed. It also effectively works at byte level which further simplifies shuffle combines.
The only (minor) annoyances are that ComputeKnownBits can sometimes take longer as it doesn't fail as quickly as ComputeNumSignBits (but I'm not seeing any actual regressions in tests) and PACKUSDW only became available after SSE41 so we have more codegen diffs between targets.
llvm-svn: 334276
While trying to propagate AND masks back to loads, we currently allow
one non-load node to be included as a leaf in chain. This fix now
limits that node to produce only a single data value.
Differential Revision: https://reviews.llvm.org/D47878
llvm-svn: 334268
Summary: This change uses fmf subflags to guard fma optimizations as well as unsafe. These changes originated from D46483 and have been simplified via getNode.
Reviewers: spatel, arsenm, hfinkel, javed.absar
Reviewed By: spatel
Subscribers: nemanjai, wdng
Differential Revision: https://reviews.llvm.org/D47388
llvm-svn: 334242
- Make code easier to maintain.
- Avoid generating waitcnts for VMEM if the address sppace does not involve VMEM.
- Add support to generate waitcnts for LDS and GDS memory.
Differential Revision: https://reviews.llvm.org/D47504
llvm-svn: 334241
BitPermutationSelector sets Repl32 flag for bit groups which can be (potentially) benefit from 32-bit rotate-and-mask instructions with bit replication, i.e. rlwinm/rlwimi copies lower 32 bits into upper 32 bits on 64-bit PowerPC before rotation.
However, enforcing 32-bit instruction sometimes results in redundant generated code.
For example, the following simple code is compiled into rotldi + rlwimi while it can be compiled into only rldimi instruction if Repl32 flag is not set on the bit group for (a & 0xFFFFFFFF).
uint64_t func(uint64_t a, uint64_t b) {
return (a & 0xFFFFFFFF) | (b << 32) ;
}
To avoid such problem, this patch checks the potential benefit of Repl32 flag before setting it. If a bit group does not require rotation (i.e. RLAmt == 0) and won't be merged into another group, we do not benefit from Repl32 flag on this group.
Differential Revision: https://reviews.llvm.org/D47867
llvm-svn: 334195
Simplify combineVectorTruncationWithPACKSS to just a SIGN_EXTEND_INREG followed by using the existing truncateVectorWithPACK instead of duplicating code.
llvm-svn: 334193
The existing trunc_shl_17_v8i16_v8i32 test case should (but doesn't) fold to zero, I've added 2 new test cases:
- trunc_shl_16_v8i16_v8i32 which folds to zero (this is actually testing the target faux shuffle combine)
- trunc_shl_15_v8i16_v8i32 which should perform the full shl + truncate
llvm-svn: 334188
This has two main components. First, widen
widen short constant loads in DAG when they have
the correct alignment. This is already done a bit in
AMDGPUCodeGenPrepare, since that has access to
DivergenceAnalysis. This can't help kernarg loads
created in the DAG. Start to use DAG divergence analysis
to help this case.
The second part is to avoid kernel argument lowering
breaking the alignment of short vector elements because
calling convention lowering wants to split everything
into legal register types.
When loading a split type, load the nearest 4-byte aligned
segment and shift to get the desired bits. This extra
load of the earlier argument piece ends up merging,
and the bit extract hopefully folds out.
There are a number of improvements and regressions with
this, but I think as-is this is a better compromise between
several of the worst parts of SelectionDAG.
Particularly when i16 is legal, this produces worse code
for i8 and i16 element vector kernel arguments. This is
partially due to the very weak load merging the DAG does.
It only looks for fairly specific combines between pairs
of loads which no longer appear. In particular this
causes v4i16 loads to be split into 2 components when
previously the two halves were merged.
Worse, because of the newly introduced shifts, there
is a lot more unnecessary vector packing and unpacking code
emitted. At least some of this is due to reporting
false for isTypeDesirableForOp for i16 as a workaround for
the lack of divergence information in the DAG. The cases
where this happens it doesn't actually matter, but the
relevant code in SimplifyDemandedBits doens't have the context
to know to ignore this.
The use of the scalar cache is probably more important
than the mess of mostly scalar instructions doing this packing
and unpacking. Future work can fix this, possibly by making better
use of the new DAG divergence information for controlling promotion
decisions, or adding another version of shift + trunc + shift
combines that doesn't only know about the used types.
llvm-svn: 334180
Summary: Prevent folding of operations with memory loads when one of the sources has undefined register update.
Reviewers: craig.topper
Subscribers: llvm-commits, mike.dvoretsky, ashlykov
Differential Revision: https://reviews.llvm.org/D47621
llvm-svn: 334175
Summary:
When the branch folder hoist code into a predecessor it adjust live-in's
in the blocks it hoist code from. However it fail to handle hoisted code
that contain a defed register that originally is live-in in the block
through a super register.
This is fixed by replacing the live-in handling code with calls to
utility functions in LivePhysRegs.
Reviewers: kparzysz, gberry, MatzeB, uweigand, aprantl
Reviewed By: kparzysz
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D47529
llvm-svn: 334163
When denormals are supported we are producing a full division for
1.0f / x. That still can be replaced by the faster version:
bool c = fabs(x) > 0x1.0p+96f;
float s = c ? 0x1.0p-32f : 1.0f;
x *= s;
return s * v_rcp_f32(x)
in case if requested accuracy is 2.5ulp or less. The same version
is used if denormals are not supported for non 1.0 numerators, where
just v_rcp_f32 is then used for 1.0 numerator.
The optimization of 1/x is extended to the case -1/x, which is the
same except for the resulting sign bit.
OpenCL conformance passed with both enabled and disabled denorms.
Differential Revision: https://reviews.llvm.org/D47805
llvm-svn: 334142
Fixes terrible code on targets without f16 support. The
legalization creates a mess that is difficult to recover
from. Also should avoid randomly breaking these tests
multiple times in sequence in future commits.
Some regressions in cases where it happens to be better
to pull the source modifier after the conversion.
llvm-svn: 334132
Summary:
In D47428, i propose to choose the `~(-(1 << nbits))` as the canonical form of low-bit-mask formation.
As it is seen from these tests, there is a reason for that.
AArch64 currently better handles `~(-(1 << nbits))`, but not the more traditional `(1 << nbits) - 1` (sic!).
The other way around for X86.
It would be much better to canonicalize.
This patch is completely monkey-typing.
I don't really understand how this works :)
I have based it on `// x & (-1 >> (32 - y))` pattern.
Also, when we only have `BMI`, i wonder if we could use `BEXTR` with `start=0` ?
Related links:
https://bugs.llvm.org/show_bug.cgi?id=36419https://bugs.llvm.org/show_bug.cgi?id=37603https://bugs.llvm.org/show_bug.cgi?id=37610https://rise4fun.com/Alive/idM
Reviewers: craig.topper, spatel, RKSimon, javed.absar
Reviewed By: craig.topper
Subscribers: kristof.beyls, llvm-commits
Differential Revision: https://reviews.llvm.org/D47453
llvm-svn: 334125
Summary:
In D47428, i propose to choose the `~(-(1 << nbits))` as the canonical form of low-bit-mask formation.
As it is seen from these tests, there is a reason for that.
AArch64 currently better handles `~(-(1 << nbits))`, but not the more traditional `(1 << nbits) - 1` (sic!).
The other way around for X86.
It would be much better to canonicalize.
It would seem that there is too much tests, but this is most of all the auto-generated possible variants
of C code that one would expect for BZHI to be formed, and then manually cleaned up a bit.
So this should be pretty representable, which somewhat good coverage...
Related links:
https://bugs.llvm.org/show_bug.cgi?id=36419https://bugs.llvm.org/show_bug.cgi?id=37603https://bugs.llvm.org/show_bug.cgi?id=37610https://rise4fun.com/Alive/idM
Reviewers: javed.absar, craig.topper, RKSimon, spatel
Reviewed By: RKSimon
Subscribers: kristof.beyls, llvm-commits, RKSimon, craig.topper, spatel
Differential Revision: https://reviews.llvm.org/D47452
llvm-svn: 334124
Summary:
This change uses fmf subflags to guard optimizations as well as unsafe. These changes originated from D46483.
It contains only context for fsqrt.
Reviewers: spatel, hfinkel, arsenm
Reviewed By: spatel
Subscribers: hfinkel, wdng, andrew.w.kaylor, wristow, efriedma, nemanjai
Differential Revision: https://reviews.llvm.org/D47749
llvm-svn: 334113
If no alignment is set, the abi/preferred alignment of structs will be
used which may be higher than required. This can lead to extra padding
and in the end an increase in data size.
Differential Revision: https://reviews.llvm.org/D47633
llvm-svn: 334099
Only the bottom 16-bits of BEXTR's control op are required (0:8 INDEX, 15:8 LENGTH).
Differential Revision: https://reviews.llvm.org/D47690
llvm-svn: 334083
On targets like Arm some relaxations may only be performed when certain
architectural features are available. As functions can be compiled with
differing levels of architectural support we must make a judgement on
whether we can relax based on the MCSubtargetInfo for the function. This
change passes through the MCSubtargetInfo for the function to
fixupNeedsRelaxation so that the decision on whether to relax can be made
per function. In this patch, only the ARM backend makes use of this
information. We must also pass the MCSubtargetInfo to applyFixup because
some fixups skip error checking on the assumption that relaxation has
occurred, to prevent code-generation errors applyFixup must see the same
MCSubtargetInfo as fixupNeedsRelaxation.
Differential Revision: https://reviews.llvm.org/D44928
llvm-svn: 334078
Add minimal support to lower function calls.
Support only functions with arguments/return that go through registers
and have type i32.
Patch by Petar Avramovic.
Differential Revision: https://reviews.llvm.org/D45627
llvm-svn: 334071
This is a fix for the problem arising in D47374 (PR37678):
https://bugs.llvm.org/show_bug.cgi?id=37678
We may not have throughput info because it's not specified in the model
or it's not available with variant scheduling, so assume that those
instructions can execute/complete at max-issue-width.
Differential Revision: https://reviews.llvm.org/D47723
llvm-svn: 334055
CodeGenPrepare pass move extension instructions close to load instructions in different BB, so they can be combined later. But the extension instructions can't move through logical and shift instructions in current implementation. This patch enables this enhancement, so we can eliminate more extension instructions.
Differential Revision: https://reviews.llvm.org/D45537
This is re-commit of r331783, which was reverted by r333305. The performance regression was caused by some unlucky alignment, not a code generation problem.
llvm-svn: 334049
Preserves the low bound of the !range. I don't think
it's legal to do anything with the top half since it's
theoretically reading garbage.
llvm-svn: 334045
Summary: This change uses fmf subflags to guard optimizations as well as unsafe. These changes originated from D46483.
Reviewers: spatel, hfinkel
Reviewed By: spatel
Subscribers: nemanjai
Differential Revision: https://reviews.llvm.org/D47389
llvm-svn: 334037
Similar to v4i32 SHL, convert v8i16 shift amounts to scale factors instead to improve performance and reduce instruction count. We were already doing this for constant shifts, this adds variable shift support.
Reduces the serial nature of the codegen, which relies on chains of plendvb/pand+pandn+por shifts.
This is a step towards adding support for vXi16 vector rotates.
Differential Revision: https://reviews.llvm.org/D47546
llvm-svn: 334023
When legalizing illegal FP load results, this was
for some reason dropping the invariant and dereferencable
memory flags. There doesn't seem to be any reason for this,
and the equivalent isn't done for integer loads.
Fixes an issue in a future AMDGPU commit where some identical
loads fail to merge because one of the loads ends up
dropping the flags.
llvm-svn: 334020
BitPermutationSelector builds the output value by repeating rotate-and-mask instructions with input registers.
Here, we may avoid one rotate instruction if we start building from an input register that does not require rotation.
For example of the test case bitfieldinsert.ll, it first rotates left r4 by 8 bits and then inserts some bits from r5 without rotation.
This can be executed by one rlwimi instruction, which rotates r4 by 8 bits and inserts its bits into r5.
This patch adds a check for rotation amounts in the comparator used in sorting to process the input without rotation first.
Differential Revision: https://reviews.llvm.org/D47765
llvm-svn: 334011
Ideally we'd use resolveTargetShuffleInputs to handle faux shuffles as well but:
(a) that code path doesn't handle general/pre-legalized ops/types very well.
(b) I'm concerned about the compute time as they recurse to calls to computeKnownBits/ComputeNumSignBits which would need depth limiting somehow.
llvm-svn: 334007
This is the new version of D46181, allowing setjmp/longjmp
to work correctly with the Intel CET shadow stack by storing
SSP on setjmp and fixing it on longjmp. The patch has been
updated to use the cf-protection-return module flag instead
of HasSHSTK, and the bug that caused D46181 to be reverted
has been fixed with the test expanded to track that fix.
patch by mike.dvoretsky
Differential Revision: https://reviews.llvm.org/D47311
llvm-svn: 333990
Passing -mattr=-mmx needs to disable these instructions since the MMX register class won't have been set up. But we don't want -mattr=-mmx to disable SSE so we have to do it separately.
llvm-svn: 333984
Start by emitting remarks for very basic unsupported cases such as
irreducible CFGs and EHFunclets. The end goal is to be able to cover all
the cases where we give up with an explanation.
llvm-svn: 333972
We already output true and false in the printer, but the parser isn't able to
read it.
Differential Revision: https://reviews.llvm.org/D47424
llvm-svn: 333970
Summary: This include variant for add, uaddo and addcarry. usubo and subcarry require the carry to be flipped to preserve semantic, but we chose to do the transform anyway in that case as to push the transform down the carry chain.
Reviewers: efriedma, spatel, RKSimon, zvi, bkramer
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D46505
llvm-svn: 333943
On GFX9 and earlier, flat memory ops may decrement VMCNT out-of-order as well as LGKMCNT out-of-order.
Differential Revision: https://reviews.llvm.org/D46616
llvm-svn: 333926
This patch is the last of a sequence of three patches related to LLVM-dev RFC
"MC support for variant scheduling classes".
http://lists.llvm.org/pipermail/llvm-dev/2018-May/123181.html
This fixes PR36672.
The main goal of this patch is to teach llvm-mca how to solve variant scheduling
classes. This patch does that, plus it adds new variant scheduling classes to
the BtVer2 scheduling model to identify so-called zero-idioms (i.e. so-called
dependency breaking instructions that are known to generate zero, and that are
optimized out in hardware at register renaming stage).
Without the BtVer2 change, this patch would not have had any meaningful tests.
This patch is effectively the union of two changes:
1) a change that teaches llvm-mca how to resolve variant scheduling classes.
2) a change to the BtVer2 scheduling model that allows us to special-case
packed XOR zero-idioms (this partially fixes PR36671).
Differential Revision: https://reviews.llvm.org/D47374
llvm-svn: 333909
Ensure we test on 32-bit and 64-bit targets, and strip -mcpu usage.
Part of ongoing work to ensure we test all intrinsic style tests on 32 and 64 bit targets where possible.
llvm-svn: 333843
Further refactoring will wait until D47452 has landed.
Part of ongoing work to ensure we test all intrinsic style tests on 32 and 64 bit targets where possible.
llvm-svn: 333841
Summary: This creates a small perf regression, but after talking with Jacques Pienaar, he was good with it to get things moving toward removng SETCCE.
Reviewers: jpienaar, bryant
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D47626
llvm-svn: 333838
I had to tweak the i32 tests so we check both reg-reg and reg-mem cases.
I also added i64 load tests.
Part of ongoing work to ensure we test all intrinsic style tests on 32 and 64 bit targets where possible.
llvm-svn: 333827
We currently support them only in AArch64. The NEON Reference,
however, says they are 'ARMv7, ARMv8' intrinsics.
Differential Revision: https://reviews.llvm.org/D47120
llvm-svn: 333825
We currently support them only in AArch64. The NEON Reference,
however, says they are 'ARMv7, ARMv8' intrinsics.
Differential Revision: https://reviews.llvm.org/D47121
llvm-svn: 333819
Previously we just returned undef, but really we should be returning the pass thru input. We also need to make sure we preserve the chain output that the original intrinsic node had to maintain connectivity in the DAG. So we should just return the incoming chain as the output chain.
llvm-svn: 333804
The avx512f intrinsic tests were in the avx512vl file. We were also missing some combinations of masking.
This does show that we fail to use the zero masking form of expand loads when the passthru is zero. I'll try to get that fixed shortly.
llvm-svn: 333795
This patch replaces the --x86_extra_scrub command line argument to automatically support a second level of regex-scrubbing if it improves the matching of nearly-identical code patterns. The argument '--extra_scrub' is there now to force extra matching if required.
This is mostly useful to help us share 32-bit/64-bit x86 vector tests which only differs by retl/retq instructions, but any scrubber can now technically support this, meaning test checks don't have to be needlessly obfuscated.
I've updated some of the existing checks that had been manually run with --x86_extra_scrub, to demonstrate the extra "ret{{[l|q]}}" scrub now only happens when useful, and re-run the sse42-intrinsics file to show extra matches - most sse/avx intrinsics files should be able to now share 32/64 checks.
Tested with the opt/analysis scripts as well which share common code - AFAICT the other update scripts use their own versions.
Differential Revision: https://reviews.llvm.org/D47485
llvm-svn: 333749
Before we were relying on the any extend of the s1 to s32, but
for AAPCS we need to zero-extend it to at least s8.
Fixes PR36719
Differential Revision: https://reviews.llvm.org/D47425
llvm-svn: 333747
The WebAssembly committee has decided on the names `memory.size` and
`memory.grow` for the memory intrinsics, so update the LLVM intrinsics to
follow those names, keeping both sets of old names in place for
compatibility.
llvm-svn: 333708
Summary:
This lowers exception catching-related instructions:
1. Lowers `wasm.catch` intrinsic to `catch` instruction
2. Removes `catchpad` and `cleanuppad` instructions; they are not
necessary after isel phase. (`MachineBasicBlock::isEHFuncletEntry()` or
`MachineBasicBlock::isEHPad()` can be used instead.)
3. Lowers `catchret` and `cleanupret` instructions to pseudo `catchret`
and `cleanupret` instructions in isel, which will be replaced with other
instructions in `WebAssemblyExceptionPrepare` pass.
4. Adds 'WebAssemblyExceptionPrepare` pass, which is for running various
transformation for EH. Currently this pass only replaces `catchret` and
`cleanupret` instructions into appropriate wasm instructions to make
this patch successfully run until the end.
Currently this does not handle lowering of intrinsics related to LSDA
info generation (`wasm.landingpad.index` and `wasm.lsda`), because they
cannot be tested without implementing `EHStreamer`'s wasm-specific
handlers. They are marked as TODO, which is needed to make isel pass.
Also this does not generate `try` and `end_try` markers yet, which will
be handled in later patches.
This patch is based on the first wasm EH proposal.
(https://github.com/WebAssembly/exception-handling/blob/master/proposals/Exceptions.md)
Reviewers: dschuff, majnemer
Subscribers: jfb, sbc100, jgravelle-google, sunfish, llvm-commits
Differential Revision: https://reviews.llvm.org/D44090
llvm-svn: 333705
Memory clauses are formed into bundles in presence of xnack.
Their source operands are marked as early-clobber.
This allows to allocate distinct source and destination registers
within a clause and prevent breaking the clause with s_nop in the
hazard recognizer.
Clauses are undone before post-RA scheduler to allow some rescheduling,
which will not break the clause since artificial edges are created in
the dag to keep memory operations together. Yet this allows a better
ILP in some cases.
Differential Revision: https://reviews.llvm.org/D47511
llvm-svn: 333691
Noticed while fixing PR37426, for splat rotations (rotation by an uniform value) its better to just expand back to shift ops than performing as a general non-uniform rotation.
llvm-svn: 333661
Summary:
{FLDL2E, FLDL2T, FLDLG2, FLDLN2, FLDPI} were using WriteMicrocoded.
- I've measured the values for Broadwell, Haswell, SandyBridge, Skylake.
- For ZnVer1 and Atom, values were transferred form InstRWs.
- For SLM and BtVer2, I've guessed some values :(
Reviewers: RKSimon, craig.topper, andreadb
Subscribers: gbedwell, llvm-commits
Differential Revision: https://reviews.llvm.org/D47585
llvm-svn: 333656
Summary:
- I've measured the values for Broadwell, Haswell, SandyBridge, Skylake.
- For ZnVer1 and Atom, values were transferred form `InstRW`s.
- For SLM and BtVer2, values are from Agner.
This is split off from https://reviews.llvm.org/D47377
Reviewers: RKSimon, andreadb
Subscribers: gbedwell, llvm-commits
Differential Revision: https://reviews.llvm.org/D47523
llvm-svn: 333642
This improves splat rotations (rotation by an uniform value), to avoid having to use the generic non-uniform shift code (extension to PR37426).
llvm-svn: 333641
v2: use "ensureAlignment"
make functions cache line aligned
Fixes GPU hangs since r333219:
"AMDGPU: Split R600 AsmPrinter code into its own class"
Differential Revision: https://reviews.llvm.org/D47516
llvm-svn: 333622
This is to make it clear what kind of bugs the LegalizerInfo::verifier
is able to catch and test its output
Reviewers: aemerson, qcolombet
Reviewed By: aemerson
Differential Revision: https://reviews.llvm.org/D46338
llvm-svn: 333597
This was just emitting loads with the ABI alignment
for the raw type. The true alignment is often better,
especially when an illegal vector type was scalarized.
The better alignment allows using a scalar load
more often.
llvm-svn: 333558
In terms of waitcnt insertion/if necessary, the waitcnt pass forces convergence
for a loop. Previously, that kicked if greater than 2 passes over a loop, which
doesn't account for loop with many bottom blocks. So, increase the threshold to
(n+1), where n is the number of bottom blocks. This gives the pass an
opportunity to consider the contribution of each bottom block, to the overall
loop, before the forced convergence potentially kicks in.
Differential Revision: https://reviews.llvm.org/D47488
llvm-svn: 333556
Support for Clang lowering of fused intrinsics. This patch:
1. Removes bindings to clang fma intrinsics.
2. Introduces new LLVM unmasked intrinsics with rounding mode:
int_x86_avx512_vfmadd_pd_512
int_x86_avx512_vfmadd_ps_512
int_x86_avx512_vfmaddsub_pd_512
int_x86_avx512_vfmaddsub_ps_512
supported with a new intrinsic type (INTR_TYPE_3OP_RM).
3. Introduces new x86 fmaddsub/fmsubadd folding.
4. Introduces new tests for code emitted by sequentions introduced in Clang part.
Patch by tkrupa
Reviewers: craig.topper, sroland, spatel, RKSimon
Reviewed By: craig.topper, RKSimon
Differential Revision: https://reviews.llvm.org/D47443
llvm-svn: 333554
It was noticed on D47377 that these tests were being unnecessarily affected by scheduler changes.
This adds vzeroupper at the end of some tests as we lose the 'FeatureFastPartialYMMorZMMWrite' feature from KNL, since Skylake+ don't support this its probably better.
llvm-svn: 333549