Commit Graph

436174 Commits

Author SHA1 Message Date
Florian Hahn 81a11da762
[CGP,AArch64] Replace zexts with shuffle that can be lowered using tbl.
This patch extends CodeGenPrepare to lower zext v16i8 -> v16i32 in loops
using a wide shuffle  creating a v64i8 vector, selecting groups of 3
zero elements and an element from the input.

This is profitable on AArch64 where such shuffles can be lowered to tbl
instructions, but only in loops, because it requires materializing 4
masks, which can be done in the loop preheader.

This is the only reason the transform is part of CGP. If there's a
better alternative I missed, please let me know. The same goes for the
shouldReplaceZExtWithShuffle hook which guards this. I am not sure if
this transform will be beneficial on other targets, but it seems like
there is no way other convenient way.

This improves the generated code for loops like the one below in
combination with D96522.

    int foo(uint8_t *p, int N) {
      unsigned long long sum = 0;
      for (int i = 0; i < N ; i++, p++) {
	unsigned int v = *p;
	sum += (v < 127) ? v : 256 - v;
      }
      return sum;
    }

https://clang.godbolt.org/z/Wco866MjY

Reviewed By: t.p.northover

Differential Revision: https://reviews.llvm.org/D120571
2022-09-15 19:18:13 +01:00
Haojian Wu 6c9d2ee490 Revert "Fix bazel build after 84d07d021333f7b5716f0444d5c09105557272e0."
This reverts commit 10250c5a2a as the
related patch is being reverted.
2022-09-15 20:02:27 +02:00
Sergei Barannikov c6acb4eb0f [SDAG] Add `getCALLSEQ_END` overload taking `uint64_t`s
All in-tree targets pass pointer-sized ConstantSDNodes to the
method. This overload reduced amount of boilerplate code a bit.  This
also makes getCALLSEQ_END consistent with getCALLSEQ_START, which
already takes uint64_ts.
2022-09-15 14:02:12 -04:00
Sanjay Patel aafaa2f4fc [SCCP] convert ashr to lshr for non-negative shift value
This is similar to the existing signed instruction folds.
We get the obvious minimal patterns in other passes, but
this avoids potential missed folds when the multi-block
tests are converted to selects.
2022-09-15 13:54:52 -04:00
Sanjay Patel 204a2fff1f [SCCP] add tests for ashr range transforms; NFC 2022-09-15 13:54:52 -04:00
Amy Huang fda44bedd6 Add Clang driver flags equivalent to cl's /MD, /MT, /MDd, /MTd.
This will allow selecting the MS C runtime library without having to use
cc1 flags.

Differential Revision: https://reviews.llvm.org/D133457
2022-09-15 17:45:41 +00:00
Groverkss 644dfbac64 Revert "[MLIR][Presburger] Improve unittest parsing"
This reverts commit 84d07d0213.

Reverted to fix a compilation issue on gcc8.
2022-09-15 18:42:47 +01:00
Groverkss a53b56e4c4 Revert "[mlir] Remove the unused source file."
This reverts commit e488ce29ec.

Reverted to fix a compilation issue on gcc8.
2022-09-15 18:42:47 +01:00
Aart Bik 3986c86986 [mlir][sparse] partially implement codegen for sparse_tensor.compress
Reviewed By: Peiming

Differential Revision: https://reviews.llvm.org/D133912
2022-09-15 10:32:33 -07:00
Siva Chandra Reddy d23d858d04 [libc] Add the implementation of the "remove" function.
Reviewed By: lntue

Differential Revision: https://reviews.llvm.org/D133922
2022-09-15 17:32:02 +00:00
Craig Topper ace05124f5 [IntegerDivision][AMDGPU] Use CreateLogicalOr to block poison propagation.
There are two ctlz intrinsics here with the zero_is_poison flag
set. There are also two comparisons that check if either of the
inputs the ctlzs are zero. We need to use a logical or to block
the poison from the ctlz if either of the inputs is zero.

Reviewed By: arsenm, aqjune

Differential Revision: https://reviews.llvm.org/D130680
2022-09-15 09:38:02 -07:00
Sanjay Patel 02a27b3890 [InstCombine] fold X*X == 0 --> X == 0
This is safe when the mul does not overflow:
https://alive2.llvm.org/ce/z/LedVVP

This could be extended to handle non-zero compare constants
and non-squared multiplies.
2022-09-15 12:02:50 -04:00
Sanjay Patel 5fd883377c [InstCombine] add tests for X*X == 0; NFC 2022-09-15 12:02:50 -04:00
Simon Pilgrim 94620e4fc3 [CostModel][X86] Add CostKinds handling for vector shift by generic/non-uniform shift amounts
These are the worst case generic vector shift costs, where nothing is known about the shift amounts - in particular this should stop us using the default sizelatency cost of 1 for so many pre-AVX2 vector shifts that can often actually expand during lowering to +20 uops, just for 128-bit vectors, resulting in some horrible inline/unroll decisions.

This was achieved with an updated version of the 'cost-tables vs llvm-mca' script D103695 (I'll update the patch soon for reference)
2022-09-15 16:51:58 +01:00
Jay Foad 3822a01e0b [AMDGPU] Add GFX11 ds_bvh_stack_rtn_b32 instruction
Differential Revision: https://reviews.llvm.org/D133928
2022-09-15 16:46:14 +01:00
Joe Loser f97cc6b712 [libc++] Clean up `_LIBCPP_HAS_NO_PLATFORM_WAIT` macro
As the comment suggests, `_LIBCPP_HAS_NO_PLATFORM_WAIT` is not documented or
defined anywhere internally in the build system. It's a direct define in terms
of `_LIBCPP_HAS_NO_THREADS`. So, remove `_LIBCPP_HAS_NO_PLATFORM_WAIT` and use
`_LIBCPP_HAS_NO_THREADS` instead to control the desired behavior.

Differential Revision: https://reviews.llvm.org/D132715
2022-09-15 09:45:21 -06:00
Matt Arsenault 69153d6c0a AMDGPU: Use GlobalPriority for largest register tuples
Only do this for 16 and 32 register tuples, although we might want to
extend to 8 tuples.

It's incredibly expensive to spill these, and doing so majorly
interferes with the ability to allocate anything else in the function.

The lit tests show mostly sizeable improvements with a handful of tiny
regressions with large vectors.
2022-09-15 11:45:02 -04:00
Jakub Kuderski 3afd351b5f [mlir][arith] Support wide int cast emulation
Add support for `arith.extsi`, `arith.extui`, and `arith.trunci` ops.

Tested by checking the results for all 16-bit inputs when emulating i16 with i8.

Reviewed By: antiagainst, Mogball

Differential Revision: https://reviews.llvm.org/D133612
2022-09-15 11:34:58 -04:00
Dmitry Preobrazhensky 405b19bb67 [AMDGPU][MC][GFX11] Add disassembler tests for v_readfirstlane_b32
Differential Revision: https://reviews.llvm.org/D133437
2022-09-15 18:18:33 +03:00
Nico Weber c28f4e3f04 Revert "[lld-macho] Add support for N_INDR symbols"
This reverts commit 5b8da10b87.
Breaks tests, see https://reviews.llvm.org/D133825
2022-09-15 11:17:48 -04:00
Sander de Smalen 45d28779c5 [AArch64][SME] Fix lowering of llvm.aarch64.get.pstatesm()
A thread may not have access to SME or TPIDR2_EL0, so in order to
safely query PSTATE.SM in a streaming-compatible function, the
code should call `__arm_sme_state()`, as described in the ABI:

  c2bb09c4d4

This means that the value of pstate.sm is:
* 0 if the function is non-streaming.
* 1 if the function has `arm_streaming` or `arm_locally_streaming`.
* evaluated at runtime by a call to __arm_sme_state() otherwise.

This patch also adds a calling convention for calls to SME support routines.

At some point we can remove the need for the llvm.aarch64.get.pstatesm() intrinsic
and use function calls (with the corresponding cc) directly instead.

Reviewed By: aemerson

Differential Revision: https://reviews.llvm.org/D131571
2022-09-15 15:14:13 +00:00
Dmitry Preobrazhensky b0eea8f440 [AMDGPU][MC][GFX11][NFC] Update disassembler tests for MIMG instructions
Differential Revision: https://reviews.llvm.org/D133411
2022-09-15 18:04:34 +03:00
Simon Pilgrim 5182839e1b [CostModel][X86] Remove redundant SSSE3 checks from div/rem costs
These all match the default SSE2 costs so use those instead
2022-09-15 15:56:11 +01:00
Matt Arsenault 63d1d37d35 RegAllocGreedy: Avoid overflowing priority bitfields
The class priority is expected to be at most 5 bits before it starts
clobbering bits used for other fields. Also clamp the instruction
distance in case we have millions of instructions.

AMDGPU was accidentally overflowing into the global priority bit in
some cases. I think in principal we would have wanted this, but in the
cases I've looked at, it had the counter intuitive effect and
de-prioritized the large register tuple.

Avoid using weird bit hack PPC uses for global priority. The
AllocationPriority field is really 5 bits, and PPC was relying on
overflowing this to 6-bits to forcibly set the global priority
bit. Split this out as a separate flag to avoid having magic behavior
for values above 31.
2022-09-15 10:38:40 -04:00
Simon Pilgrim 7435ae25ab [CostModel][X86] Remove redundant BTVER2 checks from arithmetic costs
These all match the default AVX/AVX1 costs so use those instead
2022-09-15 15:29:00 +01:00
Simon Pilgrim c119828302 [CostModel][X86] Remove redundant BTVER2 checks from shift costs
These all match the default AVX/AVX1 costs so use those instead
2022-09-15 15:28:59 +01:00
Florian Hahn 791a7ae1ba
[AArch64] Add big-endian tests for trunc-to-tbl.ll
Extra tests for D133495.
2022-09-15 15:12:34 +01:00
Dmitry Preobrazhensky 0e868aff43 [AMDGPU][MC][GFX11] Add validation of constant bus limitations for VOPD
Differential Revision: https://reviews.llvm.org/D133881
2022-09-15 16:36:19 +03:00
Dmitry Preobrazhensky c89e60bf1f [AMDGPU][MC][GFX11] Add VOPD literals validation
Differential Revision: https://reviews.llvm.org/D133864
2022-09-15 16:29:53 +03:00
Dmitry Preobrazhensky 8bb5c89205 [AMDGPU][MC][NFC] Refactor AMDGPUAsmParser::validateVOPLiteral
Differential Revision: https://reviews.llvm.org/D133861
2022-09-15 16:26:14 +03:00
Mehdi Amini 0b7f05e1a1 Apply clang-tidy fixes for llvm-include-order in TypeTest.cpp (NFC) 2022-09-15 13:23:35 +00:00
Mehdi Amini 41630c6d3f Apply clang-tidy fixes for bugprone-argument-comment in LLVMTypeTest.cpp (NFC) 2022-09-15 13:23:34 +00:00
Mehdi Amini 7637f3b850 Apply clang-tidy fixes for readability-simplify-boolean-expr in OpFormatGen.cpp (NFC) 2022-09-15 13:23:34 +00:00
Tue Ly 1c89ae71ea [libc][math] Improve sinhf and coshf performance.
Optimize `sinhf` and `coshf` by computing exp(x) and exp(-x) simultaneously.

Currently `sinhf` and `coshf` are implemented using the following formulas:
```
  sinh(x) = 0.5 *(exp(x) - 1) - 0.5*(exp(-x) - 1)
  cosh(x) = 0.5*exp(x) + 0.5*exp(-x)
```
where `exp(x)` and `exp(-x)` are calculated separately using the formula:
```
  exp(x) ~ 2^hi * 2^mid * exp(dx)
         ~ 2^hi * 2^mid * P(dx)
```
By expanding the polynomial `P(dx)` into even and odd parts
```
  P(dx) = P_even(dx) + dx * P_odd(dx)
```
we can see that the computations of `exp(x)` and `exp(-x)` have many things in common,
namely:
```
  exp(x)  ~ 2^(hi + mid) * (P_even(dx) + dx * P_odd(dx))
  exp(-x) ~ 2^(-(hi + mid)) * (P_even(dx) - dx * P_odd(dx))
```
Expanding `sinh(x)` and `cosh(x)` with respect to the above formulas, we can compute
these two functions as follow in order to maximize the sharing parts:
```
  sinh(x) = (e^x - e^(-x)) / 2
          ~ 0.5 * (P_even * (2^(hi + mid) - 2^(-(hi + mid))) +
                  dx * P_odd * (2^(hi + mid) + 2^(-(hi + mid))))
  cosh(x) = (e^x + e^(-x)) / 2
          ~ 0.5 * (P_even * (2^(hi + mid) + 2^(-(hi + mid))) +
                  dx * P_odd * (2^(hi + mid) - 2^(-(hi + mid))))
```
So in this patch, we perform the following optimizations for `sinhf` and `coshf`:
  # Use the above formulas to maximize sharing intermediate results,
  # Apply similar optimizations from https://reviews.llvm.org/D133870

Performance benchmark using `perf` tool from the CORE-MATH project on Ryzen 1700:
For `sinhf`:
```
$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh sinhf
GNU libc version: 2.35
GNU libc release: stable
CORE-MATH reciprocal throughput   : 16.718
System LIBC reciprocal throughput : 63.151

BEFORE:
LIBC reciprocal throughput        : 90.116
LIBC reciprocal throughput        : 28.554    (with `-msse4.2` flag)
LIBC reciprocal throughput        : 22.577    (with `-mfma` flag)

AFTER:
LIBC reciprocal throughput        : 36.482
LIBC reciprocal throughput        : 16.955    (with `-msse4.2` flag)
LIBC reciprocal throughput        : 13.943    (with `-mfma` flag)

$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh sinhf --latency
GNU libc version: 2.35
GNU libc release: stable
CORE-MATH latency   : 48.821
System LIBC latency : 137.019

BEFORE
LIBC latency        : 97.122
LIBC latency        : 84.214    (with `-msse4.2` flag)
LIBC latency        : 71.611    (with `-mfma` flag)

AFTER
LIBC latency        : 54.555
LIBC latency        : 50.865    (with `-msse4.2` flag)
LIBC latency        : 48.700    (with `-mfma` flag)
```
For `coshf`:
```
$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh coshf
GNU libc version: 2.35
GNU libc release: stable
CORE-MATH reciprocal throughput   : 16.939
System LIBC reciprocal throughput : 19.695

BEFORE:
LIBC reciprocal throughput        : 52.845
LIBC reciprocal throughput        : 29.174    (with `-msse4.2` flag)
LIBC reciprocal throughput        : 22.553    (with `-mfma` flag)

AFTER:
LIBC reciprocal throughput        : 37.169
LIBC reciprocal throughput        : 17.805    (with `-msse4.2` flag)
LIBC reciprocal throughput        : 14.691    (with `-mfma` flag)

$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh coshf --latency
GNU libc version: 2.35
GNU libc release: stable
CORE-MATH latency   : 48.478
System LIBC latency : 48.044

BEFORE
LIBC latency        : 99.123
LIBC latency        : 85.595    (with `-msse4.2` flag)
LIBC latency        : 72.776    (with `-mfma` flag)

AFTER
LIBC latency        : 57.760
LIBC latency        : 53.967    (with `-msse4.2` flag)
LIBC latency        : 50.987    (with `-mfma` flag)
```

Reviewed By: orex, zimmermann6

Differential Revision: https://reviews.llvm.org/D133913
2022-09-15 09:20:39 -04:00
Alexey Lapshin adabfb5e32 [DWARFLinker][NFC] Set the target DWARF version explicitly.
Currently, DWARFLinker determines the target DWARF version internally.
It examines incoming object files, detects maximal
DWARF version and uses that version for the output file.
This patch allows explicitly setting output DWARF version by the consumer
of DWARFLinker. So that DWARFLinker uses a specified version instead
of autodetected one. It allows consumers to use different logic for
setting the target DWARF version. f.e. instead of the maximally used version
someone could set a higher version to convert from DWARFv4 to DWARFv5
(This possibility is not supported yet, but it would be good if
the interface will support it). Or another variant is to set the target
version through the command line. In this patch, the autodetection is moved
into the consumers(DwarfLinkerForBinary.cpp, DebugInfoLinker.cpp).

Differential Revision: https://reviews.llvm.org/D132755
2022-09-15 16:06:10 +03:00
Simon Pilgrim 0ec028fe10 [CostModel][X86] Add CostKinds handling for vector shift by uniform/constuniform ops
Vector shift by const uniform is the cheapest shift instruction we have, non-const uniform have a marginally higher cost - some targets 'splat' the amount internally to use the shift-per-element instruction, others see a higher cost for the explicit zeroing of the upper bits for the (64-bit) shift amount.

This was achieved with an updated version of the 'cost-tables vs llvm-mca' script D103695 (I'll update the patch soon for reference)
2022-09-15 14:05:30 +01:00
Florian Hahn 8f19de848b
[AArch64] Add big-endian tests for zext-to-tbl.ll
Extra tests for D120571.
2022-09-15 14:01:27 +01:00
wanglei a65557d4b3 [LoongArch] Fixup value adjustment in applyFixup
A complete implementation of `applyFixup` for D132323.

Makes `LoongArchAsmBackend::shouldForceRelocation` to determine
if the relocation types must be forced.

This patch also adds range and alignment checks for `b*` instructions'
operands, at which point the offset to a label is known.

Differential Revision: https://reviews.llvm.org/D132818
2022-09-15 21:00:22 +08:00
Aleksandr Platonov 3ce7d256f2 [clang][RecoveryExpr] Don't perform alignment check if parameter type is dependent
This patch fixes a crash which appears because of getTypeAlignInChars() call with depentent type.

Reviewed By: hokein

Differential Revision: https://reviews.llvm.org/D133886
2022-09-15 15:51:43 +03:00
Ivan Kosarev 693f816288 [AMDGPU][SILoadStoreOptimizer] Merge SGPR_IMM scalar buffer loads.
Reviewed By: foad, rampitec

Differential Revision: https://reviews.llvm.org/D133787
2022-09-15 13:48:51 +01:00
Jez Ng 5b8da10b87 [lld-macho] Add support for N_INDR symbols
This is similar to the `-alias` CLI option, but it gives finer-grained
control in that it allows the aliased symbols to be treated as private
externs.

While working on this, I realized that our `-alias` handling did not
cover the cases where the aliased symbol is a common or dylib symbol,
nor the case where we have an undefined that gets treated specially and
converted to a defined later on. My N_INDR handling neglects this too
for now; I've added checks and TODO messages for these.

`N_INDR` symbols cropped up as part of our attempt to link swift-stdlib.

Reviewed By: #lld-macho, thakis, thevinster

Differential Revision: https://reviews.llvm.org/D133825
2022-09-15 08:35:24 -04:00
Haojian Wu e488ce29ec [mlir] Remove the unused source file.
It seems to be a missing file in 84d07d0213

Differential Revision: https://reviews.llvm.org/D133937
2022-09-15 14:30:40 +02:00
Arjun P edd173201c [MLIR][Presburger] clarify why -0 is used instead of 0 (NFC) 2022-09-15 13:23:48 +01:00
Ilia Diachkov 3544d200d9 [SPIRV] add IR regularization pass
The patch adds the regularization pass that prepare LLVM IR for
the IR translation. It also contains following changes:
- reduce indentation, make getNonParametrizedType, getSamplerType,
getPipeType, getImageType, getSampledImageType static in SPIRVBuiltins,
- rename mayBeOclOrSpirvBuiltin to getOclOrSpirvBuiltinDemangledName,
- move isOpenCLBuiltinType, isSPIRVBuiltinType, isSpecialType from
SPIRVGlobalRegistry.cpp to SPIRVUtils.cpp, renaming isSpecialType to
isSpecialOpaqueType,
- implment getTgtMemIntrinsic() in SPIRVISelLowering,
- add hasSideEffects = 0 in Pseudo (SPIRVInstrFormats.td),
- add legalization rule for G_MEMSET, correct G_BRCOND rule,
- add capability processing for OpBuildNDRange in SPIRVModuleAnalysis,
- don't correct types of registers holding constants and used in
G_ADDRSPACE_CAST (SPIRVPreLegalizer.cpp),
- lower memset/bswap intrinsics to functions in SPIRVPrepareFunctions,
- change TargetLoweringObjectFileELF to SPIRVTargetObjectFile
in SPIRVTargetMachine.cpp,
- correct comments.
5 LIT tests are added to show the improvement.

Differential Revision: https://reviews.llvm.org/D133253

Co-authored-by: Aleksandr Bezzubikov <zuban32s@gmail.com>
Co-authored-by: Michal Paszkowski <michal.paszkowski@outlook.com>
Co-authored-by: Andrey Tretyakov <andrey1.tretyakov@intel.com>
Co-authored-by: Konrad Trifunovic <konrad.trifunovic@intel.com>
2022-09-15 15:53:44 +03:00
Michael Platings f0c234d2a6 [NFC] Don't assume llvm directory is CMake root
This makes the file consistent with ARM/CMakeLists.txt
2022-09-15 13:06:54 +01:00
Haojian Wu 10250c5a2a Fix bazel build after 84d07d0213. 2022-09-15 13:52:46 +02:00
Nicolas Vasilache a8645a3c2d [mlir][Linalg] Post submit addressed comments missed in f0cdc5bcd3f25192f12bfaff072ce02497b59c3c
Differential Revision: https://reviews.llvm.org/D133936
2022-09-15 04:47:41 -07:00
Tobias Hieta 24c10abe83
[NFC] Fix exception in version-check.py script 2022-09-15 13:34:29 +02:00
Aaron Ballman e076680bd5 Add a "Potentially Breaking Changes" section to the Clang release notes
Sometimes we make changes to the compiler that we expect may cause
disruption for users. For example, we may strengthen a warning to
default to be an error, or fix an accepts-invalid bug that's been
around for a long time, etc which may cause previously accepted code to
now be rejected. Rather than hope users discover that information by
reading all of the release notes, it's better that we call these out in
one location at the top of the release notes.

Based on feedback collected in the discussion at:
https://discourse.llvm.org/t/configure-script-breakage-with-the-new-werror-implicit-function-declaration/65213/

Differential Revision: https://reviews.llvm.org/D133771
2022-09-15 07:29:49 -04:00
Groverkss 84d07d0213 [MLIR][Presburger] Improve unittest parsing
This patch adds better functions for parsing MultiAffineFunctions and
PWMAFunctions in Presburger unittests.

A PWMAFunction can now be parsed as:

```
PWMAFunction result = parsePWMAF({
    {"(x, y) : (x >= 10, x <= 20, y >= 1)", "(x, y) -> (x + y)"},
    {"(x, y) : (x >= 21)", "(x, y) -> (x + y)"},
    {"(x, y) : (x <= 9)", "(x, y) -> (x - y)"},
    {"(x, y) : (x >= 10, x <= 20, y <= 0)", "(x, y) -> (x - y)"},
});
```

which is much more readable than the old format since the output can be
described as an AffineMap, instead of coefficients.

This patch also adds support for parsing divisions in MultiAffineFunctions
and PWMAFunctions which was previously not possible.

Reviewed By: arjunp

Differential Revision: https://reviews.llvm.org/D133654
2022-09-15 12:09:00 +01:00