Commit Graph

13549 Commits

Author SHA1 Message Date
Nikita Popov 9a4453592b [DAGCombine] Fold (x & ~y) | y patterns
Fold (x & ~y) | y and it's four commuted variants to x | y. This pattern
can in particular appear when a vselect c, x, -1 is expanded to
(x & ~c) | (-1 & c) and combined to (x & ~c) | c.

This change has some overlap with D59066, which avoids creating a
vselect of this form in the first place during uaddsat expansion.

Differential Revision: https://reviews.llvm.org/D59174

llvm-svn: 356333
2019-03-17 15:45:38 +00:00
Simon Pilgrim 3b0a6c69ee [DAGCombine] combineShuffleOfScalars - handle non-zero SCALAR_TO_VECTOR indices (PR41097)
rL356292 reduces the size of scalar_to_vector if we know the upper bits are undef - which means that shuffles may find they are suddenly referencing scalar_to_vector elements other than zero - so make sure we handle this as undef.

llvm-svn: 356327
2019-03-16 17:36:26 +00:00
Simon Pilgrim f2c53b5d6c [X86][SSE] Constant fold PEXTRB/PEXTRW/EXTRACT_VECTOR_ELT nodes.
Replaces existing i1-only fold.

llvm-svn: 356325
2019-03-16 15:02:00 +00:00
Simon Pilgrim 0f472e1d01 [X86] Add SimplifyDemandedBitsForTargetNode support for PEXTRB/PEXTRW
Improved constant folding for PEXTRB/PEXTRW will be added in a future commit

llvm-svn: 356324
2019-03-16 14:29:50 +00:00
Roman Lebedev 9f37790608 [X86] X86ISelLowering::combineSextInRegCmov(): also handle i8 CMOV's
Summary:
As noted by @andreadb in https://reviews.llvm.org/D59035#inline-525780

If we have `sext (trunc (cmov C0, C1) to i8)`,
we can instead do `cmov (sext (trunc C0 to i8)), (sext (trunc C1 to i8))`

Reviewers: craig.topper, andreadb, RKSimon

Reviewed By: craig.topper

Subscribers: llvm-commits, andreadb

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59412

llvm-svn: 356301
2019-03-15 21:18:05 +00:00
Roman Lebedev b6e376ddfa [X86] Promote i8 CMOV's (PR40965)
Summary:
@mclow.lists brought up this issue up in IRC, it came up during
implementation of libc++ `std::midpoint()` implementation (D59099)
https://godbolt.org/z/oLrHBP

Currently LLVM X86 backend only promotes i8 CMOV if it came from 2x`trunc`.
This differential proposes to always promote i8 CMOV.

There are several concerns here:
* Is this actually more performant, or is it just the ASM that looks cuter?
* Does this result in partial register stalls?
* What about branch predictor?

# Indeed, performance should be the main point here.
Let's look at a simple microbenchmark: {F8412076}
```
#include "benchmark/benchmark.h"

#include <algorithm>
#include <cmath>
#include <cstdint>
#include <iterator>
#include <limits>
#include <random>
#include <type_traits>
#include <utility>
#include <vector>

// Future preliminary libc++ code, from Marshall Clow.
namespace std {
template <class _Tp>
__inline _Tp midpoint(_Tp __a, _Tp __b) noexcept {
  using _Up = typename std::make_unsigned<typename remove_cv<_Tp>::type>::type;

  int __sign = 1;
  _Up __m = __a;
  _Up __M = __b;
  if (__a > __b) {
    __sign = -1;
    __m = __b;
    __M = __a;
  }
  return __a + __sign * _Tp(_Up(__M - __m) >> 1);
}
}  // namespace std

template <typename T>
std::vector<T> getVectorOfRandomNumbers(size_t count) {
  std::random_device rd;
  std::mt19937 gen(rd());
  std::uniform_int_distribution<T> dis(std::numeric_limits<T>::min(),
                                       std::numeric_limits<T>::max());
  std::vector<T> v;
  v.reserve(count);
  std::generate_n(std::back_inserter(v), count,
                  [&dis, &gen]() { return dis(gen); });
  assert(v.size() == count);
  return v;
}

struct RandRand {
  template <typename T>
  static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) {
    return std::make_pair(getVectorOfRandomNumbers<T>(count),
                          getVectorOfRandomNumbers<T>(count));
  }
};
struct ZeroRand {
  template <typename T>
  static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) {
    return std::make_pair(std::vector<T>(count, T(0)),
                          getVectorOfRandomNumbers<T>(count));
  }
};

template <class T, class Gen>
void BM_StdMidpoint(benchmark::State& state) {
  const size_t Length = state.range(0);

  const std::pair<std::vector<T>, std::vector<T>> Data =
      Gen::template Gen<T>(Length);
  const std::vector<T>& a = Data.first;
  const std::vector<T>& b = Data.second;
  assert(a.size() == Length && b.size() == a.size());

  benchmark::ClobberMemory();
  benchmark::DoNotOptimize(a);
  benchmark::DoNotOptimize(a.data());
  benchmark::DoNotOptimize(b);
  benchmark::DoNotOptimize(b.data());

  for (auto _ : state) {
    for (size_t i = 0; i < Length; i++) {
      const auto calculated = std::midpoint(a[i], b[i]);
      benchmark::DoNotOptimize(calculated);
    }
  }
  state.SetComplexityN(Length);
  state.counters["midpoints"] =
      benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariant);
  state.counters["midpoints/sec"] =
      benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariantRate);
  const size_t BytesRead = 2 * sizeof(T) * Length;
  state.counters["bytes_read/iteration"] =
      benchmark::Counter(BytesRead, benchmark::Counter::kDefaults,
                         benchmark::Counter::OneK::kIs1024);
  state.counters["bytes_read/sec"] = benchmark::Counter(
      BytesRead, benchmark::Counter::kIsIterationInvariantRate,
      benchmark::Counter::OneK::kIs1024);
}

template <typename T>
static void CustomArguments(benchmark::internal::Benchmark* b) {
  const size_t L2SizeBytes = 2 * 1024 * 1024;
  // What is the largest range we can check to always fit within given L2 cache?
  const size_t MaxLen = L2SizeBytes / /*total bufs*/ 2 /
                        /*maximal elt size*/ sizeof(T) / /*safety margin*/ 2;
  b->RangeMultiplier(2)->Range(1, MaxLen)->Complexity(benchmark::oN);
}

// Both of the values are random.
// The comparison is unpredictable.
BENCHMARK_TEMPLATE(BM_StdMidpoint, int32_t, RandRand)
    ->Apply(CustomArguments<int32_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint32_t, RandRand)
    ->Apply(CustomArguments<uint32_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, int64_t, RandRand)
    ->Apply(CustomArguments<int64_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint64_t, RandRand)
    ->Apply(CustomArguments<uint64_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, int16_t, RandRand)
    ->Apply(CustomArguments<int16_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint16_t, RandRand)
    ->Apply(CustomArguments<uint16_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, int8_t, RandRand)
    ->Apply(CustomArguments<int8_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint8_t, RandRand)
    ->Apply(CustomArguments<uint8_t>);

// One value is always zero, and another is bigger or equal than zero.
// The comparison is predictable.
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint32_t, ZeroRand)
    ->Apply(CustomArguments<uint32_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint64_t, ZeroRand)
    ->Apply(CustomArguments<uint64_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint16_t, ZeroRand)
    ->Apply(CustomArguments<uint16_t>);
BENCHMARK_TEMPLATE(BM_StdMidpoint, uint8_t, ZeroRand)
    ->Apply(CustomArguments<uint8_t>);
```

```
$ ~/src/googlebenchmark/tools/compare.py --no-utest benchmarks ./llvm-cmov-bench-OLD ./llvm-cmov-bench-NEW
RUNNING: ./llvm-cmov-bench-OLD --benchmark_out=/tmp/tmp5a5qjm
2019-03-06 21:53:31
Running ./llvm-cmov-bench-OLD
Run on (8 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 16K (x8)
  L1 Instruction 64K (x4)
  L2 Unified 2048K (x4)
  L3 Unified 8192K (x1)
Load Average: 1.78, 1.81, 1.36
----------------------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations UserCounters<...>
----------------------------------------------------------------------------------------------------
<...>
BM_StdMidpoint<int32_t, RandRand>/131072      300398 ns       300404 ns         2330 bytes_read/iteration=1024k bytes_read/sec=3.25083G/s midpoints=305.398M midpoints/sec=436.319M/s
BM_StdMidpoint<int32_t, RandRand>_BigO          2.29 N          2.29 N
BM_StdMidpoint<int32_t, RandRand>_RMS              2 %             2 %
<...>
BM_StdMidpoint<uint32_t, RandRand>/131072     300433 ns       300433 ns         2330 bytes_read/iteration=1024k bytes_read/sec=3.25052G/s midpoints=305.398M midpoints/sec=436.278M/s
BM_StdMidpoint<uint32_t, RandRand>_BigO         2.29 N          2.29 N
BM_StdMidpoint<uint32_t, RandRand>_RMS             2 %             2 %
<...>
BM_StdMidpoint<int64_t, RandRand>/65536       169857 ns       169858 ns         4121 bytes_read/iteration=1024k bytes_read/sec=5.74929G/s midpoints=270.074M midpoints/sec=385.828M/s
BM_StdMidpoint<int64_t, RandRand>_BigO          2.59 N          2.59 N
BM_StdMidpoint<int64_t, RandRand>_RMS              3 %             3 %
<...>
BM_StdMidpoint<uint64_t, RandRand>/65536      169770 ns       169771 ns         4125 bytes_read/iteration=1024k bytes_read/sec=5.75223G/s midpoints=270.336M midpoints/sec=386.026M/s
BM_StdMidpoint<uint64_t, RandRand>_BigO         2.59 N          2.59 N
BM_StdMidpoint<uint64_t, RandRand>_RMS             3 %             3 %
<...>
BM_StdMidpoint<int16_t, RandRand>/262144      591169 ns       591179 ns         1182 bytes_read/iteration=1024k bytes_read/sec=1.65189G/s midpoints=309.854M midpoints/sec=443.426M/s
BM_StdMidpoint<int16_t, RandRand>_BigO          2.25 N          2.25 N
BM_StdMidpoint<int16_t, RandRand>_RMS              1 %             1 %
<...>
BM_StdMidpoint<uint16_t, RandRand>/262144     591264 ns       591274 ns         1184 bytes_read/iteration=1024k bytes_read/sec=1.65162G/s midpoints=310.378M midpoints/sec=443.354M/s
BM_StdMidpoint<uint16_t, RandRand>_BigO         2.25 N          2.25 N
BM_StdMidpoint<uint16_t, RandRand>_RMS             1 %             1 %
<...>
BM_StdMidpoint<int8_t, RandRand>/524288      2983669 ns      2983689 ns          235 bytes_read/iteration=1024k bytes_read/sec=335.156M/s midpoints=123.208M midpoints/sec=175.718M/s
BM_StdMidpoint<int8_t, RandRand>_BigO           5.69 N          5.69 N
BM_StdMidpoint<int8_t, RandRand>_RMS               0 %             0 %
<...>
BM_StdMidpoint<uint8_t, RandRand>/524288     2668398 ns      2668419 ns          262 bytes_read/iteration=1024k bytes_read/sec=374.754M/s midpoints=137.363M midpoints/sec=196.479M/s
BM_StdMidpoint<uint8_t, RandRand>_BigO          5.09 N          5.09 N
BM_StdMidpoint<uint8_t, RandRand>_RMS              0 %             0 %
<...>
BM_StdMidpoint<uint32_t, ZeroRand>/131072     300887 ns       300887 ns         2331 bytes_read/iteration=1024k bytes_read/sec=3.24561G/s midpoints=305.529M midpoints/sec=435.619M/s
BM_StdMidpoint<uint32_t, ZeroRand>_BigO         2.29 N          2.29 N
BM_StdMidpoint<uint32_t, ZeroRand>_RMS             2 %             2 %
<...>
BM_StdMidpoint<uint64_t, ZeroRand>/65536      169634 ns       169634 ns         4102 bytes_read/iteration=1024k bytes_read/sec=5.75688G/s midpoints=268.829M midpoints/sec=386.338M/s
BM_StdMidpoint<uint64_t, ZeroRand>_BigO         2.59 N          2.59 N
BM_StdMidpoint<uint64_t, ZeroRand>_RMS             3 %             3 %
<...>
BM_StdMidpoint<uint16_t, ZeroRand>/262144     592252 ns       592255 ns         1182 bytes_read/iteration=1024k bytes_read/sec=1.64889G/s midpoints=309.854M midpoints/sec=442.62M/s
BM_StdMidpoint<uint16_t, ZeroRand>_BigO         2.26 N          2.26 N
BM_StdMidpoint<uint16_t, ZeroRand>_RMS             1 %             1 %
<...>
BM_StdMidpoint<uint8_t, ZeroRand>/524288      987295 ns       987309 ns          711 bytes_read/iteration=1024k bytes_read/sec=1012.85M/s midpoints=372.769M midpoints/sec=531.028M/s
BM_StdMidpoint<uint8_t, ZeroRand>_BigO          1.88 N          1.88 N
BM_StdMidpoint<uint8_t, ZeroRand>_RMS              1 %             1 %
RUNNING: ./llvm-cmov-bench-NEW --benchmark_out=/tmp/tmpPvwpfW
2019-03-06 21:56:58
Running ./llvm-cmov-bench-NEW
Run on (8 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 16K (x8)
  L1 Instruction 64K (x4)
  L2 Unified 2048K (x4)
  L3 Unified 8192K (x1)
Load Average: 1.17, 1.46, 1.30
----------------------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations UserCounters<...>
----------------------------------------------------------------------------------------------------
<...>
BM_StdMidpoint<int32_t, RandRand>/131072      300878 ns       300880 ns         2324 bytes_read/iteration=1024k bytes_read/sec=3.24569G/s midpoints=304.611M midpoints/sec=435.629M/s
BM_StdMidpoint<int32_t, RandRand>_BigO          2.29 N          2.29 N
BM_StdMidpoint<int32_t, RandRand>_RMS              2 %             2 %
<...>
BM_StdMidpoint<uint32_t, RandRand>/131072     300231 ns       300226 ns         2330 bytes_read/iteration=1024k bytes_read/sec=3.25276G/s midpoints=305.398M midpoints/sec=436.578M/s
BM_StdMidpoint<uint32_t, RandRand>_BigO         2.29 N          2.29 N
BM_StdMidpoint<uint32_t, RandRand>_RMS             2 %             2 %
<...>
BM_StdMidpoint<int64_t, RandRand>/65536       170819 ns       170777 ns         4115 bytes_read/iteration=1024k bytes_read/sec=5.71835G/s midpoints=269.681M midpoints/sec=383.752M/s
BM_StdMidpoint<int64_t, RandRand>_BigO          2.60 N          2.60 N
BM_StdMidpoint<int64_t, RandRand>_RMS              3 %             3 %
<...>
BM_StdMidpoint<uint64_t, RandRand>/65536      171705 ns       171708 ns         4106 bytes_read/iteration=1024k bytes_read/sec=5.68733G/s midpoints=269.091M midpoints/sec=381.671M/s
BM_StdMidpoint<uint64_t, RandRand>_BigO         2.62 N          2.62 N
BM_StdMidpoint<uint64_t, RandRand>_RMS             3 %             3 %
<...>
BM_StdMidpoint<int16_t, RandRand>/262144      592510 ns       592516 ns         1182 bytes_read/iteration=1024k bytes_read/sec=1.64816G/s midpoints=309.854M midpoints/sec=442.425M/s
BM_StdMidpoint<int16_t, RandRand>_BigO          2.26 N          2.26 N
BM_StdMidpoint<int16_t, RandRand>_RMS              1 %             1 %
<...>
BM_StdMidpoint<uint16_t, RandRand>/262144     614823 ns       614823 ns         1180 bytes_read/iteration=1024k bytes_read/sec=1.58836G/s midpoints=309.33M midpoints/sec=426.373M/s
BM_StdMidpoint<uint16_t, RandRand>_BigO         2.33 N          2.33 N
BM_StdMidpoint<uint16_t, RandRand>_RMS             4 %             4 %
<...>
BM_StdMidpoint<int8_t, RandRand>/524288      1073181 ns      1073201 ns          650 bytes_read/iteration=1024k bytes_read/sec=931.791M/s midpoints=340.787M midpoints/sec=488.527M/s
BM_StdMidpoint<int8_t, RandRand>_BigO           2.05 N          2.05 N
BM_StdMidpoint<int8_t, RandRand>_RMS               1 %             1 %
BM_StdMidpoint<uint8_t, RandRand>/524288     1071010 ns      1071020 ns          653 bytes_read/iteration=1024k bytes_read/sec=933.689M/s midpoints=342.36M midpoints/sec=489.522M/s
BM_StdMidpoint<uint8_t, RandRand>_BigO          2.05 N          2.05 N
BM_StdMidpoint<uint8_t, RandRand>_RMS              1 %             1 %
<...>
BM_StdMidpoint<uint32_t, ZeroRand>/131072     300413 ns       300416 ns         2330 bytes_read/iteration=1024k bytes_read/sec=3.2507G/s midpoints=305.398M midpoints/sec=436.302M/s
BM_StdMidpoint<uint32_t, ZeroRand>_BigO         2.29 N          2.29 N
BM_StdMidpoint<uint32_t, ZeroRand>_RMS             2 %             2 %
<...>
BM_StdMidpoint<uint64_t, ZeroRand>/65536      169667 ns       169669 ns         4123 bytes_read/iteration=1024k bytes_read/sec=5.75568G/s midpoints=270.205M midpoints/sec=386.257M/s
BM_StdMidpoint<uint64_t, ZeroRand>_BigO         2.59 N          2.59 N
BM_StdMidpoint<uint64_t, ZeroRand>_RMS             3 %             3 %
<...>
BM_StdMidpoint<uint16_t, ZeroRand>/262144     591396 ns       591404 ns         1184 bytes_read/iteration=1024k bytes_read/sec=1.65126G/s midpoints=310.378M midpoints/sec=443.257M/s
BM_StdMidpoint<uint16_t, ZeroRand>_BigO         2.26 N          2.26 N
BM_StdMidpoint<uint16_t, ZeroRand>_RMS             1 %             1 %
<...>
BM_StdMidpoint<uint8_t, ZeroRand>/524288     1069421 ns      1069413 ns          655 bytes_read/iteration=1024k bytes_read/sec=935.092M/s midpoints=343.409M midpoints/sec=490.258M/s
BM_StdMidpoint<uint8_t, ZeroRand>_BigO          2.04 N          2.04 N
BM_StdMidpoint<uint8_t, ZeroRand>_RMS              0 %             0 %
Comparing ./llvm-cmov-bench-OLD to ./llvm-cmov-bench-NEW
Benchmark                                                   Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------
<...>
BM_StdMidpoint<int32_t, RandRand>/131072                 +0.0016         +0.0016        300398        300878        300404        300880
<...>
BM_StdMidpoint<uint32_t, RandRand>/131072                -0.0007         -0.0007        300433        300231        300433        300226
<...>
BM_StdMidpoint<int64_t, RandRand>/65536                  +0.0057         +0.0054        169857        170819        169858        170777
<...>
BM_StdMidpoint<uint64_t, RandRand>/65536                 +0.0114         +0.0114        169770        171705        169771        171708
<...>
BM_StdMidpoint<int16_t, RandRand>/262144                 +0.0023         +0.0023        591169        592510        591179        592516
<...>
BM_StdMidpoint<uint16_t, RandRand>/262144                +0.0398         +0.0398        591264        614823        591274        614823
<...>
BM_StdMidpoint<int8_t, RandRand>/524288                  -0.6403         -0.6403       2983669       1073181       2983689       1073201
<...>
BM_StdMidpoint<uint8_t, RandRand>/524288                 -0.5986         -0.5986       2668398       1071010       2668419       1071020
<...>
BM_StdMidpoint<uint32_t, ZeroRand>/131072                -0.0016         -0.0016        300887        300413        300887        300416
<...>
BM_StdMidpoint<uint64_t, ZeroRand>/65536                 +0.0002         +0.0002        169634        169667        169634        169669
<...>
BM_StdMidpoint<uint16_t, ZeroRand>/262144                -0.0014         -0.0014        592252        591396        592255        591404
<...>
BM_StdMidpoint<uint8_t, ZeroRand>/524288                 +0.0832         +0.0832        987295       1069421        987309       1069413
```

What can we tell from the benchmark?
* `BM_StdMidpoint<[u]int8_t, RandRand>` indeed has the worst performance.
* All `BM_StdMidpoint<uint{8,16,32}_t, ZeroRand>` are all performant, even the 8-bit case.
  That is because there we are computing mid point between zero and some random number,
  thus if the branch predictor is in use, it is in optimal situation.
* Promoting 8-bit CMOV did improve performance of `BM_StdMidpoint<[u]int8_t, RandRand>`, by -59%..-64%.

# What about branch predictor?
* `BM_StdMidpoint<uint8_t, ZeroRand>` was faster than `BM_StdMidpoint<uint{16,32,64}_t, ZeroRand>`,
  which may mean that well-predicted branch is better than `cmov`.
* Promoting 8-bit CMOV degraded performance of `BM_StdMidpoint<uint8_t, ZeroRand>`,
  `cmov` is up to +10% worse than well-predicted branch.
* However, i do not believe this is a concern. If the branch is well predicted,  then the PGO
  will also say that it is well predicted, and LLVM will happily expand cmov back into branch:
  https://godbolt.org/z/P5ufig

# What about partial register stalls?
I'm not really able to answer that.
What i can say is that if the branch is unpredictable (if it is predictable, then use PGO and you'll have branch)
in ~50% of cases you will have to pay branch misprediction penalty.
```
$ grep -i MispredictPenalty X86Sched*.td
X86SchedBroadwell.td:  let MispredictPenalty = 16;
X86SchedHaswell.td:  let MispredictPenalty = 16;
X86SchedSandyBridge.td:  let MispredictPenalty = 16;
X86SchedSkylakeClient.td:  let MispredictPenalty = 14;
X86SchedSkylakeServer.td:  let MispredictPenalty = 14;
X86ScheduleBdVer2.td:  let MispredictPenalty = 20; // Minimum branch misdirection penalty.
X86ScheduleBtVer2.td:  let MispredictPenalty = 14; // Minimum branch misdirection penalty
X86ScheduleSLM.td:  let MispredictPenalty = 10;
X86ScheduleZnver1.td:  let MispredictPenalty = 17;
```
.. which it can be as small as 10 cycles and as large as 20 cycles.
Partial register stalls do not seem to be an issue for AMD CPU's.
For intel CPU's, they should be around ~5 cycles?
Is that actually an issue here? I'm not sure.

In short, i'd say this is an improvement, at least on this microbenchmark.

Fixes [[ https://bugs.llvm.org/show_bug.cgi?id=40965 | PR40965 ]].

Reviewers: craig.topper, RKSimon, spatel, andreadb, nikic

Reviewed By: craig.topper, andreadb

Subscribers: jfb, jdoerfert, llvm-commits, mclow.lists

Tags: #llvm, #libc

Differential Revision: https://reviews.llvm.org/D59035

llvm-svn: 356300
2019-03-15 21:17:53 +00:00
Simon Pilgrim d33e62c826 [X86][SSE] Fold scalar_to_vector(i64 anyext(x)) -> bitcast(scalar_to_vector(i32 anyext(x)))
Reduce the size of an any-extended i64 scalar_to_vector source to i32 - the any_extend nodes are often introduced by SimplifyDemandedBits.

llvm-svn: 356292
2019-03-15 19:14:28 +00:00
Philip Reames d238bf7855 [X86][GlobalISEL] Support lowering aligned unordered atomics
The existing lowering code is accidentally correct for unordered atomics as far as I can tell. An unordered atomic has no memory ordering, and simply requires the actual load or store to be done as a single well aligned instruction. As such, relax the restriction while adding tests to ensure the lowering remains correct in the future.

Differential Revision: https://reviews.llvm.org/D57803

llvm-svn: 356280
2019-03-15 17:50:30 +00:00
Simon Pilgrim 8fbe439345 [SelectionDAG] Add SimplifyDemandedBits handling for ISD::SCALAR_TO_VECTOR
Fixes a lot of constant folding mismatches between i686 and x86_64

llvm-svn: 356273
2019-03-15 17:00:55 +00:00
Simon Pilgrim 65165d54bb [X86] Add SimplifyDemandedBitsForTargetNode support for PINSRB/PINSRW
llvm-svn: 356270
2019-03-15 16:16:49 +00:00
Mikael Holmen 339daae806 [CodeGenPrepare] avoid crashing from replacing a phi twice
Summary:
This is a fix to bug 41052:
https://bugs.llvm.org/show_bug.cgi?id=41052

While trying to optimize a memory instruction in a dead basic block, we end up registering the same phi for replacement twice. This patch avoids registering more than the first replacement candidate for a phi.

Patch by: JesperAntonsson

Reviewers: skatkov, aprantl

Reviewed By: aprantl

Subscribers: jdoerfert, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59358

llvm-svn: 356260
2019-03-15 13:51:05 +00:00
Simon Pilgrim 0ad17402a9 [X86][SSE] Attempt to convert SSE shift-by-var to shift-by-imm.
Prep work for PR40203

llvm-svn: 356249
2019-03-15 11:05:42 +00:00
Philip Reames 81abc7fb0c [Tests] Add tests to demonstrate hoisting of unordered invariant loads
llvm-svn: 356184
2019-03-14 18:06:15 +00:00
Philip Reames 9616cf0510 [Tests] Revert an accident change to a test
llvm-svn: 356183
2019-03-14 18:02:19 +00:00
Philip Reames c53f02a32a Auto-generate an existing test to make it easier to update
llvm-svn: 356181
2019-03-14 17:59:59 +00:00
Philip Reames af41b282c5 [Tests] Add tests for reordering of unordered atomics on invariant locations
llvm-svn: 356172
2019-03-14 17:36:58 +00:00
Philip Reames 70d156991c Allow code motion (and thus folding) for atomic (but unordered) memory operands
Building on the work done in D57601, now that we can distinguish between atomic and volatile memory accesses, go ahead and allow code motion of unordered atomics. As seen in the diffs, this allows much better folding of memory operations into using instructions. (Mostly done by the PeepholeOpt pass.)

Note: I have not reviewed all callers of hasOrderedMemoryRef since one of them - isSafeToMove - is very widely used. I'm relying on the documented semantics of each method to judge correctness.

Differential Revision: https://reviews.llvm.org/D59345

llvm-svn: 356170
2019-03-14 17:20:59 +00:00
Philip Reames 8dd9b54d9b [Tests] Add negative folding tests w/fences as requested in D59345
llvm-svn: 356165
2019-03-14 17:05:18 +00:00
Craig Topper c747ac3f93 [X86] Fix the pattern changes from r356121 so that the ROR*r1/ROR*m1 pattern use the rotr opcode.
These instructions used to use rotl with a bitwidth-1 immediate. I changed the immediate to 1,
but failed to change the opcode.

Thankfully this seems to have not caused a functional issue because we now had two rotl by 1 patterns,
but the correct ones were earlier and took priority. So we just missed some optimization.

llvm-svn: 356164
2019-03-14 16:53:24 +00:00
Sanjay Patel 5d1df114e8 [x86] prevent infinite looping from vselect commutation (PR41066)
This is an immediate fix for:
https://bugs.llvm.org/show_bug.cgi?id=41066
...but as noted there and the code comments, we should do better
by stubbing this out sooner.

llvm-svn: 356158
2019-03-14 15:32:34 +00:00
Craig Topper 54a0b53308 [X86] Add patterns for rotr by immediate to fix PR41057.
Prior to the introduction of funnel shift intrinsics we could count on rotate
by immediates prefering to use rotl since that's what MatchRotate would check
first. The or+shift pattern doesn't have a direction so one must be chosen
arbitrarily.

With funnel shift, there is a direction and fshr will try to use rotr first.
While fshl will try to use rotl first.

This patch adds the isel patterns for rotr to complement the rotl patterns. I've
put the rotr by 1 patterns in the instruction patterns. And moved the rotl by
bitwidth-1 patterns to separate Pat patterns.

Fixes PR41057.

llvm-svn: 356121
2019-03-14 07:07:26 +00:00
Craig Topper c867847016 [X86] Add various test cases for PR41057. NFC
llvm-svn: 356120
2019-03-14 07:07:24 +00:00
Quentin Colombet e77e5f44b8 [GlobalISel][Utils] Add a getConstantVRegVal variant that looks through instrs
getConstantVRegVal used to only look for G_CONSTANT when looking at
unboxing the value of a vreg. However, constants are sometimes not
directly used and are hidden behind trunc, s|zext or copy chain of
computation.

In particular this may be introduced by the legalization process that
doesn't want to simplify these patterns because it can lead to infine
loop when legalizing a constant.

To circumvent that problem, add a new variant of getConstantVRegVal,
named getConstantVRegValWithLookThrough, that allow to look through
extensions.

Differential Revision: https://reviews.llvm.org/D59227

llvm-svn: 356116
2019-03-14 01:37:13 +00:00
Craig Topper fad96a1588 [X86] Add 64-bit mode command lines to rot32.ll so that it will demonstrate PR41055 for 32 bit. NFC
llvm-svn: 356112
2019-03-14 00:23:31 +00:00
Simon Pilgrim e15cd7909b [X86] Remove icmp undef in more reduced tests
llvm-svn: 356084
2019-03-13 19:07:54 +00:00
Simon Pilgrim 8f1b825068 [X86] Regenerate tail call tests
llvm-svn: 356083
2019-03-13 19:04:45 +00:00
Craig Topper 84abec2855 [X86] Check for 64-bit mode in X86Subtarget::hasCmpxchg16b()
The feature flag alone can't be trusted since it can be passed via -mattr. Need to ensure 64-bit mode as well.

We had a 64 bit mode check on the instruction to make the assembler work correctly. But we weren't guarding any of our lowering code or the hooks for the AtomicExpandPass.

I've added 32-bit command lines to atomic128.ll with and without cx16. The tests there would all previously fail if -mattr=cx16 was passed to them. I had to move one test case for f128 to a new file as it seems to have a different 32-bit mode or possibly sse issue.

Differential Revision: https://reviews.llvm.org/D59308

llvm-svn: 356078
2019-03-13 18:48:50 +00:00
Simon Pilgrim e1be3403ff [X86] Avoid icmp undef in reduced tests
Because we don't currently simplify icmp with undef in DAG, bugpoint loves to introduce them during reduction.

This is a small step towards re-adding non-undef values into some of the simpler tests so that they should still test correctly and emit similar/same codegen.

Prep work for PR40800 ([SelectionDAG] Add UNDEF handling to SelectionDAG::FoldSetCC).

llvm-svn: 356076
2019-03-13 18:36:59 +00:00
Simon Pilgrim 510f26dca8 Regenerate test
llvm-svn: 356071
2019-03-13 18:18:24 +00:00
Nirav Dave d6351340bb [DAGCombiner] If a TokenFactor would be merged into its user, consider the user later.
Summary:
A number of optimizations are inhibited by single-use TokenFactors not
being merged into the TokenFactor using it. This makes we consider if
we can do the merge immediately.

Most tests changes here are due to the change in visitation causing
minor reorderings and associated reassociation of paired memory
operations.

CodeGen tests with non-reordering changes:

  X86/aligned-variadic.ll -- memory-based add folded into stored leaq
  value.

  X86/constant-combiners.ll -- Optimizes out overlap between stores.

  X86/pr40631_deadstore_elision -- folds constant byte store into
  preceding quad word constant store.

Reviewers: RKSimon, craig.topper, spatel, efriedma, courbet

Reviewed By: courbet

Subscribers: dylanmckay, sdardis, nemanjai, jvesely, nhaehnle, javed.absar, eraman, hiraditya, kbarton, jrtc27, atanasyan, jsji, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59260

llvm-svn: 356068
2019-03-13 17:07:09 +00:00
Simon Pilgrim bef4fe056d [X86][AVX] Add X86ISD::VTRUNC handling to SimplifyDemandedVectorEltsForTargetNode
llvm-svn: 356067
2019-03-13 17:00:18 +00:00
Simon Pilgrim d9aa879b67 [X86][AVX] Add combineConcatVectors support to improve subvector handling
Attempt to combine CONCAT_VECTORS nodes, which we only really have pre-legalization.

This encourages a lot of X86ISD::SUBV_BROADCAST generation, so I've added SimplifyDemandedVectorEltsForTargetNode handling for this at the same time.

The X86ISD::VTRUNC regression in shuffle-vs-trunc-256-widen.ll will be handled in a future commit.

llvm-svn: 356064
2019-03-13 16:37:30 +00:00
Sanjay Patel 0a251e4076 [x86] limit extractelement of setcc to pre-legalization
A fuzzer found the crasher:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=13700

The bug was introduced recently here:
rL355741

This is the quick fix. If we need to do this transform
later, then we'd have to extend/truncate the vector setcc
element type to the scalar setcc type (i8). 

llvm-svn: 356053
2019-03-13 14:49:52 +00:00
Clement Courbet 3bb5d0bb9b Re-land r354244 "[DAGCombiner] Eliminate dead stores to stack."
Always check candidates for hasOtherUses(), not only stores.

llvm-svn: 356050
2019-03-13 13:56:23 +00:00
Simon Pilgrim 7abbd70300 [X86][AVX] lowerShuffleAsBroadcast - improve load folding by avoiding bitcasts
AVX1 broadcasts were failing as we were adding bitcasts that caused MayFoldLoad's hasOneUse to return false.

This patch stops introducing bitcasts so early and also replaces the broadcast index scaling through bitcasts (which can't succeed in some cases) to instead just keep track of the bitoffset which can be converted back to the broadcast index later on.

Differential Revision: https://reviews.llvm.org/D58888

llvm-svn: 356043
2019-03-13 12:20:39 +00:00
Simon Pilgrim 360ce82db2 [DAG] Move integer setcc %x, %x folding into FoldSetCC
First step towards PR40800 - I intend to move the float case in a separate future patch.

I had to tweak the (overly reduced) thumb2 test and the x86 widening test change is annoying (no longer rematerializable) but we should address this separately.

Differential Revision: https://reviews.llvm.org/D59244

llvm-svn: 356040
2019-03-13 11:08:57 +00:00
Philip Reames 21a50ccf9c [ImplicitNullChecks] Support unordered atomic accesses
Update the INC pass to allow folding unordered atomics.  This is the first optimization unblocked by the changes landed from D57601.

llvm-svn: 356006
2019-03-13 03:25:20 +00:00
Philip Reames 80ccc88869 [Tests] Expand implicit null check coverage
llvm-svn: 356004
2019-03-13 03:17:58 +00:00
Craig Topper 9bae5ba076 [X86] Add ImmArg markings to intrinsics.
Remove test cases that checked for not crashing when immediate operands were passed not an immediate. These are now considered ill-formed in IR.

This was done by manually scanning the intrinsic file for llvm_i32_ty and llvm_i8_ty which are the predominant types we use for immediates. Most of them are on vector intrinsics. I might have missed some other intrinsics.

Differential Revision: https://reviews.llvm.org/D58302

llvm-svn: 355993
2019-03-12 23:48:07 +00:00
Philip Reames b760558517 [Test] Add tests for implicit null checks on atomic/volatile instructions
llvm-svn: 355983
2019-03-12 21:09:58 +00:00
Philip Reames 9134f84ba4 For faulting ops, include a comment w/the fault destination
A faulting_op is one that has specified behavior when a fault occurs, generally redirecting control flow to another location.  This change just adds a comment to the assembly output which makes it both human readable, and machine checkable w/o having to parse the FaultMap section.  This is used to split a test file into two parts, so that I can (in a near future commit) easily extend the test file to demonstrate another case.

llvm-svn: 355982
2019-03-12 21:05:31 +00:00
Sanjay Patel 737c27a9cd [x86] scalarize extractelement 0 of FP vselect
llvm-svn: 355955
2019-03-12 19:20:45 +00:00
Nikita Popov 149bc099f6 [SDAG] Expand pow2 mulo using shifts
Expand MULO with constant power of two operand into a shift. The
overflow is checked with (x << shift) >> shift == x, where the right
shift will be logical for umulo and arithmetic for smulo (with
exception for multiplications by signed_min).

Differential Revision: https://reviews.llvm.org/D59041

llvm-svn: 355937
2019-03-12 16:57:25 +00:00
Jonas Paulsson 8b8dc50e79 [RegAlloc] Avoid compile time regression with multiple copy hints.
As a fix for https://bugs.llvm.org/show_bug.cgi?id=40986 ("excessive compile
time building opencollada"), this patch makes sure that no phys reg is hinted
more than once from getRegAllocationHints().

This handles the case were many virtual registers are assigned to the same
physreg. The previous compile time fix (r343686) in weightCalcHelper() only
made sure that physical/virtual registers are passed no more than once to
addRegAllocationHint().

Review: Dimitry Andric, Quentin Colombet
https://reviews.llvm.org/D59201

llvm-svn: 355854
2019-03-11 19:00:37 +00:00
Simon Pilgrim 06ae025345 [X86] Extend widening comparison test.
Ensure we test both v2i16 unary and binary comparisons.

llvm-svn: 355849
2019-03-11 18:08:20 +00:00
Craig Topper 00afa193f1 [X86] Enable sse2_cvtsd2ss intrinsic to use an EVEX encoded instruction.
llvm-svn: 355810
2019-03-11 06:01:04 +00:00
Amaury Sechet a5820cbd20 Add test case for add to sub post legalization. NFC
llvm-svn: 355797
2019-03-11 01:25:48 +00:00
Sanjay Patel 26e06e859e [x86] add x86-specific opcodes to extractelement scalarization list
llvm-svn: 355792
2019-03-10 18:56:21 +00:00
Craig Topper 93e15dfacc [X86] Make lowering of intrinsics with rounding mode stricter so that only valid rounding modes are lowered. Update tests accordingly
Many of our tests were not using valid rounding mode immediates. Clang verifies this in the frontend when it creates the intrinsics from builtins, but the backend would still lower invalid immediates.

With this change we will now leave them as intrinsics if the immediate is invalid. This will cause an isel selection failure.

llvm-svn: 355789
2019-03-10 17:20:45 +00:00
Sanjay Patel 40bcc3de7d [x86] add tests for extract of FP select; NFC
llvm-svn: 355768
2019-03-09 02:11:05 +00:00