Commit Graph

27334 Commits

Author SHA1 Message Date
Alex Bradbury 38c4ec31cb [RISCV] Add tests to demonstrate bitcasted fneg/fabs dagcombines
This target-independent code won't trigger for cases such as RV32FD where
custom SelectionDAG nodes are generated. These new tests demonstrate such
cases. Additionally, float-arith.ll was updated so that fneg.s, fsgnjn.s, and
fabs.s selection patterns are actually exercised.

llvm-svn: 352199
2019-01-25 14:33:08 +00:00
Simon Pilgrim d41ccddda9 [X86] Add addcarry/subborrow combine tests
Show failure to simplify cases with zero op/flags

llvm-svn: 352196
2019-01-25 12:26:27 +00:00
Diana Picus 8976ad12a9 [ARM GlobalISel] Support shifts for Thumb2
Same as ARM.

On this occasion we split some of the instruction select tests for more
complicated instructions into their own files, so we can reuse them for
ARM and Thumb mode. Likewise for the legalizer tests.

llvm-svn: 352188
2019-01-25 10:48:42 +00:00
Anton Korobeynikov 509d5c4a7d [MSP430] Fix absolute addressing mode printing in AsmPrinter
Align checks for absolute addressing mode with its current
implementation (SR is used as a base register).

This fixes https://bugs.llvm.org/show_bug.cgi?id=39993

Patch by Kristina Bessonova!

Differential Revision: https://reviews.llvm.org/D56785

llvm-svn: 352178
2019-01-25 09:14:05 +00:00
Zi Xuan Wu 308a609c6e [PowerPC] Enhance the fast selection of cmp instruction and clean up related asserts
Fast selection of llvm icmp and fcmp instructions is not handled well about VSX instruction support.

We'd use VSX float comparison instruction instead of non-vsx float comparison instruction 
if the operand register class is VSSRC or VSFRC because i32 and i64 are mapped to VSSRC and 
VSFRC correspondingly if VSX feature is opened.

If the target does not have corresponding VSX instruction comparison for some type, 
just copy VSX-related register to common float register class and use non-vsx comparison instruction.

Differential Revision: https://reviews.llvm.org/D57078

llvm-svn: 352174
2019-01-25 07:24:59 +00:00
Alex Bradbury 456d3798d6 [RISCV] Custom-legalise i32 SDIV/UDIV/UREM on RV64M
Follow the same custom legalisation strategy as used in D57085 for
variable-length shifts (see that patch summary for more discussion). Although
we may lose out on some late-stage DAG combines, I think this custom
legalisation strategy is ultimately easier to reason about.

There are some codegen changes in rv64m-exhaustive-w-insts.ll but they are all
neutral in terms of the number of instructions.

Differential Revision: https://reviews.llvm.org/D57096

llvm-svn: 352171
2019-01-25 05:11:34 +00:00
Alex Bradbury 299d690a50 [RISCV] Custom-legalise 32-bit variable shifts on RV64
The previous DAG combiner-based approach had an issue with infinite loops
between the target-dependent and target-independent combiner logic (see
PR40333). Although this was worked around in rL351806, the combiner-based
approach is still potentially brittle and can fail to select the 32-bit shift
variant when profitable to do so, as demonstrated in the pr40333.ll test case.

This patch instead introduces target-specific SelectionDAG nodes for
SHLW/SRLW/SRAW and custom-lowers variable i32 shifts to them. pr40333.ll is a
good example of how this approach can improve codegen.

This adds DAG combine that does SimplifyDemandedBits on the operands (only
lower 32-bits of first operand and lower 5 bits of second operand are read).
This seems better than implementing SimplifyDemandedBitsForTargetNode as there
is no guarantee that would be called (and it's not for e.g. the anyext return
test cases). Also implements ComputeNumSignBitsForTargetNode.

There are codegen changes in atomic-rmw.ll and atomic-cmpxchg.ll but the new
instruction sequences are semantically equivalent.

Differential Revision: https://reviews.llvm.org/D57085

llvm-svn: 352169
2019-01-25 05:04:00 +00:00
Matt Arsenault 3e08b772b3 AMDGPU/GlobalISel: Scalarize add/sub
llvm-svn: 352167
2019-01-25 04:53:57 +00:00
Matt Arsenault e6cebd0d69 GlobalISel: fewerElementsVector for more cast types
llvm-svn: 352166
2019-01-25 04:37:33 +00:00
Matt Arsenault 95fd95cfe0 GlobalISel: fewerElementsVector for a few more trivial ops
llvm-svn: 352165
2019-01-25 04:03:38 +00:00
Matt Arsenault 5d622fbcc1 AMDGPU/GlobalISel: Legalize smulh/umulh and scalarize mul
llvm-svn: 352162
2019-01-25 03:23:04 +00:00
Matt Arsenault 1b1e685f10 GlobalISel: Support fewerElementsVector for icmp/fcmp
Also legalize 64-bit compares for AMDGPU

llvm-svn: 352157
2019-01-25 02:59:34 +00:00
Matt Arsenault ca676343a9 GlobalISel: Implement fewerElementsVector for extensions
llvm-svn: 352155
2019-01-25 02:36:32 +00:00
Nemanja Ivanovic b9b75de0ae [PowerPC] Exploit store instructions that store a single vector element
This patch exploits the instructions that store a single element from a vector
to preform a (store (extract_elt)). We already have code that does this with
ISA 3.0 instructions that were added to handle i8/i16 types. However, we had
never exploited the existing ones that handle f32/f64/i32/i64 types.

Differential revision: https://reviews.llvm.org/D56175

llvm-svn: 352131
2019-01-24 23:44:28 +00:00
Aditya Nandakumar 3ba0d94bce [GISel]: Change how CSE is enabled by default for each pass
https://reviews.llvm.org/D57178

Now add a hook in TargetPassConfig to query if CSE needs to be
enabled. By default this hook returns false only for O0 opt level but
this can be overridden by the target.
As a consequence of the default of enabled for non O0, a few tests
needed to be updated to not use CSE (by passing in -O0) to the run
line.

reviewed by: arsenm

llvm-svn: 352126
2019-01-24 23:11:25 +00:00
Matt Arsenault baa5d2e69c RegBankSelect: Support some more complex part mappings
llvm-svn: 352123
2019-01-24 22:47:04 +00:00
Jessica Paquette 245047dfe8 [GlobalISel][AArch64] Add isel support for FP16 vector @llvm.ceil
This patch adds support for vector @llvm.ceil intrinsics when full 16 bit
floating point support isn't available.

To do this, this patch...

- Implements basic isel for G_UNMERGE_VALUES
- Teaches the legalizer about 16 bit floats
- Teaches AArch64RegisterBankInfo to respect floating point registers on
  G_BUILD_VECTOR and G_UNMERGE_VALUES
- Teaches selectCopy about 16-bit floating point vectors

It also adds

- A legalizer test for the 16-bit vector ceil which verifies that we create a
  G_UNMERGE_VALUES and G_BUILD_VECTOR when full fp16 isn't supported
- An instruction selection test which makes sure we lower to G_FCEIL when
  full fp16 is supported
- A test for selecting G_UNMERGE_VALUES

And also updates arm64-vfloatintrinsics.ll to show that the new ceiling types
work as expected.

https://reviews.llvm.org/D56682

llvm-svn: 352113
2019-01-24 22:00:41 +00:00
Simon Pilgrim c12a634326 [X86] Regenerate SBB test to fix buildbots.
Some local WIP code unexpectedly managed to get in the way.

llvm-svn: 352081
2019-01-24 18:57:48 +00:00
James Y Knight 2c36240a82 Fix emission of _fltused for MSVC.
It should be emitted when any floating-point operations (including
calls) are present in the object, not just when calls to printf/scanf
with floating point args are made.

The difference caused by this is very subtle: in static (/MT) builds,
on x86-32, in a program that uses floating point but doesn't print it,
the default x87 rounding mode may not be set properly upon
initialization.

This commit also removes the walk of the types pointed to by pointer
arguments in calls. (To assist in opaque pointer types migration --
eventually the pointee type won't be available.)

That latter implies that it will no longer consider a call like
`scanf("%f", &floatvar)` as sufficient to emit _fltused on its
own. And without _fltused, `scanf("%f")` will abort with error R6002. This
new behavior is unlikely to bite anyone in practice (you'd have to
read a float, and do nothing with it!), and also, is consistent with
MSVC.

Differential Revision: https://reviews.llvm.org/D56548

llvm-svn: 352076
2019-01-24 18:34:00 +00:00
Simon Pilgrim f4a1b54097 [X86] Add PR25858 test cases
llvm-svn: 352075
2019-01-24 18:30:45 +00:00
Sanjay Patel 55787a7e77 [x86] add tests for unpack shuffle lowering; NFC
https://bugs.llvm.org/show_bug.cgi?id=40434

llvm-svn: 352048
2019-01-24 14:12:34 +00:00
Petar Avramovic 79df859685 [MIPS GlobalISel] Select zero extending and sign extending load
Select zero extending and sign extending load for MIPS32.
Use size from MachineMemOperand to determine number of bytes to load.

Differential Revision: https://reviews.llvm.org/D57099

llvm-svn: 352038
2019-01-24 10:27:21 +00:00
Petar Avramovic b5a939d246 [MIPS GlobalISel] Combine extending loads
Use CombinerHelper to combine extending load instructions.
G_LOAD combined with G_ZEXT, G_SEXT or G_ANYEXT gives G_ZEXTLOAD,
G_SEXTLOAD or G_LOAD with same type as def of extending instruction
respectively.
Similarly G_ZEXTLOAD combined with G_ZEXT gives G_ZEXTLOAD and
G_SEXTLOAD combined with G_SEXT gives G_SEXTLOAD with same type
as def of extending instruction.

Differential Revision: https://reviews.llvm.org/D56914

llvm-svn: 352037
2019-01-24 10:09:52 +00:00
Jonas Paulsson 5916dea338 [SystemZ] Remember to reset the NoPHIs property on MF in createPHIsForSelects()
After creating new PHI instructions during isel pseudo expansion, the NoPHIs
property of MF should be reset in case it was previously set.

Review: Ulrich Weigand
llvm-svn: 352030
2019-01-24 07:54:41 +00:00
Craig Topper e79b779fbb [X86] Add test cases for opportunities to fold a truncate and a masked store into a truncating masked store.
llvm-svn: 352027
2019-01-24 06:15:03 +00:00
Reid Kleckner f9ebacfd29 Revert r351938 "[ARM] Alter the register allocation order for minsize on Thumb2"
This change caused fatal backend errors when compiling a file in libvpx
for Android.

llvm-svn: 351979
2019-01-23 21:10:48 +00:00
Craig Topper aa0e74c1fc [X86] Autogenerate complete checks. NFC
llvm-svn: 351970
2019-01-23 18:25:49 +00:00
Andrea Di Biagio d768d35515 [MC][X86] Correctly model additional operand latency caused by transfer delays from the integer to the floating point unit.
This patch adds a new ReadAdvance definition named ReadInt2Fpu.
ReadInt2Fpu allows x86 scheduling models to accurately describe delays caused by
data transfers from the integer unit to the floating point unit.
ReadInt2Fpu currently defaults to a delay of zero cycles (i.e. no delay) for all
x86 models excluding BtVer2. That means, this patch is only a functional change
for the Jaguar cpu model only.

Tablegen definitions for instructions (V)PINSR* have been updated to account for
the new ReadInt2Fpu. That read is mapped to the the GPR input operand.
On Jaguar, int-to-fpu transfers are modeled as a +6cy delay. Before this patch,
that extra delay was added to the opcode latency. In practice, the insert opcode
only executes for 1cy. Most of the actual latency is actually contributed by the
so-called operand-latency. According to the AMD SOG for family 16h, (V)PINSR*
latency is defined by expression f+1, where f is defined as a forwarding delay
from the integer unit to the fpu.

When printing instruction latency from MCA (see InstructionInfoView.cpp) and LLC
(only when flag -print-schedule is speified), we now need to account for any
extra forwarding delays. We do this by checking if scheduling classes declare
any negative ReadAdvance entries. Quoting a code comment in TargetSchedule.td:
"A negative advance effectively increases latency, which may be used for
cross-domain stalls". When computing the instruction latency for the purpose of
our scheduling tests, we now add any extra delay to the formula. This avoids
regressing existing codegen and mca schedule tests. It comes with the cost of an
extra (but very simple) hook in MCSchedModel.

Differential Revision: https://reviews.llvm.org/D57056

llvm-svn: 351965
2019-01-23 16:35:07 +00:00
Tim Renouf f64f8efe13 [AMDGPU] With XNACK, cannot clause a load with result coalesced with operand
Summary:
With XNACK, an smem load whose result is coalesced with an operand (thus
it overwrites its own operand) cannot appear in a clause, because some
other instruction might XNACK and restart the whole clause.

The clause breaker already realized that an smem that overwrites an
operand cannot appear in a clause, and broke the clause. The problem
that this commit fixes is that the SIFormMemoryClauses optimization
formed a bundle with early clobber, which caused the earlier code that
set up the coalesced operand to be removed as dead.

Differential Revision: https://reviews.llvm.org/D57008

Change-Id: I703c4d5b0bf7d6060222bec491f45c18bb3c0016
llvm-svn: 351950
2019-01-23 13:38:06 +00:00
Jonas Paulsson 6046d087c5 [SystemZ] Fix test case for buildbot.
llvm-clang-x86_64-expensive-checks-win triggered this assert:

"llvm.dbg.value intrinsic requires a !dbg attachment"

Hopefully, adding reasonable !dbg operands solves this.

llvm-svn: 351939
2019-01-23 10:29:12 +00:00
David Green 6a858a9425 [ARM] Alter the register allocation order for minsize on Thumb2
Currently in Arm code, we allocate LR first, under the assumption that
it needs to be saved anyway. Unfortunately this has the disadvantage
that it will require any instructions using it to be the longer thumb2
instructions, not the shorter thumb1 ones.

This switches the order when we are optimising for minsize, returning to
the default order so that more lower registers can be used. It can end
up requiring more pushed registers, but on average produces smaller code.

Differential Revision: https://reviews.llvm.org/D56008

llvm-svn: 351938
2019-01-23 10:18:30 +00:00
Sam Parker 31bef63bb4 [ARM][CGP] Check trunc type before replacing
In the last stage of type promotion, we replace any zext that uses a
new trunc with the operand of the trunc. This is okay when we only
allowed one type to be optimised, but now its the case that the trunc
maybe needed to produce a more narrow type than the one we were
optimising for. So we need to check this before doing the replacement.

Differential Revision: https://reviews.llvm.org/D57041

llvm-svn: 351935
2019-01-23 09:18:44 +00:00
Sam Parker 9a2a89d58f [DAGCombine] Enable more pre-indexed stores
The current check in CombineToPreIndexedLoadStore is too
conversative, preventing a pre-indexed store when the base pointer
is a predecessor of the value being stored. Instead, we should check
the pointer operand of the store.

Differential Revision: https://reviews.llvm.org/D56719

llvm-svn: 351933
2019-01-23 09:11:49 +00:00
Kristof Beyls e70ba0a1bc [SLH][AArch64] Remove accidentally retained -debug-only line from test.
llvm-svn: 351932
2019-01-23 09:10:12 +00:00
Kristof Beyls 3ff5dfd735 [SLH] AArch64: correctly pick temporary register to mask SP
As part of speculation hardening, the stack pointer gets masked with the
taint register (X16) before a function call or before a function return.
Since there are no instructions that can directly mask writing to the
stack pointer, the stack pointer must first be transferred to another
register, where it can be masked, before that value is transferred back
to the stack pointer.
Before, that temporary register was always picked to be x17, since the
ABI allows clobbering x17 on any function call, resulting in the
following instruction pattern being inserted before function calls and
returns/tail calls:

mov x17, sp
and x17, x17, x16
mov sp, x17
However, x17 can be live in those locations, for example when the call
is an indirect call, using x17 as the target address (blr x17).

To fix this, this patch looks for an available register just before the
call or terminator instruction and uses that.

In the rare case when no register turns out to be available (this
situation is only encountered twice across the whole test-suite), just
insert a full speculation barrier at the start of the basic block where
this occurs.

Differential Revision: https://reviews.llvm.org/D56717

llvm-svn: 351930
2019-01-23 08:18:39 +00:00
Jonas Paulsson 961c47ec98 [SystemZ] Handle DBG_VALUE instructions in two places in backend.
Two backend optimizations failed to handle cases when compiled with -g, due
to failing to consider DBG_VALUE instructions. This was in
SystemZTargetLowering::emitSelect() and
SystemZElimCompare::getRegReferences().

This patch makes sure that DBG_VALUEs are recognized so that they do not
affect these optimizations.

Tests for branch-on-count, load-and-trap and consecutive selects.

Review: Ulrich Weigand
https://reviews.llvm.org/D57048

llvm-svn: 351928
2019-01-23 07:42:26 +00:00
Brendon Cahoon 59d9973146 [Pipeliner] Add two pragmas to control software pipelining optimization
#pragma clang loop pipeline(disable)
  
    Disable SWP optimization for the next loop.
    “disable” is the only possible value.
  
#pragma clang loop pipeline_initiation_interval(number)
  
    Set value of initiation interval for SWP
    optimization to specified number value for
    the next loop. Number is the positive value
    greater than 0.
  
These pragmas could be used for debugging or reducing
compile time purposes. It is possible to disable SWP for
concrete loops to save compilation time or to find bugs
by not doing SWP to certain loops. It is possible to set
value of initiation interval to concrete number to save
compilation time by not doing extra pipeliner passes or
to check created schedule for specific initiation interval.

That is llvm part of the fix

Clang part of fix: https://reviews.llvm.org/D55710

Patch by Alexey Lapshin!

Differential Revision: https://reviews.llvm.org/D56403

llvm-svn: 351923
2019-01-23 03:26:10 +00:00
Peter Collingbourne 73078ecd38 hwasan: Move memory access checks into small outlined functions on aarch64.
Each hwasan check requires emitting a small piece of code like this:
https://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html#memory-accesses

The problem with this is that these code blocks typically bloat code
size significantly.

An obvious solution is to outline these blocks of code. In fact, this
has already been implemented under the -hwasan-instrument-with-calls
flag. However, as currently implemented this has a number of problems:
- The functions use the same calling convention as regular C functions.
  This means that the backend must spill all temporary registers as
  required by the platform's C calling convention, even though the
  check only needs two registers on the hot path.
- The functions take the address to be checked in a fixed register,
  which increases register pressure.
Both of these factors can diminish the code size effect and increase
the performance hit of -hwasan-instrument-with-calls.

The solution that this patch implements is to involve the aarch64
backend in outlining the checks. An intrinsic and pseudo-instruction
are created to represent a hwasan check. The pseudo-instruction
is register allocated like any other instruction, and we allow the
register allocator to select almost any register for the address to
check. A particular combination of (register selection, type of check)
triggers the creation in the backend of a function to handle the check
for specifically that pair. The resulting functions are deduplicated by
the linker. The pseudo-instruction (really the function) is specified
to preserve all registers except for the registers that the AAPCS
specifies may be clobbered by a call.

To measure the code size and performance effect of this change, I
took a number of measurements using Chromium for Android on aarch64,
comparing a browser with inlined checks (the baseline) against a
browser with outlined checks.

Code size: Size of .text decreases from 243897420 to 171619972 bytes,
or a 30% decrease.

Performance: Using Chromium's blink_perf.layout microbenchmarks I
measured a median performance regression of 6.24%.

The fact that a perf/size tradeoff is evident here suggests that
we might want to make the new behaviour conditional on -Os/-Oz.
But for now I've enabled it unconditionally, my reasoning being that
hwasan users typically expect a relatively large perf hit, and ~6%
isn't really adding much. We may want to revisit this decision in
the future, though.

I also tried experimenting with varying the number of registers
selectable by the hwasan check pseudo-instruction (which would result
in fewer variants being created), on the hypothesis that creating
fewer variants of the function would expose another perf/size tradeoff
by reducing icache pressure from the check functions at the cost of
register pressure. Although I did observe a code size increase with
fewer registers, I did not observe a strong correlation between the
number of registers and the performance of the resulting browser on the
microbenchmarks, so I conclude that we might as well use ~all registers
to get the maximum code size improvement. My results are below:

Regs | .text size | Perf hit
-----+------------+---------
~all | 171619972  | 6.24%
  16 | 171765192  | 7.03%
   8 | 172917788  | 5.82%
   4 | 177054016  | 6.89%

Differential Revision: https://reviews.llvm.org/D56954

llvm-svn: 351920
2019-01-23 02:20:10 +00:00
Matt Arsenault 4c5e8f51e7 AMDGPU/GlobalISel: Start selectively legalizing 16-bit operations
It might be a bit nicer to use the fancy .legalIf and co. predicates,
but this was requiring more boilerplate and disables the coverage
assertions.

llvm-svn: 351886
2019-01-22 22:00:19 +00:00
Matt Arsenault 736cfa9ffb AMDGPU/GlobalISel: Handle legality/regbanks for 32/64-bit shifts
llvm-svn: 351884
2019-01-22 21:51:38 +00:00
Matt Arsenault 30989e492b GlobalISel: Allow shift amount to be a different type
For AMDGPU the shift amount is never 64-bit, and
this needs to use a 32-bit shift.

X86 uses i8, but seemed to be hacking around this before.

llvm-svn: 351882
2019-01-22 21:42:11 +00:00
Matt Arsenault 6378629609 GlobalISel: Implement widen for extract_vector_elt elt type
llvm-svn: 351871
2019-01-22 20:38:15 +00:00
Matt Arsenault aebb2ee036 GlobalISel: Implement fewerElementsVector for basic FP ops
llvm-svn: 351866
2019-01-22 20:14:29 +00:00
Matt Arsenault 6614f852b6 GlobalISel: Support narrowing zextload/sextload
llvm-svn: 351856
2019-01-22 19:02:10 +00:00
Matt Arsenault a7cd83bc88 GlobalISel: Disallow vectors for G_CONSTANT/G_FCONSTANT
llvm-svn: 351853
2019-01-22 18:53:41 +00:00
Matt Arsenault a5840c3c39 Codegen support for atomicrmw fadd/fsub
llvm-svn: 351851
2019-01-22 18:36:06 +00:00
Sanjay Patel 58256802ce [x86] add partial undef 'and' test; NFC
llvm-svn: 351840
2019-01-22 17:01:06 +00:00
Sanjay Patel 6019e6f866 [x86] add another partial undef vector binop test; NFC
The existing test unintentionally shows that we have prematurely
optimized the shuffle into a vector concat and lost the undef info, 
so it is not affected by a basic improvement to 
SimplifyDemandedVectorElts.

llvm-svn: 351834
2019-01-22 16:26:09 +00:00
Sanjay Patel effee52c59 [DAGCombiner] narrow vector binop with 2 insert subvector operands
vecbo (insertsubv undef, X, Z), (insertsubv undef, Y, Z) --> insertsubv VecC, (vecbo X, Y), Z

This is another step in generic vector narrowing. It's also a step towards more horizontal op 
formation specifically for x86 (although we still failed to match those in the affected tests).

The scalarization cases are also not optimal (we should be scalarizing those), but it's still 
an improvement to use a narrower vector op when we know part of the result must be constant 
because both inputs are undef in some vector lanes.

I think a similar match but checking for a constant operand might help some of the cases in 
D51553.

Differential Revision: https://reviews.llvm.org/D56875

llvm-svn: 351825
2019-01-22 14:24:13 +00:00
Simon Pilgrim 933673d878 [X86][SSE] Canonicalize OR(AND(X,C),AND(Y,~C)) -> OR(AND(X,C),ANDNP(C,Y))
For constant bit select patterns, replace one AND with a ANDNP, allowing us to reuse the constant mask. Only do this if the mask has multiple uses (to avoid losing load folding) or if we have XOP as its VPCMOV can handle most folding commutations.

This also requires computeKnownBitsForTargetNode support for X86ISD::ANDNP and X86ISD::FOR to prevent regressions in fabs/fcopysign patterns.

Differential Revision: https://reviews.llvm.org/D55935

llvm-svn: 351819
2019-01-22 13:44:49 +00:00