This is a resubmittion of 263158 change after fixing the existing problem with intrinsics mangling (see LTO and intrinsics mangling llvm-dev thread for details).
This patch fixes the problem which occurs when loop-vectorize tries to use @llvm.masked.load/store intrinsic for a non-default addrspace pointer. It fails with "Calling a function with a bad signature!" assertion in CallInst constructor because it tries to pass a non-default addrspace pointer to the pointer argument which has default addrspace.
The fix is to add pointer type as another overloaded type to @llvm.masked.load/store intrinsics.
Reviewed By: reames
Differential Revision: http://reviews.llvm.org/D17270
llvm-svn: 273892
AVX1 can only broadcast vectors as floats/doubles, so for 256-bit vectors we insert bitcasts if we are shuffling v8i32/v4i64 types. Unfortunately the presence of these bitcasts prevents the current broadcast lowering code from peeking through cases where we have concatenated / extracted vectors to create the 256-bit vectors.
This patch allows us to peek through bitcasts as long as the number of elements doesn't change (i.e. element bitwidth is the same) so the broadcast index is not affected.
Note this bitcast peek is different from the stage later on which doesn't care about the type and is just trying to find a load node.
Differential Revision: http://reviews.llvm.org/D21660
llvm-svn: 273848
Tail merge was making the assumption that a layout successor or
predecessor was always a cfg successor/predecessor. Remove that
assumption. Changes to tests are necessary because the errant cfg edges
were preventing optimizations.
llvm-svn: 273700
Memory references were not being propagated for this folded load. This
prevented optimizations like LICM from hoisting the load.
Added test to verify that this allows LICM to proceed.
llvm-svn: 273617
When considering whether to split an instruction with a memory operand
into an explicit load and a register-based instruction, we currently
check that the resulting instruction has exactly 1 def. This prevents 2
important LICM optimizations: compares with memory operands, and double
indirect calls. All the tests and the test-suite pass without the check.
My guess as to original intent is to limit the additional register pressure
created by the new instruction, but given that we only split out a single
register, it is already limited.
The licm-dominance test now checks actual memory loads for hoisting instead of
undef, and it tests compares.
hoist-invariant-load.ll now checks for 2 hoists, the intended hoist, and a bonus
from calling a got-relative function in a loop.
llvm-svn: 273616
X86FrameLowering::adjustForHiPEPrologue() contains a hard-coded offset
into an Erlang Runtime System-internal data structure (the PCB). As the
layout of this data structure is prone to change, this poses problems
for maintaining compatibility.
To address this problem, the compiler can produce this information as
module-level named metadata. For example (where P_NSP_LIMIT is the
offending offset):
!hipe.literals = !{ !2, !3, !4 }
!2 = !{ !"P_NSP_LIMIT", i32 152 }
!3 = !{ !"X86_LEAF_WORDS", i32 24 }
!4 = !{ !"AMD64_LEAF_WORDS", i32 24 }
Patch by Magnus Lang
Differential Revision: http://reviews.llvm.org/D20363
llvm-svn: 273593
When trying to convert a loading instruction into a FAULTING_LOAD, we
sometimes face code like this:
if %R10 is not null:
%R9<def> = MOV32ri Immediate
%R9<def, tied> = AND32rm %R9, 0x20(%R10)
else:
goto TRAP
In these cases we would like to use the AND32rm instruction as the
faulting operation by hoisting the "depedency" def-ing %R9 also above
the control flow, transforming the program into:
%R9<def> = MOV32ri Immediate
%R9<def, tied> = FAULTING_LOAD_OP(AND32rm %R9, 0x20(%R10), FailPath: TRAP)
This change teaches ImplicitNullChecks to do the above, when safe.
llvm-svn: 273501
We no longer have corresponding code in autoupgrade and the vast majority of the tests were fixed long time ago. Fix the remaining few. One of the verifier test cases is marked as XFAIL because it was passing only because the signature was incorrect.
llvm-svn: 273428
Summary:
Fix the computation of the offsets present in the scopetable when using the
SEH (__except_handler4).
This patch added an intrinsic to track the position of the allocation on the
stack of the EHGuard. This position is needed when producing the ScopeTable.
```
struct _EH4_SCOPETABLE {
DWORD GSCookieOffset;
DWORD GSCookieXOROffset;
DWORD EHCookieOffset;
DWORD EHCookieXOROffset;
_EH4_SCOPETABLE_RECORD ScopeRecord[1];
};
struct _EH4_SCOPETABLE_RECORD {
DWORD EnclosingLevel;
long (*FilterFunc)();
union {
void (*HandlerAddress)();
void (*FinallyFunc)();
};
};
```
The code to generate the EHCookie is added in `X86WinEHState.cpp`.
Which is adding these instructions when using SEH4.
```
Lfunc_begin0:
# BB#0: # %entry
pushl %ebp
movl %esp, %ebp
pushl %ebx
pushl %edi
pushl %esi
subl $28, %esp
movl %ebp, %eax <<-- Loading FramePtr
movl %esp, -36(%ebp)
movl $-2, -16(%ebp)
movl $L__ehtable$use_except_handler4_ssp, %ecx
xorl ___security_cookie, %ecx
movl %ecx, -20(%ebp)
xorl ___security_cookie, %eax <<-- XOR FramePtr and Cookie
movl %eax, -40(%ebp) <<-- Storing EHGuard
leal -28(%ebp), %eax
movl $__except_handler4, -24(%ebp)
movl %fs:0, %ecx
movl %ecx, -28(%ebp)
movl %eax, %fs:0
movl $0, -16(%ebp)
calll _may_throw_or_crash
LBB1_1: # %cont
movl -28(%ebp), %eax
movl %eax, %fs:0
addl $28, %esp
popl %esi
popl %edi
popl %ebx
popl %ebp
retl
```
And the corresponding offset is computed:
```
Luse_except_handler4_ssp$parent_frame_offset = -36
.p2align 2
L__ehtable$use_except_handler4_ssp:
.long -2 # GSCookieOffset
.long 0 # GSCookieXOROffset
.long -40 # EHCookieOffset <<----
.long 0 # EHCookieXOROffset
.long -2 # ToState
.long _catchall_filt # FilterFunction
.long LBB1_2 # ExceptionHandler
```
Clang is not yet producing function using SEH4, but it's a work in progress.
This patch is a step toward having a valid implementation of SEH4.
Unfortunately, it is not yet fully working. The EH registration block is not
allocated at the right offset on the stack.
Reviewers: rnk, majnemer
Subscribers: llvm-commits, chrisha
Differential Revision: http://reviews.llvm.org/D21231
llvm-svn: 273281
Summary:
canCombineSinCosLibcall() would previously combine sin+cos into sincos for
GNUX32/GNUEABI/GNUEABIHF regardless of whether UnsafeFPMath were set or not.
However, GNU would only combine them for UnsafeFPMath because sincos does not
set errno like sin and cos do. It seems likely that this is an oversight.
Reviewers: t.p.northover
Subscribers: t.p.northover, aemerson, llvm-commits, rengolin
Differential Revision: http://reviews.llvm.org/D21431
llvm-svn: 273259
This reverts commit r273019.
From email I sent to list:
> I don't think this makes sense. Either the linker you're using supports
> this feature, or it doesn't. Having it enabled for llc if your linker
> doesn't support it is not fun.
>
> Further note that this also affects basically all other code using llvm
> libraries -- other than Clang, which explicitly sets it back to false by
> default, unless you set the ENABLE_X86_RELAX_RELOCATIONS cmake flag to
> true.
>
> If you want to enable the relax mode across all llvm tools in some
> circumstances, I think it should be via moving the cmake flag from clang
> down into llvm.
>
> I'm going to revert this commit, since I both think it intrinsically
> doesn't make sense to do this, and because it's breaking some of our
> tools.
llvm-svn: 273245
Fix for PR27726 - sitofp i64 to fp128 was loading the merged load i64 to a x87 register preventing legalization for conversion to fp128.
Added 32-bit tests for fp128 cast/conversions.
llvm-svn: 273210
We currently only allow exact matches of shuffle mask patterns during target shuffle combining.
This patch relaxes this to permit SM_SentinelUndef in the combined shuffle to always be accepted as well as allowing exact matching of the SM_SentinelZero value.
I've adjusted some tests that were requiring exact shuffle masks to now include undef values.
Differential Revision: http://reviews.llvm.org/D21495
llvm-svn: 273119
When calculating a square root using Newton-Raphson with two constants,
a naive implementation is to use five multiplications (four muls to calculate
reciprocal square root and another one to calculate the square root itself).
However, after some reassociation and CSE the same result can be obtained
with only four multiplications. Unfortunately, there's no reliable way to do
such a reassociation in the back-end. So, the patch modifies NR code itself
so that it directly builds optimal code for SQRT and doesn't rely on any
further reassociation.
Patch by Nikolai Bozhenov!
Differential Revision: http://reviews.llvm.org/D21127
llvm-svn: 272920
This allows us to emit native IR in Clang (next commit).
Also, update the intrinsic tests to show that codegen already knows how to handle
the IR that Clang will soon produce.
llvm-svn: 272806