Was there really no other way to splat a byte in SSE2?
punpcklbw {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
pshuflw {{.*#+}} xmm0 = xmm0[0,0,0,0,4,5,6,7]
pshufd {{.*#+}} xmm0 = xmm0[0,0,1,1]
llvm-svn: 265172
Summary:
Implement BUFFER_ATOMIC_CMPSWAP{,_X2} instructions on all GCN targets, and FLAT_ATOMIC_CMPSWAP{,_X2} on CI+.
32-bit instruction variants tested manually on Kabini and Bonaire. Tests and parts of code provided by Jan Veselý.
Patch by: Vedran Miletić
Reviewers: arsenm, tstellarAMD, nhaehnle
Subscribers: jvesely, scchan, kanarayan, arsenm
Differential Revision: http://reviews.llvm.org/D17280
llvm-svn: 265170
Note however that this is identical to the existing SSE2 run.
What we really want is yet another run for an SSE2 machine that
also has fast unaligned 16-byte accesses.
llvm-svn: 265167
Follow-up to http://reviews.llvm.org/D18566 and http://reviews.llvm.org/D18676 -
where we noticed that an intermediate splat was being generated for memsets of
non-zero chars.
That was because we told getMemsetStores() to use a 32-bit vector element type,
and it happily obliged by producing that constant using an integer multiply.
The 16-byte test that was added in D18566 is now equivalent for AVX1 and AVX2
(no splats, just a vector load), but we have PR27141 to track that splat difference.
Note that the SSE1 path is not changed in this patch. That can be a follow-up.
This patch should resolve PR27100.
llvm-svn: 265161
A catchswitch cannot be preceded by another instruction in the same
basic block (other than a PHI node).
Instead, insert the extract element right after the materialization of
the vectorized value. This isn't optimal but is a reasonable compromise
given the constraints of WinEH.
This fixes PR27163.
llvm-svn: 265157
Follow-up to D18566 - where we noticed that an intermediate splat was being
generated for memsets of non-zero chars.
That was because we told getMemsetStores() to use a 32-bit vector element type,
and it happily obliged by producing that constant using an integer multiply.
The tests that were added in the last patch are now equivalent for AVX1 and AVX2
(no splats, just a vector load), but we have PR27141 to track that splat difference.
In the new tests, the splat via shuffling looks ok to me, but there might be some
room for improvement depending on uarch there.
Note that the SSE1/2 paths are not changed in this patch. That can be a follow-up.
This patch should resolve PR27100.
Differential Revision: http://reviews.llvm.org/D18676
llvm-svn: 265148
Summary: The assembler was picking the wrong JR variant because the pre-R6 one was still enabled at R6.
Author: nitesh.jain
Reviewers: vkalintiris, dsanders
Subscribers: dsanders, llvm-commits, mohit.bhakkad, sagar, bhushan, jaydeep
Differential: D18387
llvm-svn: 265134
Some ARM instructions encode 32-bit immediates as a 8-bit integer (0-255)
and a 4-bit rotation (0-30, even) in its least significant 12 bits. The
original fixup, FK_Data_4, patches the instruction by the value bit-to-bit,
regardless of the encoding. For example, assuming the label L1 and L2 are
0x0 and 0x104 respectively, the following instruction:
add r0, r0, #(L2 - L1) ; expects 0x104, i.e., 260
would be assembled to the following, which adds 1 to r0, instead of 260:
e2800104 add r0, r0, #4, 2 ; equivalently 1
The new fixup kind fixup_arm_mod_imm takes care of the encoding:
e2800f41 add r0, r0, #260
Patch by Ting-Yuan Huang!
llvm-svn: 265122
When a fixup that can be resolved by the assembler is out of range, we should
report an error in the source, rather than crashing.
Differential Revision: http://reviews.llvm.org/D18402
llvm-svn: 265120
Expose LLVMCreateTargetMachineData as data_layout.
As r263530 did for go. From that commit: "LLVMGetTargetDataLayout was
removed from the C API, and then TargetMachine.TargetData was removed.
Later, LLVMCreateTargetMachineData was added to the C API"
Differential Revision: http://reviews.llvm.org/D18677
llvm-svn: 265115
This is intended to be used for ThinLTO incremental build.
Differential Revision: http://reviews.llvm.org/D18213
This is a recommit of r265095 after fixing the Windows issues.
From: Mehdi Amini <mehdi.amini@apple.com>
llvm-svn: 265111
This changes some dllexport tests, to verify that some symbols that
should not be exported are not, in a way that improves the robustness
of CHECK-SAME interaction with CHECK-NOT.
We plan to enable dllimport/dllexport support for the PS4, and these
changes are for points we noticed in our internal testing.
Patch by Warren Ristow!
llvm-svn: 265106
This reverts commit r265096, r265095, and r265094.
Windows build is broken, and the validation does not pass.
From: Mehdi Amini <mehdi.amini@apple.com>
llvm-svn: 265102
They're not necessary (since the stack pointer is trivially restored on
return), and the way LLVM inserts the stackrestore calls breaks the
IR (we get a stackrestore between the deoptimize call and the return).
llvm-svn: 265101
They're not necessary (since the lifetime of the alloca is trivially
over due to the return), and the way LLVM inserts the lifetime.end
markers breaks the IR (we get a lifetime end marker between the
deoptimize call and the return).
llvm-svn: 265100
Re-enable an assertion enabled by Justin Lebar in rL265092. rL265092
was breaking test/CodeGen/X86/deopt-intrinsic.ll because webkit_jscc
does not like non-i64 return types. Change the test case to not do
that.
llvm-svn: 265099
Previously, HandleLastUse would delete RegRef information for sub-registers
if they were dead even if their corresponding super-register were still live.
If the super-register were later renamed, then the definitions of the
sub-register would not be updated appropriately. This patch alters the
behavior so that RegInfo information for sub-registers is only deleted when
the sub-register and super-register are both dead.
This resolves PR26775. This is the mirror image of Hal's r227311 commit.
Author: Tom Jablin (tjablin)
Reviewers: kbarton uweigand nemanjai hfinkel
http://reviews.llvm.org/D18448
llvm-svn: 265097
This is intended to be used for ThinLTO incremental build.
Differential Revision: http://reviews.llvm.org/D18213
From: Mehdi Amini <mehdi.amini@apple.com>
llvm-svn: 265095
Summary:
Previously the NVVMReflect pass would read its configuration from
command-line flags or a static configuration given to the pass at
instantiation time.
This doesn't quite work for clang's use-case. It needs to pass a value
for __CUDA_FTZ down on a per-module basis. We use a module flag for
this, so the NVVMReflect pass needs to be updated to read said flag.
Reviewers: tra, rnk
Subscribers: cfe-commits, jholewinski
Differential Revision: http://reviews.llvm.org/D18672
llvm-svn: 265090
when compiling with LTO.
r244523 a new class DiagnosticInfoOptimizationRemarkAnalysisAliasing for
optimization analysis remarks related to pointer aliasing without
guarding it in isDiagnosticEnabled in LLVMContext.cpp. This caused the
diagnostic message to be printed unconditionally when compiling with
LTO.
This commit cleans up isDiagnosticEnabled and makes sure all the
vectorization optimization remarks are guarded.
rdar://problem/25382153
llvm-svn: 265084
This mostly cosmetic patch moves the DebugEmissionKind enum from DIBuilder
into DICompileUnit. DIBuilder is not the right place for this enum to live
in — a metadata consumer should not have to include DIBuilder.h.
I also added a Verifier check that checks that the emission kind of a
DICompileUnit is actually legal.
http://reviews.llvm.org/D18612
<rdar://problem/25427165>
llvm-svn: 265077
Print aliases in topological order, that is, for any alias a = b,
b must be printed before a. This is because on some targets (e.g. PowerPC)
linker expects aliases in such an order to generate correct TOC information.
GCC also prints aliases in topological order.
llvm-svn: 265064
"blockaddress" can not apply to an external function. All
blockaddress constant uses must belong to the same module as the
definition of the target function.
llvm-svn: 265061
This patch simply mirrors the attributes we give to @llvm.nvvm.reflect
to the __nvvm_reflect libdevice call. This shaves about 30% of the code
in libdevice away because of CSE opportunities. It's also helps us
figure out that libdevice implementations of transcendental functions
don't have side-effects.
llvm-svn: 265060
Summary:
This change will allow loads with imp-def to be clustered in machine-scheduler pass.
areMemAccessesTriviallyDisjoint() can also handle loads with imp-def.
Reviewers: mcrosier, jmolloy, t.p.northover
Subscribers: aemerson, rengolin, mcrosier, llvm-commits
Differential Revision: http://reviews.llvm.org/D18665
llvm-svn: 265051
Chapter 3 of the QPX manual states that, "Scalar floating-point load
instructions, defined in the Power ISA, cause a replication of the source data
across all elements of the target register." Thus, if we have a load followed
by a QPX splat (from the first lane), the splat is redundant. This adds a late
MI-level pass to remove the redundant splats in some of these cases
(specifically when both occur in the same basic block).
This optimization is scheduled just prior to post-RA scheduling. It can't happen
before anything that might replace the load with some already-computed quantity
(i.e. store-to-load forwarding).
llvm-svn: 265047