This is the fourth patch to apply the BLIS matmul optimization pattern on matmul
kernels (http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf).
BLIS implements gemm as three nested loops around a macro-kernel, plus two
packing routines. The macro-kernel is implemented in terms of two additional
loops around a micro-kernel. The micro-kernel is a loop around a rank-1
(i.e., outer product) update. In this change we perform copying to created
arrays, which is the last step to implement the packing transformation.
Reviewed-by: Tobias Grosser <tobias@grosser.es>
Differential Revision: https://reviews.llvm.org/D23260
llvm-svn: 281441
This patch adds an entry for "-rtlib" in the output of `man clang` and `clang -help`.
Patch by Lei Zhang!
Differential Revision: https://reviews.llvm.org/D24069
llvm-svn: 281440
value is a pointer.
This patch is to fix PR30213. When expanding an expr based on ValueOffsetPair,
if the value is of pointer type, we can only create a getelementptr instead
of sub expr.
Differential Revision: https://reviews.llvm.org/D24088
llvm-svn: 281439
This change ensures all necessary symbols are resolved correctly. Before this
change on some systems, the linker may have eliminated some symbols not directly
used in bugpoint, but used in Polly.
Suggested-by: Michael Kruse <lvm@meinersbur.de>
llvm-svn: 281438
Previously, all input files were owned by the symbol table.
Files were created at various places, such as the Driver, the lazy
symbols, or the bitcode compiler, and the ownership of new files
was transferred to the symbol table using std::unique_ptr.
All input files were then free'd when the symbol table is freed
which is on program exit.
I think we don't have to transfer ownership just to free all
instance at once on exit.
In this patch, all instances are automatically collected to a
vector and freed on exit. In this way, we no longer have to
use std::unique_ptr.
Differential Revision: https://reviews.llvm.org/D24493
llvm-svn: 281425
Summary:
We were packing global device memory handles in
`PackedKernelArgumentArray`, but as I was implementing the CUDA
platform, I realized that CUDA wants the address of the handle, not the
handle itself. So this patch switches to packing the address of the
handle.
Reviewers: jlebar
Subscribers: jprice, jlebar, parallel_libs-commits
Differential Revision: https://reviews.llvm.org/D24528
llvm-svn: 281424
Summary:
Platforms were returning Device pointers, but a Device is now basically
just a pointer to an underlying PlatformDevice, so we will now just pass
it around as a value.
Reviewers: jlebar
Subscribers: jprice, jlebar, parallel_libs-commits
Differential Revision: https://reviews.llvm.org/D24537
llvm-svn: 281422
VS 2015 and higher begin making use of c++14 in their standard
library headers. As such, -std=c++11 makes it so you can't compile
trivial programs. Bump this to -std=c++14 when this situation is
detected.
llvm-svn: 281420
ObjC library call with call return.
ARC contraction tries to replace uses of an argument passed to an
objective-c library call with the call return value. For example, in the
following IR, it replaces uses of argument %9 and uses of the values
discovered traversing the chain upwards (%7 and %8) with the call return
%10, if they are dominated by the call to @objc_autoreleaseReturnValue.
This transformation enables code-gen to tail-call the call to
@objc_autoreleaseReturnValue, which is necessary to enable auto release
return value optimization.
%7 = tail call i8* @objc_loadWeakRetained(i8** %6)
%8 = bitcast i8* %7 to %0*
%9 = bitcast %0* %8 to i8*
%10 = tail call i8* @objc_autoreleaseReturnValue(i8* %9)
ret %0* %8
Since r276727, llvm started removing redundant bitcasts and as a result
started feeding the following IR to ARC contraction:
%7 = tail call i8* @objc_loadWeakRetained(i8** %6)
%8 = bitcast i8* %7 to %0*
%9 = tail call i8* @objc_autoreleaseReturnValue(i8* %7)
ret %0* %8
ARC contraction no longer does the optimization described above since it
only traverses the chain upwards and fails to recognize that the
function return can be replaced by the call return. This commit changes
ARC contraction to traverse the chain downwards too and replace uses of
bitcasts with the call return.
rdar://problem/28011339
Differential Revision: https://reviews.llvm.org/D24523
llvm-svn: 281419
using to enqueue all the jobs wasn't enough time on a slow/overloaded
system. Instead use a global to indicate when all the work has
been enqueued, let's see if this makes the CIs work more reliably.
llvm-svn: 281418
Summary:
Before, the kernel spec would only return PTX for exactly the requested
compute capability. With this patch it will now return the PTX with the
largest compute capability that does not exceed that requested compute
capability.
Reviewers: jlebar
Subscribers: jprice, jlebar, parallel_libs-commits
Differential Revision: https://reviews.llvm.org/D24531
llvm-svn: 281417
-Wdiv-by-zero may as well be an alias for -Wdivision-by-zero rather than a GCC-compatibility no-op.
-Wno-shadow should disable -Wshadow-ivar.
-Weffc++ may as well enable -Wnon-virtual-dtor like it does in GCC.
llvm-svn: 281412
Something broked BBots:
281318 failed on step 9:
http://lab.llvm.org:8011/builders/clang-with-lto-ubuntu/builds/413
r281317 built step 9 green:
http://lab.llvm.org:8011/builders/clang-with-lto-ubuntu/builds/415
Initial revision commits were:
This is PR30312. Info from bug page:
Both of these symbols demangle to abc::abc():
_ZN3abcC1Ev
_ZN3abcC2Ev
(These would be abc's complete object constructor and base object constructor, respectively.)
however with "abc::abc()" in the version script only one of the two receives the symbol version.
Patch fixes that.
It uses testcase created by Ed Maste (D24306).
Differential revision: https://reviews.llvm.org/D24336
llvm-svn: 281411
It makes the tests extremely slow due to high latency of the test launcher.
The main reason for -j5 was high memory usage with handle_abort=1, which
is now disabled in the test runner.
llvm-svn: 281409
CUDA target attributes are used for function overloading and must not be merged.
This fixes a bug where attributes were inherited during function template
specialization in CUDA and made it impossible for specialized function
to provide its own target attributes.
Differential Revision: https://reviews.llvm.org/D24522
llvm-svn: 281406
r235815 changed CGRecordLowering::accumulateBases to ignore non-virtual
bases of size 0, which prevented adding those non-virtual bases to
CGRecordLayout's NonVirtualBases. This caused clang to assert when
CGRecordLayout::getNonVirtualBaseLLVMFieldNo was called in
EmitNullConstant. This commit fixes the bug by ignoring zero-sized
non-virtual bases in EmitNullConstant.
rdar://problem/28100139
Differential Revision: https://reviews.llvm.org/D24312
llvm-svn: 281405
Summary: When expanding mul in type legalization make sure the type for shift amount can actually fit the value. This fixes PR30354 https://llvm.org/bugs/show_bug.cgi?id=30354.
Reviewers: hfinkel, majnemer, RKSimon
Subscribers: RKSimon, llvm-commits
Differential Revision: https://reviews.llvm.org/D24478
llvm-svn: 281403
This allows us to, in some cases, create a vector_shuffle out of a build_vector, when
the inputs to the build are extract_elements from two different vectors, at least one
of which is wider than the output. (E.g. a <8 x i16> being constructed out of
elements from a <16 x i16> and a <8 x i16>).
Differential Revision: https://reviews.llvm.org/D24491
llvm-svn: 281402
that use the Mach::dyld_info_command type for the load commands that are
currently use in the MachOObjectFile constructor.
This contains the missing checks for LC_DYLD_INFO and
LC_DYLD_INFO_ONLY load commands and the fields for the
Mach::dyld_info_command type.
llvm-svn: 281400
This line makes BUILD_SHARED_LIBS=ON work for Polly-ACC. Without it, ld
complains about missing isl symbols when constructing the shared library.
llvm-svn: 281396