llvm-project

Commit Graph

Author	SHA1	Message	Date
Michael Kruse	32bf468420	[Polly] Fix -polly-opt-isl -analyze The member LastSchedule was never set, such that printScop would always print "n/a" instead of the last schedule. To ensure that the isl_ctx lives as least as long as the stored schedule, also store a shared_ptr. Also set the schedule tree output style to ISL_YAML_STYLE_BLOCK to avoid printing everything on a single line. `opt -polly-opt-isl -analyze` will be used in the next commit.	2020-07-10 16:42:03 -05:00
Roman Gareev	0956a606ff	Disable the Loop Vectorizer in case of GEMM Currently, in case of GEMM and the pattern matching based optimizations, we use only the SLP Vectorizer out of two LLVM vectorizers. Since the Loop Vectorizer can get in the way of optimal code generation, we disable the Loop Vectorizer for the innermost loop using mark nodes and emitting the corresponding metadata. Reviewed-by: Tobias Grosser <tobias@grosser.es> Differential Revision: https://reviews.llvm.org/D36928 llvm-svn: 311473	2017-08-22 17:38:46 +00:00
Roman Gareev	cdfb57dc46	Introduce another level of metadata to distinguish non-aliasing accesses Introduce another level of alias metadata to distinguish the individual non-aliasing accesses that have inter iteration alias-free base pointers marked with "Inter iteration alias-free" mark nodes. It can be used to, for example, distinguish different stores (loads) produced by unrolling of the innermost loops and, subsequently, sink (hoist) them by LICM. Reviewed-by: Tobias Grosser <tobias@grosser.es> Differential Revision: https://reviews.llvm.org/D30606 llvm-svn: 298510	2017-03-22 14:25:24 +00:00
Roman Gareev	772498dc68	[NFC] Make ScheduleTreeOptimizer::optimizeBand return a schedule node optimized with optimizeMatMulPattern This patch makes ScheduleTreeOptimizer::optimizeBand return a schedule node optimized with optimizeMatMulPattern. Otherwise, it could not use the isolate option, because standardBandOpts could try to tile a band node with anchored subtree and get the error, since the use of the isolate option causes any tree containing the node to be considered anchored. Furthermore, it is not intended to apply standard optimizations, when the matrix multiplication has been detected. llvm-svn: 294444	2017-02-08 13:29:06 +00:00
Roman Gareev	98075fe181	A new algorithm for identification of a SCoP statement that implement a matrix multiplication The current identification of a SCoP statement that implement a matrix multiplication does not help to identify different permutations of loops that contain it and check for dependencies, which can prevent it from being optimized. It also requires external determination of the operands of the matrix multiplication. This patch contains the implementation of a new algorithm that helps to avoid these issues. It also modifies the test cases that generate matrix multiplications with linearized accesses, because the new algorithm does not support them. Reviewed-by: Michael Kruse <llvm@meinersbur.de>, Tobias Grosser <tobias@grosser.es> Differential Revision: https://reviews.llvm.org/D28357 llvm-svn: 293890	2017-02-02 14:23:14 +00:00
Tobias Grosser	67e94fb435	ScheduleOptimizer: Allow to set register width in command line We use this option to set a fixed register width in our test cases to make sure the results are identical accross platforms. llvm-svn: 292002	2017-01-14 07:14:54 +00:00
Roman Gareev	1c2927b209	Specify the default values of the cache parameters If the parameters of the target cache (i.e., cache level sizes, cache level associativities) are not specified or have wrong values, we use ones for parameters of the macro-kernel and do not perform data-layout optimizations of the matrix multiplication. In this patch we specify the default values of the cache parameters to be able to apply the pattern matching optimizations even in this case. Since there is no typical values of this parameters, we use the parameters of Intel Core i7-3820 SandyBridge that also help to attain the high-performance on IBM POWER System S822 and IBM Power 730 Express server. Reviewed-by: Tobias Grosser <tobias@grosser.es> Differential Revision: https://reviews.llvm.org/D28090 llvm-svn: 290518	2016-12-25 16:32:28 +00:00
Tobias Grosser	0791d5f5aa	ScheduleOptimizer: Fix spelling of option '-polly-target-throughput-vector-fma' througput -> throughput llvm-svn: 290418	2016-12-23 07:33:39 +00:00
Roman Gareev	be5299af0b	Change the determination of parameters of macro-kernel Typically processor architectures do not include an L3 cache, which means that Nc, the parameter of the micro-kernel, is, for all practical purposes, redundant ([1]). However, its small values can cause the redundant packing of the same elements of the matrix A, the first operand of the matrix multiplication. At the same time, big values of the parameter Nc can cause segmentation faults in case the available stack is exceeded. This patch adds an option to specify the parameter Nc as a multiple of the parameter of the micro-kernel Nr. In case of Intel Core i7-3820 SandyBridge and the following options, clang -O3 gemm.c -I utilities/ utilities/polybench.c -DPOLYBENCH_TIME -march=native -mllvm -polly -mllvm -polly-pattern-matching-based-opts=true -DPOLYBENCH_USE_SCALAR_LB -mllvm -polly-target-cache-level-associativity=8,8 -mllvm -polly-target-cache-level-sizes=32768,262144 -mllvm -polly-target-latency-vector-fma=8 it helps to improve the performance from 11.303 GFlops/sec (39,247% of theoretical peak) to 17.896 GFlops/sec (62,14% of theoretical peak). Refs.: [1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf Reviewed-by: Tobias Grosser <tobias@grosser.es> Differential Revision: https://reviews.llvm.org/D28019 llvm-svn: 290256	2016-12-21 12:51:12 +00:00
Roman Gareev	8babe1a216	The order of the loops defines the data reused in the BLIS implementation of gemm ([1]). In particular, elements of the matrix B, the second operand of matrix multiplication, are reused between iterations of the innermost loop. To keep the reused data in cache, only elements of matrix A, the first operand of matrix multiplication, should be evicted during an iteration of the innermost loop. To provide such a cache replacement policy, elements of the matrix A can, in particular, be loaded first and, consequently, be least-recently-used. In our case matrices are stored in row-major order instead of column-major order used in the BLIS implementation ([1]). One of the ways to address it is to accordingly change the order of the loops of the loop nest. However, it makes elements of the matrix A to be reused in the innermost loop and, consequently, requires to load elements of the matrix B first. Since the LLVM vectorizer always generates loads from the matrix A before loads from the matrix B and we can not provide it. Consequently, we only change the BLIS micro kernel and the computation of its parameters instead. In particular, reused elements of the matrix B are successively multiplied by specific elements of the matrix A . Refs.: [1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf Reviewed-by: Tobias Grosser <tobias@grosser.es> Differential Revision: https://reviews.llvm.org/D25653 llvm-svn: 289806	2016-12-15 11:47:38 +00:00
Roman Gareev	3a18a931a8	Apply all necessary tilings and interchangings to get a macro-kernel This is the second patch to apply the BLIS matmul optimization pattern on matmul kernels (http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf). BLIS implements gemm as three nested loops around a macro-kernel, plus two packing routines. The macro-kernel is implemented in terms of two additional loops around a micro-kernel. The micro-kernel is a loop around a rank-1 (i.e., outer product) update. In this change we create the BLIS macro-kernel by applying a combination of tiling and interchanging. In subsequent changes we will implement the packing transformation. Reviewed-by: Tobias Grosser <tobias@grosser.es> Differential Revision: http://reviews.llvm.org/D21491 llvm-svn: 276627	2016-07-25 09:42:53 +00:00
Roman Gareev	42402c9e89	Apply all necessary tilings and unrollings to get a micro-kernel This is the first patch to apply the BLIS matmul optimization pattern on matmul kernels (http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf). BLIS implements gemm as three nested loops around a macro-kernel, plus two packing routines. The macro-kernel is implemented in terms of two additional loops around a micro-kernel. The micro-kernel is a loop around a rank-1 (i.e., outer product) update. In this change we create the BLIS micro-kernel by applying a combination of tiling and unrolling. In subsequent changes we will add the extraction of the BLIS macro-kernel and implement the packing transformation. Contributed-by: Roman Gareev <gareevroman@gmail.com> Reviewed-by: Tobias Grosser <tobias@grosser.es> Differential Revision: http://reviews.llvm.org/D21140 llvm-svn: 273397	2016-06-22 09:52:37 +00:00

12 Commits