llvm-project

Commit Graph

Author	SHA1	Message	Date
Uday Bondhugula	294687ef59	Fix affine expr flattener bug introduced by cl/225452174. - inconsistent local var constraint size when repeatedly using the same flattener for all expressions in a map. PiperOrigin-RevId: 227067836	2019-03-29 14:40:37 -07:00
Alex Zinenko	eb0f9f37af	SuperVectorization: fix 'isa' assertion Supervectorization uses null pointers to SSA values as a means of communicating the failure to vectorize. In operation vectorization, all operations producing the values of operation arguments must be vectorized for the given operation to be vectorized. The existing check verified if any of the value "def" statements was vectorized instead, sometimes leading to assertions inside `isa` called on a null pointer. Fix this to check that all "def" statements were vectorized. PiperOrigin-RevId: 226941552	2019-03-29 14:37:20 -07:00
MLIR Team	4eef795a1d	Computation slice update: adds parameters to insertBackwardComputationSlice which specify the source loop nest depth at which to perform iteration space slicing, and the destination loop nest depth at which to insert the compution slice. Updates LoopFusion pass to take these parameters as command line flags for experimentation. PiperOrigin-RevId: 226514297	2019-03-29 14:35:03 -07:00
MLIR Team	bcb7c4742d	Do proper indexing for local variables when building access function equality constraints (working on test cases). PiperOrigin-RevId: 226399089	2019-03-29 14:34:02 -07:00
MLIR Team	2570fb5bb7	Address some issues from memref dependence check bug (b/121216762), adds tests cases. PiperOrigin-RevId: 226277453	2019-03-29 14:33:17 -07:00
MLIR Team	6892ffb896	Improve loop fusion algorithm by using a memref dependence graph. Fixed TODO for reduction fusion unit test. PiperOrigin-RevId: 226277226	2019-03-29 14:33:02 -07:00
Uday Bondhugula	14d2618f63	Simplify memref-dependence-check's meta data structures / drop duplication and reuse existing ones. - drop IterationDomainContext, redundant since FlatAffineConstraints has MLValue information associated with its dimensions. - refactor to use existing support - leads to a reduction in LOC - as a result of these changes, non-constant loop bounds get naturally supported for dep analysis. - update test cases to include a couple with non-constant loop bounds - rename addBoundsFromForStmt -> addForStmtDomain - complete TODO for getLoopIVs (handle 'if' statements) PiperOrigin-RevId: 226082008	2019-03-29 14:32:46 -07:00
Uday Bondhugula	1d72f2e47e	Update / complete a TODO for addBoundsForForStmt - when adding constraints from a 'for' stmt into FlatAffineConstraints, correctly add bound operands of the 'for' stmt as a dimensional identifier or a symbolic identifier depending on whether the bound operand is a valid MLFunction symbol - update test case to exercise this. PiperOrigin-RevId: 225988511	2019-03-29 14:32:31 -07:00
Uday Bondhugula	20531932f4	Refactor/update memref-dep-check's addMemRefAccessConstraints and addDomainConstraints; add support for mod/div for dependence testing. - add support for mod/div expressions in dependence analysis - refactor addMemRefAccessConstraints to use getFlattenedAffineExprs (instead of getFlattenedAffineExpr); update addDomainConstraints. - rename AffineExprFlattener::cst -> localVarCst PiperOrigin-RevId: 225933306	2019-03-29 14:31:58 -07:00
Alex Zinenko	51c8a095a3	Materialize vector_type_cast operation in the SuperVector dialect This operation is produced and used by the super-vectorization passes and has been emitted as an abstract unregistered operation until now. For end-to-end testing purposes, it has to be eventually lowered to LLVM IR. Matching abstract operation by name goes into the opposite direction of the generic lowering approach that is expected to be used for LLVM IR lowering in the future. Register vector_type_cast operation as a part of the SuperVector dialect. Arguably, this operation is a special case of the `view` operation from the Standard dialect. The semantics of `view` is not fully specified at this point so it is safer to rely on a custom operation. Additionally, using a custom operation may help to achieve clear dialect separation. PiperOrigin-RevId: 225887305	2019-03-29 14:31:13 -07:00
MLIR Team	3b69230b3a	Loop Fusion pass update: introduce utilities to perform generalized loop fusion based on slicing; encompasses standard loop fusion. ) Adds simple greedy fusion algorithm to drive experimentation. This algorithm greedily fuses loop nests with single-writer/single-reader memref dependences to improve locality. ) Adds support for fusing slices of a loop nest computation: fusing one loop nest into another by adjusting the source loop nest's iteration bounds (after it is fused into the destination loop nest). This is accomplished by solving for the source loop nest's IVs in terms of the destination loop nests IVs and symbols using the dependece polyhedron, then creating AffineMaps of these functions for the loop bounds of the fused source loop. ) Adds utility function 'insertMemRefComputationSlice' which computes and inserts computation slice from loop nest surrounding a source memref access into the loop nest surrounding the destingation memref access. ) Adds FlatAffineConstraints::toAffineMap function which returns and AffineMap which represents an equality contraint where one dimension identifier is represented as a function of all others in the equality constraint. *) Adds multiple fusion unit tests. PiperOrigin-RevId: 225842944	2019-03-29 14:30:13 -07:00
Uday Bondhugula	c41ee60647	'memref-bound-check': extend to store op's as well - extend memref-bound-check to store op's - make the bound check an analysis util and move to lib/Analysis/Utils.cpp (so that one doesn't need to always create a pass to use it) PiperOrigin-RevId: 225564830	2019-03-29 14:29:13 -07:00
Uday Bondhugula	45a0f52519	Expression flattening improvement - reuse local expressions. - if a local id was already for a specific mod/div expression, just reuse it if the expression repeats (instead of adding a new one). - drastically reduces the number of local variables added during flattening for real use cases - since the same div's and mod expressions often repeat. - add getFlattenedAffineExprs for AffineMap, IntegerSet based on the above As a natural result of the above: - FlatAffineConstraints(IntegerSet) ctor now deals with integer sets that have mod and div constraints as well, and these get simplified as well from -simplify-affine-structures PiperOrigin-RevId: 225452174	2019-03-29 14:28:13 -07:00
Uday Bondhugula	8365bdc17f	FlatAffineConstraints - complete TODOs: add method to remove duplicate / trivially redundant constraints. Update projectOut to eliminate identifiers in a more efficient order. Fix b/120801118. - add method to remove duplicate / trivially redundant constraints from FlatAffineConstraints (use a hashing-based approach with DenseSet) - update projectOut to eliminate identifiers in a more efficient order (A sequence of affine_apply's like this (from a real use case) finally exposed the lack of the above trivial/low hanging simplifications). for %ii = 0 to 64 { for %jj = 0 to 9 { %a0 = affine_apply (d0, d1) -> (d0 * (9 * 1024) + d1 * 128) (%ii, %jj) %a1 = affine_apply (d0) -> (d0 floordiv (2 * 3 * 3 * 128 * 128), (d0 mod 294912) floordiv (3 * 3 * 128 * 128), (((d0 mod 294912) mod 147456) floordiv 1152) floordiv 8, (((d0 mod 294912) mod 147456) mod 1152) floordiv 384, ((((d0 mod 294912) mod 147456) mod 1152) mod 384) floordiv 128, (((((d0 mod 294912) mod 147456) mod 1152) mod 384) mod 128) floordiv 128) (%a0) %v0 = load %in[%a1tensorflow/mlir#0, %a1tensorflow/mlir#1, %a1tensorflow/mlir#3, %a1tensorflow/mlir#4, %a1tensorflow/mlir#2, %a1tensorflow/mlir#5] : memref<2x2x3x3x16x1xi32> } } - update FlatAffineConstraints::print to print number of constraints. PiperOrigin-RevId: 225397480	2019-03-29 14:27:29 -07:00
Uday Bondhugula	4860f0e8fd	Fix loop unrolling test cases - These test cases had to be updated post the switch to exclusive upper bound; however, the test cases hadn't originally been written to check correctly; as a result, they didn't fail and weren't updated. Update test case and fix upper bound. PiperOrigin-RevId: 225194016	2019-03-29 14:26:56 -07:00
Alex Zinenko	97d2f3cd3d	ConvertToCFG: use affine_apply to implement loop steps Originally, loop steps were implemented using `addi` and `constant` operations because `affine_apply` was not handled in the first implementation. The support for `affine_apply` has been added, use it to implement the update of the loop induction variable. This is more consistent with the lower and upper bounds of the loop that are also implemented as `affine_apply`, removes the dependence of the converted function on the StandardOps dialect and makes it clear from the CFG function that all operations on the loop induction variable are purely affine. PiperOrigin-RevId: 225165337	2019-03-29 14:26:22 -07:00
Uday Bondhugula	b9f53dc0bd	Update/Fix LoopUtils::stmtBodySkew to handle loop step. - loop step wasn't handled and there wasn't a TODO or an assertion; fix this. - rename 'delay' to shift for consistency/readability. - other readability changes. - remove duplicate attribute print for DmaStartOp; fix misplaced attribute print for DmaWaitOp - add build method for AddFOp (unrelated to this CL, but add it anyway) PiperOrigin-RevId: 224892958	2019-03-29 14:25:07 -07:00
Uday Bondhugula	d59a95a05c	Fix missing check for dependent DMAs in pipeline-data-transfer - adding a conservative check for now (TODO: use the dependence analysis pass once the latter is extended to deal with DMA ops). resolve an existing bug on a test case. - update test cases PiperOrigin-RevId: 224869526	2019-03-29 14:24:53 -07:00
Uday Bondhugula	6757fb151d	FlatAffineConstraints API cleanup; add normalizeConstraintsByGCD(). - add method normalizeConstraintsByGCD - call normalizeConstraintsByGCD() and GCDTightenInequalities() at the end of projectOut. - remove call to GCDTightenInequalities() from getMemRefRegion - change isEmpty() to check isEmptyByGCDTest() / hasInvalidConstraint() each time an identifier is eliminated (to detect emptiness early). - make FourierMotzkinEliminate, gaussianEliminateId(s), GCDTightenInequalities() private - improve / update stale comments PiperOrigin-RevId: 224866741	2019-03-29 14:24:37 -07:00
Uday Bondhugula	2ef57806ba	Update/fix -pipeline-data-transfer; fix b/120770946 - fix replaceAllMemRefUsesWith call to replace only inside loop body. - handle the case where DMA buffers are dynamic; extend doubleBuffer() method to handle dynamically shaped DMA buffers (pass the right operands to AllocOp) - place alloc's for DMA buffers at the depth at which pipelining is being done (instead of at top-level) - add more test cases PiperOrigin-RevId: 224852231	2019-03-29 14:24:22 -07:00
Uday Bondhugula	2d6478fa92	Extend loop tiling utility to handle non-constant loop bounds and bounds that are a max/min of several expressions. - Extend loop tiling to handle non-constant loop bounds and bounds that are a max/min of several expressions, i.e., bounds using multi-result affine maps - also fix b/120630124 as a result (the IR was in an invalid state when tiled loop generation failed; SSA uses were created that weren't plugged into the IR). PiperOrigin-RevId: 224604460	2019-03-29 14:23:34 -07:00
Uday Bondhugula	dfc752e42b	Generate strided DMAs from -dma-generate - generate DMAs correctly now using strided DMAs where needed - add support for multi-level/nested strides; op still supports one level of stride for now. Other things - add test case for symbolic lower/upper bound; cases where the DMA buffer size can't be bounded by a known constant - add test case for dynamic shapes where the DMA buffers are however bounded by constants - refactor some of the '-dma-generate' code PiperOrigin-RevId: 224584529	2019-03-29 14:23:19 -07:00
Nicolas Vasilache	d9b6420fc9	[MLIR] Add LowerVectorTransfersPass This CL adds a pass that lowers VectorTransferReadOp and VectorTransferWriteOp to a simple loop nest via local buffer allocations. This is an MLIR->MLIR lowering based on builders. A few TODOs are left to address in particular: 1. invert the permutation map so the accesses to the remote memref are coalesced; 2. pad the alloc for bank conflicts in local memory (e.g. GPUs shared_memory); 3. support broadcast / avoid copies when permutation_map is not of full column rank 4. add a proper "element_cast" op One notable limitation is this does not plan on supporting boundary conditions. It should be significantly easier to use pre-baked MLIR functions to handle such paddings. This is left for future consideration. Therefore the current CL only works properly for full-tile cases atm. This CL also adds 2 simple tests: ```mlir for %i0 = 0 to %M step 3 { for %i1 = 0 to %N step 4 { for %i2 = 0 to %O { for %i3 = 0 to %P step 5 { vector_transfer_write %f1, %A, %i0, %i1, %i2, %i3 {permutation_map: (d0, d1, d2, d3) -> (d3, d1, d0)} : vector<5x4x3xf32>, memref<?x?x?x?xf32, 0>, index, index, index, index ``` lowers into: ```mlir for %i0 = 0 to %arg0 step 3 { for %i1 = 0 to %arg1 step 4 { for %i2 = 0 to %arg2 { for %i3 = 0 to %arg3 step 5 { %1 = alloc() : memref<5x4x3xf32> %2 = "element_type_cast"(%1) : (memref<5x4x3xf32>) -> memref<1xvector<5x4x3xf32>> store %cst, %2[%c0] : memref<1xvector<5x4x3xf32>> for %i4 = 0 to 5 { %3 = affine_apply (d0, d1) -> (d0 + d1) (%i3, %i4) for %i5 = 0 to 4 { %4 = affine_apply (d0, d1) -> (d0 + d1) (%i1, %i5) for %i6 = 0 to 3 { %5 = affine_apply (d0, d1) -> (d0 + d1) (%i0, %i6) %6 = load %1[%i4, %i5, %i6] : memref<5x4x3xf32> store %6, %0[%5, %4, %i2, %3] : memref<?x?x?x?xf32> dealloc %1 : memref<5x4x3xf32> ``` and ```mlir for %i0 = 0 to %M step 3 { for %i1 = 0 to %N { for %i2 = 0 to %O { for %i3 = 0 to %P step 5 { %f = vector_transfer_read %A, %i0, %i1, %i2, %i3 {permutation_map: (d0, d1, d2, d3) -> (d3, 0, d0)} : (memref<?x?x?x?xf32, 0>, index, index, index, index) -> vector<5x4x3xf32> ``` lowers into: ```mlir for %i0 = 0 to %arg0 step 3 { for %i1 = 0 to %arg1 { for %i2 = 0 to %arg2 { for %i3 = 0 to %arg3 step 5 { %1 = alloc() : memref<5x4x3xf32> %2 = "element_type_cast"(%1) : (memref<5x4x3xf32>) -> memref<1xvector<5x4x3xf32>> for %i4 = 0 to 5 { %3 = affine_apply (d0, d1) -> (d0 + d1) (%i3, %i4) for %i5 = 0 to 4 { for %i6 = 0 to 3 { %4 = affine_apply (d0, d1) -> (d0 + d1) (%i0, %i6) %5 = load %0[%4, %i1, %i2, %3] : memref<?x?x?x?xf32> store %5, %1[%i4, %i5, %i6] : memref<5x4x3xf32> %6 = load %2[%c0] : memref<1xvector<5x4x3xf32>> dealloc %1 : memref<5x4x3xf32> ``` PiperOrigin-RevId: 224552717	2019-03-29 14:23:05 -07:00
Nicolas Vasilache	4adc169bd0	[MLIR] Add AffineMap composition and use it in Materialization This CL adds the following free functions: ``` /// Returns the AffineExpr e o m. AffineExpr compose(AffineExpr e, AffineMap m); /// Returns the AffineExpr f o g. AffineMap compose(AffineMap f, AffineMap g); ``` This addresses the issue that AffineMap composition is only available at a distance via AffineValueMap and is thus unusable on Attributes. This CL thus implements AffineMap composition in a more modular and composable way. This CL does not claim that it can be a good replacement for the implementation in AffineValueMap, in particular it does not support bounded maps atm. Standalone tests are added that replicate some of the logic of the AffineMap composition pass. Lastly, affine map composition is used properly inside MaterializeVectors and a standalone test is added that requires permutation_map composition with a projection map. PiperOrigin-RevId: 224376870	2019-03-29 14:20:22 -07:00
Nicolas Vasilache	df0a25efee	[MLIR] Add support for permutation_map This CL hooks up and uses permutation_map in vector_transfer ops. In particular, when going into the nuts and bolts of the implementation, it became clear that cases arose that required supporting broadcast semantics. Broadcast semantics are thus added to the general permutation_map. The verify methods and tests are updated accordingly. Examples of interest include. Example 1: The following MLIR snippet: ```mlir for %i3 = 0 to %M { for %i4 = 0 to %N { for %i5 = 0 to %P { %a5 = load %A[%i4, %i5, %i3] : memref<?x?x?xf32> }}} ``` may vectorize with {permutation_map: (d0, d1, d2) -> (d2, d1)} into: ```mlir for %i3 = 0 to %0 step 32 { for %i4 = 0 to %1 { for %i5 = 0 to %2 step 256 { %4 = vector_transfer_read %arg0, %i4, %i5, %i3 {permutation_map: (d0, d1, d2) -> (d2, d1)} : (memref<?x?x?xf32>, index, index) -> vector<32x256xf32> }}} ```` Meaning that vector_transfer_read will be responsible for reading the 2-D slice: `%arg0[%i4, %i5:%15+256, %i3:%i3+32]` into vector<32x256xf32>. This will require a transposition when vector_transfer_read is further lowered. Example 2: The following MLIR snippet: ```mlir %cst0 = constant 0 : index for %i0 = 0 to %M { %a0 = load %A[%cst0, %cst0] : memref<?x?xf32> } ``` may vectorize with {permutation_map: (d0) -> (0)} into: ```mlir for %i0 = 0 to %0 step 128 { %3 = vector_transfer_read %arg0, %c0_0, %c0_0 {permutation_map: (d0, d1) -> (0)} : (memref<?x?xf32>, index, index) -> vector<128xf32> } ```` Meaning that vector_transfer_read will be responsible of reading the 0-D slice `%arg0[%c0, %c0]` into vector<128xf32>. This will require a 1-D vector broadcast when vector_transfer_read is further lowered. Additionally, some minor cleanups and refactorings are performed. One notable thing missing here is the composition with a projection map during materialization. This is because I could not find an AffineMap composition that operates on AffineMap directly: everything related to composition seems to require going through SSAValue and only operates on AffinMap at a distance via AffineValueMap. I have raised this concern a bunch of times already, the followup CL will actually do something about it. In the meantime, the projection is hacked at a minimum to pass verification and materialiation tests are temporarily incorrect. PiperOrigin-RevId: 224376828	2019-03-29 14:20:07 -07:00
Alex Zinenko	7c89a225cf	ConvertToCFG: support min/max in loop bounds. The recently introduced `select` operation enables ConvertToCFG to support min(max) in loop bounds. Individual min(max) is implemented as `cmpi "lt"`(`cmpi "gt"`) followed by a `select` between the compared values. Multiple results of an `affine_apply` operation extracted from the loop bounds are reduced using min(max) in a sequential manner. While this may decrease the potential for instruction-level parallelism, it is easier to recognize for the following passes, in particular for the vectorizer. PiperOrigin-RevId: 224376233	2019-03-29 14:19:52 -07:00
MLIR Team	a53ed1b767	Fix bug in GCD calculation when flattening AffineExpr (adds unit test which triggers the bug and tests the fix). PiperOrigin-RevId: 224246657	2019-03-29 14:19:07 -07:00
Uday Bondhugula	a92130880e	Complete multiple unhandled cases for DmaGeneration / getMemRefRegion; update/improve/clean up API. - update FlatAffineConstraints::getConstBoundDifference; return constant differences between symbolic affine expressions, look at equalities as well. - fix buffer size computation when generating DMAs symbolic in outer loops, correctly handle symbols at various places (affine access maps, loop bounds, loop IVs outer to the depth at which DMA generation is being done) - bug fixes / complete some TODOs for getMemRefRegion - refactor common code b/w memref dependence check and getMemRefRegion - FlatAffineConstraints API update; added methods employ trivial checks / detection - sufficient to handle hyper-rectangular cases in a precise way while being fast / low complexity. Hyper-rectangular cases fall out as trivial cases for these methods while other cases still do not cause failure (either return conservative or return failure that is handled by the caller). PiperOrigin-RevId: 224229879	2019-03-29 14:18:22 -07:00
MLIR Team	753109547d	During forward substitution, merge symbols from input AffineMap with the symbol list of the target AffineMap. Symbols can be used as dim identifiers and symbolic identifiers, and so we must preserve the symbolic identifies from the input AffineMap during forward substitution, even if that same identifier is used as a dimension identifier in the target AffineMap. Test case added. Going forward, we may want to explore solutions where we do not maintain this split between dimensions and symbols, and instead verify the validity of each use of each AffineMap operand AffineMap in the context where the AffineMap operand usage is required to be a symbol: in the denominator of floordiv/ceildiv/mod for semi-affine maps, and in instructions that can capture symbols (i.e. alloc) PiperOrigin-RevId: 224017364	2019-03-29 14:16:40 -07:00
Alex Zinenko	7868abd9d8	ConvertToCFG: convert "if" statements. The condition of the "if" statement is an integer set, defined as a conjunction of affine constraints. An affine constraints consists of an affine expression and a flag indicating whether the expression is strictly equal to zero or is also allowed to be greater than zero. Affine maps, accepted by `affine_apply` are also formed from affine expressions. Leverage this fact to implement the checking of "if" conditions. Each affine expression from the integer set is converted into an affine map. This map is applied to the arguments of the "if" statement. The result of the application is compared with zero given the equality flag to obtain the final boolean value. The conjunction of conditions is tested sequentially with short-circuit branching to the "else" branch if any of the condition evaluates to false. Create an SESE region for the if statement (including its "then" and optional "else" statement blocks) and append it to the end of the current region. The conditional region consists of a sequence of condition-checking blocks that implement the short-circuit scheme, followed by a "then" SESE region and an "else" SESE region, and the continuation block that post-dominates all blocks of the "if" statement. The flow of blocks that correspond to the "then" and "else" clauses are constructed recursively, enabling easy nesting of "if" statements and if-then-else-if chains. Note that MLIR semantics does not require nor prohibit short-circuit evaluation. Since affine expressions do not have side effects, there is no observable difference in the program behavior. We may trade off extra operations for operation-level parallelism opportunity by first performing all `affine_apply` and comparison operations independently, and then performing a tree pattern reduction of the resulting boolean values with the `muli i1` operations (in absence of the dedicated bit operations). The pros and cons are not clear, and since MLIR does not include parallel semantics, we prefer to minimize the number of sequentially executed operations. PiperOrigin-RevId: 223970248	2019-03-29 14:16:10 -07:00
Nicolas Vasilache	ebb3d38471	[MLIR] Separate and split vectorization tests These tests have become too bulky and unwiedly. Splitting simplifies modifications that will occur in the next CL. PiperOrigin-RevId: 223874321	2019-03-29 14:15:40 -07:00
Nicolas Vasilache	b39d1f0bdb	[MLIR] Add VectorTransferOps This CL implements and uses VectorTransferOps in lieu of the former custom call op. Tests are updated accordingly. VectorTransferOps come in 2 flavors: VectorTransferReadOp and VectorTransferWriteOp. VectorTransferOps can be thought of as a backend-independent pseudo op/library call that needs to be legalized to MLIR (whiteboxed) before it can be lowered to backend-dependent IR. Note that the current implementation does not yet support a real permutation map. Proper support will come in a followup CL. VectorTransferReadOp ==================== VectorTransferReadOp performs a blocking read from a scalar memref location into a super-vector of the same elemental type. This operation is called 'read' by opposition to 'load' because the super-vector granularity is generally not representable with a single hardware register. As a consequence, memory transfers will generally be required when lowering VectorTransferReadOp. A VectorTransferReadOp is thus a mid-level abstraction that supports super-vectorization with non-effecting padding for full-tile only code. A vector transfer read has semantics similar to a vector load, with additional support for: 1. an optional value of the elemental type of the MemRef. This value supports non-effecting padding and is inserted in places where the vector read exceeds the MemRef bounds. If the value is not specified, the access is statically guaranteed to be within bounds; 2. an attribute of type AffineMap to specify a slice of the original MemRef access and its transposition into the super-vector shape. The permutation_map is an unbounded AffineMap that must represent a permutation from the MemRef dim space projected onto the vector dim space. Example: ```mlir %A = alloc(%size1, %size2, %size3, %size4) : memref<?x?x?x?xf32> ... %val = `ssa-value` : f32 // let %i, %j, %k, %l be ssa-values of type index %v0 = vector_transfer_read %src, %i, %j, %k, %l {permutation_map: (d0, d1, d2, d3) -> (d3, d1, d2)} : (memref<?x?x?x?xf32>, index, index, index, index) -> vector<16x32x64xf32> %v1 = vector_transfer_read %src, %i, %j, %k, %l, %val {permutation_map: (d0, d1, d2, d3) -> (d3, d1, d2)} : (memref<?x?x?x?xf32>, index, index, index, index, f32) -> vector<16x32x64xf32> ``` VectorTransferWriteOp ===================== VectorTransferWriteOp performs a blocking write from a super-vector to a scalar memref of the same elemental type. This operation is called 'write' by opposition to 'store' because the super-vector granularity is generally not representable with a single hardware register. As a consequence, memory transfers will generally be required when lowering VectorTransferWriteOp. A VectorTransferWriteOp is thus a mid-level abstraction that supports super-vectorization with non-effecting padding for full-tile only code. A vector transfer write has semantics similar to a vector store, with additional support for handling out-of-bounds situations. Example: ```mlir %A = alloc(%size1, %size2, %size3, %size4) : memref<?x?x?x?xf32>. %val = `ssa-value` : vector<16x32x64xf32> // let %i, %j, %k, %l be ssa-values of type index vector_transfer_write %val, %src, %i, %j, %k, %l {permutation_map: (d0, d1, d2, d3) -> (d3, d1, d2)} : (vector<16x32x64xf32>, memref<?x?x?x?xf32>, index, index, index, index) ``` PiperOrigin-RevId: 223873234	2019-03-29 14:15:25 -07:00
Uday Bondhugula	89c41fdca1	FlatAffineConstraints::composeMap: return failure instead of asserting on semi-affine maps FlatAffineConstraints::composeMap: should return false instead of asserting on a semi-affine map. Make getMemRefRegion just propagate false when encountering semi-affine maps (instead of crashing!) PiperOrigin-RevId: 223828743	2019-03-29 14:14:56 -07:00
Uday Bondhugula	5f76245cfe	Minor fix for replaceAllMemRefUsesWith. The check for whether the memref was used in a non-derefencing context had to be done inside, i.e., only for the op stmt's that the replacement was specified to be performed on (by the domStmtFilter arg if provided). As such, it is completely fine for example for a function to return a memref while the replacement is being performed only a specific loop's body (as in the case of DMA generation). PiperOrigin-RevId: 223827753	2019-03-29 14:14:43 -07:00
River Riddle	7669a259c4	Add a simple common sub expression elimination pass. The algorithm collects defining operations within a scoped hash table. The scopes within the hash table correspond to nodes within the dominance tree for a function. This cl only adds support for simple operations, i.e non side-effecting. Such operations, e.g. load/store/call, will be handled in later patches. PiperOrigin-RevId: 223811328	2019-03-29 14:14:28 -07:00
Nicolas Vasilache	1ae66f6520	[MLIR] Reenable materialize_vectors test Fixes one of the Filecheck'ed test which was mistakenly disabled. PiperOrigin-RevId: 223401978	2019-03-29 14:12:40 -07:00
Alex Zinenko	68e9721aa8	Rename Deaffinator to LowerAffineApply and patch it. Several things were suggested in post-submission reviews. In particular, use pointers in function interfaces instead of references (still use references internally). Clarify the behavior of the pass in presence of MLFunctions. PiperOrigin-RevId: 222556851	2019-03-29 14:08:59 -07:00
Nicolas Vasilache	a5782f0d40	[MLIR][MaterializeVectors] Add a MaterializeVector pass via unrolling. This CL adds an MLIR-MLIR pass which materializes super-vectors to hardware-dependent sized vectors. While the physical vector size is target-dependent, the pass is written in a target-independent way: the target vector size is specified as a parameter to the pass. This pass is thus a partial lowering that opens the "greybox" that is the super-vector abstraction. This first CL adds a first materilization pass iterates over vector_transfer_write operations and: 1. computes the program slice including the current vector_transfer_write; 2. computes the multi-dimensional ratio of super-vector shape to hardware vector shape; 3. for each possible multi-dimensional value within the bounds of ratio, a new slice is instantiated (i.e. cloned and rewritten) so that all operations in this instance operate on the hardware vector type. As a simple example, given: ```mlir mlfunc @vector_add_2d(%M : index, %N : index) -> memref<?x?xf32> { %A = alloc (%M, %N) : memref<?x?xf32> %B = alloc (%M, %N) : memref<?x?xf32> %C = alloc (%M, %N) : memref<?x?xf32> for %i0 = 0 to %M { for %i1 = 0 to %N { %a1 = load %A[%i0, %i1] : memref<?x?xf32> %b1 = load %B[%i0, %i1] : memref<?x?xf32> %s1 = addf %a1, %b1 : f32 store %s1, %C[%i0, %i1] : memref<?x?xf32> } } return %C : memref<?x?xf32> } ``` and the following options: ``` -vectorize -virtual-vector-size 32 --test-fastest-varying=0 -materialize-vectors -vector-size=8 ``` materialization emits: ```mlir #map0 = (d0, d1) -> (d0, d1) #map1 = (d0, d1) -> (d0, d1 + 8) #map2 = (d0, d1) -> (d0, d1 + 16) #map3 = (d0, d1) -> (d0, d1 + 24) mlfunc @vector_add_2d(%arg0 : index, %arg1 : index) -> memref<?x?xf32> { %0 = alloc(%arg0, %arg1) : memref<?x?xf32> %1 = alloc(%arg0, %arg1) : memref<?x?xf32> %2 = alloc(%arg0, %arg1) : memref<?x?xf32> for %i0 = 0 to %arg0 { for %i1 = 0 to %arg1 step 32 { %3 = affine_apply #map0(%i0, %i1) %4 = "vector_transfer_read"(%0, %3tensorflow/mlir#0, %3tensorflow/mlir#1) : (memref<?x?xf32>, index, index) -> vector<8xf32> %5 = affine_apply #map1(%i0, %i1) %6 = "vector_transfer_read"(%0, %5tensorflow/mlir#0, %5tensorflow/mlir#1) : (memref<?x?xf32>, index, index) -> vector<8xf32> %7 = affine_apply #map2(%i0, %i1) %8 = "vector_transfer_read"(%0, %7tensorflow/mlir#0, %7tensorflow/mlir#1) : (memref<?x?xf32>, index, index) -> vector<8xf32> %9 = affine_apply #map3(%i0, %i1) %10 = "vector_transfer_read"(%0, %9tensorflow/mlir#0, %9tensorflow/mlir#1) : (memref<?x?xf32>, index, index) -> vector<8xf32> %11 = affine_apply #map0(%i0, %i1) %12 = "vector_transfer_read"(%1, %11tensorflow/mlir#0, %11tensorflow/mlir#1) : (memref<?x?xf32>, index, index) -> vector<8xf32> %13 = affine_apply #map1(%i0, %i1) %14 = "vector_transfer_read"(%1, %13tensorflow/mlir#0, %13tensorflow/mlir#1) : (memref<?x?xf32>, index, index) -> vector<8xf32> %15 = affine_apply #map2(%i0, %i1) %16 = "vector_transfer_read"(%1, %15tensorflow/mlir#0, %15tensorflow/mlir#1) : (memref<?x?xf32>, index, index) -> vector<8xf32> %17 = affine_apply #map3(%i0, %i1) %18 = "vector_transfer_read"(%1, %17tensorflow/mlir#0, %17tensorflow/mlir#1) : (memref<?x?xf32>, index, index) -> vector<8xf32> %19 = addf %4, %12 : vector<8xf32> %20 = addf %6, %14 : vector<8xf32> %21 = addf %8, %16 : vector<8xf32> %22 = addf %10, %18 : vector<8xf32> %23 = affine_apply #map0(%i0, %i1) "vector_transfer_write"(%19, %2, %23tensorflow/mlir#0, %23tensorflow/mlir#1) : (vector<8xf32>, memref<?x?xf32>, index, index) -> () %24 = affine_apply #map1(%i0, %i1) "vector_transfer_write"(%20, %2, %24tensorflow/mlir#0, %24tensorflow/mlir#1) : (vector<8xf32>, memref<?x?xf32>, index, index) -> () %25 = affine_apply #map2(%i0, %i1) "vector_transfer_write"(%21, %2, %25tensorflow/mlir#0, %25tensorflow/mlir#1) : (vector<8xf32>, memref<?x?xf32>, index, index) -> () %26 = affine_apply #map3(%i0, %i1) "vector_transfer_write"(%22, %2, %26tensorflow/mlir#0, %26tensorflow/mlir#1) : (vector<8xf32>, memref<?x?xf32>, index, index) -> () } } return %2 : memref<?x?xf32> } ``` PiperOrigin-RevId: 222455351	2019-03-29 14:08:31 -07:00
Nicolas Vasilache	5c16564bca	[MLIR][Slicing] Add utils for computing slices. This CL adds tooling for computing slices as an independent CL. The first consumer of this analysis will be super-vector materialization in a followup CL. In particular, this adds: 1. a getForwardStaticSlice function with documentation, example and a standalone unit test; 2. a getBackwardStaticSlice function with documentation, example and a standalone unit test; 3. a getStaticSlice function with documentation, example and a standalone unit test; 4. a topologicalSort function that is exercised through the getStaticSlice unit test. The getXXXStaticSlice functions take an additional root (resp. terminators) parameter which acts as a boundary that the transitive propagation algorithm is not allowed to cross. PiperOrigin-RevId: 222446208	2019-03-29 14:08:02 -07:00
Uday Bondhugula	2631b155a9	Fix bugs in DMA generation and FlatAffineConstraints; add more test cases. - fix bug in calculating index expressions for DMA buffers in certain cases (affected tiled loop nests); add more test cases for better coverage. - introduce an additional optional argument to replaceAllMemRefUsesWith; additional operands to the index remap AffineMap can now be supplied by the client. - FlatAffineConstraints::addBoundsForStmt - fix off by one upper bound, ::composeMap - fix position bug. - Some clean up and more comments PiperOrigin-RevId: 222434628	2019-03-29 14:07:31 -07:00
Alex Zinenko	615c41c788	Introduce Deaffinator pass. This function pass replaces affine_apply operations in CFG functions with sequences of primitive arithmetic instructions that form the affine map. The actual replacement functionality is located in LoweringUtils as a standalone function operating on an individual affine_apply operation and inserting the result at the location of the original operation. It is expected to be useful for other, target-specific lowering passes that may start at MLFunction level that Deaffinator does not support. PiperOrigin-RevId: 222406692	2019-03-29 14:07:16 -07:00
Uday Bondhugula	b6c03917ad	Remove allocations for memref's that become dead as a result of double buffering in the auto DMA overlap pass. This is done online in the pass. PiperOrigin-RevId: 222313640	2019-03-29 14:05:19 -07:00
Uday Bondhugula	0328217eb8	Automated rollback of changelist 221863955. PiperOrigin-RevId: 222299120	2019-03-29 14:04:05 -07:00
Nicolas Vasilache	87d46aaf4b	[MLIR][Vectorize] Refactor Vectorize use-def propagation. This CL refactors a few things in Vectorize.cpp: 1. a clear distinction is made between: a. the LoadOp are the roots of vectorization and must be vectorized eagerly and propagate their value; and b. the StoreOp which are the terminals of vectorization and must be vectorized late (i.e. they do not produce values that need to be propagated). 2. the StoreOp must be vectorized late because in general it can store a value that is not reachable from the subset of loads defined in the current pattern. One trivial such case is storing a constant defined at the top-level of the MLFunction and that needs to be turned into a splat. 3. a description of the algorithm is given; 4. the implementation matches the algorithm; 5. the last example is made parametric, in practice it will fully rely on the implementation of vector_transfer_read/write which will handle boundary conditions and padding. This will happen by lowering to a lower-level abstraction either: a. directly in MLIR (whether DMA or just loops or any async tasks in the future) (whiteboxing); b. in LLO/LLVM-IR/whatever blackbox library call/ search + swizzle inventor one may want to use; c. a partial mix of a. and b. (grey-boxing) 5. minor cleanups are applied; 6. mistakenly disabled unit tests are re-enabled (oopsie). With this CL, this MLIR snippet: ``` mlfunc @vector_add_2d(%M : index, %N : index) -> memref<?x?xf32> { %A = alloc (%M, %N) : memref<?x?xf32> %B = alloc (%M, %N) : memref<?x?xf32> %C = alloc (%M, %N) : memref<?x?xf32> %f1 = constant 1.0 : f32 %f2 = constant 2.0 : f32 for %i0 = 0 to %M { for %i1 = 0 to %N { // non-scoped %f1 store %f1, %A[%i0, %i1] : memref<?x?xf32> } } for %i4 = 0 to %M { for %i5 = 0 to %N { %a5 = load %A[%i4, %i5] : memref<?x?xf32> %b5 = load %B[%i4, %i5] : memref<?x?xf32> %s5 = addf %a5, %b5 : f32 // non-scoped %f1 %s6 = addf %s5, %f1 : f32 store %s6, %C[%i4, %i5] : memref<?x?xf32> } } return %C : memref<?x?xf32> } ``` vectorized with these arguments: ``` -vectorize -virtual-vector-size 256 --test-fastest-varying=0 ``` vectorization produces this standard innermost-loop vectorized code: ``` mlfunc @vector_add_2d(%arg0 : index, %arg1 : index) -> memref<?x?xf32> { %0 = alloc(%arg0, %arg1) : memref<?x?xf32> %1 = alloc(%arg0, %arg1) : memref<?x?xf32> %2 = alloc(%arg0, %arg1) : memref<?x?xf32> %cst = constant 1.000000e+00 : f32 %cst_0 = constant 2.000000e+00 : f32 for %i0 = 0 to %arg0 { for %i1 = 0 to %arg1 step 256 { %cst_1 = constant splat<vector<256xf32>, 1.000000e+00> : vector<256xf32> "vector_transfer_write"(%cst_1, %0, %i0, %i1) : (vector<256xf32>, memref<?x?xf32>, index, index) -> () } } for %i2 = 0 to %arg0 { for %i3 = 0 to %arg1 step 256 { %3 = "vector_transfer_read"(%0, %i2, %i3) : (memref<?x?xf32>, index, index) -> vector<256xf32> %4 = "vector_transfer_read"(%1, %i2, %i3) : (memref<?x?xf32>, index, index) -> vector<256xf32> %5 = addf %3, %4 : vector<256xf32> %cst_2 = constant splat<vector<256xf32>, 1.000000e+00> : vector<256xf32> %6 = addf %5, %cst_2 : vector<256xf32> "vector_transfer_write"(%6, %2, %i2, %i3) : (vector<256xf32>, memref<?x?xf32>, index, index) -> () } } return %2 : memref<?x?xf32> } ``` Of course, much more intricate n-D imperfectly-nested patterns can be emitted too in a fully declarative fashion, but this is enough for now. PiperOrigin-RevId: 222280209	2019-03-29 14:03:50 -07:00
Alex Zinenko	f986d5920b	ConvertToCFG: handle loop 1D affine loop bounds. In the general case, loop bounds can be expressed as affine maps of the outer loop iterators and function arguments. Relax the check for loop bounds to be known integer constants and also accept one-dimensional affine bounds in ConvertToCFG ForStmt lowering. Emit affine_apply operations for both the upper and the lower bound. The semantics of MLFunctions guarantees that both bounds can be computed before the loop starts iterating. Constant bounds are merely a short-hand notation for zero-dimensional affine maps and get supported transparently. Multidimensional affine bounds are not yet supported because the target IR dialect lacks min/max operations necessary to implement the corresponding semantics. PiperOrigin-RevId: 222275801	2019-03-29 14:03:20 -07:00
Nicolas Vasilache	89d9913a20	[MLIR][VectorAnalysis] Add a VectorAnalysis and standalone tests This CL adds some vector support in prevision of the upcoming vector materialization pass. In particular this CL adds 2 functions to: 1. compute the multiplicity of a subvector shape in a supervector shape; 2. help match operations on strict super-vectors. This is defined for a given subvector shape as an operation that manipulates a vector type that is an integral multiple of the subtype, with multiplicity at least 2. This CL also adds a TestUtil pass where we can dump arbitrary testing of functions and analysis that operate at a much smaller granularity than a pass (e.g. an analysis for which it is convenient to write a bit of artificial MLIR and write some custom test). This is in order to keep using Filecheck for things that essentially look and feel like C++ unit tests. PiperOrigin-RevId: 222250910	2019-03-29 14:02:17 -07:00
Uday Bondhugula	fff1efbaf5	Updates to transformation/analysis passes/utilities. Update DMA generation pass and getMemRefRegion() to work with specified loop depths; add support for outgoing DMAs, store op's. - add support for getMemRefRegion symbolic in outer loops - hence support for DMAs symbolic in outer surrounding loops. - add DMA generation support for outgoing DMAs (store op's to lower memory space); extend getMemoryRegion to store op's. -memref-bound-check now works with store op's as well. - fix dma-generate (references to the old memref in the dma_start op were also being replaced with the new buffer); we need replace all memref uses to work only on a subset of the uses - add a new optional argument for replaceAllMemRefUsesWith. update replaceAllMemRefUsesWith to take an optional 'operation' argument to serve as a filter - if provided, only those uses that are dominated by the filter are replaced. - Add missing print for attributes for dma_start, dma_wait op's. - update the FlatAffineConstraints API PiperOrigin-RevId: 221889223	2019-03-29 14:00:51 -07:00
Uday Bondhugula	6b52ac3aa6	Mark AllocOp as being free of side effects PiperOrigin-RevId: 221863955	2019-03-29 14:00:37 -07:00
Alex Zinenko	d030433443	ConvertToCFG: properly remap nested function attributes. Array attributes can nested and function attributes can appear anywhere at that level. They should be remapped to point to the generated CFGFunction after ML-to-CFG conversion, similarly to plain function attributes. Extract the nested attribute remapping functionality from the Parser to Utils. Extract out the remapping function for individual Functions from the module remapping function. Use these new functions in the ML-to-CFG conversion pass and in the parser. PiperOrigin-RevId: 221510997	2019-03-29 13:57:58 -07:00
Nicolas Vasilache	fefbf91314	[MLIR] Support for vectorizing operations. This CL adds support for and a vectorization test to perform scalar 2-D addf. The support extension notably comprises: 1. extend vectorizable test to exclude vector_transfer operations and expose them to LoopAnalysis where they are needed. This is a temporary solution a concrete MLIR Op exists; 2. add some more functional sugar mapKeys, apply and ScopeGuard (which became relevant again); 3. fix improper shifting during coarsening; 4. rename unaligned load/store to vector_transfer_read/write and simplify the design removing the unnecessary AllocOp that were introduced prematurely: vector_transfer_read currently has the form: (memref<?x?x?xf32>, index, index, index) -> vector<32x64x256xf32> vector_transfer_write currently has the form: (vector<32x64x256xf32>, memref<?x?x?xf32>, index, index, index) -> () 5. adds vectorizeOperations which traverses the operations in a ForStmt and rewrites them to their vector form; 6. add support for vector splat from a constant. The relevant tests are also updated. PiperOrigin-RevId: 221421426	2019-03-29 13:56:47 -07:00

1 2 3

110 Commits