Commit Graph

341 Commits

Author SHA1 Message Date
River Riddle a32f0dcb5d Add support to GreedyPatternRewriter for erasing unreachable blocks.
Rewrite patterns may make modifications to the CFG, including dropping edges between blocks. This change adds a simple unreachable block elimination run at the end of each iteration to ensure that the CFG remains valid.

PiperOrigin-RevId: 277545805
2019-10-30 11:19:24 -07:00
River Riddle 2f4d0c085a Add support for marking an operation as recursively legal.
In some cases, it may be desirable to mark entire regions of operations as legal. This provides an additional granularity of context to the concept of "legal". The `ConversionTarget` supports marking operations, that were previously added as `Legal` or `Dynamic`, as `recursively` legal. Recursive legality means that if an operation instance is legal, either statically or dynamically, all of the operations nested within are also considered legal. An operation can be marked via `markOpRecursivelyLegal<>`:

```c++
ConversionTarget &target = ...;

/// The operation must first be marked as `Legal` or `Dynamic`.
target.addLegalOp<MyOp>(...);
target.addDynamicallyLegalOp<MySecondOp>(...);

/// Mark the operation as always recursively legal.
target.markOpRecursivelyLegal<MyOp>();
/// Mark optionally with a callback to allow selective marking.
target.markOpRecursivelyLegal<MyOp, MySecondOp>([](Operation *op) { ... });
/// Mark optionally with a callback to allow selective marking.
target.markOpRecursivelyLegal<MyOp>([](MyOp op) { ... });
```

PiperOrigin-RevId: 277086382
2019-10-28 10:04:34 -07:00
River Riddle 2b61b7979e Convert the Canonicalize and CSE passes to generic Operation Passes.
This allows for them to be used on other non-function, or even other function-like, operations. The algorithms are already generic, so this is simply changing the derived pass type. The majority of this change is just ensuring that the nesting of these passes remains the same, as the pass manager won't auto-nest them anymore.

PiperOrigin-RevId: 276573038
2019-10-24 15:01:09 -07:00
River Riddle 21ee4e987f Add @below and @above directives to verify-diagnostics.
This simplifies defining expected-* directives when there are multiple that apply to the next or previous line. @below applies the directive to the next non-designator line, i.e. the next line that does not contain an expected-* designator. @above applies to the previous non designator line.

Examples:

// Expect an error on the next line that does not contain a designator.
// expected-remark@below {{remark on function below}}
// expected-remark@below {{another remark on function below}}
func @bar(%a : f32)

// Expect an error on the previous line that does not contain a designator.
func @baz(%a : f32)
// expected-remark@above {{remark on function above}}
// expected-remark@above {{another remark on function above}}

PiperOrigin-RevId: 276369085
2019-10-23 15:56:29 -07:00
Kazuaki Ishizaki f28c5aca17 Fix minor spelling tweaks (NFC)
Closes tensorflow/mlir#175

PiperOrigin-RevId: 275726876
2019-10-20 09:44:36 -07:00
Nicolas Vasilache 9e7e297da3 Lower vector transfer ops to loop.for operations.
This allows mixing linalg operations with vector transfer operations (with additional modifications to affine ops) and is a step towards solving tensorflow/mlir#189.

PiperOrigin-RevId: 275543361
2019-10-18 14:10:10 -07:00
Stephan Herhut b843cc5d5a Implement simple loop-invariant-code-motion based on dialect interfaces.
PiperOrigin-RevId: 275004258
2019-10-16 04:28:38 -07:00
River Riddle 96de7091bc Allowing replacing non-root operations in DialectConversion.
When dealing with regions, or other patterns that need to generate temporary operations, it is useful to be able to replace other operations than the root op being matched. Before this PR, these operations would still be considered for legalization meaning that the conversion would either fail, erroneously need to mark these ops as legal, or add unnecessary patterns.

PiperOrigin-RevId: 274598513
2019-10-14 10:01:59 -07:00
River Riddle 6b1cc3c6ea Add support for canonicalizing callable regions during inlining.
This will allow for inlining newly devirtualized calls, as well as give a more accurate cost model(when we have one). Currently canonicalization will only run for nodes that have no child edges, as the child nodes may be erased during canonicalization. We can support this in the future, but it requires more intricate deletion tracking.

PiperOrigin-RevId: 274011386
2019-10-10 17:06:33 -07:00
River Riddle 438dc176b1 Remove the need to convert operations in regions of operations that have been replaced.
When an operation with regions gets replaced, we currently require that all of the remaining nested operations are still converted even though they are going to be replaced when the rewrite is finished. This cl adds a tracking for a minimal set of operations that are known to be "dead". This allows for ignoring the legalization of operations that are won't survive after conversion.

PiperOrigin-RevId: 274009003
2019-10-10 17:06:25 -07:00
Parker Schuh 309b4556d0 Add test for fix to tablegen for custom folders for ops that return a single
variadic result.

Add missing test for single line fix to `void OpEmitter::genFolderDecls()`
entitled "Fold away reduction over 0 dimensions."

PiperOrigin-RevId: 273880337
2019-10-09 20:44:30 -07:00
Diego Caballero 3451055614 Add support for some multi-store cases in affine fusion
This PR is a stepping stone towards supporting generic multi-store
source loop nests in affine loop fusion. It extends the algorithm to
support fusion of multi-store loop nests that:
 1. have only one store that writes to a function-local live out, and
 2. the remaining stores are involved in loop nest self dependences
    or no dependences within the function.

Closes tensorflow/mlir#162

COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/162 from dcaballe:dcaballe/multi-output-fusion 7fb7dec6fe8b45f5ce176f018bfe37b256420c45
PiperOrigin-RevId: 273773907
2019-10-09 10:37:30 -07:00
MLIR Team 7446151236 Add Instance Specific Pass Options.
This allows individual passes to define options structs and for these options to be parsed per instance of the pass while building the pass pipeline from the command line provided textual specification.

The user can specify these per-instance pipeline options like so:
```
struct MyPassOptions : public PassOptions<MyPassOptions> {
  Option<int> exampleOption{*this, "flag-name", llvm:🆑:desc("...")};
  List<int> exampleListOption{*this, "list-flag-name", llvm:🆑:desc("...")};
};

static PassRegistration<MyPass, MyPassOptions> pass("my-pass", "description");
```

PiperOrigin-RevId: 273650140
2019-10-08 18:23:43 -07:00
River Riddle 49b29dd186 Add a PatternRewriter hook for cloning a region into another.
This is similar to the `inlineRegionBefore` hook, except the original blocks are unchanged. The region to be cloned *must* not have been modified during the conversion process at the point of cloning, i.e. it must belong an operation that has yet to be converted, or the operation that is currently being converted.

PiperOrigin-RevId: 273622533
2019-10-08 15:45:08 -07:00
Uday Bondhugula 6136f33d59 unroll and jam: fix order of jammed bodies
- bodies would earlier appear in the order (i, i+3, i+2, i+1) instead of
  (i, i+1, i+2, i+3) for example for factor 4.

- clean up hardcoded test cases

Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>

Closes tensorflow/mlir#170

COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/170 from bondhugula:ujam b66b405b2b1894a03b376952e32a9d0292042665
PiperOrigin-RevId: 273613131
2019-10-08 15:13:11 -07:00
Uday Bondhugula 89e7a76a1c fix simplify-affine-structures bug
Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>

Closes tensorflow/mlir#157

COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/157 from bondhugula:quickfix bd1fcd79825fc0bd5b4a3e688153fa0993ab703d
PiperOrigin-RevId: 273316498
2019-10-07 10:04:50 -07:00
River Riddle 5830f71a45 Add support for inlining calls with different arg/result types from the callable.
Some dialects have implicit conversions inherent in their modeling, meaning that a call may have a different type that the type that the callable expects. To support this, a hook is added to the dialect interface that allows for materializing conversion operations during inlining when there is a mismatch. A hook is also added to the callable interface to allow for introspecting the expected result types.

PiperOrigin-RevId: 272814379
2019-10-03 23:10:51 -07:00
River Riddle a20d96e436 Update the Inliner pass to work on SCCs of the CallGraph.
This allows for the inliner to work on arbitrary call operations. The updated inliner will also work bottom-up through the callgraph enabling support for multiple levels of inlining.

PiperOrigin-RevId: 272813876
2019-10-03 23:05:21 -07:00
Uday Bondhugula 458ede8775 Introduce splat op + provide its LLVM lowering
- introduce splat op in standard dialect (currently for int/float/index input
  type, output type can be vector or statically shaped tensor)
- implement LLVM lowering (when result type is 1-d vector)
- add constant folding hook for it
- while on Ops.cpp, fix some stale names

Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>

Closes tensorflow/mlir#141

COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/141 from bondhugula:splat 48976a6aa0a75be6d91187db6418de989e03eb51
PiperOrigin-RevId: 270965304
2019-09-24 12:44:58 -07:00
Uday Bondhugula f559c38c28 Upgrade/fix/simplify store to load forwarding
- fix store to load forwarding for a certain set of cases (where
  forwarding shouldn't have happened); use AffineValueMap difference
  based MemRefAccess equality checking; utility logic is also greatly
  simplified

- add missing equality/inequality operators for AffineExpr ==/!= ints

- add == != operators on MemRefAccess

Closes tensorflow/mlir#136

COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/136 from bondhugula:store-load-forwarding d79fd1add8bcfbd9fa71d841a6a9905340dcd792
PiperOrigin-RevId: 270457011
2019-09-21 10:08:56 -07:00
Uday Bondhugula 727a50ae2d Support symbolic operands for memref replacement; fix memrefNormalize
- allow symbols in index remapping provided for memref replacement
- fix memref normalize crash on cases with layout maps with symbols

Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
Reported by: Alex Zinenko

Closes tensorflow/mlir#139

COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/139 from bondhugula:memref-rep-symbols 2f48c1fdb5d4c58915bbddbd9f07b18541819233
PiperOrigin-RevId: 269851182
2019-09-18 11:26:11 -07:00
Uday Bondhugula bd7de6d4df Add rewrite pattern to compose maps into affine load/stores
- add canonicalization pattern to compose maps into affine loads/stores;
  templatize the pattern and reuse it for affine.apply as well

- rename getIndices -> getMapOperands() (getIndices is confusing since
  these are no longer the indices themselves but operands to the map
  whose results are the indices). This also makes the accessor uniform
  across affine.apply/load/store. Change arg names on the affine
  load/store builder to avoid confusion. Drop an unused confusing build
  method on AffineStoreOp.

- update incomplete doc comment for canonicalizeMapAndOperands (this was
  missed from a previous update).

Addresses issue tensorflow/mlir#121

Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>

Closes tensorflow/mlir#122

COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/122 from bondhugula:compose-load-store e71de1771e56a85c4282c10cb43f30cef0701c4f
PiperOrigin-RevId: 269619540
2019-09-17 11:49:45 -07:00
River Riddle 9619ba10d4 Add support for multi-level value mapping to DialectConversion.
When performing A->B->C conversion, an operation may still refer to an operand of A. This makes it necessary to unmap through multiple levels of replacement for a specific value.

PiperOrigin-RevId: 269367859
2019-09-16 10:38:19 -07:00
Uday Bondhugula 1366467a3b update normalizeMemRef utility; handle missing failure check + add more tests
- take care of symbolic operands with alloc
- add missing check for compose map failure and a test case
- add test cases on strides
- drop incorrect check for one-to-one'ness

Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>

Closes tensorflow/mlir#132

COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/132 from bondhugula:normalize-memrefs 8aebf285fb0d7c19269d85255aed644657e327b7
PiperOrigin-RevId: 269105947
2019-09-14 13:21:35 -07:00
Uday Bondhugula 018cfa94d9 Clean up build trip count analysis method - avoid mutating IR
- NFC - on any pass/utility logic/output.

- Resolve TODO; the method building loop trip count maps was
  creating and deleting affine.apply ops (transforming IR from under
  analysis!, strictly speaking). Introduce AffineValueMap::difference to
  do this correctly (without the need to create any IR).

- Move AffineApplyNormalizer out so that its methods are reusable from
  AffineStructures.cpp; add a helper method 'normalize' to it. Fix
  AffineApplyNormalize::renumberOneDim (Issue tensorflow/mlir#89).

- Trim includes on files touched.

- add test case on a scenario previously not covered

Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>

Closes tensorflow/mlir#133

COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/133 from bondhugula:trip-count-build 7fc34d857f7788f98b641792cafad6f5bd50e47b
PiperOrigin-RevId: 269101118
2019-09-14 12:10:55 -07:00
Uday Bondhugula 1e6a93b7ca add missing memref cast fold pattern for dim op
- add missing canonicalization pattern to fold memref_cast + dim to
  dim (needed to propagate constant when folding a dynamic shape to
  a static one)

- also fix an outdated/inconsistent comment in StandardOps/Ops.td

Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>

Closes tensorflow/mlir#126

COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/126 from bondhugula:quickfix 4566e75e49685c532faffff91d64c5d83d4da524
PiperOrigin-RevId: 269020058
2019-09-13 18:18:48 -07:00
Geoffrey Martin-Noble 2ccbb3f1ce Cmpf constant folding for nan and inf
PiperOrigin-RevId: 268783645
2019-09-12 15:43:59 -07:00
Geoffrey Martin-Noble f39a599e46 NFC: Clean up constant fold tests
Use variable captures to make constant folding tests less sensitive to printer/parser implementation details.

See guidelines at https://github.com/tensorflow/mlir/blob/master/g3doc/TestingGuide.md

PiperOrigin-RevId: 268780812
2019-09-12 15:30:58 -07:00
River Riddle 0ba0087887 Add the initial inlining infrastructure.
This defines a set of initial utilities for inlining a region(or a FuncOp), and defines a simple inliner pass for testing purposes.
A new dialect interface is defined, DialectInlinerInterface, that allows for dialects to override hooks controlling inlining legality. The interface currently provides the following hooks, but these are just premilinary and should be changed/added to/modified as necessary:

* isLegalToInline
  - Determine if a region can be inlined into one of this dialect, *or* if an operation of this dialect can be inlined into a given region.

* shouldAnalyzeRecursively
  - Determine if an operation with regions should be analyzed recursively for legality. This allows for child operations to be closed off from the legality checks for operations like lambdas.

* handleTerminator
  - Process a terminator that has been inlined.

This cl adds support for inlining StandardOps, but other dialects will be added in followups as necessary.

PiperOrigin-RevId: 267426759
2019-09-05 12:24:13 -07:00
Uday Bondhugula 8c9dc690eb pipeline-data-transfer: remove dead tag alloc's and improve test coverage for replaceMemRefUsesWith / pipeline-data-transfer
- address remaining comments from PR tensorflow/mlir#87 for better test coverage for
  pipeline-data-transfer/replaceAllMemRefUsesWith
- remove dead tag allocs the same way they are removed for the replaced buffers

Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>

Closes tensorflow/mlir#106

COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/106 from bondhugula:followup 9e868666d047e8d43e5f82f43e4093b838c710fa
PiperOrigin-RevId: 267144774
2019-09-04 06:59:09 -07:00
Uday Bondhugula 54d674f51e Utility to normalize memrefs with non-identity layout maps
- introduce utility to convert memrefs with non-identity layout maps to
  ones with identity layout maps: convert the type and rewrite/remap all
  its uses

- add this utility to -simplify-affine-structures pass for testing
  purposes

Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>

Closes tensorflow/mlir#104

COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/104 from bondhugula:memref-normalize f2c914aa1890e8860326c9e33f9aa160b3d65e6d
PiperOrigin-RevId: 266985317
2019-09-03 12:14:28 -07:00
Uday Bondhugula b1ef9dc22c Fix affine data copy generation corner cases/bugs
- the [begin, end) range identified for copying could end in between the
  block, which makes hoisting invalid in some cases. Change the range
  identification to always end with end of block.

- add test case to exercise these (with fast mem capacity set to minimal so
  that single element memref buffers are generated at the innermost loop)

- the location of begin/end of the block range for data copying was
  being confused with the insert points for copy in and copy out code.
  In cases, where we choose to hoist transfers, these are separate.

- when copy loops are single iteration ones, promote their bodies at
  the end of the pass.

- change default fast mem space to 1 (setting it to zero made it
  generate DMA op's that won't verify in the default case - since the
  DMA ops have a check for src/dest memref spaces being different).

Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
Co-Authored-By: Mehdi Amini <joker.eph@gmail.com>

Closes tensorflow/mlir#88

COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/88 from bondhugula:datacopy 88697267c45e850c3ced87671e16e4a930c02a42
PiperOrigin-RevId: 266980911
2019-09-03 11:53:16 -07:00
River Riddle 6563b1c446 Add a new dialect interface for the OperationFolder `OpFolderDialectInterface`.
This interface will allow for providing hooks to interrop with operation folding. The first hook, 'shouldMaterializeInto', will allow for controlling which region to insert materialized constants into. The folder will generally materialize constants into the top-level isolated region, this allows for materializing into a lower level ancestor region if it is more profitable/correct.

PiperOrigin-RevId: 266702972
2019-09-01 20:07:08 -07:00
River Riddle 9c8a8a7d0d Add a canonicalization to erase empty AffineForOps.
AffineForOp themselves are pure and can be removed if there are no internal operations.

PiperOrigin-RevId: 266481293
2019-08-30 16:49:32 -07:00
Uday Bondhugula 4bb6f8ecdb Extend map canonicalization to propagate constant operands
- extend canonicalizeMapAndOperands to propagate constant operands into
  the map's expressions (and thus drop those operands).
- canonicalizeMapAndOperands previously only dropped duplicate and
  unused operands; however, operands that were constants were
  retained.

This change makes IR maps/expressions generated by various
utilities/passes even simpler; also makes some of the test checks more
accurate and simpler -- for eg., 0' instead of symbol(%{{.*}}).

Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>

Closes tensorflow/mlir#107

COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/107 from bondhugula:canonicalize-maps c889a51486d14fbf7db489f224f881e7e1ff7d72
PiperOrigin-RevId: 266085289
2019-08-29 01:13:29 -07:00
Uday Bondhugula bc2a543225 fix loop unroll and jam - operand mapping - imperfect nest case
- fix operand mapping while cloning sub-blocks to jam - was incorrect
  for imperfect nests where def/use was across sub-blocks
- strengthen/generalize the first test case to cover the previously
  missed scenario
- clean up the other cases while on this.

Previously, unroll-jamming the following nest
```
    affine.for %arg0 = 0 to 2048 {
      %0 = alloc() : memref<512x10xf32>
      affine.for %arg1 = 0 to 10 {
        %1 = affine.load %0[%arg0, %arg1] : memref<512x10xf32>
      }
      dealloc %0 : memref<512x10xf32>
    }
```

would yield

```
      %0 = alloc() : memref<512x10xf32>
      %1 = affine.apply #map0(%arg0)
      %2 = alloc() : memref<512x10xf32>
      affine.for %arg1 = 0 to 10 {
        %4 = affine.load %0[%arg0, %arg1] : memref<512x10xf32>
        %5 = affine.apply #map0(%arg0)
        %6 = affine.load %0[%5, %arg1] : memref<512x10xf32>
      }
      dealloc %0 : memref<512x10xf32>
      %3 = affine.apply #map0(%arg0)
      dealloc %0 : memref<512x10xf32>

```

instead of

```

module {
    affine.for %arg0 = 0 to 2048 step 2 {
      %0 = alloc() : memref<512x10xf32>
      %1 = affine.apply #map0(%arg0)
      %2 = alloc() : memref<512x10xf32>
      affine.for %arg1 = 0 to 10 {
        %4 = affine.load %0[%arg0, %arg1] : memref<512x10xf32>
        %5 = affine.apply #map0(%arg0)
        %6 = affine.load %2[%5, %arg1] : memref<512x10xf32>
      }
      dealloc %0 : memref<512x10xf32>
      %3 = affine.apply #map0(%arg0)
      dealloc %2 : memref<512x10xf32>
    }
```

Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>

Closes tensorflow/mlir#98

COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/98 from bondhugula:ujam ddbc853f69b5608b3e8ff9b5ac1f6a5a0bb315a4
PiperOrigin-RevId: 266073460
2019-08-28 23:42:50 -07:00
Uday Bondhugula aa2cee9cf5 Refactor / improve replaceAllMemRefUsesWith
Refactor replaceAllMemRefUsesWith to split it into two methods: the new
method does the replacement on a single op, and is used by the existing
one.

- make the methods return LogicalResult instead of bool

- Earlier, when replacement failed (due to non-deferencing uses of the
  memref), the set of ops that had already been processed would have
  been replaced leaving the IR in an inconsistent state. Now, a
  pass is made over all ops to first check for non-deferencing
  uses, and then replacement is performed. No test cases were affected
  because all clients of this method were first checking for
  non-deferencing uses before calling this method (for other reasons).
  This isn't true for a use case in another upcoming PR (scalar
  replacement); clients can now bail out with consistent IR on failure
  of replaceAllMemRefUsesWith. Add test case.

- multiple deferencing uses of the same memref in a single op is
  possible (we have no such use cases/scenarios), and this has always
  remained unsupported. Add an assertion for this.

- minor fix to another test pipeline-data-transfer case.

Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>

Closes tensorflow/mlir#87

PiperOrigin-RevId: 265808183
2019-08-27 17:56:56 -07:00
Andy Ly 6a501e3d1b Support folding of ops with inner ops in GreedyPatternRewriteDriver.
This fixes a bug when folding ops with inner ops and inner ops are still being visited.

PiperOrigin-RevId: 265475780
2019-08-26 09:44:39 -07:00
River Riddle 305516fcd3 Allow isolated regions to form isolated SSA name scopes in the printer.
This will allow for naming values the same as existing SSA values for regions attached to operations that are isolated from above. This fits in with how the system already allows separate name scopes for sibling regions. This name shadowing can be enabled in the custom parser of operations by setting the 'enableNameShadowing' flag to true when calling 'parseRegion'.

%arg = constant 10 : i32
foo.op {
  %arg = constant 10 : i32
}

PiperOrigin-RevId: 264255999
2019-08-19 15:27:10 -07:00
River Riddle a481032a33 Refactor ElementsAttr::getValue and DenseElementsAttr::getSplatValue.
All 'getValue' variants now require that the index is valid, queryable via 'isValidIndex'. 'getSplatValue' now requires that the attribute is a proper splat. This allows for querying these methods on DenseElementAttr with all possible value types; e.g. float, int, APInt, etc. This also allows for removing unnecessary conversions to Attribute that really want the underlying value.

PiperOrigin-RevId: 263437337
2019-08-14 15:03:53 -07:00
Andy Ly 55f2e24ab3 Remove ops in regions/blocks from worklist when parent op is being removed via GreedyPatternRewriteDriver::replaceOp.
This fixes a bug where ops inside the parent op are visited even though the parent op has been removed.

PiperOrigin-RevId: 261953580
2019-08-06 11:08:54 -07:00
Uday Bondhugula 18b8d4352b Introduce explicit copying optimization by generalizing the DMA generation pass
Explicit copying to contiguous buffers is a standard technique to avoid
conflict misses and TLB misses, and improve hardware prefetching
performance. When done in conjunction with cache tiling, it nearly
eliminates all cache conflict and TLB misses, and a single hardware
prefetch stream is needed per data tile.

- generalize/extend DMA generation pass (renamed data copying pass) to
  perform either point-wise explicit copies to fast memory buffers or
  DMAs (depending on a cmd line option). All logic is the same as
  erstwhile -dma-generate.

- -affine-dma-generate is now renamed -affine-data-copy; when -dma flag is
  provided, DMAs are generated, or else explicit copy loops are generated
  (point-wise) by default.

- point-wise copying could be used for CPUs (or GPUs); some indicative
  performance numbers with a "C" version of the MLIR when compiled with
  and without this optimization (about 2x improvement here).

  With a matmul on 4096^2 matrices on a single core of an Intel Core i7
  Skylake i7-8700K with clang 8.0.0:

  clang -O3:                       518s
  clang -O3 with MLIR tiling (128x128):      24.5s
  clang -O3 with MLIR tiling + data copying  12.4s
  (code equivalent to test/Transforms/data-copy.mlir func @matmul)

- fix some misleading comments.

- change default fast-mem space to 0 (more intuitive now with the
  default copy generation using point-wise copies instead of DMAs)

On a simple 3-d matmul loop nest, code generated with -affine-data-copy:

```
  affine.for %arg3 = 0 to 4096 step 128 {
    affine.for %arg4 = 0 to 4096 step 128 {
      %0 = affine.apply #map0(%arg3, %arg4)
      %1 = affine.apply #map1(%arg3, %arg4)
      %2 = alloc() : memref<128x128xf32, 2>
      // Copy-in Out matrix.
      affine.for %arg5 = 0 to 128 {
        %5 = affine.apply #map2(%arg3, %arg5)
        affine.for %arg6 = 0 to 128 {
          %6 = affine.apply #map2(%arg4, %arg6)
          %7 = load %arg2[%5, %6] : memref<4096x4096xf32>
          affine.store %7, %2[%arg5, %arg6] : memref<128x128xf32, 2>
        }
      }
      affine.for %arg5 = 0 to 4096 step 128 {
        %5 = affine.apply #map0(%arg3, %arg5)
        %6 = affine.apply #map1(%arg3, %arg5)
        %7 = alloc() : memref<128x128xf32, 2>
        // Copy-in LHS.
        affine.for %arg6 = 0 to 128 {
          %11 = affine.apply #map2(%arg3, %arg6)
          affine.for %arg7 = 0 to 128 {
            %12 = affine.apply #map2(%arg5, %arg7)
            %13 = load %arg0[%11, %12] : memref<4096x4096xf32>
            affine.store %13, %7[%arg6, %arg7] : memref<128x128xf32, 2>
          }
        }
        %8 = affine.apply #map0(%arg5, %arg4)
        %9 = affine.apply #map1(%arg5, %arg4)
        %10 = alloc() : memref<128x128xf32, 2>
        // Copy-in RHS.
        affine.for %arg6 = 0 to 128 {
          %11 = affine.apply #map2(%arg5, %arg6)
          affine.for %arg7 = 0 to 128 {
            %12 = affine.apply #map2(%arg4, %arg7)
            %13 = load %arg1[%11, %12] : memref<4096x4096xf32>
            affine.store %13, %10[%arg6, %arg7] : memref<128x128xf32, 2>
          }
        }
        // Compute.
        affine.for %arg6 = #map7(%arg3) to #map8(%arg3) {
          affine.for %arg7 = #map7(%arg4) to #map8(%arg4) {
            affine.for %arg8 = #map7(%arg5) to #map8(%arg5) {
              %11 = affine.load %7[-%arg3 + %arg6, -%arg5 + %arg8] : memref<128x128xf32, 2>
              %12 = affine.load %10[-%arg5 + %arg8, -%arg4 + %arg7] : memref<128x128xf32, 2>
              %13 = affine.load %2[-%arg3 + %arg6, -%arg4 + %arg7] : memref<128x128xf32, 2>
              %14 = mulf %11, %12 : f32
              %15 = addf %13, %14 : f32
              affine.store %15, %2[-%arg3 + %arg6, -%arg4 + %arg7] : memref<128x128xf32, 2>
            }
          }
        }
        dealloc %10 : memref<128x128xf32, 2>
        dealloc %7 : memref<128x128xf32, 2>
      }
      %3 = affine.apply #map0(%arg3, %arg4)
      %4 = affine.apply #map1(%arg3, %arg4)
      // Copy out result matrix.
      affine.for %arg5 = 0 to 128 {
        %5 = affine.apply #map2(%arg3, %arg5)
        affine.for %arg6 = 0 to 128 {
          %6 = affine.apply #map2(%arg4, %arg6)
          %7 = affine.load %2[%arg5, %arg6] : memref<128x128xf32, 2>
          store %7, %arg2[%5, %6] : memref<4096x4096xf32>
        }
      }
      dealloc %2 : memref<128x128xf32, 2>
    }
  }
```

With -affine-data-copy -dma:

```
  affine.for %arg3 = 0 to 4096 step 128 {
    %0 = affine.apply #map3(%arg3)
    %1 = alloc() : memref<128xf32, 2>
    %2 = alloc() : memref<1xi32>
    affine.dma_start %arg2[%arg3], %1[%c0], %2[%c0], %c128_0 : memref<4096xf32>, memref<128xf32, 2>, memref<1xi32>
    affine.dma_wait %2[%c0], %c128_0 : memref<1xi32>
    %3 = alloc() : memref<1xi32>
    affine.for %arg4 = 0 to 4096 step 128 {
      %5 = affine.apply #map0(%arg3, %arg4)
      %6 = affine.apply #map1(%arg3, %arg4)
      %7 = alloc() : memref<128x128xf32, 2>
      %8 = alloc() : memref<1xi32>
      affine.dma_start %arg0[%arg3, %arg4], %7[%c0, %c0], %8[%c0], %c16384, %c4096, %c128_2 : memref<4096x4096xf32>, memref<128x128xf32, 2>, memref<1xi32>
      affine.dma_wait %8[%c0], %c16384 : memref<1xi32>
      %9 = affine.apply #map3(%arg4)
      %10 = alloc() : memref<128xf32, 2>
      %11 = alloc() : memref<1xi32>
      affine.dma_start %arg1[%arg4], %10[%c0], %11[%c0], %c128_1 : memref<4096xf32>, memref<128xf32, 2>, memref<1xi32>
      affine.dma_wait %11[%c0], %c128_1 : memref<1xi32>
      affine.for %arg5 = #map3(%arg3) to #map5(%arg3) {
        affine.for %arg6 = #map3(%arg4) to #map5(%arg4) {
          %12 = affine.load %7[-%arg3 + %arg5, -%arg4 + %arg6] : memref<128x128xf32, 2>
          %13 = affine.load %10[-%arg4 + %arg6] : memref<128xf32, 2>
          %14 = affine.load %1[-%arg3 + %arg5] : memref<128xf32, 2>
          %15 = mulf %12, %13 : f32
          %16 = addf %14, %15 : f32
          affine.store %16, %1[-%arg3 + %arg5] : memref<128xf32, 2>
        }
      }
      dealloc %11 : memref<1xi32>
      dealloc %10 : memref<128xf32, 2>
      dealloc %8 : memref<1xi32>
      dealloc %7 : memref<128x128xf32, 2>
    }
    %4 = affine.apply #map3(%arg3)
    affine.dma_start %1[%c0], %arg2[%arg3], %3[%c0], %c128 : memref<128xf32, 2>, memref<4096xf32>, memref<1xi32>
    affine.dma_wait %3[%c0], %c128 : memref<1xi32>
    dealloc %3 : memref<1xi32>
    dealloc %2 : memref<1xi32>
    dealloc %1 : memref<128xf32, 2>
  }
```

Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>

Closes tensorflow/mlir#50

PiperOrigin-RevId: 261221903
2019-08-01 16:31:58 -07:00
Nicolas Vasilache 54175c240a Fix backward slice corner case
In the backward slice computation, BlockArgument coming from function arguments represent a natural boundary for the traversal and should not trigger llvm_unreachable.
This CL also improves the error message and adds a relevant test.

PiperOrigin-RevId: 260118630
2019-07-26 03:49:17 -07:00
Nicolas Vasilache fae4d94990 Use "standard" load and stores in LowerVectorTransfers
Clipping creates non-affine memory accesses, use std_load and std_store instead of affine_load and affine_store.
In the future we may also want a fill with the neutral element rather than clip, this would make the accesses affine if we wanted more analyses and transformations to happen post lowering to pointwise copies.

PiperOrigin-RevId: 260110503
2019-07-26 02:34:24 -07:00
River Riddle 1293708473 Add support for an analysis mode to DialectConversion.
This mode analyzes which operations are legalizable to the given target if a conversion were to be applied, i.e. no rewrites are ever performed even on success. This mode is useful for device partitioning or other utilities that may want to analyze the effect of conversion to different targets before performing it.

The analysis method currently just fills a provided set with the operations that were found to be legalizable. This can be extended in the future to capture more information as necessary.

PiperOrigin-RevId: 259987105
2019-07-25 11:31:07 -07:00
Nicolas Vasilache dd652ce9cc Fix backward slice computation to iterate through known control flow
This CL fixes an oversight with dealing with loops in slicing analysis.
The forward slice computation properly propagates through loops but not the backward slice.

Add relevant unit tests.

PiperOrigin-RevId: 259903396
2019-07-25 01:33:35 -07:00
Nicolas Vasilache 8ebb4281aa Cleanup slicing test.
Remove hardcoded SSA names and make use of CHECK-LABEL directives.

PiperOrigin-RevId: 259767803
2019-07-24 10:28:33 -07:00
Alex Zinenko 480d68f8de Affine loop parallelism detection: conservatively handle unknown ops
The loop parallelism detection utility only collects the affine.load and
affine.store operations appearing inside the loop to analyze the access
patterns for the absence of dependences.  However, any operation, including
unregistered operations, can appear in a body of an affine loop.  If such
operation has side effects, the result of parallelism analysis is incorrect.
Conservatively assume affine loops are not parallel in presence of operations
other than affine.load, affine.store, affine.for, affine.terminator that may
have side effects.

This required to update the loop-fusion unit test that relies on parallelism
analysis and was exercising loop fusion in presence of an unregistered
operation.

PiperOrigin-RevId: 259560935
2019-07-23 10:18:46 -07:00
Uday Bondhugula b5f8a4be27 Introduce parser library method to parse list of region arguments
- introduce parseRegionArgumentList (similar to parseOperandList) to parse a
  list of region arguments with a delimiter
- allows defining custom parse for op's with multiple/variadic number of
  region arguments
- use this on the gpu.launch op (although the latter has a fixed number
  of region arguments)
- add a test dialect op to test region argument list parsing (with the
  no delimiter case)

Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>

Closes tensorflow/mlir#40

PiperOrigin-RevId: 259442536
2019-07-22 17:42:08 -07:00
Nicolas Vasilache 48a1baeb8a Refactor LoopParametricTiling as a test pass - NFC
This CL moves LoopParametricTiling into test/lib as a pass for purely testing purposes.

PiperOrigin-RevId: 259300264
2019-07-22 04:31:17 -07:00