[MLIR] Fix affine fusion bug/efficiency issue / enable more fusion

The list of destination load ops while evaluating producer-consumer fusion wasn't being maintained as a set, and as such, duplicate load ops were being added to it. Although this is harmless correctness-wise, it's a killer efficiency-wise and it prevents interesting/useful fusions (including for eg. reshapes into a matmul). The reason the latter fusions would be missed is that a slice union would be unnecessarily needed due to the duplicate load ops on a memref added to the 'dst loads' list. Since slice union is unimplemented for the local var case, a single destination load op that leads to local vars (like a floordiv / mod producing fusion), a common case, would not get fused due to an unnecessary union being tried with itself. (The union would actually be the same thing but we would bail out.) Besides the above, this would also significantly speed up fusion as all the unnecessary slice computations / unions, checks, etc. due to the duplicates go away. Differential Revision: https://reviews.llvm.org/D79547
2020-05-07 09:21:11 +05:30 · 2020-05-07 09:21:11 +05:30 · 2affcd664e
parent f058d397ff
commit 2affcd664e
2 changed files with 44 additions and 1 deletions
--- a/mlir/lib/Transforms/LoopFusion.cpp
+++ b/mlir/lib/Transforms/LoopFusion.cpp
@ -1625,7 +1625,10 @@ public:
            // continue fusing based on new operands.
            for (auto *loadOpInst : dstLoopCollector.loadOpInsts) {
              auto loadMemRef = cast<AffineLoadOp>(loadOpInst).getMemRef();
-              if (visitedMemrefs.count(loadMemRef) == 0)
+              // NOTE: Change 'loads' to a hash set in case efficiency is an
+              // issue. We still use a vector since it's expected to be small.
+              if (visitedMemrefs.count(loadMemRef) == 0 &&
+                  !llvm::is_contained(loads, loadOpInst))
                loads.push_back(loadOpInst);
            }

--- a/mlir/test/Transforms/loop-fusion.mlir
+++ b/mlir/test/Transforms/loop-fusion.mlir
@ -2422,5 +2422,45 @@ func @should_fuse_producer_with_multi_outgoing_edges(%a : memref<1xf32>, %b : me
  // CHECK-NEXT: affine.store %{{.*}}, %[[A]]
  // CHECK-NEXT: affine.load %[[B]]
  // CHECK-NOT: affine.for %{{.*}}
+  // CHECK: return
  return
 }
+
+// -----
+
+// MAXIMAL-LABEL: func @reshape_into_matmul
+func @reshape_into_matmul(%lhs : memref<1024x1024xf32>,
+              %R: memref<16x64x1024xf32>, %out: memref<1024x1024xf32>) {
+  %rhs = alloc() :  memref<1024x1024xf32>
+
+  // Reshape from 3-d to 2-d.
+  affine.for %i0 = 0 to 16 {
+    affine.for %i1 = 0 to 64 {
+      affine.for %k = 0 to 1024 {
+        %v = affine.load %R[%i0, %i1, %k] : memref<16x64x1024xf32>
+        affine.store %v, %rhs[64*%i0 + %i1, %k] : memref<1024x1024xf32>
+      }
+    }
+  }
+
+  // Matmul.
+  affine.for %i = 0 to 1024 {
+    affine.for %j = 0 to 1024 {
+      affine.for %k = 0 to 1024 {
+        %0 = affine.load %rhs[%k, %j] : memref<1024x1024xf32>
+        %1 = affine.load %lhs[%i, %k] : memref<1024x1024xf32>
+        %2 = mulf %1, %0 : f32
+        %3 = affine.load %out[%i, %j] : memref<1024x1024xf32>
+        %4 = addf %3, %2 : f32
+        affine.store %4, %out[%i, %j] : memref<1024x1024xf32>
+      }
+    }
+  }
+  return
+}
+// MAXIMAL-NEXT: alloc
+// MAXIMAL-NEXT: affine.for
+// MAXIMAL-NEXT:   affine.for
+// MAXIMAL-NEXT:     affine.for
+// MAXIMAL-NOT:      affine.for
+// MAXIMAL:      return