[Arc] Improve LowerState to never produce read-after-write conflicts (#7703)

This is a complete rewrite of the `LowerState` pass that makes the `LegalizeStateUpdate` pass obsolete. The old implementation of `LowerState` produces `arc.model`s that still contain read-after-write conflicts. This primarily happens because the pass simply emits `arc.state_write` operations that write updated values to simulation memory for each `arc.state`, and any user of `arc.state` would use an `arc.state_read` operation to retrieve the original value of the state before any writes occurred. Memories are similar. The Arc dialect considers `arc.state_write` and `arc.memory_write` operations to be _deferred_ writes until the `LegalizeStateUpdate` pass runs, at which point they become _immediate_ since the legalization inserts the necessary temporaries to resolve the read-after-write conflicts. The previous implementation would also not handle state-to-output and state-to-side-effecting-op propagation paths correctly. When a model's eval function is called, registers are updated to their new value, and any outputs that combinatorially depend on those new values should also immediately update. Similarly, operations such as asserts or debug trackers should observe new values for states immediately after they have been written. However, since all writes are considered deferred, there is no way for `LowerState` to produce a mixture of operations that depend on a register's _old_ state (because they are used to compute a register's new state), and on a _new_ state because they are combinatorially derived values. This new implementation of `LowerState` completely avoids read-after-write conflicts. It does this by changing the way modules are lowered in two ways: **Phases:** The pass tracks in which _phase_ of the simulation lifecycle a value is needed and allows for operations to have different lowerings in different phases. An `arc.state` operation for example requires its inputs, enable, and reset to be computed based on the _old_ value they had, i.e. the value the end of the previous call to the model's eval function. The clock however has to be computed based on the _new_ value it has in the current call to eval. Therefore, the ops defining the inputs, enable, and reset are lowered in the _old_ phase, while the ops defining the clock are lowered in the _new_ phase. The `arc.state` op lowering will then write its _new_ value to simulation storage. This phase tracking allows registers to be used as the clock for other registers: since the clocks require _new_ values, registers serving as clock to others are lowered first, such that the dependent registers can immediately react to the updated clock. It also allows for module outputs and side-effecting ops based on `arc.state`s to be scheduled after the states have been updated, since they depend on the state's _new_ value. The pass also covers the situation where an operation depends on a module input and a state, and feeds into a module output as well as another state. In this situation that operation has to be lowered twice: once for the _old_ phase to serve as input to the subsequent state, and once for the _new_ phase to compute the new module output. In addition to the _old_ and _new_ phases representing the previous and current call to eval, the pass also models an _initial_ and _final_ phase. These are used for `seq.initial` and `llhd.final` ops, and in order to compute the initial values for states. If an `arc.state` op has an initial value operand it is lowered in the _initial_ phase. Similarly for the ops in `llhd.final`. The pass places all ops lowered in the initial and final phases into corresponding `arc.initial` and `arc.final` ops. At a later point we may want to generate the `*_initial`, `*_eval`, and `*_final` functions directly. **No more clock trees:** The new implementation also no longer generates `arc.clock_tree` and `arc.passthrough` operations. These were a holdover from the early days of the Arc dialect, where no eval function would be generated. Instead, the user was required to directly call clock functions. This was never able to model clocks changing at the exact same moment, or having clocks derived from registers and other combinatorial operations. Since Arc has since switched to generating an eval function that can accurately interleave the effects of different clocks, grouping ops by clock tree is no longer needed. In fact, removing the clock tree ops allows for the pass to more efficiently interleave the operations from different clock domains. The Rocket core in the circt/arc-tests repository still works with this new implementation of LowerState. In combination with the MergeIfs pass the performance stays the same. I have renamed the implementation and test files to make the git diffs easier to read. The names will be changed back in a follow-up commit. Fixes #6390.
2024-10-28 14:57:03 -07:00 · 2024-10-28 14:57:03 -07:00 · 3181b0317d
parent 6f5e0a8744
commit 3181b0317d
17 changed files with 1822 additions and 2330 deletions
--- a/include/circt/Dialect/Arc/ArcPasses.h
+++ b/include/circt/Dialect/Arc/ArcPasses.h
@ -36,11 +36,9 @@ createInferMemoriesPass(const InferMemoriesOptions &options = {});
 std::unique_ptr<mlir::Pass> createInlineArcsPass();
 std::unique_ptr<mlir::Pass> createIsolateClocksPass();
 std::unique_ptr<mlir::Pass> createLatencyRetimingPass();
-std::unique_ptr<mlir::Pass> createLegalizeStateUpdatePass();
 std::unique_ptr<mlir::Pass> createLowerArcsToFuncsPass();
 std::unique_ptr<mlir::Pass> createLowerClocksToFuncsPass();
 std::unique_ptr<mlir::Pass> createLowerLUTPass();
-std::unique_ptr<mlir::Pass> createLowerStatePass();
 std::unique_ptr<mlir::Pass> createLowerVectorizationsPass(
    LowerVectorizationsModeEnum mode = LowerVectorizationsModeEnum::Full);
 std::unique_ptr<mlir::Pass> createMakeTablesPass();
--- a/include/circt/Dialect/Arc/ArcPasses.td
+++ b/include/circt/Dialect/Arc/ArcPasses.td
@ -163,12 +163,6 @@ def LatencyRetiming : Pass<"arc-latency-retiming", "mlir::ModuleOp"> {
  ];
 }

-def LegalizeStateUpdate : Pass<"arc-legalize-state-update", "mlir::ModuleOp"> {
-  let summary = "Insert temporaries such that state reads don't see writes";
-  let constructor = "circt::arc::createLegalizeStateUpdatePass()";
-  let dependentDialects = ["arc::ArcDialect"];
-}
-
 def LowerArcsToFuncs : Pass<"arc-lower-arcs-to-funcs", "mlir::ModuleOp"> {
  let summary = "Lower arc definitions into functions";
  let constructor = "circt::arc::createLowerArcsToFuncsPass()";
@ -187,12 +181,15 @@ def LowerLUT : Pass<"arc-lower-lut", "arc::DefineOp"> {
  let dependentDialects = ["hw::HWDialect", "comb::CombDialect"];
 }

-def LowerState : Pass<"arc-lower-state", "mlir::ModuleOp"> {
+def LowerStatePass : Pass<"arc-lower-state", "mlir::ModuleOp"> {
  let summary = "Split state into read and write ops grouped by clock tree";
-  let constructor = "circt::arc::createLowerStatePass()";
  let dependentDialects = [
-    "arc::ArcDialect", "mlir::scf::SCFDialect", "mlir::func::FuncDialect",
-    "mlir::LLVM::LLVMDialect", "comb::CombDialect", "seq::SeqDialect"
+    "arc::ArcDialect",
+    "comb::CombDialect",
+    "mlir::LLVM::LLVMDialect",
+    "mlir::func::FuncDialect",
+    "mlir::scf::SCFDialect",
+    "seq::SeqDialect",
  ];
 }

--- a/integration_test/arcilator/JIT/dpi.mlir
+++ b/integration_test/arcilator/JIT/dpi.mlir
@ -19,13 +19,14 @@ func.func @add_mlir_impl(%arg0: i32, %arg1: i32, %arg2: !llvm.ptr) {
  llvm.store %0, %arg2 : i32, !llvm.ptr
  return
 }
+
 hw.module @arith(in %clock : i1, in %a : i32, in %b : i32, out c : i32, out d : i32) {
  %seq_clk = seq.to_clock %clock
-
  %0 = sim.func.dpi.call @add_mlir(%a, %b) clock %seq_clk : (i32, i32) -> i32
  %1 = sim.func.dpi.call @mul_shared(%a, %b) clock %seq_clk : (i32, i32) -> i32
  hw.output %0, %1 : i32, i32
 }
+
 func.func @main() {
  %c2_i32 = arith.constant 2 : i32
  %c3_i32 = arith.constant 3 : i32
@ -34,18 +35,16 @@ func.func @main() {
  arc.sim.instantiate @arith as %arg0 {
    arc.sim.set_input %arg0, "a" = %c2_i32 : i32, !arc.sim.instance<@arith>
    arc.sim.set_input %arg0, "b" = %c3_i32 : i32, !arc.sim.instance<@arith>
-    arc.sim.set_input %arg0, "clock" = %one : i1, !arc.sim.instance<@arith>

-    arc.sim.step %arg0 : !arc.sim.instance<@arith>
    arc.sim.set_input %arg0, "clock" = %zero : i1, !arc.sim.instance<@arith>
+    arc.sim.step %arg0 : !arc.sim.instance<@arith>
    %0 = arc.sim.get_port %arg0, "c" : i32, !arc.sim.instance<@arith>
    %1 = arc.sim.get_port %arg0, "d" : i32, !arc.sim.instance<@arith>
-
    arc.sim.emit "c", %0 : i32
    arc.sim.emit "d", %1 : i32

-    arc.sim.step %arg0 : !arc.sim.instance<@arith>
    arc.sim.set_input %arg0, "clock" = %one : i1, !arc.sim.instance<@arith>
+    arc.sim.step %arg0 : !arc.sim.instance<@arith>
    %2 = arc.sim.get_port %arg0, "c" : i32, !arc.sim.instance<@arith>
    %3 = arc.sim.get_port %arg0, "d" : i32, !arc.sim.instance<@arith>
    arc.sim.emit "c", %2 : i32
--- a/integration_test/arcilator/JIT/initial-shift-reg.mlir
+++ b/integration_test/arcilator/JIT/initial-shift-reg.mlir
@ -26,6 +26,7 @@ module {
    %true = arith.constant 1 : i1

    arc.sim.instantiate @shiftreg as %model {
+      arc.sim.step %model : !arc.sim.instance<@shiftreg>
      arc.sim.set_input %model, "en" = %false : i1, !arc.sim.instance<@shiftreg>
      arc.sim.set_input %model, "reset" = %false : i1, !arc.sim.instance<@shiftreg>
      arc.sim.set_input %model, "din" = %ff : i8, !arc.sim.instance<@shiftreg>
--- a/integration_test/arcilator/JIT/reg.mlir
+++ b/integration_test/arcilator/JIT/reg.mlir
@ -1,7 +1,7 @@
 // RUN: arcilator %s --run --jit-entry=main | FileCheck %s
 // REQUIRES: arcilator-jit

-// CHECK: o1 = 2
+// CHECK:      o1 = 2
 // CHECK-NEXT: o2 = 5
 // CHECK-NEXT: o1 = 3
 // CHECK-NEXT: o2 = 6
@ -41,6 +41,7 @@ func.func @main() {
  %step = arith.constant 1 : index

  arc.sim.instantiate @counter as %model {
+    arc.sim.step %model : !arc.sim.instance<@counter>
    %init_val1 = arc.sim.get_port %model, "o1" : i8, !arc.sim.instance<@counter>
    %init_val2 = arc.sim.get_port %model, "o2" : i8, !arc.sim.instance<@counter>

--- a/lib/Conversion/ConvertToArcs/ConvertToArcs.cpp
+++ b/lib/Conversion/ConvertToArcs/ConvertToArcs.cpp
@ -25,8 +25,9 @@ using llvm::MapVector;

 static bool isArcBreakingOp(Operation *op) {
  return op->hasTrait<OpTrait::ConstantLike>() ||
-         isa<hw::InstanceOp, seq::CompRegOp, MemoryOp, ClockedOpInterface,
-             seq::InitialOp, seq::ClockGateOp, sim::DPICallOp>(op) ||
+         isa<hw::InstanceOp, seq::CompRegOp, MemoryOp, MemoryReadPortOp,
+             ClockedOpInterface, seq::InitialOp, seq::ClockGateOp,
+             sim::DPICallOp>(op) ||
         op->getNumResults() > 1;
 }

--- a/lib/Dialect/Arc/ArcTypes.cpp
+++ b/lib/Dialect/Arc/ArcTypes.cpp
@ -21,11 +21,17 @@ using namespace mlir;
 #define GET_TYPEDEF_CLASSES
 #include "circt/Dialect/Arc/ArcTypes.cpp.inc"

-unsigned StateType::getBitWidth() { return hw::getBitWidth(getType()); }
+unsigned StateType::getBitWidth() {
+  if (llvm::isa<seq::ClockType>(getType()))
+    return 1;
+  return hw::getBitWidth(getType());
+}

 LogicalResult
 StateType::verify(llvm::function_ref<InFlightDiagnostic()> emitError,
                  Type innerType) {
+  if (llvm::isa<seq::ClockType>(innerType))
+    return success();
  if (hw::getBitWidth(innerType) < 0)
    return emitError() << "state type must have a known bit width; got "
                       << innerType;
--- a/lib/Dialect/Arc/Transforms/CMakeLists.txt
+++ b/lib/Dialect/Arc/Transforms/CMakeLists.txt
@ -9,11 +9,10 @@ add_circt_dialect_library(CIRCTArcTransforms
  InlineArcs.cpp
  IsolateClocks.cpp
  LatencyRetiming.cpp
-  LegalizeStateUpdate.cpp
  LowerArcsToFuncs.cpp
  LowerClocksToFuncs.cpp
  LowerLUT.cpp
-  LowerState.cpp
+  LowerStateRewrite.cpp
  LowerVectorizations.cpp
  MakeTables.cpp
  MergeIfs.cpp
@ -33,6 +32,7 @@ add_circt_dialect_library(CIRCTArcTransforms
  CIRCTComb
  CIRCTEmit
  CIRCTHW
+  CIRCTLLHD
  CIRCTOM
  CIRCTSV
  CIRCTSeq
--- a/lib/Dialect/Arc/Transforms/LegalizeStateUpdate.cpp
+++ b/lib/Dialect/Arc/Transforms/LegalizeStateUpdate.cpp
@ -1,597 +0,0 @@
-//===- LegalizeStateUpdate.cpp --------------------------------------------===//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-
-#include "circt/Dialect/Arc/ArcOps.h"
-#include "circt/Dialect/Arc/ArcPasses.h"
-#include "mlir/Dialect/SCF/IR/SCF.h"
-#include "mlir/IR/Dominance.h"
-#include "mlir/IR/ImplicitLocOpBuilder.h"
-#include "llvm/ADT/PointerIntPair.h"
-#include "llvm/ADT/TypeSwitch.h"
-#include "llvm/Support/Debug.h"
-
-#define DEBUG_TYPE "arc-legalize-state-update"
-
-namespace circt {
-namespace arc {
-#define GEN_PASS_DEF_LEGALIZESTATEUPDATE
-#include "circt/Dialect/Arc/ArcPasses.h.inc"
-} // namespace arc
-} // namespace circt
-
-using namespace mlir;
-using namespace circt;
-using namespace arc;
-
-/// Check if an operation partakes in state accesses.
-static bool isOpInteresting(Operation *op) {
-  if (isa<InitialOp>(op))
-    return false;
-  if (isa<StateReadOp, StateWriteOp, CallOpInterface, CallableOpInterface>(op))
-    return true;
-  if (op->getNumRegions() > 0)
-    return true;
-  return false;
-}
-
-//===----------------------------------------------------------------------===//
-// Access Analysis
-//===----------------------------------------------------------------------===//
-
-namespace {
-
-enum class AccessType { Read = 0, Write = 1 };
-
-/// A read or write access to a state value.
-using Access = llvm::PointerIntPair<Value, 1, AccessType>;
-
-struct BlockAccesses;
-struct OpAccesses;
-
-/// A block's access analysis information and graph edges.
-struct BlockAccesses {
-  BlockAccesses(Block *block) : block(block) {}
-
-  /// The block.
-  Block *const block;
-  /// The parent op lattice node.
-  OpAccesses *parent = nullptr;
-  /// The accesses from ops within this block to the block arguments.
-  SmallPtrSet<Access, 1> argAccesses;
-  /// The accesses from ops within this block to values defined outside the
-  /// block.
-  SmallPtrSet<Access, 1> aboveAccesses;
-};
-
-/// An operation's access analysis information and graph edges.
-struct OpAccesses {
-  OpAccesses(Operation *op) : op(op) {}
-
-  /// The operation.
-  Operation *const op;
-  /// The parent block lattice node.
-  BlockAccesses *parent = nullptr;
-  /// If this is a callable op, `callers` is the set of ops calling it.
-  SmallPtrSet<OpAccesses *, 1> callers;
-  /// The accesses performed by this op.
-  SmallPtrSet<Access, 1> accesses;
-};
-
-/// An analysis that determines states read and written by operations and
-/// blocks. Looks through calls and handles nested operations properly. Does not
-/// follow state values returned from functions and modified by operations.
-struct AccessAnalysis {
-  LogicalResult analyze(Operation *op);
-  OpAccesses *lookup(Operation *op);
-  BlockAccesses *lookup(Block *block);
-
-  /// A global order assigned to state values. These allow us to not care about
-  /// ordering during the access analysis and only establish a determinstic
-  /// order once we insert additional operations later on.
-  DenseMap<Value, unsigned> stateOrder;
-
-  /// A symbol table cache.
-  SymbolTableCollection symbolTable;
-
-private:
-  llvm::SpecificBumpPtrAllocator<OpAccesses> opAlloc;
-  llvm::SpecificBumpPtrAllocator<BlockAccesses> blockAlloc;
-
-  DenseMap<Operation *, OpAccesses *> opAccesses;
-  DenseMap<Block *, BlockAccesses *> blockAccesses;
-
-  SetVector<OpAccesses *> opWorklist;
-  bool anyInvalidStateAccesses = false;
-
-  // Get the node for an operation, creating one if necessary.
-  OpAccesses &get(Operation *op) {
-    auto &slot = opAccesses[op];
-    if (!slot)
-      slot = new (opAlloc.Allocate()) OpAccesses(op);
-    return *slot;
-  }
-
-  // Get the node for a block, creating one if necessary.
-  BlockAccesses &get(Block *block) {
-    auto &slot = blockAccesses[block];
-    if (!slot)
-      slot = new (blockAlloc.Allocate()) BlockAccesses(block);
-    return *slot;
-  }
-
-  // NOLINTBEGIN(misc-no-recursion)
-  void addOpAccess(OpAccesses &op, Access access);
-  void addBlockAccess(BlockAccesses &block, Access access);
-  // NOLINTEND(misc-no-recursion)
-};
-} // namespace
-
-LogicalResult AccessAnalysis::analyze(Operation *op) {
-  LLVM_DEBUG(llvm::dbgs() << "Analyzing accesses in " << op->getName() << "\n");
-
-  // Create the lattice nodes for all blocks and operations.
-  llvm::SmallSetVector<OpAccesses *, 16> initWorklist;
-  initWorklist.insert(&get(op));
-  while (!initWorklist.empty()) {
-    OpAccesses &opNode = *initWorklist.pop_back_val();
-
-    // First create lattice nodes for all nested blocks and operations.
-    for (auto &region : opNode.op->getRegions()) {
-      for (auto &block : region) {
-        BlockAccesses &blockNode = get(&block);
-        blockNode.parent = &opNode;
-        for (auto &subOp : block) {
-          if (!isOpInteresting(&subOp))
-            continue;
-          OpAccesses &subOpNode = get(&subOp);
-          if (!subOp.hasTrait<OpTrait::IsIsolatedFromAbove>()) {
-            subOpNode.parent = &blockNode;
-          }
-          initWorklist.insert(&subOpNode);
-        }
-      }
-    }
-
-    // Track the relationship between callers and callees.
-    if (auto callOp = dyn_cast<CallOpInterface>(opNode.op))
-      if (auto *calleeOp = callOp.resolveCallableInTable(&symbolTable))
-        get(calleeOp).callers.insert(&opNode);
-
-    // Create the seed accesses.
-    if (auto readOp = dyn_cast<StateReadOp>(opNode.op))
-      addOpAccess(opNode, Access(readOp.getState(), AccessType::Read));
-    else if (auto writeOp = dyn_cast<StateWriteOp>(opNode.op))
-      addOpAccess(opNode, Access(writeOp.getState(), AccessType::Write));
-  }
-  LLVM_DEBUG(llvm::dbgs() << "- Prepared " << blockAccesses.size()
-                          << " block and " << opAccesses.size()
-                          << " op lattice nodes\n");
-  LLVM_DEBUG(llvm::dbgs() << "- Worklist has " << opWorklist.size()
-                          << " initial ops\n");
-
-  // Propagate accesses through calls.
-  while (!opWorklist.empty()) {
-    if (anyInvalidStateAccesses)
-      return failure();
-    auto &opNode = *opWorklist.pop_back_val();
-    if (opNode.callers.empty())
-      continue;
-    auto calleeOp = dyn_cast<CallableOpInterface>(opNode.op);
-    if (!calleeOp)
-      return opNode.op->emitOpError(
-          "does not implement CallableOpInterface but has callers");
-    LLVM_DEBUG(llvm::dbgs() << "- Updating callable " << opNode.op->getName()
-                            << " " << opNode.op->getAttr("sym_name") << "\n");
-
-    auto &calleeRegion = *calleeOp.getCallableRegion();
-    auto *blockNode = lookup(&calleeRegion.front());
-    if (!blockNode)
-      continue;
-    auto calleeArgs = blockNode->block->getArguments();
-
-    for (auto *callOpNode : opNode.callers) {
-      LLVM_DEBUG(llvm::dbgs() << "  - Updating " << *callOpNode->op << "\n");
-      auto callArgs = cast<CallOpInterface>(callOpNode->op).getArgOperands();
-      for (auto [calleeArg, callArg] : llvm::zip(calleeArgs, callArgs)) {
-        if (blockNode->argAccesses.contains({calleeArg, AccessType::Read}))
-          addOpAccess(*callOpNode, {callArg, AccessType::Read});
-        if (blockNode->argAccesses.contains({calleeArg, AccessType::Write}))
-          addOpAccess(*callOpNode, {callArg, AccessType::Write});
-      }
-    }
-  }
-
-  return failure(anyInvalidStateAccesses);
-}
-
-OpAccesses *AccessAnalysis::lookup(Operation *op) {
-  return opAccesses.lookup(op);
-}
-
-BlockAccesses *AccessAnalysis::lookup(Block *block) {
-  return blockAccesses.lookup(block);
-}
-
-// NOLINTBEGIN(misc-no-recursion)
-void AccessAnalysis::addOpAccess(OpAccesses &op, Access access) {
-  // We don't support state pointers flowing among ops and blocks. Check that
-  // the accessed state is either directly passed down through a block argument
-  // (no defining op), or is trivially a local state allocation.
-  auto *defOp = access.getPointer().getDefiningOp();
-  if (defOp && !isa<AllocStateOp, RootInputOp, RootOutputOp>(defOp)) {
-    auto d = op.op->emitOpError("accesses non-trivial state value defined by `")
-             << defOp->getName()
-             << "`; only block arguments and `arc.alloc_state` results are "
-                "supported";
-    d.attachNote(defOp->getLoc()) << "state defined here";
-    anyInvalidStateAccesses = true;
-  }
-
-  // HACK: Do not propagate accesses outside of `arc.passthrough` to prevent
-  // reads from being legalized. Ideally we'd be able to more precisely specify
-  // on read ops whether they should read the initial or the final value.
-  if (isa<PassThroughOp>(op.op))
-    return;
-
-  // Propagate to the parent block and operation if the access escapes the block
-  // or targets a block argument.
-  if (op.accesses.insert(access).second && op.parent) {
-    stateOrder.insert({access.getPointer(), stateOrder.size()});
-    addBlockAccess(*op.parent, access);
-  }
-}
-
-void AccessAnalysis::addBlockAccess(BlockAccesses &block, Access access) {
-  Value value = access.getPointer();
-
-  // If the accessed value is defined outside the block, add it to the set of
-  // outside accesses.
-  if (value.getParentBlock() != block.block) {
-    if (block.aboveAccesses.insert(access).second)
-      addOpAccess(*block.parent, access);
-    return;
-  }
-
-  // If the accessed value is defined within the block, and it is a block
-  // argument, add it to the list of block argument accesses.
-  if (auto blockArg = dyn_cast<BlockArgument>(value)) {
-    assert(blockArg.getOwner() == block.block);
-    if (!block.argAccesses.insert(access).second)
-      return;
-
-    // Adding block argument accesses affects calls to the surrounding ops. Add
-    // the op to the worklist such that the access can propagate to callers.
-    opWorklist.insert(block.parent);
-  }
-}
-// NOLINTEND(misc-no-recursion)
-
-//===----------------------------------------------------------------------===//
-// Legalization
-//===----------------------------------------------------------------------===//
-
-namespace {
-struct Legalizer {
-  Legalizer(AccessAnalysis &analysis) : analysis(analysis) {}
-  LogicalResult run(MutableArrayRef<Region> regions);
-  LogicalResult visitBlock(Block *block);
-
-  AccessAnalysis &analysis;
-
-  unsigned numLegalizedWrites = 0;
-  unsigned numUpdatedReads = 0;
-
-  /// A mapping from pre-existing states to temporary states for read
-  /// operations, created during legalization to remove read-after-write
-  /// hazards.
-  DenseMap<Value, Value> legalizedStates;
-};
-} // namespace
-
-LogicalResult Legalizer::run(MutableArrayRef<Region> regions) {
-  for (auto &region : regions)
-    for (auto &block : region)
-      if (failed(visitBlock(&block)))
-        return failure();
-  assert(legalizedStates.empty() && "should be balanced within block");
-  return success();
-}
-
-LogicalResult Legalizer::visitBlock(Block *block) {
-  // In a first reverse pass over the block, find the first write that occurs
-  // before the last read of a state, if any.
-  SmallPtrSet<Value, 4> readStates;
-  DenseMap<Value, Operation *> illegallyWrittenStates;
-  for (Operation &op : llvm::reverse(*block)) {
-    const auto *accesses = analysis.lookup(&op);
-    if (!accesses)
-      continue;
-
-    // Determine the states written by this op for which we have already seen a
-    // read earlier. These writes need to be legalized.
-    SmallVector<Value, 1> affectedStates;
-    for (auto access : accesses->accesses)
-      if (access.getInt() == AccessType::Write)
-        if (readStates.contains(access.getPointer()))
-          illegallyWrittenStates[access.getPointer()] = &op;
-
-    // Determine the states read by this op. This comes after handling of the
-    // writes, such that a block that contains both reads and writes to a state
-    // doesn't mark itself as illegal. Instead, we will descend into that block
-    // further down and do a more fine-grained legalization.
-    for (auto access : accesses->accesses)
-      if (access.getInt() == AccessType::Read)
-        readStates.insert(access.getPointer());
-  }
-
-  // Create a mapping from operations that create a read-after-write hazard to
-  // the states that they modify. Don't consider states that have already been
-  // legalized. This is important since we may have already created a temporary
-  // in a parent block which we can just reuse.
-  DenseMap<Operation *, SmallVector<Value, 1>> illegalWrites;
-  for (auto [state, op] : illegallyWrittenStates)
-    if (!legalizedStates.count(state))
-      illegalWrites[op].push_back(state);
-
-  // In a second forward pass over the block, insert the necessary temporary
-  // state to legalize the writes and recur into subblocks while providing the
-  // necessary rewrites.
-  SmallVector<Value> locallyLegalizedStates;
-
-  auto handleIllegalWrites =
-      [&](Operation *op, SmallVector<Value, 1> &states) -> LogicalResult {
-    LLVM_DEBUG(llvm::dbgs() << "Visiting illegal " << op->getName() << "\n");
-
-    // Sort the states we need to legalize by a determinstic order established
-    // during the access analysis. Without this the exact order in which states
-    // were moved into a temporary would be non-deterministic.
-    llvm::sort(states, [&](Value a, Value b) {
-      return analysis.stateOrder.lookup(a) < analysis.stateOrder.lookup(b);
-    });
-
-    // Legalize each state individually.
-    for (auto state : states) {
-      LLVM_DEBUG(llvm::dbgs() << "- Legalizing " << state << "\n");
-
-      // HACK: This is ugly, but we need a storage reference to allocate a state
-      // into. Ideally we'd materialize this later on, but the current impl of
-      // the alloc op requires a storage immediately. So try to find one.
-      auto storage = TypeSwitch<Operation *, Value>(state.getDefiningOp())
-                         .Case<AllocStateOp, RootInputOp, RootOutputOp>(
-                             [&](auto allocOp) { return allocOp.getStorage(); })
-                         .Default([](auto) { return Value{}; });
-      if (!storage) {
-        mlir::emitError(
-            state.getLoc(),
-            "cannot find storage pointer to allocate temporary into");
-        return failure();
-      }
-
-      // Allocate a temporary state, read the current value of the state we are
-      // legalizing, and write it to the temporary.
-      ++numLegalizedWrites;
-      ImplicitLocOpBuilder builder(state.getLoc(), op);
-      auto tmpState =
-          builder.create<AllocStateOp>(state.getType(), storage, nullptr);
-      auto stateValue = builder.create<StateReadOp>(state);
-      builder.create<StateWriteOp>(tmpState, stateValue, Value{});
-      locallyLegalizedStates.push_back(state);
-      legalizedStates.insert({state, tmpState});
-    }
-    return success();
-  };
-
-  for (Operation &op : *block) {
-    if (isOpInteresting(&op)) {
-      if (auto it = illegalWrites.find(&op); it != illegalWrites.end())
-        if (failed(handleIllegalWrites(&op, it->second)))
-          return failure();
-    }
-    // BUG: This is insufficient. Actually only reads should have their state
-    // updated, since we want writes to still affect the original state. This
-    // works for `state_read`, but in the case of a function that both reads and
-    // writes a state we only have a single operand to change but we would need
-    // one for reads and one for writes instead.
-    // HACKY FIX: Assume that there is ever only a single write to a state. In
-    // that case it is safe to assume that when an op is marked as writing a
-    // state it wants the original state, not the temporary one for reads.
-    const auto *accesses = analysis.lookup(&op);
-    for (auto &operand : op.getOpOperands()) {
-      if (accesses &&
-          accesses->accesses.contains({operand.get(), AccessType::Read}) &&
-          accesses->accesses.contains({operand.get(), AccessType::Write})) {
-        auto d = op.emitWarning("operation reads and writes state; "
-                                "legalization may be insufficient");
-        d.attachNote()
-            << "state update legalization does not properly handle operations "
-               "that both read and write states at the same time; runtime data "
-               "races between the read and write behavior are possible";
-        d.attachNote(operand.get().getLoc()) << "state defined here:";
-      }
-      if (!accesses ||
-          !accesses->accesses.contains({operand.get(), AccessType::Write})) {
-        if (auto tmpState = legalizedStates.lookup(operand.get())) {
-          operand.set(tmpState);
-          ++numUpdatedReads;
-        }
-      }
-    }
-    for (auto &region : op.getRegions())
-      for (auto &block : region)
-        if (failed(visitBlock(&block)))
-          return failure();
-  }
-
-  // Since we're leaving this block's scope, remove all the locally-legalized
-  // states which are no longer accessible outside.
-  for (auto state : locallyLegalizedStates)
-    legalizedStates.erase(state);
-  return success();
-}
-
-static LogicalResult getAncestorOpsInCommonDominatorBlock(
-    Operation *write, Operation **writeAncestor, Operation *read,
-    Operation **readAncestor, DominanceInfo *domInfo) {
-  Block *commonDominator =
-      domInfo->findNearestCommonDominator(write->getBlock(), read->getBlock());
-  if (!commonDominator)
-    return write->emitOpError(
-        "cannot find a common dominator block with all read operations");
-
-  // Path from writeOp to commmon dominator must only contain IfOps with no
-  // return values
-  Operation *writeParent = write;
-  while (writeParent->getBlock() != commonDominator) {
-    if (!isa<scf::IfOp, ClockTreeOp>(writeParent->getParentOp()))
-      return write->emitOpError("memory write operations in arbitrarily nested "
-                                "regions not supported");
-    writeParent = writeParent->getParentOp();
-  }
-  Operation *readParent = read;
-  while (readParent->getBlock() != commonDominator)
-    readParent = readParent->getParentOp();
-
-  *writeAncestor = writeParent;
-  *readAncestor = readParent;
-  return success();
-}
-
-static LogicalResult
-moveMemoryWritesAfterLastRead(Region &region, const DenseSet<Value> &memories,
-                              DominanceInfo *domInfo) {
-  // Collect memory values and their reads
-  DenseMap<Value, SetVector<Operation *>> readOps;
-  auto result = region.walk([&](Operation *op) {
-    if (isa<MemoryWriteOp>(op))
-      return WalkResult::advance();
-    SmallVector<Value> memoriesReadFrom;
-    if (auto readOp = dyn_cast<MemoryReadOp>(op)) {
-      memoriesReadFrom.push_back(readOp.getMemory());
-    } else {
-      for (auto operand : op->getOperands())
-        if (isa<MemoryType>(operand.getType()))
-          memoriesReadFrom.push_back(operand);
-    }
-    for (auto memVal : memoriesReadFrom) {
-      if (!memories.contains(memVal))
-        return op->emitOpError("uses memory value not directly defined by a "
-                               "arc.alloc_memory operation"),
-               WalkResult::interrupt();
-      readOps[memVal].insert(op);
-    }
-
-    return WalkResult::advance();
-  });
-
-  if (result.wasInterrupted())
-    return failure();
-
-  // Collect all writes
-  SmallVector<MemoryWriteOp> writes;
-  region.walk([&](MemoryWriteOp writeOp) { writes.push_back(writeOp); });
-
-  // Move the writes
-  for (auto writeOp : writes) {
-    if (!memories.contains(writeOp.getMemory()))
-      return writeOp->emitOpError("uses memory value not directly defined by a "
-                                  "arc.alloc_memory operation");
-    for (auto *readOp : readOps[writeOp.getMemory()]) {
-      // (1) If the last read and the write are in the same block, just move the
-      // write after the read.
-      // (2) If the write is directly in the clock tree region and the last read
-      // in some nested region, move the write after the operation with the
-      // nested region. (3) If the write is nested in if-statements (arbitrarily
-      // deep) without return value, move the whole if operation after the last
-      // read or the operation that defines the region if the read is inside a
-      // nested region. (4) Number (3) may move more memory operations with the
-      // write op, thus messing up the order of previously moved memory writes,
-      // we check in a second walk-through if that is the case and just emit an
-      // error for now. We could instead move reads in a parent region, split if
-      // operations such that the memory write has its own, etc. Alternatively,
-      // rewrite this to insert temporaries which is more difficult for memories
-      // than simple states because the memory addresses have to be considered
-      // (we cannot just copy the whole memory each time).
-      Operation *readAncestor, *writeAncestor;
-      if (failed(getAncestorOpsInCommonDominatorBlock(
-              writeOp, &writeAncestor, readOp, &readAncestor, domInfo)))
-        return failure();
-      // FIXME: the 'isBeforeInBlock` + 'moveAfter' compination can be
-      // computationally very expensive.
-      if (writeAncestor->isBeforeInBlock(readAncestor))
-        writeAncestor->moveAfter(readAncestor);
-    }
-  }
-
-  // Double check that all writes happen after all reads to the same memory.
-  for (auto writeOp : writes) {
-    for (auto *readOp : readOps[writeOp.getMemory()]) {
-      Operation *readAncestor, *writeAncestor;
-      if (failed(getAncestorOpsInCommonDominatorBlock(
-              writeOp, &writeAncestor, readOp, &readAncestor, domInfo)))
-        return failure();
-
-      if (writeAncestor->isBeforeInBlock(readAncestor))
-        return writeOp
-                   ->emitOpError("could not be moved to be after all reads to "
-                                 "the same memory")
-                   .attachNote(readOp->getLoc())
-               << "could not be moved after this read";
-    }
-  }
-
-  return success();
-}
-
-//===----------------------------------------------------------------------===//
-// Pass Infrastructure
-//===----------------------------------------------------------------------===//
-
-namespace {
-struct LegalizeStateUpdatePass
-    : public arc::impl::LegalizeStateUpdateBase<LegalizeStateUpdatePass> {
-  LegalizeStateUpdatePass() = default;
-  LegalizeStateUpdatePass(const LegalizeStateUpdatePass &pass)
-      : LegalizeStateUpdatePass() {}
-
-  void runOnOperation() override;
-
-  Statistic numLegalizedWrites{
-      this, "legalized-writes",
-      "Writes that required temporary state for later reads"};
-  Statistic numUpdatedReads{this, "updated-reads", "Reads that were updated"};
-};
-} // namespace
-
-void LegalizeStateUpdatePass::runOnOperation() {
-  auto module = getOperation();
-  auto *domInfo = &getAnalysis<DominanceInfo>();
-
-  for (auto model : module.getOps<ModelOp>()) {
-    DenseSet<Value> memories;
-    for (auto memOp : model.getOps<AllocMemoryOp>())
-      memories.insert(memOp.getResult());
-    for (auto ct : model.getOps<ClockTreeOp>())
-      if (failed(
-              moveMemoryWritesAfterLastRead(ct.getBody(), memories, domInfo)))
-        return signalPassFailure();
-  }
-
-  AccessAnalysis analysis;
-  if (failed(analysis.analyze(module)))
-    return signalPassFailure();
-
-  Legalizer legalizer(analysis);
-  if (failed(legalizer.run(module->getRegions())))
-    return signalPassFailure();
-  numLegalizedWrites += legalizer.numLegalizedWrites;
-  numUpdatedReads += legalizer.numUpdatedReads;
-}
-
-std::unique_ptr<Pass> arc::createLegalizeStateUpdatePass() {
-  return std::make_unique<LegalizeStateUpdatePass>();
-}
--- a/lib/Dialect/Arc/Transforms/LowerState.cpp
+++ b/lib/Dialect/Arc/Transforms/LowerState.cpp
@ -1,959 +0,0 @@
-//===- LowerState.cpp ---------------------------------------------------===//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-
-#include "circt/Dialect/Arc/ArcOps.h"
-#include "circt/Dialect/Arc/ArcPasses.h"
-#include "circt/Dialect/Comb/CombDialect.h"
-#include "circt/Dialect/Comb/CombOps.h"
-#include "circt/Dialect/HW/HWOps.h"
-#include "circt/Dialect/Seq/SeqOps.h"
-#include "circt/Dialect/Sim/SimOps.h"
-#include "circt/Support/BackedgeBuilder.h"
-#include "mlir/Analysis/TopologicalSortUtils.h"
-#include "mlir/Dialect/Func/IR/FuncOps.h"
-#include "mlir/Dialect/LLVMIR/LLVMAttrs.h"
-#include "mlir/Dialect/LLVMIR/LLVMDialect.h"
-#include "mlir/Dialect/SCF/IR/SCF.h"
-#include "mlir/IR/IRMapping.h"
-#include "mlir/IR/ImplicitLocOpBuilder.h"
-#include "mlir/IR/SymbolTable.h"
-#include "mlir/Pass/Pass.h"
-#include "llvm/ADT/TypeSwitch.h"
-#include "llvm/Support/Debug.h"
-
-#define DEBUG_TYPE "arc-lower-state"
-
-namespace circt {
-namespace arc {
-#define GEN_PASS_DEF_LOWERSTATE
-#include "circt/Dialect/Arc/ArcPasses.h.inc"
-} // namespace arc
-} // namespace circt
-
-using namespace circt;
-using namespace arc;
-using namespace hw;
-using namespace mlir;
-using llvm::SmallDenseSet;
-
-//===----------------------------------------------------------------------===//
-// Data Structures
-//===----------------------------------------------------------------------===//
-
-namespace {
-
-/// Statistics gathered throughout the execution of this pass.
-struct Statistics {
-  Pass *parent;
-  Statistics(Pass *parent) : parent(parent) {}
-  using Statistic = Pass::Statistic;
-
-  Statistic matOpsMoved{parent, "mat-ops-moved",
-                        "Ops moved during value materialization"};
-  Statistic matOpsCloned{parent, "mat-ops-cloned",
-                         "Ops cloned during value materialization"};
-  Statistic opsPruned{parent, "ops-pruned", "Ops removed as dead code"};
-};
-
-/// Lowering info associated with a single primary clock.
-struct ClockLowering {
-  /// The root clock this lowering is for.
-  Value clock;
-  /// A `ClockTreeOp` or `PassThroughOp`  or `InitialOp`.
-  Operation *treeOp;
-  /// Pass statistics.
-  Statistics &stats;
-  OpBuilder builder;
-  /// A mapping from values outside the clock tree to their materialize form
-  /// inside the clock tree.
-  IRMapping materializedValues;
-  /// A cache of AND gates created for aggregating enable conditions.
-  DenseMap<std::pair<Value, Value>, Value> andCache;
-  /// A cache of OR gates created for aggregating enable conditions.
-  DenseMap<std::pair<Value, Value>, Value> orCache;
-
-  // Prevent accidental construction and copying
-  ClockLowering() = delete;
-  ClockLowering(const ClockLowering &other) = delete;
-
-  ClockLowering(Value clock, Operation *treeOp, Statistics &stats)
-      : clock(clock), treeOp(treeOp), stats(stats), builder(treeOp) {
-    assert((isa<ClockTreeOp, PassThroughOp, InitialOp>(treeOp)));
-    builder.setInsertionPointToStart(&treeOp->getRegion(0).front());
-  }
-
-  Value materializeValue(Value value);
-  Value getOrCreateAnd(Value lhs, Value rhs, Location loc);
-  Value getOrCreateOr(Value lhs, Value rhs, Location loc);
-
-  bool isInitialTree() const { return isa<InitialOp>(treeOp); }
-};
-
-struct GatedClockLowering {
-  /// Lowering info of the primary clock.
-  ClockLowering &clock;
-  /// An optional enable condition of the primary clock. May be null.
-  Value enable;
-};
-
-/// State lowering for a single `HWModuleOp`.
-struct ModuleLowering {
-  HWModuleOp moduleOp;
-  /// Pass statistics.
-  Statistics &stats;
-  MLIRContext *context;
-  DenseMap<Value, std::unique_ptr<ClockLowering>> clockLowerings;
-  DenseMap<Value, GatedClockLowering> gatedClockLowerings;
-  std::unique_ptr<ClockLowering> initialLowering;
-  Value storageArg;
-  OpBuilder clockBuilder;
-  OpBuilder stateBuilder;
-
-  ModuleLowering(HWModuleOp moduleOp, Statistics &stats)
-      : moduleOp(moduleOp), stats(stats), context(moduleOp.getContext()),
-        clockBuilder(moduleOp), stateBuilder(moduleOp) {}
-
-  GatedClockLowering getOrCreateClockLowering(Value clock);
-  ClockLowering &getOrCreatePassThrough();
-  ClockLowering &getInitial();
-  Value replaceValueWithStateRead(Value value, Value state);
-
-  void addStorageArg();
-  LogicalResult lowerPrimaryInputs();
-  LogicalResult lowerPrimaryOutputs();
-  LogicalResult lowerStates();
-  LogicalResult lowerInitials();
-  template <typename CallTy>
-  LogicalResult lowerStateLike(Operation *op, Value clock, Value enable,
-                               Value reset, ArrayRef<Value> inputs,
-                               FlatSymbolRefAttr callee,
-                               ArrayRef<Value> initialValues = {});
-  LogicalResult lowerState(StateOp stateOp);
-  LogicalResult lowerState(sim::DPICallOp dpiCallOp);
-  LogicalResult lowerState(MemoryOp memOp);
-  LogicalResult lowerState(MemoryWritePortOp memWriteOp);
-  LogicalResult lowerState(TapOp tapOp);
-  LogicalResult lowerExtModules(SymbolTable &symtbl);
-  LogicalResult lowerExtModule(InstanceOp instOp);
-
-  LogicalResult cleanup();
-};
-} // namespace
-
-//===----------------------------------------------------------------------===//
-// Clock Lowering
-//===----------------------------------------------------------------------===//
-
-static bool shouldMaterialize(Operation *op) {
-  // Don't materialize arc uses with latency >0, since we handle these in a
-  // second pass once all other operations have been moved to their respective
-  // clock trees.
-  return !isa<MemoryOp, AllocStateOp, AllocMemoryOp, AllocStorageOp,
-              ClockTreeOp, PassThroughOp, RootInputOp, RootOutputOp,
-              StateWriteOp, MemoryWritePortOp, igraph::InstanceOpInterface,
-              StateOp, sim::DPICallOp>(op);
-}
-
-static bool shouldMaterialize(Value value) {
-  assert(value);
-
-  // Block arguments are just used as they are.
-  auto *op = value.getDefiningOp();
-  if (!op)
-    return false;
-
-  return shouldMaterialize(op);
-}
-
-static bool canBeMaterializedInInitializer(Operation *op) {
-  if (!op)
-    return false;
-  if (op->hasTrait<OpTrait::ConstantLike>())
-    return true;
-  if (isa<comb::CombDialect>(op->getDialect()))
-    return true;
-  if (isa<mlir::UnrealizedConversionCastOp>(op))
-    return true;
-  // TODO: There are some other ops we probably want to allow
-  return false;
-}
-
-/// Materialize a value within this clock tree. This clones or moves all
-/// operations required to produce this value inside the clock tree.
-Value ClockLowering::materializeValue(Value value) {
-  if (!value)
-    return {};
-  if (auto mapped = materializedValues.lookupOrNull(value))
-    return mapped;
-  if (auto fromImmutable = value.getDefiningOp<seq::FromImmutableOp>())
-    // Immutable value is pre-materialized so directly lookup the input.
-    return materializedValues.lookup(fromImmutable.getInput());
-
-  if (!shouldMaterialize(value))
-    return value;
-
-  struct WorkItem {
-    Operation *op;
-    SmallVector<Value, 2> operands;
-    WorkItem(Operation *op) : op(op) {}
-  };
-
-  SmallPtrSet<Operation *, 8> seen;
-  SmallVector<WorkItem> worklist;
-
-  auto addToWorklist = [&](Operation *outerOp) {
-    SmallDenseSet<Value> seenOperands;
-    auto &workItem = worklist.emplace_back(outerOp);
-    outerOp->walk([&](Operation *innerOp) {
-      for (auto operand : innerOp->getOperands()) {
-        // Skip operands that are defined within the operation itself.
-        if (!operand.getParentBlock()->getParentOp()->isProperAncestor(outerOp))
-          continue;
-
-        // Skip operands that we have already seen.
-        if (!seenOperands.insert(operand).second)
-          continue;
-
-        // Skip operands that we have already materialized or that should not
-        // be materialized at all.
-        if (materializedValues.contains(operand) || !shouldMaterialize(operand))
-          continue;
-
-        workItem.operands.push_back(operand);
-      }
-    });
-  };
-
-  seen.insert(value.getDefiningOp());
-  addToWorklist(value.getDefiningOp());
-
-  while (!worklist.empty()) {
-    auto &workItem = worklist.back();
-    if (isInitialTree() && !canBeMaterializedInInitializer(workItem.op)) {
-      workItem.op->emitError("Value cannot be used in initializer.");
-      return {};
-    }
-    if (!workItem.operands.empty()) {
-      auto operand = workItem.operands.pop_back_val();
-      if (materializedValues.contains(operand) || !shouldMaterialize(operand))
-        continue;
-      auto *defOp = operand.getDefiningOp();
-      if (!seen.insert(defOp).second) {
-        defOp->emitError("combinational loop detected");
-        return {};
-      }
-      addToWorklist(defOp);
-    } else {
-      builder.clone(*workItem.op, materializedValues);
-      seen.erase(workItem.op);
-      worklist.pop_back();
-    }
-  }
-
-  return materializedValues.lookup(value);
-}
-
-/// Create an AND gate if none with the given operands already exists. Note that
-/// the operands may be null, in which case the function will return the
-/// non-null operand, or null if both operands are null.
-Value ClockLowering::getOrCreateAnd(Value lhs, Value rhs, Location loc) {
-  if (!lhs)
-    return rhs;
-  if (!rhs)
-    return lhs;
-  auto &slot = andCache[std::make_pair(lhs, rhs)];
-  if (!slot)
-    slot = builder.create<comb::AndOp>(loc, lhs, rhs);
-  return slot;
-}
-
-/// Create an OR gate if none with the given operands already exists. Note that
-/// the operands may be null, in which case the function will return the
-/// non-null operand, or null if both operands are null.
-Value ClockLowering::getOrCreateOr(Value lhs, Value rhs, Location loc) {
-  if (!lhs)
-    return rhs;
-  if (!rhs)
-    return lhs;
-  auto &slot = orCache[std::make_pair(lhs, rhs)];
-  if (!slot)
-    slot = builder.create<comb::OrOp>(loc, lhs, rhs);
-  return slot;
-}
-
-//===----------------------------------------------------------------------===//
-// Module Lowering
-//===----------------------------------------------------------------------===//
-
-GatedClockLowering ModuleLowering::getOrCreateClockLowering(Value clock) {
-  // Look through clock gates.
-  if (auto ckgOp = clock.getDefiningOp<seq::ClockGateOp>()) {
-    // Reuse the existing lowering for this clock gate if possible.
-    if (auto it = gatedClockLowerings.find(clock);
-        it != gatedClockLowerings.end())
-      return it->second;
-
-    // Get the lowering for the parent clock gate's input clock. This will give
-    // us the clock tree to emit things into, alongside the compound enable
-    // condition of all the clock gates along the way to the primary clock. All
-    // we have to do is to add this clock gate's condition to that list.
-    auto info = getOrCreateClockLowering(ckgOp.getInput());
-    auto ckgEnable = info.clock.materializeValue(ckgOp.getEnable());
-    auto ckgTestEnable = info.clock.materializeValue(ckgOp.getTestEnable());
-    info.enable = info.clock.getOrCreateAnd(
-        info.enable,
-        info.clock.getOrCreateOr(ckgEnable, ckgTestEnable, ckgOp.getLoc()),
-        ckgOp.getLoc());
-    gatedClockLowerings.insert({clock, info});
-    return info;
-  }
-
-  // Create the `ClockTreeOp` that corresponds to this ungated clock.
-  auto &slot = clockLowerings[clock];
-  if (!slot) {
-    auto newClock =
-        clockBuilder.createOrFold<seq::FromClockOp>(clock.getLoc(), clock);
-
-    // Detect a rising edge on the clock, as `(old != new) & new`.
-    auto oldClockStorage = stateBuilder.create<AllocStateOp>(
-        clock.getLoc(), StateType::get(stateBuilder.getI1Type()), storageArg);
-    auto oldClock =
-        clockBuilder.create<StateReadOp>(clock.getLoc(), oldClockStorage);
-    clockBuilder.create<StateWriteOp>(clock.getLoc(), oldClockStorage, newClock,
-                                      Value{});
-    Value trigger = clockBuilder.create<comb::ICmpOp>(
-        clock.getLoc(), comb::ICmpPredicate::ne, oldClock, newClock);
-    trigger =
-        clockBuilder.create<comb::AndOp>(clock.getLoc(), trigger, newClock);
-
-    // Create the tree op.
-    auto treeOp = clockBuilder.create<ClockTreeOp>(clock.getLoc(), trigger);
-    treeOp.getBody().emplaceBlock();
-    slot = std::make_unique<ClockLowering>(clock, treeOp, stats);
-  }
-  return GatedClockLowering{*slot, Value{}};
-}
-
-ClockLowering &ModuleLowering::getOrCreatePassThrough() {
-  auto &slot = clockLowerings[Value{}];
-  if (!slot) {
-    auto treeOp = clockBuilder.create<PassThroughOp>(moduleOp.getLoc());
-    treeOp.getBody().emplaceBlock();
-    slot = std::make_unique<ClockLowering>(Value{}, treeOp, stats);
-  }
-  return *slot;
-}
-
-ClockLowering &ModuleLowering::getInitial() {
-  assert(!!initialLowering && "Initial tree op should have been constructed");
-  return *initialLowering;
-}
-
-/// Replace all uses of a value with a `StateReadOp` on a state.
-Value ModuleLowering::replaceValueWithStateRead(Value value, Value state) {
-  OpBuilder builder(state.getContext());
-  builder.setInsertionPointAfterValue(state);
-  Value readOp = builder.create<StateReadOp>(value.getLoc(), state);
-  if (isa<seq::ClockType>(value.getType()))
-    readOp = builder.createOrFold<seq::ToClockOp>(value.getLoc(), readOp);
-  value.replaceAllUsesWith(readOp);
-  return readOp;
-}
-
-/// Add the global state as an argument to the module's body block.
-void ModuleLowering::addStorageArg() {
-  assert(!storageArg);
-  storageArg = moduleOp.getBodyBlock()->addArgument(
-      StorageType::get(context, {}), moduleOp.getLoc());
-}
-
-/// Lower the primary inputs of the module to dedicated ops that allocate the
-/// inputs in the model's storage.
-LogicalResult ModuleLowering::lowerPrimaryInputs() {
-  for (auto blockArg : moduleOp.getBodyBlock()->getArguments()) {
-    if (blockArg == storageArg)
-      continue;
-    auto name = moduleOp.getArgName(blockArg.getArgNumber());
-    auto argTy = blockArg.getType();
-    IntegerType innerTy;
-    if (isa<seq::ClockType>(argTy)) {
-      innerTy = IntegerType::get(context, 1);
-    } else if (auto intType = dyn_cast<IntegerType>(argTy)) {
-      innerTy = intType;
-    } else {
-      return mlir::emitError(blockArg.getLoc(), "input ")
-             << name << " is of non-integer type " << blockArg.getType();
-    }
-    auto state = stateBuilder.create<RootInputOp>(
-        blockArg.getLoc(), StateType::get(innerTy), name, storageArg);
-    replaceValueWithStateRead(blockArg, state);
-  }
-  return success();
-}
-
-/// Lower the primary outputs of the module to dedicated ops that allocate the
-/// outputs in the model's storage.
-LogicalResult ModuleLowering::lowerPrimaryOutputs() {
-  auto outputOp = cast<hw::OutputOp>(moduleOp.getBodyBlock()->getTerminator());
-  if (outputOp.getNumOperands() > 0) {
-    auto outputOperands = SmallVector<Value>(outputOp.getOperands());
-    outputOp->dropAllReferences();
-    auto &passThrough = getOrCreatePassThrough();
-    for (auto [outputArg, name] :
-         llvm::zip(outputOperands, moduleOp.getOutputNames())) {
-      IntegerType innerTy;
-      if (isa<seq::ClockType>(outputArg.getType())) {
-        innerTy = IntegerType::get(context, 1);
-      } else if (auto intType = dyn_cast<IntegerType>(outputArg.getType())) {
-        innerTy = intType;
-      } else {
-        return mlir::emitError(outputOp.getLoc(), "output ")
-               << name << " is of non-integer type " << outputArg.getType();
-      }
-      auto value = passThrough.materializeValue(outputArg);
-      auto state = stateBuilder.create<RootOutputOp>(
-          outputOp.getLoc(), StateType::get(innerTy), cast<StringAttr>(name),
-          storageArg);
-      if (isa<seq::ClockType>(value.getType()))
-        value = passThrough.builder.createOrFold<seq::FromClockOp>(
-            outputOp.getLoc(), value);
-      passThrough.builder.create<StateWriteOp>(outputOp.getLoc(), state, value,
-                                               Value{});
-    }
-  }
-  outputOp.erase();
-  return success();
-}
-
-LogicalResult ModuleLowering::lowerInitials() {
-  // Merge all seq.initial ops into a single seq.initial op.
-  auto result = circt::seq::mergeInitialOps(moduleOp.getBodyBlock());
-  if (failed(result))
-    return moduleOp.emitError() << "initial ops cannot be topologically sorted";
-
-  auto initialOp = *result;
-  if (!initialOp) // There is no seq.initial op.
-    return success();
-
-  // Move the operations of the merged initial op into the builder's block.
-  auto terminator =
-      cast<seq::YieldOp>(initialOp.getBodyBlock()->getTerminator());
-  getInitial().builder.getBlock()->getOperations().splice(
-      getInitial().builder.getBlock()->begin(),
-      initialOp.getBodyBlock()->getOperations());
-
-  // Map seq.initial results to their corresponding operands.
-  for (auto [result, operand] :
-       llvm::zip(initialOp.getResults(), terminator.getOperands()))
-    getInitial().materializedValues.map(result, operand);
-  terminator.erase();
-
-  return success();
-}
-
-LogicalResult ModuleLowering::lowerStates() {
-  SmallVector<Operation *> opsToLower;
-  for (auto &op : *moduleOp.getBodyBlock())
-    if (isa<StateOp, MemoryOp, MemoryWritePortOp, TapOp, sim::DPICallOp>(&op))
-      opsToLower.push_back(&op);
-
-  for (auto *op : opsToLower) {
-    LLVM_DEBUG(llvm::dbgs() << "- Lowering " << *op << "\n");
-    auto result =
-        TypeSwitch<Operation *, LogicalResult>(op)
-            .Case<StateOp, MemoryOp, MemoryWritePortOp, TapOp, sim::DPICallOp>(
-                [&](auto op) { return lowerState(op); })
-            .Default(success());
-    if (failed(result))
-      return failure();
-  }
-  return success();
-}
-
-template <typename CallOpTy>
-LogicalResult ModuleLowering::lowerStateLike(
-    Operation *stateOp, Value stateClock, Value stateEnable, Value stateReset,
-    ArrayRef<Value> stateInputs, FlatSymbolRefAttr callee,
-    ArrayRef<Value> initialValues) {
-  // Grab all operands from the state op at the callsite and make it drop all
-  // its references. This allows `materializeValue` to move an operation if this
-  // state was the last user.
-
-  // Get the clock tree and enable condition for this state's clock. If this arc
-  // carries an explicit enable condition, fold that into the enable provided by
-  // the clock gates in the arc's clock tree.
-  auto info = getOrCreateClockLowering(stateClock);
-  info.enable = info.clock.getOrCreateAnd(
-      info.enable, info.clock.materializeValue(stateEnable), stateOp->getLoc());
-
-  // Allocate the necessary state within the model.
-  SmallVector<Value> allocatedStates;
-  for (unsigned stateIdx = 0; stateIdx < stateOp->getNumResults(); ++stateIdx) {
-    auto type = stateOp->getResult(stateIdx).getType();
-    auto intType = dyn_cast<IntegerType>(type);
-    if (!intType)
-      return stateOp->emitOpError("result ")
-             << stateIdx << " has non-integer type " << type
-             << "; only integer types are supported";
-    auto stateType = StateType::get(intType);
-    auto state = stateBuilder.create<AllocStateOp>(stateOp->getLoc(), stateType,
-                                                   storageArg);
-    if (auto names = stateOp->getAttrOfType<ArrayAttr>("names"))
-      state->setAttr("name", names[stateIdx]);
-    allocatedStates.push_back(state);
-  }
-
-  // Create a copy of the arc use with latency zero. This will effectively be
-  // the computation of the arc's transfer function, while the latency is
-  // implemented through read and write functions.
-  SmallVector<Value> materializedOperands;
-  materializedOperands.reserve(stateInputs.size());
-
-  for (auto input : stateInputs)
-    materializedOperands.push_back(info.clock.materializeValue(input));
-
-  OpBuilder nonResetBuilder = info.clock.builder;
-  if (stateReset) {
-    auto materializedReset = info.clock.materializeValue(stateReset);
-    auto ifOp = info.clock.builder.create<scf::IfOp>(stateOp->getLoc(),
-                                                     materializedReset, true);
-
-    for (auto [alloc, resTy] :
-         llvm::zip(allocatedStates, stateOp->getResultTypes())) {
-      if (!isa<IntegerType>(resTy))
-        stateOp->emitOpError("Non-integer result not supported yet!");
-
-      auto thenBuilder = ifOp.getThenBodyBuilder();
-      Value constZero =
-          thenBuilder.create<hw::ConstantOp>(stateOp->getLoc(), resTy, 0);
-      thenBuilder.create<StateWriteOp>(stateOp->getLoc(), alloc, constZero,
-                                       Value());
-    }
-    nonResetBuilder = ifOp.getElseBodyBuilder();
-  }
-
-  if (!initialValues.empty()) {
-    assert(initialValues.size() == allocatedStates.size() &&
-           "Unexpected number of initializers");
-    auto &initialTree = getInitial();
-    for (auto [alloc, init] : llvm::zip(allocatedStates, initialValues)) {
-      // TODO: Can we get away without materialization?
-      auto matierializedInit = initialTree.materializeValue(init);
-      if (!matierializedInit)
-        return failure();
-      initialTree.builder.create<StateWriteOp>(stateOp->getLoc(), alloc,
-                                               matierializedInit, Value());
-    }
-  }
-
-  stateOp->dropAllReferences();
-
-  auto newStateOp = nonResetBuilder.create<CallOpTy>(
-      stateOp->getLoc(), stateOp->getResultTypes(), callee,
-      materializedOperands);
-
-  // Create the write ops that write the result of the transfer function to the
-  // allocated state storage.
-  for (auto [alloc, result] :
-       llvm::zip(allocatedStates, newStateOp.getResults()))
-    nonResetBuilder.create<StateWriteOp>(stateOp->getLoc(), alloc, result,
-                                         info.enable);
-
-  // Replace all uses of the arc with reads from the allocated state.
-  for (auto [alloc, result] : llvm::zip(allocatedStates, stateOp->getResults()))
-    replaceValueWithStateRead(result, alloc);
-  stateOp->erase();
-  return success();
-}
-
-LogicalResult ModuleLowering::lowerState(StateOp stateOp) {
-  // We don't support arcs beyond latency 1 yet. These should be easy to add in
-  // the future though.
-  if (stateOp.getLatency() > 1)
-    return stateOp.emitError("state with latency > 1 not supported");
-
-  auto stateInputs = SmallVector<Value>(stateOp.getInputs());
-  auto stateInitializers = SmallVector<Value>(stateOp.getInitials());
-
-  return lowerStateLike<arc::CallOp>(
-      stateOp, stateOp.getClock(), stateOp.getEnable(), stateOp.getReset(),
-      stateInputs, stateOp.getArcAttr(), stateInitializers);
-}
-
-LogicalResult ModuleLowering::lowerState(sim::DPICallOp callOp) {
-  // Clocked call op can be considered as arc state with single latency.
-  auto stateClock = callOp.getClock();
-  if (!stateClock)
-    return callOp.emitError("unclocked DPI call not implemented yet");
-
-  auto stateInputs = SmallVector<Value>(callOp.getInputs());
-
-  return lowerStateLike<func::CallOp>(callOp, stateClock, callOp.getEnable(),
-                                      Value(), stateInputs,
-                                      callOp.getCalleeAttr());
-}
-
-LogicalResult ModuleLowering::lowerState(MemoryOp memOp) {
-  auto allocMemOp = stateBuilder.create<AllocMemoryOp>(
-      memOp.getLoc(), memOp.getType(), storageArg, memOp->getAttrs());
-  memOp.replaceAllUsesWith(allocMemOp.getResult());
-  memOp.erase();
-  return success();
-}
-
-LogicalResult ModuleLowering::lowerState(MemoryWritePortOp memWriteOp) {
-  if (memWriteOp.getLatency() > 1)
-    return memWriteOp->emitOpError("latencies > 1 not supported yet");
-
-  // Get the clock tree and enable condition for this write port's clock. If the
-  // port carries an explicit enable condition, fold that into the enable
-  // provided by the clock gates in the port's clock tree.
-  auto info = getOrCreateClockLowering(memWriteOp.getClock());
-
-  // Grab all operands from the op and make it drop all its references. This
-  // allows `materializeValue` to move an operation if this op was the last
-  // user.
-  auto writeMemory = memWriteOp.getMemory();
-  auto writeInputs = SmallVector<Value>(memWriteOp.getInputs());
-  auto arcResultTypes = memWriteOp.getArcResultTypes();
-  memWriteOp->dropAllReferences();
-
-  SmallVector<Value> materializedInputs;
-  for (auto input : writeInputs)
-    materializedInputs.push_back(info.clock.materializeValue(input));
-  ValueRange results =
-      info.clock.builder
-          .create<CallOp>(memWriteOp.getLoc(), arcResultTypes,
-                          memWriteOp.getArc(), materializedInputs)
-          ->getResults();
-
-  auto enable =
-      memWriteOp.getEnable() ? results[memWriteOp.getEnableIdx()] : Value();
-  info.enable =
-      info.clock.getOrCreateAnd(info.enable, enable, memWriteOp.getLoc());
-
-  // Materialize the operands for the write op within the surrounding clock
-  // tree.
-  auto address = results[memWriteOp.getAddressIdx()];
-  auto data = results[memWriteOp.getDataIdx()];
-  if (memWriteOp.getMask()) {
-    Value mask = results[memWriteOp.getMaskIdx(static_cast<bool>(enable))];
-    Value oldData = info.clock.builder.create<arc::MemoryReadOp>(
-        mask.getLoc(), data.getType(), writeMemory, address);
-    Value allOnes = info.clock.builder.create<hw::ConstantOp>(
-        mask.getLoc(), oldData.getType(), -1);
-    Value negatedMask = info.clock.builder.create<comb::XorOp>(
-        mask.getLoc(), mask, allOnes, true);
-    Value maskedOldData = info.clock.builder.create<comb::AndOp>(
-        mask.getLoc(), negatedMask, oldData, true);
-    Value maskedNewData =
-        info.clock.builder.create<comb::AndOp>(mask.getLoc(), mask, data, true);
-    data = info.clock.builder.create<comb::OrOp>(mask.getLoc(), maskedOldData,
-                                                 maskedNewData, true);
-  }
-  info.clock.builder.create<MemoryWriteOp>(memWriteOp.getLoc(), writeMemory,
-                                           address, info.enable, data);
-  memWriteOp.erase();
-  return success();
-}
-
-// Add state for taps into the passthrough block.
-LogicalResult ModuleLowering::lowerState(TapOp tapOp) {
-  auto intType = dyn_cast<IntegerType>(tapOp.getValue().getType());
-  if (!intType)
-    return mlir::emitError(tapOp.getLoc(), "tapped value ")
-           << tapOp.getNameAttr() << " is of non-integer type "
-           << tapOp.getValue().getType();
-
-  // Grab what we need from the tap op and then make it drop all its references.
-  // This will allow `materializeValue` to move ops instead of cloning them.
-  auto tapValue = tapOp.getValue();
-  tapOp->dropAllReferences();
-
-  auto &passThrough = getOrCreatePassThrough();
-  auto materializedValue = passThrough.materializeValue(tapValue);
-  auto state = stateBuilder.create<AllocStateOp>(
-      tapOp.getLoc(), StateType::get(intType), storageArg, true);
-  state->setAttr("name", tapOp.getNameAttr());
-  passThrough.builder.create<StateWriteOp>(tapOp.getLoc(), state,
-                                           materializedValue, Value{});
-  tapOp.erase();
-  return success();
-}
-
-/// Lower all instances of external modules to internal inputs/outputs to be
-/// driven from outside of the design.
-LogicalResult ModuleLowering::lowerExtModules(SymbolTable &symtbl) {
-  auto instOps = SmallVector<InstanceOp>(moduleOp.getOps<InstanceOp>());
-  for (auto op : instOps)
-    if (isa<HWModuleExternOp>(symtbl.lookup(op.getModuleNameAttr().getAttr())))
-      if (failed(lowerExtModule(op)))
-        return failure();
-  return success();
-}
-
-LogicalResult ModuleLowering::lowerExtModule(InstanceOp instOp) {
-  LLVM_DEBUG(llvm::dbgs() << "- Lowering extmodule "
-                          << instOp.getInstanceNameAttr() << "\n");
-
-  SmallString<32> baseName(instOp.getInstanceName());
-  auto baseNameLen = baseName.size();
-
-  // Lower the inputs of the extmodule as state that is only written.
-  for (auto [operand, name] :
-       llvm::zip(instOp.getOperands(), instOp.getArgNames())) {
-    LLVM_DEBUG(llvm::dbgs()
-               << "  - Input " << name << " : " << operand.getType() << "\n");
-    auto intType = dyn_cast<IntegerType>(operand.getType());
-    if (!intType)
-      return mlir::emitError(operand.getLoc(), "input ")
-             << name << " of extern module " << instOp.getModuleNameAttr()
-             << " instance " << instOp.getInstanceNameAttr()
-             << " is of non-integer type " << operand.getType();
-    baseName.resize(baseNameLen);
-    baseName += '/';
-    baseName += cast<StringAttr>(name).getValue();
-    auto &passThrough = getOrCreatePassThrough();
-    auto state = stateBuilder.create<AllocStateOp>(
-        instOp.getLoc(), StateType::get(intType), storageArg);
-    state->setAttr("name", stateBuilder.getStringAttr(baseName));
-    passThrough.builder.create<StateWriteOp>(
-        instOp.getLoc(), state, passThrough.materializeValue(operand), Value{});
-  }
-
-  // Lower the outputs of the extmodule as state that is only read.
-  for (auto [result, name] :
-       llvm::zip(instOp.getResults(), instOp.getResultNames())) {
-    LLVM_DEBUG(llvm::dbgs()
-               << "  - Output " << name << " : " << result.getType() << "\n");
-    auto intType = dyn_cast<IntegerType>(result.getType());
-    if (!intType)
-      return mlir::emitError(result.getLoc(), "output ")
-             << name << " of extern module " << instOp.getModuleNameAttr()
-             << " instance " << instOp.getInstanceNameAttr()
-             << " is of non-integer type " << result.getType();
-    baseName.resize(baseNameLen);
-    baseName += '/';
-    baseName += cast<StringAttr>(name).getValue();
-    auto state = stateBuilder.create<AllocStateOp>(
-        result.getLoc(), StateType::get(intType), storageArg);
-    state->setAttr("name", stateBuilder.getStringAttr(baseName));
-    replaceValueWithStateRead(result, state);
-  }
-
-  instOp.erase();
-  return success();
-}
-
-LogicalResult ModuleLowering::cleanup() {
-  // Clean up dead ops in the model.
-  SetVector<Operation *> erasureWorklist;
-  auto isDead = [](Operation *op) {
-    if (isOpTriviallyDead(op))
-      return true;
-    if (!op->use_empty())
-      return false;
-    return false;
-  };
-  for (auto &op : *moduleOp.getBodyBlock())
-    if (isDead(&op))
-      erasureWorklist.insert(&op);
-  while (!erasureWorklist.empty()) {
-    auto *op = erasureWorklist.pop_back_val();
-    if (!isDead(op))
-      continue;
-    op->walk([&](Operation *innerOp) {
-      for (auto operand : innerOp->getOperands())
-        if (auto *defOp = operand.getDefiningOp())
-          if (!op->isProperAncestor(defOp))
-            erasureWorklist.insert(defOp);
-    });
-    op->erase();
-  }
-
-  // Establish an order among all operations (to avoid an O(n²) pathological
-  // pattern with `moveBefore`) and replicate read operations into the blocks
-  // where they have uses. The established order is used to create the read
-  // operation as late in the block as possible, just before the first use.
-  DenseMap<Operation *, unsigned> opOrder;
-  SmallVector<StateReadOp, 0> readsToSink;
-  moduleOp.walk([&](Operation *op) {
-    opOrder.insert({op, opOrder.size()});
-    if (auto readOp = dyn_cast<StateReadOp>(op))
-      readsToSink.push_back(readOp);
-  });
-  for (auto readToSink : readsToSink) {
-    SmallDenseMap<Block *, std::pair<StateReadOp, unsigned>> readsByBlock;
-    for (auto &use : llvm::make_early_inc_range(readToSink->getUses())) {
-      auto *user = use.getOwner();
-      auto userOrder = opOrder.lookup(user);
-      auto &localRead = readsByBlock[user->getBlock()];
-      if (!localRead.first) {
-        if (user->getBlock() == readToSink->getBlock()) {
-          localRead.first = readToSink;
-          readToSink->moveBefore(user);
-        } else {
-          localRead.first = OpBuilder(user).cloneWithoutRegions(readToSink);
-        }
-        localRead.second = userOrder;
-      } else if (userOrder < localRead.second) {
-        localRead.first->moveBefore(user);
-        localRead.second = userOrder;
-      }
-      use.set(localRead.first);
-    }
-    if (readToSink.use_empty())
-      readToSink.erase();
-  }
-  return success();
-}
-
-//===----------------------------------------------------------------------===//
-// Pass Infrastructure
-//===----------------------------------------------------------------------===//
-
-namespace {
-struct LowerStatePass : public arc::impl::LowerStateBase<LowerStatePass> {
-  LowerStatePass() = default;
-  LowerStatePass(const LowerStatePass &pass) : LowerStatePass() {}
-
-  void runOnOperation() override;
-  LogicalResult runOnModule(HWModuleOp moduleOp, SymbolTable &symtbl);
-
-  Statistics stats{this};
-};
-} // namespace
-
-void LowerStatePass::runOnOperation() {
-  auto &symtbl = getAnalysis<SymbolTable>();
-  SmallVector<HWModuleExternOp> extModules;
-  for (auto &op : llvm::make_early_inc_range(getOperation().getOps())) {
-    if (auto moduleOp = dyn_cast<HWModuleOp>(&op)) {
-      if (failed(runOnModule(moduleOp, symtbl)))
-        return signalPassFailure();
-    } else if (auto extModuleOp = dyn_cast<HWModuleExternOp>(&op)) {
-      extModules.push_back(extModuleOp);
-    }
-  }
-  for (auto op : extModules)
-    op.erase();
-
-  // Lower remaining MemoryReadPort ops to MemoryRead ops. This can occur when
-  // the fan-in of a MemoryReadPortOp contains another such operation and is
-  // materialized before the one in the fan-in as the MemoryReadPortOp is not
-  // marked as a fan-in blocking/termination operation in `shouldMaterialize`.
-  // Adding it there can lead to dominance issues which would then have to be
-  // resolved instead.
-  SetVector<DefineOp> arcsToLower;
-  OpBuilder builder(getOperation());
-  getOperation()->walk([&](MemoryReadPortOp memReadOp) {
-    if (auto defOp = memReadOp->getParentOfType<DefineOp>())
-      arcsToLower.insert(defOp);
-
-    builder.setInsertionPoint(memReadOp);
-    Value newRead = builder.create<MemoryReadOp>(
-        memReadOp.getLoc(), memReadOp.getMemory(), memReadOp.getAddress());
-    memReadOp.replaceAllUsesWith(newRead);
-    memReadOp.erase();
-  });
-
-  SymbolTableCollection symbolTable;
-  mlir::SymbolUserMap userMap(symbolTable, getOperation());
-  for (auto defOp : arcsToLower) {
-    auto *terminator = defOp.getBodyBlock().getTerminator();
-    builder.setInsertionPoint(terminator);
-    builder.create<func::ReturnOp>(terminator->getLoc(),
-                                   terminator->getOperands());
-    terminator->erase();
-    builder.setInsertionPoint(defOp);
-    auto funcOp = builder.create<func::FuncOp>(defOp.getLoc(), defOp.getName(),
-                                               defOp.getFunctionType());
-    funcOp->setAttr("llvm.linkage",
-                    LLVM::LinkageAttr::get(builder.getContext(),
-                                           LLVM::linkage::Linkage::Internal));
-    funcOp.getBody().takeBody(defOp.getBody());
-
-    for (auto *user : userMap.getUsers(defOp)) {
-      builder.setInsertionPoint(user);
-      ValueRange results = builder
-                               .create<func::CallOp>(
-                                   user->getLoc(), funcOp,
-                                   cast<CallOpInterface>(user).getArgOperands())
-                               ->getResults();
-      user->replaceAllUsesWith(results);
-      user->erase();
-    }
-
-    defOp.erase();
-  }
-}
-
-LogicalResult LowerStatePass::runOnModule(HWModuleOp moduleOp,
-                                          SymbolTable &symtbl) {
-  LLVM_DEBUG(llvm::dbgs() << "Lowering state in `" << moduleOp.getModuleName()
-                          << "`\n");
-  ModuleLowering lowering(moduleOp, stats);
-
-  // Add sentinel ops to separate state allocations from clock trees.
-  lowering.stateBuilder.setInsertionPointToStart(moduleOp.getBodyBlock());
-
-  Operation *stateSentinel =
-      lowering.stateBuilder.create<hw::OutputOp>(moduleOp.getLoc());
-  Operation *clockSentinel =
-      lowering.stateBuilder.create<hw::OutputOp>(moduleOp.getLoc());
-
-  // Create the 'initial' pseudo clock tree.
-  auto initialTreeOp =
-      lowering.stateBuilder.create<InitialOp>(moduleOp.getLoc());
-  initialTreeOp.getBody().emplaceBlock();
-  lowering.initialLowering =
-      std::make_unique<ClockLowering>(Value{}, initialTreeOp, stats);
-
-  lowering.stateBuilder.setInsertionPoint(stateSentinel);
-  lowering.clockBuilder.setInsertionPoint(clockSentinel);
-
-  lowering.addStorageArg();
-  if (failed(lowering.lowerInitials()))
-    return failure();
-  if (failed(lowering.lowerPrimaryInputs()))
-    return failure();
-  if (failed(lowering.lowerPrimaryOutputs()))
-    return failure();
-  if (failed(lowering.lowerStates()))
-    return failure();
-  if (failed(lowering.lowerExtModules(symtbl)))
-    return failure();
-
-  // Clean up the module body which contains a lot of operations that the
-  // pessimistic value materialization has left behind because it couldn't
-  // reliably determine that the ops were no longer needed.
-  if (failed(lowering.cleanup()))
-    return failure();
-
-  // Erase the sentinel ops.
-  stateSentinel->erase();
-  clockSentinel->erase();
-
-  // Replace the `HWModuleOp` with a `ModelOp`.
-  moduleOp.getBodyBlock()->eraseArguments(
-      [&](auto arg) { return arg != lowering.storageArg; });
-  ImplicitLocOpBuilder builder(moduleOp.getLoc(), moduleOp);
-  auto modelOp =
-      builder.create<ModelOp>(moduleOp.getLoc(), moduleOp.getModuleNameAttr(),
-                              TypeAttr::get(moduleOp.getModuleType()),
-                              FlatSymbolRefAttr{}, FlatSymbolRefAttr{});
-  modelOp.getBody().takeBody(moduleOp.getBody());
-  moduleOp->erase();
-  sortTopologically(&modelOp.getBodyBlock());
-
-  return success();
-}
-
-std::unique_ptr<Pass> arc::createLowerStatePass() {
-  return std::make_unique<LowerStatePass>();
-}
--- a/lib/Dialect/Arc/Transforms/LowerStateRewrite.cpp
+++ b/lib/Dialect/Arc/Transforms/LowerStateRewrite.cpp
--- a/test/Dialect/Arc/legalize-state-update-error.mlir
+++ b/test/Dialect/Arc/legalize-state-update-error.mlir
@ -1,22 +0,0 @@
-// RUN: circt-opt %s --arc-legalize-state-update --split-input-file --verify-diagnostics
-
-arc.model @Memory io !hw.modty<> {
-^bb0(%arg0: !arc.storage):
-  %false = hw.constant false
-  %mem1 = arc.alloc_memory %arg0 : (!arc.storage) -> !arc.memory<2 x i32, i1>
-  %mem2 = arc.alloc_memory %arg0 : (!arc.storage) -> !arc.memory<2 x i32, i1>
-  %s1 = arc.alloc_state %arg0 : (!arc.storage) -> !arc.state<i32>
-  arc.clock_tree %false attributes {ct4} {
-    %r1 = arc.state_read %s1 : <i32>
-    scf.if %false {
-      // expected-error @+1 {{could not be moved to be after all reads to the same memory}}
-      arc.memory_write %mem2[%false], %r1 : <2 x i32, i1>
-      %mr1 = arc.memory_read %mem1[%false] : <2 x i32, i1>
-    }
-    scf.if %false {
-      arc.memory_write %mem1[%false], %r1 : <2 x i32, i1>
-      // expected-note @+1 {{could not be moved after this read}}
-      %mr1 = arc.memory_read %mem2[%false] : <2 x i32, i1>
-    }
-  }
-}
--- a/test/Dialect/Arc/legalize-state-update.mlir
+++ b/test/Dialect/Arc/legalize-state-update.mlir
@ -1,253 +0,0 @@
-// RUN: circt-opt %s --arc-legalize-state-update | FileCheck %s
-
-// CHECK-LABEL: func.func @Unaffected
-func.func @Unaffected(%arg0: !arc.storage, %arg1: i4) -> i4 {
-  %0 = arc.alloc_state %arg0 : (!arc.storage) -> !arc.state<i4>
-  %1 = arc.state_read %0 : <i4>
-  arc.state_write %0 = %arg1 : <i4>
-  return %1 : i4
-  // CHECK-NEXT: arc.alloc_state
-  // CHECK-NEXT: arc.state_read
-  // CHECK-NEXT: arc.state_write
-  // CHECK-NEXT: return
-}
-// CHECK-NEXT: }
-
-// CHECK-LABEL: func.func @SameBlock
-func.func @SameBlock(%arg0: !arc.storage, %arg1: i4) -> i4 {
-  %0 = arc.alloc_state %arg0 : (!arc.storage) -> !arc.state<i4>
-  %1 = arc.state_read %0 : <i4>
-  // CHECK-NEXT: [[STATE:%.+]] = arc.alloc_state
-  // CHECK-NEXT: arc.state_read [[STATE]]
-
-  arc.state_write %0 = %arg1 : <i4>
-  // CHECK-NEXT: [[TMP:%.+]] = arc.alloc_state
-  // CHECK-NEXT: [[CURRENT:%.+]] = arc.state_read [[STATE]]
-  // CHECK-NEXT: arc.state_write [[TMP]] = [[CURRENT]]
-  // CHECK-NEXT: arc.state_write [[STATE]] = %arg1
-
-  %2 = arc.state_read %0 : <i4>
-  %3 = arc.state_read %0 : <i4>
-  %4 = comb.xor %1, %2, %3 : i4
-  return %4 : i4
-  // CHECK-NEXT: arc.state_read [[TMP]]
-  // CHECK-NEXT: arc.state_read [[TMP]]
-  // CHECK-NEXT: comb.xor
-  // CHECK-NEXT: return
-}
-// CHECK-NEXT: }
-
-// CHECK-LABEL: func.func @FuncLegal
-func.func @FuncLegal(%arg0: !arc.storage, %arg1: i4) -> i4 {
-  %0 = arc.alloc_state %arg0 : (!arc.storage) -> !arc.state<i4>
-  %1 = call @ReadFunc(%0) : (!arc.state<i4>) -> i4
-  call @WriteFunc(%0, %arg1) : (!arc.state<i4>, i4) -> ()
-  return %1 : i4
-  // CHECK-NEXT: arc.alloc_state
-  // CHECK-NEXT: call @ReadFunc
-  // CHECK-NEXT: call @WriteFunc
-  // CHECK-NEXT: return
-}
-// CHECK-NEXT: }
-
-// CHECK-LABEL: func.func @FuncIllegal
-func.func @FuncIllegal(%arg0: !arc.storage, %arg1: i4) -> i4 {
-  %0 = arc.alloc_state %arg0 : (!arc.storage) -> !arc.state<i4>
-  %1 = call @ReadFunc(%0) : (!arc.state<i4>) -> i4
-  // CHECK-NEXT: [[STATE:%.+]] = arc.alloc_state
-  // CHECK-NEXT: call @ReadFunc
-
-  call @WriteFunc(%0, %arg1) : (!arc.state<i4>, i4) -> ()
-  // CHECK-NEXT: [[TMP:%.+]] = arc.alloc_state
-  // CHECK-NEXT: [[CURRENT:%.+]] = arc.state_read [[STATE]]
-  // CHECK-NEXT: arc.state_write [[TMP]] = [[CURRENT]]
-  // CHECK-NEXT: call @WriteFunc
-
-  %2 = call @ReadFunc(%0) : (!arc.state<i4>) -> i4
-  %3 = call @ReadFunc(%0) : (!arc.state<i4>) -> i4
-  %4 = comb.xor %1, %2, %3 : i4
-  return %4 : i4
-  // CHECK-NEXT: call @ReadFunc([[TMP]])
-  // CHECK-NEXT: call @ReadFunc([[TMP]])
-  // CHECK-NEXT: comb.xor
-  // CHECK-NEXT: return
-}
-// CHECK-NEXT: }
-
-// CHECK-LABEL: func.func @NestedBlocks
-func.func @NestedBlocks(%arg0: !arc.storage, %arg1: i4) -> i4 {
-  %0 = arc.alloc_state %arg0 : (!arc.storage) -> !arc.state<i4>
-  %11 = arc.alloc_state %arg0 : (!arc.storage) -> !arc.state<i4>
-  // CHECK-NEXT: [[S0:%.+]] = arc.alloc_state
-  // CHECK-NEXT: [[S1:%.+]] = arc.alloc_state
-
-  // CHECK-NEXT: scf.execute_region
-  %10 = scf.execute_region -> i4 {
-    // CHECK-NEXT: [[TMP1:%.+]] = arc.alloc_state
-    // CHECK-NEXT: [[CURRENT:%.+]] = arc.state_read [[S1]]
-    // CHECK-NEXT: arc.state_write [[TMP1]] = [[CURRENT]]
-    // CHECK-NEXT: [[TMP0:%.+]] = arc.alloc_state
-    // CHECK-NEXT: [[CURRENT:%.+]] = arc.state_read [[S0]]
-    // CHECK-NEXT: arc.state_write [[TMP0]] = [[CURRENT]]
-    // CHECK-NEXT: scf.execute_region
-    %3 = scf.execute_region -> i4 {
-      // CHECK-NEXT: scf.execute_region
-      %1 = scf.execute_region -> i4 {
-        %2 = arc.state_read %0 : <i4>
-        scf.yield %2 : i4
-        // CHECK-NEXT: arc.state_read [[TMP0]]
-        // CHECK-NEXT: scf.yield
-      }
-      // CHECK-NEXT: }
-      // CHECK-NEXT: scf.execute_region
-      scf.execute_region {
-        arc.state_write %0 = %arg1 : <i4>
-        arc.state_write %11 = %arg1 : <i4>
-        scf.yield
-        // CHECK-NEXT: arc.state_write [[S0]]
-        // CHECK-NEXT: arc.state_write [[S1]]
-        // CHECK-NEXT: scf.yield
-      }
-      // CHECK-NEXT: }
-      scf.yield %1 : i4
-      // CHECK-NEXT: scf.yield
-    }
-    // CHECK-NEXT: }
-    func.call @WriteFunc(%0, %arg1) : (!arc.state<i4>, i4) -> ()
-    // CHECK-NEXT: func.call @WriteFunc([[S0]], %arg1)
-    // CHECK-NEXT: scf.execute_region
-    %7, %8 = scf.execute_region -> (i4, i4) {
-      // CHECK-NEXT: scf.execute_region
-      %4 = scf.execute_region -> i4 {
-        %5 = func.call @ReadFunc(%0) : (!arc.state<i4>) -> i4
-        scf.yield %5 : i4
-        // CHECK-NEXT: func.call @ReadFunc([[TMP0]])
-        // CHECK-NEXT: scf.yield
-      }
-      // CHECK-NEXT: }
-      %6 = arc.state_read %0 : <i4>
-      %12 = arc.state_read %11 : <i4>
-      scf.yield %4, %6 : i4, i4
-      // CHECK-NEXT: arc.state_read [[TMP0]]
-      // CHECK-NEXT: arc.state_read [[TMP1]]
-      // CHECK-NEXT: scf.yield
-    }
-    // CHECK-NEXT: }
-    %9 = comb.xor %3, %7, %8 : i4
-    scf.yield %9 : i4
-    // CHECK-NEXT: comb.xor
-    // CHECK-NEXT: scf.yield
-  }
-  // CHECK-NEXT: }
-  return %10 : i4
-  // CHECK-NEXT: return
-}
-
-func.func @ReadFunc(%arg0: !arc.state<i4>) -> i4 {
-  %0 = func.call @InnerReadFunc(%arg0) : (!arc.state<i4>) -> i4
-  return %0 : i4
-}
-
-func.func @WriteFunc(%arg0: !arc.state<i4>, %arg1: i4) {
-  func.call @InnerWriteFunc(%arg0, %arg1) : (!arc.state<i4>, i4) -> ()
-  return
-}
-
-func.func @InnerReadFunc(%arg0: !arc.state<i4>) -> i4 {
-  %0 = arc.state_read %arg0 : <i4>
-  return %0 : i4
-}
-
-func.func @InnerWriteFunc(%arg0: !arc.state<i4>, %arg1: i4) {
-  arc.state_write %arg0 = %arg1 : <i4>
-  return
-}
-
-// State legalization should not happen across clock trees and passthrough ops.
-// CHECK-LABEL: arc.model @DontLeakThroughClockTreeOrPassthrough
-arc.model @DontLeakThroughClockTreeOrPassthrough io !hw.modty<input a : i1, output b : i1> {
-^bb0(%arg0: !arc.storage):
-  %false = hw.constant false
-  %in_a = arc.root_input "a", %arg0 : (!arc.storage) -> !arc.state<i1>
-  %out_b = arc.root_output "b", %arg0 : (!arc.storage) -> !arc.state<i1>
-  // CHECK: arc.alloc_state %arg0 {foo}
-  %0 = arc.alloc_state %arg0 {foo} : (!arc.storage) -> !arc.state<i1>
-  // CHECK-NOT: arc.alloc_state
-  // CHECK-NOT: arc.state_read
-  // CHECK-NOT: arc.state_write
-  // CHECK: arc.clock_tree
-  arc.clock_tree %false {
-    %1 = arc.state_read %in_a : <i1>
-    arc.state_write %0 = %1 : <i1>
-  }
-  // CHECK: arc.passthrough
-  arc.passthrough {
-    %1 = arc.state_read %0 : <i1>
-    arc.state_write %out_b = %1 : <i1>
-  }
-}
-
-// CHECK-LABEL: arc.model @Memory
-arc.model @Memory io !hw.modty<> {
-^bb0(%arg0: !arc.storage):
-  %false = hw.constant false
-  // CHECK: [[MEM1:%.+]] = arc.alloc_memory %arg0 :
-  // CHECK: [[MEM2:%.+]] = arc.alloc_memory %arg0 :
-  %mem1 = arc.alloc_memory %arg0 : (!arc.storage) -> !arc.memory<2 x i32, i1>
-  %mem2 = arc.alloc_memory %arg0 : (!arc.storage) -> !arc.memory<2 x i32, i1>
-  %s1 = arc.alloc_state %arg0 : (!arc.storage) -> !arc.state<i32>
-  // CHECK: arc.clock_tree %false attributes {ct1}
-  arc.clock_tree %false attributes {ct1} {
-    // CHECK-NEXT: arc.state_read
-    // CHECK-NEXT: arc.memory_read [[MEM1]][%false]
-    // CHECK-NEXT: arc.memory_write [[MEM1]]
-    // CHECK-NEXT: arc.memory_read [[MEM2]][%false]
-    // CHECK-NEXT: arc.memory_write [[MEM2]]
-    %r1 = arc.state_read %s1 : <i32>
-    arc.memory_write %mem2[%false], %r1 : <2 x i32, i1>
-    arc.memory_write %mem1[%false], %r1 : <2 x i32, i1>
-    %mr1 = arc.memory_read %mem1[%false] : <2 x i32, i1>
-    %mr2 = arc.memory_read %mem2[%false] : <2 x i32, i1>
-  // CHECK-NEXT: }
-  }
-  // CHECK: arc.clock_tree %false attributes {ct2}
-  arc.clock_tree %false attributes {ct2} {
-    // CHECK-NEXT: arc.state_read
-    // CHECK-NEXT: arc.memory_read
-    // CHECK-NEXT: scf.if %false {
-    // CHECK-NEXT:   arc.memory_read
-    // CHECK-NEXT: }
-    // CHECK-NEXT: arc.memory_write
-    %r1 = arc.state_read %s1 : <i32>
-    arc.memory_write %mem1[%false], %r1 : <2 x i32, i1>
-    %mr1 = arc.memory_read %mem1[%false] : <2 x i32, i1>
-    scf.if %false {
-      %mr2 = arc.memory_read %mem1[%false] : <2 x i32, i1>
-    }
-  // CHECK-NEXT: }
-  }
-  // CHECK: arc.clock_tree %false attributes {ct3}
-  arc.clock_tree %false attributes {ct3} {
-    // CHECK-NEXT: arc.memory_read [[MEM1]]
-    // CHECK-NEXT: arc.memory_read [[MEM2]]
-    // CHECK-NEXT: scf.if %false {
-    // CHECK-NEXT:   arc.state_read
-    // CHECK-NEXT:   scf.if %false {
-    // CHECK-NEXT:     arc.memory_write [[MEM2]]
-    // CHECK-NEXT:     arc.memory_read [[MEM1]]
-    // CHECK-NEXT:   }
-    // CHECK-NEXT:   arc.memory_write [[MEM1]]
-    // CHECK-NEXT: }
-    scf.if %false {
-      %r1 = arc.state_read %s1 : <i32>
-      arc.memory_write %mem1[%false], %r1 : <2 x i32, i1>
-      scf.if %false {
-        arc.memory_write %mem2[%false], %r1 : <2 x i32, i1>
-        %mr3 = arc.memory_read %mem1[%false] : <2 x i32, i1>
-      }
-    }
-    %mr1 = arc.memory_read %mem1[%false] : <2 x i32, i1>
-    %mr2 = arc.memory_read %mem2[%false] : <2 x i32, i1>
-  // CHECK-NEXT: }
-  }
-}
--- a/test/Dialect/Arc/lower-state-errors.mlir
+++ b/test/Dialect/Arc/lower-state-errors.mlir
@ -1,39 +1,11 @@
-// RUN: circt-opt %s --arc-lower-state  --split-input-file --verify-diagnostics
+// RUN: circt-opt %s --arc-lower-state --verify-diagnostics --split-input-file

-arc.define @DummyArc(%arg0: i42) -> i42 {
-  arc.output %arg0 : i42
-}
-
-// expected-error @+1 {{Value cannot be used in initializer.}}
-hw.module @argInit(in %clk: !seq.clock, in %input: i42) {
-  %0 = arc.state @DummyArc(%0) clock %clk initial (%input : i42) latency 1 : (i42) -> i42
-}
-
-
-// -----
-
-
-arc.define @DummyArc(%arg0: i42) -> i42 {
-  arc.output %arg0 : i42
-}
-
-hw.module @argInit(in %clk: !seq.clock, in %input: i42) {
-  // expected-error @+1 {{Value cannot be used in initializer.}}
-  %0 = arc.state @DummyArc(%0) clock %clk latency 1 : (i42) -> i42
-  %1 = arc.state @DummyArc(%1) clock %clk initial (%0 : i42) latency 1 : (i42) -> i42
-}
-
-// -----
-
-// expected-error @+1 {{initial ops cannot be topologically sorted}}
-hw.module @toposort_failure(in %clk: !seq.clock, in %rst: i1, in %i: i32) {
-  %init = seq.initial (%add) {
-    ^bb0(%arg0: i32):
-    seq.yield %arg0 : i32
-  } : (!seq.immutable<i32>) -> !seq.immutable<i32>
-
-  %add = seq.initial (%init) {
-    ^bb0(%arg0 : i32):
-    seq.yield %arg0 : i32
-  } : (!seq.immutable<i32>) -> !seq.immutable<i32>
+hw.module @CombLoop(in %a: i42, out z: i42) {
+  // expected-error @below {{'comb.add' op is on a combinational loop}}
+  // expected-remark @below {{computing new phase here}}
+  %0 = comb.add %a, %1 : i42
+  // expected-remark @below {{computing new phase here}}
+  %1 = comb.mul %a, %0 : i42
+  // expected-remark @below {{computing new phase here}}
+  hw.output %0 : i42
 }
--- a/test/Dialect/Arc/lower-state.mlir
+++ b/test/Dialect/Arc/lower-state.mlir
--- a/test/arcilator/arcilator.mlir
+++ b/test/arcilator/arcilator.mlir
@ -1,33 +1,7 @@
-// RUN: arcilator %s --inline=0 --until-before=llvm-lowering | FileCheck %s
 // RUN: arcilator %s | FileCheck %s --check-prefix=LLVM
 // RUN: arcilator --print-debug-info %s | FileCheck %s --check-prefix=LLVM-DEBUG

-// CHECK:      func.func @[[XOR_ARC:.+]](
-// CHECK-NEXT:   comb.xor
-// CHECK-NEXT:   return
-// CHECK-NEXT: }
-
-// CHECK:      func.func @[[ADD_ARC:.+]](
-// CHECK-NEXT:   comb.add
-// CHECK-NEXT:   return
-// CHECK-NEXT: }
-
-// CHECK:      func.func @[[MUL_ARC:.+]](
-// CHECK-NEXT:   comb.mul
-// CHECK-NEXT:   return
-// CHECK-NEXT: }
-
-// CHECK: func.func @Top_passthrough
-// CHECK: func.func @Top_clock
-
-// CHECK-NOT: hw.module @Top
-// CHECK-LABEL: arc.model @Top io !hw.modty<input clock : !seq.clock, input i0 : i4, input i1 : i4, output out : i4>
-// CHECK-NEXT: ^bb0(%arg0: !arc.storage<8>):
 hw.module @Top(in %clock : !seq.clock, in %i0 : i4, in %i1 : i4, out out : i4) {
-  // CHECK: func.call @Top_passthrough(%arg0)
-  // CHECK: scf.if {{%.+}} {
-  // CHECK:   func.call @Top_clock(%arg0)
-  // CHECK: }
  %0 = comb.add %i0, %i1 : i4
  %1 = comb.xor %0, %i0 : i4
  %2 = comb.xor %0, %i1 : i4
@ -37,22 +11,14 @@ hw.module @Top(in %clock : !seq.clock, in %i0 : i4, in %i1 : i4, out out : i4) {
  hw.output %3 : i4
 }

-// LLVM: define void @Top_passthrough(ptr %0)
-// LLVM:   mul i4
-// LLVM: define void @Top_clock(ptr %0)
+// LLVM: define void @Top_eval(ptr %0)
 // LLVM:   add i4
 // LLVM:   xor i4
 // LLVM:   xor i4
-// LLVM: define void @Top_eval(ptr %0)
-// LLVM:   call void @Top_passthrough(ptr %0)
-// LLVM:   call void @Top_clock(ptr %0)
+// LLVM:   mul i4

-// LLVM-DEBUG: define void @Top_passthrough(ptr %0){{.*}}!dbg
-// LLVM-DEBUG:   mul i4{{.*}}!dbg
-// LLVM-DEBUG: define void @Top_clock(ptr %0){{.*}}!dbg
+// LLVM-DEBUG: define void @Top_eval(ptr %0){{.*}}!dbg
 // LLVM-DEBUG:   add i4{{.*}}!dbg
 // LLVM-DEBUG:   xor i4{{.*}}!dbg
 // LLVM-DEBUG:   xor i4{{.*}}!dbg
-// LLVM-DEBUG: define void @Top_eval(ptr %0){{.*}}!dbg
-// LLVM-DEBUG:   call void @Top_passthrough(ptr %0){{.*}}!dbg
-// LLVM-DEBUG:   call void @Top_clock(ptr %0){{.*}}!dbg
+// LLVM-DEBUG:   mul i4{{.*}}!dbg
--- a/tools/arcilator/arcilator.cpp
+++ b/tools/arcilator/arcilator.cpp
@ -321,8 +321,6 @@ static void populateHwModuleToArcPipeline(PassManager &pm) {
  if (untilReached(UntilStateLowering))
    return;
  pm.addPass(arc::createLowerStatePass());
-  pm.addPass(createCSEPass());
-  pm.addPass(arc::createArcCanonicalizerPass());

  // TODO: LowerClocksToFuncsPass might not properly consider scf.if operations
  // (or nested regions in general) and thus errors out when muxes are also
@ -330,15 +328,10 @@ static void populateHwModuleToArcPipeline(PassManager &pm) {
  // TODO: InlineArcs seems to not properly handle scf.if operations, thus the
  // following is commented out
  // pm.addPass(arc::createMuxToControlFlowPass());
-
-  if (shouldInline) {
+  if (shouldInline)
    pm.addPass(arc::createInlineArcsPass());
-    pm.addPass(arc::createArcCanonicalizerPass());
-    pm.addPass(createCSEPass());
-  }

  pm.addPass(arc::createMergeIfsPass());
-  pm.addPass(arc::createLegalizeStateUpdatePass());
  pm.addPass(createCSEPass());
  pm.addPass(arc::createArcCanonicalizerPass());