llvm-project/llvm/lib/CodeGen/MachineBlockPlacement.cpp

993 lines
39 KiB
C++
Raw Normal View History

Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
//===-- MachineBlockPlacement.cpp - Basic Block Code Layout optimization --===//
//
// The LLVM Compiler Infrastructure
//
// This file is distributed under the University of Illinois Open Source
// License. See LICENSE.TXT for details.
//
//===----------------------------------------------------------------------===//
//
Completely re-write the algorithm behind MachineBlockPlacement based on discussions with Andy. Fundamentally, the previous algorithm is both counter productive on several fronts and prioritizing things which aren't necessarily the most important: static branch prediction. The new algorithm uses the existing loop CFG structure information to walk through the CFG itself to layout blocks. It coalesces adjacent blocks within the loop where the CFG allows based on the most likely path taken. Finally, it topologically orders the block chains that have been formed. This allows it to choose a (mostly) topologically valid ordering which still priorizes fallthrough within the structural constraints. As a final twist in the algorithm, it does violate the CFG when it discovers a "hot" edge, that is an edge that is more than 4x hotter than the competing edges in the CFG. These are forcibly merged into a fallthrough chain. Future transformations that need te be added are rotation of loop exit conditions to be fallthrough, and better isolation of cold block chains. I'm also planning on adding statistics to model how well the algorithm does at laying out blocks based on the probabilities it receives. The old tests mostly still pass, and I have some new tests to add, but the nested loops are still behaving very strangely. This almost seems like working-as-intended as it rotated the exit branch to be fallthrough, but I'm not convinced this is actually the best layout. It is well supported by the probabilities for loops we currently get, but those are pretty broken for nested loops, so this may change later. llvm-svn: 142743
2011-10-23 17:18:45 +08:00
// This file implements basic block placement transformations using the CFG
// structure and branch probability estimates.
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
//
Completely re-write the algorithm behind MachineBlockPlacement based on discussions with Andy. Fundamentally, the previous algorithm is both counter productive on several fronts and prioritizing things which aren't necessarily the most important: static branch prediction. The new algorithm uses the existing loop CFG structure information to walk through the CFG itself to layout blocks. It coalesces adjacent blocks within the loop where the CFG allows based on the most likely path taken. Finally, it topologically orders the block chains that have been formed. This allows it to choose a (mostly) topologically valid ordering which still priorizes fallthrough within the structural constraints. As a final twist in the algorithm, it does violate the CFG when it discovers a "hot" edge, that is an edge that is more than 4x hotter than the competing edges in the CFG. These are forcibly merged into a fallthrough chain. Future transformations that need te be added are rotation of loop exit conditions to be fallthrough, and better isolation of cold block chains. I'm also planning on adding statistics to model how well the algorithm does at laying out blocks based on the probabilities it receives. The old tests mostly still pass, and I have some new tests to add, but the nested loops are still behaving very strangely. This almost seems like working-as-intended as it rotated the exit branch to be fallthrough, but I'm not convinced this is actually the best layout. It is well supported by the probabilities for loops we currently get, but those are pretty broken for nested loops, so this may change later. llvm-svn: 142743
2011-10-23 17:18:45 +08:00
// The pass strives to preserve the structure of the CFG (that is, retain
// a topological ordering of basic blocks) in the absense of a *strong* signal
// to the contrary from probabilities. However, within the CFG structure, it
// attempts to choose an ordering which favors placing more likely sequences of
// blocks adjacent to each other.
//
// The algorithm works from the inner-most loop within a function outward, and
// at each stage walks through the basic blocks, trying to coalesce them into
// sequential chains where allowed by the CFG (or demanded by heavy
// probabilities). Finally, it walks the blocks in topological order, and the
// first time it reaches a chain of basic blocks, it schedules them in the
// function in-order.
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
//
//===----------------------------------------------------------------------===//
#define DEBUG_TYPE "block-placement2"
#include "llvm/CodeGen/MachineBasicBlock.h"
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
#include "llvm/CodeGen/MachineBlockFrequencyInfo.h"
#include "llvm/CodeGen/MachineBranchProbabilityInfo.h"
#include "llvm/CodeGen/MachineFunction.h"
#include "llvm/CodeGen/MachineFunctionPass.h"
#include "llvm/CodeGen/MachineLoopInfo.h"
#include "llvm/CodeGen/MachineModuleInfo.h"
#include "llvm/CodeGen/Passes.h"
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
#include "llvm/Support/Allocator.h"
Completely re-write the algorithm behind MachineBlockPlacement based on discussions with Andy. Fundamentally, the previous algorithm is both counter productive on several fronts and prioritizing things which aren't necessarily the most important: static branch prediction. The new algorithm uses the existing loop CFG structure information to walk through the CFG itself to layout blocks. It coalesces adjacent blocks within the loop where the CFG allows based on the most likely path taken. Finally, it topologically orders the block chains that have been formed. This allows it to choose a (mostly) topologically valid ordering which still priorizes fallthrough within the structural constraints. As a final twist in the algorithm, it does violate the CFG when it discovers a "hot" edge, that is an edge that is more than 4x hotter than the competing edges in the CFG. These are forcibly merged into a fallthrough chain. Future transformations that need te be added are rotation of loop exit conditions to be fallthrough, and better isolation of cold block chains. I'm also planning on adding statistics to model how well the algorithm does at laying out blocks based on the probabilities it receives. The old tests mostly still pass, and I have some new tests to add, but the nested loops are still behaving very strangely. This almost seems like working-as-intended as it rotated the exit branch to be fallthrough, but I'm not convinced this is actually the best layout. It is well supported by the probabilities for loops we currently get, but those are pretty broken for nested loops, so this may change later. llvm-svn: 142743
2011-10-23 17:18:45 +08:00
#include "llvm/Support/Debug.h"
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
#include "llvm/ADT/DenseMap.h"
#include "llvm/ADT/SmallPtrSet.h"
#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/Statistic.h"
#include "llvm/Target/TargetInstrInfo.h"
#include "llvm/Target/TargetLowering.h"
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
#include <algorithm>
using namespace llvm;
STATISTIC(NumCondBranches, "Number of conditional branches");
STATISTIC(NumUncondBranches, "Number of uncondittional branches");
STATISTIC(CondBranchTakenFreq,
"Potential frequency of taking conditional branches");
STATISTIC(UncondBranchTakenFreq,
"Potential frequency of taking unconditional branches");
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
namespace {
Completely re-write the algorithm behind MachineBlockPlacement based on discussions with Andy. Fundamentally, the previous algorithm is both counter productive on several fronts and prioritizing things which aren't necessarily the most important: static branch prediction. The new algorithm uses the existing loop CFG structure information to walk through the CFG itself to layout blocks. It coalesces adjacent blocks within the loop where the CFG allows based on the most likely path taken. Finally, it topologically orders the block chains that have been formed. This allows it to choose a (mostly) topologically valid ordering which still priorizes fallthrough within the structural constraints. As a final twist in the algorithm, it does violate the CFG when it discovers a "hot" edge, that is an edge that is more than 4x hotter than the competing edges in the CFG. These are forcibly merged into a fallthrough chain. Future transformations that need te be added are rotation of loop exit conditions to be fallthrough, and better isolation of cold block chains. I'm also planning on adding statistics to model how well the algorithm does at laying out blocks based on the probabilities it receives. The old tests mostly still pass, and I have some new tests to add, but the nested loops are still behaving very strangely. This almost seems like working-as-intended as it rotated the exit branch to be fallthrough, but I'm not convinced this is actually the best layout. It is well supported by the probabilities for loops we currently get, but those are pretty broken for nested loops, so this may change later. llvm-svn: 142743
2011-10-23 17:18:45 +08:00
class BlockChain;
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
/// \brief Type for our function-wide basic block -> block chain mapping.
typedef DenseMap<MachineBasicBlock *, BlockChain *> BlockToChainMapType;
}
namespace {
/// \brief A chain of blocks which will be laid out contiguously.
///
/// This is the datastructure representing a chain of consecutive blocks that
/// are profitable to layout together in order to maximize fallthrough
/// probabilities. We also can use a block chain to represent a sequence of
/// basic blocks which have some external (correctness) requirement for
/// sequential layout.
///
/// Eventually, the block chains will form a directed graph over the function.
/// We provide an SCC-supporting-iterator in order to quicky build and walk the
/// SCCs of block chains within a function.
///
/// The block chains also have support for calculating and caching probability
/// information related to the chain itself versus other chains. This is used
/// for ranking during the final layout of block chains.
Completely re-write the algorithm behind MachineBlockPlacement based on discussions with Andy. Fundamentally, the previous algorithm is both counter productive on several fronts and prioritizing things which aren't necessarily the most important: static branch prediction. The new algorithm uses the existing loop CFG structure information to walk through the CFG itself to layout blocks. It coalesces adjacent blocks within the loop where the CFG allows based on the most likely path taken. Finally, it topologically orders the block chains that have been formed. This allows it to choose a (mostly) topologically valid ordering which still priorizes fallthrough within the structural constraints. As a final twist in the algorithm, it does violate the CFG when it discovers a "hot" edge, that is an edge that is more than 4x hotter than the competing edges in the CFG. These are forcibly merged into a fallthrough chain. Future transformations that need te be added are rotation of loop exit conditions to be fallthrough, and better isolation of cold block chains. I'm also planning on adding statistics to model how well the algorithm does at laying out blocks based on the probabilities it receives. The old tests mostly still pass, and I have some new tests to add, but the nested loops are still behaving very strangely. This almost seems like working-as-intended as it rotated the exit branch to be fallthrough, but I'm not convinced this is actually the best layout. It is well supported by the probabilities for loops we currently get, but those are pretty broken for nested loops, so this may change later. llvm-svn: 142743
2011-10-23 17:18:45 +08:00
class BlockChain {
/// \brief The sequence of blocks belonging to this chain.
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
///
Completely re-write the algorithm behind MachineBlockPlacement based on discussions with Andy. Fundamentally, the previous algorithm is both counter productive on several fronts and prioritizing things which aren't necessarily the most important: static branch prediction. The new algorithm uses the existing loop CFG structure information to walk through the CFG itself to layout blocks. It coalesces adjacent blocks within the loop where the CFG allows based on the most likely path taken. Finally, it topologically orders the block chains that have been formed. This allows it to choose a (mostly) topologically valid ordering which still priorizes fallthrough within the structural constraints. As a final twist in the algorithm, it does violate the CFG when it discovers a "hot" edge, that is an edge that is more than 4x hotter than the competing edges in the CFG. These are forcibly merged into a fallthrough chain. Future transformations that need te be added are rotation of loop exit conditions to be fallthrough, and better isolation of cold block chains. I'm also planning on adding statistics to model how well the algorithm does at laying out blocks based on the probabilities it receives. The old tests mostly still pass, and I have some new tests to add, but the nested loops are still behaving very strangely. This almost seems like working-as-intended as it rotated the exit branch to be fallthrough, but I'm not convinced this is actually the best layout. It is well supported by the probabilities for loops we currently get, but those are pretty broken for nested loops, so this may change later. llvm-svn: 142743
2011-10-23 17:18:45 +08:00
/// This is the sequence of blocks for a particular chain. These will be laid
/// out in-order within the function.
SmallVector<MachineBasicBlock *, 4> Blocks;
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
/// \brief A handle to the function-wide basic block to block chain mapping.
///
/// This is retained in each block chain to simplify the computation of child
/// block chains for SCC-formation and iteration. We store the edges to child
/// basic blocks, and map them back to their associated chains using this
/// structure.
BlockToChainMapType &BlockToChain;
Completely re-write the algorithm behind MachineBlockPlacement based on discussions with Andy. Fundamentally, the previous algorithm is both counter productive on several fronts and prioritizing things which aren't necessarily the most important: static branch prediction. The new algorithm uses the existing loop CFG structure information to walk through the CFG itself to layout blocks. It coalesces adjacent blocks within the loop where the CFG allows based on the most likely path taken. Finally, it topologically orders the block chains that have been formed. This allows it to choose a (mostly) topologically valid ordering which still priorizes fallthrough within the structural constraints. As a final twist in the algorithm, it does violate the CFG when it discovers a "hot" edge, that is an edge that is more than 4x hotter than the competing edges in the CFG. These are forcibly merged into a fallthrough chain. Future transformations that need te be added are rotation of loop exit conditions to be fallthrough, and better isolation of cold block chains. I'm also planning on adding statistics to model how well the algorithm does at laying out blocks based on the probabilities it receives. The old tests mostly still pass, and I have some new tests to add, but the nested loops are still behaving very strangely. This almost seems like working-as-intended as it rotated the exit branch to be fallthrough, but I'm not convinced this is actually the best layout. It is well supported by the probabilities for loops we currently get, but those are pretty broken for nested loops, so this may change later. llvm-svn: 142743
2011-10-23 17:18:45 +08:00
public:
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
/// \brief Construct a new BlockChain.
///
/// This builds a new block chain representing a single basic block in the
/// function. It also registers itself as the chain that block participates
/// in with the BlockToChain mapping.
BlockChain(BlockToChainMapType &BlockToChain, MachineBasicBlock *BB)
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
: Blocks(1, BB), BlockToChain(BlockToChain), LoopPredecessors(0) {
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
assert(BB && "Cannot create a chain with a null basic block");
BlockToChain[BB] = this;
}
Completely re-write the algorithm behind MachineBlockPlacement based on discussions with Andy. Fundamentally, the previous algorithm is both counter productive on several fronts and prioritizing things which aren't necessarily the most important: static branch prediction. The new algorithm uses the existing loop CFG structure information to walk through the CFG itself to layout blocks. It coalesces adjacent blocks within the loop where the CFG allows based on the most likely path taken. Finally, it topologically orders the block chains that have been formed. This allows it to choose a (mostly) topologically valid ordering which still priorizes fallthrough within the structural constraints. As a final twist in the algorithm, it does violate the CFG when it discovers a "hot" edge, that is an edge that is more than 4x hotter than the competing edges in the CFG. These are forcibly merged into a fallthrough chain. Future transformations that need te be added are rotation of loop exit conditions to be fallthrough, and better isolation of cold block chains. I'm also planning on adding statistics to model how well the algorithm does at laying out blocks based on the probabilities it receives. The old tests mostly still pass, and I have some new tests to add, but the nested loops are still behaving very strangely. This almost seems like working-as-intended as it rotated the exit branch to be fallthrough, but I'm not convinced this is actually the best layout. It is well supported by the probabilities for loops we currently get, but those are pretty broken for nested loops, so this may change later. llvm-svn: 142743
2011-10-23 17:18:45 +08:00
/// \brief Iterator over blocks within the chain.
typedef SmallVectorImpl<MachineBasicBlock *>::const_iterator iterator;
Completely re-write the algorithm behind MachineBlockPlacement based on discussions with Andy. Fundamentally, the previous algorithm is both counter productive on several fronts and prioritizing things which aren't necessarily the most important: static branch prediction. The new algorithm uses the existing loop CFG structure information to walk through the CFG itself to layout blocks. It coalesces adjacent blocks within the loop where the CFG allows based on the most likely path taken. Finally, it topologically orders the block chains that have been formed. This allows it to choose a (mostly) topologically valid ordering which still priorizes fallthrough within the structural constraints. As a final twist in the algorithm, it does violate the CFG when it discovers a "hot" edge, that is an edge that is more than 4x hotter than the competing edges in the CFG. These are forcibly merged into a fallthrough chain. Future transformations that need te be added are rotation of loop exit conditions to be fallthrough, and better isolation of cold block chains. I'm also planning on adding statistics to model how well the algorithm does at laying out blocks based on the probabilities it receives. The old tests mostly still pass, and I have some new tests to add, but the nested loops are still behaving very strangely. This almost seems like working-as-intended as it rotated the exit branch to be fallthrough, but I'm not convinced this is actually the best layout. It is well supported by the probabilities for loops we currently get, but those are pretty broken for nested loops, so this may change later. llvm-svn: 142743
2011-10-23 17:18:45 +08:00
/// \brief Beginning of blocks within the chain.
iterator begin() const { return Blocks.begin(); }
Completely re-write the algorithm behind MachineBlockPlacement based on discussions with Andy. Fundamentally, the previous algorithm is both counter productive on several fronts and prioritizing things which aren't necessarily the most important: static branch prediction. The new algorithm uses the existing loop CFG structure information to walk through the CFG itself to layout blocks. It coalesces adjacent blocks within the loop where the CFG allows based on the most likely path taken. Finally, it topologically orders the block chains that have been formed. This allows it to choose a (mostly) topologically valid ordering which still priorizes fallthrough within the structural constraints. As a final twist in the algorithm, it does violate the CFG when it discovers a "hot" edge, that is an edge that is more than 4x hotter than the competing edges in the CFG. These are forcibly merged into a fallthrough chain. Future transformations that need te be added are rotation of loop exit conditions to be fallthrough, and better isolation of cold block chains. I'm also planning on adding statistics to model how well the algorithm does at laying out blocks based on the probabilities it receives. The old tests mostly still pass, and I have some new tests to add, but the nested loops are still behaving very strangely. This almost seems like working-as-intended as it rotated the exit branch to be fallthrough, but I'm not convinced this is actually the best layout. It is well supported by the probabilities for loops we currently get, but those are pretty broken for nested loops, so this may change later. llvm-svn: 142743
2011-10-23 17:18:45 +08:00
/// \brief End of blocks within the chain.
iterator end() const { return Blocks.end(); }
Completely re-write the algorithm behind MachineBlockPlacement based on discussions with Andy. Fundamentally, the previous algorithm is both counter productive on several fronts and prioritizing things which aren't necessarily the most important: static branch prediction. The new algorithm uses the existing loop CFG structure information to walk through the CFG itself to layout blocks. It coalesces adjacent blocks within the loop where the CFG allows based on the most likely path taken. Finally, it topologically orders the block chains that have been formed. This allows it to choose a (mostly) topologically valid ordering which still priorizes fallthrough within the structural constraints. As a final twist in the algorithm, it does violate the CFG when it discovers a "hot" edge, that is an edge that is more than 4x hotter than the competing edges in the CFG. These are forcibly merged into a fallthrough chain. Future transformations that need te be added are rotation of loop exit conditions to be fallthrough, and better isolation of cold block chains. I'm also planning on adding statistics to model how well the algorithm does at laying out blocks based on the probabilities it receives. The old tests mostly still pass, and I have some new tests to add, but the nested loops are still behaving very strangely. This almost seems like working-as-intended as it rotated the exit branch to be fallthrough, but I'm not convinced this is actually the best layout. It is well supported by the probabilities for loops we currently get, but those are pretty broken for nested loops, so this may change later. llvm-svn: 142743
2011-10-23 17:18:45 +08:00
/// \brief Merge a block chain into this one.
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
///
/// This routine merges a block chain into this one. It takes care of forming
/// a contiguous sequence of basic blocks, updating the edge list, and
/// updating the block -> chain mapping. It does not free or tear down the
/// old chain, but the old chain's block list is no longer valid.
void merge(MachineBasicBlock *BB, BlockChain *Chain) {
Completely re-write the algorithm behind MachineBlockPlacement based on discussions with Andy. Fundamentally, the previous algorithm is both counter productive on several fronts and prioritizing things which aren't necessarily the most important: static branch prediction. The new algorithm uses the existing loop CFG structure information to walk through the CFG itself to layout blocks. It coalesces adjacent blocks within the loop where the CFG allows based on the most likely path taken. Finally, it topologically orders the block chains that have been formed. This allows it to choose a (mostly) topologically valid ordering which still priorizes fallthrough within the structural constraints. As a final twist in the algorithm, it does violate the CFG when it discovers a "hot" edge, that is an edge that is more than 4x hotter than the competing edges in the CFG. These are forcibly merged into a fallthrough chain. Future transformations that need te be added are rotation of loop exit conditions to be fallthrough, and better isolation of cold block chains. I'm also planning on adding statistics to model how well the algorithm does at laying out blocks based on the probabilities it receives. The old tests mostly still pass, and I have some new tests to add, but the nested loops are still behaving very strangely. This almost seems like working-as-intended as it rotated the exit branch to be fallthrough, but I'm not convinced this is actually the best layout. It is well supported by the probabilities for loops we currently get, but those are pretty broken for nested loops, so this may change later. llvm-svn: 142743
2011-10-23 17:18:45 +08:00
assert(BB);
assert(!Blocks.empty());
// Fast path in case we don't have a chain already.
if (!Chain) {
assert(!BlockToChain[BB]);
Blocks.push_back(BB);
BlockToChain[BB] = this;
return;
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
}
Completely re-write the algorithm behind MachineBlockPlacement based on discussions with Andy. Fundamentally, the previous algorithm is both counter productive on several fronts and prioritizing things which aren't necessarily the most important: static branch prediction. The new algorithm uses the existing loop CFG structure information to walk through the CFG itself to layout blocks. It coalesces adjacent blocks within the loop where the CFG allows based on the most likely path taken. Finally, it topologically orders the block chains that have been formed. This allows it to choose a (mostly) topologically valid ordering which still priorizes fallthrough within the structural constraints. As a final twist in the algorithm, it does violate the CFG when it discovers a "hot" edge, that is an edge that is more than 4x hotter than the competing edges in the CFG. These are forcibly merged into a fallthrough chain. Future transformations that need te be added are rotation of loop exit conditions to be fallthrough, and better isolation of cold block chains. I'm also planning on adding statistics to model how well the algorithm does at laying out blocks based on the probabilities it receives. The old tests mostly still pass, and I have some new tests to add, but the nested loops are still behaving very strangely. This almost seems like working-as-intended as it rotated the exit branch to be fallthrough, but I'm not convinced this is actually the best layout. It is well supported by the probabilities for loops we currently get, but those are pretty broken for nested loops, so this may change later. llvm-svn: 142743
2011-10-23 17:18:45 +08:00
assert(BB == *Chain->begin());
assert(Chain->begin() != Chain->end());
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
Completely re-write the algorithm behind MachineBlockPlacement based on discussions with Andy. Fundamentally, the previous algorithm is both counter productive on several fronts and prioritizing things which aren't necessarily the most important: static branch prediction. The new algorithm uses the existing loop CFG structure information to walk through the CFG itself to layout blocks. It coalesces adjacent blocks within the loop where the CFG allows based on the most likely path taken. Finally, it topologically orders the block chains that have been formed. This allows it to choose a (mostly) topologically valid ordering which still priorizes fallthrough within the structural constraints. As a final twist in the algorithm, it does violate the CFG when it discovers a "hot" edge, that is an edge that is more than 4x hotter than the competing edges in the CFG. These are forcibly merged into a fallthrough chain. Future transformations that need te be added are rotation of loop exit conditions to be fallthrough, and better isolation of cold block chains. I'm also planning on adding statistics to model how well the algorithm does at laying out blocks based on the probabilities it receives. The old tests mostly still pass, and I have some new tests to add, but the nested loops are still behaving very strangely. This almost seems like working-as-intended as it rotated the exit branch to be fallthrough, but I'm not convinced this is actually the best layout. It is well supported by the probabilities for loops we currently get, but those are pretty broken for nested loops, so this may change later. llvm-svn: 142743
2011-10-23 17:18:45 +08:00
// Update the incoming blocks to point to this chain, and add them to the
// chain structure.
for (BlockChain::iterator BI = Chain->begin(), BE = Chain->end();
BI != BE; ++BI) {
Blocks.push_back(*BI);
assert(BlockToChain[*BI] == Chain && "Incoming blocks not in chain");
BlockToChain[*BI] = this;
}
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
}
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
/// \brief Count of predecessors within the loop currently being processed.
///
/// This count is updated at each loop we process to represent the number of
/// in-loop predecessors of this chain.
unsigned LoopPredecessors;
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
};
}
namespace {
class MachineBlockPlacement : public MachineFunctionPass {
Completely re-write the algorithm behind MachineBlockPlacement based on discussions with Andy. Fundamentally, the previous algorithm is both counter productive on several fronts and prioritizing things which aren't necessarily the most important: static branch prediction. The new algorithm uses the existing loop CFG structure information to walk through the CFG itself to layout blocks. It coalesces adjacent blocks within the loop where the CFG allows based on the most likely path taken. Finally, it topologically orders the block chains that have been formed. This allows it to choose a (mostly) topologically valid ordering which still priorizes fallthrough within the structural constraints. As a final twist in the algorithm, it does violate the CFG when it discovers a "hot" edge, that is an edge that is more than 4x hotter than the competing edges in the CFG. These are forcibly merged into a fallthrough chain. Future transformations that need te be added are rotation of loop exit conditions to be fallthrough, and better isolation of cold block chains. I'm also planning on adding statistics to model how well the algorithm does at laying out blocks based on the probabilities it receives. The old tests mostly still pass, and I have some new tests to add, but the nested loops are still behaving very strangely. This almost seems like working-as-intended as it rotated the exit branch to be fallthrough, but I'm not convinced this is actually the best layout. It is well supported by the probabilities for loops we currently get, but those are pretty broken for nested loops, so this may change later. llvm-svn: 142743
2011-10-23 17:18:45 +08:00
/// \brief A typedef for a block filter set.
typedef SmallPtrSet<MachineBasicBlock *, 16> BlockFilterSet;
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
/// \brief A handle to the branch probability pass.
const MachineBranchProbabilityInfo *MBPI;
/// \brief A handle to the function-wide block frequency pass.
const MachineBlockFrequencyInfo *MBFI;
/// \brief A handle to the loop info.
const MachineLoopInfo *MLI;
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
/// \brief A handle to the target's instruction info.
const TargetInstrInfo *TII;
/// \brief A handle to the target's lowering info.
const TargetLowering *TLI;
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
/// \brief Allocator and owner of BlockChain structures.
///
/// We build BlockChains lazily by merging together high probability BB
/// sequences acording to the "Algo2" in the paper mentioned at the top of
/// the file. To reduce malloc traffic, we allocate them using this slab-like
/// allocator, and destroy them after the pass completes.
SpecificBumpPtrAllocator<BlockChain> ChainAllocator;
/// \brief Function wide BasicBlock to BlockChain mapping.
///
/// This mapping allows efficiently moving from any given basic block to the
/// BlockChain it participates in, if any. We use it to, among other things,
/// allow implicitly defining edges between chains as the existing edges
/// between basic blocks.
DenseMap<MachineBasicBlock *, BlockChain *> BlockToChain;
void markChainSuccessors(BlockChain &Chain,
MachineBasicBlock *LoopHeaderBB,
SmallVectorImpl<MachineBasicBlock *> &BlockWorkList,
const BlockFilterSet *BlockFilter = 0);
MachineBasicBlock *selectBestSuccessor(MachineBasicBlock *BB,
BlockChain &Chain,
const BlockFilterSet *BlockFilter);
MachineBasicBlock *selectBestCandidateBlock(
BlockChain &Chain, SmallVectorImpl<MachineBasicBlock *> &WorkList,
const BlockFilterSet *BlockFilter);
MachineBasicBlock *getFirstUnplacedBlock(
MachineFunction &F,
const BlockChain &PlacedChain,
MachineFunction::iterator &PrevUnplacedBlockIt,
const BlockFilterSet *BlockFilter);
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
void buildChain(MachineBasicBlock *BB, BlockChain &Chain,
SmallVectorImpl<MachineBasicBlock *> &BlockWorkList,
const BlockFilterSet *BlockFilter = 0);
Take two on rotating the block ordering of loops. My previous attempt was centered around the premise of laying out a loop in a chain, and then rotating that chain. This is good for preserving contiguous layout, but bad for actually making sane rotations. In order to keep it safe, I had to essentially make it impossible to rotate deeply nested loops. The information needed to correctly reason about a deeply nested loop is actually available -- *before* we layout the loop. We know the inner loops are already fused into chains, etc. We lose information the moment we actually lay out the loop. The solution was the other alternative for this algorithm I discussed with Benjamin and some others: rather than rotating the loop after-the-fact, try to pick a profitable starting block for the loop's layout, and then use our existing layout logic. I was worried about the complexity of this "pick" step, but it turns out such complexity is needed to handle all the important cases I keep teasing out of benchmarks. This is, I'm afraid, a bit of a work-in-progress. It is still misbehaving on some likely important cases I'm investigating in Olden. It also isn't really tested. I'm going to try to craft some interesting nested-loop test cases, but it's likely to be extremely time consuming and I don't want to go there until I'm sure I'm testing the correct behavior. Sadly I can't come up with a way of getting simple, fine grained test cases for this logic. We need complex loop structures to even trigger much of it. llvm-svn: 145183
2011-11-27 21:34:33 +08:00
MachineBasicBlock *findBestLoopTop(MachineFunction &F,
MachineLoop &L,
const BlockFilterSet &LoopBlockSet);
void buildLoopChains(MachineFunction &F, MachineLoop &L);
Completely re-write the algorithm behind MachineBlockPlacement based on discussions with Andy. Fundamentally, the previous algorithm is both counter productive on several fronts and prioritizing things which aren't necessarily the most important: static branch prediction. The new algorithm uses the existing loop CFG structure information to walk through the CFG itself to layout blocks. It coalesces adjacent blocks within the loop where the CFG allows based on the most likely path taken. Finally, it topologically orders the block chains that have been formed. This allows it to choose a (mostly) topologically valid ordering which still priorizes fallthrough within the structural constraints. As a final twist in the algorithm, it does violate the CFG when it discovers a "hot" edge, that is an edge that is more than 4x hotter than the competing edges in the CFG. These are forcibly merged into a fallthrough chain. Future transformations that need te be added are rotation of loop exit conditions to be fallthrough, and better isolation of cold block chains. I'm also planning on adding statistics to model how well the algorithm does at laying out blocks based on the probabilities it receives. The old tests mostly still pass, and I have some new tests to add, but the nested loops are still behaving very strangely. This almost seems like working-as-intended as it rotated the exit branch to be fallthrough, but I'm not convinced this is actually the best layout. It is well supported by the probabilities for loops we currently get, but those are pretty broken for nested loops, so this may change later. llvm-svn: 142743
2011-10-23 17:18:45 +08:00
void buildCFGChains(MachineFunction &F);
void AlignLoops(MachineFunction &F);
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
public:
static char ID; // Pass identification, replacement for typeid
MachineBlockPlacement() : MachineFunctionPass(ID) {
initializeMachineBlockPlacementPass(*PassRegistry::getPassRegistry());
}
bool runOnMachineFunction(MachineFunction &F);
void getAnalysisUsage(AnalysisUsage &AU) const {
AU.addRequired<MachineBranchProbabilityInfo>();
AU.addRequired<MachineBlockFrequencyInfo>();
AU.addRequired<MachineLoopInfo>();
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
MachineFunctionPass::getAnalysisUsage(AU);
}
const char *getPassName() const { return "Block Placement"; }
};
}
char MachineBlockPlacement::ID = 0;
INITIALIZE_PASS_BEGIN(MachineBlockPlacement, "block-placement2",
"Branch Probability Basic Block Placement", false, false)
INITIALIZE_PASS_DEPENDENCY(MachineBranchProbabilityInfo)
INITIALIZE_PASS_DEPENDENCY(MachineBlockFrequencyInfo)
INITIALIZE_PASS_DEPENDENCY(MachineLoopInfo)
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
INITIALIZE_PASS_END(MachineBlockPlacement, "block-placement2",
"Branch Probability Basic Block Placement", false, false)
FunctionPass *llvm::createMachineBlockPlacementPass() {
return new MachineBlockPlacement();
}
Completely re-write the algorithm behind MachineBlockPlacement based on discussions with Andy. Fundamentally, the previous algorithm is both counter productive on several fronts and prioritizing things which aren't necessarily the most important: static branch prediction. The new algorithm uses the existing loop CFG structure information to walk through the CFG itself to layout blocks. It coalesces adjacent blocks within the loop where the CFG allows based on the most likely path taken. Finally, it topologically orders the block chains that have been formed. This allows it to choose a (mostly) topologically valid ordering which still priorizes fallthrough within the structural constraints. As a final twist in the algorithm, it does violate the CFG when it discovers a "hot" edge, that is an edge that is more than 4x hotter than the competing edges in the CFG. These are forcibly merged into a fallthrough chain. Future transformations that need te be added are rotation of loop exit conditions to be fallthrough, and better isolation of cold block chains. I'm also planning on adding statistics to model how well the algorithm does at laying out blocks based on the probabilities it receives. The old tests mostly still pass, and I have some new tests to add, but the nested loops are still behaving very strangely. This almost seems like working-as-intended as it rotated the exit branch to be fallthrough, but I'm not convinced this is actually the best layout. It is well supported by the probabilities for loops we currently get, but those are pretty broken for nested loops, so this may change later. llvm-svn: 142743
2011-10-23 17:18:45 +08:00
#ifndef NDEBUG
/// \brief Helper to print the name of a MBB.
///
/// Only used by debug logging.
static std::string getBlockName(MachineBasicBlock *BB) {
Completely re-write the algorithm behind MachineBlockPlacement based on discussions with Andy. Fundamentally, the previous algorithm is both counter productive on several fronts and prioritizing things which aren't necessarily the most important: static branch prediction. The new algorithm uses the existing loop CFG structure information to walk through the CFG itself to layout blocks. It coalesces adjacent blocks within the loop where the CFG allows based on the most likely path taken. Finally, it topologically orders the block chains that have been formed. This allows it to choose a (mostly) topologically valid ordering which still priorizes fallthrough within the structural constraints. As a final twist in the algorithm, it does violate the CFG when it discovers a "hot" edge, that is an edge that is more than 4x hotter than the competing edges in the CFG. These are forcibly merged into a fallthrough chain. Future transformations that need te be added are rotation of loop exit conditions to be fallthrough, and better isolation of cold block chains. I'm also planning on adding statistics to model how well the algorithm does at laying out blocks based on the probabilities it receives. The old tests mostly still pass, and I have some new tests to add, but the nested loops are still behaving very strangely. This almost seems like working-as-intended as it rotated the exit branch to be fallthrough, but I'm not convinced this is actually the best layout. It is well supported by the probabilities for loops we currently get, but those are pretty broken for nested loops, so this may change later. llvm-svn: 142743
2011-10-23 17:18:45 +08:00
std::string Result;
raw_string_ostream OS(Result);
OS << "BB#" << BB->getNumber()
<< " (derived from LLVM BB '" << BB->getName() << "')";
OS.flush();
return Result;
}
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
Completely re-write the algorithm behind MachineBlockPlacement based on discussions with Andy. Fundamentally, the previous algorithm is both counter productive on several fronts and prioritizing things which aren't necessarily the most important: static branch prediction. The new algorithm uses the existing loop CFG structure information to walk through the CFG itself to layout blocks. It coalesces adjacent blocks within the loop where the CFG allows based on the most likely path taken. Finally, it topologically orders the block chains that have been formed. This allows it to choose a (mostly) topologically valid ordering which still priorizes fallthrough within the structural constraints. As a final twist in the algorithm, it does violate the CFG when it discovers a "hot" edge, that is an edge that is more than 4x hotter than the competing edges in the CFG. These are forcibly merged into a fallthrough chain. Future transformations that need te be added are rotation of loop exit conditions to be fallthrough, and better isolation of cold block chains. I'm also planning on adding statistics to model how well the algorithm does at laying out blocks based on the probabilities it receives. The old tests mostly still pass, and I have some new tests to add, but the nested loops are still behaving very strangely. This almost seems like working-as-intended as it rotated the exit branch to be fallthrough, but I'm not convinced this is actually the best layout. It is well supported by the probabilities for loops we currently get, but those are pretty broken for nested loops, so this may change later. llvm-svn: 142743
2011-10-23 17:18:45 +08:00
/// \brief Helper to print the number of a MBB.
///
/// Only used by debug logging.
static std::string getBlockNum(MachineBasicBlock *BB) {
Completely re-write the algorithm behind MachineBlockPlacement based on discussions with Andy. Fundamentally, the previous algorithm is both counter productive on several fronts and prioritizing things which aren't necessarily the most important: static branch prediction. The new algorithm uses the existing loop CFG structure information to walk through the CFG itself to layout blocks. It coalesces adjacent blocks within the loop where the CFG allows based on the most likely path taken. Finally, it topologically orders the block chains that have been formed. This allows it to choose a (mostly) topologically valid ordering which still priorizes fallthrough within the structural constraints. As a final twist in the algorithm, it does violate the CFG when it discovers a "hot" edge, that is an edge that is more than 4x hotter than the competing edges in the CFG. These are forcibly merged into a fallthrough chain. Future transformations that need te be added are rotation of loop exit conditions to be fallthrough, and better isolation of cold block chains. I'm also planning on adding statistics to model how well the algorithm does at laying out blocks based on the probabilities it receives. The old tests mostly still pass, and I have some new tests to add, but the nested loops are still behaving very strangely. This almost seems like working-as-intended as it rotated the exit branch to be fallthrough, but I'm not convinced this is actually the best layout. It is well supported by the probabilities for loops we currently get, but those are pretty broken for nested loops, so this may change later. llvm-svn: 142743
2011-10-23 17:18:45 +08:00
std::string Result;
raw_string_ostream OS(Result);
OS << "BB#" << BB->getNumber();
OS.flush();
return Result;
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
}
Completely re-write the algorithm behind MachineBlockPlacement based on discussions with Andy. Fundamentally, the previous algorithm is both counter productive on several fronts and prioritizing things which aren't necessarily the most important: static branch prediction. The new algorithm uses the existing loop CFG structure information to walk through the CFG itself to layout blocks. It coalesces adjacent blocks within the loop where the CFG allows based on the most likely path taken. Finally, it topologically orders the block chains that have been formed. This allows it to choose a (mostly) topologically valid ordering which still priorizes fallthrough within the structural constraints. As a final twist in the algorithm, it does violate the CFG when it discovers a "hot" edge, that is an edge that is more than 4x hotter than the competing edges in the CFG. These are forcibly merged into a fallthrough chain. Future transformations that need te be added are rotation of loop exit conditions to be fallthrough, and better isolation of cold block chains. I'm also planning on adding statistics to model how well the algorithm does at laying out blocks based on the probabilities it receives. The old tests mostly still pass, and I have some new tests to add, but the nested loops are still behaving very strangely. This almost seems like working-as-intended as it rotated the exit branch to be fallthrough, but I'm not convinced this is actually the best layout. It is well supported by the probabilities for loops we currently get, but those are pretty broken for nested loops, so this may change later. llvm-svn: 142743
2011-10-23 17:18:45 +08:00
#endif
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
/// \brief Mark a chain's successors as having one fewer preds.
///
/// When a chain is being merged into the "placed" chain, this routine will
/// quickly walk the successors of each block in the chain and mark them as
/// having one fewer active predecessor. It also adds any successors of this
/// chain which reach the zero-predecessor state to the worklist passed in.
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
void MachineBlockPlacement::markChainSuccessors(
BlockChain &Chain,
MachineBasicBlock *LoopHeaderBB,
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
SmallVectorImpl<MachineBasicBlock *> &BlockWorkList,
const BlockFilterSet *BlockFilter) {
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
// Walk all the blocks in this chain, marking their successors as having
// a predecessor placed.
for (BlockChain::iterator CBI = Chain.begin(), CBE = Chain.end();
CBI != CBE; ++CBI) {
// Add any successors for which this is the only un-placed in-loop
// predecessor to the worklist as a viable candidate for CFG-neutral
// placement. No subsequent placement of this block will violate the CFG
// shape, so we get to use heuristics to choose a favorable placement.
for (MachineBasicBlock::succ_iterator SI = (*CBI)->succ_begin(),
SE = (*CBI)->succ_end();
SI != SE; ++SI) {
if (BlockFilter && !BlockFilter->count(*SI))
continue;
BlockChain &SuccChain = *BlockToChain[*SI];
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
// Disregard edges within a fixed chain, or edges to the loop header.
if (&Chain == &SuccChain || *SI == LoopHeaderBB)
continue;
// This is a cross-chain edge that is within the loop, so decrement the
// loop predecessor count of the destination chain.
if (SuccChain.LoopPredecessors > 0 && --SuccChain.LoopPredecessors == 0)
BlockWorkList.push_back(*SuccChain.begin());
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
}
}
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
}
/// \brief Select the best successor for a block.
///
/// This looks across all successors of a particular block and attempts to
/// select the "best" one to be the layout successor. It only considers direct
/// successors which also pass the block filter. It will attempt to avoid
/// breaking CFG structure, but cave and break such structures in the case of
/// very hot successor edges.
///
/// \returns The best successor block found, or null if none are viable.
MachineBasicBlock *MachineBlockPlacement::selectBestSuccessor(
MachineBasicBlock *BB, BlockChain &Chain,
const BlockFilterSet *BlockFilter) {
const BranchProbability HotProb(4, 5); // 80%
MachineBasicBlock *BestSucc = 0;
// FIXME: Due to the performance of the probability and weight routines in
// the MBPI analysis, we manually compute probabilities using the edge
// weights. This is suboptimal as it means that the somewhat subtle
// definition of edge weight semantics is encoded here as well. We should
// improve the MBPI interface to effeciently support query patterns such as
// this.
uint32_t BestWeight = 0;
uint32_t WeightScale = 0;
uint32_t SumWeight = MBPI->getSumForBlock(BB, WeightScale);
DEBUG(dbgs() << "Attempting merge from: " << getBlockName(BB) << "\n");
for (MachineBasicBlock::succ_iterator SI = BB->succ_begin(),
SE = BB->succ_end();
SI != SE; ++SI) {
if (BlockFilter && !BlockFilter->count(*SI))
continue;
BlockChain &SuccChain = *BlockToChain[*SI];
if (&SuccChain == &Chain) {
DEBUG(dbgs() << " " << getBlockName(*SI) << " -> Already merged!\n");
continue;
}
if (*SI != *SuccChain.begin()) {
DEBUG(dbgs() << " " << getBlockName(*SI) << " -> Mid chain!\n");
continue;
}
uint32_t SuccWeight = MBPI->getEdgeWeight(BB, *SI);
BranchProbability SuccProb(SuccWeight / WeightScale, SumWeight);
// Only consider successors which are either "hot", or wouldn't violate
// any CFG constraints.
if (SuccChain.LoopPredecessors != 0) {
if (SuccProb < HotProb) {
DEBUG(dbgs() << " " << getBlockName(*SI) << " -> CFG conflict\n");
continue;
}
// Make sure that a hot successor doesn't have a globally more important
// predecessor.
BlockFrequency CandidateEdgeFreq
= MBFI->getBlockFreq(BB) * SuccProb * HotProb.getCompl();
bool BadCFGConflict = false;
for (MachineBasicBlock::pred_iterator PI = (*SI)->pred_begin(),
PE = (*SI)->pred_end();
PI != PE; ++PI) {
if (*PI == *SI || (BlockFilter && !BlockFilter->count(*PI)) ||
BlockToChain[*PI] == &Chain)
continue;
BlockFrequency PredEdgeFreq
= MBFI->getBlockFreq(*PI) * MBPI->getEdgeProbability(*PI, *SI);
if (PredEdgeFreq >= CandidateEdgeFreq) {
BadCFGConflict = true;
break;
}
}
if (BadCFGConflict) {
DEBUG(dbgs() << " " << getBlockName(*SI)
<< " -> non-cold CFG conflict\n");
continue;
}
}
DEBUG(dbgs() << " " << getBlockName(*SI) << " -> " << SuccProb
<< " (prob)"
<< (SuccChain.LoopPredecessors != 0 ? " (CFG break)" : "")
<< "\n");
if (BestSucc && BestWeight >= SuccWeight)
continue;
BestSucc = *SI;
BestWeight = SuccWeight;
}
return BestSucc;
}
namespace {
/// \brief Predicate struct to detect blocks already placed.
class IsBlockPlaced {
const BlockChain &PlacedChain;
const BlockToChainMapType &BlockToChain;
public:
IsBlockPlaced(const BlockChain &PlacedChain,
const BlockToChainMapType &BlockToChain)
: PlacedChain(PlacedChain), BlockToChain(BlockToChain) {}
bool operator()(MachineBasicBlock *BB) const {
return BlockToChain.lookup(BB) == &PlacedChain;
}
};
}
/// \brief Select the best block from a worklist.
///
/// This looks through the provided worklist as a list of candidate basic
/// blocks and select the most profitable one to place. The definition of
/// profitable only really makes sense in the context of a loop. This returns
/// the most frequently visited block in the worklist, which in the case of
/// a loop, is the one most desirable to be physically close to the rest of the
/// loop body in order to improve icache behavior.
///
/// \returns The best block found, or null if none are viable.
MachineBasicBlock *MachineBlockPlacement::selectBestCandidateBlock(
BlockChain &Chain, SmallVectorImpl<MachineBasicBlock *> &WorkList,
const BlockFilterSet *BlockFilter) {
// Once we need to walk the worklist looking for a candidate, cleanup the
// worklist of already placed entries.
// FIXME: If this shows up on profiles, it could be folded (at the cost of
// some code complexity) into the loop below.
WorkList.erase(std::remove_if(WorkList.begin(), WorkList.end(),
IsBlockPlaced(Chain, BlockToChain)),
WorkList.end());
MachineBasicBlock *BestBlock = 0;
BlockFrequency BestFreq;
for (SmallVectorImpl<MachineBasicBlock *>::iterator WBI = WorkList.begin(),
WBE = WorkList.end();
WBI != WBE; ++WBI) {
assert(!BlockFilter || BlockFilter->count(*WBI));
BlockChain &SuccChain = *BlockToChain[*WBI];
if (&SuccChain == &Chain) {
DEBUG(dbgs() << " " << getBlockName(*WBI)
<< " -> Already merged!\n");
continue;
}
assert(SuccChain.LoopPredecessors == 0 && "Found CFG-violating block");
BlockFrequency CandidateFreq = MBFI->getBlockFreq(*WBI);
DEBUG(dbgs() << " " << getBlockName(*WBI) << " -> " << CandidateFreq
<< " (freq)\n");
if (BestBlock && BestFreq >= CandidateFreq)
continue;
BestBlock = *WBI;
BestFreq = CandidateFreq;
}
return BestBlock;
}
/// \brief Retrieve the first unplaced basic block.
///
/// This routine is called when we are unable to use the CFG to walk through
/// all of the basic blocks and form a chain due to unnatural loops in the CFG.
/// We walk through the function's blocks in order, starting from the
/// LastUnplacedBlockIt. We update this iterator on each call to avoid
/// re-scanning the entire sequence on repeated calls to this routine.
MachineBasicBlock *MachineBlockPlacement::getFirstUnplacedBlock(
MachineFunction &F, const BlockChain &PlacedChain,
MachineFunction::iterator &PrevUnplacedBlockIt,
const BlockFilterSet *BlockFilter) {
for (MachineFunction::iterator I = PrevUnplacedBlockIt, E = F.end(); I != E;
++I) {
if (BlockFilter && !BlockFilter->count(I))
continue;
if (BlockToChain[I] != &PlacedChain) {
PrevUnplacedBlockIt = I;
// Now select the head of the chain to which the unplaced block belongs
// as the block to place. This will force the entire chain to be placed,
// and satisfies the requirements of merging chains.
return *BlockToChain[I]->begin();
}
}
return 0;
}
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
void MachineBlockPlacement::buildChain(
MachineBasicBlock *BB,
BlockChain &Chain,
SmallVectorImpl<MachineBasicBlock *> &BlockWorkList,
const BlockFilterSet *BlockFilter) {
Completely re-write the algorithm behind MachineBlockPlacement based on discussions with Andy. Fundamentally, the previous algorithm is both counter productive on several fronts and prioritizing things which aren't necessarily the most important: static branch prediction. The new algorithm uses the existing loop CFG structure information to walk through the CFG itself to layout blocks. It coalesces adjacent blocks within the loop where the CFG allows based on the most likely path taken. Finally, it topologically orders the block chains that have been formed. This allows it to choose a (mostly) topologically valid ordering which still priorizes fallthrough within the structural constraints. As a final twist in the algorithm, it does violate the CFG when it discovers a "hot" edge, that is an edge that is more than 4x hotter than the competing edges in the CFG. These are forcibly merged into a fallthrough chain. Future transformations that need te be added are rotation of loop exit conditions to be fallthrough, and better isolation of cold block chains. I'm also planning on adding statistics to model how well the algorithm does at laying out blocks based on the probabilities it receives. The old tests mostly still pass, and I have some new tests to add, but the nested loops are still behaving very strangely. This almost seems like working-as-intended as it rotated the exit branch to be fallthrough, but I'm not convinced this is actually the best layout. It is well supported by the probabilities for loops we currently get, but those are pretty broken for nested loops, so this may change later. llvm-svn: 142743
2011-10-23 17:18:45 +08:00
assert(BB);
assert(BlockToChain[BB] == &Chain);
MachineFunction &F = *BB->getParent();
MachineFunction::iterator PrevUnplacedBlockIt = F.begin();
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
MachineBasicBlock *LoopHeaderBB = BB;
markChainSuccessors(Chain, LoopHeaderBB, BlockWorkList, BlockFilter);
BB = *llvm::prior(Chain.end());
for (;;) {
assert(BB);
assert(BlockToChain[BB] == &Chain);
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
assert(*llvm::prior(Chain.end()) == BB);
MachineBasicBlock *BestSucc = 0;
// Look for the best viable successor if there is one to place immediately
// after this block.
BestSucc = selectBestSuccessor(BB, Chain, BlockFilter);
Completely re-write the algorithm behind MachineBlockPlacement based on discussions with Andy. Fundamentally, the previous algorithm is both counter productive on several fronts and prioritizing things which aren't necessarily the most important: static branch prediction. The new algorithm uses the existing loop CFG structure information to walk through the CFG itself to layout blocks. It coalesces adjacent blocks within the loop where the CFG allows based on the most likely path taken. Finally, it topologically orders the block chains that have been formed. This allows it to choose a (mostly) topologically valid ordering which still priorizes fallthrough within the structural constraints. As a final twist in the algorithm, it does violate the CFG when it discovers a "hot" edge, that is an edge that is more than 4x hotter than the competing edges in the CFG. These are forcibly merged into a fallthrough chain. Future transformations that need te be added are rotation of loop exit conditions to be fallthrough, and better isolation of cold block chains. I'm also planning on adding statistics to model how well the algorithm does at laying out blocks based on the probabilities it receives. The old tests mostly still pass, and I have some new tests to add, but the nested loops are still behaving very strangely. This almost seems like working-as-intended as it rotated the exit branch to be fallthrough, but I'm not convinced this is actually the best layout. It is well supported by the probabilities for loops we currently get, but those are pretty broken for nested loops, so this may change later. llvm-svn: 142743
2011-10-23 17:18:45 +08:00
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
// If an immediate successor isn't available, look for the best viable
// block among those we've identified as not violating the loop's CFG at
// this point. This won't be a fallthrough, but it will increase locality.
if (!BestSucc)
BestSucc = selectBestCandidateBlock(Chain, BlockWorkList, BlockFilter);
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
if (!BestSucc) {
BestSucc = getFirstUnplacedBlock(F, Chain, PrevUnplacedBlockIt,
BlockFilter);
if (!BestSucc)
break;
DEBUG(dbgs() << "Unnatural loop CFG detected, forcibly merging the "
"layout successor until the CFG reduces\n");
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
}
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
// Place this block, updating the datastructures to reflect its placement.
BlockChain &SuccChain = *BlockToChain[BestSucc];
// Zero out LoopPredecessors for the successor we're about to merge in case
// we selected a successor that didn't fit naturally into the CFG.
SuccChain.LoopPredecessors = 0;
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
DEBUG(dbgs() << "Merging from " << getBlockNum(BB)
<< " to " << getBlockNum(BestSucc) << "\n");
markChainSuccessors(SuccChain, LoopHeaderBB, BlockWorkList, BlockFilter);
Chain.merge(BestSucc, &SuccChain);
BB = *llvm::prior(Chain.end());
}
DEBUG(dbgs() << "Finished forming chain for header block "
<< getBlockNum(*Chain.begin()) << "\n");
Completely re-write the algorithm behind MachineBlockPlacement based on discussions with Andy. Fundamentally, the previous algorithm is both counter productive on several fronts and prioritizing things which aren't necessarily the most important: static branch prediction. The new algorithm uses the existing loop CFG structure information to walk through the CFG itself to layout blocks. It coalesces adjacent blocks within the loop where the CFG allows based on the most likely path taken. Finally, it topologically orders the block chains that have been formed. This allows it to choose a (mostly) topologically valid ordering which still priorizes fallthrough within the structural constraints. As a final twist in the algorithm, it does violate the CFG when it discovers a "hot" edge, that is an edge that is more than 4x hotter than the competing edges in the CFG. These are forcibly merged into a fallthrough chain. Future transformations that need te be added are rotation of loop exit conditions to be fallthrough, and better isolation of cold block chains. I'm also planning on adding statistics to model how well the algorithm does at laying out blocks based on the probabilities it receives. The old tests mostly still pass, and I have some new tests to add, but the nested loops are still behaving very strangely. This almost seems like working-as-intended as it rotated the exit branch to be fallthrough, but I'm not convinced this is actually the best layout. It is well supported by the probabilities for loops we currently get, but those are pretty broken for nested loops, so this may change later. llvm-svn: 142743
2011-10-23 17:18:45 +08:00
}
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
Take two on rotating the block ordering of loops. My previous attempt was centered around the premise of laying out a loop in a chain, and then rotating that chain. This is good for preserving contiguous layout, but bad for actually making sane rotations. In order to keep it safe, I had to essentially make it impossible to rotate deeply nested loops. The information needed to correctly reason about a deeply nested loop is actually available -- *before* we layout the loop. We know the inner loops are already fused into chains, etc. We lose information the moment we actually lay out the loop. The solution was the other alternative for this algorithm I discussed with Benjamin and some others: rather than rotating the loop after-the-fact, try to pick a profitable starting block for the loop's layout, and then use our existing layout logic. I was worried about the complexity of this "pick" step, but it turns out such complexity is needed to handle all the important cases I keep teasing out of benchmarks. This is, I'm afraid, a bit of a work-in-progress. It is still misbehaving on some likely important cases I'm investigating in Olden. It also isn't really tested. I'm going to try to craft some interesting nested-loop test cases, but it's likely to be extremely time consuming and I don't want to go there until I'm sure I'm testing the correct behavior. Sadly I can't come up with a way of getting simple, fine grained test cases for this logic. We need complex loop structures to even trigger much of it. llvm-svn: 145183
2011-11-27 21:34:33 +08:00
/// \brief Find the best loop top block for layout.
///
Take two on rotating the block ordering of loops. My previous attempt was centered around the premise of laying out a loop in a chain, and then rotating that chain. This is good for preserving contiguous layout, but bad for actually making sane rotations. In order to keep it safe, I had to essentially make it impossible to rotate deeply nested loops. The information needed to correctly reason about a deeply nested loop is actually available -- *before* we layout the loop. We know the inner loops are already fused into chains, etc. We lose information the moment we actually lay out the loop. The solution was the other alternative for this algorithm I discussed with Benjamin and some others: rather than rotating the loop after-the-fact, try to pick a profitable starting block for the loop's layout, and then use our existing layout logic. I was worried about the complexity of this "pick" step, but it turns out such complexity is needed to handle all the important cases I keep teasing out of benchmarks. This is, I'm afraid, a bit of a work-in-progress. It is still misbehaving on some likely important cases I'm investigating in Olden. It also isn't really tested. I'm going to try to craft some interesting nested-loop test cases, but it's likely to be extremely time consuming and I don't want to go there until I'm sure I'm testing the correct behavior. Sadly I can't come up with a way of getting simple, fine grained test cases for this logic. We need complex loop structures to even trigger much of it. llvm-svn: 145183
2011-11-27 21:34:33 +08:00
/// This routine implements the logic to analyze the loop looking for the best
/// block to layout at the top of the loop. Typically this is done to maximize
/// fallthrough opportunities.
MachineBasicBlock *
MachineBlockPlacement::findBestLoopTop(MachineFunction &F,
MachineLoop &L,
const BlockFilterSet &LoopBlockSet) {
Take two on rotating the block ordering of loops. My previous attempt was centered around the premise of laying out a loop in a chain, and then rotating that chain. This is good for preserving contiguous layout, but bad for actually making sane rotations. In order to keep it safe, I had to essentially make it impossible to rotate deeply nested loops. The information needed to correctly reason about a deeply nested loop is actually available -- *before* we layout the loop. We know the inner loops are already fused into chains, etc. We lose information the moment we actually lay out the loop. The solution was the other alternative for this algorithm I discussed with Benjamin and some others: rather than rotating the loop after-the-fact, try to pick a profitable starting block for the loop's layout, and then use our existing layout logic. I was worried about the complexity of this "pick" step, but it turns out such complexity is needed to handle all the important cases I keep teasing out of benchmarks. This is, I'm afraid, a bit of a work-in-progress. It is still misbehaving on some likely important cases I'm investigating in Olden. It also isn't really tested. I'm going to try to craft some interesting nested-loop test cases, but it's likely to be extremely time consuming and I don't want to go there until I'm sure I'm testing the correct behavior. Sadly I can't come up with a way of getting simple, fine grained test cases for this logic. We need complex loop structures to even trigger much of it. llvm-svn: 145183
2011-11-27 21:34:33 +08:00
BlockFrequency BestExitEdgeFreq;
MachineBasicBlock *ExitingBB = 0;
MachineBasicBlock *LoopingBB = 0;
// If there are exits to outer loops, loop rotation can severely limit
// fallthrough opportunites unless it selects such an exit. Keep a set of
// blocks where rotating to exit with that block will reach an outer loop.
SmallPtrSet<MachineBasicBlock *, 4> BlocksExitingToOuterLoop;
Take two on rotating the block ordering of loops. My previous attempt was centered around the premise of laying out a loop in a chain, and then rotating that chain. This is good for preserving contiguous layout, but bad for actually making sane rotations. In order to keep it safe, I had to essentially make it impossible to rotate deeply nested loops. The information needed to correctly reason about a deeply nested loop is actually available -- *before* we layout the loop. We know the inner loops are already fused into chains, etc. We lose information the moment we actually lay out the loop. The solution was the other alternative for this algorithm I discussed with Benjamin and some others: rather than rotating the loop after-the-fact, try to pick a profitable starting block for the loop's layout, and then use our existing layout logic. I was worried about the complexity of this "pick" step, but it turns out such complexity is needed to handle all the important cases I keep teasing out of benchmarks. This is, I'm afraid, a bit of a work-in-progress. It is still misbehaving on some likely important cases I'm investigating in Olden. It also isn't really tested. I'm going to try to craft some interesting nested-loop test cases, but it's likely to be extremely time consuming and I don't want to go there until I'm sure I'm testing the correct behavior. Sadly I can't come up with a way of getting simple, fine grained test cases for this logic. We need complex loop structures to even trigger much of it. llvm-svn: 145183
2011-11-27 21:34:33 +08:00
DEBUG(dbgs() << "Finding best loop exit for: "
<< getBlockName(L.getHeader()) << "\n");
for (MachineLoop::block_iterator I = L.block_begin(),
E = L.block_end();
I != E; ++I) {
BlockChain &Chain = *BlockToChain[*I];
Take two on rotating the block ordering of loops. My previous attempt was centered around the premise of laying out a loop in a chain, and then rotating that chain. This is good for preserving contiguous layout, but bad for actually making sane rotations. In order to keep it safe, I had to essentially make it impossible to rotate deeply nested loops. The information needed to correctly reason about a deeply nested loop is actually available -- *before* we layout the loop. We know the inner loops are already fused into chains, etc. We lose information the moment we actually lay out the loop. The solution was the other alternative for this algorithm I discussed with Benjamin and some others: rather than rotating the loop after-the-fact, try to pick a profitable starting block for the loop's layout, and then use our existing layout logic. I was worried about the complexity of this "pick" step, but it turns out such complexity is needed to handle all the important cases I keep teasing out of benchmarks. This is, I'm afraid, a bit of a work-in-progress. It is still misbehaving on some likely important cases I'm investigating in Olden. It also isn't really tested. I'm going to try to craft some interesting nested-loop test cases, but it's likely to be extremely time consuming and I don't want to go there until I'm sure I'm testing the correct behavior. Sadly I can't come up with a way of getting simple, fine grained test cases for this logic. We need complex loop structures to even trigger much of it. llvm-svn: 145183
2011-11-27 21:34:33 +08:00
// Ensure that this block is at the end of a chain; otherwise it could be
// mid-way through an inner loop or a successor of an analyzable branch.
if (*I != *llvm::prior(Chain.end()))
continue;
Take two on rotating the block ordering of loops. My previous attempt was centered around the premise of laying out a loop in a chain, and then rotating that chain. This is good for preserving contiguous layout, but bad for actually making sane rotations. In order to keep it safe, I had to essentially make it impossible to rotate deeply nested loops. The information needed to correctly reason about a deeply nested loop is actually available -- *before* we layout the loop. We know the inner loops are already fused into chains, etc. We lose information the moment we actually lay out the loop. The solution was the other alternative for this algorithm I discussed with Benjamin and some others: rather than rotating the loop after-the-fact, try to pick a profitable starting block for the loop's layout, and then use our existing layout logic. I was worried about the complexity of this "pick" step, but it turns out such complexity is needed to handle all the important cases I keep teasing out of benchmarks. This is, I'm afraid, a bit of a work-in-progress. It is still misbehaving on some likely important cases I'm investigating in Olden. It also isn't really tested. I'm going to try to craft some interesting nested-loop test cases, but it's likely to be extremely time consuming and I don't want to go there until I'm sure I'm testing the correct behavior. Sadly I can't come up with a way of getting simple, fine grained test cases for this logic. We need complex loop structures to even trigger much of it. llvm-svn: 145183
2011-11-27 21:34:33 +08:00
// Now walk the successors. We need to establish whether this has a viable
// exiting successor and whether it has a viable non-exiting successor.
// We store the old exiting state and restore it if a viable looping
// successor isn't found.
MachineBasicBlock *OldExitingBB = ExitingBB;
BlockFrequency OldBestExitEdgeFreq = BestExitEdgeFreq;
// We also compute and store the best looping successor for use in layout.
MachineBasicBlock *BestLoopSucc = 0;
// FIXME: Due to the performance of the probability and weight routines in
// the MBPI analysis, we use the internal weights. This is only valid
// because it is purely a ranking function, we don't care about anything
// but the relative values.
uint32_t BestLoopSuccWeight = 0;
// FIXME: We also manually compute the probabilities to avoid quadratic
// behavior.
uint32_t WeightScale = 0;
uint32_t SumWeight = MBPI->getSumForBlock(*I, WeightScale);
for (MachineBasicBlock::succ_iterator SI = (*I)->succ_begin(),
SE = (*I)->succ_end();
SI != SE; ++SI) {
Take two on rotating the block ordering of loops. My previous attempt was centered around the premise of laying out a loop in a chain, and then rotating that chain. This is good for preserving contiguous layout, but bad for actually making sane rotations. In order to keep it safe, I had to essentially make it impossible to rotate deeply nested loops. The information needed to correctly reason about a deeply nested loop is actually available -- *before* we layout the loop. We know the inner loops are already fused into chains, etc. We lose information the moment we actually lay out the loop. The solution was the other alternative for this algorithm I discussed with Benjamin and some others: rather than rotating the loop after-the-fact, try to pick a profitable starting block for the loop's layout, and then use our existing layout logic. I was worried about the complexity of this "pick" step, but it turns out such complexity is needed to handle all the important cases I keep teasing out of benchmarks. This is, I'm afraid, a bit of a work-in-progress. It is still misbehaving on some likely important cases I'm investigating in Olden. It also isn't really tested. I'm going to try to craft some interesting nested-loop test cases, but it's likely to be extremely time consuming and I don't want to go there until I'm sure I'm testing the correct behavior. Sadly I can't come up with a way of getting simple, fine grained test cases for this logic. We need complex loop structures to even trigger much of it. llvm-svn: 145183
2011-11-27 21:34:33 +08:00
if ((*SI)->isLandingPad())
continue;
if (*SI == *I)
continue;
BlockChain &SuccChain = *BlockToChain[*SI];
Take two on rotating the block ordering of loops. My previous attempt was centered around the premise of laying out a loop in a chain, and then rotating that chain. This is good for preserving contiguous layout, but bad for actually making sane rotations. In order to keep it safe, I had to essentially make it impossible to rotate deeply nested loops. The information needed to correctly reason about a deeply nested loop is actually available -- *before* we layout the loop. We know the inner loops are already fused into chains, etc. We lose information the moment we actually lay out the loop. The solution was the other alternative for this algorithm I discussed with Benjamin and some others: rather than rotating the loop after-the-fact, try to pick a profitable starting block for the loop's layout, and then use our existing layout logic. I was worried about the complexity of this "pick" step, but it turns out such complexity is needed to handle all the important cases I keep teasing out of benchmarks. This is, I'm afraid, a bit of a work-in-progress. It is still misbehaving on some likely important cases I'm investigating in Olden. It also isn't really tested. I'm going to try to craft some interesting nested-loop test cases, but it's likely to be extremely time consuming and I don't want to go there until I'm sure I'm testing the correct behavior. Sadly I can't come up with a way of getting simple, fine grained test cases for this logic. We need complex loop structures to even trigger much of it. llvm-svn: 145183
2011-11-27 21:34:33 +08:00
// Don't split chains, either this chain or the successor's chain.
if (&Chain == &SuccChain || *SI != *SuccChain.begin()) {
DEBUG(dbgs() << " " << (LoopBlockSet.count(*SI) ? "looping: "
: "exiting: ")
<< getBlockName(*I) << " -> "
<< getBlockName(*SI) << " (chain conflict)\n");
continue;
}
uint32_t SuccWeight = MBPI->getEdgeWeight(*I, *SI);
if (LoopBlockSet.count(*SI)) {
DEBUG(dbgs() << " looping: " << getBlockName(*I) << " -> "
<< getBlockName(*SI) << " (" << SuccWeight << ")\n");
if (BestLoopSucc && BestLoopSuccWeight >= SuccWeight)
continue;
BestLoopSucc = *SI;
BestLoopSuccWeight = SuccWeight;
continue;
}
BranchProbability SuccProb(SuccWeight / WeightScale, SumWeight);
BlockFrequency ExitEdgeFreq = MBFI->getBlockFreq(*I) * SuccProb;
DEBUG(dbgs() << " exiting: " << getBlockName(*I) << " -> "
<< getBlockName(*SI) << " (" << ExitEdgeFreq << ")\n");
// Note that we slightly bias this toward an existing layout successor to
// retain incoming order in the absence of better information.
// FIXME: Should we bias this more strongly? It's pretty weak.
if (!ExitingBB || ExitEdgeFreq > BestExitEdgeFreq ||
((*I)->isLayoutSuccessor(*SI) &&
!(ExitEdgeFreq < BestExitEdgeFreq))) {
BestExitEdgeFreq = ExitEdgeFreq;
ExitingBB = *I;
}
if (MachineLoop *ExitLoop = MLI->getLoopFor(*SI))
if (ExitLoop->contains(&L))
BlocksExitingToOuterLoop.insert(*I);
}
Take two on rotating the block ordering of loops. My previous attempt was centered around the premise of laying out a loop in a chain, and then rotating that chain. This is good for preserving contiguous layout, but bad for actually making sane rotations. In order to keep it safe, I had to essentially make it impossible to rotate deeply nested loops. The information needed to correctly reason about a deeply nested loop is actually available -- *before* we layout the loop. We know the inner loops are already fused into chains, etc. We lose information the moment we actually lay out the loop. The solution was the other alternative for this algorithm I discussed with Benjamin and some others: rather than rotating the loop after-the-fact, try to pick a profitable starting block for the loop's layout, and then use our existing layout logic. I was worried about the complexity of this "pick" step, but it turns out such complexity is needed to handle all the important cases I keep teasing out of benchmarks. This is, I'm afraid, a bit of a work-in-progress. It is still misbehaving on some likely important cases I'm investigating in Olden. It also isn't really tested. I'm going to try to craft some interesting nested-loop test cases, but it's likely to be extremely time consuming and I don't want to go there until I'm sure I'm testing the correct behavior. Sadly I can't come up with a way of getting simple, fine grained test cases for this logic. We need complex loop structures to even trigger much of it. llvm-svn: 145183
2011-11-27 21:34:33 +08:00
// Restore the old exiting state, no viable looping successor was found.
if (!BestLoopSucc) {
ExitingBB = OldExitingBB;
BestExitEdgeFreq = OldBestExitEdgeFreq;
continue;
Take two on rotating the block ordering of loops. My previous attempt was centered around the premise of laying out a loop in a chain, and then rotating that chain. This is good for preserving contiguous layout, but bad for actually making sane rotations. In order to keep it safe, I had to essentially make it impossible to rotate deeply nested loops. The information needed to correctly reason about a deeply nested loop is actually available -- *before* we layout the loop. We know the inner loops are already fused into chains, etc. We lose information the moment we actually lay out the loop. The solution was the other alternative for this algorithm I discussed with Benjamin and some others: rather than rotating the loop after-the-fact, try to pick a profitable starting block for the loop's layout, and then use our existing layout logic. I was worried about the complexity of this "pick" step, but it turns out such complexity is needed to handle all the important cases I keep teasing out of benchmarks. This is, I'm afraid, a bit of a work-in-progress. It is still misbehaving on some likely important cases I'm investigating in Olden. It also isn't really tested. I'm going to try to craft some interesting nested-loop test cases, but it's likely to be extremely time consuming and I don't want to go there until I'm sure I'm testing the correct behavior. Sadly I can't come up with a way of getting simple, fine grained test cases for this logic. We need complex loop structures to even trigger much of it. llvm-svn: 145183
2011-11-27 21:34:33 +08:00
}
Take two on rotating the block ordering of loops. My previous attempt was centered around the premise of laying out a loop in a chain, and then rotating that chain. This is good for preserving contiguous layout, but bad for actually making sane rotations. In order to keep it safe, I had to essentially make it impossible to rotate deeply nested loops. The information needed to correctly reason about a deeply nested loop is actually available -- *before* we layout the loop. We know the inner loops are already fused into chains, etc. We lose information the moment we actually lay out the loop. The solution was the other alternative for this algorithm I discussed with Benjamin and some others: rather than rotating the loop after-the-fact, try to pick a profitable starting block for the loop's layout, and then use our existing layout logic. I was worried about the complexity of this "pick" step, but it turns out such complexity is needed to handle all the important cases I keep teasing out of benchmarks. This is, I'm afraid, a bit of a work-in-progress. It is still misbehaving on some likely important cases I'm investigating in Olden. It also isn't really tested. I'm going to try to craft some interesting nested-loop test cases, but it's likely to be extremely time consuming and I don't want to go there until I'm sure I'm testing the correct behavior. Sadly I can't come up with a way of getting simple, fine grained test cases for this logic. We need complex loop structures to even trigger much of it. llvm-svn: 145183
2011-11-27 21:34:33 +08:00
// If this was best exiting block thus far, also record the looping block.
if (ExitingBB == *I)
LoopingBB = BestLoopSucc;
}
Take two on rotating the block ordering of loops. My previous attempt was centered around the premise of laying out a loop in a chain, and then rotating that chain. This is good for preserving contiguous layout, but bad for actually making sane rotations. In order to keep it safe, I had to essentially make it impossible to rotate deeply nested loops. The information needed to correctly reason about a deeply nested loop is actually available -- *before* we layout the loop. We know the inner loops are already fused into chains, etc. We lose information the moment we actually lay out the loop. The solution was the other alternative for this algorithm I discussed with Benjamin and some others: rather than rotating the loop after-the-fact, try to pick a profitable starting block for the loop's layout, and then use our existing layout logic. I was worried about the complexity of this "pick" step, but it turns out such complexity is needed to handle all the important cases I keep teasing out of benchmarks. This is, I'm afraid, a bit of a work-in-progress. It is still misbehaving on some likely important cases I'm investigating in Olden. It also isn't really tested. I'm going to try to craft some interesting nested-loop test cases, but it's likely to be extremely time consuming and I don't want to go there until I'm sure I'm testing the correct behavior. Sadly I can't come up with a way of getting simple, fine grained test cases for this logic. We need complex loop structures to even trigger much of it. llvm-svn: 145183
2011-11-27 21:34:33 +08:00
// Without a candidate exitting block or with only a single block in the
// loop, just use the loop header to layout the loop.
if (!ExitingBB || L.getNumBlocks() == 1)
return L.getHeader();
// Also, if we have exit blocks which lead to outer loops but didn't select
// one of them as the exiting block we are rotating toward, disable loop
// rotation altogether.
if (!BlocksExitingToOuterLoop.empty() &&
!BlocksExitingToOuterLoop.count(ExitingBB))
return L.getHeader();
Take two on rotating the block ordering of loops. My previous attempt was centered around the premise of laying out a loop in a chain, and then rotating that chain. This is good for preserving contiguous layout, but bad for actually making sane rotations. In order to keep it safe, I had to essentially make it impossible to rotate deeply nested loops. The information needed to correctly reason about a deeply nested loop is actually available -- *before* we layout the loop. We know the inner loops are already fused into chains, etc. We lose information the moment we actually lay out the loop. The solution was the other alternative for this algorithm I discussed with Benjamin and some others: rather than rotating the loop after-the-fact, try to pick a profitable starting block for the loop's layout, and then use our existing layout logic. I was worried about the complexity of this "pick" step, but it turns out such complexity is needed to handle all the important cases I keep teasing out of benchmarks. This is, I'm afraid, a bit of a work-in-progress. It is still misbehaving on some likely important cases I'm investigating in Olden. It also isn't really tested. I'm going to try to craft some interesting nested-loop test cases, but it's likely to be extremely time consuming and I don't want to go there until I'm sure I'm testing the correct behavior. Sadly I can't come up with a way of getting simple, fine grained test cases for this logic. We need complex loop structures to even trigger much of it. llvm-svn: 145183
2011-11-27 21:34:33 +08:00
assert(LoopingBB && "All successors of a loop block are exit blocks!");
DEBUG(dbgs() << " Best exiting block: " << getBlockName(ExitingBB) << "\n");
DEBUG(dbgs() << " Best top block: " << getBlockName(LoopingBB) << "\n");
return LoopingBB;
}
Completely re-write the algorithm behind MachineBlockPlacement based on discussions with Andy. Fundamentally, the previous algorithm is both counter productive on several fronts and prioritizing things which aren't necessarily the most important: static branch prediction. The new algorithm uses the existing loop CFG structure information to walk through the CFG itself to layout blocks. It coalesces adjacent blocks within the loop where the CFG allows based on the most likely path taken. Finally, it topologically orders the block chains that have been formed. This allows it to choose a (mostly) topologically valid ordering which still priorizes fallthrough within the structural constraints. As a final twist in the algorithm, it does violate the CFG when it discovers a "hot" edge, that is an edge that is more than 4x hotter than the competing edges in the CFG. These are forcibly merged into a fallthrough chain. Future transformations that need te be added are rotation of loop exit conditions to be fallthrough, and better isolation of cold block chains. I'm also planning on adding statistics to model how well the algorithm does at laying out blocks based on the probabilities it receives. The old tests mostly still pass, and I have some new tests to add, but the nested loops are still behaving very strangely. This almost seems like working-as-intended as it rotated the exit branch to be fallthrough, but I'm not convinced this is actually the best layout. It is well supported by the probabilities for loops we currently get, but those are pretty broken for nested loops, so this may change later. llvm-svn: 142743
2011-10-23 17:18:45 +08:00
/// \brief Forms basic block chains from the natural loop structures.
///
/// These chains are designed to preserve the existing *structure* of the code
/// as much as possible. We can then stitch the chains together in a way which
/// both preserves the topological structure and minimizes taken conditional
/// branches.
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
void MachineBlockPlacement::buildLoopChains(MachineFunction &F,
MachineLoop &L) {
Completely re-write the algorithm behind MachineBlockPlacement based on discussions with Andy. Fundamentally, the previous algorithm is both counter productive on several fronts and prioritizing things which aren't necessarily the most important: static branch prediction. The new algorithm uses the existing loop CFG structure information to walk through the CFG itself to layout blocks. It coalesces adjacent blocks within the loop where the CFG allows based on the most likely path taken. Finally, it topologically orders the block chains that have been formed. This allows it to choose a (mostly) topologically valid ordering which still priorizes fallthrough within the structural constraints. As a final twist in the algorithm, it does violate the CFG when it discovers a "hot" edge, that is an edge that is more than 4x hotter than the competing edges in the CFG. These are forcibly merged into a fallthrough chain. Future transformations that need te be added are rotation of loop exit conditions to be fallthrough, and better isolation of cold block chains. I'm also planning on adding statistics to model how well the algorithm does at laying out blocks based on the probabilities it receives. The old tests mostly still pass, and I have some new tests to add, but the nested loops are still behaving very strangely. This almost seems like working-as-intended as it rotated the exit branch to be fallthrough, but I'm not convinced this is actually the best layout. It is well supported by the probabilities for loops we currently get, but those are pretty broken for nested loops, so this may change later. llvm-svn: 142743
2011-10-23 17:18:45 +08:00
// First recurse through any nested loops, building chains for those inner
// loops.
for (MachineLoop::iterator LI = L.begin(), LE = L.end(); LI != LE; ++LI)
buildLoopChains(F, **LI);
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
SmallVector<MachineBasicBlock *, 16> BlockWorkList;
BlockFilterSet LoopBlockSet(L.block_begin(), L.block_end());
Take two on rotating the block ordering of loops. My previous attempt was centered around the premise of laying out a loop in a chain, and then rotating that chain. This is good for preserving contiguous layout, but bad for actually making sane rotations. In order to keep it safe, I had to essentially make it impossible to rotate deeply nested loops. The information needed to correctly reason about a deeply nested loop is actually available -- *before* we layout the loop. We know the inner loops are already fused into chains, etc. We lose information the moment we actually lay out the loop. The solution was the other alternative for this algorithm I discussed with Benjamin and some others: rather than rotating the loop after-the-fact, try to pick a profitable starting block for the loop's layout, and then use our existing layout logic. I was worried about the complexity of this "pick" step, but it turns out such complexity is needed to handle all the important cases I keep teasing out of benchmarks. This is, I'm afraid, a bit of a work-in-progress. It is still misbehaving on some likely important cases I'm investigating in Olden. It also isn't really tested. I'm going to try to craft some interesting nested-loop test cases, but it's likely to be extremely time consuming and I don't want to go there until I'm sure I'm testing the correct behavior. Sadly I can't come up with a way of getting simple, fine grained test cases for this logic. We need complex loop structures to even trigger much of it. llvm-svn: 145183
2011-11-27 21:34:33 +08:00
MachineBasicBlock *LayoutTop = findBestLoopTop(F, L, LoopBlockSet);
BlockChain &LoopChain = *BlockToChain[LayoutTop];
Completely re-write the algorithm behind MachineBlockPlacement based on discussions with Andy. Fundamentally, the previous algorithm is both counter productive on several fronts and prioritizing things which aren't necessarily the most important: static branch prediction. The new algorithm uses the existing loop CFG structure information to walk through the CFG itself to layout blocks. It coalesces adjacent blocks within the loop where the CFG allows based on the most likely path taken. Finally, it topologically orders the block chains that have been formed. This allows it to choose a (mostly) topologically valid ordering which still priorizes fallthrough within the structural constraints. As a final twist in the algorithm, it does violate the CFG when it discovers a "hot" edge, that is an edge that is more than 4x hotter than the competing edges in the CFG. These are forcibly merged into a fallthrough chain. Future transformations that need te be added are rotation of loop exit conditions to be fallthrough, and better isolation of cold block chains. I'm also planning on adding statistics to model how well the algorithm does at laying out blocks based on the probabilities it receives. The old tests mostly still pass, and I have some new tests to add, but the nested loops are still behaving very strangely. This almost seems like working-as-intended as it rotated the exit branch to be fallthrough, but I'm not convinced this is actually the best layout. It is well supported by the probabilities for loops we currently get, but those are pretty broken for nested loops, so this may change later. llvm-svn: 142743
2011-10-23 17:18:45 +08:00
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
// FIXME: This is a really lame way of walking the chains in the loop: we
// walk the blocks, and use a set to prevent visiting a particular chain
// twice.
SmallPtrSet<BlockChain *, 4> UpdatedPreds;
assert(LoopChain.LoopPredecessors == 0);
UpdatedPreds.insert(&LoopChain);
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
for (MachineLoop::block_iterator BI = L.block_begin(),
BE = L.block_end();
Completely re-write the algorithm behind MachineBlockPlacement based on discussions with Andy. Fundamentally, the previous algorithm is both counter productive on several fronts and prioritizing things which aren't necessarily the most important: static branch prediction. The new algorithm uses the existing loop CFG structure information to walk through the CFG itself to layout blocks. It coalesces adjacent blocks within the loop where the CFG allows based on the most likely path taken. Finally, it topologically orders the block chains that have been formed. This allows it to choose a (mostly) topologically valid ordering which still priorizes fallthrough within the structural constraints. As a final twist in the algorithm, it does violate the CFG when it discovers a "hot" edge, that is an edge that is more than 4x hotter than the competing edges in the CFG. These are forcibly merged into a fallthrough chain. Future transformations that need te be added are rotation of loop exit conditions to be fallthrough, and better isolation of cold block chains. I'm also planning on adding statistics to model how well the algorithm does at laying out blocks based on the probabilities it receives. The old tests mostly still pass, and I have some new tests to add, but the nested loops are still behaving very strangely. This almost seems like working-as-intended as it rotated the exit branch to be fallthrough, but I'm not convinced this is actually the best layout. It is well supported by the probabilities for loops we currently get, but those are pretty broken for nested loops, so this may change later. llvm-svn: 142743
2011-10-23 17:18:45 +08:00
BI != BE; ++BI) {
BlockChain &Chain = *BlockToChain[*BI];
Take two on rotating the block ordering of loops. My previous attempt was centered around the premise of laying out a loop in a chain, and then rotating that chain. This is good for preserving contiguous layout, but bad for actually making sane rotations. In order to keep it safe, I had to essentially make it impossible to rotate deeply nested loops. The information needed to correctly reason about a deeply nested loop is actually available -- *before* we layout the loop. We know the inner loops are already fused into chains, etc. We lose information the moment we actually lay out the loop. The solution was the other alternative for this algorithm I discussed with Benjamin and some others: rather than rotating the loop after-the-fact, try to pick a profitable starting block for the loop's layout, and then use our existing layout logic. I was worried about the complexity of this "pick" step, but it turns out such complexity is needed to handle all the important cases I keep teasing out of benchmarks. This is, I'm afraid, a bit of a work-in-progress. It is still misbehaving on some likely important cases I'm investigating in Olden. It also isn't really tested. I'm going to try to craft some interesting nested-loop test cases, but it's likely to be extremely time consuming and I don't want to go there until I'm sure I'm testing the correct behavior. Sadly I can't come up with a way of getting simple, fine grained test cases for this logic. We need complex loop structures to even trigger much of it. llvm-svn: 145183
2011-11-27 21:34:33 +08:00
if (!UpdatedPreds.insert(&Chain))
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
continue;
assert(Chain.LoopPredecessors == 0);
for (BlockChain::iterator BCI = Chain.begin(), BCE = Chain.end();
BCI != BCE; ++BCI) {
assert(BlockToChain[*BCI] == &Chain);
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
for (MachineBasicBlock::pred_iterator PI = (*BCI)->pred_begin(),
PE = (*BCI)->pred_end();
PI != PE; ++PI) {
if (BlockToChain[*PI] == &Chain || !LoopBlockSet.count(*PI))
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
continue;
++Chain.LoopPredecessors;
}
}
if (Chain.LoopPredecessors == 0)
BlockWorkList.push_back(*Chain.begin());
Completely re-write the algorithm behind MachineBlockPlacement based on discussions with Andy. Fundamentally, the previous algorithm is both counter productive on several fronts and prioritizing things which aren't necessarily the most important: static branch prediction. The new algorithm uses the existing loop CFG structure information to walk through the CFG itself to layout blocks. It coalesces adjacent blocks within the loop where the CFG allows based on the most likely path taken. Finally, it topologically orders the block chains that have been formed. This allows it to choose a (mostly) topologically valid ordering which still priorizes fallthrough within the structural constraints. As a final twist in the algorithm, it does violate the CFG when it discovers a "hot" edge, that is an edge that is more than 4x hotter than the competing edges in the CFG. These are forcibly merged into a fallthrough chain. Future transformations that need te be added are rotation of loop exit conditions to be fallthrough, and better isolation of cold block chains. I'm also planning on adding statistics to model how well the algorithm does at laying out blocks based on the probabilities it receives. The old tests mostly still pass, and I have some new tests to add, but the nested loops are still behaving very strangely. This almost seems like working-as-intended as it rotated the exit branch to be fallthrough, but I'm not convinced this is actually the best layout. It is well supported by the probabilities for loops we currently get, but those are pretty broken for nested loops, so this may change later. llvm-svn: 142743
2011-10-23 17:18:45 +08:00
}
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
Take two on rotating the block ordering of loops. My previous attempt was centered around the premise of laying out a loop in a chain, and then rotating that chain. This is good for preserving contiguous layout, but bad for actually making sane rotations. In order to keep it safe, I had to essentially make it impossible to rotate deeply nested loops. The information needed to correctly reason about a deeply nested loop is actually available -- *before* we layout the loop. We know the inner loops are already fused into chains, etc. We lose information the moment we actually lay out the loop. The solution was the other alternative for this algorithm I discussed with Benjamin and some others: rather than rotating the loop after-the-fact, try to pick a profitable starting block for the loop's layout, and then use our existing layout logic. I was worried about the complexity of this "pick" step, but it turns out such complexity is needed to handle all the important cases I keep teasing out of benchmarks. This is, I'm afraid, a bit of a work-in-progress. It is still misbehaving on some likely important cases I'm investigating in Olden. It also isn't really tested. I'm going to try to craft some interesting nested-loop test cases, but it's likely to be extremely time consuming and I don't want to go there until I'm sure I'm testing the correct behavior. Sadly I can't come up with a way of getting simple, fine grained test cases for this logic. We need complex loop structures to even trigger much of it. llvm-svn: 145183
2011-11-27 21:34:33 +08:00
buildChain(LayoutTop, LoopChain, BlockWorkList, &LoopBlockSet);
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
DEBUG({
// Crash at the end so we get all of the debugging output first.
bool BadLoop = false;
if (LoopChain.LoopPredecessors) {
BadLoop = true;
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
dbgs() << "Loop chain contains a block without its preds placed!\n"
<< " Loop header: " << getBlockName(*L.block_begin()) << "\n"
<< " Chain header: " << getBlockName(*LoopChain.begin()) << "\n";
}
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
for (BlockChain::iterator BCI = LoopChain.begin(), BCE = LoopChain.end();
BCI != BCE; ++BCI)
if (!LoopBlockSet.erase(*BCI)) {
// We don't mark the loop as bad here because there are real situations
// where this can occur. For example, with an unanalyzable fallthrough
// from a loop block to a non-loop block or vice versa.
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
dbgs() << "Loop chain contains a block not contained by the loop!\n"
<< " Loop header: " << getBlockName(*L.block_begin()) << "\n"
<< " Chain header: " << getBlockName(*LoopChain.begin()) << "\n"
<< " Bad block: " << getBlockName(*BCI) << "\n";
}
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
if (!LoopBlockSet.empty()) {
BadLoop = true;
for (BlockFilterSet::iterator LBI = LoopBlockSet.begin(),
LBE = LoopBlockSet.end();
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
LBI != LBE; ++LBI)
dbgs() << "Loop contains blocks never placed into a chain!\n"
<< " Loop header: " << getBlockName(*L.block_begin()) << "\n"
<< " Chain header: " << getBlockName(*LoopChain.begin()) << "\n"
<< " Bad block: " << getBlockName(*LBI) << "\n";
}
assert(!BadLoop && "Detected problems with the placement of this loop.");
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
});
Completely re-write the algorithm behind MachineBlockPlacement based on discussions with Andy. Fundamentally, the previous algorithm is both counter productive on several fronts and prioritizing things which aren't necessarily the most important: static branch prediction. The new algorithm uses the existing loop CFG structure information to walk through the CFG itself to layout blocks. It coalesces adjacent blocks within the loop where the CFG allows based on the most likely path taken. Finally, it topologically orders the block chains that have been formed. This allows it to choose a (mostly) topologically valid ordering which still priorizes fallthrough within the structural constraints. As a final twist in the algorithm, it does violate the CFG when it discovers a "hot" edge, that is an edge that is more than 4x hotter than the competing edges in the CFG. These are forcibly merged into a fallthrough chain. Future transformations that need te be added are rotation of loop exit conditions to be fallthrough, and better isolation of cold block chains. I'm also planning on adding statistics to model how well the algorithm does at laying out blocks based on the probabilities it receives. The old tests mostly still pass, and I have some new tests to add, but the nested loops are still behaving very strangely. This almost seems like working-as-intended as it rotated the exit branch to be fallthrough, but I'm not convinced this is actually the best layout. It is well supported by the probabilities for loops we currently get, but those are pretty broken for nested loops, so this may change later. llvm-svn: 142743
2011-10-23 17:18:45 +08:00
}
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
Completely re-write the algorithm behind MachineBlockPlacement based on discussions with Andy. Fundamentally, the previous algorithm is both counter productive on several fronts and prioritizing things which aren't necessarily the most important: static branch prediction. The new algorithm uses the existing loop CFG structure information to walk through the CFG itself to layout blocks. It coalesces adjacent blocks within the loop where the CFG allows based on the most likely path taken. Finally, it topologically orders the block chains that have been formed. This allows it to choose a (mostly) topologically valid ordering which still priorizes fallthrough within the structural constraints. As a final twist in the algorithm, it does violate the CFG when it discovers a "hot" edge, that is an edge that is more than 4x hotter than the competing edges in the CFG. These are forcibly merged into a fallthrough chain. Future transformations that need te be added are rotation of loop exit conditions to be fallthrough, and better isolation of cold block chains. I'm also planning on adding statistics to model how well the algorithm does at laying out blocks based on the probabilities it receives. The old tests mostly still pass, and I have some new tests to add, but the nested loops are still behaving very strangely. This almost seems like working-as-intended as it rotated the exit branch to be fallthrough, but I'm not convinced this is actually the best layout. It is well supported by the probabilities for loops we currently get, but those are pretty broken for nested loops, so this may change later. llvm-svn: 142743
2011-10-23 17:18:45 +08:00
void MachineBlockPlacement::buildCFGChains(MachineFunction &F) {
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
// Ensure that every BB in the function has an associated chain to simplify
// the assumptions of the remaining algorithm.
SmallVector<MachineOperand, 4> Cond; // For AnalyzeBranch.
for (MachineFunction::iterator FI = F.begin(), FE = F.end(); FI != FE; ++FI) {
MachineBasicBlock *BB = FI;
BlockChain *Chain
= new (ChainAllocator.Allocate()) BlockChain(BlockToChain, BB);
// Also, merge any blocks which we cannot reason about and must preserve
// the exact fallthrough behavior for.
for (;;) {
Cond.clear();
MachineBasicBlock *TBB = 0, *FBB = 0; // For AnalyzeBranch.
if (!TII->AnalyzeBranch(*BB, TBB, FBB, Cond) || !FI->canFallThrough())
break;
MachineFunction::iterator NextFI(llvm::next(FI));
MachineBasicBlock *NextBB = NextFI;
// Ensure that the layout successor is a viable block, as we know that
// fallthrough is a possibility.
assert(NextFI != FE && "Can't fallthrough past the last block.");
DEBUG(dbgs() << "Pre-merging due to unanalyzable fallthrough: "
<< getBlockName(BB) << " -> " << getBlockName(NextBB)
<< "\n");
Chain->merge(NextBB, 0);
FI = NextFI;
BB = NextBB;
}
}
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
// Build any loop-based chains.
Completely re-write the algorithm behind MachineBlockPlacement based on discussions with Andy. Fundamentally, the previous algorithm is both counter productive on several fronts and prioritizing things which aren't necessarily the most important: static branch prediction. The new algorithm uses the existing loop CFG structure information to walk through the CFG itself to layout blocks. It coalesces adjacent blocks within the loop where the CFG allows based on the most likely path taken. Finally, it topologically orders the block chains that have been formed. This allows it to choose a (mostly) topologically valid ordering which still priorizes fallthrough within the structural constraints. As a final twist in the algorithm, it does violate the CFG when it discovers a "hot" edge, that is an edge that is more than 4x hotter than the competing edges in the CFG. These are forcibly merged into a fallthrough chain. Future transformations that need te be added are rotation of loop exit conditions to be fallthrough, and better isolation of cold block chains. I'm also planning on adding statistics to model how well the algorithm does at laying out blocks based on the probabilities it receives. The old tests mostly still pass, and I have some new tests to add, but the nested loops are still behaving very strangely. This almost seems like working-as-intended as it rotated the exit branch to be fallthrough, but I'm not convinced this is actually the best layout. It is well supported by the probabilities for loops we currently get, but those are pretty broken for nested loops, so this may change later. llvm-svn: 142743
2011-10-23 17:18:45 +08:00
for (MachineLoopInfo::iterator LI = MLI->begin(), LE = MLI->end(); LI != LE;
++LI)
buildLoopChains(F, **LI);
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
SmallVector<MachineBasicBlock *, 16> BlockWorkList;
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
SmallPtrSet<BlockChain *, 4> UpdatedPreds;
for (MachineFunction::iterator FI = F.begin(), FE = F.end(); FI != FE; ++FI) {
MachineBasicBlock *BB = &*FI;
BlockChain &Chain = *BlockToChain[BB];
if (!UpdatedPreds.insert(&Chain))
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
continue;
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
assert(Chain.LoopPredecessors == 0);
for (BlockChain::iterator BCI = Chain.begin(), BCE = Chain.end();
BCI != BCE; ++BCI) {
assert(BlockToChain[*BCI] == &Chain);
for (MachineBasicBlock::pred_iterator PI = (*BCI)->pred_begin(),
PE = (*BCI)->pred_end();
PI != PE; ++PI) {
if (BlockToChain[*PI] == &Chain)
continue;
++Chain.LoopPredecessors;
}
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
}
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
if (Chain.LoopPredecessors == 0)
BlockWorkList.push_back(*Chain.begin());
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
}
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
BlockChain &FunctionChain = *BlockToChain[&F.front()];
buildChain(&F.front(), FunctionChain, BlockWorkList);
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
typedef SmallPtrSet<MachineBasicBlock *, 16> FunctionBlockSetType;
DEBUG({
// Crash at the end so we get all of the debugging output first.
bool BadFunc = false;
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
FunctionBlockSetType FunctionBlockSet;
for (MachineFunction::iterator FI = F.begin(), FE = F.end(); FI != FE; ++FI)
FunctionBlockSet.insert(FI);
for (BlockChain::iterator BCI = FunctionChain.begin(),
BCE = FunctionChain.end();
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
BCI != BCE; ++BCI)
if (!FunctionBlockSet.erase(*BCI)) {
BadFunc = true;
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
dbgs() << "Function chain contains a block not in the function!\n"
<< " Bad block: " << getBlockName(*BCI) << "\n";
}
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
if (!FunctionBlockSet.empty()) {
BadFunc = true;
for (FunctionBlockSetType::iterator FBI = FunctionBlockSet.begin(),
FBE = FunctionBlockSet.end();
FBI != FBE; ++FBI)
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
dbgs() << "Function contains blocks never placed into a chain!\n"
<< " Bad block: " << getBlockName(*FBI) << "\n";
}
assert(!BadFunc && "Detected problems with the block placement.");
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
});
// Splice the blocks into place.
MachineFunction::iterator InsertPos = F.begin();
for (BlockChain::iterator BI = FunctionChain.begin(),
BE = FunctionChain.end();
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
BI != BE; ++BI) {
DEBUG(dbgs() << (BI == FunctionChain.begin() ? "Placing chain "
: " ... ")
<< getBlockName(*BI) << "\n");
if (InsertPos != MachineFunction::iterator(*BI))
F.splice(InsertPos, *BI);
else
++InsertPos;
// Update the terminator of the previous block.
if (BI == FunctionChain.begin())
continue;
MachineBasicBlock *PrevBB = llvm::prior(MachineFunction::iterator(*BI));
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
// FIXME: It would be awesome of updateTerminator would just return rather
// than assert when the branch cannot be analyzed in order to remove this
// boiler plate.
Cond.clear();
MachineBasicBlock *TBB = 0, *FBB = 0; // For AnalyzeBranch.
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
if (!TII->AnalyzeBranch(*PrevBB, TBB, FBB, Cond))
PrevBB->updateTerminator();
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
}
Rewrite #3 of machine block placement. This is based somewhat on the second algorithm, but only loosely. It is more heavily based on the last discussion I had with Andy. It continues to walk from the inner-most loop outward, but there is a key difference. With this algorithm we ensure that as we visit each loop, the entire loop is merged into a single chain. At the end, the entire function is treated as a "loop", and merged into a single chain. This chain forms the desired sequence of blocks within the function. Switching to a single algorithm removes my biggest problem with the previous approaches -- they had different behavior depending on which system triggered the layout. Now there is exactly one algorithm and one basis for the decision making. The other key difference is how the chain is formed. This is based heavily on the idea Andy mentioned of keeping a worklist of blocks that are viable layout successors based on the CFG. Having this set allows us to consistently select the best layout successor for each block. It is expensive though. The code here remains very rough. There is a lot that needs to be done to clean up the code, and to make the runtime cost of this pass much lower. Very much WIP, but this was a giant chunk of code and I'd rather folks see it sooner than later. Everything remains behind a flag of course. I've added a couple of tests to exercise the issues that this iteration was motivated by: loop structure preservation. I've also fixed one test that was exhibiting the broken behavior of the previous version. llvm-svn: 144495
2011-11-13 19:20:44 +08:00
// Fixup the last block.
Cond.clear();
MachineBasicBlock *TBB = 0, *FBB = 0; // For AnalyzeBranch.
if (!TII->AnalyzeBranch(F.back(), TBB, FBB, Cond))
F.back().updateTerminator();
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
}
/// \brief Recursive helper to align a loop and any nested loops.
static void AlignLoop(MachineFunction &F, MachineLoop *L, unsigned Align) {
// Recurse through nested loops.
for (MachineLoop::iterator I = L->begin(), E = L->end(); I != E; ++I)
AlignLoop(F, *I, Align);
L->getTopBlock()->setAlignment(Align);
}
/// \brief Align loop headers to target preferred alignments.
void MachineBlockPlacement::AlignLoops(MachineFunction &F) {
if (F.getFunction()->hasFnAttr(Attribute::OptimizeForSize))
return;
unsigned Align = TLI->getPrefLoopAlignment();
if (!Align)
return; // Don't care about loop alignment.
for (MachineLoopInfo::iterator I = MLI->begin(), E = MLI->end(); I != E; ++I)
AlignLoop(F, *I, Align);
}
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
bool MachineBlockPlacement::runOnMachineFunction(MachineFunction &F) {
// Check for single-block functions and skip them.
if (llvm::next(F.begin()) == F.end())
return false;
MBPI = &getAnalysis<MachineBranchProbabilityInfo>();
MBFI = &getAnalysis<MachineBlockFrequencyInfo>();
MLI = &getAnalysis<MachineLoopInfo>();
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
TII = F.getTarget().getInstrInfo();
TLI = F.getTarget().getTargetLowering();
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
assert(BlockToChain.empty());
Completely re-write the algorithm behind MachineBlockPlacement based on discussions with Andy. Fundamentally, the previous algorithm is both counter productive on several fronts and prioritizing things which aren't necessarily the most important: static branch prediction. The new algorithm uses the existing loop CFG structure information to walk through the CFG itself to layout blocks. It coalesces adjacent blocks within the loop where the CFG allows based on the most likely path taken. Finally, it topologically orders the block chains that have been formed. This allows it to choose a (mostly) topologically valid ordering which still priorizes fallthrough within the structural constraints. As a final twist in the algorithm, it does violate the CFG when it discovers a "hot" edge, that is an edge that is more than 4x hotter than the competing edges in the CFG. These are forcibly merged into a fallthrough chain. Future transformations that need te be added are rotation of loop exit conditions to be fallthrough, and better isolation of cold block chains. I'm also planning on adding statistics to model how well the algorithm does at laying out blocks based on the probabilities it receives. The old tests mostly still pass, and I have some new tests to add, but the nested loops are still behaving very strangely. This almost seems like working-as-intended as it rotated the exit branch to be fallthrough, but I'm not convinced this is actually the best layout. It is well supported by the probabilities for loops we currently get, but those are pretty broken for nested loops, so this may change later. llvm-svn: 142743
2011-10-23 17:18:45 +08:00
buildCFGChains(F);
AlignLoops(F);
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
BlockToChain.clear();
ChainAllocator.DestroyAll();
Implement a block placement pass based on the branch probability and block frequency analyses. This differs substantially from the existing block-placement pass in LLVM: 1) It operates on the Machine-IR in the CodeGen layer. This exposes much more (and more precise) information and opportunities. Also, the results are more stable due to fewer transforms ocurring after the pass runs. 2) It uses the generalized probability and frequency analyses. These can model static heuristics, code annotation derived heuristics as well as eventual profile loading. By basing the optimization on the analysis interface it can work from any (or a combination) of these inputs. 3) It uses a more aggressive algorithm, both building chains from tho bottom up to maximize benefit, and using an SCC-based walk to layout chains of blocks in a profitable ordering without O(N^2) iterations which the old pass involves. The pass is currently gated behind a flag, and not enabled by default because it still needs to grow some important features. Most notably, it needs to support loop aligning and careful layout of loop structures much as done by hand currently in CodePlacementOpt. Once it supports these, and has sufficient testing and quality tuning, it should replace both of these passes. Thanks to Nick Lewycky and Richard Smith for help authoring & debugging this, and to Jakob, Andy, Eric, Jim, and probably a few others I'm forgetting for reviewing and answering all my questions. Writing a backend pass is *sooo* much better now than it used to be. =D llvm-svn: 142641
2011-10-21 14:46:38 +08:00
// We always return true as we have no way to track whether the final order
// differs from the original order.
return true;
}
namespace {
/// \brief A pass to compute block placement statistics.
///
/// A separate pass to compute interesting statistics for evaluating block
/// placement. This is separate from the actual placement pass so that they can
/// be computed in the absense of any placement transformations or when using
/// alternative placement strategies.
class MachineBlockPlacementStats : public MachineFunctionPass {
/// \brief A handle to the branch probability pass.
const MachineBranchProbabilityInfo *MBPI;
/// \brief A handle to the function-wide block frequency pass.
const MachineBlockFrequencyInfo *MBFI;
public:
static char ID; // Pass identification, replacement for typeid
MachineBlockPlacementStats() : MachineFunctionPass(ID) {
initializeMachineBlockPlacementStatsPass(*PassRegistry::getPassRegistry());
}
bool runOnMachineFunction(MachineFunction &F);
void getAnalysisUsage(AnalysisUsage &AU) const {
AU.addRequired<MachineBranchProbabilityInfo>();
AU.addRequired<MachineBlockFrequencyInfo>();
AU.setPreservesAll();
MachineFunctionPass::getAnalysisUsage(AU);
}
const char *getPassName() const { return "Block Placement Stats"; }
};
}
char MachineBlockPlacementStats::ID = 0;
INITIALIZE_PASS_BEGIN(MachineBlockPlacementStats, "block-placement-stats",
"Basic Block Placement Stats", false, false)
INITIALIZE_PASS_DEPENDENCY(MachineBranchProbabilityInfo)
INITIALIZE_PASS_DEPENDENCY(MachineBlockFrequencyInfo)
INITIALIZE_PASS_END(MachineBlockPlacementStats, "block-placement-stats",
"Basic Block Placement Stats", false, false)
FunctionPass *llvm::createMachineBlockPlacementStatsPass() {
return new MachineBlockPlacementStats();
}
bool MachineBlockPlacementStats::runOnMachineFunction(MachineFunction &F) {
// Check for single-block functions and skip them.
if (llvm::next(F.begin()) == F.end())
return false;
MBPI = &getAnalysis<MachineBranchProbabilityInfo>();
MBFI = &getAnalysis<MachineBlockFrequencyInfo>();
for (MachineFunction::iterator I = F.begin(), E = F.end(); I != E; ++I) {
BlockFrequency BlockFreq = MBFI->getBlockFreq(I);
Statistic &NumBranches = (I->succ_size() > 1) ? NumCondBranches
: NumUncondBranches;
Statistic &BranchTakenFreq = (I->succ_size() > 1) ? CondBranchTakenFreq
: UncondBranchTakenFreq;
for (MachineBasicBlock::succ_iterator SI = I->succ_begin(),
SE = I->succ_end();
SI != SE; ++SI) {
// Skip if this successor is a fallthrough.
if (I->isLayoutSuccessor(*SI))
continue;
BlockFrequency EdgeFreq = BlockFreq * MBPI->getEdgeProbability(I, *SI);
++NumBranches;
BranchTakenFreq += EdgeFreq.getFrequency();
}
}
return false;
}