forked from OSchip/llvm-project
1001 lines
30 KiB
Markdown
1001 lines
30 KiB
Markdown
|
# Data flow analysis: an informal introduction
|
|||
|
|
|||
|
## Abstract
|
|||
|
|
|||
|
This document introduces data flow analysis in an informal way. The goal is to
|
|||
|
give the reader an intuitive understanding of how it works, and show how it
|
|||
|
applies to a range of refactoring and bug finding problems.
|
|||
|
|
|||
|
Data flow analysis is a well-established technique; it is described in many
|
|||
|
papers, books, and videos. If you would like a more formal, or a more thorough
|
|||
|
explanation of the concepts mentioned in this document, please refer to the
|
|||
|
following resources:
|
|||
|
|
|||
|
* [The Lattice article in Wikipedia](https://en.wikipedia.org/wiki/Lattice_\(order\)).
|
|||
|
* Videos on the PacketPrep YouTube channel that introduce lattices and the
|
|||
|
necessary background information:
|
|||
|
[#20](https://www.youtube.com/watch?v=73j_FXBXGm8),
|
|||
|
[#21](https://www.youtube.com/watch?v=b5sDjo9tfE8),
|
|||
|
[#22](https://www.youtube.com/watch?v=saOG7Uooeho),
|
|||
|
[#23](https://www.youtube.com/watch?v=3EAYX-wZH0g),
|
|||
|
[#24](https://www.youtube.com/watch?v=KRkHwQtW6Cc),
|
|||
|
[#25](https://www.youtube.com/watch?v=7Gwzsc4rAgw).
|
|||
|
* [Introduction to Dataflow Analysis](https://www.youtube.com/watch?v=OROXJ9-wUQE)
|
|||
|
* [Introduction to abstract interpretation](http://www.cs.tau.ac.il/~msagiv/courses/asv/absint-1.pdf).
|
|||
|
* [Introduction to symbolic execution](https://www.cs.umd.edu/~mwh/se-tutorial/symbolic-exec.pdf).
|
|||
|
* [Static Program Analysis by Anders Møller and Michael I. Schwartzbach](https://cs.au.dk/~amoeller/spa/).
|
|||
|
* [EXE: automatically generating inputs of death](https://css.csail.mit.edu/6.858/2020/readings/exe.pdf)
|
|||
|
(a paper that successfully applies symbolic execution to real-world
|
|||
|
software).
|
|||
|
|
|||
|
## Data flow analysis
|
|||
|
|
|||
|
### The purpose of data flow analysis
|
|||
|
|
|||
|
Data flow analysis is a static analysis technique that proves facts about a
|
|||
|
program or its fragment. It can make conclusions about all paths through the
|
|||
|
program, while taking control flow into account and scaling to large programs.
|
|||
|
The basic idea is propagating facts about the program through the edges of the
|
|||
|
control flow graph (CFG) until a fixpoint is reached.
|
|||
|
|
|||
|
### Sample problem and an ad-hoc solution
|
|||
|
|
|||
|
We would like to explain data flow analysis while discussing an example. Let's
|
|||
|
imagine that we want to track possible values of an integer variable in our
|
|||
|
program. Here is how a human could annotate the code:
|
|||
|
|
|||
|
```c++
|
|||
|
void Example(int n) {
|
|||
|
int x = 0;
|
|||
|
// x is {0}
|
|||
|
if (n > 0) {
|
|||
|
x = 5;
|
|||
|
// x is {5}
|
|||
|
} else {
|
|||
|
x = 42;
|
|||
|
// x is {42}
|
|||
|
}
|
|||
|
// x is {5; 42}
|
|||
|
print(x);
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
We use sets of integers to represent possible values of `x`. Local variables
|
|||
|
have unambiguous values between statements, so we annotate program points
|
|||
|
between statements with sets of possible values.
|
|||
|
|
|||
|
Here is how we arrived at these annotations. Assigning a constant to `x` allows
|
|||
|
us to make a conclusion that `x` can only have one value. When control flow from
|
|||
|
the "then" and "else" branches joins, `x` can have either value.
|
|||
|
|
|||
|
Abstract algebra provides a nice formalism that models this kind of structure,
|
|||
|
namely, a lattice. A join-semilattice is a partially ordered set, in which every
|
|||
|
two elements have a least upper bound (called a *join*).
|
|||
|
|
|||
|
```
|
|||
|
join(a, b) ⩾ a and join(a, b) ⩾ b and join(x, x) = x
|
|||
|
```
|
|||
|
|
|||
|
For this problem we will use the lattice of subsets of integers, with set
|
|||
|
inclusion relation as ordering and set union as a join.
|
|||
|
|
|||
|
Lattices are often represented visually as Hasse diagrams. Here is a Hasse
|
|||
|
diagram for our lattice that tracks subsets of integers:
|
|||
|
|
|||
|
![Hasse diagram for a lattice of integer sets](DataFlowAnalysisIntroImages/IntegerSetsInfiniteLattice.svg)
|
|||
|
|
|||
|
Computing the join in the lattice corresponds to finding the lowest common
|
|||
|
ancestor (LCA) between two nodes in its Hasse diagram. There is a vast amount of
|
|||
|
literature on efficiently implementing LCA queries for a DAG, however Efficient
|
|||
|
Implementation of Lattice Operations (1989)
|
|||
|
([CiteSeerX](https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.106.4911),
|
|||
|
[doi](https://doi.org/10.1145%2F59287.59293)) describes a scheme that
|
|||
|
particularly well-suited for programmatic implementation.
|
|||
|
|
|||
|
### Too much information and "top" values
|
|||
|
|
|||
|
Let's try to find the possible sets of values of `x` in a function that modifies
|
|||
|
`x` in a loop:
|
|||
|
|
|||
|
```c++
|
|||
|
void ExampleOfInfiniteSets() {
|
|||
|
int x = 0; // x is {0}
|
|||
|
while (condition()) {
|
|||
|
x += 1; // x is {0; 1; 2; …}
|
|||
|
}
|
|||
|
print(x); // x is {0; 1; 2; …}
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
We have an issue: `x` can have any value greater than zero; that's an infinite
|
|||
|
set of values, if the program operated on mathematical integers. In C++ `int` is
|
|||
|
limited by `INT_MAX` so technically we have a set `{0; 1; …; INT_MAX}` which is
|
|||
|
still really big.
|
|||
|
|
|||
|
To make our analysis practical to compute, we have to limit the amount of
|
|||
|
information that we track. In this case, we can, for example, arbitrarily limit
|
|||
|
the size of sets to 3 elements. If at a certain program point `x` has more than
|
|||
|
3 possible values, we stop tracking specific values at that program point.
|
|||
|
Instead, we denote possible values of `x` with the symbol `⊤` (pronounced "top"
|
|||
|
according to a convention in abstract algebra).
|
|||
|
|
|||
|
```c++
|
|||
|
void ExampleOfTopWithALoop() {
|
|||
|
int x = 0; // x is {0}
|
|||
|
while (condition()) {
|
|||
|
x += 1; // x is ⊤
|
|||
|
}
|
|||
|
print(x); // x is ⊤
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
The statement "at this program point, `x`'s possible values are `⊤`" is
|
|||
|
understood as "at this program point `x` can have any value because we have too
|
|||
|
much information, or the information is conflicting".
|
|||
|
|
|||
|
Note that we can get more than 3 possible values even without a loop:
|
|||
|
|
|||
|
```c++
|
|||
|
void ExampleOfTopWithoutLoops(int n) {
|
|||
|
int x = 0; // x is {0}
|
|||
|
switch(n) {
|
|||
|
case 0: x = 1; break; // x is {1}
|
|||
|
case 1: x = 9; break; // x is {9}
|
|||
|
case 2: x = 7; break; // x is {7}
|
|||
|
default: x = 3; break; // x is {3}
|
|||
|
}
|
|||
|
// x is ⊤
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
### Uninitialized variables and "bottom" values
|
|||
|
|
|||
|
When `x` is declared but not initialized, it has no possible values. We
|
|||
|
represent this fact symbolically as `⊥` (pronounced "bottom").
|
|||
|
|
|||
|
```c++
|
|||
|
void ExampleOfBottom() {
|
|||
|
int x; // x is ⊥
|
|||
|
x = 42; // x is {42}
|
|||
|
print(x);
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
Note that using values read from uninitialized variables is undefined behaviour
|
|||
|
in C++. Generally, compilers and static analysis tools can assume undefined
|
|||
|
behavior does not happen. We must model uninitialized variables only when we are
|
|||
|
implementing a checker that specifically is trying to find uninitialized reads.
|
|||
|
In this example we show how to model uninitialized variables only to demonstrate
|
|||
|
the concept of "bottom", and how it applies to possible value analysis. We
|
|||
|
describe an analysis that finds uninitialized reads in a section below.
|
|||
|
|
|||
|
### A practical lattice that tracks sets of concrete values
|
|||
|
|
|||
|
Taking into account all corner cases covered above, we can put together a
|
|||
|
lattice that we can use in practice to track possible values of integer
|
|||
|
variables. This lattice represents sets of integers with 1, 2, or 3 elements, as
|
|||
|
well as top and bottom. Here is a Hasse diagram for it:
|
|||
|
|
|||
|
![Hasse diagram for a lattice of integer sets](DataFlowAnalysisIntroImages/IntegerSetsFiniteLattice.svg)
|
|||
|
|
|||
|
### Formalization
|
|||
|
|
|||
|
Let's consider a slightly more complex example, and think about how we can
|
|||
|
compute the sets of possible values algorithmically.
|
|||
|
|
|||
|
```c++
|
|||
|
void Example(int n) {
|
|||
|
int x; // x is ⊥
|
|||
|
if (n > 0) {
|
|||
|
if (n == 42) {
|
|||
|
x = 44; // x is {44}
|
|||
|
} else {
|
|||
|
x = 5; // x is {5}
|
|||
|
}
|
|||
|
print(x); // x is {44; 5}
|
|||
|
} else {
|
|||
|
x = n; // x is ⊤
|
|||
|
}
|
|||
|
print(x); // x is ⊤
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
As humans, we understand the control flow from the program text. We used our
|
|||
|
understanding of control flow to find program points where two flows join.
|
|||
|
Formally, control flow is represented by a CFG (control flow graph):
|
|||
|
|
|||
|
![CFG for the code above](DataFlowAnalysisIntroImages/CFGExample.svg)
|
|||
|
|
|||
|
We can compute sets of possible values by propagating them through the CFG of
|
|||
|
the function:
|
|||
|
|
|||
|
* When `x` is declared but not initialized, its possible values are `{}`. The
|
|||
|
empty set plays the role of `⊥` in this lattice.
|
|||
|
|
|||
|
* When `x` is assigned a concrete value, its possible set of values contains
|
|||
|
just that specific value.
|
|||
|
|
|||
|
* When `x` is assigned some unknown value, it can have any value. We represent
|
|||
|
this fact as `⊤`.
|
|||
|
|
|||
|
* When two control flow paths join, we compute the set union of incoming
|
|||
|
values (limiting the number of elements to 3, representig larger sets as
|
|||
|
`⊤`).
|
|||
|
|
|||
|
The sets of possible values are influenced by:
|
|||
|
|
|||
|
* Statements, for example, assignments.
|
|||
|
|
|||
|
* Joins in control flow, for example, ones that appear at the end of "if"
|
|||
|
statements.
|
|||
|
|
|||
|
**Effects of statements** are modeled by what is formally known as a transfer
|
|||
|
function. A transfer function takes two arguments: the statement, and the state
|
|||
|
of `x` at the previous program point. It produces the state of `x` at the next
|
|||
|
program point. For example, the transfer function for assignment ignores the
|
|||
|
state at the previous program point:
|
|||
|
|
|||
|
```c++
|
|||
|
// GIVEN: x is {42; 44}
|
|||
|
x = 0;
|
|||
|
// CONCLUSION: x is {0}
|
|||
|
```
|
|||
|
|
|||
|
The transfer function for `+` performs arithmetic on every set member:
|
|||
|
|
|||
|
```c++
|
|||
|
// GIVEN: x is {42, 44}
|
|||
|
x = x + 100;
|
|||
|
// CONCLUSION: x is {142, 144}
|
|||
|
```
|
|||
|
|
|||
|
**Effects of control flow** are modeled by joining the knowledge from all
|
|||
|
possible previous program points.
|
|||
|
|
|||
|
```c++
|
|||
|
if (...) {
|
|||
|
...
|
|||
|
// GIVEN: x is {42}
|
|||
|
} else {
|
|||
|
...
|
|||
|
// GIVEN: x is {44}
|
|||
|
}
|
|||
|
// CONCLUSION: x is {42; 44}
|
|||
|
```
|
|||
|
|
|||
|
```c++
|
|||
|
// GIVEN: x is {42}
|
|||
|
while (...) {
|
|||
|
...
|
|||
|
// GIVEN: x is {44}
|
|||
|
}
|
|||
|
// CONCLUSION: {42; 44}
|
|||
|
```
|
|||
|
|
|||
|
The predicate that we marked "given" is usually called a precondition, and the
|
|||
|
conclusion is called a postcondition.
|
|||
|
|
|||
|
In terms of the CFG, we join the information from all predecessor basic blocks.
|
|||
|
|
|||
|
![Modeling the effects of a CFG basic block](DataFlowAnalysisIntroImages/CFGJoinRule.svg)
|
|||
|
|
|||
|
Putting it all together, to model the effects of a basic block we compute:
|
|||
|
|
|||
|
```
|
|||
|
out = transfer(basic_block, join(in_1, in_2, ..., in_n))
|
|||
|
```
|
|||
|
|
|||
|
(Note that there are other ways to write this equation that produce higher
|
|||
|
precision analysis results. The trick is to keep exploring the execution paths
|
|||
|
separately and delay joining until later. Hoowever, we won't discuss those
|
|||
|
variations here.)
|
|||
|
|
|||
|
To make a conclusion about all paths through the program, we repeat this
|
|||
|
computation on all basic blocks until we reach a fixpoint. In other words, we
|
|||
|
keep propagating information through the CFG until the computed sets of values
|
|||
|
stop changing.
|
|||
|
|
|||
|
If the lattice has a finite height and transfer functions are monotonic the
|
|||
|
algorithm is guaranteed to terminate. Each iteration of the algorithm can
|
|||
|
change computed values only to larger values from the lattice. In the worst
|
|||
|
case, all computed values become `⊤`, which is not very useful, but at least the
|
|||
|
analysis terminates at that point, because it can't change any of the values.
|
|||
|
|
|||
|
Fixpoint iteration can be optimised by only reprocessing basic blocks which had
|
|||
|
one of their inputs changed on the previous iteration. This is typically
|
|||
|
implemented using a worklist queue. With this optimisation the time complexity
|
|||
|
becomes `O(m * |L|)`, where `m` is the number of basic blocks in the CFG and
|
|||
|
`|L|` is the size of lattice used by the analysis.
|
|||
|
|
|||
|
## Symbolic execution: a very short informal introduction
|
|||
|
|
|||
|
### Symbolic values
|
|||
|
|
|||
|
In the previous example where we tried to figure out what values a variable can
|
|||
|
have, the analysis had to be seeded with a concrete value. What if there are no
|
|||
|
assignments of concrete values in the program? We can still deduce some
|
|||
|
interesting information by representing unknown input values symbolically, and
|
|||
|
computing results as symbolic expressions:
|
|||
|
|
|||
|
```c++
|
|||
|
void PrintAbs(int x) {
|
|||
|
int result;
|
|||
|
if (x >= 0) {
|
|||
|
result = x; // result is {x}
|
|||
|
} else {
|
|||
|
result = -x; // result is {-x}
|
|||
|
}
|
|||
|
print(result); // result is {x; -x}
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
We can't say what specific value gets printed, but we know that it is either `x`
|
|||
|
or `-x`.
|
|||
|
|
|||
|
Dataflow analysis is an istance of abstract interpretation, and does not dictate
|
|||
|
how exactly the lattice and transfer functions should be designed, beyond the
|
|||
|
necessary conditions for the analysis to converge. Nevertheless, we can use
|
|||
|
symbolic execution ideas to guide our design of the lattice and transfer
|
|||
|
functions: lattice values can be symbolic expressions, and transfer functions
|
|||
|
can construct more complex symbolic expressions from symbolic expressions that
|
|||
|
represent arguments. See [this StackOverflow
|
|||
|
discussion](https://cstheory.stackexchange.com/questions/19708/symbolic-execution-is-a-case-of-abstract-interpretation)
|
|||
|
for a further comparison of abstract interpretation and symbolic execution.
|
|||
|
|
|||
|
### Flow condition
|
|||
|
|
|||
|
A human can say about the previous example that the function returns `x` when
|
|||
|
`x >= 0`, and `-x` when `x < 0`. We can make this conclusion programmatically by
|
|||
|
tracking a flow condition. A flow condition is a predicate written in terms of
|
|||
|
the program state that is true at a specific program point regardless of the
|
|||
|
execution path that led to this statement. For example, the flow condition for
|
|||
|
the program point right before evaluating `result = x` is `x >= 0`.
|
|||
|
|
|||
|
If we enhance the lattice to be a set of pairs of values and predicates, the
|
|||
|
dataflow analysis computes the following values:
|
|||
|
|
|||
|
```c++
|
|||
|
void PrintAbs(int x) {
|
|||
|
int result;
|
|||
|
if (x >= 0) {
|
|||
|
// Flow condition: x >= 0.
|
|||
|
result = x; // result is {x if x >= 0}
|
|||
|
} else {
|
|||
|
// Flow condition: x < 0.
|
|||
|
result = -x; // result is {-x if x < 0}
|
|||
|
}
|
|||
|
print(result); // result is {x if x >= 0; -x if x < 0}
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
Of course, in a program with loops, symbolic expressions for flow conditions can
|
|||
|
grow unbounded. A practical static analysis system must control this growth to
|
|||
|
keep the symbolic representations manageable and ensure that the data flow
|
|||
|
analysis terminates. For example, it can use a constraint solver to prune
|
|||
|
impossible flow conditions, and/or it can abstract them, losing precision, after
|
|||
|
their symbolic representations grow beyond some threshold. This is similar to
|
|||
|
how we had to limit the sizes of computed sets of possible values to 3 elements.
|
|||
|
|
|||
|
### Symbolic pointers
|
|||
|
|
|||
|
This approach proves to be particularly useful for modeling pointer values,
|
|||
|
since we don't care about specific addresses but just want to give a unique
|
|||
|
identifier to a memory location.
|
|||
|
|
|||
|
```c++
|
|||
|
void ExampleOfSymbolicPointers(bool b) {
|
|||
|
int x = 0; // x is {0}
|
|||
|
int* ptr = &x; // x is {0} ptr is {&x}
|
|||
|
if (b) {
|
|||
|
*ptr = 42; // x is {42} ptr is {&x}
|
|||
|
}
|
|||
|
print(x); // x is {0; 42} ptr is {&x}
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
## Example: finding output parameters
|
|||
|
|
|||
|
Let's explore how data flow analysis can help with a problem that is hard to
|
|||
|
solve with other tools in Clang.
|
|||
|
|
|||
|
### Problem description
|
|||
|
|
|||
|
Output parameters are function parameters of pointer or reference type whose
|
|||
|
pointee is completely overwritten by the function, and not read before it is
|
|||
|
overwritten. They are common in pre-C++11 code due to the absence of move
|
|||
|
semantics. In modern C++ output parameters are non-idiomatic, and return values
|
|||
|
are used instead.
|
|||
|
|
|||
|
Imagine that we would like to refactor output parameters to return values to
|
|||
|
modernize old code. The first step is to identify refactoring candidates through
|
|||
|
static analysis.
|
|||
|
|
|||
|
For example, in the following code snippet the pointer `c` is an output
|
|||
|
parameter:
|
|||
|
|
|||
|
```c++
|
|||
|
struct Customer {
|
|||
|
int account_id;
|
|||
|
std::string name;
|
|||
|
}
|
|||
|
|
|||
|
void GetCustomer(Customer *c) {
|
|||
|
c->account_id = ...;
|
|||
|
if (...) {
|
|||
|
c->name = ...;
|
|||
|
} else {
|
|||
|
c->name = ...;
|
|||
|
}
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
We would like to refactor this code into:
|
|||
|
|
|||
|
```c++
|
|||
|
Customer GetCustomer() {
|
|||
|
Customer c;
|
|||
|
c.account_id = ...;
|
|||
|
if (...) {
|
|||
|
c.name = ...;
|
|||
|
} else {
|
|||
|
c.name = ...;
|
|||
|
}
|
|||
|
return c;
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
However, in the function below the parameter `c` is not an output parameter
|
|||
|
because its field `name` is not overwritten on every path through the function.
|
|||
|
|
|||
|
```c++
|
|||
|
void GetCustomer(Customer *c) {
|
|||
|
c->account_id = ...;
|
|||
|
if (...) {
|
|||
|
c->name = ...;
|
|||
|
}
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
The code also cannot read the value of the parameter before overwriting it:
|
|||
|
|
|||
|
```c++
|
|||
|
void GetCustomer(Customer *c) {
|
|||
|
use(c->account_id);
|
|||
|
c->name = ...;
|
|||
|
c->account_id = ...;
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
Functions that escape the pointer also block the refactoring:
|
|||
|
|
|||
|
```c++
|
|||
|
Customer* kGlobalCustomer;
|
|||
|
|
|||
|
void GetCustomer(Customer *c) {
|
|||
|
c->name = ...;
|
|||
|
c->account_id = ...;
|
|||
|
kGlobalCustomer = c;
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
To identify a candidate function for refactoring, we need to do the following:
|
|||
|
|
|||
|
* Find a function with a non-const pointer or reference parameter.
|
|||
|
|
|||
|
* Find the definition of that function.
|
|||
|
|
|||
|
* Prove that the function completely overwrites the pointee on all paths
|
|||
|
before returning.
|
|||
|
|
|||
|
* Prove that the function reads the pointee only after overwriting it.
|
|||
|
|
|||
|
* Prove that the function does not persist the pointer in a data structure
|
|||
|
that is live after the function returns.
|
|||
|
|
|||
|
There are also requirements that all usage sites of the candidate function must
|
|||
|
satisfy, for example, that function arguments do not alias, that users are not
|
|||
|
taking the address of the function, and so on. Let's consider verifying usage
|
|||
|
site conditions to be a separate static analysis problem.
|
|||
|
|
|||
|
### Lattice design
|
|||
|
|
|||
|
To analyze the function body we can use a lattice which consists of normal
|
|||
|
states and failure states. A normal state describes program points where we are
|
|||
|
sure that no behaviors that block the refactoring have occurred. Normal states
|
|||
|
keep track of all parameter's member fields that are known to be overwritten on
|
|||
|
every path from function entry to the corresponding program point. Failure
|
|||
|
states accumulate observed violations (unsafe reads and pointer escapes) that
|
|||
|
block the refactoring.
|
|||
|
|
|||
|
In the partial order of the lattice failure states compare greater than normal
|
|||
|
states, which guarantees that they "win" when joined with normal states. Order
|
|||
|
between failure states is determined by inclusion relation on the set of
|
|||
|
accumulated violations (lattice's `⩽` is `⊆` on the set of violations). Order
|
|||
|
between normal states is determined by reversed inclusion relation on the set of
|
|||
|
overwritten parameter's member fields (lattice's `⩽` is `⊇` on the set of
|
|||
|
overwritten fields).
|
|||
|
|
|||
|
![Lattice for data flow analysis that identifies output parameters](DataFlowAnalysisIntroImages/OutputParameterIdentificationLattice.svg)
|
|||
|
|
|||
|
To determine whether a statement reads or writes a field we can implement
|
|||
|
symbolic evaluation of `DeclRefExpr`s, `LValueToRValue` casts, pointer
|
|||
|
dereference operator and `MemberExpr`s.
|
|||
|
|
|||
|
### Using data flow results to identify output parameters
|
|||
|
|
|||
|
Let's take a look at how we use data flow analysis to identify an output
|
|||
|
parameter. The refactoring can be safely done when the data flow algorithm
|
|||
|
computes a normal state with all of the fields proven to be overwritten in the
|
|||
|
exit basic block of the function.
|
|||
|
|
|||
|
```c++
|
|||
|
struct Customer {
|
|||
|
int account_id;
|
|||
|
std::string name;
|
|||
|
};
|
|||
|
|
|||
|
void GetCustomer(Customer* c) {
|
|||
|
// Overwritten: {}
|
|||
|
c->account_id = ...; // Overwritten: {c->account_id}
|
|||
|
if (...) {
|
|||
|
c->name = ...; // Overwritten: {c->account_id, c->name}
|
|||
|
} else {
|
|||
|
c->name = ...; // Overwritten: {c->account_id, c->name}
|
|||
|
}
|
|||
|
// Overwritten: {c->account_id, c->name}
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
When the data flow algorithm computes a normal state, but not all fields are
|
|||
|
proven to be overwritten we can't perform the refactoring.
|
|||
|
|
|||
|
```c++
|
|||
|
void target(bool b, Customer* c) {
|
|||
|
// Overwritten: {}
|
|||
|
if (b) {
|
|||
|
c->account_id = 42; // Overwritten: {c->account_id}
|
|||
|
} else {
|
|||
|
c->name = "Konrad"; // Overwritten: {c->name}
|
|||
|
}
|
|||
|
// Overwritten: {}
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
Similarly, when the data flow algorithm computes a failure state, we also can't
|
|||
|
perform the refactoring.
|
|||
|
|
|||
|
```c++
|
|||
|
Customer* kGlobalCustomer;
|
|||
|
|
|||
|
void GetCustomer(Customer* c) {
|
|||
|
// Overwritten: {}
|
|||
|
c->account_id = ...; // Overwritten: {c->account_id}
|
|||
|
if (...) {
|
|||
|
print(c->name); // Unsafe read
|
|||
|
} else {
|
|||
|
kGlobalCustomer = c; // Pointer escape
|
|||
|
}
|
|||
|
// Unsafe read, Pointer escape
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
## Example: finding dead stores
|
|||
|
|
|||
|
Let's say we want to find redundant stores, because they indicate potential
|
|||
|
bugs.
|
|||
|
|
|||
|
```c++
|
|||
|
x = GetX();
|
|||
|
x = GetY();
|
|||
|
```
|
|||
|
|
|||
|
The first store to `x` is never read, probably there is a bug.
|
|||
|
|
|||
|
The implementation of dead store analysis is very similar to output parameter
|
|||
|
analysis: we need to track stores and loads, and find stores that were never
|
|||
|
read.
|
|||
|
|
|||
|
[Liveness analysis](https://en.wikipedia.org/wiki/Live_variable_analysis) is a
|
|||
|
generalization of this idea, which is often used to answer many related
|
|||
|
questions, for example:
|
|||
|
|
|||
|
* finding dead stores,
|
|||
|
* finding uninitialized variables,
|
|||
|
* finding a good point to deallocate memory,
|
|||
|
* finding out if it would be safe to move an object.
|
|||
|
|
|||
|
## Example: definitive initialization
|
|||
|
|
|||
|
Definitive initialization proves that variables are known to be initialized when
|
|||
|
read. If we find a variable which is read when not initialized then we generate
|
|||
|
a warning.
|
|||
|
|
|||
|
```c++
|
|||
|
void Init() {
|
|||
|
int x; // x is uninitialized
|
|||
|
if (cond()) {
|
|||
|
x = 10; // x is initialized
|
|||
|
} else {
|
|||
|
x = 20; // x is initialized
|
|||
|
}
|
|||
|
print(x); // x is initialized
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
```c++
|
|||
|
void Uninit() {
|
|||
|
int x; // x is uninitialized
|
|||
|
if (cond()) {
|
|||
|
x = 10; // x is initialized
|
|||
|
}
|
|||
|
print(x); // x is maybe uninitialized, x is being read, report a bug.
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
For this purpose we can use lattice in a form of a mapping from variable
|
|||
|
declarations to initialization states; each initialization state is represented
|
|||
|
by the followingn lattice:
|
|||
|
|
|||
|
![Lattice for definitive initialization analysis](DataFlowAnalysisIntroImages/DefinitiveInitializationLattice.svg)
|
|||
|
|
|||
|
A lattice element could also capture the source locations of the branches that
|
|||
|
lead us to the corresponding program point. Diagnostics would use this
|
|||
|
information to show a sample buggy code path to the user.
|
|||
|
|
|||
|
## Example: refactoring raw pointers to `unique_ptr`
|
|||
|
|
|||
|
Modern idiomatic C++ uses smart pointers to express memory ownership, however in
|
|||
|
pre-C++11 code one can often find raw pointers that own heap memory blocks.
|
|||
|
|
|||
|
Imagine that we would like to refactor raw pointers that own memory to
|
|||
|
`unique_ptr`. There are multiple ways to design a data flow analysis for this
|
|||
|
problem; let's look at one way to do it.
|
|||
|
|
|||
|
For example, we would like to refactor the following code that uses raw
|
|||
|
pointers:
|
|||
|
|
|||
|
```c++
|
|||
|
void UniqueOwnership1() {
|
|||
|
int *pi = new int;
|
|||
|
if (...) {
|
|||
|
Borrow(pi);
|
|||
|
delete pi;
|
|||
|
} else {
|
|||
|
TakeOwnership(pi);
|
|||
|
}
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
into code that uses `unique_ptr`:
|
|||
|
|
|||
|
```c++
|
|||
|
void UniqueOwnership1() {
|
|||
|
auto pi = std::make_unique<int>();
|
|||
|
if (...) {
|
|||
|
Borrow(pi.get());
|
|||
|
} else {
|
|||
|
TakeOwnership(pi.release());
|
|||
|
}
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
This problem can be solved with a lattice in form of map from value declarations
|
|||
|
to pointer states:
|
|||
|
|
|||
|
![Lattice that identifies candidates for unique_ptr refactoring](DataFlowAnalysisIntroImages/UniquePtrLattice.svg)
|
|||
|
|
|||
|
We can perform the refactoring if at the exit of a function `pi` is
|
|||
|
`Compatible`.
|
|||
|
|
|||
|
```c++
|
|||
|
void UniqueOwnership1() {
|
|||
|
int *pi; // pi is Compatible
|
|||
|
pi = new int; // pi is Defined
|
|||
|
if (...) {
|
|||
|
Borrow(pi); // pi is Defined
|
|||
|
delete pi; // pi is Compatible
|
|||
|
} else {
|
|||
|
TakeOwnership(pi); // pi is Compatible
|
|||
|
}
|
|||
|
// pi is Compatible
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
Let's look at an example where the raw pointer owns two different memory blocks:
|
|||
|
|
|||
|
```c++
|
|||
|
void UniqueOwnership2() {
|
|||
|
int *pi = new int; // pi is Defined
|
|||
|
Borrow(pi);
|
|||
|
delete pi; // pi is Compatible
|
|||
|
if (smth) {
|
|||
|
pi = new int; // pi is Defined
|
|||
|
Borrow(pi);
|
|||
|
delete pi; // pi is Compatible
|
|||
|
}
|
|||
|
// pi is Compatible
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
It can be refactored to use `unique_ptr` like this:
|
|||
|
|
|||
|
```c++
|
|||
|
void UniqueOwnership2() {
|
|||
|
auto pi = make_unique<int>();
|
|||
|
Borrow(pi);
|
|||
|
if (smth) {
|
|||
|
pi = make_unique<int>();
|
|||
|
Borrow(pi);
|
|||
|
}
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
In the following example, the raw pointer is used to access the heap object
|
|||
|
after the ownership has been transferred.
|
|||
|
|
|||
|
```c++
|
|||
|
void UniqueOwnership3() {
|
|||
|
int *pi = new int; // pi is Defined
|
|||
|
if (...) {
|
|||
|
Borrow(pi);
|
|||
|
delete pi; // pi is Compatible
|
|||
|
} else {
|
|||
|
vector<unique_ptr<int>> v = {std::unique_ptr(pi)}; // pi is Compatible
|
|||
|
print(*pi);
|
|||
|
use(v);
|
|||
|
}
|
|||
|
// pi is Compatible
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
We can refactor this code to use `unique_ptr`, however we would have to
|
|||
|
introduce a non-owning pointer variable, since we can't use the moved-from
|
|||
|
`unique_ptr` to access the object:
|
|||
|
|
|||
|
```c++
|
|||
|
void UniqueOwnership3() {
|
|||
|
std::unique_ptr<int> pi = std::make_unique<int>();
|
|||
|
if (...) {
|
|||
|
Borrow(pi);
|
|||
|
} else {
|
|||
|
int *pi_non_owning = pi.get();
|
|||
|
vector<unique_ptr<int>> v = {std::move(pi)};
|
|||
|
print(*pi_non_owning);
|
|||
|
use(v);
|
|||
|
}
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
If the original code didn't call `delete` at the very end of the function, then
|
|||
|
our refactoring may change the point at which we run the destructor and release
|
|||
|
memory. Specifically, if there is some user code after `delete`, then extending
|
|||
|
the lifetime of the object until the end of the function may hold locks for
|
|||
|
longer than necessary, introduce memory overhead etc.
|
|||
|
|
|||
|
One solution is to always replace `delete` with a call to `reset()`, and then
|
|||
|
perform another analysis that removes unnecessary `reset()` calls.
|
|||
|
|
|||
|
```c++
|
|||
|
void AddedMemoryOverhead() {
|
|||
|
HugeObject *ho = new HugeObject();
|
|||
|
use(ho);
|
|||
|
delete ho; // Release the large amount of memory quickly.
|
|||
|
LongRunningFunction();
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
This analysis will refuse to refactor code that mixes borrowed pointer values
|
|||
|
and unique ownership. In the following code, `GetPtr()` returns a borrowed
|
|||
|
pointer, which is assigned to `pi`. Then, `pi` is used to hold a uniquely-owned
|
|||
|
pointer. We don't distinguish between these two assignments, and we want each
|
|||
|
assignment to be paired with a corresponding sink; otherwise, we transition the
|
|||
|
pointer to a `Conflicting` state, like in this example.
|
|||
|
|
|||
|
```c++
|
|||
|
void ConflictingOwnership() {
|
|||
|
int *pi; // pi is Compatible
|
|||
|
pi = GetPtr(); // pi is Defined
|
|||
|
Borrow(pi); // pi is Defined
|
|||
|
|
|||
|
pi = new int; // pi is Conflicting
|
|||
|
Borrow(pi);
|
|||
|
delete pi;
|
|||
|
// pi is Conflicting
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
We could still handle this case by finding a maximal range in the code where
|
|||
|
`pi` could be in the Compatible state, and only refactoring that part.
|
|||
|
|
|||
|
```c++
|
|||
|
void ConflictingOwnership() {
|
|||
|
int *pi;
|
|||
|
pi = GetPtr();
|
|||
|
Borrow(pi);
|
|||
|
|
|||
|
std::unique_ptr<int> pi_unique = std::make_unique<int>();
|
|||
|
Borrow(pi_unique.get());
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
## Example: finding redundant branch conditions
|
|||
|
|
|||
|
In the code below `b1` should not be checked in both the outer and inner "if"
|
|||
|
statements. It is likely there is a bug in this code.
|
|||
|
|
|||
|
```c++
|
|||
|
int F(bool b1, bool b2) {
|
|||
|
if (b1) {
|
|||
|
f();
|
|||
|
if (b1 && b2) { // Check `b1` again -- unnecessary!
|
|||
|
g();
|
|||
|
}
|
|||
|
}
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
A checker that finds this pattern syntactically is already implemented in
|
|||
|
ClangTidy using AST matchers (`bugprone-redundant-branch-condition`).
|
|||
|
|
|||
|
To implement it using the data flow analysis framework, we can produce a warning
|
|||
|
if any part of the branch condition is implied by the flow condition.
|
|||
|
|
|||
|
```c++
|
|||
|
int F(bool b1, bool b2) {
|
|||
|
// Flow condition: true.
|
|||
|
if (b1) {
|
|||
|
// Flow condition: b1.
|
|||
|
f();
|
|||
|
if (b1 && b2) { // `b1` is implied by the flow condition.
|
|||
|
g();
|
|||
|
}
|
|||
|
}
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
One way to check this implication is to use a SAT solver. Without a SAT solver,
|
|||
|
we could keep the flow condition in the CNF form and then it would be easy to
|
|||
|
check the implication.
|
|||
|
|
|||
|
## Example: finding unchecked `std::optional` unwraps
|
|||
|
|
|||
|
Calling `optional::value()` is only valid if `optional::has_value()` is true. We
|
|||
|
want to show that when `x.value()` is executed, the flow condition implies
|
|||
|
`x.has_value()`.
|
|||
|
|
|||
|
In the example below `x.value()` is accessed safely because it is guarded by the
|
|||
|
`x.has_value()` check.
|
|||
|
|
|||
|
```c++
|
|||
|
void Example(std::optional<int> &x) {
|
|||
|
if (x.has_value()) {
|
|||
|
use(x.value());
|
|||
|
}
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
While entering the if branch we deduce that `x.has_value()` is implied by the
|
|||
|
flow condition.
|
|||
|
|
|||
|
```c++
|
|||
|
void Example(std::optional<int> x) {
|
|||
|
// Flow condition: true.
|
|||
|
if (x.has_value()) {
|
|||
|
// Flow condition: x.has_value() == true.
|
|||
|
use(x.value());
|
|||
|
}
|
|||
|
// Flow condition: true.
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
We also need to prove that `x` is not modified between check and value access.
|
|||
|
The modification of `x` may be very subtle:
|
|||
|
|
|||
|
```c++
|
|||
|
void F(std::optional<int> &x);
|
|||
|
|
|||
|
void Example(std::optional<int> &x) {
|
|||
|
if (x.has_value()) {
|
|||
|
// Flow condition: x.has_value() == true.
|
|||
|
unknown_function(x); // may change x.
|
|||
|
// Flow condition: true.
|
|||
|
use(x.value());
|
|||
|
}
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
## Example: finding dead code behind A/B experiment flags
|
|||
|
|
|||
|
Finding dead code is a classic application of data flow analysis.
|
|||
|
|
|||
|
Unused flags for A/B experiment hide dead code. However, this flavor of dead
|
|||
|
code is invisible to the compiler because the flag can be turned on at any
|
|||
|
moment.
|
|||
|
|
|||
|
We could make a tool that deletes experiment flags. The user tells us which flag
|
|||
|
they want to delete, and we assume that the it's value is a given constant.
|
|||
|
|
|||
|
For example, the user could use the tool to remove `example_flag` from this
|
|||
|
code:
|
|||
|
|
|||
|
```c++
|
|||
|
DEFINE_FLAG(std::string, example_flag, "", "A sample flag.");
|
|||
|
|
|||
|
void Example() {
|
|||
|
bool x = GetFlag(FLAGS_example_flag).empty();
|
|||
|
f();
|
|||
|
if (x) {
|
|||
|
g();
|
|||
|
} else {
|
|||
|
h();
|
|||
|
}
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
The tool would simplify the code to:
|
|||
|
|
|||
|
```c++
|
|||
|
void Example() {
|
|||
|
f();
|
|||
|
g();
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
We can solve this problem with a classic constant propagation lattice combined
|
|||
|
with symbolic evaluation.
|
|||
|
|
|||
|
## Example: finding inefficient usages of associative containers
|
|||
|
|
|||
|
Real-world code often accidentally performs repeated lookups in associative
|
|||
|
containers:
|
|||
|
|
|||
|
```c++
|
|||
|
map<int, Employee> xs;
|
|||
|
xs[42]->name = "...";
|
|||
|
xs[42]->title = "...";
|
|||
|
```
|
|||
|
|
|||
|
To find the above inefficiency we can use the available expressions analysis to
|
|||
|
understand that `m[42]` is evaluated twice.
|
|||
|
|
|||
|
```c++
|
|||
|
map<int, Employee> xs;
|
|||
|
Employee &e = xs[42];
|
|||
|
e->name = "...";
|
|||
|
e->title = "...";
|
|||
|
```
|
|||
|
|
|||
|
We can also track the `m.contains()` check in the flow condition to find
|
|||
|
redundant checks, like in the example below.
|
|||
|
|
|||
|
```c++
|
|||
|
std::map<int, Employee> xs;
|
|||
|
if (!xs.contains(42)) {
|
|||
|
xs.insert({42, someEmployee});
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
## Example: refactoring types that implicitly convert to each other
|
|||
|
|
|||
|
Refactoring one strong type to another is difficult, but the compiler can help:
|
|||
|
once you refactor one reference to the type, the compiler will flag other places
|
|||
|
where this information flows with type mismatch errors. Unfortunately this
|
|||
|
strategy does not work when you are refactoring types that implicitly convert to
|
|||
|
each other, for example, replacing `int32_t` with `int64_t`.
|
|||
|
|
|||
|
Imagine that we want to change user IDs from 32 to 64-bit integers. In other
|
|||
|
words, we need to find all integers tainted with user IDs. We can use data flow
|
|||
|
analysis to implement taint analysis.
|
|||
|
|
|||
|
```c++
|
|||
|
void UseUser(int32_t user_id) {
|
|||
|
int32_t id = user_id;
|
|||
|
// Variable `id` is tainted with a user ID.
|
|||
|
...
|
|||
|
}
|
|||
|
```
|
|||
|
|
|||
|
Taint analysis is very well suited to this problem because the program rarely
|
|||
|
branches on user IDs, and almost certainly does not perform any computation
|
|||
|
(like arithmetic).
|