30 KiB

Raw Blame History

Data flow analysis: an informal introduction

Abstract

This document introduces data flow analysis in an informal way. The goal is to give the reader an intuitive understanding of how it works, and show how it applies to a range of refactoring and bug finding problems.

Data flow analysis is a well-established technique; it is described in many papers, books, and videos. If you would like a more formal, or a more thorough explanation of the concepts mentioned in this document, please refer to the following resources:

The Lattice article in Wikipedia.
Videos on the PacketPrep YouTube channel that introduce lattices and the necessary background information: #20, #21, #22, #23, #24, #25.
Introduction to Dataflow Analysis
Introduction to abstract interpretation.
Introduction to symbolic execution.
Static Program Analysis by Anders Møller and Michael I. Schwartzbach.
EXE: automatically generating inputs of death (a paper that successfully applies symbolic execution to real-world software).

Data flow analysis

The purpose of data flow analysis

Data flow analysis is a static analysis technique that proves facts about a program or its fragment. It can make conclusions about all paths through the program, while taking control flow into account and scaling to large programs. The basic idea is propagating facts about the program through the edges of the control flow graph (CFG) until a fixpoint is reached.

Sample problem and an ad-hoc solution

We would like to explain data flow analysis while discussing an example. Let's imagine that we want to track possible values of an integer variable in our program. Here is how a human could annotate the code:

void Example(int n) {
  int x = 0;
  // x is {0}
  if (n > 0) {
    x = 5;
    // x is {5}
  } else {
    x = 42;
    // x is {42}
  }
  // x is {5; 42}
  print(x);
}

We use sets of integers to represent possible values of x. Local variables have unambiguous values between statements, so we annotate program points between statements with sets of possible values.

Here is how we arrived at these annotations. Assigning a constant to x allows us to make a conclusion that x can only have one value. When control flow from the "then" and "else" branches joins, x can have either value.

Abstract algebra provides a nice formalism that models this kind of structure, namely, a lattice. A join-semilattice is a partially ordered set, in which every two elements have a least upper bound (called a join).

join(a, b) ⩾ a   and   join(a, b) ⩾ b   and   join(x, x) = x

For this problem we will use the lattice of subsets of integers, with set inclusion relation as ordering and set union as a join.

Lattices are often represented visually as Hasse diagrams. Here is a Hasse diagram for our lattice that tracks subsets of integers:

Computing the join in the lattice corresponds to finding the lowest common ancestor (LCA) between two nodes in its Hasse diagram. There is a vast amount of literature on efficiently implementing LCA queries for a DAG, however Efficient Implementation of Lattice Operations (1989) (CiteSeerX, doi) describes a scheme that particularly well-suited for programmatic implementation.

Too much information and "top" values

Let's try to find the possible sets of values of x in a function that modifies x in a loop:

void ExampleOfInfiniteSets() {
  int x = 0; // x is {0}
  while (condition()) {
    x += 1;  // x is {0; 1; 2; …}
  }
  print(x);  // x is {0; 1; 2; …}
}

We have an issue: x can have any value greater than zero; that's an infinite set of values, if the program operated on mathematical integers. In C++ int is limited by INT_MAX so technically we have a set {0; 1; …; INT_MAX} which is still really big.

To make our analysis practical to compute, we have to limit the amount of information that we track. In this case, we can, for example, arbitrarily limit the size of sets to 3 elements. If at a certain program point x has more than 3 possible values, we stop tracking specific values at that program point. Instead, we denote possible values of x with the symbol ⊤ (pronounced "top" according to a convention in abstract algebra).

void ExampleOfTopWithALoop() {
  int x = 0;  // x is {0}
  while (condition()) {
    x += 1;   // x is ⊤
  }
  print(x);   // x is ⊤
}

The statement "at this program point, x's possible values are ⊤" is understood as "at this program point x can have any value because we have too much information, or the information is conflicting".

Note that we can get more than 3 possible values even without a loop:

void ExampleOfTopWithoutLoops(int n) {
  int x = 0;  // x is {0}
  switch(n) {
    case 0:  x = 1; break; // x is {1}
    case 1:  x = 9; break; // x is {9}
    case 2:  x = 7; break; // x is {7}
    default: x = 3; break; // x is {3}
  }
  // x is ⊤
}

Uninitialized variables and "bottom" values

When x is declared but not initialized, it has no possible values. We represent this fact symbolically as ⊥ (pronounced "bottom").

void ExampleOfBottom() {
  int x;    // x is ⊥
  x = 42;   // x is {42}
  print(x);
}

Note that using values read from uninitialized variables is undefined behaviour in C++. Generally, compilers and static analysis tools can assume undefined behavior does not happen. We must model uninitialized variables only when we are implementing a checker that specifically is trying to find uninitialized reads. In this example we show how to model uninitialized variables only to demonstrate the concept of "bottom", and how it applies to possible value analysis. We describe an analysis that finds uninitialized reads in a section below.

A practical lattice that tracks sets of concrete values

Taking into account all corner cases covered above, we can put together a lattice that we can use in practice to track possible values of integer variables. This lattice represents sets of integers with 1, 2, or 3 elements, as well as top and bottom. Here is a Hasse diagram for it:

Formalization

Let's consider a slightly more complex example, and think about how we can compute the sets of possible values algorithmically.

void Example(int n) {
  int x;          // x is ⊥
  if (n > 0) {
    if (n == 42) {
       x = 44;    // x is {44}
    } else {
       x = 5;     // x is {5}
    }
    print(x);     // x is {44; 5}
  } else {
    x = n;        // x is ⊤
  }
  print(x);       // x is ⊤
}

As humans, we understand the control flow from the program text. We used our understanding of control flow to find program points where two flows join. Formally, control flow is represented by a CFG (control flow graph):

We can compute sets of possible values by propagating them through the CFG of the function:

When x is declared but not initialized, its possible values are {}. The empty set plays the role of ⊥ in this lattice.
When x is assigned a concrete value, its possible set of values contains just that specific value.
When x is assigned some unknown value, it can have any value. We represent this fact as ⊤.
When two control flow paths join, we compute the set union of incoming values (limiting the number of elements to 3, representig larger sets as ⊤).

The sets of possible values are influenced by:

Statements, for example, assignments.
Joins in control flow, for example, ones that appear at the end of "if" statements.

Effects of statements are modeled by what is formally known as a transfer function. A transfer function takes two arguments: the statement, and the state of x at the previous program point. It produces the state of x at the next program point. For example, the transfer function for assignment ignores the state at the previous program point:

// GIVEN: x is {42; 44}
x = 0;
// CONCLUSION: x is {0}

The transfer function for + performs arithmetic on every set member:

// GIVEN: x is {42, 44}
x = x + 100;
// CONCLUSION: x is {142, 144}

Effects of control flow are modeled by joining the knowledge from all possible previous program points.

if (...) {
  ...
  // GIVEN: x is {42}
} else {
  ...
  // GIVEN: x is {44}
}
// CONCLUSION: x is {42; 44}

// GIVEN: x is {42}
while (...) {
  ...
  // GIVEN: x is {44}
}
// CONCLUSION: {42; 44}

The predicate that we marked "given" is usually called a precondition, and the conclusion is called a postcondition.

In terms of the CFG, we join the information from all predecessor basic blocks.

Putting it all together, to model the effects of a basic block we compute:

out = transfer(basic_block, join(in_1, in_2, ..., in_n))

(Note that there are other ways to write this equation that produce higher precision analysis results. The trick is to keep exploring the execution paths separately and delay joining until later. However, we won't discuss those variations here.)

To make a conclusion about all paths through the program, we repeat this computation on all basic blocks until we reach a fixpoint. In other words, we keep propagating information through the CFG until the computed sets of values stop changing.

If the lattice has a finite height and transfer functions are monotonic the algorithm is guaranteed to terminate. Each iteration of the algorithm can change computed values only to larger values from the lattice. In the worst case, all computed values become ⊤, which is not very useful, but at least the analysis terminates at that point, because it can't change any of the values.

Fixpoint iteration can be optimised by only reprocessing basic blocks which had one of their inputs changed on the previous iteration. This is typically implemented using a worklist queue. With this optimisation the time complexity becomes O(m * |L|), where m is the number of basic blocks in the CFG and |L| is the size of lattice used by the analysis.

Symbolic execution: a very short informal introduction

Symbolic values

In the previous example where we tried to figure out what values a variable can have, the analysis had to be seeded with a concrete value. What if there are no assignments of concrete values in the program? We can still deduce some interesting information by representing unknown input values symbolically, and computing results as symbolic expressions:

void PrintAbs(int x) {
  int result;
  if (x >= 0) {
    result = x;   // result is {x}
  } else {
    result = -x;  // result is {-x}
  }
  print(result);  // result is {x; -x}
}

We can't say what specific value gets printed, but we know that it is either x or -x.

Dataflow analysis is an istance of abstract interpretation, and does not dictate how exactly the lattice and transfer functions should be designed, beyond the necessary conditions for the analysis to converge. Nevertheless, we can use symbolic execution ideas to guide our design of the lattice and transfer functions: lattice values can be symbolic expressions, and transfer functions can construct more complex symbolic expressions from symbolic expressions that represent arguments. See this StackOverflow discussion for a further comparison of abstract interpretation and symbolic execution.

Flow condition

A human can say about the previous example that the function returns x when x >= 0, and -x when x < 0. We can make this conclusion programmatically by tracking a flow condition. A flow condition is a predicate written in terms of the program state that is true at a specific program point regardless of the execution path that led to this statement. For example, the flow condition for the program point right before evaluating result = x is x >= 0.

If we enhance the lattice to be a set of pairs of values and predicates, the dataflow analysis computes the following values:

void PrintAbs(int x) {
  int result;
  if (x >= 0) {
    // Flow condition: x >= 0.
    result = x;   // result is {x if x >= 0}
  } else {
    // Flow condition: x < 0.
    result = -x;  // result is {-x if x < 0}
  }
  print(result);  // result is {x if x >= 0; -x if x < 0}
}

Of course, in a program with loops, symbolic expressions for flow conditions can grow unbounded. A practical static analysis system must control this growth to keep the symbolic representations manageable and ensure that the data flow analysis terminates. For example, it can use a constraint solver to prune impossible flow conditions, and/or it can abstract them, losing precision, after their symbolic representations grow beyond some threshold. This is similar to how we had to limit the sizes of computed sets of possible values to 3 elements.

Symbolic pointers

This approach proves to be particularly useful for modeling pointer values, since we don't care about specific addresses but just want to give a unique identifier to a memory location.

void ExampleOfSymbolicPointers(bool b) {
  int x = 0;     // x is {0}
  int* ptr = &x; // x is {0}      ptr is {&x}
  if (b) {
    *ptr = 42;   // x is {42}     ptr is {&x}
  }
  print(x);      // x is {0; 42}  ptr is {&x}
}

Example: finding output parameters

Let's explore how data flow analysis can help with a problem that is hard to solve with other tools in Clang.

Problem description

Output parameters are function parameters of pointer or reference type whose pointee is completely overwritten by the function, and not read before it is overwritten. They are common in pre-C++11 code due to the absence of move semantics. In modern C++ output parameters are non-idiomatic, and return values are used instead.

Imagine that we would like to refactor output parameters to return values to modernize old code. The first step is to identify refactoring candidates through static analysis.

For example, in the following code snippet the pointer c is an output parameter:

struct Customer {
  int account_id;
  std::string name;
}

void GetCustomer(Customer *c) {
  c->account_id = ...;
  if (...) {
    c->name = ...;
  } else {
    c->name = ...;
  }
}

We would like to refactor this code into:

Customer GetCustomer() {
  Customer c;
  c.account_id = ...;
  if (...) {
    c.name = ...;
  } else {
    c.name = ...;
  }
  return c;
}

However, in the function below the parameter c is not an output parameter because its field name is not overwritten on every path through the function.

void GetCustomer(Customer *c) {
  c->account_id = ...;
  if (...) {
    c->name = ...;
  }
}

The code also cannot read the value of the parameter before overwriting it:

void GetCustomer(Customer *c) {
  use(c->account_id);
  c->name = ...;
  c->account_id = ...;
}

Functions that escape the pointer also block the refactoring:

Customer* kGlobalCustomer;

void GetCustomer(Customer *c) {
  c->name = ...;
  c->account_id = ...;
  kGlobalCustomer = c;
}

To identify a candidate function for refactoring, we need to do the following:

Find a function with a non-const pointer or reference parameter.
Find the definition of that function.
Prove that the function completely overwrites the pointee on all paths before returning.
Prove that the function reads the pointee only after overwriting it.
Prove that the function does not persist the pointer in a data structure that is live after the function returns.

There are also requirements that all usage sites of the candidate function must satisfy, for example, that function arguments do not alias, that users are not taking the address of the function, and so on. Let's consider verifying usage site conditions to be a separate static analysis problem.

Lattice design

To analyze the function body we can use a lattice which consists of normal states and failure states. A normal state describes program points where we are sure that no behaviors that block the refactoring have occurred. Normal states keep track of all parameter's member fields that are known to be overwritten on every path from function entry to the corresponding program point. Failure states accumulate observed violations (unsafe reads and pointer escapes) that block the refactoring.

In the partial order of the lattice failure states compare greater than normal states, which guarantees that they "win" when joined with normal states. Order between failure states is determined by inclusion relation on the set of accumulated violations (lattice's ⩽ is ⊆ on the set of violations). Order between normal states is determined by reversed inclusion relation on the set of overwritten parameter's member fields (lattice's ⩽ is ⊇ on the set of overwritten fields).

To determine whether a statement reads or writes a field we can implement symbolic evaluation of DeclRefExprs, LValueToRValue casts, pointer dereference operator and MemberExprs.

Using data flow results to identify output parameters

Let's take a look at how we use data flow analysis to identify an output parameter. The refactoring can be safely done when the data flow algorithm computes a normal state with all of the fields proven to be overwritten in the exit basic block of the function.

struct Customer {
  int account_id;
  std::string name;
};

void GetCustomer(Customer* c) {
  // Overwritten: {}
  c->account_id = ...; // Overwritten: {c->account_id}
  if (...) {
    c->name = ...;     // Overwritten: {c->account_id, c->name}
  } else {
    c->name = ...;     // Overwritten: {c->account_id, c->name}
  }
  // Overwritten: {c->account_id, c->name}
}

When the data flow algorithm computes a normal state, but not all fields are proven to be overwritten we can't perform the refactoring.

void target(bool b, Customer* c) {
  // Overwritten: {}
  if (b) {
    c->account_id = 42;     // Overwritten: {c->account_id}
  } else {
    c->name = "Konrad";  // Overwritten: {c->name}
  }
  // Overwritten: {}
}

Similarly, when the data flow algorithm computes a failure state, we also can't perform the refactoring.

Customer* kGlobalCustomer;

void GetCustomer(Customer* c) {
  // Overwritten: {}
  c->account_id = ...;    // Overwritten: {c->account_id}
  if (...) {
    print(c->name);       // Unsafe read
  } else {
    kGlobalCustomer = c;  // Pointer escape
  }
  // Unsafe read, Pointer escape
}

Example: finding dead stores

Let's say we want to find redundant stores, because they indicate potential bugs.

x = GetX();
x = GetY();

The first store to x is never read, probably there is a bug.

The implementation of dead store analysis is very similar to output parameter analysis: we need to track stores and loads, and find stores that were never read.

Liveness analysis is a generalization of this idea, which is often used to answer many related questions, for example:

finding dead stores,
finding uninitialized variables,
finding a good point to deallocate memory,
finding out if it would be safe to move an object.

Example: definitive initialization

Definitive initialization proves that variables are known to be initialized when read. If we find a variable which is read when not initialized then we generate a warning.

void Init() {
  int x;    // x is uninitialized
  if (cond()) {
    x = 10; // x is initialized
  } else {
    x = 20; // x is initialized
  }
  print(x); // x is initialized
}

void Uninit() {
  int x;    // x is uninitialized
  if (cond()) {
    x = 10; // x is initialized
  }
  print(x); // x is maybe uninitialized, x is being read, report a bug.
}

For this purpose we can use lattice in a form of a mapping from variable declarations to initialization states; each initialization state is represented by the followingn lattice:

A lattice element could also capture the source locations of the branches that lead us to the corresponding program point. Diagnostics would use this information to show a sample buggy code path to the user.

Example: refactoring raw pointers to `unique_ptr`

Modern idiomatic C++ uses smart pointers to express memory ownership, however in pre-C++11 code one can often find raw pointers that own heap memory blocks.

Imagine that we would like to refactor raw pointers that own memory to unique_ptr. There are multiple ways to design a data flow analysis for this problem; let's look at one way to do it.

For example, we would like to refactor the following code that uses raw pointers:

void UniqueOwnership1() {
  int *pi = new int;
  if (...) {
    Borrow(pi);
    delete pi;
  } else {
    TakeOwnership(pi);
  }
}

into code that uses unique_ptr:

void UniqueOwnership1() {
  auto pi = std::make_unique<int>();
  if (...) {
    Borrow(pi.get());
  } else {
    TakeOwnership(pi.release());
  }
}

This problem can be solved with a lattice in form of map from value declarations to pointer states:

We can perform the refactoring if at the exit of a function pi is Compatible.

void UniqueOwnership1() {
  int *pi;             // pi is Compatible
  pi = new int;        // pi is Defined
  if (...) {
    Borrow(pi);        // pi is Defined
    delete pi;         // pi is Compatible
  } else {
    TakeOwnership(pi); // pi is Compatible
  }
  // pi is Compatible
}

Let's look at an example where the raw pointer owns two different memory blocks:

void UniqueOwnership2() {
  int *pi = new int;  // pi is Defined
  Borrow(pi);
  delete pi;          // pi is Compatible
  if (smth) {
    pi = new int;     // pi is Defined
    Borrow(pi);
    delete pi;        // pi is Compatible
  }
  // pi is Compatible
}

It can be refactored to use unique_ptr like this:

void UniqueOwnership2() {
  auto pi = make_unique<int>();
  Borrow(pi);
  if (smth) {
    pi = make_unique<int>();
    Borrow(pi);
  }
}

In the following example, the raw pointer is used to access the heap object after the ownership has been transferred.

void UniqueOwnership3() {
  int *pi = new int; // pi is Defined
  if (...) {
    Borrow(pi);
    delete pi;       // pi is Compatible
  } else {
    vector<unique_ptr<int>> v = {std::unique_ptr(pi)}; // pi is Compatible
    print(*pi);
    use(v);
  }
  // pi is Compatible
}

We can refactor this code to use unique_ptr, however we would have to introduce a non-owning pointer variable, since we can't use the moved-from unique_ptr to access the object:

void UniqueOwnership3() {
  std::unique_ptr<int> pi = std::make_unique<int>();
  if (...) {
    Borrow(pi);
  } else {
    int *pi_non_owning = pi.get();
    vector<unique_ptr<int>> v = {std::move(pi)};
    print(*pi_non_owning);
    use(v);
  }
}

If the original code didn't call delete at the very end of the function, then our refactoring may change the point at which we run the destructor and release memory. Specifically, if there is some user code after delete, then extending the lifetime of the object until the end of the function may hold locks for longer than necessary, introduce memory overhead etc.

One solution is to always replace delete with a call to reset(), and then perform another analysis that removes unnecessary reset() calls.

void AddedMemoryOverhead() {
  HugeObject *ho = new HugeObject();
  use(ho);
  delete ho; // Release the large amount of memory quickly.
  LongRunningFunction();
}

This analysis will refuse to refactor code that mixes borrowed pointer values and unique ownership. In the following code, GetPtr() returns a borrowed pointer, which is assigned to pi. Then, pi is used to hold a uniquely-owned pointer. We don't distinguish between these two assignments, and we want each assignment to be paired with a corresponding sink; otherwise, we transition the pointer to a Conflicting state, like in this example.

void ConflictingOwnership() {
  int *pi;           // pi is Compatible
  pi = GetPtr();     // pi is Defined
  Borrow(pi);        // pi is Defined

  pi = new int;      // pi is Conflicting
  Borrow(pi);
  delete pi;
  // pi is Conflicting
}

We could still handle this case by finding a maximal range in the code where pi could be in the Compatible state, and only refactoring that part.

void ConflictingOwnership() {
  int *pi;
  pi = GetPtr();
  Borrow(pi);

  std::unique_ptr<int> pi_unique = std::make_unique<int>();
  Borrow(pi_unique.get());
}

Example: finding redundant branch conditions

In the code below b1 should not be checked in both the outer and inner "if" statements. It is likely there is a bug in this code.

int F(bool b1, bool b2) {
  if (b1) {
    f();
    if (b1 && b2) {  // Check `b1` again -- unnecessary!
      g();
    }
  }
}

A checker that finds this pattern syntactically is already implemented in ClangTidy using AST matchers (bugprone-redundant-branch-condition).

To implement it using the data flow analysis framework, we can produce a warning if any part of the branch condition is implied by the flow condition.

int F(bool b1, bool b2) {
  // Flow condition: true.
  if (b1) {
    // Flow condition: b1.
    f();
    if (b1 && b2) { // `b1` is implied by the flow condition.
      g();
    }
  }
}

One way to check this implication is to use a SAT solver. Without a SAT solver, we could keep the flow condition in the CNF form and then it would be easy to check the implication.

Example: finding unchecked `std::optional` unwraps

Calling optional::value() is only valid if optional::has_value() is true. We want to show that when x.value() is executed, the flow condition implies x.has_value().

In the example below x.value() is accessed safely because it is guarded by the x.has_value() check.

void Example(std::optional<int> &x) {
  if (x.has_value()) {
    use(x.value());
  }
}

While entering the if branch we deduce that x.has_value() is implied by the flow condition.

void Example(std::optional<int> x) {
  // Flow condition: true.
  if (x.has_value()) {
    // Flow condition: x.has_value() == true.
    use(x.value());
  }
  // Flow condition: true.
}

We also need to prove that x is not modified between check and value access. The modification of x may be very subtle:

void F(std::optional<int> &x);

void Example(std::optional<int> &x) {
  if (x.has_value()) {
    // Flow condition: x.has_value() == true.
    unknown_function(x); // may change x.
    // Flow condition: true.
    use(x.value());
  }
}

Example: finding dead code behind A/B experiment flags

Finding dead code is a classic application of data flow analysis.

Unused flags for A/B experiment hide dead code. However, this flavor of dead code is invisible to the compiler because the flag can be turned on at any moment.

We could make a tool that deletes experiment flags. The user tells us which flag they want to delete, and we assume that the it's value is a given constant.

For example, the user could use the tool to remove example_flag from this code:

DEFINE_FLAG(std::string, example_flag, "", "A sample flag.");

void Example() {
  bool x = GetFlag(FLAGS_example_flag).empty();
  f();
  if (x) {
    g();
  } else {
    h();
  }
}

The tool would simplify the code to:

void Example() {
  f();
  g();
}

We can solve this problem with a classic constant propagation lattice combined with symbolic evaluation.

Example: finding inefficient usages of associative containers

Real-world code often accidentally performs repeated lookups in associative containers:

map<int, Employee> xs;
xs[42]->name = "...";
xs[42]->title = "...";

To find the above inefficiency we can use the available expressions analysis to understand that m[42] is evaluated twice.

map<int, Employee> xs;
Employee &e = xs[42];
e->name = "...";
e->title = "...";

We can also track the m.contains() check in the flow condition to find redundant checks, like in the example below.

std::map<int, Employee> xs;
if (!xs.contains(42)) {
  xs.insert({42, someEmployee});
}

Example: refactoring types that implicitly convert to each other

Refactoring one strong type to another is difficult, but the compiler can help: once you refactor one reference to the type, the compiler will flag other places where this information flows with type mismatch errors. Unfortunately this strategy does not work when you are refactoring types that implicitly convert to each other, for example, replacing int32_t with int64_t.

Imagine that we want to change user IDs from 32 to 64-bit integers. In other words, we need to find all integers tainted with user IDs. We can use data flow analysis to implement taint analysis.

void UseUser(int32_t user_id) {
  int32_t id = user_id;
  // Variable `id` is tainted with a user ID.
  ...
}

Taint analysis is very well suited to this problem because the program rarely branches on user IDs, and almost certainly does not perform any computation (like arithmetic).

30 KiB Raw Blame History Unescape Escape