forked from OSchip/llvm-project
198 lines
9.0 KiB
Plaintext
198 lines
9.0 KiB
Plaintext
Static Analyzer: 'Regions'
|
|
--------------------------
|
|
|
|
INTRODUCTION
|
|
|
|
The path-sensitive analysis engine in libAnalysis employs an extensible API
|
|
for abstractly modeling the memory of an analyzed program. This API employs
|
|
the concept of "memory regions" to abstractly model chunks of program memory
|
|
such as program variables and dynamically allocated memory such as those
|
|
returned from 'malloc' and 'alloca'. Regions are hierarchical, with subregions
|
|
modeling subtyping relationships, field and array offsets into larger chunks
|
|
of memory, and so on.
|
|
|
|
The region API consists of two components. The first is the taxonomy and
|
|
representation of regions themselves within the analyzer engine. The primary
|
|
definitions and interfaces are described in
|
|
'include/clang/Analysis/PathSensitive/MemRegion.h'. At the root of the region
|
|
hierarchy is the class 'MemRegion' with specific subclasses refining the
|
|
region concept for variables, heap allocated memory, and so forth.
|
|
|
|
The second component in the region API is the modeling of the binding of
|
|
values to regions. For example, modeling the value stored to a local variable
|
|
'x' consists of recording the binding between the region for 'x' (which
|
|
represents the raw memory associated with 'x') and the value stored to 'x'.
|
|
This binding relationship is captured with the notion of "symbolic stores."
|
|
|
|
Symbolic stores, which can be thought of as representing the relation 'regions
|
|
-> values', are implemented by subclasses of the StoreManager class (Store.h).
|
|
A particular StoreManager implementation has complete flexibility concerning
|
|
(a) *how* to model the binding between regions and values and (b) *what*
|
|
bindings are recorded. Together, both points allow different StoreManagers to
|
|
tradeoff between different levels of analysis precision and scalability
|
|
concerning the reasoning of program memory. Meanwhile, the core path-sensitive
|
|
engine makes no assumptions about (a) or (b), and queries a StoreManager about
|
|
the bindings to a memory region through a generic interface that all
|
|
StoreManagers share. If a particular StoreManager cannot reason about the
|
|
potential bindings of a given memory region (e.g., 'BasicStoreManager' does
|
|
not reason about fields of structures) then the StoreManager can simply return
|
|
'unknown' (represented by 'UnknownVal') for a particular region-binding. This
|
|
separation of concerns not only isolates the core analysis engine from the
|
|
details of reasoning about program memory but also facilities the option of a
|
|
client of the path-sensitive engine to easily swap in different StoreManager
|
|
implementations that internally reason about program memory in very different
|
|
ways.
|
|
|
|
The rest of this document is divided into two parts. We first discuss region
|
|
taxonomy and the semantics of regions. We then discuss the StoreManager
|
|
interface, and details of how the currently available StoreManager classes
|
|
implement region bindings.
|
|
|
|
MEMORY REGIONS and REGION TAXONOMY
|
|
|
|
POINTERS
|
|
|
|
Before talking about the memory regions, we would talk about the pointers
|
|
since memory regions are essentially used to represent pointer values.
|
|
|
|
The pointer is a type of values. Pointer values have two semantic aspects. One
|
|
is its physical value, which is an address or location. The other is the type
|
|
of the memory object residing in the address.
|
|
|
|
Memory regions are designed to abstract these two properties of the
|
|
pointer. The physical value of a pointer is represented by MemRegion
|
|
pointers. The rvalue type of the region corresponds to the type of the pointee
|
|
object.
|
|
|
|
One complication is that we could have different view regions on the same
|
|
memory chunk. They represent the same memory location, but have different
|
|
abstract location, i.e., MemRegion pointers. Thus we need to canonicalize
|
|
the abstract locations to get a unique abstract location for one physical
|
|
location.
|
|
|
|
Furthermore, these different view regions may or may not represent memory
|
|
objects of different types. Some different types are semantically the same,
|
|
for example, 'struct s' and 'my_type' are the same type.
|
|
struct s;
|
|
typedef struct s my_type;
|
|
|
|
But 'char' and 'int' are not the same type in the code below:
|
|
void *p;
|
|
int *q = (int*) p;
|
|
char *r = (char*) p;
|
|
|
|
Thus we need to canonicalize the MemRegion which is used in binding and
|
|
retrieving.
|
|
|
|
SYMBOLIC REGIONS
|
|
|
|
A symbolic region is a map of the concept of symbolic values into the domain
|
|
of regions. It is the way that we represent symbolic pointers. Whenever a
|
|
symbolic pointer value is needed, a symbolic region is created to represent
|
|
it.
|
|
|
|
A symbolic region has no type. It wraps a SymbolData. But sometimes we have
|
|
type information associated with a symbolic region. For this case, a
|
|
TypedViewRegion is created to layer the type information on top of the
|
|
symbolic region. The reason we do not carry type information with the symbolic
|
|
region is that the symbolic regions can have no type. To be consistent, we
|
|
don't let them to carry type information.
|
|
|
|
Like a symbolic pointer, a symbolic region may be NULL, has unknown extent,
|
|
and represents a generic chunk of memory.
|
|
|
|
NOTE: We plan not to use loc::SymbolVal in RegionStore and remove it
|
|
gradually.
|
|
|
|
Symbolic regions get their rvalue types through the following ways:
|
|
* through the parameter or global variable that points to it, e.g.:
|
|
|
|
void f(struct s* p) {
|
|
...
|
|
}
|
|
|
|
The symbolic region pointed to by 'p' has type 'struct s'.
|
|
|
|
* through explicit or implicit casts, e.g.:
|
|
void f(void* p) {
|
|
struct s* q = (struct s*) p;
|
|
...
|
|
}
|
|
|
|
We attach the type information to the symbolic region lazily. For the first
|
|
case above, we create the TypedViewRegion only when the pointer is actually
|
|
used to access the pointee memory object, that is when the element or field
|
|
region is created. For the cast case, the TypedViewRegion is created when
|
|
visiting the CastExpr.
|
|
|
|
The reason for doing lazy typing is that symbolic regions are sometimes only
|
|
used to do location comparison.
|
|
|
|
Pointer Casts
|
|
|
|
Pointer casts allow people to impose different 'views' onto a chunk of memory.
|
|
|
|
Usually we have two kinds of casts. One kind of casts cast down with in the
|
|
type hierarchy. It imposes more specific views onto more generic memory
|
|
regions. The other kind of casts cast up with in the type hierarchy. It strips
|
|
away more specific views on top of the more generic memory regions.
|
|
|
|
We simulate the down casts by layering another TypedViewRegion on top of the
|
|
original region. We simulate the up casts by striping away the top
|
|
TypedViewRegion. Down casts is usually simple. For up casts, if the there is
|
|
no TypedViewRegion to be stripped, we return the original region. If the
|
|
underlying region is of the different type than the cast-to type, we flag an
|
|
error state.
|
|
|
|
For toll-free bridging casts, we return the original region.
|
|
|
|
We can set up a partial order for pointer types, with the most general type
|
|
'void*' at the top. The partial order forms a tree with 'void*' as its root
|
|
node.
|
|
|
|
Every MemRegion has a root position in the type tree. For example, the pointee
|
|
region of 'void *p' has its root position at the root node of the tree.
|
|
VarRegion of 'int x' has its root position at the 'int type' node.
|
|
|
|
TypedViewRegion is used to move the region down or up in the tree. Moving
|
|
down in the tree adds a TypedViewRegion. Moving up in the tree removes a
|
|
TypedViewRegion.
|
|
|
|
Do we want to allow moving up beyond the root position? This happens when:
|
|
int x;
|
|
void *p = &x;
|
|
|
|
The region of 'x' has its root position at 'int*' node. the cast to void*
|
|
moves that region up to the 'void*' node. I propose to not allow such casts,
|
|
and assign the region of 'x' for 'p'.
|
|
|
|
Region Bindings
|
|
|
|
The following region kinds are boundable: VarRegion, CompoundLiteralRegion,
|
|
StringRegion, ElementRegion, FieldRegion, and ObjCIvarRegion.
|
|
|
|
When binding regions, we perform canonicalization on element regions and field
|
|
regions. This is because we can have different views on the same region, some
|
|
of which are essentially the same view with different sugar type names.
|
|
|
|
To canonicalize a region, we get the canonical types for all TypedViewRegions
|
|
along the way up to the root region, and make new TypedViewRegions with those
|
|
canonical types.
|
|
|
|
For ObjC and C++, perhaps another canonicalization rule should be added: for
|
|
FieldRegion, the least derived class that has the field is used as the type
|
|
of the super region of the FieldRegion.
|
|
|
|
All bindings and retrievings are done on the canonicalized regions.
|
|
|
|
Canonicalization is transparent outside the region store manager, and more
|
|
specifically, unaware outside the Bind() and Retrieve() method. We don't need
|
|
to consider region canonicalization when doing pointer cast.
|
|
|
|
Constraint Manager
|
|
|
|
The constraint manager reasons about the abstract location of memory
|
|
objects. We can have different views on a region, but none of these views
|
|
changes the location of that object. Thus we should get the same abstract
|
|
location for those regions.
|