forked from OSchip/llvm-project
143 lines
6.4 KiB
ReStructuredText
143 lines
6.4 KiB
ReStructuredText
DataFlowSanitizer Design Document
|
|
=================================
|
|
|
|
This document sets out the design for DataFlowSanitizer, a general
|
|
dynamic data flow analysis. Unlike other Sanitizer tools, this tool is
|
|
not designed to detect a specific class of bugs on its own. Instead,
|
|
it provides a generic dynamic data flow analysis framework to be used
|
|
by clients to help detect application-specific issues within their
|
|
own code.
|
|
|
|
DataFlowSanitizer is a program instrumentation which can associate
|
|
a number of taint labels with any data stored in any memory region
|
|
accessible by the program. The analysis is dynamic, which means that
|
|
it operates on a running program, and tracks how the labels propagate
|
|
through that program. The tool shall support a large (>100) number
|
|
of labels, such that programs which operate on large numbers of data
|
|
items may be analysed with each data item being tracked separately.
|
|
|
|
Use Cases
|
|
---------
|
|
|
|
This instrumentation can be used as a tool to help monitor how data
|
|
flows from a program's inputs (sources) to its outputs (sinks).
|
|
This has applications from a privacy/security perspective in that
|
|
one can audit how a sensitive data item is used within a program and
|
|
ensure it isn't exiting the program anywhere it shouldn't be.
|
|
|
|
Interface
|
|
---------
|
|
|
|
A number of functions are provided which will create taint labels,
|
|
attach labels to memory regions and extract the set of labels
|
|
associated with a specific memory region. These functions are declared
|
|
in the header file ``sanitizer/dfsan_interface.h``.
|
|
|
|
.. code-block:: c
|
|
|
|
/// Creates and returns a base label with the given description and user data.
|
|
dfsan_label dfsan_create_label(const char *desc, void *userdata);
|
|
|
|
/// Sets the label for each address in [addr,addr+size) to \c label.
|
|
void dfsan_set_label(dfsan_label label, void *addr, size_t size);
|
|
|
|
/// Sets the label for each address in [addr,addr+size) to the union of the
|
|
/// current label for that address and \c label.
|
|
void dfsan_add_label(dfsan_label label, void *addr, size_t size);
|
|
|
|
/// Retrieves the label associated with the given data.
|
|
///
|
|
/// The type of 'data' is arbitrary. The function accepts a value of any type,
|
|
/// which can be truncated or extended (implicitly or explicitly) as necessary.
|
|
/// The truncation/extension operations will preserve the label of the original
|
|
/// value.
|
|
dfsan_label dfsan_get_label(long data);
|
|
|
|
/// Retrieves a pointer to the dfsan_label_info struct for the given label.
|
|
const struct dfsan_label_info *dfsan_get_label_info(dfsan_label label);
|
|
|
|
/// Returns whether the given label label contains the label elem.
|
|
int dfsan_has_label(dfsan_label label, dfsan_label elem);
|
|
|
|
/// If the given label label contains a label with the description desc, returns
|
|
/// that label, else returns 0.
|
|
dfsan_label dfsan_has_label_with_desc(dfsan_label label, const char *desc);
|
|
|
|
Taint label representation
|
|
--------------------------
|
|
|
|
As stated above, the tool must track a large number of taint
|
|
labels. This poses an implementation challenge, as most multiple-label
|
|
tainting systems assign one label per bit to shadow storage, and
|
|
union taint labels using a bitwise or operation. This will not scale
|
|
to clients which use hundreds or thousands of taint labels, as the
|
|
label union operation becomes O(n) in the number of supported labels,
|
|
and data associated with it will quickly dominate the live variable
|
|
set, causing register spills and hampering performance.
|
|
|
|
Instead, a low overhead approach is proposed which is best-case O(log\
|
|
:sub:`2` n) during execution. The underlying assumption is that
|
|
the required space of label unions is sparse, which is a reasonable
|
|
assumption to make given that we are optimizing for the case where
|
|
applications mostly copy data from one place to another, without often
|
|
invoking the need for an actual union operation. The representation
|
|
of a taint label is a 16-bit integer, and new labels are allocated
|
|
sequentially from a pool. The label identifier 0 is special, and means
|
|
that the data item is unlabelled.
|
|
|
|
When a label union operation is requested at a join point (any
|
|
arithmetic or logical operation with two or more operands, such as
|
|
addition), the code checks whether a union is required, whether the
|
|
same union has been requested before, and whether one union label
|
|
subsumes the other. If so, it returns the previously allocated union
|
|
label. If not, it allocates a new union label from the same pool used
|
|
for new labels.
|
|
|
|
Specifically, the instrumentation pass will insert code like this
|
|
to decide the union label ``lu`` for a pair of labels ``l1``
|
|
and ``l2``:
|
|
|
|
.. code-block:: c
|
|
|
|
if (l1 == l2)
|
|
lu = l1;
|
|
else
|
|
lu = __dfsan_union(l1, l2);
|
|
|
|
The equality comparison is outlined, to provide an early exit in
|
|
the common cases where the program is processing unlabelled data, or
|
|
where the two data items have the same label. ``__dfsan_union`` is
|
|
a runtime library function which performs all other union computation.
|
|
|
|
Further optimizations are possible, for example if ``l1`` is known
|
|
at compile time to be zero (e.g. it is derived from a constant),
|
|
``l2`` can be used for ``lu``, and vice versa.
|
|
|
|
Memory layout and label management
|
|
----------------------------------
|
|
|
|
The following is the current memory layout for Linux/x86\_64:
|
|
|
|
+---------------+---------------+--------------------+
|
|
| Start | End | Use |
|
|
+===============+===============+====================+
|
|
| 0x700000008000|0x800000000000 | application memory |
|
|
+---------------+---------------+--------------------+
|
|
| 0x200200000000|0x700000008000 | unused |
|
|
+---------------+---------------+--------------------+
|
|
| 0x200000000000|0x200200000000 | union table |
|
|
+---------------+---------------+--------------------+
|
|
| 0x000000010000|0x200000000000 | shadow memory |
|
|
+---------------+---------------+--------------------+
|
|
| 0x000000000000|0x000000010000 | reserved by kernel |
|
|
+---------------+---------------+--------------------+
|
|
|
|
Each byte of application memory corresponds to two bytes of shadow
|
|
memory, which are used to store its taint label. As for LLVM SSA
|
|
registers, we have not found it necessary to associate a label with
|
|
each byte or bit of data, as some other tools do. Instead, labels are
|
|
associated directly with registers. Loads will result in a union of
|
|
all shadow labels corresponding to bytes loaded (which most of the
|
|
time will be short circuited by the initial comparison) and stores will
|
|
result in a copy of the label to the shadow of all bytes stored to.
|