forked from OSchip/llvm-project
292 lines
13 KiB
ReStructuredText
292 lines
13 KiB
ReStructuredText
DataFlowSanitizer Design Document
|
|
=================================
|
|
|
|
This document sets out the design for DataFlowSanitizer, a general
|
|
dynamic data flow analysis. Unlike other Sanitizer tools, this tool is
|
|
not designed to detect a specific class of bugs on its own. Instead,
|
|
it provides a generic dynamic data flow analysis framework to be used
|
|
by clients to help detect application-specific issues within their
|
|
own code.
|
|
|
|
DataFlowSanitizer is a program instrumentation which can associate
|
|
a number of taint labels with any data stored in any memory region
|
|
accessible by the program. The analysis is dynamic, which means that
|
|
it operates on a running program, and tracks how the labels propagate
|
|
through that program.
|
|
|
|
Use Cases
|
|
---------
|
|
|
|
This instrumentation can be used as a tool to help monitor how data
|
|
flows from a program's inputs (sources) to its outputs (sinks).
|
|
This has applications from a privacy/security perspective in that
|
|
one can audit how a sensitive data item is used within a program and
|
|
ensure it isn't exiting the program anywhere it shouldn't be.
|
|
|
|
Interface
|
|
---------
|
|
|
|
A number of functions are provided which will attach taint labels to
|
|
memory regions and extract the set of labels associated with a
|
|
specific memory region. These functions are declared in the header
|
|
file ``sanitizer/dfsan_interface.h``.
|
|
|
|
.. code-block:: c
|
|
|
|
/// Sets the label for each address in [addr,addr+size) to \c label.
|
|
void dfsan_set_label(dfsan_label label, void *addr, size_t size);
|
|
|
|
/// Sets the label for each address in [addr,addr+size) to the union of the
|
|
/// current label for that address and \c label.
|
|
void dfsan_add_label(dfsan_label label, void *addr, size_t size);
|
|
|
|
/// Retrieves the label associated with the given data.
|
|
///
|
|
/// The type of 'data' is arbitrary. The function accepts a value of any type,
|
|
/// which can be truncated or extended (implicitly or explicitly) as necessary.
|
|
/// The truncation/extension operations will preserve the label of the original
|
|
/// value.
|
|
dfsan_label dfsan_get_label(long data);
|
|
|
|
/// Retrieves the label associated with the data at the given address.
|
|
dfsan_label dfsan_read_label(const void *addr, size_t size);
|
|
|
|
/// Returns whether the given label label contains the label elem.
|
|
int dfsan_has_label(dfsan_label label, dfsan_label elem);
|
|
|
|
/// Computes the union of \c l1 and \c l2, resulting in a union label.
|
|
dfsan_label dfsan_union(dfsan_label l1, dfsan_label l2);
|
|
|
|
/// Flushes the DFSan shadow, i.e. forgets about all labels currently associated
|
|
/// with the application memory. Use this call to start over the taint tracking
|
|
/// within the same process.
|
|
///
|
|
/// Note: If another thread is working with tainted data during the flush, that
|
|
/// taint could still be written to shadow after the flush.
|
|
void dfsan_flush(void);
|
|
|
|
The following functions are provided to check origin tracking status and results.
|
|
|
|
.. code-block:: c
|
|
|
|
/// Retrieves the immediate origin associated with the given data. The returned
|
|
/// origin may point to another origin.
|
|
///
|
|
/// The type of 'data' is arbitrary. The function accepts a value of any type,
|
|
/// which can be truncated or extended (implicitly or explicitly) as necessary.
|
|
/// The truncation/extension operations will preserve the label of the original
|
|
/// value.
|
|
dfsan_origin dfsan_get_origin(long data);
|
|
|
|
/// Retrieves the very first origin associated with the data at the given
|
|
/// address.
|
|
dfsan_origin dfsan_get_init_origin(const void *addr);
|
|
|
|
/// Prints the origin trace of the label at the address `addr` to stderr. It also
|
|
/// prints description at the beginning of the trace. If origin tracking is not
|
|
/// on, or the address is not labeled, it prints nothing.
|
|
void dfsan_print_origin_trace(const void *addr, const char *description);
|
|
|
|
/// Prints the origin trace of the label at the address `addr` to a pre-allocated
|
|
/// output buffer. If origin tracking is not on, or the address is`
|
|
/// not labeled, it prints nothing.
|
|
///
|
|
/// `addr` is the tainted memory address whose origin we are printing.
|
|
/// `description` is a description printed at the beginning of the trace.
|
|
/// `out_buf` is the output buffer to write the results to. `out_buf_size` is
|
|
/// the size of `out_buf`. The function returns the number of symbols that
|
|
/// should have been written to `out_buf` (not including trailing null byte '\0').
|
|
/// Thus, the string is truncated iff return value is not less than `out_buf_size`.
|
|
size_t dfsan_sprint_origin_trace(const void *addr, const char *description,
|
|
char *out_buf, size_t out_buf_size);
|
|
|
|
/// Returns the value of `-dfsan-track-origins`.
|
|
int dfsan_get_track_origins(void);
|
|
|
|
The following functions are provided to register hooks called by custom wrappers.
|
|
|
|
.. code-block:: c
|
|
|
|
/// Sets a callback to be invoked on calls to `write`. The callback is invoked
|
|
/// before the write is done. The write is not guaranteed to succeed when the
|
|
/// callback executes. Pass in NULL to remove any callback.
|
|
typedef void (*dfsan_write_callback_t)(int fd, const void *buf, size_t count);
|
|
void dfsan_set_write_callback(dfsan_write_callback_t labeled_write_callback);
|
|
|
|
/// Callbacks to be invoked on calls to `memcmp` or `strncmp`.
|
|
void dfsan_weak_hook_memcmp(void *caller_pc, const void *s1, const void *s2,
|
|
size_t n, dfsan_label s1_label,
|
|
dfsan_label s2_label, dfsan_label n_label);
|
|
void dfsan_weak_hook_strncmp(void *caller_pc, const char *s1, const char *s2,
|
|
size_t n, dfsan_label s1_label,
|
|
dfsan_label s2_label, dfsan_label n_label);
|
|
|
|
Taint label representation
|
|
--------------------------
|
|
|
|
We use an 8-bit unsigned integer for the representation of a
|
|
label. The label identifier 0 is special, and means that the data item
|
|
is unlabelled. This is optimizing for low CPU and code size overhead
|
|
of the instrumentation. When a label union operation is requested at a
|
|
join point (any arithmetic or logical operation with two or more
|
|
operands, such as addition), we can simply OR the two labels in O(1).
|
|
|
|
Users are responsible for managing the 8 integer labels (i.e., keeping
|
|
track of what labels they have used so far, picking one that is yet
|
|
unused, etc).
|
|
|
|
Origin tracking trace representation
|
|
------------------------------------
|
|
|
|
An origin tracking trace is a list of chains. Each chain has a stack trace
|
|
where the DFSan runtime records a label propagation, and a pointer to its
|
|
previous chain. The very first chain does not point to any chain.
|
|
|
|
Every four 4-bytes aligned application bytes share a 4-byte origin trace ID. A
|
|
4-byte origin trace ID contains a 4-bit depth and a 28-bit hash ID of a chain.
|
|
|
|
A chain ID is calculated as a hash from a chain structure. A chain structure
|
|
contains a stack ID and the previous chain ID. The chain head has a zero
|
|
previous chain ID. A stack ID is a hash from a stack trace. The 4-bit depth
|
|
limits the maximal length of a path. The environment variable ``origin_history_size``
|
|
can set the depth limit. Non-positive values mean unlimited. Its default value
|
|
is 16. When reaching the limit, origin tracking ignores following propagation
|
|
chains.
|
|
|
|
The first chain of a trace starts by `dfsan_set_label` with non-zero labels. A
|
|
new chain is appended at the end of a trace at stores or memory transfers when
|
|
``-dfsan-track-origins`` is 1. Memory transfers include LLVM memory transfer
|
|
instructions, glibc memcpy and memmove. When ``-dfsan-track-origins`` is 2, a
|
|
new chain is also appended at loads.
|
|
|
|
Other instructions do not create new chains, but simply propagate origin trace
|
|
IDs. If an instruction has more than one operands with non-zero labels, the origin
|
|
treace ID of the last operand with non-zero label is propagated to the result of
|
|
the instruction.
|
|
|
|
Memory layout and label management
|
|
----------------------------------
|
|
|
|
The following is the memory layout for Linux/x86\_64:
|
|
|
|
+---------------+---------------+--------------------+
|
|
| Start | End | Use |
|
|
+===============+===============+====================+
|
|
| 0x700000000000|0x800000000000 | application 3 |
|
|
+---------------+---------------+--------------------+
|
|
| 0x610000000000|0x700000000000 | unused |
|
|
+---------------+---------------+--------------------+
|
|
| 0x600000000000|0x610000000000 | origin 1 |
|
|
+---------------+---------------+--------------------+
|
|
| 0x510000000000|0x600000000000 | application 2 |
|
|
+---------------+---------------+--------------------+
|
|
| 0x500000000000|0x510000000000 | shadow 1 |
|
|
+---------------+---------------+--------------------+
|
|
| 0x400000000000|0x500000000000 | unused |
|
|
+---------------+---------------+--------------------+
|
|
| 0x300000000000|0x400000000000 | origin 3 |
|
|
+---------------+---------------+--------------------+
|
|
| 0x200000000000|0x300000000000 | shadow 3 |
|
|
+---------------+---------------+--------------------+
|
|
| 0x110000000000|0x200000000000 | origin 2 |
|
|
+---------------+---------------+--------------------+
|
|
| 0x100000000000|0x110000000000 | unused |
|
|
+---------------+---------------+--------------------+
|
|
| 0x010000000000|0x100000000000 | shadow 2 |
|
|
+---------------+---------------+--------------------+
|
|
| 0x000000000000|0x010000000000 | application 1 |
|
|
+---------------+---------------+--------------------+
|
|
|
|
Each byte of application memory corresponds to a single byte of shadow
|
|
memory, which is used to store its taint label. We map memory, shadow, and
|
|
origin regions to each other with these masks and offsets:
|
|
|
|
* shadow_addr = memory_addr ^ 0x500000000000
|
|
|
|
* origin_addr = shadow_addr + 0x100000000000
|
|
|
|
As for LLVM SSA registers, we have not found it necessary to associate a label
|
|
with each byte or bit of data, as some other tools do. Instead, labels are
|
|
associated directly with registers. Loads will result in a union of
|
|
all shadow labels corresponding to bytes loaded, and stores will
|
|
result in a copy of the label of the stored value to the shadow of all
|
|
bytes stored to.
|
|
|
|
Propagating labels through arguments
|
|
------------------------------------
|
|
|
|
In order to propagate labels through function arguments and return values,
|
|
DataFlowSanitizer changes the ABI of each function in the translation unit.
|
|
There are currently two supported ABIs:
|
|
|
|
* Args -- Argument and return value labels are passed through additional
|
|
arguments and by modifying the return type.
|
|
|
|
* TLS -- Argument and return value labels are passed through TLS variables
|
|
``__dfsan_arg_tls`` and ``__dfsan_retval_tls``.
|
|
|
|
The main advantage of the TLS ABI is that it is more tolerant of ABI mismatches
|
|
(TLS storage is not shared with any other form of storage, whereas extra
|
|
arguments may be stored in registers which under the native ABI are not used
|
|
for parameter passing and thus could contain arbitrary values). On the other
|
|
hand the args ABI is more efficient and allows ABI mismatches to be more easily
|
|
identified by checking for nonzero labels in nominally unlabelled programs.
|
|
|
|
Implementing the ABI list
|
|
-------------------------
|
|
|
|
The `ABI list <DataFlowSanitizer.html#abi-list>`_ provides a list of functions
|
|
which conform to the native ABI, each of which is callable from an instrumented
|
|
program. This is implemented by replacing each reference to a native ABI
|
|
function with a reference to a function which uses the instrumented ABI.
|
|
Such functions are automatically-generated wrappers for the native functions.
|
|
For example, given the ABI list example provided in the user manual, the
|
|
following wrappers will be generated under the args ABI:
|
|
|
|
.. code-block:: llvm
|
|
|
|
define linkonce_odr { i8*, i16 } @"dfsw$malloc"(i64 %0, i16 %1) {
|
|
entry:
|
|
%2 = call i8* @malloc(i64 %0)
|
|
%3 = insertvalue { i8*, i16 } undef, i8* %2, 0
|
|
%4 = insertvalue { i8*, i16 } %3, i16 0, 1
|
|
ret { i8*, i16 } %4
|
|
}
|
|
|
|
define linkonce_odr { i32, i16 } @"dfsw$tolower"(i32 %0, i16 %1) {
|
|
entry:
|
|
%2 = call i32 @tolower(i32 %0)
|
|
%3 = insertvalue { i32, i16 } undef, i32 %2, 0
|
|
%4 = insertvalue { i32, i16 } %3, i16 %1, 1
|
|
ret { i32, i16 } %4
|
|
}
|
|
|
|
define linkonce_odr { i8*, i16 } @"dfsw$memcpy"(i8* %0, i8* %1, i64 %2, i16 %3, i16 %4, i16 %5) {
|
|
entry:
|
|
%labelreturn = alloca i16
|
|
%6 = call i8* @__dfsw_memcpy(i8* %0, i8* %1, i64 %2, i16 %3, i16 %4, i16 %5, i16* %labelreturn)
|
|
%7 = load i16* %labelreturn
|
|
%8 = insertvalue { i8*, i16 } undef, i8* %6, 0
|
|
%9 = insertvalue { i8*, i16 } %8, i16 %7, 1
|
|
ret { i8*, i16 } %9
|
|
}
|
|
|
|
As an optimization, direct calls to native ABI functions will call the
|
|
native ABI function directly and the pass will compute the appropriate label
|
|
internally. This has the advantage of reducing the number of union operations
|
|
required when the return value label is known to be zero (i.e. ``discard``
|
|
functions, or ``functional`` functions with known unlabelled arguments).
|
|
|
|
Checking ABI Consistency
|
|
------------------------
|
|
|
|
DFSan changes the ABI of each function in the module. This makes it possible
|
|
for a function with the native ABI to be called with the instrumented ABI,
|
|
or vice versa, thus possibly invoking undefined behavior. A simple way
|
|
of statically detecting instances of this problem is to append the suffix
|
|
".dfsan" to the name of each instrumented-ABI function.
|
|
|
|
This will not catch every such problem; in particular function pointers passed
|
|
across the instrumented-native barrier cannot be used on the other side.
|
|
These problems could potentially be caught dynamically.
|