Summary:
Add an option to initialize automatic variables with either a pattern or with
zeroes. The default is still that automatic variables are uninitialized. Also
add attributes to request uninitialized on a per-variable basis, mainly to disable
initialization of large stack arrays when deemed too expensive.
This isn't meant to change the semantics of C and C++. Rather, it's meant to be
a last-resort when programmers inadvertently have some undefined behavior in
their code. This patch aims to make undefined behavior hurt less, which
security-minded people will be very happy about. Notably, this means that
there's no inadvertent information leak when:
- The compiler re-uses stack slots, and a value is used uninitialized.
- The compiler re-uses a register, and a value is used uninitialized.
- Stack structs / arrays / unions with padding are copied.
This patch only addresses stack and register information leaks. There's many
more infoleaks that we could address, and much more undefined behavior that
could be tamed. Let's keep this patch focused, and I'm happy to address related
issues elsewhere.
To keep the patch simple, only some `undef` is removed for now, see
`replaceUndef`. The padding-related infoleaks are therefore not all gone yet.
This will be addressed in a follow-up, mainly because addressing padding-related
leaks should be a stand-alone option which is implied by variable
initialization.
There are three options when it comes to automatic variable initialization:
0. Uninitialized
This is C and C++'s default. It's not changing. Depending on code
generation, a programmer who runs into undefined behavior by using an
uninialized automatic variable may observe any previous value (including
program secrets), or any value which the compiler saw fit to materialize on
the stack or in a register (this could be to synthesize an immediate, to
refer to code or data locations, to generate cookies, etc).
1. Pattern initialization
This is the recommended initialization approach. Pattern initialization's
goal is to initialize automatic variables with values which will likely
transform logic bugs into crashes down the line, are easily recognizable in
a crash dump, without being values which programmers can rely on for useful
program semantics. At the same time, pattern initialization tries to
generate code which will optimize well. You'll find the following details in
`patternFor`:
- Integers are initialized with repeated 0xAA bytes (infinite scream).
- Vectors of integers are also initialized with infinite scream.
- Pointers are initialized with infinite scream on 64-bit platforms because
it's an unmappable pointer value on architectures I'm aware of. Pointers
are initialize to 0x000000AA (small scream) on 32-bit platforms because
32-bit platforms don't consistently offer unmappable pages. When they do
it's usually the zero page. As people try this out, I expect that we'll
want to allow different platforms to customize this, let's do so later.
- Vectors of pointers are initialized the same way pointers are.
- Floating point values and vectors are initialized with a negative quiet
NaN with repeated 0xFF payload (e.g. 0xffffffff and 0xffffffffffffffff).
NaNs are nice (here, anways) because they propagate on arithmetic, making
it more likely that entire computations become NaN when a single
uninitialized value sneaks in.
- Arrays are initialized to their homogeneous elements' initialization
value, repeated. Stack-based Variable-Length Arrays (VLAs) are
runtime-initialized to the allocated size (no effort is made for negative
size, but zero-sized VLAs are untouched even if technically undefined).
- Structs are initialized to their heterogeneous element's initialization
values. Zero-size structs are initialized as 0xAA since they're allocated
a single byte.
- Unions are initialized using the initialization for the largest member of
the union.
Expect the values used for pattern initialization to change over time, as we
refine heuristics (both for performance and security). The goal is truly to
avoid injecting semantics into undefined behavior, and we should be
comfortable changing these values when there's a worthwhile point in doing
so.
Why so much infinite scream? Repeated byte patterns tend to be easy to
synthesize on most architectures, and otherwise memset is usually very
efficient. For values which aren't entirely repeated byte patterns, LLVM
will often generate code which does memset + a few stores.
2. Zero initialization
Zero initialize all values. This has the unfortunate side-effect of
providing semantics to otherwise undefined behavior, programs therefore
might start to rely on this behavior, and that's sad. However, some
programmers believe that pattern initialization is too expensive for them,
and data might show that they're right. The only way to make these
programmers wrong is to offer zero-initialization as an option, figure out
where they are right, and optimize the compiler into submission. Until the
compiler provides acceptable performance for all security-minded code, zero
initialization is a useful (if blunt) tool.
I've been asked for a fourth initialization option: user-provided byte value.
This might be useful, and can easily be added later.
Why is an out-of band initialization mecanism desired? We could instead use
-Wuninitialized! Indeed we could, but then we're forcing the programmer to
provide semantics for something which doesn't actually have any (it's
uninitialized!). It's then unclear whether `int derp = 0;` lends meaning to `0`,
or whether it's just there to shut that warning up. It's also way easier to use
a compiler flag than it is to manually and intelligently initialize all values
in a program.
Why not just rely on static analysis? Because it cannot reason about all dynamic
code paths effectively, and it has false positives. It's a great tool, could get
even better, but it's simply incapable of catching all uses of uninitialized
values.
Why not just rely on memory sanitizer? Because it's not universally available,
has a 3x performance cost, and shouldn't be deployed in production. Again, it's
a great tool, it'll find the dynamic uses of uninitialized variables that your
test coverage hits, but it won't find the ones that you encounter in production.
What's the performance like? Not too bad! Previous publications [0] have cited
2.7 to 4.5% averages. We've commmitted a few patches over the last few months to
address specific regressions, both in code size and performance. In all cases,
the optimizations are generally useful, but variable initialization benefits
from them a lot more than regular code does. We've got a handful of other
optimizations in mind, but the code is in good enough shape and has found enough
latent issues that it's a good time to get the change reviewed, checked in, and
have others kick the tires. We'll continue reducing overheads as we try this out
on diverse codebases.
Is it a good idea? Security-minded folks think so, and apparently so does the
Microsoft Visual Studio team [1] who say "Between 2017 and mid 2018, this
feature would have killed 49 MSRC cases that involved uninitialized struct data
leaking across a trust boundary. It would have also mitigated a number of bugs
involving uninitialized struct data being used directly.". They seem to use pure
zero initialization, and claim to have taken the overheads down to within noise.
Don't just trust Microsoft though, here's another relevant person asking for
this [2]. It's been proposed for GCC [3] and LLVM [4] before.
What are the caveats? A few!
- Variables declared in unreachable code, and used later, aren't initialized.
This goto, Duff's device, other objectionable uses of switch. This should
instead be a hard-error in any serious codebase.
- Volatile stack variables are still weird. That's pre-existing, it's really
the language's fault and this patch keeps it weird. We should deprecate
volatile [5].
- As noted above, padding isn't fully handled yet.
I don't think these caveats make the patch untenable because they can be
addressed separately.
Should this be on by default? Maybe, in some circumstances. It's a conversation
we can have when we've tried it out sufficiently, and we're confident that we've
eliminated enough of the overheads that most codebases would want to opt-in.
Let's keep our precious undefined behavior until that point in time.
How do I use it:
1. On the command-line:
-ftrivial-auto-var-init=uninitialized (the default)
-ftrivial-auto-var-init=pattern
-ftrivial-auto-var-init=zero -enable-trivial-auto-var-init-zero-knowing-it-will-be-removed-from-clang
2. Using an attribute:
int dont_initialize_me __attribute((uninitialized));
[0]: https://users.elis.ugent.be/~jsartor/researchDocs/OOPSLA2011Zero-submit.pdf
[1]: https://twitter.com/JosephBialek/status/1062774315098112001
[2]: https://outflux.net/slides/2018/lss/danger.pdf
[3]: https://gcc.gnu.org/ml/gcc-patches/2014-06/msg00615.html
[4]: 776a0955ef
[5]: http://wg21.link/p1152
I've also posted an RFC to cfe-dev: http://lists.llvm.org/pipermail/cfe-dev/2018-November/060172.html
<rdar://problem/39131435>
Reviewers: pcc, kcc, rsmith
Subscribers: JDevlieghere, jkorous, dexonsmith, cfe-commits
Differential Revision: https://reviews.llvm.org/D54604
llvm-svn: 349442
Found via codespell -q 3 -I ../clang-whitelist.txt
Where whitelist consists of:
archtype
cas
classs
checkk
compres
definit
frome
iff
inteval
ith
lod
methode
nd
optin
ot
pres
statics
te
thru
Patch by luzpaz! (This is a subset of D44188 that applies cleanly with a few
files that have dubious fixes reverted.)
Differential revision: https://reviews.llvm.org/D44188
llvm-svn: 329399
for halting the propagation of uninitialized value tracking along
a path. Unlike __attribute__((noreturn)), this attribute (which
is used by clients of the static analyzer) can be used to annotate
functions that essentially never return, but in rare cares may be
allowed to return for (special) debugging purposes. This attribute
has been shown in reducing false positives in the static analyzer
by pruning false postives, and is equally applicable here.
Handling this attribute in the CFG itself is another option, but
this is not something all clients (e.g., possibly -Wunreachable-code)
would want to see.
Addresses <rdar://problem/12281583>.
llvm-svn: 163681
short-circuiting when building the CFG. Also be sure to skip parens before
checking for the && / || special cases. Finally, fix some crashes in CFG
printing in the presence of calls to destructors for array of array of class
type.
llvm-svn: 160691
* Treat compound assignment as a use, at Jordy's request.
* Always add compound assignments into the CFG, so we can correctly diagnose the use in 'return x += 1;'
llvm-svn: 160334
use out of TransferFunctions, and compute it in advance rather than on-the-fly.
This allows us to handle compound assignments with DeclRefExprs on the RHS
correctly, and also makes it trivial to treat const& function parameters as not
initializing the argument. The patch also makes both of those changes.
llvm-svn: 160330
instead push the terminator for the branch down into the basic blocks of the subexpressions of '&&' and '||'
respectively. This eliminates some artifical control-flow from the CFG and results in a more
compact CFG.
Note that this patch only alters the branches 'while', 'if' and 'for'. This was complex enough for
one patch. The remaining branches (e.g., do...while) can be handled in a separate patch, but they
weren't immediately tackled because they were less important.
It is possible that this patch introduces some subtle bugs, particularly w.r.t. to destructor placement.
I've tried to audit these changes, but it is also known that the destructor logic needs some refinement
in the area of '||' and '&&' regardless (i.e., their are known bugs).
llvm-svn: 160218
-Wsometimes-uninitialized diagnostics to make it clearer that the cause
of the issue may be a condition which must always evaluate to true or
false, rather than an uninitialized variable.
To emphasize this, add a new note with a fixit which removes the
impossible condition or replaces it with a constant.
Also, downgrade the diagnostic from -Wsometimes-uninitialized to
-Wconditional-uninitialized when it applies to a range-based for loop,
since the condition is not written explicitly in the code in that case.
llvm-svn: 157511
-Wsometimes-uninitialized. This detects cases where an explicitly-written branch
inevitably leads to an uninitialized variable use (so either the branch is dead
code or there is an uninitialized use bug).
This chunk of warnings tentatively lives within -Wuninitialized, in order to
give it more visibility to existing Clang users.
llvm-svn: 157458
#define TEST int y; int x = y;
void foo() {
TEST
}
-Wuninitialized gives this warning:
invalid-loc.cc:4:3: warning: variable 'y' is uninitialized when used here
[-Wuninitialized]
TEST
^~~~
invalid-loc.cc:2:29: note: expanded from macro 'TEST'
#define TEST int y; int x = y;
^
note: initialize the variable 'y' to silence this warning
1 warning generated.
The second note lacks filename, line number, and code snippet. This change
will remove the fixit and only point to variable declaration.
invalid-loc.cc:4:3: warning: variable 'y' is uninitialized when used here
[-Wuninitialized]
TEST
^~~~
invalid-loc.cc:2:29: note: expanded from macro 'TEST'
#define TEST int y; int x = y;
^
invalid-loc.cc:4:3: note: variable 'y' is declared here
TEST
^
invalid-loc.cc:2:14: note: expanded from macro 'TEST'
#define TEST int y; int x = y;
^
1 warning generated.
llvm-svn: 156045
incorrectly in the CFG, and also the static analyzer. This patch regresses the analyzer a bit, but
that needs to be followed up with a better solution.
Fixes <rdar://problem/10008112>.
llvm-svn: 138372
AnalysisBasedWarnings Sema layer and out of the Analysis library itself.
This returns the uninitialized values analysis to a more pure form,
allowing its original logic to correctly detect some categories of
definitely uninitialized values. Fixes PR10358 (again).
Thanks to Ted for reviewing and updating this patch after his rewrite of
several portions of this analysis.
llvm-svn: 135748
patch, we actually move the state-machine for the value set backwards
one step. This can pretty easily lead to infinite loops where we
continually try to propagate a bit, succeed for one iteration, but then
back up because we find an uninitialized use.
A reduced test case from PR10379 is included.
llvm-svn: 135359
definitely have a path leading to them, and possibly have a path leading
to them; reflect that distinction in the warning text emitted.
llvm-svn: 129126
marked explicitly as uninitialized through direct self initialization:
int x = x;
With r128894 we prevented warnings about this code, and this patch
teaches the analysis engine to continue analyzing subsequent uses of
'x'. This should wrap up PR9624.
There is still an open question of whether we should suppress the
maybe-uninitialized warnings resulting from variables initialized in
this fashion. The definitely-uninitialized uses should always be warned.
llvm-svn: 128932
int x = x;
GCC disables its warnings on this construct as a way of indicating that
the programmer intentionally wants the variable to be uninitialized.
Only the warning on the initializer is turned off in this iteration.
This makes the code a lot more ugly, but starts commenting the
surprising behavior here. This is a WIP, I want to refactor it
substantially for clarity, and to determine whether subsequent warnings
should be suppressed or not.
llvm-svn: 128894
1) Change the CFG to include the DeclStmt for conditional variables, instead of using the condition itself as a faux DeclStmt.
2) Update ExprEngine (the static analyzer) to understand (1), so not to regress.
3) Update UninitializedValues.cpp to initialize all tracked variables to Uninitialized at the start of the function/method.
4) Only use the SelfReferenceChecker (SemaDecl.cpp) on global variables, leaving the dataflow analysis to handle other cases.
The combination of (1) and (3) allows the dataflow-based -Wuninitialized to find self-init problems when the initializer
contained control-flow.
llvm-svn: 128858
Note this can potentially be enhanced to detect if the __block variable
is actually written by the block, or only when the block "escapes" or
is actually used, but that requires more analysis than it is probably worth
for this simple check.
llvm-svn: 128681