[scudo] Scudo thread specific data refactor, part 3
Summary:
Previous parts: D38139, D38183.
In this part of the refactor, we abstract the Linux vs Android TSD dissociation
in favor of a Exclusive vs Shared one, allowing for easier platform introduction
and configuration.
Most of this change consist of shuffling the files around to reflect the new
organization.
We introduce `scudo_platform.h` where platform specific definition lie. This
involves the TSD model and the platform specific allocator parameters. In an
upcoming CL, those will be configurable via defines, but we currently stick
with conservative defaults.
Reviewers: alekseyshl, dvyukov
Reviewed By: alekseyshl, dvyukov
Subscribers: srhines, llvm-commits, mgorny
Differential Revision: https://reviews.llvm.org/D38244
llvm-svn: 314224
2017-09-27 01:20:02 +08:00
|
|
|
//===-- scudo_tsd_exclusive.cpp ---------------------------------*- C++ -*-===//
|
2017-04-28 04:21:16 +08:00
|
|
|
//
|
|
|
|
// The LLVM Compiler Infrastructure
|
|
|
|
//
|
|
|
|
// This file is distributed under the University of Illinois Open Source
|
|
|
|
// License. See LICENSE.TXT for details.
|
|
|
|
//
|
|
|
|
//===----------------------------------------------------------------------===//
|
|
|
|
///
|
[scudo] Scudo thread specific data refactor, part 3
Summary:
Previous parts: D38139, D38183.
In this part of the refactor, we abstract the Linux vs Android TSD dissociation
in favor of a Exclusive vs Shared one, allowing for easier platform introduction
and configuration.
Most of this change consist of shuffling the files around to reflect the new
organization.
We introduce `scudo_platform.h` where platform specific definition lie. This
involves the TSD model and the platform specific allocator parameters. In an
upcoming CL, those will be configurable via defines, but we currently stick
with conservative defaults.
Reviewers: alekseyshl, dvyukov
Reviewed By: alekseyshl, dvyukov
Subscribers: srhines, llvm-commits, mgorny
Differential Revision: https://reviews.llvm.org/D38244
llvm-svn: 314224
2017-09-27 01:20:02 +08:00
|
|
|
/// Scudo exclusive TSD implementation.
|
2017-04-28 04:21:16 +08:00
|
|
|
///
|
|
|
|
//===----------------------------------------------------------------------===//
|
|
|
|
|
[scudo] Scudo thread specific data refactor, part 3
Summary:
Previous parts: D38139, D38183.
In this part of the refactor, we abstract the Linux vs Android TSD dissociation
in favor of a Exclusive vs Shared one, allowing for easier platform introduction
and configuration.
Most of this change consist of shuffling the files around to reflect the new
organization.
We introduce `scudo_platform.h` where platform specific definition lie. This
involves the TSD model and the platform specific allocator parameters. In an
upcoming CL, those will be configurable via defines, but we currently stick
with conservative defaults.
Reviewers: alekseyshl, dvyukov
Reviewed By: alekseyshl, dvyukov
Subscribers: srhines, llvm-commits, mgorny
Differential Revision: https://reviews.llvm.org/D38244
llvm-svn: 314224
2017-09-27 01:20:02 +08:00
|
|
|
#include "scudo_tsd.h"
|
2017-04-28 04:21:16 +08:00
|
|
|
|
[scudo] Scudo thread specific data refactor, part 3
Summary:
Previous parts: D38139, D38183.
In this part of the refactor, we abstract the Linux vs Android TSD dissociation
in favor of a Exclusive vs Shared one, allowing for easier platform introduction
and configuration.
Most of this change consist of shuffling the files around to reflect the new
organization.
We introduce `scudo_platform.h` where platform specific definition lie. This
involves the TSD model and the platform specific allocator parameters. In an
upcoming CL, those will be configurable via defines, but we currently stick
with conservative defaults.
Reviewers: alekseyshl, dvyukov
Reviewed By: alekseyshl, dvyukov
Subscribers: srhines, llvm-commits, mgorny
Differential Revision: https://reviews.llvm.org/D38244
llvm-svn: 314224
2017-09-27 01:20:02 +08:00
|
|
|
#if SCUDO_TSD_EXCLUSIVE
|
2017-04-28 04:21:16 +08:00
|
|
|
|
|
|
|
namespace __scudo {
|
|
|
|
|
|
|
|
static pthread_once_t GlobalInitialized = PTHREAD_ONCE_INIT;
|
|
|
|
static pthread_key_t PThreadKey;
|
|
|
|
|
[scudo] Add Android support
Summary:
This change adds Android support to the allocator (but doesn't yet enable it in
the cmake config), and should be the last fragment of the rewritten change
D31947.
Android has more memory constraints than other platforms, so the idea of a
unique context per thread would not have worked. The alternative chosen is to
allocate a set of contexts based on the number of cores on the machine, and
share those contexts within the threads. Contexts can be dynamically reassigned
to threads to prevent contention, based on a scheme suggested by @dvyuokv in
the initial review.
Additionally, given that Android doesn't support ELF TLS (only emutls for now),
we use the TSan TLS slot to make things faster: Scudo is mutually exclusive
with other sanitizers so this shouldn't cause any problem.
An additional change made here, is replacing `thread_local` by `THREADLOCAL`
and using the initial-exec thread model in the non-Android version to prevent
extraneous weak definition and checks on the relevant variables.
Reviewers: kcc, dvyukov, alekseyshl
Reviewed By: alekseyshl
Subscribers: srhines, mgorny, llvm-commits
Differential Revision: https://reviews.llvm.org/D32649
llvm-svn: 302300
2017-05-06 05:38:22 +08:00
|
|
|
__attribute__((tls_model("initial-exec")))
|
|
|
|
THREADLOCAL ThreadState ScudoThreadState = ThreadNotInitialized;
|
|
|
|
__attribute__((tls_model("initial-exec")))
|
2017-09-22 23:35:37 +08:00
|
|
|
THREADLOCAL ScudoTSD TSD;
|
2017-04-28 04:21:16 +08:00
|
|
|
|
[scudo] Scudo thread specific data refactor, part 2
Summary:
Following D38139, we now consolidate the TSD definition, merging the shared
TSD definition with the exclusive TSD definition. We introduce a boolean set
at initializaton denoting the need for the TSD to be unlocked or not. This
adds some unused members to the exclusive TSD, but increases consistency and
reduces the definitions fragmentation.
We remove the fallback mechanism from `scudo_allocator.cpp` and add a fallback
TSD in the non-shared version. Since the shared version doesn't require one,
this makes overall more sense.
There are a couple of additional cosmetic changes: removing the header guards
from the remaining `.inc` files, added error string to a `CHECK`.
Question to reviewers: I thought about friending `getTSDAndLock` in `ScudoTSD`
so that the `FallbackTSD` could `Mutex.Lock()` directly instead of `lock()`
which involved zeroing out the `Precedence`, which is unused otherwise. Is it
worth doing?
Reviewers: alekseyshl, dvyukov, kcc
Reviewed By: dvyukov
Subscribers: srhines, llvm-commits
Differential Revision: https://reviews.llvm.org/D38183
llvm-svn: 314110
2017-09-25 23:12:08 +08:00
|
|
|
// Fallback TSD for when the thread isn't initialized yet or is torn down. It
|
|
|
|
// can be shared between multiple threads and as such must be locked.
|
|
|
|
ScudoTSD FallbackTSD;
|
|
|
|
|
2017-04-28 04:21:16 +08:00
|
|
|
static void teardownThread(void *Ptr) {
|
2017-05-26 23:39:22 +08:00
|
|
|
uptr I = reinterpret_cast<uptr>(Ptr);
|
2017-04-28 04:21:16 +08:00
|
|
|
// The glibc POSIX thread-local-storage deallocation routine calls user
|
|
|
|
// provided destructors in a loop of PTHREAD_DESTRUCTOR_ITERATIONS.
|
|
|
|
// We want to be called last since other destructors might call free and the
|
|
|
|
// like, so we wait until PTHREAD_DESTRUCTOR_ITERATIONS before draining the
|
|
|
|
// quarantine and swallowing the cache.
|
2017-05-26 23:39:22 +08:00
|
|
|
if (I > 1) {
|
|
|
|
// If pthread_setspecific fails, we will go ahead with the teardown.
|
|
|
|
if (LIKELY(pthread_setspecific(PThreadKey,
|
|
|
|
reinterpret_cast<void *>(I - 1)) == 0))
|
|
|
|
return;
|
2017-04-28 04:21:16 +08:00
|
|
|
}
|
2017-09-22 23:35:37 +08:00
|
|
|
TSD.commitBack();
|
2017-04-28 04:21:16 +08:00
|
|
|
ScudoThreadState = ThreadTornDown;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
static void initOnce() {
|
|
|
|
CHECK_EQ(pthread_key_create(&PThreadKey, teardownThread), 0);
|
|
|
|
initScudo();
|
[scudo] Improve the scalability of the shared TSD model
Summary:
The shared TSD model in its current form doesn't scale. Here is an example of
rpc2-benchmark (with default parameters, which is threading heavy) on a 72-core
machines (defaulting to a `CompactSizeClassMap` and no Quarantine):
- with tcmalloc: 337K reqs/sec, peak RSS of 338MB;
- with scudo (exclusive): 321K reqs/sec, peak RSS of 637MB;
- with scudo (shared): 241K reqs/sec, peak RSS of 324MB.
This isn't great, since the exclusive model uses a lot of memory, while the
shared model doesn't even come close to be competitive.
This is mostly due to the fact that we are consistently scanning the TSD pool
starting at index 0 for an available TSD, which can result in a lot of failed
lock attempts, and touching some memory that needs not be touched.
This CL attempts to make things better in most situations:
- first, use a thread local variable on Linux (intead of pthread APIs) to store
the current TSD in the shared model;
- move the locking boolean out of the TSD: this allows the compiler to use a
register and potentially optimize out a branch instead of reading it from the
TSD everytime (we also save a tiny bit of memory per TSD);
- 64-bit atomic operations on 32-bit ARM platforms happen to be expensive: so
store the `Precedence` in a `uptr` instead of a `u64`. We lose some
nanoseconds of precision and we'll wrap around at some point, but the benefit
is worth it;
- change a `CHECK` to a `DCHECK`: this should never happen, but if something is
ever terribly wrong, we'll crash on a near null AV if the TSD happens to be
null;
- based on an idea by dvyukov@, we are implementing a bound random scan for
an available TSD. This requires computing the coprimes for the number of TSDs,
and attempting to lock up to 4 TSDs in an random order before falling back to
the current one. This is obviously slightly more expansive when we have just
2 TSDs (barely noticeable) but is otherwise beneficial. The `Precedence` still
basically corresponds to the moment of the first contention on a TSD. To seed
on random choice, we use the precedence of the current TSD since it is very
likely to be non-zero (since we are in the slow path after a failed `tryLock`)
With those modifications, the benchmark yields to:
- with scudo (shared): 330K reqs/sec, peak RSS of 327MB.
So the shared model for this specific situation not only becomes competitive but
outperforms the exclusive model. I experimented with some values greater than 4
for the number of TSDs to attempt to lock and it yielded a decrease in QPS. Just
sticking with the current TSD is also a tad slower. Numbers on platforms with
less cores (eg: Android) remain similar.
Reviewers: alekseyshl, dvyukov, javed.absar
Reviewed By: alekseyshl, dvyukov
Subscribers: srhines, kristof.beyls, delcypher, llvm-commits, #sanitizers
Differential Revision: https://reviews.llvm.org/D47289
llvm-svn: 334410
2018-06-11 22:50:31 +08:00
|
|
|
FallbackTSD.init();
|
2017-04-28 04:21:16 +08:00
|
|
|
}
|
|
|
|
|
[scudo] Fix improper TSD init after TLS destructors are called
Summary:
Some of glibc's own thread local data is destroyed after a user's thread local
destructors are called, via __libc_thread_freeres. This might involve calling
free, as is the case for strerror_thread_freeres.
If there is no prior heap operation in the thread, this free would end up
initializing some thread specific data that would never be destroyed properly
(as user's pthread destructors have already been called), while still being
deallocated when the TLS goes away. As a result, a program could SEGV, usually
in __sanitizer::AllocatorGlobalStats::Unregister, where one of the doubly linked
list links would refer to a now unmapped memory area.
To prevent this from happening, we will not do a full initialization from the
deallocation path. This means that the fallback cache & quarantine will be used
if no other heap operation has been called, and we effectively prevent the TSD
being initialized and never destroyed. The TSD will be fully initialized for all
other paths.
In the event of a thread doing only frees and nothing else, a TSD would never
be initialized for that thread, but this situation is unlikely and we can live
with that.
Reviewers: alekseyshl
Reviewed By: alekseyshl
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D37697
llvm-svn: 312939
2017-09-12 03:59:40 +08:00
|
|
|
void initThread(bool MinimalInit) {
|
2017-05-26 23:39:22 +08:00
|
|
|
CHECK_EQ(pthread_once(&GlobalInitialized, initOnce), 0);
|
[scudo] Fix improper TSD init after TLS destructors are called
Summary:
Some of glibc's own thread local data is destroyed after a user's thread local
destructors are called, via __libc_thread_freeres. This might involve calling
free, as is the case for strerror_thread_freeres.
If there is no prior heap operation in the thread, this free would end up
initializing some thread specific data that would never be destroyed properly
(as user's pthread destructors have already been called), while still being
deallocated when the TLS goes away. As a result, a program could SEGV, usually
in __sanitizer::AllocatorGlobalStats::Unregister, where one of the doubly linked
list links would refer to a now unmapped memory area.
To prevent this from happening, we will not do a full initialization from the
deallocation path. This means that the fallback cache & quarantine will be used
if no other heap operation has been called, and we effectively prevent the TSD
being initialized and never destroyed. The TSD will be fully initialized for all
other paths.
In the event of a thread doing only frees and nothing else, a TSD would never
be initialized for that thread, but this situation is unlikely and we can live
with that.
Reviewers: alekseyshl
Reviewed By: alekseyshl
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D37697
llvm-svn: 312939
2017-09-12 03:59:40 +08:00
|
|
|
if (UNLIKELY(MinimalInit))
|
|
|
|
return;
|
2017-05-26 23:39:22 +08:00
|
|
|
CHECK_EQ(pthread_setspecific(PThreadKey, reinterpret_cast<void *>(
|
|
|
|
GetPthreadDestructorIterations())), 0);
|
[scudo] Improve the scalability of the shared TSD model
Summary:
The shared TSD model in its current form doesn't scale. Here is an example of
rpc2-benchmark (with default parameters, which is threading heavy) on a 72-core
machines (defaulting to a `CompactSizeClassMap` and no Quarantine):
- with tcmalloc: 337K reqs/sec, peak RSS of 338MB;
- with scudo (exclusive): 321K reqs/sec, peak RSS of 637MB;
- with scudo (shared): 241K reqs/sec, peak RSS of 324MB.
This isn't great, since the exclusive model uses a lot of memory, while the
shared model doesn't even come close to be competitive.
This is mostly due to the fact that we are consistently scanning the TSD pool
starting at index 0 for an available TSD, which can result in a lot of failed
lock attempts, and touching some memory that needs not be touched.
This CL attempts to make things better in most situations:
- first, use a thread local variable on Linux (intead of pthread APIs) to store
the current TSD in the shared model;
- move the locking boolean out of the TSD: this allows the compiler to use a
register and potentially optimize out a branch instead of reading it from the
TSD everytime (we also save a tiny bit of memory per TSD);
- 64-bit atomic operations on 32-bit ARM platforms happen to be expensive: so
store the `Precedence` in a `uptr` instead of a `u64`. We lose some
nanoseconds of precision and we'll wrap around at some point, but the benefit
is worth it;
- change a `CHECK` to a `DCHECK`: this should never happen, but if something is
ever terribly wrong, we'll crash on a near null AV if the TSD happens to be
null;
- based on an idea by dvyukov@, we are implementing a bound random scan for
an available TSD. This requires computing the coprimes for the number of TSDs,
and attempting to lock up to 4 TSDs in an random order before falling back to
the current one. This is obviously slightly more expansive when we have just
2 TSDs (barely noticeable) but is otherwise beneficial. The `Precedence` still
basically corresponds to the moment of the first contention on a TSD. To seed
on random choice, we use the precedence of the current TSD since it is very
likely to be non-zero (since we are in the slow path after a failed `tryLock`)
With those modifications, the benchmark yields to:
- with scudo (shared): 330K reqs/sec, peak RSS of 327MB.
So the shared model for this specific situation not only becomes competitive but
outperforms the exclusive model. I experimented with some values greater than 4
for the number of TSDs to attempt to lock and it yielded a decrease in QPS. Just
sticking with the current TSD is also a tad slower. Numbers on platforms with
less cores (eg: Android) remain similar.
Reviewers: alekseyshl, dvyukov, javed.absar
Reviewed By: alekseyshl, dvyukov
Subscribers: srhines, kristof.beyls, delcypher, llvm-commits, #sanitizers
Differential Revision: https://reviews.llvm.org/D47289
llvm-svn: 334410
2018-06-11 22:50:31 +08:00
|
|
|
TSD.init();
|
2017-04-28 04:21:16 +08:00
|
|
|
ScudoThreadState = ThreadInitialized;
|
|
|
|
}
|
|
|
|
|
|
|
|
} // namespace __scudo
|
|
|
|
|
[scudo] Scudo thread specific data refactor, part 3
Summary:
Previous parts: D38139, D38183.
In this part of the refactor, we abstract the Linux vs Android TSD dissociation
in favor of a Exclusive vs Shared one, allowing for easier platform introduction
and configuration.
Most of this change consist of shuffling the files around to reflect the new
organization.
We introduce `scudo_platform.h` where platform specific definition lie. This
involves the TSD model and the platform specific allocator parameters. In an
upcoming CL, those will be configurable via defines, but we currently stick
with conservative defaults.
Reviewers: alekseyshl, dvyukov
Reviewed By: alekseyshl, dvyukov
Subscribers: srhines, llvm-commits, mgorny
Differential Revision: https://reviews.llvm.org/D38244
llvm-svn: 314224
2017-09-27 01:20:02 +08:00
|
|
|
#endif // SCUDO_TSD_EXCLUSIVE
|