llvm-project/compiler-rt/lib/scudo/scudo_tsd_shared.cpp

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

108 lines
3.4 KiB
C++
Raw Normal View History

//===-- scudo_tsd_shared.cpp ------------------------------------*- C++ -*-===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//
///
/// Scudo shared TSD implementation.
///
//===----------------------------------------------------------------------===//
#include "scudo_tsd.h"
#if !SCUDO_TSD_EXCLUSIVE
namespace __scudo {
static pthread_once_t GlobalInitialized = PTHREAD_ONCE_INIT;
pthread_key_t PThreadKey;
static atomic_uint32_t CurrentIndex;
static ScudoTSD *TSDs;
static u32 NumberOfTSDs;
[scudo] Improve the scalability of the shared TSD model Summary: The shared TSD model in its current form doesn't scale. Here is an example of rpc2-benchmark (with default parameters, which is threading heavy) on a 72-core machines (defaulting to a `CompactSizeClassMap` and no Quarantine): - with tcmalloc: 337K reqs/sec, peak RSS of 338MB; - with scudo (exclusive): 321K reqs/sec, peak RSS of 637MB; - with scudo (shared): 241K reqs/sec, peak RSS of 324MB. This isn't great, since the exclusive model uses a lot of memory, while the shared model doesn't even come close to be competitive. This is mostly due to the fact that we are consistently scanning the TSD pool starting at index 0 for an available TSD, which can result in a lot of failed lock attempts, and touching some memory that needs not be touched. This CL attempts to make things better in most situations: - first, use a thread local variable on Linux (intead of pthread APIs) to store the current TSD in the shared model; - move the locking boolean out of the TSD: this allows the compiler to use a register and potentially optimize out a branch instead of reading it from the TSD everytime (we also save a tiny bit of memory per TSD); - 64-bit atomic operations on 32-bit ARM platforms happen to be expensive: so store the `Precedence` in a `uptr` instead of a `u64`. We lose some nanoseconds of precision and we'll wrap around at some point, but the benefit is worth it; - change a `CHECK` to a `DCHECK`: this should never happen, but if something is ever terribly wrong, we'll crash on a near null AV if the TSD happens to be null; - based on an idea by dvyukov@, we are implementing a bound random scan for an available TSD. This requires computing the coprimes for the number of TSDs, and attempting to lock up to 4 TSDs in an random order before falling back to the current one. This is obviously slightly more expansive when we have just 2 TSDs (barely noticeable) but is otherwise beneficial. The `Precedence` still basically corresponds to the moment of the first contention on a TSD. To seed on random choice, we use the precedence of the current TSD since it is very likely to be non-zero (since we are in the slow path after a failed `tryLock`) With those modifications, the benchmark yields to: - with scudo (shared): 330K reqs/sec, peak RSS of 327MB. So the shared model for this specific situation not only becomes competitive but outperforms the exclusive model. I experimented with some values greater than 4 for the number of TSDs to attempt to lock and it yielded a decrease in QPS. Just sticking with the current TSD is also a tad slower. Numbers on platforms with less cores (eg: Android) remain similar. Reviewers: alekseyshl, dvyukov, javed.absar Reviewed By: alekseyshl, dvyukov Subscribers: srhines, kristof.beyls, delcypher, llvm-commits, #sanitizers Differential Revision: https://reviews.llvm.org/D47289 llvm-svn: 334410
2018-06-11 22:50:31 +08:00
static u32 CoPrimes[SCUDO_SHARED_TSD_POOL_SIZE];
static u32 NumberOfCoPrimes = 0;
#if SANITIZER_LINUX && !SANITIZER_ANDROID
__attribute__((tls_model("initial-exec")))
THREADLOCAL ScudoTSD *CurrentTSD;
#endif
static void initOnce() {
CHECK_EQ(pthread_key_create(&PThreadKey, NULL), 0);
initScudo();
NumberOfTSDs = Min(Max(1U, GetNumberOfCPUsCached()),
static_cast<u32>(SCUDO_SHARED_TSD_POOL_SIZE));
TSDs = reinterpret_cast<ScudoTSD *>(
MmapOrDie(sizeof(ScudoTSD) * NumberOfTSDs, "ScudoTSDs"));
[scudo] Improve the scalability of the shared TSD model Summary: The shared TSD model in its current form doesn't scale. Here is an example of rpc2-benchmark (with default parameters, which is threading heavy) on a 72-core machines (defaulting to a `CompactSizeClassMap` and no Quarantine): - with tcmalloc: 337K reqs/sec, peak RSS of 338MB; - with scudo (exclusive): 321K reqs/sec, peak RSS of 637MB; - with scudo (shared): 241K reqs/sec, peak RSS of 324MB. This isn't great, since the exclusive model uses a lot of memory, while the shared model doesn't even come close to be competitive. This is mostly due to the fact that we are consistently scanning the TSD pool starting at index 0 for an available TSD, which can result in a lot of failed lock attempts, and touching some memory that needs not be touched. This CL attempts to make things better in most situations: - first, use a thread local variable on Linux (intead of pthread APIs) to store the current TSD in the shared model; - move the locking boolean out of the TSD: this allows the compiler to use a register and potentially optimize out a branch instead of reading it from the TSD everytime (we also save a tiny bit of memory per TSD); - 64-bit atomic operations on 32-bit ARM platforms happen to be expensive: so store the `Precedence` in a `uptr` instead of a `u64`. We lose some nanoseconds of precision and we'll wrap around at some point, but the benefit is worth it; - change a `CHECK` to a `DCHECK`: this should never happen, but if something is ever terribly wrong, we'll crash on a near null AV if the TSD happens to be null; - based on an idea by dvyukov@, we are implementing a bound random scan for an available TSD. This requires computing the coprimes for the number of TSDs, and attempting to lock up to 4 TSDs in an random order before falling back to the current one. This is obviously slightly more expansive when we have just 2 TSDs (barely noticeable) but is otherwise beneficial. The `Precedence` still basically corresponds to the moment of the first contention on a TSD. To seed on random choice, we use the precedence of the current TSD since it is very likely to be non-zero (since we are in the slow path after a failed `tryLock`) With those modifications, the benchmark yields to: - with scudo (shared): 330K reqs/sec, peak RSS of 327MB. So the shared model for this specific situation not only becomes competitive but outperforms the exclusive model. I experimented with some values greater than 4 for the number of TSDs to attempt to lock and it yielded a decrease in QPS. Just sticking with the current TSD is also a tad slower. Numbers on platforms with less cores (eg: Android) remain similar. Reviewers: alekseyshl, dvyukov, javed.absar Reviewed By: alekseyshl, dvyukov Subscribers: srhines, kristof.beyls, delcypher, llvm-commits, #sanitizers Differential Revision: https://reviews.llvm.org/D47289 llvm-svn: 334410
2018-06-11 22:50:31 +08:00
for (u32 I = 0; I < NumberOfTSDs; I++) {
TSDs[I].init();
u32 A = I + 1;
u32 B = NumberOfTSDs;
while (B != 0) { const u32 T = A; A = B; B = T % B; }
if (A == 1)
CoPrimes[NumberOfCoPrimes++] = I + 1;
}
}
ALWAYS_INLINE void setCurrentTSD(ScudoTSD *TSD) {
#if SANITIZER_ANDROID
*get_android_tls_ptr() = reinterpret_cast<uptr>(TSD);
[scudo] Improve the scalability of the shared TSD model Summary: The shared TSD model in its current form doesn't scale. Here is an example of rpc2-benchmark (with default parameters, which is threading heavy) on a 72-core machines (defaulting to a `CompactSizeClassMap` and no Quarantine): - with tcmalloc: 337K reqs/sec, peak RSS of 338MB; - with scudo (exclusive): 321K reqs/sec, peak RSS of 637MB; - with scudo (shared): 241K reqs/sec, peak RSS of 324MB. This isn't great, since the exclusive model uses a lot of memory, while the shared model doesn't even come close to be competitive. This is mostly due to the fact that we are consistently scanning the TSD pool starting at index 0 for an available TSD, which can result in a lot of failed lock attempts, and touching some memory that needs not be touched. This CL attempts to make things better in most situations: - first, use a thread local variable on Linux (intead of pthread APIs) to store the current TSD in the shared model; - move the locking boolean out of the TSD: this allows the compiler to use a register and potentially optimize out a branch instead of reading it from the TSD everytime (we also save a tiny bit of memory per TSD); - 64-bit atomic operations on 32-bit ARM platforms happen to be expensive: so store the `Precedence` in a `uptr` instead of a `u64`. We lose some nanoseconds of precision and we'll wrap around at some point, but the benefit is worth it; - change a `CHECK` to a `DCHECK`: this should never happen, but if something is ever terribly wrong, we'll crash on a near null AV if the TSD happens to be null; - based on an idea by dvyukov@, we are implementing a bound random scan for an available TSD. This requires computing the coprimes for the number of TSDs, and attempting to lock up to 4 TSDs in an random order before falling back to the current one. This is obviously slightly more expansive when we have just 2 TSDs (barely noticeable) but is otherwise beneficial. The `Precedence` still basically corresponds to the moment of the first contention on a TSD. To seed on random choice, we use the precedence of the current TSD since it is very likely to be non-zero (since we are in the slow path after a failed `tryLock`) With those modifications, the benchmark yields to: - with scudo (shared): 330K reqs/sec, peak RSS of 327MB. So the shared model for this specific situation not only becomes competitive but outperforms the exclusive model. I experimented with some values greater than 4 for the number of TSDs to attempt to lock and it yielded a decrease in QPS. Just sticking with the current TSD is also a tad slower. Numbers on platforms with less cores (eg: Android) remain similar. Reviewers: alekseyshl, dvyukov, javed.absar Reviewed By: alekseyshl, dvyukov Subscribers: srhines, kristof.beyls, delcypher, llvm-commits, #sanitizers Differential Revision: https://reviews.llvm.org/D47289 llvm-svn: 334410
2018-06-11 22:50:31 +08:00
#elif SANITIZER_LINUX
CurrentTSD = TSD;
#else
CHECK_EQ(pthread_setspecific(PThreadKey, reinterpret_cast<void *>(TSD)), 0);
#endif // SANITIZER_ANDROID
}
void initThread(bool MinimalInit) {
pthread_once(&GlobalInitialized, initOnce);
// Initial context assignment is done in a plain round-robin fashion.
u32 Index = atomic_fetch_add(&CurrentIndex, 1, memory_order_relaxed);
setCurrentTSD(&TSDs[Index % NumberOfTSDs]);
}
[scudo] Improve the scalability of the shared TSD model Summary: The shared TSD model in its current form doesn't scale. Here is an example of rpc2-benchmark (with default parameters, which is threading heavy) on a 72-core machines (defaulting to a `CompactSizeClassMap` and no Quarantine): - with tcmalloc: 337K reqs/sec, peak RSS of 338MB; - with scudo (exclusive): 321K reqs/sec, peak RSS of 637MB; - with scudo (shared): 241K reqs/sec, peak RSS of 324MB. This isn't great, since the exclusive model uses a lot of memory, while the shared model doesn't even come close to be competitive. This is mostly due to the fact that we are consistently scanning the TSD pool starting at index 0 for an available TSD, which can result in a lot of failed lock attempts, and touching some memory that needs not be touched. This CL attempts to make things better in most situations: - first, use a thread local variable on Linux (intead of pthread APIs) to store the current TSD in the shared model; - move the locking boolean out of the TSD: this allows the compiler to use a register and potentially optimize out a branch instead of reading it from the TSD everytime (we also save a tiny bit of memory per TSD); - 64-bit atomic operations on 32-bit ARM platforms happen to be expensive: so store the `Precedence` in a `uptr` instead of a `u64`. We lose some nanoseconds of precision and we'll wrap around at some point, but the benefit is worth it; - change a `CHECK` to a `DCHECK`: this should never happen, but if something is ever terribly wrong, we'll crash on a near null AV if the TSD happens to be null; - based on an idea by dvyukov@, we are implementing a bound random scan for an available TSD. This requires computing the coprimes for the number of TSDs, and attempting to lock up to 4 TSDs in an random order before falling back to the current one. This is obviously slightly more expansive when we have just 2 TSDs (barely noticeable) but is otherwise beneficial. The `Precedence` still basically corresponds to the moment of the first contention on a TSD. To seed on random choice, we use the precedence of the current TSD since it is very likely to be non-zero (since we are in the slow path after a failed `tryLock`) With those modifications, the benchmark yields to: - with scudo (shared): 330K reqs/sec, peak RSS of 327MB. So the shared model for this specific situation not only becomes competitive but outperforms the exclusive model. I experimented with some values greater than 4 for the number of TSDs to attempt to lock and it yielded a decrease in QPS. Just sticking with the current TSD is also a tad slower. Numbers on platforms with less cores (eg: Android) remain similar. Reviewers: alekseyshl, dvyukov, javed.absar Reviewed By: alekseyshl, dvyukov Subscribers: srhines, kristof.beyls, delcypher, llvm-commits, #sanitizers Differential Revision: https://reviews.llvm.org/D47289 llvm-svn: 334410
2018-06-11 22:50:31 +08:00
ScudoTSD *getTSDAndLockSlow(ScudoTSD *TSD) {
if (NumberOfTSDs > 1) {
[scudo] Improve the scalability of the shared TSD model Summary: The shared TSD model in its current form doesn't scale. Here is an example of rpc2-benchmark (with default parameters, which is threading heavy) on a 72-core machines (defaulting to a `CompactSizeClassMap` and no Quarantine): - with tcmalloc: 337K reqs/sec, peak RSS of 338MB; - with scudo (exclusive): 321K reqs/sec, peak RSS of 637MB; - with scudo (shared): 241K reqs/sec, peak RSS of 324MB. This isn't great, since the exclusive model uses a lot of memory, while the shared model doesn't even come close to be competitive. This is mostly due to the fact that we are consistently scanning the TSD pool starting at index 0 for an available TSD, which can result in a lot of failed lock attempts, and touching some memory that needs not be touched. This CL attempts to make things better in most situations: - first, use a thread local variable on Linux (intead of pthread APIs) to store the current TSD in the shared model; - move the locking boolean out of the TSD: this allows the compiler to use a register and potentially optimize out a branch instead of reading it from the TSD everytime (we also save a tiny bit of memory per TSD); - 64-bit atomic operations on 32-bit ARM platforms happen to be expensive: so store the `Precedence` in a `uptr` instead of a `u64`. We lose some nanoseconds of precision and we'll wrap around at some point, but the benefit is worth it; - change a `CHECK` to a `DCHECK`: this should never happen, but if something is ever terribly wrong, we'll crash on a near null AV if the TSD happens to be null; - based on an idea by dvyukov@, we are implementing a bound random scan for an available TSD. This requires computing the coprimes for the number of TSDs, and attempting to lock up to 4 TSDs in an random order before falling back to the current one. This is obviously slightly more expansive when we have just 2 TSDs (barely noticeable) but is otherwise beneficial. The `Precedence` still basically corresponds to the moment of the first contention on a TSD. To seed on random choice, we use the precedence of the current TSD since it is very likely to be non-zero (since we are in the slow path after a failed `tryLock`) With those modifications, the benchmark yields to: - with scudo (shared): 330K reqs/sec, peak RSS of 327MB. So the shared model for this specific situation not only becomes competitive but outperforms the exclusive model. I experimented with some values greater than 4 for the number of TSDs to attempt to lock and it yielded a decrease in QPS. Just sticking with the current TSD is also a tad slower. Numbers on platforms with less cores (eg: Android) remain similar. Reviewers: alekseyshl, dvyukov, javed.absar Reviewed By: alekseyshl, dvyukov Subscribers: srhines, kristof.beyls, delcypher, llvm-commits, #sanitizers Differential Revision: https://reviews.llvm.org/D47289 llvm-svn: 334410
2018-06-11 22:50:31 +08:00
// Use the Precedence of the current TSD as our random seed. Since we are in
// the slow path, it means that tryLock failed, and as a result it's very
// likely that said Precedence is non-zero.
u32 RandState = static_cast<u32>(TSD->getPrecedence());
const u32 R = Rand(&RandState);
const u32 Inc = CoPrimes[R % NumberOfCoPrimes];
u32 Index = R % NumberOfTSDs;
uptr LowestPrecedence = UINTPTR_MAX;
ScudoTSD *CandidateTSD = nullptr;
// Go randomly through at most 4 contexts and find a candidate.
for (u32 I = 0; I < Min(4U, NumberOfTSDs); I++) {
if (TSDs[Index].tryLock()) {
setCurrentTSD(&TSDs[Index]);
return &TSDs[Index];
}
[scudo] Improve the scalability of the shared TSD model Summary: The shared TSD model in its current form doesn't scale. Here is an example of rpc2-benchmark (with default parameters, which is threading heavy) on a 72-core machines (defaulting to a `CompactSizeClassMap` and no Quarantine): - with tcmalloc: 337K reqs/sec, peak RSS of 338MB; - with scudo (exclusive): 321K reqs/sec, peak RSS of 637MB; - with scudo (shared): 241K reqs/sec, peak RSS of 324MB. This isn't great, since the exclusive model uses a lot of memory, while the shared model doesn't even come close to be competitive. This is mostly due to the fact that we are consistently scanning the TSD pool starting at index 0 for an available TSD, which can result in a lot of failed lock attempts, and touching some memory that needs not be touched. This CL attempts to make things better in most situations: - first, use a thread local variable on Linux (intead of pthread APIs) to store the current TSD in the shared model; - move the locking boolean out of the TSD: this allows the compiler to use a register and potentially optimize out a branch instead of reading it from the TSD everytime (we also save a tiny bit of memory per TSD); - 64-bit atomic operations on 32-bit ARM platforms happen to be expensive: so store the `Precedence` in a `uptr` instead of a `u64`. We lose some nanoseconds of precision and we'll wrap around at some point, but the benefit is worth it; - change a `CHECK` to a `DCHECK`: this should never happen, but if something is ever terribly wrong, we'll crash on a near null AV if the TSD happens to be null; - based on an idea by dvyukov@, we are implementing a bound random scan for an available TSD. This requires computing the coprimes for the number of TSDs, and attempting to lock up to 4 TSDs in an random order before falling back to the current one. This is obviously slightly more expansive when we have just 2 TSDs (barely noticeable) but is otherwise beneficial. The `Precedence` still basically corresponds to the moment of the first contention on a TSD. To seed on random choice, we use the precedence of the current TSD since it is very likely to be non-zero (since we are in the slow path after a failed `tryLock`) With those modifications, the benchmark yields to: - with scudo (shared): 330K reqs/sec, peak RSS of 327MB. So the shared model for this specific situation not only becomes competitive but outperforms the exclusive model. I experimented with some values greater than 4 for the number of TSDs to attempt to lock and it yielded a decrease in QPS. Just sticking with the current TSD is also a tad slower. Numbers on platforms with less cores (eg: Android) remain similar. Reviewers: alekseyshl, dvyukov, javed.absar Reviewed By: alekseyshl, dvyukov Subscribers: srhines, kristof.beyls, delcypher, llvm-commits, #sanitizers Differential Revision: https://reviews.llvm.org/D47289 llvm-svn: 334410
2018-06-11 22:50:31 +08:00
const uptr Precedence = TSDs[Index].getPrecedence();
// A 0 precedence here means another thread just locked this TSD.
if (Precedence && Precedence < LowestPrecedence) {
[scudo] Improve the scalability of the shared TSD model Summary: The shared TSD model in its current form doesn't scale. Here is an example of rpc2-benchmark (with default parameters, which is threading heavy) on a 72-core machines (defaulting to a `CompactSizeClassMap` and no Quarantine): - with tcmalloc: 337K reqs/sec, peak RSS of 338MB; - with scudo (exclusive): 321K reqs/sec, peak RSS of 637MB; - with scudo (shared): 241K reqs/sec, peak RSS of 324MB. This isn't great, since the exclusive model uses a lot of memory, while the shared model doesn't even come close to be competitive. This is mostly due to the fact that we are consistently scanning the TSD pool starting at index 0 for an available TSD, which can result in a lot of failed lock attempts, and touching some memory that needs not be touched. This CL attempts to make things better in most situations: - first, use a thread local variable on Linux (intead of pthread APIs) to store the current TSD in the shared model; - move the locking boolean out of the TSD: this allows the compiler to use a register and potentially optimize out a branch instead of reading it from the TSD everytime (we also save a tiny bit of memory per TSD); - 64-bit atomic operations on 32-bit ARM platforms happen to be expensive: so store the `Precedence` in a `uptr` instead of a `u64`. We lose some nanoseconds of precision and we'll wrap around at some point, but the benefit is worth it; - change a `CHECK` to a `DCHECK`: this should never happen, but if something is ever terribly wrong, we'll crash on a near null AV if the TSD happens to be null; - based on an idea by dvyukov@, we are implementing a bound random scan for an available TSD. This requires computing the coprimes for the number of TSDs, and attempting to lock up to 4 TSDs in an random order before falling back to the current one. This is obviously slightly more expansive when we have just 2 TSDs (barely noticeable) but is otherwise beneficial. The `Precedence` still basically corresponds to the moment of the first contention on a TSD. To seed on random choice, we use the precedence of the current TSD since it is very likely to be non-zero (since we are in the slow path after a failed `tryLock`) With those modifications, the benchmark yields to: - with scudo (shared): 330K reqs/sec, peak RSS of 327MB. So the shared model for this specific situation not only becomes competitive but outperforms the exclusive model. I experimented with some values greater than 4 for the number of TSDs to attempt to lock and it yielded a decrease in QPS. Just sticking with the current TSD is also a tad slower. Numbers on platforms with less cores (eg: Android) remain similar. Reviewers: alekseyshl, dvyukov, javed.absar Reviewed By: alekseyshl, dvyukov Subscribers: srhines, kristof.beyls, delcypher, llvm-commits, #sanitizers Differential Revision: https://reviews.llvm.org/D47289 llvm-svn: 334410
2018-06-11 22:50:31 +08:00
CandidateTSD = &TSDs[Index];
LowestPrecedence = Precedence;
}
[scudo] Improve the scalability of the shared TSD model Summary: The shared TSD model in its current form doesn't scale. Here is an example of rpc2-benchmark (with default parameters, which is threading heavy) on a 72-core machines (defaulting to a `CompactSizeClassMap` and no Quarantine): - with tcmalloc: 337K reqs/sec, peak RSS of 338MB; - with scudo (exclusive): 321K reqs/sec, peak RSS of 637MB; - with scudo (shared): 241K reqs/sec, peak RSS of 324MB. This isn't great, since the exclusive model uses a lot of memory, while the shared model doesn't even come close to be competitive. This is mostly due to the fact that we are consistently scanning the TSD pool starting at index 0 for an available TSD, which can result in a lot of failed lock attempts, and touching some memory that needs not be touched. This CL attempts to make things better in most situations: - first, use a thread local variable on Linux (intead of pthread APIs) to store the current TSD in the shared model; - move the locking boolean out of the TSD: this allows the compiler to use a register and potentially optimize out a branch instead of reading it from the TSD everytime (we also save a tiny bit of memory per TSD); - 64-bit atomic operations on 32-bit ARM platforms happen to be expensive: so store the `Precedence` in a `uptr` instead of a `u64`. We lose some nanoseconds of precision and we'll wrap around at some point, but the benefit is worth it; - change a `CHECK` to a `DCHECK`: this should never happen, but if something is ever terribly wrong, we'll crash on a near null AV if the TSD happens to be null; - based on an idea by dvyukov@, we are implementing a bound random scan for an available TSD. This requires computing the coprimes for the number of TSDs, and attempting to lock up to 4 TSDs in an random order before falling back to the current one. This is obviously slightly more expansive when we have just 2 TSDs (barely noticeable) but is otherwise beneficial. The `Precedence` still basically corresponds to the moment of the first contention on a TSD. To seed on random choice, we use the precedence of the current TSD since it is very likely to be non-zero (since we are in the slow path after a failed `tryLock`) With those modifications, the benchmark yields to: - with scudo (shared): 330K reqs/sec, peak RSS of 327MB. So the shared model for this specific situation not only becomes competitive but outperforms the exclusive model. I experimented with some values greater than 4 for the number of TSDs to attempt to lock and it yielded a decrease in QPS. Just sticking with the current TSD is also a tad slower. Numbers on platforms with less cores (eg: Android) remain similar. Reviewers: alekseyshl, dvyukov, javed.absar Reviewed By: alekseyshl, dvyukov Subscribers: srhines, kristof.beyls, delcypher, llvm-commits, #sanitizers Differential Revision: https://reviews.llvm.org/D47289 llvm-svn: 334410
2018-06-11 22:50:31 +08:00
Index += Inc;
if (Index >= NumberOfTSDs)
Index -= NumberOfTSDs;
}
[scudo] Improve the scalability of the shared TSD model Summary: The shared TSD model in its current form doesn't scale. Here is an example of rpc2-benchmark (with default parameters, which is threading heavy) on a 72-core machines (defaulting to a `CompactSizeClassMap` and no Quarantine): - with tcmalloc: 337K reqs/sec, peak RSS of 338MB; - with scudo (exclusive): 321K reqs/sec, peak RSS of 637MB; - with scudo (shared): 241K reqs/sec, peak RSS of 324MB. This isn't great, since the exclusive model uses a lot of memory, while the shared model doesn't even come close to be competitive. This is mostly due to the fact that we are consistently scanning the TSD pool starting at index 0 for an available TSD, which can result in a lot of failed lock attempts, and touching some memory that needs not be touched. This CL attempts to make things better in most situations: - first, use a thread local variable on Linux (intead of pthread APIs) to store the current TSD in the shared model; - move the locking boolean out of the TSD: this allows the compiler to use a register and potentially optimize out a branch instead of reading it from the TSD everytime (we also save a tiny bit of memory per TSD); - 64-bit atomic operations on 32-bit ARM platforms happen to be expensive: so store the `Precedence` in a `uptr` instead of a `u64`. We lose some nanoseconds of precision and we'll wrap around at some point, but the benefit is worth it; - change a `CHECK` to a `DCHECK`: this should never happen, but if something is ever terribly wrong, we'll crash on a near null AV if the TSD happens to be null; - based on an idea by dvyukov@, we are implementing a bound random scan for an available TSD. This requires computing the coprimes for the number of TSDs, and attempting to lock up to 4 TSDs in an random order before falling back to the current one. This is obviously slightly more expansive when we have just 2 TSDs (barely noticeable) but is otherwise beneficial. The `Precedence` still basically corresponds to the moment of the first contention on a TSD. To seed on random choice, we use the precedence of the current TSD since it is very likely to be non-zero (since we are in the slow path after a failed `tryLock`) With those modifications, the benchmark yields to: - with scudo (shared): 330K reqs/sec, peak RSS of 327MB. So the shared model for this specific situation not only becomes competitive but outperforms the exclusive model. I experimented with some values greater than 4 for the number of TSDs to attempt to lock and it yielded a decrease in QPS. Just sticking with the current TSD is also a tad slower. Numbers on platforms with less cores (eg: Android) remain similar. Reviewers: alekseyshl, dvyukov, javed.absar Reviewed By: alekseyshl, dvyukov Subscribers: srhines, kristof.beyls, delcypher, llvm-commits, #sanitizers Differential Revision: https://reviews.llvm.org/D47289 llvm-svn: 334410
2018-06-11 22:50:31 +08:00
if (CandidateTSD) {
CandidateTSD->lock();
setCurrentTSD(CandidateTSD);
return CandidateTSD;
}
}
// Last resort, stick with the current one.
TSD->lock();
return TSD;
}
} // namespace __scudo
#endif // !SCUDO_TSD_EXCLUSIVE