Summary:
The shared TSD model in its current form doesn't scale. Here is an example of
rpc2-benchmark (with default parameters, which is threading heavy) on a 72-core
machines (defaulting to a `CompactSizeClassMap` and no Quarantine):
- with tcmalloc: 337K reqs/sec, peak RSS of 338MB;
- with scudo (exclusive): 321K reqs/sec, peak RSS of 637MB;
- with scudo (shared): 241K reqs/sec, peak RSS of 324MB.
This isn't great, since the exclusive model uses a lot of memory, while the
shared model doesn't even come close to be competitive.
This is mostly due to the fact that we are consistently scanning the TSD pool
starting at index 0 for an available TSD, which can result in a lot of failed
lock attempts, and touching some memory that needs not be touched.
This CL attempts to make things better in most situations:
- first, use a thread local variable on Linux (intead of pthread APIs) to store
the current TSD in the shared model;
- move the locking boolean out of the TSD: this allows the compiler to use a
register and potentially optimize out a branch instead of reading it from the
TSD everytime (we also save a tiny bit of memory per TSD);
- 64-bit atomic operations on 32-bit ARM platforms happen to be expensive: so
store the `Precedence` in a `uptr` instead of a `u64`. We lose some
nanoseconds of precision and we'll wrap around at some point, but the benefit
is worth it;
- change a `CHECK` to a `DCHECK`: this should never happen, but if something is
ever terribly wrong, we'll crash on a near null AV if the TSD happens to be
null;
- based on an idea by dvyukov@, we are implementing a bound random scan for
an available TSD. This requires computing the coprimes for the number of TSDs,
and attempting to lock up to 4 TSDs in an random order before falling back to
the current one. This is obviously slightly more expansive when we have just
2 TSDs (barely noticeable) but is otherwise beneficial. The `Precedence` still
basically corresponds to the moment of the first contention on a TSD. To seed
on random choice, we use the precedence of the current TSD since it is very
likely to be non-zero (since we are in the slow path after a failed `tryLock`)
With those modifications, the benchmark yields to:
- with scudo (shared): 330K reqs/sec, peak RSS of 327MB.
So the shared model for this specific situation not only becomes competitive but
outperforms the exclusive model. I experimented with some values greater than 4
for the number of TSDs to attempt to lock and it yielded a decrease in QPS. Just
sticking with the current TSD is also a tad slower. Numbers on platforms with
less cores (eg: Android) remain similar.
Reviewers: alekseyshl, dvyukov, javed.absar
Reviewed By: alekseyshl, dvyukov
Subscribers: srhines, kristof.beyls, delcypher, llvm-commits, #sanitizers
Differential Revision: https://reviews.llvm.org/D47289
llvm-svn: 334410
Summary:
Follow up to D38826.
We introduce `pthread_{get,set}specific` versions of `{get,set}CurrentTSD` to
allow for non Android platforms to use the Shared TSD model.
We now allow `SCUDO_TSD_EXCLUSIVE` to be defined at compile time.
A couple of things:
- I know that `#if SANITIZER_ANDROID` is not ideal within a function, but in
the end I feel it looks more compact and clean than going the .inc route; I
am open to an alternative if anyone has one;
- `SCUDO_TSD_EXCLUSIVE=1` requires ELF TLS support (and not emutls as this uses
malloc). I haven't found anything to enforce that, so it's currently not
checked.
Reviewers: alekseyshl
Reviewed By: alekseyshl
Subscribers: srhines, llvm-commits
Differential Revision: https://reviews.llvm.org/D38854
llvm-svn: 315751
Summary:
This first part just prepares the grounds for part 2 and doesn't add any new
functionality. It mostly consists of small refactors:
- move the `pthread.h` include higher as it will be used in the headers;
- use `errno.h` in `scudo_allocator.cpp` instead of the sanitizer one, update
the `errno` assignments accordingly (otherwise it creates conflicts on some
platforms due to `pthread.h` including `errno.h`);
- introduce and use `getCurrentTSD` and `setCurrentTSD` for the shared TSD
model code;
Reviewers: alekseyshl
Reviewed By: alekseyshl
Subscribers: llvm-commits, srhines
Differential Revision: https://reviews.llvm.org/D38826
llvm-svn: 315583
Summary:
Previous parts: D38139, D38183.
In this part of the refactor, we abstract the Linux vs Android TSD dissociation
in favor of a Exclusive vs Shared one, allowing for easier platform introduction
and configuration.
Most of this change consist of shuffling the files around to reflect the new
organization.
We introduce `scudo_platform.h` where platform specific definition lie. This
involves the TSD model and the platform specific allocator parameters. In an
upcoming CL, those will be configurable via defines, but we currently stick
with conservative defaults.
Reviewers: alekseyshl, dvyukov
Reviewed By: alekseyshl, dvyukov
Subscribers: srhines, llvm-commits, mgorny
Differential Revision: https://reviews.llvm.org/D38244
llvm-svn: 314224