The goal of this patch is to maximize CPU utilization on multi-socket or high core count systems, so that parallel computations such as LLD/ThinLTO can use all hardware threads in the system. Before this patch, on Windows, a maximum of 64 hardware threads could be used at most, in some cases dispatched only on one CPU socket.
== Background ==
Windows doesn't have a flat cpu_set_t like Linux. Instead, it projects hardware CPUs (or NUMA nodes) to applications through a concept of "processor groups". A "processor" is the smallest unit of execution on a CPU, that is, an hyper-thread if SMT is active; a core otherwise. There's a limit of 32-bit processors on older 32-bit versions of Windows, which later was raised to 64-processors with 64-bit versions of Windows. This limit comes from the affinity mask, which historically is represented by the sizeof(void*). Consequently, the concept of "processor groups" was introduced for dealing with systems with more than 64 hyper-threads.
By default, the Windows OS assigns only one "processor group" to each starting application, in a round-robin manner. If the application wants to use more processors, it needs to programmatically enable it, by assigning threads to other "processor groups". This also means that affinity cannot cross "processor group" boundaries; one can only specify a "preferred" group on start-up, but the application is free to allocate more groups if it wants to.
This creates a peculiar situation, where newer CPUs like the AMD EPYC 7702P (64-cores, 128-hyperthreads) are projected by the OS as two (2) "processor groups". This means that by default, an application can only use half of the cores. This situation could only get worse in the years to come, as dies with more cores will appear on the market.
== The problem ==
The heavyweight_hardware_concurrency() API was introduced so that only *one hardware thread per core* was used. Once that API returns, that original intention is lost, only the number of threads is retained. Consider a situation, on Windows, where the system has 2 CPU sockets, 18 cores each, each core having 2 hyper-threads, for a total of 72 hyper-threads. Both heavyweight_hardware_concurrency() and hardware_concurrency() currently return 36, because on Windows they are simply wrappers over std:🧵:hardware_concurrency() -- which can only return processors from the current "processor group".
== The changes in this patch ==
To solve this situation, we capture (and retain) the initial intention until the point of usage, through a new ThreadPoolStrategy class. The number of threads to use is deferred as late as possible, until the moment where the std::threads are created (ThreadPool in the case of ThinLTO).
When using hardware_concurrency(), setting ThreadCount to 0 now means to use all the possible hardware CPU (SMT) threads. Providing a ThreadCount above to the maximum number of threads will have no effect, the maximum will be used instead.
The heavyweight_hardware_concurrency() is similar to hardware_concurrency(), except that only one thread per hardware *core* will be used.
When LLVM_ENABLE_THREADS is OFF, the threading APIs will always return 1, to ensure any caller loops will be exercised at least once.
Differential Revision: https://reviews.llvm.org/D71775
ObjectLinkingLayer was not correctly propagating dependencies through local
symbols within an object. This could cause symbol lookup to return before a
searched-for symbol is ready if the following conditions are met:
(1) The definition of the symbol being searched for transitively depends on a
local symbol within the same object, and that local symbol in turn
transitively depends on an external symbol provided by some other module
in the JIT.
(2) Concurrent compilation is enabled.
(3) Thread scheduling causes the lookup of the searched-for symbol to return
before all transitive dependencies of the looked-up symbol are emitted.
This bug was found by inspection and has not been observed in practice.
A jitlink test case has been added to verify that symbol dependencies are
correctly propagated through local symbol definitions.
This is how it should've been and brings it more in line with
std::string_view. There should be no functional change here.
This is mostly mechanical from a custom clang-tidy check, with a lot of
manual fixups. It uncovers a lot of minor inefficiencies.
This doesn't actually modify StringRef yet, I'll do that in a follow-up.
Summary:
This is a follow up on https://reviews.llvm.org/D71473#inline-647262.
There's a caveat here that `Align(1)` relies on the compiler understanding of `Log2_64` implementation to produce good code. One could use `Align()` as a replacement but I believe it is less clear that the alignment is one in that case.
Reviewers: xbolva00, courbet, bollu
Subscribers: arsenm, dylanmckay, sdardis, nemanjai, jvesely, nhaehnle, hiraditya, kbarton, jrtc27, atanasyan, jsji, Jim, kerbowa, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D73099
This commit adds a ManglingOptions struct to IRMaterializationUnit, and replaces
IRCompileLayer::CompileFunction with a new IRCompileLayer::IRCompiler class. The
ManglingOptions struct defines the emulated-TLS state (via a bool member,
EmulatedTLS, which is true if emulated-TLS is enabled and false otherwise). The
IRCompileLayer::IRCompiler class wraps an IRCompiler (the same way that the
CompileFunction typedef used to), but adds a method to return the
IRCompileLayer::ManglingOptions that the compiler will use.
These changes allow us to correctly determine the symbols that will be produced
when a thread local global variable defined at the IR level is compiled with or
without emulated TLS. This is required for ORCv2, where MaterializationUnits
must declare their interface up-front.
Most ORCv2 clients should not require any changes. Clients writing custom IR
compilers will need to wrap their compiler in an IRCompileLayer::IRCompiler,
rather than an IRCompileLayer::CompileFunction, however this should be a
straightforward change (see modifications to CompileUtils.* in this patch for an
example).
The MaterializationResponsibility::defineMaterializing method allows clients to
add new definitions that are in the process of being materialized to the JIT.
This patch adds support to defineMaterializing for symbols with weak linkage
where the new definitions may be rejected if another materializer concurrently
defines the same symbol. If a weak symbol is rejected it will not be added to
the MaterializationResponsibility's responsibility set. Clients can check for
membership in the responsibility set via the
MaterializationResponsibility::getSymbols() method before resolving any
such weak symbols.
This patch also adds code to RTDyldObjectLinkingLayer to tag COFF comdat symbols
introduced during codegen as weak, on the assumption that these are COFF comdat
constants. This fixes http://llvm.org/PR40074.
Based on Don Hinton's patch in https://reviews.llvm.org/D72406. This feature
was accidentally left out of e9e26c01cd, and
would have pessimized concurrent compilation in the default case.
Thanks for spotting this Don!
This patch makes the target triple available via the LLJIT interface, and moves
the IRTransformLayer from LLLazyJIT down into LLJIT. Together these changes make
it easier to use the lazyReexports utility with LLJIT, and to apply IR
transforms to code as it is compiled in LLJIT (rather than requiring transforms
to be applied manually before code is added). An code example is added in
llvm/examples/LLJITExamples/LLJITWithLazyReexports
A bug in the existing implementation meant that lazyReexports would not work if
the aliased name differed from the alias's name, i.e. all lazy reexports had to
be of the form (lib1, name) -> (lib2, name). This patch fixes the issue by
capturing the alias's name in the NotifyResolved callback. To simplify this
capture, and the LazyCallThroughManager code in general, the NotifyResolved
callback is updated to use llvm::unique_function rather than a custom class.
No test case yet: This can only be tested at runtime, and the only in-tree
client (lli) always uses aliases with matching names. I will add a new LLJIT
example shortly that will directly test the lazyReexports API and the
non-trivial alias use case.
The argument is llvm::null() everywhere except llvm::errs() in
llvm-objdump in -DLLVM_ENABLE_ASSERTIONS=On builds. It is used by no
target but X86 in -DLLVM_ENABLE_ASSERTIONS=On builds.
If we ever have the needs to add verbose log to disassemblers, we can
record log with a member function, instead of passing it around as an
argument.
This fixes an off-by-one error in the argc value computed by runAsMain, and
switches lli back to using the input bitcode (rather than the string "lli") as
the effective program name.
Thanks to Stefan Graenitz for spotting the bug.
LLJITBuilder will now use JITLink on supported platforms even if a custom
JITTargetMachineBuilder is supplied, provided that neither the code model,
nor the relocation model, nor the ObjectLinkingLayerCreator is set.
JITLink (which underlies ObjectLinkingLayer) is a replacement for RuntimeDyld.
It supports the native code model, and linker plugins that enable a wider range
of features than RuntimeDyld.
Currently only enabled for MachO/x86-64 and MachO/arm64. New architectures will
be added as JITLink support for them is developed.
This relieves ObjectLinkingLayer clients of the responsibility of holding the
memory manager. This makes it easier to select between RTDyldObjectLinkingLayer
(which already owned its memory manager factory) and ObjectLinkingLayer at
runtime as clients aren't required to hold a jitlink::MemoryManager field just
in case ObjectLinkingLayer is selected.
This patch removes the magic "main" JITDylib from ExecutionEngine. The main
JITDylib was created automatically at ExecutionSession construction time, and
all subsequently created JITDylibs were added to the main JITDylib's
links-against list by default. This saves a couple of lines of boilerplate for
simple JIT setups, but this isn't worth introducing magical behavior for.
ORCv2 clients should now construct their own main JITDylib using
ExecutionSession::createJITDylib and set up its linkages manually using
JITDylib::setSearchOrder (or related methods in JITDylib).
The runAsMain function takes a pointer to a function with a standard C main
signature, int(*)(int, char*[]), and invokes it using the given arguments and
program name. The arguments are copied into writable temporary storage as
required by the C and C++ specifications, so runAsMain safe to use when calling
main functions that modify their arguments in-place.
This patch also uses the new runAsMain function to replace hand-rolled versions
in lli, llvm-jitlink, and the SpeculativeJIT example.
libraries.
This patch substantially updates ORCv2's lookup API in order to support weak
references, and to better support static archives. Key changes:
-- Each symbol being looked for is now associated with a SymbolLookupFlags
value. If the associated value is SymbolLookupFlags::RequiredSymbol then
the symbol must be defined in one of the JITDylibs being searched (or be
able to be generated in one of these JITDylibs via an attached definition
generator) or the lookup will fail with an error. If the associated value is
SymbolLookupFlags::WeaklyReferencedSymbol then the symbol is permitted to be
undefined, in which case it will simply not appear in the resulting
SymbolMap if the rest of the lookup succeeds.
Since lookup now requires these flags for each symbol, the lookup method now
takes an instance of a new SymbolLookupSet type rather than a SymbolNameSet.
SymbolLookupSet is a vector-backed set of (name, flags) pairs. Clients are
responsible for ensuring that the set property (i.e. unique elements) holds,
though this is usually simple and SymbolLookupSet provides convenience
methods to support this.
-- Lookups now have an associated LookupKind value, which is either
LookupKind::Static or LookupKind::DLSym. Definition generators can inspect
the lookup kind when determining whether or not to generate new definitions.
The StaticLibraryDefinitionGenerator is updated to only pull in new objects
from the archive if the lookup kind is Static. This allows lookup to be
re-used to emulate dlsym for JIT'd symbols without pulling in new objects
from archives (which would not happen in a normal dlsym call).
-- JITLink is updated to allow externals to be assigned weak linkage, and
weak externals now use the SymbolLookupFlags::WeaklyReferencedSymbol value
for lookups. Unresolved weak references will be assigned the default value of
zero.
Since this patch was modifying the lookup API anyway, it alo replaces all of the
"MatchNonExported" boolean arguments with a "JITDylibLookupFlags" enum for
readability. If a JITDylib's associated value is
JITDylibLookupFlags::MatchExportedSymbolsOnly then the lookup will only
match against exported (non-hidden) symbols in that JITDylib. If a JITDylib's
associated value is JITDylibLookupFlags::MatchAllSymbols then the lookup will
match against any symbol defined in the JITDylib.
Summary:
Most libraries are defined in the lib/ directory but there are also a
few libraries defined in tools/ e.g. libLLVM, libLTO. I'm defining
"Component Libraries" as libraries defined in lib/ that may be included in
libLLVM.so. Explicitly marking the libraries in lib/ as component
libraries allows us to remove some fragile checks that attempt to
differentiate between lib/ libraries and tools/ libraires:
1. In tools/llvm-shlib, because
llvm_map_components_to_libnames(LIB_NAMES "all") returned a list of
all libraries defined in the whole project, there was custom code
needed to filter out libraries defined in tools/, none of which should
be included in libLLVM.so. This code assumed that any library
defined as static was from lib/ and everything else should be
excluded.
With this change, llvm_map_components_to_libnames(LIB_NAMES, "all")
only returns libraries that have been added to the LLVM_COMPONENT_LIBS
global cmake property, so this custom filtering logic can be removed.
Doing this also fixes the build with BUILD_SHARED_LIBS=ON
and LLVM_BUILD_LLVM_DYLIB=ON.
2. There was some code in llvm_add_library that assumed that
libraries defined in lib/ would not have LLVM_LINK_COMPONENTS or
ARG_LINK_COMPONENTS set. This is only true because libraries
defined lib lib/ use LLVMBuild.txt and don't set these values.
This code has been fixed now to check if the library has been
explicitly marked as a component library, which should now make it
easier to remove LLVMBuild at some point in the future.
I have tested this patch on Windows, MacOS and Linux with release builds
and the following combinations of CMake options:
- "" (No options)
- -DLLVM_BUILD_LLVM_DYLIB=ON
- -DLLVM_LINK_LLVM_DYLIB=ON
- -DBUILD_SHARED_LIBS=ON
- -DBUILD_SHARED_LIBS=ON -DLLVM_BUILD_LLVM_DYLIB=ON
- -DBUILD_SHARED_LIBS=ON -DLLVM_LINK_LLVM_DYLIB=ON
Reviewers: beanz, smeenai, compnerd, phosek
Reviewed By: beanz
Subscribers: wuzish, jholewinski, arsenm, dschuff, jyknight, dylanmckay, sdardis, nemanjai, jvesely, nhaehnle, mgorny, mehdi_amini, sbc100, jgravelle-google, hiraditya, aheejin, fedor.sergeev, asb, rbar, johnrusso, simoncook, apazos, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, atanasyan, steven_wu, rogfer01, MartinMosbeck, brucehoult, the_o, dexonsmith, PkmX, jocewei, jsji, dang, Jim, lenary, s.egerton, pzheng, sameer.abuasal, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70179
In a34680a33e OrcError.h and Orc/RPC/*.h were split out from the rest of
ExecutionEngine in order to eliminate false dependencies for remote JIT
targets (see https://reviews.llvm.org/D68732), however this broke modules
builds (see https://reviews.llvm.org/D69817).
This patch splits these headers out into a separate module, LLVM_OrcSupport,
in order to fix the modules build.
Fixes <rdar://56377508>.
It was failing with
PerfJITEventListener.cpp:489:7: error: 'ManagedStatic' in namespace 'llvm' does not name a template type
llvm::ManagedStatic<PerfJITEventListener> PerfListener;
It was failing with
llvm/lib/ExecutionEngine/Orc/DebugUtils.cpp:56:10:
error: could not convert ‘Obj’ from ‘std::unique_ptr<llvm::MemoryBuffer>’
to ‘llvm::Expected<std::unique_ptr<llvm::MemoryBuffer> >’
return Obj;
^
Adds a DumpObjects utility that can be used to dump JIT'd objects to disk.
Instances of DebugObjects may be used by ObjectTransformLayer as no-op
transforms.
This patch also adds an ObjectTransformLayer to LLJIT and an example of how
to use this utility to dump JIT'd objects in LLJIT.
Some targets (E.g. MachO/arm64) use relocations to fix some CFI record fields
in the eh-frame section. When relocations are used the initial (pre-relocation)
content of the eh-frame section can no longer be interpreted by following the
eh-frame specification. This causes errors in the existing eh-frame parser.
This patch moves eh-frame handling into two LinkGraph passes that are run after
relocations have been parsed (but before they are applied). The first] pass
breaks up blocks in the eh-frame section into per-CFI-record blocks, and the
second parses blocks of (potentially multiple) CFI records and adds the
appropriate edges to any CFI fields that do not have existing relocations.
These passes can be run independently of one another. By handling eh-frame
splitting/fixing with LinkGraph passes we can both re-use existing relocations
for CFI record fields and avoid applying eh-frame fixups before parsing the
section (which would complicate the linker and require extra temporary
allocations of working memory).
LinkGraph::splitBlock will split a block at a given index, returning a new
block covering the range [ 0, index ) and modifying the original block to
cover the range [ index, original-block-size ). Block addresses, content,
edges and symbols will be updated as necessary. This utility will be used
in upcoming improvements to JITLink's eh-frame support.
Summary:
When createing an ORC remote JIT target the current library split forces the target process to link large portions of LLVM (Core, Execution Engine, JITLink, Object, MC, Passes, RuntimeDyld, Support, Target, and TransformUtils). This occurs because the ORC RPC interfaces rely on the static globals the ORC Error types require, which starts a cycle of pulling in more and more.
This patch breaks the ORC RPC Error implementations out into an "OrcError" library which only depends on LLVM Support. It also pulls the ORC RPC headers into their own subdirectory.
With this patch code can include the Orc/RPC/*.h headers and will only incur link dependencies on LLVMOrcError and LLVMSupport.
Reviewers: lhames
Reviewed By: lhames
Subscribers: mgorny, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68732
Sections may have zero size and zero-sized sections may share a start address
with other zero-sized sections. For the section overlap test to function
correctly zero-sized sections must be ordered before any non-zero sized ones.
This should fix the intermittent failures in the
test/ExecutionEngine/JITLink/X86/MachO_zero_fill_alignment.s test case that
have been observed on some builders.
It returns just a section_iterator currently and have a report_fatal_error call inside.
This change adds a way to return errors and handle them on caller sides.
The patch also changes/improves current users and adds test cases.
Differential revision: https://reviews.llvm.org/D69167
llvm-svn: 375408
Works on this dependency chain:
ArrayRef.h ->
Hashing.h -> --CUT--
Host.h ->
StringMap.h / StringRef.h
ArrayRef is very popular, but Host.h is rarely needed. Move the
IsBigEndianHost constant to SwapByteOrder.h. Clients of that header are
more likely to need it.
llvm-svn: 375316