forked from OSchip/llvm-project
244 lines
9.9 KiB
Markdown
244 lines
9.9 KiB
Markdown
The New PE/COFF Linker
|
|
======================
|
|
|
|
This directory contains an experimental linker for the PE/COFF file
|
|
format. Because the fundamental design of this port is different from
|
|
the other ports of LLD, this port is separated to this directory.
|
|
|
|
The other ports are based on the Atom model, in which symbols and
|
|
references are represented as vertices and edges of graphs.
|
|
We don't use that model to aim for performance and simplicity.
|
|
Our plan is to implement a linker for the PE/COFF format based on a
|
|
different idea, and then apply the same idea to the ELF if proved to
|
|
be effective.
|
|
|
|
Overall Design
|
|
--------------
|
|
|
|
This is a list of important data types in this linker.
|
|
|
|
* SymbolBody
|
|
|
|
SymbolBody is a class for symbols. They may be created for symbols
|
|
in object files or in archive file headers. The linker may create
|
|
them out of nothing.
|
|
|
|
There are mainly three types of SymbolBodies: Defined, Undefined, or
|
|
Lazy. Defined symbols are for all symbols that are considered as
|
|
"resolved", including real defined symbols, COMDAT symbols, common
|
|
symbols, absolute symbols, linker-created symbols, etc. Undefined
|
|
symbols are for undefined symbols, which need to be replaced by
|
|
Defined symbols by the resolver. Lazy symbols represent symbols we
|
|
found in archive file headers -- which can turn into Defined symbols
|
|
if we read archieve members, but we haven't done that yet.
|
|
|
|
* Symbol
|
|
|
|
Symbol is a pointer to a SymbolBody. There's only one Symbol for
|
|
each unique symbol name (this uniqueness is guaranteed by the symbol
|
|
table). Because SymbolBodies are created for each file
|
|
independently, there can be many SymbolBodies for the same
|
|
name. Thus, the relationship between Symbols and SymbolBodies is 1:N.
|
|
|
|
The resolver keeps the Symbol's pointer to always point to the "best"
|
|
SymbolBody. Pointer mutation is the resolve operation in this
|
|
linker.
|
|
|
|
SymbolBodies have pointers to their Symbols. That means you can
|
|
always find the best SymbolBody from any SymbolBody by following
|
|
pointers twice. This structure makes it very easy to find
|
|
replacements for symbols. For example, if you have an Undefined
|
|
SymbolBody, you can find a Defined SymbolBody for that symbol just
|
|
by going to its Symbol and then to SymbolBody, assuming the resolver
|
|
have successfully resolved all undefined symbols.
|
|
|
|
* Chunk
|
|
|
|
Chunk represents a chunk of data that will occupy space in an
|
|
output. Each regular section becomes a chunk.
|
|
Chunks created for common or BSS symbols are not backed by sections.
|
|
The linker may create chunks out of nothing to append additional
|
|
data to an output.
|
|
|
|
Chunks know about their size, how to copy their data to mmap'ed
|
|
outputs, and how to apply relocations to them. Specifically,
|
|
section-based chunks know how to read relocation tables and how to
|
|
apply them.
|
|
|
|
* SymbolTable
|
|
|
|
SymbolTable is basically a hash table from strings to Symbols, with
|
|
a logic to resolve symbol conflicts. It resolves conflicts by symbol
|
|
type. For example, if we add Undefined and Defined symbols, the
|
|
symbol table will keep the latter. If we add Defined and Lazy
|
|
symbols, it will keep the former. If we add Lazy and Undefined, it
|
|
will keep the former, but it will also trigger the Lazy symbol to
|
|
load the archive member to actually resolve the symbol.
|
|
|
|
* OutputSection
|
|
|
|
OutputSection is a container of Chunks. A Chunk belongs to at most
|
|
one OutputSection.
|
|
|
|
There are mainly three actors in this linker.
|
|
|
|
* InputFile
|
|
|
|
InputFile is a superclass of file readers. We have a different
|
|
subclass for each input file type, such as regular object file,
|
|
archive file, etc. They are responsible for creating and owning
|
|
SymbolBodies and Chunks.
|
|
|
|
* Writer
|
|
|
|
The writer is responsible for writing file headers and Chunks to a
|
|
file. It creates OutputSections, put all Chunks into them, assign
|
|
unique, non-overlapping addresses and file offsets to them, and then
|
|
write them down to a file.
|
|
|
|
* Driver
|
|
|
|
The linking process is drived by the driver. The driver
|
|
|
|
- processes command line options,
|
|
- creates a symbol table,
|
|
- creates an InputFile for each input file and put all symbols in it
|
|
into the symbol table,
|
|
- checks if there's no remaining undefined symbols,
|
|
- creates a writer,
|
|
- and passes the symbol table to the writer to write the result to a
|
|
file.
|
|
|
|
Performance
|
|
-----------
|
|
|
|
Currently it's able to self-host on the Windows platform. It takes 1.2
|
|
seconds to self-host on my Xeon 2580 machine, while the existing
|
|
Atom-based linker takes 5 seconds to self-host. We believe the
|
|
performance difference comes from simplification and optimizations we
|
|
made to the new port. Notable differences are listed below.
|
|
|
|
* Reduced number of relocation table reads
|
|
|
|
In the existing design, relocation tables are read from beginning to
|
|
construct graphs because they consist of graph edges. In the new
|
|
design, they are not read until we actually apply relocations.
|
|
|
|
This simplification has two benefits. One is that we don't create
|
|
additional objects for relocations but instead consume relocation
|
|
tables directly. The other is that it reduces number of relocation
|
|
entries we have to read, because we won't read relocations for
|
|
dead-stripped COMDAT sections. Large C++ programs tend to consist of
|
|
lots of COMDAT sections. In the existing design, the time to process
|
|
relocation table is linear to size of input. In this new model, it's
|
|
linear to size of output.
|
|
|
|
* Reduced number of symbol table lookup
|
|
|
|
Symbol table lookup can be a heavy operation because number of
|
|
symbols can be very large and each symbol name can be very long
|
|
(think of C++ mangled symbols -- time to compute a hash value for a
|
|
string is linear to the length.)
|
|
|
|
We look up the symbol table exactly only once for each symbol in the
|
|
new design. This is I believe the minimum possible number. This is
|
|
achieved by the separation of Symbol and SymbolBody. Once you get a
|
|
pointer to a Symbol by looking up the symbol table, you can always
|
|
get the latest symbol resolution result by just dereferencing a
|
|
pointer. (I'm not sure if the idea is new to the linker. At least,
|
|
all other linkers I've investigated so far seem to look up hash
|
|
tables or sets more than once for each new symbol, but I may be
|
|
wrong.)
|
|
|
|
* Reduced number of file visits
|
|
|
|
The symbol table implements the Windows linker semantics. We treat
|
|
the symbol table as a bucket of all known symbols, including symbols
|
|
in archive file headers. We put all symbols into one bucket as we
|
|
visit new files. That means we visit each file only once.
|
|
|
|
This is different from the Unix linker semantics, in which we only
|
|
keep undefined symbols and visit each file one by one until we
|
|
resolve all undefined symbols. In the Unix model, we have to visit
|
|
archive files many times if there are circular dependencies between
|
|
archives.
|
|
|
|
* Avoiding creating additional objects or copying data
|
|
|
|
The data structures described in the previous section are all thin
|
|
wrappers for classes that LLVM libObject provides. We avoid copying
|
|
data from libObject's objects to our objects. We read much less data
|
|
than before. For example, we don't read symbol values until we apply
|
|
relocations because these values are not relevant to symbol
|
|
resolution. Again, COMDAT symbols may be discarded during symbol
|
|
resolution, so reading their attributes too early could result in a
|
|
waste. We use underlying objects directly where doing so makes
|
|
sense.
|
|
|
|
Parallelism
|
|
-----------
|
|
|
|
The abovementioned data structures are also chosen with
|
|
multi-threading in mind. It should relatively be easy to make the
|
|
symbol table a concurrent hash map, so that we let multiple workers
|
|
work on symbol table concurrently. Symbol resolution in this design is
|
|
a single pointer mutation, which allows the resolver work concurrently
|
|
in a lock-free manner using atomic pointer compare-and-swap.
|
|
|
|
It should also be easy to apply relocations and write chunks concurrently.
|
|
|
|
We created an experimental multi-threaded linker using the Microsoft
|
|
ConcRT concurrency library, and it was able to link itself in 0.5
|
|
seconds, so we think the design is promising.
|
|
|
|
Link-Time Optimization
|
|
----------------------
|
|
|
|
LTO is implemented by handling LLVM bitcode files as object files.
|
|
The linker resolves symbols in bitcode files normally. If all symbols
|
|
are successfully resolved, it then calls an LLVM libLTO function
|
|
with all bitcode files to convert them to one big regular COFF file.
|
|
Finally, the linker replaces bitcode symbols with COFF symbols,
|
|
so that we can link the input files as if they were in the native
|
|
format from the beginning.
|
|
|
|
The details are described in this document.
|
|
http://llvm.org/docs/LinkTimeOptimization.html
|
|
|
|
Glossary
|
|
--------
|
|
|
|
* RVA
|
|
|
|
Short for Relative Virtual Address.
|
|
|
|
Windows executables or DLLs are not position-independent; they are
|
|
linked against a fixed address called an image base. RVAs are
|
|
offsets from an image base.
|
|
|
|
Default image bases are 0x140000000 for executables and 0x18000000
|
|
for DLLs. For example, when we are creating an executable, we assume
|
|
that the executable will be loaded at address 0x140000000 by the
|
|
loader, so we apply relocations accordingly. Result texts and data
|
|
will contain raw absolute addresses.
|
|
|
|
* VA
|
|
|
|
Short for Virtual Address. Equivalent to RVA + image base. It is
|
|
rarely used. We almost always use RVAs instead.
|
|
|
|
* Base relocations
|
|
|
|
Relocation information for the loader. If the loader decides to map
|
|
an executable or a DLL to a different address than their image
|
|
bases, it fixes up binaries using information contained in the base
|
|
relocation table. A base relocation table consists of a list of
|
|
locations containing addresses. The loader adds a difference between
|
|
RVA and actual load address to all locations listed there.
|
|
|
|
Note that this run-time relocation mechanism is much simpler than ELF.
|
|
There's no PLT or GOT. Images are relocated as a whole just
|
|
by shifting entire images in memory by some offsets. Although doing
|
|
this breaks text sharing, I think this mechanism is not actually bad
|
|
on today's computers.
|