llvm-project/lld/COFF/README.md

The New PE/COFF Linker
======================

This directory contains an experimental linker for the PE/COFF file
format. Because the fundamental design of this port is different from
the other ports of LLD, this port is separated to this directory.

The other ports are based on the Atom model, in which symbols and
references are represented as vertices and edges of graphs.
We don't use that model to aim for performance and simplicity.
Our plan is to implement a linker for the PE/COFF format based on a
different idea, and then apply the same idea to the ELF if proved to
be effective.

Overall Design
--------------

This is a list of important data types in this linker.

* SymbolBody

  SymbolBody is a class for symbols, which may be created for symbols
  in object files or in archive file headers. The linker may create
  them out of nothing.

  There are mainly three types of SymbolBodies: Defined, Undefined, or
  Lazy. Defined symbols are for all symbols that are considered as
  "resolved", including real defined symbols, COMDAT symbols, common
  symbols, absolute symbols, linker-created symbols, etc. Undefined
  symbols are for undefined symbols, which need to be replaced by
  Defined symbols by the resolver. Lazy symbols represent symbols we
  found in archive file headers -- which can turn into Defined symbols
  if we read archieve members, but we haven't done that yet.

* Symbol

  Symbol is a pointer to a SymbolBody. There's only one Symbol for
  each unique symbol name (this uniqueness is guaranteed by the symbol
  table). Because SymbolBodies are created for each file
  independently, there can be many SymbolBodies for the same
  name. Thus, the relationship between Symbols and SymbolBodies is 1:N.

  The resolver keeps the Symbol's pointer to always point to the "best"
  SymbolBody. Pointer mutation is the resolve operation in this
  linker.

  SymbolBodies have pointers to their Symbols. That means you can
  always find the best SymbolBody from any SymbolBody by following
  pointers twice. This structure makes it very easy to find
  replacements for symbols. For example, if you have an Undefined
  SymbolBody, you can find a Defined SymbolBody for that symbol just
  by going to its Symbol and then to SymbolBody, assuming the resolver
  have successfully resolved all undefined symbols.

* Chunk

  Chunk represents a chunk of data that will occupy space in an
  output. They may be backed by sections of input files, but can be
  created for something different, if they are for common or BSS
  symbols. The linker may also create chunks out of nothing to append
  additional data to an output.

  Chunks know about their size, how to copy their data to mmap'ed
  outputs, and how to apply relocations to them. Specifically,
  section-based chunks know how to read relocation tables and how to
  apply them.

* SymbolTable

  SymbolTable is basically a hash table from strings to Symbols, with
  a logic to resolve symbol conflicts. It resolves conflicts by symbol
  type. For example, if we add Undefined and Defined symbols, the
  symbol table will keep the latter. If we add Undefined and Lazy
  symbols, it will keep the latter. If we add Lazy and Undefined, it
  will keep the former, but it will also trigger the Lazy symbol to
  load the archive member to actually resolve the symbol.

* OutputSection

  OutputSection is a container of Chunks. A Chunk belongs to at most
  one OutputSection.

There are mainly three actors in this linker.

* InputFile

  InputFile is a superclass for file readers. We have a different
  subclass for each input file type, such as regular object file,
  archive file, etc. They are responsible for creating and owning
  SymbolBodies and Chunks.

* Writer

  The writer is responsible for writing file headers and Chunks to a
  file. It creates OutputSections, put all Chunks into them, assign
  unique, non-overlapping addresses and file offsets to them, and then
  write them down to a file.

* Driver

  The linking process is drived by the driver. The driver

  - processes command line options,
  - creates a symbol table,
  - creates an InputFile for each input file and put all symbols in it
    into the symbol table,
  - checks if there's no remaining undefined symbols,
  - creates a writer,
  - and passes the symbol table to the writer to write the result to a
    file.

Performance
-----------

Currently it's able to self-host on the Windows platform. It takes 1.2
seconds to self-host on my Xeon 2580 machine, while the existing
Atom-based linker takes 5 seconds to self-host. We believe the
performance difference comes from simplification and optimizations we
made to the new port. Notable differences are listed below.

* Reduced number of relocation table reads

  In the existing design, relocation tables are read from beginning to
  construct graphs because they consist of graph edges. In the new
  design, they are not read until we actually apply relocations.

  This simplification has two benefits. One is that we don't create
  additional objects for relocations but instead consume relocation
  tables directly. The other is that it reduces number of relocation
  entries we have to read, because we won't read relocations for
  dead-stripped COMDAT sections. Large C++ programs tend to consist of
  lots of COMDAT sections. In the existing design, the time to process
  relocation table is linear to size of input. In this new model, it's
  linear to size of output.

* Reduced number of symbol table lookup

  Symbol table lookup can be a heavy operation because number of
  symbols can be very large and each symbol name can be very long
  (think of C++ mangled symbols -- time to compute a hash value for a
  string is linear to the length.)

  We look up the symbol table exactly only once for each symbol in the
  new design. This is I believe the minimum possible number. This is
  achieved by the separation of Symbol and SymbolBody. Once you get a
  pointer to a Symbol by looking up the symbol table, you can always
  get the latest symbol resolution result by just dereferencing a
  pointer. (I'm not sure if the idea is new to the linker. At least,
  all other linkers I've investigated so far seem to look up hash
  tables or sets more than once for each new symbol, but I may be
  wrong.)

* Reduced number of file visits

  The symbol table implements the Windows linker semantics. We treat
  the symbol table as a bucket of all known symbols, including symbols
  in archive file headers. We put all symbols into one bucket as we
  visit new files. That means we visit each file only once.

  This is different from the Unix linker semantics, in which we only
  keep undefined symbols and visit each file one by one until we
  resolve all undefined symbols. In the Unix model, we have to visit
  archive files many times if there are circular dependencies between
  archives.

* Avoiding creating additional objects or copying data

  The data structures described in the previous section are all thin
  wrappers for classes that LLVM libObject provides. We avoid copying
  data from libObject's objects to our objects. We read much less data
  than before. For example, we don't read symbol values until we apply
  relocations because these values are not relevant to symbol
  resolution. Again, COMDAT symbols may be discarded during symbol
  resolution, so reading their attributes too early could result in a
  waste. We use underlying objects directly where doing so makes
  sense.

Parallelism
-----------

The abovementioned data structures are also chosen with
multi-threading in mind. It should relatively be easy to make the
symbol table a concurrent hash map, so that we let multiple workers
work on symbol table concurrently. Symbol resolution in this design is
a single pointer mutation, which allows the resolver work concurrently
in a lock-free manner using atomic pointer compare-and-swap.

It should also be easy to apply relocations and write chunks concurrently.

We created an experimental multi-threaded linker using the Microsoft
ConcRT concurrency library, and it was able to link itself in 0.5
seconds, so we think the design is promising.
COFF: Add a new PE/COFF port. This is an initial patch for a section-based COFF linker. The patch has 2300 lines of code including comments and blank lines. Before diving into details, you want to start from reading README because it should give you an overview of the design. All important things are written in the README file, so I write summary here. - The linker is already able to self-link on Windows. - It's significantly faster than the existing implementation. The existing one takes 5 seconds to link LLD on my machine, while the new one only takes 1.2 seconds, even though the new one is not multi-threaded yet. (And a proof-of-concept multi- threaded version was able to link it in 0.5 seconds.) - It uses much less memory (250MB vs. 2GB virtual memory space to self-host). - IMHO the new code is much simpler and easier to read than the existing PE/COFF port. http://reviews.llvm.org/D10036 llvm-svn: 238458 2015-05-29 03:09:30 +08:00			`The New PE/COFF Linker`
			`======================`

			`This directory contains an experimental linker for the PE/COFF file`
			`format. Because the fundamental design of this port is different from`
			`the other ports of LLD, this port is separated to this directory.`

			`The other ports are based on the Atom model, in which symbols and`
COFF: Update README. Avoid saying this is based on sections because it's not very accurate. That we don't split section into smaller chunks of data does not mean that the linker is built on top of that. In reality, most part of the code do not care about underlying data, so they are neither based on "atoms" nor sections. The symbol table only cares about symbol names and their types. The writer handles list of chunks, which look like just blobs, and the writer doesn't care what those chunks are backed by. The only thing that interact with sections is SectionChunk, which is abstracted away as one type of Chunk. llvm-svn: 238902 2015-06-03 13:39:13 +08:00			`references are represented as vertices and edges of graphs.`
			`We don't use that model to aim for performance and simplicity.`
			`Our plan is to implement a linker for the PE/COFF format based on a`
			`different idea, and then apply the same idea to the ELF if proved to`
			`be effective.`
COFF: Add a new PE/COFF port. This is an initial patch for a section-based COFF linker. The patch has 2300 lines of code including comments and blank lines. Before diving into details, you want to start from reading README because it should give you an overview of the design. All important things are written in the README file, so I write summary here. - The linker is already able to self-link on Windows. - It's significantly faster than the existing implementation. The existing one takes 5 seconds to link LLD on my machine, while the new one only takes 1.2 seconds, even though the new one is not multi-threaded yet. (And a proof-of-concept multi- threaded version was able to link it in 0.5 seconds.) - It uses much less memory (250MB vs. 2GB virtual memory space to self-host). - IMHO the new code is much simpler and easier to read than the existing PE/COFF port. http://reviews.llvm.org/D10036 llvm-svn: 238458 2015-05-29 03:09:30 +08:00
			`Overall Design`
			`--------------`

			`This is a list of important data types in this linker.`

			`* SymbolBody`

			`SymbolBody is a class for symbols, which may be created for symbols`
			`in object files or in archive file headers. The linker may create`
			`them out of nothing.`

			`There are mainly three types of SymbolBodies: Defined, Undefined, or`
			`Lazy. Defined symbols are for all symbols that are considered as`
			`"resolved", including real defined symbols, COMDAT symbols, common`
			`symbols, absolute symbols, linker-created symbols, etc. Undefined`
			`symbols are for undefined symbols, which need to be replaced by`
			`Defined symbols by the resolver. Lazy symbols represent symbols we`
			`found in archive file headers -- which can turn into Defined symbols`
			`if we read archieve members, but we haven't done that yet.`

			`* Symbol`

			`Symbol is a pointer to a SymbolBody. There's only one Symbol for`
			`each unique symbol name (this uniqueness is guaranteed by the symbol`
			`table). Because SymbolBodies are created for each file`
			`independently, there can be many SymbolBodies for the same`
			`name. Thus, the relationship between Symbols and SymbolBodies is 1:N.`

			`The resolver keeps the Symbol's pointer to always point to the "best"`
			`SymbolBody. Pointer mutation is the resolve operation in this`
			`linker.`

			`SymbolBodies have pointers to their Symbols. That means you can`
			`always find the best SymbolBody from any SymbolBody by following`
			`pointers twice. This structure makes it very easy to find`
			`replacements for symbols. For example, if you have an Undefined`
			`SymbolBody, you can find a Defined SymbolBody for that symbol just`
			`by going to its Symbol and then to SymbolBody, assuming the resolver`
			`have successfully resolved all undefined symbols.`

			`* Chunk`

			`Chunk represents a chunk of data that will occupy space in an`
			`output. They may be backed by sections of input files, but can be`
			`created for something different, if they are for common or BSS`
			`symbols. The linker may also create chunks out of nothing to append`
			`additional data to an output.`

			`Chunks know about their size, how to copy their data to mmap'ed`
			`outputs, and how to apply relocations to them. Specifically,`
			`section-based chunks know how to read relocation tables and how to`
			`apply them.`

			`* SymbolTable`

			`SymbolTable is basically a hash table from strings to Symbols, with`
			`a logic to resolve symbol conflicts. It resolves conflicts by symbol`
			`type. For example, if we add Undefined and Defined symbols, the`
			`symbol table will keep the latter. If we add Undefined and Lazy`
			`symbols, it will keep the latter. If we add Lazy and Undefined, it`
			`will keep the former, but it will also trigger the Lazy symbol to`
			`load the archive member to actually resolve the symbol.`

			`* OutputSection`

			`OutputSection is a container of Chunks. A Chunk belongs to at most`
			`one OutputSection.`

			`There are mainly three actors in this linker.`

			`* InputFile`

			`InputFile is a superclass for file readers. We have a different`
			`subclass for each input file type, such as regular object file,`
			`archive file, etc. They are responsible for creating and owning`
			`SymbolBodies and Chunks.`

			`* Writer`

			`The writer is responsible for writing file headers and Chunks to a`
			`file. It creates OutputSections, put all Chunks into them, assign`
			`unique, non-overlapping addresses and file offsets to them, and then`
			`write them down to a file.`

			`* Driver`

			`The linking process is drived by the driver. The driver`

			`- processes command line options,`
			`- creates a symbol table,`
			`- creates an InputFile for each input file and put all symbols in it`
			`into the symbol table,`
			`- checks if there's no remaining undefined symbols,`
			`- creates a writer,`
			`- and passes the symbol table to the writer to write the result to a`
			`file.`

			`Performance`
			`-----------`

			`Currently it's able to self-host on the Windows platform. It takes 1.2`
			`seconds to self-host on my Xeon 2580 machine, while the existing`
			`Atom-based linker takes 5 seconds to self-host. We believe the`
			`performance difference comes from simplification and optimizations we`
			`made to the new port. Notable differences are listed below.`

			`* Reduced number of relocation table reads`

			`In the existing design, relocation tables are read from beginning to`
			`construct graphs because they consist of graph edges. In the new`
			`design, they are not read until we actually apply relocations.`

			`This simplification has two benefits. One is that we don't create`
			`additional objects for relocations but instead consume relocation`
			`tables directly. The other is that it reduces number of relocation`
			`entries we have to read, because we won't read relocations for`
			`dead-stripped COMDAT sections. Large C++ programs tend to consist of`
			`lots of COMDAT sections. In the existing design, the time to process`
			`relocation table is linear to size of input. In this new model, it's`
			`linear to size of output.`

			`* Reduced number of symbol table lookup`

			`Symbol table lookup can be a heavy operation because number of`
			`symbols can be very large and each symbol name can be very long`
			`(think of C++ mangled symbols -- time to compute a hash value for a`
			`string is linear to the length.)`

			`We look up the symbol table exactly only once for each symbol in the`
			`new design. This is I believe the minimum possible number. This is`
			`achieved by the separation of Symbol and SymbolBody. Once you get a`
			`pointer to a Symbol by looking up the symbol table, you can always`
			`get the latest symbol resolution result by just dereferencing a`
			`pointer. (I'm not sure if the idea is new to the linker. At least,`
			`all other linkers I've investigated so far seem to look up hash`
			`tables or sets more than once for each new symbol, but I may be`
			`wrong.)`

			`* Reduced number of file visits`

			`The symbol table implements the Windows linker semantics. We treat`
			`the symbol table as a bucket of all known symbols, including symbols`
			`in archive file headers. We put all symbols into one bucket as we`
			`visit new files. That means we visit each file only once.`

			`This is different from the Unix linker semantics, in which we only`
			`keep undefined symbols and visit each file one by one until we`
			`resolve all undefined symbols. In the Unix model, we have to visit`
			`archive files many times if there are circular dependencies between`
			`archives.`

			`* Avoiding creating additional objects or copying data`

			`The data structures described in the previous section are all thin`
			`wrappers for classes that LLVM libObject provides. We avoid copying`
			`data from libObject's objects to our objects. We read much less data`
			`than before. For example, we don't read symbol values until we apply`
			`relocations because these values are not relevant to symbol`
			`resolution. Again, COMDAT symbols may be discarded during symbol`
			`resolution, so reading their attributes too early could result in a`
			`waste. We use underlying objects directly where doing so makes`
			`sense.`

			`Parallelism`
			`-----------`

			`The abovementioned data structures are also chosen with`
			`multi-threading in mind. It should relatively be easy to make the`
			`symbol table a concurrent hash map, so that we let multiple workers`
			`work on symbol table concurrently. Symbol resolution in this design is`
			`a single pointer mutation, which allows the resolver work concurrently`
			`in a lock-free manner using atomic pointer compare-and-swap.`

			`It should also be easy to apply relocations and write chunks concurrently.`

			`We created an experimental multi-threaded linker using the Microsoft`
			`ConcRT concurrency library, and it was able to link itself in 0.5`
			`seconds, so we think the design is promising.`