llvm-project/lld/docs/Readers.rst

.. _Readers:

Developing lld Readers
======================

Introduction
------------

The purpose of a "Reader" is to take an object file in a particular format
and create an `lld::File`:cpp:class: (which is a graph of Atoms)
representing the object file.  A Reader inherits from
`lld::Reader`:cpp:class: which lives in
:file:`include/lld/ReaderWriter/Reader.h` and
:file:`lib/ReaderWriter/Reader.cpp`.

The Reader infrastructure for an object format ``Foo`` requires the
following pieces in order to fit into lld:

:file:`include/lld/ReaderWriter/ReaderFoo.h`

   .. cpp:class:: ReaderOptionsFoo : public ReaderOptions

      This Options class is the only way to configure how the Reader will 
      parse any file into an `lld::Reader`:cpp:class: object.  This class
      should be declared in the `lld`:cpp:class: namespace.

   .. cpp:function:: Reader *createReaderFoo(ReaderOptionsFoo &reader)

      This factory function configures and create the Reader. This function
      should be declared in the `lld`:cpp:class: namespace.

:file:`lib/ReaderWriter/Foo/ReaderFoo.cpp`

   .. cpp:class:: ReaderFoo : public Reader

      This is the concrete Reader class which can be called to parse
      object files. It should be declared in an anonymous namespace or
      if there is shared code with the `lld::WriterFoo`:cpp:class: you
      can make a nested namespace (e.g. `lld::foo`:cpp:class:).

You may have noticed that :cpp:class:`ReaderFoo` is not declared in the
``.h`` file. An important design aspect of lld is that all Readers are
created *only* through an object-format-specific
:cpp:func:`createReaderFoo` factory function. The creation of the Reader is
parametrized through a :cpp:class:`ReaderOptionsFoo` class. This options
class is the one-and-only way to control how the Reader operates when
parsing an input file into an Atom graph. For instance, you may want the
Reader to only accept certain architectures. The options class can be
instantiated from command line options or be programmatically configured.

Where to start
--------------

The lld project already has a skeleton of source code for Readers for
``ELF``, ``PECOFF``, ``MachO``, and lld's native Atom graph format
(both binary ``Native`` and ``YAML`` representations).  If your file format
is a variant of one of those, you should modify the existing Reader to
support your variant. This is done by customizing the Options
class for the Reader and making appropriate changes to the ``.cpp`` file to
interpret those options and act accordingly.

If your object file format is not a variant of any existing Reader, you'll need
to create a new Reader subclass with the organization described above.

Readers are factories
---------------------

The linker will usually only instantiate your Reader once.  That one Reader will
have its parseFile() method called many times with different input files.
To support multithreaded linking, the Reader may be parsing multiple input
files in parallel. Therefore, there should be no parsing state in you Reader
object.  Any parsing state should be in ivars of your File subclass or in
some temporary object.

The key method to implement in a reader is::

  virtual error_code parseFile(std::unique_ptr<MemoryBuffer> mb,
                               std::vector<std::unique_ptr<File>> &result);

It takes a memory buffer (which contains the contents of the object file
being read) and returns an instantiated lld::File object which is
a collection of Atoms. The result is a vector of File pointers (instead of
simple a File pointer) because some file formats allow multiple object
"files" to be encoded in one file system file.


Memory Ownership
----------------

If parseFile() is successful, it either passes ownership of the MemoryBuffer
to the File object, or it deletes the MemoryBuffer.  The former is done if the
Atoms contain pointers into the MemoryBuffer (e.g. StringRefs for symbols
or ArrayRefs for section content).  If parseFile() fails, the MemoryBuffer
must be deleted by the Reader.

Atoms are always owned by their File object. During core linking when Atoms
are coalesced or stripped away, core linking does not delete them.
Core linking just removes those unused Atoms from its internal list.
The destructor of a File object is responsible for deleting all Atoms it
owns, and if ownership of the MemoryBuffer was passed to it, the File
destructor needs to delete that too.

Making Atoms
------------

The internal model of lld is purely Atom based.  But most object files do not
have an explicit concept of Atoms, instead most have "sections". The way
to think of this is that a section is just a list of Atoms with common
attributes.

The first step in parsing section-based object files is to cleave each
section into a list of Atoms. The technique may vary by section type. For
code sections (e.g. .text), there are usually symbols at the start of each
function. Those symbol addresses are the points at which the section is
cleaved into discrete Atoms.  Some file formats (like ELF) also include the
length of each symbol in the symbol table. Otherwise, the length of each
Atom is calculated to run to the start of the next symbol or the end of the
section.

Other sections types can be implicitly cleaved. For instance c-string literals
or unwind info (e.g. .eh_frame) can be cleaved by having the Reader look at
the content of the section.  It is important to cleave sections into Atoms
to remove false dependencies. For instance the .eh_frame section often
has no symbols, but contains "pointers" to the functions for which it
has unwind info.  If the .eh_frame section was not cleaved (but left as one
big Atom), there would always be a reference (from the eh_frame Atom) to
each function.  So the linker would be unable to coalesce or dead stripped
away the function atoms.

The lld Atom model also requires that a reference to an undefined symbol be
modeled as a Reference to an UndefinedAtom. So the Reader also needs to
create an UndefinedAtom for each undefined symbol in the object file.

Once all Atoms have been created, the second step is to create References
(recall that Atoms are "nodes" and References are "edges"). Most References
are created by looking at the "relocation records" in the object file. If
a function contains a call to "malloc", there is usually a relocation record
specifying the address in the section and the symbol table index. Your
Reader will need to convert the address to an Atom and offset and the symbol
table index into a target Atom. If "malloc" is not defined in the object file,
the target Atom of the Reference will be an UndefinedAtom.


Performance
-----------
Once you have the above working to parse an object file into Atoms and
References, you'll want to look at performance.  Some techniques that can
help performance are:

* Use llvm::BumpPtrAllocator or pre-allocate one big vector<Reference> and then
  just have each atom point to its subrange of References in that vector.
  This can be faster that allocating each Reference as separate object.
* Pre-scan the symbol table and determine how many atoms are in each section
  then allocate space for all the Atom objects at once.
* Don't copy symbol names or section content to each Atom, instead use
  StringRef and ArrayRef in each Atom to point to its name and content in the
  MemoryBuffer.


Testing
-------

We are still working on infrastructure to test Readers. The issue is that
you don't want to check in binary files to the test suite. And the tools
for creating your object file from assembly source may not be available on
every OS.

We are investigating a way to use YAML to describe the section, symbols,
and content of a file. Then have some code which will write out an object
file from that YAML description.

Once that is in place, you can write test cases that contain section/symbols
YAML and is run through the linker to produce Atom/References based YAML which
is then run through FileCheck to verify the Atoms and References are as
expected.
Wrote initial doc on how to create a Reader llvm-svn: 158374 2012-06-13 06:43:35 +08:00			`.. _Readers:`

			`Developing lld Readers`
			`======================`

			`Introduction`
			`------------`

Wordsmithing from patch from Sean Silva llvm-svn: 158584 2012-06-16 08:52:17 +08:00			`The purpose of a "Reader" is to take an object file in a particular format`
			and create an `lld::File`:cpp:class: (which is a graph of Atoms)
			`representing the object file. A Reader inherits from`
			`lld::Reader`:cpp:class: which lives in
			:file:`include/lld/ReaderWriter/Reader.h` and
			:file:`lib/ReaderWriter/Reader.cpp`.
Wrote initial doc on how to create a Reader llvm-svn: 158374 2012-06-13 06:43:35 +08:00
Wordsmithing from patch from Sean Silva llvm-svn: 158584 2012-06-16 08:52:17 +08:00			The Reader infrastructure for an object format ``Foo`` requires the
			`following pieces in order to fit into lld:`
Wrote initial doc on how to create a Reader llvm-svn: 158374 2012-06-13 06:43:35 +08:00
Wordsmithing from patch from Sean Silva llvm-svn: 158584 2012-06-16 08:52:17 +08:00			:file:`include/lld/ReaderWriter/ReaderFoo.h`

			`.. cpp:class:: ReaderOptionsFoo : public ReaderOptions`

			`This Options class is the only way to configure how the Reader will`
			parse any file into an `lld::Reader`:cpp:class: object. This class
			should be declared in the `lld`:cpp:class: namespace.

			`.. cpp:function:: Reader *createReaderFoo(ReaderOptionsFoo &reader)`

			`This factory function configures and create the Reader. This function`
			should be declared in the `lld`:cpp:class: namespace.

			:file:`lib/ReaderWriter/Foo/ReaderFoo.cpp`

			`.. cpp:class:: ReaderFoo : public Reader`

			`This is the concrete Reader class which can be called to parse`
			`object files. It should be declared in an anonymous namespace or`
			if there is shared code with the `lld::WriterFoo`:cpp:class: you
			can make a nested namespace (e.g. `lld::foo`:cpp:class:).

			You may have noticed that :cpp:class:`ReaderFoo` is not declared in the
			``.h`` file. An important design aspect of lld is that all Readers are
			`created only through an object-format-specific`
			:cpp:func:`createReaderFoo` factory function. The creation of the Reader is
			parametrized through a :cpp:class:`ReaderOptionsFoo` class. This options
			`class is the one-and-only way to control how the Reader operates when`
			`parsing an input file into an Atom graph. For instance, you may want the`
			`Reader to only accept certain architectures. The options class can be`
			`instantiated from command line options or be programmatically configured.`
Wrote initial doc on how to create a Reader llvm-svn: 158374 2012-06-13 06:43:35 +08:00
			`Where to start`
			`--------------`

Wordsmithing from patch from Sean Silva llvm-svn: 158584 2012-06-16 08:52:17 +08:00			`The lld project already has a skeleton of source code for Readers for`
			``ELF``, ``PECOFF``, ``MachO``, and lld's native Atom graph format
			(both binary ``Native`` and ``YAML`` representations). If your file format
			`is a variant of one of those, you should modify the existing Reader to`
			`support your variant. This is done by customizing the Options`
			class for the Reader and making appropriate changes to the ``.cpp`` file to
			`interpret those options and act accordingly.`
Wrote initial doc on how to create a Reader llvm-svn: 158374 2012-06-13 06:43:35 +08:00
			`If your object file format is not a variant of any existing Reader, you'll need`
Wordsmithing from patch from Sean Silva llvm-svn: 158584 2012-06-16 08:52:17 +08:00			`to create a new Reader subclass with the organization described above.`
Wrote initial doc on how to create a Reader llvm-svn: 158374 2012-06-13 06:43:35 +08:00
			`Readers are factories`
			`---------------------`

Wordsmithing from patch from Sean Silva llvm-svn: 158584 2012-06-16 08:52:17 +08:00			`The linker will usually only instantiate your Reader once. That one Reader will`
			`have its parseFile() method called many times with different input files.`
			`To support multithreaded linking, the Reader may be parsing multiple input`
Wrote initial doc on how to create a Reader llvm-svn: 158374 2012-06-13 06:43:35 +08:00			`files in parallel. Therefore, there should be no parsing state in you Reader`
Wordsmithing from patch from Sean Silva llvm-svn: 158584 2012-06-16 08:52:17 +08:00			`object. Any parsing state should be in ivars of your File subclass or in`
			`some temporary object.`
Wrote initial doc on how to create a Reader llvm-svn: 158374 2012-06-13 06:43:35 +08:00
			`The key method to implement in a reader is::`

			`virtual error_code parseFile(std::unique_ptr<MemoryBuffer> mb,`
			`std::vector<std::unique_ptr<File>> &result);`

			`It takes a memory buffer (which contains the contents of the object file`
			`being read) and returns an instantiated lld::File object which is`
			`a collection of Atoms. The result is a vector of File pointers (instead of`
			`simple a File pointer) because some file formats allow multiple object`
			`"files" to be encoded in one file system file.`


			`Memory Ownership`
			`----------------`

			`If parseFile() is successful, it either passes ownership of the MemoryBuffer`
			`to the File object, or it deletes the MemoryBuffer. The former is done if the`
			`Atoms contain pointers into the MemoryBuffer (e.g. StringRefs for symbols`
			`or ArrayRefs for section content). If parseFile() fails, the MemoryBuffer`
			`must be deleted by the Reader.`

Wordsmithing from patch from Sean Silva llvm-svn: 158584 2012-06-16 08:52:17 +08:00			`Atoms are always owned by their File object. During core linking when Atoms`
			`are coalesced or stripped away, core linking does not delete them.`
			`Core linking just removes those unused Atoms from its internal list.`
			`The destructor of a File object is responsible for deleting all Atoms it`
			`owns, and if ownership of the MemoryBuffer was passed to it, the File`
Wrote initial doc on how to create a Reader llvm-svn: 158374 2012-06-13 06:43:35 +08:00			`destructor needs to delete that too.`

			`Making Atoms`
			`------------`

			`The internal model of lld is purely Atom based. But most object files do not`
Wordsmithing from patch from Sean Silva llvm-svn: 158584 2012-06-16 08:52:17 +08:00			`have an explicit concept of Atoms, instead most have "sections". The way`
			`to think of this is that a section is just a list of Atoms with common`
			`attributes.`

			`The first step in parsing section-based object files is to cleave each`
			`section into a list of Atoms. The technique may vary by section type. For`
			`code sections (e.g. .text), there are usually symbols at the start of each`
			`function. Those symbol addresses are the points at which the section is`
			`cleaved into discrete Atoms. Some file formats (like ELF) also include the`
			`length of each symbol in the symbol table. Otherwise, the length of each`
Wrote initial doc on how to create a Reader llvm-svn: 158374 2012-06-13 06:43:35 +08:00			`Atom is calculated to run to the start of the next symbol or the end of the`
			`section.`

Wordsmithing from patch from Sean Silva llvm-svn: 158584 2012-06-16 08:52:17 +08:00			`Other sections types can be implicitly cleaved. For instance c-string literals`
			`or unwind info (e.g. .eh_frame) can be cleaved by having the Reader look at`
Wrote initial doc on how to create a Reader llvm-svn: 158374 2012-06-13 06:43:35 +08:00			`the content of the section. It is important to cleave sections into Atoms`
Wordsmithing from patch from Sean Silva llvm-svn: 158584 2012-06-16 08:52:17 +08:00			`to remove false dependencies. For instance the .eh_frame section often`
			`has no symbols, but contains "pointers" to the functions for which it`
Wrote initial doc on how to create a Reader llvm-svn: 158374 2012-06-13 06:43:35 +08:00			`has unwind info. If the .eh_frame section was not cleaved (but left as one`
Wordsmithing from patch from Sean Silva llvm-svn: 158584 2012-06-16 08:52:17 +08:00			`big Atom), there would always be a reference (from the eh_frame Atom) to`
			`each function. So the linker would be unable to coalesce or dead stripped`
			`away the function atoms.`
Wrote initial doc on how to create a Reader llvm-svn: 158374 2012-06-13 06:43:35 +08:00
			`The lld Atom model also requires that a reference to an undefined symbol be`
Wordsmithing from patch from Sean Silva llvm-svn: 158584 2012-06-16 08:52:17 +08:00			`modeled as a Reference to an UndefinedAtom. So the Reader also needs to`
Wrote initial doc on how to create a Reader llvm-svn: 158374 2012-06-13 06:43:35 +08:00			`create an UndefinedAtom for each undefined symbol in the object file.`

Wordsmithing from patch from Sean Silva llvm-svn: 158584 2012-06-16 08:52:17 +08:00			`Once all Atoms have been created, the second step is to create References`
			`(recall that Atoms are "nodes" and References are "edges"). Most References`
			`are created by looking at the "relocation records" in the object file. If`
Wrote initial doc on how to create a Reader llvm-svn: 158374 2012-06-13 06:43:35 +08:00			`a function contains a call to "malloc", there is usually a relocation record`
Wordsmithing from patch from Sean Silva llvm-svn: 158584 2012-06-16 08:52:17 +08:00			`specifying the address in the section and the symbol table index. Your`
Wrote initial doc on how to create a Reader llvm-svn: 158374 2012-06-13 06:43:35 +08:00			`Reader will need to convert the address to an Atom and offset and the symbol`
Wordsmithing from patch from Sean Silva llvm-svn: 158584 2012-06-16 08:52:17 +08:00			`table index into a target Atom. If "malloc" is not defined in the object file,`
			`the target Atom of the Reference will be an UndefinedAtom.`
Wrote initial doc on how to create a Reader llvm-svn: 158374 2012-06-13 06:43:35 +08:00

			`Performance`
			`-----------`
Wordsmithing from patch from Sean Silva llvm-svn: 158584 2012-06-16 08:52:17 +08:00			`Once you have the above working to parse an object file into Atoms and`
Wrote initial doc on how to create a Reader llvm-svn: 158374 2012-06-13 06:43:35 +08:00			`References, you'll want to look at performance. Some techniques that can`
			`help performance are:`

Wordsmithing from patch from Sean Silva llvm-svn: 158584 2012-06-16 08:52:17 +08:00			`* Use llvm::BumpPtrAllocator or pre-allocate one big vector<Reference> and then`
			`just have each atom point to its subrange of References in that vector.`
Wrote initial doc on how to create a Reader llvm-svn: 158374 2012-06-13 06:43:35 +08:00			`This can be faster that allocating each Reference as separate object.`
			`* Pre-scan the symbol table and determine how many atoms are in each section`
Wordsmithing from patch from Sean Silva llvm-svn: 158584 2012-06-16 08:52:17 +08:00			`then allocate space for all the Atom objects at once.`
Wrote initial doc on how to create a Reader llvm-svn: 158374 2012-06-13 06:43:35 +08:00			`* Don't copy symbol names or section content to each Atom, instead use`
Wordsmithing from patch from Sean Silva llvm-svn: 158584 2012-06-16 08:52:17 +08:00			`StringRef and ArrayRef in each Atom to point to its name and content in the`
			`MemoryBuffer.`
Wrote initial doc on how to create a Reader llvm-svn: 158374 2012-06-13 06:43:35 +08:00

			`Testing`
			`-------`

Wordsmithing from patch from Sean Silva llvm-svn: 158584 2012-06-16 08:52:17 +08:00			`We are still working on infrastructure to test Readers. The issue is that`
			`you don't want to check in binary files to the test suite. And the tools`
Wrote initial doc on how to create a Reader llvm-svn: 158374 2012-06-13 06:43:35 +08:00			`for creating your object file from assembly source may not be available on`
Wordsmithing from patch from Sean Silva llvm-svn: 158584 2012-06-16 08:52:17 +08:00			`every OS.`
Wrote initial doc on how to create a Reader llvm-svn: 158374 2012-06-13 06:43:35 +08:00
Wordsmithing from patch from Sean Silva llvm-svn: 158584 2012-06-16 08:52:17 +08:00			`We are investigating a way to use YAML to describe the section, symbols,`
			`and content of a file. Then have some code which will write out an object`
			`file from that YAML description.`
Wrote initial doc on how to create a Reader llvm-svn: 158374 2012-06-13 06:43:35 +08:00
Wordsmithing from patch from Sean Silva llvm-svn: 158584 2012-06-16 08:52:17 +08:00			`Once that is in place, you can write test cases that contain section/symbols`
			`YAML and is run through the linker to produce Atom/References based YAML which`
			`is then run through FileCheck to verify the Atoms and References are as`
Wrote initial doc on how to create a Reader llvm-svn: 158374 2012-06-13 06:43:35 +08:00			`expected.`