Convert PDB docs to unix line endings. No other changes.

llvm-svn: 359712
This commit is contained in:
Nico Weber 2019-05-01 19:15:05 +00:00
parent a0df4d37b0
commit 0a4aeec16e
7 changed files with 848 additions and 848 deletions

View File

@ -1,3 +1,3 @@
=====================================
The PDB Global Symbol Stream
=====================================
=====================================
The PDB Global Symbol Stream
=====================================

View File

@ -1,103 +1,103 @@
The PDB Serialized Hash Table Format
====================================
.. contents::
:local:
.. _hash_intro:
Introduction
============
One of the design goals of the PDB format is to provide accelerated access to
debug information, and for this reason there are several occasions where hash
tables are serialized and embedded directly to the file, rather than requiring
a consumer to read a list of values and reconstruct the hash table on the fly.
The serialization format supports hash tables of arbitrarily large size and
capacity, as well as value types and hash functions. The only supported key
value type is a uint32. The only requirement is that the producer and consumer
agree on the hash function. As such, the hash function can is not discussed
further in this document, it is assumed that for a particular instance of a PDB
file hash table, the appropriate hash function is being used.
On-Disk Format
==============
.. code-block:: none
.--------------------.-- +0
| Size |
.--------------------.-- +4
| Capacity |
.--------------------.-- +8
| Present Bit Vector |
.--------------------.-- +N
| Deleted Bit Vector |
.--------------------.-- +M ─╮
| Key | │
.--------------------.-- +M+4 │
| Value | │
.--------------------.-- +M+4+sizeof(Value) │
... ├─ |Capacity| Bucket entries
.--------------------. │
| Key | │
.--------------------. │
| Value | │
.--------------------. ─╯
- **Size** - The number of values contained in the hash table.
- **Capacity** - The number of buckets in the hash table. Producers should
maintain a load factor of no greater than ``2/3*Capacity+1``.
- **Present Bit Vector** - A serialized bit vector which contains information
about which buckets have valid values. If the bucket has a value, the
corresponding bit will be set, and if the bucket doesn't have a value (either
because the bucket is empty or because the value is a tombstone value) the bit
will be unset.
- **Deleted Bit Vector** - A serialized bit vector which contains information
about which buckets have tombstone values. If the entry in this bucket is
deleted, the bit will be set, otherwise it will be unset.
- **Keys and Values** - A list of ``Capacity`` hash buckets, where the first
entry is the key (always a uint32), and the second entry is the value. The
state of each bucket (valid, empty, deleted) can be determined by examining
the present and deleted bit vectors.
.. _hash_bit_vectors:
Present and Deleted Bit Vectors
===============================
The bit vectors indicating the status of each bucket are serialized as follows:
.. code-block:: none
.--------------------.-- +0
| Word Count |
.--------------------.-- +4
| Word_0 | ─╮
.--------------------.-- +8 │
| Word_1 | │
.--------------------.-- +12 ├─ |Word Count| values
... │
.--------------------. │
| Word_N | │
.--------------------. ─╯
The words, when viewed as a contiguous block of bytes, represent a bit vector with
the following layout:
.. code-block:: none
.------------. .------------.------------.
| Word_N | ... | Word_1 | Word_0 |
.------------. .------------.------------.
| | | | |
+N*32 +(N-1)*32 +64 +32 +0
where the k'th bit of this bit vector represents the status of the k'th bucket
in the hash table.
The PDB Serialized Hash Table Format
====================================
.. contents::
:local:
.. _hash_intro:
Introduction
============
One of the design goals of the PDB format is to provide accelerated access to
debug information, and for this reason there are several occasions where hash
tables are serialized and embedded directly to the file, rather than requiring
a consumer to read a list of values and reconstruct the hash table on the fly.
The serialization format supports hash tables of arbitrarily large size and
capacity, as well as value types and hash functions. The only supported key
value type is a uint32. The only requirement is that the producer and consumer
agree on the hash function. As such, the hash function can is not discussed
further in this document, it is assumed that for a particular instance of a PDB
file hash table, the appropriate hash function is being used.
On-Disk Format
==============
.. code-block:: none
.--------------------.-- +0
| Size |
.--------------------.-- +4
| Capacity |
.--------------------.-- +8
| Present Bit Vector |
.--------------------.-- +N
| Deleted Bit Vector |
.--------------------.-- +M ─╮
| Key | │
.--------------------.-- +M+4 │
| Value | │
.--------------------.-- +M+4+sizeof(Value) │
... ├─ |Capacity| Bucket entries
.--------------------. │
| Key | │
.--------------------. │
| Value | │
.--------------------. ─╯
- **Size** - The number of values contained in the hash table.
- **Capacity** - The number of buckets in the hash table. Producers should
maintain a load factor of no greater than ``2/3*Capacity+1``.
- **Present Bit Vector** - A serialized bit vector which contains information
about which buckets have valid values. If the bucket has a value, the
corresponding bit will be set, and if the bucket doesn't have a value (either
because the bucket is empty or because the value is a tombstone value) the bit
will be unset.
- **Deleted Bit Vector** - A serialized bit vector which contains information
about which buckets have tombstone values. If the entry in this bucket is
deleted, the bit will be set, otherwise it will be unset.
- **Keys and Values** - A list of ``Capacity`` hash buckets, where the first
entry is the key (always a uint32), and the second entry is the value. The
state of each bucket (valid, empty, deleted) can be determined by examining
the present and deleted bit vectors.
.. _hash_bit_vectors:
Present and Deleted Bit Vectors
===============================
The bit vectors indicating the status of each bucket are serialized as follows:
.. code-block:: none
.--------------------.-- +0
| Word Count |
.--------------------.-- +4
| Word_0 | ─╮
.--------------------.-- +8 │
| Word_1 | │
.--------------------.-- +12 ├─ |Word Count| values
... │
.--------------------. │
| Word_N | │
.--------------------. ─╯
The words, when viewed as a contiguous block of bytes, represent a bit vector with
the following layout:
.. code-block:: none
.------------. .------------.------------.
| Word_N | ... | Word_1 | Word_0 |
.------------. .------------.------------.
| | | | |
+N*32 +(N-1)*32 +64 +32 +0
where the k'th bit of this bit vector represents the status of the k'th bucket
in the hash table.

View File

@ -1,80 +1,80 @@
=====================================
The Module Information Stream
=====================================
.. contents::
:local:
.. _modi_stream_intro:
Introduction
============
The Module Info Stream (henceforth referred to as the Modi stream) contains
information about a single module (object file, import library, etc that
contributes to the binary this PDB contains debug information about. There
is one modi stream for each module, and the mapping between modi stream index
and module is contained in the :doc:`DBI Stream <DbiStream>`. The modi stream
for a single module contains line information for the compiland, as well as
all CodeView information for the symbols defined in the compiland. Finally,
there is a "global refs" substream which is not well understood.
.. _modi_stream_layout:
Stream Layout
=============
A modi stream is laid out as follows:
.. code-block:: c++
struct ModiStream {
uint32_t Signature;
uint8_t Symbols[SymbolSize-4];
uint8_t C11LineInfo[C11Size];
uint8_t C13LineInfo[C13Size];
uint32_t GlobalRefsSize;
uint8_t GlobalRefs[GlobalRefsSize];
};
- **Signature** - Unknown. In practice only the value of ``4`` has been
observed. It is hypothesized that this value corresponds to the set of
``CV_SIGNATURE_xx`` defines in ``cvinfo.h``, with the value of ``4``
meaning that this module has C13 line information (as opposed to C11 line
information). A corollary of this is that we expect to only ever see
C13 line info, and that we do not understand the format of C11 line info.
- **Symbols** - The :ref:`CodeView Symbol Substream <modi_symbol_substream>`.
``SymbolSize`` is equal to the value of ``SymByteSize`` for the
corresponding module's entry in the :ref:`Module Info Substream <dbi_mod_info_substream>`
of the :doc:`DBI Stream <DbiStream>`.
- **C11LineInfo** - A block containing CodeView line information in C11
format. ``C11Size`` is equal to the value of ``C11ByteSize`` from the
:ref:`Module Info Substream <dbi_mod_info_substream>` of the
:doc:`DBI Stream <DbiStream>`. If this value is ``0``, then C11 line
information is not present. As mentioned previously, the format of
C11 line info is not understood and we assume all line in modern PDBs
to be in C13 format.
- **C13LineInfo** - A block containing CodeView line information in C13
format. ``C13Size`` is equal to the value of ``C13ByteSize`` from the
:ref:`Module Info Substream <dbi_mod_info_substream>` of the
:doc:`DBI Stream <DbiStream>`. If this value is ``0``, then C13 line
information is not present.
- **GlobalRefs** - The meaning of this substream is not understood.
.. _modi_symbol_substream:
The CodeView Symbol Substream
=============================
The CodeView Symbol Substream. This is an array of variable length
records describing the functions, variables, inlining information,
and other symbols defined in the compiland. The entire array consumes
``SymbolSize-4`` bytes. The format of a CodeView Symbol Record (and
thusly, an array of CodeView Symbol Records) is described in
:doc:`CodeViewSymbols`.
=====================================
The Module Information Stream
=====================================
.. contents::
:local:
.. _modi_stream_intro:
Introduction
============
The Module Info Stream (henceforth referred to as the Modi stream) contains
information about a single module (object file, import library, etc that
contributes to the binary this PDB contains debug information about. There
is one modi stream for each module, and the mapping between modi stream index
and module is contained in the :doc:`DBI Stream <DbiStream>`. The modi stream
for a single module contains line information for the compiland, as well as
all CodeView information for the symbols defined in the compiland. Finally,
there is a "global refs" substream which is not well understood.
.. _modi_stream_layout:
Stream Layout
=============
A modi stream is laid out as follows:
.. code-block:: c++
struct ModiStream {
uint32_t Signature;
uint8_t Symbols[SymbolSize-4];
uint8_t C11LineInfo[C11Size];
uint8_t C13LineInfo[C13Size];
uint32_t GlobalRefsSize;
uint8_t GlobalRefs[GlobalRefsSize];
};
- **Signature** - Unknown. In practice only the value of ``4`` has been
observed. It is hypothesized that this value corresponds to the set of
``CV_SIGNATURE_xx`` defines in ``cvinfo.h``, with the value of ``4``
meaning that this module has C13 line information (as opposed to C11 line
information). A corollary of this is that we expect to only ever see
C13 line info, and that we do not understand the format of C11 line info.
- **Symbols** - The :ref:`CodeView Symbol Substream <modi_symbol_substream>`.
``SymbolSize`` is equal to the value of ``SymByteSize`` for the
corresponding module's entry in the :ref:`Module Info Substream <dbi_mod_info_substream>`
of the :doc:`DBI Stream <DbiStream>`.
- **C11LineInfo** - A block containing CodeView line information in C11
format. ``C11Size`` is equal to the value of ``C11ByteSize`` from the
:ref:`Module Info Substream <dbi_mod_info_substream>` of the
:doc:`DBI Stream <DbiStream>`. If this value is ``0``, then C11 line
information is not present. As mentioned previously, the format of
C11 line info is not understood and we assume all line in modern PDBs
to be in C13 format.
- **C13LineInfo** - A block containing CodeView line information in C13
format. ``C13Size`` is equal to the value of ``C13ByteSize`` from the
:ref:`Module Info Substream <dbi_mod_info_substream>` of the
:doc:`DBI Stream <DbiStream>`. If this value is ``0``, then C13 line
information is not present.
- **GlobalRefs** - The meaning of this substream is not understood.
.. _modi_symbol_substream:
The CodeView Symbol Substream
=============================
The CodeView Symbol Substream. This is an array of variable length
records describing the functions, variables, inlining information,
and other symbols defined in the compiland. The entire array consumes
``SymbolSize-4`` bytes. The format of a CodeView Symbol Record (and
thusly, an array of CodeView Symbol Records) is described in
:doc:`CodeViewSymbols`.

View File

@ -1,179 +1,179 @@
=====================================
The MSF File Format
=====================================
.. contents::
:local:
.. _msf_layout:
File Layout
===========
The MSF file format consists of the following components:
1. :ref:`msf_superblock`
2. :ref:`msf_freeblockmap` (also know as Free Page Map, or FPM)
3. Data
Each component is stored as an indexed block, the length of which is specified
in ``SuperBlock::BlockSize``. The file consists of 1 or more iterations of the
following pattern (sometimes referred to as an "interval"):
1. 1 block of data
2. Free Block Map 1 (corresponds to ``SuperBlock::FreeBlockMapBlock`` 1)
3. Free Block Map 2 (corresponds to ``SuperBlock::FreeBlockMapBlock`` 2)
4. ``SuperBlock::BlockSize - 3`` blocks of data
In the first interval, the first data block is used to store
:ref:`msf_superblock`.
The following diagram demonstrates the general layout of the file (\| denotes
the end of an interval, and is for visualization purposes only):
+-------------+-----------------------+------------------+------------------+----------+----+------+------+------+-------------+----+-----+
| Block Index | 0 | 1 | 2 | 3 - 4095 | \| | 4096 | 4097 | 4098 | 4099 - 8191 | \| | ... |
+=============+=======================+==================+==================+==========+====+======+======+======+=============+====+=====+
| Meaning | :ref:`msf_superblock` | Free Block Map 1 | Free Block Map 2 | Data | \| | Data | FPM1 | FPM2 | Data | \| | ... |
+-------------+-----------------------+------------------+------------------+----------+----+------+------+------+-------------+----+-----+
The file may end after any block, including immediately after a FPM1.
.. note::
LLVM only supports 4096 byte blocks (sometimes referred to as the "BigMsf"
variant), so the rest of this document will assume a block size of 4096.
.. _msf_superblock:
The Superblock
==============
At file offset 0 in an MSF file is the MSF *SuperBlock*, which is laid out as
follows:
.. code-block:: c++
struct SuperBlock {
char FileMagic[sizeof(Magic)];
ulittle32_t BlockSize;
ulittle32_t FreeBlockMapBlock;
ulittle32_t NumBlocks;
ulittle32_t NumDirectoryBytes;
ulittle32_t Unknown;
ulittle32_t BlockMapAddr;
};
- **FileMagic** - Must be equal to ``"Microsoft C / C++ MSF 7.00\\r\\n"``
followed by the bytes ``1A 44 53 00 00 00``.
- **BlockSize** - The block size of the internal file system. Valid values are
512, 1024, 2048, and 4096 bytes. Certain aspects of the MSF file layout vary
depending on the block sizes. For the purposes of LLVM, we handle only block
sizes of 4KiB, and all further discussion assumes a block size of 4KiB.
- **FreeBlockMapBlock** - The index of a block within the file, at which begins
a bitfield representing the set of all blocks within the file which are "free"
(i.e. the data within that block is not used). See :ref:`msf_freeblockmap` for
more information.
**Important**: ``FreeBlockMapBlock`` can only be ``1`` or ``2``!
- **NumBlocks** - The total number of blocks in the file. ``NumBlocks * BlockSize``
should equal the size of the file on disk.
- **NumDirectoryBytes** - The size of the stream directory, in bytes. The stream
directory contains information about each stream's size and the set of blocks
that it occupies. It will be described in more detail later.
- **BlockMapAddr** - The index of a block within the MSF file. At this block is
an array of ``ulittle32_t``'s listing the blocks that the stream directory
resides on. For large MSF files, the stream directory (which describes the
block layout of each stream) may not fit entirely on a single block. As a
result, this extra layer of indirection is introduced, whereby this block
contains the list of blocks that the stream directory occupies, and the stream
directory itself can be stitched together accordingly. The number of
``ulittle32_t``'s in this array is given by ``ceil(NumDirectoryBytes / BlockSize)``.
.. _msf_freeblockmap:
The Free Block Map
==================
The Free Block Map (sometimes referred to as the Free Page Map, or FPM) is a
series of blocks which contains a bit flag for every block in the file. The
flag will be set to 0 if the block is in use, and 1 if the block is unused.
Each file contains two FPMs, one of which is active at any given time. This
feature is designed to support incremental and atomic updates of the underlying
MSF file. While writing to an MSF file, if the active FPM is FPM1, you can
write your new modified bitfield to FPM2, and vice versa. Only when you commit
the file to disk do you need to swap the value in the SuperBlock to point to
the new ``FreeBlockMapBlock``.
The Free Block Maps are stored as a series of single blocks thoughout the file
at intervals of BlockSize. Because each FPM block is of size ``BlockSize``
bytes, it contains 8 times as many bits as an interval has blocks. This means
that the first block of each FPM refers to the first 8 intervals of the file
(the first 32768 blocks), the second block of each FPM refers to the next 8
blocks, and so on. This results in far more FPM blocks being present than are
required, but in order to maintain backwards compatibility the format must stay
this way.
The Stream Directory
====================
The Stream Directory is the root of all access to the other streams in an MSF
file. Beginning at byte 0 of the stream directory is the following structure:
.. code-block:: c++
struct StreamDirectory {
ulittle32_t NumStreams;
ulittle32_t StreamSizes[NumStreams];
ulittle32_t StreamBlocks[NumStreams][];
};
And this structure occupies exactly ``SuperBlock->NumDirectoryBytes`` bytes.
Note that each of the last two arrays is of variable length, and in particular
that the second array is jagged.
**Example:** Suppose a hypothetical PDB file with a 4KiB block size, and 4
streams of lengths {1000 bytes, 8000 bytes, 16000 bytes, 9000 bytes}.
Stream 0: ceil(1000 / 4096) = 1 block
Stream 1: ceil(8000 / 4096) = 2 blocks
Stream 2: ceil(16000 / 4096) = 4 blocks
Stream 3: ceil(9000 / 4096) = 3 blocks
In total, 10 blocks are used. Let's see what the stream directory might look
like:
.. code-block:: c++
struct StreamDirectory {
ulittle32_t NumStreams = 4;
ulittle32_t StreamSizes[] = {1000, 8000, 16000, 9000};
ulittle32_t StreamBlocks[][] = {
{4},
{5, 6},
{11, 9, 7, 8},
{10, 15, 12}
};
};
In total, this occupies ``15 * 4 = 60`` bytes, so ``SuperBlock->NumDirectoryBytes``
would equal ``60``, and ``SuperBlock->BlockMapAddr`` would be an array of one
``ulittle32_t``, since ``60 <= SuperBlock->BlockSize``.
Note also that the streams are discontiguous, and that part of stream 3 is in the
middle of part of stream 2. You cannot assume anything about the layout of the
blocks!
Alignment and Block Boundaries
==============================
As may be clear by now, it is possible for a single field (whether it be a high
level record, a long string field, or even a single ``uint16``) to begin and
end in separate blocks. For example, if the block size is 4096 bytes, and a
``uint16`` field begins at the last byte of the current block, then it would
need to end on the first byte of the next block. Since blocks are not
necessarily contiguously laid out in the file, this means that both the consumer
and the producer of an MSF file must be prepared to split data apart
accordingly. In the aforementioned example, the high byte of the ``uint16``
would be written to the last byte of block N, and the low byte would be written
to the first byte of block N+1, which could be tens of thousands of bytes later
(or even earlier!) in the file, depending on what the stream directory says.
=====================================
The MSF File Format
=====================================
.. contents::
:local:
.. _msf_layout:
File Layout
===========
The MSF file format consists of the following components:
1. :ref:`msf_superblock`
2. :ref:`msf_freeblockmap` (also know as Free Page Map, or FPM)
3. Data
Each component is stored as an indexed block, the length of which is specified
in ``SuperBlock::BlockSize``. The file consists of 1 or more iterations of the
following pattern (sometimes referred to as an "interval"):
1. 1 block of data
2. Free Block Map 1 (corresponds to ``SuperBlock::FreeBlockMapBlock`` 1)
3. Free Block Map 2 (corresponds to ``SuperBlock::FreeBlockMapBlock`` 2)
4. ``SuperBlock::BlockSize - 3`` blocks of data
In the first interval, the first data block is used to store
:ref:`msf_superblock`.
The following diagram demonstrates the general layout of the file (\| denotes
the end of an interval, and is for visualization purposes only):
+-------------+-----------------------+------------------+------------------+----------+----+------+------+------+-------------+----+-----+
| Block Index | 0 | 1 | 2 | 3 - 4095 | \| | 4096 | 4097 | 4098 | 4099 - 8191 | \| | ... |
+=============+=======================+==================+==================+==========+====+======+======+======+=============+====+=====+
| Meaning | :ref:`msf_superblock` | Free Block Map 1 | Free Block Map 2 | Data | \| | Data | FPM1 | FPM2 | Data | \| | ... |
+-------------+-----------------------+------------------+------------------+----------+----+------+------+------+-------------+----+-----+
The file may end after any block, including immediately after a FPM1.
.. note::
LLVM only supports 4096 byte blocks (sometimes referred to as the "BigMsf"
variant), so the rest of this document will assume a block size of 4096.
.. _msf_superblock:
The Superblock
==============
At file offset 0 in an MSF file is the MSF *SuperBlock*, which is laid out as
follows:
.. code-block:: c++
struct SuperBlock {
char FileMagic[sizeof(Magic)];
ulittle32_t BlockSize;
ulittle32_t FreeBlockMapBlock;
ulittle32_t NumBlocks;
ulittle32_t NumDirectoryBytes;
ulittle32_t Unknown;
ulittle32_t BlockMapAddr;
};
- **FileMagic** - Must be equal to ``"Microsoft C / C++ MSF 7.00\\r\\n"``
followed by the bytes ``1A 44 53 00 00 00``.
- **BlockSize** - The block size of the internal file system. Valid values are
512, 1024, 2048, and 4096 bytes. Certain aspects of the MSF file layout vary
depending on the block sizes. For the purposes of LLVM, we handle only block
sizes of 4KiB, and all further discussion assumes a block size of 4KiB.
- **FreeBlockMapBlock** - The index of a block within the file, at which begins
a bitfield representing the set of all blocks within the file which are "free"
(i.e. the data within that block is not used). See :ref:`msf_freeblockmap` for
more information.
**Important**: ``FreeBlockMapBlock`` can only be ``1`` or ``2``!
- **NumBlocks** - The total number of blocks in the file. ``NumBlocks * BlockSize``
should equal the size of the file on disk.
- **NumDirectoryBytes** - The size of the stream directory, in bytes. The stream
directory contains information about each stream's size and the set of blocks
that it occupies. It will be described in more detail later.
- **BlockMapAddr** - The index of a block within the MSF file. At this block is
an array of ``ulittle32_t``'s listing the blocks that the stream directory
resides on. For large MSF files, the stream directory (which describes the
block layout of each stream) may not fit entirely on a single block. As a
result, this extra layer of indirection is introduced, whereby this block
contains the list of blocks that the stream directory occupies, and the stream
directory itself can be stitched together accordingly. The number of
``ulittle32_t``'s in this array is given by ``ceil(NumDirectoryBytes / BlockSize)``.
.. _msf_freeblockmap:
The Free Block Map
==================
The Free Block Map (sometimes referred to as the Free Page Map, or FPM) is a
series of blocks which contains a bit flag for every block in the file. The
flag will be set to 0 if the block is in use, and 1 if the block is unused.
Each file contains two FPMs, one of which is active at any given time. This
feature is designed to support incremental and atomic updates of the underlying
MSF file. While writing to an MSF file, if the active FPM is FPM1, you can
write your new modified bitfield to FPM2, and vice versa. Only when you commit
the file to disk do you need to swap the value in the SuperBlock to point to
the new ``FreeBlockMapBlock``.
The Free Block Maps are stored as a series of single blocks thoughout the file
at intervals of BlockSize. Because each FPM block is of size ``BlockSize``
bytes, it contains 8 times as many bits as an interval has blocks. This means
that the first block of each FPM refers to the first 8 intervals of the file
(the first 32768 blocks), the second block of each FPM refers to the next 8
blocks, and so on. This results in far more FPM blocks being present than are
required, but in order to maintain backwards compatibility the format must stay
this way.
The Stream Directory
====================
The Stream Directory is the root of all access to the other streams in an MSF
file. Beginning at byte 0 of the stream directory is the following structure:
.. code-block:: c++
struct StreamDirectory {
ulittle32_t NumStreams;
ulittle32_t StreamSizes[NumStreams];
ulittle32_t StreamBlocks[NumStreams][];
};
And this structure occupies exactly ``SuperBlock->NumDirectoryBytes`` bytes.
Note that each of the last two arrays is of variable length, and in particular
that the second array is jagged.
**Example:** Suppose a hypothetical PDB file with a 4KiB block size, and 4
streams of lengths {1000 bytes, 8000 bytes, 16000 bytes, 9000 bytes}.
Stream 0: ceil(1000 / 4096) = 1 block
Stream 1: ceil(8000 / 4096) = 2 blocks
Stream 2: ceil(16000 / 4096) = 4 blocks
Stream 3: ceil(9000 / 4096) = 3 blocks
In total, 10 blocks are used. Let's see what the stream directory might look
like:
.. code-block:: c++
struct StreamDirectory {
ulittle32_t NumStreams = 4;
ulittle32_t StreamSizes[] = {1000, 8000, 16000, 9000};
ulittle32_t StreamBlocks[][] = {
{4},
{5, 6},
{11, 9, 7, 8},
{10, 15, 12}
};
};
In total, this occupies ``15 * 4 = 60`` bytes, so ``SuperBlock->NumDirectoryBytes``
would equal ``60``, and ``SuperBlock->BlockMapAddr`` would be an array of one
``ulittle32_t``, since ``60 <= SuperBlock->BlockSize``.
Note also that the streams are discontiguous, and that part of stream 3 is in the
middle of part of stream 2. You cannot assume anything about the layout of the
blocks!
Alignment and Block Boundaries
==============================
As may be clear by now, it is possible for a single field (whether it be a high
level record, a long string field, or even a single ``uint16``) to begin and
end in separate blocks. For example, if the block size is 4096 bytes, and a
``uint16`` field begins at the last byte of the current block, then it would
need to end on the first byte of the next block. Since blocks are not
necessarily contiguously laid out in the file, this means that both the consumer
and the producer of an MSF file must be prepared to split data apart
accordingly. In the aforementioned example, the high byte of the ``uint16``
would be written to the last byte of block N, and the low byte would be written
to the first byte of block N+1, which could be tens of thousands of bytes later
(or even earlier!) in the file, depending on what the stream directory says.

View File

@ -1,3 +1,3 @@
=====================================
The PDB Public Symbol Stream
=====================================
=====================================
The PDB Public Symbol Stream
=====================================

View File

@ -1,312 +1,312 @@
=====================================
The PDB TPI and IPI Streams
=====================================
.. contents::
:local:
.. _tpi_intro:
Introduction
============
The PDB TPI Stream (Index 2) and IPI Stream (Index 4) contain information about
all types used in the program. It is organized as a :ref:`header <tpi_header>`
followed by a list of :doc:`CodeView Type Records <CodeViewTypes>`. Types are
referenced from various streams and records throughout the PDB by their
:ref:`type index <type_indices>`. In general, the sequence of type records
following the :ref:`header <tpi_header>` forms a topologically sorted DAG
(directed acyclic graph), which means that a type record B can only refer to
the type A if ``A.TypeIndex < B.TypeIndex``. While there are rare cases where
this property will not hold (particularly when dealing with object files
compiled with MASM), an implementation should try very hard to make this
property hold, as it means the entire type graph can be constructed in a single
pass.
.. important::
Type records form a topologically sorted DAG (directed acyclic graph).
.. _tpi_ipi:
TPI vs IPI Stream
=================
Recent versions of the PDB format (aka all versions covered by this document)
have 2 streams with identical layout, henceforth referred to as the TPI stream
and IPI stream. Subsequent contents of this document describing the on-disk
format apply equally whether it is for the TPI Stream or the IPI Stream. The
only difference between the two is in *which* CodeView records are allowed to
appear in each one, summarized by the following table:
+----------------------+---------------------+
| TPI Stream | IPI Stream |
+======================+=====================+
| LF_POINTER | LF_FUNC_ID |
+----------------------+---------------------+
| LF_MODIFIER | LF_MFUNC_ID |
+----------------------+---------------------+
| LF_PROCEDURE | LF_BUILDINFO |
+----------------------+---------------------+
| LF_MFUNCTION | LF_SUBSTR_LIST |
+----------------------+---------------------+
| LF_LABEL | LF_STRING_ID |
+----------------------+---------------------+
| LF_ARGLIST | LF_UDT_SRC_LINE |
+----------------------+---------------------+
| LF_FIELDLIST | LF_UDT_MOD_SRC_LINE |
+----------------------+---------------------+
| LF_ARRAY | |
+----------------------+---------------------+
| LF_CLASS | |
+----------------------+---------------------+
| LF_STRUCTURE | |
+----------------------+---------------------+
| LF_INTERFACE | |
+----------------------+---------------------+
| LF_UNION | |
+----------------------+---------------------+
| LF_ENUM | |
+----------------------+---------------------+
| LF_TYPESERVER2 | |
+----------------------+---------------------+
| LF_VFTABLE | |
+----------------------+---------------------+
| LF_VTSHAPE | |
+----------------------+---------------------+
| LF_BITFIELD | |
+----------------------+---------------------+
| LF_METHODLIST | |
+----------------------+---------------------+
| LF_PRECOMP | |
+----------------------+---------------------+
| LF_ENDPRECOMP | |
+----------------------+---------------------+
The usage of these records is described in more detail in
:doc:`CodeView Type Records <CodeViewTypes>`.
.. _type_indices:
Type Indices
============
A type index is a 32-bit integer that uniquely identifies a type inside of an
object file's ``.debug$T`` section or a PDB file's TPI or IPI stream. The
value of the type index for the first type record from the TPI stream is given
by the ``TypeIndexBegin`` member of the :ref:`TPI Stream Header <tpi_header>`
although in practice this value is always equal to 0x1000 (4096).
Any type index with a high bit set is considered to come from the IPI stream,
although this appears to be more of a hack, and LLVM does not generate type
indices of this nature. They can, however, be observed in Microsoft PDBs
occasionally, so one should be prepared to handle them. Note that having the
high bit set is not a necessary condition to determine whether a type index
comes from the IPI stream, it is only sufficient.
Once the high bit is cleared, any type index >= ``TypeIndexBegin`` is presumed
to come from the appropriate stream, and any type index less than this is a
bitmask which can be decomposed as follows:
.. code-block:: none
.---------------------------.------.----------.
| Unused | Mode | Kind |
'---------------------------'------'----------'
|+32 |+12 |+8 |+0
- **Kind** - A value from the following enum:
.. code-block:: c++
enum class SimpleTypeKind : uint32_t {
None = 0x0000, // uncharacterized type (no type)
Void = 0x0003, // void
NotTranslated = 0x0007, // type not translated by cvpack
HResult = 0x0008, // OLE/COM HRESULT
SignedCharacter = 0x0010, // 8 bit signed
UnsignedCharacter = 0x0020, // 8 bit unsigned
NarrowCharacter = 0x0070, // really a char
WideCharacter = 0x0071, // wide char
Character16 = 0x007a, // char16_t
Character32 = 0x007b, // char32_t
SByte = 0x0068, // 8 bit signed int
Byte = 0x0069, // 8 bit unsigned int
Int16Short = 0x0011, // 16 bit signed
UInt16Short = 0x0021, // 16 bit unsigned
Int16 = 0x0072, // 16 bit signed int
UInt16 = 0x0073, // 16 bit unsigned int
Int32Long = 0x0012, // 32 bit signed
UInt32Long = 0x0022, // 32 bit unsigned
Int32 = 0x0074, // 32 bit signed int
UInt32 = 0x0075, // 32 bit unsigned int
Int64Quad = 0x0013, // 64 bit signed
UInt64Quad = 0x0023, // 64 bit unsigned
Int64 = 0x0076, // 64 bit signed int
UInt64 = 0x0077, // 64 bit unsigned int
Int128Oct = 0x0014, // 128 bit signed int
UInt128Oct = 0x0024, // 128 bit unsigned int
Int128 = 0x0078, // 128 bit signed int
UInt128 = 0x0079, // 128 bit unsigned int
Float16 = 0x0046, // 16 bit real
Float32 = 0x0040, // 32 bit real
Float32PartialPrecision = 0x0045, // 32 bit PP real
Float48 = 0x0044, // 48 bit real
Float64 = 0x0041, // 64 bit real
Float80 = 0x0042, // 80 bit real
Float128 = 0x0043, // 128 bit real
Complex16 = 0x0056, // 16 bit complex
Complex32 = 0x0050, // 32 bit complex
Complex32PartialPrecision = 0x0055, // 32 bit PP complex
Complex48 = 0x0054, // 48 bit complex
Complex64 = 0x0051, // 64 bit complex
Complex80 = 0x0052, // 80 bit complex
Complex128 = 0x0053, // 128 bit complex
Boolean8 = 0x0030, // 8 bit boolean
Boolean16 = 0x0031, // 16 bit boolean
Boolean32 = 0x0032, // 32 bit boolean
Boolean64 = 0x0033, // 64 bit boolean
Boolean128 = 0x0034, // 128 bit boolean
};
- **Mode** - A value from the following enum:
.. code-block:: c++
enum class SimpleTypeMode : uint32_t {
Direct = 0, // Not a pointer
NearPointer = 1, // Near pointer
FarPointer = 2, // Far pointer
HugePointer = 3, // Huge pointer
NearPointer32 = 4, // 32 bit near pointer
FarPointer32 = 5, // 32 bit far pointer
NearPointer64 = 6, // 64 bit near pointer
NearPointer128 = 7 // 128 bit near pointer
};
Note that for pointers, the bitness is represented in the mode. So a ``void*``
would have a type index with ``Mode=NearPointer32, Kind=Void`` if built for 32-bits
but a type index with ``Mode=NearPointer64, Kind=Void`` if built for 64-bits.
By convention, the type index for ``std::nullptr_t`` is constructed the same way
as the type index for ``void*``, but using the bitless enumeration value
``NearPointer``.
.. _tpi_header:
Stream Header
=============
At offset 0 of the TPI Stream is a header with the following layout:
.. code-block:: c++
struct TpiStreamHeader {
uint32_t Version;
uint32_t HeaderSize;
uint32_t TypeIndexBegin;
uint32_t TypeIndexEnd;
uint32_t TypeRecordBytes;
uint16_t HashStreamIndex;
uint16_t HashAuxStreamIndex;
uint32_t HashKeySize;
uint32_t NumHashBuckets;
int32_t HashValueBufferOffset;
uint32_t HashValueBufferLength;
int32_t IndexOffsetBufferOffset;
uint32_t IndexOffsetBufferLength;
int32_t HashAdjBufferOffset;
uint32_t HashAdjBufferLength;
};
- **Version** - A value from the following enum.
.. code-block:: c++
enum class TpiStreamVersion : uint32_t {
V40 = 19950410,
V41 = 19951122,
V50 = 19961031,
V70 = 19990903,
V80 = 20040203,
};
Similar to the :doc:`PDB Stream <PdbStream>`, this value always appears to be
``V80``, and no other values have been observed. It is assumed that should
another value be observed, the layout described by this document may not be
accurate.
- **HeaderSize** - ``sizeof(TpiStreamHeader)``
- **TypeIndexBegin** - The numeric value of the type index representing the
first type record in the TPI stream. This is usually the value 0x1000 as type
indices lower than this are reserved (see :ref:`Type Indices <type_indices>` for
a discussion of reserved type indices).
- **TypeIndexEnd** - One greater than the numeric value of the type index
representing the last type record in the TPI stream. The total number of type
records in the TPI stream can be computed as ``TypeIndexEnd - TypeIndexBegin``.
- **TypeRecordBytes** - The number of bytes of type record data following the header.
- **HashStreamIndex** - The index of a stream which contains a list of hashes for
every type record. This value may be -1, indicating that hash information is not
present. In practice a valid stream index is always observed, so any producer
implementation should be prepared to emit this stream to ensure compatibility with
tools which may expect it to be present.
- **HashAuxStreamIndex** - Presumably the index of a stream which contains a separate
hash table, although this has not been observed in practice and it's unclear what it
might be used for.
- **HashKeySize** - The size of a hash value (usually 4 bytes).
- **NumHashBuckets** - The number of buckets used to generate the hash values in the
aforementioned hash streams.
- **HashValueBufferOffset / HashValueBufferLength** - The offset and size within
the TPI Hash Stream of the list of hash values. It should be assumed that there
are either 0 hash values, or a number equal to the number of type records in the
TPI stream (``TypeIndexEnd - TypeEndBegin``). Thus, if ``HashBufferLength`` is
not equal to ``(TypeIndexEnd - TypeEndBegin) * HashKeySize`` we can consider the
PDB malformed.
- **IndexOffsetBufferOffset / IndexOffsetBufferLength** - The offset and size
within the TPI Hash Stream of the Type Index Offsets Buffer. This is a list of
pairs of uint32_t's where the first value is a :ref:`Type Index <type_indices>`
and the second value is the offset in the type record data of the type with this
index. This can be used to do a binary search followed bin a linear search to
get amortized O(log n) lookup by type index.
- **HashAdjBufferOffset / HashAdjBufferLength** - The offset and size within
the TPI hash stream of a serialized hash table whose keys are the hash values
in the hash value buffer and whose values are type indices. This appears to
be useful in incremental linking scenarios, so that if a type is modified an
entry can be created mapping the old hash value to the new type index so that
a PDB file consumer can always have the most up to date version of the type
without forcing the incremental linker to garbage collect and update
references that point to the old version to now point to the new version.
The layout of this hash table is described in :doc:`HashTable`.
.. _tpi_records:
CodeView Type Record List
=========================
Following the header, there are ``TypeRecordBytes`` bytes of data that represent a
variable length array of :doc:`CodeView type records <CodeViewTypes>`. The number
of such records (e.g. the length of the array) can be determined by computing the
value ``Header.TypeIndexEnd - Header.TypeIndexBegin``.
log(n) random access is provided by way of the Type Index Offsets array (if present)
described previously.
=====================================
The PDB TPI and IPI Streams
=====================================
.. contents::
:local:
.. _tpi_intro:
Introduction
============
The PDB TPI Stream (Index 2) and IPI Stream (Index 4) contain information about
all types used in the program. It is organized as a :ref:`header <tpi_header>`
followed by a list of :doc:`CodeView Type Records <CodeViewTypes>`. Types are
referenced from various streams and records throughout the PDB by their
:ref:`type index <type_indices>`. In general, the sequence of type records
following the :ref:`header <tpi_header>` forms a topologically sorted DAG
(directed acyclic graph), which means that a type record B can only refer to
the type A if ``A.TypeIndex < B.TypeIndex``. While there are rare cases where
this property will not hold (particularly when dealing with object files
compiled with MASM), an implementation should try very hard to make this
property hold, as it means the entire type graph can be constructed in a single
pass.
.. important::
Type records form a topologically sorted DAG (directed acyclic graph).
.. _tpi_ipi:
TPI vs IPI Stream
=================
Recent versions of the PDB format (aka all versions covered by this document)
have 2 streams with identical layout, henceforth referred to as the TPI stream
and IPI stream. Subsequent contents of this document describing the on-disk
format apply equally whether it is for the TPI Stream or the IPI Stream. The
only difference between the two is in *which* CodeView records are allowed to
appear in each one, summarized by the following table:
+----------------------+---------------------+
| TPI Stream | IPI Stream |
+======================+=====================+
| LF_POINTER | LF_FUNC_ID |
+----------------------+---------------------+
| LF_MODIFIER | LF_MFUNC_ID |
+----------------------+---------------------+
| LF_PROCEDURE | LF_BUILDINFO |
+----------------------+---------------------+
| LF_MFUNCTION | LF_SUBSTR_LIST |
+----------------------+---------------------+
| LF_LABEL | LF_STRING_ID |
+----------------------+---------------------+
| LF_ARGLIST | LF_UDT_SRC_LINE |
+----------------------+---------------------+
| LF_FIELDLIST | LF_UDT_MOD_SRC_LINE |
+----------------------+---------------------+
| LF_ARRAY | |
+----------------------+---------------------+
| LF_CLASS | |
+----------------------+---------------------+
| LF_STRUCTURE | |
+----------------------+---------------------+
| LF_INTERFACE | |
+----------------------+---------------------+
| LF_UNION | |
+----------------------+---------------------+
| LF_ENUM | |
+----------------------+---------------------+
| LF_TYPESERVER2 | |
+----------------------+---------------------+
| LF_VFTABLE | |
+----------------------+---------------------+
| LF_VTSHAPE | |
+----------------------+---------------------+
| LF_BITFIELD | |
+----------------------+---------------------+
| LF_METHODLIST | |
+----------------------+---------------------+
| LF_PRECOMP | |
+----------------------+---------------------+
| LF_ENDPRECOMP | |
+----------------------+---------------------+
The usage of these records is described in more detail in
:doc:`CodeView Type Records <CodeViewTypes>`.
.. _type_indices:
Type Indices
============
A type index is a 32-bit integer that uniquely identifies a type inside of an
object file's ``.debug$T`` section or a PDB file's TPI or IPI stream. The
value of the type index for the first type record from the TPI stream is given
by the ``TypeIndexBegin`` member of the :ref:`TPI Stream Header <tpi_header>`
although in practice this value is always equal to 0x1000 (4096).
Any type index with a high bit set is considered to come from the IPI stream,
although this appears to be more of a hack, and LLVM does not generate type
indices of this nature. They can, however, be observed in Microsoft PDBs
occasionally, so one should be prepared to handle them. Note that having the
high bit set is not a necessary condition to determine whether a type index
comes from the IPI stream, it is only sufficient.
Once the high bit is cleared, any type index >= ``TypeIndexBegin`` is presumed
to come from the appropriate stream, and any type index less than this is a
bitmask which can be decomposed as follows:
.. code-block:: none
.---------------------------.------.----------.
| Unused | Mode | Kind |
'---------------------------'------'----------'
|+32 |+12 |+8 |+0
- **Kind** - A value from the following enum:
.. code-block:: c++
enum class SimpleTypeKind : uint32_t {
None = 0x0000, // uncharacterized type (no type)
Void = 0x0003, // void
NotTranslated = 0x0007, // type not translated by cvpack
HResult = 0x0008, // OLE/COM HRESULT
SignedCharacter = 0x0010, // 8 bit signed
UnsignedCharacter = 0x0020, // 8 bit unsigned
NarrowCharacter = 0x0070, // really a char
WideCharacter = 0x0071, // wide char
Character16 = 0x007a, // char16_t
Character32 = 0x007b, // char32_t
SByte = 0x0068, // 8 bit signed int
Byte = 0x0069, // 8 bit unsigned int
Int16Short = 0x0011, // 16 bit signed
UInt16Short = 0x0021, // 16 bit unsigned
Int16 = 0x0072, // 16 bit signed int
UInt16 = 0x0073, // 16 bit unsigned int
Int32Long = 0x0012, // 32 bit signed
UInt32Long = 0x0022, // 32 bit unsigned
Int32 = 0x0074, // 32 bit signed int
UInt32 = 0x0075, // 32 bit unsigned int
Int64Quad = 0x0013, // 64 bit signed
UInt64Quad = 0x0023, // 64 bit unsigned
Int64 = 0x0076, // 64 bit signed int
UInt64 = 0x0077, // 64 bit unsigned int
Int128Oct = 0x0014, // 128 bit signed int
UInt128Oct = 0x0024, // 128 bit unsigned int
Int128 = 0x0078, // 128 bit signed int
UInt128 = 0x0079, // 128 bit unsigned int
Float16 = 0x0046, // 16 bit real
Float32 = 0x0040, // 32 bit real
Float32PartialPrecision = 0x0045, // 32 bit PP real
Float48 = 0x0044, // 48 bit real
Float64 = 0x0041, // 64 bit real
Float80 = 0x0042, // 80 bit real
Float128 = 0x0043, // 128 bit real
Complex16 = 0x0056, // 16 bit complex
Complex32 = 0x0050, // 32 bit complex
Complex32PartialPrecision = 0x0055, // 32 bit PP complex
Complex48 = 0x0054, // 48 bit complex
Complex64 = 0x0051, // 64 bit complex
Complex80 = 0x0052, // 80 bit complex
Complex128 = 0x0053, // 128 bit complex
Boolean8 = 0x0030, // 8 bit boolean
Boolean16 = 0x0031, // 16 bit boolean
Boolean32 = 0x0032, // 32 bit boolean
Boolean64 = 0x0033, // 64 bit boolean
Boolean128 = 0x0034, // 128 bit boolean
};
- **Mode** - A value from the following enum:
.. code-block:: c++
enum class SimpleTypeMode : uint32_t {
Direct = 0, // Not a pointer
NearPointer = 1, // Near pointer
FarPointer = 2, // Far pointer
HugePointer = 3, // Huge pointer
NearPointer32 = 4, // 32 bit near pointer
FarPointer32 = 5, // 32 bit far pointer
NearPointer64 = 6, // 64 bit near pointer
NearPointer128 = 7 // 128 bit near pointer
};
Note that for pointers, the bitness is represented in the mode. So a ``void*``
would have a type index with ``Mode=NearPointer32, Kind=Void`` if built for 32-bits
but a type index with ``Mode=NearPointer64, Kind=Void`` if built for 64-bits.
By convention, the type index for ``std::nullptr_t`` is constructed the same way
as the type index for ``void*``, but using the bitless enumeration value
``NearPointer``.
.. _tpi_header:
Stream Header
=============
At offset 0 of the TPI Stream is a header with the following layout:
.. code-block:: c++
struct TpiStreamHeader {
uint32_t Version;
uint32_t HeaderSize;
uint32_t TypeIndexBegin;
uint32_t TypeIndexEnd;
uint32_t TypeRecordBytes;
uint16_t HashStreamIndex;
uint16_t HashAuxStreamIndex;
uint32_t HashKeySize;
uint32_t NumHashBuckets;
int32_t HashValueBufferOffset;
uint32_t HashValueBufferLength;
int32_t IndexOffsetBufferOffset;
uint32_t IndexOffsetBufferLength;
int32_t HashAdjBufferOffset;
uint32_t HashAdjBufferLength;
};
- **Version** - A value from the following enum.
.. code-block:: c++
enum class TpiStreamVersion : uint32_t {
V40 = 19950410,
V41 = 19951122,
V50 = 19961031,
V70 = 19990903,
V80 = 20040203,
};
Similar to the :doc:`PDB Stream <PdbStream>`, this value always appears to be
``V80``, and no other values have been observed. It is assumed that should
another value be observed, the layout described by this document may not be
accurate.
- **HeaderSize** - ``sizeof(TpiStreamHeader)``
- **TypeIndexBegin** - The numeric value of the type index representing the
first type record in the TPI stream. This is usually the value 0x1000 as type
indices lower than this are reserved (see :ref:`Type Indices <type_indices>` for
a discussion of reserved type indices).
- **TypeIndexEnd** - One greater than the numeric value of the type index
representing the last type record in the TPI stream. The total number of type
records in the TPI stream can be computed as ``TypeIndexEnd - TypeIndexBegin``.
- **TypeRecordBytes** - The number of bytes of type record data following the header.
- **HashStreamIndex** - The index of a stream which contains a list of hashes for
every type record. This value may be -1, indicating that hash information is not
present. In practice a valid stream index is always observed, so any producer
implementation should be prepared to emit this stream to ensure compatibility with
tools which may expect it to be present.
- **HashAuxStreamIndex** - Presumably the index of a stream which contains a separate
hash table, although this has not been observed in practice and it's unclear what it
might be used for.
- **HashKeySize** - The size of a hash value (usually 4 bytes).
- **NumHashBuckets** - The number of buckets used to generate the hash values in the
aforementioned hash streams.
- **HashValueBufferOffset / HashValueBufferLength** - The offset and size within
the TPI Hash Stream of the list of hash values. It should be assumed that there
are either 0 hash values, or a number equal to the number of type records in the
TPI stream (``TypeIndexEnd - TypeEndBegin``). Thus, if ``HashBufferLength`` is
not equal to ``(TypeIndexEnd - TypeEndBegin) * HashKeySize`` we can consider the
PDB malformed.
- **IndexOffsetBufferOffset / IndexOffsetBufferLength** - The offset and size
within the TPI Hash Stream of the Type Index Offsets Buffer. This is a list of
pairs of uint32_t's where the first value is a :ref:`Type Index <type_indices>`
and the second value is the offset in the type record data of the type with this
index. This can be used to do a binary search followed bin a linear search to
get amortized O(log n) lookup by type index.
- **HashAdjBufferOffset / HashAdjBufferLength** - The offset and size within
the TPI hash stream of a serialized hash table whose keys are the hash values
in the hash value buffer and whose values are type indices. This appears to
be useful in incremental linking scenarios, so that if a type is modified an
entry can be created mapping the old hash value to the new type index so that
a PDB file consumer can always have the most up to date version of the type
without forcing the incremental linker to garbage collect and update
references that point to the old version to now point to the new version.
The layout of this hash table is described in :doc:`HashTable`.
.. _tpi_records:
CodeView Type Record List
=========================
Following the header, there are ``TypeRecordBytes`` bytes of data that represent a
variable length array of :doc:`CodeView type records <CodeViewTypes>`. The number
of such records (e.g. the length of the array) can be determined by computing the
value ``Header.TypeIndexEnd - Header.TypeIndexBegin``.
log(n) random access is provided by way of the Type Index Offsets array (if present)
described previously.

View File

@ -1,168 +1,168 @@
=====================================
The PDB File Format
=====================================
.. contents::
:local:
.. _pdb_intro:
Introduction
============
PDB (Program Database) is a file format invented by Microsoft and which contains
debug information that can be consumed by debuggers and other tools. Since
officially supported APIs exist on Windows for querying debug information from
PDBs even without the user understanding the internals of the file format, a
large ecosystem of tools has been built for Windows to consume this format. In
order for Clang to be able to generate programs that can interoperate with these
tools, it is necessary for us to generate PDB files ourselves.
At the same time, LLVM has a long history of being able to cross-compile from
any platform to any platform, and we wish for the same to be true here. So it
is necessary for us to understand the PDB file format at the byte-level so that
we can generate PDB files entirely on our own.
This manual describes what we know about the PDB file format today. The layout
of the file, the various streams contained within, the format of individual
records within, and more.
We would like to extend our heartfelt gratitude to Microsoft, without whom we
would not be where we are today. Much of the knowledge contained within this
manual was learned through reading code published by Microsoft on their `GitHub
repo <https://github.com/Microsoft/microsoft-pdb>`__.
.. _pdb_layout:
File Layout
===========
.. important::
Unless otherwise specified, all numeric values are encoded in little endian.
If you see a type such as ``uint16_t`` or ``uint64_t`` going forward, always
assume it is little endian!
.. toctree::
:hidden:
MsfFile
PdbStream
TpiStream
DbiStream
ModiStream
PublicStream
GlobalStream
HashTable
CodeViewSymbols
CodeViewTypes
.. _msf:
The MSF Container
-----------------
A PDB file is really just a special case of an MSF (Multi-Stream Format) file.
An MSF file is actually a miniature "file system within a file". It contains
multiple streams (aka files) which can represent arbitrary data, and these
streams are divided into blocks which may not necessarily be contiguously
laid out within the file (aka fragmented). Additionally, the MSF contains a
stream directory (aka MFT) which describes how the streams (files) are laid
out within the MSF.
For more information about the MSF container format, stream directory, and
block layout, see :doc:`MsfFile`.
.. _streams:
Streams
-------
The PDB format contains a number of streams which describe various information
such as the types, symbols, source files, and compilands (e.g. object files)
of a program, as well as some additional streams containing hash tables that are
used by debuggers and other tools to provide fast lookup of records and types
by name, and various other information about how the program was compiled such
as the specific toolchain used, and more. A summary of streams contained in a
PDB file is as follows:
+--------------------+------------------------------+-------------------------------------------+
| Name | Stream Index | Contents |
+====================+==============================+===========================================+
| Old Directory | - Fixed Stream Index 0 | - Previous MSF Stream Directory |
+--------------------+------------------------------+-------------------------------------------+
| PDB Stream | - Fixed Stream Index 1 | - Basic File Information |
| | | - Fields to match EXE to this PDB |
| | | - Map of named streams to stream indices |
+--------------------+------------------------------+-------------------------------------------+
| TPI Stream | - Fixed Stream Index 2 | - CodeView Type Records |
| | | - Index of TPI Hash Stream |
+--------------------+------------------------------+-------------------------------------------+
| DBI Stream | - Fixed Stream Index 3 | - Module/Compiland Information |
| | | - Indices of individual module streams |
| | | - Indices of public / global streams |
| | | - Section Contribution Information |
| | | - Source File Information |
| | | - References to streams containing |
| | | FPO / PGO Data |
+--------------------+------------------------------+-------------------------------------------+
| IPI Stream | - Fixed Stream Index 4 | - CodeView Type Records |
| | | - Index of IPI Hash Stream |
+--------------------+------------------------------+-------------------------------------------+
| /LinkInfo | - Contained in PDB Stream | - Unknown |
| | Named Stream map | |
+--------------------+------------------------------+-------------------------------------------+
| /src/headerblock | - Contained in PDB Stream | - Summary of embedded source file content |
| | Named Stream map | (e.g. natvis files) |
+--------------------+------------------------------+-------------------------------------------+
| /names | - Contained in PDB Stream | - PDB-wide global string table used for |
| | Named Stream map | string de-duplication |
+--------------------+------------------------------+-------------------------------------------+
| Module Info Stream | - Contained in DBI Stream | - CodeView Symbol Records for this module |
| | - One for each compiland | - Line Number Information |
+--------------------+------------------------------+-------------------------------------------+
| Public Stream | - Contained in DBI Stream | - Public (Exported) Symbol Records |
| | | - Index of Public Hash Stream |
+--------------------+------------------------------+-------------------------------------------+
| Global Stream | - Contained in DBI Stream | - Single combined master symbol-table |
| | | - Index of Global Hash Stream |
+--------------------+------------------------------+-------------------------------------------+
| TPI Hash Stream | - Contained in TPI Stream | - Hash table for looking up TPI records |
| | | by name |
+--------------------+------------------------------+-------------------------------------------+
| IPI Hash Stream | - Contained in IPI Stream | - Hash table for looking up IPI records |
| | | by name |
+--------------------+------------------------------+-------------------------------------------+
More information about the structure of each of these can be found on the
following pages:
:doc:`PdbStream`
Information about the PDB Info Stream and how it is used to match PDBs to EXEs.
:doc:`TpiStream`
Information about the TPI stream and the CodeView records contained within.
:doc:`DbiStream`
Information about the DBI stream and relevant substreams including the Module Substreams,
source file information, and CodeView symbol records contained within.
:doc:`ModiStream`
Information about the Module Information Stream, of which there is one for each compilation
unit and the format of symbols contained within.
:doc:`PublicStream`
Information about the Public Symbol Stream.
:doc:`GlobalStream`
Information about the Global Symbol Stream.
:doc:`HashTable`
Information about the serialized hash table format used internally to represent things such
as the Named Stream Map and the Hash Adjusters in the :doc:`TPI/IPI Stream <TpiStream>`.
CodeView
========
CodeView is another format which comes into the picture. While MSF defines
the structure of the overall file, and PDB defines the set of streams that
appear within the MSF file and the format of those streams, CodeView defines
the format of **symbol and type records** that appear within specific streams.
Refer to the pages on :doc:`CodeViewSymbols` and :doc:`CodeViewTypes` for
more information about the CodeView format.
=====================================
The PDB File Format
=====================================
.. contents::
:local:
.. _pdb_intro:
Introduction
============
PDB (Program Database) is a file format invented by Microsoft and which contains
debug information that can be consumed by debuggers and other tools. Since
officially supported APIs exist on Windows for querying debug information from
PDBs even without the user understanding the internals of the file format, a
large ecosystem of tools has been built for Windows to consume this format. In
order for Clang to be able to generate programs that can interoperate with these
tools, it is necessary for us to generate PDB files ourselves.
At the same time, LLVM has a long history of being able to cross-compile from
any platform to any platform, and we wish for the same to be true here. So it
is necessary for us to understand the PDB file format at the byte-level so that
we can generate PDB files entirely on our own.
This manual describes what we know about the PDB file format today. The layout
of the file, the various streams contained within, the format of individual
records within, and more.
We would like to extend our heartfelt gratitude to Microsoft, without whom we
would not be where we are today. Much of the knowledge contained within this
manual was learned through reading code published by Microsoft on their `GitHub
repo <https://github.com/Microsoft/microsoft-pdb>`__.
.. _pdb_layout:
File Layout
===========
.. important::
Unless otherwise specified, all numeric values are encoded in little endian.
If you see a type such as ``uint16_t`` or ``uint64_t`` going forward, always
assume it is little endian!
.. toctree::
:hidden:
MsfFile
PdbStream
TpiStream
DbiStream
ModiStream
PublicStream
GlobalStream
HashTable
CodeViewSymbols
CodeViewTypes
.. _msf:
The MSF Container
-----------------
A PDB file is really just a special case of an MSF (Multi-Stream Format) file.
An MSF file is actually a miniature "file system within a file". It contains
multiple streams (aka files) which can represent arbitrary data, and these
streams are divided into blocks which may not necessarily be contiguously
laid out within the file (aka fragmented). Additionally, the MSF contains a
stream directory (aka MFT) which describes how the streams (files) are laid
out within the MSF.
For more information about the MSF container format, stream directory, and
block layout, see :doc:`MsfFile`.
.. _streams:
Streams
-------
The PDB format contains a number of streams which describe various information
such as the types, symbols, source files, and compilands (e.g. object files)
of a program, as well as some additional streams containing hash tables that are
used by debuggers and other tools to provide fast lookup of records and types
by name, and various other information about how the program was compiled such
as the specific toolchain used, and more. A summary of streams contained in a
PDB file is as follows:
+--------------------+------------------------------+-------------------------------------------+
| Name | Stream Index | Contents |
+====================+==============================+===========================================+
| Old Directory | - Fixed Stream Index 0 | - Previous MSF Stream Directory |
+--------------------+------------------------------+-------------------------------------------+
| PDB Stream | - Fixed Stream Index 1 | - Basic File Information |
| | | - Fields to match EXE to this PDB |
| | | - Map of named streams to stream indices |
+--------------------+------------------------------+-------------------------------------------+
| TPI Stream | - Fixed Stream Index 2 | - CodeView Type Records |
| | | - Index of TPI Hash Stream |
+--------------------+------------------------------+-------------------------------------------+
| DBI Stream | - Fixed Stream Index 3 | - Module/Compiland Information |
| | | - Indices of individual module streams |
| | | - Indices of public / global streams |
| | | - Section Contribution Information |
| | | - Source File Information |
| | | - References to streams containing |
| | | FPO / PGO Data |
+--------------------+------------------------------+-------------------------------------------+
| IPI Stream | - Fixed Stream Index 4 | - CodeView Type Records |
| | | - Index of IPI Hash Stream |
+--------------------+------------------------------+-------------------------------------------+
| /LinkInfo | - Contained in PDB Stream | - Unknown |
| | Named Stream map | |
+--------------------+------------------------------+-------------------------------------------+
| /src/headerblock | - Contained in PDB Stream | - Summary of embedded source file content |
| | Named Stream map | (e.g. natvis files) |
+--------------------+------------------------------+-------------------------------------------+
| /names | - Contained in PDB Stream | - PDB-wide global string table used for |
| | Named Stream map | string de-duplication |
+--------------------+------------------------------+-------------------------------------------+
| Module Info Stream | - Contained in DBI Stream | - CodeView Symbol Records for this module |
| | - One for each compiland | - Line Number Information |
+--------------------+------------------------------+-------------------------------------------+
| Public Stream | - Contained in DBI Stream | - Public (Exported) Symbol Records |
| | | - Index of Public Hash Stream |
+--------------------+------------------------------+-------------------------------------------+
| Global Stream | - Contained in DBI Stream | - Single combined master symbol-table |
| | | - Index of Global Hash Stream |
+--------------------+------------------------------+-------------------------------------------+
| TPI Hash Stream | - Contained in TPI Stream | - Hash table for looking up TPI records |
| | | by name |
+--------------------+------------------------------+-------------------------------------------+
| IPI Hash Stream | - Contained in IPI Stream | - Hash table for looking up IPI records |
| | | by name |
+--------------------+------------------------------+-------------------------------------------+
More information about the structure of each of these can be found on the
following pages:
:doc:`PdbStream`
Information about the PDB Info Stream and how it is used to match PDBs to EXEs.
:doc:`TpiStream`
Information about the TPI stream and the CodeView records contained within.
:doc:`DbiStream`
Information about the DBI stream and relevant substreams including the Module Substreams,
source file information, and CodeView symbol records contained within.
:doc:`ModiStream`
Information about the Module Information Stream, of which there is one for each compilation
unit and the format of symbols contained within.
:doc:`PublicStream`
Information about the Public Symbol Stream.
:doc:`GlobalStream`
Information about the Global Symbol Stream.
:doc:`HashTable`
Information about the serialized hash table format used internally to represent things such
as the Named Stream Map and the Hash Adjusters in the :doc:`TPI/IPI Stream <TpiStream>`.
CodeView
========
CodeView is another format which comes into the picture. While MSF defines
the structure of the overall file, and PDB defines the set of streams that
appear within the MSF file and the format of those streams, CodeView defines
the format of **symbol and type records** that appear within specific streams.
Refer to the pages on :doc:`CodeViewSymbols` and :doc:`CodeViewTypes` for
more information about the CodeView format.