This change seems to make LLD 0.6% faster when linking Clang with
debug info. I don't want us to have lots of local optimizations,
but this function is very hot, and the improvement is small but
not negligible, so I think it's worth doing.
llvm-svn: 288757
ICF is short for Identical Code Folding. It is a size optimization to
identify two or more functions that happened to have the same contents
to merges them. It usually reduces output size by a few percent.
ICF is slow because it is computationally intensive process. I tried
to paralellize it before but failed because I couldn't make a
parallelized version produce consistent outputs. Although it didn't
create broken executables, every invocation of the linker generated
slightly different output, and I couldn't figure out why.
I think I now understand what was going on, and also came up with a
simple algorithm to fix it. So is this patch.
The result is very exciting. Chromium for example has 780,662 input
sections in which 20,774 are reducible by ICF. LLD previously took
7.980 seconds for ICF. Now it finishes in 1.065 seconds.
As a result, LLD can now link a Chromium binary (output size 1.59 GB)
in 10.28 seconds on my machine with ICF enabled. Compared to gold
which takes 40.94 seconds to do the same thing, this is an amazing
number.
From here, I'll describe what we are doing for ICF, what was the
previous problem, and what I did in this patch.
In ICF, two sections are considered identical if they have the same
section flags, section data, and relocations. Relocations are tricky,
becuase two relocations are considered the same if they have the same
relocation type, values, and if they point to the same section _in
terms of ICF_.
Here is an example. If foo and bar defined below are compiled to the
same machine instructions, ICF can (and should) merge the two,
although their relocations point to each other.
void foo() { bar(); }
void bar() { foo(); }
This is not an easy problem to solve.
What we are doing in LLD is some sort of coloring algorithm. We color
non-identical sections using different colors repeatedly, and sections
in the same color when the algorithm terminates are considered
identical. Here is the details:
1. First, we color all sections using their hash values of section
types, section contents, and numbers of relocations. At this moment,
relocation targets are not taken into account. We just color
sections that apparently differ in different colors.
2. Next, for each color C, we visit sections having color C to see
if their relocations are the same. Relocations are considered equal
if their targets have the same color. We then recolor sections that
have different relocation targets in new colors.
3. If we recolor some section in step 2, relocations that were
previously pointing to the same color targets may now be pointing to
different colors. Therefore, repeat 2 until a convergence is
obtained.
Step 2 is a heavy operation. For Chromium, the first iteration of step
2 takes 2.882 seconds, and the second iteration takes 1.038 seconds,
and in total it needs 23 iterations.
Parallelizing step 1 is easy because we can color each section
independently. This patch does that.
Parallelizing step 2 is tricky. We could work on each color
independently, but we cannot recolor sections in place, because it
will break the invariance that two possibly-identical sections must
have the same color at any moment.
Consider sections S1, S2, S3, S4 in the same color C, where S1 and S2
are identical, S3 and S4 are identical, but S2 and S3 are not. Thread
A is about to recolor S1 and S2 in C'. After thread A recolor S1 in
C', but before recolor S2 in C', other thread B might observe S1 and
S2. Then thread B will conclude that S1 and S2 are different, and it
will split thread B's sections into smaller groups wrongly. Over-
splitting doesn't produce broken results, but it loses a chance to
merge some identical sections. That was the cause of indeterminism.
To fix the problem, I made sections have two colors, namely current
color and next color. At the beginning of each iteration, both colors
are the same. Each thread reads from current color and writes to next
color. In this way, we can avoid threads from reading partial
results. After each iteration, we flip current and next.
This is a very simple solution and is implemented in less than 50
lines of code.
I tested this patch with Chromium and confirmed that this parallelized
ICF produces the identical output as the non-parallelized one.
Differential Revision: https://reviews.llvm.org/D27247
llvm-svn: 288373
They return new vectors, but at the same time they mutate other vectors,
so returning values doesn't make much sense. We should just mutate two
vectors.
llvm-svn: 287979
The function was used only within Relocations.cpp, but now we are
using it in many places, so this patch moves it to a file that fits
to the functionality.
llvm-svn: 287943
We have different functions to stringize objects to construct
error messages. For InputFile, we have getFilename, and for
InputSection, we have getName. You had to memorize them.
I think this is the case where the function overloading comes in handy.
This patch defines toString() functions that are overloaded for all these
types, so that you just call it in error().
Differential Revision: https://reviews.llvm.org/D27030
llvm-svn: 287787
Previously, we set (uintptr_t)-1 to InputSectionBase::OutSec to record
that a section has already been set to be assigned to some output section
by linker scripts. Later, we restored nullptr to the pointer to use
the field for the original purpose. That overloading is not very easy to
understand.
This patch adds a bit flag for that purpose, so that we don't need
to piggyback the flag on an unrelated pointer.
llvm-svn: 287508
Also this patch uses file-scope functions instead of class member function.
Now that ICF class is not visible from outside, InputSection class
can no longer be "friend" of it. So I removed the friend relation
and just make it expose the features to public.
llvm-svn: 287480
MergeOutputSection class was a bit hard to use because it provdes
a series of finalize functions that have to be called in a right way
at a right time. It also intereacted with MergeInputSection, and the
logic was somewhat entangled between the two classes.
This patch simplifies it by providing only one finalize function.
Now, all you have to do is to call MergeOutputSection::finalize
when you have added all sections to the output section. Then, it
internally merges strings and initliazes StringPiece objects.
I think this is much easier to understand.
This patch also adds comments.
llvm-svn: 287314
Relocations are the last thing that we wore storing a raw section
pointer to and parsing on demand.
With this patch we parse it only once and store a pointer to the
actual data.
The patch also changes where we store it. It is now in
InputSectionBase. Not all sections have relocations, but most do and
this simplifies the logic. It also means that we now only support one
relocation section per section. Given that that constraint is
maintained even with -r with gold bfd and lld, I think it is OK.
llvm-svn: 286459
The disadvantage is that we use uint64_t instad of uint32_t for some
value in 32 bit files. The advantage is a substantially simpler code,
faster builds and less code duplication.
llvm-svn: 286414
Previously, we have both input and output section for .MIPS.abiflags.
Now we have only one class for .MIPS.abiflags, which is MipsAbiFlagsSection.
This class is a synthetic input section.
.MIPS.abiflags sections are handled as regular sections until
the control reaches Writer. Writer then aggregates all sections
whose type is SHT_MIPS_ABIFLAGS to create a single synthesized
input section. The synthesized section is then processed normally
as if it came from an input file.
llvm-svn: 286398
Previously, we have both input and output sections for .reginfo and
.MIPS.options. Now for each such sections we have one synthetic input
sections: MipsReginfoSection and MipsOptionsSection respectively.
Both sections are handled as regular sections until the control reaches
Writer. Writer then aggregates all sections whose type is SHT_MIPS_REGINFO
or SHT_MIPS_OPTIONS to create a single synthesized input section. In that
moment Writer also save GP0 value to the MipsGp0 field of the corresponding
ObjectFile. This value required for R_MIPS_GPREL16 and R_MIPS_GPREL32
relocations calculation.
Differential revision: https://reviews.llvm.org/D26444
llvm-svn: 286397
A CommonInputSection is a section containing all common symbols.
That was an input section but was abstracted in a different way
than the synthetic input sections because it was written before
the synthetic input section was invented.
This patch rewrites CommonInputSection as a synthetic input section
so that it behaves better with other sections.
llvm-svn: 286053
We are going to have many more classes for linker-synthesized
input sections, so it's worth to be added to a separate file
than to the file for regular input sections.
llvm-svn: 285740
Instead of storing a pointer, store the members we need.
The reason for doing this is that it makes it far easier to create
synthetic sections. It also avoids reading data from files multiple
times., which might help with cross endian linking and host
architectures with slow unaligned access.
There are obvious compacting opportunities, but this already has mixed
results even on native x86_64 linking.
There is also the possibility of better refactoring the code for
handling common symbols, but this already shows that a custom class is
not necessary.
llvm-svn: 285148
We were fairly inconsistent as to what information should be accessed
with getSectionHdr and what information (like alignment) was stored
elsewhere.
Now all section info has a dedicated getter. The code is also a bit
more compact.
llvm-svn: 285079
Builds were failing with:
InputSection.h(139): error C2338: SectionPiece is too big
because MSVC does record layout differently, probably not packing the
'OutputOff' and 'Live' bitfields because their types are of different
size. Using size_t for 'Live' seems to fix it.
llvm-svn: 284740
Previously, we supported only SHF_COMPRESSED sections because it's
new and it's the ELF standard. But there are object files compressed
in the GNU style out there, so we had to support it.
Sections compressed in the GNU style start with ".zdebug_" and
contain different headers than the ELF standard's one. In this
patch, getRawCompressedData is responsible to handle it.
A tricky thing about GNU-style compressed sections is that we have
to rename them when creating output sections. ".zdebug_" prefix
implies the section is compressed. We need to rename ".zdebug_"
".debug" because our output sections are not compressed.
We do that in this patch.
llvm-svn: 284068
.ARM.exidx sections have a reverse dependency on the section they have
a SHF_LINK_ORDER dependency on. In other words a .ARM.exidx section is
live only if the executable section it describes is live. We implement
this with a reverse dependency field in InputSection.
Adding the dependency to InputSection is the simplest implementation
but it could be moved out to a separate map if it were found to decrease
performance for non ARM targets.
Differential revision: https://reviews.llvm.org/D25234
llvm-svn: 283734
The .ARM.exidx sections contain a table. Each entry has two fields:
- PREL31 offset to the function the table entry describes
- Action to take, either cantunwind, inline unwind, or PREL31 offset to
.ARM.extab section
The table entries must be sorted in order of the virtual addresses the
first entry of the table describes. Traditionally this is implemented by
the SHF_LINK_ORDER dependency. Instead of implementing this directly we
sort the table entries post relocation.
The .ARM.exidx OutputSection is described by the PT_ARM_EXIDX program
header
Differential revision: https://reviews.llvm.org/D25127
llvm-svn: 283730
This spreads out computing the hash and using it in a hash table. The
speedups are:
firefox
master 6.811232891
patch 6.559280249 1.03841162939x faster
chromium
master 4.369323666
patch 4.33171853 1.00868134338x faster
chromium fast
master 1.856679971
patch 1.850617741 1.00327578725x faster
the gold plugin
master 0.32917962
patch 0.325711944 1.01064645023x faster
clang
master 0.558015452
patch 0.550284165 1.01404962652x faster
llvm-as
master 0.032563515
patch 0.032152077 1.01279662275x faster
the gold plugin fsds
master 0.356221362
patch 0.352772162 1.00977741549x faster
clang fsds
master 0.635096494
patch 0.627249229 1.01251060127x faster
llvm-as fsds
master 0.030183188
patch 0.029889544 1.00982430511x faster
scylla
master 3.071448906
patch 2.938484138 1.04524944215x faster
This seems to be because we don't stall as much. When linking firefox
stalled-cycles-frontend goes from 57.56% to 55.55%.
With -O2 the difference is even more significant since we avoid
recomputing the hash. For firefox we go from 9.990295265 to
9.149627521 seconds (1.09x faster).
llvm-svn: 283367
It is pretty easy to get the data from the InputSection, so we don't
have to store it.
This opens the way for storing the hash instead.
llvm-svn: 283357
This simplifies error handling as there is now only one place in the
code that needs to consider the possibility that the name is
corrupted. Before we would do it in every access.
llvm-svn: 280937