2021-05-26 02:57:16 +08:00
|
|
|
//===- ConcatOutputSection.h ------------------------------------*- C++ -*-===//
|
2020-05-02 07:29:06 +08:00
|
|
|
//
|
|
|
|
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
|
|
|
|
// See https://llvm.org/LICENSE.txt for license information.
|
|
|
|
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
|
|
|
|
//
|
|
|
|
//===----------------------------------------------------------------------===//
|
|
|
|
|
2022-04-08 06:13:27 +08:00
|
|
|
#ifndef LLD_MACHO_CONCAT_OUTPUT_SECTION_H
|
|
|
|
#define LLD_MACHO_CONCAT_OUTPUT_SECTION_H
|
2020-05-02 07:29:06 +08:00
|
|
|
|
|
|
|
#include "InputSection.h"
|
|
|
|
#include "OutputSection.h"
|
|
|
|
#include "lld/Common/LLVM.h"
|
2021-03-30 08:33:48 +08:00
|
|
|
#include "llvm/ADT/DenseMap.h"
|
[lld-macho] Refactor segment/section creation, sorting, and merging
Summary:
There were a few issues with the previous setup:
1. The section sorting comparator used a declarative map of section names to
determine the correct order, but it turns out we need to match on more than
just names -- in particular, an upcoming diff will sort based on whether the
S_ZERO_FILL flag is set. This diff changes the sorter to a more imperative but
flexible form.
2. We were sorting OutputSections stored in a MapVector, which left the
MapVector in an inconsistent state -- the wrong keys map to the wrong values!
In practice, we weren't doing key lookups (only container iteration) after the
sort, so this was fine, but it was still a dubious state of affairs. This diff
copies the OutputSections to a vector before sorting them.
3. We were adding unneeded OutputSections to OutputSegments and then filtering
them out later, which meant that we had to remember whether an OutputSegment
was in a pre- or post-filtered state. This diff only adds the sections to the
segments if they are needed.
In addition to those major changes, two minor ones worth noting:
1. I renamed all OutputSection variable names to `osec`, to parallel `isec`.
Previously we were using some inconsistent combination of `osec`, `os`, and
`section`.
2. I added a check (and a test) for InputSections with names that clashed with
those of our synthetic OutputSections.
Reviewers: #lld-macho
Subscribers: llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D81887
2020-06-15 15:03:24 +08:00
|
|
|
#include "llvm/ADT/MapVector.h"
|
2020-05-02 07:29:06 +08:00
|
|
|
|
|
|
|
namespace lld {
|
|
|
|
namespace macho {
|
|
|
|
|
2021-03-30 08:33:48 +08:00
|
|
|
class Defined;
|
|
|
|
|
2020-05-02 07:29:06 +08:00
|
|
|
// Linking multiple files will inevitably mean resolving sections in different
|
|
|
|
// files that are labeled with the same segment and section name. This class
|
|
|
|
// contains all such sections and writes the data from each section sequentially
|
|
|
|
// in the final binary.
|
2022-04-08 06:13:27 +08:00
|
|
|
class ConcatOutputSection : public OutputSection {
|
2020-05-02 07:29:06 +08:00
|
|
|
public:
|
2021-05-26 02:57:16 +08:00
|
|
|
explicit ConcatOutputSection(StringRef name)
|
|
|
|
: OutputSection(ConcatKind, name) {}
|
2020-05-02 07:29:06 +08:00
|
|
|
|
[lld-macho] Implement cstring deduplication
Our implementation draws heavily from LLD-ELF's, which in turn delegates
its string deduplication to llvm-mc's StringTableBuilder. The messiness of
this diff is largely due to the fact that we've previously assumed that
all InputSections get concatenated together to form the output. This is
no longer true with CStringInputSections, which split their contents into
StringPieces. StringPieces are much more lightweight than InputSections,
which is important as we create a lot of them. They may also overlap in
the output, which makes it possible for strings to be tail-merged. In
fact, the initial version of this diff implemented tail merging, but
I've dropped it for reasons I'll explain later.
**Alignment Issues**
Mergeable cstring literals are found under the `__TEXT,__cstring`
section. In contrast to ELF, which puts strings that need different
alignments into different sections, clang's Mach-O backend puts them all
in one section. Strings that need to be aligned have the `.p2align`
directive emitted before them, which simply translates into zero padding
in the object file.
I *think* ld64 extracts the desired per-string alignment from this data
by preserving each string's offset from the last section-aligned
address. I'm not entirely certain since it doesn't seem consistent about
doing this; but perhaps this can be chalked up to cases where ld64 has
to deduplicate strings with different offset/alignment combos -- it
seems to pick one of their alignments to preserve. This doesn't seem
correct in general; we can in fact can induce ld64 to produce a crashing
binary just by linking in an additional object file that only contains
cstrings and no code. See PR50563 for details.
Moreover, this scheme seems rather inefficient: since unaligned and
aligned strings are all put in the same section, which has a single
alignment value, it doesn't seem possible to tell whether a given string
doesn't have any alignment requirements. Preserving offset+alignments
for strings that don't need it is wasteful.
In practice, the crashes seen so far seem to stem from x86_64 SIMD
operations on cstrings. X86_64 requires SIMD accesses to be
16-byte-aligned. So for now, I'm thinking of just aligning all strings
to 16 bytes on x86_64. This is indeed wasteful, but implementation-wise
it's simpler than preserving per-string alignment+offsets. It also
avoids the aforementioned crash after deduplication of
differently-aligned strings. Finally, the overhead is not huge: using
16-byte alignment (vs no alignment) is only a 0.5% size overhead when
linking chromium_framework.
With these alignment requirements, it doesn't make sense to attempt tail
merging -- most strings will not be eligible since their overlaps aren't
likely to start at a 16-byte boundary. Tail-merging (with alignment) for
chromium_framework only improves size by 0.3%.
It's worth noting that LLD-ELF only does tail merging at `-O2`. By
default (at `-O1`), it just deduplicates w/o tail merging. @thakis has
also mentioned that they saw it regress compressed size in some cases
and therefore turned it off. `ld64` does not seem to do tail merging at
all.
**Performance Numbers**
CString deduplication reduces chromium_framework from 250MB to 242MB, or
about a 3.2% reduction.
Numbers for linking chromium_framework on my 3.2 GHz 16-Core Intel Xeon W:
N Min Max Median Avg Stddev
x 20 3.91 4.03 3.935 3.95 0.034641016
+ 20 3.99 4.14 4.015 4.0365 0.0492336
Difference at 95.0% confidence
0.0865 +/- 0.027245
2.18987% +/- 0.689746%
(Student's t, pooled s = 0.0425673)
As expected, cstring merging incurs some non-trivial overhead.
When passing `--no-literal-merge`, it seems that performance is the
same, i.e. the refactoring in this diff didn't cost us.
N Min Max Median Avg Stddev
x 20 3.91 4.03 3.935 3.95 0.034641016
+ 20 3.89 4.02 3.935 3.9435 0.043197831
No difference proven at 95.0% confidence
Reviewed By: #lld-macho, gkm
Differential Revision: https://reviews.llvm.org/D102964
2021-06-08 11:47:12 +08:00
|
|
|
const ConcatInputSection *firstSection() const { return inputs.front(); }
|
|
|
|
const ConcatInputSection *lastSection() const { return inputs.back(); }
|
[lld-macho] Have ICF operate on all sections at once
ICF previously operated only within a given OutputSection. We would
merge all CFStrings first, then merge all regular code sections in a
second phase. This worked fine since CFStrings would never reference
regular `__text` sections. However, I would like to expand ICF to merge
functions that reference unwind info. Unwind info references the LSDA
section, which can in turn reference the `__text` section, so we cannot
perform ICF in phases.
In order to have ICF operate on InputSections spanning multiple
OutputSections, we need a way to distinguish InputSections that are
destined for different OutputSections, so that we don't fold across
section boundaries. We achieve this by creating OutputSections early,
and setting `InputSection::parent` to point to them. This is what
LLD-ELF does. (This change should also make it easier to implement the
`section$start$` symbols.)
This diff also folds InputSections w/o checking their flags, which I
think is the right behavior -- if they are destined for the same
OutputSection, they will have the same flags in the output (even if
their input flags differ). I.e. the `parent` pointer check subsumes the
`flags` check. In practice this has nearly no effect (ICF did not become
any more effective on chromium_framework).
I've also updated ICF.cpp's block comment to better reflect its current
status.
Reviewed By: #lld-macho, smeenai
Differential Revision: https://reviews.llvm.org/D105641
2021-07-18 01:42:26 +08:00
|
|
|
bool isNeeded() const override { return !inputs.empty(); }
|
2020-05-02 07:29:06 +08:00
|
|
|
|
|
|
|
// These accessors will only be valid after finalizing the section
|
2020-06-17 08:27:28 +08:00
|
|
|
uint64_t getSize() const override { return size; }
|
2020-05-02 07:29:06 +08:00
|
|
|
uint64_t getFileSize() const override { return fileSize; }
|
|
|
|
|
2022-04-08 06:13:27 +08:00
|
|
|
// Assign values to InputSection::outSecOff. In contrast to TextOutputSection,
|
|
|
|
// which does this in its implementation of `finalize()`, we can do this
|
|
|
|
// without `finalize()`'s sequential guarantees detailed in the block comment
|
|
|
|
// of `OutputSection::finalize()`.
|
|
|
|
virtual void finalizeContents();
|
2020-05-02 07:29:06 +08:00
|
|
|
|
2022-04-08 06:13:27 +08:00
|
|
|
void addInput(ConcatInputSection *input);
|
2020-05-02 07:29:06 +08:00
|
|
|
void writeTo(uint8_t *buf) const override;
|
|
|
|
|
2020-05-06 07:37:34 +08:00
|
|
|
static bool classof(const OutputSection *sec) {
|
2021-05-26 02:57:16 +08:00
|
|
|
return sec->kind() == ConcatKind;
|
2020-05-06 07:37:34 +08:00
|
|
|
}
|
|
|
|
|
[lld-macho] Have ICF operate on all sections at once
ICF previously operated only within a given OutputSection. We would
merge all CFStrings first, then merge all regular code sections in a
second phase. This worked fine since CFStrings would never reference
regular `__text` sections. However, I would like to expand ICF to merge
functions that reference unwind info. Unwind info references the LSDA
section, which can in turn reference the `__text` section, so we cannot
perform ICF in phases.
In order to have ICF operate on InputSections spanning multiple
OutputSections, we need a way to distinguish InputSections that are
destined for different OutputSections, so that we don't fold across
section boundaries. We achieve this by creating OutputSections early,
and setting `InputSection::parent` to point to them. This is what
LLD-ELF does. (This change should also make it easier to implement the
`section$start$` symbols.)
This diff also folds InputSections w/o checking their flags, which I
think is the right behavior -- if they are destined for the same
OutputSection, they will have the same flags in the output (even if
their input flags differ). I.e. the `parent` pointer check subsumes the
`flags` check. In practice this has nearly no effect (ICF did not become
any more effective on chromium_framework).
I've also updated ICF.cpp's block comment to better reflect its current
status.
Reviewed By: #lld-macho, smeenai
Differential Revision: https://reviews.llvm.org/D105641
2021-07-18 01:42:26 +08:00
|
|
|
static ConcatOutputSection *getOrCreateForInput(const InputSection *);
|
|
|
|
|
2022-04-08 06:13:27 +08:00
|
|
|
std::vector<ConcatInputSection *> inputs;
|
2020-05-02 07:29:06 +08:00
|
|
|
|
2022-04-08 06:13:27 +08:00
|
|
|
protected:
|
2020-05-02 07:29:06 +08:00
|
|
|
size_t size = 0;
|
|
|
|
uint64_t fileSize = 0;
|
2022-04-08 06:13:27 +08:00
|
|
|
void finalizeOne(ConcatInputSection *);
|
|
|
|
|
|
|
|
private:
|
|
|
|
void finalizeFlags(InputSection *input);
|
|
|
|
};
|
|
|
|
|
|
|
|
// ConcatOutputSections that contain code (text) require special handling to
|
|
|
|
// support thunk insertion.
|
|
|
|
class TextOutputSection : public ConcatOutputSection {
|
|
|
|
public:
|
|
|
|
explicit TextOutputSection(StringRef name) : ConcatOutputSection(name) {}
|
|
|
|
void finalizeContents() override {}
|
|
|
|
void finalize() override;
|
|
|
|
bool needsThunks() const;
|
|
|
|
void writeTo(uint8_t *buf) const override;
|
|
|
|
|
|
|
|
private:
|
|
|
|
uint64_t estimateStubsInRangeVA(size_t callIdx) const;
|
|
|
|
|
|
|
|
std::vector<ConcatInputSection *> thunks;
|
2020-05-02 07:29:06 +08:00
|
|
|
};
|
|
|
|
|
2021-03-30 08:33:48 +08:00
|
|
|
// We maintain one ThunkInfo per real function.
|
|
|
|
//
|
|
|
|
// The "active thunk" is represented by the sym/isec pair that
|
|
|
|
// turns-over during finalize(): as the call-site address advances,
|
|
|
|
// the active thunk goes out of branch-range, and we create a new
|
|
|
|
// thunk to take its place.
|
|
|
|
//
|
|
|
|
// The remaining members -- bools and counters -- apply to the
|
|
|
|
// collection of thunks associated with the real function.
|
|
|
|
|
|
|
|
struct ThunkInfo {
|
|
|
|
// These denote the active thunk:
|
[lld-macho] Implement cstring deduplication
Our implementation draws heavily from LLD-ELF's, which in turn delegates
its string deduplication to llvm-mc's StringTableBuilder. The messiness of
this diff is largely due to the fact that we've previously assumed that
all InputSections get concatenated together to form the output. This is
no longer true with CStringInputSections, which split their contents into
StringPieces. StringPieces are much more lightweight than InputSections,
which is important as we create a lot of them. They may also overlap in
the output, which makes it possible for strings to be tail-merged. In
fact, the initial version of this diff implemented tail merging, but
I've dropped it for reasons I'll explain later.
**Alignment Issues**
Mergeable cstring literals are found under the `__TEXT,__cstring`
section. In contrast to ELF, which puts strings that need different
alignments into different sections, clang's Mach-O backend puts them all
in one section. Strings that need to be aligned have the `.p2align`
directive emitted before them, which simply translates into zero padding
in the object file.
I *think* ld64 extracts the desired per-string alignment from this data
by preserving each string's offset from the last section-aligned
address. I'm not entirely certain since it doesn't seem consistent about
doing this; but perhaps this can be chalked up to cases where ld64 has
to deduplicate strings with different offset/alignment combos -- it
seems to pick one of their alignments to preserve. This doesn't seem
correct in general; we can in fact can induce ld64 to produce a crashing
binary just by linking in an additional object file that only contains
cstrings and no code. See PR50563 for details.
Moreover, this scheme seems rather inefficient: since unaligned and
aligned strings are all put in the same section, which has a single
alignment value, it doesn't seem possible to tell whether a given string
doesn't have any alignment requirements. Preserving offset+alignments
for strings that don't need it is wasteful.
In practice, the crashes seen so far seem to stem from x86_64 SIMD
operations on cstrings. X86_64 requires SIMD accesses to be
16-byte-aligned. So for now, I'm thinking of just aligning all strings
to 16 bytes on x86_64. This is indeed wasteful, but implementation-wise
it's simpler than preserving per-string alignment+offsets. It also
avoids the aforementioned crash after deduplication of
differently-aligned strings. Finally, the overhead is not huge: using
16-byte alignment (vs no alignment) is only a 0.5% size overhead when
linking chromium_framework.
With these alignment requirements, it doesn't make sense to attempt tail
merging -- most strings will not be eligible since their overlaps aren't
likely to start at a 16-byte boundary. Tail-merging (with alignment) for
chromium_framework only improves size by 0.3%.
It's worth noting that LLD-ELF only does tail merging at `-O2`. By
default (at `-O1`), it just deduplicates w/o tail merging. @thakis has
also mentioned that they saw it regress compressed size in some cases
and therefore turned it off. `ld64` does not seem to do tail merging at
all.
**Performance Numbers**
CString deduplication reduces chromium_framework from 250MB to 242MB, or
about a 3.2% reduction.
Numbers for linking chromium_framework on my 3.2 GHz 16-Core Intel Xeon W:
N Min Max Median Avg Stddev
x 20 3.91 4.03 3.935 3.95 0.034641016
+ 20 3.99 4.14 4.015 4.0365 0.0492336
Difference at 95.0% confidence
0.0865 +/- 0.027245
2.18987% +/- 0.689746%
(Student's t, pooled s = 0.0425673)
As expected, cstring merging incurs some non-trivial overhead.
When passing `--no-literal-merge`, it seems that performance is the
same, i.e. the refactoring in this diff didn't cost us.
N Min Max Median Avg Stddev
x 20 3.91 4.03 3.935 3.95 0.034641016
+ 20 3.89 4.02 3.935 3.9435 0.043197831
No difference proven at 95.0% confidence
Reviewed By: #lld-macho, gkm
Differential Revision: https://reviews.llvm.org/D102964
2021-06-08 11:47:12 +08:00
|
|
|
Defined *sym = nullptr; // private-extern symbol for active thunk
|
|
|
|
ConcatInputSection *isec = nullptr; // input section for active thunk
|
2021-03-30 08:33:48 +08:00
|
|
|
|
|
|
|
// The following values are cumulative across all thunks on this function
|
|
|
|
uint32_t callSiteCount = 0; // how many calls to the real function?
|
|
|
|
uint32_t callSitesUsed = 0; // how many call sites processed so-far?
|
|
|
|
uint32_t thunkCallCount = 0; // how many call sites went to thunk?
|
|
|
|
uint8_t sequence = 0; // how many thunks created so-far?
|
|
|
|
};
|
|
|
|
|
[lld-macho] Have ICF operate on all sections at once
ICF previously operated only within a given OutputSection. We would
merge all CFStrings first, then merge all regular code sections in a
second phase. This worked fine since CFStrings would never reference
regular `__text` sections. However, I would like to expand ICF to merge
functions that reference unwind info. Unwind info references the LSDA
section, which can in turn reference the `__text` section, so we cannot
perform ICF in phases.
In order to have ICF operate on InputSections spanning multiple
OutputSections, we need a way to distinguish InputSections that are
destined for different OutputSections, so that we don't fold across
section boundaries. We achieve this by creating OutputSections early,
and setting `InputSection::parent` to point to them. This is what
LLD-ELF does. (This change should also make it easier to implement the
`section$start$` symbols.)
This diff also folds InputSections w/o checking their flags, which I
think is the right behavior -- if they are destined for the same
OutputSection, they will have the same flags in the output (even if
their input flags differ). I.e. the `parent` pointer check subsumes the
`flags` check. In practice this has nearly no effect (ICF did not become
any more effective on chromium_framework).
I've also updated ICF.cpp's block comment to better reflect its current
status.
Reviewed By: #lld-macho, smeenai
Differential Revision: https://reviews.llvm.org/D105641
2021-07-18 01:42:26 +08:00
|
|
|
NamePair maybeRenameSection(NamePair key);
|
|
|
|
|
|
|
|
// Output sections are added to output segments in iteration order
|
|
|
|
// of ConcatOutputSection, so must have deterministic iteration order.
|
|
|
|
extern llvm::MapVector<NamePair, ConcatOutputSection *> concatOutputSections;
|
|
|
|
|
2021-03-30 08:33:48 +08:00
|
|
|
extern llvm::DenseMap<Symbol *, ThunkInfo> thunkMap;
|
|
|
|
|
2020-05-02 07:29:06 +08:00
|
|
|
} // namespace macho
|
|
|
|
} // namespace lld
|
|
|
|
|
|
|
|
#endif
|