llvm-project/lld/MachO/InputSection.cpp

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

277 lines
9.0 KiB
C++
Raw Normal View History

//===- InputSection.cpp ---------------------------------------------------===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//
#include "InputSection.h"
2021-07-18 01:42:26 +08:00
#include "ConcatOutputSection.h"
#include "Config.h"
#include "InputFiles.h"
#include "OutputSegment.h"
#include "Symbols.h"
#include "SyntheticSections.h"
#include "Target.h"
#include "UnwindInfoSection.h"
#include "Writer.h"
#include "lld/Common/Memory.h"
#include "llvm/Support/Endian.h"
[lld-macho] Implement cstring deduplication Our implementation draws heavily from LLD-ELF's, which in turn delegates its string deduplication to llvm-mc's StringTableBuilder. The messiness of this diff is largely due to the fact that we've previously assumed that all InputSections get concatenated together to form the output. This is no longer true with CStringInputSections, which split their contents into StringPieces. StringPieces are much more lightweight than InputSections, which is important as we create a lot of them. They may also overlap in the output, which makes it possible for strings to be tail-merged. In fact, the initial version of this diff implemented tail merging, but I've dropped it for reasons I'll explain later. **Alignment Issues** Mergeable cstring literals are found under the `__TEXT,__cstring` section. In contrast to ELF, which puts strings that need different alignments into different sections, clang's Mach-O backend puts them all in one section. Strings that need to be aligned have the `.p2align` directive emitted before them, which simply translates into zero padding in the object file. I *think* ld64 extracts the desired per-string alignment from this data by preserving each string's offset from the last section-aligned address. I'm not entirely certain since it doesn't seem consistent about doing this; but perhaps this can be chalked up to cases where ld64 has to deduplicate strings with different offset/alignment combos -- it seems to pick one of their alignments to preserve. This doesn't seem correct in general; we can in fact can induce ld64 to produce a crashing binary just by linking in an additional object file that only contains cstrings and no code. See PR50563 for details. Moreover, this scheme seems rather inefficient: since unaligned and aligned strings are all put in the same section, which has a single alignment value, it doesn't seem possible to tell whether a given string doesn't have any alignment requirements. Preserving offset+alignments for strings that don't need it is wasteful. In practice, the crashes seen so far seem to stem from x86_64 SIMD operations on cstrings. X86_64 requires SIMD accesses to be 16-byte-aligned. So for now, I'm thinking of just aligning all strings to 16 bytes on x86_64. This is indeed wasteful, but implementation-wise it's simpler than preserving per-string alignment+offsets. It also avoids the aforementioned crash after deduplication of differently-aligned strings. Finally, the overhead is not huge: using 16-byte alignment (vs no alignment) is only a 0.5% size overhead when linking chromium_framework. With these alignment requirements, it doesn't make sense to attempt tail merging -- most strings will not be eligible since their overlaps aren't likely to start at a 16-byte boundary. Tail-merging (with alignment) for chromium_framework only improves size by 0.3%. It's worth noting that LLD-ELF only does tail merging at `-O2`. By default (at `-O1`), it just deduplicates w/o tail merging. @thakis has also mentioned that they saw it regress compressed size in some cases and therefore turned it off. `ld64` does not seem to do tail merging at all. **Performance Numbers** CString deduplication reduces chromium_framework from 250MB to 242MB, or about a 3.2% reduction. Numbers for linking chromium_framework on my 3.2 GHz 16-Core Intel Xeon W: N Min Max Median Avg Stddev x 20 3.91 4.03 3.935 3.95 0.034641016 + 20 3.99 4.14 4.015 4.0365 0.0492336 Difference at 95.0% confidence 0.0865 +/- 0.027245 2.18987% +/- 0.689746% (Student's t, pooled s = 0.0425673) As expected, cstring merging incurs some non-trivial overhead. When passing `--no-literal-merge`, it seems that performance is the same, i.e. the refactoring in this diff didn't cost us. N Min Max Median Avg Stddev x 20 3.91 4.03 3.935 3.95 0.034641016 + 20 3.89 4.02 3.935 3.9435 0.043197831 No difference proven at 95.0% confidence Reviewed By: #lld-macho, gkm Differential Revision: https://reviews.llvm.org/D102964
2021-06-08 11:47:12 +08:00
#include "llvm/Support/xxhash.h"
using namespace llvm;
using namespace llvm::MachO;
using namespace llvm::support;
using namespace lld;
using namespace lld::macho;
// Verify ConcatInputSection's size on 64-bit builds. The size of std::vector
// can differ based on STL debug levels (e.g. iterator debugging on MSVC's STL),
// so account for that.
static_assert(sizeof(void *) != 8 ||
sizeof(ConcatInputSection) == sizeof(std::vector<Reloc>) + 96,
"Try to minimize ConcatInputSection's size, we create many "
"instances of it");
[lld-macho] Move ICF earlier to avoid emitting redundant binds This is a pretty big refactoring diff, so here are the motivations: Previously, ICF ran after scanRelocations(), where we emitting bind/rebase opcodes etc. So we had a bunch of redundant leftovers after ICF. Having ICF run before Writer seems like a better design, and is what LLD-ELF does, so this diff refactors it accordingly. However, ICF had two dependencies on things occurring in Writer: 1) it needs literals to be deduplicated beforehand and 2) it needs to know which functions have unwind info, which was being handled by `UnwindInfoSection::prepareRelocations()`. In order to do literal deduplication earlier, we need to add literal input sections to their corresponding output sections. So instead of putting all input sections into the big `inputSections` vector, and then filtering them by type later on, I've changed things so that literal sections get added directly to their output sections during the 'gather' phase. Likewise for compact unwind sections -- they get added directly to the UnwindInfoSection now. This latter change is not strictly necessary, but makes it easier for ICF to determine which functions have unwind info. Adding literal sections directly to their output sections means that we can no longer determine `inputOrder` from iterating over `inputSections`. Instead, we store that order explicitly on InputSection. Bloating the size of InputSection for this purpose would be unfortunate -- but LLD-ELF has already solved this problem: it reuses `outSecOff` to store this order value. One downside of this refactor is that we now make an additional pass over the unwind info relocations to figure out which functions have unwind info, since want to know that before `processRelocations()`. I've made sure to run that extra loop only if ICF is enabled, so there should be no overhead in non-optimizing runs of the linker. The upside of all this is that the `inputSections` vector now contains only ConcatInputSections that are destined for ConcatOutputSections, so we can clean up a bunch of code that just existed to filter out other elements from that vector. I will test for the lack of redundant binds/rebases in the upcoming cfstring deduplication diff. While binds/rebases can also happen in the regular `.text` section, they're more common in `.data` sections, so it seems more natural to test it that way. This change is perf-neutral when linking chromium_framework. Reviewed By: oontvoo Differential Revision: https://reviews.llvm.org/D105044
2021-07-02 08:33:42 +08:00
std::vector<ConcatInputSection *> macho::inputSections;
uint64_t InputSection::getFileSize() const {
return isZeroFill(getFlags()) ? 0 : getSize();
}
[lld-macho] Implement cstring deduplication Our implementation draws heavily from LLD-ELF's, which in turn delegates its string deduplication to llvm-mc's StringTableBuilder. The messiness of this diff is largely due to the fact that we've previously assumed that all InputSections get concatenated together to form the output. This is no longer true with CStringInputSections, which split their contents into StringPieces. StringPieces are much more lightweight than InputSections, which is important as we create a lot of them. They may also overlap in the output, which makes it possible for strings to be tail-merged. In fact, the initial version of this diff implemented tail merging, but I've dropped it for reasons I'll explain later. **Alignment Issues** Mergeable cstring literals are found under the `__TEXT,__cstring` section. In contrast to ELF, which puts strings that need different alignments into different sections, clang's Mach-O backend puts them all in one section. Strings that need to be aligned have the `.p2align` directive emitted before them, which simply translates into zero padding in the object file. I *think* ld64 extracts the desired per-string alignment from this data by preserving each string's offset from the last section-aligned address. I'm not entirely certain since it doesn't seem consistent about doing this; but perhaps this can be chalked up to cases where ld64 has to deduplicate strings with different offset/alignment combos -- it seems to pick one of their alignments to preserve. This doesn't seem correct in general; we can in fact can induce ld64 to produce a crashing binary just by linking in an additional object file that only contains cstrings and no code. See PR50563 for details. Moreover, this scheme seems rather inefficient: since unaligned and aligned strings are all put in the same section, which has a single alignment value, it doesn't seem possible to tell whether a given string doesn't have any alignment requirements. Preserving offset+alignments for strings that don't need it is wasteful. In practice, the crashes seen so far seem to stem from x86_64 SIMD operations on cstrings. X86_64 requires SIMD accesses to be 16-byte-aligned. So for now, I'm thinking of just aligning all strings to 16 bytes on x86_64. This is indeed wasteful, but implementation-wise it's simpler than preserving per-string alignment+offsets. It also avoids the aforementioned crash after deduplication of differently-aligned strings. Finally, the overhead is not huge: using 16-byte alignment (vs no alignment) is only a 0.5% size overhead when linking chromium_framework. With these alignment requirements, it doesn't make sense to attempt tail merging -- most strings will not be eligible since their overlaps aren't likely to start at a 16-byte boundary. Tail-merging (with alignment) for chromium_framework only improves size by 0.3%. It's worth noting that LLD-ELF only does tail merging at `-O2`. By default (at `-O1`), it just deduplicates w/o tail merging. @thakis has also mentioned that they saw it regress compressed size in some cases and therefore turned it off. `ld64` does not seem to do tail merging at all. **Performance Numbers** CString deduplication reduces chromium_framework from 250MB to 242MB, or about a 3.2% reduction. Numbers for linking chromium_framework on my 3.2 GHz 16-Core Intel Xeon W: N Min Max Median Avg Stddev x 20 3.91 4.03 3.935 3.95 0.034641016 + 20 3.99 4.14 4.015 4.0365 0.0492336 Difference at 95.0% confidence 0.0865 +/- 0.027245 2.18987% +/- 0.689746% (Student's t, pooled s = 0.0425673) As expected, cstring merging incurs some non-trivial overhead. When passing `--no-literal-merge`, it seems that performance is the same, i.e. the refactoring in this diff didn't cost us. N Min Max Median Avg Stddev x 20 3.91 4.03 3.935 3.95 0.034641016 + 20 3.89 4.02 3.935 3.9435 0.043197831 No difference proven at 95.0% confidence Reviewed By: #lld-macho, gkm Differential Revision: https://reviews.llvm.org/D102964
2021-06-08 11:47:12 +08:00
uint64_t InputSection::getVA(uint64_t off) const {
return parent->addr + getOffset(off);
}
static uint64_t resolveSymbolVA(const Symbol *sym, uint8_t type) {
const RelocAttrs &relocAttrs = target->getRelocAttrs(type);
if (relocAttrs.hasAttr(RelocAttrBits::BRANCH))
return sym->resolveBranchVA();
if (relocAttrs.hasAttr(RelocAttrBits::GOT))
return sym->resolveGotVA();
if (relocAttrs.hasAttr(RelocAttrBits::TLV))
return sym->resolveTlvVA();
return sym->getVA();
}
// ICF needs to hash any section that might potentially be duplicated so
// that it can match on content rather than identity.
[lld-macho] Move ICF earlier to avoid emitting redundant binds This is a pretty big refactoring diff, so here are the motivations: Previously, ICF ran after scanRelocations(), where we emitting bind/rebase opcodes etc. So we had a bunch of redundant leftovers after ICF. Having ICF run before Writer seems like a better design, and is what LLD-ELF does, so this diff refactors it accordingly. However, ICF had two dependencies on things occurring in Writer: 1) it needs literals to be deduplicated beforehand and 2) it needs to know which functions have unwind info, which was being handled by `UnwindInfoSection::prepareRelocations()`. In order to do literal deduplication earlier, we need to add literal input sections to their corresponding output sections. So instead of putting all input sections into the big `inputSections` vector, and then filtering them by type later on, I've changed things so that literal sections get added directly to their output sections during the 'gather' phase. Likewise for compact unwind sections -- they get added directly to the UnwindInfoSection now. This latter change is not strictly necessary, but makes it easier for ICF to determine which functions have unwind info. Adding literal sections directly to their output sections means that we can no longer determine `inputOrder` from iterating over `inputSections`. Instead, we store that order explicitly on InputSection. Bloating the size of InputSection for this purpose would be unfortunate -- but LLD-ELF has already solved this problem: it reuses `outSecOff` to store this order value. One downside of this refactor is that we now make an additional pass over the unwind info relocations to figure out which functions have unwind info, since want to know that before `processRelocations()`. I've made sure to run that extra loop only if ICF is enabled, so there should be no overhead in non-optimizing runs of the linker. The upside of all this is that the `inputSections` vector now contains only ConcatInputSections that are destined for ConcatOutputSections, so we can clean up a bunch of code that just existed to filter out other elements from that vector. I will test for the lack of redundant binds/rebases in the upcoming cfstring deduplication diff. While binds/rebases can also happen in the regular `.text` section, they're more common in `.data` sections, so it seems more natural to test it that way. This change is perf-neutral when linking chromium_framework. Reviewed By: oontvoo Differential Revision: https://reviews.llvm.org/D105044
2021-07-02 08:33:42 +08:00
bool ConcatInputSection::isHashableForICF() const {
switch (sectionType(getFlags())) {
case S_REGULAR:
[lld-macho] Move ICF earlier to avoid emitting redundant binds This is a pretty big refactoring diff, so here are the motivations: Previously, ICF ran after scanRelocations(), where we emitting bind/rebase opcodes etc. So we had a bunch of redundant leftovers after ICF. Having ICF run before Writer seems like a better design, and is what LLD-ELF does, so this diff refactors it accordingly. However, ICF had two dependencies on things occurring in Writer: 1) it needs literals to be deduplicated beforehand and 2) it needs to know which functions have unwind info, which was being handled by `UnwindInfoSection::prepareRelocations()`. In order to do literal deduplication earlier, we need to add literal input sections to their corresponding output sections. So instead of putting all input sections into the big `inputSections` vector, and then filtering them by type later on, I've changed things so that literal sections get added directly to their output sections during the 'gather' phase. Likewise for compact unwind sections -- they get added directly to the UnwindInfoSection now. This latter change is not strictly necessary, but makes it easier for ICF to determine which functions have unwind info. Adding literal sections directly to their output sections means that we can no longer determine `inputOrder` from iterating over `inputSections`. Instead, we store that order explicitly on InputSection. Bloating the size of InputSection for this purpose would be unfortunate -- but LLD-ELF has already solved this problem: it reuses `outSecOff` to store this order value. One downside of this refactor is that we now make an additional pass over the unwind info relocations to figure out which functions have unwind info, since want to know that before `processRelocations()`. I've made sure to run that extra loop only if ICF is enabled, so there should be no overhead in non-optimizing runs of the linker. The upside of all this is that the `inputSections` vector now contains only ConcatInputSections that are destined for ConcatOutputSections, so we can clean up a bunch of code that just existed to filter out other elements from that vector. I will test for the lack of redundant binds/rebases in the upcoming cfstring deduplication diff. While binds/rebases can also happen in the regular `.text` section, they're more common in `.data` sections, so it seems more natural to test it that way. This change is perf-neutral when linking chromium_framework. Reviewed By: oontvoo Differential Revision: https://reviews.llvm.org/D105044
2021-07-02 08:33:42 +08:00
return true;
case S_CSTRING_LITERALS:
case S_4BYTE_LITERALS:
case S_8BYTE_LITERALS:
case S_16BYTE_LITERALS:
case S_LITERAL_POINTERS:
llvm_unreachable("found unexpected literal type in ConcatInputSection");
case S_ZEROFILL:
case S_GB_ZEROFILL:
case S_NON_LAZY_SYMBOL_POINTERS:
case S_LAZY_SYMBOL_POINTERS:
case S_SYMBOL_STUBS:
case S_MOD_INIT_FUNC_POINTERS:
case S_MOD_TERM_FUNC_POINTERS:
case S_COALESCED:
case S_INTERPOSING:
case S_DTRACE_DOF:
case S_LAZY_DYLIB_SYMBOL_POINTERS:
case S_THREAD_LOCAL_REGULAR:
case S_THREAD_LOCAL_ZEROFILL:
case S_THREAD_LOCAL_VARIABLES:
case S_THREAD_LOCAL_VARIABLE_POINTERS:
case S_THREAD_LOCAL_INIT_FUNCTION_POINTERS:
return false;
default:
llvm_unreachable("Section type");
}
}
void ConcatInputSection::hashForICF() {
assert(data.data()); // zeroFill section data has nullptr with non-zero size
assert(icfEqClass[0] == 0); // don't overwrite a unique ID!
// Turn-on the top bit to guarantee that valid hashes have no collisions
// with the small-integer unique IDs for ICF-ineligible sections
icfEqClass[0] = xxHash64(data) | (1ull << 63);
}
void ConcatInputSection::foldIdentical(ConcatInputSection *copy) {
align = std::max(align, copy->align);
copy->live = false;
copy->wasCoalesced = true;
copy->replacement = this;
// Merge the sorted vectors of symbols together.
auto it = symbols.begin();
for (auto copyIt = copy->symbols.begin(); copyIt != copy->symbols.end();) {
if (it == symbols.end()) {
symbols.push_back(*copyIt++);
it = symbols.end();
} else if ((*it)->value > (*copyIt)->value) {
std::swap(*it++, *copyIt);
} else {
++it;
}
}
copy->symbols.clear();
// Remove duplicate compact unwind info for symbols at the same address.
if (symbols.empty())
return;
it = symbols.begin();
uint64_t v = (*it)->value;
for (++it; it != symbols.end(); ++it) {
Defined *d = *it;
if (d->value == v)
d->unwindEntry = nullptr;
else
v = d->value;
}
}
void ConcatInputSection::writeTo(uint8_t *buf) {
[lld/mac] Write every weak symbol only once in the output Before this, if an inline function was defined in several input files, lld would write each copy of the inline function the output. With this patch, it only writes one copy. Reduces the size of Chromium Framework from 378MB to 345MB (compared to 290MB linked with ld64, which also does dead-stripping, which we don't do yet), and makes linking it faster: N Min Max Median Avg Stddev x 10 3.9957051 4.3496981 4.1411121 4.156837 0.10092097 + 10 3.908154 4.169318 3.9712729 3.9846753 0.075773012 Difference at 95.0% confidence -0.172162 +/- 0.083847 -4.14165% +/- 2.01709% (Student's t, pooled s = 0.0892373) Implementation-wise, when merging two weak symbols, this sets a "canOmitFromOutput" on the InputSection belonging to the weak symbol not put in the symbol table. We then don't write InputSections that have this set, as long as they are not referenced from other symbols. (This happens e.g. for object files that don't set .subsections_via_symbols or that use .alt_entry.) Some restrictions: - not yet done for bitcode inputs - no "comdat" handling (`kindNoneGroupSubordinate*` in ld64) -- Frame Descriptor Entries (FDEs), Language Specific Data Areas (LSDAs) (that is, catch block unwind information) and Personality Routines associated with weak functions still not stripped. This is wasteful, but harmless. - However, this does strip weaks from __unwind_info (which is needed for correctness and not just for size) - This nopes out on InputSections that are referenced form more than one symbol (eg from .alt_entry) for now Things that work based on symbols Just Work: - map files (change in MapFile.cpp is no-op and not needed; I just found it a bit more explicit) - exports Things that work with inputSections need to explicitly check if an inputSection is written (e.g. unwind info). This patch is useful in itself, but it's also likely also a useful foundation for dead_strip. I used to have a "canoncialRepresentative" pointer on InputSection instead of just the bool, which would be handy for ICF too. But I ended up not needing it for this patch, so I removed that again for now. Differential Revision: https://reviews.llvm.org/D102076
2021-05-07 02:47:57 +08:00
assert(!shouldOmitFromOutput());
if (getFileSize() == 0)
return;
memcpy(buf, data.data(), data.size());
for (size_t i = 0; i < relocs.size(); i++) {
const Reloc &r = relocs[i];
uint8_t *loc = buf + r.offset;
uint64_t referentVA = 0;
if (target->hasAttr(r.type, RelocAttrBits::SUBTRAHEND)) {
const Symbol *fromSym = r.referent.get<Symbol *>();
const Reloc &minuend = relocs[++i];
uint64_t minuendVA;
if (const Symbol *toSym = minuend.referent.dyn_cast<Symbol *>())
[lld-macho] Implement cstring deduplication Our implementation draws heavily from LLD-ELF's, which in turn delegates its string deduplication to llvm-mc's StringTableBuilder. The messiness of this diff is largely due to the fact that we've previously assumed that all InputSections get concatenated together to form the output. This is no longer true with CStringInputSections, which split their contents into StringPieces. StringPieces are much more lightweight than InputSections, which is important as we create a lot of them. They may also overlap in the output, which makes it possible for strings to be tail-merged. In fact, the initial version of this diff implemented tail merging, but I've dropped it for reasons I'll explain later. **Alignment Issues** Mergeable cstring literals are found under the `__TEXT,__cstring` section. In contrast to ELF, which puts strings that need different alignments into different sections, clang's Mach-O backend puts them all in one section. Strings that need to be aligned have the `.p2align` directive emitted before them, which simply translates into zero padding in the object file. I *think* ld64 extracts the desired per-string alignment from this data by preserving each string's offset from the last section-aligned address. I'm not entirely certain since it doesn't seem consistent about doing this; but perhaps this can be chalked up to cases where ld64 has to deduplicate strings with different offset/alignment combos -- it seems to pick one of their alignments to preserve. This doesn't seem correct in general; we can in fact can induce ld64 to produce a crashing binary just by linking in an additional object file that only contains cstrings and no code. See PR50563 for details. Moreover, this scheme seems rather inefficient: since unaligned and aligned strings are all put in the same section, which has a single alignment value, it doesn't seem possible to tell whether a given string doesn't have any alignment requirements. Preserving offset+alignments for strings that don't need it is wasteful. In practice, the crashes seen so far seem to stem from x86_64 SIMD operations on cstrings. X86_64 requires SIMD accesses to be 16-byte-aligned. So for now, I'm thinking of just aligning all strings to 16 bytes on x86_64. This is indeed wasteful, but implementation-wise it's simpler than preserving per-string alignment+offsets. It also avoids the aforementioned crash after deduplication of differently-aligned strings. Finally, the overhead is not huge: using 16-byte alignment (vs no alignment) is only a 0.5% size overhead when linking chromium_framework. With these alignment requirements, it doesn't make sense to attempt tail merging -- most strings will not be eligible since their overlaps aren't likely to start at a 16-byte boundary. Tail-merging (with alignment) for chromium_framework only improves size by 0.3%. It's worth noting that LLD-ELF only does tail merging at `-O2`. By default (at `-O1`), it just deduplicates w/o tail merging. @thakis has also mentioned that they saw it regress compressed size in some cases and therefore turned it off. `ld64` does not seem to do tail merging at all. **Performance Numbers** CString deduplication reduces chromium_framework from 250MB to 242MB, or about a 3.2% reduction. Numbers for linking chromium_framework on my 3.2 GHz 16-Core Intel Xeon W: N Min Max Median Avg Stddev x 20 3.91 4.03 3.935 3.95 0.034641016 + 20 3.99 4.14 4.015 4.0365 0.0492336 Difference at 95.0% confidence 0.0865 +/- 0.027245 2.18987% +/- 0.689746% (Student's t, pooled s = 0.0425673) As expected, cstring merging incurs some non-trivial overhead. When passing `--no-literal-merge`, it seems that performance is the same, i.e. the refactoring in this diff didn't cost us. N Min Max Median Avg Stddev x 20 3.91 4.03 3.935 3.95 0.034641016 + 20 3.89 4.02 3.935 3.9435 0.043197831 No difference proven at 95.0% confidence Reviewed By: #lld-macho, gkm Differential Revision: https://reviews.llvm.org/D102964
2021-06-08 11:47:12 +08:00
minuendVA = toSym->getVA() + minuend.addend;
[lld/mac] Write every weak symbol only once in the output Before this, if an inline function was defined in several input files, lld would write each copy of the inline function the output. With this patch, it only writes one copy. Reduces the size of Chromium Framework from 378MB to 345MB (compared to 290MB linked with ld64, which also does dead-stripping, which we don't do yet), and makes linking it faster: N Min Max Median Avg Stddev x 10 3.9957051 4.3496981 4.1411121 4.156837 0.10092097 + 10 3.908154 4.169318 3.9712729 3.9846753 0.075773012 Difference at 95.0% confidence -0.172162 +/- 0.083847 -4.14165% +/- 2.01709% (Student's t, pooled s = 0.0892373) Implementation-wise, when merging two weak symbols, this sets a "canOmitFromOutput" on the InputSection belonging to the weak symbol not put in the symbol table. We then don't write InputSections that have this set, as long as they are not referenced from other symbols. (This happens e.g. for object files that don't set .subsections_via_symbols or that use .alt_entry.) Some restrictions: - not yet done for bitcode inputs - no "comdat" handling (`kindNoneGroupSubordinate*` in ld64) -- Frame Descriptor Entries (FDEs), Language Specific Data Areas (LSDAs) (that is, catch block unwind information) and Personality Routines associated with weak functions still not stripped. This is wasteful, but harmless. - However, this does strip weaks from __unwind_info (which is needed for correctness and not just for size) - This nopes out on InputSections that are referenced form more than one symbol (eg from .alt_entry) for now Things that work based on symbols Just Work: - map files (change in MapFile.cpp is no-op and not needed; I just found it a bit more explicit) - exports Things that work with inputSections need to explicitly check if an inputSection is written (e.g. unwind info). This patch is useful in itself, but it's also likely also a useful foundation for dead_strip. I used to have a "canoncialRepresentative" pointer on InputSection instead of just the bool, which would be handy for ICF too. But I ended up not needing it for this patch, so I removed that again for now. Differential Revision: https://reviews.llvm.org/D102076
2021-05-07 02:47:57 +08:00
else {
auto *referentIsec = minuend.referent.get<InputSection *>();
assert(!::shouldOmitFromOutput(referentIsec));
[lld-macho] Implement cstring deduplication Our implementation draws heavily from LLD-ELF's, which in turn delegates its string deduplication to llvm-mc's StringTableBuilder. The messiness of this diff is largely due to the fact that we've previously assumed that all InputSections get concatenated together to form the output. This is no longer true with CStringInputSections, which split their contents into StringPieces. StringPieces are much more lightweight than InputSections, which is important as we create a lot of them. They may also overlap in the output, which makes it possible for strings to be tail-merged. In fact, the initial version of this diff implemented tail merging, but I've dropped it for reasons I'll explain later. **Alignment Issues** Mergeable cstring literals are found under the `__TEXT,__cstring` section. In contrast to ELF, which puts strings that need different alignments into different sections, clang's Mach-O backend puts them all in one section. Strings that need to be aligned have the `.p2align` directive emitted before them, which simply translates into zero padding in the object file. I *think* ld64 extracts the desired per-string alignment from this data by preserving each string's offset from the last section-aligned address. I'm not entirely certain since it doesn't seem consistent about doing this; but perhaps this can be chalked up to cases where ld64 has to deduplicate strings with different offset/alignment combos -- it seems to pick one of their alignments to preserve. This doesn't seem correct in general; we can in fact can induce ld64 to produce a crashing binary just by linking in an additional object file that only contains cstrings and no code. See PR50563 for details. Moreover, this scheme seems rather inefficient: since unaligned and aligned strings are all put in the same section, which has a single alignment value, it doesn't seem possible to tell whether a given string doesn't have any alignment requirements. Preserving offset+alignments for strings that don't need it is wasteful. In practice, the crashes seen so far seem to stem from x86_64 SIMD operations on cstrings. X86_64 requires SIMD accesses to be 16-byte-aligned. So for now, I'm thinking of just aligning all strings to 16 bytes on x86_64. This is indeed wasteful, but implementation-wise it's simpler than preserving per-string alignment+offsets. It also avoids the aforementioned crash after deduplication of differently-aligned strings. Finally, the overhead is not huge: using 16-byte alignment (vs no alignment) is only a 0.5% size overhead when linking chromium_framework. With these alignment requirements, it doesn't make sense to attempt tail merging -- most strings will not be eligible since their overlaps aren't likely to start at a 16-byte boundary. Tail-merging (with alignment) for chromium_framework only improves size by 0.3%. It's worth noting that LLD-ELF only does tail merging at `-O2`. By default (at `-O1`), it just deduplicates w/o tail merging. @thakis has also mentioned that they saw it regress compressed size in some cases and therefore turned it off. `ld64` does not seem to do tail merging at all. **Performance Numbers** CString deduplication reduces chromium_framework from 250MB to 242MB, or about a 3.2% reduction. Numbers for linking chromium_framework on my 3.2 GHz 16-Core Intel Xeon W: N Min Max Median Avg Stddev x 20 3.91 4.03 3.935 3.95 0.034641016 + 20 3.99 4.14 4.015 4.0365 0.0492336 Difference at 95.0% confidence 0.0865 +/- 0.027245 2.18987% +/- 0.689746% (Student's t, pooled s = 0.0425673) As expected, cstring merging incurs some non-trivial overhead. When passing `--no-literal-merge`, it seems that performance is the same, i.e. the refactoring in this diff didn't cost us. N Min Max Median Avg Stddev x 20 3.91 4.03 3.935 3.95 0.034641016 + 20 3.89 4.02 3.935 3.9435 0.043197831 No difference proven at 95.0% confidence Reviewed By: #lld-macho, gkm Differential Revision: https://reviews.llvm.org/D102964
2021-06-08 11:47:12 +08:00
minuendVA = referentIsec->getVA(minuend.addend);
[lld/mac] Write every weak symbol only once in the output Before this, if an inline function was defined in several input files, lld would write each copy of the inline function the output. With this patch, it only writes one copy. Reduces the size of Chromium Framework from 378MB to 345MB (compared to 290MB linked with ld64, which also does dead-stripping, which we don't do yet), and makes linking it faster: N Min Max Median Avg Stddev x 10 3.9957051 4.3496981 4.1411121 4.156837 0.10092097 + 10 3.908154 4.169318 3.9712729 3.9846753 0.075773012 Difference at 95.0% confidence -0.172162 +/- 0.083847 -4.14165% +/- 2.01709% (Student's t, pooled s = 0.0892373) Implementation-wise, when merging two weak symbols, this sets a "canOmitFromOutput" on the InputSection belonging to the weak symbol not put in the symbol table. We then don't write InputSections that have this set, as long as they are not referenced from other symbols. (This happens e.g. for object files that don't set .subsections_via_symbols or that use .alt_entry.) Some restrictions: - not yet done for bitcode inputs - no "comdat" handling (`kindNoneGroupSubordinate*` in ld64) -- Frame Descriptor Entries (FDEs), Language Specific Data Areas (LSDAs) (that is, catch block unwind information) and Personality Routines associated with weak functions still not stripped. This is wasteful, but harmless. - However, this does strip weaks from __unwind_info (which is needed for correctness and not just for size) - This nopes out on InputSections that are referenced form more than one symbol (eg from .alt_entry) for now Things that work based on symbols Just Work: - map files (change in MapFile.cpp is no-op and not needed; I just found it a bit more explicit) - exports Things that work with inputSections need to explicitly check if an inputSection is written (e.g. unwind info). This patch is useful in itself, but it's also likely also a useful foundation for dead_strip. I used to have a "canoncialRepresentative" pointer on InputSection instead of just the bool, which would be handy for ICF too. But I ended up not needing it for this patch, so I removed that again for now. Differential Revision: https://reviews.llvm.org/D102076
2021-05-07 02:47:57 +08:00
}
[lld-macho] Implement cstring deduplication Our implementation draws heavily from LLD-ELF's, which in turn delegates its string deduplication to llvm-mc's StringTableBuilder. The messiness of this diff is largely due to the fact that we've previously assumed that all InputSections get concatenated together to form the output. This is no longer true with CStringInputSections, which split their contents into StringPieces. StringPieces are much more lightweight than InputSections, which is important as we create a lot of them. They may also overlap in the output, which makes it possible for strings to be tail-merged. In fact, the initial version of this diff implemented tail merging, but I've dropped it for reasons I'll explain later. **Alignment Issues** Mergeable cstring literals are found under the `__TEXT,__cstring` section. In contrast to ELF, which puts strings that need different alignments into different sections, clang's Mach-O backend puts them all in one section. Strings that need to be aligned have the `.p2align` directive emitted before them, which simply translates into zero padding in the object file. I *think* ld64 extracts the desired per-string alignment from this data by preserving each string's offset from the last section-aligned address. I'm not entirely certain since it doesn't seem consistent about doing this; but perhaps this can be chalked up to cases where ld64 has to deduplicate strings with different offset/alignment combos -- it seems to pick one of their alignments to preserve. This doesn't seem correct in general; we can in fact can induce ld64 to produce a crashing binary just by linking in an additional object file that only contains cstrings and no code. See PR50563 for details. Moreover, this scheme seems rather inefficient: since unaligned and aligned strings are all put in the same section, which has a single alignment value, it doesn't seem possible to tell whether a given string doesn't have any alignment requirements. Preserving offset+alignments for strings that don't need it is wasteful. In practice, the crashes seen so far seem to stem from x86_64 SIMD operations on cstrings. X86_64 requires SIMD accesses to be 16-byte-aligned. So for now, I'm thinking of just aligning all strings to 16 bytes on x86_64. This is indeed wasteful, but implementation-wise it's simpler than preserving per-string alignment+offsets. It also avoids the aforementioned crash after deduplication of differently-aligned strings. Finally, the overhead is not huge: using 16-byte alignment (vs no alignment) is only a 0.5% size overhead when linking chromium_framework. With these alignment requirements, it doesn't make sense to attempt tail merging -- most strings will not be eligible since their overlaps aren't likely to start at a 16-byte boundary. Tail-merging (with alignment) for chromium_framework only improves size by 0.3%. It's worth noting that LLD-ELF only does tail merging at `-O2`. By default (at `-O1`), it just deduplicates w/o tail merging. @thakis has also mentioned that they saw it regress compressed size in some cases and therefore turned it off. `ld64` does not seem to do tail merging at all. **Performance Numbers** CString deduplication reduces chromium_framework from 250MB to 242MB, or about a 3.2% reduction. Numbers for linking chromium_framework on my 3.2 GHz 16-Core Intel Xeon W: N Min Max Median Avg Stddev x 20 3.91 4.03 3.935 3.95 0.034641016 + 20 3.99 4.14 4.015 4.0365 0.0492336 Difference at 95.0% confidence 0.0865 +/- 0.027245 2.18987% +/- 0.689746% (Student's t, pooled s = 0.0425673) As expected, cstring merging incurs some non-trivial overhead. When passing `--no-literal-merge`, it seems that performance is the same, i.e. the refactoring in this diff didn't cost us. N Min Max Median Avg Stddev x 20 3.91 4.03 3.935 3.95 0.034641016 + 20 3.89 4.02 3.935 3.9435 0.043197831 No difference proven at 95.0% confidence Reviewed By: #lld-macho, gkm Differential Revision: https://reviews.llvm.org/D102964
2021-06-08 11:47:12 +08:00
referentVA = minuendVA - fromSym->getVA();
} else if (auto *referentSym = r.referent.dyn_cast<Symbol *>()) {
if (target->hasAttr(r.type, RelocAttrBits::LOAD) &&
!referentSym->isInGot())
target->relaxGotLoad(loc, r.type);
[lld-macho] Implement cstring deduplication Our implementation draws heavily from LLD-ELF's, which in turn delegates its string deduplication to llvm-mc's StringTableBuilder. The messiness of this diff is largely due to the fact that we've previously assumed that all InputSections get concatenated together to form the output. This is no longer true with CStringInputSections, which split their contents into StringPieces. StringPieces are much more lightweight than InputSections, which is important as we create a lot of them. They may also overlap in the output, which makes it possible for strings to be tail-merged. In fact, the initial version of this diff implemented tail merging, but I've dropped it for reasons I'll explain later. **Alignment Issues** Mergeable cstring literals are found under the `__TEXT,__cstring` section. In contrast to ELF, which puts strings that need different alignments into different sections, clang's Mach-O backend puts them all in one section. Strings that need to be aligned have the `.p2align` directive emitted before them, which simply translates into zero padding in the object file. I *think* ld64 extracts the desired per-string alignment from this data by preserving each string's offset from the last section-aligned address. I'm not entirely certain since it doesn't seem consistent about doing this; but perhaps this can be chalked up to cases where ld64 has to deduplicate strings with different offset/alignment combos -- it seems to pick one of their alignments to preserve. This doesn't seem correct in general; we can in fact can induce ld64 to produce a crashing binary just by linking in an additional object file that only contains cstrings and no code. See PR50563 for details. Moreover, this scheme seems rather inefficient: since unaligned and aligned strings are all put in the same section, which has a single alignment value, it doesn't seem possible to tell whether a given string doesn't have any alignment requirements. Preserving offset+alignments for strings that don't need it is wasteful. In practice, the crashes seen so far seem to stem from x86_64 SIMD operations on cstrings. X86_64 requires SIMD accesses to be 16-byte-aligned. So for now, I'm thinking of just aligning all strings to 16 bytes on x86_64. This is indeed wasteful, but implementation-wise it's simpler than preserving per-string alignment+offsets. It also avoids the aforementioned crash after deduplication of differently-aligned strings. Finally, the overhead is not huge: using 16-byte alignment (vs no alignment) is only a 0.5% size overhead when linking chromium_framework. With these alignment requirements, it doesn't make sense to attempt tail merging -- most strings will not be eligible since their overlaps aren't likely to start at a 16-byte boundary. Tail-merging (with alignment) for chromium_framework only improves size by 0.3%. It's worth noting that LLD-ELF only does tail merging at `-O2`. By default (at `-O1`), it just deduplicates w/o tail merging. @thakis has also mentioned that they saw it regress compressed size in some cases and therefore turned it off. `ld64` does not seem to do tail merging at all. **Performance Numbers** CString deduplication reduces chromium_framework from 250MB to 242MB, or about a 3.2% reduction. Numbers for linking chromium_framework on my 3.2 GHz 16-Core Intel Xeon W: N Min Max Median Avg Stddev x 20 3.91 4.03 3.935 3.95 0.034641016 + 20 3.99 4.14 4.015 4.0365 0.0492336 Difference at 95.0% confidence 0.0865 +/- 0.027245 2.18987% +/- 0.689746% (Student's t, pooled s = 0.0425673) As expected, cstring merging incurs some non-trivial overhead. When passing `--no-literal-merge`, it seems that performance is the same, i.e. the refactoring in this diff didn't cost us. N Min Max Median Avg Stddev x 20 3.91 4.03 3.935 3.95 0.034641016 + 20 3.89 4.02 3.935 3.9435 0.043197831 No difference proven at 95.0% confidence Reviewed By: #lld-macho, gkm Differential Revision: https://reviews.llvm.org/D102964
2021-06-08 11:47:12 +08:00
referentVA = resolveSymbolVA(referentSym, r.type) + r.addend;
if (isThreadLocalVariables(getFlags())) {
// References from thread-local variable sections are treated as offsets
// relative to the start of the thread-local data memory area, which
// is initialized via copying all the TLV data sections (which are all
// contiguous).
if (isa<Defined>(referentSym))
referentVA -= firstTLVDataSection->addr;
}
} else if (auto *referentIsec = r.referent.dyn_cast<InputSection *>()) {
assert(!::shouldOmitFromOutput(referentIsec));
[lld-macho] Implement cstring deduplication Our implementation draws heavily from LLD-ELF's, which in turn delegates its string deduplication to llvm-mc's StringTableBuilder. The messiness of this diff is largely due to the fact that we've previously assumed that all InputSections get concatenated together to form the output. This is no longer true with CStringInputSections, which split their contents into StringPieces. StringPieces are much more lightweight than InputSections, which is important as we create a lot of them. They may also overlap in the output, which makes it possible for strings to be tail-merged. In fact, the initial version of this diff implemented tail merging, but I've dropped it for reasons I'll explain later. **Alignment Issues** Mergeable cstring literals are found under the `__TEXT,__cstring` section. In contrast to ELF, which puts strings that need different alignments into different sections, clang's Mach-O backend puts them all in one section. Strings that need to be aligned have the `.p2align` directive emitted before them, which simply translates into zero padding in the object file. I *think* ld64 extracts the desired per-string alignment from this data by preserving each string's offset from the last section-aligned address. I'm not entirely certain since it doesn't seem consistent about doing this; but perhaps this can be chalked up to cases where ld64 has to deduplicate strings with different offset/alignment combos -- it seems to pick one of their alignments to preserve. This doesn't seem correct in general; we can in fact can induce ld64 to produce a crashing binary just by linking in an additional object file that only contains cstrings and no code. See PR50563 for details. Moreover, this scheme seems rather inefficient: since unaligned and aligned strings are all put in the same section, which has a single alignment value, it doesn't seem possible to tell whether a given string doesn't have any alignment requirements. Preserving offset+alignments for strings that don't need it is wasteful. In practice, the crashes seen so far seem to stem from x86_64 SIMD operations on cstrings. X86_64 requires SIMD accesses to be 16-byte-aligned. So for now, I'm thinking of just aligning all strings to 16 bytes on x86_64. This is indeed wasteful, but implementation-wise it's simpler than preserving per-string alignment+offsets. It also avoids the aforementioned crash after deduplication of differently-aligned strings. Finally, the overhead is not huge: using 16-byte alignment (vs no alignment) is only a 0.5% size overhead when linking chromium_framework. With these alignment requirements, it doesn't make sense to attempt tail merging -- most strings will not be eligible since their overlaps aren't likely to start at a 16-byte boundary. Tail-merging (with alignment) for chromium_framework only improves size by 0.3%. It's worth noting that LLD-ELF only does tail merging at `-O2`. By default (at `-O1`), it just deduplicates w/o tail merging. @thakis has also mentioned that they saw it regress compressed size in some cases and therefore turned it off. `ld64` does not seem to do tail merging at all. **Performance Numbers** CString deduplication reduces chromium_framework from 250MB to 242MB, or about a 3.2% reduction. Numbers for linking chromium_framework on my 3.2 GHz 16-Core Intel Xeon W: N Min Max Median Avg Stddev x 20 3.91 4.03 3.935 3.95 0.034641016 + 20 3.99 4.14 4.015 4.0365 0.0492336 Difference at 95.0% confidence 0.0865 +/- 0.027245 2.18987% +/- 0.689746% (Student's t, pooled s = 0.0425673) As expected, cstring merging incurs some non-trivial overhead. When passing `--no-literal-merge`, it seems that performance is the same, i.e. the refactoring in this diff didn't cost us. N Min Max Median Avg Stddev x 20 3.91 4.03 3.935 3.95 0.034641016 + 20 3.89 4.02 3.935 3.9435 0.043197831 No difference proven at 95.0% confidence Reviewed By: #lld-macho, gkm Differential Revision: https://reviews.llvm.org/D102964
2021-06-08 11:47:12 +08:00
referentVA = referentIsec->getVA(r.addend);
}
target->relocateOne(loc, r, referentVA, getVA() + r.offset);
[lld-macho] Implement cstring deduplication Our implementation draws heavily from LLD-ELF's, which in turn delegates its string deduplication to llvm-mc's StringTableBuilder. The messiness of this diff is largely due to the fact that we've previously assumed that all InputSections get concatenated together to form the output. This is no longer true with CStringInputSections, which split their contents into StringPieces. StringPieces are much more lightweight than InputSections, which is important as we create a lot of them. They may also overlap in the output, which makes it possible for strings to be tail-merged. In fact, the initial version of this diff implemented tail merging, but I've dropped it for reasons I'll explain later. **Alignment Issues** Mergeable cstring literals are found under the `__TEXT,__cstring` section. In contrast to ELF, which puts strings that need different alignments into different sections, clang's Mach-O backend puts them all in one section. Strings that need to be aligned have the `.p2align` directive emitted before them, which simply translates into zero padding in the object file. I *think* ld64 extracts the desired per-string alignment from this data by preserving each string's offset from the last section-aligned address. I'm not entirely certain since it doesn't seem consistent about doing this; but perhaps this can be chalked up to cases where ld64 has to deduplicate strings with different offset/alignment combos -- it seems to pick one of their alignments to preserve. This doesn't seem correct in general; we can in fact can induce ld64 to produce a crashing binary just by linking in an additional object file that only contains cstrings and no code. See PR50563 for details. Moreover, this scheme seems rather inefficient: since unaligned and aligned strings are all put in the same section, which has a single alignment value, it doesn't seem possible to tell whether a given string doesn't have any alignment requirements. Preserving offset+alignments for strings that don't need it is wasteful. In practice, the crashes seen so far seem to stem from x86_64 SIMD operations on cstrings. X86_64 requires SIMD accesses to be 16-byte-aligned. So for now, I'm thinking of just aligning all strings to 16 bytes on x86_64. This is indeed wasteful, but implementation-wise it's simpler than preserving per-string alignment+offsets. It also avoids the aforementioned crash after deduplication of differently-aligned strings. Finally, the overhead is not huge: using 16-byte alignment (vs no alignment) is only a 0.5% size overhead when linking chromium_framework. With these alignment requirements, it doesn't make sense to attempt tail merging -- most strings will not be eligible since their overlaps aren't likely to start at a 16-byte boundary. Tail-merging (with alignment) for chromium_framework only improves size by 0.3%. It's worth noting that LLD-ELF only does tail merging at `-O2`. By default (at `-O1`), it just deduplicates w/o tail merging. @thakis has also mentioned that they saw it regress compressed size in some cases and therefore turned it off. `ld64` does not seem to do tail merging at all. **Performance Numbers** CString deduplication reduces chromium_framework from 250MB to 242MB, or about a 3.2% reduction. Numbers for linking chromium_framework on my 3.2 GHz 16-Core Intel Xeon W: N Min Max Median Avg Stddev x 20 3.91 4.03 3.935 3.95 0.034641016 + 20 3.99 4.14 4.015 4.0365 0.0492336 Difference at 95.0% confidence 0.0865 +/- 0.027245 2.18987% +/- 0.689746% (Student's t, pooled s = 0.0425673) As expected, cstring merging incurs some non-trivial overhead. When passing `--no-literal-merge`, it seems that performance is the same, i.e. the refactoring in this diff didn't cost us. N Min Max Median Avg Stddev x 20 3.91 4.03 3.935 3.95 0.034641016 + 20 3.89 4.02 3.935 3.9435 0.043197831 No difference proven at 95.0% confidence Reviewed By: #lld-macho, gkm Differential Revision: https://reviews.llvm.org/D102964
2021-06-08 11:47:12 +08:00
}
}
void CStringInputSection::splitIntoPieces() {
size_t off = 0;
StringRef s = toStringRef(data);
while (!s.empty()) {
size_t end = s.find(0);
if (end == StringRef::npos)
fatal(toString(this) + ": string is not null terminated");
size_t size = end + 1;
uint32_t hash = config->dedupLiterals ? xxHash64(s.substr(0, size)) : 0;
pieces.emplace_back(off, hash);
[lld-macho] Implement cstring deduplication Our implementation draws heavily from LLD-ELF's, which in turn delegates its string deduplication to llvm-mc's StringTableBuilder. The messiness of this diff is largely due to the fact that we've previously assumed that all InputSections get concatenated together to form the output. This is no longer true with CStringInputSections, which split their contents into StringPieces. StringPieces are much more lightweight than InputSections, which is important as we create a lot of them. They may also overlap in the output, which makes it possible for strings to be tail-merged. In fact, the initial version of this diff implemented tail merging, but I've dropped it for reasons I'll explain later. **Alignment Issues** Mergeable cstring literals are found under the `__TEXT,__cstring` section. In contrast to ELF, which puts strings that need different alignments into different sections, clang's Mach-O backend puts them all in one section. Strings that need to be aligned have the `.p2align` directive emitted before them, which simply translates into zero padding in the object file. I *think* ld64 extracts the desired per-string alignment from this data by preserving each string's offset from the last section-aligned address. I'm not entirely certain since it doesn't seem consistent about doing this; but perhaps this can be chalked up to cases where ld64 has to deduplicate strings with different offset/alignment combos -- it seems to pick one of their alignments to preserve. This doesn't seem correct in general; we can in fact can induce ld64 to produce a crashing binary just by linking in an additional object file that only contains cstrings and no code. See PR50563 for details. Moreover, this scheme seems rather inefficient: since unaligned and aligned strings are all put in the same section, which has a single alignment value, it doesn't seem possible to tell whether a given string doesn't have any alignment requirements. Preserving offset+alignments for strings that don't need it is wasteful. In practice, the crashes seen so far seem to stem from x86_64 SIMD operations on cstrings. X86_64 requires SIMD accesses to be 16-byte-aligned. So for now, I'm thinking of just aligning all strings to 16 bytes on x86_64. This is indeed wasteful, but implementation-wise it's simpler than preserving per-string alignment+offsets. It also avoids the aforementioned crash after deduplication of differently-aligned strings. Finally, the overhead is not huge: using 16-byte alignment (vs no alignment) is only a 0.5% size overhead when linking chromium_framework. With these alignment requirements, it doesn't make sense to attempt tail merging -- most strings will not be eligible since their overlaps aren't likely to start at a 16-byte boundary. Tail-merging (with alignment) for chromium_framework only improves size by 0.3%. It's worth noting that LLD-ELF only does tail merging at `-O2`. By default (at `-O1`), it just deduplicates w/o tail merging. @thakis has also mentioned that they saw it regress compressed size in some cases and therefore turned it off. `ld64` does not seem to do tail merging at all. **Performance Numbers** CString deduplication reduces chromium_framework from 250MB to 242MB, or about a 3.2% reduction. Numbers for linking chromium_framework on my 3.2 GHz 16-Core Intel Xeon W: N Min Max Median Avg Stddev x 20 3.91 4.03 3.935 3.95 0.034641016 + 20 3.99 4.14 4.015 4.0365 0.0492336 Difference at 95.0% confidence 0.0865 +/- 0.027245 2.18987% +/- 0.689746% (Student's t, pooled s = 0.0425673) As expected, cstring merging incurs some non-trivial overhead. When passing `--no-literal-merge`, it seems that performance is the same, i.e. the refactoring in this diff didn't cost us. N Min Max Median Avg Stddev x 20 3.91 4.03 3.935 3.95 0.034641016 + 20 3.89 4.02 3.935 3.9435 0.043197831 No difference proven at 95.0% confidence Reviewed By: #lld-macho, gkm Differential Revision: https://reviews.llvm.org/D102964
2021-06-08 11:47:12 +08:00
s = s.substr(size);
off += size;
}
}
StringPiece &CStringInputSection::getStringPiece(uint64_t off) {
[lld-macho] Implement cstring deduplication Our implementation draws heavily from LLD-ELF's, which in turn delegates its string deduplication to llvm-mc's StringTableBuilder. The messiness of this diff is largely due to the fact that we've previously assumed that all InputSections get concatenated together to form the output. This is no longer true with CStringInputSections, which split their contents into StringPieces. StringPieces are much more lightweight than InputSections, which is important as we create a lot of them. They may also overlap in the output, which makes it possible for strings to be tail-merged. In fact, the initial version of this diff implemented tail merging, but I've dropped it for reasons I'll explain later. **Alignment Issues** Mergeable cstring literals are found under the `__TEXT,__cstring` section. In contrast to ELF, which puts strings that need different alignments into different sections, clang's Mach-O backend puts them all in one section. Strings that need to be aligned have the `.p2align` directive emitted before them, which simply translates into zero padding in the object file. I *think* ld64 extracts the desired per-string alignment from this data by preserving each string's offset from the last section-aligned address. I'm not entirely certain since it doesn't seem consistent about doing this; but perhaps this can be chalked up to cases where ld64 has to deduplicate strings with different offset/alignment combos -- it seems to pick one of their alignments to preserve. This doesn't seem correct in general; we can in fact can induce ld64 to produce a crashing binary just by linking in an additional object file that only contains cstrings and no code. See PR50563 for details. Moreover, this scheme seems rather inefficient: since unaligned and aligned strings are all put in the same section, which has a single alignment value, it doesn't seem possible to tell whether a given string doesn't have any alignment requirements. Preserving offset+alignments for strings that don't need it is wasteful. In practice, the crashes seen so far seem to stem from x86_64 SIMD operations on cstrings. X86_64 requires SIMD accesses to be 16-byte-aligned. So for now, I'm thinking of just aligning all strings to 16 bytes on x86_64. This is indeed wasteful, but implementation-wise it's simpler than preserving per-string alignment+offsets. It also avoids the aforementioned crash after deduplication of differently-aligned strings. Finally, the overhead is not huge: using 16-byte alignment (vs no alignment) is only a 0.5% size overhead when linking chromium_framework. With these alignment requirements, it doesn't make sense to attempt tail merging -- most strings will not be eligible since their overlaps aren't likely to start at a 16-byte boundary. Tail-merging (with alignment) for chromium_framework only improves size by 0.3%. It's worth noting that LLD-ELF only does tail merging at `-O2`. By default (at `-O1`), it just deduplicates w/o tail merging. @thakis has also mentioned that they saw it regress compressed size in some cases and therefore turned it off. `ld64` does not seem to do tail merging at all. **Performance Numbers** CString deduplication reduces chromium_framework from 250MB to 242MB, or about a 3.2% reduction. Numbers for linking chromium_framework on my 3.2 GHz 16-Core Intel Xeon W: N Min Max Median Avg Stddev x 20 3.91 4.03 3.935 3.95 0.034641016 + 20 3.99 4.14 4.015 4.0365 0.0492336 Difference at 95.0% confidence 0.0865 +/- 0.027245 2.18987% +/- 0.689746% (Student's t, pooled s = 0.0425673) As expected, cstring merging incurs some non-trivial overhead. When passing `--no-literal-merge`, it seems that performance is the same, i.e. the refactoring in this diff didn't cost us. N Min Max Median Avg Stddev x 20 3.91 4.03 3.935 3.95 0.034641016 + 20 3.89 4.02 3.935 3.9435 0.043197831 No difference proven at 95.0% confidence Reviewed By: #lld-macho, gkm Differential Revision: https://reviews.llvm.org/D102964
2021-06-08 11:47:12 +08:00
if (off >= data.size())
fatal(toString(this) + ": offset is outside the section");
auto it =
partition_point(pieces, [=](StringPiece p) { return p.inSecOff <= off; });
return it[-1];
}
const StringPiece &CStringInputSection::getStringPiece(uint64_t off) const {
return const_cast<CStringInputSection *>(this)->getStringPiece(off);
}
[lld-macho] Implement cstring deduplication Our implementation draws heavily from LLD-ELF's, which in turn delegates its string deduplication to llvm-mc's StringTableBuilder. The messiness of this diff is largely due to the fact that we've previously assumed that all InputSections get concatenated together to form the output. This is no longer true with CStringInputSections, which split their contents into StringPieces. StringPieces are much more lightweight than InputSections, which is important as we create a lot of them. They may also overlap in the output, which makes it possible for strings to be tail-merged. In fact, the initial version of this diff implemented tail merging, but I've dropped it for reasons I'll explain later. **Alignment Issues** Mergeable cstring literals are found under the `__TEXT,__cstring` section. In contrast to ELF, which puts strings that need different alignments into different sections, clang's Mach-O backend puts them all in one section. Strings that need to be aligned have the `.p2align` directive emitted before them, which simply translates into zero padding in the object file. I *think* ld64 extracts the desired per-string alignment from this data by preserving each string's offset from the last section-aligned address. I'm not entirely certain since it doesn't seem consistent about doing this; but perhaps this can be chalked up to cases where ld64 has to deduplicate strings with different offset/alignment combos -- it seems to pick one of their alignments to preserve. This doesn't seem correct in general; we can in fact can induce ld64 to produce a crashing binary just by linking in an additional object file that only contains cstrings and no code. See PR50563 for details. Moreover, this scheme seems rather inefficient: since unaligned and aligned strings are all put in the same section, which has a single alignment value, it doesn't seem possible to tell whether a given string doesn't have any alignment requirements. Preserving offset+alignments for strings that don't need it is wasteful. In practice, the crashes seen so far seem to stem from x86_64 SIMD operations on cstrings. X86_64 requires SIMD accesses to be 16-byte-aligned. So for now, I'm thinking of just aligning all strings to 16 bytes on x86_64. This is indeed wasteful, but implementation-wise it's simpler than preserving per-string alignment+offsets. It also avoids the aforementioned crash after deduplication of differently-aligned strings. Finally, the overhead is not huge: using 16-byte alignment (vs no alignment) is only a 0.5% size overhead when linking chromium_framework. With these alignment requirements, it doesn't make sense to attempt tail merging -- most strings will not be eligible since their overlaps aren't likely to start at a 16-byte boundary. Tail-merging (with alignment) for chromium_framework only improves size by 0.3%. It's worth noting that LLD-ELF only does tail merging at `-O2`. By default (at `-O1`), it just deduplicates w/o tail merging. @thakis has also mentioned that they saw it regress compressed size in some cases and therefore turned it off. `ld64` does not seem to do tail merging at all. **Performance Numbers** CString deduplication reduces chromium_framework from 250MB to 242MB, or about a 3.2% reduction. Numbers for linking chromium_framework on my 3.2 GHz 16-Core Intel Xeon W: N Min Max Median Avg Stddev x 20 3.91 4.03 3.935 3.95 0.034641016 + 20 3.99 4.14 4.015 4.0365 0.0492336 Difference at 95.0% confidence 0.0865 +/- 0.027245 2.18987% +/- 0.689746% (Student's t, pooled s = 0.0425673) As expected, cstring merging incurs some non-trivial overhead. When passing `--no-literal-merge`, it seems that performance is the same, i.e. the refactoring in this diff didn't cost us. N Min Max Median Avg Stddev x 20 3.91 4.03 3.935 3.95 0.034641016 + 20 3.89 4.02 3.935 3.9435 0.043197831 No difference proven at 95.0% confidence Reviewed By: #lld-macho, gkm Differential Revision: https://reviews.llvm.org/D102964
2021-06-08 11:47:12 +08:00
uint64_t CStringInputSection::getOffset(uint64_t off) const {
const StringPiece &piece = getStringPiece(off);
uint64_t addend = off - piece.inSecOff;
return piece.outSecOff + addend;
}
WordLiteralInputSection::WordLiteralInputSection(StringRef segname,
StringRef name,
InputFile *file,
ArrayRef<uint8_t> data,
uint32_t align, uint32_t flags)
: InputSection(WordLiteralKind, segname, name, file, data, align, flags) {
switch (sectionType(flags)) {
case S_4BYTE_LITERALS:
power2LiteralSize = 2;
break;
case S_8BYTE_LITERALS:
power2LiteralSize = 3;
break;
case S_16BYTE_LITERALS:
power2LiteralSize = 4;
break;
default:
llvm_unreachable("invalid literal section type");
}
live.resize(data.size() >> power2LiteralSize, !config->deadStrip);
}
[lld-macho] Deduplicate fixed-width literals Conceptually, the implementation is pretty straightforward: we put each literal value into a hashtable, and then write out the keys of that hashtable at the end. In contrast with ELF, the Mach-O format does not support variable-length literals that aren't strings. Its literals are either 4, 8, or 16 bytes in length. LLD-ELF dedups its literals via sorting + uniq'ing, but since we don't need to worry about overly-long values, we should be able to do a faster job by just hashing. That said, the implementation right now is far from optimal, because we add to those hashtables serially. To parallelize this, we'll need a basic concurrent hashtable (only needs to support concurrent writes w/o interleave reads), which shouldn't be to hard to implement, but I'd like to punt on it for now. Numbers for linking chromium_framework on my 3.2 GHz 16-Core Intel Xeon W: N Min Max Median Avg Stddev x 20 4.27 4.39 4.315 4.3225 0.033225703 + 20 4.36 4.82 4.44 4.4845 0.13152846 Difference at 95.0% confidence 0.162 +/- 0.0613971 3.74783% +/- 1.42041% (Student's t, pooled s = 0.0959262) This corresponds to binary size savings of 2MB out of 335MB, or 0.6%. It's not a great tradeoff as-is, but as mentioned our implementation can be signficantly optimized, and literal dedup will unlock more opportunities for ICF to identify identical structures that reference the same literals. Reviewed By: #lld-macho, gkm Differential Revision: https://reviews.llvm.org/D103113
2021-06-12 07:49:50 +08:00
uint64_t WordLiteralInputSection::getOffset(uint64_t off) const {
auto *osec = cast<WordLiteralSection>(parent);
const uintptr_t buf = reinterpret_cast<uintptr_t>(data.data());
switch (sectionType(getFlags())) {
[lld-macho] Deduplicate fixed-width literals Conceptually, the implementation is pretty straightforward: we put each literal value into a hashtable, and then write out the keys of that hashtable at the end. In contrast with ELF, the Mach-O format does not support variable-length literals that aren't strings. Its literals are either 4, 8, or 16 bytes in length. LLD-ELF dedups its literals via sorting + uniq'ing, but since we don't need to worry about overly-long values, we should be able to do a faster job by just hashing. That said, the implementation right now is far from optimal, because we add to those hashtables serially. To parallelize this, we'll need a basic concurrent hashtable (only needs to support concurrent writes w/o interleave reads), which shouldn't be to hard to implement, but I'd like to punt on it for now. Numbers for linking chromium_framework on my 3.2 GHz 16-Core Intel Xeon W: N Min Max Median Avg Stddev x 20 4.27 4.39 4.315 4.3225 0.033225703 + 20 4.36 4.82 4.44 4.4845 0.13152846 Difference at 95.0% confidence 0.162 +/- 0.0613971 3.74783% +/- 1.42041% (Student's t, pooled s = 0.0959262) This corresponds to binary size savings of 2MB out of 335MB, or 0.6%. It's not a great tradeoff as-is, but as mentioned our implementation can be signficantly optimized, and literal dedup will unlock more opportunities for ICF to identify identical structures that reference the same literals. Reviewed By: #lld-macho, gkm Differential Revision: https://reviews.llvm.org/D103113
2021-06-12 07:49:50 +08:00
case S_4BYTE_LITERALS:
return osec->getLiteral4Offset(buf + (off & ~3LLU)) | (off & 3);
[lld-macho] Deduplicate fixed-width literals Conceptually, the implementation is pretty straightforward: we put each literal value into a hashtable, and then write out the keys of that hashtable at the end. In contrast with ELF, the Mach-O format does not support variable-length literals that aren't strings. Its literals are either 4, 8, or 16 bytes in length. LLD-ELF dedups its literals via sorting + uniq'ing, but since we don't need to worry about overly-long values, we should be able to do a faster job by just hashing. That said, the implementation right now is far from optimal, because we add to those hashtables serially. To parallelize this, we'll need a basic concurrent hashtable (only needs to support concurrent writes w/o interleave reads), which shouldn't be to hard to implement, but I'd like to punt on it for now. Numbers for linking chromium_framework on my 3.2 GHz 16-Core Intel Xeon W: N Min Max Median Avg Stddev x 20 4.27 4.39 4.315 4.3225 0.033225703 + 20 4.36 4.82 4.44 4.4845 0.13152846 Difference at 95.0% confidence 0.162 +/- 0.0613971 3.74783% +/- 1.42041% (Student's t, pooled s = 0.0959262) This corresponds to binary size savings of 2MB out of 335MB, or 0.6%. It's not a great tradeoff as-is, but as mentioned our implementation can be signficantly optimized, and literal dedup will unlock more opportunities for ICF to identify identical structures that reference the same literals. Reviewed By: #lld-macho, gkm Differential Revision: https://reviews.llvm.org/D103113
2021-06-12 07:49:50 +08:00
case S_8BYTE_LITERALS:
return osec->getLiteral8Offset(buf + (off & ~7LLU)) | (off & 7);
[lld-macho] Deduplicate fixed-width literals Conceptually, the implementation is pretty straightforward: we put each literal value into a hashtable, and then write out the keys of that hashtable at the end. In contrast with ELF, the Mach-O format does not support variable-length literals that aren't strings. Its literals are either 4, 8, or 16 bytes in length. LLD-ELF dedups its literals via sorting + uniq'ing, but since we don't need to worry about overly-long values, we should be able to do a faster job by just hashing. That said, the implementation right now is far from optimal, because we add to those hashtables serially. To parallelize this, we'll need a basic concurrent hashtable (only needs to support concurrent writes w/o interleave reads), which shouldn't be to hard to implement, but I'd like to punt on it for now. Numbers for linking chromium_framework on my 3.2 GHz 16-Core Intel Xeon W: N Min Max Median Avg Stddev x 20 4.27 4.39 4.315 4.3225 0.033225703 + 20 4.36 4.82 4.44 4.4845 0.13152846 Difference at 95.0% confidence 0.162 +/- 0.0613971 3.74783% +/- 1.42041% (Student's t, pooled s = 0.0959262) This corresponds to binary size savings of 2MB out of 335MB, or 0.6%. It's not a great tradeoff as-is, but as mentioned our implementation can be signficantly optimized, and literal dedup will unlock more opportunities for ICF to identify identical structures that reference the same literals. Reviewed By: #lld-macho, gkm Differential Revision: https://reviews.llvm.org/D103113
2021-06-12 07:49:50 +08:00
case S_16BYTE_LITERALS:
return osec->getLiteral16Offset(buf + (off & ~15LLU)) | (off & 15);
[lld-macho] Deduplicate fixed-width literals Conceptually, the implementation is pretty straightforward: we put each literal value into a hashtable, and then write out the keys of that hashtable at the end. In contrast with ELF, the Mach-O format does not support variable-length literals that aren't strings. Its literals are either 4, 8, or 16 bytes in length. LLD-ELF dedups its literals via sorting + uniq'ing, but since we don't need to worry about overly-long values, we should be able to do a faster job by just hashing. That said, the implementation right now is far from optimal, because we add to those hashtables serially. To parallelize this, we'll need a basic concurrent hashtable (only needs to support concurrent writes w/o interleave reads), which shouldn't be to hard to implement, but I'd like to punt on it for now. Numbers for linking chromium_framework on my 3.2 GHz 16-Core Intel Xeon W: N Min Max Median Avg Stddev x 20 4.27 4.39 4.315 4.3225 0.033225703 + 20 4.36 4.82 4.44 4.4845 0.13152846 Difference at 95.0% confidence 0.162 +/- 0.0613971 3.74783% +/- 1.42041% (Student's t, pooled s = 0.0959262) This corresponds to binary size savings of 2MB out of 335MB, or 0.6%. It's not a great tradeoff as-is, but as mentioned our implementation can be signficantly optimized, and literal dedup will unlock more opportunities for ICF to identify identical structures that reference the same literals. Reviewed By: #lld-macho, gkm Differential Revision: https://reviews.llvm.org/D103113
2021-06-12 07:49:50 +08:00
default:
llvm_unreachable("invalid literal section type");
}
}
bool macho::isCodeSection(const InputSection *isec) {
uint32_t type = sectionType(isec->getFlags());
if (type != S_REGULAR && type != S_COALESCED)
return false;
uint32_t attr = isec->getFlags() & SECTION_ATTRIBUTES_USR;
if (attr == S_ATTR_PURE_INSTRUCTIONS)
return true;
if (isec->getSegName() == segment_names::text)
return StringSwitch<bool>(isec->getName())
.Cases(section_names::textCoalNt, section_names::staticInit, true)
.Default(false);
return false;
}
bool macho::isCfStringSection(const InputSection *isec) {
return isec->getName() == section_names::cfString &&
isec->getSegName() == segment_names::data;
}
std::string lld::toString(const InputSection *isec) {
return (toString(isec->getFile()) + ":(" + isec->getName() + ")").str();
}