llvm-project

Commit Graph

Author	SHA1	Message	Date
Alexander Shaposhnikov	e3654fc274	[BOLT] Uniquify names of local symbols Summary: 1. Uniquify names of local symbols. 2. Handle aliases. (cherry picked from FBD20270196)	2020-03-04 18:36:44 -08:00
Alexander Shaposhnikov	842a25f785	[BOLT] Mark functions containing data as non-simple Summary: Temporarily mark functions containing data as non-simple. (cherry picked from FBD20213279)	2020-03-02 22:41:12 -08:00
Maksim Panchenko	cb9c991dcb	[BOLT] Remove allow-section-relocations option Summary: The option is not used. Remove all related code. (cherry picked from FBD20237859)	2020-03-03 15:51:24 -08:00
Maksim Panchenko	c7e012e145	[BOLT][NFC] Get rid of BestFit parameter Summary: The parameter is no longer used. (cherry picked from FBD20236516)	2020-03-03 14:28:42 -08:00
Alexander Shaposhnikov	b0cbb60165	[BOLT] Fix begin decrementing Summary: Fix begin decrementing. (cherry picked from FBD20232474)	2020-03-03 13:36:32 -08:00
Maksim Panchenko	d89bb53afa	[BOLT][NFC] Factor out relocation processing (cherry picked from FBD20087297)	2020-02-24 17:10:02 -08:00
Rafael Auler	340da8f294	[BOLT] Fix shrink wrapping to check pops Summary: Shrink wrapping has a mode where it will directly move push pop pairs, instead of replacing them with stores/loads. This is an ambitious mode that is triggered sometimes, but whenever matching with a push, it would operate with the assumption that the restoring instruction was a pop, not a load, otherwise it would assert. Fix this assertion to bail nicely back to non-pushpop mode (use regular store and load instructions). (cherry picked from FBD20085905)	2020-02-18 16:00:40 -08:00
Maksim Panchenko	2df4e7b99e	[BOLT][NFC] Minor refactoring of RewriteInstance (cherry picked from FBD20087424)	2020-02-24 17:12:41 -08:00
Maksim Panchenko	495761dc70	[BOLT][NFC] Remove unused BinarySection member functions (cherry picked from FBD20087243)	2020-02-24 16:56:45 -08:00
Maksim Panchenko	3b45212e84	[BOLT] Delete ExecutableFileMemoryManager::registerNoteSection() Summary: The interface is no longer in use. (cherry picked from FBD20070558)	2020-02-24 09:40:32 -08:00
Alexander Shaposhnikov	01b7c90242	[BOLT] Add missing override Summary: Add missing override in X86MCPlusBuilder.cpp. (cherry picked from FBD20064222)	2020-02-23 22:27:28 -08:00
Maksim Panchenko	be43f89c4f	[BOLT][llvm] Update llvm.patch Summary: (cherry picked from FBD20063562)	2020-02-23 19:51:33 -08:00
Alexander Shaposhnikov	76aa1c26aa	[BOLT] Enable reversing the order of basic blocks Summary: Enable reversing the order of basic blocks. (cherry picked from FBD19943692)	2020-02-17 13:35:09 -08:00
Alexander Shaposhnikov	4ad5048393	[BOLT] Add first bits to build CFG Summary: Add first bits to build CFG. (cherry picked from FBD19943472)	2020-02-17 12:18:42 -08:00
Alexander Shaposhnikov	5b64bf2128	[BOLT] Disassemble functions from a MachO binary Summary: Add first bits to disassemble functions from a MachO binary. (cherry picked from FBD19900493)	2020-02-11 14:30:33 -08:00
Rafael Auler	a9d85413ac	[BOLT] Emit long nops by default Summary: Change our X86 target to use long nops by default. In general, BOLT does not put nops into the instruction stream that is going to be executed, since it doesn't align basic blocks, only functions. Since we rebased BOLT, our relationship with MCAssembler changed because it stopped using multibyte nops and we never needed to revisit that. But it makes a difference if we want to mitigate perf issues with the Intel JCC erratum, since the nops inserted are going to be decoded and executed. To make MCAssembler emit long nops again, we need to explictly set mattr (Features) of the X86 target. (cherry picked from FBD19987277)	2020-02-19 16:13:58 -08:00
Maksim Panchenko	9711286858	[BOLT] Get rid of BinarySection::IsLocal Summary: The flag is no longer used/needed. (cherry picked from FBD19951571)	2020-02-18 09:20:17 -08:00
Alexander Shaposhnikov	16630f5c58	[BOLT] Factor out NameResolver from RewriteInstance Summary: Factor out the helper class NameResolver from the class RewriteInstance. (cherry picked from FBD19943916)	2020-02-17 14:37:46 -08:00
Alexander Shaposhnikov	754b6569f6	[BOLT] Add missing std::move Summary: Add missing std::move in the method BinaryFunction::addAlternativeName (cherry picked from FBD19944661)	2020-02-17 17:53:12 -08:00
Alexander Shaposhnikov	36cf37c4c1	[BOLT] Add initial bits for parsing MachO files Summary: Start adding initial bits for MachO, this diff contains some small preparations for finding functions inside a MachO binary, this will be done in the next diff. The concept of a section in the MachO world is quite different from ELF, nevertheless, for functions for now it more or less fits into the current picture (in BOLT), but things will diverge more significantly a bit later. (cherry picked from FBD19648161)	2020-01-30 13:10:48 -08:00
Rafael Auler	58a129a602	[BOLT] Move peepholes pass after sctc Summary: There are two peephole subpasses, remove-double-jumps and remove-useless-conditional-branches, that operates by reading branches directly, which makes them tricky to run before fix-branches. In the case of remove-double-jumps, it will even lead to suboptimal code if the patched branch was going to be removed by fix-branches when the target is the fall-through. If the final target is a tail call, it will lead to a broken CFG in the worst case. Fix this by moving these passes after SCTC, which already produces CFGs with conditional tail calls. (cherry picked from FBD18795592)	2019-12-03 12:28:22 -08:00
Rafael Auler	c82e7fd1cc	[BOLT] Decoder cache friendly alignment wrt Intel JCC Erratum Summary: This diff ports reviews.llvm.org/D70157 to our LLVM tree, which makes the integrated assembler able to align X86 control-flow changing instructions in a way to reduce the performance impact of the ucode update on Intel processors that implement the JCC erratum mitigation. See white paper "Mitigations for Jump Conditional Code Erratum" by Intel published November 2019. To port this patch, I changed classifySecondInstInMacroFusion to analyze instruction opcodes directly instead of analyzing the CondCond operand (in more recent versions of LLVM, all conditional branches share the same opcode, but with a different conditional operand). I also pulled to our tree Alignment.h as a dependency, and the macroop analyzing helpers. x86-align-branch-boundary and -x86-align-branch are the two flags that control nop insertion to avoid disabling the decoder cache, following the original patch. In BOLT, I added the flag x86-align-branch-boundary-hot-only to request the alignment to only be applied to hot code, which is turned on by default. The reason is because such alignment is expensive to perform on large modules, but if we limit it to hot code, the relaxation pass runtime becomes tolerable. (cherry picked from FBD19828850)	2020-02-10 18:50:53 -08:00
Alexander Shaposhnikov	d5b8fc8fbe	[BOLT] Make the methods isText/isData more robust Summary: Make the methods isText/isData work for MachO. (cherry picked from FBD19849460)	2020-02-11 17:54:48 -08:00
Alexander Shaposhnikov	c3c4b15a2e	[BOLT] Remove BinaryContext::getFunctionData Summary: In this diff we refactor the code around getting the original binary encoding of function's body. The main changes are: remove BinaryContext::getFunctionData, remove the parameter of the method BinaryFunction::disassemble, introduce BinaryFunction::getData. (cherry picked from FBD19824368)	2020-02-10 15:35:11 -08:00
Maksim Panchenko	41de03b8e9	[BOLT] Fix section names under `-generate-link-sections` Summary: Use proper function while printing modified function name to file. (cherry picked from FBD19791847)	2020-02-07 09:39:38 -08:00
Rafael Auler	0080d74506	[BOLT] Fix issue with strict and builtin_unreachable Summary: In strict mode, a jump table with targets generated by builtin_unreachable (located at the very end of the function) was asserting when being recreated by postProcessIndirectBranches. Fix this. (cherry picked from FBD19614981)	2020-01-28 18:38:10 -08:00
Maksim Panchenko	d57513e4ab	[BOLT] Fix symbol table issue with ICF Summary: Not all symbol table entries were updated after ICF. (cherry picked from FBD19319685)	2020-01-08 13:32:59 -08:00
Maksim Panchenko	ac697b7d3a	[BOLT] Replace list of Names with Symbols for BinaryFunction Summary: BinaryFunction used to have a list of Names associated with its main entry point. However, the function is primarily identified by its corresponding symbol or symbols, and these symbols are available as we are creating them for a corresponding BinaryData object. There's also no reason to emit symbols for alternative function names (aliases), so change the code to only emit needed symbols. When we emit a cold fragment for a function, only emit one cold symbol for the fragment instead of one per every main entry symbol/name. When we match a symbol to an entry point in the function, with this change we can first go through the list of main entry symbols (now that they are available). (cherry picked from FBD19426709)	2020-01-13 11:56:59 -08:00
Alexander Shaposhnikov	7a59783d7a	[BOLT] Move createBinaryContext to BinaryContext Summary: 1. Move createBinaryContext to BinaryContext. 1. Add support for nonlinux triples in createBinaryContext. 2. Remove unnecessary std::move in DWARFRewriter.cpp. (cherry picked from FBD19421314)	2020-01-15 15:23:45 -08:00
Rafael Auler	961d3d02d8	[BOLT] Move postProcessEntryPoints after disassembly Summary: Call postProcessEntryPoints only after all functions have been disassembled and all interprocedural references have been processed, when all possible entry points have been accounted for. This makes our detection of bad entries more robust as it does not depend on the order of the functions any more. (cherry picked from FBD19404767)	2020-01-14 17:12:03 -08:00
Maksim Panchenko	0283271f29	[BOLT] Do no report error on mismatched instruction encoding Summary: When the validation of instruction encoding fails but we are able to continue processing the binary, do no report an error. Report encoding format only under `-v=1`. (cherry picked from FBD19376531)	2020-01-13 11:24:10 -08:00
Maksim Panchenko	45b27d7b44	[BOLT] Get rid of Names in BinaryData Summary: For BinaryData, we used to maintain a vector of StringRef names and also a vector of pointers to MCSymbol's associated with the data. There was an unnecessary duplication of information and an associated overhead of keeping it in sync. Fix it by removing Names and using Symbols wherever Names were used. Also merge two variants of registerNameAtAddress() and remove unreachable/dead code in the process. (cherry picked from FBD19359123)	2020-01-10 16:17:47 -08:00
Maksim Panchenko	088e3c032a	[BOLT] Improve handling of secondary function entry points Summary: "Fix symbol table entries for secondary entries" diff broke the inliner. Fix the breakage and make the discovery of secondary entry points more accurate. Add ability to BinaryContext::getFunctionForSymbol() to return an entry point discriminator and use it instead of calling getEntryForSymbol() and isSecondaryEntry(). This is the preferred way since getFunctionForSymbol() is thread-safe. (cherry picked from FBD19295983)	2020-01-06 14:57:15 -08:00
Alexander Shaposhnikov	8c7f524afb	[BOLT] Fix build of the runtime on OSX Summary: Fix the compilation error on OSX (cherry picked from FBD19269806)	2020-01-02 16:20:13 -08:00
Rafael Auler	de284bc510	[BOLT] Fix symbol table entries for secondary entries Summary: Commit "Support full instrumentation" changed the map SymbolToFunction in BinaryContext to map secondary entries of functions too. This introduced unexpected behavior in our symbol table rewriting logic, which caused it to mistakenly write them with the address of the original function. Fix the behavior of getBinaryFunctionAtAddress to correct this. Also fix other users of SymbolToFunction to ensure they are not accidentally using secondary entries when they shouldn't. (cherry picked from FBD19168319)	2019-12-18 12:14:42 -08:00
Xin-Xin Wang	9aa276d349	[BOLT] Make .debug_loc update deterministic Summary: Change the single DebugLocWriter to one for each compilation unit. Then, each thread can write to its own DebugLocWriter and we can combine the data in a deterministic order once the threads are done. The only catch is that each thread would need the offset of the location lists it adds, so we make a list of pending location list patches and compute the final offsets at the end. (cherry picked from FBD18153069)	2019-10-25 11:47:51 -07:00
Maksim Panchenko	d414acfbb6	[perf2bolt] Better mmap event matching Summary: When perf tool reports a mapping address of a binary, it is not always the address of the first loadable segment we were checking against. As a result, perf2botl was not working properly for binaries where the first segment was not executable. The fix is to check if the address reported by mmap event matches any of the loadable segments. Note that the segment alignment has to be applied to get real loadable address of the segment. Fixes facebookincubator/BOLT#65 (cherry picked from FBD19146419)	2019-12-17 11:17:31 -08:00
Rafael Auler	16a497c627	[BOLT] Support full instrumentation Summary: Add full instrumentation support (branches, direct and indirect calls). Add output statistics to show how many hot bytes were split from cold ones in functions. Add -cold-threshold option to allow splitting warm code (non-zero count). Add option in bolt-diff to report missing functions in profile 2. In instrumentation, fini hooks are fixed to run proper finalization code after program finishes. Hooks for startup are added to setup the runtime structures that needs initilization, such as indirect call hash tables. Add support for automatically dumping profile data every N seconds by forking a watcher process during runtime. (cherry picked from FBD17644396)	2019-12-13 17:27:03 -08:00
Rafael Auler	e46d52de5b	[BOLT] Fix non-determinism in ICP with threads Summary: -icp-top-callsites selects candidates for optimization until a threshold is met. Currently, this parameter is set to 99% of calls by default. The order of functions evaluated changes in parallel mode, thus the functions that may be included to satisfy 99% of all calls may change, leading to different optimization decisions when running in parallel versus sequential. Fix this by enabling optimizations for all branches with the same frequency once we reach our budget instead of cutting off immediatelly after our budget is satisfied. In that way, order of functions has no impact on which functions are optimized. (cherry picked from FBD18902239)	2019-12-13 16:46:00 -08:00
Xin-Xin Wang	bdb60857e8	[BOLT] Make .debug_loc update deterministic Summary: Change the single DebugLocWriter to one for each compilation unit. Then, each thread can write to its own DebugLocWriter and we can combine the data in a deterministic order once the threads are done. The only catch is that each thread would need the offset of the location lists it adds, so we make a list of pending location list patches and compute the final offsets at the end. (cherry picked from FBD18153069)	2019-10-25 11:47:51 -07:00
Maksim Panchenko	e5d1334ad5	[perf2bolt] Ignore mmap events unrelated to execution Summary: Some processes can mmap the main binary for the purpose of introspection. We should ignore such mmap events for fixed-address binaries. For PIC binaries, we record the mapping and do the address filtering later for all sample events. (cherry picked from FBD18844314)	2019-12-05 16:52:15 -08:00
Xin-Xin Wang	6f93d53bf5	[BOLT] Remove test for impossible debug ranges condition Summary: The condition `DebugRangesOffset == -1U` can never happen since DebugRangesOffset has type `uint64_t` and the value always comes from `RangesSectionWriter->addRanges` which gets its value from `DebugRangesSectionWriter.SectionOffset` which has type `uint32_t`. The condition seems to be left over from a time where something was using `-1` as an error value. I'm removing that check so I can use `-1` as a tag to refer to the empty range that will be at the beginning of the ranges section. (cherry picked from FBD18153119)	2019-10-25 15:18:37 -07:00
Xin-Xin Wang	112c4251f5	[BOLT] Separate DebugRangesSectionsWriter into Ranges and ARanges Summary: The `.debug_aranges` section is already deterministic and is logically separate from the `.debug_ranges` section so separate them into separate classes so that it will be easier to make DebugRangesSectionsWriter deterministic (cherry picked from FBD18153057)	2019-10-25 11:24:49 -07:00
Xin-Xin Wang	8e2d3f7c30	[BOLT] Fix invalid abbrev error when reading debug_info section with readelf Summary: This fixes a bug which causes the debug_info and debug_loc sections to be unreadable by readelf/objdump. Basically, we're using 12 bytes of a ULEB128 value to fill in space, but readelf can't read more than 9 bytes of ULEB128. Thus, we replace that value with a string of 'a' instead. (cherry picked from FBD18097728)	2019-10-23 15:19:49 -07:00
Rafael Auler	28f91871b3	[PERF2BOLT/BOLT] Improve support for .so Summary: Avoid asserting on inputs that are shared libraries with R_X86_64_64 static relocs and RELATIVE dynamic relocations matching those. Our relocation checking mechanism would expect the result of the static relocation to be encoded in the binary, but the linker instead puts it as an addend in the RELATIVE dyn reloc. Also fix aggregation for .so if the executable segment is not the first one in the binary. (cherry picked from FBD18651868)	2019-11-14 16:07:11 -08:00
Rafael Auler	4bcc53a408	[BOLT] Fix shrink wrapping empty BB issue Summary: When combining icp=calls and shrink wrapping, the former may generate empty BBs that are going to trigger a bug in shrink wraping restore placement strategy. The restore is wrongly pushed to the BB successor instead of being added to the current block. Add a pass to go over the CFG to fix empty blocks by adding a temporary NOP instruction that is going to be deleted later. Empty BBs are not supported by one of the analysis done at this pass. (cherry picked from FBD18717994)	2019-11-26 15:09:40 -08:00
Maksim Panchenko	3cc4fc267b	[BOLT] Proper support for -trap-avx512 option Summary: If -trap-avx512 option is not set, verify that we correctly encode AVX-512 instructions and treat them as ordinary instructions. (cherry picked from FBD18666427)	2019-11-22 14:53:20 -08:00
Maksim Panchenko	7350d40404	[BOLT][NFC] Refactor BinaryFunction::addEntryPoint() Summary: There is no need to support existing functionality of adding entry points after the CFG is built as the function is only called in empty or disassembled state. Previously we used to run disassemble+buildCFG per function, but now these phases are decoupled. Also, remove a couple of redundant checks. (cherry picked from FBD18622822)	2019-11-11 17:02:37 -08:00
Maksim Panchenko	a09659fd54	[BOLT] Refactor markAmbiguousRelocations() Summary: Refactor markAmbiguousRelocations() code and move it to BinaryContext. Also remove a redundant check. (cherry picked from FBD18623815)	2019-11-18 14:08:17 -08:00
Maksim Panchenko	658f270417	[BOLT] Refactor data PC relocations in BinaryContext Summary: We only use locations of PC relocations and ignore the rest of the data. There's no need to store type and value. (cherry picked from FBD18623280)	2019-11-19 18:52:08 -08:00
Maksim Panchenko	b07e870d78	[BOLT] Add BinarySection::flushPendingRelocations() (cherry picked from FBD18623527)	2019-11-20 00:16:19 -08:00
Maksim Panchenko	3b1b9916dd	[BOLT][NFC] Refactor data section emission code Summary: RewriteInstance::emitDataSection() -> BinarySection::emitAsData() (cherry picked from FBD18623050)	2019-11-19 14:47:49 -08:00
spupyrev	95a1c7f553	speeding up ext-tsp Summary: Speeding up cache+/ext-tsp block reordering algorithm. On a high-level, the speedup is achieved by: - precomputing and memorizing all jumps between a pair of chains (instead of extracting them on every merge iteration); - using a cache of size O(\|E\|) instead of O(\|V\|^2) as in previous version. The final output is identical to previous one subject to a new deterministic comparison of double values. (cherry picked from FBD18380870)	2019-10-31 13:32:25 -07:00
Maksim Panchenko	6796b7216b	[BOLT] Fix jump table analysis for non-simple functions Summary: When we disassemble functions, we add discovered jump tables to a global container in BinaryContext. Later, we analyze and verify all jump tables. However, analysis for non-simple functions might fail for numerous reasons, e.g. there would be no instruction at a destination. Since we are not overwriting non-simple functions, it is not a critical error. Thus, we can safely skip jump table analysis for non-simple functions. (cherry picked from FBD18422997)	2019-11-10 21:09:01 -08:00
Maksim Panchenko	72b52edcbb	[BOLT] Free more memory in BinaryFunction::releaseCFG() Summary: Free more lists in BinaryFunction::releaseCFG(). Release BinaryFunction::Relocations after disassembly. Do not populate BinaryFunction::MoveRelocations as we are not using them currently. Also remove PCRelativeRelocationOffsets that weren't used. (cherry picked from FBD18413256)	2019-11-08 14:41:31 -08:00
Maksim Panchenko	d5ddb320ef	[BOLT] Free memory for CFG after emission Summary: Once we emit function code, we no longer need CFG for next phases that use basic blocks for address-translation and symbol update purposes. We free memory used by CFG and instructions. The freed memory gets reused by later phases resulting in overall memory usage reduction. We can probably improve memory consumption even further by replacing BinaryBasicBlocks with more compact data structures. (cherry picked from FBD18408954)	2019-10-31 16:54:48 -07:00
Maksim Panchenko	f2b257bec8	[BOLT] Update SDTs based on translation tables Summary: We've used to emit special annotations to update SDT markers. However, we can just use "Offset" annotations for the same purpose. Unlike BAT, we have to generate "reverse" address translation tables. This approach eliminates reliance on instructions after code emission. (cherry picked from FBD18318660)	2019-11-03 21:57:15 -08:00
Maksim Panchenko	98e63610b1	[BOLT] Create OffsetTranslationTable for basic blocks Summary: Use BinaryBasicBlock::OffsetTranslationTable for BAT. This removes dependency on instructions after the code emission. (cherry picked from FBD18283965)	2019-11-01 16:19:45 -07:00
Maksim Panchenko	a1388308f0	[BOLT] Use NameResolver class for local symbols Summary: NameResolver class is used to assign unique names to local symbols. (cherry picked from FBD18277131)	2019-11-01 12:31:17 -07:00
Maksim Panchenko	1ed3ac17ff	[BOLT] Fix section offsets after debug stripping Summary: Be default, we strip debug sections from the binary. Even though we did not write the sections, we allocated space for them in the output binary by mistake. (cherry picked from FBD18218708)	2019-10-29 14:49:49 -07:00
Maksim Panchenko	ed8be23e73	[BOLT][llvm] Reduce memory used by MCInst Summary: BOLT creates MCInst for every instruction from the input. For large binaries, this means we are creating tens if not hundreds of millions of instructions. If the number of operands for average instruction is much less than 8, we benefit from changing the type of Operands from SmallVector<MCOperand, 8> to SmallVector<MCOperand, 2>. That seems to be the optimal type for X86-64 on average. The size of MCInst goes down from 176 to 80 which often reduces BOLT memory consumption by gigabytes. (cherry picked from FBD18218924)	2019-10-28 17:40:18 -07:00
Rafael Auler	a3295715e4	[AArch64] Recognize one extra br idiom Summary: We do not support optimizing functions with jump tables in AArch64, but we do need to detect them. This idiom is slightly different from the ones we've seen before. It encode jump table entries as relative to the jump table itself instead of relative to the indirect branch (BR) instruction. (cherry picked from FBD18191100)	2019-10-28 16:16:35 -07:00
Maksim Panchenko	8fb6512a23	[BOLT][Docs] Instructions for linking with jemalloc/tcmalloc (cherry picked from FBD18050722)	2019-10-21 15:57:36 -07:00
Maksim Panchenko	12aca4005c	[BOLT] Ignore __builtin_unreachable destination Summary: For functions with unknown control flow, do not populate TakenBranches with an entry pointing to the end of the function. (cherry picked from FBD18034019)	2019-10-20 20:46:32 -07:00
Rafael Auler	b807641e2a	[BOLT] Fix stale functions when using BAT Summary: If collecting data in Intel Skylake machines, we may face a bug where LBR0 or LBR1 may be duplicated w.r.t. the next entry. This makes perf2bolt interpret it as an invalid trace, which ordinarily we discard during aggregation. However, in BAT, since we do not disassemble the binary where the collection happened but rely only on the translation table, it is not possible to detect bad traces and discard them. This gets to the fdata file, and this invalid trace ends up invalidating the profile for the whole function (by being treated as stale by BOLT). In this patch, we detect Skylake by looking for LBRs with 32 entries, and discard the first 2 entries to avoid running into this problem. It also fixes an issue with collision in the translation map by prioritizing the last basic block when more than one share the same output address. (cherry picked from FBD17996791)	2019-10-17 16:35:57 -07:00
Maksim Panchenko	103b0a77cc	[BOLT] Fix non-determinism while reading debug info Summary: When reading debug info in parallel, CUs for functions were populated in parallel and the order was non-deterministic. We used the first CU from the non-deterministically-ordered list to set the line number resulting in different outputs. The fix is to sort the list after it's been created and before assigning the line table unit. (cherry picked from FBD17946889)	2019-10-14 17:57:36 -07:00
Rafael Auler	698a4684ac	[BOLT] Fix merge-fdata and heatmap in BAT Summary: merge-fdata for legacy format was simply appending all input strings to output, but the real format supports some header strings that can't be simply concatanated to output. Check for the header string used by BAT before merging fdata to avoid creating an output file with invalid lines (header in the middle of the fdata file). For heatmap, avoid reading BAT tables, since they won't be used. (cherry picked from FBD17943131)	2019-10-11 13:32:14 -07:00
Xin-Xin Wang	d87f95065a	[BOLT] Add missing CMake test dependencies Summary: I noticed when setting up a new repository for bolt that bolt tests would fail unexpectedly when running `ninja check-bolt` and `ninja check-llvm`. This turns out to be because dependencies for bolt binaries were not specified in the CMake configuration so they were not built before running the tests. This diff adds the dependencies to the CMake configuration for check-bolt and check-llvm so that bolt binaries are built before running tests. (cherry picked from FBD17919505)	2019-10-14 16:03:54 -07:00
Maksim Panchenko	8c6ea8540a	[BOLT] Improve object discovery runtime Summary: (cherry picked from FBD17872824)	2019-10-08 11:03:33 -07:00
Rafael Auler	13948f376d	[BOLT] Do not emit BAT for non-simple in nonreloc Summary: Doing so cause corrupt entries to be emitted. (cherry picked from FBD17774505)	2019-10-04 16:28:03 -07:00
Mark Santaniello	c9f4bbdc22	[llvm-bolt] Bugfix jemalloc sized deallocation segfault Summary: C++14 "sized deallocation" introduces a 2-argument `delete` where the new 2nd argument is the original allocated size. It's useful for allocators like jemalloc to be "reminded" of the original allocation size, else they incur the cost of an address to size lookup. Jemalloc has provided this for a while as `sdallocx`, and recently it got wired up to the new 2-arg `delete`. Here I introduce typedefs for the SmallVectors so the "16" is consistent, which seems to fix the issue. (cherry picked from FBD17618981)	2019-09-26 16:51:22 -07:00
Rafael Auler	ba31344fa9	[BOLT] Fix build for Mac Summary: Change our CMake config for the standalone runtime instrumentation library to check for the elf.h header before using it, so the build doesn't break on systems lacking it. Also fix a SmallPtrSet usage where its elements are not really pointers, but uint64_t, breaking the build in Apple's Clang. (cherry picked from FBD17505759)	2019-09-20 11:29:35 -07:00
Maksim Panchenko	5e6d246b9c	[BOLT] Reword message for macro-op fusion optimization Summary: With the word "missed", the previous message about opportunities for macro-op fusion optimization could be misleading. (cherry picked from FBD17464603)	2019-09-18 15:33:03 -07:00
Maksim Panchenko	c823220116	[BOLT] Better check for compiler de-virtualization bug Summary: The existing check for compiler de-virtualization bug was not working when the relocation reference did not fall on a function boundary. As a result, we were falsely detecting "unmarked object in code". When running the check, the address could be arbitrary, except it shouldn't match any existing function. Additionally, check that there's a proper reference to the de-virtualized callee to avoid false positives. (cherry picked from FBD17433887)	2019-09-17 14:24:31 -07:00
Maksim Panchenko	e9c6c73bb8	[BOLT][non-reloc] Change function splitting in non-relocation mode Summary: This diff applies to non-relocation mode mostly. In this mode, we are limited by original function boundaries, i.e. if a function becomes larger after optimizations (e.g. because of the newly introduced branches) then we might not be able to write the optimized version, unless we split the function. At the same time, we do not benefit from function splitting as we do in the relocation mode since we are not moving functions/fragments, and the hot code does not become more compact. For the reasons described above, we used to execute multiple re-write attempts to optimize the binary and we would only split functions that were too large to fit into their original space. After the first attempt, we would know functions that did not fit into their original space. Then we would re-run all our passes again feeding back the function information and forcefully splitting such functions. Some functions still wouldn't fit even after the splitting (mostly because of the branch relaxation for conditional tail calls that does not happen in non-relocation mode). Yet we have emitted debug info as if they were successfully overwritten. That's why we had one more stage to write the functions again, marking failed-to-emit functions non-simple. Sadly, there was a bug in the way 2nd and 3rd attempts interacted, and we were not splitting the functions correctly and as a result we were emitting less optimized code. One of the reasons we had the multi-pass rewrite scheme in place, was that we did not have an ability to precisely estimate the code size before the actual code emission. Recently, BinaryContext obtained such functionality, and now we can use it instead of relying on the multi-pass rewrite. This eliminates redundant work of re-running the same function passes multiple times. Because function splitting runs before a number of optimization passes that run on post-CFG state (those rely on the splitting pass), we cannot estimate the non-split code size with 100% accuracy. However, it is good enough for over 99% of the cases to extract most of the performance gains for the binary. As a result of eliminating the multi-pass rewrite, the processing time in non-relocation mode with `-split-functions=2` is greatly reduced. With debug info update, it is less than half of what it used to be. New semantics for `-split-functions=<n>`: -split-functions - split functions into hot and cold regions =0 - do not split any function =1 - in non-relocation mode only split functions too large to fit into original code space =2 - same as 1 (backwards compatibility) =3 - split all functions (cherry picked from FBD17362607)	2019-09-11 15:42:22 -07:00
Wenlei He	615a318b60	[BOLT] Filter perf samples by PID Summary: `perf2bolt` accepts executable name, and the tool will find all the PIDs associated with that executable. When different versions of an executable are running at the same time, name alone may not be sufficient as we will get samples from different versions of the binary aggregated together. The resulting fdata may look stale to BOLT, which makes BOLT bailout optimization for functions. This change adds a `-pid` switch that lets user specify process ID in addition to executable name so BOLT can target a specific process. (cherry picked from FBD17178898)	2019-09-03 22:24:06 -07:00
Wenlei He	8cd1ba599b	[BOLT] Ignore LBR from kernel interrupts Summary: This change adds a switch (`ignore-interrupt-lbr`) to ignores LBR from perf input that is result of kernel interrupts. These asynchronous flow of user/kernel transition will make BOLT think that profile is stale, thus bailout optimization for functions. Ideally, user mode filter need to be set for `perf record` so we don't have asynchronous LBRs. However these are identifiable as kernel address space is known, so we can ignore any LBRs that come from or go into kernel addresses during aggregation. This is under a switch and off by default in case we need to BOLT kernel module. (cherry picked from FBD17170107)	2019-09-03 10:01:26 -07:00
Rafael Auler	cc4b2fb614	[BOLT] Efficient edge profiling in instrumented mode Summary: Change our edge profiling technique when using instrumentation to do not instrument every edge. Instead, build the spanning tree for the CFG and omit instrumentation for edges in the spanning tree. Infer the edge count for these edges when writing the profile during run time. The inference works with a bottom-up traversal of the spanning tree and establishes the value of the edge connecting to the parent based on a simple flow equation involving output and input edges, where the only unknown variable is the parent edge. This requires some engineering in the runtime lib to support dynamic allocation for building these graphs at runtime. (cherry picked from FBD17062773)	2019-08-07 16:09:50 -07:00
Rafael Auler	52786928ff	[BOLT] Fix perf2bolt race in BAT mode Summary: We start a thread to preprocess the profile while the main thread continues to disassemble the input binary. We should not disassemble it in BAT mode, however, the test to check whether we have BAT in the input binary depends on the preprocessing thread, so there is a race where we may start disassembling functions just because the preprocessing thread didn't conclude we are in BAT mode. Fix this and make the main thread check for BAT without depending on the preprocessing thread. (cherry picked from FBD17124370)	2019-08-29 16:18:43 -07:00
Rafael Auler	1f6564f117	[BOLT] Support .plt.got section Summary: We decode the regular .plt section and we are able to perform optimizations on it with -plt=hot or -plt=all, however, .plt.got sections are not decoded by BOLT. This patch teaches BOLT how to read them. They are created by the bfd linker whenever there is no need for the dynamic linker to lazy-bind the symbol (when they are eagerly resolved at binary load time). These entries are 8-byte sized instead of 16-byte sized like the regular PLT, and contain a single indirect call instruction with 7 bytes and a nop. (cherry picked from FBD17060515)	2019-08-26 15:03:38 -07:00
Rafael Auler	243507db99	[BOLT] Fix aggregator w.r.t. split functions Summary: We should not rely on split function detection while aggregating data, but only look up the original function names in the symbol table. Split function detection should be done by BOLT and not perf2bolt while writing the profile. Then, BOLT, when reading it, will take care of combining functions if necessary. This caused a bug in bolted data collection where samples in cold parts of a function were being falsely attributed to the hot part of a function instead of being attributed to the cold part, causing incorrect translation of addresses. (cherry picked from FBD16993065)	2019-08-23 12:18:31 -07:00
Maksim Panchenko	f588d7a6ea	[BOLT] Tighter control of jump table detection Summary: We were too permissive by allowing more jump tables during the preliminary scan of memory. This allowed for jump tables to be falsely detected. And since we didn't have a way to backtrack the jump table creation, we had to assert. This diff refactors the code that analyzes jump table contents. Preliminary and final passes share the same code. The only difference should be the detection of instruction boundaries that are available during the final pass. This should affect strict relocation mode only. (cherry picked from FBD16923335)	2019-08-19 14:06:36 -07:00
Maksim Panchenko	bf030f336a	[BOLT] Fix misleading output Summary: BOLT prints "spawning thread to pre-process profile" message even when it is not running in the aggregation mode. Fix that. (cherry picked from FBD16908596)	2019-08-19 17:11:42 -07:00
Rafael Auler	821480d27f	[BOLT] Encode instrumentation tables in file Summary: Avoid directly allocating string and description tables in binary's static data region, since they are not needed during runtime except when writing the profile at exit. Change the runtime library to open the tables on disk and read only when necessary. (cherry picked from FBD16626030)	2019-08-02 11:20:13 -07:00
Rafael Auler	62aa74f836	[BOLT] Support instrumentation via runtime library Summary: To allow the development of future instrumentation work, this patch adds support in BOLT for linking arbitrary libraries into the binary processed by BOLT. We use orc relocation handling mechanism for that. With this support, this patch also moves code programatically generated in X86 assembly language by X86MCPlusBuilder to C code written in a new library called bolt_rt. Change CMake to support this library as an external project in the same way as clang does with compiler_rt. This library is installed in the lib/ folder relative to BOLT root installation and by default instrumentation will look for the library at that location to finish processing the binary with instrumentation. (cherry picked from FBD16572013)	2019-07-24 14:03:43 -07:00
laith sakka	f77cccf681	Rename option (cherry picked from FBD16655093)	2019-08-05 13:56:48 -07:00
laith sakka	c1564a1026	Add test for parallel mode Summary: Add a flag that disable writing botl-info section and add a test that run bolt with two modes parallel and sequential and assert that the resulting binaries are the same. (cherry picked from FBD16575440)	2019-07-30 17:55:27 -07:00
laith sakka	cc8415406c	Rewrite frame analysis using parallel utilities Summary: Rewrite frame analysis using parallel utilities (cherry picked from FBD16499130)	2019-07-25 11:57:08 -07:00
laith sakka	5084534699	Rewrite ICF using parallel utilities Summary: Rewrite ICF using parallel utilities (cherry picked from FBD16472975)	2019-07-24 17:13:15 -07:00
Maksim Panchenko	8d5854ef09	[BOLT] Add option to verify instruction encoder/decoder Summary: Add option `-check-encoding` to verify if the input to LLVM disassembler matches the output of the assembler. When set, the verification runs on every instruction in processed functions. I'm not enabling the option by default as it could be quite noisy on x86 where instruction encoding is ambiguous and can include redundant prefixes. (cherry picked from FBD16595415)	2019-07-31 16:03:49 -07:00
Maksim Panchenko	79ff4ec1cb	[perf2bolt] Enforce strict mode for perf2bolt Summary: In strict relocation mode, we get better function coverage. However, if the profile used for optimization was converted using non-strict mode, then it wouldn't match functions exclusive to strict mode. Hence, we have to enforce strict relocation mode for profile conversion, so it can be used for either mode. I'm also adding parallel profile pre-processing unless `--no-threads` is specified. This masks the runtime overhead of function disassembly on multi-core machines. (cherry picked from FBD16587855)	2019-06-11 13:24:10 -07:00
laith sakka	1bce256e67	Fix race condition in buildCFG Summary: switch to sequential execution when print-all is passed. Since the function getDynoStats have an unsafe access to the annotation allocators. (cherry picked from FBD16503502)	2019-07-25 14:41:57 -07:00
laith sakka	6443c46b9d	Run hfsort+ in parallel Summary: hfsort+ performs an expensive analysis to determine the new order of the functions. 99% of the time during hfsort+ is spent in the function runPassTwo. This diff runs the body of the hot loop in runPassTwo in parallel speeding up the total runtime of reorder-functions pass by up to 4x (cherry picked from FBD16450780)	2019-07-23 15:49:02 -07:00
Maksim Panchenko	a9b9aa1e02	[BOLT] Add code padding verification Summary: In non-relocation mode, we allow data objects to be embedded in the code. Such objects could be unmarked, and could occupy an area between functions, the area which is considered to be code padding. When we disassemble code, we detect references into the padding area and adjust it, so that it is not overwritten during the code emission. We assume the reference to be pointing to the beginning of the object. However, assembly-written functions may reference the middle of an object and use negative offsets to reference data fields. Thus, conservatively, we reduce the possibly-overwritten padding area to a minimum if the object reference was detected. Since we also allow functions with unknown code in non-relocation mode, it is possible that we miss references to some objects in code. To cover such cases, we need to verify the padding area before we allow to overwrite it. (cherry picked from FBD16477787)	2019-07-23 20:48:41 -07:00
Maksim Panchenko	6722875047	[BOLT] Fix processing PLT without relocs Summary: Some binaries may not have a relocation section corresponding to PLT. Handle them properly. (cherry picked from FBD16477841)	2019-07-24 22:08:36 -07:00
Maksim Panchenko	98fdba2cc7	[BOLT][NFC] Fix white space (cherry picked from FBD16473918)	2019-07-24 17:54:14 -07:00
laith sakka	744a2417dd	Run findSubprograms in preprocessDebugInfo in parallel Summary: While reading debug info the function findSubprograms runs on each compilation unit. This diff parallelize that loop reducing its runtime duration by 70%. (cherry picked from FBD16362867)	2019-07-17 20:54:53 -07:00
laith sakka	b50500893d	Lock-based parallelization for updateDebugInfo Summary: BOLT spends a decent amount of time creating patches to update debug information when -update-debug-sections is passed. In updateDebugInfo patches are created to update .debug_info and .debug_abbrev sections while .debug_loc and .debug_ranges contents are populated. This this diff uses a lock-based approach to parallelize updateDebugInfo functions and reduces the runtime of the function by around 30%. (cherry picked from FBD16352261)	2019-07-17 14:58:17 -07:00
Facebook Github Bot	86800abc81	[BOLT][PR] Target compilation based on LLVM CMake configuration Summary: Minimalist implementation of target configurable compilation. Fixes https://github.com/facebookincubator/BOLT/issues/59 Pull Request resolved: https://github.com/facebookincubator/BOLT/pull/60 GitHub Author: Pierre RAMOIN <pierre.ramoin@amadeus.com> (cherry picked from FBD16461879)	2019-07-24 11:05:08 -07:00
Maksim Panchenko	2c9c6b164b	[BOLT] Fix issue printing CTCs without annotations Summary: After stripping annotations, conditional tail calls no longer can be identified by their corresponding tag. We can check the number of basic block successors instead. Fixes facebookincubator/BOLT#58. (cherry picked from FBD16444718)	2019-07-22 20:57:19 -07:00
laith sakka	fde5a2b470	Run shrink wrapping in parallel Summary: Shrink wrapping is an expensive part of frame optimizations if performed on all functions. This diff makes it run in parallel, reducing wall time. (cherry picked from FBD16092651)	2019-07-02 10:48:43 -07:00
laith sakka	7d42835418	Run buildCFG in disassembly in parallel Summary: This diff parallelize the construction of call graph during disassembly. The diff includes a change to parallel-utilities where another interface is added, that support running tasks on binaryFunctions that involves adding instruction annotations. This pattern is common in different places, e.g. frame optimizations. And such, pattern justify creating an interface, that abstract out all the messy details. (cherry picked from FBD16232809)	2019-07-12 07:25:50 -07:00
laith sakka	f4ab6e6924	run finalize functions in parallel Summary: (cherry picked from FBD16188733)	2019-07-10 10:59:56 -07:00
laith sakka	98539b0966	run aligner pass in parallel Summary: this diff parallelize the aligner pass (cherry picked from FBD16176327)	2019-07-09 17:59:41 -07:00
laith sakka	9977b03fea	Run reorder blocks in parallel Summary: This diff change reorderBasicBlocks pass to run in parallel, it does so by adding locks to the fix branches function, and creating temporary MCCodeEmitters when estimating basic block code size. (cherry picked from FBD16161149)	2019-07-08 12:32:58 -07:00
Rafael Auler	1169f1fdd8	[BOLT] Support duplicating jump tables Summary: If two indirect branches use the same jump table, we need to detect this and duplicate dump tables so we can modify this CFG correctly. This is necessary for instrumentation and shrink wrapping. For the latter, we only detect this and bail, fixing this old known issue with shrink wrapping. Other minor changes to support better instrumentation: add an option to instrument only hot functions, add LOCK prefix to instrumentation increment instruction, speed up splitting critical edges by avoiding calling recomputeLandingPads() unnecessarily. (cherry picked from FBD16101312)	2019-07-02 16:56:41 -07:00
Rafael Auler	8880969ced	[BOLT] Restrict creation of jump tables Summary: Heuristic that creates a jump table for every memory access, including those we do not match against a pattern in an indirect jump, is too permissive and has false positives. Guard this logic under strict mode until we figure out a better strategy. (cherry picked from FBD16192205)	2019-07-10 15:41:34 -07:00
laith sakka	3cfc76cdbf	Create a general interface to implement parallel tasks easily and apply it to run EliminateUnreachableBlocks in parallel. Summary: Each time we run some work in parallel over the list of functions in bolt, we manage a thread pool, task scheduling and perform some work to manage the granularity of the tasks based on the type of the work we do. In this task, I am creating an interface where all those details are abstracted out, the user provides the function that will run on each function, and some policy parameters that setup the scheduling and granularity configurations. This will make it easier to implement parallel tasks, and eliminate redundant coding efforts. (cherry picked from FBD16116077)	2019-07-03 17:23:19 -07:00
laith sakka	f10d1fe0f3	Run cleanAnnotations within frame analysis in parallel Summary: This diff parallelize the function FrameAnalysis::cleanAnnotations() (cherry picked from FBD16096711)	2019-07-02 13:42:17 -07:00
laith sakka	00c252f6d8	Clean SPTMap in frame anaylsis in parallel Summary: This diff parallelize the STPClean() function reducing its runtime from 5 seconds to 0.4 on HHVM, Making the runtime for the frame optimizer goes down to 33 seconds on HHVM. (cherry picked from FBD15914371)	2019-06-19 18:01:00 -07:00
laith sakka	86b529bd54	run SPT in parallel, and split annotation allocator Summary: This diff includes two main changes: 1) When creating an annotation, a dedicated annotation allocator can be used, instead of the default allocator. This allows some annotation to be deallocated right after the end of their usage completely. Furthermore, having the ability to use dedicated allocators allows running SPT in parallel where each task uses a different allocator. 2) SPT is parallelized. (cherry picked from FBD15913492)	2019-06-14 19:56:11 -07:00
Wenlei He	4e90fc1e3b	[BOLT] Prioritize Jump Table ICP target by frequency and indice count Summary: We select the top hot targets for indirect call promotion. But since we only have frequency for targets, not for actual jump table indices, we have to merge indices that share the same actual target. In order to do that we sort targets by pointer of target symbol before merging, which introduces instability. Later we stable sort merged targets by frequency. Due to the instability of sorting pointers, and depending on how many indices each merged target has, we could end up with unstable ICP. This commit changes the 2nd pass sorting to prioritize targets with fewer indices, and higher mispredicts, in addition to higher frequency. It improves stability of ICP, and also exposes more ICP because targets with fewer indices has better chance of getting promoted. (cherry picked from FBD16099701)	2019-07-02 15:51:20 -07:00
Maksim Panchenko	078ece1691	[BOLT] Fix out-of-bounds entry points Summary: Check that a symbol address is less than the next function address before considering it for a secondary entry. (cherry picked from FBD16056468)	2019-06-28 11:53:34 -07:00
Maksim Panchenko	e89ad0db4b	[BOLT] Introduce strict relocation mode Summary: In strict relocation mode we rely on relocations to represent all possible entry points into a function. Most of the code generated by tested compilers (gcc and clang) will result in relocations against any internal labels for jump tables and for computed goto tables. In situations where we cannot properly reconstruct a jump table, or when we cannot determine a table that guides an indirect jump, e.g. when multiple computed goto tables are used, we conservatively assume that the indirect jump can end up at any possible basic block referenced by relocations. In strict mode, simple functions may include the aforementioned instructions with unknown control flow with a conservative list of destinations added to the containing basic block. This allows us to expand coverage of simple functions and to enable code reordering optimizations for more functions. The strict mode is recommended when BOLT is used with a well-formed code generated by a compiler. To use the strict mode, add "-strict" on the command line. Another effect of this diff, is that with relocations, we will always replace the immediate operand of an instruction with a symbol if the relocation exists against this operand. Also this diff fixes issues with Clang compiled with -fpic. (cherry picked from FBD15872849)	2019-06-28 09:21:27 -07:00
Maksim Panchenko	06e7a1e059	[BOLT] Ignore false function references Summary: A relocation can have an addend that makes it look as the relocated value is in a different section from the symbol being relocated. E.g., a relocation against a variable in .rodata could have a negative offset that will make it look like it is against a symbol in .text (a section that typically precedes .rodata). Unless the relocation is against a section symbol, we know exactly the symbol that is being relocated and there is no issue. However, when the linker leaves only a section relocation (i.e. a relocation against a section symbol when a temporary original symbol gets deleted), we have to guess the relocated symbol, and can falsely detect a function reference in the case described above. The fix is to keep a section relocation if the corresponding relocated value falls into a different section, and to detect and ignore false function reference. (cherry picked from FBD16030791)	2019-06-27 03:20:17 -07:00
Wenlei He	459add2827	[BOLT] Force non-relocation mode for heatmap generation Summary: BOLT operates in relocation mode by default when .reloc is in the binary. This changes disables relocation mode for heatmap generation so we can use that for more cases. There's a small separate change that ignores zero-sized symbol in zero-sized code section during function discovery. (cherry picked from FBD16009610)	2019-06-26 11:06:46 -07:00
Rafael Auler	0d23cbaa52	[BOLT] Initial experimental instrumentation pass Summary: An instrumentation pass that modifies the input binary to generate a profile after execution finishes. It modifies branches to increment counters stored in the process memory and injects a new function that dumps this data to an fdata file, readable by BOLT. This instrumentation is experimental and currently uses a naive approach where every branch is instrumented. This is not ideal for runtime performance, but should be good enough for us to evaluate/debug LBR profile quality against instrumentation. Does not support instrumenting indirect calls yet, only direct calls, direct branches and indirect local branches. (cherry picked from FBD15998096)	2019-06-19 20:10:49 -07:00
Rafael Auler	db02a1a142	[BOLT] Ignore empty funcs in relocation mode Summary: Make BOLT ignore empty functions (those containing no instructions, despite having some space allocated to it filled with zeroes). (cherry picked from FBD15981683)	2019-06-24 20:23:22 -07:00
Rafael Auler	bda13b7dd8	[BOLT] Add option to print profile bias stats Summary: Profile bias may happen depending on the hardware counter used to trigger LBR sampling, on the hardware implementation and as an intrinsic characteristic of relying on LBRs. Since we infer fall-through execution and these non-taken branches take zero hardware resources to be represented, LBR-based profile likely overrepresents paths with fall throughs and underrepresents paths with many taken branches. This patch adds an option to print statistics about profile bias so we can better understand these biases. The goal is to analyze differences in the sum of the frequency of all incoming edges in a basic block versus the sum of all outgoing. In an ideally sampled profile, these differences should be close to zero. With this option, the user gets the mean of these differences in flow as a percentage of the input flow. For example, if this number is 15%, it means, on average, a block observed 15% more or less flow going out of it in comparison with the flow going in. We also print the standard deviation so we can have an idea of how spread apart are different measurements of flow differences. If variance is low, it means the average bias is happening across all blocks, which is compatible with using LBRs. If the variance is high, it means some blocks in the profile have a much higher bias than others, which is compatible with using a biased event such as cycles to sample LBRs because it overrepresents paths that end in an expensive instruction. (cherry picked from FBD15790517)	2019-06-10 17:26:48 -07:00
laith sakka	1ec091e6f5	Parallelize ICF Pass Summary: ICF consumes 10-15% of bolt runtime, for HHVM that is around 45 seconds. this diff perform some parallelization for the pass to make it faster. A 60% reduction in the ICF runtime is measured on the parallel version for HHVM. (cherry picked from FBD15589515)	2019-05-31 16:45:31 -07:00
Maksim Panchenko	9894de0094	[BOLT] Check instruction boundaries while populating jump tables Summary: Now that we populate jump tables after all functions are disassembled, we can check for instruction boundaries corresponding to jump table entries. No need to delegate this task to postProcessJumpTables(). (cherry picked from FBD15814762)	2019-06-13 15:31:30 -07:00
Maksim Panchenko	9e2ad3f593	[BOLT] Delay populating jump tables Summary: During the initial disassembly pass, only identify jump tables without populating the contents. Later, after all functions have been disassembled, we have a better idea of jump table boundaries and can do a better job of populating their entries. As a result, we no longer have embedded jump tables (i.e. a jump table that is parter of another jump table). If we ever need to keep sequential jump tables inseparable during the output, we can always add such functionality later. Fixes facebookincubator/BOLT#56. (cherry picked from FBD15800427)	2019-06-12 18:21:02 -07:00
laith sakka	66cf16208f	Use singleton instances for SPT (stack pointer tracking) in FrameAnalysis. Summary: During frame analysis, the functions do not change, and stack pointer tracking does not need to be performed more than one time. The current implementation performs the SPT analysis multiple times per function during the frame analysis, we ca eliminate such computation redundancy. On HHVM with -frame-opts=hot, this save around a minute which is 40% of the frame optimization runtime. (129s to 76s). fdata should be passed for a reasonable evaluation (we need the call graph). However, this comes at a memory cost, around 2G to the peak when only -frame-opt=hot only is used but, When all the usual flags are passed, the effect is to the peak is only 200K (measured from one test). Update: When jemalloc is used the base became way better and the following runtime are observed: [jemalloc] hhvm 85 --> 72. clang 27 --> 23. [malloc] hhvm 129 --> 76. clang 34 --> 27. (cherry picked from FBD15707003)	2019-06-06 12:58:14 -07:00
Maksim Panchenko	9df5063c0e	[perf2bolt] Option to use event PC with LBR stack Summary: Add an option to get extra profile trace using the recorded event PC. The trace goes from the latest LBR record destination to the event PC. (cherry picked from FBD15711804)	2019-06-06 19:38:06 -07:00
Maksim Panchenko	fac6a89c23	[BOLT] Better handling of address references Summary: We used to handle PC-relative address references differently from direct address references. As a result, some cases, such as escaped function label address, were not handled when dealing with absolute (non-PIC) code. This diff moves processing of an address reference into BinaryContext::handleAddressRef() which is called for both PIC and non-PIC code. (cherry picked from FBD15643535)	2019-06-04 15:30:22 -07:00
laith sakka	d3c1821f5f	Compile Bolt using std 14. Summary: Compile Bolt using std 14. We want that to be able to use some threading the locking tools that do not exists in std 11. (cherry picked from FBD15671736)	2019-06-05 10:32:29 -07:00
Rafael Auler	21f4303bfd	Support data collection in bolted binaries Summary: Similarly to how the compiler relies on DWARF to map samples, so it is possible to collect profile data in binaries optimized by PGO techniques and retrofit data to be used in a representation of the program that was not optimized by PGO, this diff implements an option in BOLT to encode a table in the output binary that allows us to map data collected in optimized binaries back to the address space used in the input binary (where the profile is useful, since we do not support running BOLT on a binary already optimized by BOLT). The goal is to offer an option to support BOLT in scenarios where it is not easy to run a special deployment of the binary with a version that was not optimized by BOLT just for data collection. This feature is enabled with the -enable-bat flag. BAT stands for BOLT Address Translation, which refers to the process of mapping output to input addresses. (cherry picked from FBD15531860)	2019-04-12 17:33:46 -07:00
Laith Sakka	3df2c9ea1f	Update SDT locations after bolt reordering Summary: Update SDT locations in .note section to match the new location after bolt reorder the code. (cherry picked from FBD15427779)	2019-05-17 07:58:27 -07:00
Maksim Panchenko	9ef9a7b1be	[BOLT] Use regex matching for function names passed on command line Summary: Options such as `-print-only`, `-skip-funcs`, etc. now take regular expressions. Internally, the option is converted to '^funcname$' form prior to regex matching. This ensures that names without special symbols will match exactly, i.e. "foo" will not match "foo123". (cherry picked from FBD15551930)	2019-05-29 18:33:09 -07:00
Laith Sakka	c8038da36e	Minor-fix: remove duplicate definition of SPT optimization timer Summary: (cherry picked from FBD28111560)	2019-05-22 15:03:42 -07:00
Maksim Panchenko	e5b1d9cd8c	[BOLT][NFC] Fix white space (cherry picked from FBD15485688)	2019-05-23 15:49:36 -07:00
Maksim Panchenko	f57d3c00fc	[BOLT] Better verification of jump tables Summary: Run analyzeIndirectBranch() using basic block boundaries instead of running ad-hoc validation of the jump table assumptions. (cherry picked from FBD15465034)	2019-05-22 18:14:34 -07:00
Maksim Panchenko	be344c8de7	[BOLT] Refactor handling of interproc refs Summary: Move handling of interprocedural references to BinaryContext. Post-process indirect branches immediately after the CFG is built. This is almost NFC. Since indirect branches are now post-processed before the profile data is processed it interferes with the way the profile data in YAML format is handled. (cherry picked from FBD15456003)	2019-05-22 11:26:58 -07:00
Maksim Panchenko	d047df12c5	[BOLT] Add an option to specialize memcpy() for 1 byte copy Summary: Add an option: -memcpy1-spec=func1,func2:cs1,func3:cs1:cs2,... to specialize calls to memcpy() in listed functions (the name could be supplied in regex) for size 1. The optimization will dynamically check if the size argument equals to 1 and execute a one byte copy, otherwise it will call memcpy() as usual. Specific call sites could be indicated after ":" using their numeric count from the start of the function. (cherry picked from FBD15428936)	2019-05-20 20:11:40 -07:00
Laith Saed Sakka	ca659e4336	Preserve nops that are SDT markers in binaries and disable SDT conflicting optimizations Summary: SDT markers that appears as nops in the assembly, are preserved and not eliminated. Functions with SDT markers are also flagged. Inlining and folding are disabled for functions that have SDT markers. (cherry picked from FBD15379799)	2019-05-16 12:46:32 -07:00
Laith Saed Sakka	4755825447	Parse statically defined tracepoint markers from .note.stapsdt section Summary: Parse statically defined tracepoints(SDT) markers from the ELF file, and store them. Add an option to print SDTs (-print-sdt). Add test case for parsing and printing SDTs. (cherry picked from FBD15366712)	2019-05-15 17:19:18 -07:00
Rafael Auler	f1fde44154	[BOLT] Improve ICP activation policy and hot jt processing Summary: Previously, ICP worked with a budget of N targets to convert to direct calls. As long as the frequency of up to N of the hottest targets surpassed a given fraction (threshold) of the total frequency, say, 90%, then the optimization would convert a number of targets (up to N) to direct calls. Otherwise, it would completely abort processing this call site. The intent was to convert a given fraction of the indirect call site frequency to use direct calls instead, but this ends up being a "all or nothing" strategy. In this patch we change this to operate with the same strategy seem in LLVM's ICP, with two thresholds. The idea is that the hottest target of an indirect call site will be compared against these two thresholds: one checks its frequency relative to the total frequency of the original indirect call site, and the other checks its frequency relative to the remaining, unconverted targets (excluding the hottest targets that were already converted to direct calls). The remaining threshold is typically set higher than the total threshold. This allows us more control over ICP. I expose two pairs of knobs, one for jump tables and another for indirect calls. To improve the promotion of hot jump table indices when we have memory profile, I also fix a bug that could cause us to promote extra indices besides the hottest ones as seen in the memory profile. When we have the memory profile, I reapply the dual threshold checks to the memory profile which specifies exactly which indices are hot. I then update N, the number of targets to be promoted, based on this new information, and update frequency information. To allow us to work with smaller profiles, I also created an option in perf2bolt to filter out memory samples outside the statically allocated area of the binary (heap/stack). This option is on by default. (cherry picked from FBD15187832)	2019-05-02 12:28:34 -07:00
Maksim Panchenko	fee61231ef	[BOLT] Move JumpTable management to BinaryContext Summary: Make BinaryContext responsible for creation and management of JumpTables. This will be used for detection and resolution of jump table conflicts across functions. (cherry picked from FBD15196017)	2019-05-02 17:42:06 -07:00
Maksim Panchenko	4b55967d9e	[perf2bot] Pass `-f` flag to perf Summary: perf tool requires the input data to be owned by the current user or root, otherwise it rejects the input. Use `-f` option to override this behavior. (cherry picked from FBD15160678)	2019-04-30 17:08:22 -07:00
Maksim Panchenko	310b32fbe5	[BOLT] Limit jump table size by containing object Summary: While checking for a size of a jump table, we've used containing section as a boundary. This worked for most cases as typically jump tables are not marked with symbol table entries. However, the compiler may generate objects for indirect goto's. (cherry picked from FBD15158905)	2019-04-30 15:47:10 -07:00
Maksim Panchenko	f1dfd38dec	[BOLT][NFC] Move DynoStats out of BinaryFunction Summary: Move DynoStats into separate source files. (cherry picked from FBD15138883)	2019-04-29 12:51:10 -07:00
Maksim Panchenko	2b1523362e	[BOLT] Strip debug sections by default Summary: We used to ignore debug sections by default, but we kept them in the binary which led to invalid debug information in the output. It's better to strip debug info and print a warning to the user. Note: we are not updating debug info by default due to high memory requirements for large applications. (cherry picked from FBD15128947)	2019-04-26 15:30:12 -07:00
Rafael Auler	21ee0e98c7	[BOLT] Fix symboltable update bug Summary: Commit "Update symbols for secondary entry points" introduced a bug by using getBinaryFunctionContainingAddress() instead of getBinaryFunctionAtAddress() regarding ICF'd functions. Only the latter would fetch the correct BinaryFunction object for addresses of functions that were ICF'd. As a result of this bug, the dynamic symbol table was not updated for function symbols that were folded by ICF. (cherry picked from FBD15112941)	2019-04-26 19:52:36 -07:00
Maksim Panchenko	caa0fafa18	[BOLT] Fix profile reading in non-reloc mode Summary: In non-relocation mode we may execute multiple re-write passes either because we need to split large functions or update debug information for large functions (in this context large functions are functions that do not fit into the original function boundaries after optimizations). When we execute another pass, we reset RewriteInstance and run most of the steps such as disassembly and profile matching for the 2nd or 3rd time. However, when we match a profile, we check `Used` flag, and don't use the profile for the 2nd time. Since we didn't reset the flag while resetting the rest of the states, we ignored profile for all functions. Resetting the flag in-between rewrite passes solves the problem. (cherry picked from FBD15110959)	2019-04-26 16:32:28 -07:00
Maksim Panchenko	5717b0c427	[perf2bolt] Fix print report for pre-aggregated profile Summary: For pre-aggregated profile, we were using the number of records in the profile for `NumTraces` ignoring the counts per record. As a result, the reported percentage of mismatched traces was bogus. (cherry picked from FBD15093123)	2019-04-25 16:34:50 -07:00
Maksim Panchenko	492e4a515e	[BOLT] Automatically enable -hot-text Summary: Enable -hot-text by default if reordering functions. Also fail immediately if function reordering is specified on the command line in non-relocation mode. (cherry picked from FBD15095178)	2019-04-25 17:00:05 -07:00
Brian Gesiak	91b2de3c23	[BOLT] Minimize BOLT's diff with LLVM by removing trivial changes (NFC) Summary: BOLT works as a series of patches rebased onto upstream LLVM at revision `f137ed238db`. Some of these patches introduce unnecessary whitespace changes or includes. Remove these to minimize the diff with upstream LLVM. (cherry picked from FBD15064122)	2019-04-24 11:24:15 -04:00
Rafael Auler	4e4d39c21c	[BOLT] Update symbols for secondary entry points Summary: Update the output ELF symbol table for symbols representing secondary entry points for functions. Previously, those were left unchanged in the symtab. (cherry picked from FBD15010517)	2019-04-18 16:32:22 -07:00
Brian Gesiak	eba1a67730	Fix casting issues on macOS Summary: `size_t` is platform-dependent, and on macOS it is defined as `unsigned long long`. This is not the same type as is used in many calls to templated functions that expect the same type. As a result, on macOS, calls to `std::max` fail because a template function that takes `uint64_t, unsigned long long` cannot be found. To work around the issue: * Specify explicit `std::max` and `std::min` functions where necessary, to work around the compiler trying (and failing) to find a suitable instantiation. * For lambda return types, specify an explicit return type where necessary. * For `operator ==()` calls, use an explicit cast where necessary. (cherry picked from FBD15030283)	2019-04-22 11:27:50 -04:00
Brian Gesiak	d9f1bd42fd	[cmake] Only build enabled targets Summary: When attempting to build llvm-bolt with `-DLLVM_ENABLE_TARGETS="X86"`, I encountered an error: ``` CMake Error at cmake/modules/AddLLVM.cmake:559 (add_dependencies): The dependency target "AArch64CommonTableGen" of target "LLVMBOLTTargetAArch64" does not exist. Call Stack (most recent call first): cmake/modules/AddLLVM.cmake:607 (llvm_add_library) tools/llvm-bolt/src/Target/AArch64/CMakeLists.txt:1 (add_llvm_library) ``` The issue is that the `llvm-bolt/src/Target/AArch64` subdirectory is added by CMake unconditionally. The LLVM project, on the other hand, only adds the subdirectories that are enabled, by using a `foreach` loop over `LLVM_TARGETS_TO_BUILD`. Copying that same loop, from `llvm/lib/Target/CMakeLists.txt`, to this project avoids the error. (cherry picked from FBD15030236)	2019-04-22 11:19:02 -04:00
Rafael Auler	3b422eafd0	[BOLT] Fix non-determinism in shrink wrapping Summary: Iterating over SmallPtrSet is non-deterministic. Change it to SmallSetVector. Similarly, do not sort a vector of ProgramPoint when computing the dominance frontier, as ProgramPoint uses the pointer value to determine order. Use a SmallSetVector there too to avoid duplicates instead of sorting + uniqueing. (cherry picked from FBD14992085)	2019-04-17 18:20:56 -07:00
Maksim Panchenko	433f3e3e02	[BOLT] Process CFIs for functions with FDE size mismatch Summary: If a function size indicated in FDE is different from the one in the symbol table, we can keep processing the function as we are using the max size for internal purposes. Typically this happens for assembly-written functions with padding at the end. This padding is not included in FDE, but it is in the symbol table. (cherry picked from FBD14987653)	2019-04-17 15:17:55 -07:00
Maksim Panchenko	99ef4c90c1	[BOLT] Basic support for split functions Summary: This adds very basic and limited support for split functions. In non-relocation mode, split functions are ignored, while their debug info is properly updated. No support in the relocation mode yet. Split functions consist of a main body and one or more fragments. For fragments, the main part is called their parent. Any fragment could only be entered via its parent or another fragment. The short-term goal is to correctly update debug information for split functions, while the long-term goal is to have a complete support including full optimization. Note that if we don't detect split bodies, we would have to add multiple entry points via tail calls, which we would rather avoid. Parent functions and fragments are represented by a `BinaryFunction` and are marked accordingly. For now they are marked as non-simple, and thus only supported in non-relocation mode. Once we start building a CFG, it should be a common graph (i.e. the one that includes all fragments) in the parent function. The function discovery is unchanged, except for the detection of `\.cold\.` pattern in the function name, which automatically marks the function as a fragment of another function. Because of the local function name ambiguity, we cannot rely on the function name to establish child fragment and parent relationship. Instead we rely on disassembly processing. `BinaryContext::getBinaryFunctionContainingAddress()` now returns a parent function if an address from its fragment is passed. There's no jump table support at the moment. Jump tables can have source and destinations in both fragment and parent. Parent functions that enter their fragments via C++ exception handling mechanism are not yet supported. (cherry picked from FBD14970569)	2019-04-16 10:24:34 -07:00
Maksim Panchenko	ffae5e73f3	[BOLT] Fix an issue with std:errc Summary: On some platforms `llvm::make_error_code(std::errc::no_such_process) == std::errc::no_such_process` evaluates to false. (cherry picked from FBD14944405)	2019-04-15 16:42:49 -07:00
Rafael Auler	31fc56b313	[BOLT] Fix adjustFunctionBoundaries w.r.t. entry points Summary: Don't consider symbols in another section when processing additional entry points for a function. (cherry picked from FBD14962853)	2019-04-16 14:35:29 -07:00
Maksim Panchenko	22ba3dc816	[BOLT] Add another section to the list of hot text movers Summary: (cherry picked from FBD14954472)	2019-04-16 10:39:05 -07:00
Maksim Panchenko	27dcec9717	[BOLT] Abort processing if the profile has no valid data Summary: It's possible to pass a profile in invalid format to BOLT, and we silently ignore it. This could cause a regression as such scenario can go undetected. We should abort processing if no valid data was seen in the profile and issue a warning if it was partially invalid. (cherry picked from FBD14941211)	2019-04-15 14:03:01 -07:00
Maksim Panchenko	8f98268518	[BOLT] Reduce warnings for non-simple functions Summary: If a function was already marked as non-simple, there's no reason to issue a warning that it has a reference in the middle of an instruction. Besides, sometimes there wouldn't be instructions disassembled at a given entry, and the warning would be incorrect. (cherry picked from FBD14938227)	2019-04-15 11:56:55 -07:00
Maksim Panchenko	e50e89be9e	[BOLT] Handle R_X86_64_converted_reloc_bit Summary: In binutils 2.30 a bfd linker accidentally started modifying some relocations on output under `-q/--emit-relocs` by turning on R_X86_64_converted_reloc_bit. As a result, BOLT ignored such relocations and failed to correctly update the binary. This diff filters out R_X86_64_converted_reloc_bit from the relocation type. (cherry picked from FBD14907832)	2019-04-11 17:11:08 -07:00
Maksim Panchenko	315ae74de3	[BOLT] Include <numeric> for std::iota Summary: Some compilers require <numeric> header. (cherry picked from FBD14868132)	2019-04-09 21:22:41 -07:00
Maksim Panchenko	88375d311e	[BOLT] Sort basic block successors for printing Summary: For easier analysis of the hottest targets of jump tables it helps to have basic block successors sorted based on the taken frequency. (cherry picked from FBD14856640)	2019-04-09 11:27:23 -07:00
Maksim Panchenko	a8e05d067d	[BOLT] Add interface to extract values from static addresses (cherry picked from FBD14858028)	2019-04-09 12:29:40 -07:00
Maksim Panchenko	7d89b113d8	[BOLT][NFC] Indentation fix (cherry picked from FBD14856700)	2019-04-09 11:31:45 -07:00
Rafael Auler	90996eb54b	[PERF2BOLT] Print a better message if perf.data lacks LBR Summary: If processing the perf.data in LBR mode but the data was collected without -j, currently we confusingly report all samples to mismatch the input binary, even though the samples match but lack LBR info. Change perf2bolt to detect this scenario and print a helpful message instructing the user to collect data with LBR. (cherry picked from FBD14817732)	2019-04-05 17:27:25 -07:00
Maksim Panchenko	624a0e810d	[DWARF][BOLT] Convert DW_AT_(low\|high)_pc to DW_AT_ranges only if necessary Summary: While updating DWARF, we used to convert address ranges for functions into DW_AT_ranges format, even if the ranges were not split and still had a simple [low, high) form. We had to do this because functions with contiguous ranges could be sharing an abbrev with non-contiguous range function, and we had to convert the abbrev. It turns out, that the excessive usage of DW_AT_ranges may lead to internal core dumps in gdb in the presence of .gdb_index. I still don't know the root cause of it, but reducing the number DW_AT_ranges used by DW_TAG_subprogram DIEs does alleviate the issue. We can keep a simple range for DIEs that are guaranteed not to share an abbrev with any non-contiguous function. Hence we have to postpone the update of function ranges until we've seen all DIEs. Note that DIEs from different compilation units could share the same abbrev, and hence we have to process DIEs from all compilation units. (cherry picked from FBD14814043)	2019-04-01 20:26:41 -07:00
Maksim Panchenko	c8a927696c	[BOLT] Detect internal references into a middle of instruction Summary: Some instructions in assembly-written functions could reference 8-byte constants from another instructions using 4-byte offsets, presumably to save a couple of bytes. Detect such cases, and skip processing such functions until we teach BOLT how to handle references into a middle of instruction. (cherry picked from FBD14768212)	2019-04-03 22:31:12 -07:00
Maksim Panchenko	7fd487066f	[BOLT] Move BinaryFunctions into a BinaryContext and more Summary: A long due refactoring that makes interfaces cleaner and less awkward. Mainly makes the future work way easier. (cherry picked from FBD14766284)	2019-04-03 15:52:01 -07:00
Maksim Panchenko	8894853f42	[BOLT][DWARF] Dedup .debug_abbrev section patches Summary: When we patch .debug_abbrev we issue many duplicate patches. Instead of storing these patches as a vector, use a hash map. This saves some processing time and memory. (cherry picked from FBD14691292)	2019-03-29 14:22:54 -07:00
Maksim Panchenko	297d1a4e1a	[BOLT] Do not write jump table section headers Summary: In non-relocation mode we were accidentally emitting section headers for every single jump table. This happened with default `-jump-tables=basic`. (cherry picked from FBD14653282)	2019-03-27 13:58:31 -07:00
Maksim Panchenko	d1b76f2ac2	[BOLT] Allocate enough space past __hot_end for huge pages Summary: While using "-hot-text" option, we might not get enough cold text to fill up the last huge page, and we can get data allocated on this page producing undesirable effects. To prevent this from happening, always make sure to allocate enough space past __hot_end. (cherry picked from FBD14575100)	2019-03-21 21:13:45 -07:00
Maksim Panchenko	69faf61372	[BOLT] Fix section lookup while deleting symbols Summary: While removing redundant local symbols, we used new section index to lookup the corresponding section in the old section table. As a result, we used to either not remove the correct symbols, or remove the wrong ones. (cherry picked from FBD14552047)	2019-03-20 16:13:09 -07:00
Maksim Panchenko	b8d3dc40ea	[BOLT] Use local binding for cold fragment symbols Summary: We used to use existing symbol binding while duplicating and renaming cold fragment symbols. As a result, some of those were emitted with global binding. This confuses gdb, and it starts treating those symbols as additional entry points. The fix is to always emit such symbols with a local binding. This also means that we have to sort static symbol table before emission to make sure local symbols precede all others. (cherry picked from FBD14529265)	2019-03-19 13:46:21 -07:00
Maksim Panchenko	6bcb3389dd	[BOLT] Place hot text mover functions into a separate section Summary: Create a separate pass for assigning functions to sections. Detect functions originating from special sections (by default .stub and .mover) and place them into ".text.mover" if "-hot-text" options is specified. Cold functions are isolated from hot functions even when no function re-ordering is specified. (cherry picked from FBD14512628)	2019-03-15 13:43:36 -07:00
Maksim Panchenko	17cd2034f3	[BOLT] Fix debug line info emission Summary: GDB does not like if the first entry in the line info table after end_sequence entry is not marked with is_stmt. If this happens, it will not print the correct line number information for such address. Note that everything works fine starting with the first address marked with is_stmt. This could happen if the first instruction in the cold section wasn't marked with is_stmt. The fix is to always emit debug line info for the first instruction in any function fragment with is_stmt flag. (cherry picked from FBD14516629)	2019-03-18 19:22:26 -07:00
Maksim Panchenko	61ea19edf8	[BOLT][NFC] Fix compilation warnings Summary: Get rid of warnings while building with Clang. (cherry picked from FBD14495620)	2019-03-15 15:06:41 -07:00
Maksim Panchenko	0a55001a0e	[BOLT] Fix -hot-functions-at-end option Summary: Make "-hot-functions-at-end" option work again. (cherry picked from FBD14476242)	2019-03-14 20:32:04 -07:00
Maksim Panchenko	163adbec9f	[BOLT] Refactor allocatable sections rewrite part Summary: This refactoring makes it easier to create new code sections and control code placement. As an example, cold code is being placed into ".text.cold" which is emitted independently from ".text", and the final address assignment becomes more flexible. Previously, in non-relocation mode we used to emit temporary section name into .shstrtab. This resulted in unnecessary bloat of this section. There was unnecessary padding emitted at the end of text section. After fixing this, the output binary becomes smaller. I had to change the way exception handling tables are re-written as the current infra does not support cross-section label difference. This means we have to emit absolute landing pad addresses, which might not work for PIE binaries. I'm going to address this once I investigate the current exception handling issues in PIEs. This diff temporarily disables "-hot-functions-at-end" option. (cherry picked from FBD14475693)	2019-03-14 18:51:05 -07:00
Maksim Panchenko	a9e64947c5	[NFC][BOLT] Move ExecutableFileMemoryManager into its own file (cherry picked from FBD14474800)	2019-03-14 18:49:40 -07:00
Rafael Auler	c593563d1f	Do not assert on addresses read from processIndirectBranch Summary: As part of our heuristics to decode an indirect branch, if we suspect the branch is an indirect tail call, we add its probable target to the BC::InterproceduralReferences vector to detect functions with more than one entry point. However, if this probable target is not in an allocatable section, we were asserting. Remove this assertion and change the code to conditionally store to InterproceduralReferences instead. The probable target could be garbage at this point because of analyzeIndirectBranch failing to identify the load instruction that has the memory address of the target, so we should tolerate this. (cherry picked from FBD14432821)	2019-03-12 16:36:35 -07:00
Maksim Panchenko	0c704eb75a	[BOLT-HEATMAP] Initial heat map implementation Summary: Add heatmap subcommand to produce heatmaps based on perf.data with LBR. The output is produced in colored ASCII format. llvm-bolt heatmap -p perf.data <executable> -block-size=<uint> - size of a heat map block in bytes (default 64) -line-size=<uint> - number of entries per line (default 256) -max-address=<uint> - maximum address considered valid for heatmap (default 4GB) -o=<string> - heatmap output file (default stdout) (cherry picked from FBD13969992)	2019-02-05 15:28:19 -08:00
Maksim Panchenko	ff6e21290f	[BOLT] New inliner implementation Summary: Addresses correctness issues related to inlining. Inlining heuristics are not part of this diff. (cherry picked from FBD13796888)	2019-01-31 11:23:02 -08:00
Maksim Panchenko	365bd1f1c8	[BOLT] For non-simple functions always update jump tables in-place Summary: For non-simple function we can miss a reference to a jump table or to an indirect goto table. If we move the jump table, the missed reference will not get updated, and the corresponding indirect jump will end up in the old (wrong) location. Updating the original jump table in-place should take care of the issue. (cherry picked from FBD13849776)	2019-01-28 13:46:18 -08:00
Rafael Auler	af81c7ff80	[perf2bolt] Add support for generating autofdo input Summary: Autofdo tools support. (cherry picked from FBD13779026)	2019-01-22 17:21:45 -08:00
Maksim Panchenko	c6ce2abb7d	[perf2bolt] Optimize memory usage in perf2bolt Summary: While converting perf profile, we only need CFG for functions that were profiled and can skip building CFG for the rest. This saves us some processing time and memory. Breakdown processing of perf.data into two steps. The first step parses the data, saves it in intermediate format, and marks functions with the profile. The second step attributes the profile to functions with CFG. When we disassemble and build CFG for functions in aggregate-only mode, we skip functions without the profile. (cherry picked from FBD13706697)	2019-01-15 23:43:40 -08:00
Maksim Panchenko	2fe0c38d6b	[perf2bolt] Better tracking of process forking Summary: Improve tracking of forked processes. If a process corresponding to the input binary has forked/started before 'perf record' was initiated, then the full name of the binary will be recorded in a corresponding MMAP2 event. We've being handling such cases well so far. However, if the process was forked after 'perf record' has started, and execve(2) wasn't called afterwards, then there will be no MMAP2 event recorded corresponding to the mapping of the main binary (unrelated MMAP2 events could still be recorded). To track such cases, we need to parse 'perf script --show-task-events' command output, and to scan for PERF_RECORD_FORK events, and then add forked process PIDs to the list associated with the input binary. If the fork event was followed by an exec event (PERF_RECORD_COMM exec) of a different binary, then the forked PID should be ignored. If the exec event was associated with our input binary, then the correct MMAP2 event was recorded and parsed. To track if the event occurred before or after 'perf record', we parse event's time. This helps us to differentiate some events. E.g. the exec event is only registered correctly if it happened after perf recording has started (otherwise the "exec" part is missing), and thus we only record forks with non-zero time stamps. (cherry picked from FBD13250904)	2018-11-21 20:04:00 -08:00
Maksim Panchenko	067a385000	[BOLT] Add thresholds for function splitting Summary: Use newly added function size estimation to measure the effectiveness and guide function splitting. Two new tuning options are added: -split-threshold=<uint> split function only if its main size is reduced by more than given amount of bytes. Default value: 0, i.e. split iff the size is reduced. Note that on some architectures the size can increase after splitting. -split-align-threshold=<uint> when deciding to split a function, apply this alignment while doing the size comparison (see -split-threshold). Default value: 2. (cherry picked from FBD13136352)	2018-11-15 16:03:34 -08:00
Maksim Panchenko	b0f7fddd35	[BOLT] Add method for better function size estimation Summary: Add BinaryContext::calculateEmittedSize() that ephemerally emits code to allow precise estimation of the function size. Relaxation and macro-op alignment adjustments are taken into account. (cherry picked from FBD13092139)	2018-11-15 16:02:16 -08:00
Maksim Panchenko	e1b8fade7f	[BOLT] Add branch priority policy for blocks with 2 successors Summary: On x86 the difference between long and short jump instructions could be either 4 or 3 bytes, depending if it's a conditional jump or not. For a basic block with 2 jump instructions, if we know that one of the successors is in a different code region, then we can make it a target of an unconditional jump instruction. This will save 1 byte in case the conditional jump happens to be a short one. (cherry picked from FBD13078139)	2018-11-14 14:43:59 -08:00
Maksim Panchenko	40d9fcfdca	[BOLT] Workaround for Clang de-virtualization bug Summary: When Clang is boot-strapped with (Thin)LTO, it may produce a code fragment similar to below: .LFT663334 (6 instructions, align : 1) Predecessors: .LFT663333 00000538: movb $0x1, %al 0000053a: movl %eax, -0x2c(%rbp) 0000053d: movl $"_ZN5clang6Parser12ConsumeParenEv/1", %ecx 00000542: testb $0x1, %cl 00000545: movq -0x40(%rbp), %r14 00000549: je .Ltmp1071462 Successors: .Ltmp1071462, .LFT663335 .LFT663335 (2 instructions, align : 1) Predecessors: .LFT663334 0000054b: movq (%r12), %rax 0000054f: movq .Ltmp0(%rax), %rcx Successors: .Ltmp1071462 .Ltmp1071462 (7 instructions, align : 1) Predecessors: .LFT663334, .LFT663335 00000556: movq %r12, %rdi 00000559: callq *%rcx ....... The code above is making a call by dereferencing a pointer to a member function. A pointer to a member function could either be a regular function, or a virtual function. To differentiate between the two, AMD64 ABI (originated from Itanium ABI) uses the last bit of the pointer. The call instruction sequence varies depending if the function is virtual or not, and the pointer's last bit is checked. If it's "1" then the value of the pointer (minus 1) is used as an offset in the object vtable to get the address of the function, otherwise the pointer is used directly as a function address. In this specific case, a de-virtualization is taking place, but it's not complete. Compiler knows that the member function pointer is actually a non-virtual function _ZN5clang6Parser12ConsumeParenEv (aka "clang::Parser::ConsumeParen()"). However, it keeps the (dead) code that checks the last bit of _ZN5clang6Parser12ConsumeParenEv, and furthermore keeps the code (unreachable/dead) to make a virtual call while using (_ZN5clang6Parser12ConsumeParenEv - 1) as an offset into the vtable. This is obviously wrong, but since the code is unreachable, it will never affect the runtime correctness. The value "_ZN5clang6Parser12ConsumeParenEv - 1" falls into a last byte of a function preceding _ZN5clang6Parser12ConsumeParenEv, and BOLT creates a label ".Ltmp0" pointing to this last byte that is referenced in by the instruction sequence above. It just happens that the last byte is also in the middle of the last instruction, and as a result, BOLT never emits the label, hence resulting in the error message "Undefined temporary symbol". The workaround is to detect non-pc-relative relocations from code pointing to some (fptr - 1). Note that this is not completely error-prone, but non-pc-relative references from code into a middle of a function are quite rare, and chances that in a normal situation they will point to a byte preceding some function address are virtually zero. (cherry picked from FBD13030310)	2018-11-12 12:38:50 -08:00
Maksim Panchenko	30fd960951	[BOLT] Update local symbol count in symbol table Summary: Fix sh_info entry for symbol table section to reflect updated number of local symbols. (cherry picked from FBD10503216)	2018-10-22 18:48:12 -07:00
Maksim Panchenko	a76b13d48e	[perf2bolt] Pre-aggregate LBR samples Summary: Pre-aggregating LBR data cuts pef2bolt processing times in half. (cherry picked from FBD10420286)	2018-10-02 17:16:26 -07:00
Rafael Auler	74a71c6812	Fix bug in analyzeRelocation for GOT entries Summary: Special case GOT relocs to ignore addend subtracting logic in analyzeRelocation, since the addend does not refer to the target of the instruction being analyzed. Also make the code honor the comments in the special case about zeroed out ExtractValue but non-zero addend. Fix facebookincubator/BOLT#40 (cherry picked from FBD10355019)	2018-10-11 18:12:09 -07:00
Facebook Github Bot	b166ccbea8	[BOLT][PR] Fix compiler warnings in BinaryContext and RegAnalysis Summary: This pull request fixes two compiler warnings: - missing `break;` in a switch-case statement in RegAnalysis.cpp (-Wimplicit-fallthrough warning) - misleading indentation in BinaryContext.cpp (-Wmisleading-indentation warning) Pull Request resolved: https://github.com/facebookincubator/BOLT/pull/39 GitHub Author: Andreas Ziegler <andreas.ziegler@fau.de> (cherry picked from FBD10202092)	2018-10-04 10:46:16 -07:00
Igor Sugak	c3c80822a3	[BOLT] Capitalize i Summary: as titled (cherry picked from FBD10136655)	2018-10-01 16:22:46 -07:00
Igor Sugak	cc2276d3f1	[BOLT] fix build with gcc-4.8.5 Summary: These are two minor changes to make it copatible with gcc-4.8.5 (cherry picked from FBD9884971)	2018-09-17 12:17:33 -07:00
Maksim Panchenko	ce508b58c6	[BOLT] Support relocations without symbols Summary: lld may generate relocations without associated symbols. Instead of rejecting binaries with such relocations, we can re-create the symbol the relocation is against based on the extracted value. (cherry picked from FBD10054576)	2018-09-21 12:00:20 -07:00
Rafael Auler	bd0b99c45d	[BOLT] Change stub-insertion pass for AArch64 Summary: Previously, we were expanding eligible branches with stubs. After expansion, we were computing which stubs were unnecessary and removing them, assuming ranges were shortening as code is removed. The problem with this approach is that for branches that refer to code that is not managed by BOLT, the distance to that location can increase and we can end up with an out-of-range branch. This rewrites the pass to be simpler, only increasing size and expanding code with stubs as needed after each iteration, stopping when code stops increasing. Besides this rewrite, the stub-insertion pass now supports stubs grouping similar to what the linker does, allowing different functions to share the same veneer that jumps to a common callee. It also fixes a bug in the previous implementation that, in very large functions that use TBZ/TBNZ (+-32KB range), it would mistakenly try to reuse a local stub BB that is out of range. This includes a change to allow hot functions to be put at the end of the .text section, closer to the heap, requiring no veneers to jump to JITted code. And finally it enables eliminate veneers pass by default. (cherry picked from FBD10023158)	2018-09-17 13:36:59 -07:00
Maksim Panchenko	1387a9d761	[BOLT] Keep .text section in file when using old text Summary: If we reuse text section under `-use-old-text` option, then there's no need to rename it. Tools, such as perf, seem to not like binaries without `.text`. Additionally, check if the code fits into `.text` using the page alignment, otherwise we were skipping the alignment relying on the user detecting the warning message. This could have resulted in unexpected performance drops. Also add `-no-huge-pages` option to use regular page size for code alignment purposes (i.e. 4KiB instead of 2MiB). (cherry picked from FBD10024670)	2018-09-24 20:58:31 -07:00
Maksim Panchenko	53b72d0f2e	[BOLT] Ignore symbols from non-allocatable sections Summary: While creating BinaryData objects we used to process all symbol table entries. However, some symbols could belong to non-allocatable sections, and thus we have to ignore them for the purpose of analyzing in-memory data. (cherry picked from FBD9666511)	2018-09-05 14:36:52 -07:00
Maksim Panchenko	8026760ac0	[BOLT] Fix another issue with profile after ICP Summary: For jump tables ICP was using profile from the jump table itself which doesn't work correct if the jump table is re-used at different code locations. (cherry picked from FBD9618774)	2018-08-30 13:21:50 -07:00

... 2 3 4 5 6 ...

832 Commits