llvm-project

Commit Graph

Author	SHA1	Message	Date
Zino Benaissa	60b15062e1	[BOLT] Dump dynamic execution per instruction opcode Summary: We extended DynoStats to dump the histogram per instruction opcode. By default the dump is turned off. Use '-print-dyno-opcode-stats' to enable the dump. BOLT also dumps for each instruction opcode the maximum execution count and corresponding function name and basic block offsets where the instruction occurs. Below is a sample of the dump: Opcode, Execution Count, Max Exec Count, Function Name:Offset SHR8rCL, 232, 232, _ZNK5folly14AsyncSSLSocket4goodEv:53 VPADDDYrr, 13956, 388, chacha20_encrypt_bytes.part.0/3:736 PMOVSXBWrr, 4, 2, ares_expand_name/1:264 VMOVAPSmr, 1082, 43, chacha20_encrypt_bytes.part.0/3:2864 VPSHUFBrr, 9540, 1667, chacha20_encrypt_bytes.part.0/3:4416 VPUNPCKLDQYrr, 1102, 188, jsimd_ycc_rgb_convert_avx2/1:125 VPBROADCASTQYrm, 39, 39, chacha20_encrypt_bytes.part.0/3:400 PMOVSXWDrr, 8, 2, ares_expand_name/1:264 VPORrr, 817, 129, jsimd_idct_islow_avx2/1:41 PSLLDri, 8690752, 65644, blockmix_salsa8_xor/1:1424 (cherry picked from FBD28859624)	2021-05-24 21:33:43 -07:00
Maksim Panchenko	c9f5f47b51	[BOLT] Add support for .plt.sec and refactor PLT-reading code Summary: A binary can contain multiple PLT sections with different name and attributes (such as an entry size). Extend the support to .plt.sec and refactor the code to make future extensions simpler. (cherry picked from FBD29502107)	2021-06-30 14:41:41 -07:00
Joey Thaman	4c12afc1f4	[BOLT][NFC] Resolved all clang-12 warnings for bolt Summary: clang-12 now compiles bolt without warnings. Some warnings were fixed if possible while others were suppressed by doing (void)variable for unused variable warnings or moving code inside assert statements of LLVM_DEBUG blocks. (cherry picked from FBD29469054)	2021-06-29 12:11:56 -07:00
Maksim Panchenko	1de0746790	[BOLT] Read all dynamic relocations and refactor code Summary: Add code to read more dynamic relocations (DT_JMPREL) and enforce strict checks that corresponding sections sizes match .dynamic entry description. (cherry picked from FBD29502109)	2021-06-30 14:38:50 -07:00
Alexander Yermolovich	f7499c6711	[BOLT][DWARF] Fix writing out dwo with DWP as input Summary: The code for writing out dwo files wasn't handling case where DWP is an input. Because all the sections are part of the same binary. One note with current implementation. .debug-str.dwo will have strings for all the dwo objects. This is because llvm-dwp de-duplicates strings and combines them in to one section. It then re-writes .debug-str-offsets.dwo to point to new .debug-str.dwo section. (cherry picked from FBD29244835)	2021-06-18 15:57:34 -07:00
Maksim Panchenko	3e5ce1f282	[BOLT][TESTS] Remove dynamic relocations from YAML tests Summary: Our YAML objects contain references to dynamic relocations via .dynamic, but there are no corresponding relocation sections. Change .dynamic contents to specify no dynamic relocations. (cherry picked from FBD29502108)	2021-06-30 14:33:59 -07:00
Amir Ayupov	a07d24cc4b	[BOLT][NFC] Un-inline checking AArch64 linker veneers out of disassemble loop Summary: Move the AArch64 `matchLinkerVeneer` check out of a for-loop in `BinaryFunction::disassemble` (cherry picked from FBD29411348)	2021-06-25 17:52:51 -07:00
Amir Ayupov	c7c0803b59	[BOLT][NFC] Un-inline indirect branch handling out of disassemble loop Summary: Move the `processIndirectBranch` switch statement out of a for-loop in `BinaryFunction::disassemble` (cherry picked from FBD29411346)	2021-06-25 17:49:43 -07:00
Amir Ayupov	8f751bc058	[BOLT][NFC] Un-inline adding external references out of disassemble loop Summary: Move the code that handles true external references (non-unreachable) out of a for-loop in `BinaryFunction::disassemble`. (cherry picked from FBD29411345)	2021-06-25 17:32:25 -07:00
Amir Ayupov	8f7a400629	[BOLT][NFC] Delete MoveRelocations entirely Summary: MoveRelocations are unused. Remove interfaces and emission part. (cherry picked from FBD29468409)	2021-06-25 17:06:21 -07:00
Maksim Panchenko	38c5887992	[BOLT][NFC] Always process runtime relocations Summary: Dynamic relocations applied at runtime should be processed even in non-relocation mode. (cherry picked from FBD29311906)	2021-06-22 13:46:06 -07:00
Amir Ayupov	ef1b1e7184	[BOLT][NFC] Refactor handlePCRelOperand Summary: Move error logging to handlePCRelOperand, reduce code duplication (cherry picked from FBD29309702)	2021-06-17 17:41:28 -07:00
Amir Ayupov	b964e852d5	[BOLT][NFC] Readability improvements in X86,Aarch64 MCPlusBuilder Summary: Minor refactorings in target specific MCPlusBuilders to improve readability (cherry picked from FBD29309701)	2021-06-17 18:22:32 -07:00
James Luo	dea6c247d9	[BOLT][CSSPGO] Relate decoded pseudo probe basic blocks Summary: Assign decoded pseudo probe to correlated output block Pseudo probes can then be encoded to a proper address (cherry picked from FBD29211688)	2021-06-25 11:42:58 -07:00
Amir Ayupov	521a61b056	[BOLT][NFC] Use MCPlusBuilder::isPseudo Summary: Consistently use this interface across BOLT codebase (cherry picked from FBD29171718)	2021-06-16 12:10:20 -07:00
Amir Ayupov	da276d73c7	[BOLT] Handle R_X86_64_64 in flushPendingRelocations Summary: Handle R_X86_64_64 the same way as R_X86_64_32; `getSizeForType` takes care of the size: ```x86_64 ABI relocation types Name Value Field Calculation R_X86_64_64 1 word64 S + A R_X86_64_32 10 word32 S + A ``` (cherry picked from FBD29370417)	2021-06-24 12:18:16 -07:00
Maksim Panchenko	f46af9e9bc	[BOLT][TESTS] Fix ICF test case Summary: Host compiler may generate duplicate functions and as a result BOLT can fold more than 1 function. (cherry picked from FBD29347302)	2021-06-23 16:13:30 -07:00
Joey Thaman	be0da0fac2	Throw an error in instrument for dynamic libs Summary: In InstrumentatonRuntimeLibrary, throw an error if the program uses dynamic libraries (cherry picked from FBD29265147)	2021-06-21 07:45:52 -07:00
Maksim Panchenko	bbbd159ccb	[BOLT] Fix undefined symbol warnings/errors Summary: When we fold a function in relocation mode, make sure to clear its state to avoid emitting relocations against undefined symbols. (cherry picked from FBD29245320)	2021-06-18 14:35:39 -07:00
Sameeran joshi	ba915af1cd	[PR][BOLT] Print revision in perf2bolt and bolt-diff modes" Summary: Fix issue facebookincubator/BOLT#160 PR facebookincubator/BOLT#172 (cherry picked from FBD29139522)	2021-06-08 23:28:37 +05:30
Rafael Auler	e485a9830b	Rebase: [BOLT][DebugFission] Fix reading support for DWP Summary: Dived more in to DWARF APIs and llvm-symbolizer this is a more streamline way of doing it, and address base gets set properly. Writing out dwo files with dwp input will be separate patch. (cherry picked from FBD31361529)	2021-06-16 09:52:03 -07:00
Vladislav Khmelevsky	a8b9319536	[PR] Patch allocatable relocations for AArch64 Summary: PR facebookincubator/BOLT#166 Vladislav Khmelevsky, Advanced Software Technology Lab, Huawei (cherry picked from FBD28910060)	2021-06-02 00:03:56 +03:00
Vladislav Khmelevsky	2cf9008a60	[PR] Instrumentation: Disable signals on mutex lock Summary: When indirect call is instrmented it locks SimpleHashTable's mutex on get() call. If while locked we we receive a signal and signal handler also will call indirect function we will end up with deadlock. PR facebookincubator/BOLT#167 Vladislav Khmelevsky, Advanced Software Technology Lab, Huawei (cherry picked from FBD28909921)	2021-06-04 19:51:06 +03:00
Maksim Panchenko	1efadeedf2	[BOLT] Fix rodata load simplification pass Summary: If the target address has a runtime relocation against it, do not perform the load simplification. (cherry picked from FBD29091939)	2021-06-13 15:37:31 -07:00
Amir Ayupov	f7f0a571d7	[BOLT][NFC] Suppress addList override warning Summary: Suppresses the warning ``` src/DebugData.h:338:20: warning: 'addList' overrides a member function but is not marked 'override' [-Wsuggest-override] ``` (cherry picked from FBD28858201)	2021-06-02 19:12:13 -07:00
James Luo	8a919593c7	[BOLT][CSSPGO] Pseudo probe decoding Summary: Make bolt decode pseudo probe section in binary For more detail of pseudo probe, check https://reviews.llvm.org/D86490. (cherry picked from FBD28856316)	2021-06-11 13:06:12 -07:00
Alexander Yermolovich	226d1c3b0b	[BOLT] Change how DF DWO logging is handled Summary: Changing assert to a warning when DWO debug information can't be retrieved. Usually due to invalid path. (cherry picked from FBD29005217)	2021-06-09 12:55:09 -07:00
Amir Ayupov	2da5b12a3d	[BOLT] Hugify: check for THP support via sysfs Summary: Remove dependence on kernel version check, query sysfs directly instead. (cherry picked from FBD28858208)	2021-06-02 19:11:52 -07:00
Maksim Panchenko	7bccf8d25d	[BOLT][NFC] Fix debug info printouts for inlined functions Summary: While printing debug info for instructions, we should use line tables from the corresponding DWARF CU which could be different from the containing function CU in case of inlined instructions. (cherry picked from FBD28908324)	2021-06-04 12:31:31 -07:00
Amir Ayupov	65d227c035	[BOLT][TEST] Fix test case to conform to analyzePICJumpTable pattern matching Summary: Make sure that jump table is properly recognized in `split_func_jump_table_fragment.s`. (cherry picked from FBD28839976)	2021-06-02 10:50:47 -07:00
James Luo	1c06193d0f	[BOLT] Resolve JumpTable namespace issue in pseudo probe decoder migration Summary: This diff fixes the JumpTable namespace conflicts during the migration of pseudo probe decoder. (cherry picked from FBD28859927)	2021-06-02 22:46:57 -07:00
Maksim Panchenko	a26370389a	[BOLT][NFC] Disable ProcessAllSections in RuntimeDyld Summary: FBD55943 changed the way ProcessAllSections works in RuntimeDyld. After the change, all sections, including symbol table, section table, etc. are loaded into memory whenever ProcessAllSections is enabled. In BOLT we rely on RuntimeDyld for processing sections with relocations. These include most allocatable sections and additionally .debug_line. The latter is skipped by RuntimeDyld without ProcessAllSections flag. If we enable ProcessAllSections, we will have to deal with allocating memory for more sections than we need (see above) and later to filter them out. The alternative is to mark all sections that we actually plan to use as "required for execution" (using RuntimeDyld terminology). For .debug_line section on ELF it means adding SHF_ALLOC flag. On MachO, RuntimeDyld currently treats all sections as required. (cherry picked from FBD28729398)	2021-05-26 16:23:34 -07:00
Vladislav Khmelevsky	5a6c379f5b	[PR] Instrumentation: Emit paddings to preserve data alignment Summary: Vladislav Khmelevsky, Advanced Software Technology Lab, Huawei facebookincubator/BOLT#156 (cherry picked from FBD28521843)	2021-05-14 14:09:05 +03:00
Vladislav Khmelevsky	79807d99fe	[PR] Introduce loop inversion pass Summary: This patch introduces LoopInversionPass. Its main purpose is to ensure that the loop layout is optimal depending on the profile information. So if profile information shows that the loop is used, the unconditional jump instruction must be executed only once and vice-versa. Please take a look to the pass header file and test for more details. Also change link_fdata script a bit, to be able to change FDATA prefix, like FileCheck does. Vladislav Khmelevsky, Advanced Software Technology Lab, Huawei PR facebookincubator/BOLT#153 (cherry picked from FBD28391811)	2021-05-11 20:59:13 +03:00
Amir Ayupov	12e9fec697	Rebase: [BOLT] DebugFission Support Summary: Implemented support for Debug Fission. For the most part it doesn't impact Monolithic execution path. One area that was changed is the DW_AT_low_pc/DW_AT_high_pc conversion. Before it was to DW_AT_ranges/DW_AT_low_pc, now DW_AT_low_pc is kept in same place. Another more visible impact is in Skeleton CU the DW_AT_low_pc is replaced with DW_AT_ranges_base if it's not originally present and bolt converted ranges conversion inside the dwo units. Output of this are multiple .dwo files with updated debug information. (cherry picked from FBD29569788)	2021-04-01 11:43:00 -07:00
Amir Ayupov	99d7f90635	[BOLT][NFC][TEST] Added llvm-dwarfdump and llvm-mc to BOLT_TEST_DEPS (cherry picked from FBD28427352)	2021-05-13 15:36:43 -07:00
Maksim Panchenko	ba6fdb8113	[BOLT] Preserve original jump table relocations Summary: Remove relocations against internal function labels, e.g. jump table relocations, only when overwriting them. While reading an input file with relocations, we create internal relocations against code references (we skip PIC relocations). Later, when we discover jump tables, we remove corresponding relocations with the assumption that original relocations will either be ignored or replaced by new relocations. However, it is possible to miss some references to the jump table, in which case the original entries will not be ignored. While such situation is abnormal, it is still a better/safer approach to preserve relocations if we are not replacing them with new ones. (cherry picked from FBD28406628)	2021-05-12 23:35:10 -07:00
Maksim Panchenko	81c59d9a54	[BOLT][NFC] Change interface for searching relocations (cherry picked from FBD28406629)	2021-05-12 23:29:04 -07:00
Amir Ayupov	500edf26c9	[BOLT][NFC] Address warning about ProgramPoint implicit copy constructor Summary: Explicit assignment operator can be replaced with an implicit one. Remove it to allow an implicit copy constructor: ``` bolt/src/Passes/DataflowAnalysis.h:74:8: warning: definition of implicit copy constructor for 'ProgramPoint' is deprecated because it has a user-declared copy assignment operator [-Wdeprecated-copy] void operator=(const ProgramPoint &PP) { ^ bolt/src/Passes/DataflowAnalysis.h:62:14: note: in implicit copy constructor for 'llvm::bolt::ProgramPoint' first required here return ProgramPoint(&*Last); ``` (cherry picked from FBD28335138)	2021-05-10 14:16:25 -07:00
Maksim Panchenko	fe37f1870e	[BOLT][NFC] Follow LLVM variable initialization style (cherry picked from FBD28417604)	2021-05-13 10:50:47 -07:00
Vladislav Khmelevsky	b728bfc70a	[PR] Add missing includes Summary: Adds missing headers removed by IWYU. NB: this caused build breakage on ubuntu-latest Vladislav Khmelevsky, Advanced Software Technology Lab, Huawei (cherry picked from FBD28368185)	2021-05-11 15:55:57 +03:00
Vladislav Khmelevsky	de298c08fd	[PR] Fix tests build with -no-pie option Summary: Since gcc/ld could produce and expect PIE files we need to pass -no-pie option to avoid linking errors for tests. Vladislav Khmelevsky, Advanced Software Technology Lab, Huawei (cherry picked from FBD28360045)	2021-05-11 03:25:49 +03:00
Alexey Moksyakov	ce84e9607a	[PR] Fix bb reordering optimization Summary: Reorder-blocks optimization pass doesn't take into account that available offset for legacy Jcc instructions (for example, JRCXZ - operand 8 bits) has to be less than 255 bytes. It's rare case and to exclude such functions with unsupported instructions from optimization passes added extra checking Alexey Moksyakov Advanced Software Technology Lab, Huawei (cherry picked from FBD28264117)	2021-04-23 11:34:40 +03:00
Amir Ayupov	9a884543f1	[BOLT][NFC] Avoid unnecessary copies with push_back Summary: Small refactoring inspired by clang-tidy modernize-use-emplace (cherry picked from FBD28307493)	2021-05-07 18:43:25 -07:00
Amir Ayupov	94653797f3	Rebase: [BOLT][NFC] Avoid binutils in tests Summary: Replace binutils tools with llvm tools (cherry picked from FBD29575630)	2021-05-04 16:45:28 -07:00
Amir Ayupov	eb99a6665c	Rebase: [BOLT][NFC] Remove unneeded includes with include-what-you-use Summary: Ran iwyu multiple times, manually picked header remove lines. Reached fixed point wrt removal: iwyu doesn't automatically remove any more headers or forward declarations. (cherry picked from FBD29569221)	2021-04-30 13:54:02 -07:00
Maksim Panchenko	5239182075	[perf2bolt] Further relax segment matching Summary: Previously, we used p_align value of the code segment to predict the mapping of the segment at runtime. However, at times the reported value is not aligned and at other times the actual aligned value will be different because of the different page size used. All we know is that the page size used at runtime should not exceed p_align value. Adjust our segment address matching accordingly. (cherry picked from FBD28133066)	2021-04-30 15:02:29 -07:00
Maksim Panchenko	bd86c06c1b	[BOLT][NFC] Remove CFIReaderWriter::fdes() (cherry picked from FBD27918126)	2021-04-21 12:33:08 -07:00
Maksim Panchenko	f8fa3e97d5	[BOLT] Remove -dump-eh-frame option Summary: The option duplicates functionality of "llvm-dwarfdump -eh-frame". (cherry picked from FBD27917505)	2021-04-21 12:13:22 -07:00
Maksim Panchenko	3355936e14	[BOLT][NFC] Remove RewriteInstance::EHFrame (cherry picked from FBD27915725)	2021-04-21 11:24:15 -07:00
Amir Ayupov	f84f451a54	[BOLT][NFC] Use const reference for MCInstrDesc Summary: Addressing comments from the review for "Expand auto types". Use const reference in MCPlusBuilder for MCInstrDesc where the copy is not necessary. (cherry picked from FBD27844344)	2021-04-17 21:48:46 -07:00
Amir Ayupov	c7306cc219	Rebase: [BOLT][NFC] Expand auto types Summary: Expanded auto types across BOLT semi-automatically with the aid of clangd LSP (cherry picked from FBD33289309)	2021-04-08 00:19:26 -07:00
Rafael Auler	dc2673a039	[BOLT] Fix value invalidation bug in runtimelib Summary: We can't use a fragment of the old LibPath as an input to create a new one. (cherry picked from FBD27642728)	2021-04-07 21:40:23 -07:00
Rafael Auler	35732d954b	[BOLT] Remove cantFail in getAddressRanges calls Summary: We may have a CU with empty ranges, so accept errors coming from DWARFDie::getAddressRanges(). This happens when using tools that selectively strip debuginfo from the binary. (cherry picked from FBD27602731)	2021-04-06 12:57:09 -07:00
Amir Ayupov	f1bfb18ceb	[BOLT] Refactor SectionPatchers map to a Patcher in BinarySection Summary: Refactor SectionPatches to avoid the use of extra map and a cast from StringRef to std::string. cherry-picked from FBD26756560 (cherry picked from FBD27490641)	2021-03-18 13:06:18 -07:00
Amir Ayupov	081e39aa15	Rebase: [cherry-pick] [BOLT] Add option to skip writing an output file Summary: The user may wish to run BOLT for printing statistics only (i.e. to check that the profile is valid). Add an option to run BOLT without writing any output file, similar to a dry run. This option is triggered by supplying -o with "/dev/null". (cherry picked from FBD29568632)	2021-03-29 16:04:57 -07:00
Maksim Panchenko	e7169be93f	[BOLT] Do not assert on jump table heuristic failure Summary: During the initial indirect jump analysis, we used to assert that the discovered jump table type matched the pattern of the corresponding instruction sequence. E.g., for PIC jump table memory we expected the PIC jump table instruction sequence. The assertions were too conservative, as in the case of a mismatch we can mark the indirect jump as having an unknown control flow. That should be sufficient to either skip the function processing or rely on relocation information for possible recovery of the control flow. (cherry picked from FBD27255816)	2021-03-23 13:41:41 -07:00
Rafael Auler	b3c34d568a	[BOLT] Fix instrumentation bug in duplicated JTs Summary: Fix a bug with instrumentation when trying to instrument functions that share a jump table with multiple indirect jumps. Usually, each indirect jump that uses a JT will have its own copy of it. When this does not happen, we need to duplicate the jump table safely, so we can split the edges correctly (each copy of the jump table may have different split edges). For this to happen, we need to correctly match the sequence of instructions that perform the indirect jump to identify the base address of the jump table and patch it to point to the new cloned JT. It was reported to us a case in which the compiler generated suboptimal code to do an indirect jump which our matcher failed to identify. Fixes facebookincubator/BOLT#126 (cherry picked from FBD27065579)	2021-03-15 16:34:25 -07:00
Maksim Panchenko	b11c826889	[BOLT] Fix false references to zero-sized objects Summary: Whenever BOLT encounters a data reference in code, it tries to convert it into <Object+Offset> form. The primary reason behind this approach is to support read-only data-reordering optimization. However, with the current level of the linker and compiler support we don't have enough information to always correctly restore the original <Object+Offset>. E.g. with zero-sized symbols we have to speculate that the actual size of the underlying object extends to the next symbol. Most of the time, there will be an object pointed by a zero-sized symbol and even if we are guessing incorrectly, there will be no harm in creating references of such form. The problem happens when there's no object corresponding to the original symbol and the next object is an (unmarked) jump table: A: # <- zero-sized object .LJUMP_TABLE: .long <entry1> .long <entry2> .... .LB: .long 21 .LC: .long 42 The jump table will be moved and all references past it (up to the next named object) will be incorrectly updated. We should not speculate about the size of A in a case like that and treat all discovered data objects (and thus references) independently. (cherry picked from FBD27005660)	2021-03-15 12:06:56 -07:00
Vladislav Khmelevsky	76d346ca14	[BOLT][PR] Instrumentation: Introduce -no-counters-clear and -wait-forks options Summary: This PR introduces 2 new instrumentation options: 1. instrumentation-no-counters-clear: Discussed at https://github.com/facebookincubator/BOLT/issues/121 2. instrumentation-wait-forks: Since the instrumentation counters are mapped as MAP_SHARED it will be nice to add ability to wait until all forks of the parent process will die using tracking of process group. The last patch is just emitBinary code refactor. Vladislav Khmelevsky, Advanced Software Technology Lab, Huawei Pull Request resolved: https://github.com/facebookincubator/BOLT/pull/125 GitHub Author: Vladislav Khmelevskyi <Vladislav.Khmelevskyi@huawei.com> (cherry picked from FBD26919011)	2021-03-09 16:18:11 -08:00
Maksim Panchenko	225a8d7f2c	[BOLT] Ignore TBSS section at layout time Summary: TBSS section is a "virtual" section that does not take memory or file space. Ignore it completely while adjusting section sizes. (cherry picked from FBD26824484)	2021-03-04 16:31:12 -08:00
Vladislav Khmelevsky	ec9751eef5	[BOLT][PR] readDynamicRelocations: Skip NONE relocations Summary: NONE relocations should not be processed during dynamic relocations read process Vladislav Khmelevsky, Advanced Software Technology Lab, Huawei Pull Request resolved: https://github.com/facebookincubator/BOLT/pull/118 GitHub Author: Vladislav Khmelevsky <Vladislav.Khmelevskyi@huawei.com> (cherry picked from FBD26489881)	2021-02-17 15:36:58 -08:00
Alexander Yermolovich	06959eedcf	Fix up test for Update DW_AT_stmt_list for .debug_types Summary: As titled. (cherry picked from FBD28112186)	2021-03-17 17:08:26 -07:00
Rafael Auler	da752c9c5c	Fix license for a few remaining files Summary: As titled. (cherry picked from FBD28112137)	2021-03-17 15:04:19 -07:00
Alexander Yermolovich	0ec91a25df	Update DW_AT_stmt_list for .debug_types Summary: There is no real link between CU and TU, so relying on fact that address are the same, and we are updating all of them. (cherry picked from FBD28112114)	2021-02-17 15:30:10 -08:00
Rafael Auler	16521f1f79	[BOLT] Update license headers Summary: Update license and fix headers for some files. (cherry picked from FBD28112041)	2021-03-15 18:04:18 -07:00
Amir Ayupov	1c5d3a056c	Rebase: Merge BOLT codebase in monorepo Summary: This commit is the first step in rebasing all of BOLT history in the LLVM monorepo. It also solves trivial build issues by updating BOLT codebase to use current LLVM. There is still work left in rebasing some BOLT features and in making sure everything is working as intended. History has been rewritten to put BOLT in the /bolt folder, as opposed to /tools/llvm-bolt. (cherry picked from FBD33289252)	2020-12-01 16:29:39 -08:00
Alexander Shaposhnikov	0a8aaf56bb	[BOLT] Add support for reading profile on Mach-O Summary: Add support for reading profile on Mach-O. (cherry picked from FBD25777049)	2021-01-29 16:37:07 -08:00
Alexander Shaposhnikov	a0dd5b05dc	[BOLT] Add support for dumping profile on MacOS Summary: Add support for dumping profile on MacOS. (cherry picked from FBD25751363)	2021-01-28 12:44:14 -08:00
Alexander Shaposhnikov	3b876cc3e7	[BOLT] Add support for dumping counters on MacOS Summary: Add support for dumping counters on MacOS (cherry picked from FBD25750516)	2021-01-28 12:32:03 -08:00
Alexander Shaposhnikov	6a84124e1d	[BOLT] Add support for __literal16 section on MachO Summary: 1. Add support for __literal16 section in the instrumentation runtime library for MacOS. 2. Fix emitting __counters section. (cherry picked from FBD25746342)	2021-01-28 12:04:46 -08:00
Sergey Pupyrev	fea6b4e469	an updated version of ExtTSP Summary: a few minor updates in block reordering: - some refactoring to improve readability; - optimized chain splitting strategy to improve quality of layout and performance of the algorithm. (cherry picked from FBD25126220)	2021-01-27 18:29:16 -08:00
Alexander Shaposhnikov	d6e60c5bec	[BOLT] Enable intToStr for MacOS Summary: Enable intToStr et al. in the runtime library for MacOS. (cherry picked from FBD25745358)	2021-01-20 16:40:17 -08:00
Alexander Shaposhnikov	faaefff618	[BOLT] Fix operator new signature Summary: Use size_t for the first parameter of operator new. https://en.cppreference.com/w/cpp/memory/new/operator_new (cherry picked from FBD25750921)	2021-01-20 12:56:41 -08:00
Amir Ayupov	a86cd533b3	[BOLT] Fix missing newlines in debug prints (cherry picked from FBD25966797)	2021-01-19 18:43:16 -08:00
Rafael Auler	0de92b8346	[PERF2BOLT] Relax segment matching requirements Summary: When looking at perf.data's available binaries and their respective mmap'ed segments, match them with the input binary by looking at both aligned and non-aligned addresses. If we suppose the alignment is the mmap'ed page size, we may miss some cases and perf2bolt will refuse to proceed because it failed to match the input binary with a process recorded in perf.data. (cherry picked from FBD25732673)	2021-01-11 06:24:46 -08:00
Rafael Auler	e3898d5969	[BOLT] Add threshold options for lite mode Summary: Add options for trading processing speed for binary performance. -lite-threshold-pct=<uint> Threshold (in percent) for selecting functions to process in lite mode. Higher threshold means fewer functions to process. E.g threshold of 90 means only top 10 percent of functions with profile will be processed. -lite-threshold-count=<uint> Similar to '-lite-threshold-pct' but specify threshold using absolute function call count. I.e. limit processing to functions executed at least the specified number of times. -no-scan Do not scan cold functions for external references (may result in slower binary). (cherry picked from FBD24739092)	2020-12-30 12:23:58 -08:00
Rafael Auler	e0261a22ce	[TEST] Remove dependency on debug output Summary: Test mistakenly used -debug output, which makes it fail on no-asserts build. (cherry picked from FBD25399449)	2020-12-09 12:25:58 -08:00
Rafael Auler	d2f68039bc	[BOLT] Fix shrinkwrapping bug when changing frame alignment Summary: This fixes a bug with shrink wrapping when trying to move push-pops in a function where we are not allowed to modify the stack layout for alignment reasons. In this bug, we failed to propagate alignment requirement upwards in the call graph from function A to B when: (1) there is a cycle in the call graph and (2) the distance from A to B is greater than 1 in the call graph and (3) there is a node in the path from A to B, not including A or B, that does not access parameters in the stack. (cherry picked from FBD25315977)	2020-12-03 20:09:32 -08:00
Alexander Shaposhnikov	e067f2adf4	Inject instrumentation's global dtor on MachO Summary: This diff is a preparation for dumping the profile generated by BOLT's instrumenation on MachO. 1/ Function "bolt_instr_fini" is placed into the predefined section "__fini" 2/ In the instrumentation pass we create a symbol "bolt_instr_fini" and replace the last global destructor with it. This is a temporary solution, in the future we need to register bolt_instr_fini in addition to the existing destructors without dropping the last one. (cherry picked from FBD25071864)	2020-11-19 18:18:28 -08:00
Alexander Shaposhnikov	1b258b8908	Refactor syscall wrappers for OSX Summary: Refactor syscall wrappers for OSX. (cherry picked from FBD25084642)	2020-11-19 14:56:45 -08:00
Amir Ayupov	f9d00d418b	[BOLT] Handle insertion of updated CFI at the first basic block Summary: Fix corner case of insertion of updated CFI with unset `PrevBB`. Handle it in the same way as inserting past hot-cold split point. (cherry picked from FBD24943911)	2020-11-17 18:40:19 -08:00
Alexander Shaposhnikov	1cf23e5ee8	Link the instrumentation runtime on OSX Summary: Link the instrumentation runtime on OSX. (cherry picked from FBD24390019)	2020-11-17 13:57:29 -08:00
Maksim Panchenko	7eaf63a118	[BOLT] Fix data race while running split functions pass Summary: In BinaryContext::calculateEmittedSize(), after the temporary code emission, we have to perform a cleanup and mark all symbols used during the emission as undefined and unregistered (so that we can emit them again later). The cleanup is happening even for symbols that were referenced and not defined by emitted code. If all emitted symbols are local, there is no risk that one thread will define a symbol while some other thread will undefine it in its cleanup code. Such behavior is expected as local symbols can only be referenced within the containing function and each function is processed in one thread. However, secondary entry points have associated global symbols and if we emit them, then it is possible for a thread to undefine a symbol while the other thread had defined it and was in the process of emitting the fragment with it. In such case, a data race may happen and the thread that contains the definition of the symbol may define it twice causing a redefinition error. To avoid the data race, we skip the emission of secondary entry global symbols when emitting code used only for the size estimation. (cherry picked from FBD24986007)	2020-11-16 14:34:51 -08:00
Sergey Pupyrev	1e9b733008	a new version of hfsort+ Summary: A faster and better version of function reordering: - fixed a bug when some computed probabilities were negative; - changed an O(n^2) loop to a priority queue to find a candidate of chains to merge (cherry picked from FBD24571208)	2020-11-14 13:18:58 -08:00
Amir Ayupov	6401af89c7	[BOLT] Support jump tables in split fragments with entries pointing back to parent functions Summary: Support jump tables belonging to split fragments with entries pointing back to parent functions. While skipping such families of functions, make sure to use the topmost fragment to ignore its fragments. (cherry picked from FBD24907438)	2020-11-12 11:54:51 -08:00
Amir Ayupov	e8234b3b98	[BOLT] Add invalid offset for a JT entry pointing to a fragment Summary: In a jump table identification, register an invalid offset for jump table entries pointing to function fragments. These invalid offsets have no effect other than padding the jump table size, calculated as `max(OffsetEntries, Entries)`. Correct jump table size is required in strict mode (enabled by default in aggregation mode by `perf2bolt`) in accounting of all PC-relative relocations in data. Functions containing these jump tables with invalid offsets are marked to be ignored immediately afterwards in `populateJumpTables`. (cherry picked from FBD24897464)	2020-11-12 11:54:44 -08:00
Amir Ayupov	157129b751	[BOLT] Debug logging in analyzeJumpTable Summary: Added debug logging in/around `analyzeJumpTable`: - Dump jump table entries as they are being processed: ```BOLT-DEBUG: analyzeJumpTable in read_encoded_value_with_base/2(2) Checking 0x428ff40 -> OK: real entry * Checking 0x428ff44 -> OK: real entry * Checking 0x428ff48 -> OK: real entry * Checking 0x428ff4c -> OK: real entry * Checking 0x428ff50 -> OK: real entry * Checking 0x428ff54 -> OK: address in split fragment * Checking 0x428ff58 -> OK: address in split fragment * Checking 0x428ff5c -> OK: address in split fragment * Checking 0x428ff60 -> OK: address in split fragment * Checking 0x428ff64 -> OK: real entry * Checking 0x428ff68 -> OK: real entry * Checking 0x428ff6c -> OK: real entry * Checking 0x428ff70 -> OK: real entry BOLT-DEBUG: analyzeJumpTable in classify_object_over_fdes/1(2) Checking 0x428ff74 -> OK: real entry ... ``` - Dump skipped functions: ``` Skipping _ZNK6icu_676number4impl12RoundingImpl5applyERNS1_15DecimalQuantityER10UErrorCode.part.2/1(2) family Ignoring _ZNK6icu_676number4impl12RoundingImpl5applyERNS1_15DecimalQuantityER10UErrorCode.part.2/1(2) Ignoring _ZNK6icu_676number4impl12RoundingImpl5applyERNS1_15DecimalQuantityER10UErrorCode.part.2.cold.3/1(2) Skipping _ZNK6icu_676number4impl12RoundingImpl5applyERNS1_15DecimalQuantityER10UErrorCode family Ignoring _ZNK6icu_676number4impl12RoundingImpl5applyERNS1_15DecimalQuantityER10UErrorCode Ignoring _ZNK6icu_676number4impl12RoundingImpl5applyERNS1_15DecimalQuantityER10UErrorCode.cold.4/1(2) ``` - Dump values of unclaimed PC-relative relocations in data. (cherry picked from FBD24898172)	2020-11-12 11:54:38 -08:00
Amir Ayupov	c0cb550536	Minimize X86/shrinkwrapping-critedge test case Summary: Minimized test case while preserving the CFG subgraph with an issue (cherry picked from FBD24871063)	2020-11-10 21:22:57 -08:00
Amir Ayupov	e54d389799	[BOLT] Disable DynoStats printing after SCTC Summary: Introduce new BinaryFunction flag `IsCanonicalCFG`, which gets unset by SCTC pass. Make DynoStats collection conditional on this new flag. SCTC leaves CFG in a state where branch counters of BBs with tail calls/conditional tail calls are not available (except via annotations, which get stripped by `lower-annotations`). Without branch counters, DynoStats are invalid. (cherry picked from FBD24558050)	2020-11-10 10:51:23 -08:00
Amir Ayupov	c36b71686c	Improve cold fragment name matching Summary: Fix cold fragment name matching regex by replacing existing regexes `.\.cold\..` and `.\.cold` and combining them into `.\.cold(\.\d)?`, applied to restored name (with BOLT-added suffixes stripped) This allows matching names like "execute_stack_op.cold/1", which previously weren't recognized. (cherry picked from FBD24804880)	2020-11-09 12:38:51 -08:00
Amir Ayupov	f86a78a4e7	Lost in rebase: call registerFragment with a reference to TargetBF Summary: Fixes broken build due to a lost dereferencing (cherry picked from FBD24799948)	2020-11-06 12:22:22 -08:00
Amir Ayupov	2b09d672ce	Conservatively handle jump tables in split functions Summary: - Allow jump table entries to point to locations inside the function and its fragments. Reasoning behind this is that jump table identification has the logic of stopping at entry which belongs to a function different from the one originally referencing jump table. This assumption is invalid for jump tables with entries pointing to both parent function and cold fragments, leading to "unclaimed PC-relative relocations" assertion. - Add fragment identification heuristic based on function name regex and contiguous jump table entries. Currently, parent-to-fragment relationship is set up based on interprocedural references – direct references from the parent function. These references don't include references through jump table. Additionally, some fragments are only reachable through jump table. In that case, in order to fully consume jump table, add parent-to-fragment relationship during `analyzeJumpTable` using the following heuristics: 1. Fragment is identified as such based on name (contains `.cold.` part), but 2. Parent function is not set – no direct interprocedural references to that fragment, and 3. Fragment has the name of the form <parent>.cold(.\d+) * For split functions with jump table entries spanning parent and fragments, mark parent and all fragments as ignored. (cherry picked from FBD24456904)	2020-11-06 11:19:03 -08:00
Amir Ayupov	dc48354f71	processInterproceduralReferences: record references to cold fragments as entry points Summary: For interprocedural references to fragments, record them as fragment entry points. Not registering these entry points leads to UCE removing the blocks and "Undefined temporary symbol" assertion. (cherry picked from FBD24511281)	2020-11-06 10:57:47 -08:00
Amir Ayupov	5452287710	Extract BinaryContext::registerFragment Summary: registerFragment to be reused in adding fragments reachable only through jump tables. (cherry picked from FBD24656651)	2020-11-06 10:27:33 -08:00
Vladislav Khmelevsky	58460460d9	[BOLT][PR] Handle TLS relocations on AArch64 Summary: Some of the TLS relocatios like R_AARCH64_TLSDESC_ADR_PAGE21 must be handled by bolt and should not be skipped by the removed condition. Some of the TLS relocations like R_AARCH64_TLS_TPREL64 could really be skipped here, but AFAIU this condition was added as part of BOLT its self optimization, so to prevent future problems here my suggestion is not to add another condition like "isTLS(RType) && isTLSRelocatable(RType)", but just remove it since absense of this condition should not broke any other TLS relocation. Vladislav Khmelevsky, Advanced Software Technology Lab, Huawei Pull Request resolved: https://github.com/facebookincubator/BOLT/pull/103 GitHub Author: Vladislav Khmelevsky <Vladislav.Khmelevskyi@huawei.com> (cherry picked from FBD24745928)	2020-11-04 16:45:58 -08:00
Maksim Panchenko	4f4239ceba	[BOLT] Fix C++ exceptions for shared objects Summary: Fix several issues to make C++ exceptions work in shared objects: * Set MCObjectFileInfo PIC type based on the input binary type. * Support indirect (DW_EH_PE_indirect) encoding while writing exception Type Table. * Use different LPStart value and landing pad encoding for .so's. * Disable splitting of exception-handling code for .so's because of the new encoding. (cherry picked from FBD24698765)	2020-11-04 11:44:02 -08:00
Rafael Auler	c1bb4dcb2b	[BOLT] Remove threaded EliminateUnreachableBlock version Summary: EliminateUnreachableBlocks has a data race because it depends on BinaryContext::computeCodeSize. computeCodeSize supports independent Emitters, enabling a lock-free execution. Unfortunately, that is almost as expensive as the lock. Removing the boilerplate code for parallellization of this pass turned out to be the best alternative: no races and slightly better execution time for HHVM. (cherry picked from FBD24716250)	2020-11-03 11:28:59 -08:00
Rafael Auler	37921b489a	[BOLT] Please sanitizers Summary: In BinaryContext, we had StringRef holding a reference to an r-value std::string. This triggers clang's address sanitizer warnings. In MCPlusBuilder we had a left shift overflowing a type, which is undefined behavior. Similarly, in CallGraph, we had a hash function shifting a negative value, which is also UB. The last two triggers the UB sanitizer. (cherry picked from FBD24661045)	2020-10-30 15:11:52 -07:00
Rafael Auler	3e78082c1b	[DOCS] Add instrumentation instructions to README Summary: Add basic instructions on how to instrument a binary. (cherry picked from FBD24660183)	2020-10-30 14:45:30 -07:00
Rafael Auler	eb12d719ac	[BOLT] Fix no-asserts build Summary: Only use dump() method under DEBUG() macro. (cherry picked from FBD24666481)	2020-10-30 19:59:07 -07:00
Maksim Panchenko	6b185cccf4	[BOLT] Always keep dynamic symbols defined Summary: Some symbols in .dynsym will be erroneously marked as belonging to a non-allocatable section that BOLT can remove. In that case, keep the original invalid index for such symbols instead of setting the UNDEF index. (cherry picked from FBD24488677)	2020-10-22 16:35:29 -07:00
Amir Ayupov	5f2f96c4c9	Add pass number to dot dump filename Summary: Change .dot dumps filename format from <function>-<passname>.dot to <function>-<passidx>_<passname>.dot This change helps navigate dumps by making the pass order explicit. Example: execute_stack_op.cold.6-1(2)-00_build-cfg.dot execute_stack_op.cold.6-1(2)-01_validate-internal-calls.dot execute_stack_op.cold.6-1(*2)-02_strip-rep-ret.dot ... (cherry picked from FBD24452903)	2020-10-21 17:08:32 -07:00
Maksim Panchenko	d91add0bfe	[BOLT] Fix PatchEntries pass Summary: While refactoring the pass, I removed the important transactional property of the patching process. Restore it. (cherry picked from FBD24440214)	2020-10-21 12:31:09 -07:00
Maksim Panchenko	d6d88399fc	[BOLT] Enable lite mode by default with relocations Summary: When optimizing input with relocations, make it faster and less memory-hungry with lite mode. (cherry picked from FBD24374241)	2020-10-17 15:09:06 -07:00
Rafael Auler	e4396c41da	[BOLT] Ignore __hot_start, __hot_end from input Summary: When -hot-text is on, do not read __hot_start and __hot_end from input (inserted by a linker script with the intent of ordering functions). This can confuse BOLT into creating a function with this name depending on which address the symbol lands and we will assert when trying to emit our own __hot_start/__hot_end with symbol redefinition. (cherry picked from FBD24366636)	2020-10-17 00:50:27 -07:00
Alexander Shaposhnikov	6133d2598b	Inject a hook into the entry point on MachO Summary: This diff is a preparation for loading the runtime on MachO. The proposed schema is the following: 1/ Function "bolt_instr_setup" is placed into the predefined section "setup" (in the final setting this function will be coming from the instrumentation runtime but we still will be placing it into this section). 2/ In the instrumentation pass we create a symbol "bolt_instr_setup" and inject the corresponding call into the beginning of the function representing the entry point of the binary. (cherry picked from FBD24329530)	2020-10-15 01:39:35 -07:00
Maksim Panchenko	f15532c2aa	[BOLT][DWARF] Streamline processing of DWARF unit DIEs Summary: Do not store processed DWARF DIEs, but instead process them while reading one at a time. Reduces memory consumption when updating debug info by 10%-25%. (cherry picked from FBD24327029)	2020-10-16 00:11:24 -07:00
Alexander Shaposhnikov	bbd9d610fe	Add first bits to cross-compile the runtime for OSX Summary: Add first bits to cross-compile the runtime for OSX. (cherry picked from FBD24330977)	2020-10-15 03:51:56 -07:00
Rafael Auler	0b6df06e04	[BOLT] In shrinkwrap, do not split prefix/instr Summary: When placing restore instructions in the shrink wrapping pass, we typically put them right before the last instruction of a block at the dominance frontier. If this instruction happened to have a prefix, because the MC lib separates prefix into separate MCInsts, we would accidentally put a load between a prefix and another instruction. Fix this. (cherry picked from FBD24295324)	2020-10-14 12:40:33 -07:00
Maksim Panchenko	53bd88c7fe	[BOLT] Refactor reading of debug line info Summary: Match BinaryFunction to a DWARFUnit based on the unit's address ranges skipping the parsing of DIEs. (cherry picked from FBD24269325)	2020-10-12 21:04:42 -07:00
Maksim Panchenko	9f15b9f3c2	[BOLT] Fix debug line info in lite relocation mode Summary: Emit line info for functions that were not emitted in relocation mode. (cherry picked from FBD24267650)	2020-10-12 20:16:59 -07:00
Alexander Shaposhnikov	473a6199ab	Add first bits to support emitting instrumented code on MachO Summary: Add first bits to support emitting instrumented code on MachO. This diff enables us to instrument branches / emit counters. (cherry picked from FBD24255164)	2020-10-12 10:11:17 -07:00
Maksim Panchenko	247b4181a3	[BOLT] Emit symbol size for functions Summary: On targets that support it, emit size of the emitted function symbol. At the moment there's no use for the size except that it is visible in a temporary .o file symbol table. (cherry picked from FBD24246177)	2020-10-12 13:02:50 -07:00
Alexander Shaposhnikov	528da5d795	Fix handling of _end symbol on MachO Summary: _end is "defined" but its address doesn't belong to any section. This diff adds special handling for this symbol. (cherry picked from FBD24249120)	2020-10-12 03:56:50 -07:00
Maksim Panchenko	c27e254056	[BOLT] Change label name for cold fragments Summary: Append ".cold.0" suffix to the original part of the name, such that "foo/1" becomes "foo.cold.0/1" instead of "foo/1.cold.0". (cherry picked from FBD24246112)	2020-10-12 11:26:07 -07:00
Alexander Shaposhnikov	7f1fd80762	Add support for emitting code into a new segment on MachO Summary: Add support for emitting code into a new segment on MachO (in the instrumentation mode). (cherry picked from FBD24097547)	2020-10-02 19:25:17 -07:00
Maksim Panchenko	843309c075	[BOLT] Disable PatchEntries in non-relocation mode on ELF Summary: At the moment we are not using PatchEntries pass in non-relocation mode on ELF. However, we will use it on MachO. (cherry picked from FBD24235271)	2020-10-09 19:37:12 -07:00
Maksim Panchenko	0465d952cc	[BOLT] Refactor PatchEntries pass Summary: Use injected functions with fixed addresses to patch original function entries. (cherry picked from FBD24133890)	2020-10-09 16:06:27 -07:00
Alexander Shaposhnikov	0376abe252	Add ToolPath field to MachORewriteInstance Summary: Add ToolPath field to MachORewriteInstance. This will enable us to locate the runtime library relative to the tool's location. (cherry picked from FBD24183448)	2020-10-07 17:52:47 -07:00
Rafael Auler	35632d4828	[BOLT] Refactor relocations class impl per arch, NFC Summary: Do not mix relocation codes from different archs. Even though they do not intersect at the moment, this could easily introduce bugs once new relocations are supported (for example, ILP32 for AArch64). (cherry picked from FBD24169425)	2020-10-07 15:40:51 -07:00
Alexander Shaposhnikov	59c21b42da	Precompute symbol section indices on MachO Summary: Precompute symbol section indices on Mach-O. (cherry picked from FBD24133810)	2020-10-06 01:30:55 -07:00
Alexander Shaposhnikov	71e185f2da	Add -check-overlapping-elements option Summary: This diff adds a command line option to disable the check of overlapping elements in Mach-O parsing. This check in its current form is prohibitively expensive for large binaries. A long-term fix would be to reimplement the check in a more efficient manner (and contribute it to the upstream). (cherry picked from FBD24109468)	2020-10-05 02:35:26 -07:00
Rafael Auler	d7fb998637	[BOLT] Fix sign issue when validating X86 relocations Summary: In analyzeRelocations, we extract the result of the relocation from binary code to recreate the target of it in a few special cases. For R_X86_64_32S relocations, however, we were neglecting the possibility of the encoded value in the instruction to be negative. (cherry picked from FBD24096347)	2020-10-05 12:41:03 -07:00
Alexander Shaposhnikov	2808c800e8	Read the entry point address on MachO Summary: Read the entry point address on MachO (cherry picked from FBD24039370)	2020-09-30 19:10:24 -07:00
Amir Ayupov	d1ec11b28f	postProcessEntryPoints: return after setIgnored and setSimple are set Summary: This patch fixes the assertion failure during instrumentation. The assertion is raised by `getInstructionAtOffset` , which expects `CurrentState` to be either `Disassembled` or `CFG`. The function is called from `postProcessEntryPoints`, which goes over Labels and performs a series of checks. The checks call BinaryFunction methods `setSimple(false)` or `setIgnored()`. However, if `setIgnored` is invoked, it resets the state to `Empty`. Thus subsequent call to `getInstructionAtOffset` will fail. (cherry picked from FBD24005197)	2020-09-29 19:37:47 -07:00
Alexander Shaposhnikov	0601ae6438	Set InputFileOffset for MachO sections Summary: Set InputFileOffset for MachO sections. (cherry picked from FBD23903542)	2020-09-24 03:22:31 -07:00
Maksim Panchenko	a10f799290	[BOLT][Linux] Initial support for special Linux Kernel sections Summary: Enable initial support for reading and patching special Linux kernel sections. Author: Tanvir Ahmed Khan <takh@fb.com> GitHub Author: takhandipu (cherry picked from FBD22998869)	2020-09-15 11:42:03 -07:00
Maksim Panchenko	a82cff0f52	[BOLT] Eliminate "shallow" function lookup Summary: Whenever we search for a function based on its address in the input binary, we now always return a corresponding fragment for split functions. If the user needs an access to the main fragment, they can call getTopmostFragment(). (cherry picked from FBD23670311)	2020-09-14 15:48:32 -07:00
Maksim Panchenko	62469b5036	[BOLT] Do no map sections with zero address Summary: Sections that do not originate from the input binary will have an input address set to zero and thus do not have to be mapped. Mapping such sections caused a build time regression in non-relocation mode. (cherry picked from FBD23670334)	2020-09-14 14:31:50 -07:00
Amir Ayupov	8c4ba8f165	Bugfix for splitting critical edges in shrink wrapping Summary: Fix issue with splitting critical edges originating at the same BB in ShrinkWrapping::splitFrontierCritEdges. Splitting of critical edges originating at the same FromBB wasn't handled correctly as the Frontier at index corresponding to FromBB was overwritten with basic blocks created for multiple DestinationBBs. (cherry picked from FBD23232398)	2020-08-20 19:00:29 -07:00
Rafael Auler	9bc4a8db18	Fix BAT cold-to-hot mappings Summary: Right now, if activity is recorded in cold parts, we write to the .fdata file the ".cold" name instead of the correct name of the function. Fix this. (cherry picked from FBD23148705)	2020-08-18 11:55:56 -07:00
Maksim Panchenko	aaf49b095f	[perf2bolt] Issue error when writing YAML for BOLTed input Summary: When the input file is processed by BOLT, we cannot save profile in YAML format as it requires CFG representation of functions. (cherry picked from FBD22941794)	2020-08-12 18:10:41 -07:00
Alexander Shaposhnikov	8b989765f6	Add first bits to support emitting more than 255 sections on MachO Summary: Add first bits to support emitting more than 255 sections. Update llvm.patch to include the changes in MachOObjectFile.cpp. (cherry picked from FBD22655053)	2020-07-21 17:26:00 -07:00
Rafael Auler	6547813d1f	Print when we are operating in lite mode Summary: (cherry picked from FBD22968343)	2020-08-06 14:43:33 -07:00
takh	0033a7612d	Linux kernel marker to update special sections Summary: This diff adds SDT marker like LK marker to update special lk sections (cherry picked from FBD22932157)	2020-08-04 13:50:00 -07:00
Maksim Panchenko	8f2a962866	[perf2bolt] Fix for SKL bug workaround Summary: Some LBR traces contain less than 16/32 entries. When the first trace is less than the standard length, we fail to enable SKL bug workaround with BAT mode enabled. The result is a bad profile from perf2bolt. The fix is to check the length of every trace, not just the first one. Issue facebookincubator/BOLT#94 (cherry picked from FBD22917971)	2020-08-04 10:59:37 -07:00
Amir Ayupov	8f7cb54ae5	Added execution count threshold option Summary: Added execution count threshold option (execution-count-threshold) controlling the optimizations that are sensitive to the accuracy of the profiling data: - BB reordering - function splitting - frame opts - shrink wrapping - indirect call promotion (cherry picked from FBD22682171)	2020-07-27 18:07:18 -07:00
Rafael Auler	c6799a689d	[BOLT] Fix stack alignment for runtime lib Summary: Right now, the SAVE_ALL sequence executed upon entry of both of our runtime libs (hugify and instrumentation) will cause the stack to not be aligned at a 16B boundary because it saves 15 8-byte regs. Change the code sequence to adjust for that. The compiler may generate code that assumes the stack is aligned by using movaps instructions, which will crash. (cherry picked from FBD22744307)	2020-07-27 16:52:51 -07:00
Rafael Auler	ed02946281	[BOLT] Fix hot_end symbol update with user function order Summary: If no profile data is provided, but only a user-provided order file for functions, fix the placement of the __hot_end symbol. (cherry picked from FBD22713265)	2020-07-24 10:28:36 -07:00
Amir Ayupov	6b89a9cb44	Handle intra-function call in instrumentOneTarget Summary: Added handling of intra-function calls (internal control transfer by call instruction) to instrumentOneTarget (cherry picked from FBD22606033)	2020-07-17 23:16:56 -07:00
Amir Ayupov	f7d4bed9d1	Extracted sequence insertion function into helper function Summary: Factored out common code from multiple places into a helper function (cherry picked from FBD22606101)	2020-07-17 23:16:52 -07:00
Maksim Panchenko	937244b4f2	[BOLT] Allow to specify -reorder-functions option multiple times Summary: Need to be able to override the option. (cherry picked from FBD22583585)	2020-07-17 10:08:51 -07:00
Rafael Auler	6c8fc28892	Revert "[BOLT] Add the FeatureMiner pass to extract Calder's features." This reverts commit 2476f46af02ccce04e9ed456462dd098460e4e1f. Reviewed By: maks (cherry picked from FBD28111787)	2020-07-16 17:35:55 -07:00
Rafael Auler	170f73ac9e	[BOLT] Fix fix-branches in presence of JRCXZ and friends Summary: Do not fail/assert when trying to reorder blocks that terminate with JRCXZ/JECXZ/LOOP instructions. We cannot invert the condition of these instructions, so just treat them accordingly in fixBranches(). (cherry picked from FBD22487107)	2020-07-15 23:02:58 -07:00
Angélica Moreira	181327d763	[BOLT] Add the FeatureMiner pass to extract Calder's features. (cherry picked from FBD19844247)	2020-07-07 23:01:22 -07:00
Tanvir Ahmed Khan	f40ffa0dc8	Report stale sample count and percentage Summary: This diff adds extra reporting of total number of stale branch samples for the binary. (cherry picked from FBD22304965)	2020-07-06 21:35:44 -07:00
Maksim Panchenko	3e795c8a5f	[BOLT] Ignore addresses from non-allocatable sections Summary: We've accidentally registered TBSS section address with a BinaryContext resulting in addresses being attributed to it when getSectionForAddress() was called. (cherry picked from FBD22369323)	2020-07-06 14:39:44 -07:00
takh	a9fac6a89f	Support for CDF distribution of heatmap buckets Summary: This diff adds the support for generating CDF distributions of heatmap buckets. (cherry picked from FBD22128291)	2020-06-18 16:47:21 -07:00
Xun Li	84eae1a413	[Bolt] Improve coding style for runtime lib related code Summary: Reading through the LLVM coding standard again, realized a few places where I didn't follow the standard when coding. Addressing them: 1. prefer static functions over functions in unnamed namespace. 2. #include as little as possible in headers 3. Have vtable anchors. (cherry picked from FBD22353046)	2020-07-02 14:28:13 -07:00
Maksim Panchenko	e233dec467	[BOLT] Skip R_X86_64_PLT32 relocation verification Summary: R_X86_64_PLT32 relocations recorded by the linker may point to the PLT section instead of being resolved to the symbol reported by the relocation. Sometimes they could point to the symbol too. Disable internal verification for this type of relocation. Include a fix for symbol address calculation when it is based on the extracted value. The truncation to the relocation size is needed if the results overflows. (cherry picked from FBD22317952)	2020-06-30 19:58:43 -07:00
Rafael Auler	26ad0bd951	[TESTS] Re-add issue20/issue26 tests Summary: Re-add tests removed because they used to depend on yaml2obj. Rewrite them with an assembler (llvm-mc) and use the system linker to produce a valid ELF as input to BOLT. (cherry picked from FBD22323449)	2020-06-30 18:36:49 -07:00
Rafael Auler	41cb6b68ed	Update X86/pre-aggregated-perf.test Summary: Add REQUIRED statement. (cherry picked from FBD22290759)	2020-06-24 18:24:07 -07:00
Maksim Panchenko	ffaba22476	[BOLT] Do not emit duplicate org symbols Summary: When adding symbols for patched functions, we may end up emitting multiple symbols per function if the function has multiple names (e.g. after identical code folding by the linker). (cherry picked from FBD22294112)	2020-06-24 12:36:15 -07:00
Maksim Panchenko	250ca4082e	[BOLT] Add static binary support Summary: Accept binaries without dynamic section/segment as a valid input. Modify the check for invalid debug info "executables" that are result of running "objcopy --only-keep-debug". Instead of checking for an empty dynamic segment, check that ".text" is mapped into a valid segment. Move SegmentMapInfo inside BinaryContext. Fixes facebookincubator/BOLT#91 Temporarily removing issue*.test tests that use yaml2obj and operate on fake binaries. (cherry picked from FBD22271481)	2020-06-26 16:52:07 -07:00
Maksim Panchenko	94230a2c07	[perf2bolt] Relax rules for aggregation in strict mode Summary: While aggregating perf.data events, even in strict mode, there is no need to process all functions since we are not generating an output binary. However, it's still important to convert data for as many functions as possible, even for ones with unknown internal control flow. (cherry picked from FBD22248390)	2020-06-25 16:29:17 -07:00
Maksim Panchenko	4aaa8892dd	[BOLT] Ignore duplicate relocations Summary: lld linker may emit static relocations against addresses that also have dynamic relocations associated with them. When this happens, BOLT fails to validate the extracted value at the address. Read dynamic relocations in the binary and ignore static relocations at addresses that have a duplicate dynamic relocation. (cherry picked from FBD22192345)	2020-06-23 12:22:58 -07:00
Maksim Panchenko	13baf47a3c	[BOLT] Add '-force-patch' to forcefully patch old entries Summary: The option is useful for debugging. Also, print personality function when dumping a function. (cherry picked from FBD22169482)	2020-06-22 13:08:28 -07:00
Maksim Panchenko	4946b881a8	[BOLT] Fix getNewValueForSymbol() Summary: getNewValueForSymbol() uses orc::RTDyldObjectLinkingLayer::findSymbol() to resolve symbol values. The latter will always return JITSymbol, even if there was no symbol defined. The address for the undefined symbol will be zero, but some symbols could legally be resolved to zero too. We need to distinguish between real zero-valued symbols and symbols that were not emitted and are not visible by orc::RTDyldObjectLinkingLayer. If zero address is returned by ORC, check for a binary data with the same name and use its address for the symbol resolution. (cherry picked from FBD22175269)	2020-06-22 16:16:08 -07:00
Maksim Panchenko	ae296ea665	[BOLT] Allow to overwrite -use-old-text option (cherry picked from FBD22169409)	2020-06-22 14:05:19 -07:00
Maksim Panchenko	12b7987d4f	[BOLT] Ignore functions that failed validation Summary: If a function failed internal calls validation, we can ignore it and keep processing the binary. (cherry picked from FBD22169381)	2020-06-22 12:59:03 -07:00
Maksim Panchenko	efce443e0d	[BOLT] Create entry points for internal refs from external code Summary: If we detect an internal function reference from code outside of the function, then create an entry point at that location. (cherry picked from FBD22169337)	2020-06-22 13:05:13 -07:00
Maksim Panchenko	0403adde32	[BOLT] Fixes for scanExternalRefs() Summary: In my previous commit, I've accidentally reverted the condition while evaluating a branch target. Also, do not emit instruction for relocation purposes in scanExternalRefs() if there was no TargetSymbol set and we have not produced new relocations. (cherry picked from FBD22169317)	2020-06-22 12:50:49 -07:00
Maksim Panchenko	8374e8e3fe	[BOLT] Properly register symbols at secondary entry points Summary: We may end up with a secondary entry symbol set to zero if there was no symbol in the input file at the entry point address, and if we skipped the function emission, e.g. if it was ignored. In that case, the symbol should be properly initialized with a proper address. (cherry picked from FBD22169167)	2020-06-22 12:37:48 -07:00
Maksim Panchenko	15fffe2824	[BOLT] Fix memory error Summary: Fix for double-free I've introduced earlier. (cherry picked from FBD22132595)	2020-06-18 20:59:01 -07:00
Maksim Panchenko	db4642d0a6	[BOLT] Support -hot-text in lite mode Summary: Update special symbol references in functions that are not emitted. (cherry picked from FBD22120995)	2020-06-18 11:10:41 -07:00
Maksim Panchenko	e7c3464226	[BOLT] Disable trapping on AVX-512 by default Summary: (cherry picked from FBD22118562)	2020-06-18 09:55:05 -07:00
Maksim Panchenko	0ce0bce9e7	[BOLT] Support for lite mode with relocations Summary: Add '-lite' support for relocations for improved processing time, memory consumption, and more resilient processing of binaries with embedded assembly code. In lite relocation mode, BOLT will skip full processing of functions without a profile. It will run scanExternalRefs() on such functions to discover external references and to create internal relocations to update references to optimized functions. Note that we could have relied on the compiler/linker to provide relocations for function references. However, there's no assurance that all such references are reported. E.g., the compiler can resolve inter-procedural references internally, leaving no relocations for the linker. The scan process takes about <10 seconds per 100MB of code on modern hardware. It's a reasonable overhead to live with considering the flexibility it provides. If BOLT fails to scan or disassemble a function, .e.g., due to a data object embedded in code, or an unsupported instruction, it enables a patching mode to guarantee that the failed function will call optimized/moved versions of functions. The patching happens at original function entry points. '-skip=<func1,func2,...>' option now can be used to skip processing of arbitrary functions in the relocation mode. With '-use-old-text' or '-strict' we require all functions to be processed. As such, it is incompatible with '-lite' option, and '-skip' option will only disable optimizations of listed functions, not their disassembly and emission. (cherry picked from FBD22040717)	2020-06-15 00:15:47 -07:00
Xun Li	e22378d20a	Be more flexible when locating runtime libs Summary: In some cases we install bolt binary into one level deeper in bin/, such as bin/install/, we need to go back one more level to find lib directory. (cherry picked from FBD22070974)	2020-06-16 09:59:27 -07:00
Alexander Shaposhnikov	0823882d47	Link functions on MachO Summary: Add first bits for linking functions on MachO. (cherry picked from FBD21991721)	2020-06-12 20:16:27 -07:00
Alexander Shaposhnikov	7950e1e5bb	Provide a redundant declaration of KernelBaseAddr Summary: Adjust the code to make it buildable with clang-10. (cherry picked from FBD22055933)	2020-06-15 16:06:07 -07:00
takh	48b71ad219	Generate heatmap for linux kernel Summary: This diff handles several challenges related to heatmap generation for Linux kernel (vmlinux elf file): - If the input binary elf file contains the section `__ksymtab`, this diff assumes that this is the linux kernel `vmlinux` file and enables an extra flag `LinuxKernelMode` - In `LinuxKernelMode`, we only support heat map generation right now, therefore it ensures that current BOLT mode is heat map generation. Otherwise, it exits with error. - For some Linux symbol and section combinations, BOLT may not be able to find section for symbol (specially symbols that specifies the end of some section). For such cases, we show an warning message without exiting which was the previous behavior. - Linux kernel elf file does not contain dynamic section, therefore, we don't exit when no dynamic section is found for linux kernel binary. - Current `ParseMMap` logic does not work with linux kernel. MMap entries for linux kernel uses `PERF_RECORD_MMAP` format instead of typical `PERF_RECORD_MMAP2` format. Since linux kernel address mapping is absolute (same as specified in the ELF file), we avoid calling `ParseMMap` in linux kernel mode. - Linux kernel entries are registered with PID -1, therefore `BinaryMMapInfo` lookup is not required for linux kernel entries. Similarly, `adjustLBR` is also not required. - Default max address in linux kernel mode is highest unsigned 64-bit integer instead of current 4GBs. - Added another new parameter for heatmap, `MinAddress`, in case of Linux kernel mode which is `KernelBaseAddress`, otherwise, it is 0. While registering Heatmap sample counts from LBR entries, any address lower than this `MinAddress` is ignored. - `IgnoreInterruptLBR` is disabled in linux kernel mode to ensure that kernel entries are processed Currently, linux kernel heat map also include heat map for Linux kernel modules that are not part of vmlinux elf file. This is intentional to identify other potential optimization opportunities. If reviewers think, those modules should be omitted, I will disable those modules based on highest end address of a vmlinux elf section. (cherry picked from FBD21992765)	2020-06-10 23:00:39 -07:00
Maksim Panchenko	2d524fd5e2	[BOLT] Update section index for symbols from unemitted functions Summary: Under some conditions, e.g. while running in lite mode or when a function is non-simple, BOLT may decide not to emit function code and hence there's no need to update the symbol. However, since we change section table, the corresponding section index may need an update. Also, update section index for ICF symbols. (cherry picked from FBD21970017)	2020-06-09 19:12:06 -07:00
Xun Li	9bd7161529	Adding automatic huge page support Summary: This patch enables automated hugify for Bolt. When running Bolt against a binary with -hugify specified, Bolt will inject a call to a runtime library function at the entry of the binary. The runtime library calls madvise to map the hot code region into a 2M huge page. We support both new kernel with THP support and old kernels. For kernels with THP support we simply make a madvise call, while for old kernels, we first copy the code out, remap the memory with huge page, and then copy the code back. With this change, we no longer need to manually call into hugify_self and precompile it with --hot-text. Instead, we could simply combine --hugify option with existing optimizations, and at runtime it will automatically move hot code into 2M pages. Some details around the changes made: 1. Add an command line option to support --hugify. --hugify will automatically turn on --hot-text to get the proper hot code symbols. However, running with both --hugify and --hot-text is not allowed, since --hot-text is used on binaries that has precompiled call to hugify_self, which contradicts with the purpose of --hugify. 2. Moved the common utility functions out of instr.cpp to common.h, which will also be used by hugify.cpp. Added a few new system calls definitions. 3. Added a new class that inherits RuntimeLibrary, and implemented the necessary emit and link logic for hugify. 4. Added a simple test for hugify. (cherry picked from FBD21384529)	2020-05-02 11:14:38 -07:00
Xun Li	00892a5fd0	Refactor runtime library Summary: As we are adding more types of runtime libraries, it would be better to move the runtime library out of RewriteInstance so that it could grow separately. This also requires splitting the current implementation of Instrumentation.cpp to two separate pieces, one as normal Pass, one as the runtime library. The Instrumentation Pass would pass over the generated data to the runtime library, which will use to emit binary and perform linking. This patch does the following: 1. Turn Instrumentation class into an optimization pass. Register the pass in the pass manager instead of in RewriteInstance. 2. Split all the data that are generated by Instrumentation that's needed by runtime library into a separate data structure called InstrumentationSummary. At the creation of Instrumentation pass, we create an instance of such data structure, which will be moved over to the runtime at the end of the pass. 3. Added a runtime library member to BinaryContext. Set the member at the end of Instrumentation pass. 4. In BinaryEmitter, make BinaryContext to also emit runtime library binary. 5. Created a base class RuntimeLibrary, that defines the interface of a runtime library, along with a few common helper functions. 6. Created InstrumentationRuntimeLibrary which inherits from RuntimeLibrary, that does all the work (mostly copied over) for emit and linking. 7. Added a new directory called RuntimeLibs, and put all the runtime library related files into it. (cherry picked from FBD21694762)	2020-05-21 14:28:47 -07:00
Alexander Shaposhnikov	cd067ae1e8	Emit functions on MachO Summary: Start emitting functions (for MachO input binaries). (cherry picked from FBD21721586)	2020-05-26 04:21:04 -07:00
Xun Li	2b65b3aa6b	Use shuffle instead of random_shuffle Summary: random_shuffle is deprecated in C++14. (cherry picked from FBD21698180)	2020-05-21 16:46:27 -07:00
Xun Li	8a680745dd	Remove const call to take_front Summary: take_front() is a const member of StringRef. Calling it does nothing. This suggests that this line of code is useless, deleting it. But it's good to double check, what was the original intention here? (cherry picked from FBD21697637)	2020-05-21 16:25:05 -07:00
Maksim Panchenko	8729171182	[BOLT] Refactor profile-handling code Summary: This diff handles several issues related to profile reading and handling: * Unifies interface used by 3 profile readers in ProfileReaderBase. * Adds automatic detection of the profile file contents. * Removes reader-specific fields from BinaryFunction and BinaryData. All the information is stored in instruction annotations. * Removes implicit memory dependencies in annotations on profile reader instance. * Adds lite mode support to YAML reader. * Moves profile reading code out of BinaryFunction. (cherry picked from FBD21601411)	2020-05-07 23:00:29 -07:00
Maksim Panchenko	cce49b9522	[BOLT] Remove StringRef from IndirectCallProfile Summary: IndirectCallProfile was holding to a StringRef from a profile reader providing an implicit dependency on the reader. (cherry picked from FBD21587101)	2020-05-14 17:34:20 -07:00
Rafael Auler	f91d121eee	[BOLT] Add option to tag version Summary: Add a dummy option in BOLT to allow us to write any string in the bolt info section. This is accomplished since we record the complete argv vector. This string used to tag this binary with any ID that can later be associated with a specific BOLT invocation. (cherry picked from FBD21441902)	2020-05-06 17:31:25 -07:00
Maksim Panchenko	689447bf10	[BOLT] Change .debug_line emission for non-simple functions Summary: We use a special routine to emit line info for functions that we do not overwrite. The resulting DWARF was not quite efficient as we were advancing addresses using a larger than needed opcodes. Since there were only a few functions that we didn't emit/overwrite, it was not a big issue. However, in lite mode the majority of functions are not overwritten and as a result, the inefficiency in debug line encoding got exposed and binaries were getting larger than expected .debug_line sections. Fix it by using more conventional line table opcodes for address advancing. (cherry picked from FBD21423074)	2020-05-05 23:56:50 -07:00
Maksim Panchenko	96c4168ddc	[BOLT] Ignore kernel interrupts by default (cherry picked from FBD21431563)	2020-05-06 11:52:16 -07:00
Xun Li	7b61bdf8ea	Check runtime lib format within archiver Summary: We only support linking ELF runtime library right now. If the library is an archiver, we check that each individual library inside the archiver is an ELF library. (cherry picked from FBD21388672)	2020-05-04 13:57:21 -07:00
Maksim Panchenko	924d0bdb08	[BOLT] Introduce lite processing mode without relocations Summary: When optimizing a binary without relocations, we can skip processing functions without profile (cold functions). By skipping processing of cold functions, we reduce the processing time and memory. We call such mode a lite mode, and it is enabled by default. Some processing is still done for functions without profile even in lite mode. scanExternalRefs() function is used to detect secondary entry points to functions that are not marked in the symbol table. Note that the no-relocation requirement is a temporary limitation of the lite mode. (cherry picked from FBD21366567)	2020-05-03 15:49:58 -07:00
Maksim Panchenko	04c5d4fcab	[BOLT] Introduce isIgnored() function attribute Summary: Whenever a function is not meant for processing, e.g. when the user requests to optimize only a subset of functions, mark the function as ignored. Use this attribute instead of opts::shouldProcess(). (cherry picked from FBD21374806)	2020-05-03 13:54:45 -07:00
Maksim Panchenko	4e69764c65	[BOLT] Fix dyno stats after ICF in non-reloc mode Summary: The commit that fixed ICF determinism in non-relocation mode disabled profile merging for functions. Dyno stats output needs to be updated to reflect the lack of merge. (cherry picked from FBD21366046)	2020-05-01 17:51:43 -07:00
Maksim Panchenko	b62a1774af	[BOLT] Cover PIC jump table reference in non-strict mode Summary: In non-strict relocation mode it was possible to miss a jump table reference leading to incorrect code. (cherry picked from FBD21251467)	2020-04-26 17:51:07 -07:00
Maksim Panchenko	ac36e17a73	[BOLT][BFC] Refactor code for adding secondary function entries Summary: In non-relocation mode, the code for marking a function non-simple was decoupled from the code that added new entry points. Fix that. (cherry picked from FBD21264247)	2020-04-27 13:40:53 -07:00
Maksim Panchenko	5296b6d12a	[BOLT] Change symbol handling for secondary function entries Summary: Some functions could be called at an address inside their function body. Typically, these functions are written in assembly as C/C++ does not have a multi-entry function concept. The addresses inside a function body that could be referenced from outside are called secondary entry points. In BOLT we support processing functions with secondary/multiple entry points. We used to mark basic blocks representing those entry points with a special flag. There was only one problem - each basic block has exactly one MCSymbol associated with it, and for the most efficient processing we prefer that symbol to be local/temporary. However, in certain scenarios, e.g. when running in non-relocation mode, we need the entry symbol to be global/non-temporary. We could create global symbols for secondary points ahead of time when the entry point is marked in the symbol table. But not all such entries are properly marked. This means that potentially we could discover an entry point only after disassembling the code that references it, and it could happen after a local label was already created at the same location together with all its references. Replacing the local symbol and updating the references turned out to be an error-prone process. This diff takes a different approach. All basic blocks are created with permanently local symbols. Whenever there's a need to add a secondary entry point, we create an extra global symbol or use an existing one at that location. Containing BinaryFunction maps a local symbol of a basic block to the global symbol representing a secondary entry point. This way we can tell if the basic block is a secondary entry point, and we emit both symbols for all secondary entry points. Since secondary entry points are quite rare, the overhead of this approach is minimal. Note that the same location could be referenced via local symbol from inside a function and via global entry point symbol from outside. This is true for both primary and secondary entry points. (cherry picked from FBD21150193)	2020-04-19 22:29:54 -07:00
Maksim Panchenko	ac1af09e82	[BOLT][NFC] Change wording while reporting functions stats Summary: (cherry picked from FBD21242167)	2020-04-24 16:36:22 -07:00
Maksim Panchenko	fbca177a83	[BOLT] Speedup PLT processing Summary: With larger PLT sizes, linear PLT symbol name lookup becomes a bottleneck. (cherry picked from FBD21223695)	2020-04-23 21:29:10 -07:00
Maksim Panchenko	0ea98d1f0b	[BOLT] Option to fail if invalid profile detected Summary: Add an option to fail processing of the input binary if the profile is not accurate: -stale-threshold=<uint> - maximum percentage of stale functions to tolerate (default: 100) Default (100) means never to fail. A function profile is considered stale if any branch in its profile has invalid source or destination. Use `-stale-threshold=0` to fail if any staleness is detected in the profile. (cherry picked from FBD21189036)	2020-04-22 15:09:49 -07:00
Maksim Panchenko	33e0b2aa58	[BOLT] Do not emit old .eh_frame in relocation mode Summary: In relocation mode, there is no use for old .eh_frame section. Moreover, it can interfere with new EH frames via .eh_frame_hdr when the original .text is reused. (cherry picked from FBD21120070)	2020-04-19 12:55:43 -07:00
Maksim Panchenko	23edb3ed9c	[BOLT] Option to control .text alignment Summary: Add option `-align-text=<n>` to control .text alignment within a segment. Set to page size by default. (cherry picked from FBD21120063)	2020-04-19 15:02:50 -07:00
Maksim Panchenko	10245b5c5b	[BOLT] Emit ICF symbols for large functions Summary: In non-relocation mode, make sure we emit extra symbols for a folded function even if the function was not overwritten due to its large size. (cherry picked from FBD21080467)	2020-04-16 00:05:01 -07:00
Maksim Panchenko	606532bdf1	[BOLT] Fix .eh_frame update with ICF in non-relocation mode Summary: In a rare case, we may fold a function and fail to emit it in non-relocation mode due to a function size increase. At the same time, the function that the original function was folded into could have been successfully emitted, e.g. because it was split in the presence of a profile information. Later, because the function was not emitted, we have to use its original .eh_frame entry in the preserved .eh_frame section. However, that entry is no longer referencing the original function, but the function that the original was folded into. This happens since the original symbol gets emitted at the other function location. As a result, .eh_frame entry for the folded function is missing. To prevent incorrect update of the original .eh_frame, create relocations against absolute values. This guarantees preservation of the section contents while updating pc-relative references. (cherry picked from FBD21061130)	2020-04-16 00:02:35 -07:00
Maksim Panchenko	1be7a82540	[BOLT] Speedup RTDyld external symbol resolution Summary: RuntimeDyldImpl::resolveExternalSymbols() some time ago used to call getSymbolAddress() while in the second loop. That call could have modified the contents of ExternalSymbolRelocations that the loop was iterating over. Thus the code was written in a way that erased the processed entry on every loop iteration and reset the map iterator. With large number of entries in ExternalSymbolRelocations the loop code becomes a performance bottleneck. Since getSymbolAddress() is no longer used, the ExternalSymbolRelocations could be iterated in a straightforward way and the map cleared before the function exit. (cherry picked from FBD21057058)	2019-11-11 13:29:46 -08:00
Rafael Auler	6dbd15bc01	[BOLT-X86] Fix instrumentation issue with indirect calls Summary: Indirect calls that use RSP to compute the target address would break in instrumentation mode because we were adding instructions that changed the stack pointer. Fix this. (cherry picked from FBD20883791)	2020-04-06 17:38:11 -07:00
Maksim Panchenko	401fa5b493	[BOLT] Further speedup ICF Summary: Further speedup ICF by applying stricter rules for congruent functions. While checking symbolic operands in congruent functions, consider operands congruent only if they are equal or reference functions with identical hashes, i.e. potentially foldable functions. Note that jump table operands are handled as a special case. (cherry picked from FBD20912054)	2020-04-07 22:10:12 -07:00
Maksim Panchenko	ee0371ad97	[BOLT] Speedup ICF by better function hashing Summary: Too many hash collisions may cause ICF to run slowly. We used to hash BinaryFunction only looking at instruction opcodes, ignoring instruction operands. With many almost identical functions, such approach may lead to long ICF processing time. By including operands into the hash, we reduce the number of collisions and improve the runtime often by a factor of 2 or more. (cherry picked from FBD20888957)	2020-04-07 00:21:37 -07:00
Maksim Panchenko	abda7dc6a7	[BOLT] Fix ICF non-determinism in non-relocation mode Summary: ICF may fold functions in arbitrary order when running multi-threaded. This is fine in relocation mode as we end up with just one function holding all function symbols. However, in non-relocation mode we keep all function bodies, and if we keep merging profiles in non-deterministic order, we end up with functions with non deterministic profiles. The fix for non-relocation mode is to not merge profiles as the factual new profile could be different from the merged one since both function instances are potentially callable. Additionally, emit extra symbols for ICF functions in non-relocation mode to make it possible to track the folding. (cherry picked from FBD20889866)	2020-04-04 20:12:38 -07:00
Maksim Panchenko	b08d82d91b	[BOLT] Verify exceptions action table equivalence in ICF Summary: Some functions may have exactly the same code and exception handlers. However, their action tables could be different leading to mismatching semantics. We should verify their equivalence while running ICF. (cherry picked from FBD20889035)	2020-03-30 19:08:24 -07:00
Maksim Panchenko	58b0d9e7b0	[BOLT][DWARF] Add support for base address in DWARF location lists Summary: The version of LLVM that we are based on lacks the support for base address in DWARF location lists. Add the missing pieces. (cherry picked from FBD20640784)	2020-03-24 22:05:37 -07:00
Maksim Panchenko	bbbf679b42	[BOLT] Refactor ELF symbol table rewriting code Summary: Make ELF symbol table rewriting code more structured. While at it, remove symbols from non-allocatable sections. (cherry picked from FBD20243386)	2020-02-26 20:43:18 -08:00
Maksim Panchenko	a07f1a26e7	[BOLT] Refactor section prefixes (cherry picked from FBD20400886)	2020-03-11 15:51:32 -07:00
Maksim Panchenko	1f3e351a9c	[BOLT] Refactor code and data emission code Summary: Consolidate code and data emission code in ELF-independent BinaryEmitter. The high-level interface includes only two functions emitBinaryContext() and emitFunctionBody() used by RewriteInstance and BinaryContext respectively. (cherry picked from FBD20332901)	2020-03-06 15:06:37 -08:00
Maksim Panchenko	74a2777c54	[BOLT] Refactor ELF parts of instrumentation code Summary: This is a prerequisite for larger emitter refactoring. Since .dynamic is read unconditionally, add an error message if the section is missing, or the size of the section is zero. (cherry picked from FBD20331735)	2020-03-08 19:04:39 -07:00
Maksim Panchenko	af553124d3	[BOLT] Refactor emission of original .eh_frame Summary: There is no need to treat the emission of the original `.eh_frame` section as a special case. (cherry picked from FBD20323360)	2020-03-07 11:19:09 -08:00
Alexander Shaposhnikov	e3654fc274	[BOLT] Uniquify names of local symbols Summary: 1. Uniquify names of local symbols. 2. Handle aliases. (cherry picked from FBD20270196)	2020-03-04 18:36:44 -08:00
Alexander Shaposhnikov	842a25f785	[BOLT] Mark functions containing data as non-simple Summary: Temporarily mark functions containing data as non-simple. (cherry picked from FBD20213279)	2020-03-02 22:41:12 -08:00
Maksim Panchenko	cb9c991dcb	[BOLT] Remove allow-section-relocations option Summary: The option is not used. Remove all related code. (cherry picked from FBD20237859)	2020-03-03 15:51:24 -08:00
Maksim Panchenko	c7e012e145	[BOLT][NFC] Get rid of BestFit parameter Summary: The parameter is no longer used. (cherry picked from FBD20236516)	2020-03-03 14:28:42 -08:00
Alexander Shaposhnikov	b0cbb60165	[BOLT] Fix begin decrementing Summary: Fix begin decrementing. (cherry picked from FBD20232474)	2020-03-03 13:36:32 -08:00
Maksim Panchenko	d89bb53afa	[BOLT][NFC] Factor out relocation processing (cherry picked from FBD20087297)	2020-02-24 17:10:02 -08:00
Rafael Auler	340da8f294	[BOLT] Fix shrink wrapping to check pops Summary: Shrink wrapping has a mode where it will directly move push pop pairs, instead of replacing them with stores/loads. This is an ambitious mode that is triggered sometimes, but whenever matching with a push, it would operate with the assumption that the restoring instruction was a pop, not a load, otherwise it would assert. Fix this assertion to bail nicely back to non-pushpop mode (use regular store and load instructions). (cherry picked from FBD20085905)	2020-02-18 16:00:40 -08:00
Maksim Panchenko	2df4e7b99e	[BOLT][NFC] Minor refactoring of RewriteInstance (cherry picked from FBD20087424)	2020-02-24 17:12:41 -08:00
Maksim Panchenko	495761dc70	[BOLT][NFC] Remove unused BinarySection member functions (cherry picked from FBD20087243)	2020-02-24 16:56:45 -08:00
Maksim Panchenko	3b45212e84	[BOLT] Delete ExecutableFileMemoryManager::registerNoteSection() Summary: The interface is no longer in use. (cherry picked from FBD20070558)	2020-02-24 09:40:32 -08:00
Alexander Shaposhnikov	01b7c90242	[BOLT] Add missing override Summary: Add missing override in X86MCPlusBuilder.cpp. (cherry picked from FBD20064222)	2020-02-23 22:27:28 -08:00
Maksim Panchenko	be43f89c4f	[BOLT][llvm] Update llvm.patch Summary: (cherry picked from FBD20063562)	2020-02-23 19:51:33 -08:00
Alexander Shaposhnikov	76aa1c26aa	[BOLT] Enable reversing the order of basic blocks Summary: Enable reversing the order of basic blocks. (cherry picked from FBD19943692)	2020-02-17 13:35:09 -08:00
Alexander Shaposhnikov	4ad5048393	[BOLT] Add first bits to build CFG Summary: Add first bits to build CFG. (cherry picked from FBD19943472)	2020-02-17 12:18:42 -08:00
Alexander Shaposhnikov	5b64bf2128	[BOLT] Disassemble functions from a MachO binary Summary: Add first bits to disassemble functions from a MachO binary. (cherry picked from FBD19900493)	2020-02-11 14:30:33 -08:00
Rafael Auler	a9d85413ac	[BOLT] Emit long nops by default Summary: Change our X86 target to use long nops by default. In general, BOLT does not put nops into the instruction stream that is going to be executed, since it doesn't align basic blocks, only functions. Since we rebased BOLT, our relationship with MCAssembler changed because it stopped using multibyte nops and we never needed to revisit that. But it makes a difference if we want to mitigate perf issues with the Intel JCC erratum, since the nops inserted are going to be decoded and executed. To make MCAssembler emit long nops again, we need to explictly set mattr (Features) of the X86 target. (cherry picked from FBD19987277)	2020-02-19 16:13:58 -08:00
Maksim Panchenko	9711286858	[BOLT] Get rid of BinarySection::IsLocal Summary: The flag is no longer used/needed. (cherry picked from FBD19951571)	2020-02-18 09:20:17 -08:00
Alexander Shaposhnikov	16630f5c58	[BOLT] Factor out NameResolver from RewriteInstance Summary: Factor out the helper class NameResolver from the class RewriteInstance. (cherry picked from FBD19943916)	2020-02-17 14:37:46 -08:00
Alexander Shaposhnikov	754b6569f6	[BOLT] Add missing std::move Summary: Add missing std::move in the method BinaryFunction::addAlternativeName (cherry picked from FBD19944661)	2020-02-17 17:53:12 -08:00
Alexander Shaposhnikov	36cf37c4c1	[BOLT] Add initial bits for parsing MachO files Summary: Start adding initial bits for MachO, this diff contains some small preparations for finding functions inside a MachO binary, this will be done in the next diff. The concept of a section in the MachO world is quite different from ELF, nevertheless, for functions for now it more or less fits into the current picture (in BOLT), but things will diverge more significantly a bit later. (cherry picked from FBD19648161)	2020-01-30 13:10:48 -08:00
Rafael Auler	58a129a602	[BOLT] Move peepholes pass after sctc Summary: There are two peephole subpasses, remove-double-jumps and remove-useless-conditional-branches, that operates by reading branches directly, which makes them tricky to run before fix-branches. In the case of remove-double-jumps, it will even lead to suboptimal code if the patched branch was going to be removed by fix-branches when the target is the fall-through. If the final target is a tail call, it will lead to a broken CFG in the worst case. Fix this by moving these passes after SCTC, which already produces CFGs with conditional tail calls. (cherry picked from FBD18795592)	2019-12-03 12:28:22 -08:00
Rafael Auler	c82e7fd1cc	[BOLT] Decoder cache friendly alignment wrt Intel JCC Erratum Summary: This diff ports reviews.llvm.org/D70157 to our LLVM tree, which makes the integrated assembler able to align X86 control-flow changing instructions in a way to reduce the performance impact of the ucode update on Intel processors that implement the JCC erratum mitigation. See white paper "Mitigations for Jump Conditional Code Erratum" by Intel published November 2019. To port this patch, I changed classifySecondInstInMacroFusion to analyze instruction opcodes directly instead of analyzing the CondCond operand (in more recent versions of LLVM, all conditional branches share the same opcode, but with a different conditional operand). I also pulled to our tree Alignment.h as a dependency, and the macroop analyzing helpers. x86-align-branch-boundary and -x86-align-branch are the two flags that control nop insertion to avoid disabling the decoder cache, following the original patch. In BOLT, I added the flag x86-align-branch-boundary-hot-only to request the alignment to only be applied to hot code, which is turned on by default. The reason is because such alignment is expensive to perform on large modules, but if we limit it to hot code, the relaxation pass runtime becomes tolerable. (cherry picked from FBD19828850)	2020-02-10 18:50:53 -08:00
Alexander Shaposhnikov	d5b8fc8fbe	[BOLT] Make the methods isText/isData more robust Summary: Make the methods isText/isData work for MachO. (cherry picked from FBD19849460)	2020-02-11 17:54:48 -08:00
Alexander Shaposhnikov	c3c4b15a2e	[BOLT] Remove BinaryContext::getFunctionData Summary: In this diff we refactor the code around getting the original binary encoding of function's body. The main changes are: remove BinaryContext::getFunctionData, remove the parameter of the method BinaryFunction::disassemble, introduce BinaryFunction::getData. (cherry picked from FBD19824368)	2020-02-10 15:35:11 -08:00
Maksim Panchenko	41de03b8e9	[BOLT] Fix section names under `-generate-link-sections` Summary: Use proper function while printing modified function name to file. (cherry picked from FBD19791847)	2020-02-07 09:39:38 -08:00
Rafael Auler	0080d74506	[BOLT] Fix issue with strict and builtin_unreachable Summary: In strict mode, a jump table with targets generated by builtin_unreachable (located at the very end of the function) was asserting when being recreated by postProcessIndirectBranches. Fix this. (cherry picked from FBD19614981)	2020-01-28 18:38:10 -08:00
Maksim Panchenko	d57513e4ab	[BOLT] Fix symbol table issue with ICF Summary: Not all symbol table entries were updated after ICF. (cherry picked from FBD19319685)	2020-01-08 13:32:59 -08:00
Maksim Panchenko	ac697b7d3a	[BOLT] Replace list of Names with Symbols for BinaryFunction Summary: BinaryFunction used to have a list of Names associated with its main entry point. However, the function is primarily identified by its corresponding symbol or symbols, and these symbols are available as we are creating them for a corresponding BinaryData object. There's also no reason to emit symbols for alternative function names (aliases), so change the code to only emit needed symbols. When we emit a cold fragment for a function, only emit one cold symbol for the fragment instead of one per every main entry symbol/name. When we match a symbol to an entry point in the function, with this change we can first go through the list of main entry symbols (now that they are available). (cherry picked from FBD19426709)	2020-01-13 11:56:59 -08:00
Alexander Shaposhnikov	7a59783d7a	[BOLT] Move createBinaryContext to BinaryContext Summary: 1. Move createBinaryContext to BinaryContext. 1. Add support for nonlinux triples in createBinaryContext. 2. Remove unnecessary std::move in DWARFRewriter.cpp. (cherry picked from FBD19421314)	2020-01-15 15:23:45 -08:00
Rafael Auler	961d3d02d8	[BOLT] Move postProcessEntryPoints after disassembly Summary: Call postProcessEntryPoints only after all functions have been disassembled and all interprocedural references have been processed, when all possible entry points have been accounted for. This makes our detection of bad entries more robust as it does not depend on the order of the functions any more. (cherry picked from FBD19404767)	2020-01-14 17:12:03 -08:00
Maksim Panchenko	0283271f29	[BOLT] Do no report error on mismatched instruction encoding Summary: When the validation of instruction encoding fails but we are able to continue processing the binary, do no report an error. Report encoding format only under `-v=1`. (cherry picked from FBD19376531)	2020-01-13 11:24:10 -08:00
Maksim Panchenko	45b27d7b44	[BOLT] Get rid of Names in BinaryData Summary: For BinaryData, we used to maintain a vector of StringRef names and also a vector of pointers to MCSymbol's associated with the data. There was an unnecessary duplication of information and an associated overhead of keeping it in sync. Fix it by removing Names and using Symbols wherever Names were used. Also merge two variants of registerNameAtAddress() and remove unreachable/dead code in the process. (cherry picked from FBD19359123)	2020-01-10 16:17:47 -08:00
Maksim Panchenko	088e3c032a	[BOLT] Improve handling of secondary function entry points Summary: "Fix symbol table entries for secondary entries" diff broke the inliner. Fix the breakage and make the discovery of secondary entry points more accurate. Add ability to BinaryContext::getFunctionForSymbol() to return an entry point discriminator and use it instead of calling getEntryForSymbol() and isSecondaryEntry(). This is the preferred way since getFunctionForSymbol() is thread-safe. (cherry picked from FBD19295983)	2020-01-06 14:57:15 -08:00
Alexander Shaposhnikov	8c7f524afb	[BOLT] Fix build of the runtime on OSX Summary: Fix the compilation error on OSX (cherry picked from FBD19269806)	2020-01-02 16:20:13 -08:00
Rafael Auler	de284bc510	[BOLT] Fix symbol table entries for secondary entries Summary: Commit "Support full instrumentation" changed the map SymbolToFunction in BinaryContext to map secondary entries of functions too. This introduced unexpected behavior in our symbol table rewriting logic, which caused it to mistakenly write them with the address of the original function. Fix the behavior of getBinaryFunctionAtAddress to correct this. Also fix other users of SymbolToFunction to ensure they are not accidentally using secondary entries when they shouldn't. (cherry picked from FBD19168319)	2019-12-18 12:14:42 -08:00
Xin-Xin Wang	9aa276d349	[BOLT] Make .debug_loc update deterministic Summary: Change the single DebugLocWriter to one for each compilation unit. Then, each thread can write to its own DebugLocWriter and we can combine the data in a deterministic order once the threads are done. The only catch is that each thread would need the offset of the location lists it adds, so we make a list of pending location list patches and compute the final offsets at the end. (cherry picked from FBD18153069)	2019-10-25 11:47:51 -07:00
Maksim Panchenko	d414acfbb6	[perf2bolt] Better mmap event matching Summary: When perf tool reports a mapping address of a binary, it is not always the address of the first loadable segment we were checking against. As a result, perf2botl was not working properly for binaries where the first segment was not executable. The fix is to check if the address reported by mmap event matches any of the loadable segments. Note that the segment alignment has to be applied to get real loadable address of the segment. Fixes facebookincubator/BOLT#65 (cherry picked from FBD19146419)	2019-12-17 11:17:31 -08:00
Rafael Auler	16a497c627	[BOLT] Support full instrumentation Summary: Add full instrumentation support (branches, direct and indirect calls). Add output statistics to show how many hot bytes were split from cold ones in functions. Add -cold-threshold option to allow splitting warm code (non-zero count). Add option in bolt-diff to report missing functions in profile 2. In instrumentation, fini hooks are fixed to run proper finalization code after program finishes. Hooks for startup are added to setup the runtime structures that needs initilization, such as indirect call hash tables. Add support for automatically dumping profile data every N seconds by forking a watcher process during runtime. (cherry picked from FBD17644396)	2019-12-13 17:27:03 -08:00
Rafael Auler	e46d52de5b	[BOLT] Fix non-determinism in ICP with threads Summary: -icp-top-callsites selects candidates for optimization until a threshold is met. Currently, this parameter is set to 99% of calls by default. The order of functions evaluated changes in parallel mode, thus the functions that may be included to satisfy 99% of all calls may change, leading to different optimization decisions when running in parallel versus sequential. Fix this by enabling optimizations for all branches with the same frequency once we reach our budget instead of cutting off immediatelly after our budget is satisfied. In that way, order of functions has no impact on which functions are optimized. (cherry picked from FBD18902239)	2019-12-13 16:46:00 -08:00
Xin-Xin Wang	bdb60857e8	[BOLT] Make .debug_loc update deterministic Summary: Change the single DebugLocWriter to one for each compilation unit. Then, each thread can write to its own DebugLocWriter and we can combine the data in a deterministic order once the threads are done. The only catch is that each thread would need the offset of the location lists it adds, so we make a list of pending location list patches and compute the final offsets at the end. (cherry picked from FBD18153069)	2019-10-25 11:47:51 -07:00
Maksim Panchenko	e5d1334ad5	[perf2bolt] Ignore mmap events unrelated to execution Summary: Some processes can mmap the main binary for the purpose of introspection. We should ignore such mmap events for fixed-address binaries. For PIC binaries, we record the mapping and do the address filtering later for all sample events. (cherry picked from FBD18844314)	2019-12-05 16:52:15 -08:00
Xin-Xin Wang	6f93d53bf5	[BOLT] Remove test for impossible debug ranges condition Summary: The condition `DebugRangesOffset == -1U` can never happen since DebugRangesOffset has type `uint64_t` and the value always comes from `RangesSectionWriter->addRanges` which gets its value from `DebugRangesSectionWriter.SectionOffset` which has type `uint32_t`. The condition seems to be left over from a time where something was using `-1` as an error value. I'm removing that check so I can use `-1` as a tag to refer to the empty range that will be at the beginning of the ranges section. (cherry picked from FBD18153119)	2019-10-25 15:18:37 -07:00
Xin-Xin Wang	112c4251f5	[BOLT] Separate DebugRangesSectionsWriter into Ranges and ARanges Summary: The `.debug_aranges` section is already deterministic and is logically separate from the `.debug_ranges` section so separate them into separate classes so that it will be easier to make DebugRangesSectionsWriter deterministic (cherry picked from FBD18153057)	2019-10-25 11:24:49 -07:00
Xin-Xin Wang	8e2d3f7c30	[BOLT] Fix invalid abbrev error when reading debug_info section with readelf Summary: This fixes a bug which causes the debug_info and debug_loc sections to be unreadable by readelf/objdump. Basically, we're using 12 bytes of a ULEB128 value to fill in space, but readelf can't read more than 9 bytes of ULEB128. Thus, we replace that value with a string of 'a' instead. (cherry picked from FBD18097728)	2019-10-23 15:19:49 -07:00
Rafael Auler	28f91871b3	[PERF2BOLT/BOLT] Improve support for .so Summary: Avoid asserting on inputs that are shared libraries with R_X86_64_64 static relocs and RELATIVE dynamic relocations matching those. Our relocation checking mechanism would expect the result of the static relocation to be encoded in the binary, but the linker instead puts it as an addend in the RELATIVE dyn reloc. Also fix aggregation for .so if the executable segment is not the first one in the binary. (cherry picked from FBD18651868)	2019-11-14 16:07:11 -08:00
Rafael Auler	4bcc53a408	[BOLT] Fix shrink wrapping empty BB issue Summary: When combining icp=calls and shrink wrapping, the former may generate empty BBs that are going to trigger a bug in shrink wraping restore placement strategy. The restore is wrongly pushed to the BB successor instead of being added to the current block. Add a pass to go over the CFG to fix empty blocks by adding a temporary NOP instruction that is going to be deleted later. Empty BBs are not supported by one of the analysis done at this pass. (cherry picked from FBD18717994)	2019-11-26 15:09:40 -08:00
Maksim Panchenko	3cc4fc267b	[BOLT] Proper support for -trap-avx512 option Summary: If -trap-avx512 option is not set, verify that we correctly encode AVX-512 instructions and treat them as ordinary instructions. (cherry picked from FBD18666427)	2019-11-22 14:53:20 -08:00
Maksim Panchenko	7350d40404	[BOLT][NFC] Refactor BinaryFunction::addEntryPoint() Summary: There is no need to support existing functionality of adding entry points after the CFG is built as the function is only called in empty or disassembled state. Previously we used to run disassemble+buildCFG per function, but now these phases are decoupled. Also, remove a couple of redundant checks. (cherry picked from FBD18622822)	2019-11-11 17:02:37 -08:00
Maksim Panchenko	a09659fd54	[BOLT] Refactor markAmbiguousRelocations() Summary: Refactor markAmbiguousRelocations() code and move it to BinaryContext. Also remove a redundant check. (cherry picked from FBD18623815)	2019-11-18 14:08:17 -08:00
Maksim Panchenko	658f270417	[BOLT] Refactor data PC relocations in BinaryContext Summary: We only use locations of PC relocations and ignore the rest of the data. There's no need to store type and value. (cherry picked from FBD18623280)	2019-11-19 18:52:08 -08:00
Maksim Panchenko	b07e870d78	[BOLT] Add BinarySection::flushPendingRelocations() (cherry picked from FBD18623527)	2019-11-20 00:16:19 -08:00
Maksim Panchenko	3b1b9916dd	[BOLT][NFC] Refactor data section emission code Summary: RewriteInstance::emitDataSection() -> BinarySection::emitAsData() (cherry picked from FBD18623050)	2019-11-19 14:47:49 -08:00
spupyrev	95a1c7f553	speeding up ext-tsp Summary: Speeding up cache+/ext-tsp block reordering algorithm. On a high-level, the speedup is achieved by: - precomputing and memorizing all jumps between a pair of chains (instead of extracting them on every merge iteration); - using a cache of size O(\|E\|) instead of O(\|V\|^2) as in previous version. The final output is identical to previous one subject to a new deterministic comparison of double values. (cherry picked from FBD18380870)	2019-10-31 13:32:25 -07:00
Maksim Panchenko	6796b7216b	[BOLT] Fix jump table analysis for non-simple functions Summary: When we disassemble functions, we add discovered jump tables to a global container in BinaryContext. Later, we analyze and verify all jump tables. However, analysis for non-simple functions might fail for numerous reasons, e.g. there would be no instruction at a destination. Since we are not overwriting non-simple functions, it is not a critical error. Thus, we can safely skip jump table analysis for non-simple functions. (cherry picked from FBD18422997)	2019-11-10 21:09:01 -08:00
Maksim Panchenko	72b52edcbb	[BOLT] Free more memory in BinaryFunction::releaseCFG() Summary: Free more lists in BinaryFunction::releaseCFG(). Release BinaryFunction::Relocations after disassembly. Do not populate BinaryFunction::MoveRelocations as we are not using them currently. Also remove PCRelativeRelocationOffsets that weren't used. (cherry picked from FBD18413256)	2019-11-08 14:41:31 -08:00
Maksim Panchenko	d5ddb320ef	[BOLT] Free memory for CFG after emission Summary: Once we emit function code, we no longer need CFG for next phases that use basic blocks for address-translation and symbol update purposes. We free memory used by CFG and instructions. The freed memory gets reused by later phases resulting in overall memory usage reduction. We can probably improve memory consumption even further by replacing BinaryBasicBlocks with more compact data structures. (cherry picked from FBD18408954)	2019-10-31 16:54:48 -07:00
Maksim Panchenko	f2b257bec8	[BOLT] Update SDTs based on translation tables Summary: We've used to emit special annotations to update SDT markers. However, we can just use "Offset" annotations for the same purpose. Unlike BAT, we have to generate "reverse" address translation tables. This approach eliminates reliance on instructions after code emission. (cherry picked from FBD18318660)	2019-11-03 21:57:15 -08:00
Maksim Panchenko	98e63610b1	[BOLT] Create OffsetTranslationTable for basic blocks Summary: Use BinaryBasicBlock::OffsetTranslationTable for BAT. This removes dependency on instructions after the code emission. (cherry picked from FBD18283965)	2019-11-01 16:19:45 -07:00
Maksim Panchenko	a1388308f0	[BOLT] Use NameResolver class for local symbols Summary: NameResolver class is used to assign unique names to local symbols. (cherry picked from FBD18277131)	2019-11-01 12:31:17 -07:00
Maksim Panchenko	1ed3ac17ff	[BOLT] Fix section offsets after debug stripping Summary: Be default, we strip debug sections from the binary. Even though we did not write the sections, we allocated space for them in the output binary by mistake. (cherry picked from FBD18218708)	2019-10-29 14:49:49 -07:00
Maksim Panchenko	ed8be23e73	[BOLT][llvm] Reduce memory used by MCInst Summary: BOLT creates MCInst for every instruction from the input. For large binaries, this means we are creating tens if not hundreds of millions of instructions. If the number of operands for average instruction is much less than 8, we benefit from changing the type of Operands from SmallVector<MCOperand, 8> to SmallVector<MCOperand, 2>. That seems to be the optimal type for X86-64 on average. The size of MCInst goes down from 176 to 80 which often reduces BOLT memory consumption by gigabytes. (cherry picked from FBD18218924)	2019-10-28 17:40:18 -07:00
Rafael Auler	a3295715e4	[AArch64] Recognize one extra br idiom Summary: We do not support optimizing functions with jump tables in AArch64, but we do need to detect them. This idiom is slightly different from the ones we've seen before. It encode jump table entries as relative to the jump table itself instead of relative to the indirect branch (BR) instruction. (cherry picked from FBD18191100)	2019-10-28 16:16:35 -07:00
Maksim Panchenko	8fb6512a23	[BOLT][Docs] Instructions for linking with jemalloc/tcmalloc (cherry picked from FBD18050722)	2019-10-21 15:57:36 -07:00
Maksim Panchenko	12aca4005c	[BOLT] Ignore __builtin_unreachable destination Summary: For functions with unknown control flow, do not populate TakenBranches with an entry pointing to the end of the function. (cherry picked from FBD18034019)	2019-10-20 20:46:32 -07:00
Rafael Auler	b807641e2a	[BOLT] Fix stale functions when using BAT Summary: If collecting data in Intel Skylake machines, we may face a bug where LBR0 or LBR1 may be duplicated w.r.t. the next entry. This makes perf2bolt interpret it as an invalid trace, which ordinarily we discard during aggregation. However, in BAT, since we do not disassemble the binary where the collection happened but rely only on the translation table, it is not possible to detect bad traces and discard them. This gets to the fdata file, and this invalid trace ends up invalidating the profile for the whole function (by being treated as stale by BOLT). In this patch, we detect Skylake by looking for LBRs with 32 entries, and discard the first 2 entries to avoid running into this problem. It also fixes an issue with collision in the translation map by prioritizing the last basic block when more than one share the same output address. (cherry picked from FBD17996791)	2019-10-17 16:35:57 -07:00
Maksim Panchenko	103b0a77cc	[BOLT] Fix non-determinism while reading debug info Summary: When reading debug info in parallel, CUs for functions were populated in parallel and the order was non-deterministic. We used the first CU from the non-deterministically-ordered list to set the line number resulting in different outputs. The fix is to sort the list after it's been created and before assigning the line table unit. (cherry picked from FBD17946889)	2019-10-14 17:57:36 -07:00
Rafael Auler	698a4684ac	[BOLT] Fix merge-fdata and heatmap in BAT Summary: merge-fdata for legacy format was simply appending all input strings to output, but the real format supports some header strings that can't be simply concatanated to output. Check for the header string used by BAT before merging fdata to avoid creating an output file with invalid lines (header in the middle of the fdata file). For heatmap, avoid reading BAT tables, since they won't be used. (cherry picked from FBD17943131)	2019-10-11 13:32:14 -07:00
Xin-Xin Wang	d87f95065a	[BOLT] Add missing CMake test dependencies Summary: I noticed when setting up a new repository for bolt that bolt tests would fail unexpectedly when running `ninja check-bolt` and `ninja check-llvm`. This turns out to be because dependencies for bolt binaries were not specified in the CMake configuration so they were not built before running the tests. This diff adds the dependencies to the CMake configuration for check-bolt and check-llvm so that bolt binaries are built before running tests. (cherry picked from FBD17919505)	2019-10-14 16:03:54 -07:00
Maksim Panchenko	8c6ea8540a	[BOLT] Improve object discovery runtime Summary: (cherry picked from FBD17872824)	2019-10-08 11:03:33 -07:00
Rafael Auler	13948f376d	[BOLT] Do not emit BAT for non-simple in nonreloc Summary: Doing so cause corrupt entries to be emitted. (cherry picked from FBD17774505)	2019-10-04 16:28:03 -07:00
Mark Santaniello	c9f4bbdc22	[llvm-bolt] Bugfix jemalloc sized deallocation segfault Summary: C++14 "sized deallocation" introduces a 2-argument `delete` where the new 2nd argument is the original allocated size. It's useful for allocators like jemalloc to be "reminded" of the original allocation size, else they incur the cost of an address to size lookup. Jemalloc has provided this for a while as `sdallocx`, and recently it got wired up to the new 2-arg `delete`. Here I introduce typedefs for the SmallVectors so the "16" is consistent, which seems to fix the issue. (cherry picked from FBD17618981)	2019-09-26 16:51:22 -07:00
Rafael Auler	ba31344fa9	[BOLT] Fix build for Mac Summary: Change our CMake config for the standalone runtime instrumentation library to check for the elf.h header before using it, so the build doesn't break on systems lacking it. Also fix a SmallPtrSet usage where its elements are not really pointers, but uint64_t, breaking the build in Apple's Clang. (cherry picked from FBD17505759)	2019-09-20 11:29:35 -07:00
Maksim Panchenko	5e6d246b9c	[BOLT] Reword message for macro-op fusion optimization Summary: With the word "missed", the previous message about opportunities for macro-op fusion optimization could be misleading. (cherry picked from FBD17464603)	2019-09-18 15:33:03 -07:00
Maksim Panchenko	c823220116	[BOLT] Better check for compiler de-virtualization bug Summary: The existing check for compiler de-virtualization bug was not working when the relocation reference did not fall on a function boundary. As a result, we were falsely detecting "unmarked object in code". When running the check, the address could be arbitrary, except it shouldn't match any existing function. Additionally, check that there's a proper reference to the de-virtualized callee to avoid false positives. (cherry picked from FBD17433887)	2019-09-17 14:24:31 -07:00
Maksim Panchenko	e9c6c73bb8	[BOLT][non-reloc] Change function splitting in non-relocation mode Summary: This diff applies to non-relocation mode mostly. In this mode, we are limited by original function boundaries, i.e. if a function becomes larger after optimizations (e.g. because of the newly introduced branches) then we might not be able to write the optimized version, unless we split the function. At the same time, we do not benefit from function splitting as we do in the relocation mode since we are not moving functions/fragments, and the hot code does not become more compact. For the reasons described above, we used to execute multiple re-write attempts to optimize the binary and we would only split functions that were too large to fit into their original space. After the first attempt, we would know functions that did not fit into their original space. Then we would re-run all our passes again feeding back the function information and forcefully splitting such functions. Some functions still wouldn't fit even after the splitting (mostly because of the branch relaxation for conditional tail calls that does not happen in non-relocation mode). Yet we have emitted debug info as if they were successfully overwritten. That's why we had one more stage to write the functions again, marking failed-to-emit functions non-simple. Sadly, there was a bug in the way 2nd and 3rd attempts interacted, and we were not splitting the functions correctly and as a result we were emitting less optimized code. One of the reasons we had the multi-pass rewrite scheme in place, was that we did not have an ability to precisely estimate the code size before the actual code emission. Recently, BinaryContext obtained such functionality, and now we can use it instead of relying on the multi-pass rewrite. This eliminates redundant work of re-running the same function passes multiple times. Because function splitting runs before a number of optimization passes that run on post-CFG state (those rely on the splitting pass), we cannot estimate the non-split code size with 100% accuracy. However, it is good enough for over 99% of the cases to extract most of the performance gains for the binary. As a result of eliminating the multi-pass rewrite, the processing time in non-relocation mode with `-split-functions=2` is greatly reduced. With debug info update, it is less than half of what it used to be. New semantics for `-split-functions=<n>`: -split-functions - split functions into hot and cold regions =0 - do not split any function =1 - in non-relocation mode only split functions too large to fit into original code space =2 - same as 1 (backwards compatibility) =3 - split all functions (cherry picked from FBD17362607)	2019-09-11 15:42:22 -07:00
Wenlei He	615a318b60	[BOLT] Filter perf samples by PID Summary: `perf2bolt` accepts executable name, and the tool will find all the PIDs associated with that executable. When different versions of an executable are running at the same time, name alone may not be sufficient as we will get samples from different versions of the binary aggregated together. The resulting fdata may look stale to BOLT, which makes BOLT bailout optimization for functions. This change adds a `-pid` switch that lets user specify process ID in addition to executable name so BOLT can target a specific process. (cherry picked from FBD17178898)	2019-09-03 22:24:06 -07:00
Wenlei He	8cd1ba599b	[BOLT] Ignore LBR from kernel interrupts Summary: This change adds a switch (`ignore-interrupt-lbr`) to ignores LBR from perf input that is result of kernel interrupts. These asynchronous flow of user/kernel transition will make BOLT think that profile is stale, thus bailout optimization for functions. Ideally, user mode filter need to be set for `perf record` so we don't have asynchronous LBRs. However these are identifiable as kernel address space is known, so we can ignore any LBRs that come from or go into kernel addresses during aggregation. This is under a switch and off by default in case we need to BOLT kernel module. (cherry picked from FBD17170107)	2019-09-03 10:01:26 -07:00
Rafael Auler	cc4b2fb614	[BOLT] Efficient edge profiling in instrumented mode Summary: Change our edge profiling technique when using instrumentation to do not instrument every edge. Instead, build the spanning tree for the CFG and omit instrumentation for edges in the spanning tree. Infer the edge count for these edges when writing the profile during run time. The inference works with a bottom-up traversal of the spanning tree and establishes the value of the edge connecting to the parent based on a simple flow equation involving output and input edges, where the only unknown variable is the parent edge. This requires some engineering in the runtime lib to support dynamic allocation for building these graphs at runtime. (cherry picked from FBD17062773)	2019-08-07 16:09:50 -07:00
Rafael Auler	52786928ff	[BOLT] Fix perf2bolt race in BAT mode Summary: We start a thread to preprocess the profile while the main thread continues to disassemble the input binary. We should not disassemble it in BAT mode, however, the test to check whether we have BAT in the input binary depends on the preprocessing thread, so there is a race where we may start disassembling functions just because the preprocessing thread didn't conclude we are in BAT mode. Fix this and make the main thread check for BAT without depending on the preprocessing thread. (cherry picked from FBD17124370)	2019-08-29 16:18:43 -07:00
Rafael Auler	1f6564f117	[BOLT] Support .plt.got section Summary: We decode the regular .plt section and we are able to perform optimizations on it with -plt=hot or -plt=all, however, .plt.got sections are not decoded by BOLT. This patch teaches BOLT how to read them. They are created by the bfd linker whenever there is no need for the dynamic linker to lazy-bind the symbol (when they are eagerly resolved at binary load time). These entries are 8-byte sized instead of 16-byte sized like the regular PLT, and contain a single indirect call instruction with 7 bytes and a nop. (cherry picked from FBD17060515)	2019-08-26 15:03:38 -07:00
Rafael Auler	243507db99	[BOLT] Fix aggregator w.r.t. split functions Summary: We should not rely on split function detection while aggregating data, but only look up the original function names in the symbol table. Split function detection should be done by BOLT and not perf2bolt while writing the profile. Then, BOLT, when reading it, will take care of combining functions if necessary. This caused a bug in bolted data collection where samples in cold parts of a function were being falsely attributed to the hot part of a function instead of being attributed to the cold part, causing incorrect translation of addresses. (cherry picked from FBD16993065)	2019-08-23 12:18:31 -07:00
Maksim Panchenko	f588d7a6ea	[BOLT] Tighter control of jump table detection Summary: We were too permissive by allowing more jump tables during the preliminary scan of memory. This allowed for jump tables to be falsely detected. And since we didn't have a way to backtrack the jump table creation, we had to assert. This diff refactors the code that analyzes jump table contents. Preliminary and final passes share the same code. The only difference should be the detection of instruction boundaries that are available during the final pass. This should affect strict relocation mode only. (cherry picked from FBD16923335)	2019-08-19 14:06:36 -07:00
Maksim Panchenko	bf030f336a	[BOLT] Fix misleading output Summary: BOLT prints "spawning thread to pre-process profile" message even when it is not running in the aggregation mode. Fix that. (cherry picked from FBD16908596)	2019-08-19 17:11:42 -07:00
Rafael Auler	821480d27f	[BOLT] Encode instrumentation tables in file Summary: Avoid directly allocating string and description tables in binary's static data region, since they are not needed during runtime except when writing the profile at exit. Change the runtime library to open the tables on disk and read only when necessary. (cherry picked from FBD16626030)	2019-08-02 11:20:13 -07:00
Rafael Auler	62aa74f836	[BOLT] Support instrumentation via runtime library Summary: To allow the development of future instrumentation work, this patch adds support in BOLT for linking arbitrary libraries into the binary processed by BOLT. We use orc relocation handling mechanism for that. With this support, this patch also moves code programatically generated in X86 assembly language by X86MCPlusBuilder to C code written in a new library called bolt_rt. Change CMake to support this library as an external project in the same way as clang does with compiler_rt. This library is installed in the lib/ folder relative to BOLT root installation and by default instrumentation will look for the library at that location to finish processing the binary with instrumentation. (cherry picked from FBD16572013)	2019-07-24 14:03:43 -07:00
laith sakka	f77cccf681	Rename option (cherry picked from FBD16655093)	2019-08-05 13:56:48 -07:00
laith sakka	c1564a1026	Add test for parallel mode Summary: Add a flag that disable writing botl-info section and add a test that run bolt with two modes parallel and sequential and assert that the resulting binaries are the same. (cherry picked from FBD16575440)	2019-07-30 17:55:27 -07:00
laith sakka	cc8415406c	Rewrite frame analysis using parallel utilities Summary: Rewrite frame analysis using parallel utilities (cherry picked from FBD16499130)	2019-07-25 11:57:08 -07:00
laith sakka	5084534699	Rewrite ICF using parallel utilities Summary: Rewrite ICF using parallel utilities (cherry picked from FBD16472975)	2019-07-24 17:13:15 -07:00
Maksim Panchenko	8d5854ef09	[BOLT] Add option to verify instruction encoder/decoder Summary: Add option `-check-encoding` to verify if the input to LLVM disassembler matches the output of the assembler. When set, the verification runs on every instruction in processed functions. I'm not enabling the option by default as it could be quite noisy on x86 where instruction encoding is ambiguous and can include redundant prefixes. (cherry picked from FBD16595415)	2019-07-31 16:03:49 -07:00
Maksim Panchenko	79ff4ec1cb	[perf2bolt] Enforce strict mode for perf2bolt Summary: In strict relocation mode, we get better function coverage. However, if the profile used for optimization was converted using non-strict mode, then it wouldn't match functions exclusive to strict mode. Hence, we have to enforce strict relocation mode for profile conversion, so it can be used for either mode. I'm also adding parallel profile pre-processing unless `--no-threads` is specified. This masks the runtime overhead of function disassembly on multi-core machines. (cherry picked from FBD16587855)	2019-06-11 13:24:10 -07:00
laith sakka	1bce256e67	Fix race condition in buildCFG Summary: switch to sequential execution when print-all is passed. Since the function getDynoStats have an unsafe access to the annotation allocators. (cherry picked from FBD16503502)	2019-07-25 14:41:57 -07:00
laith sakka	6443c46b9d	Run hfsort+ in parallel Summary: hfsort+ performs an expensive analysis to determine the new order of the functions. 99% of the time during hfsort+ is spent in the function runPassTwo. This diff runs the body of the hot loop in runPassTwo in parallel speeding up the total runtime of reorder-functions pass by up to 4x (cherry picked from FBD16450780)	2019-07-23 15:49:02 -07:00
Maksim Panchenko	a9b9aa1e02	[BOLT] Add code padding verification Summary: In non-relocation mode, we allow data objects to be embedded in the code. Such objects could be unmarked, and could occupy an area between functions, the area which is considered to be code padding. When we disassemble code, we detect references into the padding area and adjust it, so that it is not overwritten during the code emission. We assume the reference to be pointing to the beginning of the object. However, assembly-written functions may reference the middle of an object and use negative offsets to reference data fields. Thus, conservatively, we reduce the possibly-overwritten padding area to a minimum if the object reference was detected. Since we also allow functions with unknown code in non-relocation mode, it is possible that we miss references to some objects in code. To cover such cases, we need to verify the padding area before we allow to overwrite it. (cherry picked from FBD16477787)	2019-07-23 20:48:41 -07:00
Maksim Panchenko	6722875047	[BOLT] Fix processing PLT without relocs Summary: Some binaries may not have a relocation section corresponding to PLT. Handle them properly. (cherry picked from FBD16477841)	2019-07-24 22:08:36 -07:00
Maksim Panchenko	98fdba2cc7	[BOLT][NFC] Fix white space (cherry picked from FBD16473918)	2019-07-24 17:54:14 -07:00
laith sakka	744a2417dd	Run findSubprograms in preprocessDebugInfo in parallel Summary: While reading debug info the function findSubprograms runs on each compilation unit. This diff parallelize that loop reducing its runtime duration by 70%. (cherry picked from FBD16362867)	2019-07-17 20:54:53 -07:00
laith sakka	b50500893d	Lock-based parallelization for updateDebugInfo Summary: BOLT spends a decent amount of time creating patches to update debug information when -update-debug-sections is passed. In updateDebugInfo patches are created to update .debug_info and .debug_abbrev sections while .debug_loc and .debug_ranges contents are populated. This this diff uses a lock-based approach to parallelize updateDebugInfo functions and reduces the runtime of the function by around 30%. (cherry picked from FBD16352261)	2019-07-17 14:58:17 -07:00
Facebook Github Bot	86800abc81	[BOLT][PR] Target compilation based on LLVM CMake configuration Summary: Minimalist implementation of target configurable compilation. Fixes https://github.com/facebookincubator/BOLT/issues/59 Pull Request resolved: https://github.com/facebookincubator/BOLT/pull/60 GitHub Author: Pierre RAMOIN <pierre.ramoin@amadeus.com> (cherry picked from FBD16461879)	2019-07-24 11:05:08 -07:00
Maksim Panchenko	2c9c6b164b	[BOLT] Fix issue printing CTCs without annotations Summary: After stripping annotations, conditional tail calls no longer can be identified by their corresponding tag. We can check the number of basic block successors instead. Fixes facebookincubator/BOLT#58. (cherry picked from FBD16444718)	2019-07-22 20:57:19 -07:00
laith sakka	fde5a2b470	Run shrink wrapping in parallel Summary: Shrink wrapping is an expensive part of frame optimizations if performed on all functions. This diff makes it run in parallel, reducing wall time. (cherry picked from FBD16092651)	2019-07-02 10:48:43 -07:00
laith sakka	7d42835418	Run buildCFG in disassembly in parallel Summary: This diff parallelize the construction of call graph during disassembly. The diff includes a change to parallel-utilities where another interface is added, that support running tasks on binaryFunctions that involves adding instruction annotations. This pattern is common in different places, e.g. frame optimizations. And such, pattern justify creating an interface, that abstract out all the messy details. (cherry picked from FBD16232809)	2019-07-12 07:25:50 -07:00
laith sakka	f4ab6e6924	run finalize functions in parallel Summary: (cherry picked from FBD16188733)	2019-07-10 10:59:56 -07:00
laith sakka	98539b0966	run aligner pass in parallel Summary: this diff parallelize the aligner pass (cherry picked from FBD16176327)	2019-07-09 17:59:41 -07:00
laith sakka	9977b03fea	Run reorder blocks in parallel Summary: This diff change reorderBasicBlocks pass to run in parallel, it does so by adding locks to the fix branches function, and creating temporary MCCodeEmitters when estimating basic block code size. (cherry picked from FBD16161149)	2019-07-08 12:32:58 -07:00
Rafael Auler	1169f1fdd8	[BOLT] Support duplicating jump tables Summary: If two indirect branches use the same jump table, we need to detect this and duplicate dump tables so we can modify this CFG correctly. This is necessary for instrumentation and shrink wrapping. For the latter, we only detect this and bail, fixing this old known issue with shrink wrapping. Other minor changes to support better instrumentation: add an option to instrument only hot functions, add LOCK prefix to instrumentation increment instruction, speed up splitting critical edges by avoiding calling recomputeLandingPads() unnecessarily. (cherry picked from FBD16101312)	2019-07-02 16:56:41 -07:00
Rafael Auler	8880969ced	[BOLT] Restrict creation of jump tables Summary: Heuristic that creates a jump table for every memory access, including those we do not match against a pattern in an indirect jump, is too permissive and has false positives. Guard this logic under strict mode until we figure out a better strategy. (cherry picked from FBD16192205)	2019-07-10 15:41:34 -07:00
laith sakka	3cfc76cdbf	Create a general interface to implement parallel tasks easily and apply it to run EliminateUnreachableBlocks in parallel. Summary: Each time we run some work in parallel over the list of functions in bolt, we manage a thread pool, task scheduling and perform some work to manage the granularity of the tasks based on the type of the work we do. In this task, I am creating an interface where all those details are abstracted out, the user provides the function that will run on each function, and some policy parameters that setup the scheduling and granularity configurations. This will make it easier to implement parallel tasks, and eliminate redundant coding efforts. (cherry picked from FBD16116077)	2019-07-03 17:23:19 -07:00
laith sakka	f10d1fe0f3	Run cleanAnnotations within frame analysis in parallel Summary: This diff parallelize the function FrameAnalysis::cleanAnnotations() (cherry picked from FBD16096711)	2019-07-02 13:42:17 -07:00
laith sakka	00c252f6d8	Clean SPTMap in frame anaylsis in parallel Summary: This diff parallelize the STPClean() function reducing its runtime from 5 seconds to 0.4 on HHVM, Making the runtime for the frame optimizer goes down to 33 seconds on HHVM. (cherry picked from FBD15914371)	2019-06-19 18:01:00 -07:00
laith sakka	86b529bd54	run SPT in parallel, and split annotation allocator Summary: This diff includes two main changes: 1) When creating an annotation, a dedicated annotation allocator can be used, instead of the default allocator. This allows some annotation to be deallocated right after the end of their usage completely. Furthermore, having the ability to use dedicated allocators allows running SPT in parallel where each task uses a different allocator. 2) SPT is parallelized. (cherry picked from FBD15913492)	2019-06-14 19:56:11 -07:00
Wenlei He	4e90fc1e3b	[BOLT] Prioritize Jump Table ICP target by frequency and indice count Summary: We select the top hot targets for indirect call promotion. But since we only have frequency for targets, not for actual jump table indices, we have to merge indices that share the same actual target. In order to do that we sort targets by pointer of target symbol before merging, which introduces instability. Later we stable sort merged targets by frequency. Due to the instability of sorting pointers, and depending on how many indices each merged target has, we could end up with unstable ICP. This commit changes the 2nd pass sorting to prioritize targets with fewer indices, and higher mispredicts, in addition to higher frequency. It improves stability of ICP, and also exposes more ICP because targets with fewer indices has better chance of getting promoted. (cherry picked from FBD16099701)	2019-07-02 15:51:20 -07:00
Maksim Panchenko	078ece1691	[BOLT] Fix out-of-bounds entry points Summary: Check that a symbol address is less than the next function address before considering it for a secondary entry. (cherry picked from FBD16056468)	2019-06-28 11:53:34 -07:00
Maksim Panchenko	e89ad0db4b	[BOLT] Introduce strict relocation mode Summary: In strict relocation mode we rely on relocations to represent all possible entry points into a function. Most of the code generated by tested compilers (gcc and clang) will result in relocations against any internal labels for jump tables and for computed goto tables. In situations where we cannot properly reconstruct a jump table, or when we cannot determine a table that guides an indirect jump, e.g. when multiple computed goto tables are used, we conservatively assume that the indirect jump can end up at any possible basic block referenced by relocations. In strict mode, simple functions may include the aforementioned instructions with unknown control flow with a conservative list of destinations added to the containing basic block. This allows us to expand coverage of simple functions and to enable code reordering optimizations for more functions. The strict mode is recommended when BOLT is used with a well-formed code generated by a compiler. To use the strict mode, add "-strict" on the command line. Another effect of this diff, is that with relocations, we will always replace the immediate operand of an instruction with a symbol if the relocation exists against this operand. Also this diff fixes issues with Clang compiled with -fpic. (cherry picked from FBD15872849)	2019-06-28 09:21:27 -07:00
Maksim Panchenko	06e7a1e059	[BOLT] Ignore false function references Summary: A relocation can have an addend that makes it look as the relocated value is in a different section from the symbol being relocated. E.g., a relocation against a variable in .rodata could have a negative offset that will make it look like it is against a symbol in .text (a section that typically precedes .rodata). Unless the relocation is against a section symbol, we know exactly the symbol that is being relocated and there is no issue. However, when the linker leaves only a section relocation (i.e. a relocation against a section symbol when a temporary original symbol gets deleted), we have to guess the relocated symbol, and can falsely detect a function reference in the case described above. The fix is to keep a section relocation if the corresponding relocated value falls into a different section, and to detect and ignore false function reference. (cherry picked from FBD16030791)	2019-06-27 03:20:17 -07:00
Wenlei He	459add2827	[BOLT] Force non-relocation mode for heatmap generation Summary: BOLT operates in relocation mode by default when .reloc is in the binary. This changes disables relocation mode for heatmap generation so we can use that for more cases. There's a small separate change that ignores zero-sized symbol in zero-sized code section during function discovery. (cherry picked from FBD16009610)	2019-06-26 11:06:46 -07:00
Rafael Auler	0d23cbaa52	[BOLT] Initial experimental instrumentation pass Summary: An instrumentation pass that modifies the input binary to generate a profile after execution finishes. It modifies branches to increment counters stored in the process memory and injects a new function that dumps this data to an fdata file, readable by BOLT. This instrumentation is experimental and currently uses a naive approach where every branch is instrumented. This is not ideal for runtime performance, but should be good enough for us to evaluate/debug LBR profile quality against instrumentation. Does not support instrumenting indirect calls yet, only direct calls, direct branches and indirect local branches. (cherry picked from FBD15998096)	2019-06-19 20:10:49 -07:00
Rafael Auler	db02a1a142	[BOLT] Ignore empty funcs in relocation mode Summary: Make BOLT ignore empty functions (those containing no instructions, despite having some space allocated to it filled with zeroes). (cherry picked from FBD15981683)	2019-06-24 20:23:22 -07:00
Rafael Auler	bda13b7dd8	[BOLT] Add option to print profile bias stats Summary: Profile bias may happen depending on the hardware counter used to trigger LBR sampling, on the hardware implementation and as an intrinsic characteristic of relying on LBRs. Since we infer fall-through execution and these non-taken branches take zero hardware resources to be represented, LBR-based profile likely overrepresents paths with fall throughs and underrepresents paths with many taken branches. This patch adds an option to print statistics about profile bias so we can better understand these biases. The goal is to analyze differences in the sum of the frequency of all incoming edges in a basic block versus the sum of all outgoing. In an ideally sampled profile, these differences should be close to zero. With this option, the user gets the mean of these differences in flow as a percentage of the input flow. For example, if this number is 15%, it means, on average, a block observed 15% more or less flow going out of it in comparison with the flow going in. We also print the standard deviation so we can have an idea of how spread apart are different measurements of flow differences. If variance is low, it means the average bias is happening across all blocks, which is compatible with using LBRs. If the variance is high, it means some blocks in the profile have a much higher bias than others, which is compatible with using a biased event such as cycles to sample LBRs because it overrepresents paths that end in an expensive instruction. (cherry picked from FBD15790517)	2019-06-10 17:26:48 -07:00
laith sakka	1ec091e6f5	Parallelize ICF Pass Summary: ICF consumes 10-15% of bolt runtime, for HHVM that is around 45 seconds. this diff perform some parallelization for the pass to make it faster. A 60% reduction in the ICF runtime is measured on the parallel version for HHVM. (cherry picked from FBD15589515)	2019-05-31 16:45:31 -07:00
Maksim Panchenko	9894de0094	[BOLT] Check instruction boundaries while populating jump tables Summary: Now that we populate jump tables after all functions are disassembled, we can check for instruction boundaries corresponding to jump table entries. No need to delegate this task to postProcessJumpTables(). (cherry picked from FBD15814762)	2019-06-13 15:31:30 -07:00
Maksim Panchenko	9e2ad3f593	[BOLT] Delay populating jump tables Summary: During the initial disassembly pass, only identify jump tables without populating the contents. Later, after all functions have been disassembled, we have a better idea of jump table boundaries and can do a better job of populating their entries. As a result, we no longer have embedded jump tables (i.e. a jump table that is parter of another jump table). If we ever need to keep sequential jump tables inseparable during the output, we can always add such functionality later. Fixes facebookincubator/BOLT#56. (cherry picked from FBD15800427)	2019-06-12 18:21:02 -07:00
laith sakka	66cf16208f	Use singleton instances for SPT (stack pointer tracking) in FrameAnalysis. Summary: During frame analysis, the functions do not change, and stack pointer tracking does not need to be performed more than one time. The current implementation performs the SPT analysis multiple times per function during the frame analysis, we ca eliminate such computation redundancy. On HHVM with -frame-opts=hot, this save around a minute which is 40% of the frame optimization runtime. (129s to 76s). fdata should be passed for a reasonable evaluation (we need the call graph). However, this comes at a memory cost, around 2G to the peak when only -frame-opt=hot only is used but, When all the usual flags are passed, the effect is to the peak is only 200K (measured from one test). Update: When jemalloc is used the base became way better and the following runtime are observed: [jemalloc] hhvm 85 --> 72. clang 27 --> 23. [malloc] hhvm 129 --> 76. clang 34 --> 27. (cherry picked from FBD15707003)	2019-06-06 12:58:14 -07:00
Maksim Panchenko	9df5063c0e	[perf2bolt] Option to use event PC with LBR stack Summary: Add an option to get extra profile trace using the recorded event PC. The trace goes from the latest LBR record destination to the event PC. (cherry picked from FBD15711804)	2019-06-06 19:38:06 -07:00
Maksim Panchenko	fac6a89c23	[BOLT] Better handling of address references Summary: We used to handle PC-relative address references differently from direct address references. As a result, some cases, such as escaped function label address, were not handled when dealing with absolute (non-PIC) code. This diff moves processing of an address reference into BinaryContext::handleAddressRef() which is called for both PIC and non-PIC code. (cherry picked from FBD15643535)	2019-06-04 15:30:22 -07:00
laith sakka	d3c1821f5f	Compile Bolt using std 14. Summary: Compile Bolt using std 14. We want that to be able to use some threading the locking tools that do not exists in std 11. (cherry picked from FBD15671736)	2019-06-05 10:32:29 -07:00
Rafael Auler	21f4303bfd	Support data collection in bolted binaries Summary: Similarly to how the compiler relies on DWARF to map samples, so it is possible to collect profile data in binaries optimized by PGO techniques and retrofit data to be used in a representation of the program that was not optimized by PGO, this diff implements an option in BOLT to encode a table in the output binary that allows us to map data collected in optimized binaries back to the address space used in the input binary (where the profile is useful, since we do not support running BOLT on a binary already optimized by BOLT). The goal is to offer an option to support BOLT in scenarios where it is not easy to run a special deployment of the binary with a version that was not optimized by BOLT just for data collection. This feature is enabled with the -enable-bat flag. BAT stands for BOLT Address Translation, which refers to the process of mapping output to input addresses. (cherry picked from FBD15531860)	2019-04-12 17:33:46 -07:00
Laith Sakka	3df2c9ea1f	Update SDT locations after bolt reordering Summary: Update SDT locations in .note section to match the new location after bolt reorder the code. (cherry picked from FBD15427779)	2019-05-17 07:58:27 -07:00
Maksim Panchenko	9ef9a7b1be	[BOLT] Use regex matching for function names passed on command line Summary: Options such as `-print-only`, `-skip-funcs`, etc. now take regular expressions. Internally, the option is converted to '^funcname$' form prior to regex matching. This ensures that names without special symbols will match exactly, i.e. "foo" will not match "foo123". (cherry picked from FBD15551930)	2019-05-29 18:33:09 -07:00
Laith Sakka	c8038da36e	Minor-fix: remove duplicate definition of SPT optimization timer Summary: (cherry picked from FBD28111560)	2019-05-22 15:03:42 -07:00
Maksim Panchenko	e5b1d9cd8c	[BOLT][NFC] Fix white space (cherry picked from FBD15485688)	2019-05-23 15:49:36 -07:00
Maksim Panchenko	f57d3c00fc	[BOLT] Better verification of jump tables Summary: Run analyzeIndirectBranch() using basic block boundaries instead of running ad-hoc validation of the jump table assumptions. (cherry picked from FBD15465034)	2019-05-22 18:14:34 -07:00
Maksim Panchenko	be344c8de7	[BOLT] Refactor handling of interproc refs Summary: Move handling of interprocedural references to BinaryContext. Post-process indirect branches immediately after the CFG is built. This is almost NFC. Since indirect branches are now post-processed before the profile data is processed it interferes with the way the profile data in YAML format is handled. (cherry picked from FBD15456003)	2019-05-22 11:26:58 -07:00
Maksim Panchenko	d047df12c5	[BOLT] Add an option to specialize memcpy() for 1 byte copy Summary: Add an option: -memcpy1-spec=func1,func2:cs1,func3:cs1:cs2,... to specialize calls to memcpy() in listed functions (the name could be supplied in regex) for size 1. The optimization will dynamically check if the size argument equals to 1 and execute a one byte copy, otherwise it will call memcpy() as usual. Specific call sites could be indicated after ":" using their numeric count from the start of the function. (cherry picked from FBD15428936)	2019-05-20 20:11:40 -07:00
Laith Saed Sakka	ca659e4336	Preserve nops that are SDT markers in binaries and disable SDT conflicting optimizations Summary: SDT markers that appears as nops in the assembly, are preserved and not eliminated. Functions with SDT markers are also flagged. Inlining and folding are disabled for functions that have SDT markers. (cherry picked from FBD15379799)	2019-05-16 12:46:32 -07:00
Laith Saed Sakka	4755825447	Parse statically defined tracepoint markers from .note.stapsdt section Summary: Parse statically defined tracepoints(SDT) markers from the ELF file, and store them. Add an option to print SDTs (-print-sdt). Add test case for parsing and printing SDTs. (cherry picked from FBD15366712)	2019-05-15 17:19:18 -07:00
Rafael Auler	f1fde44154	[BOLT] Improve ICP activation policy and hot jt processing Summary: Previously, ICP worked with a budget of N targets to convert to direct calls. As long as the frequency of up to N of the hottest targets surpassed a given fraction (threshold) of the total frequency, say, 90%, then the optimization would convert a number of targets (up to N) to direct calls. Otherwise, it would completely abort processing this call site. The intent was to convert a given fraction of the indirect call site frequency to use direct calls instead, but this ends up being a "all or nothing" strategy. In this patch we change this to operate with the same strategy seem in LLVM's ICP, with two thresholds. The idea is that the hottest target of an indirect call site will be compared against these two thresholds: one checks its frequency relative to the total frequency of the original indirect call site, and the other checks its frequency relative to the remaining, unconverted targets (excluding the hottest targets that were already converted to direct calls). The remaining threshold is typically set higher than the total threshold. This allows us more control over ICP. I expose two pairs of knobs, one for jump tables and another for indirect calls. To improve the promotion of hot jump table indices when we have memory profile, I also fix a bug that could cause us to promote extra indices besides the hottest ones as seen in the memory profile. When we have the memory profile, I reapply the dual threshold checks to the memory profile which specifies exactly which indices are hot. I then update N, the number of targets to be promoted, based on this new information, and update frequency information. To allow us to work with smaller profiles, I also created an option in perf2bolt to filter out memory samples outside the statically allocated area of the binary (heap/stack). This option is on by default. (cherry picked from FBD15187832)	2019-05-02 12:28:34 -07:00
Maksim Panchenko	fee61231ef	[BOLT] Move JumpTable management to BinaryContext Summary: Make BinaryContext responsible for creation and management of JumpTables. This will be used for detection and resolution of jump table conflicts across functions. (cherry picked from FBD15196017)	2019-05-02 17:42:06 -07:00
Maksim Panchenko	4b55967d9e	[perf2bot] Pass `-f` flag to perf Summary: perf tool requires the input data to be owned by the current user or root, otherwise it rejects the input. Use `-f` option to override this behavior. (cherry picked from FBD15160678)	2019-04-30 17:08:22 -07:00
Maksim Panchenko	310b32fbe5	[BOLT] Limit jump table size by containing object Summary: While checking for a size of a jump table, we've used containing section as a boundary. This worked for most cases as typically jump tables are not marked with symbol table entries. However, the compiler may generate objects for indirect goto's. (cherry picked from FBD15158905)	2019-04-30 15:47:10 -07:00
Maksim Panchenko	f1dfd38dec	[BOLT][NFC] Move DynoStats out of BinaryFunction Summary: Move DynoStats into separate source files. (cherry picked from FBD15138883)	2019-04-29 12:51:10 -07:00

... 5 6 7 8 9 ...

1191 Commits