llvm-project

Commit Graph

Author	SHA1	Message	Date
Xin-Xin Wang	9aa276d349	[BOLT] Make .debug_loc update deterministic Summary: Change the single DebugLocWriter to one for each compilation unit. Then, each thread can write to its own DebugLocWriter and we can combine the data in a deterministic order once the threads are done. The only catch is that each thread would need the offset of the location lists it adds, so we make a list of pending location list patches and compute the final offsets at the end. (cherry picked from FBD18153069)	2019-10-25 11:47:51 -07:00
Maksim Panchenko	d414acfbb6	[perf2bolt] Better mmap event matching Summary: When perf tool reports a mapping address of a binary, it is not always the address of the first loadable segment we were checking against. As a result, perf2botl was not working properly for binaries where the first segment was not executable. The fix is to check if the address reported by mmap event matches any of the loadable segments. Note that the segment alignment has to be applied to get real loadable address of the segment. Fixes facebookincubator/BOLT#65 (cherry picked from FBD19146419)	2019-12-17 11:17:31 -08:00
Rafael Auler	16a497c627	[BOLT] Support full instrumentation Summary: Add full instrumentation support (branches, direct and indirect calls). Add output statistics to show how many hot bytes were split from cold ones in functions. Add -cold-threshold option to allow splitting warm code (non-zero count). Add option in bolt-diff to report missing functions in profile 2. In instrumentation, fini hooks are fixed to run proper finalization code after program finishes. Hooks for startup are added to setup the runtime structures that needs initilization, such as indirect call hash tables. Add support for automatically dumping profile data every N seconds by forking a watcher process during runtime. (cherry picked from FBD17644396)	2019-12-13 17:27:03 -08:00
Rafael Auler	e46d52de5b	[BOLT] Fix non-determinism in ICP with threads Summary: -icp-top-callsites selects candidates for optimization until a threshold is met. Currently, this parameter is set to 99% of calls by default. The order of functions evaluated changes in parallel mode, thus the functions that may be included to satisfy 99% of all calls may change, leading to different optimization decisions when running in parallel versus sequential. Fix this by enabling optimizations for all branches with the same frequency once we reach our budget instead of cutting off immediatelly after our budget is satisfied. In that way, order of functions has no impact on which functions are optimized. (cherry picked from FBD18902239)	2019-12-13 16:46:00 -08:00
Xin-Xin Wang	bdb60857e8	[BOLT] Make .debug_loc update deterministic Summary: Change the single DebugLocWriter to one for each compilation unit. Then, each thread can write to its own DebugLocWriter and we can combine the data in a deterministic order once the threads are done. The only catch is that each thread would need the offset of the location lists it adds, so we make a list of pending location list patches and compute the final offsets at the end. (cherry picked from FBD18153069)	2019-10-25 11:47:51 -07:00
Maksim Panchenko	e5d1334ad5	[perf2bolt] Ignore mmap events unrelated to execution Summary: Some processes can mmap the main binary for the purpose of introspection. We should ignore such mmap events for fixed-address binaries. For PIC binaries, we record the mapping and do the address filtering later for all sample events. (cherry picked from FBD18844314)	2019-12-05 16:52:15 -08:00
Xin-Xin Wang	6f93d53bf5	[BOLT] Remove test for impossible debug ranges condition Summary: The condition `DebugRangesOffset == -1U` can never happen since DebugRangesOffset has type `uint64_t` and the value always comes from `RangesSectionWriter->addRanges` which gets its value from `DebugRangesSectionWriter.SectionOffset` which has type `uint32_t`. The condition seems to be left over from a time where something was using `-1` as an error value. I'm removing that check so I can use `-1` as a tag to refer to the empty range that will be at the beginning of the ranges section. (cherry picked from FBD18153119)	2019-10-25 15:18:37 -07:00
Xin-Xin Wang	112c4251f5	[BOLT] Separate DebugRangesSectionsWriter into Ranges and ARanges Summary: The `.debug_aranges` section is already deterministic and is logically separate from the `.debug_ranges` section so separate them into separate classes so that it will be easier to make DebugRangesSectionsWriter deterministic (cherry picked from FBD18153057)	2019-10-25 11:24:49 -07:00
Xin-Xin Wang	8e2d3f7c30	[BOLT] Fix invalid abbrev error when reading debug_info section with readelf Summary: This fixes a bug which causes the debug_info and debug_loc sections to be unreadable by readelf/objdump. Basically, we're using 12 bytes of a ULEB128 value to fill in space, but readelf can't read more than 9 bytes of ULEB128. Thus, we replace that value with a string of 'a' instead. (cherry picked from FBD18097728)	2019-10-23 15:19:49 -07:00
Rafael Auler	28f91871b3	[PERF2BOLT/BOLT] Improve support for .so Summary: Avoid asserting on inputs that are shared libraries with R_X86_64_64 static relocs and RELATIVE dynamic relocations matching those. Our relocation checking mechanism would expect the result of the static relocation to be encoded in the binary, but the linker instead puts it as an addend in the RELATIVE dyn reloc. Also fix aggregation for .so if the executable segment is not the first one in the binary. (cherry picked from FBD18651868)	2019-11-14 16:07:11 -08:00
Rafael Auler	4bcc53a408	[BOLT] Fix shrink wrapping empty BB issue Summary: When combining icp=calls and shrink wrapping, the former may generate empty BBs that are going to trigger a bug in shrink wraping restore placement strategy. The restore is wrongly pushed to the BB successor instead of being added to the current block. Add a pass to go over the CFG to fix empty blocks by adding a temporary NOP instruction that is going to be deleted later. Empty BBs are not supported by one of the analysis done at this pass. (cherry picked from FBD18717994)	2019-11-26 15:09:40 -08:00
Maksim Panchenko	3cc4fc267b	[BOLT] Proper support for -trap-avx512 option Summary: If -trap-avx512 option is not set, verify that we correctly encode AVX-512 instructions and treat them as ordinary instructions. (cherry picked from FBD18666427)	2019-11-22 14:53:20 -08:00
Maksim Panchenko	7350d40404	[BOLT][NFC] Refactor BinaryFunction::addEntryPoint() Summary: There is no need to support existing functionality of adding entry points after the CFG is built as the function is only called in empty or disassembled state. Previously we used to run disassemble+buildCFG per function, but now these phases are decoupled. Also, remove a couple of redundant checks. (cherry picked from FBD18622822)	2019-11-11 17:02:37 -08:00
Maksim Panchenko	a09659fd54	[BOLT] Refactor markAmbiguousRelocations() Summary: Refactor markAmbiguousRelocations() code and move it to BinaryContext. Also remove a redundant check. (cherry picked from FBD18623815)	2019-11-18 14:08:17 -08:00
Maksim Panchenko	658f270417	[BOLT] Refactor data PC relocations in BinaryContext Summary: We only use locations of PC relocations and ignore the rest of the data. There's no need to store type and value. (cherry picked from FBD18623280)	2019-11-19 18:52:08 -08:00
Maksim Panchenko	b07e870d78	[BOLT] Add BinarySection::flushPendingRelocations() (cherry picked from FBD18623527)	2019-11-20 00:16:19 -08:00
Maksim Panchenko	3b1b9916dd	[BOLT][NFC] Refactor data section emission code Summary: RewriteInstance::emitDataSection() -> BinarySection::emitAsData() (cherry picked from FBD18623050)	2019-11-19 14:47:49 -08:00
spupyrev	95a1c7f553	speeding up ext-tsp Summary: Speeding up cache+/ext-tsp block reordering algorithm. On a high-level, the speedup is achieved by: - precomputing and memorizing all jumps between a pair of chains (instead of extracting them on every merge iteration); - using a cache of size O(\|E\|) instead of O(\|V\|^2) as in previous version. The final output is identical to previous one subject to a new deterministic comparison of double values. (cherry picked from FBD18380870)	2019-10-31 13:32:25 -07:00
Maksim Panchenko	6796b7216b	[BOLT] Fix jump table analysis for non-simple functions Summary: When we disassemble functions, we add discovered jump tables to a global container in BinaryContext. Later, we analyze and verify all jump tables. However, analysis for non-simple functions might fail for numerous reasons, e.g. there would be no instruction at a destination. Since we are not overwriting non-simple functions, it is not a critical error. Thus, we can safely skip jump table analysis for non-simple functions. (cherry picked from FBD18422997)	2019-11-10 21:09:01 -08:00
Maksim Panchenko	72b52edcbb	[BOLT] Free more memory in BinaryFunction::releaseCFG() Summary: Free more lists in BinaryFunction::releaseCFG(). Release BinaryFunction::Relocations after disassembly. Do not populate BinaryFunction::MoveRelocations as we are not using them currently. Also remove PCRelativeRelocationOffsets that weren't used. (cherry picked from FBD18413256)	2019-11-08 14:41:31 -08:00
Maksim Panchenko	d5ddb320ef	[BOLT] Free memory for CFG after emission Summary: Once we emit function code, we no longer need CFG for next phases that use basic blocks for address-translation and symbol update purposes. We free memory used by CFG and instructions. The freed memory gets reused by later phases resulting in overall memory usage reduction. We can probably improve memory consumption even further by replacing BinaryBasicBlocks with more compact data structures. (cherry picked from FBD18408954)	2019-10-31 16:54:48 -07:00
Maksim Panchenko	f2b257bec8	[BOLT] Update SDTs based on translation tables Summary: We've used to emit special annotations to update SDT markers. However, we can just use "Offset" annotations for the same purpose. Unlike BAT, we have to generate "reverse" address translation tables. This approach eliminates reliance on instructions after code emission. (cherry picked from FBD18318660)	2019-11-03 21:57:15 -08:00
Maksim Panchenko	98e63610b1	[BOLT] Create OffsetTranslationTable for basic blocks Summary: Use BinaryBasicBlock::OffsetTranslationTable for BAT. This removes dependency on instructions after the code emission. (cherry picked from FBD18283965)	2019-11-01 16:19:45 -07:00
Maksim Panchenko	a1388308f0	[BOLT] Use NameResolver class for local symbols Summary: NameResolver class is used to assign unique names to local symbols. (cherry picked from FBD18277131)	2019-11-01 12:31:17 -07:00
Maksim Panchenko	1ed3ac17ff	[BOLT] Fix section offsets after debug stripping Summary: Be default, we strip debug sections from the binary. Even though we did not write the sections, we allocated space for them in the output binary by mistake. (cherry picked from FBD18218708)	2019-10-29 14:49:49 -07:00
Maksim Panchenko	ed8be23e73	[BOLT][llvm] Reduce memory used by MCInst Summary: BOLT creates MCInst for every instruction from the input. For large binaries, this means we are creating tens if not hundreds of millions of instructions. If the number of operands for average instruction is much less than 8, we benefit from changing the type of Operands from SmallVector<MCOperand, 8> to SmallVector<MCOperand, 2>. That seems to be the optimal type for X86-64 on average. The size of MCInst goes down from 176 to 80 which often reduces BOLT memory consumption by gigabytes. (cherry picked from FBD18218924)	2019-10-28 17:40:18 -07:00
Rafael Auler	a3295715e4	[AArch64] Recognize one extra br idiom Summary: We do not support optimizing functions with jump tables in AArch64, but we do need to detect them. This idiom is slightly different from the ones we've seen before. It encode jump table entries as relative to the jump table itself instead of relative to the indirect branch (BR) instruction. (cherry picked from FBD18191100)	2019-10-28 16:16:35 -07:00
Maksim Panchenko	8fb6512a23	[BOLT][Docs] Instructions for linking with jemalloc/tcmalloc (cherry picked from FBD18050722)	2019-10-21 15:57:36 -07:00
Maksim Panchenko	12aca4005c	[BOLT] Ignore __builtin_unreachable destination Summary: For functions with unknown control flow, do not populate TakenBranches with an entry pointing to the end of the function. (cherry picked from FBD18034019)	2019-10-20 20:46:32 -07:00
Rafael Auler	b807641e2a	[BOLT] Fix stale functions when using BAT Summary: If collecting data in Intel Skylake machines, we may face a bug where LBR0 or LBR1 may be duplicated w.r.t. the next entry. This makes perf2bolt interpret it as an invalid trace, which ordinarily we discard during aggregation. However, in BAT, since we do not disassemble the binary where the collection happened but rely only on the translation table, it is not possible to detect bad traces and discard them. This gets to the fdata file, and this invalid trace ends up invalidating the profile for the whole function (by being treated as stale by BOLT). In this patch, we detect Skylake by looking for LBRs with 32 entries, and discard the first 2 entries to avoid running into this problem. It also fixes an issue with collision in the translation map by prioritizing the last basic block when more than one share the same output address. (cherry picked from FBD17996791)	2019-10-17 16:35:57 -07:00
Maksim Panchenko	103b0a77cc	[BOLT] Fix non-determinism while reading debug info Summary: When reading debug info in parallel, CUs for functions were populated in parallel and the order was non-deterministic. We used the first CU from the non-deterministically-ordered list to set the line number resulting in different outputs. The fix is to sort the list after it's been created and before assigning the line table unit. (cherry picked from FBD17946889)	2019-10-14 17:57:36 -07:00
Rafael Auler	698a4684ac	[BOLT] Fix merge-fdata and heatmap in BAT Summary: merge-fdata for legacy format was simply appending all input strings to output, but the real format supports some header strings that can't be simply concatanated to output. Check for the header string used by BAT before merging fdata to avoid creating an output file with invalid lines (header in the middle of the fdata file). For heatmap, avoid reading BAT tables, since they won't be used. (cherry picked from FBD17943131)	2019-10-11 13:32:14 -07:00
Xin-Xin Wang	d87f95065a	[BOLT] Add missing CMake test dependencies Summary: I noticed when setting up a new repository for bolt that bolt tests would fail unexpectedly when running `ninja check-bolt` and `ninja check-llvm`. This turns out to be because dependencies for bolt binaries were not specified in the CMake configuration so they were not built before running the tests. This diff adds the dependencies to the CMake configuration for check-bolt and check-llvm so that bolt binaries are built before running tests. (cherry picked from FBD17919505)	2019-10-14 16:03:54 -07:00
Maksim Panchenko	8c6ea8540a	[BOLT] Improve object discovery runtime Summary: (cherry picked from FBD17872824)	2019-10-08 11:03:33 -07:00
Rafael Auler	13948f376d	[BOLT] Do not emit BAT for non-simple in nonreloc Summary: Doing so cause corrupt entries to be emitted. (cherry picked from FBD17774505)	2019-10-04 16:28:03 -07:00
Mark Santaniello	c9f4bbdc22	[llvm-bolt] Bugfix jemalloc sized deallocation segfault Summary: C++14 "sized deallocation" introduces a 2-argument `delete` where the new 2nd argument is the original allocated size. It's useful for allocators like jemalloc to be "reminded" of the original allocation size, else they incur the cost of an address to size lookup. Jemalloc has provided this for a while as `sdallocx`, and recently it got wired up to the new 2-arg `delete`. Here I introduce typedefs for the SmallVectors so the "16" is consistent, which seems to fix the issue. (cherry picked from FBD17618981)	2019-09-26 16:51:22 -07:00
Rafael Auler	ba31344fa9	[BOLT] Fix build for Mac Summary: Change our CMake config for the standalone runtime instrumentation library to check for the elf.h header before using it, so the build doesn't break on systems lacking it. Also fix a SmallPtrSet usage where its elements are not really pointers, but uint64_t, breaking the build in Apple's Clang. (cherry picked from FBD17505759)	2019-09-20 11:29:35 -07:00
Maksim Panchenko	5e6d246b9c	[BOLT] Reword message for macro-op fusion optimization Summary: With the word "missed", the previous message about opportunities for macro-op fusion optimization could be misleading. (cherry picked from FBD17464603)	2019-09-18 15:33:03 -07:00
Maksim Panchenko	c823220116	[BOLT] Better check for compiler de-virtualization bug Summary: The existing check for compiler de-virtualization bug was not working when the relocation reference did not fall on a function boundary. As a result, we were falsely detecting "unmarked object in code". When running the check, the address could be arbitrary, except it shouldn't match any existing function. Additionally, check that there's a proper reference to the de-virtualized callee to avoid false positives. (cherry picked from FBD17433887)	2019-09-17 14:24:31 -07:00
Maksim Panchenko	e9c6c73bb8	[BOLT][non-reloc] Change function splitting in non-relocation mode Summary: This diff applies to non-relocation mode mostly. In this mode, we are limited by original function boundaries, i.e. if a function becomes larger after optimizations (e.g. because of the newly introduced branches) then we might not be able to write the optimized version, unless we split the function. At the same time, we do not benefit from function splitting as we do in the relocation mode since we are not moving functions/fragments, and the hot code does not become more compact. For the reasons described above, we used to execute multiple re-write attempts to optimize the binary and we would only split functions that were too large to fit into their original space. After the first attempt, we would know functions that did not fit into their original space. Then we would re-run all our passes again feeding back the function information and forcefully splitting such functions. Some functions still wouldn't fit even after the splitting (mostly because of the branch relaxation for conditional tail calls that does not happen in non-relocation mode). Yet we have emitted debug info as if they were successfully overwritten. That's why we had one more stage to write the functions again, marking failed-to-emit functions non-simple. Sadly, there was a bug in the way 2nd and 3rd attempts interacted, and we were not splitting the functions correctly and as a result we were emitting less optimized code. One of the reasons we had the multi-pass rewrite scheme in place, was that we did not have an ability to precisely estimate the code size before the actual code emission. Recently, BinaryContext obtained such functionality, and now we can use it instead of relying on the multi-pass rewrite. This eliminates redundant work of re-running the same function passes multiple times. Because function splitting runs before a number of optimization passes that run on post-CFG state (those rely on the splitting pass), we cannot estimate the non-split code size with 100% accuracy. However, it is good enough for over 99% of the cases to extract most of the performance gains for the binary. As a result of eliminating the multi-pass rewrite, the processing time in non-relocation mode with `-split-functions=2` is greatly reduced. With debug info update, it is less than half of what it used to be. New semantics for `-split-functions=<n>`: -split-functions - split functions into hot and cold regions =0 - do not split any function =1 - in non-relocation mode only split functions too large to fit into original code space =2 - same as 1 (backwards compatibility) =3 - split all functions (cherry picked from FBD17362607)	2019-09-11 15:42:22 -07:00
Wenlei He	615a318b60	[BOLT] Filter perf samples by PID Summary: `perf2bolt` accepts executable name, and the tool will find all the PIDs associated with that executable. When different versions of an executable are running at the same time, name alone may not be sufficient as we will get samples from different versions of the binary aggregated together. The resulting fdata may look stale to BOLT, which makes BOLT bailout optimization for functions. This change adds a `-pid` switch that lets user specify process ID in addition to executable name so BOLT can target a specific process. (cherry picked from FBD17178898)	2019-09-03 22:24:06 -07:00
Wenlei He	8cd1ba599b	[BOLT] Ignore LBR from kernel interrupts Summary: This change adds a switch (`ignore-interrupt-lbr`) to ignores LBR from perf input that is result of kernel interrupts. These asynchronous flow of user/kernel transition will make BOLT think that profile is stale, thus bailout optimization for functions. Ideally, user mode filter need to be set for `perf record` so we don't have asynchronous LBRs. However these are identifiable as kernel address space is known, so we can ignore any LBRs that come from or go into kernel addresses during aggregation. This is under a switch and off by default in case we need to BOLT kernel module. (cherry picked from FBD17170107)	2019-09-03 10:01:26 -07:00
Rafael Auler	cc4b2fb614	[BOLT] Efficient edge profiling in instrumented mode Summary: Change our edge profiling technique when using instrumentation to do not instrument every edge. Instead, build the spanning tree for the CFG and omit instrumentation for edges in the spanning tree. Infer the edge count for these edges when writing the profile during run time. The inference works with a bottom-up traversal of the spanning tree and establishes the value of the edge connecting to the parent based on a simple flow equation involving output and input edges, where the only unknown variable is the parent edge. This requires some engineering in the runtime lib to support dynamic allocation for building these graphs at runtime. (cherry picked from FBD17062773)	2019-08-07 16:09:50 -07:00
Rafael Auler	52786928ff	[BOLT] Fix perf2bolt race in BAT mode Summary: We start a thread to preprocess the profile while the main thread continues to disassemble the input binary. We should not disassemble it in BAT mode, however, the test to check whether we have BAT in the input binary depends on the preprocessing thread, so there is a race where we may start disassembling functions just because the preprocessing thread didn't conclude we are in BAT mode. Fix this and make the main thread check for BAT without depending on the preprocessing thread. (cherry picked from FBD17124370)	2019-08-29 16:18:43 -07:00
Rafael Auler	1f6564f117	[BOLT] Support .plt.got section Summary: We decode the regular .plt section and we are able to perform optimizations on it with -plt=hot or -plt=all, however, .plt.got sections are not decoded by BOLT. This patch teaches BOLT how to read them. They are created by the bfd linker whenever there is no need for the dynamic linker to lazy-bind the symbol (when they are eagerly resolved at binary load time). These entries are 8-byte sized instead of 16-byte sized like the regular PLT, and contain a single indirect call instruction with 7 bytes and a nop. (cherry picked from FBD17060515)	2019-08-26 15:03:38 -07:00
Rafael Auler	243507db99	[BOLT] Fix aggregator w.r.t. split functions Summary: We should not rely on split function detection while aggregating data, but only look up the original function names in the symbol table. Split function detection should be done by BOLT and not perf2bolt while writing the profile. Then, BOLT, when reading it, will take care of combining functions if necessary. This caused a bug in bolted data collection where samples in cold parts of a function were being falsely attributed to the hot part of a function instead of being attributed to the cold part, causing incorrect translation of addresses. (cherry picked from FBD16993065)	2019-08-23 12:18:31 -07:00
Maksim Panchenko	f588d7a6ea	[BOLT] Tighter control of jump table detection Summary: We were too permissive by allowing more jump tables during the preliminary scan of memory. This allowed for jump tables to be falsely detected. And since we didn't have a way to backtrack the jump table creation, we had to assert. This diff refactors the code that analyzes jump table contents. Preliminary and final passes share the same code. The only difference should be the detection of instruction boundaries that are available during the final pass. This should affect strict relocation mode only. (cherry picked from FBD16923335)	2019-08-19 14:06:36 -07:00
Maksim Panchenko	bf030f336a	[BOLT] Fix misleading output Summary: BOLT prints "spawning thread to pre-process profile" message even when it is not running in the aggregation mode. Fix that. (cherry picked from FBD16908596)	2019-08-19 17:11:42 -07:00
Rafael Auler	821480d27f	[BOLT] Encode instrumentation tables in file Summary: Avoid directly allocating string and description tables in binary's static data region, since they are not needed during runtime except when writing the profile at exit. Change the runtime library to open the tables on disk and read only when necessary. (cherry picked from FBD16626030)	2019-08-02 11:20:13 -07:00
Rafael Auler	62aa74f836	[BOLT] Support instrumentation via runtime library Summary: To allow the development of future instrumentation work, this patch adds support in BOLT for linking arbitrary libraries into the binary processed by BOLT. We use orc relocation handling mechanism for that. With this support, this patch also moves code programatically generated in X86 assembly language by X86MCPlusBuilder to C code written in a new library called bolt_rt. Change CMake to support this library as an external project in the same way as clang does with compiler_rt. This library is installed in the lib/ folder relative to BOLT root installation and by default instrumentation will look for the library at that location to finish processing the binary with instrumentation. (cherry picked from FBD16572013)	2019-07-24 14:03:43 -07:00

... 6 7 8 9 10 ...

997 Commits All Branches Search

997 Commits

All Branches