llvm-project

Commit Graph

Author	SHA1	Message	Date
laith sakka	fde5a2b470	Run shrink wrapping in parallel Summary: Shrink wrapping is an expensive part of frame optimizations if performed on all functions. This diff makes it run in parallel, reducing wall time. (cherry picked from FBD16092651)	2019-07-02 10:48:43 -07:00
laith sakka	7d42835418	Run buildCFG in disassembly in parallel Summary: This diff parallelize the construction of call graph during disassembly. The diff includes a change to parallel-utilities where another interface is added, that support running tasks on binaryFunctions that involves adding instruction annotations. This pattern is common in different places, e.g. frame optimizations. And such, pattern justify creating an interface, that abstract out all the messy details. (cherry picked from FBD16232809)	2019-07-12 07:25:50 -07:00
laith sakka	f4ab6e6924	run finalize functions in parallel Summary: (cherry picked from FBD16188733)	2019-07-10 10:59:56 -07:00
laith sakka	98539b0966	run aligner pass in parallel Summary: this diff parallelize the aligner pass (cherry picked from FBD16176327)	2019-07-09 17:59:41 -07:00
laith sakka	9977b03fea	Run reorder blocks in parallel Summary: This diff change reorderBasicBlocks pass to run in parallel, it does so by adding locks to the fix branches function, and creating temporary MCCodeEmitters when estimating basic block code size. (cherry picked from FBD16161149)	2019-07-08 12:32:58 -07:00
Rafael Auler	1169f1fdd8	[BOLT] Support duplicating jump tables Summary: If two indirect branches use the same jump table, we need to detect this and duplicate dump tables so we can modify this CFG correctly. This is necessary for instrumentation and shrink wrapping. For the latter, we only detect this and bail, fixing this old known issue with shrink wrapping. Other minor changes to support better instrumentation: add an option to instrument only hot functions, add LOCK prefix to instrumentation increment instruction, speed up splitting critical edges by avoiding calling recomputeLandingPads() unnecessarily. (cherry picked from FBD16101312)	2019-07-02 16:56:41 -07:00
Rafael Auler	8880969ced	[BOLT] Restrict creation of jump tables Summary: Heuristic that creates a jump table for every memory access, including those we do not match against a pattern in an indirect jump, is too permissive and has false positives. Guard this logic under strict mode until we figure out a better strategy. (cherry picked from FBD16192205)	2019-07-10 15:41:34 -07:00
laith sakka	3cfc76cdbf	Create a general interface to implement parallel tasks easily and apply it to run EliminateUnreachableBlocks in parallel. Summary: Each time we run some work in parallel over the list of functions in bolt, we manage a thread pool, task scheduling and perform some work to manage the granularity of the tasks based on the type of the work we do. In this task, I am creating an interface where all those details are abstracted out, the user provides the function that will run on each function, and some policy parameters that setup the scheduling and granularity configurations. This will make it easier to implement parallel tasks, and eliminate redundant coding efforts. (cherry picked from FBD16116077)	2019-07-03 17:23:19 -07:00
laith sakka	f10d1fe0f3	Run cleanAnnotations within frame analysis in parallel Summary: This diff parallelize the function FrameAnalysis::cleanAnnotations() (cherry picked from FBD16096711)	2019-07-02 13:42:17 -07:00
laith sakka	00c252f6d8	Clean SPTMap in frame anaylsis in parallel Summary: This diff parallelize the STPClean() function reducing its runtime from 5 seconds to 0.4 on HHVM, Making the runtime for the frame optimizer goes down to 33 seconds on HHVM. (cherry picked from FBD15914371)	2019-06-19 18:01:00 -07:00
laith sakka	86b529bd54	run SPT in parallel, and split annotation allocator Summary: This diff includes two main changes: 1) When creating an annotation, a dedicated annotation allocator can be used, instead of the default allocator. This allows some annotation to be deallocated right after the end of their usage completely. Furthermore, having the ability to use dedicated allocators allows running SPT in parallel where each task uses a different allocator. 2) SPT is parallelized. (cherry picked from FBD15913492)	2019-06-14 19:56:11 -07:00
Wenlei He	4e90fc1e3b	[BOLT] Prioritize Jump Table ICP target by frequency and indice count Summary: We select the top hot targets for indirect call promotion. But since we only have frequency for targets, not for actual jump table indices, we have to merge indices that share the same actual target. In order to do that we sort targets by pointer of target symbol before merging, which introduces instability. Later we stable sort merged targets by frequency. Due to the instability of sorting pointers, and depending on how many indices each merged target has, we could end up with unstable ICP. This commit changes the 2nd pass sorting to prioritize targets with fewer indices, and higher mispredicts, in addition to higher frequency. It improves stability of ICP, and also exposes more ICP because targets with fewer indices has better chance of getting promoted. (cherry picked from FBD16099701)	2019-07-02 15:51:20 -07:00
Maksim Panchenko	078ece1691	[BOLT] Fix out-of-bounds entry points Summary: Check that a symbol address is less than the next function address before considering it for a secondary entry. (cherry picked from FBD16056468)	2019-06-28 11:53:34 -07:00
Maksim Panchenko	e89ad0db4b	[BOLT] Introduce strict relocation mode Summary: In strict relocation mode we rely on relocations to represent all possible entry points into a function. Most of the code generated by tested compilers (gcc and clang) will result in relocations against any internal labels for jump tables and for computed goto tables. In situations where we cannot properly reconstruct a jump table, or when we cannot determine a table that guides an indirect jump, e.g. when multiple computed goto tables are used, we conservatively assume that the indirect jump can end up at any possible basic block referenced by relocations. In strict mode, simple functions may include the aforementioned instructions with unknown control flow with a conservative list of destinations added to the containing basic block. This allows us to expand coverage of simple functions and to enable code reordering optimizations for more functions. The strict mode is recommended when BOLT is used with a well-formed code generated by a compiler. To use the strict mode, add "-strict" on the command line. Another effect of this diff, is that with relocations, we will always replace the immediate operand of an instruction with a symbol if the relocation exists against this operand. Also this diff fixes issues with Clang compiled with -fpic. (cherry picked from FBD15872849)	2019-06-28 09:21:27 -07:00
Maksim Panchenko	06e7a1e059	[BOLT] Ignore false function references Summary: A relocation can have an addend that makes it look as the relocated value is in a different section from the symbol being relocated. E.g., a relocation against a variable in .rodata could have a negative offset that will make it look like it is against a symbol in .text (a section that typically precedes .rodata). Unless the relocation is against a section symbol, we know exactly the symbol that is being relocated and there is no issue. However, when the linker leaves only a section relocation (i.e. a relocation against a section symbol when a temporary original symbol gets deleted), we have to guess the relocated symbol, and can falsely detect a function reference in the case described above. The fix is to keep a section relocation if the corresponding relocated value falls into a different section, and to detect and ignore false function reference. (cherry picked from FBD16030791)	2019-06-27 03:20:17 -07:00
Wenlei He	459add2827	[BOLT] Force non-relocation mode for heatmap generation Summary: BOLT operates in relocation mode by default when .reloc is in the binary. This changes disables relocation mode for heatmap generation so we can use that for more cases. There's a small separate change that ignores zero-sized symbol in zero-sized code section during function discovery. (cherry picked from FBD16009610)	2019-06-26 11:06:46 -07:00
Rafael Auler	0d23cbaa52	[BOLT] Initial experimental instrumentation pass Summary: An instrumentation pass that modifies the input binary to generate a profile after execution finishes. It modifies branches to increment counters stored in the process memory and injects a new function that dumps this data to an fdata file, readable by BOLT. This instrumentation is experimental and currently uses a naive approach where every branch is instrumented. This is not ideal for runtime performance, but should be good enough for us to evaluate/debug LBR profile quality against instrumentation. Does not support instrumenting indirect calls yet, only direct calls, direct branches and indirect local branches. (cherry picked from FBD15998096)	2019-06-19 20:10:49 -07:00
Rafael Auler	db02a1a142	[BOLT] Ignore empty funcs in relocation mode Summary: Make BOLT ignore empty functions (those containing no instructions, despite having some space allocated to it filled with zeroes). (cherry picked from FBD15981683)	2019-06-24 20:23:22 -07:00
Rafael Auler	bda13b7dd8	[BOLT] Add option to print profile bias stats Summary: Profile bias may happen depending on the hardware counter used to trigger LBR sampling, on the hardware implementation and as an intrinsic characteristic of relying on LBRs. Since we infer fall-through execution and these non-taken branches take zero hardware resources to be represented, LBR-based profile likely overrepresents paths with fall throughs and underrepresents paths with many taken branches. This patch adds an option to print statistics about profile bias so we can better understand these biases. The goal is to analyze differences in the sum of the frequency of all incoming edges in a basic block versus the sum of all outgoing. In an ideally sampled profile, these differences should be close to zero. With this option, the user gets the mean of these differences in flow as a percentage of the input flow. For example, if this number is 15%, it means, on average, a block observed 15% more or less flow going out of it in comparison with the flow going in. We also print the standard deviation so we can have an idea of how spread apart are different measurements of flow differences. If variance is low, it means the average bias is happening across all blocks, which is compatible with using LBRs. If the variance is high, it means some blocks in the profile have a much higher bias than others, which is compatible with using a biased event such as cycles to sample LBRs because it overrepresents paths that end in an expensive instruction. (cherry picked from FBD15790517)	2019-06-10 17:26:48 -07:00
laith sakka	1ec091e6f5	Parallelize ICF Pass Summary: ICF consumes 10-15% of bolt runtime, for HHVM that is around 45 seconds. this diff perform some parallelization for the pass to make it faster. A 60% reduction in the ICF runtime is measured on the parallel version for HHVM. (cherry picked from FBD15589515)	2019-05-31 16:45:31 -07:00
Maksim Panchenko	9894de0094	[BOLT] Check instruction boundaries while populating jump tables Summary: Now that we populate jump tables after all functions are disassembled, we can check for instruction boundaries corresponding to jump table entries. No need to delegate this task to postProcessJumpTables(). (cherry picked from FBD15814762)	2019-06-13 15:31:30 -07:00
Maksim Panchenko	9e2ad3f593	[BOLT] Delay populating jump tables Summary: During the initial disassembly pass, only identify jump tables without populating the contents. Later, after all functions have been disassembled, we have a better idea of jump table boundaries and can do a better job of populating their entries. As a result, we no longer have embedded jump tables (i.e. a jump table that is parter of another jump table). If we ever need to keep sequential jump tables inseparable during the output, we can always add such functionality later. Fixes facebookincubator/BOLT#56. (cherry picked from FBD15800427)	2019-06-12 18:21:02 -07:00
laith sakka	66cf16208f	Use singleton instances for SPT (stack pointer tracking) in FrameAnalysis. Summary: During frame analysis, the functions do not change, and stack pointer tracking does not need to be performed more than one time. The current implementation performs the SPT analysis multiple times per function during the frame analysis, we ca eliminate such computation redundancy. On HHVM with -frame-opts=hot, this save around a minute which is 40% of the frame optimization runtime. (129s to 76s). fdata should be passed for a reasonable evaluation (we need the call graph). However, this comes at a memory cost, around 2G to the peak when only -frame-opt=hot only is used but, When all the usual flags are passed, the effect is to the peak is only 200K (measured from one test). Update: When jemalloc is used the base became way better and the following runtime are observed: [jemalloc] hhvm 85 --> 72. clang 27 --> 23. [malloc] hhvm 129 --> 76. clang 34 --> 27. (cherry picked from FBD15707003)	2019-06-06 12:58:14 -07:00
Maksim Panchenko	9df5063c0e	[perf2bolt] Option to use event PC with LBR stack Summary: Add an option to get extra profile trace using the recorded event PC. The trace goes from the latest LBR record destination to the event PC. (cherry picked from FBD15711804)	2019-06-06 19:38:06 -07:00
Maksim Panchenko	fac6a89c23	[BOLT] Better handling of address references Summary: We used to handle PC-relative address references differently from direct address references. As a result, some cases, such as escaped function label address, were not handled when dealing with absolute (non-PIC) code. This diff moves processing of an address reference into BinaryContext::handleAddressRef() which is called for both PIC and non-PIC code. (cherry picked from FBD15643535)	2019-06-04 15:30:22 -07:00
laith sakka	d3c1821f5f	Compile Bolt using std 14. Summary: Compile Bolt using std 14. We want that to be able to use some threading the locking tools that do not exists in std 11. (cherry picked from FBD15671736)	2019-06-05 10:32:29 -07:00
Rafael Auler	21f4303bfd	Support data collection in bolted binaries Summary: Similarly to how the compiler relies on DWARF to map samples, so it is possible to collect profile data in binaries optimized by PGO techniques and retrofit data to be used in a representation of the program that was not optimized by PGO, this diff implements an option in BOLT to encode a table in the output binary that allows us to map data collected in optimized binaries back to the address space used in the input binary (where the profile is useful, since we do not support running BOLT on a binary already optimized by BOLT). The goal is to offer an option to support BOLT in scenarios where it is not easy to run a special deployment of the binary with a version that was not optimized by BOLT just for data collection. This feature is enabled with the -enable-bat flag. BAT stands for BOLT Address Translation, which refers to the process of mapping output to input addresses. (cherry picked from FBD15531860)	2019-04-12 17:33:46 -07:00
Laith Sakka	3df2c9ea1f	Update SDT locations after bolt reordering Summary: Update SDT locations in .note section to match the new location after bolt reorder the code. (cherry picked from FBD15427779)	2019-05-17 07:58:27 -07:00
Maksim Panchenko	9ef9a7b1be	[BOLT] Use regex matching for function names passed on command line Summary: Options such as `-print-only`, `-skip-funcs`, etc. now take regular expressions. Internally, the option is converted to '^funcname$' form prior to regex matching. This ensures that names without special symbols will match exactly, i.e. "foo" will not match "foo123". (cherry picked from FBD15551930)	2019-05-29 18:33:09 -07:00
Laith Sakka	c8038da36e	Minor-fix: remove duplicate definition of SPT optimization timer Summary: (cherry picked from FBD28111560)	2019-05-22 15:03:42 -07:00
Maksim Panchenko	e5b1d9cd8c	[BOLT][NFC] Fix white space (cherry picked from FBD15485688)	2019-05-23 15:49:36 -07:00
Maksim Panchenko	f57d3c00fc	[BOLT] Better verification of jump tables Summary: Run analyzeIndirectBranch() using basic block boundaries instead of running ad-hoc validation of the jump table assumptions. (cherry picked from FBD15465034)	2019-05-22 18:14:34 -07:00
Maksim Panchenko	be344c8de7	[BOLT] Refactor handling of interproc refs Summary: Move handling of interprocedural references to BinaryContext. Post-process indirect branches immediately after the CFG is built. This is almost NFC. Since indirect branches are now post-processed before the profile data is processed it interferes with the way the profile data in YAML format is handled. (cherry picked from FBD15456003)	2019-05-22 11:26:58 -07:00
Maksim Panchenko	d047df12c5	[BOLT] Add an option to specialize memcpy() for 1 byte copy Summary: Add an option: -memcpy1-spec=func1,func2:cs1,func3:cs1:cs2,... to specialize calls to memcpy() in listed functions (the name could be supplied in regex) for size 1. The optimization will dynamically check if the size argument equals to 1 and execute a one byte copy, otherwise it will call memcpy() as usual. Specific call sites could be indicated after ":" using their numeric count from the start of the function. (cherry picked from FBD15428936)	2019-05-20 20:11:40 -07:00
Laith Saed Sakka	ca659e4336	Preserve nops that are SDT markers in binaries and disable SDT conflicting optimizations Summary: SDT markers that appears as nops in the assembly, are preserved and not eliminated. Functions with SDT markers are also flagged. Inlining and folding are disabled for functions that have SDT markers. (cherry picked from FBD15379799)	2019-05-16 12:46:32 -07:00
Laith Saed Sakka	4755825447	Parse statically defined tracepoint markers from .note.stapsdt section Summary: Parse statically defined tracepoints(SDT) markers from the ELF file, and store them. Add an option to print SDTs (-print-sdt). Add test case for parsing and printing SDTs. (cherry picked from FBD15366712)	2019-05-15 17:19:18 -07:00
Rafael Auler	f1fde44154	[BOLT] Improve ICP activation policy and hot jt processing Summary: Previously, ICP worked with a budget of N targets to convert to direct calls. As long as the frequency of up to N of the hottest targets surpassed a given fraction (threshold) of the total frequency, say, 90%, then the optimization would convert a number of targets (up to N) to direct calls. Otherwise, it would completely abort processing this call site. The intent was to convert a given fraction of the indirect call site frequency to use direct calls instead, but this ends up being a "all or nothing" strategy. In this patch we change this to operate with the same strategy seem in LLVM's ICP, with two thresholds. The idea is that the hottest target of an indirect call site will be compared against these two thresholds: one checks its frequency relative to the total frequency of the original indirect call site, and the other checks its frequency relative to the remaining, unconverted targets (excluding the hottest targets that were already converted to direct calls). The remaining threshold is typically set higher than the total threshold. This allows us more control over ICP. I expose two pairs of knobs, one for jump tables and another for indirect calls. To improve the promotion of hot jump table indices when we have memory profile, I also fix a bug that could cause us to promote extra indices besides the hottest ones as seen in the memory profile. When we have the memory profile, I reapply the dual threshold checks to the memory profile which specifies exactly which indices are hot. I then update N, the number of targets to be promoted, based on this new information, and update frequency information. To allow us to work with smaller profiles, I also created an option in perf2bolt to filter out memory samples outside the statically allocated area of the binary (heap/stack). This option is on by default. (cherry picked from FBD15187832)	2019-05-02 12:28:34 -07:00
Maksim Panchenko	fee61231ef	[BOLT] Move JumpTable management to BinaryContext Summary: Make BinaryContext responsible for creation and management of JumpTables. This will be used for detection and resolution of jump table conflicts across functions. (cherry picked from FBD15196017)	2019-05-02 17:42:06 -07:00
Maksim Panchenko	4b55967d9e	[perf2bot] Pass `-f` flag to perf Summary: perf tool requires the input data to be owned by the current user or root, otherwise it rejects the input. Use `-f` option to override this behavior. (cherry picked from FBD15160678)	2019-04-30 17:08:22 -07:00
Maksim Panchenko	310b32fbe5	[BOLT] Limit jump table size by containing object Summary: While checking for a size of a jump table, we've used containing section as a boundary. This worked for most cases as typically jump tables are not marked with symbol table entries. However, the compiler may generate objects for indirect goto's. (cherry picked from FBD15158905)	2019-04-30 15:47:10 -07:00
Maksim Panchenko	f1dfd38dec	[BOLT][NFC] Move DynoStats out of BinaryFunction Summary: Move DynoStats into separate source files. (cherry picked from FBD15138883)	2019-04-29 12:51:10 -07:00
Maksim Panchenko	2b1523362e	[BOLT] Strip debug sections by default Summary: We used to ignore debug sections by default, but we kept them in the binary which led to invalid debug information in the output. It's better to strip debug info and print a warning to the user. Note: we are not updating debug info by default due to high memory requirements for large applications. (cherry picked from FBD15128947)	2019-04-26 15:30:12 -07:00
Rafael Auler	21ee0e98c7	[BOLT] Fix symboltable update bug Summary: Commit "Update symbols for secondary entry points" introduced a bug by using getBinaryFunctionContainingAddress() instead of getBinaryFunctionAtAddress() regarding ICF'd functions. Only the latter would fetch the correct BinaryFunction object for addresses of functions that were ICF'd. As a result of this bug, the dynamic symbol table was not updated for function symbols that were folded by ICF. (cherry picked from FBD15112941)	2019-04-26 19:52:36 -07:00
Maksim Panchenko	caa0fafa18	[BOLT] Fix profile reading in non-reloc mode Summary: In non-relocation mode we may execute multiple re-write passes either because we need to split large functions or update debug information for large functions (in this context large functions are functions that do not fit into the original function boundaries after optimizations). When we execute another pass, we reset RewriteInstance and run most of the steps such as disassembly and profile matching for the 2nd or 3rd time. However, when we match a profile, we check `Used` flag, and don't use the profile for the 2nd time. Since we didn't reset the flag while resetting the rest of the states, we ignored profile for all functions. Resetting the flag in-between rewrite passes solves the problem. (cherry picked from FBD15110959)	2019-04-26 16:32:28 -07:00
Maksim Panchenko	5717b0c427	[perf2bolt] Fix print report for pre-aggregated profile Summary: For pre-aggregated profile, we were using the number of records in the profile for `NumTraces` ignoring the counts per record. As a result, the reported percentage of mismatched traces was bogus. (cherry picked from FBD15093123)	2019-04-25 16:34:50 -07:00
Maksim Panchenko	492e4a515e	[BOLT] Automatically enable -hot-text Summary: Enable -hot-text by default if reordering functions. Also fail immediately if function reordering is specified on the command line in non-relocation mode. (cherry picked from FBD15095178)	2019-04-25 17:00:05 -07:00
Brian Gesiak	91b2de3c23	[BOLT] Minimize BOLT's diff with LLVM by removing trivial changes (NFC) Summary: BOLT works as a series of patches rebased onto upstream LLVM at revision `f137ed238db`. Some of these patches introduce unnecessary whitespace changes or includes. Remove these to minimize the diff with upstream LLVM. (cherry picked from FBD15064122)	2019-04-24 11:24:15 -04:00
Rafael Auler	4e4d39c21c	[BOLT] Update symbols for secondary entry points Summary: Update the output ELF symbol table for symbols representing secondary entry points for functions. Previously, those were left unchanged in the symtab. (cherry picked from FBD15010517)	2019-04-18 16:32:22 -07:00
Brian Gesiak	eba1a67730	Fix casting issues on macOS Summary: `size_t` is platform-dependent, and on macOS it is defined as `unsigned long long`. This is not the same type as is used in many calls to templated functions that expect the same type. As a result, on macOS, calls to `std::max` fail because a template function that takes `uint64_t, unsigned long long` cannot be found. To work around the issue: * Specify explicit `std::max` and `std::min` functions where necessary, to work around the compiler trying (and failing) to find a suitable instantiation. * For lambda return types, specify an explicit return type where necessary. * For `operator ==()` calls, use an explicit cast where necessary. (cherry picked from FBD15030283)	2019-04-22 11:27:50 -04:00
Brian Gesiak	d9f1bd42fd	[cmake] Only build enabled targets Summary: When attempting to build llvm-bolt with `-DLLVM_ENABLE_TARGETS="X86"`, I encountered an error: ``` CMake Error at cmake/modules/AddLLVM.cmake:559 (add_dependencies): The dependency target "AArch64CommonTableGen" of target "LLVMBOLTTargetAArch64" does not exist. Call Stack (most recent call first): cmake/modules/AddLLVM.cmake:607 (llvm_add_library) tools/llvm-bolt/src/Target/AArch64/CMakeLists.txt:1 (add_llvm_library) ``` The issue is that the `llvm-bolt/src/Target/AArch64` subdirectory is added by CMake unconditionally. The LLVM project, on the other hand, only adds the subdirectories that are enabled, by using a `foreach` loop over `LLVM_TARGETS_TO_BUILD`. Copying that same loop, from `llvm/lib/Target/CMakeLists.txt`, to this project avoids the error. (cherry picked from FBD15030236)	2019-04-22 11:19:02 -04:00

... 4 5 6 7 8 ...

832 Commits All Branches Search

832 Commits

All Branches