llvm-project

Commit Graph

Author	SHA1	Message	Date
Rafael Auler	0d23cbaa52	[BOLT] Initial experimental instrumentation pass Summary: An instrumentation pass that modifies the input binary to generate a profile after execution finishes. It modifies branches to increment counters stored in the process memory and injects a new function that dumps this data to an fdata file, readable by BOLT. This instrumentation is experimental and currently uses a naive approach where every branch is instrumented. This is not ideal for runtime performance, but should be good enough for us to evaluate/debug LBR profile quality against instrumentation. Does not support instrumenting indirect calls yet, only direct calls, direct branches and indirect local branches. (cherry picked from FBD15998096)	2019-06-19 20:10:49 -07:00
Rafael Auler	db02a1a142	[BOLT] Ignore empty funcs in relocation mode Summary: Make BOLT ignore empty functions (those containing no instructions, despite having some space allocated to it filled with zeroes). (cherry picked from FBD15981683)	2019-06-24 20:23:22 -07:00
Rafael Auler	bda13b7dd8	[BOLT] Add option to print profile bias stats Summary: Profile bias may happen depending on the hardware counter used to trigger LBR sampling, on the hardware implementation and as an intrinsic characteristic of relying on LBRs. Since we infer fall-through execution and these non-taken branches take zero hardware resources to be represented, LBR-based profile likely overrepresents paths with fall throughs and underrepresents paths with many taken branches. This patch adds an option to print statistics about profile bias so we can better understand these biases. The goal is to analyze differences in the sum of the frequency of all incoming edges in a basic block versus the sum of all outgoing. In an ideally sampled profile, these differences should be close to zero. With this option, the user gets the mean of these differences in flow as a percentage of the input flow. For example, if this number is 15%, it means, on average, a block observed 15% more or less flow going out of it in comparison with the flow going in. We also print the standard deviation so we can have an idea of how spread apart are different measurements of flow differences. If variance is low, it means the average bias is happening across all blocks, which is compatible with using LBRs. If the variance is high, it means some blocks in the profile have a much higher bias than others, which is compatible with using a biased event such as cycles to sample LBRs because it overrepresents paths that end in an expensive instruction. (cherry picked from FBD15790517)	2019-06-10 17:26:48 -07:00
laith sakka	1ec091e6f5	Parallelize ICF Pass Summary: ICF consumes 10-15% of bolt runtime, for HHVM that is around 45 seconds. this diff perform some parallelization for the pass to make it faster. A 60% reduction in the ICF runtime is measured on the parallel version for HHVM. (cherry picked from FBD15589515)	2019-05-31 16:45:31 -07:00
Maksim Panchenko	9894de0094	[BOLT] Check instruction boundaries while populating jump tables Summary: Now that we populate jump tables after all functions are disassembled, we can check for instruction boundaries corresponding to jump table entries. No need to delegate this task to postProcessJumpTables(). (cherry picked from FBD15814762)	2019-06-13 15:31:30 -07:00
Maksim Panchenko	9e2ad3f593	[BOLT] Delay populating jump tables Summary: During the initial disassembly pass, only identify jump tables without populating the contents. Later, after all functions have been disassembled, we have a better idea of jump table boundaries and can do a better job of populating their entries. As a result, we no longer have embedded jump tables (i.e. a jump table that is parter of another jump table). If we ever need to keep sequential jump tables inseparable during the output, we can always add such functionality later. Fixes facebookincubator/BOLT#56. (cherry picked from FBD15800427)	2019-06-12 18:21:02 -07:00
laith sakka	66cf16208f	Use singleton instances for SPT (stack pointer tracking) in FrameAnalysis. Summary: During frame analysis, the functions do not change, and stack pointer tracking does not need to be performed more than one time. The current implementation performs the SPT analysis multiple times per function during the frame analysis, we ca eliminate such computation redundancy. On HHVM with -frame-opts=hot, this save around a minute which is 40% of the frame optimization runtime. (129s to 76s). fdata should be passed for a reasonable evaluation (we need the call graph). However, this comes at a memory cost, around 2G to the peak when only -frame-opt=hot only is used but, When all the usual flags are passed, the effect is to the peak is only 200K (measured from one test). Update: When jemalloc is used the base became way better and the following runtime are observed: [jemalloc] hhvm 85 --> 72. clang 27 --> 23. [malloc] hhvm 129 --> 76. clang 34 --> 27. (cherry picked from FBD15707003)	2019-06-06 12:58:14 -07:00
Maksim Panchenko	9df5063c0e	[perf2bolt] Option to use event PC with LBR stack Summary: Add an option to get extra profile trace using the recorded event PC. The trace goes from the latest LBR record destination to the event PC. (cherry picked from FBD15711804)	2019-06-06 19:38:06 -07:00
Maksim Panchenko	fac6a89c23	[BOLT] Better handling of address references Summary: We used to handle PC-relative address references differently from direct address references. As a result, some cases, such as escaped function label address, were not handled when dealing with absolute (non-PIC) code. This diff moves processing of an address reference into BinaryContext::handleAddressRef() which is called for both PIC and non-PIC code. (cherry picked from FBD15643535)	2019-06-04 15:30:22 -07:00
laith sakka	d3c1821f5f	Compile Bolt using std 14. Summary: Compile Bolt using std 14. We want that to be able to use some threading the locking tools that do not exists in std 11. (cherry picked from FBD15671736)	2019-06-05 10:32:29 -07:00
Rafael Auler	21f4303bfd	Support data collection in bolted binaries Summary: Similarly to how the compiler relies on DWARF to map samples, so it is possible to collect profile data in binaries optimized by PGO techniques and retrofit data to be used in a representation of the program that was not optimized by PGO, this diff implements an option in BOLT to encode a table in the output binary that allows us to map data collected in optimized binaries back to the address space used in the input binary (where the profile is useful, since we do not support running BOLT on a binary already optimized by BOLT). The goal is to offer an option to support BOLT in scenarios where it is not easy to run a special deployment of the binary with a version that was not optimized by BOLT just for data collection. This feature is enabled with the -enable-bat flag. BAT stands for BOLT Address Translation, which refers to the process of mapping output to input addresses. (cherry picked from FBD15531860)	2019-04-12 17:33:46 -07:00
Laith Sakka	3df2c9ea1f	Update SDT locations after bolt reordering Summary: Update SDT locations in .note section to match the new location after bolt reorder the code. (cherry picked from FBD15427779)	2019-05-17 07:58:27 -07:00
Maksim Panchenko	9ef9a7b1be	[BOLT] Use regex matching for function names passed on command line Summary: Options such as `-print-only`, `-skip-funcs`, etc. now take regular expressions. Internally, the option is converted to '^funcname$' form prior to regex matching. This ensures that names without special symbols will match exactly, i.e. "foo" will not match "foo123". (cherry picked from FBD15551930)	2019-05-29 18:33:09 -07:00
Laith Sakka	c8038da36e	Minor-fix: remove duplicate definition of SPT optimization timer Summary: (cherry picked from FBD28111560)	2019-05-22 15:03:42 -07:00
Maksim Panchenko	e5b1d9cd8c	[BOLT][NFC] Fix white space (cherry picked from FBD15485688)	2019-05-23 15:49:36 -07:00
Maksim Panchenko	f57d3c00fc	[BOLT] Better verification of jump tables Summary: Run analyzeIndirectBranch() using basic block boundaries instead of running ad-hoc validation of the jump table assumptions. (cherry picked from FBD15465034)	2019-05-22 18:14:34 -07:00
Maksim Panchenko	be344c8de7	[BOLT] Refactor handling of interproc refs Summary: Move handling of interprocedural references to BinaryContext. Post-process indirect branches immediately after the CFG is built. This is almost NFC. Since indirect branches are now post-processed before the profile data is processed it interferes with the way the profile data in YAML format is handled. (cherry picked from FBD15456003)	2019-05-22 11:26:58 -07:00
Maksim Panchenko	d047df12c5	[BOLT] Add an option to specialize memcpy() for 1 byte copy Summary: Add an option: -memcpy1-spec=func1,func2:cs1,func3:cs1:cs2,... to specialize calls to memcpy() in listed functions (the name could be supplied in regex) for size 1. The optimization will dynamically check if the size argument equals to 1 and execute a one byte copy, otherwise it will call memcpy() as usual. Specific call sites could be indicated after ":" using their numeric count from the start of the function. (cherry picked from FBD15428936)	2019-05-20 20:11:40 -07:00
Laith Saed Sakka	ca659e4336	Preserve nops that are SDT markers in binaries and disable SDT conflicting optimizations Summary: SDT markers that appears as nops in the assembly, are preserved and not eliminated. Functions with SDT markers are also flagged. Inlining and folding are disabled for functions that have SDT markers. (cherry picked from FBD15379799)	2019-05-16 12:46:32 -07:00
Laith Saed Sakka	4755825447	Parse statically defined tracepoint markers from .note.stapsdt section Summary: Parse statically defined tracepoints(SDT) markers from the ELF file, and store them. Add an option to print SDTs (-print-sdt). Add test case for parsing and printing SDTs. (cherry picked from FBD15366712)	2019-05-15 17:19:18 -07:00
Rafael Auler	f1fde44154	[BOLT] Improve ICP activation policy and hot jt processing Summary: Previously, ICP worked with a budget of N targets to convert to direct calls. As long as the frequency of up to N of the hottest targets surpassed a given fraction (threshold) of the total frequency, say, 90%, then the optimization would convert a number of targets (up to N) to direct calls. Otherwise, it would completely abort processing this call site. The intent was to convert a given fraction of the indirect call site frequency to use direct calls instead, but this ends up being a "all or nothing" strategy. In this patch we change this to operate with the same strategy seem in LLVM's ICP, with two thresholds. The idea is that the hottest target of an indirect call site will be compared against these two thresholds: one checks its frequency relative to the total frequency of the original indirect call site, and the other checks its frequency relative to the remaining, unconverted targets (excluding the hottest targets that were already converted to direct calls). The remaining threshold is typically set higher than the total threshold. This allows us more control over ICP. I expose two pairs of knobs, one for jump tables and another for indirect calls. To improve the promotion of hot jump table indices when we have memory profile, I also fix a bug that could cause us to promote extra indices besides the hottest ones as seen in the memory profile. When we have the memory profile, I reapply the dual threshold checks to the memory profile which specifies exactly which indices are hot. I then update N, the number of targets to be promoted, based on this new information, and update frequency information. To allow us to work with smaller profiles, I also created an option in perf2bolt to filter out memory samples outside the statically allocated area of the binary (heap/stack). This option is on by default. (cherry picked from FBD15187832)	2019-05-02 12:28:34 -07:00
Maksim Panchenko	fee61231ef	[BOLT] Move JumpTable management to BinaryContext Summary: Make BinaryContext responsible for creation and management of JumpTables. This will be used for detection and resolution of jump table conflicts across functions. (cherry picked from FBD15196017)	2019-05-02 17:42:06 -07:00
Maksim Panchenko	4b55967d9e	[perf2bot] Pass `-f` flag to perf Summary: perf tool requires the input data to be owned by the current user or root, otherwise it rejects the input. Use `-f` option to override this behavior. (cherry picked from FBD15160678)	2019-04-30 17:08:22 -07:00
Maksim Panchenko	310b32fbe5	[BOLT] Limit jump table size by containing object Summary: While checking for a size of a jump table, we've used containing section as a boundary. This worked for most cases as typically jump tables are not marked with symbol table entries. However, the compiler may generate objects for indirect goto's. (cherry picked from FBD15158905)	2019-04-30 15:47:10 -07:00
Maksim Panchenko	f1dfd38dec	[BOLT][NFC] Move DynoStats out of BinaryFunction Summary: Move DynoStats into separate source files. (cherry picked from FBD15138883)	2019-04-29 12:51:10 -07:00
Maksim Panchenko	2b1523362e	[BOLT] Strip debug sections by default Summary: We used to ignore debug sections by default, but we kept them in the binary which led to invalid debug information in the output. It's better to strip debug info and print a warning to the user. Note: we are not updating debug info by default due to high memory requirements for large applications. (cherry picked from FBD15128947)	2019-04-26 15:30:12 -07:00
Rafael Auler	21ee0e98c7	[BOLT] Fix symboltable update bug Summary: Commit "Update symbols for secondary entry points" introduced a bug by using getBinaryFunctionContainingAddress() instead of getBinaryFunctionAtAddress() regarding ICF'd functions. Only the latter would fetch the correct BinaryFunction object for addresses of functions that were ICF'd. As a result of this bug, the dynamic symbol table was not updated for function symbols that were folded by ICF. (cherry picked from FBD15112941)	2019-04-26 19:52:36 -07:00
Maksim Panchenko	caa0fafa18	[BOLT] Fix profile reading in non-reloc mode Summary: In non-relocation mode we may execute multiple re-write passes either because we need to split large functions or update debug information for large functions (in this context large functions are functions that do not fit into the original function boundaries after optimizations). When we execute another pass, we reset RewriteInstance and run most of the steps such as disassembly and profile matching for the 2nd or 3rd time. However, when we match a profile, we check `Used` flag, and don't use the profile for the 2nd time. Since we didn't reset the flag while resetting the rest of the states, we ignored profile for all functions. Resetting the flag in-between rewrite passes solves the problem. (cherry picked from FBD15110959)	2019-04-26 16:32:28 -07:00
Maksim Panchenko	5717b0c427	[perf2bolt] Fix print report for pre-aggregated profile Summary: For pre-aggregated profile, we were using the number of records in the profile for `NumTraces` ignoring the counts per record. As a result, the reported percentage of mismatched traces was bogus. (cherry picked from FBD15093123)	2019-04-25 16:34:50 -07:00
Maksim Panchenko	492e4a515e	[BOLT] Automatically enable -hot-text Summary: Enable -hot-text by default if reordering functions. Also fail immediately if function reordering is specified on the command line in non-relocation mode. (cherry picked from FBD15095178)	2019-04-25 17:00:05 -07:00
Brian Gesiak	91b2de3c23	[BOLT] Minimize BOLT's diff with LLVM by removing trivial changes (NFC) Summary: BOLT works as a series of patches rebased onto upstream LLVM at revision `f137ed238db`. Some of these patches introduce unnecessary whitespace changes or includes. Remove these to minimize the diff with upstream LLVM. (cherry picked from FBD15064122)	2019-04-24 11:24:15 -04:00
Rafael Auler	4e4d39c21c	[BOLT] Update symbols for secondary entry points Summary: Update the output ELF symbol table for symbols representing secondary entry points for functions. Previously, those were left unchanged in the symtab. (cherry picked from FBD15010517)	2019-04-18 16:32:22 -07:00
Brian Gesiak	eba1a67730	Fix casting issues on macOS Summary: `size_t` is platform-dependent, and on macOS it is defined as `unsigned long long`. This is not the same type as is used in many calls to templated functions that expect the same type. As a result, on macOS, calls to `std::max` fail because a template function that takes `uint64_t, unsigned long long` cannot be found. To work around the issue: * Specify explicit `std::max` and `std::min` functions where necessary, to work around the compiler trying (and failing) to find a suitable instantiation. * For lambda return types, specify an explicit return type where necessary. * For `operator ==()` calls, use an explicit cast where necessary. (cherry picked from FBD15030283)	2019-04-22 11:27:50 -04:00
Brian Gesiak	d9f1bd42fd	[cmake] Only build enabled targets Summary: When attempting to build llvm-bolt with `-DLLVM_ENABLE_TARGETS="X86"`, I encountered an error: ``` CMake Error at cmake/modules/AddLLVM.cmake:559 (add_dependencies): The dependency target "AArch64CommonTableGen" of target "LLVMBOLTTargetAArch64" does not exist. Call Stack (most recent call first): cmake/modules/AddLLVM.cmake:607 (llvm_add_library) tools/llvm-bolt/src/Target/AArch64/CMakeLists.txt:1 (add_llvm_library) ``` The issue is that the `llvm-bolt/src/Target/AArch64` subdirectory is added by CMake unconditionally. The LLVM project, on the other hand, only adds the subdirectories that are enabled, by using a `foreach` loop over `LLVM_TARGETS_TO_BUILD`. Copying that same loop, from `llvm/lib/Target/CMakeLists.txt`, to this project avoids the error. (cherry picked from FBD15030236)	2019-04-22 11:19:02 -04:00
Rafael Auler	3b422eafd0	[BOLT] Fix non-determinism in shrink wrapping Summary: Iterating over SmallPtrSet is non-deterministic. Change it to SmallSetVector. Similarly, do not sort a vector of ProgramPoint when computing the dominance frontier, as ProgramPoint uses the pointer value to determine order. Use a SmallSetVector there too to avoid duplicates instead of sorting + uniqueing. (cherry picked from FBD14992085)	2019-04-17 18:20:56 -07:00
Maksim Panchenko	433f3e3e02	[BOLT] Process CFIs for functions with FDE size mismatch Summary: If a function size indicated in FDE is different from the one in the symbol table, we can keep processing the function as we are using the max size for internal purposes. Typically this happens for assembly-written functions with padding at the end. This padding is not included in FDE, but it is in the symbol table. (cherry picked from FBD14987653)	2019-04-17 15:17:55 -07:00
Maksim Panchenko	99ef4c90c1	[BOLT] Basic support for split functions Summary: This adds very basic and limited support for split functions. In non-relocation mode, split functions are ignored, while their debug info is properly updated. No support in the relocation mode yet. Split functions consist of a main body and one or more fragments. For fragments, the main part is called their parent. Any fragment could only be entered via its parent or another fragment. The short-term goal is to correctly update debug information for split functions, while the long-term goal is to have a complete support including full optimization. Note that if we don't detect split bodies, we would have to add multiple entry points via tail calls, which we would rather avoid. Parent functions and fragments are represented by a `BinaryFunction` and are marked accordingly. For now they are marked as non-simple, and thus only supported in non-relocation mode. Once we start building a CFG, it should be a common graph (i.e. the one that includes all fragments) in the parent function. The function discovery is unchanged, except for the detection of `\.cold\.` pattern in the function name, which automatically marks the function as a fragment of another function. Because of the local function name ambiguity, we cannot rely on the function name to establish child fragment and parent relationship. Instead we rely on disassembly processing. `BinaryContext::getBinaryFunctionContainingAddress()` now returns a parent function if an address from its fragment is passed. There's no jump table support at the moment. Jump tables can have source and destinations in both fragment and parent. Parent functions that enter their fragments via C++ exception handling mechanism are not yet supported. (cherry picked from FBD14970569)	2019-04-16 10:24:34 -07:00
Maksim Panchenko	ffae5e73f3	[BOLT] Fix an issue with std:errc Summary: On some platforms `llvm::make_error_code(std::errc::no_such_process) == std::errc::no_such_process` evaluates to false. (cherry picked from FBD14944405)	2019-04-15 16:42:49 -07:00
Rafael Auler	31fc56b313	[BOLT] Fix adjustFunctionBoundaries w.r.t. entry points Summary: Don't consider symbols in another section when processing additional entry points for a function. (cherry picked from FBD14962853)	2019-04-16 14:35:29 -07:00
Maksim Panchenko	22ba3dc816	[BOLT] Add another section to the list of hot text movers Summary: (cherry picked from FBD14954472)	2019-04-16 10:39:05 -07:00
Maksim Panchenko	27dcec9717	[BOLT] Abort processing if the profile has no valid data Summary: It's possible to pass a profile in invalid format to BOLT, and we silently ignore it. This could cause a regression as such scenario can go undetected. We should abort processing if no valid data was seen in the profile and issue a warning if it was partially invalid. (cherry picked from FBD14941211)	2019-04-15 14:03:01 -07:00
Maksim Panchenko	8f98268518	[BOLT] Reduce warnings for non-simple functions Summary: If a function was already marked as non-simple, there's no reason to issue a warning that it has a reference in the middle of an instruction. Besides, sometimes there wouldn't be instructions disassembled at a given entry, and the warning would be incorrect. (cherry picked from FBD14938227)	2019-04-15 11:56:55 -07:00
Maksim Panchenko	e50e89be9e	[BOLT] Handle R_X86_64_converted_reloc_bit Summary: In binutils 2.30 a bfd linker accidentally started modifying some relocations on output under `-q/--emit-relocs` by turning on R_X86_64_converted_reloc_bit. As a result, BOLT ignored such relocations and failed to correctly update the binary. This diff filters out R_X86_64_converted_reloc_bit from the relocation type. (cherry picked from FBD14907832)	2019-04-11 17:11:08 -07:00
Maksim Panchenko	315ae74de3	[BOLT] Include <numeric> for std::iota Summary: Some compilers require <numeric> header. (cherry picked from FBD14868132)	2019-04-09 21:22:41 -07:00
Maksim Panchenko	88375d311e	[BOLT] Sort basic block successors for printing Summary: For easier analysis of the hottest targets of jump tables it helps to have basic block successors sorted based on the taken frequency. (cherry picked from FBD14856640)	2019-04-09 11:27:23 -07:00
Maksim Panchenko	a8e05d067d	[BOLT] Add interface to extract values from static addresses (cherry picked from FBD14858028)	2019-04-09 12:29:40 -07:00
Maksim Panchenko	7d89b113d8	[BOLT][NFC] Indentation fix (cherry picked from FBD14856700)	2019-04-09 11:31:45 -07:00
Rafael Auler	90996eb54b	[PERF2BOLT] Print a better message if perf.data lacks LBR Summary: If processing the perf.data in LBR mode but the data was collected without -j, currently we confusingly report all samples to mismatch the input binary, even though the samples match but lack LBR info. Change perf2bolt to detect this scenario and print a helpful message instructing the user to collect data with LBR. (cherry picked from FBD14817732)	2019-04-05 17:27:25 -07:00
Maksim Panchenko	624a0e810d	[DWARF][BOLT] Convert DW_AT_(low\|high)_pc to DW_AT_ranges only if necessary Summary: While updating DWARF, we used to convert address ranges for functions into DW_AT_ranges format, even if the ranges were not split and still had a simple [low, high) form. We had to do this because functions with contiguous ranges could be sharing an abbrev with non-contiguous range function, and we had to convert the abbrev. It turns out, that the excessive usage of DW_AT_ranges may lead to internal core dumps in gdb in the presence of .gdb_index. I still don't know the root cause of it, but reducing the number DW_AT_ranges used by DW_TAG_subprogram DIEs does alleviate the issue. We can keep a simple range for DIEs that are guaranteed not to share an abbrev with any non-contiguous function. Hence we have to postpone the update of function ranges until we've seen all DIEs. Note that DIEs from different compilation units could share the same abbrev, and hence we have to process DIEs from all compilation units. (cherry picked from FBD14814043)	2019-04-01 20:26:41 -07:00
Maksim Panchenko	c8a927696c	[BOLT] Detect internal references into a middle of instruction Summary: Some instructions in assembly-written functions could reference 8-byte constants from another instructions using 4-byte offsets, presumably to save a couple of bytes. Detect such cases, and skip processing such functions until we teach BOLT how to handle references into a middle of instruction. (cherry picked from FBD14768212)	2019-04-03 22:31:12 -07:00

1 2 3 4 5 ...

616 Commits All Branches Search

616 Commits

All Branches