llvm-project

Commit Graph

Author	SHA1	Message	Date
laith sakka	98539b0966	run aligner pass in parallel Summary: this diff parallelize the aligner pass (cherry picked from FBD16176327)	2019-07-09 17:59:41 -07:00
laith sakka	9977b03fea	Run reorder blocks in parallel Summary: This diff change reorderBasicBlocks pass to run in parallel, it does so by adding locks to the fix branches function, and creating temporary MCCodeEmitters when estimating basic block code size. (cherry picked from FBD16161149)	2019-07-08 12:32:58 -07:00
Rafael Auler	1169f1fdd8	[BOLT] Support duplicating jump tables Summary: If two indirect branches use the same jump table, we need to detect this and duplicate dump tables so we can modify this CFG correctly. This is necessary for instrumentation and shrink wrapping. For the latter, we only detect this and bail, fixing this old known issue with shrink wrapping. Other minor changes to support better instrumentation: add an option to instrument only hot functions, add LOCK prefix to instrumentation increment instruction, speed up splitting critical edges by avoiding calling recomputeLandingPads() unnecessarily. (cherry picked from FBD16101312)	2019-07-02 16:56:41 -07:00
Rafael Auler	8880969ced	[BOLT] Restrict creation of jump tables Summary: Heuristic that creates a jump table for every memory access, including those we do not match against a pattern in an indirect jump, is too permissive and has false positives. Guard this logic under strict mode until we figure out a better strategy. (cherry picked from FBD16192205)	2019-07-10 15:41:34 -07:00
laith sakka	3cfc76cdbf	Create a general interface to implement parallel tasks easily and apply it to run EliminateUnreachableBlocks in parallel. Summary: Each time we run some work in parallel over the list of functions in bolt, we manage a thread pool, task scheduling and perform some work to manage the granularity of the tasks based on the type of the work we do. In this task, I am creating an interface where all those details are abstracted out, the user provides the function that will run on each function, and some policy parameters that setup the scheduling and granularity configurations. This will make it easier to implement parallel tasks, and eliminate redundant coding efforts. (cherry picked from FBD16116077)	2019-07-03 17:23:19 -07:00
laith sakka	f10d1fe0f3	Run cleanAnnotations within frame analysis in parallel Summary: This diff parallelize the function FrameAnalysis::cleanAnnotations() (cherry picked from FBD16096711)	2019-07-02 13:42:17 -07:00
laith sakka	00c252f6d8	Clean SPTMap in frame anaylsis in parallel Summary: This diff parallelize the STPClean() function reducing its runtime from 5 seconds to 0.4 on HHVM, Making the runtime for the frame optimizer goes down to 33 seconds on HHVM. (cherry picked from FBD15914371)	2019-06-19 18:01:00 -07:00
laith sakka	86b529bd54	run SPT in parallel, and split annotation allocator Summary: This diff includes two main changes: 1) When creating an annotation, a dedicated annotation allocator can be used, instead of the default allocator. This allows some annotation to be deallocated right after the end of their usage completely. Furthermore, having the ability to use dedicated allocators allows running SPT in parallel where each task uses a different allocator. 2) SPT is parallelized. (cherry picked from FBD15913492)	2019-06-14 19:56:11 -07:00
Wenlei He	4e90fc1e3b	[BOLT] Prioritize Jump Table ICP target by frequency and indice count Summary: We select the top hot targets for indirect call promotion. But since we only have frequency for targets, not for actual jump table indices, we have to merge indices that share the same actual target. In order to do that we sort targets by pointer of target symbol before merging, which introduces instability. Later we stable sort merged targets by frequency. Due to the instability of sorting pointers, and depending on how many indices each merged target has, we could end up with unstable ICP. This commit changes the 2nd pass sorting to prioritize targets with fewer indices, and higher mispredicts, in addition to higher frequency. It improves stability of ICP, and also exposes more ICP because targets with fewer indices has better chance of getting promoted. (cherry picked from FBD16099701)	2019-07-02 15:51:20 -07:00
Maksim Panchenko	078ece1691	[BOLT] Fix out-of-bounds entry points Summary: Check that a symbol address is less than the next function address before considering it for a secondary entry. (cherry picked from FBD16056468)	2019-06-28 11:53:34 -07:00
Maksim Panchenko	e89ad0db4b	[BOLT] Introduce strict relocation mode Summary: In strict relocation mode we rely on relocations to represent all possible entry points into a function. Most of the code generated by tested compilers (gcc and clang) will result in relocations against any internal labels for jump tables and for computed goto tables. In situations where we cannot properly reconstruct a jump table, or when we cannot determine a table that guides an indirect jump, e.g. when multiple computed goto tables are used, we conservatively assume that the indirect jump can end up at any possible basic block referenced by relocations. In strict mode, simple functions may include the aforementioned instructions with unknown control flow with a conservative list of destinations added to the containing basic block. This allows us to expand coverage of simple functions and to enable code reordering optimizations for more functions. The strict mode is recommended when BOLT is used with a well-formed code generated by a compiler. To use the strict mode, add "-strict" on the command line. Another effect of this diff, is that with relocations, we will always replace the immediate operand of an instruction with a symbol if the relocation exists against this operand. Also this diff fixes issues with Clang compiled with -fpic. (cherry picked from FBD15872849)	2019-06-28 09:21:27 -07:00
Maksim Panchenko	06e7a1e059	[BOLT] Ignore false function references Summary: A relocation can have an addend that makes it look as the relocated value is in a different section from the symbol being relocated. E.g., a relocation against a variable in .rodata could have a negative offset that will make it look like it is against a symbol in .text (a section that typically precedes .rodata). Unless the relocation is against a section symbol, we know exactly the symbol that is being relocated and there is no issue. However, when the linker leaves only a section relocation (i.e. a relocation against a section symbol when a temporary original symbol gets deleted), we have to guess the relocated symbol, and can falsely detect a function reference in the case described above. The fix is to keep a section relocation if the corresponding relocated value falls into a different section, and to detect and ignore false function reference. (cherry picked from FBD16030791)	2019-06-27 03:20:17 -07:00
Wenlei He	459add2827	[BOLT] Force non-relocation mode for heatmap generation Summary: BOLT operates in relocation mode by default when .reloc is in the binary. This changes disables relocation mode for heatmap generation so we can use that for more cases. There's a small separate change that ignores zero-sized symbol in zero-sized code section during function discovery. (cherry picked from FBD16009610)	2019-06-26 11:06:46 -07:00
Rafael Auler	0d23cbaa52	[BOLT] Initial experimental instrumentation pass Summary: An instrumentation pass that modifies the input binary to generate a profile after execution finishes. It modifies branches to increment counters stored in the process memory and injects a new function that dumps this data to an fdata file, readable by BOLT. This instrumentation is experimental and currently uses a naive approach where every branch is instrumented. This is not ideal for runtime performance, but should be good enough for us to evaluate/debug LBR profile quality against instrumentation. Does not support instrumenting indirect calls yet, only direct calls, direct branches and indirect local branches. (cherry picked from FBD15998096)	2019-06-19 20:10:49 -07:00
Rafael Auler	db02a1a142	[BOLT] Ignore empty funcs in relocation mode Summary: Make BOLT ignore empty functions (those containing no instructions, despite having some space allocated to it filled with zeroes). (cherry picked from FBD15981683)	2019-06-24 20:23:22 -07:00
Rafael Auler	bda13b7dd8	[BOLT] Add option to print profile bias stats Summary: Profile bias may happen depending on the hardware counter used to trigger LBR sampling, on the hardware implementation and as an intrinsic characteristic of relying on LBRs. Since we infer fall-through execution and these non-taken branches take zero hardware resources to be represented, LBR-based profile likely overrepresents paths with fall throughs and underrepresents paths with many taken branches. This patch adds an option to print statistics about profile bias so we can better understand these biases. The goal is to analyze differences in the sum of the frequency of all incoming edges in a basic block versus the sum of all outgoing. In an ideally sampled profile, these differences should be close to zero. With this option, the user gets the mean of these differences in flow as a percentage of the input flow. For example, if this number is 15%, it means, on average, a block observed 15% more or less flow going out of it in comparison with the flow going in. We also print the standard deviation so we can have an idea of how spread apart are different measurements of flow differences. If variance is low, it means the average bias is happening across all blocks, which is compatible with using LBRs. If the variance is high, it means some blocks in the profile have a much higher bias than others, which is compatible with using a biased event such as cycles to sample LBRs because it overrepresents paths that end in an expensive instruction. (cherry picked from FBD15790517)	2019-06-10 17:26:48 -07:00
laith sakka	1ec091e6f5	Parallelize ICF Pass Summary: ICF consumes 10-15% of bolt runtime, for HHVM that is around 45 seconds. this diff perform some parallelization for the pass to make it faster. A 60% reduction in the ICF runtime is measured on the parallel version for HHVM. (cherry picked from FBD15589515)	2019-05-31 16:45:31 -07:00
Maksim Panchenko	9894de0094	[BOLT] Check instruction boundaries while populating jump tables Summary: Now that we populate jump tables after all functions are disassembled, we can check for instruction boundaries corresponding to jump table entries. No need to delegate this task to postProcessJumpTables(). (cherry picked from FBD15814762)	2019-06-13 15:31:30 -07:00
Maksim Panchenko	9e2ad3f593	[BOLT] Delay populating jump tables Summary: During the initial disassembly pass, only identify jump tables without populating the contents. Later, after all functions have been disassembled, we have a better idea of jump table boundaries and can do a better job of populating their entries. As a result, we no longer have embedded jump tables (i.e. a jump table that is parter of another jump table). If we ever need to keep sequential jump tables inseparable during the output, we can always add such functionality later. Fixes facebookincubator/BOLT#56. (cherry picked from FBD15800427)	2019-06-12 18:21:02 -07:00
laith sakka	66cf16208f	Use singleton instances for SPT (stack pointer tracking) in FrameAnalysis. Summary: During frame analysis, the functions do not change, and stack pointer tracking does not need to be performed more than one time. The current implementation performs the SPT analysis multiple times per function during the frame analysis, we ca eliminate such computation redundancy. On HHVM with -frame-opts=hot, this save around a minute which is 40% of the frame optimization runtime. (129s to 76s). fdata should be passed for a reasonable evaluation (we need the call graph). However, this comes at a memory cost, around 2G to the peak when only -frame-opt=hot only is used but, When all the usual flags are passed, the effect is to the peak is only 200K (measured from one test). Update: When jemalloc is used the base became way better and the following runtime are observed: [jemalloc] hhvm 85 --> 72. clang 27 --> 23. [malloc] hhvm 129 --> 76. clang 34 --> 27. (cherry picked from FBD15707003)	2019-06-06 12:58:14 -07:00
Maksim Panchenko	9df5063c0e	[perf2bolt] Option to use event PC with LBR stack Summary: Add an option to get extra profile trace using the recorded event PC. The trace goes from the latest LBR record destination to the event PC. (cherry picked from FBD15711804)	2019-06-06 19:38:06 -07:00
Maksim Panchenko	fac6a89c23	[BOLT] Better handling of address references Summary: We used to handle PC-relative address references differently from direct address references. As a result, some cases, such as escaped function label address, were not handled when dealing with absolute (non-PIC) code. This diff moves processing of an address reference into BinaryContext::handleAddressRef() which is called for both PIC and non-PIC code. (cherry picked from FBD15643535)	2019-06-04 15:30:22 -07:00
laith sakka	d3c1821f5f	Compile Bolt using std 14. Summary: Compile Bolt using std 14. We want that to be able to use some threading the locking tools that do not exists in std 11. (cherry picked from FBD15671736)	2019-06-05 10:32:29 -07:00
Rafael Auler	21f4303bfd	Support data collection in bolted binaries Summary: Similarly to how the compiler relies on DWARF to map samples, so it is possible to collect profile data in binaries optimized by PGO techniques and retrofit data to be used in a representation of the program that was not optimized by PGO, this diff implements an option in BOLT to encode a table in the output binary that allows us to map data collected in optimized binaries back to the address space used in the input binary (where the profile is useful, since we do not support running BOLT on a binary already optimized by BOLT). The goal is to offer an option to support BOLT in scenarios where it is not easy to run a special deployment of the binary with a version that was not optimized by BOLT just for data collection. This feature is enabled with the -enable-bat flag. BAT stands for BOLT Address Translation, which refers to the process of mapping output to input addresses. (cherry picked from FBD15531860)	2019-04-12 17:33:46 -07:00
Laith Sakka	3df2c9ea1f	Update SDT locations after bolt reordering Summary: Update SDT locations in .note section to match the new location after bolt reorder the code. (cherry picked from FBD15427779)	2019-05-17 07:58:27 -07:00
Maksim Panchenko	9ef9a7b1be	[BOLT] Use regex matching for function names passed on command line Summary: Options such as `-print-only`, `-skip-funcs`, etc. now take regular expressions. Internally, the option is converted to '^funcname$' form prior to regex matching. This ensures that names without special symbols will match exactly, i.e. "foo" will not match "foo123". (cherry picked from FBD15551930)	2019-05-29 18:33:09 -07:00
Laith Sakka	c8038da36e	Minor-fix: remove duplicate definition of SPT optimization timer Summary: (cherry picked from FBD28111560)	2019-05-22 15:03:42 -07:00
Maksim Panchenko	e5b1d9cd8c	[BOLT][NFC] Fix white space (cherry picked from FBD15485688)	2019-05-23 15:49:36 -07:00
Maksim Panchenko	f57d3c00fc	[BOLT] Better verification of jump tables Summary: Run analyzeIndirectBranch() using basic block boundaries instead of running ad-hoc validation of the jump table assumptions. (cherry picked from FBD15465034)	2019-05-22 18:14:34 -07:00
Maksim Panchenko	be344c8de7	[BOLT] Refactor handling of interproc refs Summary: Move handling of interprocedural references to BinaryContext. Post-process indirect branches immediately after the CFG is built. This is almost NFC. Since indirect branches are now post-processed before the profile data is processed it interferes with the way the profile data in YAML format is handled. (cherry picked from FBD15456003)	2019-05-22 11:26:58 -07:00
Maksim Panchenko	d047df12c5	[BOLT] Add an option to specialize memcpy() for 1 byte copy Summary: Add an option: -memcpy1-spec=func1,func2:cs1,func3:cs1:cs2,... to specialize calls to memcpy() in listed functions (the name could be supplied in regex) for size 1. The optimization will dynamically check if the size argument equals to 1 and execute a one byte copy, otherwise it will call memcpy() as usual. Specific call sites could be indicated after ":" using their numeric count from the start of the function. (cherry picked from FBD15428936)	2019-05-20 20:11:40 -07:00
Laith Saed Sakka	ca659e4336	Preserve nops that are SDT markers in binaries and disable SDT conflicting optimizations Summary: SDT markers that appears as nops in the assembly, are preserved and not eliminated. Functions with SDT markers are also flagged. Inlining and folding are disabled for functions that have SDT markers. (cherry picked from FBD15379799)	2019-05-16 12:46:32 -07:00
Laith Saed Sakka	4755825447	Parse statically defined tracepoint markers from .note.stapsdt section Summary: Parse statically defined tracepoints(SDT) markers from the ELF file, and store them. Add an option to print SDTs (-print-sdt). Add test case for parsing and printing SDTs. (cherry picked from FBD15366712)	2019-05-15 17:19:18 -07:00
Rafael Auler	f1fde44154	[BOLT] Improve ICP activation policy and hot jt processing Summary: Previously, ICP worked with a budget of N targets to convert to direct calls. As long as the frequency of up to N of the hottest targets surpassed a given fraction (threshold) of the total frequency, say, 90%, then the optimization would convert a number of targets (up to N) to direct calls. Otherwise, it would completely abort processing this call site. The intent was to convert a given fraction of the indirect call site frequency to use direct calls instead, but this ends up being a "all or nothing" strategy. In this patch we change this to operate with the same strategy seem in LLVM's ICP, with two thresholds. The idea is that the hottest target of an indirect call site will be compared against these two thresholds: one checks its frequency relative to the total frequency of the original indirect call site, and the other checks its frequency relative to the remaining, unconverted targets (excluding the hottest targets that were already converted to direct calls). The remaining threshold is typically set higher than the total threshold. This allows us more control over ICP. I expose two pairs of knobs, one for jump tables and another for indirect calls. To improve the promotion of hot jump table indices when we have memory profile, I also fix a bug that could cause us to promote extra indices besides the hottest ones as seen in the memory profile. When we have the memory profile, I reapply the dual threshold checks to the memory profile which specifies exactly which indices are hot. I then update N, the number of targets to be promoted, based on this new information, and update frequency information. To allow us to work with smaller profiles, I also created an option in perf2bolt to filter out memory samples outside the statically allocated area of the binary (heap/stack). This option is on by default. (cherry picked from FBD15187832)	2019-05-02 12:28:34 -07:00
Maksim Panchenko	fee61231ef	[BOLT] Move JumpTable management to BinaryContext Summary: Make BinaryContext responsible for creation and management of JumpTables. This will be used for detection and resolution of jump table conflicts across functions. (cherry picked from FBD15196017)	2019-05-02 17:42:06 -07:00
Maksim Panchenko	4b55967d9e	[perf2bot] Pass `-f` flag to perf Summary: perf tool requires the input data to be owned by the current user or root, otherwise it rejects the input. Use `-f` option to override this behavior. (cherry picked from FBD15160678)	2019-04-30 17:08:22 -07:00
Maksim Panchenko	310b32fbe5	[BOLT] Limit jump table size by containing object Summary: While checking for a size of a jump table, we've used containing section as a boundary. This worked for most cases as typically jump tables are not marked with symbol table entries. However, the compiler may generate objects for indirect goto's. (cherry picked from FBD15158905)	2019-04-30 15:47:10 -07:00
Maksim Panchenko	f1dfd38dec	[BOLT][NFC] Move DynoStats out of BinaryFunction Summary: Move DynoStats into separate source files. (cherry picked from FBD15138883)	2019-04-29 12:51:10 -07:00
Maksim Panchenko	2b1523362e	[BOLT] Strip debug sections by default Summary: We used to ignore debug sections by default, but we kept them in the binary which led to invalid debug information in the output. It's better to strip debug info and print a warning to the user. Note: we are not updating debug info by default due to high memory requirements for large applications. (cherry picked from FBD15128947)	2019-04-26 15:30:12 -07:00
Rafael Auler	21ee0e98c7	[BOLT] Fix symboltable update bug Summary: Commit "Update symbols for secondary entry points" introduced a bug by using getBinaryFunctionContainingAddress() instead of getBinaryFunctionAtAddress() regarding ICF'd functions. Only the latter would fetch the correct BinaryFunction object for addresses of functions that were ICF'd. As a result of this bug, the dynamic symbol table was not updated for function symbols that were folded by ICF. (cherry picked from FBD15112941)	2019-04-26 19:52:36 -07:00
Maksim Panchenko	caa0fafa18	[BOLT] Fix profile reading in non-reloc mode Summary: In non-relocation mode we may execute multiple re-write passes either because we need to split large functions or update debug information for large functions (in this context large functions are functions that do not fit into the original function boundaries after optimizations). When we execute another pass, we reset RewriteInstance and run most of the steps such as disassembly and profile matching for the 2nd or 3rd time. However, when we match a profile, we check `Used` flag, and don't use the profile for the 2nd time. Since we didn't reset the flag while resetting the rest of the states, we ignored profile for all functions. Resetting the flag in-between rewrite passes solves the problem. (cherry picked from FBD15110959)	2019-04-26 16:32:28 -07:00
Maksim Panchenko	5717b0c427	[perf2bolt] Fix print report for pre-aggregated profile Summary: For pre-aggregated profile, we were using the number of records in the profile for `NumTraces` ignoring the counts per record. As a result, the reported percentage of mismatched traces was bogus. (cherry picked from FBD15093123)	2019-04-25 16:34:50 -07:00
Maksim Panchenko	492e4a515e	[BOLT] Automatically enable -hot-text Summary: Enable -hot-text by default if reordering functions. Also fail immediately if function reordering is specified on the command line in non-relocation mode. (cherry picked from FBD15095178)	2019-04-25 17:00:05 -07:00
Brian Gesiak	91b2de3c23	[BOLT] Minimize BOLT's diff with LLVM by removing trivial changes (NFC) Summary: BOLT works as a series of patches rebased onto upstream LLVM at revision `f137ed238db`. Some of these patches introduce unnecessary whitespace changes or includes. Remove these to minimize the diff with upstream LLVM. (cherry picked from FBD15064122)	2019-04-24 11:24:15 -04:00
Rafael Auler	4e4d39c21c	[BOLT] Update symbols for secondary entry points Summary: Update the output ELF symbol table for symbols representing secondary entry points for functions. Previously, those were left unchanged in the symtab. (cherry picked from FBD15010517)	2019-04-18 16:32:22 -07:00
Brian Gesiak	eba1a67730	Fix casting issues on macOS Summary: `size_t` is platform-dependent, and on macOS it is defined as `unsigned long long`. This is not the same type as is used in many calls to templated functions that expect the same type. As a result, on macOS, calls to `std::max` fail because a template function that takes `uint64_t, unsigned long long` cannot be found. To work around the issue: * Specify explicit `std::max` and `std::min` functions where necessary, to work around the compiler trying (and failing) to find a suitable instantiation. * For lambda return types, specify an explicit return type where necessary. * For `operator ==()` calls, use an explicit cast where necessary. (cherry picked from FBD15030283)	2019-04-22 11:27:50 -04:00
Brian Gesiak	d9f1bd42fd	[cmake] Only build enabled targets Summary: When attempting to build llvm-bolt with `-DLLVM_ENABLE_TARGETS="X86"`, I encountered an error: ``` CMake Error at cmake/modules/AddLLVM.cmake:559 (add_dependencies): The dependency target "AArch64CommonTableGen" of target "LLVMBOLTTargetAArch64" does not exist. Call Stack (most recent call first): cmake/modules/AddLLVM.cmake:607 (llvm_add_library) tools/llvm-bolt/src/Target/AArch64/CMakeLists.txt:1 (add_llvm_library) ``` The issue is that the `llvm-bolt/src/Target/AArch64` subdirectory is added by CMake unconditionally. The LLVM project, on the other hand, only adds the subdirectories that are enabled, by using a `foreach` loop over `LLVM_TARGETS_TO_BUILD`. Copying that same loop, from `llvm/lib/Target/CMakeLists.txt`, to this project avoids the error. (cherry picked from FBD15030236)	2019-04-22 11:19:02 -04:00
Rafael Auler	3b422eafd0	[BOLT] Fix non-determinism in shrink wrapping Summary: Iterating over SmallPtrSet is non-deterministic. Change it to SmallSetVector. Similarly, do not sort a vector of ProgramPoint when computing the dominance frontier, as ProgramPoint uses the pointer value to determine order. Use a SmallSetVector there too to avoid duplicates instead of sorting + uniqueing. (cherry picked from FBD14992085)	2019-04-17 18:20:56 -07:00
Maksim Panchenko	433f3e3e02	[BOLT] Process CFIs for functions with FDE size mismatch Summary: If a function size indicated in FDE is different from the one in the symbol table, we can keep processing the function as we are using the max size for internal purposes. Typically this happens for assembly-written functions with padding at the end. This padding is not included in FDE, but it is in the symbol table. (cherry picked from FBD14987653)	2019-04-17 15:17:55 -07:00
Maksim Panchenko	99ef4c90c1	[BOLT] Basic support for split functions Summary: This adds very basic and limited support for split functions. In non-relocation mode, split functions are ignored, while their debug info is properly updated. No support in the relocation mode yet. Split functions consist of a main body and one or more fragments. For fragments, the main part is called their parent. Any fragment could only be entered via its parent or another fragment. The short-term goal is to correctly update debug information for split functions, while the long-term goal is to have a complete support including full optimization. Note that if we don't detect split bodies, we would have to add multiple entry points via tail calls, which we would rather avoid. Parent functions and fragments are represented by a `BinaryFunction` and are marked accordingly. For now they are marked as non-simple, and thus only supported in non-relocation mode. Once we start building a CFG, it should be a common graph (i.e. the one that includes all fragments) in the parent function. The function discovery is unchanged, except for the detection of `\.cold\.` pattern in the function name, which automatically marks the function as a fragment of another function. Because of the local function name ambiguity, we cannot rely on the function name to establish child fragment and parent relationship. Instead we rely on disassembly processing. `BinaryContext::getBinaryFunctionContainingAddress()` now returns a parent function if an address from its fragment is passed. There's no jump table support at the moment. Jump tables can have source and destinations in both fragment and parent. Parent functions that enter their fragments via C++ exception handling mechanism are not yet supported. (cherry picked from FBD14970569)	2019-04-16 10:24:34 -07:00
Maksim Panchenko	ffae5e73f3	[BOLT] Fix an issue with std:errc Summary: On some platforms `llvm::make_error_code(std::errc::no_such_process) == std::errc::no_such_process` evaluates to false. (cherry picked from FBD14944405)	2019-04-15 16:42:49 -07:00
Rafael Auler	31fc56b313	[BOLT] Fix adjustFunctionBoundaries w.r.t. entry points Summary: Don't consider symbols in another section when processing additional entry points for a function. (cherry picked from FBD14962853)	2019-04-16 14:35:29 -07:00
Maksim Panchenko	22ba3dc816	[BOLT] Add another section to the list of hot text movers Summary: (cherry picked from FBD14954472)	2019-04-16 10:39:05 -07:00
Maksim Panchenko	27dcec9717	[BOLT] Abort processing if the profile has no valid data Summary: It's possible to pass a profile in invalid format to BOLT, and we silently ignore it. This could cause a regression as such scenario can go undetected. We should abort processing if no valid data was seen in the profile and issue a warning if it was partially invalid. (cherry picked from FBD14941211)	2019-04-15 14:03:01 -07:00
Maksim Panchenko	8f98268518	[BOLT] Reduce warnings for non-simple functions Summary: If a function was already marked as non-simple, there's no reason to issue a warning that it has a reference in the middle of an instruction. Besides, sometimes there wouldn't be instructions disassembled at a given entry, and the warning would be incorrect. (cherry picked from FBD14938227)	2019-04-15 11:56:55 -07:00
Maksim Panchenko	e50e89be9e	[BOLT] Handle R_X86_64_converted_reloc_bit Summary: In binutils 2.30 a bfd linker accidentally started modifying some relocations on output under `-q/--emit-relocs` by turning on R_X86_64_converted_reloc_bit. As a result, BOLT ignored such relocations and failed to correctly update the binary. This diff filters out R_X86_64_converted_reloc_bit from the relocation type. (cherry picked from FBD14907832)	2019-04-11 17:11:08 -07:00
Maksim Panchenko	315ae74de3	[BOLT] Include <numeric> for std::iota Summary: Some compilers require <numeric> header. (cherry picked from FBD14868132)	2019-04-09 21:22:41 -07:00
Maksim Panchenko	88375d311e	[BOLT] Sort basic block successors for printing Summary: For easier analysis of the hottest targets of jump tables it helps to have basic block successors sorted based on the taken frequency. (cherry picked from FBD14856640)	2019-04-09 11:27:23 -07:00
Maksim Panchenko	a8e05d067d	[BOLT] Add interface to extract values from static addresses (cherry picked from FBD14858028)	2019-04-09 12:29:40 -07:00
Maksim Panchenko	7d89b113d8	[BOLT][NFC] Indentation fix (cherry picked from FBD14856700)	2019-04-09 11:31:45 -07:00
Rafael Auler	90996eb54b	[PERF2BOLT] Print a better message if perf.data lacks LBR Summary: If processing the perf.data in LBR mode but the data was collected without -j, currently we confusingly report all samples to mismatch the input binary, even though the samples match but lack LBR info. Change perf2bolt to detect this scenario and print a helpful message instructing the user to collect data with LBR. (cherry picked from FBD14817732)	2019-04-05 17:27:25 -07:00
Maksim Panchenko	624a0e810d	[DWARF][BOLT] Convert DW_AT_(low\|high)_pc to DW_AT_ranges only if necessary Summary: While updating DWARF, we used to convert address ranges for functions into DW_AT_ranges format, even if the ranges were not split and still had a simple [low, high) form. We had to do this because functions with contiguous ranges could be sharing an abbrev with non-contiguous range function, and we had to convert the abbrev. It turns out, that the excessive usage of DW_AT_ranges may lead to internal core dumps in gdb in the presence of .gdb_index. I still don't know the root cause of it, but reducing the number DW_AT_ranges used by DW_TAG_subprogram DIEs does alleviate the issue. We can keep a simple range for DIEs that are guaranteed not to share an abbrev with any non-contiguous function. Hence we have to postpone the update of function ranges until we've seen all DIEs. Note that DIEs from different compilation units could share the same abbrev, and hence we have to process DIEs from all compilation units. (cherry picked from FBD14814043)	2019-04-01 20:26:41 -07:00
Maksim Panchenko	c8a927696c	[BOLT] Detect internal references into a middle of instruction Summary: Some instructions in assembly-written functions could reference 8-byte constants from another instructions using 4-byte offsets, presumably to save a couple of bytes. Detect such cases, and skip processing such functions until we teach BOLT how to handle references into a middle of instruction. (cherry picked from FBD14768212)	2019-04-03 22:31:12 -07:00
Maksim Panchenko	7fd487066f	[BOLT] Move BinaryFunctions into a BinaryContext and more Summary: A long due refactoring that makes interfaces cleaner and less awkward. Mainly makes the future work way easier. (cherry picked from FBD14766284)	2019-04-03 15:52:01 -07:00
Maksim Panchenko	8894853f42	[BOLT][DWARF] Dedup .debug_abbrev section patches Summary: When we patch .debug_abbrev we issue many duplicate patches. Instead of storing these patches as a vector, use a hash map. This saves some processing time and memory. (cherry picked from FBD14691292)	2019-03-29 14:22:54 -07:00
Maksim Panchenko	297d1a4e1a	[BOLT] Do not write jump table section headers Summary: In non-relocation mode we were accidentally emitting section headers for every single jump table. This happened with default `-jump-tables=basic`. (cherry picked from FBD14653282)	2019-03-27 13:58:31 -07:00
Maksim Panchenko	d1b76f2ac2	[BOLT] Allocate enough space past __hot_end for huge pages Summary: While using "-hot-text" option, we might not get enough cold text to fill up the last huge page, and we can get data allocated on this page producing undesirable effects. To prevent this from happening, always make sure to allocate enough space past __hot_end. (cherry picked from FBD14575100)	2019-03-21 21:13:45 -07:00
Maksim Panchenko	69faf61372	[BOLT] Fix section lookup while deleting symbols Summary: While removing redundant local symbols, we used new section index to lookup the corresponding section in the old section table. As a result, we used to either not remove the correct symbols, or remove the wrong ones. (cherry picked from FBD14552047)	2019-03-20 16:13:09 -07:00
Maksim Panchenko	b8d3dc40ea	[BOLT] Use local binding for cold fragment symbols Summary: We used to use existing symbol binding while duplicating and renaming cold fragment symbols. As a result, some of those were emitted with global binding. This confuses gdb, and it starts treating those symbols as additional entry points. The fix is to always emit such symbols with a local binding. This also means that we have to sort static symbol table before emission to make sure local symbols precede all others. (cherry picked from FBD14529265)	2019-03-19 13:46:21 -07:00
Maksim Panchenko	6bcb3389dd	[BOLT] Place hot text mover functions into a separate section Summary: Create a separate pass for assigning functions to sections. Detect functions originating from special sections (by default .stub and .mover) and place them into ".text.mover" if "-hot-text" options is specified. Cold functions are isolated from hot functions even when no function re-ordering is specified. (cherry picked from FBD14512628)	2019-03-15 13:43:36 -07:00
Maksim Panchenko	17cd2034f3	[BOLT] Fix debug line info emission Summary: GDB does not like if the first entry in the line info table after end_sequence entry is not marked with is_stmt. If this happens, it will not print the correct line number information for such address. Note that everything works fine starting with the first address marked with is_stmt. This could happen if the first instruction in the cold section wasn't marked with is_stmt. The fix is to always emit debug line info for the first instruction in any function fragment with is_stmt flag. (cherry picked from FBD14516629)	2019-03-18 19:22:26 -07:00
Maksim Panchenko	61ea19edf8	[BOLT][NFC] Fix compilation warnings Summary: Get rid of warnings while building with Clang. (cherry picked from FBD14495620)	2019-03-15 15:06:41 -07:00
Maksim Panchenko	0a55001a0e	[BOLT] Fix -hot-functions-at-end option Summary: Make "-hot-functions-at-end" option work again. (cherry picked from FBD14476242)	2019-03-14 20:32:04 -07:00
Maksim Panchenko	163adbec9f	[BOLT] Refactor allocatable sections rewrite part Summary: This refactoring makes it easier to create new code sections and control code placement. As an example, cold code is being placed into ".text.cold" which is emitted independently from ".text", and the final address assignment becomes more flexible. Previously, in non-relocation mode we used to emit temporary section name into .shstrtab. This resulted in unnecessary bloat of this section. There was unnecessary padding emitted at the end of text section. After fixing this, the output binary becomes smaller. I had to change the way exception handling tables are re-written as the current infra does not support cross-section label difference. This means we have to emit absolute landing pad addresses, which might not work for PIE binaries. I'm going to address this once I investigate the current exception handling issues in PIEs. This diff temporarily disables "-hot-functions-at-end" option. (cherry picked from FBD14475693)	2019-03-14 18:51:05 -07:00
Maksim Panchenko	a9e64947c5	[NFC][BOLT] Move ExecutableFileMemoryManager into its own file (cherry picked from FBD14474800)	2019-03-14 18:49:40 -07:00
Rafael Auler	c593563d1f	Do not assert on addresses read from processIndirectBranch Summary: As part of our heuristics to decode an indirect branch, if we suspect the branch is an indirect tail call, we add its probable target to the BC::InterproceduralReferences vector to detect functions with more than one entry point. However, if this probable target is not in an allocatable section, we were asserting. Remove this assertion and change the code to conditionally store to InterproceduralReferences instead. The probable target could be garbage at this point because of analyzeIndirectBranch failing to identify the load instruction that has the memory address of the target, so we should tolerate this. (cherry picked from FBD14432821)	2019-03-12 16:36:35 -07:00
Maksim Panchenko	0c704eb75a	[BOLT-HEATMAP] Initial heat map implementation Summary: Add heatmap subcommand to produce heatmaps based on perf.data with LBR. The output is produced in colored ASCII format. llvm-bolt heatmap -p perf.data <executable> -block-size=<uint> - size of a heat map block in bytes (default 64) -line-size=<uint> - number of entries per line (default 256) -max-address=<uint> - maximum address considered valid for heatmap (default 4GB) -o=<string> - heatmap output file (default stdout) (cherry picked from FBD13969992)	2019-02-05 15:28:19 -08:00
Maksim Panchenko	ff6e21290f	[BOLT] New inliner implementation Summary: Addresses correctness issues related to inlining. Inlining heuristics are not part of this diff. (cherry picked from FBD13796888)	2019-01-31 11:23:02 -08:00
Maksim Panchenko	365bd1f1c8	[BOLT] For non-simple functions always update jump tables in-place Summary: For non-simple function we can miss a reference to a jump table or to an indirect goto table. If we move the jump table, the missed reference will not get updated, and the corresponding indirect jump will end up in the old (wrong) location. Updating the original jump table in-place should take care of the issue. (cherry picked from FBD13849776)	2019-01-28 13:46:18 -08:00
Rafael Auler	af81c7ff80	[perf2bolt] Add support for generating autofdo input Summary: Autofdo tools support. (cherry picked from FBD13779026)	2019-01-22 17:21:45 -08:00
Maksim Panchenko	c6ce2abb7d	[perf2bolt] Optimize memory usage in perf2bolt Summary: While converting perf profile, we only need CFG for functions that were profiled and can skip building CFG for the rest. This saves us some processing time and memory. Breakdown processing of perf.data into two steps. The first step parses the data, saves it in intermediate format, and marks functions with the profile. The second step attributes the profile to functions with CFG. When we disassemble and build CFG for functions in aggregate-only mode, we skip functions without the profile. (cherry picked from FBD13706697)	2019-01-15 23:43:40 -08:00
Maksim Panchenko	2fe0c38d6b	[perf2bolt] Better tracking of process forking Summary: Improve tracking of forked processes. If a process corresponding to the input binary has forked/started before 'perf record' was initiated, then the full name of the binary will be recorded in a corresponding MMAP2 event. We've being handling such cases well so far. However, if the process was forked after 'perf record' has started, and execve(2) wasn't called afterwards, then there will be no MMAP2 event recorded corresponding to the mapping of the main binary (unrelated MMAP2 events could still be recorded). To track such cases, we need to parse 'perf script --show-task-events' command output, and to scan for PERF_RECORD_FORK events, and then add forked process PIDs to the list associated with the input binary. If the fork event was followed by an exec event (PERF_RECORD_COMM exec) of a different binary, then the forked PID should be ignored. If the exec event was associated with our input binary, then the correct MMAP2 event was recorded and parsed. To track if the event occurred before or after 'perf record', we parse event's time. This helps us to differentiate some events. E.g. the exec event is only registered correctly if it happened after perf recording has started (otherwise the "exec" part is missing), and thus we only record forks with non-zero time stamps. (cherry picked from FBD13250904)	2018-11-21 20:04:00 -08:00
Maksim Panchenko	067a385000	[BOLT] Add thresholds for function splitting Summary: Use newly added function size estimation to measure the effectiveness and guide function splitting. Two new tuning options are added: -split-threshold=<uint> split function only if its main size is reduced by more than given amount of bytes. Default value: 0, i.e. split iff the size is reduced. Note that on some architectures the size can increase after splitting. -split-align-threshold=<uint> when deciding to split a function, apply this alignment while doing the size comparison (see -split-threshold). Default value: 2. (cherry picked from FBD13136352)	2018-11-15 16:03:34 -08:00
Maksim Panchenko	b0f7fddd35	[BOLT] Add method for better function size estimation Summary: Add BinaryContext::calculateEmittedSize() that ephemerally emits code to allow precise estimation of the function size. Relaxation and macro-op alignment adjustments are taken into account. (cherry picked from FBD13092139)	2018-11-15 16:02:16 -08:00
Maksim Panchenko	e1b8fade7f	[BOLT] Add branch priority policy for blocks with 2 successors Summary: On x86 the difference between long and short jump instructions could be either 4 or 3 bytes, depending if it's a conditional jump or not. For a basic block with 2 jump instructions, if we know that one of the successors is in a different code region, then we can make it a target of an unconditional jump instruction. This will save 1 byte in case the conditional jump happens to be a short one. (cherry picked from FBD13078139)	2018-11-14 14:43:59 -08:00
Maksim Panchenko	40d9fcfdca	[BOLT] Workaround for Clang de-virtualization bug Summary: When Clang is boot-strapped with (Thin)LTO, it may produce a code fragment similar to below: .LFT663334 (6 instructions, align : 1) Predecessors: .LFT663333 00000538: movb $0x1, %al 0000053a: movl %eax, -0x2c(%rbp) 0000053d: movl $"_ZN5clang6Parser12ConsumeParenEv/1", %ecx 00000542: testb $0x1, %cl 00000545: movq -0x40(%rbp), %r14 00000549: je .Ltmp1071462 Successors: .Ltmp1071462, .LFT663335 .LFT663335 (2 instructions, align : 1) Predecessors: .LFT663334 0000054b: movq (%r12), %rax 0000054f: movq .Ltmp0(%rax), %rcx Successors: .Ltmp1071462 .Ltmp1071462 (7 instructions, align : 1) Predecessors: .LFT663334, .LFT663335 00000556: movq %r12, %rdi 00000559: callq *%rcx ....... The code above is making a call by dereferencing a pointer to a member function. A pointer to a member function could either be a regular function, or a virtual function. To differentiate between the two, AMD64 ABI (originated from Itanium ABI) uses the last bit of the pointer. The call instruction sequence varies depending if the function is virtual or not, and the pointer's last bit is checked. If it's "1" then the value of the pointer (minus 1) is used as an offset in the object vtable to get the address of the function, otherwise the pointer is used directly as a function address. In this specific case, a de-virtualization is taking place, but it's not complete. Compiler knows that the member function pointer is actually a non-virtual function _ZN5clang6Parser12ConsumeParenEv (aka "clang::Parser::ConsumeParen()"). However, it keeps the (dead) code that checks the last bit of _ZN5clang6Parser12ConsumeParenEv, and furthermore keeps the code (unreachable/dead) to make a virtual call while using (_ZN5clang6Parser12ConsumeParenEv - 1) as an offset into the vtable. This is obviously wrong, but since the code is unreachable, it will never affect the runtime correctness. The value "_ZN5clang6Parser12ConsumeParenEv - 1" falls into a last byte of a function preceding _ZN5clang6Parser12ConsumeParenEv, and BOLT creates a label ".Ltmp0" pointing to this last byte that is referenced in by the instruction sequence above. It just happens that the last byte is also in the middle of the last instruction, and as a result, BOLT never emits the label, hence resulting in the error message "Undefined temporary symbol". The workaround is to detect non-pc-relative relocations from code pointing to some (fptr - 1). Note that this is not completely error-prone, but non-pc-relative references from code into a middle of a function are quite rare, and chances that in a normal situation they will point to a byte preceding some function address are virtually zero. (cherry picked from FBD13030310)	2018-11-12 12:38:50 -08:00
Maksim Panchenko	30fd960951	[BOLT] Update local symbol count in symbol table Summary: Fix sh_info entry for symbol table section to reflect updated number of local symbols. (cherry picked from FBD10503216)	2018-10-22 18:48:12 -07:00
Maksim Panchenko	a76b13d48e	[perf2bolt] Pre-aggregate LBR samples Summary: Pre-aggregating LBR data cuts pef2bolt processing times in half. (cherry picked from FBD10420286)	2018-10-02 17:16:26 -07:00
Rafael Auler	74a71c6812	Fix bug in analyzeRelocation for GOT entries Summary: Special case GOT relocs to ignore addend subtracting logic in analyzeRelocation, since the addend does not refer to the target of the instruction being analyzed. Also make the code honor the comments in the special case about zeroed out ExtractValue but non-zero addend. Fix facebookincubator/BOLT#40 (cherry picked from FBD10355019)	2018-10-11 18:12:09 -07:00
Facebook Github Bot	b166ccbea8	[BOLT][PR] Fix compiler warnings in BinaryContext and RegAnalysis Summary: This pull request fixes two compiler warnings: - missing `break;` in a switch-case statement in RegAnalysis.cpp (-Wimplicit-fallthrough warning) - misleading indentation in BinaryContext.cpp (-Wmisleading-indentation warning) Pull Request resolved: https://github.com/facebookincubator/BOLT/pull/39 GitHub Author: Andreas Ziegler <andreas.ziegler@fau.de> (cherry picked from FBD10202092)	2018-10-04 10:46:16 -07:00
Igor Sugak	c3c80822a3	[BOLT] Capitalize i Summary: as titled (cherry picked from FBD10136655)	2018-10-01 16:22:46 -07:00
Igor Sugak	cc2276d3f1	[BOLT] fix build with gcc-4.8.5 Summary: These are two minor changes to make it copatible with gcc-4.8.5 (cherry picked from FBD9884971)	2018-09-17 12:17:33 -07:00
Maksim Panchenko	ce508b58c6	[BOLT] Support relocations without symbols Summary: lld may generate relocations without associated symbols. Instead of rejecting binaries with such relocations, we can re-create the symbol the relocation is against based on the extracted value. (cherry picked from FBD10054576)	2018-09-21 12:00:20 -07:00
Rafael Auler	bd0b99c45d	[BOLT] Change stub-insertion pass for AArch64 Summary: Previously, we were expanding eligible branches with stubs. After expansion, we were computing which stubs were unnecessary and removing them, assuming ranges were shortening as code is removed. The problem with this approach is that for branches that refer to code that is not managed by BOLT, the distance to that location can increase and we can end up with an out-of-range branch. This rewrites the pass to be simpler, only increasing size and expanding code with stubs as needed after each iteration, stopping when code stops increasing. Besides this rewrite, the stub-insertion pass now supports stubs grouping similar to what the linker does, allowing different functions to share the same veneer that jumps to a common callee. It also fixes a bug in the previous implementation that, in very large functions that use TBZ/TBNZ (+-32KB range), it would mistakenly try to reuse a local stub BB that is out of range. This includes a change to allow hot functions to be put at the end of the .text section, closer to the heap, requiring no veneers to jump to JITted code. And finally it enables eliminate veneers pass by default. (cherry picked from FBD10023158)	2018-09-17 13:36:59 -07:00
Maksim Panchenko	1387a9d761	[BOLT] Keep .text section in file when using old text Summary: If we reuse text section under `-use-old-text` option, then there's no need to rename it. Tools, such as perf, seem to not like binaries without `.text`. Additionally, check if the code fits into `.text` using the page alignment, otherwise we were skipping the alignment relying on the user detecting the warning message. This could have resulted in unexpected performance drops. Also add `-no-huge-pages` option to use regular page size for code alignment purposes (i.e. 4KiB instead of 2MiB). (cherry picked from FBD10024670)	2018-09-24 20:58:31 -07:00
Maksim Panchenko	53b72d0f2e	[BOLT] Ignore symbols from non-allocatable sections Summary: While creating BinaryData objects we used to process all symbol table entries. However, some symbols could belong to non-allocatable sections, and thus we have to ignore them for the purpose of analyzing in-memory data. (cherry picked from FBD9666511)	2018-09-05 14:36:52 -07:00
Maksim Panchenko	8026760ac0	[BOLT] Fix another issue with profile after ICP Summary: For jump tables ICP was using profile from the jump table itself which doesn't work correct if the jump table is re-used at different code locations. (cherry picked from FBD9618774)	2018-08-30 13:21:50 -07:00
spupyrev	41ed5431a0	[BOLT] turning on the compact aligner by default Summary: Making UseCompactAligner true by default (cherry picked from FBD9325158)	2018-08-14 14:49:10 -07:00
Maksim Panchenko	cd19f718b4	[BOLT] Merge jump table profile data Summary: While running ICF pass we have skipped merging profile data for jump tables. We were only updating profile in the CFG. Fix that. (cherry picked from FBD9595523)	2018-08-30 13:21:29 -07:00
Maksim Panchenko	69e6004a42	[perf2bolt] Fix processing of binaries with names over 15 chars long Summary: Do not truncate the binary name for comparison purposes as the binary name we are getting from "perf script" is no longer truncated. (cherry picked from FBD9596409)	2018-08-30 14:51:10 -07:00
Rafael Auler	d0a80b0870	[BOLT] Change ForceRelocation behavior Summary: Only record address as addend if the target of the relocation is the pseudo-symbol Zero. (cherry picked from FBD9551543)	2018-08-28 18:15:13 -07:00
Maksim Panchenko	708a550084	[BOLT] Fix profile after ICP Summary: After optimizing a target of a jump table, ICP was not updating edge counts corresponding to that target. As a result the edge could be left hot and negatively influence the code layout. (cherry picked from FBD9524396)	2018-08-23 22:47:46 -07:00
Maksim Panchenko	2511b09985	[BOLT][DWARF] Fix line info for empty CU DIEs Summary: In some rare cases a compiler may generate DWARF that contains an empty CU DIE that references a debug line fragment. That fragment will contain no file name information, and we fail to register it. Then, as a result, DW_AT_stmt_list is not updated for the CU. This may cause some DWARF-processing tools to segfault. As a solution/workaround, we register "<unknown>" file name for such debug line tables. (cherry picked from FBD9526705)	2018-08-27 20:12:59 -07:00
Rafael Auler	a7e0704be6	[BOLT] Reduce AArch64 target feature flags Summary: Eliminate some flags that are not recognized and are currently printing warnings when BOLT runs on AArch64. (cherry picked from FBD9499971)	2018-08-24 10:42:00 -07:00
Rafael Auler	af1177d99f	[BOLT] Add mattr options to AArch64 target Summary: Make the AArch64 subtarget enable all features, so the disassembler won't choke on extension instructions. (cherry picked from FBD9477066)	2018-08-22 18:47:39 -07:00
Rafael Auler	9c4fcafa37	[BOLT] Add update-build-id option, on by default Summary: The build-id is used by tools to uniquely identify binaries. Update the output binary build-id with a different number to make it distinguishable from the input binary. This implementation just flips the last build-id bit. (cherry picked from FBD9235336)	2018-08-08 17:55:24 -07:00
Rafael Auler	510a8c4bbe	[BOLT] Fix shrink-wrapping CFI update Summary: When updating CFI for a function that was optimized by shrink-wrapping, if the function had no frame pointers, the CFI update algorithm was incorrect. (cherry picked from FBD9328658)	2018-08-14 17:32:06 -07:00
Maksim Panchenko	88bb145164	[BOLT] Update allocatable relocation sections Summary: Position-independent binaries may have runtime relocations of type R_X86_64_RELATIVE that need an update if they were pointing to one of the functions that we have relocated. (cherry picked from FBD9374164)	2018-08-16 16:53:14 -07:00
Maksim Panchenko	87788ca926	[perf2bolt] Support profiling of PIEs and .so's Summary: Processing profile data for binaries with flexible load address (such as position-independent executables and shared objects) requires adjusting binary addresses depending on the base load address. For every PID the mapping will be more or less unique when executing with ASLR enabled, thus we have to keep the mapping record for all PIDs associated with the binary. Then we adjust the addresses based on those mappings. (cherry picked from FBD9368566)	2018-08-14 13:24:44 -07:00
Maksim Panchenko	560c23411a	[perf2bolt] Use mmap events for PID collection Summary: Switch from using `perf script --show-task-events` to `perf script --show-mmap-events` for associating a binary with PIDs in perf.data. The output of the former command does not provide enough information for PIE/.so processing. (cherry picked from FBD9346586)	2018-08-14 13:24:44 -07:00
Rafael Auler	b10d4724c3	[BOLT] Fix pseudo calculation in BinaryBasicBlock Summary: A recent commit broke our tests because it was depending on getNumNonPseudos() at a very late stage of our optimization pipeline. The problem was in a instruction deletion member function in BinaryBasicBlock that was not updating the number of pseudos after deletion. Fix this. (cherry picked from FBD9305972)	2018-08-13 14:36:38 -07:00
Laith Saed Sakka	b2382dc552	retpoline insertion : further updates. Summary: Couple of updates: 1) Handle address pattern with segment register. 2) Assume R11 available for PLT calls always. 3) Add CFI state to each BB. 4) early exit getMacroOpFusionPair if Instruction.size() <2. (cherry picked from FBD9172426)	2018-08-03 16:36:06 -07:00
Maksim Panchenko	c35dc2a386	[BOLT] Detect and handle fixed indirect branches Summary: Sometimes GCC can generate code where one of jump table entries is being used by an indirect branch with a fixed memory reference, such as "jmp *(JT+8)". If we don't convert such branches to direct ones and move jump tables, then the indirect branch will reference the old table value and will end up at the non-updated destination, possibly causing a runtime crash. This fix converts such indirect branches into direct ones. For now we mark functions containing indirect branches with fixed destination as non-simple to prevent unreachable code elimination problem triggered by related dead/unreachable jump table. (cherry picked from FBD9192363)	2018-08-06 11:22:45 -07:00
Laith Saed Sakka	06e1554158	Retpoline Insertion Pass Summary: retpoline insertion implemented for reloc mode, (cherry picked from FBD8832838)	2018-07-25 19:07:41 -07:00
Maksim Panchenko	39f6fcc947	[BOLT] Add support for IFUNC Summary: Relocation value verification was failing for IFUNC as the real value used for relocation wasn't the symbol value, but a corresponding PLT entry. Relax the verification and skip any symbols of ST_Other type. (cherry picked from FBD9123741)	2018-07-30 10:29:47 -07:00
Maksim Panchenko	df94786119	[BOLT] Fix range checks Summary: containsRange() functions were incorrectly checking for an empty range at the end of containing object. I.e. [a,b) was reporting true for containing [b,b). (cherry picked from FBD9074643)	2018-07-30 16:30:18 -07:00
Maksim Panchenko	fe9f8219fa	[BOLT] Fix TBSS-related issue Summary: TLS segment provide a template for initializing thread-local storage for every new thread. It consists of initialized and uninitialized parts. The uninitialized part of TLS, .tbss, is completely meaningless from a binary analysis perspective. It doesn't take any space in the file, or in memory. Note that this is different from a regular .bss section that takes space in memory. We should not place .tbss into a list of allocatable sections, otherwise it may cause conflicts with objects contained in the next section. (cherry picked from FBD9074056)	2018-07-30 16:30:18 -07:00
Maksim Panchenko	771d976543	[BOLT][NFC] Minor code refactoring (cherry picked from FBD8882632)	2018-07-12 10:13:03 -07:00
Maksim Panchenko	49920a8fad	[BOLT] Add R_X86_64_PC64 relocation support (cherry picked from FBD8980691)	2018-07-24 14:30:16 -07:00
spupyrev	631da736b0	[BOLT] further speeding up cache+ Summary: For large binaries, cache+ algorithm adds a noticeable overhead in comparison with cache. This modification restricts search space of the optimization, which makes cache+ as fast as cache for all tested binaries. There is a tiny (in the order of 0.01%) regression in cache-related metrics, but this is not noticeable in practice. (cherry picked from FBD8369968)	2018-05-17 18:27:13 -07:00
Rafael Auler	ddfcf4f266	[BOLT] Add parser for pre-aggregated perf data Summary: The regular perf2bolt aggregation job is to read perf output directly. However, if the data is coming from a database instead of perf, one could write a query to produce a pre-aggregated file. This function deals with this case. The pre-aggregated file contains aggregated LBR data, but without binary knowledge. BOLT will parse it and, using information from the disassembled binary, augment it with fall-through edge frequency information. After this step is finished, this data can be either written to disk to be consumed by BOLT later, or can be used by BOLT immediately if kept in memory. File format syntax: {B\|F\|f} [<start_id>:]<start_offset> [<end_id>:]<end_offset> <count> [<mispred_count>] B - indicates an aggregated branch F - an aggregated fall-through (trace) f - an aggregated fall-through with external origin - used to disambiguate between a return hitting a basic block head and a regular internal jump to the block <start_id> - build id of the object containing the start address. We can skip it for the main binary and use "X" for an unknown object. This will save some space and facilitate human parsing. <start_offset> - hex offset from the object base load address (0 for the main executable unless it's PIE) to the start address. <end_id>, <end_offset> - same for the end address. <count> - total aggregated count of the branch or a fall-through. <mispred_count> - the number of times the branch was mispredicted. Omitted for fall-throughs. Example F 41be50 41be50 3 F 41be90 41be90 4 f 41be90 41be90 7 B 4b1942 39b57f0 3 0 B 4b196f 4b19e0 2 0 (cherry picked from FBD8887182)	2018-07-17 18:31:46 -07:00
Laith Saed Sakka	27f3032447	Add initial function injection support Summary: This diff have the API needed to inject functions using bolt. In relocation mode injected functions are emitted between the cold and the hot functions, In non-reloc mode injected functions are emitted a next text section. (cherry picked from FBD8715965)	2018-07-08 12:14:08 -07:00
Maksim Panchenko	6e45f5aeec	[perf2bolt] Enforce file matching in perf2bolt Summary: If the input binary does not have a build-id and the name does not match any file names in perf.data, then reject the binary, and issue an error message suggesting to rename it to one of the listed names from perf.data. (cherry picked from FBD8846181)	2018-07-13 15:26:41 -07:00
Maksim Panchenko	f2f164f474	[perf2bolt] Fix perf build-id matching Summary: Recent compiler tool chains can produce build-ids that are less than 40 characters long. Linux perf, however, always outputs 40 characters, expanding the string with 0's as needed. Fix the matching by only checking the string prefix. (cherry picked from FBD8839452)	2018-07-13 10:49:41 -07:00
Rafael Auler	7aee0adbf9	[BOLT-AArch64] Create cold symbols on demand Summary: Rework the logic we use for managing references to constant islands. Defer the creation of the cold versions to when we split the function and will need them. (cherry picked from FBD8228803)	2018-05-31 10:33:53 -07:00
Maksim Panchenko	44a36937f8	[BOLT] Fix llvm-dwarfdump issues Summary: llvm-dwarfdump is relying on getRelocatedSection() to return section_end() for ELF files of types other than relocatable objects. We've changed the function to return relocatable section for other types of ELF files. As a result, llvm-dwarfdump started re-processing relocations for sections that already had relocations applied, e.g. in executable files, and this resulted in wrong values reported. As a workaround/solution, we make this function return relocated section for executable (and any non-relocatable objects) files only if the section is allocatable. (cherry picked from FBD8760175)	2018-07-06 21:30:23 -07:00
Maksim Panchenko	66e0313d15	[perf2bolt] Accept `-` as a valid misprediction symbol Summary: As reported in GH-28 `perf` can produce `-` symbol for misprediction bit if the bit is not supported by the kernel/HW. In this case we can ignore the bit. (cherry picked from FBD8786827)	2018-07-10 10:25:55 -07:00
Rafael Auler	12380b8b06	Fix assembly after adding entry points Summary: When a given function B, located after function A, references one of A's basic blocks, it registers a new global symbol at the reference address and update A's Labels vector via BinaryFunction::addEntryPoint(). However, we don't update A's branch targets at this point. So we end up with an inconsistent CFG, where the basic block names are global symbols, but the internal branch operands are still referencing the old local name of the corresponding blocks that got promoted to an entry point. This patch fix this by detecting this situation in addEntryPoint and iterating over all instructions, looking for references to the old symbol and replacing them to use the new global symbol (since this is now an entry point). Fixes facebookincubator/BOLT#26 (cherry picked from FBD8728407)	2018-07-03 11:57:46 -07:00
Rafael Auler	544d1577c1	Avoid removing BBs referenced by JTs Summary: While removing unreachable blocks, we may decide to remove a block that is listed as a target in a jump table entry. If we do that, this label will be then undefined and LLVM assembler will crash. Mitigate this for now by not removing such blocks, as we don't support removing unnecessary jump tables yet. Fixes facebookincubator/BOLT#20 (cherry picked from FBD8730269)	2018-07-03 17:02:33 -07:00
Laith Saed Sakka	b6c4d8e924	-- Adding Veneer elimination pass and Veneer count to dyno stats. Summary: Create a pass that performs veneers elimination . (cherry picked from FBD8359299)	2018-06-07 11:10:37 -07:00
Maksim Panchenko	207ac19c63	Revert "[LongJumpPass] X86 enablement. First attempt." This reverts commit 010b0f7603fc9fa209c6dc95ce4b9c08e7b70d75. (cherry picked from FBD28111178)	2018-07-06 14:54:53 -07:00
Puyan Lotfi	64c429da89	[LongJumpPass] X86 enablement. First attempt. (cherry picked from FBD8753328)	2018-07-06 12:31:36 -07:00
Maksim Panchenko	b447979b8c	[BOLT] Fix diagnostics printing in data aggregator Summary: Print correct part of the string while reporting an error. (cherry picked from FBD8745329)	2018-07-05 20:47:38 -07:00
Maksim Panchenko	d7b2474f83	[DebugInfo] Change default value of FDEPointerEncoding Summary: If the encoding is not specified in CIE augmentation string, then it should be DW_EH_PE_absptr instead of DW_EH_PE_omit. (cherry picked from FBD8740274)	2018-07-05 14:21:49 -07:00
Maksim Panchenko	365613b404	[BOLT] Fix no-assertions build Summary: In release build without assertions MCInst::dump() is undefined and causes link time failure. Fixes facebookincubator/BOLT#27. (cherry picked from FBD8732905)	2018-07-04 10:33:26 -07:00
Maksim Panchenko	a6a37995d9	[BOLT] Reject processing of PIE binaries Summary: Check if the input binary ELF type. Reject any binary not of ET_EXEC type, including position-independent executables (PIEs). Also print the first function containing PIC jump table. (cherry picked from FBD8707274)	2018-06-29 21:12:55 -07:00
Maksim Panchenko	edc0cb1121	[LLVM] Accept `S` in augmentation strings in CIE Summary: Ignore 'S' in augmentation string on input. It just marks a signal frame. All we have to do is propagate it. Fixes facebookincubator/BOLT#21 This was already in LLVM trunk rL331738. Update llvm.patch. (cherry picked from FBD8707222)	2018-06-29 20:30:36 -07:00
Maksim Panchenko	6802948028	[BOLT] Allow jump tables with 2 entries Summary: GCC 8 can generate jump tables with just 2 entries. Modify our heuristic to accept it. We still assert that there's more than one entry. (cherry picked from FBD8709416)	2018-06-30 13:30:47 -07:00
Rafael Auler	8835f90d1e	[X86] Support a subset of internal calls Summary: Add support for functions with internal calls, necessary for handling Intel MKL library and some code observed in google core dumper library. This is not optimizing these functions, but only identifying them, running analyses to assure we will not break those functions if we move them, and then "freezing" these functions (marking as not simple so Bolt will not try to reorder it or touch it in any way). (cherry picked from FBD8364381)	2018-06-11 13:18:44 -07:00
Facebook Github Bot	07353e9590	[BOLT][PR] In some cases DB could be nullptr Summary: When processing binary with -debug mode in some cases, BD could be nullptr. It will be better to fail later on assert than here with segfault. Closes https://github.com/facebookincubator/BOLT/pull/18 GitHub Author: Alexander Gryanko <xpahos@gmail.com> (cherry picked from FBD8650719)	2018-06-26 17:02:00 -07:00
Rafael Auler	72ecd12f2f	Disable -split-eh in non-relocation mode Summary: This option only works in relocation mode. In non-relocation mode, it generates invalid references that cause MCStreamer to fail. Disable this flag if the user requested and print a warning. (cherry picked from FBD8625990)	2018-06-25 14:55:48 -07:00
Rafael Auler	5b2eab6538	[BOLT] Fix call to evaluateX86MemOperands Summary: There was a call site not providing a displament immediate value. This assertion is firing in opensource. (cherry picked from FBD8576033)	2018-06-21 11:03:57 -07:00
Rafael Auler	8f717dd25e	[BOLT] Add initial bolt-only test infra Summary: Create folders and setup to make LIT run BOLT-only tests. Add a test example. This will add a new make/ninja rule "check-bolt" that the user can invoke to run LIT on this folder. (cherry picked from FBD8595786)	2018-06-22 13:50:07 -07:00
Maksim Panchenko	1baa2529ea	[merge-fdata] Support legacy/non-YAML profile format Summary: Concatenate profile contents if they are not in YAML format. (cherry picked from FBD8579955)	2018-06-21 14:45:38 -07:00
Maksim Panchenko	3ab2929b36	[BOLT] Fix support for PIC jump tables Summary: BOLT heuristics failed to work if false PIC jump table entries were accepted when they were pointing inside a function, but not at an instruction boundary. This fix checks if the destination falls at instruction boundary, and if it does not, it truncates the jump table. This, of course, still does not guarantee that the entry corresponds to a real destination, and we can have "false positive" entry(ies). However, it shouldn't affect correctness of the function, but the CFG may have edges that are never taken. We may update an incorrect jump table entry, corresponding to an unrelated data, and for that reason we force moving of jump tables if a PIC jump table was detected. (cherry picked from FBD8559588)	2018-06-20 21:43:22 -07:00
Rafael Auler	35c09dc4dd	[BOLT] Add a user friendly error reporting message Summary: In case we fail to disassemble or to build the CFG for a function, print instructions on bug reporting. (cherry picked from FBD8549737)	2018-06-20 12:03:24 -07:00
Maksim Panchenko	221107c5fb	[BOLT] Update llvm.patch Summary: (cherry picked from FBD8475998)	2018-06-17 22:29:27 -07:00
Maksim Panchenko	a7d025139f	Revert "[Bolt][NFC] Change capitalization s/BOLT/Bolt/g" Summary: (cherry picked from FBD8431879)	2018-06-14 14:27:20 -07:00
Maksim Panchenko	789162276d	[Bolt][NFC] Change capitalization s/BOLT/Bolt/g (cherry picked from FBD8373789)	2018-06-11 19:46:40 -07:00
Maksim Panchenko	232046f9b2	[Bolt] Reduce verbosity while reporting hash collisions Summary: Don't report all data objects with hash collisions by default. Only report the summary, and use -v=1 for providing the full list. (cherry picked from FBD8372241)	2018-06-11 17:17:25 -07:00
Bill Nell	706abb6c95	[BOLT] Hash anonymous symbol names Summary: This diff replaces the addresses in all the {SYMBOLat,HOLEat,DATAat} symbols with hash values based on the data contained in the symbol. It should make the profiling data for anonymous symbols robust to address changes. The only small problem with this approach is that the hashed name for padding symbols of the same size collide frequently. This shouldn't be a big deal since it would be weird if those symbols were hot. On a test run with hhvm there were 26 collisions (out of ~338k symbols). Most of the collisions were from small (2,4,8 byte) objects. (cherry picked from FBD7134261)	2018-06-06 03:17:32 -07:00
spupyrev	779541283a	[BOLT] merging cold basic blocks to reduce #jumps Summary: This diff introduces a modification of cache+ block ordering algorithm, which reordered and merges cold blocks in a function with the goal of reducing the number of (non-fallthrough) jumps, and thus, the code size. (cherry picked from FBD8044978)	2018-05-17 11:14:15 -07:00
Maksim Panchenko	b4dbd35d6c	[BOLT] Initial support for memcpy() inlininig Summary: Add "-inline-memcpy" option to inline calls to memcpy() using "rep movsb" instruction. The pass is X86-specific. Calls to _memcpy8 are optimized too using a special return value (dest+size). The implementation is very primitive in that it does not track liveness of %rax after return, and no %rcx substitution. This is going to get improved if we find the optimization to be useful. (cherry picked from FBD8211890)	2018-05-26 12:40:51 -07:00
Rafael Auler	42e6512241	[BOLT-AArch64] Detect linker stubs and address them Summary: In AArch64, when the binary gets large, the linker inserts stubs with 3 instructions: ADRP to load the PC-relative address of a page; ADD to add the offset of the page; and a branch instruction to do an indirect jump to the contents of X16 (the linker-reserved reg). The problem is that the linker does not issue a relocation for this (since this is not code coming from the assembler), so BOLT has no idea what is the real target, unless it recognizes these instructions and extract the target by combining the operands of the instructions from the stub. This diff does exactly that. (cherry picked from FBD7882653)	2018-04-30 14:47:32 -07:00
Maksim Panchenko	929b0908f7	[BOLT][NFC] Move ICF pass into a separate file Summary: Consolidate code used by identical code folding under Passes/IdenticalCodeFolding.cpp. (cherry picked from FBD8109916)	2018-05-22 15:52:21 -07:00
Maksim Panchenko	6302e18f94	[PERF2BOLT] Improve file matching Summary: If the input binary for perf2bolt has a build-id and perf data has recorded build-ids, then try to match them. Adjust the file name if build-ids match to cover cases where the binary was renamed after data collection. If there's no matching build-id report an error and exit. While scanning task events, truncate the name to 15 characters prior to matching, since that's how names are reported by perf. (cherry picked from FBD8034436)	2018-05-16 13:31:13 -07:00
Maksim Panchenko	13968f7fa9	[BOLT] Add option to print functions with bad layout Summary: Option `-report-bad-layout=N` prints top N functions with layouts that have cold blocks placed in the middle of hot blocks. The sorting is based on execution_count / number_of_basic_blocks formula. (cherry picked from FBD8051950)	2018-05-17 16:58:29 -07:00
Maksim Panchenko	3af3537383	[BOLT] Properly handle non-standard function refs Summary: Application code can reference functions in a non-standard way, e.g. using arithmetic and bitmask operations on them. One example is if a program checks if a function is below a certain address or within a certain address range to perform a low-level optimization or generate a proper code (JIT). Instead of relying on a relocation value (symbol+addend), we use only the symbol value, and then check if the value is inside the function. If it is, we treat it as a code reference against location within the function, otherwise we handle it as a non-standard function reference and issue a warning. (cherry picked from FBD7996274)	2018-05-14 11:10:26 -07:00
Maksim Panchenko	1750fee2ac	[BOLT] Add option to ignore function hash in profile Summary: When we make changes to MCInst opcodes (or get changes from upstream), a hash value for BinaryFunction changes. As a result, we are unable to match profile generated by a previous version of BOLT. Add option `-profile-ignore-hash` to match profile while ignoring function hash value. With this option we match functions with common names using the number of basic blocks. (cherry picked from FBD7983269)	2018-05-11 18:30:47 -07:00
Maksim Panchenko	56b38a14c5	[BOLT] Fix dyno-stats for PLT calls Summary: To accurately account for PLT optimization, each PLT call should be counted as an extra indirect call instruction, which in turn is a load, a call, an indirect call, and instruction entry in dyno stats. (cherry picked from FBD7978980)	2018-05-11 15:30:56 -07:00
spupyrev	e4f39bda51	adjusting cache stats for non-simple functions Summary: While working with a binary in non-relocations mode, I realized some cache metrics are not computed correctly. Hence, this fix. In addition, logging the number of functions with modified ordering of basic blocks, which is helpful for analysis. (cherry picked from FBD7975392)	2018-05-11 12:03:19 -07:00
Bill Nell	729da2da22	[BOLT] Static data reordering pass. Summary: Enable BOLT to reorder data sections in a binary based on memory profiling data. This diff adds a new pass to BOLT that can reorder data sections for better locality based on memory profiling data. For now, the algorithm to order data is primitive and just relies on the frequency of loads to order the contents of a section. We could probably do a lot better by looking at what functions use the hot data and grouping together hot data that is used by a single function (or cluster of functions). Block ordering might give some hints on how to order the data better as well. The new pass has two basic modes: inplace and split (when inplace is false). The default is split since inplace hasn't really been tested much. When splitting is on, the cold data is copied to a "cold" version of the section while the hot data is kept in the original section, e.g. for .rodata, .rodata will contain the hot data and .bolt.org.rodata will contain the cold bits. In inplace mode, the section contents are reordered inplace. In either mode, all relocations to data within that section are updated to reflect new data locations. Things to improve: - The current algorithm is really dumb and doesn't seem to lead to any wins. It certainly could use some improvement. - Private symbols can have data that leaks over to an adjacent symbol, e.g. a string that has a common suffix can start in one symbol and leak over (with the common suffix) into the next. For now, we punt on adjacent private symbols. - Handle ambiguous relocations better. Section relocations that point to the boundary of two symbols will prevent the adjacent symbols from being moved because we can't tell which symbol the relocation is for. - Handle jump tables. Right now jump table support must be basic if data reordering is enabled. - Being able to handle TLS. A good amount of data access in some binaries are happening in TLS. It would be worthwhile to be able to reorder any TLS sections too. - Handle sections with writeable data. This hasn't been tested so probably won't work. We could try to prevent false sharing in writeable sections as well. - A pie in the sky goal would be to use DWARF info to reorder types. (cherry picked from FBD6792876)	2018-04-20 20:03:31 -07:00
Maksim Panchenko	bdf21f7617	[BOLT] Align basic blocks based on execution count Summary: The default is not changing, i.e. we are not aligning code within a function by default. New meaning of options for aligning basic blocks: -align-blocks triggers basic block alignment based on profile -preserve-blocks-alignment tries to preserve basic block alignment seen on input Tuning options for "-align-blocks": -align-blocks-min-size=<uint> blocks smaller than the specified size wouldn't be aligned -align-blocks-threshold=<uint> align only blocks with frequency larger than containing function execution frequency specified in percent. E.g. 1000 means aligning blocks that are 10 times more frequently executed than the containing function. (cherry picked from FBD7921980)	2017-11-07 15:42:28 -08:00
Maksim Panchenko	9c6f965616	[BOLT] Getting open-source ready Summary: BOLT sources are being moved under tools/llvm-bolt/src and tools/llvm-bolt will contain more files such as LICENSE.txt, README.txt, etc. Remove trailing white spaces from our sources. Create llvm.patch by running > git diff f137ed238db11440f03083b1c88b7ffc0f4af65e include lib > \ tools/llvm-bolt/llvm.patch README.txt has instructions on checking out sources and applying the patch. (cherry picked from FBD7878380)	2018-05-04 10:10:41 -07:00
Maksim Panchenko	caad4bcf3a	[BOLT] Fix crash while writing new profile Summary: New profile writer was crashing as functions were lacking a profile flags. Fix it by requiring flags when marking function as profiled. Generate new profile for clang. The new profile has more coverage and results in better overall improvement from BOLT. It was generated by merging multiple runs of: % perf record -e cycles:u -j any,u -F32000 -- \ ./clang bf.cpp -O2 -std=c++11 -c -o /tmp/bf.o (cherry picked from FBD7798580)	2018-04-27 14:16:42 -07:00
Rafael Auler	d6003e94eb	[BOLT-AArch64] Fix -icf, -use-old-text and -update-debug-sections Summary: Refactor MCInst comparison code to support target-dependent functionality. This was necessary because AArch64 uses MCTargetExprs that only the AArch64 backend knows how to unpack it and compare. Also fix a bug where a relocation against a constant island would make BOLT create a fixed reference against a code location in a similar way to read-only data, so when we asked to -use-old-text, the code would break for this particular HHVM function (_ZN5folly2io4zlib18defaultZlibOptionsEv) because the reference now contains invalid data, since the original .text was overwritten. Finally, fix a bug with -update-debug-sections on AArch64 where the update loop wasn't expecting a function with zero basic blocks, which can happen on AArch64 because some functions contain just a constant island. (cherry picked from FBD7679244)	2018-04-12 10:07:11 -07:00
spupyrev	aa91281ac3	[BOLT] improving cache metrics Summary: Modifying parameters of block reordering algorithm that result in better performance. Additionally extending some cache-related metrics (cherry picked from FBD7578336)	2018-03-28 09:10:25 -07:00
Rafael Auler	db949fc1f5	[PERF2BOLT] Add support for non-LBR aggregation Summary: Previously, we depended on the python script perf2bolt.py whenever operating with non-LBR data. (cherry picked from FBD7620125)	2018-04-13 11:18:46 -07:00
Rafael Auler	a30fff6e36	[BOLT-AArch64] Fix BOLT build on AArch64 Summary: Whenever building BOLT in an AArch64 box, we need to make sure we do not run tests that are excluse for x86. This diff also adds a tag for expensive tests, so the user can disable them, which is useful when using a memory-constrained machine to run BOLT tests. It also removes ifdefs that caused BOLT to behave diferently when running in a non-x86 host. Finally, it changes a case where we depended on updated libstdc++ implementation for insert to make the codebase more friendly with boxes that do not have the newer version of the lib. (cherry picked from FBD7625715)	2018-04-13 15:34:09 -07:00
Maksim Panchenko	120d26727a	[BOLT] Restore macro-fusion optimization Summary: Restore the optimization with some modifications: * Only enabled in relocation mode. * Covers instructions other than TEST/CMP. * Prints missed macro-fusion opportunities for input. * By default enabled for all hot code. * Without profile enabled for all code. The new command-line option: -align-macro-fusion - fix instruction alignment for macro-fusion (x86 relocation mode) =none - do not insert alignment no-ops for macro-fusion =hot - only insert alignment no-ops on hot execution paths (default) =all - always align instructions to allow macro-fusion (cherry picked from FBD7644042)	2018-04-13 15:46:19 -07:00
Maksim Panchenko	c13cd9084d	[BOLT] Fix tests Summary: During a rebase function hashes changed and new profile stopped matching functions. (cherry picked from FBD7618919)	2018-04-13 10:09:55 -07:00
Maksim Panchenko	dc12911fea	[BOLT] Report when operating in relocation mode Summary: Since BOLT can use relocations in the binary automatically, it's not always clear if we are operating in relocation mode or not. This diff adds "BOLT-INFO" message indicating if the relocation mode in ON. (cherry picked from FBD7557492)	2018-04-09 13:47:43 -07:00
Maksim Panchenko	8b049d3c7f	[BOLT] Support for non-LBR profile in YAML Summary: Expanded YAML profile format to support different kinds of profile including LBR and non-LBR (and memevents in the future). The profile now starts with a header that includes the profile description. "profile-flags" field includes either "lbr" or "sample", but not both at the same time. It could also include "memevent" in addition to other flags. For now, the only way to generate non-LBR YAML profile is through conversion. Once task is done, it should be possible to use perf2bolt for it. (cherry picked from FBD7595693)	2018-04-09 19:10:19 -07:00
Maksim Panchenko	4878770072	[BOLT][Cleanup] Remove branch history Summary: We are not using branch histories and don't have plans to. Clean up the code. (cherry picked from FBD7588644)	2018-04-11 11:23:14 -07:00
Maksim Panchenko	190693059a	[merge-fdata] Rewrite merge-fdata to use YAML format Summary: merge-fdata now operates on .fdata files in YAML format. The previous format is not supported, which means that non-LBR data could not be merged and memory data has to be merged with "cat" command. (cherry picked from FBD7544031)	2018-04-05 13:03:05 -07:00
Rafael Auler	7df6a6d5c6	[BOLT-AArch64] Fix AArch64 port - make it work with hhvm Summary: This diff has 3 fixes. First fixes the way relocations are read and interpreted for AArch64, so the references are preserved correctly. Second, it fixes constant islands to be able to live in the very first address of a function (which means there is no code, but this function contains just a constant island). Third, it fixes function splitting to do not outline entry points for AArch64. This was done because some functions may load pointers to its internal basic blocks, issueing a short-range ADR instruction to do so without its pair ADRP (since the size of the function is supposed to be small). But when we move this block to a cold region, that is not the case anymore. Since blocks with a reference are marked as entry points, we conservatively disable outlining for them in AArch64. (cherry picked from FBD7505067)	2018-03-20 14:34:58 -07:00
Maksim Panchenko	489e514530	[BOLT] Improve annotations format and processing Summary: Change the way annotations are stored and processed. Embed annotation type/index into immediate value stored as an operand. This limits the effective range of values that could be stored as annotations to 56 bits, which is still plenty for most integer types that we use and for pointers on real systems. High 8 bits are reserved for storing annotation type/index. Expand the interface for general annotations to include reference to annotations by index. The main purpose of this interface is to improve performance of annotations that are used by heavy (>O(N)) algorithms, such as data flow analysis. For -frame-opt pass, new memory usage and processing times are slightly better compared to those before refactoring. (cherry picked from FBD7492017)	2018-03-29 18:42:06 -07:00
Maksim Panchenko	d8cf08b243	[BOLT] Use MCPlus::getNumPrimeOperands() Summary: Use MCPlus::getNumPrimeOperands() to get the real number of operands on MCInst. Alternatively, use MCInstrDesc::getNumOperands(). (cherry picked from FBD7507666)	2018-04-04 15:00:00 -07:00
Maksim Panchenko	7956da0fe8	[BOLT] Fix CFG in BinaryFunction::eraseInvalidBBs() Summary: When we erase invalid/unreachable basic blocks, we have to remove them from a list of predecessors of regular blocks, otherwise the CFG will be left in a broken state containing references to removed basic blocks. (cherry picked from FBD7464292)	2018-03-30 17:44:14 -07:00
Maksim Panchenko	0d729f218b	[BOLT] Fix relocation verification Summary: We verify that relocation information matches a value stored in a binary, i.e. "ExtractedValue == SymbolValue + Addend". However, because of the size of the relocation, and the fact that an addend is always of type int64_t, we have to sign-extend the extracted value, and then we might get mismatch in higher bits in certain scenarios. Hence, we should only compare values that are truncated to a relocation size. Discovered while processing hhvm binary with modified compiler flags. (cherry picked from FBD7462559)	2018-03-30 15:49:34 -07:00
Maksim Panchenko	77f35bd0e9	[BOLT] Fix iterator issue Summary: Getting a forward iterator from reverse iterator was implemented incorrectly. For some reason erase worked on it, but it's clearly wrong and printing the instruction (before the deletion) results in an error. (cherry picked from FBD7457457)	2018-03-30 10:54:42 -07:00
Maksim Panchenko	a62f4fda46	[BOLT][Refactoring] Isolate changes to MC layer Summary: Changes that we made to MCInst, MCOperand, MCExpr, etc. are now all moved into tools/llvm-bolt. That required a change to the way we handle annotations and any extra operands for MCInst. Any MCPlus information is now attached via an extra operand of type MCInst with an opcode ANNOTATION_LABEL. Since this operand is MCInst, we attach extra info as operands to this instruction. For first-level annotations use functions to access the information, such as getConditionalTailCall() or getEHInfo(), etc. For the rest, optional or second-class annotations, use a general named-annotation interface such as getAnnotationAs<uint64_t>(Inst, "Count"). I did a test on HHVM binary, and a memory consumption went down a little bit while the runtime remained the same. (cherry picked from FBD7405412)	2018-03-19 18:32:12 -07:00
spupyrev	0dea33737a	[BOLT] improvements for CFG construction Summary: Some improvements for CFG construction: - getting rid of fallthrough-inferrence, as this is already done DataAggregator; - adjusting block counts for blocks with non-zero outgoing edges to make sure they're not outlined; - making sure that all functions (including non-simple ones) are reordered and placed in the hot section. The main goal of the diff is to make sure that constructed CFG graphs exactly correspond to the input profile data. (cherry picked from FBD7323205)	2018-03-22 09:48:59 -07:00
spupyrev	3458e92285	removing compact-mode Summary: this is not needed but makes code harder to read; hence, removing (cherry picked from FBD7257937)	2018-03-14 09:05:26 -07:00
Bill Nell	faacdf6080	[BOLT] Fix assertion when building test binary Summary: The binary had some unexpected ovelapping symbols: .str.34.llvm.2944770977690351622/1 address = 0x48e9ec7, next address = 0x48e9ed2, size = 21 PG.LC135/1 address = 0x48e9ed2, next address = 0x48e9eef, size = 29 BOLT wasn't expecting this type of overlap when generating HOLE symbols, so it was asserting. I've changed the code to deal with this case. I'll need to change the reordering pass to mark these types of symbols as unmoveable for now. (cherry picked from FBD7304195)	2018-03-16 09:03:12 -07:00
Bill Nell	598a346abf	[BOLT] Fix assertion when setting size of jump table symbol Summary: This assertion was making sure that when we patched up symbol sizes that we wouldn't modify the size of a symbol that has already had its size set. The issue here is that private symbols are sometimes composed of multiple objects internally (e.g. jump tables). In this particular case a jump table started at the same address as the private data blob it was contained in. Currently, there isn't any good way of differentiating symbols that start at the same address (except possibly using multimaps for certain data structures). I'm hacking around it by modifying the assertion to ignore jump tables and skip setting the size when it has already been set. This shouldn't affect any existing optimizations since the only thing that depended on sizes is data reordering and that currently ignores jump tables and private data blobs. (cherry picked from FBD7269207)	2018-03-13 18:59:22 -07:00
Maksim Panchenko	48ae32a33b	[BOLT] Introduce MCPlus layer Summary: Refactor architecture-specific code out of llvm into llvm-bolt. Introduce MCPlusBuilder, a class that is taking over MCInstrAnalysis responsibilities, i.e. creating, analyzing, and modifying instructions. To access the builder use BC->MIB, i.e. substitute MIA with MIB. MIB is an acronym for MCInstBuilder, that's what MCPlusBuilder used to be. The name stuck, and I find it better than MPB. Instructions are still MCInst, and a bunch of BOLT-specific code still lives in LLVM, but the staff under Target/* is significantly reduced. (cherry picked from FBD7300101)	2018-03-09 09:45:13 -08:00
Maksim Panchenko	8c16594f2e	[BOLT] Fix ORC to properly update symbols Summary: In new ORC, the sequence of how sections are allocated and loaded is changed. Now everything is delayed until emitAndFinalize() is called, and all actions are supposed to happen via notification functors. There are two functors that we pass to new ObjectLinkingLayer object. One is used to notify when objects are loaded, and the other - once they are finalized. We use the first one to remap sections to proper addresses, and that's the earliest place where we can do it. However, ORC decides to update symbols right before that, and as a result they are updated with non-mapped values. There are two possible fixes for that. This diff postpones the update to the symbol table until the notifier is called. I don't know what other tools depend on the existing sequence, and the proper fix may involve creating a third notifier to be called before the symbol table update. (cherry picked from FBD7280973)	2018-03-14 15:07:16 -07:00
Rafael Auler	2fe37b4435	[BOLT] Fix remove-unused-stores in rebased bolt Summary: Rebased version revealed a mistake when computing the dataflow for the "remove-unused-stores" optimization. This is disabled in prod but it doesn't hurt to fix it, so the tests for the rebased bolt go green again. (cherry picked from FBD7253418)	2018-03-12 20:24:01 -07:00
Rafael Auler	6644548c74	[BOLTDIFF] Add a tool to audit performance differences Summary: This is a simple bolt-based tool that instantiates two RewriteInstances objects and compares them. Add a method to RewriteInstance to enable us to compare two objects. Include a mechanism to match functions from binary 1 to binary 2 and finally print the largest differences in profiling data from one binary to another. (cherry picked from FBD6517076)	2017-12-07 15:00:41 -08:00
Maksim Panchenko	d660f8b1fe	[BOLT] Disassemble all functions before building CFGs Summary: This makes it possible to do adjustments to all functions based on information gained during disassembly. E.g. if we detect an entry point after the CFG for a function is constructed, we have to take a conservative approach, and mark such function as non-simple. Now we have this information before building the CFG. This could also be used to do other processing/post-processing on disassembled functions that might affect CFG construction of other functions (e.g. early detection of functions that never return). The drawback of this approach is that we lose cache locality and some processing performance as a result. I've measured 5 second difference on HHVM binary. (cherry picked from FBD7258466)	2018-02-14 12:06:17 -08:00
Bill Nell	0e4d86bf19	[BOLT] Refactor global symbol handling code. Summary: This is preparation work for static data reordering. I've created a new class called BinaryData which represents a symbol contained in a section. It records almost all the information relevant for dealing with data, e.g. names, address, size, alignment, profiling data, etc. BinaryContext still stores and manages BinaryData objects similar to how it managed symbols and global addresses before. The interfaces are not changed too drastically from before either. There is a bit of overlap between BinaryData and BinaryFunction. I would have liked to do some more refactoring to make a BinaryFunctionFragment that subclassed from BinaryData and then have BinaryFunction be composed or associated with BinaryFunctionFragments. I've also attempted to use (symbol + offset) for when addresses are pointing into the middle of symbols with known sizes. This changes the simplify rodata loads optimization slightly since the expression on an instruction can now also be a (symbol + offset) rather than just a symbol. One of the overall goals for this refactoring is to make sure every relocation is associated with a BinaryData object. This requires adding "hole" BinaryData's wherever there are gaps in a section's address space. Most of the holes seem to be data that has no associated symbol info. In this case we can't do any better than lumping all the adjacent hole symbols into one big symbol (there may be more than one actual data object that contributes to a hole). At least the combined holes should be moveable. Jump tables have similar issues. They appear to mostly be sub-objects for top level local symbols. The main problem is that we can't recognize jump tables at the time we scan the symbol table, we have to wait til disassembly. When a jump table is discovered we add it as a sub-object to the existing local symbol. If there are one or more existing BinaryData's that appear in the address range of a newly created jump table, those are added as sub-objects as well. (cherry picked from FBD6362544)	2017-11-14 20:05:11 -08:00
Rafael Auler	32b332ad2d	[BOLT] Fix ShrinkWrapping bugs and enable testing Summary: Fix a few ShrinkWrapping bugs: - Using push-pop mode in a function that required aligned stack - Correctly update the edges in jump tables after splitting critical edges - Fix stack pointer restores based on RBP + offset, when we change the stack layout in push-pop mode. (cherry picked from FBD6755232)	2017-12-14 17:26:19 -08:00
Rafael Auler	6d0401ccfb	[BOLT/LSDA] Fix alignment Summary: Fix a bug introduced by rebasing with respect to aligned ULEBs. This wasn't breaking anything but it is good to keep LDSA aligned. (cherry picked from FBD7094742)	2018-02-26 20:09:14 -08:00
Bill Nell	ddefc770b0	[BOLT] Refactoring of section handling code Summary: This is a big refactoring of the section handling code. I've removed the SectionInfoMap and NoteSectionInfo and stored all the associated info about sections in BinaryContext and BinarySection classes. BinarySections should now hold all the info we care about for each section. They can be initialized from SectionRefs but don't necessarily require one to be created. There are only one or two spots that needed access to the original SectionRef to work properly. The trickiest part was making sure RewriteInstance.cpp iterated over the proper sets of sections for each of it's different types of processing. The different sets are broken down roughly as allocatable and non-alloctable and "registered" (I couldn't think up a better name). "Registered" means that the section has been updated to include output information, i.e. contents, file offset/address, new size, etc. It may help to have special iterators on BinaryContext to iterate over the different classes to make things easier. I can do that if you guys think it is worthwhile. I found pointee_iterator in the llvm ADT code. Use that for iterating over BBs in BinaryFunction rather than the custom iterator class. (cherry picked from FBD6879086)	2018-02-01 16:33:43 -08:00
Maksim Panchenko	6744f0dbeb	[BOLT] Fix jump table placement for non-simple functions Summary: When we move a jump table to either hot or cold new section (-jump-tables=move), we rely on a number of taken branches from the table to decide if it's hot or cold. However, if the function is non-simple, we always get 0 count, and always move the table to the cold section. Instead, we should make a conservative decision based on the execution count of the function. (cherry picked from FBD7058127)	2018-02-22 11:20:46 -08:00
Andy Newell	e15623058e	Cache+ speed, reduce mallocs Summary: Speed of cache+ by skipping mallocs on vectors. Although this change speeds up the algorithm by 2x, this is still not enough for some binaries where some functions have ~2500 hot basic blocks. Hence, introduce a threshold for expensive optimizations in CachePlusReorderAlgorithm. If the number of hot basic blocks exceeds the threshold (2048 by default), we use a cheaper version, which is quite fast. (cherry picked from FBD6928075)	2018-02-09 09:58:19 -08:00
Maksim Panchenko	5599c01911	[BOLT] Fixes for new profile Summary: Do a better job of recording fall-through branches in new profile mode (-prof-compat-mode=0). For this we need to record offsets for all instructions that are last in the containing basic block. Change the way we convert conditional tail calls. Now we never reverse the condition. This is required for better profile matching. The original approach of preserving the direction was controversial to start with. Add "-infer-fall-throughs" option (on by default) to allow disabling inference of fall-through edge counts. (cherry picked from FBD6994293)	2018-02-13 11:21:59 -08:00
Maksim Panchenko	a24c5543ea	[BOLT] Improved function profile matching Summary: Prioritize functions with 100% name match when doing LTO "fuzzy" name matching. Avoid re-assigning profile to a function. (cherry picked from FBD6992179)	2018-02-14 12:30:27 -08:00
Maksim Panchenko	1298d99a41	[BOLT] Limited "support" for AVX-512 Summary: In relocation mode trap on entry to any function that has AVX-512 instructions. This is controlled by "-trap-avx512" option which is on by default. If the option is disabled and AVX-512 instruction is seen in relocation mode, then we abort while re-writing the binary. (cherry picked from FBD6893165)	2018-02-02 16:07:11 -08:00
Rafael Auler	8a5a30156e	[BOLT rebase] Rebase fixes on top of LLVM Feb2018 Summary: This commit includes all code necessary to make BOLT working again after the rebase. This includes a redesign of the EHFrame work, cherry-pick of the 3dnow disassembly work, compilation error fixes, and port of the debug_info work. The macroop fusion feature is not ported yet. The rebased version has minor changes to the "executed instructions" dynostats counter because REP prefixes are considered a part of the instruction it applies to. Also, some X86 instructions had the "mayLoad" tablegen property removed, which BOLT uses to identify and account for loads, thus reducing the total number of loads reported by dynostats. This was observed in X86::MOVDQUmr. TRAP instructions are not terminators anymore, changing our CFG. This commit adds compensation to preserve this old behavior and minimize tests changes. debug_info sections are now slightly larger. The discriminator field in the line table is slightly different due to a change upstream. New profiles generated with the other bolt are incompatible with this version because of different hash values calculated for functions, so they will be considered 100% stale. This commit changes the corresponding test to XFAIL so it can be updated. The hash function changes because it relies on raw opcode values, which change according to the opcodes described in the X86 tablegen files. When processing HHVM, bolt was observed to be using about 800MB more memory in the rebased version and being about 5% slower. (cherry picked from FBD7078072)	2018-02-06 15:00:23 -08:00
Maksim Panchenko	600cf0ecf6	[BOLT] Fix memory regression Summary: This fixes the increased memory consumption introduced in an earlier diff while I was working on new profiling infra. The increase came from a delayed release of memory allocated to intermediate structures used to build CFG. In this diff we release them ASAP, and don't keep them for all functions at the same time. (cherry picked from FBD6890067)	2018-02-02 14:46:21 -08:00
Maksim Panchenko	f85264ae18	[BOLT] Reduce the usage of "Offset" annotation Summary: Limiting "Offset" annotation only to instructions that actually need it, improves the memory consumption on HHVM binary by 1GB. (cherry picked from FBD6878943)	2018-02-01 14:36:29 -08:00
Bill Nell	501601259b	[BOLT] Fix branch info stats after SCTC Summary: SCTC was incorrectly swapping BranchInfo when reversing the branch condition. This was wrong because when we remove the successor BB later, it removes the BranchInfo for that BB. In this case the successor would be the BB with the stats we had just swapped. Instead leave BranchInfo as it is and read the branch count from the false or true branch depending on whether we reverse or replace the branch, respectively. The call to removeSuccessor later will remove the unused BranchInfo we no longer care about. (cherry picked from FBD6876799)	2018-02-01 14:24:26 -08:00
Bill Nell	1207e1d229	[BOLT] Fix lookup of non-allocatable sections in RewriteInstance Summary: Register all sections with BinaryContext. Store all sections in a set ordered by (address, size, name). Add two separate maps to lookup sections by address or by name. Non-allocatable sections are not stored in the address->section map since they all "start" at 0. (cherry picked from FBD6862973)	2018-01-31 12:12:59 -08:00
Qinfan Wu	2b8194fa50	Handle types CU list in updateGdbIndexSection Summary: Handle types CU list in `updateGdbIndexSection`. It looks like the types part of `.gdb_index` isn't empty when `-fdebug-types-section` is used. So instead of aborting, we copy the part to new `.gdb_index` section. (cherry picked from FBD6770460)	2018-01-31 11:52:39 -08:00
Maksim Panchenko	d114ef1fa5	[BOLT] Fix profile for multi-entry functions Summary: When we read profile for functions, we initialize counts for entry blocks first, and then populate counts for all blocks based on incoming edges. During the second phase we ignore the entry blocks because we expect them to be already initialized. For the primary entry at offset 0 it's the correct thing to do, since we treat all incoming branches as calls or tail calls. However, for secondary entries we only consider external edges to be from calls and don't increase entry count if an edge originates from inside the function. Thus we need to update the secondary entry basic block counts with internal edges too. (cherry picked from FBD6836817)	2018-01-23 15:18:41 -08:00
Bill Nell	304c8ba80a	[BOLT] Handle multiple sections with the same name Summary: Multiple sections can have the same name, so we need to make the NameToSectionMap into a multimap. (cherry picked from FBD6847622)	2018-01-30 13:18:40 -08:00
Rafael Auler	48370744d9	[BOLT] Do not assert on bad data Summary: A test is asserting on impossible addresses coming from perf.data, instead of just reporting it as bad data. Fix this behavior. (cherry picked from FBD6835590)	2018-01-29 10:37:30 -08:00
spupyrev	626e977c4a	[BOLT] faster cache+ implementation Summary: Speeding up cache+ algorithm. The idea is to find and merge "fallthrough" successors before main optimization. For a pair of blocks, A and B, block B is the fallthrough successor of A, if (i) all jumps (based on profile) from A goes to B and (ii) all jumps to B are from A. Such blocks should be adjacent in an optimal ordering, and should not be considered for splitting. (This gives the speed up). The gap between cache and cache+ reduced from ~2m to ~1m. (cherry picked from FBD6799900)	2018-01-24 12:29:38 -08:00
Bill Nell	89feb847ea	[BOLT] Refactor relocation analysis code. Summary: Refactor the relocation anaylsis code. It should be a little better at validating that the relocation value matches up with the symbol address + addend stored in the relocation (except on aarch64). It is also a little better at finding the symbol address used to do the lookup in BinaryContext, rather than just using symbol address + addend. (cherry picked from FBD6814702)	2018-01-24 05:42:11 -08:00
Bill Nell	2640b4071f	[BOLT] Refactoring - add BinarySection class Summary: Add BinarySection class that is a wrapper around SectionRef. This is refactoring work for static data reordering. (cherry picked from FBD6792785)	2018-01-23 15:10:24 -08:00
Rafael Auler	907ca25841	[BOLT-AArch64] Support large test binary Summary: Rewrite how data/code markers are interpreted, so the code can have constant islands essentially anywhere. This is necessary to accomodate custom AArch64 assembly code coming from mozjpeg. Allow any function to refer to the constant island owned by any other function. When this happens, we pull the constant island from the referred function and emit it as our own, so it will live nearby the code that refers to it, allowing us to freely reorder functions and code pieces. Make bolt more strict about not changing anything in non-simple ARM functions, as we need to preserve offsets for those functions we don't interpret their jump tables (currently any function with jump tables in ARM is non-simple and is left untouched). (cherry picked from FBD6402324)	2017-11-22 16:17:36 -08:00
Maksim Panchenko	b6cb112feb	[BOLT] New profile format Summary: A new profile that is more resilient to minor binary modifications. BranchData is eliminated. For calls, the data is converted into instruction annotations if the profile matches a function. If a profile cannot be matched, AllCallSites data should have call sites profiles. The new profile format is YAML, which is quite verbose. It still takes less space than the older format because we avoid function name repetition. The plan is to get rid of the old profile format eventually. merge-fdata does not work with the new format yet. (cherry picked from FBD6753747)	2017-12-13 23:12:01 -08:00
Rafael Auler	f8f52d01d0	[BOLT-AArch64] Support SPEC17 programs and organize AArch64 tests Summary: Add a few new relocation types to support a wider variety of binaries, add support for constant island duplication (so we can split functions in large binaries) and make LongJmp pass really precise with respect to layout, so we don't miss stubs insertions at the correct places for really large binaries. In LongJmp, introduce "freeze" annotations so fixBranches won't mess the jumps we carefully determined that needed a stub. (cherry picked from FBD6294390)	2017-11-09 16:59:18 -08:00
spupyrev	a599fe1bbc	[BOLT] a new block reordering algorithm Summary: A new block reordering algorithm, cache+, that is designed to optimize i-cache performance. On a high level, this algorithm is a greedy heuristic that merges clusters (ordered sequences) of basic blocks, similarly to how it is done in OptimizeCacheReorderAlgorithm. There are two important differences: (a) the metric that is optimized in the procedure, and (b) how two clusters are merged together. Initially all clusters are isolated basic blocks. On every iteration, we pick a pair of clusters whose merging yields the biggest increase in the ExtTSP metric (see CacheMetrics.cpp for exact implementation), which models how i-cache "friendly" a pecific cluster is. A pair of clusters giving the maximum gain is merged to a new clusters. The procedure stops when there is only one cluster left, or when merging does not increase ExtTSP. In the latter case, the remaining clusters are sorted by density. An important aspect is the way two clusters are merged. Unlike earlier algorithms (e.g., OptimizeCacheReorderAlgorithm or Pettis-Hansen), two clusters, X and Y, are first split into three, X1, X2, and Y. Then we consider all possible ways of gluing the three clusters (e.g., X1YX2, X1X2Y, X2X1Y, X2YX1, YX1X2, YX2X1) and choose the one producing the largest score. This improves the quality of the final result (the search space is larger) while keeping the implementation sufficiently fast. (cherry picked from FBD6466264)	2017-12-01 16:54:08 -08:00
Rafael Auler	1fa80594cf	[BOLT] Do not assign a LP to tail calls Summary: Do not assign a LP to tail calls. They are not calls in the view of an unwinder, they are just regular branches. We were hitting an assertion in BinaryFunction::removeConditionalTailCalls() complaining about landing pads in a CTC, however it was in fact a builtin_unreachable being conservatively treated as a CTC. (cherry picked from FBD6564957)	2017-12-13 19:08:43 -08:00
Rafael Auler	660daac2d0	[BOLT] Fix -simplify-rodata-loads wrt data chunks with relocs Summary: The pass was previously copying data that would change after layout because it had a relocation at the copied address. (cherry picked from FBD6541334)	2017-12-11 17:07:56 -08:00
Maksim Panchenko	85f5f4fb63	[BOLT] Fix debugging derp (cherry picked from FBD28110992)	2017-12-11 17:22:49 -08:00
Maksim Panchenko	67cef1f536	debug (cherry picked from FBD28110897)	2017-12-11 11:44:07 -08:00
Maksim Panchenko	d15b93bade	[BOLT] Major overhaul of profiling in BOLT Summary: Profile reading was tightly coupled with building CFG. Since I plan to move to a new profile format that will be associated with CFG it is critical to decouple the two phases. We now have read profile right after the cfg was constructed, but before it is "canonicalized", i.e. CTCs will till be there. After reading the profile, we do a post-processing pass that fixes CFG and does some post-processing for debug info, such as inference of fall-throughs, which is still required with the current format. Another good reason for decoupling is that we can use profile with CFG to more accurately record fall-through branches during aggregation. At the moment we use "Offset" annotations to facilitate location of instructions corresponding to the profile. This might not be super efficient. However, once we switch to the new profile format the offsets would be no longer needed. We might keep them for the aggregator, but if we have to trust LBR data that might not be strictly necessary. I've tried to make changes while keeping backwards compatibly. This makes it easier to verify correctness of the changes, but that also means that we lose accuracy of the profile. Some refactoring is included. Flag "-prof-compat-mode" (on by default) is used for bug-level backwards compatibility. Disable it for more accurate tracing. (cherry picked from FBD6506156)	2017-11-28 09:57:21 -08:00
Maksim Panchenko	b6f7c68a6c	[BOLT] Automatically detect and use relocations Summary: If relocations are available in the binary, use them by default. If "-relocs" is specified, then require relocations for further processing. Use "-relocs=0" to forcefully ignore relocations. Instead of `opts::Relocs` use `BinaryContext::HasRelocations` to check for the presence of the relocations. (cherry picked from FBD6530023)	2017-12-09 21:40:39 -08:00
Maksim Panchenko	2b9bafed83	[BOLT] Consistent DFS ordering for landing pads Summary: The list of landing pads in BinaryBasicBlock was sorted by their address in memory. As a result, the DFS order was not always deterministic. The change is to store landing pads in the order they appear in invoke instructions while keeping them unique. Also, add Throwers verification to validateCFG(). (cherry picked from FBD6529032)	2017-12-08 20:27:49 -08:00
Maksim Panchenko	10274633ee	[BOLT] Options to facilitate debugging Summary: Some helpful options: -print-dyno-stats-only while printing functions output dyno-stats and skip instructions -report-stale print a list of functions with a stale profile (cherry picked from FBD6505141)	2017-12-06 15:45:57 -08:00
Rafael Auler	70d44ab20a	[BOLT] Add REX prefix rebalancing pass Summary: Add a pass to rebalance the usage of REX prefixes, moving them from the hot code path to the cold path whenever possible. To do this, we rank the usage frequency of each register and exchange an X86 classic reg with an extended one (which requires a REX prefix) whenever the classic register is being used less times than the extended one. There are two versions of this pass: regular one will only consider RBX as classic and R12-R15 as extended registers because those are callee-saved, which means their scope is local to the function and therefore they can be easily interchanged within the function without further consequences. The aggressive version relies on liveness analysis to detect if the value of a register is being used as a caller-saved value (written to without being read first), which also is eligible for reallocation. However, it showed limited results and is not the default option because it is expensive. Currently, this pass does not update debug info. This means that if a substitution is made, the AT_LOCATION of a variable inside a function may be outdated and GDB will display the wrong value if you ask it to print the value of the affected variable. Updating DWARF involves a painful task of writing a new DWARF expression parser/writer similar to the one we already have for CFI expressions. I'll defer the task of writing this until we determine this optimization is enabled in production. So far, it is experimental to be combined with other optimizations to help us find a new set of optimizations that is beneficial. (cherry picked from FBD6476659)	2017-11-14 18:20:40 -08:00
Bill Nell	cd0a075a08	[BOLT] Fix ICP nested jump table handling and general stats. Summary: Load elimination for ICP wasn't handling nested jump tables correctly. It wasn't offseting the indices by the range of the nested table. I also wasn't computing some of the stats ICP correctly in all cases which was leading to weird results in the stats. (cherry picked from FBD6453693)	2017-11-29 17:38:39 -08:00
spupyrev	48a53a7b55	a new i-cache metric Summary: The diff introduces two measures for i-cache performance: a TSP measure (currently used for optimization) and an "extended" TSP measure that takes into account jumps between non-consecutive basic blocks. The two measures are computed for estimated addresses/sizes of basic blocks and for the actually omitted addresses/sizes. Intuitively, the Extended-TSP metric quantifies the expected number of i-cache misses for a given ordering of basic blocks. It has 5 parameters: - FallthroughWeight is the impact of fallthrough jumps on the score - ForwardWeight is the impact of forward (but not fallthrough) jumps - BackwardWeight is the impact of backward jumps - ForwardDistance is the max distance of a forward jump affecting the score - BackwardDistance is the max distance of a backward jump affecting the score We're still learning the "best" values for the options but default values look reasonable so far. (cherry picked from FBD6331418)	2017-11-14 16:51:24 -08:00
Rafael Auler	21eb2139ee	Introduce pass to reduce jump tables footprint Summary: Add a pass to identify indirect jumps to jump tables and reduce their entries size from 8 to 4 bytes. For PIC jump tables, it will convert the PIC code to non-PIC (since BOLT only processes static code, it makes no sense to use expensive PIC-style jumps in static code). Add corresponding improvements to register scavenging pass and add a MCInst matcher machinery. (cherry picked from FBD6421582)	2017-11-02 00:30:11 -07:00
Bill Nell	39a8c36697	[BOLT] Use getNumPrimeOperands in shortenInstruction. Summary: Apply maks' review comments (cherry picked from FBD6451164)	2017-11-30 13:30:49 -08:00
Bill Nell	a71b5700c0	[BOLT] Fix bug in shortening peephole. Summary: The arithmetic shortening code on x86 was broken. It would sometimes shorten instructions with immediate operands that wouldn't fit into 8 bits. (cherry picked from FBD6444699)	2017-11-29 17:40:14 -08:00
Bill Nell	0bab742949	[BOLT] Fix icp-top-callsites option, remove icp-always-on. Summary: The icp-top-callsites option was using basic block counts to pick the top callsites while the ICP main loop was using branch info from the targets of each call. These numbers do not exactly match up so there was a dispcrepancy in computing the top calls. I've switch top callsites over to use the same stats as the main loop. The icp-always-on option was redundant with -icp-top-callsites=100, so I removed it. (cherry picked from FBD6370977)	2017-11-19 11:17:57 -08:00
Bill Nell	591e0ef3ba	[BOLT] Add timers for non-optimization related phases. Summary: Add timers for non-optimization related phases. There are two new options, -time-build for disassembling functions and building CFGs, and -time-rewrite for phases in executeRewritePass(). (cherry picked from FBD6422006)	2017-11-27 18:00:24 -08:00
Rafael Auler	dc23def477	[PERF2BOLT] Fix aggregator wrt traces with REP RET Summary: Previously the perf2bolt aggregator was rejecting traces finishing with REP RET (return instruction with REP prefix) as a result of the migration from objdump output to LLVM disassembler, which decodes REP as a separate instruction. Add code to detect REP RET and treat it as a single return instruction. (cherry picked from FBD6417496)	2017-11-27 12:58:21 -08:00
Bill Nell	b2f132c7c2	[RFC] [BOLT] Use iterators for MC branch/call analysis code. Summary: Here's an implementation of an abstract instruction iterator for the branch/call analysis code in MCInstrAnalysis. I'm posting it up to see what you guys think. It's a bit sloppy with constness and probably needs more tidying up. (cherry picked from FBD6244012)	2017-11-04 19:22:05 -07:00
Bill Nell	c4d7460ed6	[BOLT] Improve ICP for virtual method calls and jump tables using value profiling. Summary: Use value profiling data to remove the method pointer loads from vtables when doing ICP at virtual function and jump table callsites. The basic process is the following: 1. Work backwards from the callsite to find the most recent def of the call register. 2. Work back from the call register def to find the instruction where the vtable is loaded. 3. Find out of there is any value profiling data associated with the vtable load. If so, record all these addresses as potential vtables + method offsets. 4. Since the addresses extracted by #3 will be vtable + method offset, we need to figure out the method offset in order to determine the actual vtable base address. At this point I virtually execute all the instructions that occur between #3 and #2 that touch the method pointer register. The result of this execution should be the method offset. 5. Fetch the actual method address from the appropriate data section containing the vtable using the computed method offset. Make sure that this address maps to an actual function symbol. 6. Try to associate a vtable pointer with each target address in SymTargets. If every target has a vtable, then this is almost certainly a virtual method callsite. 7. Use the vtable address when generating the promoted call code. It's basically the same as regular ICP code except that the compare is against the vtable and not the method pointer. Additionally, the instructions to load up the method are dumped into the cold call block. For jump tables, the basic idea is the same. I use the memory profiling data to find the hottest slots in the jumptable and then use that information to compute the indices of the hottest entries. We can then compare the index register to the hot index values and avoid the load from the jump table. Note: I'm assuming the whole call is in a single BB. According to @rafaelauler, this isn't always the case on ARM. This also isn't always the case on X86 either. If there are non-trivial arguments that are passed by value, there could be branches in between the setup and the call. I'm going to leave fixing this until later since it makes things a bit more complicated. I've also fixed a bug where ICP was introducing a conditional tail call. I made sure that SCTC fixes these up afterwards. I have no idea why I made it introduce a CTC in the first place. (cherry picked from FBD6120768)	2017-10-20 12:11:34 -07:00
spupyrev	1475c4da71	speeding up caches for hfsort+ Summary: When running hfsort+, we invalidate too many cache entries, which leads to inefficiencies. It seems we only need to invalidate cache for pairs of clusters (Into, X) and (X, Into) when modifying cluster Into (for all clusters X). With the modification, we do not really need ShortCache, since it is computed only once per pair of clusters. (cherry picked from FBD6341039)	2017-11-15 14:17:39 -08:00
Maksim Panchenko	0836fa7d08	[BOLT] Fix handling of RememberState CFI Summary: When RememberState CFI happens to be the last CFI in a basic block, we used to set the state of the next basic block to a CFI prior to executing RememberState instruction. This contradicts comments in annotateCFIState() function and also differs form behaviour of getCFIStateAtInstr(). As a result we were getting code like the following: .LBB0121166 (21 instructions, align : 1) CFI State : 0 .... 0000001a: !CFI $1 ; OpOffset Reg6 -16 0000001a: !CFI $2 ; OpRememberState .... Successors: .Ltmp4167600, .Ltmp4167601 CFI State: 3 .Ltmp4167601 (13 instructions, align : 1) CFI State : 2 .... Notice that the state at the entry of the 2nd basic block is less than the state at the exit of the previous basic block. In practice we have never seen basic blocks where RememberState was the last CFI instruction in the basic block, and hence we've never run into this issue before. The fix is a synchronization of handling of last RememberState instruction by annotateCFIState() and getCFIStateAtInstr(). In the example above, the CFI state at the entry to the second BB will be 3 after this diff. (cherry picked from FBD6314916)	2017-11-13 11:05:47 -08:00
Bill Nell	7eaaaaba96	[BOLT] Add finer control of peephole pass. Summary: Add selective control over peephole options. This makes it easier to test which ones might have a positive effect. (cherry picked from FBD6289659)	2017-11-08 18:49:33 -08:00
Rafael Auler	a3b719e0f9	[BOLT] Fix ASAN bugs Summary: Fix a leak in DEBUGRewriter.cpp and an address out of bounds issue in edit distance calculation. (cherry picked from FBD6290026)	2017-11-08 14:29:20 -08:00
Bill Nell	e9aa6e1a33	[BOLT] Fix N-1'th sctc bug. Summary: The logic to append an unconditional branch at the end of a block that had the condition flipped on its conditional tail was broken. It should have been looking at the successor to PredBB instead of BB. It also wasn't skipping invalid blocks when finding the fallthrough block. This fixes the SCTC bug uncovered by @spupyrev's work on block reordering. (cherry picked from FBD6269493)	2017-11-07 16:00:26 -08:00
Maksim Panchenko	f8e6f66c1e	[BOLT] Fix segfault in debug print Summary: With "-debug" flag we are using a dump in intermediate state when basic block's list is initialized, but layout is not. In new isSplit() funciton we were checking the size() which uses basic block list, and then we were accessing the (uninitiazed) layout. Instead of checking size() we should be checking layout_size(). (cherry picked from FBD6277770)	2017-11-08 14:42:14 -08:00
spupyrev	a0c041f72a	[BOLT] Custom function alignment Summary: A new 'compact' function aligner that takes function sizes in consideration. The approach is based on the following assumptions: -- It is not desirable to introduce a large offset when aligning short functions, as it leads to a lot of "wasted" address space. -- For longer functions, the offset can be larger than the default 32 bytes; However, using 64 bytes for the offset still worsen performance, as again a lot of address space is wasted. -- Cold parts of functions can still use the default max-32 offset. The algorithm is switched on/off by flag 'use-compact-aligner' and is controlled by parameters align-functions-max-bytes and align-cold-functions-max-bytes described above. In my tests the best performance is produced with '-use-compact-aligner=true -align-functions-max-bytes=48 -align-cold-functions-max-bytes=32'. (cherry picked from FBD6194092)	2017-10-27 15:05:31 -07:00
Rafael Auler	dd6ecdd782	[BOLT-AArch64] Support reordering spec06 gcc relocs Summary: Enhance the basic infrastructure for relocation mode for AArch64 to make a reasonably large program work after reordering (gcc). Detect jump table patterns and skip optimizing functions with jump tables in AArch64, as those will require extra future effort to fully decode. To make these work in relocation mode, we skip changing the function body and introduce a mode to preserve even the original nops. By not changing any local offsets in the function, the input original jump tables should just work. Functions with no jump tables are optimized with BB reordering. No other optimizations have been tested. (cherry picked from FBD6130117)	2017-10-16 11:12:22 -07:00
Maksim Panchenko	69ddcfa5cb	[BOLT] Fix implementation for TSP solution Summary: Fix a bug in reconstruction of an optimal path. When calculating the best path we need to take into account a path from new "last" node to the current last node. Add "-tsp-threshold" (defaults to 10) to control when the TSP algorithm should be used. (cherry picked from FBD6253461)	2017-11-06 11:52:58 -08:00
Rafael Auler	624b2d984a	[BOLT-AArch64] Support relocation mode for bzip2 Summary: As we deal with incomplete addresses in address-computing sequences of code in AArch64, we found it is easier to handle them in relocation mode in the presence of relocations. Incomplete addresses may mislead BOLT into thinking there are instructions referring to a basic block when, in fact, this may be the base address of a data reference. If the relocation is present, we can easily spot such cases. This diff contains extensions in relocation mode to understand and deal with AArch64 relocations. It also adds code to process data inside functions as marked by AArch64 ABI (symbol table entries named "$d"). In our code, this is called constant islands handling. Last, it extends bughunter with a "cross" mode, in which the host generates the binaries and the user test them (uploading to the target), useful when debugging in AArch64. (cherry picked from FBD6024570)	2017-09-20 10:43:01 -07:00
Rafael Auler	76d7740cc9	[BOLT-AArch64] Support reordering bzip2 no relocs Summary: Add functionality to support reordering bzip2 compiled to AArch64, with function splitting but without relocations: * Expand the AArch64 backend to support inverting branches and analyzing branches so BOLT reordering machinery is able to shuffle blocks and fix branches correctly; * Add a new pass named LongJmp to add stubs whenever code needs to jump to the cold area, when using function splitting, because of the limited target encoding capability in AArch64 (as a RISC architecture). (cherry picked from FBD5748184)	2017-08-31 11:45:37 -07:00
Rafael Auler	fe6e9b4ab5	[BOLT-AArch64] Support rewriting bzip2 Summary: Add basic AArch64 read/write capability to be able to disassemble bzip2 for AArch64 compiled with gcc 5.4.0 and write it back after going through the basic BOLT pipeline with no block reordering (NOPs/unreachable blocks get removed). This is not for relocation mode. (cherry picked from FBD5701994)	2017-08-24 14:37:35 -07:00
spupyrev	19fea92792	improving hfsort+ algorithm Summary: A few improvements for hfsort+ algorithm. The goal of the diff is (i) to make the resulting function order more i-cache "friendly" and (ii) fix a bug with incorrect input edge weights. A specific list of changes is as follows: - The "samples" field of CallGraph.Node should be at least the sum of incoming edge weights. Fixed with a new method CallGraph::adjustArcWeights() - A new optimization pass for hfsort+ in which pairs of functions that call each other with very high probability (>=0.99) are always merged. This improves the resulting i-cache but may worsen i-TLB. See a new method HFSortPlus::runPassOne() - Adjusted optimization goal to make the resulting ordering more i-cache "friendly", see HFSortPlus::expectedCalls and HFSortPlus::mergeGain - Functions w/o samples are now reordered too (they're placed at the end of the list of hot functions). These functions do appear in the call graph, as some of their basic blocks have samples in the LBR dataset. See HfSortPlus::initializeClusters (cherry picked from FBD6248850)	2017-10-27 21:15:57 -07:00
Bill Nell	848cb78080	[BOLT] Fix BOLT build Summary: The latest change to MCInstrAnalysis broke then clang build. This fixes it. (cherry picked from FBD6262308)	2017-11-07 11:27:35 -08:00
Bill Nell	0b967eb012	[BOLT] Always call fixBranches in relocation mode. Summary: If you attempted to use a function filter on a binary when in relocation mode, the resulting binary would probably crash. This is because we weren't calling fixBranches on all functions. This was breaking bughunter.sh I also strengthened the validation of basic blocks. The cond branch should always be non-null when there are two successors. (cherry picked from FBD6261930)	2017-11-06 21:04:28 -08:00
Maksim Panchenko	e838b354ce	[BOLT][Refactoring] Move basic block reordering to BinaryPasses Summary: Refactor basic block reordering code out of the BinaryFunction. BinaryFunction::isSplit() is now checking if the first and the last blocks in the layout belong to the same fragment. As a result, it no longer returns true for functions that have their cold part optimized away. Change type for returned "size" from unsigned to size_t. Fix lines over 80 characters long. (cherry picked from FBD6250825)	2017-11-06 11:52:31 -08:00
Bill Nell	46866f5fa0	[BOLT] Refactor branch analysis code. Summary: Move the indirect branch analysis code from BinaryFunction to MCInstrAnalysis/X86MCTargetDesc.cpp. In the process of doing this, I've added an MCRegInfo to MCInstrAnalysis which allowed me to remove a bunch of extra method parameters. I've also had to refactor how BinaryFunction held on to instructions/offsets so that it would be easy to pass a sequence of instructions to the analysis code (rather than a map keyed by offset). Note: I think there are a bunch of MCInstrAnalysis methods that have a BitVector output parameter that could be changed to a return value since the size of the vector is based on the number of registers, i.e. from MCRegisterInfo. I haven't done this in order to keep the diff a more manageable size. (cherry picked from FBD6213556)	2017-11-01 10:26:07 -07:00
Bill Nell	9e42885d04	[BOLT] Add value profiling to BOLT Summary: Add support for reading value profiling info from perf data. This diff adds support in DataReader/DataAggregator for value profiling data. Each event is recorded as two Locations (a PC and an address/value) and a count. For now, I'm assuming that the value profiling data is in the same file as the usual BOLT profiling data. Collecting both at the same time seems to work. (cherry picked from FBD6076877)	2017-10-16 13:09:43 -07:00
Maksim Panchenko	1288c81c9b	[BOLT][Refactoring] Change landing pads handling Summary: Change the way we store and handle landing pads and throwers. (cherry picked from FBD6169992)	2017-10-26 18:36:30 -07:00
spupyrev	244a476a2e	using offsets for CG Summary: Arc->AvgOffset can be used for function/block ordering to distinguish between calls from the beggining of a function and calls from the end of the function. This makes a difference for large functions. (cherry picked from FBD6094221)	2017-10-18 15:18:52 -07:00
Maksim Panchenko	61e5fbf8c3	[BOLT][Refactoring] Get rid of TailCallTerminatedBlocks, etc. Summary: More changes to allow separation of CFG construction and profile assignment. Misc cleanups. (cherry picked from FBD6158653)	2017-10-23 23:32:40 -07:00
Bill Nell	c58996fd55	[BOLT] Add ability to specify custom printers for annotations. Summary: This will give us the ability to print annotations in a more meaningful way. Especially annotations that could be interpreted in multiple ways. I've added one register name printer for liveness analysis. We can update the other dataflow annotations as needed. I also noticed that BitVector annotations were leaking since they contain heap allocated memory. I made removeAnnotation call the annotation destructor explicitly to mitigate this but it won't fix the problem when annotations are just dropped en masse. (cherry picked from FBD6105999)	2017-10-19 12:36:48 -07:00
Maksim Panchenko	2ab7472329	[BOLT] Account for FDE functions when calculating max function size Summary: When we calculate maximum function size we only used to rely on the symbol table information, and ignore function info coming from FDEs. Invalid maximum function size can lead to code emission over the code of neighbouring function. Fix this by considering FDE functions when determining the maximum function size. (cherry picked from FBD6025613)	2017-10-10 14:54:09 -07:00
Maksim Panchenko	1e1833c8a2	[BOLT][Refactoring] Make CTC first class operand, etc. Summary: This diff is a preparation for decoupling function disassembly, profile association, and CFG construction phases. We used to have multiple ways to mark conditional tail calls with annotations or TailCallOffsets map. Since CTC information is affecting the correctness, it is justifiable to have it as a operand class for instruction with a destination (0 is a valid one). "Offset" annotation now replaces "EdgeCountData" and "IndirectBranchData" annotations to extract profile data for any given instruction. Inlining for small functions was broken in a presence of profiled (annotated) instructions and hence I had to remove "-inline-small-functions" from the test case. Also fix an issue with UNDEF section for created __hot_start/__hot_end symbols. Now the symbols use ABS section. (cherry picked from FBD6087284)	2017-10-12 14:57:11 -07:00
spupyrev	b77172ce2f	updating cache metrics Summary: This is a replacement of a previous diff. The implemented metric ('graph distance') is not very useful at the moment but I plan to add more relevant metrics in the subsequent diff. This diff fixes some obvious problems and moves the call of CalcMetrics::printAll to the right place. (cherry picked from FBD6072312)	2017-10-16 16:53:50 -07:00
Maksim Panchenko	4c8f48be3d	[BOLT] Fix function order output option Summary: Add support to output both function order and section order files as the former is useful for offloading functions sorting and the latter is useful for linker script generation: -generate-function-order=<file> -generate-link-sections=<file> (cherry picked from FBD6078446)	2017-10-17 10:05:16 -07:00
Maksim Panchenko	bee9132a54	[BOLT] Change function order file format for linker script Summary: Change output of "-generate-function-order=<file>" to match expected format used for a linker script: * Prefix function names with ".text". * Strip internal suffix from local function names. E.g. for function with names "foo/1" and "foo/foo.c/1" we will only output "foo". * Output (with indentation) duplicate names for folded functions. (cherry picked from FBD6071020)	2017-10-16 15:22:05 -07:00
Maksim Panchenko	1605f07f5c	[BOLT] Create symbol table entries under -hot-text if they did not exist Summary: If "-hot-text" options is specified and the input binary did not have __hot_start/__hot_end symbols, then add them to the symbol table. (cherry picked from FBD6027737)	2017-10-10 18:06:45 -07:00
Maksim Panchenko	3d3fefff46	[BOLT] Use 32 as the default max bytes for function alignment Summary: Several benchmarks (hhvm, compilers) show that 32 provides a good balance between I-Cache performance and iTLB misses. (cherry picked from FBD6026476)	2017-10-10 16:36:01 -07:00
Rafael Auler	7689cf2417	[BOLT] Fix bolt_info ELF note Summary: Small fix - align the end of the descriptor string as well, since readelf will detect when it is not aligned and print an error instead of printing BOLT version and command line. (cherry picked from FBD6023643)	2017-10-10 13:30:05 -07:00
Rafael Auler	0cc2a62f6a	[BOLT] Write bolt info according to ELF spec Summary: Follow ELF spec for NOTE sections when writing bolt info. Since tools such as "readelf -n" will not recognize a custom code identifying our new note section, we use GNU "gold linker version" note, tricking readelf into printing bolt info. (cherry picked from FBD6010153)	2017-10-06 17:54:26 -07:00
Rafael Auler	0ed144a188	[PERF2BOLT] Check build-ids of binaries when aggregating Summary: Check the build-id of the input binary against the build-id of the binary used during profiling data collection with perf, as reported in perf.data. If they differ, issue a warning, since the user should use exactly the same binary. If we cannot determine the build-id of either the input binary or the one registered in the input perf.data, cancel the build-id check but print a log message. (cherry picked from FBD6001917)	2017-10-06 14:42:46 -07:00
spupyrev	f77a6acd71	fixing sizes Summary: In some (weird) cases, a Function is marked 'split' but doesn't contain any 'cold' basic block. In that case, the size of the last basic block of the function is computed incorrectly. Hence, this fix. (cherry picked from FBD6012963)	2017-10-09 14:15:38 -07:00
Rafael Auler	9df6dce234	[PERF2BOLT] Fix aggregator wrt new output format of perf Summary: Perf is now outputting one less space, which broke our previous (flaky) assumptions about field separators when processing the output file. Make it more resilient by accepting any number of spaces before reading LBR entries. (cherry picked from FBD6014941)	2017-10-09 15:52:13 -07:00
Rafael Auler	f02c8c29ee	[PERF2BOLT] Improve user messages about profiling stats Summary: Improve messages and color-code bad traces percentage, warning user about a potential input binary mismatch. (cherry picked from FBD5915934)	2017-09-26 14:42:43 -07:00
Maksim Panchenko	f32784f4cb	[BOLT] Ignore Clang LTO artifact file symbol Summary: The presence of ld-temp.o symbol is somewhat indeterministic. I couldn't find out exactly when it's generated, it could be related to LTO vs ThinLTO, but not always. If the symbol is there, it could affect names of most of functions in LTO binary. The status of the symbol may change between the binary the profile was collected on, and the binary BOLT is called on. As a result, we may mismatch many function names. It is safe to ignore this symbol. (cherry picked from FBD5908955)	2017-09-25 18:05:37 -07:00
Bill Nell	aa05dc91c5	Fix SCTC bug when two pred/succ BB are in a loop. Summary: It's possible that two basic blocks being conidered for SCTC are in a loop in the CFG. In this case a block that is both a predecessor and a successor may have been processed and marked invalid by a previous iteration of the SCTC loop. We should skip rewriting in this case. (cherry picked from FBD5886721)	2017-09-21 15:45:39 -07:00
Rafael Auler	42f957bb75	[BOLT] Integrate perf2bolt into llvm-bolt Summary: Move the data aggregator logic from our python script to our C++ LLVM/BOLT libs. This has a dramatic reduction in processing time for profiling data (from 45 minutes for HHVM to 5 minutes) because we directly use BOLT as a disassembler in order to validate traces found in the LBR and to add the fallthrough counts. Previously, the python approach relied on parsing the output objdump to check traces. (cherry picked from FBD5761313)	2017-09-01 18:13:51 -07:00
Maksim Panchenko	156fc73157	[BOLT] Fix SCTC bug Summary: If conditional branch has been converted to conditional tail call, it may be considered for SCTC optimization later since it will appear as a tail call. We have to make sure that the tail call we are considering is not a conditional branch. (cherry picked from FBD5884777)	2017-09-19 16:59:05 -07:00
Maksim Panchenko	b006d2a860	[BOLT] Fix issue with exception handlers splitting Summary: A cold part of a function can start with a landing pad. As a result, this landing pad will have offset 0 from the start of the corresponding FDE, and it wouldn't get registered by exception-handling runtime. The solution is to use a different landing pad base address (LPStart), such as (FDE_start - 1). (cherry picked from FBD5876561)	2017-09-20 13:32:46 -07:00
Rafael Auler	ef0ec9edf9	[BOLT] Fix frameopt=all for gcc Summary: Fix two bugs. First, stack pointer tracking, the dataflow analysis, was converging to the "superposition" state (meaning that at this point there are multiple and conflicting states) too early in case the entry state in the BB was "empty" AND there was an SP computation in the block. In these cases, we need to propagate an "empty" value as well and wait for an iteration where the input is not empty (only entry BBs start with a non-empty well-defined value). Previously, it was propagating "superposition", meaning there is a conflict of states in this block, which is not true, since the input is empty and, therefore, there is no preceding state to justify a collision of states. Second, if SPT failed and has no idea about the stack values in a block (if it is in the superposition state at a given point in a BB), shrink wrapping should not attempt to insert computation into those blocks that we do not understand what is happening. Fix it to bail on those cases. (cherry picked from FBD5858402)	2017-09-18 16:26:00 -07:00
Rafael Auler	9df155ce11	[BOLT] Introduce non-LBR mode Summary: Add support to read profiles collected without LBR. This involves adapting our data aggregator perf2bolt and adding support in llvm-bolt itself to read this data. This patch also introduces different options to convert basic block execution count to edge count, so BOLT can operate with its regular algorithms to perform basic block layout. The most successful approach is the default one. (cherry picked from FBD5664735)	2017-08-02 10:59:33 -07:00
Maksim Panchenko	29d4f4cfac	[BOLT] Ignore TLS relocations types Summary: No special handling is required for TLS relocations types, and if we see them in the binary we can safely ignore those types. (cherry picked from FBD5853889)	2017-09-13 11:21:47 -07:00
Maksim Panchenko	ec5b3b0a65	[BOLT] Fix bug in SCTC Summary: After SCTC optimization fixDoubleJumps() was relying on CFG information on the number of successors of a basic block. It ignored the fact that conditional tail call had a successor outside of the function and deleted a containing basic block. Discovered while testing old HHVM with disabled jump tables. (cherry picked from FBD5752903)	2017-08-31 17:28:14 -07:00
Maksim Panchenko	bd8e4b9e87	[BOLT] Support PIC-style exception tables Summary: Exceptions tables for PIC may contain indirect type references that are also encoded using relative addresses. This diff adds support for such encodings. We read PIC-style type info table, and write it using new encoding. (cherry picked from FBD5716060)	2017-08-27 17:04:06 -07:00
Maksim Panchenko	49d1f5698d	[BOLT] PLT optimization Summary: Add an option to optimize PLT calls: -plt - optimize PLT calls (requires linking with -znow) =none - do not optimize PLT calls =hot - optimize executed (hot) PLT calls =all - optimize all PLT calls When optimized, the calls are converted to use GOT reference indirectly. GOT entries are guaranteed to contain a valid function pointer if lazy binding is disabled - hence the requirement for linker's -znow option. Note: we can add an entry to .dynamic and drop a requirement for -znow if we were moving .dynamic to a new segment. (cherry picked from FBD5579789)	2017-08-04 11:21:05 -07:00
Maksim Panchenko	0c07445110	[BOLT] Fix printing of dyno-stats Summary: We used to print dyno-stats after instruction lowering which was skewing our metrics as tail calls were no longer recognized as calls for one thing. The fix is to control the point at which dyno-stats printing pass is run and run it immediately before instruction lowering. In the future we may decide to run the pass before some other intervening pass. (cherry picked from FBD5605639)	2017-08-10 13:18:44 -07:00
Rafael Auler	21c48f7d78	Fix profiling for functions with multiple entry points Summary: Fix issue in memcpy where one of its entry points was getting no profiling data and was wrongly considered cold, being put in the cold region. (cherry picked from FBD5569156)	2017-08-02 18:14:01 -07:00
Rafael Auler	b81ff8a8fc	[BOLT] Fix SCTC issue with hot-cold split Summary: SCTC was deleting an unconditional branch to a block in the cold area because it was the next block in the layout vector. Fix the condition to only delete such branches when source and target are in the same allocation area (either both hot or both cold). (cherry picked from FBD5570300)	2017-08-04 20:14:24 -07:00
Maksim Panchenko	e4290d083f	[BOLT] Disable last basic block assertion. Summary: While converting code from __builtin_unreachable() we were asserting that a basic block with a conditional jump and a single CFG successor was the last one before converting the jump to an unconditional one. However, if that code was executed after a conditional tail call conversion in the same function, the original last basic block will no longer be the last one in the post-conversion layout. I'm disabling the assertion since it doesn't seem worth it to add extra checks for the basic block that used to be the last one. (cherry picked from FBD5570298)	2017-08-04 19:39:45 -07:00
Maksim Panchenko	ae409f0b27	[BOLT] Better match LTO functions profile. Summary: * Improve profile matching for LTO binaries that don't match 100%. * Fix profile matching for '.LTHUNK' functions. Add external outgoing branches (calls) for profile validation. There's an improvement for 100% match profile and for stale LTO profile. However, we are still not fully closing the gap with stale profile when LTO is enabled. (NOTE: I haven't updated all test cases yet) (cherry picked from FBD5529293)	2017-07-17 11:22:22 -07:00
Maksim Panchenko	d27b31ee07	[BOLT] Fix reading LSDA address for PIC code Summary: Fix a bug while reading LSDA address in PIC format. The base address was wrong for PC-relative value. There's more work involved in making PIC code with C++ exceptions work. (cherry picked from FBD5538755)	2017-08-01 11:19:01 -07:00
Yue Zhao	eb64d03b73	Reformat the register strings in the output so Stoke can parse without preprocessing. Summary: Minor change. Reformat the def-in, live-out register strings so that Stoke can parse without doing preprocessing. (cherry picked from FBD5537421)	2017-07-27 12:52:56 -07:00
Bohan Ren	87481cb494	[BOLT] Improve Jump-Distance Metric -- Consider Function Execution Count Summary: Function execution count is very important. When calculating metric, we should care more about functions which are known to be executed. The correlations between this metric and both CPU time is slightly improved to be close to 96% and the correlation between this metric and Cache Miss remains the same 96%. Thanks the suggestion from Sergey! (cherry picked from FBD5494720)	2017-07-25 16:27:00 -07:00
Rafael Auler	787db1cf3e	Recognize AArch64 as a valid input Summary: BOLT needs to be configured with the LLVM AArch64 backend. If the backend is linked into the LLVM library, start processing AArch64 binaries. (cherry picked from FBD5489369)	2017-07-25 09:11:42 -07:00
Yue Zhao	70bad8d34d	add: get function score to find hot functions refine the dumped csv format Summary: minor modification of the bolt stoke pass (cherry picked from FBD5471011)	2017-07-13 15:02:52 -07:00
Yue Zhao	6d845719ce	get analysis information of functions Summary: complete the StokeInfo pass, ignore previous arc diff (cherry picked from FBD5306863)	2017-06-13 17:24:27 -07:00
Rafael Auler	4e29afeb18	[BOLT] Add cold symbols to the symbol table Summary: Create new .symtab and .strtab sections, so we can change their sizes and not only patch them. Remove local symbols and add symbols to identify the cold part of split functions. (cherry picked from FBD5345460)	2017-06-27 16:25:59 -07:00
Bohan Ren	4d34471eeb	[BOLT] Improved Jump-Distance Metric Summary: Current existing Jump-Distance Metric (Previously named Call-Distance) will ignore some traversals. This modified version adds those missing traversals back. The correlation remains the same: around 97% correlation with CPU and Cache Miss (which implies that even though some traversals are ignored, it doesn't affect correlation that much.) (cherry picked from FBD5369653)	2017-07-04 15:59:29 -07:00
Rafael Auler	4ecd3856e9	[BOLT] Fix shrink-wrapping bugs Summary: Make shrink-wrapping more stable. Changes: * Correctly detect landing pads at the dominance frontier, bailing on such cases because we are not prepared to split LPs that are target of a critical edge. * Disable FOP's store removal by default - this is experimental and shouldn t go to prod because removing a store that we failed to detect it's actually necessary is disastrous. This pass currently doesn't have a great impact on the number of stores reduced, so it is not a problem. Most stores reduced are due shrink wrapping anyway. * Fix stack access identification - correctly estimate memory length of weird instructions, bail if we don't know. * Make rules for shrink-wrapping more strict: cancel shrink wrapping on a number of cases when we are not 100% sure that we are dealing with a regular callee-saved register. * Add basic block folding to SW. Sometimes when splitting critical edges we create a lot of redundant BBs with the same instructions, same successor but different predecessor. Fold all identical BBs created by splitting critical edges. * Change defaults: now the threshold used to determine when to perform SW is more conservative, to be sure we are moving a spill to a colder area. This effort, along with BB folding, helps us to avoid hurting icache performance by indiscriminately increasing code size. (cherry picked from FBD5315086)	2017-06-22 16:34:01 -07:00
Bohan Ren	ec304396c3	[BOLT] Call Distance Metric Summary: Designed a new metric, which shows 93.46% correltation with Cache Miss and 86% correlation with CPU Time. Definition: One can get all the traversal path for each function. And for each traversal, we will define a distance. The distance represents how far two connected basic blocks are. Therefore, for each traversal, I will go through the basic blocks one by one, until the end of the traversal and sum up the distance for the neighboring basic blocks. Distance between two connected basic blocks is the distance of the centers of two blocks in the binary file. (cherry picked from FBD5242526)	2017-06-13 16:29:39 -07:00
Rafael Auler	3469396269	[BOLT] Set local symbols in relocation mode to zero Summary: Strobelight is getting confused by local symbols that we do not update in relocation mode. These symbols were preserved by the linker in relocation mode in order support emitting relocations against local labels, but they are unused. Issue a quick fix to this by detecting such symbols and setting their value to zero. This patch also fixes an issue with the symbol table that was assigning the wrong section index to symbols associated with the .text section. (cherry picked from FBD5271277)	2017-06-16 20:04:43 -07:00
Bill Nell	59e90f0f43	[BOLT] Make function reordering more robust with stale data. Summary: Rewrote the guts of buildCallGraph. There are two new options to control how the CG is created. UsePerfData controls whether we use the perf data directly to construct the CG for functions with a stale profile. IgnoreRecursiveCalls omits recursive calls from the CG since they might be skewing results unfairly for heavily recursive functions. I've changed the way BinaryFunction::estimateHotSize() works. If the function is marked as split, I count the size of all the non-cold blocks. This gives a different but more accurate answer than the old method. I've improved and updated the CG build stats with extra information. (cherry picked from FBD5224183)	2017-06-09 13:17:36 -07:00
Rafael Auler	8233c7d204	[BOLT] Bail frame analysis on PUSHes escaping vars Summary: Some PUSH instructions may contain memory addresses pushed to the stack. If this memory address is from an object in the stack, cancel further frame analysis for this function since it may be escaping a variable. This fixes a bug with deleting used stores (in frameopt) in hhvm trunk. (cherry picked from FBD5270590)	2017-06-16 15:02:26 -07:00
Yue Zhao	37d0f81df5	BinaryFunction.h: Clarify commet for getSize(), add getNumNonPseudos() Summary: Minor fix and add new function (cherry picked from FBD5270376)	2017-06-16 17:06:13 -07:00
Bill Nell	dc4dd64800	[BOLT] More HFSort+ refactoring Summary: Move most of hfsort+ into a class so the state can more easily be shared. (cherry picked from FBD5216206)	2017-06-08 10:55:28 -07:00
Bohan Ren	f819f53d27	Normalize Clusters Twice Summary: This one will normalize cluster twice, leaving edges connecting two basic block untouched (cherry picked from FBD5207416)	2017-06-07 20:25:30 -07:00
Rafael Auler	eeea415dd2	[BOLT] Fix SCTC execution count assertion Summary: SCTC is currently asserting (my fault :-) when running in combination with hot jump table entries optimization. This optimization sets the frequency for edges connecting basic blocks it creates and jump table targets based on the execution count of the original BB containing the indirect jump. This is OK as an estimation, but it breaks our assumption that the sum of the frequency of preds edges equals to our BB frequency. This happens because the frequency of the BB is rarely equal to its outgoing edges frequency. SCTC, in turn, was updating the execution count for BBs with tail calls by subtracting the frequency count of predecessor edges. Because hot jump table entries optimization broke the BB exec count = sum(preds freq) invariant, SCTC was asserting. To trigger this, the input program must have a jump table where each entry contains a tail call. This happens in the HHVM binary for func _ZN4HPHP11collections5issetEPNS_10ObjectDataEPKNS_10TypedValueE. (cherry picked from FBD5222504)	2017-06-09 15:52:50 -07:00
Bohan Ren	eb63a0b295	[BOLT] Expand BOLT report for basic block ordering Summary: Add a new positional option onto bolt: "-print-function-statistics=<uint64>" which prints information about block ordering for requested number of functions. (cherry picked from FBD5105323)	2017-05-22 11:04:01 -07:00
Bill Nell	ea53066287	[BOLT] Fix hfsort+ caching mechanism Summary: There's good news and bad news. The good news is that this fixes the caching mechanism used by hfsort+ so that we always get the correct end results, i.e. the order is the same whether the cache is enabled or not. The bad news is that it takes about the same amount of time as the original to run. (~6min) The good news is that I can make some improvements on this implementation which I'll put up in another diff. The problem with the old caching mechanism is that it was caching values that were dependent on adjacent sets of clusters. It only invalidated the clusters being merged and none of other clusters that might have been affected. This version computes the adjacency information up front and updates it after every merge, rather than recomputing it for each iteration. It uses the adjacency data to properly invalidate any cached values. (cherry picked from FBD5203023)	2017-06-06 17:43:45 -07:00
Rafael Auler	583790ee22	Fix dynostats for conditional tail calls Summary: Don't treat conditional tail calls as branches for dynostats. Count taken conditional tails calls as calls. Change SCTC to report dynamic numbers after it is done. (cherry picked from FBD5203708)	2017-06-07 14:20:39 -07:00
Rafael Auler	2baa4c7a2c	[BOLT] Only print stats when requested Summary: Make LLVM timers only output numbers when the -time-opts option is used. (cherry picked from FBD5212221)	2017-06-08 13:46:17 -07:00
Bill Nell	8eaa2fdd9f	[BOLT] Fix hfsort+ crash when no perf data is present. Summary: hfsort+ was trying to access the back() of an empty vector when no perf data is present. Just add a guard around that code. (cherry picked from FBD5206962)	2017-06-07 18:31:06 -07:00
Maksim Panchenko	f9436bc903	[BOLT] Fix ELF inter-section references Summary: Since we are stripping non-allocatable relocation sections from the binary and adding new sections it changes section indices in the binary. Some sections refer to other sections by their index which is stored in sh_link or sh_info field. Hence we need to update these field. In the past update of indices was done ad-hoc and as we started adding more complex updates to section header table the update mechanism became broken in some cases. As a result, we were putting wrong indices into sh_link/sh_info. The broken case was discovered while investigating a problem with a stripped BOLTed binary. In BOLTed binary .rela.plt was incorrectly pointing to one of the debug sections and strip command removed the debug section together with .rela section that was referencing it. The new update mechanism computes complete old to new section index mapping and updates sh_link/sh_info fields based on the mapping before writing section header entries into an output file. (cherry picked from FBD5207378)	2017-06-07 20:06:29 -07:00
Rafael Auler	2c23094299	Split FrameAnalysis and improve LivenessAnalysis Summary: Split FrameAnalysis into FrameAnalysis and RegAnalysis, since some optimizations only require register information about functions, not frame information. Refactor callgraph walking code into the CallGraphWalker class, allowing any analysis that depend on the call graph to easily traverse it via a visitor pattern. Also fix LivenessAnalysis, which was broken because it was not considering registers read into callees and incorporating this into caller. (cherry picked from FBD5177901)	2017-06-02 16:57:22 -07:00
Rafael Auler	d850ca3622	[BOLT] Add shrink wrapping pass Summary: Add an implementation for shrink wrapping, a frame optimization that moves callee-saved register spills from hot prologues to cold successors. (cherry picked from FBD4983706)	2017-05-01 16:52:54 -07:00
Maksim Panchenko	4b485f4167	[BOLT] Fix misc issues in relocation mode. Summary: Fix issues discovered while testing LTO mode with bfd linker: * Correctly update absolute function references from code with addend. * Support .got.plt section generated by bfd linker. * Support quirks of .tbss section. * Don't ignore functions if the size in FDE doesn't match the size in the symbol table. Instead keep processing using the maximum indicated size. (cherry picked from FBD5178831)	2017-06-02 18:41:31 -07:00
Bill Nell	382c660ee5	[BOLT] Make hfsort+ deterministic and add test case Summary: Make hfsort+ algorithm deterministic. We only had a test for hfsort. Since hfsort+ is going to be the default, I've added a test for that too. (cherry picked from FBD5143143)	2017-05-26 17:42:39 -07:00
Bill Nell	5feee9f1d8	[BOLT] More CG refactoring Summary: Do some additional refactoring of the CallGraph class. Add a BinaryFunctionCallGraph class that has the BOLT specific bits. This is in preparation to moving the generic CallGraph class into a library that both BOLT and HHVM can use. Make data members of CallGraph private and add the appropriate accessor methods. (cherry picked from FBD5143468)	2017-05-26 15:46:46 -07:00
Maksim Panchenko	95ab659fe4	[BOLT] Do not assert on an empty location list. Summary: Clang generates an empty debug location list, which doesn't make sense, but we probably shouldn't assert on it and instead issue a warning in verbosity mode. There is only a single empty location list in the whole llvm binary. (cherry picked from FBD5166666)	2017-06-01 12:30:52 -07:00
Bill Nell	733e8c464f	HFSort/call graph refactoring Summary: I've factored out the call graph code from dataflow and function reordering code and done a few small renames/cleanups. I've also moved the function reordering pass into a separate file because it was starting to get big. I've got more refactoring planned for hfsort/call graph but this is a start. (cherry picked from FBD5140771)	2017-05-26 12:53:21 -07:00
Bill Nell	9b190cc74b	[BOLT] Fix SCTC again again. Summary: I put the const_cast<BinaryFunction *>(this) on the wrong version of getBasicBlockAfter(). It's on the right one now. (cherry picked from FBD5159127)	2017-05-31 14:23:37 -07:00
Maksim Panchenko	6c32079d57	[BOLT] Update addresses for DW_TAG_GNU_call_site and DW_TAG_label. Summary: Some DWARF tags (such as GNU_call_site and label) reference instruction addresses in the input binary. When we update debug info we need to update these tags too with new addresses. Also fix base address used for calculation of output addresses in relocation mode. (cherry picked from FBD5155814)	2017-05-31 09:36:49 -07:00
Bill Nell	35d2530a40	[BOLT] Fix SCTC again. Summary: Respect hot/cold boundaries when using BinaryFunction::getBasicBlockAfter(). (cherry picked from FBD5153379)	2017-05-30 19:06:22 -07:00
Maksim Panchenko	2e744e6867	[BOLT] Emit sorted DWARF ranges and location lists. Summary: When producing address ranges and location lists for debug info add a post-processing step that sorts them and merges adjacent entries. Fix a memory allocation/free issue for .debug_ranges section. (cherry picked from FBD5130583)	2017-05-24 15:20:27 -07:00
Bill Nell	96943d2f4b	Add option to generate function order file. Summary: Add -generate-function-order=<filename> option to write the computed function order to a file. We can read this order in later rather than recomputing each time we process a binary with BOLT. (cherry picked from FBD5127915)	2017-05-24 18:40:29 -07:00
Maksim Panchenko	2428567f7d	[BOLT] Fix no-assertions build. (cherry picked from FBD5130285)	2017-05-25 10:29:38 -07:00
Maksim Panchenko	174e3a825b	[BOLT] Fix C++ ABI function alignment. Summary: C++ functions have to be aligned at 2-bytes minimum on x86-64. (cherry picked from FBD5128185)	2017-05-24 21:59:01 -07:00
Bill Nell	5cd58961a9	Add .bolt_info notes section containing BOLT revision and command line args. Summary: Optinally add a .bolt_info notes section containing BOLT revision and command line args. The new section is controlled by the -add-bolt-info flag which is on by default. (cherry picked from FBD5125890)	2017-05-24 14:14:16 -07:00
Rafael Auler	2ee4bbd3c1	[BOLT] Optimize jump tables with hot entries Summary: This diff is similar to Bill's diff for optimizing jump tables (and is built on top of it), but it differs in the strategy used to optimize the jump table. The previous approach loads the target address from the jump table and compare it to check if it is a hot target. This accomplishes branch misprediction reduction by promote the indirect jmp to a (more predictable) direct jmp. load %r10, JMPTABLE cmp %r10, HOTTARGET je HOTTARGET ijmp [JMPTABLE + %index * scale] The idea in this diff is instead to make dcache better by avoiding the load of the jump table, leaving branch mispredictions as a secondary target. To do this we compare the index used in the indirect jmp and if it matches a known hot entry, it performs a direct jump to the target. cmp %index, HOTINDEX je CORRESPONDING_TARGET ijmp [JMPTABLE + %index * scale] The downside of this approach is that we may have multiple indices associated with a single target, but we only have profiling to show which targets are hot and we have no clue about which indices are hot. INDEX TARGET 0 4004f8 8 4004f8 10 4003d0 18 4004f8 Profiling data: TARGET COUNT 4004f8 10020 4003d0 17 In this example, we know 4004f8 is hot, but to make a direct call to it we need to check for indices 0, 8 and 18 -- 3 comparisons instead of 1. Therefore, once we know a target is hot, we must generate code to compare against all possible indices associated with this target because we don't know which index is the hot one (IF there's a hotter index). cmp %index, 0 je 4004f8 cmp %index, 8 je 4004f8 cmp %index, 18 je 4004f8 (... up to N comparisons as in --indirect-call-promotion-topn=N ) ijmp [JMPTABLE + %index * scale] (cherry picked from FBD5005620)	2017-05-01 14:04:40 -07:00
Bill Nell	3a3bcd767e	Don't add useless uncond branch to fallthroughs when running SCTC. Summary: SCTC was sometimes adding unconditional branches to fallthrough blocks. This diff checks to see if the unconditional branch is really necessary, e.g. it's not to a fallthrough block. (cherry picked from FBD5098493)	2017-05-19 14:45:46 -07:00
Maksim Panchenko	96adec51eb	[BOLT] Rework debug info processing. Summary: Multiple improvements to debug info handling: * Add support for relocation mode. * Speed-up processing. * Reduce memory consumption. * Bug fixes. The high-level idea behind the new debug handling is that we don't save intermediate state for ranges and location lists. Instead we depend on function and basic block address transformations to update the info as a final post-processing step. For HHVM in non-relocation mode the peak memory went down from 55GB to 35GB. Processing time went from over 6 minutes to under 5 minutes. (cherry picked from FBD5113431)	2017-05-16 09:27:34 -07:00
Rafael Auler	511a1c78b2	[BOLT] Add dataflow infrastructure Summary: This diff introduces a common infrastructure for performing dataflow analyses in BinaryFunctions as well as a few analyses that are useful in a variety of scenarios. The largest user of this infrastructure so far is shrink wrapping, which will be added in a separate diff. (cherry picked from FBD4983671)	2017-05-01 16:51:27 -07:00
Maksim Panchenko	457b7f14b9	[BOLT] Fix debug info for input with continuous range. Summary: When we see a compilation unit with continuous range on input, it has two attributes: DW_AT_low_pc and DW_AT_high_pc. We convert the range to a non-continuous one and change the attributes to DW_AT_ranges and DW_AT_producer. However, gdb seems to expect every compilation unit to have a base address specified via DW_AT_low_pc, even when its value is always 0. Otherwise gdb will not show proper debug info for such modules. With this diff we produce DW_AT_ranges followed by DW_AT_low_pc. The problem is that the first attribute takes DW_FORM_sec_offset which is exactly 4 bytes, and in many cases we are left with 12 bytes to fill in. We used to fill this space with DW_AT_producer, which took an arbitrary-length field. For DW_AT_low_pc we can use a trick of using DW_FORM_udata (unsigned ULEB128 encoded integer) which can take up to 12 bytes, even when the value is 0. (cherry picked from FBD5109798)	2017-05-22 17:17:04 -07:00
Bill Nell	4806b13835	[BOLT] Add jump table support to ICP Summary: Add jump table support to ICP. The optimization is basically the same as ICP for tail calls. The big difference is that the profiling data comes from the jump table and the targets are local symbols rather than global. I've removed an instruction from ICP for tail calls. The code used to have a conditional jump to a block with a direct jump to the target, i.e. B1: cmp foo,(%rax) jne B3 B2: jmp foo B3: ... this code is now: B1: cmp foo,(%rax) je foo B2: ... The other changes in this diff: - Move ICP + new jump table support to separate file in Passes. - Improve the CFG validation to handle jump tables. - Fix the double jump peephole so that the successor of the modified block is updated properly. Also make sure that any existing branches in the block are modified to properly reflect the new CFG. - Add an invocation of the double jump peephole to SCTC. This allows us to remove a call to peepholes/UCE occurring after fixBranches() in the pass manager. - Miscellaneous cleanups to BOLT output. (cherry picked from FBD4727757)	2017-03-08 19:58:33 -08:00
Maksim Panchenko	c789d5137b	[BOLT] Add option to keep/generate .debug_aranges. Summary: GOLD linker removes .debug_aranges while generating .gdb_index. Some tools however rely on the presence of this section. Add an option to generate .debug_aranges if it was removed, or keep it in the file if it was present. Generally speaking .debug_aranges duplicates information present in .gdb_index addresses table. (cherry picked from FBD5084808)	2017-05-17 18:35:00 -07:00
Maksim Panchenko	69b586326c	[BOLT] Support adding new non-allocatable sections. Summary: We had the ability to add allocatable sections before. This diff expands this capability to non-allocatable sections. (cherry picked from FBD5082018)	2017-05-16 17:29:31 -07:00
Maksim Panchenko	3adb52d80e	[BOLT] Update .gdb_index section. Summary: Update address table in .gdb_index section. (cherry picked from FBD5068255)	2017-05-15 15:21:59 -07:00
Maksim Panchenko	3f42fdf7da	[BOLT] Update function address and size in relocation mode. Summary: Set function addresses after code emission but before we update debug info and symbol table entries. (cherry picked from FBD5029609)	2017-05-08 22:51:36 -07:00
Maksim Panchenko	13c89e6ef1	[BOLT] Fix branch data for __builtin_unreachable(). Summary: When we have a conditional branch past the end of function (a result of a call to__builtin_unreachable()), we replace the branch with nop, but keep branch information for validation purposes. If that branch has a recorded profile we mistakenly create an additional successor to a containing basic block (a 3rd successor). Instead of adding the branch to FTBranches list we should be adding to IgnoredBranches. (cherry picked from FBD4912840)	2017-04-18 23:32:11 -07:00
Maksim Panchenko	075f076503	[BOLT] Don't abort on processing binaries with .gdb_index section Summary: While writing non-allocatable sections we had an assumption that the size of such section is congruent to the alignment, as typically such sections are a collections of fixed-sized elements. .gdb_index breaks this assumption. This diff removes the assertion that was triggered by a presence of .gdb_index section, and makes sure that we insert a padding if we are appending to a section with a size not congruent to section alignment. (cherry picked from FBD4844553)	2017-04-06 10:49:59 -07:00
Bill Nell	c7cccacc4f	[BOLT] Enable SCTC by default. (cherry picked from FBD4837849)	2017-04-05 13:23:58 -07:00
Maksim Panchenko	34c8a7c21b	[BOLT] Relocation support for non-allocatable sections. Summary: Relocations can be created for non-allocatable (aka Note) sections. To start using this for debug info, the emission has to be moved earlier in the pipeline for relocation processing to kick in. (cherry picked from FBD4835204)	2017-04-05 09:29:24 -07:00
Maksim Panchenko	a99005397f	[BOLT] Fix branch count in removeDuplicateConditionalSuccessor(). Summary: When we merge the original branch counts we have to make sure both of them have a profile. Otherwise set the count to COUNT_NO_PROFILE. The misprediction count should be 0. (cherry picked from FBD4837774)	2017-04-05 13:00:20 -07:00
Bill Nell	6c5c65e3a3	[BOLT] Fix double jump peephole, remove useless conditional branches. Summary: I split some of this out from the jumptable diff since it fixes the double jump peephole. I've changed the pass manager so that UCE and peepholes are not called after SCTC. I've incorporated a call to the double jump fixer to SCTC since it is needed to fix things up afterwards. While working on fixing the double jump peephole I discovered a few useless conditional branches that could be removed as well. I highly doubt that removing them will improve perf at all but it does seem odd to leave in useless conditional branches. There are also some minor logging improvements. (cherry picked from FBD4751875)	2017-03-20 22:44:25 -07:00
Maksim Panchenko	f7d32f7e7d	[BOLT] Detect and reject binaries built for coverage. Summary: Don't attempt to optimize binaries built with coverage support. (cherry picked from FBD4810330)	2017-03-31 07:51:30 -07:00
Maksim Panchenko	c166a8c1a7	[BOLT] Fix debug info update for inlining. Summary: When inlining, if a callee has debug info and a caller does not (i.e. a containing compilation unit was compiled without "-g"), we try to update a nonexistent compilation unit. Instead we should skip updating debug info in such cases. Minor refactoring of line number emitting code. (cherry picked from FBD4823982)	2017-04-03 16:24:26 -07:00
Maksim Panchenko	0bde796e50	[BOLT] Organize options in categories for pretty printing (near NFC). Summary: Each BOLT-specific option now belongs to BoltCategory or BoltOptCategory. Use alphabetical order for options in source code (does not affect output). The result is a cleaner output of "llvm-bolt -help" which does not include any unrelated llvm options and is close to the following: ..... BOLT generic options: -data=<string> - <data file> -dyno-stats - print execution info based on profile -hot-text - hot text symbols support (relocation mode) -o=<string> - <output file> -relocs - relocation mode - use relocations to move functions in the binary -update-debug-sections - update DWARF debug sections of the executable -use-gnu-stack - use GNU_STACK program header for new segment (workaround for issues with strip/objcopy) -use-old-text - re-use space in old .text if possible (relocation mode) -v=<uint> - set verbosity level for diagnostic output BOLT optimization options: -align-blocks - try to align BBs inserting nops -align-functions=<uint> - align functions at a given value (relocation mode) -align-functions-max-bytes=<uint> - maximum number of bytes to use to align functions -boost-macroops - try to boost macro-op fusions by avoiding the cache-line boundary -eliminate-unreachable - eliminate unreachable code -frame-opt - optimize stack frame accesses ...... (cherry picked from FBD4793684)	2017-03-28 14:40:20 -07:00
Maksim Panchenko	d5a0264a9e	[BOLT] Issue error in relocs mode if input is lacking relocations. Summary: If we specify "-relocs" flag and an input has no relocations we proceed with assumptions that relocations were there and break the binary. Detect the condition above, and reject the input. (cherry picked from FBD4761239)	2017-03-22 22:05:50 -07:00
Rafael Auler	ad81bd6779	Change dynostats dynamic instruction count policy Summary: Also add LOAD/STORE counters. (cherry picked from FBD4732284)	2017-03-17 10:32:56 -07:00
Bill Nell	b1ef186ca9	[BOLT] Don't allow non-symbol targets in ICP Summary: ICP was letting through call targets that weren't symbols. This diff filters out the non-symbol targets before running ICP. (cherry picked from FBD4735358)	2017-03-18 11:55:45 -07:00
Maksim Panchenko	e6f96de4d0	[BOLT] Add option to print only specific functions. Summary: Add option '-print-only=func1,func2,...' to print only functions of interest. The rest of the functions are still processed and optimized (e.g. inlined), but only the ones on the list are printed. (cherry picked from FBD4734610)	2017-03-17 19:05:11 -07:00
Maksim Panchenko	6cfd7ac2d5	[BOLT] Do not overwrite starting address in non-relocation mode. Summary: In non-relocation mode we shouldn't attemtp to change ELF entry point. What made matters worse - it broke '-max-funcs=' and '-funcs=' options since an entry function more often than not was excluded from the list of processed functions, and we were setting entry point to 0. (cherry picked from FBD4720044)	2017-03-15 19:31:20 -07:00
Maksim Panchenko	559a57a181	[BOLT] Improve dynostats output. Summary: Reduce verbosity of dynostats to make them more readable. * Don't print "before" dynostats twice. * Detect if dynostats have changed after optimization and print before/after only if at least one metric have changed. Otherwise just print dynostats once and indicate "no change". * If any given metric hasn't changed, then print the difference as "(=)" as opposed to (+0.0%). (cherry picked from FBD4705920)	2017-03-14 09:03:23 -07:00
Maksim Panchenko	351af0c895	[BOLT] Do not process empty functions. Summary: While running on a recent test binary BOLT failed with an error. We were trying to process '__hot_end' (which is not really a function), and asserted that it had no basic blocks. This diff marks functions with empty basic blocks list as non-simple since there's no need to process them. (cherry picked from FBD4696517)	2017-03-12 11:30:05 -07:00
Bill Nell	2e5c2e689f	Fix hfsort callgraph stats, add hfsort test. Summary: The stats for call sites that are not included in the call graph were broken. The intention is to count the total number of call sites vs. the number of call sites that are ignored because they have targets that are not BinaryFunctions. Also add a new test for hfsort. (cherry picked from FBD4668631)	2017-03-07 11:45:07 -08:00
Maksim Panchenko	f4825ea417	[BOLT] Fix gcc5 build. Summary: A <numeric> include is required for gcc5 build. (cherry picked from FBD4671953)	2017-03-07 18:09:09 -08:00
Maksim Panchenko	98737b34bb	[BOLT] Fix verbose output. Summary: Inadvertently, output of BOLT became way too verbose. Discovered while building HHVM on master. (cherry picked from FBD4669881)	2017-03-07 14:22:15 -08:00
Bill Nell	fed0980139	[BOLT] Update tests Summary: Fix validateCFG to handle BBs that were generated from code that used _builtin_unreachable(). Add -verify-cfg option to run CFG validation after every optimization pass. (cherry picked from FBD4641174)	2017-02-27 21:44:38 -08:00
Maksim Panchenko	0acba2bcf0	[BOLT] Detect unmarked data in text. Summary: Sometimes a code written in assembly will have unmarked data (such as constants) embedded into text. Typically such data falls into a "padding" address space of a function. This diffs detects such references, and adjusts the padding space to prevent overwriting of code in data. Note that in relocation mode we prefer to overwrite the original code (-use-old-text) and thus cannot simply ignore data in text. (cherry picked from FBD4662780)	2017-02-21 14:18:09 -08:00
Maksim Panchenko	f241e252fc	[BOLT] Detect and handle __builtin_unreachable(). Summary: Calls to __builtin_unreachable() can result in a inconsistent CFG. It was possible for basic block to end with a conditional branche and have a single successor. Or there could exist non-terminated basic block without successors. We also often treated conditional jumps with destination past the end of a function as conditional tail calls. This can be prevented reliably at least when the byte past the end of the function does not belong to the next function. This diff includes several changes: * At disassembly stage jumps past the end of a function are converted into 'nops'. This is done only for cases when we can guarantee that the jump is not a tail call. Conversion to nop is required since the instruction could be referenced either by exception handling tables and/or debug info. Nops are later removed. * In CFG insert 'ret' into non-terminated basic blocks without successors (this almost never happens). * Conditional jumps at the end of the function are removed from CFG. The block will still have a single successor. * Cases where a destination of a jump instruction is the start of the next function, are still conservatively handled as (conditional) tail calls. (cherry picked from FBD4655046)	2017-03-03 11:35:41 -08:00
Maksim Panchenko	6dc2351505	[BOLT] New CFI handling policy. Summary: The new interface for handling Call Frame Information: * CFI state at any point in a function (in CFG state) is defined by CFI state at basic block entry and CFI instructions inside the block. The state is independent of basic blocks layout order (this is implied by CFG state but wasn't always true in the past). * Use BinaryBasicBlock::getCFIStateAtInstr(const MCInst Inst) to get CFI state at any given instruction in the program. No need to call fixCFIState() after any given pass. fixCFIState() is called only once during function finalization, and any function transformations after that point are prohibited. * When introducing new basic blocks, make sure CFI state at entry is set correctly and matches CFI instructions in the basic block (if any). * When splitting basic blocks, use getCFIStateAtInstr() to get a state at the split point, and set the new basic block's CFI state to this value. Introduce CFG_Finalized state to indicate that no further optimizations are allowed on the function. This state is reached after we have synced CFI instructions and updated EH info. Rename "-print-after-fixup" option to "-print-finalized". This diffs fixes CFI for cases when we split conditional tail calls, and for indirect call promotion optimization. (cherry picked from FBD4629307)	2017-02-24 21:59:33 -08:00
Rafael Auler	965a373dc4	Fix warnings when compiling with clang (NFC) Summary: Fix inconsistent override keyword usages and initializes a missing field of a Relocation object when using braced initializers. (cherry picked from FBD4622856)	2017-02-27 13:09:27 -08:00
Maksim Panchenko	2029458f34	[BOLT] Strip 'repz' prefix from 'repz retq'. Summary: Add pass to strip 'repz' prefix from 'repz retq' sequence. The prefix is not used in Intel CPUs afaik. The pass is on by default. (cherry picked from FBD4610329)	2017-02-23 18:09:10 -08:00
Maksim Panchenko	88a461014b	[BOLT] Don't set code skew in relocations mode. Summary: We use code skew in non-relocation mode since functions have fixed addresses, and internal alignment has to be adjusted wrt the skew. However in relocation mode it interferes with effective code alignment, and has to be disabled. I missed it when was re-basing the relocation diff. (cherry picked from FBD4599670)	2017-02-22 11:29:52 -08:00
Maksim Panchenko	d3e33b6edc	[BOLT] Fix -jump-tables=basic in relocation mode. Summary: In a prev diff I added an option to update jump tables in-place (on by default) and accidentally broke the default handling of jump tables in relocation mode. The update should be happening semi-automatically, but because we ignore relocations for jump tables it wasn't happening (derp). Since we mostly use '-jump-tables=move' this hasn't been noticed for some time. This diff gets rid of IgnoredRelocations and removes relocations from a relocation set when they are no longer needed. If relocations are created later for jump tables they are no longer ignored. (cherry picked from FBD4595159)	2017-02-21 16:15:15 -08:00
Maksim Panchenko	88244a10bb	[BOLT] Move BOLT passes under Passes subdirectory (NFC). Summary: Move passes under Passes subdirectory. Move inlining passes under Passes/Inliner.* (cherry picked from FBD4575832)	2017-02-16 14:57:57 -08:00
Maksim Panchenko	f06a1455ea	[BOLT] Add support for *GOTPCRELX relocation type. Summary: gcc5 can generate new types of relocations that give linker a freedom to substitute instructions. These relocations are PC-relative, and since we manually process such relocations they don't present much of a problem. Additionally, detect non-pc-relative access from code into a middle of a function. Occasionally I've seen such code, but don't know exactly how to trigger its generation. Just issue a warning for now. (cherry picked from FBD4566473)	2017-02-14 22:55:10 -08:00
Maksim Panchenko	82965b963f	[BOLT] Emit short tail calls in relocation mode. Summary: To minimize size of the output code we should emit tail calls that are as short as possible. For this we have to convert a synthetic TAILJMPd into JMP_1 instruction. This should be one of the last passes as most of analysis passes could break since tail calls will no longer be marked as such. The total size of the code is smaller, but not by much - hot text was reduced by 192 bytes. (cherry picked from FBD4557804)	2017-02-13 23:05:12 -08:00
Maksim Panchenko	734a7a5437	[BOLT] Skip disassembly of padding at function end. Summary: Some functions coming from assembly may not have been marked with size. We assume the size to include all bytes up to the next function/object in the file. As a result, function body will include any padding inserted by the linker. If linker inserts 0-value bytes this could be misinterpreted as invalid instruction and BOLT will bail out on such functions in non-relocation mode, and give up on a binary in relocation mode. This diff detects zero-padding, ignores it, and continues processing as normal. (cherry picked from FBD4528893)	2017-02-08 09:14:10 -08:00
Maksim Panchenko	6b0b5bbae7	[BOLT] Reject sanitized binaries. Summary: Whenever input binary is suspected to have been sanitized we print an error message and exit. I've checked that "__asan_init*" symbol presence is the most conservative way to detect "sanitization". (cherry picked from FBD4525478)	2017-02-07 15:56:00 -08:00
Maksim Panchenko	c89821cee3	[BOLT] Detect and prevent re-optimization attempts. Summary: Whenever we try to re-optimize a binary with BOLT we should issue an error and exit. (cherry picked from FBD4525228)	2017-02-07 15:31:14 -08:00
Maksim Panchenko	e212805ea6	[BOLT] Update section names in output file. Summary: Re-write section header string table to reflect new names given to sections. Old sections get ".bolt.org" prefix. E.g. when we write ".eh_frame" section, we keep the old copy but rename it to ".bolt.org.eh_frame". Note: the new code section is named ".bolt.text" - it contains split function bodies, while original ".text" name is left unchanged. (cherry picked from FBD4524935)	2017-02-07 12:20:46 -08:00
Bill Nell	d74997c3cc	Indirect call promotion optimization. Summary: Perform indirect call promotion optimization in BOLT. The code scans the instructions during CFG creation for all indirect calls. Right now indirect tail calls are not handled since the functions are marked not simple. The offsets of the indirect calls are stored for later use by the ICP pass. The indirect call promotion pass visits each indirect call and examines the BranchData for each. If the most frequent targets from that callsite exceed the specified threshold (default 90%), the call is promoted. Otherwise, it is ignored. By default, only one target is considered at each callsite. When an candiate callsite is processed, we modify the callsite to test for the most common call targets before calling through the original generic call mechanism. The CFG and layout are modified by ICP. A few new command line options have been added: -indirect-call-promotion -indirect-call-promotion-threshold=<percentage> -indirect-call-promotion-topn=<int> The threshold is the minimum frequency of a call target needed before ICP is triggered. The topn option controls the number of targets to consider for each callsite, e.g. ICP is triggered if topn=2 and the total requency of the top two call targets exceeds the threshold. Example of ICP: C++ code: int B_count = 0; int C_count = 0; struct A { virtual void foo() = 0; } struct B : public A { virtual void foo() { ++B_count; }; }; struct C : public A { virtual void foo() { ++C_count; }; }; A* a = ... a->foo(); ... original: 400863: 49 8b 07 mov (%r15),%rax 400866: 4c 89 ff mov %r15,%rdi 400869: ff 10 callq (%rax) 40086b: 41 83 e6 01 and $0x1,%r14d 40086f: 4d 89 e6 mov %r12,%r14 400872: 4c 0f 44 f5 cmove %rbp,%r14 400876: 4c 89 f7 mov %r14,%rdi ... after ICP: 40085e: 49 8b 07 mov (%r15),%rax 400861: 4c 89 ff mov %r15,%rdi 400864: 49 ba e0 0b 40 00 00 movabs $0x400be0,%r10 40086b: 00 00 00 40086e: 4c 3b 10 cmp (%rax),%r10 400871: 75 29 jne 40089c <main+0x9c> 400873: 41 ff d2 callq %r10 400876: 41 83 e6 01 and $0x1,%r14d 40087a: 4d 89 e6 mov %r12,%r14 40087d: 4c 0f 44 f5 cmove %rbp,%r14 400881: 4c 89 f7 mov %r14,%rdi ... 40089c: ff 10 callq *(%rax) 40089e: eb d6 jmp 400876 <main+0x76> (cherry picked from FBD3612218)	2016-09-07 18:59:23 -07:00
Maksim Panchenko	6ff1795d96	[BOLT] Support overwriting jump tables in-place. Summary: Add an option to overwrite jump tables without moving and make it a default: -jump-tables - jump tables support (default=basic) =none - do not optimize functions with jump tables =basic - optimize functions with jump tables =move - move jump tables to a separate section =split - split jump tables section into hot and cold based on function execution frequency =aggressive - aggressively split jump tables section based on usage of the tables (cherry picked from FBD4448499)	2017-01-17 15:49:59 -08:00
Rafael Auler	6dfd16cb4c	Cover RSP-indexed accesses in frame optimization Summary: Add a new dataflow analysis to recover the value of RSP at a given point of the program. This value is expressed as an offset from the CFA. Use this information to detect redundant load in memory accesses performed via RSP as well, not only RBP as done previously. Bail when RSP value (as an offset of the CFA) can't be reliably determined with a simple dataflow analysis. (cherry picked from FBD4372261)	2016-12-28 17:09:52 -08:00
Maksim Panchenko	503c741d43	[BOLT] Report stale functions' percentage wrt all profiled functions. Summary: Report stale functions percentage with respect to all profiled functions instead of all simple functions in the binary. The new reporting format should make it more apparent if the profile is out-of-date. Compare: BOLT-INFO: 341 (16.7% of all profiled) functions have invalid (possibly stale) profile. vs old: BOLT-INFO: 341 (0.3%) functions have invalid (possibly stale) profile. (cherry picked from FBD4451746)	2017-01-23 13:08:40 -08:00
Maksim Panchenko	19859377f8	[BOLT] Fix debug info update for zero-length ranges. Summary: Due to a clowntown on my part we were generating wrong ranges when an empty range was seen on input. We were basically expanding the range to include all basic blocks following such range and setting wrong sizes at the same time. Add "-dump-cu" option to llvm-dwarfdump that allows to look at debug info of a single compile unit only. Saves time if we are only interested in a subset of information. (cherry picked from FBD4430989)	2017-01-18 10:09:54 -08:00
Maksim Panchenko	0894905373	[ICF] Don't re-fold functions in non-relocation mode. Summary: In-non relocation mode, when we run ICF the second time, we fold the same functions again since they were not removed from the function set. This diff marks them as folded and ignores them during ICF optimization. Note that we still want to optimize such functions since they are potentially called from the code not covered by BOLT in non-relocation mode. Folded functions are also excluded from dyno stats with this diff Also print the number of times folded functions were called. When 2 functions - f1() and f2() are folded, that number would be min(call_frequency(f1), call_frequency(f2)). (cherry picked from FBD4399993)	2017-01-10 11:20:56 -08:00
Maksim Panchenko	bc8a456309	ICF improvements. Summary: Re-worked the way ICF operates. The pass now checks for more than just call instructions, but also for all references including function pointers. Jump tables are handled too. (cherry picked from FBD4372491)	2016-12-21 17:13:56 -08:00
Maksim Panchenko	55fc5417f8	Relocations support for BOLT. Summary: Read relocation from linker and relocate all functions. (cherry picked from FBD4223901)	2016-09-27 19:09:38 -07:00
Rafael Auler	a75bbfc640	Add a frame optimization pass Summary: This is a first attempt to perform data flow analyses on bolt and try to rebuild the stack frame for functions. The goal of the frame optimization pass is to detect instructions that are accessing stack and, if loading values, evaluate whether this load is redundant and we can substitute the memory operation for a register load or immediate load. To find opportunities, this pass also builds a map of clobbered registers by function, so we use this in our analysis at call sites. If a call site is found out to not clobber a caller-saved register but the caller is spilling it anyway to the stack (to comply with the ABI), we should detect these cases and remove this unnecessary move. (cherry picked from FBD4337238)	2016-12-05 11:47:08 -08:00
Bill Nell	3a3dfc3dc2	BOLT: Use profiling info to control branch simplification optimization. Summary: An optimization to simplify conditional tail calls by removing unnecessary branches. It adds the following two command line options: -simplify-conditional-tail-calls - simplify conditional tail calls by removing unnecessary jumps -sctc-mode - mode for simplify conditional tail calls =always - always perform sctc =preserve - only perform sctc when branch direction is preserved =heuristic - use branch prediction data to control sctc This optimization considers both of the following cases: foo: ... jcc L1 original ... L1: jmp bar # TAILJMP -> foo: ... jcc bar iff jcc L1 is expected ... L1 is unreachable OR foo: ... jcc L2 L1: jmp dest # TAILJMP L2: ... -> foo: jncc dest # TAILJMP L2: ... L1 is unreachable For this particular case, the first basic block ends with a conditional branch and has two successors, one fall-through and one for when the condition is true. The target of the conditional is a basic block with a single unconditional branch (i.e. tail call) to another function. We don't care about the contents of the fall-through block. (cherry picked from FBD3719617)	2016-09-22 18:08:20 -07:00
Rafael Auler	06caefdb1d	Fix typo in time passes Summary: Previously NamedRegionTimer's constructor was being called with no local variable associated with it owing to a typo. We need a local variable to keep track of the time spent in the scope. At the end of the scope, the destructor will be called an then the timer will stop. (cherry picked from FBD4301844)	2016-12-08 13:34:56 -08:00
Rafael Auler	c570038d31	Add option to time passes Summary: As we begin to work on optimization passes for bolt, it is important to keep track of the time spent in each of these to measure their contribution to the time bolt takes to finish rewriting a program. (cherry picked from FBD4301136)	2016-12-08 12:15:20 -08:00
Rafael Auler	3888c5604f	Remove unused private var in CFIReaderWriter (NFC) Summary: This member variable is dead. (cherry picked from FBD4255342)	2016-11-30 16:03:53 -08:00
Rafael Auler	5c0e4b6a57	Fix undefined behavior in DebugInfo Summary: The CFI instructions parser in libDebugInfo was relying on undefined behavior to parse operands by assuming the order function parameters are evaluated in a function call site is defined (it is not). This patch fix this and makes our clang and gcc tests agree. It also fixes wrong LIT tests in our codebase with respect to the order of DW_CFA_def_cfa operands. (cherry picked from FBD4255227)	2016-11-30 15:52:24 -08:00
Rafael Auler	a331fa396b	Fix memory leak in DWARFRewriter Summary: Clang's Address Sanitizer caught this leak where MCAsmBackend and MCObjectWriter instances were being created but not freed. Fix this. (cherry picked from FBD4249941)	2016-11-29 20:11:32 -08:00
Rafael Auler	5cc9c58410	Avoid const_iterator on std::vector::emplace Summary: This is part of a series of clean-up patches to make bolt cleanly compile with clang 4.0. This patch fixes an error where clang will fail to compile because it does not support passing a const_iterator to std::vector<T>::emplace(Iter, ...). (cherry picked from FBD4242546)	2016-11-28 17:45:25 -08:00
Rafael Auler	b21bc02ac4	Remove pessimizing std::move Summary: This is part of a series of clean-up patches to make bolt cleanly compile with clang 4.0. This patch fixes the following warning: moving a temporary object prevents copy elision (cherry picked from FBD4242236)	2016-11-28 17:25:17 -08:00
Rafael Auler	7115706d02	Fix clang warning about switch covering all enums Summary: This is part of a series of clean-up patches to make bolt cleanly compile with clang 4.0. This patch fixes the following warning: default label in switch which covers all enumeration values (cherry picked from FBD4242168)	2016-11-28 17:17:14 -08:00
Maksim Panchenko	ac2621fbf4	Add stats for "-optimize-bodyless-functions". Summary: Print the number of calls eliminated. (cherry picked from FBD4010698)	2016-10-12 13:08:52 -07:00
Rafael Auler	8609ad51e5	Detect default CFI frame instructions for the target Summary: Make BOLT resilient to changes in the LLVM's X86 target library by not hardwiring the list of default CIE instructions, but detecting it at run time. (cherry picked from FBD4200982)	2016-11-17 14:56:42 -08:00
Maksim Panchenko	a7fb610eba	Relocate old .eh_frame section next to the new one. Summary: In order to improve gdb experience with BOLT we have to make sure the output file has a single .eh_frame section. Otherwise gdb will use either old or new section for unwinding purposes. This diff relocates the original .eh_frame section next to the new one generated by LLVM. Later we merge two sections into one and make sure only the newly created section has .eh_frame name. (cherry picked from FBD4203943)	2016-11-11 14:33:34 -08:00
Maksim Panchenko	809c28f585	Generate .eh_frame_hdr based on contents of .eh_frame's. Summary: We used to patch an existing .eh_frame_hdr and append contents for split functions at the end. However, this approach does not work in relocation mode since function addresses change and split functions will not necessarily be at the end. Instead of patching and appending we generate the new .eh_frame_hdr based on contents of old and new .eh_frame sections. (cherry picked from FBD4180756)	2016-11-14 16:39:55 -08:00
Maksim Panchenko	055dfe48e7	Another EH fix for cold fragments of functions that we fail to write. Summary: In a prev diff I disabled inclusion of FDEs for cold fragments that we fail to write. The side effect of it was that we failed to write FDE for the next function with a cold fragment since it had the same assigned address that we had put in FailedAddresses. The correct fix is to assign zero address to failed cold fragments and ignore them when we write .eh_frame_hdr. (cherry picked from FBD4156740)	2016-11-09 11:19:02 -08:00
Rafael Auler	355dbd769e	Fix DW_CFA_def_cfa CFI duping in output binary Summary: CFI instructions may live in CIEs or FDEs. CIEs hold common instructions used across many FDEs. When replaying CFIs to the output binary, llvm-bolt needs to replay both instructions from CIE and the corresponding FDE for the function. However, some instructions need not to be replayed because MCStreamer/MCDwarf and friends will write them by default in the output CIE. This patch fix the code that tried to recognize one of these default instructions but was failing, resulting in an extra CFI instruction in each FDE we outputted. With this patch, the output binary should be a bit smaller. (cherry picked from FBD4194753)	2016-11-16 17:47:31 -08:00
Rafael Auler	bc8cb088c0	Support DWARF expressions in CFI instructions Summary: Modify the MC layer (MCDwarf.h\|cpp) to understand CFI instructions dealing with DWARF expressions. Add code to emit DWARF expressions in MCDwarf. Change llvm-bolt to pass these CFI instructions to streamer instead of bailing on them. Change -dump-eh-frame option in llvm-bolt to dump the EH frame of the rewritten binary in addition to the one in the original binary, allowing us to proper test this patch. (cherry picked from FBD4194452)	2016-11-15 10:40:00 -08:00
Maksim Panchenko	99dce7d05e	Disable processing of functions with EVEX-encoded instructions (AVX-512). Summary: AVX-512 disassembler support in LLVM is not quite ready yet. Before we feel more comfortable about it we disable processing of all functions that use any EVEX-encoded instructions. (cherry picked from FBD4028706)	2016-10-16 18:56:56 -07:00
Maksim Panchenko	0eb2559fee	Fix EH for cold fragments that we fail to write. Summary: When we fail to write functions that are too big, we have to effectively cancel their effect on exception handling by ignoring their FDE entries in .eh_frame while writing .eh_frame_hdr. This can happen to functions that we split too. In such cases the cold part has its own FDE and we have to ignore that one too. This doesn't happen very often - I've only seen one case on hhvm binary, however it is a potential issue. The fix is to add the cold part address to the list of failed-to-write addresses. (cherry picked from FBD3987984)	2016-10-07 09:34:16 -07:00
Maksim Panchenko	e241e9c156	New function discovery and support for multiple entries. Summary: Modified function discovery process to tolerate more functions and symbols coming from assembly. The processing order now matches the memory order of the functions (input symbol table is unsorted). Added basic support for functions with multiple entries. When a function references its internal address other than with a branch instruction, that address could potentially escape. We mark such addresses as entry points and make sure they are treated as roots by unreachable code elimination. Without relocations we have to mark multiple-entry functions as non-simple. (cherry picked from FBD3950243)	2016-09-29 11:19:06 -07:00
Maksim Panchenko	9cf5d74ffb	Support for PIC-style jump tables. Summary: Added support for jump tables in code compiled with "-fpic". Code pattern generated for position-independent jump tables is quite different, as is the format of the tables. More details in comments. Coverage increased slightly for a test, mostly due to the code coming from external lib that was compiled with "-fpic". (cherry picked from FBD3940771)	2016-09-27 19:09:38 -07:00
Bill Nell	4a0c494bc1	BOLT: Remove restrictions on unreachable code elimination Summary: Allow UCE when blocks have EH info. Since UCE may remove blocks that are referenced from debugging info data structures, we don't actually delete them. We just mark them with an "invalid" index and store them in a different vector to be cleaned up later once the BinaryFunction is destroyed. The debugging code just skips any BBs that have an invalid index. Eliminating blocks may also expose useless jmp instructions, i.e. a jmp around a dead block could just be a fallthrough. I've added a new routine to cleanup these jmps. Although, @maks is working on changing fixBranches() so that it can be used instead. (cherry picked from FBD3793259)	2016-09-07 18:59:23 -07:00
Maksim Panchenko	4464861a02	Support for splitting jump tables. Summary: Add level for "-jump-tables=<n>" option: 1 - all jump tables are output in the same section (default). 2 - basic splitting, if the table is used it is output to hot section otherwise to cold one. 3 - aggressively split compound jump tables and collect profile for all entries. Option "-print-jump-tables" outputs all jump tables for debugging and/or analyzing purposes. Use with "-jump-tables=3" to get profile values for every entry in a jump table. (cherry picked from FBD3912119)	2016-09-16 15:54:32 -07:00
Bill Nell	ecc4b9e713	BOLT: Add ud2 after indirect tailcalls. Summary: Insert ud2 instructions after indirect tailcalls to prevent the CPU from decoding instructions following the callsite. A simple counter in the peephole pass shows 3260 tail call traps inserted. (cherry picked from FBD3859737)	2016-09-13 15:16:11 -07:00
Bill Nell	2f1341b51d	BOLT: Refactoring BinaryFunction interface. Summary: Get rid of all uses of getIndex/getLayoutIndex/getOffset outside of BinaryFunction. Also made some other offset related methods private. (cherry picked from FBD3861968)	2016-09-13 20:32:12 -07:00
Bill Nell	510f227cbd	BOLT: Add feature to sort functions by dyno stats. Summary: Add -print-sorted-by and -print-sorted-by-order command line options. The first option takes a list of dyno stats keys used to sort functions that are printed at the end of all optimization passes. Only the top 100 functions are printed. The -print-sorted-by-order option can be either ascending or descending (descending is the default). (cherry picked from FBD3898818)	2016-09-20 20:55:49 -07:00
Maksim Panchenko	62bff426c3	Do no collect dyno stats on functions with stale profile. Summary: Dyno stats collected on functions with invalid profile may appear completely bogus. Skip them. (cherry picked from FBD3879371)	2016-09-16 13:13:16 -07:00
Maksim Panchenko	2c9bf9afd6	Add PLT dyno stats. Summary: Get PLT call stats. (cherry picked from FBD3874799)	2016-09-15 15:47:10 -07:00
Maksim Panchenko	c4e36c1dd6	Fix issue with zero-size duplicate function symbols. Summary: While working on PLT dyno stats I've noticed that we were missing BinaryFunctions for some symbols that were not PLT. Upon closer inspection turned out that those symbols were marked as zero-sized functions in symbol table, but they had duplicates with non-zero size. Since the zero-size symbols were preceding other duplicates, we were not creating BinaryFunction for them and they were not added as duplicates. The 2 most prominent functions that were missing for a test were free() and malloc(). There's not much to optimize in these functions, but they were contributing quite significantly to dyno stats. As a result dyno stats for this test needed an adjustment. Also several assembly functions (e.g. _init()) had zero size, and now we set the size to the max size and start processing those. It's good for coverage but will not affect the performance. (cherry picked from FBD3874622)	2016-09-15 15:47:10 -07:00
Maksim Panchenko	8dbf0e2b3d	Add dyno stats for jump tables. Summary: Add dyno stats for jump tables. (cherry picked from FBD3871035)	2016-09-15 10:24:22 -07:00
Maksim Panchenko	2f3a859772	Add experimental jump table support. Summary: Option "-jump-tables=1" enables experimental support for jump tables. The option hasn't been tested with optimizations other than block re-ordering. Only non-PIC jump tables are supported at the moment. (cherry picked from FBD3867849)	2016-09-14 16:45:40 -07:00
Bill Nell	7483cd0fa6	BOLT: Clean up interface between BinaryFunction and BinaryBasicBlock. Summary: This is just a bit of refactoring to make sure that BinaryFunction goes through methods to get at the state in BinaryBasicBlock. I did this so that changing the way Index/LayoutIndex/Valid works will be easier. (cherry picked from FBD3860899)	2016-09-13 17:12:00 -07:00
Maksim Panchenko	b0f4031db3	Add cluster randomization layout algorithm. Summary: Add "-reorder-blocks=cluster-shuffle" for performance experiments. Use "-bolt-seed=<N>" to set a randomization seed. (cherry picked from FBD3851035)	2016-09-11 14:33:58 -07:00
Maksim Panchenko	52bfc3f92f	Fix switch table detection. Disassemble all instructions in non-simple functions. Summary: Switch table can contain __builtin_unreachable(). As a result, a compiler may place an entry into a jump table that contains an address immediately past the last instruction in the function. Sometimes it may coincide with a start of the next function in the binary. Thus when we check for switch tables in such cases we have to check more than a single entry until we see either an address inside containing function or some address outside different from the address past the last instruction. Additonally, don't stop disassembly after discovering that the function was not simple. We need to detect all outside references whenever possible. (cherry picked from FBD3850825)	2016-09-12 10:12:31 -07:00
Bill Nell	861d5a1586	BOLT: Remove double jumps peephole. Summary: Replace jumps to other unconditional jumps with the final destination, e.g. B0: ... jmp B1 (or jcc B1) B1: jmp B2 -> B0: ... jmp B2 (or jcc B1) This peephole removes 8928 double jumps from a test binary. Note: after filtering out double jumps found in EH code and infinite loops, the number of double jumps patched is 49 (24 for a clang compiled test). The 24 in the clang build are all from external libraries which have probably been compiled with gcc. This peephole is still useful for cleaning up after ICP though. (cherry picked from FBD3815420)	2016-09-02 18:09:07 -07:00
Maksim Panchenko	617c6a13b7	Use BB.getNumNonPseudos() in more places. Summary: Use BB.getNumNonPseudos() in more places. Fix analyze_potential script to pass the new parameter. (cherry picked from FBD3844416)	2016-09-09 14:42:35 -07:00
Bill Nell	71be567969	BOLT: Add per pass dyno stats + factor out post pass printing. Summary: I've added dyno stats printing per pass so we can see the results of each optimization pass on the stats. I've also factored out the post pass function printing code since it was pretty much the same after each pass. (cherry picked from FBD3843587)	2016-09-09 12:37:37 -07:00
Maksim Panchenko	c4c518ee9d	Rewrite SCTC pass to do UCE and make it the last optimization pass. Summary: For now we make SCTC a special pass that runs at the end of all optimizations and transformations right after fixupBranches(). Since it's the last pass, it has to do its own UCE. (cherry picked from FBD3838051)	2016-09-08 14:52:26 -07:00
Maksim Panchenko	6bef336cc2	Add dyno stats to BOLT. Summary: Add "-dyno-stats" option that prints instruction stats based on the execution profile similar to below: BOLT-INFO: program-wide dynostats after optimizations: executed forward branches : 109706407 (+8.1%) taken forward branches : 13769074 (-55.5%) executed backward branches : 24517582 (-25.0%) taken backward branches : 15330256 (-27.2%) executed unconditional branches : 6009826 (-35.5%) function calls : 17192114 (+0.0%) executed instructions : 837733057 (-0.4%) total branches : 140233815 (-2.3%) taken branches : 35109156 (-42.8%) Also fixed pseudo instruction discrepancies and added assertions for BinaryBasicBlock::getNumPseudos() to make sure the number is synchronized with real number of pseudo instructions. (cherry picked from FBD3826995)	2016-08-29 21:11:22 -07:00
Maksim Panchenko	17e691915b	Make BinaryFunction::fixBranches() more flexible and support CFG updates. Summary: The CFG represents "the ultimate source of truth". Transformations on functions and blocks have to update the CFG and fixBranches() would make sure the correct branch instructions are inserted at the end of basic blocks (or removed when necessary). We do require a conditional branch at the end of the basic block if the block has 2 successors as CFG currently lacks the conditional code support (it will probably stay that way). We only use this branch instruction for its conditional code, the destination is determined by CFG - first successor representing true/taken branch, while the second successor - false/fall-through branch. When we reverse the branch condition, the CFG is updated accordingly. The previous version used to insert jumps after some terminating instructions sometimes resulting in a larger code than needed. As a result with the new version 1 extra function becomes overwritten for HHVM binary. With this diff we also convert conditional branches with one successor (result of code from __builtin_unreachable()) into unconditional jumps. (cherry picked from FBD3802062)	2016-08-29 21:11:22 -07:00
Bill Nell	48b55300e0	BOLT: Make most command line options ZeroOrMore. Summary: This will make it easier to run experiments with the same baseline BOLT binary but different command line options. (cherry picked from FBD3831978)	2016-09-07 14:41:56 -07:00
Bill Nell	dcaffe64d3	Inlining fixes/enhancements Summary: A number of fixes/enhancements to inline-small-functions - Fixed size estimateHotSize to use computeCodeSize instead of the original layout offsets. - Added -print-inline option to dump CFGs for functions that have been modified by inlining. - Added flag to force consideration of functions without any profiling info (mostly for testing) - Updated debug line info for inlined functions. - Ignore the number of pseudo instructions when checking for candidates of suitable size. Misc changes - Moved most print flags to BinaryPasses.cpp (cherry picked from FBD3812658)	2016-09-02 11:58:53 -07:00
Maksim Panchenko	1cf200107e	Fix tail call conversion and test cases. Summary: A previous diff accidentally disabled tail call conversion. Additionally some test cases relied on output of "-v=2". Fix those. (cherry picked from FBD3823760)	2016-09-06 13:19:26 -07:00
Bill Nell	c27a6a5c63	Add verbosity level and clean up stream usage. Summary: I've added a verbosity level to help keep the BOLT spewage to a minimum. The default level is pretty terse now, level 1 is closer to the original, I've saved level 2 for the noisiest of messages. Error messages should never be suppressed by the verbosity level only warnings and info messages. The rational behind stream usage is as follows: outs() for info and debugging controlled by command line flags. errs() for errors and warnings. dbgs() for output within DEBUG(). With the exception of a few of the level 2 messages I don't have any strong feelings about the others. (cherry picked from FBD3814259)	2016-09-02 14:15:29 -07:00
Maksim Panchenko	43acb6a28a	Emit remember_state CFI in the same code region as restore_state. Summary: While creating remember_state/restore_state CFI sequences, we were always placing remember_state instruction into the first basic block. However, when we have hot-cold splitting, the cold part has and independent FDE entry in .eh_frame, and thus the restore_state instruction was missing its counter part. The fix is to adjust the basic block that is used for placing remember_state instruction whenever we see the hot-cold split boundary. (cherry picked from FBD3767102)	2016-08-24 14:25:33 -07:00
Maksim Panchenko	97f598fd17	Handling for indirect tail calls. Summary: Analyze indirect branches and convert them into indirect tail calls when possible. We analyze the memory contents when the address could be calculated statically and also detect epilogue code. (cherry picked from FBD3754395)	2016-08-22 14:24:09 -07:00
Maksim Panchenko	42c5894fe2	Write padding for .eh_frame_hdr to a file. Summary: We were applying padding to the calculated address but were never writing it to a file triggering an assertion for cases when .gcc_except_table size wasn't multiple of 4. (cherry picked from FBD3744638)	2016-08-19 13:54:35 -07:00
Maksim Panchenko	a10fb73ab3	Compute ClusterEdges only when necessary. Summary: We only need ClusterEdges in reordering algorithm optimized for branches and the computation is quite resource-hungry, thus it makes sense to only do it when needed. Some refactoring too. (cherry picked from FBD3721107)	2016-08-15 15:37:00 -07:00
Bill Nell	c1d1c2e7cd	Check if operands are immediates before trying shortening. Summary: Operands in the initial instruction stream should all have immediate operands for instructions that can be shortened. But if a BOLT optimization pass adds one of these instructions with a symbolic operand, the shortening operation will assert. This diff adds checks to make sure that the operands are immediate. I've also disabled shortening pass by default since it won't really be needed until ICP is submitted. It will still run at CFG creation time. (cherry picked from FBD3610646)	2016-07-22 20:52:57 -07:00
Bill Nell	406aa62083	Add additional info to BOLT graphviz CFG dumps. Summary: Add the following info the graphviz CFG dump: - Edges are labeled with the jmp instruction that leads to that edge. - Edges include the count and misprediction count. - Nodes have (offset, BB index, BB layout index) - Nodes optionally have tooltips which contain the code of the basic block. (enabled with -dot-tooltip-code) - Added dashed edges to landing pads. (cherry picked from FBD3646568)	2016-07-29 19:18:37 -07:00
Maksim Panchenko	003d106c0b	More refactoring work. Summary: Avoid referring to BinaryFunction's by name. Functions could be found by MCSymbol using BinaryContext::getFunctionForSymbol(). (cherry picked from FBD3707685)	2016-08-11 14:23:54 -07:00
Maksim Panchenko	36df6057b0	Refactoring. Mainly NFC. Summary: Eliminated BinaryFunction::getName(). The function was confusing since the name is ambigous. Instead we have BinaryFunction::getPrintName() used for printing and whenever unique string identifier is needed one can use getSymbol()->getName(). In the next diff I'll have a map from MCSymbol to BinaryFunction in BinaryContext to facilitate function lookup from instruction operand expressions. There's one bug fixed where the function was called only under assert() in ICF::foldFunction(). For output we update all symbols associated with the function. At the moment it has no effect on the generated binary but in the future we would like to have all symbols in the symbol table updated. (cherry picked from FBD3704790)	2016-08-07 12:35:23 -07:00
Theodoros Kasampalis	32739247eb	More aggressive inlining pass Summary: This adds functionality for a more aggressive inlining pass, that can inline tail calls and functions with more than one basic block. (cherry picked from FBD3677856)	2016-07-29 14:17:06 -07:00
Bill Nell	82d76ae18b	Add MCInst annotation mechanism to MCInstrAnalysis class. Summary: Add three new MCOperand types: Annotation, LandingPad and GnuArgsSize. Annotation is used for associating random data with MCInsts. Clients can construct their own annotation types (subclassed from MCAnnotation) and associate them with instructions. Annotations are looked up by string keys. Annotations can be added, removed and queried using an instance of the MCInstrAnalysis class. The LandingPad operand is a MCSymbol, uint64_t pair used to encode exception handling information for call instructions. GnuArgsSize is used to annotate calls with the DW_CFA_GNU_args_size attribute. (cherry picked from FBD3597877)	2016-07-28 10:34:50 -07:00
Theodoros Kasampalis	713e361f36	Fix for correct disassembling of conditional tail calls. Summary: BOLT attempts to convert jumps that serve as tail calls to dedicated tail call instructions, but this is impossible when the jump is conditional because there is no corresponding tail call instruction. This was causing the creation of a duplicate fall-through edge for basic blocks terminated with a conditional jump serving as a tail call when there is profile data available for the non-taken branch. In this case, the first fall-through edge had a count taken from the profile data, while the second has a count computed (incorrectly) by BinaryFunction::inferFallThroughCounts. (cherry picked from FBD3560504)	2016-07-13 18:57:40 -07:00
Maksim Panchenko	486ab273c7	Add printing support for indirect tail calls. Summary: LLVM was missing assembler print string for indirect tail calls which are synthetic instructions created by us. (cherry picked from FBD3640197)	2016-07-28 18:49:48 -07:00
Bill Nell	50e011f4e5	CFG editing functions Summary: This diff adds a number of methods to BinaryFunction that can be used to edit the CFG after it is created. The basic public functions are: - createBasicBlock - create a new block that is not inserted into the CFG. - insertBasicBlocks - insert a range of blocks (made with createBasicBlock) into the CFG. - updateLayout - update the CFG layout (either by inserting new blocks at a certain point or recomputing the entire layout). - fixFallthroughBranch - add a direct jump to the fallthrough successor for a given block. There are a number of private helper functions used to implement the above. This was split off the ICP diff to simplify it a bit. (cherry picked from FBD3611313)	2016-07-23 12:50:34 -07:00
Theodoros Kasampalis	ab599fe71a	Basic block clustering algorithm for minimizing branches. Summary: This algorithm is similar to our main clustering algorithm but uses a different heuristic for selecting edges to become fall-throughs. The weight of an edge is calculated as the win in branches if we choose to layout this edge as a fall-through. For example, the edges A -> B with execution count 100 and A -> C with execution count 500 (where B and C are the only successors of A) have weights -400 and +400 respectively. (cherry picked from FBD3606591)	2016-07-15 16:11:30 -07:00
Theodoros Kasampalis	a9bb3320ad	Identical Code Folding (ICF) pass Summary: Added an ICF pass to BOLT, that can recognize identical functions and replace references to these functions with references to just one representative. (cherry picked from FBD3460297)	2016-06-09 11:36:55 -07:00
Bill Nell	82401630a2	Factor out instruction printing and size computation. Summary: I've factored out the instruction printing and size computation routines to methods on BinaryContext. I've also added some more debug print functions. This was split off the ICP diff to simplify it a bit. (cherry picked from FBD3610690)	2016-07-23 08:01:53 -07:00
Theodoros Kasampalis	156a55209c	Simplification of loads from read-only data sections. Summary: Instructions that load data from the a read-only data section and their target address can be computed statically (e.g. RIP-relative addressing) are modified to corresponding instructions that use immediate operands. We apply the transformation only when the resulting instruction will have smaller or equal size. (cherry picked from FBD3397112)	2016-06-03 00:58:11 -07:00
Theodoros Kasampalis	17b846586c	Loop detection for BOLT's CFG. Summary: Loop detection for the CFG data structure. Added a GraphTraits specialization for BOLT's CFG that allows us to use LLVM's loop detection interface. (cherry picked from FBD3604837)	2016-05-26 10:58:01 -07:00
Bill Nell	ea53cffb2d	Add movabs -> mov shortening optimization. Add peephole optimization pass that does instruction shortening. Summary: Shorten when a mov instruction has a 64-bit immediate that can be repesented as a sign extended 32-bit number, use the smaller mov instruction (MOV64ri -> MOV64ri32). Add peephole optimization pass that does instruction shortening. (cherry picked from FBD3603099)	2016-07-21 16:40:06 -07:00
Maksim Panchenko	c6d0c568d4	Add BinaryContext::getSectionForAddress() Summary: Interface for accessing section from BinaryContext. (cherry picked from FBD3600854)	2016-07-21 12:45:35 -07:00
Maksim Panchenko	f2d82919d0	Move debug-handling code into DWARFRewriter (NFC). Summary: RewriteInstance.cpp is getting too big. Split the code. (cherry picked from FBD3596103)	2016-05-31 19:12:26 -07:00
Maksim Panchenko	bf46263eed	Shorten instructions if possible. Summary: Generate short versions of branch instructions by default and rely on relaxation to produce longer versions when needed. Also produce short versions of arithmetic instructions if immediate fits into one byte. This was only triggered once on HHVM binary. (cherry picked from FBD3591466)	2016-07-19 11:19:18 -07:00
Bill Nell	674dbcc0de	Fix crash in patchELFPHDRTable when no functions are modified. Summary: patchELFPHDRTable was asserting that it could not find an entry for .eh_frame_hdr in SectionMapInfo when no functions were modified by BOLT. This just changes code to skip modifying GNU_EH_FRAME program headers hen SectionMapInfo is empty. The existing header is copied and written instead. (cherry picked from FBD3557481)	2016-07-12 16:43:53 -07:00
Maksim Panchenko	84b5b9e462	Create alternative name for local symbols. Summary: If a profile data was collected on a stripped binary but an input to BOLT is unstripped, we would use a different mangling scheme for local functions and ignore their profiles. To solve the issue this diff adds alternative name for all local functions such that one of the names would match the name in the profile. If the input binary was stripped, we reject it, unless "-allow-stripped" option was passed. It's more complicated to do a matching in this case since we have less information than at the time of profile collection. It's also not that simple to tell if the profile was gathered on a stripped binary (in which case we would have no issue matching data). (cherry picked from FBD3548012)	2016-07-11 18:51:13 -07:00
Bill Nell	bdd4af2134	Store index inside BinaryBasicBlock instead of in map on BinaryFunction. Summary: Store the basic block index inside the BinaryBasicBlock instead of a map in BinaryFunction. This cut another 15-20 sec. from the processing time for hhvm. (cherry picked from FBD3533606)	2016-07-07 21:43:43 -07:00
Bill Nell	90c9323511	Use unordered_map instead of map in ReorderAlgorithm and BinaryFunction::BasicBlockIndices. Summary: Use unordered_map instead of map in ReorderAlgorithm and BinaryFunction::BasicBlockIndices. Cuts about 30sec off the processing time for the hhvm binary. (~8.5 min to ~8min) (cherry picked from FBD3530910)	2016-07-07 11:48:50 -07:00
Theodoros Kasampalis	c20506c570	Fix in inferFallthroughCounts Summary: This fixes the initialization of basic block execution counts, where we should skip edges to the first basic block but we were not skipping the corresponding profile info. Also, I removed a check that was done twice. (cherry picked from FBD3519265)	2016-07-03 21:30:35 -07:00
Bill Nell	260f6fbdb6	Add option to dump CFGs in (simple) graphviz format during all passes. Summary: I noticed the BinaryFunction::viewGraph() method that hadn't been implemented and decided I could use a simple DOT dumper for CFGs while working on the indirect call optimization. I've implemented the bare minimum for the dumper. It's just nodes+BB labels with dges. We can add more detailed information as needed/desired. (cherry picked from FBD3509326)	2016-07-01 08:40:56 -07:00
Theodoros Kasampalis	6eb4e5b687	perf2bolt can extract branch records with histories Summary: Added perf2bolt functionality for extracting branch records with histories of previous branches. The length of the histories is user defined, and the default is 0 (previous functionality). Also, DataReader can parse perf2bolt output with histories. Note: creating profile data with long histories can increase their size significantly (2x for history of length 1, 3x for length 2 etc). (cherry picked from FBD3473983)	2016-06-21 18:44:42 -07:00
Theodoros Kasampalis	287fa51324	Fix for ignoring fall-through profile data when jump is followed by no-op Summary: When a conditional jump is followed by one or more no-ops, the destination of fall-through branch was recorded as the first no-op in FuncBranchInfo. However the fall-through basic block after the jump starts after the no-ops, so the profile data could not match the CFG and was ignored. (cherry picked from FBD3496084)	2016-06-27 14:51:38 -07:00
Theodoros Kasampalis	d09b00ebff	Refactoring of the reordering algorithms Summary: The various reorder and clustering algorithms have been refactored into separate classes, so that it is easier to add new algorithms and/or change the logic of algorithm selection. (cherry picked from FBD3473656)	2016-06-16 18:47:57 -07:00
Maksim Panchenko	f1192a7118	Support for multiple function names. Summary: With ICF optimization in the linker we were getting mismatches of function names in .fdata and BinaryFunction name. This diff adds support for multiple function names for BinaryFunction and does a match against all possible names for the profile. (cherry picked from FBD3466215)	2016-06-10 17:13:05 -07:00
Maksim Panchenko	70f82d9371	Reject profile data for functions that do not match. Summary: Verify profile data for a function and reject if there are branches that don't correspond to any branches in the function CFG. Note that we have to ignore branches resulting from recursive calls. Fix printing instruction offsets in disassembled state. Allow function to have non-zero execution count even if we don't have branch information. (cherry picked from FBD3451596)	2016-06-15 18:36:16 -07:00
Maksim Panchenko	88ac5d9d0e	[merge-fdata] Add option to print function list. Summary: Print total number of functions/objects that have profile and add new options: -print - print the list of objects with count to stderr =none - do not print objects/functions =exec - print functions sorted by execution count =branches - print functions sorted by total branch count -q - do not print merged data to stdout (cherry picked from FBD3442288)	2016-06-09 17:45:15 -07:00
Bill Nell	980a06265a	Revert "Indirect call optimization." This reverts commit 33966090e18545b64013614e7929ff1bdcdf10d5. (cherry picked from FBD28110782)	2016-06-08 17:38:13 -07:00
Bill Nell	8bcfd9a392	Indirect call optimization. (cherry picked from FBD28110629)	2016-06-07 16:27:52 -07:00
Bill Nell	45e2219ae4	Allocate BinaryBasicBlocks with new rather than storing them in the BasicBlocks vector. Summary: This will help optimization passes that need to modify the CFG after it is constructed. Otherwise, the BinaryBasicBlock pointers stored in the layout, successors and predecessors would need to be modified every time a new basic block is created. (cherry picked from FBD3403372)	2016-06-07 16:27:52 -07:00
Maksim Panchenko	6da0d95326	Fix large functions debug info by default. Summary: Turn on -fix-debuginfo-large-functions by default. In the process of testing I've discovered that we output cold code for functions that were too large to be emitted. Fixed that. (cherry picked from FBD3372697)	2016-05-31 19:29:34 -07:00
Maksim Panchenko	4460da0d81	Improvements for debug info. Summary: Assembly functions could have no corresponding DW_AT_subprogram entries, yet they are represented in module ranges (and .debug_aranges) and will have line number information. Make sure we update those. Eliminated unnecessary data structures and optimized some passes. For .debug_loc unused location entries are no longer processed resulting in smaller output files. Overall it's a small processing time improvement and memory imporement. (cherry picked from FBD3362540)	2016-05-27 20:19:19 -07:00
Theodoros Kasampalis	65ac8bbdf2	Better edge counts for fall through blocks in presence of C++ exceptions. Summary: The inference algorithm for counts of fall through edges takes possible jumps to landing pad blocks into account. Also, the landing pad block execution counts are updated using profile data. (cherry picked from FBD3350727)	2016-05-26 15:10:09 -07:00
Theodoros Kasampalis	485f9220b7	Taking LP counts into account for FT count inference (cherry picked from FBD28110493)	2016-05-24 09:26:25 -07:00
Theodoros Kasampalis	fb5f18b2dc	Correctly updating landing pad exec counts. (cherry picked from FBD28110316)	2016-05-23 16:16:25 -07:00
Maksim Panchenko	06b9c5b342	Better .debug_line for non-simple functions. Summary: Generate .debug_line info for non-simple functions in a way that if preferrable by 'objdump -S'. (cherry picked from FBD3345485)	2016-05-24 20:50:36 -07:00
Maksim Panchenko	7b97793b94	Fix for clang .debug_info. Summary: Clang uses different attribute for high_pc which was incompatible with the way we were updating ranges. This diff fixes it. (cherry picked from FBD3345537)	2016-05-24 14:54:23 -07:00
Maksim Panchenko	cfa5d753eb	Miscellaneous fixes for debug info. Summary: * Fix several cases for handling debug info: - properly update CU DW_AT_ranges for function with folded body due to ICF optimization - convert ranges to DW_AT_ranges from hi/low PC for all DIEs - add support for [a, a) range - update CU ranges even when there are no functions registered * Overwrite .debug_ranges section instead of appending. * Convert assertions in debug info handling part into warnings. (cherry picked from FBD3339383)	2016-05-23 19:36:38 -07:00
Maksim Panchenko	7ab3db129b	Create DW_AT_ranges for compile units. Summary: Some compile unit DIEs might be missing DW_AT_ranges because they were compiled without "-ffunction-sections" option. This diff adds the attribute to all compile units. If the section is not present, we need to create it. Will do it in a separate diff. (cherry picked from FBD3314984)	2016-05-17 18:10:14 -07:00
Maksim Panchenko	f047b9d43a	Overwrite contents of .debug_line section. Summary: Overwrite contents of .debug_line section since we don't reference the original contents anymore. This saves ~100MB of HHVM binary. (cherry picked from FBD3314917)	2016-05-16 17:02:17 -07:00
Bill Nell	e63984f325	Patch forward jumping tail calls to prevent branch mispredictions. Summary: A simple optimization to prevent branch misprediction for tail calls. Convert the sequence: j<cc> L1 ... L1: jmp foo # tail call into: j<cc> foo but only if 'j<cc> foo' turns out to be a forward branch. (cherry picked from FBD3234207)	2016-05-02 12:47:18 -07:00
Maksim Panchenko	b445f5eb7b	Fix issue with garbage address in .debug_line. Summary: While emitting debug lines for a function we don't overwrite, we don't have a code section context that is needed by default writing routine. Hence we have to emit end_sequence after the last address, not at the end of section. (cherry picked from FBD3291533)	2016-05-11 19:13:38 -07:00
Bill Nell	f7e7e25b88	Put all optimization passes under the pass manager. Summary: Move eliminate unreachable code, block reordering, and CFI/exception fixup into official optimization passes. (cherry picked from FBD3248991)	2016-05-02 12:47:18 -07:00
Gabriel Poesia	5fa128e748	Inlining of small functions. Summary: Added an optimization pass of inlining calls to small functions (with only one basic block). Inlining is done in a very simple way, inserting instructions to simulate the changes to the stack pointer that call/ret would make before/after the inlined function executes. Also, the heuristic prefers to inline calls that happen in the hottest blocks (by looking at their execution count). Calls in cold blocks are ignored. (cherry picked from FBD3233516)	2016-04-25 14:25:58 -07:00
Gabriel Poesia	d1f525499e	Optimize calls to functions that are a single unconditional jump Summary: Many functions (around 600) in the HHVM binary are simply a single unconditional jump instruction to another function. These can be trivially optimized by modifying the call sites to directly call the branch target instead (because it also happens with more than one jump in sequence, we do it iteratively). This diff also adds a very simple analysis/optimization pass system in which this pass is the first one to be implemented. A follow-up to this could be to move the current optimizations to other passes. (cherry picked from FBD3211138)	2016-04-15 15:59:52 -07:00
Gabriel Poesia	e6acc7bb53	Optimize calls to functions that are a single unconditional jump Summary: Many functions (around 600) in the HHVM binary are simply a single unconditional jump instruction to another function. These can be trivially optimized by modifying the call sites to directly call the branch target instead (because it also happens with more than one jump in sequence, we do it iteratively). This diff also adds a very simple analysis/optimization pass system in which this pass is the first one to be implemented. A follow-up to this could be to move the current optimizations to other passes. (cherry picked from FBD3211138)	2016-04-15 15:59:52 -07:00
Gabriel Poesia	459eb8c230	Fix "Cannot update ranges for DIE at offset" error messages. Summary: Fix the error message by not printing it :) Explanation: a previous diff accidentally removed this error message from within the DEBUG macro, and it's expected that we'll have a bunch of them since a lot of the DIEs we try to update are empty or meaningless. For instance (and mainly), there is a huge number of lexical block DIEs with no attributes in .debug_info. In the first phase of collecting debugging info, we store the offsets of all these DIEs, only later to realize that we cannot update their address ranges because they have none. A better fix would be to check this earlier and not store offsets of DIEs we cannot update to begin with. (cherry picked from FBD3236923)	2016-04-28 12:55:35 -07:00
Maksim Panchenko	de95a5b6a4	Make merge-fdata generate smaller .fdata files. Summary: A lot of the space in the merged .fdata is taken by branches to and from [heap], which is jitted code. On different machines, or during different runs, jitted addresses are all different. We don't use these addresses, but we need branch info to get accurate function call counts. This diff treats all [heap] addresses the same, resulting in a simplified merged file. The size of the compressed file decreased from 70MB to 8MB. (cherry picked from FBD3233943)	2016-04-27 18:06:18 -07:00
Maksim Panchenko	1258903b54	Fix for functions in different segments. Summary: In a test binary some functions are placed in a segment preceding the segment containing .text section. As a result, we were miscalculating maximum function size as the calculation was based on addresses only. This diff fixes the calculation by checking if symbol after function belongs to the same section. If it does not, then we set the maximum function size based on the size of the containing section and not on the address distance to the next symbol. (cherry picked from FBD3229205)	2016-04-26 23:42:39 -07:00
Maksim Panchenko	3811673a0c	Option to break in given functions. Summary: Added option "-break-funcs=func1,func2,...." to coredump in any given function by introducing ud2 sequence at the beginning of the function. Useful for debugging and validating stack traces. Also renamed options containing "_" to use "-" instead. Also run hhvm test with "-update-debug-sections". (cherry picked from FBD3210248)	2016-04-21 09:54:33 -07:00
Maksim Panchenko	87a90ae133	Fix ninja install-* for BOLT utilities. Summary: Make sure we can install all tools needed for processing BOLT .fdata files such as perf2bolt, merge-fdata, etc. (cherry picked from FBD3223477)	2016-04-25 22:13:12 -07:00
Maksim Panchenko	ff68b34553	Tool to merge .fdata files. Summary: merge-fdata tool takes multiple .fdata files and outputs to stdout combined fdata. Takes about 2 seconds per each additional .fdata file with hhvm production data. (cherry picked from FBD3216430)	2016-04-08 12:18:06 -07:00
Maksim Panchenko	43bc4a09ad	Changed splitting options and fixed sorting. Summary: Splitting option now has different meanings/values. Since landing pads are mostly always cold/frozen, we should split them before anything else (we still check the execution count is 0). That's value '1'. Everything else goes on top of that and has increased value (2 - large functions, 3 - everything). Sorting was non-deterministic and somewhat broken for functions with EH ranges. Fixed that and added '-split-all-cold' option to outline all 0-count blocks. Fixed compilation of test cases. After my last commit the binaries were linked to wrong source files (i.e. debug info). Had to rebuild the binaries from updated sources. (cherry picked from FBD3209369)	2016-04-20 15:31:11 -07:00
Maksim Panchenko	4f44d60947	Special handling for GNU_args_size call frame instruction. Summary: GNU_args_size is a special kind of CFI that tells runtime to adjust %rsp when control is passed to a landing pad. It is used for annotating call instructions that pass (extra) parameters on the stack and there's a corresponding landing pad. It is also special in a way that its value is not handled by DW_CFA_remember_state/DW_CFA_restore_state instruction sequence that we utilize to restore the state after block re-ordering. This diff adds association of call instructions with GNU_args_size value when it's used. If the function does not use GNU_args_size, there is no overhead. Otherwise, we regenerate GNU_args_size instruction during code emission, i.e. after all optimizations and block-reordering. (cherry picked from FBD3201322)	2016-04-19 22:00:29 -07:00
Gabriel Poesia	ad344c4387	Group debugging info representation and serialization code. Summary: Moved the classes related to representing and serializing DWARF entities into a single header, DebugData.h. (cherry picked from FBD3153279)	2016-04-07 15:06:43 -07:00
Gabriel Poesia	f6c8929799	Fix debugging info for simple functions that we fail to rewrite. Summary: Simple functions which we fail to rewrite after optimizations were having wrong debugging information because the latter would reflect the optimized version of the function. There are only 48 functions (at this time) in this situation in the HHVM binary. The simple fix is to add another full pass. Another more complicated path, which will be more efficient, is to reset only the BinaryContext and emit again, but then we need to recreate all symbols in the new MCContext and update the pointers. I started taking this path but it started getting too complicated for only those 48 functions (needed to create a new map of global symbols, recreate landing pads - which needed to have the internal intermediate labels in the functions kept to be updated too, etc). Because the overhead is quite large (another full emission pass - around 4m30s here) and the impact is small I put this behind a new command-line flag which is off by default: -fix-debuginfo-large-functions. (cherry picked from FBD3166576)	2016-04-11 17:46:18 -07:00
Gabriel Poesia	0e77c53b89	Update address ranges of inlined functions and try/catch blocks. Summary: Update address ranges of inlined functions and try/catch blocks. This was missing and lead gdb to show weird information in a core dump we inspected because of the several nestings of inline in the call stack. This is very similar to Lexical Blocks, so the change is to basically generalize that code to do the same for DW_AT_try_block, DW_AT_catch_block and DW_AT_inlined_subroutine. (cherry picked from FBD3169417)	2016-04-12 11:41:03 -07:00
Maksim Panchenko	e16b5d8b78	Option to pass a file with list of functions to skip. Summary: Take "-skip_funcs_file=<file>" option and don't process any function listed in the <file>. (cherry picked from FBD3160226)	2016-04-08 19:30:27 -07:00
Gabriel Poesia	2694e58fa2	Update unmatched and nested subprogram DIEs. Summary: readelf was showing some errors because we weren't updating DIEs that were not shallow in the DIE tree, or DIEs of functions with addresses we don't recognize (mostly functions with address 0, which could have been removed by the Linker Script but still have debugging information there). These DIEs need to be updated because their abbreviations are patched. (cherry picked from FBD3159335)	2016-04-08 16:24:38 -07:00
Gabriel Poesia	665b03a464	Fix behavior with multiple functions with same address. Summary: We were updating only one DIE per function, but because the Linker Script may map multiple functions to the same address this would cause us to generate invalid debug info (as some DIEs weren't updated but their abbreviations were changed). (cherry picked from FBD3157263)	2016-04-08 11:55:42 -07:00
Gabriel Poesia	784f6a8773	Emit debug line information for non-simple functions. Summary: Non-simple functions aren't emitted, and thus didn't have line number information emitted. This diff emits it for those functions by extending LLVM's generation of the line number program to allow for absolute addresses (it is wholly symbolic), then iterating over the relevant line tables from the input and appending entries with absolute addresses to the line tables to be emited. This still leaves the simple but not overwritten functions unhandled (there were 48 in HHVM in my last run). However, I think that to fix them we'd need another pass, since by the time we realize a simple function wont't fit, debug line info was already written to the output. (cherry picked from FBD3148468)	2016-04-05 19:35:45 -07:00
Maksim Panchenko	e513bfd86d	Only set output ranges when updating dbg info. Summary: Save processing time by setting output ranges when needed. (cherry picked from FBD3148791)	2016-04-06 18:03:44 -07:00
Gabriel Poesia	4b4db40174	Update DWARF location lists after optimization. Summary: Summary: Update DWARF location lists in .debug_loc and pointers to them in .debug_info so that gdb can print variables which change location during their lifetime. The following changes were made: - Refactored BasicBlockOffsetRanges to allow ranges to be tied to binary information (so that we can reuse it for location lists) - Implemented range compression optimization in BasicBlockOffsetRanges (needed otherwise too much data was being generated). - Added representation for location lists (LocationList.h, BinaryContext.h) - Implemented .debug_loc serializer that keeps the updated offsets (DebugLocWriter.{h,cpp}) - After disassembly, traverse entries in .debug_loc and save them in context (BinaryContext.cpp) - After optimizations, serialize .debug_loc and update pointers in .debug_info (RewriteInstance.cpp) (cherry picked from FBD3130682)	2016-04-01 11:37:28 -07:00
Maksim Panchenko	4349b63144	Re-enable conditional function spitting under an option. Summary: Add a parameter value to "-split-functions=" option to allow splitting only when the function is too large to fit: 0 - never split 1 - split if too large to fit 2 - always split We may use this option when the profile data is not very precise. In that case excessive splitting may increase iTLB misses. (cherry picked from FBD3137700)	2016-03-31 16:38:49 -07:00
Gabriel Poesia	0a07d9bf88	Don't skip non-simple functions on function address ranges update. Summary: This fixes a problem in which bolt was generating a malformed .debug_info section on the bzip2 binary. The bug was the following: - A simple and a non-simple function shared an abbreviation - The abbreviation was patched to contain DW_AT_ranges because of the simple function - The non-simple function's data was not updated, but then it didn't match the layout expected by the abbreviation anymore And because we were already creating an address ranges list in .debug_ranges even for non-simple functions, it doesn't make sense not to use it anyway. (cherry picked from FBD3129219)	2016-04-01 15:09:34 -07:00
Gabriel Poesia	ffa9641e16	Update DWARF lexical blocks address ranges. Summary: Updates DWARF lexical blocks address ranges in the output binary after optimizations. This is similar to updating function address ranges except that the ranges representation needs to be more general, since address ranges can begin or end in the middle of a basic block. The following changes were made: - Added a data structure for iterating over the basic blocks that intersect an address range: BasicBlockTable.h - Added some more bookkeeping in BinaryBasicBlock. Basically, I needed to keep track of the block's size in the input binary as well as its address in the output binary. This information is mostly set by BinaryFunction after disassembly. - Added a representation for address ranges relative to basic blocks (BasicBlockOffsetRanges.h). Will also serve for location lists. - Added a representation for Lexical Blocks (LexicalBlock.h) - Small refactorings in DebugArangesWriter: -- Renamed to DebugRangesSectionsWriter since it also writes .debug_ranges -- Refactored it not to depend on BinaryFunction but instead on anything that can be assined an aoffset in .debug_ranges (added an interface for that) - Iterate over the DIE tree during initialization to find lexical blocks in .debug_info (BinaryContext.cpp) - Added patches to .debug_abbrev and .debug_info in RewriteInstance to update lexical blocks attributes (in fact, this part is very similar to what was done to function address ranges and I just refactored/reused that code) - Added small test case (lexical_blocks_address_ranges_debug.test) (cherry picked from FBD3113181)	2016-03-28 17:45:22 -07:00
Maksim Panchenko	e8ef8a5619	Speedup section remapping. Summary: Before this diff LLVM used to iterate over all sections to find the one with an address we want to remap. Since we have extremely large number of section this process is highly inefficient. Instead we add a new interface to remap a section with a given ID (which effectively is an index into an array of sections), and pass the ID instead of the address. This cuts down the processing time of hhvm binary by 10 seconds, and brings the total processing time to a little under 2 minutes. (cherry picked from FBD3110015)	2016-03-28 22:39:48 -07:00
Maksim Panchenko	595d0885d9	Populate function execution count while parsing fdata. Summary: Populate function execution count while parsing fdata. Before we used a quadratic algorithm to populate the execution count (had to iterate over all branches for every single function). Ignore non-symbol to non-symbol branches while parsing fdata. These changes combined drop HHVM processing time from 4 minutes 53 seconds down to 2 minutes 9 seconds on my devserver. Test case had to be modified since it contained irrelevant branches from PLT to libc. (cherry picked from FBD3106263)	2016-03-28 11:06:28 -07:00
Gabriel Poesia	466cbae866	Update subroutine address ranges in binary. Summary: [WIP] Update DWARF info for function address ranges. This diff currently does not work for unknown reasons, but I'm describing here what's the current state. According to both llvm-dwarf and readelf our output seems correct, but GDB does not interpret it as expected. All details go below in hope I missed something. I couldn't actually track the whole change that introduced support for what we need in gdb yet, but I think I can get to it (2007-12-04: Support lexical bocks and function bodies that occupy non-contiguous address ranges). I have reasons to believe gdb at least at some nges). The set of introduced changes was basically this: - After disassembly, iterate over the DIEs in .debug_info and find the ones that correspond to each BinaryFunction. - Refactor DebugArangesWriter to also write addresses of functions to .debug_ranges and track the offsets of function address ranges there - Add some infrastructure to facilitate patching the binary in simple ways (BinaryPatcher.h) - In RewriteInstance, after writing .debug_ranges already with function address ranges, for each function do: -- Find the abbreviation corresponding to the function -- Patch .debug_abbrev to replace DW_AT_low_pc with DW_AT_ranges and DW_AT_high_pc with DW_AT_producer (I'll explain this hack below). Also patch the corresponding forms to DW_FORM_sec_offset and DW_FORM_string (null-terminated in-place string). -- Patch debug_info with the .debug_ranges offset in place of the first 4 bytes of DW_AT_low_pc (DW_AT_ranges only occupies 4 bytes whereas low_pc occupies 8), and write an arbitrary string in-place in the other 12 bytes that were the 4 MSB of low_pc and the 8 bytes of high_pc before the patch. This depends on low_pc and high_pc being put consecutively by the compiler, but it serves to validate the idea. I tried another way of doing it that does not rely on this but it didn't work either and I believe the reason for either not working is the same (and still unknown, but unrelated to them. I might be wrong though, and if I find yet another way of doing it I may try it). The other way was to use a form of DW_FORM_data8 for the section offset. This is disallowed by the specification, but I doubt gdb validates this, as it's just easier to store it as 64-bit anyway as this is even necessary to support 64-bit DWARF (which is not what gcc generates by default apparently). I still need to make changes to the diff to make it production-ready, but first I want to figure out why it doesn't work as expected. By looking at the output of llvm-dwarfdump or readelf, all of .debug_ranges, .debug_abbrev and .debug_info seem to have been correctly updated. However, gdb seems to have serious problems with what we write. (In fact, readelf --debug-dump=Ranges shows some funny warning messages of the form ("Warning: There is a hole [0x100 - 0x120] in .debug_ranges"), but I played around with this and it seems it's just because no compile unit was using these ranges. Changing .debug_info apparently changes these warnings, so they seem to be unrelated to the section itself. Also looking at the hex dump of the section doesn't help, as everything seems fine. llvm-dwarfdump doesn't say anything. So I think .debug_ranges is fine.) The result is that gdb not only doesn't show the function name as we wanted, but it also stops showing line number information. Apparently it's not reading/interpreting the address ranges at all, and so the functions now have no associated address ranges, only the symbol value which allows one to put a breakpoint in the function, but not to show source code. As this left me without more ideas of what to try to feed gdb with, I believe the most promising next trial is to try to debug gdb itself, unless someone spots anything I missed. I found where the interesting part of the code lies for this case (gdb/dwarf2read.c and some other related files, but mainly that one). It seems in some parts gdb uses DW_AT_ranges for only getting its lowest and highest addresses and setting that as low_pc and high_pc (see dwarf2_get_pc_bounds in gdb's code and where it's called). I really hope this is not actually the case for function address ranges. I'll investigate this further. Otherwise I don't think any changes we make will make it work as initially intended, as we'll simply need gdb to support it and in that case it doesn't. (cherry picked from FBD3073641)	2016-03-16 18:08:29 -07:00
Gabriel Poesia	9cdb7bdb55	Write only minimal .debug_line information. Summary: We used to output .debug_line information for every instruction, but because of the way gdb (and probably lldb as of llvm::DWARFDebugLine::LineTable::findAddress) queries the line table it's not necessary to output information for two instructions if they follow each other and map to the same source line. By not repeating this information we generate a bit less .debug_line data. (cherry picked from FBD3056402)	2016-03-15 16:22:04 -07:00
Maksim Panchenko	a60914427c	Update DW_AT_ranges for CU when it exists. Summary: If CU has DW_AT_ranges update the value. Note that it does not create DW_AT_ranges attribute. (cherry picked from FBD3051904)	2016-03-14 19:04:23 -07:00
Maksim Panchenko	d01172ffa8	Refactor existing debugging code. Summary: Almost NFC. Isolate code for updating debug info. (cherry picked from FBD3051536)	2016-03-14 18:48:05 -07:00
Gabriel Poesia	dc7cc1fb18	Fix default line number information for instructions. Summary: The line number information generated from a null pointer was actually valid, which caused new instructions without the line number information set to have a valid and wrong line number reference. This diff fixes this by making the null pointer be assigned to an invalid line number row. (cherry picked from FBD3048453)	2016-03-14 11:40:52 -07:00
Gabriel Poesia	80ea31b24e	Write updated .debug_aranges section after optimizations. Summary: Write the .debug_aranges section after optimizations to the output binary. Each function generates at least one range and at most two (one extra for its cold part). The writing is done manually because LLVM's implementation is tied to the output of .debug_info (see EmitGenDwarfInfo and EmitGenDwarfARanges in lib/MC/MCDwarf.cpp), which we don't want to trigger right now. (cherry picked from FBD3043108)	2016-03-11 11:30:30 -08:00
Maksim Panchenko	e7e9e15b90	Check function data in symbol table against data in .eh_frame. Summary: At the moment we rely solely on the symbol table information to discover function boundaries. However, similar information is contained in .eh_frame. Verify that the information from these two sources is consistent, and if it's not, then skip processing the functions with conflicting information. (cherry picked from FBD3043800)	2016-03-11 11:09:34 -08:00
Maksim Panchenko	f2df1a8d97	Update stmt_list value to point to new .debug_line offset. Summary: After we add new line number information we have to update stmt_list offsets in .debug_info. For this I had to add a primitive relocations support for non-allocatable sections we are copying from input file. Also enabled functionality to process relocations in non-allocatable sections that LLVM is generating, such as .debug_line. I thought we already had it, but apparently it didn't work, at least not for ELF binaries. (cherry picked from FBD3037903)	2016-03-09 16:06:41 -08:00
Maksim Panchenko	9212a9ad69	Proper skipping of unsupported CFI instructions. Summary: Skip DW_CFA_expression and DW_CFA_val_expression instructions properly, according to DWARF spec. If CFI range does not match function range skip that function. (cherry picked from FBD3040502)	2016-03-10 23:03:17 -08:00
Gabriel Poesia	73c9f0abe3	Write updated .debug_line information to temp file Summary: Writes .debug_line section by setting the state in MCContext that LLVM needs to produce and output the line tables. This basically consists of setting the current location and compile unit offset. This makes LLVM output .debug_line in the temporary file, but not yet in the generated ELF file. Also computes the line table offsets for each compile unit and saves them into BinaryContext. Added an option to print these offsets. (cherry picked from FBD3004554)	2016-03-02 18:40:10 -08:00
Maksim Panchenko	d68b1c7b16	Extending support for non-allocatable sections. Summary: The is a set of changes that allow modification of non-allocatable sections in ELF binary. Primarily for the purpose of updating debug info. Extend LLVM interface to allow processing relocations in non-allocatable sections. This allows to produce .debug* sections with resolved relocations against generated code. Extend BOLT rewriting framework to allow appending contents to non-allocatable sections in the binary. Re-worked ELF binary rewriting to support the above and to allow future extensions (e.g. new section names). (cherry picked from FBD3023403)	2016-03-03 10:13:11 -08:00
Gabriel Poesia	77a6b72842	BOLT: Read and tie .debug_line info to IR. Summary: Reads information in the DWARF .debug_line section using LLVM and tie every MCInst to one line of a line table from the input binary. Subsequent diffs will update this information to match the final binary layout and output updated line tables. (cherry picked from FBD2989813)	2016-02-25 16:57:07 -08:00
Maksim Panchenko	62da18d32a	Always split functions under '-split-functions=1' option. Summary: Force the splitting of the function into hot/cold even when the function fits into original slot. This reduces BOLT optimization time by 50% without affecting hhvm performance. (cherry picked from FBD2973773)	2016-02-22 16:49:26 -08:00
Maksim Panchenko	73e9afe99c	Don't abort on unknown CFI instructions. Summary: If we see an unknown CFI instruction, skip processing the function containing it instead of aborting execution. (cherry picked from FBD2964557)	2016-02-22 18:25:43 -08:00
Maksim Panchenko	7f7d4af7e0	Add an option to use PT_GNU_STACK for new segment. Summary: Added an option to reuse existing program header entry. This option allows for bfd tools like strip and objcopy to operate on the optimized binary without destroying it. Also, all new sections are now properly marked in ELF. (cherry picked from FBD2943339)	2016-02-12 19:01:53 -08:00
Maksim Panchenko	50c895ad0c	Drop requirement for __flo_storage in the input binary. Summary: We used to require pre-allocated space in the input binary so that we can write extra sections in there (.eh_frame, .eh_frame_hdr, .gcc_except_table, etc.). With this diff there's no further need for pre-allocated storage as we create a new segment and can use as much space as needed. There are certain limitations on where the new segment could be allocated, and as a result the size of the file may increase. There's currently a limitation if the binary size is close to 4GB we cannot allocate new segment prior to that and as a result we require debug info to be stripped to reduce the file size. The fix is in progress. (cherry picked from FBD2916029)	2016-02-08 10:02:48 -08:00
Maksim Panchenko	e1a61e1eed	Keep intermediate .o file only under -keep-tmp option. Summary: We use intermediate .o file for debugging purposes, but there's no reason to generate it by default. Only do it if "-keep-tmp" is specified. (cherry picked from FBD2912098)	2016-02-08 10:08:28 -08:00
Maksim Panchenko	d1526083fc	Rename binary optimizer to BOLT. Summary: BOLT - Binary Optimization and Layout Tool replaces FLO. I'm keeping .fdata extension for "feedback data". (cherry picked from FBD2908028)	2016-02-05 14:42:04 -08:00
Maksim Panchenko	628d06b1e5	Preserve layout of basic blocks with 0 profile counts. Summary: Preserve original layout for basic blocks that have 0 execution count. Since we don't optimize for size, it's better to rely on the original input order. (cherry picked from FBD2875335)	2016-01-21 14:18:30 -08:00
Maksim Panchenko	b91d1f1299	Enable REPNZ prefix support. Summary: I didn't see a case where REPNZ were not disassembled/reassembled properly. (cherry picked from FBD2869229)	2016-01-26 17:53:08 -08:00
Maksim Panchenko	218c5f0916	Fix a bug with outlining first basic block. Summary: We should never outline the first basic block. Also add an option to accept a file with the list of functions to optimize. (cherry picked from FBD2868184)	2016-01-26 16:03:58 -08:00
Maksim Panchenko	89578e2314	Allow to partially split functions with exceptions. Summary: We could split functions with exceptions even without creating a new exception handling table. This limits us to only move basic blocks that never throw, and are not a start of a landing pad. (cherry picked from FBD2862937)	2016-01-22 16:45:39 -08:00
Maksim Panchenko	bbb745efa9	Don't create empty basic blocks. Fix CFI bug. Summary: Some basic blocks were created empty because they only contained alignment nop's. Ignore such nop's before basic block gets created. Fixed intermittent aborts related to CFI update. (cherry picked from FBD2844465)	2016-01-19 00:20:06 -08:00
Maksim Panchenko	4a44d187c6	Handle more CFI cases and some. Summary: * Update CFI state for larger range of functions to increase coverage. * Issue more warnings indicating reasons for skipping functions. * Print top called functions in the binary. (cherry picked from FBD2839734)	2016-01-16 14:58:22 -08:00
Maksim Panchenko	d9536e6092	Added an option to reverse original basic blocks order. Summary: Modified processing of "-reorder-blocks=" option and added an option to reverse original basic blocks order for testing purposes. (cherry picked from FBD2829862)	2016-01-13 17:19:40 -08:00
Maksim Panchenko	c9b7e3e09e	Write updated LSDA's. Summary: Write new exception ranges tables (LSDA's) into the output file. (cherry picked from FBD2828312)	2015-12-18 17:00:46 -08:00
Maksim Panchenko	b42c72cbf6	Fix issues with some CFI instructions with gcc 4.9. Summary: Fixes some issues discovered after hhvm switched to gcc 4.9. Add support for DW_CFA_GNU_args_size instruction. Allow CFI instruction after the last instruction in a function. Reverse conditions of assert for DW_CFA_set_loc. (cherry picked from FBD28110096)	2015-12-18 20:26:44 -08:00
Maksim Panchenko	a6efd11c05	Code/comments cleanup. Summary: Consolidate cold function info under cold FragmentInfo. Minor code and comment mods to LSDA handling. (cherry picked from FBD28109981)	2015-12-17 12:59:15 -08:00
Maksim Panchenko	e2fcb371a8	Ignore functions referencing symbol at 0x0. Summary: Binary code could be weird. It could include calls to address 0 and reference data at 0 (e.g. with lea on x86). LLVM JIT fatals while resolving relocations against symbols at address 0x0. For now we will stop emitting such code, i.e. we'll skip functions. (cherry picked from FBD28109837)	2015-12-16 17:56:49 -08:00
Maksim Panchenko	f7d7a85a24	Turn EH ranges support back on. Summary: Changed the way EH info is stored/extracted from call instruction. Make sure indirect calls work. (cherry picked from FBD28109629)	2015-12-15 17:06:27 -08:00
Rafael Auler	fb6e8c5d0b	Don't touch functions whose internal BBs are targets of interprocedural branches Summary: In a test binary, we found 8 cases where code in a function A would jump to the middle of another function B. In this case, we cannot reorder function B because this would change instruction offsets and break the program. This is pretty rare but can happen in code written in assembly. (cherry picked from FBD2719850)	2015-12-03 13:29:52 -08:00
Rafael Auler	9a73a8c446	Turns off basic block alignment by default Summary: We found out that the insertion of extra nops to preserve alignment of some loop bodies do not pay off the increased function size, since this extra size may inhibit us from rewriting a reordered version of this function. (cherry picked from FBD2718466)	2015-12-03 09:45:18 -08:00
Rafael Auler	04c80af012	Don't choke on DW_CFA_def_cfa_expression and friends Summary: Our CFI parser in the LLVM library was giving up on parsing all CFI instructions when finding a single instruction with expression operands. Yet, all gcc-4.9 binaries seem to have at least one CFI instruction with expression operands (DW_CFA_def_cfa_expression). This patch fixes this and makes DebugInfo continue to parse other instructions, even though it does not completely parse DWARF expressions yet. However, this seems to be enough to allow llvm-flo to process gcc-4.9 binaries because the FDEs with DWARF expressions are linked to the PLT region, and not to functions that we process. If we ever try to read a function whose CFI depends on DWARF expression, which is unlikely, llvm-flo will assert. (cherry picked from FBD2693088)	2015-11-24 13:55:44 -08:00
Rafael Auler	d6f01452d1	Change function splitting to be a two-pass process Summary: This patch builds upon the previous patch to create a two-pass process to function splitting. We first perform the full rewriting pipeline to discover which functions need splitting. Afterwards, we restart the pipeline with those functions annotated to be split. (cherry picked from FBD2691709)	2015-11-24 09:29:41 -08:00
Rafael Auler	c67a753e3c	Refactoring llvm-flo.cpp into a new class RewriteInstance, NFC. Summary: Previously, llvm-flo.cpp contained a long function doing lots of different tasks. This patch refactors this logic into a separate class with different member functions, exposing the relationship between each step of the rewritting process and making it easier to coordinate/change it. (cherry picked from FBD2691674)	2015-11-23 17:54:18 -08:00
Rafael Auler	ccbbb8f8b9	Teach llvm-flo how to split functions into hot and cold regions Summary: After basic block reordering, it may be possible that the reordered function is now larger than the original because of the following reasons: - jump offsets may change, forcing some jump instructions to use 4-byte immediate operand instead of the 1-byte, shorter version. - fall-throughs change, forcing us to emit an extra jump instruction to jump to the original fall-through at the end of a basic block. Since we currently do not change function addresses, we need to rewrite the function back in the binary in the original location. If it doesn't fit, we were dropping the function. This patch adds a flag -split-functions that tells llvm-flo to split hot functions into hot and cold separate regions. The hot region is written back in the original function location, while the cold region is written in a separate, far-away region reserved to flo via a linker script. This patch also adds the logic to create and extra FDE to supply unwinding information to the cold part of the function. Owing to this, we now need to rewrite .eh_frame_hdr to another location and patch the EH_FRAME ELF segment to point to this new .eh_frame_hdr. (cherry picked from FBD2677996)	2015-11-19 17:59:41 -08:00
Rafael Auler	38dac03e6b	Make llvm-flo print dynamic coverage of rewritten functions Summary: This is an attempt at determining the hotness of functions we are rewriting and help detect if we are discarding hot functions. This patch introduces logic to estimate the number of instructions executed in each function by using the profile data for branches. It sums the products of BB frequency and size. Since we can only do this for functions we have successfully disassembled, created the CFG and annotated with profiling data, all complex functions that were not disassembled are left out from this analysis. (cherry picked from FBD2654985)	2015-11-13 15:27:59 -08:00
Rafael Auler	75798a891b	Do not bail on functions with indirect calls Summary: Previously, we were marking functions with indirect calls as too complex to be disassembled, but this was unnecessarily conservative. This patch removes this restriction. (cherry picked from FBD2669627)	2015-11-02 09:46:50 -08:00
Rafael Auler	7886f4e81a	Ignore LSDA information for now Summary: Teach llvm-flo to drop on function with LSDA information until we know how to update them after block reordering. (cherry picked from FBD2640806)	2015-11-10 17:21:42 -08:00
Rafael Auler	1d248ec51b	Write .eh_frame and .eh_frame_hdr after reordering BBs Summary: This patch adds logic to detect when the binary has extra space reserved for us via the __flo_storage symbol. If this symbol is present, it means we have extra space in the binary to write extraneous information. When we write a new .eh_frame, we cannot discard the old .eh_frame because it may still contain relevant information for functions we do not reorder. Thus, we write the new .eh_frame into __flo_storage and patch the current .eh_frame_hdr to point to the new .eh_frame only for the functions we touched, generating a binary that works with a bi-.eh_frame model. (cherry picked from FBD2639326)	2015-11-10 15:20:50 -08:00
Rafael Auler	70db5677fb	Write updated CFI to temporary object file Summary: This patch is an intermediary step towards updating the CFI in the optimized binary. It adds the logic necessary to output our CFI annotations to a new .eh_frame in the temporary object file we create to hold rewritten functions. The next step will be to fully integrate this new .eh_frame into the optimized binary. (cherry picked from FBD2633728)	2015-11-09 11:08:02 -08:00
Rafael Auler	6c851dc2e3	Attempts to fix CFI state after reordering Summary: This patch introduces logic to check how the CFI instructions define a table to help during stack unwinding at exception run time and attempts to fix any problem in this table that may have been introduced by reordering the basic blocks. If it fails to fix this problem, the function is marked as not simple and not eligible for rewriting. (cherry picked from FBD2633696)	2015-11-08 12:23:54 -08:00
Maksim Panchenko	bc9d6e3b6c	Regenerate exception handling information after optimizations. Summary: Regenerate exception handling information after optimizations. Use '-print-eh-ranges' to see CFG with updated ranges. (cherry picked from FBD2660982)	2015-11-13 14:18:45 -08:00
Maksim Panchenko	56cca2fb5b	Fix LSDA reading issues. Summary: There were two issues: we were trying to process non-simple functions, i.e. function that we don't fully understand, and then we failed to stop iterating if EH closing label was after the last instruction in a function. (cherry picked from FBD2664460)	2015-11-17 11:02:04 -08:00
Maksim Panchenko	be2a19523c	Add exception handling information to CFG. Summary: Read .gcc_except_table and add information to CFG. Calls have extra operands indicating there's a possible handler for exceptions and an action. Landing pad information is recorded in BinaryFunction. Also convert JMP instructions that are calls into tail calls pseudo instructions so that they don't miss call instruction analysis. (cherry picked from FBD2652775)	2015-11-12 18:56:58 -08:00
Rafael Auler	2117362a09	Revert 45fc13b as it breaks HHVM rewriting Summary: Reverting this commit until we better investigate why it is necessary to change local symbol names with a prefix. (cherry picked from FBD28109521)	2015-11-12 10:41:46 -08:00
Rafael Auler	1df130ae17	Remove add PG prefix from symbols that are already local Summary: After discussion with Maksim, we decided to drop the lines that add the PG prefix if the symbol is already local, since they wouldn't be impacted by the way LLVM handles these symbols. (cherry picked from FBD28109400)	2015-11-12 10:02:12 -08:00
Rafael Auler	e80d11f27a	Fix bug in local symbol name disambiguation algorithm Summary: This bug would cause llvm-flo to fail to disambiguate two local symbols with the same file name, causing two different addresses to compete in the symbol table for the resolution of a given name, causing unpredicted behavior in the linker. (cherry picked from FBD2646626)	2015-11-11 23:56:24 -08:00
Rafael Auler	a30d04c3e2	Annotate BinaryFunctions with MCCFIInstructions encoding CFI Summary: In order to represent CFI information in our BinaryFunction class, this patch adds a map of Offsets to CFI instructions. In this way, we make it easy to check exactly where DWARF CFI information is annotated in the disassembled function. (cherry picked from FBD2619216)	2015-11-04 16:48:47 -08:00
Maksim Panchenko	de46e6fc07	Parse whole contents of .gcc_except_table even if we are not printing. Summary: We need to parse the whole contents of .gcc_except_table even if we are not printing exceptions. Otherwise we are missing type index table and miscalculate the size of the current table. (cherry picked from FBD2632965)	2015-11-09 12:27:13 -08:00
Rafael Auler	2088875656	Teach llvm-flo how to read .eh_frame information from binaries Summary: In order to reorder binaries with C++ exceptions, we first need to read DWARF CFI (call frame info) from binaries in a table in the .eh_frame ELF section. This table contains unwinding information we need to be aware of when reordering basic blocks, so as to avoid corrupting it. This patch also cleans up some code from Exceptions.cpp due to a refactoring where we moved some functions to the LLVM's libSupport. (cherry picked from FBD2614464)	2015-11-05 13:37:30 -08:00
Maksim Panchenko	7d592d0975	Verbose printing of actions from .gcc_except_table Summary: Print actions for exception ranges from .gcc_except_table. Types are printed as names if the name is available from symbol table. (cherry picked from FBD2612631)	2015-11-03 14:26:33 -08:00
Maksim Panchenko	21cc191ea8	Added function to parse and dump .gcc_except_table Summary: Use '-print-exceptions' option to dump contents of .gcc_except_table. (cherry picked from FBD2609925)	2015-11-02 11:50:53 -07:00
Rafael Auler	0e8998713c	Extract non-taken branch frequencies from LBR Summary: Previously, we inferred all non-taken branch frequencies with the information we had for taken branches. This patch teaches perf2flo and llvm-flo how to read and incorporate non-taken branch frequencies directly from the traces available in LBR data and by disassembling the binary. It still leaves the inference engine untouched in case we need it to fill out other fall-throughs. (cherry picked from FBD2589212)	2015-10-26 15:00:56 -07:00
Rafael Auler	13a520ab30	Implement two cluster layout heuristics Summary: Pettis' paper on block layout (PLDI'90) suggests we should order clusters (or chains, using the paper terminology) using a specific criterion. This patch implements two distinct ideas for cluster layout that can be activated using different command-line flags. The first one reflects Pettis' ideas on minimizing branch mispredictions and the second one is targeted at reducing I-cache misses, described in the Ispike paper (CGO'04). (cherry picked from FBD2588693)	2015-10-23 09:38:26 -07:00
Rafael Auler	2539539bde	Fixes priority queue ordering in llvm-flo block reordering Summary: Fixes a bug which caused the block reordering heuristic to put in the same cluster hot basic blocks and cold basic blocks, increasing I-cache misses. (cherry picked from FBD2588203)	2015-10-27 03:04:58 -07:00
Maksim Panchenko	d4d773458c	More control over function printing. Summary: Can use '-print-*' option to print function at specific stage. Use '-print-all' to print at every stage. (cherry picked from FBD2578196)	2015-10-23 15:52:59 -07:00
Maksim Panchenko	7f44331773	Issue warning when relaxed tail call is seen on input. Summary: Issue warning when we see a 2-byte tail call. Currently we will increase the size of these instructions. (cherry picked from FBD2575520)	2015-10-20 10:51:17 -07:00
Rafael Auler	546c4e6e84	Fix bug in BinaryFunction::fixBranches() in llvm-flo Summary: When the ignore-nops patch landed, it exposed a bug in fixBranches() where it ignored empty BBs. However, we cannot ignore empty BBs when it is reordered and its fall-through changes. We must update it with a jump to the original fall-through. This patch fixes this. (cherry picked from FBD2568244)	2015-10-21 16:25:16 -07:00
Rafael Auler	dc848b5376	Fix entry BB execution count in llvm-flo Summary: When we have tailcalls, the execution count for the entry point is wrongly computed. Fix this. (cherry picked from FBD2563112)	2015-10-20 16:48:54 -07:00
Rafael Auler	ab63ca9afb	Implement unreachable BB elimination in llvm-flo Summary: It is important to remove dead blocks to free up space in functions and allow us to reorder blocks or align branch targets with more freedom. This patch implements a simple algorithm to delete all basic blocks that are not reachable from the entry point. Note that C++ exceptions may create "unreachable" blocks, so this option must be used with care. (cherry picked from FBD2562637)	2015-10-20 12:47:37 -07:00
Rafael Auler	9f41a0d263	Do not schedule BBs before the entry point Summary: SPEC CPU2006 perlbench triggered a bug in our heuristic block reordering algorithm where a hot edge that targets the entry point (as in a recursive tail call) would make us try to allocate the call site before the function entry point. Since we don't update function addresses yet, moving the entry point will corrupt the program. This patch fixes this. (cherry picked from FBD2562528)	2015-10-20 12:30:22 -07:00
Rafael Auler	b0115a4536	Teach llvm-flo how to handle two back-to-back JMPs Summary: If we have two consecutive JMP instructions and no branches to the second one, the second one is dead code, but llvm-flo does not handle these cases properly and put two JMPs in the same BB. This patch fixes this, putting the extraneous JMP in a separate block, making it easy for us to detect it is dead code and remove it later in a separate step. (cherry picked from FBD2562465)	2015-10-20 10:17:38 -07:00
Maksim Panchenko	85b99eb7b7	Eliminate nop instruction in input and derive alignment. Summary: Nop instructions are primarily used for alignment purposes on the input. We remove all nops when we build CFG and derive alignment of basic blocks based on existing alignment and a presence of nops before it. This will not always work as some basic blocks will be naturally aligned without necessity for nops. However, it's better than random alignment. We would also add heuristics for BB alignment based on execution profile. (cherry picked from FBD2561740)	2015-10-20 10:51:17 -07:00
Rafael Auler	cd6250d1e3	Fixes branches after reordering basic blocks in a binary function Summary: Adds logic in BinaryFunction to be able to fix branches (invert its condition, delete or add a branch), making the new function work with the new layout proposed by the layout pass. All the architecture-specific content was designed to live in the LLVM Target library, in the MCInstrAnalysis pass. For now, we only introduce such logic to the X86 backend. (cherry picked from FBD2551479)	2015-10-16 09:49:04 -07:00
Rafael Auler	ef059af3d1	Fix bug in block reorder heuristic Summary: Tests with SPEC CPU2006 400.perlbench exposed a bug in the block reordering heuristic that happened when two blocks are both successor and predecessor of each other. This patch fixes this. (cherry picked from FBD2555835)	2015-10-19 10:43:54 -07:00
Rafael Auler	31e6bd1226	Fix missing sanity check in BinaryFunction::optimizeLayout() Summary: SPEC CPU2006 perlbench exposed a bug in BinaryFunction::optimizeLayout() where it would try to optimize the layout even though the function had zero basic blocks. This patch simply checks if the function has zero basic blocks and bails out. (cherry picked from FBD2556831)	2015-10-19 13:23:03 -07:00
Maksim Panchenko	b4ed5cc942	Make FLO work on hhvm binary. Summary: Fixes several issues that prevented us from running hhvm binary. (cherry picked from FBD2543057)	2015-10-14 15:35:14 -07:00
Rafael Auler	ec22caff1e	Fix comments. NFC. Summary: Updated comments in BinaryFunction class. (cherry picked from FBD28108888)	2015-10-16 17:15:00 -07:00
Rafael Auler	9a8d357d0b	Fix DataReader to work with new local sym perf2flo format Summary: In a recent commit, we changed local symbols to be specially tagged with the number 2 (local sym) instead of 1 (sym). This patch modifies the reader to don't choke when seeing a 2 in the symbol id field. (cherry picked from FBD2552776)	2015-10-16 17:00:36 -07:00
Rafael Auler	f9ed45893b	Teach llvm-flo how to reorder blocks in an optimal way Summary: This patch implements a dynamic programming approach to solve reorder basic blocks with profiling information in an optimal way. Since this is analogous to TSP, it is NP-hard and the algorithm is exponential in time and memory consumption. Therefore, we only use the optimal algorithm to decide the layout of small functions (with less than 11 basic blocks). (cherry picked from FBD2544124)	2015-10-14 16:58:55 -07:00
Rafael Auler	34f7085503	Teach llvm-flo how to reorder basic blocks with a heuristic Summary: This patch introduces a first approach to reorder basic blocks based on profiling data that gives us the execution frequency for each edge. Our strategy is to layout basic blocks in a order that maximizes the weight (hotness) of branches that will be deleted. We can delete branches when src comes right before dst in the new layout order. This can be reduced to the TSP problem. This patch uses a greedy heuristic to solve the problem: we start with a graph with no edges and progressively add edges by choosing the hottest edges first, building a layout order that attempts to put BBs with hot edges together. (cherry picked from FBD2544076)	2015-10-13 12:18:54 -07:00
Rafael Auler	9b58b2e64b	Make llvm-flo infer branch count data for fall-through edges Summary: The LBR only has information about taken branches and does not record information when a branch is not taken. In our CFG, we call these edges "fall-through" edges. This patch teaches llvm-flo how to infer fall-through edge frequencies. (cherry picked from FBD2536633)	2015-10-13 10:25:45 -07:00
Maksim Panchenko	f79f6302c1	Converted local offsets from uint64_t to uint32_t. Refactoring. (cherry picked from FBD2543557)	2015-10-14 16:46:59 -07:00
Rafael Auler	4c1da22ae9	Add branch count information to binary CFG Summary: Changes DataReader to organize branch perf data per function name and sets up logistics to bring this data to BinaryFunction::buildCFG(). To do this, we expand BinaryContext with a const reference to DataReader. This patch also adds the "-dump-functions" flag to force llvm-flo to dump the current state of BinaryFunctions once they are disassembled and their CFG built, allowing us to test whether the builder is sane with LLVM LIT tests. (cherry picked from FBD2534675)	2015-10-12 12:30:47 -07:00
Maksim Panchenko	d30423f872	Don't bail out if there's no input data file specified. Summary: Don't attempt to read data file if it was not specified by the user. (cherry picked from FBD2533440)	2015-10-12 14:46:18 -07:00
Maksim Panchenko	ffcc2be7fa	FLO: added support for rip-relative operands. Summary: Detect and replace rip-relative operands with relocations. (cherry picked from FBD2529818)	2015-10-09 21:47:18 -07:00
Maksim Panchenko	f166c4ab2b	Fix CFG building issue. Summary: Fixed getBasicBlockContainingOffset() to return correct basic block. (cherry picked from FBD2532514)	2015-10-12 12:12:16 -07:00
Rafael Auler	e1a539b0ec	Add initial implementation of DataReader Summary: This patch introduces DataReader, a module responsible for parsing llvm flo data files into in-memory data structures. (cherry picked from FBD2515754)	2015-10-05 18:31:25 -07:00
Maksim Panchenko	9a2fe7ebe4	Commit FLO with control flow graph. Summary: llvm-flo disassembles, builds control flow graph, and re-writes simple functions. (cherry picked from FBD2524024)	2015-10-09 17:21:14 -07:00
Maksim Panchenko	7927c14ff5	Fixed cmake. (cherry picked from FBD28108725)	2015-10-02 12:38:07 -07:00
Maksim Panchenko	a89c417357	Removed remote .arcconfig + comment change. (cherry picked from FBD2503821)	2015-10-02 12:06:31 -07:00
Maksim Panchenko	575b24d719	Initial FLO commit. Summary: Directory created. (cherry picked from FBD28105260)	2015-10-02 11:55:15 -07:00
Maksim Panchenko	25b976aa12	BOLT root commit	2022-01-10 17:58:05 -08:00

... 10 11 12 13 14 ...

1129 Commits