llvm-project

Commit Graph

Author	SHA1	Message	Date
Rafael Auler	698a4684ac	[BOLT] Fix merge-fdata and heatmap in BAT Summary: merge-fdata for legacy format was simply appending all input strings to output, but the real format supports some header strings that can't be simply concatanated to output. Check for the header string used by BAT before merging fdata to avoid creating an output file with invalid lines (header in the middle of the fdata file). For heatmap, avoid reading BAT tables, since they won't be used. (cherry picked from FBD17943131)	2019-10-11 13:32:14 -07:00
Xin-Xin Wang	d87f95065a	[BOLT] Add missing CMake test dependencies Summary: I noticed when setting up a new repository for bolt that bolt tests would fail unexpectedly when running `ninja check-bolt` and `ninja check-llvm`. This turns out to be because dependencies for bolt binaries were not specified in the CMake configuration so they were not built before running the tests. This diff adds the dependencies to the CMake configuration for check-bolt and check-llvm so that bolt binaries are built before running tests. (cherry picked from FBD17919505)	2019-10-14 16:03:54 -07:00
Maksim Panchenko	8c6ea8540a	[BOLT] Improve object discovery runtime Summary: (cherry picked from FBD17872824)	2019-10-08 11:03:33 -07:00
Rafael Auler	13948f376d	[BOLT] Do not emit BAT for non-simple in nonreloc Summary: Doing so cause corrupt entries to be emitted. (cherry picked from FBD17774505)	2019-10-04 16:28:03 -07:00
Mark Santaniello	c9f4bbdc22	[llvm-bolt] Bugfix jemalloc sized deallocation segfault Summary: C++14 "sized deallocation" introduces a 2-argument `delete` where the new 2nd argument is the original allocated size. It's useful for allocators like jemalloc to be "reminded" of the original allocation size, else they incur the cost of an address to size lookup. Jemalloc has provided this for a while as `sdallocx`, and recently it got wired up to the new 2-arg `delete`. Here I introduce typedefs for the SmallVectors so the "16" is consistent, which seems to fix the issue. (cherry picked from FBD17618981)	2019-09-26 16:51:22 -07:00
Rafael Auler	ba31344fa9	[BOLT] Fix build for Mac Summary: Change our CMake config for the standalone runtime instrumentation library to check for the elf.h header before using it, so the build doesn't break on systems lacking it. Also fix a SmallPtrSet usage where its elements are not really pointers, but uint64_t, breaking the build in Apple's Clang. (cherry picked from FBD17505759)	2019-09-20 11:29:35 -07:00
Maksim Panchenko	5e6d246b9c	[BOLT] Reword message for macro-op fusion optimization Summary: With the word "missed", the previous message about opportunities for macro-op fusion optimization could be misleading. (cherry picked from FBD17464603)	2019-09-18 15:33:03 -07:00
Maksim Panchenko	c823220116	[BOLT] Better check for compiler de-virtualization bug Summary: The existing check for compiler de-virtualization bug was not working when the relocation reference did not fall on a function boundary. As a result, we were falsely detecting "unmarked object in code". When running the check, the address could be arbitrary, except it shouldn't match any existing function. Additionally, check that there's a proper reference to the de-virtualized callee to avoid false positives. (cherry picked from FBD17433887)	2019-09-17 14:24:31 -07:00
Maksim Panchenko	e9c6c73bb8	[BOLT][non-reloc] Change function splitting in non-relocation mode Summary: This diff applies to non-relocation mode mostly. In this mode, we are limited by original function boundaries, i.e. if a function becomes larger after optimizations (e.g. because of the newly introduced branches) then we might not be able to write the optimized version, unless we split the function. At the same time, we do not benefit from function splitting as we do in the relocation mode since we are not moving functions/fragments, and the hot code does not become more compact. For the reasons described above, we used to execute multiple re-write attempts to optimize the binary and we would only split functions that were too large to fit into their original space. After the first attempt, we would know functions that did not fit into their original space. Then we would re-run all our passes again feeding back the function information and forcefully splitting such functions. Some functions still wouldn't fit even after the splitting (mostly because of the branch relaxation for conditional tail calls that does not happen in non-relocation mode). Yet we have emitted debug info as if they were successfully overwritten. That's why we had one more stage to write the functions again, marking failed-to-emit functions non-simple. Sadly, there was a bug in the way 2nd and 3rd attempts interacted, and we were not splitting the functions correctly and as a result we were emitting less optimized code. One of the reasons we had the multi-pass rewrite scheme in place, was that we did not have an ability to precisely estimate the code size before the actual code emission. Recently, BinaryContext obtained such functionality, and now we can use it instead of relying on the multi-pass rewrite. This eliminates redundant work of re-running the same function passes multiple times. Because function splitting runs before a number of optimization passes that run on post-CFG state (those rely on the splitting pass), we cannot estimate the non-split code size with 100% accuracy. However, it is good enough for over 99% of the cases to extract most of the performance gains for the binary. As a result of eliminating the multi-pass rewrite, the processing time in non-relocation mode with `-split-functions=2` is greatly reduced. With debug info update, it is less than half of what it used to be. New semantics for `-split-functions=<n>`: -split-functions - split functions into hot and cold regions =0 - do not split any function =1 - in non-relocation mode only split functions too large to fit into original code space =2 - same as 1 (backwards compatibility) =3 - split all functions (cherry picked from FBD17362607)	2019-09-11 15:42:22 -07:00
Wenlei He	615a318b60	[BOLT] Filter perf samples by PID Summary: `perf2bolt` accepts executable name, and the tool will find all the PIDs associated with that executable. When different versions of an executable are running at the same time, name alone may not be sufficient as we will get samples from different versions of the binary aggregated together. The resulting fdata may look stale to BOLT, which makes BOLT bailout optimization for functions. This change adds a `-pid` switch that lets user specify process ID in addition to executable name so BOLT can target a specific process. (cherry picked from FBD17178898)	2019-09-03 22:24:06 -07:00
Wenlei He	8cd1ba599b	[BOLT] Ignore LBR from kernel interrupts Summary: This change adds a switch (`ignore-interrupt-lbr`) to ignores LBR from perf input that is result of kernel interrupts. These asynchronous flow of user/kernel transition will make BOLT think that profile is stale, thus bailout optimization for functions. Ideally, user mode filter need to be set for `perf record` so we don't have asynchronous LBRs. However these are identifiable as kernel address space is known, so we can ignore any LBRs that come from or go into kernel addresses during aggregation. This is under a switch and off by default in case we need to BOLT kernel module. (cherry picked from FBD17170107)	2019-09-03 10:01:26 -07:00
Rafael Auler	cc4b2fb614	[BOLT] Efficient edge profiling in instrumented mode Summary: Change our edge profiling technique when using instrumentation to do not instrument every edge. Instead, build the spanning tree for the CFG and omit instrumentation for edges in the spanning tree. Infer the edge count for these edges when writing the profile during run time. The inference works with a bottom-up traversal of the spanning tree and establishes the value of the edge connecting to the parent based on a simple flow equation involving output and input edges, where the only unknown variable is the parent edge. This requires some engineering in the runtime lib to support dynamic allocation for building these graphs at runtime. (cherry picked from FBD17062773)	2019-08-07 16:09:50 -07:00
Rafael Auler	52786928ff	[BOLT] Fix perf2bolt race in BAT mode Summary: We start a thread to preprocess the profile while the main thread continues to disassemble the input binary. We should not disassemble it in BAT mode, however, the test to check whether we have BAT in the input binary depends on the preprocessing thread, so there is a race where we may start disassembling functions just because the preprocessing thread didn't conclude we are in BAT mode. Fix this and make the main thread check for BAT without depending on the preprocessing thread. (cherry picked from FBD17124370)	2019-08-29 16:18:43 -07:00
Rafael Auler	1f6564f117	[BOLT] Support .plt.got section Summary: We decode the regular .plt section and we are able to perform optimizations on it with -plt=hot or -plt=all, however, .plt.got sections are not decoded by BOLT. This patch teaches BOLT how to read them. They are created by the bfd linker whenever there is no need for the dynamic linker to lazy-bind the symbol (when they are eagerly resolved at binary load time). These entries are 8-byte sized instead of 16-byte sized like the regular PLT, and contain a single indirect call instruction with 7 bytes and a nop. (cherry picked from FBD17060515)	2019-08-26 15:03:38 -07:00
Rafael Auler	243507db99	[BOLT] Fix aggregator w.r.t. split functions Summary: We should not rely on split function detection while aggregating data, but only look up the original function names in the symbol table. Split function detection should be done by BOLT and not perf2bolt while writing the profile. Then, BOLT, when reading it, will take care of combining functions if necessary. This caused a bug in bolted data collection where samples in cold parts of a function were being falsely attributed to the hot part of a function instead of being attributed to the cold part, causing incorrect translation of addresses. (cherry picked from FBD16993065)	2019-08-23 12:18:31 -07:00
Maksim Panchenko	f588d7a6ea	[BOLT] Tighter control of jump table detection Summary: We were too permissive by allowing more jump tables during the preliminary scan of memory. This allowed for jump tables to be falsely detected. And since we didn't have a way to backtrack the jump table creation, we had to assert. This diff refactors the code that analyzes jump table contents. Preliminary and final passes share the same code. The only difference should be the detection of instruction boundaries that are available during the final pass. This should affect strict relocation mode only. (cherry picked from FBD16923335)	2019-08-19 14:06:36 -07:00
Maksim Panchenko	bf030f336a	[BOLT] Fix misleading output Summary: BOLT prints "spawning thread to pre-process profile" message even when it is not running in the aggregation mode. Fix that. (cherry picked from FBD16908596)	2019-08-19 17:11:42 -07:00
Rafael Auler	821480d27f	[BOLT] Encode instrumentation tables in file Summary: Avoid directly allocating string and description tables in binary's static data region, since they are not needed during runtime except when writing the profile at exit. Change the runtime library to open the tables on disk and read only when necessary. (cherry picked from FBD16626030)	2019-08-02 11:20:13 -07:00
Rafael Auler	62aa74f836	[BOLT] Support instrumentation via runtime library Summary: To allow the development of future instrumentation work, this patch adds support in BOLT for linking arbitrary libraries into the binary processed by BOLT. We use orc relocation handling mechanism for that. With this support, this patch also moves code programatically generated in X86 assembly language by X86MCPlusBuilder to C code written in a new library called bolt_rt. Change CMake to support this library as an external project in the same way as clang does with compiler_rt. This library is installed in the lib/ folder relative to BOLT root installation and by default instrumentation will look for the library at that location to finish processing the binary with instrumentation. (cherry picked from FBD16572013)	2019-07-24 14:03:43 -07:00
laith sakka	f77cccf681	Rename option (cherry picked from FBD16655093)	2019-08-05 13:56:48 -07:00
laith sakka	c1564a1026	Add test for parallel mode Summary: Add a flag that disable writing botl-info section and add a test that run bolt with two modes parallel and sequential and assert that the resulting binaries are the same. (cherry picked from FBD16575440)	2019-07-30 17:55:27 -07:00
laith sakka	cc8415406c	Rewrite frame analysis using parallel utilities Summary: Rewrite frame analysis using parallel utilities (cherry picked from FBD16499130)	2019-07-25 11:57:08 -07:00
laith sakka	5084534699	Rewrite ICF using parallel utilities Summary: Rewrite ICF using parallel utilities (cherry picked from FBD16472975)	2019-07-24 17:13:15 -07:00
Maksim Panchenko	8d5854ef09	[BOLT] Add option to verify instruction encoder/decoder Summary: Add option `-check-encoding` to verify if the input to LLVM disassembler matches the output of the assembler. When set, the verification runs on every instruction in processed functions. I'm not enabling the option by default as it could be quite noisy on x86 where instruction encoding is ambiguous and can include redundant prefixes. (cherry picked from FBD16595415)	2019-07-31 16:03:49 -07:00
Maksim Panchenko	79ff4ec1cb	[perf2bolt] Enforce strict mode for perf2bolt Summary: In strict relocation mode, we get better function coverage. However, if the profile used for optimization was converted using non-strict mode, then it wouldn't match functions exclusive to strict mode. Hence, we have to enforce strict relocation mode for profile conversion, so it can be used for either mode. I'm also adding parallel profile pre-processing unless `--no-threads` is specified. This masks the runtime overhead of function disassembly on multi-core machines. (cherry picked from FBD16587855)	2019-06-11 13:24:10 -07:00
laith sakka	1bce256e67	Fix race condition in buildCFG Summary: switch to sequential execution when print-all is passed. Since the function getDynoStats have an unsafe access to the annotation allocators. (cherry picked from FBD16503502)	2019-07-25 14:41:57 -07:00
laith sakka	6443c46b9d	Run hfsort+ in parallel Summary: hfsort+ performs an expensive analysis to determine the new order of the functions. 99% of the time during hfsort+ is spent in the function runPassTwo. This diff runs the body of the hot loop in runPassTwo in parallel speeding up the total runtime of reorder-functions pass by up to 4x (cherry picked from FBD16450780)	2019-07-23 15:49:02 -07:00
Maksim Panchenko	a9b9aa1e02	[BOLT] Add code padding verification Summary: In non-relocation mode, we allow data objects to be embedded in the code. Such objects could be unmarked, and could occupy an area between functions, the area which is considered to be code padding. When we disassemble code, we detect references into the padding area and adjust it, so that it is not overwritten during the code emission. We assume the reference to be pointing to the beginning of the object. However, assembly-written functions may reference the middle of an object and use negative offsets to reference data fields. Thus, conservatively, we reduce the possibly-overwritten padding area to a minimum if the object reference was detected. Since we also allow functions with unknown code in non-relocation mode, it is possible that we miss references to some objects in code. To cover such cases, we need to verify the padding area before we allow to overwrite it. (cherry picked from FBD16477787)	2019-07-23 20:48:41 -07:00
Maksim Panchenko	6722875047	[BOLT] Fix processing PLT without relocs Summary: Some binaries may not have a relocation section corresponding to PLT. Handle them properly. (cherry picked from FBD16477841)	2019-07-24 22:08:36 -07:00
Maksim Panchenko	98fdba2cc7	[BOLT][NFC] Fix white space (cherry picked from FBD16473918)	2019-07-24 17:54:14 -07:00
laith sakka	744a2417dd	Run findSubprograms in preprocessDebugInfo in parallel Summary: While reading debug info the function findSubprograms runs on each compilation unit. This diff parallelize that loop reducing its runtime duration by 70%. (cherry picked from FBD16362867)	2019-07-17 20:54:53 -07:00
laith sakka	b50500893d	Lock-based parallelization for updateDebugInfo Summary: BOLT spends a decent amount of time creating patches to update debug information when -update-debug-sections is passed. In updateDebugInfo patches are created to update .debug_info and .debug_abbrev sections while .debug_loc and .debug_ranges contents are populated. This this diff uses a lock-based approach to parallelize updateDebugInfo functions and reduces the runtime of the function by around 30%. (cherry picked from FBD16352261)	2019-07-17 14:58:17 -07:00
Facebook Github Bot	86800abc81	[BOLT][PR] Target compilation based on LLVM CMake configuration Summary: Minimalist implementation of target configurable compilation. Fixes https://github.com/facebookincubator/BOLT/issues/59 Pull Request resolved: https://github.com/facebookincubator/BOLT/pull/60 GitHub Author: Pierre RAMOIN <pierre.ramoin@amadeus.com> (cherry picked from FBD16461879)	2019-07-24 11:05:08 -07:00
Maksim Panchenko	2c9c6b164b	[BOLT] Fix issue printing CTCs without annotations Summary: After stripping annotations, conditional tail calls no longer can be identified by their corresponding tag. We can check the number of basic block successors instead. Fixes facebookincubator/BOLT#58. (cherry picked from FBD16444718)	2019-07-22 20:57:19 -07:00
laith sakka	fde5a2b470	Run shrink wrapping in parallel Summary: Shrink wrapping is an expensive part of frame optimizations if performed on all functions. This diff makes it run in parallel, reducing wall time. (cherry picked from FBD16092651)	2019-07-02 10:48:43 -07:00
laith sakka	7d42835418	Run buildCFG in disassembly in parallel Summary: This diff parallelize the construction of call graph during disassembly. The diff includes a change to parallel-utilities where another interface is added, that support running tasks on binaryFunctions that involves adding instruction annotations. This pattern is common in different places, e.g. frame optimizations. And such, pattern justify creating an interface, that abstract out all the messy details. (cherry picked from FBD16232809)	2019-07-12 07:25:50 -07:00
laith sakka	f4ab6e6924	run finalize functions in parallel Summary: (cherry picked from FBD16188733)	2019-07-10 10:59:56 -07:00
laith sakka	98539b0966	run aligner pass in parallel Summary: this diff parallelize the aligner pass (cherry picked from FBD16176327)	2019-07-09 17:59:41 -07:00
laith sakka	9977b03fea	Run reorder blocks in parallel Summary: This diff change reorderBasicBlocks pass to run in parallel, it does so by adding locks to the fix branches function, and creating temporary MCCodeEmitters when estimating basic block code size. (cherry picked from FBD16161149)	2019-07-08 12:32:58 -07:00
Rafael Auler	1169f1fdd8	[BOLT] Support duplicating jump tables Summary: If two indirect branches use the same jump table, we need to detect this and duplicate dump tables so we can modify this CFG correctly. This is necessary for instrumentation and shrink wrapping. For the latter, we only detect this and bail, fixing this old known issue with shrink wrapping. Other minor changes to support better instrumentation: add an option to instrument only hot functions, add LOCK prefix to instrumentation increment instruction, speed up splitting critical edges by avoiding calling recomputeLandingPads() unnecessarily. (cherry picked from FBD16101312)	2019-07-02 16:56:41 -07:00
Rafael Auler	8880969ced	[BOLT] Restrict creation of jump tables Summary: Heuristic that creates a jump table for every memory access, including those we do not match against a pattern in an indirect jump, is too permissive and has false positives. Guard this logic under strict mode until we figure out a better strategy. (cherry picked from FBD16192205)	2019-07-10 15:41:34 -07:00
laith sakka	3cfc76cdbf	Create a general interface to implement parallel tasks easily and apply it to run EliminateUnreachableBlocks in parallel. Summary: Each time we run some work in parallel over the list of functions in bolt, we manage a thread pool, task scheduling and perform some work to manage the granularity of the tasks based on the type of the work we do. In this task, I am creating an interface where all those details are abstracted out, the user provides the function that will run on each function, and some policy parameters that setup the scheduling and granularity configurations. This will make it easier to implement parallel tasks, and eliminate redundant coding efforts. (cherry picked from FBD16116077)	2019-07-03 17:23:19 -07:00
laith sakka	f10d1fe0f3	Run cleanAnnotations within frame analysis in parallel Summary: This diff parallelize the function FrameAnalysis::cleanAnnotations() (cherry picked from FBD16096711)	2019-07-02 13:42:17 -07:00
laith sakka	00c252f6d8	Clean SPTMap in frame anaylsis in parallel Summary: This diff parallelize the STPClean() function reducing its runtime from 5 seconds to 0.4 on HHVM, Making the runtime for the frame optimizer goes down to 33 seconds on HHVM. (cherry picked from FBD15914371)	2019-06-19 18:01:00 -07:00
laith sakka	86b529bd54	run SPT in parallel, and split annotation allocator Summary: This diff includes two main changes: 1) When creating an annotation, a dedicated annotation allocator can be used, instead of the default allocator. This allows some annotation to be deallocated right after the end of their usage completely. Furthermore, having the ability to use dedicated allocators allows running SPT in parallel where each task uses a different allocator. 2) SPT is parallelized. (cherry picked from FBD15913492)	2019-06-14 19:56:11 -07:00
Wenlei He	4e90fc1e3b	[BOLT] Prioritize Jump Table ICP target by frequency and indice count Summary: We select the top hot targets for indirect call promotion. But since we only have frequency for targets, not for actual jump table indices, we have to merge indices that share the same actual target. In order to do that we sort targets by pointer of target symbol before merging, which introduces instability. Later we stable sort merged targets by frequency. Due to the instability of sorting pointers, and depending on how many indices each merged target has, we could end up with unstable ICP. This commit changes the 2nd pass sorting to prioritize targets with fewer indices, and higher mispredicts, in addition to higher frequency. It improves stability of ICP, and also exposes more ICP because targets with fewer indices has better chance of getting promoted. (cherry picked from FBD16099701)	2019-07-02 15:51:20 -07:00
Maksim Panchenko	078ece1691	[BOLT] Fix out-of-bounds entry points Summary: Check that a symbol address is less than the next function address before considering it for a secondary entry. (cherry picked from FBD16056468)	2019-06-28 11:53:34 -07:00
Maksim Panchenko	e89ad0db4b	[BOLT] Introduce strict relocation mode Summary: In strict relocation mode we rely on relocations to represent all possible entry points into a function. Most of the code generated by tested compilers (gcc and clang) will result in relocations against any internal labels for jump tables and for computed goto tables. In situations where we cannot properly reconstruct a jump table, or when we cannot determine a table that guides an indirect jump, e.g. when multiple computed goto tables are used, we conservatively assume that the indirect jump can end up at any possible basic block referenced by relocations. In strict mode, simple functions may include the aforementioned instructions with unknown control flow with a conservative list of destinations added to the containing basic block. This allows us to expand coverage of simple functions and to enable code reordering optimizations for more functions. The strict mode is recommended when BOLT is used with a well-formed code generated by a compiler. To use the strict mode, add "-strict" on the command line. Another effect of this diff, is that with relocations, we will always replace the immediate operand of an instruction with a symbol if the relocation exists against this operand. Also this diff fixes issues with Clang compiled with -fpic. (cherry picked from FBD15872849)	2019-06-28 09:21:27 -07:00
Maksim Panchenko	06e7a1e059	[BOLT] Ignore false function references Summary: A relocation can have an addend that makes it look as the relocated value is in a different section from the symbol being relocated. E.g., a relocation against a variable in .rodata could have a negative offset that will make it look like it is against a symbol in .text (a section that typically precedes .rodata). Unless the relocation is against a section symbol, we know exactly the symbol that is being relocated and there is no issue. However, when the linker leaves only a section relocation (i.e. a relocation against a section symbol when a temporary original symbol gets deleted), we have to guess the relocated symbol, and can falsely detect a function reference in the case described above. The fix is to keep a section relocation if the corresponding relocated value falls into a different section, and to detect and ignore false function reference. (cherry picked from FBD16030791)	2019-06-27 03:20:17 -07:00
Wenlei He	459add2827	[BOLT] Force non-relocation mode for heatmap generation Summary: BOLT operates in relocation mode by default when .reloc is in the binary. This changes disables relocation mode for heatmap generation so we can use that for more cases. There's a small separate change that ignores zero-sized symbol in zero-sized code section during function discovery. (cherry picked from FBD16009610)	2019-06-26 11:06:46 -07:00

1 2 3 4 5 ...

616 Commits All Branches Search

616 Commits

All Branches