llvm-project

History

Roman Lebedev 10151f6618 [SimplifyCFG] FoldTwoEntryPHINode(): consider total speculation cost, not per-BB cost Summary: Previously, if the threshold was 2, we were willing to speculatively execute 2 cheap instructions in both basic blocks (thus we were willing to speculatively execute cost = 4), but weren't willing to speculate when one BB had 3 instructions and other one had no instructions, even thought that would have total cost of 3. This looks inconsistent to me. I don't think `cmov`-like instructions will start executing until both of it's inputs are available: https://godbolt.org/z/zgHePf So i don't see why the existing behavior is the correct one. Also, let's add it's own `cl::opt` for this threshold, with default=4, so it is not stricter than the previous threshold: will allow to fold when there are 2 BB's each with cost=2. And since the logic has changed, it will also allow to fold when one BB has cost=3 and other cost=1, or there is only one BB with cost=4. This is an alternative solution to D65148: This fix is mainly motivated by `signbit-like-value-extension.ll` test. That pattern comes up in JPEG decoding, see e.g. `Figure F.12 – Extending the sign bit of a decoded value in V` of `ITU T.81` (JPEG specification). That branch is not predictable, and it is within the innermost loop, so the fact that that pattern ends up being stuck with a branch instead of `select` (i.e. `CMOV` for x86) is unlikely to be beneficial. This has great results on the final assembly (vanilla test-suite + RawSpeed): (metric pass - D67240) \| metric \| old \| new \| delta \| % \| \| x86-mi-counting.NumMachineFunctions \| 37720 \| 37721 \| 1 \| 0.00% \| \| x86-mi-counting.NumMachineBasicBlocks \| 773545 \| 771181 \| -2364 \| -0.31% \| \| x86-mi-counting.NumMachineInstructions \| 7488843 \| 7486442 \| -2401 \| -0.03% \| \| x86-mi-counting.NumUncondBR \| 135770 \| 135543 \| -227 \| -0.17% \| \| x86-mi-counting.NumCondBR \| 423753 \| 422187 \| -1566 \| -0.37% \| \| x86-mi-counting.NumCMOV \| 24815 \| 25731 \| 916 \| 3.69% \| \| x86-mi-counting.NumVecBlend \| 17 \| 17 \| 0 \| 0.00% \| We significantly decrease basic block count, notably decrease instruction count, significantly decrease branch count and very significantly increase `cmov` count. Performance-wise, unsurprisingly, this has great effect on target RawSpeed benchmark. I'm seeing 5 major improvements: ``` Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 49 vs 49 Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_mean -0.3064 -0.3064 226.9913 157.4452 226.9800 157.4384 Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_median -0.3057 -0.3057 226.8407 157.4926 226.8282 157.4828 Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_stddev -0.4985 -0.4954 0.3051 0.1530 0.3040 0.1534 Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 49 vs 49 Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_mean -0.1747 -0.1747 80.4787 66.4227 80.4771 66.4146 Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_median -0.1742 -0.1743 80.4686 66.4542 80.4690 66.4436 Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_stddev +0.6089 +0.5797 0.0670 0.1078 0.0673 0.1062 Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 49 vs 49 Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_mean -0.1598 -0.1598 171.6996 144.2575 171.6915 144.2538 Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_median -0.1598 -0.1597 171.7109 144.2755 171.7018 144.2766 Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_stddev +0.4024 +0.3850 0.0847 0.1187 0.0848 0.1175 Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 49 vs 49 Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_mean -0.0550 -0.0551 280.3046 264.8800 280.3017 264.8559 Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_median -0.0554 -0.0554 280.2628 264.7360 280.2574 264.7297 Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_stddev +0.7005 +0.7041 0.2779 0.4725 0.2775 0.4729 Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_pvalue 0.0000 0.0000 U Test, Repetitions: 49 vs 49 Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_mean -0.0354 -0.0355 316.7396 305.5208 316.7342 305.4890 Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_median -0.0354 -0.0356 316.6969 305.4798 316.6917 305.4324 Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_stddev +0.0493 +0.0330 0.3562 0.3737 0.3563 0.3681 ``` That being said, it's always best-effort, so there will likely be cases where this worsens things. Reviewers: efriedma, craig.topper, dmgreen, jmolloy, fhahn, Carrot, hfinkel, chandlerc Reviewed By: jmolloy Subscribers: xbolva00, hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D67318 llvm-svn: 372009		2019-09-16 16:18:24 +00:00
..
Analysis	[SCEV] Add smin support to getRangeRef	2019-09-12 21:32:27 +00:00
Assembler	Debug Info: Support for DW_AT_export_symbols for anonymous structs	2019-08-23 17:19:21 +00:00
Bindings	…
Bitcode	[BitcodeReader] Check if we can create a null constant for type.	2019-08-21 18:20:11 +00:00
BugPoint	…
CodeGen	[ARM] Add patterns for BSWAP intrinsic on MVE	2019-09-16 15:20:10 +00:00
DebugInfo	Revert "Fix test failures after r371640"	2019-09-13 08:26:59 +00:00
Demangle	…
Examples	…
ExecutionEngine	[JITLink] Don't under-align zero-fill sections.	2019-08-27 15:22:23 +00:00
Feature	[FPEnv] Add fptosi and fptoui constrained intrinsics.	2019-08-28 16:33:36 +00:00
FileCheck	[FileCheck] Forbid using var defined on same line	2019-09-02 14:04:00 +00:00
Instrumentation	[NewPM][Sancov] Make Sancov a Module Pass instead of 2 Passes	2019-09-04 20:30:29 +00:00
Integer	…
JitListener	…
LTO	[IRMover] Don't map globals if their types are the same	2019-09-11 18:35:49 +00:00
Linker	Remove some unnecessary REQUIRES: shell lines	2019-09-10 00:06:52 +00:00
MC	[WebAssembly] Narrowing and widening SIMD ops	2019-09-13 22:54:41 +00:00
MachineVerifier	Remove unnecessary REQUIRES from a test.	2019-08-24 02:39:51 +00:00
Object	[llvm-ar] Uncapitalize error messages and delete full stop	2019-09-14 01:18:47 +00:00
ObjectYAML	[yaml2obj/ObjectYAML] - Cleanup the error reporting API, add custom errors handlers.	2019-09-13 16:00:16 +00:00
Other	Revert "Reland "r364412 [ExpandMemCmp][MergeICmps] Move passes out of CodeGen into opt pipeline.""	2019-09-10 10:39:09 +00:00
Reduce	Fix llvm-reduce tests so that they don't assume the source code is	2019-09-12 21:03:49 +00:00
SafepointIRVerifier	…
Support	…
SymbolRewriter	…
TableGen	[CodeEmitter] Improve testing for APInt encoding	2019-09-15 08:44:40 +00:00
ThinLTO/X86	Reland "clang-misexpect: Profile Guided Validation of Performance Annotations in LLVM"	2019-09-11 16:19:50 +00:00
Transforms	[SimplifyCFG] FoldTwoEntryPHINode(): consider total speculation cost, not per-BB cost	2019-09-16 16:18:24 +00:00
Unit	…
Verifier	[Intrinsic] Add the llvm.umul.fix.sat intrinsic	2019-09-07 12:16:14 +00:00
YAMLParser	…
tools	[llvm-objcopy] Ignore -B --binary-architecture=	2019-09-14 01:36:31 +00:00
.clang-format	…
CMakeLists.txt	[llvm-ifs][IFS] llvm Interface Stubs merging + object file generation tool.	2019-08-30 18:26:05 +00:00
TestRunner.sh	…
lit.cfg.py	[llvm-ifs][IFS] llvm Interface Stubs merging + object file generation tool.	2019-08-30 18:26:05 +00:00
lit.site.cfg.py.in	…