llvm-project/llvm/test
Roman Lebedev 10151f6618 [SimplifyCFG] FoldTwoEntryPHINode(): consider *total* speculation cost, not per-BB cost
Summary:
Previously, if the threshold was 2, we were willing to speculatively
execute 2 cheap instructions in both basic blocks (thus we were willing
to speculatively execute cost = 4), but weren't willing to speculate
when one BB had 3 instructions and other one had no instructions,
even thought that would have total cost of 3.

This looks inconsistent to me.
I don't think `cmov`-like instructions will start executing
until both of it's inputs are available: https://godbolt.org/z/zgHePf
So i don't see why the existing behavior is the correct one.

Also, let's add it's own `cl::opt` for this threshold,
with default=4, so it is not stricter than the previous threshold:
will allow to fold when there are 2 BB's each with cost=2.
And since the logic has changed, it will also allow to fold when
one BB has cost=3 and other cost=1, or there is only one BB with cost=4.

This is an alternative solution to D65148:
This fix is mainly motivated by `signbit-like-value-extension.ll` test.
That pattern comes up in JPEG decoding, see e.g.
`Figure F.12 – Extending the sign bit of a decoded value in V`
of `ITU T.81` (JPEG specification).
That branch is not predictable, and it is within the innermost loop,
so the fact that that pattern ends up being stuck with a branch
instead of `select` (i.e. `CMOV` for x86) is unlikely to be beneficial.

This has great results on the final assembly (vanilla test-suite + RawSpeed): (metric pass - D67240)
| metric                                 |     old |     new | delta |      % |
| x86-mi-counting.NumMachineFunctions    |   37720 |   37721 |     1 |  0.00% |
| x86-mi-counting.NumMachineBasicBlocks  |  773545 |  771181 | -2364 | -0.31% |
| x86-mi-counting.NumMachineInstructions | 7488843 | 7486442 | -2401 | -0.03% |
| x86-mi-counting.NumUncondBR            |  135770 |  135543 |  -227 | -0.17% |
| x86-mi-counting.NumCondBR              |  423753 |  422187 | -1566 | -0.37% |
| x86-mi-counting.NumCMOV                |   24815 |   25731 |   916 |  3.69% |
| x86-mi-counting.NumVecBlend            |      17 |      17 |     0 |  0.00% |

We significantly decrease basic block count, notably decrease instruction count,
significantly decrease branch count and very significantly increase `cmov` count.

Performance-wise, unsurprisingly, this has great effect on
target RawSpeed benchmark. I'm seeing 5 **major** improvements:
```
Benchmark                                                                                             Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_pvalue                                 0.0000          0.0000      U Test, Repetitions: 49 vs 49
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_mean                                  -0.3064         -0.3064      226.9913      157.4452      226.9800      157.4384
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_median                                -0.3057         -0.3057      226.8407      157.4926      226.8282      157.4828
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_stddev                                -0.4985         -0.4954        0.3051        0.1530        0.3040        0.1534
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_pvalue                                  0.0000          0.0000      U Test, Repetitions: 49 vs 49
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_mean                                   -0.1747         -0.1747       80.4787       66.4227       80.4771       66.4146
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_median                                 -0.1742         -0.1743       80.4686       66.4542       80.4690       66.4436
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_stddev                                 +0.6089         +0.5797        0.0670        0.1078        0.0673        0.1062
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_pvalue                                 0.0000          0.0000      U Test, Repetitions: 49 vs 49
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_mean                                  -0.1598         -0.1598      171.6996      144.2575      171.6915      144.2538
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_median                                -0.1598         -0.1597      171.7109      144.2755      171.7018      144.2766
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_stddev                                +0.4024         +0.3850        0.0847        0.1187        0.0848        0.1175
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_pvalue                                  0.0000          0.0000      U Test, Repetitions: 49 vs 49
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_mean                                   -0.0550         -0.0551      280.3046      264.8800      280.3017      264.8559
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_median                                 -0.0554         -0.0554      280.2628      264.7360      280.2574      264.7297
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_stddev                                 +0.7005         +0.7041        0.2779        0.4725        0.2775        0.4729
Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_pvalue                                  0.0000          0.0000      U Test, Repetitions: 49 vs 49
Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_mean                                   -0.0354         -0.0355      316.7396      305.5208      316.7342      305.4890
Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_median                                 -0.0354         -0.0356      316.6969      305.4798      316.6917      305.4324
Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_stddev                                 +0.0493         +0.0330        0.3562        0.3737        0.3563        0.3681
```

That being said, it's always best-effort, so there will likely
be cases where this worsens things.

Reviewers: efriedma, craig.topper, dmgreen, jmolloy, fhahn, Carrot, hfinkel, chandlerc

Reviewed By: jmolloy

Subscribers: xbolva00, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D67318

llvm-svn: 372009
2019-09-16 16:18:24 +00:00
..
Analysis [SCEV] Add smin support to getRangeRef 2019-09-12 21:32:27 +00:00
Assembler Debug Info: Support for DW_AT_export_symbols for anonymous structs 2019-08-23 17:19:21 +00:00
Bindings
Bitcode [BitcodeReader] Check if we can create a null constant for type. 2019-08-21 18:20:11 +00:00
BugPoint
CodeGen [ARM] Add patterns for BSWAP intrinsic on MVE 2019-09-16 15:20:10 +00:00
DebugInfo Revert "Fix test failures after r371640" 2019-09-13 08:26:59 +00:00
Demangle
Examples
ExecutionEngine [JITLink] Don't under-align zero-fill sections. 2019-08-27 15:22:23 +00:00
Feature [FPEnv] Add fptosi and fptoui constrained intrinsics. 2019-08-28 16:33:36 +00:00
FileCheck [FileCheck] Forbid using var defined on same line 2019-09-02 14:04:00 +00:00
Instrumentation [NewPM][Sancov] Make Sancov a Module Pass instead of 2 Passes 2019-09-04 20:30:29 +00:00
Integer
JitListener
LTO [IRMover] Don't map globals if their types are the same 2019-09-11 18:35:49 +00:00
Linker Remove some unnecessary REQUIRES: shell lines 2019-09-10 00:06:52 +00:00
MC [WebAssembly] Narrowing and widening SIMD ops 2019-09-13 22:54:41 +00:00
MachineVerifier Remove unnecessary REQUIRES from a test. 2019-08-24 02:39:51 +00:00
Object [llvm-ar] Uncapitalize error messages and delete full stop 2019-09-14 01:18:47 +00:00
ObjectYAML [yaml2obj/ObjectYAML] - Cleanup the error reporting API, add custom errors handlers. 2019-09-13 16:00:16 +00:00
Other Revert "Reland "r364412 [ExpandMemCmp][MergeICmps] Move passes out of CodeGen into opt pipeline."" 2019-09-10 10:39:09 +00:00
Reduce Fix llvm-reduce tests so that they don't assume the source code is 2019-09-12 21:03:49 +00:00
SafepointIRVerifier
Support
SymbolRewriter
TableGen [CodeEmitter] Improve testing for APInt encoding 2019-09-15 08:44:40 +00:00
ThinLTO/X86 Reland "clang-misexpect: Profile Guided Validation of Performance Annotations in LLVM" 2019-09-11 16:19:50 +00:00
Transforms [SimplifyCFG] FoldTwoEntryPHINode(): consider *total* speculation cost, not per-BB cost 2019-09-16 16:18:24 +00:00
Unit
Verifier [Intrinsic] Add the llvm.umul.fix.sat intrinsic 2019-09-07 12:16:14 +00:00
YAMLParser
tools [llvm-objcopy] Ignore -B --binary-architecture= 2019-09-14 01:36:31 +00:00
.clang-format
CMakeLists.txt [llvm-ifs][IFS] llvm Interface Stubs merging + object file generation tool. 2019-08-30 18:26:05 +00:00
TestRunner.sh
lit.cfg.py [llvm-ifs][IFS] llvm Interface Stubs merging + object file generation tool. 2019-08-30 18:26:05 +00:00
lit.site.cfg.py.in