This is one of those wonderful "in theory X doesn't matter, but in practice is does" changes. In this particular case, we shift the IVs inserted by the runtime unroller to clamp iteration count of the loops* from decrementing to incrementing.
Why does this matter? A couple of reasons:
* SCEV doesn't have a native subtract node. Instead, all subtracts (A - B) are represented as A + -1 * B and drops any flags invalidated by such. As a result, SCEV is slightly less good at reasoning about edge cases involving decrementing addrecs than incrementing ones. (You can see this in the inferred flags in some of the test cases.)
* Other parts of the optimizer produce incrementing IVs, and they're common in idiomatic source language. We do have support for reversing IVs, but in general if we produce one of each, the pair will persist surprisingly far through the optimizer before being coalesced. (You can see this looking at nearby phis in the test cases.)
Note that if the hardware prefers decrementing (i.e. zero tested) loops, LSR should convert back immediately before codegen.
* Mostly irrelevant detail: The main loop of the prolog case is handled independently and will simple use the original IV with a changed start value. We could in theory use this scheme for all iteration clamping, but that's a larger and more invasive change.
The division by constant optimization often produces constants that
are uimm32, but not simm32. These constants require 3 or 4 instructions
to materialize without Zba.
Since these instructions are often used by a multiply with a LHS
that needs to be zero extended with an AND, we can switch the MUL
to a MULHU by shifting both inputs left by 32. Once we shift the
constant left, the upper 32 bits no longer need to be 0 so constant
materialization is free to use LUI+ADDIW. This reduces the constant
materialization from 4 instructions to 3 in some cases while also
reducing the zero extend of the LHS from 2 shifts to 1.
Differential Revision: https://reviews.llvm.org/D113805
Fix some confusion between the two types of `const` a pointer/iterator
can have. Users of a SwitchInst::CaseIterator should not (and do not!)
manually mutate the SwitchInst::CaseHandle that tracks its internal
state. Change operator*() to return `const CaseHandle&`, remove the
non-const-qualified operator*(), and const-qualify
CaseHandle::setValue() and CaseHandle::setSuccessor().
Differential Revision: https://reviews.llvm.org/D113788
Take advantage of class name injection to avoid redundantly specifying
template parameters of iterator adaptor/facade base classes.
No functionality change, although the private typedefs changed in a
couple of cases.
- Added a private typedef HashTableIterator::BaseT, following the
pattern from r207084 / 3478d4b164, to
pre-emptively appease MSVC (maybe it's not necessary anymore but
looks like we do this pretty consistently). Otherwise, I removed
private
- Removed private typedefs filter_iterator_impl::BaseT and
FilterIteratorTest::InputIterator::BaseT since there was only one
use of each and the definition was no longer interesting.
* The format_arg attribute tells the compiler that the attributed function
returns a format string that is compatible with a format string that is being
passed as a specific argument.
* Several NSString methods return copies of their input, so they would ideally
have the format_arg attribute. A previous differential (D112670) added
support for instancetype methods having the format_arg attribute when used
in the context of NSString method declarations.
* D112670 failed to account that instancetype can be sugared in certain narrow
(but critical) scenarios, like by using nullability specifiers. This patch
resolves this problem.
Differential Revision: https://reviews.llvm.org/D113636
Reviewed By: ahatanak
Radar-Id: rdar://85278860
We missed the tests in the earlier XFAIL-ing because the locale.fr_FR.UTF-8
feature wasn't available, but since an upgrade these are now showing up
on the CI.
Differential Revision: https://reviews.llvm.org/D113791
Fortran defines an ENTRY point name as being pure if its enclosing
subprogram scope defines a pure procedure.
Differential Revision: https://reviews.llvm.org/D113711
Similar to D113702, but for the LSDAs. Clang seems to emit all LSDA
relocs as section relocs, but ld -r can turn those relocs into symbol
ones.
Reviewed By: #lld-macho, oontvoo
Differential Revision: https://reviews.llvm.org/D113721
Dedup'ing unwind info is tricky because each CUE contains a different
function address, if ICF operated naively and compared the entire
contents of each CUE, entries with identical unwind info but belonging
to different functions would never be considered identical. To work
around this problem, we slice away the function address before
performing ICF. We rely on `relocateCompactUnwind()` to correctly handle
these truncated input sections.
Here are the numbers before and after D109944, D109945, and this diff
were applied, as tested on my 3.2 GHz 16-Core Intel Xeon W:
Without any optimizations:
base diff difference (95% CI)
sys_time 0.849 ± 0.015 0.896 ± 0.012 [ +4.8% .. +6.2%]
user_time 3.357 ± 0.030 3.512 ± 0.023 [ +4.3% .. +5.0%]
wall_time 3.944 ± 0.039 4.032 ± 0.031 [ +1.8% .. +2.6%]
samples 40 38
With `-dead_strip`:
base diff difference (95% CI)
sys_time 0.847 ± 0.010 0.896 ± 0.012 [ +5.2% .. +6.5%]
user_time 3.377 ± 0.014 3.532 ± 0.015 [ +4.4% .. +4.8%]
wall_time 3.962 ± 0.024 4.060 ± 0.030 [ +2.1% .. +2.8%]
samples 47 30
With `-dead_strip` and `--icf=all`:
base diff difference (95% CI)
sys_time 0.935 ± 0.013 0.957 ± 0.018 [ +1.5% .. +3.2%]
user_time 3.472 ± 0.022 6.531 ± 0.046 [ +87.6% .. +88.7%]
wall_time 4.080 ± 0.040 5.329 ± 0.060 [ +30.0% .. +31.2%]
samples 37 30
Unsurprisingly, ICF is now a lot slower, likely due to the much larger
number of input sections it needs to process. But the rest of the
linker only suffers a mild slowdown.
Note that the compact-unwind-bad-reloc.s test was expanded because we
now handle the relocation for CUE's function address in a separate code
path from the rest of the CUE relocations. The extended test covers both
code paths.
Reviewed By: #lld-macho, oontvoo
Differential Revision: https://reviews.llvm.org/D109946
These patterns were noted in the recent D113212 and follow-ups.
I did not bother to duplicate every test because it should be
clear if we recognize the swaps from a smaller sample. We have
complete coverage for the original patterns.
Previously we set `isFuncEntry` flag to true when the funcName from DWARF is equal to the name in symbol table and we use this flag to ignore reporting callsite sample that's from an intra func branch. However, in HHVM, it appears that the symbol table name is inconsistent with the dwarf info func name, it's likely due to `OptimizeGlobalAliases`.
This change is a workaround in llvm-profgen side to mark the only one range as the function entry and add warnings for the remaining inconsistence.
This also fixed a missing `getCanonicalFnName` for symbol name which caused the mismatching as well.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D113492
Instead of pretending that function pointer type aliases or variables
are functions, and thereby losing the information that they are type
aliases or variables, respectively, we use the existence of a return
type in the DeclInfo to signify a "function-like" object.
That seems pretty natural, since it's also the return type (or parameter
list) from the DeclInfo that we compare the documentation with.
Addresses a concern voiced in D111264#3115104.
Reviewed By: gribozavr2
Differential Revision: https://reviews.llvm.org/D113691
Then we don't have to look into the declaration again. Also it's only
natural to collect this information alongside parameters and return
type, as it's also just a parameter in some sense.
Reviewed By: gribozavr2
Differential Revision: https://reviews.llvm.org/D113690
We noticed that the library structure causes link ordering problems in Google's internal build. However, we don't think the problem is specific to Google's build, it probably can be reproduced anywhere with the right library structure.
In general splitting the Python bindings from their dependencies (the C API targets) creates the possibility that the two libraries might end up in the wrong order on the linker command line. We can avoid this problem happening by reverting the structure of the MLIRBindingsPythonCore to represent its dependencies in the usual way, rather than composing an incomplete `MLIRBindingsPythonCoreNoCAPI` target and their CAPI dependencies. It was probably a mistake to rewrite this particular `cc_library()` rule in terms of the two, since nothing guarantees that the two will be correctly ordered by the linker when both are being linked into the same binary, and it was only an incidental "cleanup" done in passing.
Otherwise the previous PR (D113565) is fine, since that was about the case where both are being built into two separate shared libraries. It just shouldn't have made this (unrelated) change.
Reviewed By: GMNGeoffrey
Differential Revision: https://reviews.llvm.org/D113773
Clang seems to emit all functionAddress relocs as section relocs, but
`ld -r` can turn those relocs into symbol ones. It turns out that we
weren't handling that case correctly when the symbol was a weak def
whose definition did not prevail.
Reviewed By: #lld-macho, oontvoo
Differential Revision: https://reviews.llvm.org/D113702
Part of the very first discussion here, but didn't upstream it before as we
didn't use it yet. Fix that for pending updates. Just adding the op here,
follow up will add the lowering to codegen.
`const`-qualify ParsedAttr::iterator::operator*(), clearing up confusion
about the two meanings of const for pointers/iterators. Helps unblock
removal of (non-const) iterator_facade_base::operator->().
The unrolling code was previously inserting new cloned blocks at the end of the function. The result of this with typical loop structures is that the new iterations are placed far from the initial iteration.
With unrolling, the general assumption is that the a) the loop is reasonable hot, and b) the first Count-1 copies of the loop are rarely (if ever) loop exiting. As such, placing Count-1 copies out of line is a fairly poor code placement choice. We'd much rather fall through into the hot (non-exiting) path. For code with branch profiles, later layout would fix this, but this may have a positive impact on non-PGO compiled code.
However, the real motivation for this change isn't performance. Its readability and human understanding. Having to jump around long distances in an IR file to trace an unrolled loop structure is error prone and tedious.
Profiling a basic internal real input read benchmark shows some
hot spots in the code used to prepare input for decimal-to-binary
conversion, which is of course where the time should be spent.
The library that implements decimal to/from binary conversions has
been optimized, but not the code in the Fortran runtime that calls it,
and there are some obvious light changes worth making here.
Move some member functions from *.cpp files into the class definitions
of Descriptor and IoStatementState to enable inlining and specialization.
Make GetNextInputBytes() the new basic input API within the
runtime, replacing GetCurrentChar() -- which is rewritten in terms of
GetNextInputBytes -- so that input routines can have the
ability to acquire more than one input character at a time
and amortize overhead.
These changes speed up the time to read 1M random reals
using internal I/O from a character array from 1.29s to 0.54s
on my machine, which on par with Intel Fortran and much faster than
GNU Fortran.
Differential Revision: https://reviews.llvm.org/D113697
Previously if you passed `-oso_prefix path/to/foo/` with a trailing
slash at the end, using `real_path` would remove that slash, but that
slash is necessary to make sure OSO prefix paths end up as valid
relative paths instead of starting with `/`.
Differential Revision: https://reviews.llvm.org/D113541
This fixes const-correctness of iterator adaptors, dropping non-`const`
overloads for `operator*()`.
Iterators, like the pointers that they generalize, have two types of
`const`.
The `const` qualifier on members indicates whether the iterator itself
can be changed. This is analagous to `int *const`.
The `const` qualifier on return values of `operator*()`, `operator[]()`,
and `operator->()` controls whether the the pointed-to value can be
changed. This is analogous to `const int *`.
Since `operator*()` does not (in principle) change the iterator, then
there should only be one definition, which is `const`-qualified. E.g.,
iterators wrapping `int*` should look like:
```
int *operator*() const; // always const-qualified, no overloads
```
ba7a6b314f changed `iterator_adaptor_base`
away from this to work around bugs in other iterator adaptors. That was
already reverted. This patch adds back its test, which combined
llvm::enumerate() and llvm::make_filter_range(), adds a test for
iterator_adaptor_base itself, and cleans up the `const`-ness of the
other iterator adaptors.
This also updates the documented requirements for
`iterator_facade_base`:
```
/// OLD:
/// - const T &operator*() const;
/// - T &operator*();
/// New:
/// - T &operator*() const;
```
In a future commit we might also clean up `iterator_facade`'s overloads
of `operator->()` and `operator[]()`. These already (correctly) return
non-`const` proxies regardless of the iterator's `const` qualifier.
Differential Revision: https://reviews.llvm.org/D113158
This case was complicated because someone had added new non-autogened test to an autogened file. In particular, those new tests used two variables (%J and %j) which differeded only in capitalization. The auto-updater doesn't distinguish case, so this meant auto-gened versions of the new tests failed with non-obvious errors.
There are two key lessons here:
1) Please don't use two values which differ only in case. This is problematic for automatic tooling, but is also hard to understand for a human.
2) Please DO NOT add new tests to an autogened test without running autogen again. If autogen doesn't pass on your new test, put them in a separate file.
When an Fw.d output edit descriptor has a "d" value exactly
equal to the number of zeroes after the decimal point for a value
(e.g., 0.07 with F5.1), the Fw.d output editing code needs to
do the rounding itself to either 0.0 or 0.1 after performing
a conversion without rounding (to avoid 0.04999 rounding up twice).
Differential Revision: https://reviews.llvm.org/D113698
We don't use Python 2 anymore, so let us do the recommended fix instead
of using the workaround made for Python 2.
Differential Revision: https://reviews.llvm.org/D107715
When an environment variable NO_STOP_MESSAGE=1 is set,
assume that STOP statements with a successful code
have QUIET=.TRUE.
Differential Revision: https://reviews.llvm.org/D113701
The check was failing because it was matching against the end of the range, not
the start.
This bug wasn't causing the ORC-RT MachO TLV regression test to fail because
we were only logging deallocation errors (including TLV deregistration errors)
and not actually returning a failure code. This commit updates llvm-jitlink to
report the errors properly.
We need to skip the length field when generating error strings.
No test case: This hand-hacked deserializer should be removed in the near future
once JITLink can use generic ORC APIs (including SPS and WrapperFunction).
The ORDER= argument to the transformational intrinsic function RESHAPE
was being misinterpreted in an inverted way that could be detected only
with 3-d or higher rank array. Fix in both folding and the runtime, and
extend tests.
Differential Revision: https://reviews.llvm.org/D113699
[NFC] As part of using inclusive language within the llvm project, this patch
replaces `m_master` in `ASTImporterDelegate` with `m_main`.
Reviewed By: teemperor, clayborg
Differential Revision: https://reviews.llvm.org/D113720
This brings back the original version of D81359.
I have found several use cases now.
* Unlike GNU ld, LLD's relocation processing is one pass. If we decide to
optimize(relax) R_X86_64_{,REX_}GOTPCRELX, we will suppress GOT generation and
cannot undo the decision later. Optimizing R_X86_64_REX_GOTPCRELX can usually
make it easy to hit `relocation R_X86_64_REX_GOTPCRELX out of range` because
the distance to GOT is usually shorter. Without --no-relax, the user has to
recompile with `-Wa,-mrelax-relocations=no`.
* The option would help during my investigationg of the root cause of https://git.kernel.org/linus/09e43968db40c33a73e9ddbfd937f46d5c334924
* There is need for relaxation for AArch64 & RISC-V. Implementing this for
x86-64 improves consistency with little target-specific cost (two-line
X86_64.cpp change).
Reviewed By: alexander-shaposhnikov
Differential Revision: https://reviews.llvm.org/D113615
[NFC] This patch replaces master and slave with primary and secondary
respectively when referring to pseudoterminals/file descriptors.
Reviewed By: clayborg, teemperor
Differential Revision: https://reviews.llvm.org/D113687
Add support for v32i8/v64i8 converting shift-by-constant to multiply-by-constant. This helps us avoid the generic vXi8 shift lowering, and a lot of VPBLENDVB ops which can be particularly slow.
We also needed to reorder a few shift lowering patterns to prevent regressions, particularly for XOP+AVX2 (Excavator) targets (which can split to fast v16i8 shifts) and AVX512-BWI targets (which prefers to extend to fast v32i16 shifts).
Small patch that changes blacklisted_global to blocked_global and a change in comments.
Reviewed By: pgousseau
Differential Revision: https://reviews.llvm.org/D113692