A common debugging pattern is to set a breakpoint that only stops after
a number of hits is recorded. The current implementation never resets
the hit count of breakpoints; as such, if a user re-`run`s their
program, the debugger will never stop on such a breakpoint again.
This behavior is arguably undesirable, as it renders such breakpoints
ineffective on all but the first run. This commit changes the
implementation of the `Will{Launch, Attach}` methods so that they reset
the _target's_ breakpoint hitcounts.
Differential Revision: https://reviews.llvm.org/D133858
Working in a mixed environment of both vscode/vim with a team configured prettier configuration, this can leave clang-format and prettier fighting each other over the formatting of arrays, both simple arrays of elements.
This review aims to add some "control knobs" to the Json formatting in clang-format to help align the two tools so they can be used interchangeably.
This will allow simply arrays `[1, 2, 3]` to remain on a single line but will break those arrays based on context within that array.
Happy to change the name of the option (this is the third name I tried)
Reviewed By: HazardyKnusperkeks, owenpan
Differential Revision: https://reviews.llvm.org/D133589
Use opaqueptr for test case
llvm/test/Transforms/SimplifyCFG/preserve-llvm-loop-metadata.ll.
- Adjust variable number accordingly since bitcast between different pointer
types are not necessary.
Differential Revision: https://reviews.llvm.org/D134159
After BOLT's merge to LLVM, there are two (almost identical) versions of the
code layout algorithm. The diff unifies the implementations by keeping the one
in LLVM.
There are mild changes in the resulting block orders. I tested the changes
extensively both on the clang binary and on prod services. Didn't see stat sig
differences on average.
Reviewed By: Amir
Differential Revision: https://reviews.llvm.org/D129895
Summary:
according nm in AIX OS , https://www.ibm.com/docs/en/aix/7.2?topic=n-nm-command
In AIX OS, The default is to process 32-bit object files (ignore 64-bit objects). The mode can also be set with the OBJECT_MODE environment variable. For example, OBJECT_MODE=64 causes nm to process any 64-bit objects and ignore 32-bit objects. The -X flag overrides the OBJECT_MODE variable.
In non AIX OS. The default is to process all support object files. and not support the OBJECT_MODE environment variable.
Reviewers: James Henderson
Differential Revision: https://reviews.llvm.org/D132494
In https://llvm.org/D56913, we added an emulation for the __atomic_always_lock_free
compiler builtin when compiling in Freestanding mode. However, the emulation
did (and could not) give exactly the same answer as the compiler builtin,
which led to a potential ABI break for e.g. enum classes.
After speaking to the original author of D56913, we agree that the correct
behavior is to instead always use the compiler builtin, since that provides
a more accurate answer, and __atomic_always_lock_free is a purely front-end
builtin which doesn't require any runtime support. Furthermore, it is
available regardless of the Standard mode (see https://godbolt.org/z/cazf3ssYY).
However, this patch does constitute an ABI break. As shown by https://godbolt.org/z/1eoex6zdK:
- In LLVM <= 11.0.1, an atomic<enum class with 1 byte> would not contain a lock byte.
- In LLVM >= 12.0.0, an atomic<enum class with 1 byte> would contain a lock byte.
This patch breaks the ABI again to bring it back to 1 byte, which seems
like the correct thing to do.
Fixes#57440
Differential Revision: https://reviews.llvm.org/D133377
This feature relies on Relations in the index being complete.
An out-of-tree index implementation is missing some override relations, so
such renames end up breaking the code.
We plan to fix it, but this flag is a cheap band-aid for now.
Differential Revision: https://reviews.llvm.org/D133440
This revision adds a new op `map_nested_foreach_thread_to_gpu_threads` to transform dialect. The op searches `scf.foreach_threads` inside the `gpu_launch` and distributes them with `gpu.thread_id` attribute.
Loop mapping is explicit and given by the `map_nested_foreach_thread_to_gpu_threads` op. Mapping is done one-to-one, therefore the loops dissappear.
The dynamic trip count or trip count that are larger than thread size are not supported for the time being. However, we can indeed support them by generating a loop inside with cyclic scheduling.
For the time being, trip counts that are dynamic or bigger than thread sizes are not supported. However, in the future the compiler can indeed generate a loop with static cyclic scheduling to support these cases.
Current mechanism allows `scf.foreach_threads` to be siblings or nested. There cannot be interleaving code between the loops when they are nested.
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D133950
Implement exp10f function correctly rounded to all rounding modes.
Algorithm: perform range reduction to reduce
```
10^x = 2^(hi + mid) * 10^lo
```
where:
```
hi is an integer,
0 <= mid * 2^5 < 2^5
-log10(2) / 2^6 <= lo <= log10(2) / 2^6
```
Then `2^mid` is stored in a table of 32 entries and the product `2^hi * 2^mid` is
performed by adding `hi` into the exponent field of `2^mid`.
`10^lo` is then approximated by a degree-5 minimax polynomials generated by Sollya with:
```
> P = fpminimax((10^x - 1)/x, 4, [|D...|], [-log10(2)/64. log10(2)/64]);
```
Performance benchmark using perf tool from the CORE-MATH project on Ryzen 1700:
```
$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh exp10f
GNU libc version: 2.35
GNU libc release: stable
CORE-MATH reciprocal throughput : 10.215
System LIBC reciprocal throughput : 7.944
LIBC reciprocal throughput : 38.538
LIBC reciprocal throughput : 12.175 (with `-msse4.2` flag)
LIBC reciprocal throughput : 9.862 (with `-mfma` flag)
$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh exp10f --latency
GNU libc version: 2.35
GNU libc release: stable
CORE-MATH latency : 40.744
System LIBC latency : 37.546
BEFORE
LIBC latency : 48.989
LIBC latency : 44.486 (with `-msse4.2` flag)
LIBC latency : 40.221 (with `-mfma` flag)
```
This patch relies on https://reviews.llvm.org/D134002
Reviewed By: orex, zimmermann6
Differential Revision: https://reviews.llvm.org/D134104
Given an opOperand uniquely determined by the operation `%op` and the operand number `num`,
the `transform.get_producer_of_operand %op[num]` returns the handle to the unique operation
that produced the SSA value used as opOperand.
The transform fails if the operand is a block argument.
Differential Revision: https://reviews.llvm.org/D134171
This is required because if there is a pure loop-invariant instruction, Loop Rotation
may decide to not clone it and just hoist it instead. If SCEV has previously cached
that it was loop-variant (not being smart enough to prove invariance), we may end
up with inconsistent cache state (which may later trigger false-negative assertion
failures checking that something was invariant).
This is a conservative fix that unconditionally drops the dispositions. We could
only drop it if the hoisting has actually happened, but it should take some time
understanding whether it's safe with all other things this function does.
Differential Revision: https://reviews.llvm.org/D134167
Reviewed By: fhahn
Alive2 doesn't support verification of optimizations that use inter-procedural analyses.
Right now, clang uses GlobalsAA by default and there's no way to disable it.
This leads to Alive2 producing false positives.
The added flag allows us to skip global analyses altogether.
Differential Revision: https://reviews.llvm.org/D134139
The batch-reduce GEMM kernel essentially multiplies a sequence of input tensor
blocks (which form a batch) and the partial multiplication results are reduced
into a single output tensor block.
See: https://ieeexplore.ieee.org/document/9139809 for more details.
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D134163
This bug was found by recent improvement in SCEV verifier. The code in LoopFuse
directly reassigns blocks to be a part of a different loop, which should automatically
invalidate all related cached loop dispositions.
Differential Revision: https://reviews.llvm.org/D134173
Reviewed By: nikic
Previously only using the UnsafeFPMath option, this now looks for the
fast moth flags on the instructions, using the same flag flags as other
backends.
The batch-reduce GEMM kernel essentially multiplies a sequence of input tensor
blocks (which form a batch) and the partial multiplication results are reduced
into a single output tensor block.
See: https://ieeexplore.ieee.org/document/9139809 for more details.
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D134163