This test, specifically `TwoSignalCallbacks`, can be a little bit flaky, failing in around 5/2000 runs.
POSIX says:
> If the value of pid causes sig to be generated for the sending process, and if sig is not blocked for the calling thread and if no other thread has sig unblocked or is waiting in a sigwait() function for sig, either sig or at least one pending unblocked signal shall be delivered to the sending thread before kill() returns.
The problem is that in test setup, we create a new thread with `std::async` and that is occasionally not cleaned up. This leaves that thread available to eat the signal we're polling for.
The need for this to be async does not apply anymore, so we can just make it synchronous.
This makes the test passes in 10000 runs.
Reviewed By: labath
Differential Revision: https://reviews.llvm.org/D133181
This change adds a set of utilities to replace the result of a
`tensor.collapse_shape -> tensor.extract_slice` chain with the
equivalent result formed by aggregating slices of the
`tensor.collapse_shape` source. In general, it is not possible to
commute `extract_slice` and `collapse_shape` if linearized dimensions
are sliced. The i-th dimension of the `tensor.collapse_shape`
result is a "linearized sliced dimension" if:
1) Reassociation indices of tensor.collapse_shape in the i'th position
is greater than size 1 (multiple dimensions of the input are collapsed)
2) The i-th dimension is sliced by `tensor.extract_slice`.
We can work around this by stitching together the result of
`tensor.extract_slice` by iterating over any linearized sliced dimensions.
This is equivalent to "tiling" the linearized-and-sliced dimensions of
the `tensor.collapse_shape` operation in order to manifest the result
tile (the result of the `tensor.extract_slice`). The user of the
utilities must provide the mechanism to create the tiling (e.g. a loop).
In the tests, it is demonstrated how to apply the utilities using either
`scf.for` or `scf.foreach_thread`.
The below example illustrates the pattern using `scf.for`:
```
%0 = linalg.generic ... -> tensor<3x7x11x10xf32>
%1 = tensor.collapse_shape %0 [[0, 1, 2], [3]] : ... to tensor<341x10xf32>
%2 = tensor.extract_slice %1 [13, 0] [10, 10] [2, 1] : .... tensor<10x10xf32>
```
We can construct %2 by generating the following IR:
```
%dest = linalg.init_tensor() : tensor<10x10xf32>
%2 = scf.for %iv = %c0 to %c10 step %c1 iter_args(%arg0) -> tensor<10x10xf32> {
// Step 1: Map this output idx (%iv) to a multi-index for the input (%3):
%linear_index = affine.apply affine_map<(d0)[]->(d0*2 + 11)>(%iv)
%3:3 = arith.delinearize_index %iv into (3, 7, 11)
// Step 2: Extract the slice from the input
%4 = tensor.extract_slice %0 [%3#0, %3#1, %3#2, 0] [1, 1, 1, 10] [1, 1, 1, 1] :
tensor<3x7x11x10xf32> to tensor<1x1x1x10xf32>
%5 = tensor.collapse_shape %4 [[0, 1, 2], [3]] :
tensor<1x1x1x10xf32> into tensor<1x10xf32>
// Step 3: Insert the slice into the destination
%6 = tensor.insert_slice %5 into %arg0 [%iv, 0] [1, 10] [1, 1] :
tensor<1x10xf32> into tensor<10x10xf32>
scf.yield %6 : tensor<10x10xf32>
}
```
The pattern was discussed in the RFC here: https://discourse.llvm.org/t/rfc-tensor-extracting-slices-from-tensor-collapse-shape/64034
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D129699
1) Tablegen patterns exist to use 'xtn' and 'uzp1' for trunc [1]. Cost table entries are updated based on the actual number of {xtn, uzp1} instructions generated.
2) Without this, an IR instruction like trunc <8 x i16> %v to <8 x i8> is considered free and might be sinked to other basic blocks. As a result, the sinked 'trunc' is in a different basic block with its (usually not-free) vector operand and misses the chance to be combined during instruction selection. (examples in [2])
3) It's a lot of effort to teach CodeGenPrepare.cpp to sink the operand of trunc without introducing regressions, since the instruction to compute the operand of trunc could be faster (e.g., throughput) than the instruction corresponding to "trunc (bin-vector-op". For instance in [3], sinking %1 (as trunc operand) into bb.1 and bb.2 means to replace 2 xtn with 2 shrn (shrn has a throughput of 1 and only utilize v1 pipeline), which is not necessarily good, especially since ushr result needs to be preserved for store operation in bb.0. Meanwhile, it's too optimistic (for CodeGenPrepare pass) to assume machine-cse will always be able to de-dup shrn from various basic blocks into one shrn.
[1] For {v8i16->v8i8, v4i32->v4i16, v2i64->v2i32}, 813ae2871d/llvm/lib/Target/AArch64/AArch64InstrInfo.td (L4472).
For concat (trunc, trunc) -> uzip1, 813ae2871d/llvm/lib/Target/AArch64/AArch64InstrInfo.td (L5428-L5437)
[2] examples
- trunc(umin(X, 255)) -> UQXTRN v8i8 (and other {u,s}x{min,max} pattern for v8i16 operands) from 813ae2871d/llvm/lib/Target/AArch64/AArch64InstrInfo.td (L4515-L4528)
- trunc (AArch64vlshr v8i16, imm) -> SHRNv8i8 (same missed for SHRNv2i32) from 813ae2871d/llvm/lib/Target/AArch64/AArch64InstrInfo.td (L6743-L6748)
[3]
---
; instruction latency / throughput / pipeline on `neoverse-n1`
bb.0:
%1 = lshr <8 x i16> %10, <i16 4, i16 4, i16 4, i16 4, i16 4, i16 4, i16 4, i16 4> ; ushr, latency 2, throughput 1, pipeline V1
%2 = trunc <8 x i16> %1 to <8 x i8> ; xtn, latency 2, throughput 2, pipeline V
%3 = store <8 x i8> %1, ptr %addr
br cond i1 cond, label bb.1, label bb.2
bb.1:
%4 = trunc <8 x i16> %1 to <8 x i8> ; xtn
bb.2:
%5 = trunc <8 x i16> %1 to <8 x i8> ; xtn
---
Differential Revision: https://reviews.llvm.org/D132784
Similar to 123ce97fac for dllimport: dllexport
expresses a non-hidden visibility intention. We can consider it explicit and
therefore it should override the global visibility setting (see AST/Decl.cpp
"NamedDecl Implementation").
Adding the special case to CodeGenModule::setGlobalVisibility is somewhat weird,
but allows we to add the code in one place instead of many in AST/Decl.cpp.
Differential Revision: https://reviews.llvm.org/D133180
* If GCC is configured with `--disable-multi-arch`, `--print-multiarch` output is an empty line.
* If GCC is configured with `--enable-multi-arch`, `--print-multiarch` output may be a normalized triple or (on Debian, 'vendor' is omitted) `x86_64-linux-gnu`.
The Clang support D101400 just prints the Debian multiarch style triple
unconditionally, but the string is not really expected for non-Debian systems.
AIUI many Linux distributions and non-Linux OSes don't configure GCC with `--enable-multi-arch`.
Instead of getting us in the trouble of supporting all kinds of variants, drop the support as before D101400.
Close https://github.com/llvm/llvm-project/issues/51469
Reviewed By: phosek
Differential Revision: https://reviews.llvm.org/D133170
This was achieved with an updated version of the 'cost-tables vs llvm-mca' script D103695
As we're using 'typical' worst case values, not all cost entries come from a single CPU - e.g. the latency/throughput from haswell but the size-latency(uops) from zen1/alderlake-e due to 'double pumping'
Change the behavior of the libc++ `unordered_map` synthetic provider to present
children as `std::pair` values, just like `std::map` does.
The synthetic provider for libc++ `std::unordered_map` has returned children
that expose a level of internal structure (over top of the key/value pair). For
example, given an unordered map initialized with `{{1,2}, {3, 4}}`, the output
is:
```
(std::unordered_map<int, int, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<const int, int> > >) map = size=2 {
[0] = {
__cc = (first = 3, second = 4)
}
[1] = {
__cc = (first = 1, second = 2)
}
}
```
It's not ideal/necessary to have the numbered children embdedded in the `__cc`
field.
Note: the numbered children have type
`std::__hash_node<std::__hash_value_type<Key, T>, void *>::__node_value_type`,
and the `__cc` fields have type `std::__hash_value_type<Key, T>::value_type`.
Compare this output to `std::map`:
```
(std::map<int, int, std::less<int>, std::allocator<std::pair<const int, int> > >) map = size=2 {
[0] = (first = 1, second = 2)
[1] = (first = 3, second = 4)
```
Where the numbered children have type `std::pair<const Key, T>`.
This changes the behavior of the synthetic provider for `unordered_map` to also
present children as `pairs`, just like `std::map`.
It appears the synthetic provider implementation for `unordered_map` was meant
to provide this behavior, but was maybe incomplete (see
d22a94377f). It has both an `m_node_type` and an
`m_element_type`, but uses only the former. The latter is exactly the type
needed for the children pairs. With this existing code, it's not much of a
change to make this work.
Differential Revision: https://reviews.llvm.org/D117383
These were missed in an earlier commit, the latency/codesize/size-latency numbers aren't different from the SSE2 values that it was falling through to, hence no test change, but it did mean we were wasting a lookup.
Detailed motivation here: https://docs.google.com/document/d/1xUNo5ovPKJMYxitiHUQVRxGI3iUmspI51Jm4w8puMwo
check-asan (with LSAN enabled) and check-lsan are currently broken on recent macOS versions, due to pervasive false positives. Whenever the Objective-C runtime realizes a class, it allocates data for it, then stores that data with flags in the low bits. This means LSAN can not recognize it as a pointer while scanning.
This change checks every potential pointer on Apple platforms, and if the high bit is set, attempts to extract a pointer by masking out the high bit and flags. This is ugly, but it's also the best approach I could think of (see doc above); very open to other suggestions.
Differential Revision: https://reviews.llvm.org/D133126
LEGALPOS appears to only be used by LegalizeVectorOps. It needs
to point at a vector operand. Stores need to point at the second
operand since the result and the first operand are MVT::Other.
Reductions need to point at the second operand since the result
and the first operand are scalsrs.
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D133048
The method is now wrapped by clang_getNonReferenceType.
A declaration for clang_getNonReferenceType was added to clang-c/Index.h
to expose it to user of the library.
An implementation for clang_getNonReferenceType was introduced in
CXType.cpp, wrapping the equivalent method of the underlying QualType of
a CXType.
An export symbol for the new function was added to libclang.map under
the LLVM_16 version entry.
A test was added to LibclangTest.cpp that tests the removal of
ref-qualifiers for some CXTypes.
The release-notes for the clang project was updated to include a
notification of the new addition under the "libclang" section.
Differential Revision: https://reviews.llvm.org/D133195
This change refines the semantics of scf.foreach_thread. Tensors that are inserted into in the terminator must now be passed to the region explicitly via `shared_outs`. Inside of the body of the op, those tensors are then accessed via block arguments.
The body of a scf.foreach_thread is now treated as a repetitive region. I.e., op dominance can no longer be used in conflict detection when using a value that is defined outside of the body. Such uses may now be considered as conflicts (if there is at least one read and one write in the body), effectively privatizing the tensor. Shared outputs are not privatized when they are used via their corresponding block arguments.
As part of this change, it was also necessary to update the "tiling to scf.foreach_thread", such that the generated tensor.extract_slice ops use the scf.foreach_thread's block arguments. This is implemented by cloning the TilingInterface op inside the scf.foreach_thread, rewriting all of its outputs with block arguments and then calling the tiling implementation. Afterwards, the cloned op is deleted again.
Differential Revision: https://reviews.llvm.org/D133114
This method allows to declare regions as "repetitive" even if the parent op does not implement the RegionBranchOpInterface.
This is needed to support loop-like ops that have parallel semantics but do not branch between regions.
Differential Revision: https://reviews.llvm.org/D133113
We haven't had a Geordi bot in years and we moved the build bot to
another channel. This updates the documentation to be more reflective
of reality, and it also identifies whether a channel is actively
moderated or not.
Differential Revision: https://reviews.llvm.org/D133155
We are not building up a proper list of load-store candidates because
we are throwing away stores where the type don't match the load.
This patch adds stores with matching store sizes as candidates.
Author of the original patch: David Sherwood.
Differential Revision: https://reviews.llvm.org/D130233
This reverts commit c911befaec.
It has broken LLDB Arm/AArch64 Linux buildbots. I dont really understand
the underlying reason. Reverting for now make buildbot green.
https://reviews.llvm.org/D133036
This was achieved with an updated version of the 'cost-tables vs llvm-mca' script D103695 which I'll update shortly
As we're using 'typical' worst case values, not all cost entries come from a single CPU - e.g. the latency/throughput from haswell but the size-latency(uops) from zen1/alderlake-e due to 'double pumping'
hasOnlyColdCalls skipped over calls to intrinsics, but it did so after
checking the linkage of the called function. This meant that the presence
of a call to a debug intrinsic could affect the outcome of the
optimization.
In my original reproducer (for an out of tree target) it was particularly
interesting, because the actual IR after GlobalOpt was not different with
debug instrinsics present, so -print-after-all printouts didn't show
anything there.
However, without debuginfo, GlobalOpt went further and ran
BlockFrequencyAnalysis and (more importanly) LoopAnalysis, and later on in
the pipeline, instcombine behaved in different ways when LoopInfo was
present.
So a call to a dbg.declare prevented running LoopAnalysis in
GlobalOpt, which later prevented InstCombine from doing an optimization.
The dbg-intrinsic-loopanalysis.ll testcase tries to expose this.
Then I also noted that adding a dbg.declare actually made the existing
testcase colccc_coldsites.ll generate different code, so I modified that
to now test it behaves the same way with and without the dbg.declare.
Reviewed By: nikic, fhahn
Differential Revision: https://reviews.llvm.org/D133193
Use getPredicateOnEdge method if value is a non-local
compare-with-a-constant instruction, that can give more precise
results than getConstantOnEdge.
Differential Revision: https://reviews.llvm.org/D131956
I'm not sure how much to add to the description as we've tried to allow targets to interpret the TargetCostKind enums in their own way. But we need to make it clear that certain cost kinds need to match threshold numbers used by various passes (and vice-versa when passes are determining a cost-benefit threshold).
I'm not keen on the "The weighted sum of size and latency" description, but its very difficult to come up with anything else that's suitably generic (e.g. X86 will use uop counts here to easily work with LoopMicroOpBufferSize thresholds, even though high latency fdiv/fsqrt instructions still often have low uop counts).
Differential Revision: https://reviews.llvm.org/D132288
This removes a bunch of duplicated code, by adding an intermediate
function simplifyReduction that takes a std::function argument
for the actual replacement of the code.
No functional change intended.
Reviewed By: vzakhari
Differential Revision: https://reviews.llvm.org/D132588
Running: `mlir-opt -test-vector-warp-distribute=rewrite-warp-ops-to-scf-if -canonicalize -verify-each=0`.
Prior to this revision, IR resembling the following would be produced:
```
%4 = "vector.load"(%3, %arg0) : (memref<1x32xf32, 3>, index) -> vector<1x1xf32>
```
This fails verification since it needs 2 indices to load but only 1 is provided.
Differential Revision: https://reviews.llvm.org/D133106
met this issue when building llvm with config LLVM_LIBDIR_SUFFIX=64, and
the installation destination of scan-build-py does not respect the given
suffix.
Reviewed By: phosek
Differential Revision: https://reviews.llvm.org/D133160