We can quickly extract multiple elements of a bool vector using MOVMSK ops - since we don't know what generated the vXi1, I've been optimistic and assumed we can use PMOVMSKB to extract the maximum number of bools with a single op.
The MOVMSK pattern isn't great for extract+insert round trips as vXi1 type legalization can interfere with this a lot - so this relies on us remaining good at using getScalarizationOverhead properly (and tagging both Insert and Extract modes) for those round trip cases.
The AVX512 KMOV codegen for bool extraction is a bit of a mess so for now I've not included that - the per-element cost is a lot more accurate for current codegen.
We ignored the cast if the enum was scoped.
This is bad since there is no implicit conversion from the scoped enum to the corresponding underlying type.
The fix is basically: isIntegralOrEnumerationType() -> isIntegralOr**Unscoped**EnumerationType()
This materialized in crashes on analyzing the LLVM itself using the Z3 refutation.
Refutation synthesized the given Z3 Binary expression (`BO_And` of `unsigned char` aka. 8 bits
and an `int` 32 bits) with the wrong bitwidth in the end, which triggered an assert.
Now, we evaluate the cast according to the standard.
This bug could have been triggered using the Z3 CM according to
https://bugs.llvm.org/show_bug.cgi?id=44030Fixes#47570#43375
Reviewed By: martong
Differential Revision: https://reviews.llvm.org/D85528
Per the guidance in
https://llvm.org/docs/Atomics.html#atomics-and-ir-optimization,
an atomic load from a constant global can be dropped, as there can
be no stores to synchronize with. Any write to the constant global
would be UB.
IPSCCP will already drop such loads, but the main helper in Local
doesn't recognize this currently. This is motivated by D118387.
Differential Revision: https://reviews.llvm.org/D124241
This patch restricts the value of `if` clause expression to an I1 value.
It also restricts the value of `num_threads` clause expression to an I32
value.
Reviewed By: kiranchandramohan
Differential Revision: https://reviews.llvm.org/D124142
`g_memdup()` allocates and copies memory, thus we should not assume that
the returned memory region is uninitialized because it might not be the
case.
PS: It would be even better to copy the bindings to mimic the actual
content of the buffer, but this works too.
Fixes#53617
Reviewed By: martong
Differential Revision: https://reviews.llvm.org/D124436
ConstantFolding currently converts "getelementptr i8, Ptr, (sub 0, V)"
to "inttoptr (sub (ptrtoint Ptr), V)". This transform is, taken by
itself, correct, but does came with two issues:
1. It unnecessarily broadens provenance by introducing an inttoptr.
We generally prefer not to introduce inttoptr during optimization.
2. For the case where V == ptrtoint Ptr, this folds to inttoptr 0,
which further folds to null. In that case provenance becomes
incorrect. This has been observed as a real-world miscompile with
rustc.
We should probably address that incorrect inttoptr 0 fold at some
point, but in either case we should also drop this inttoptr-introducing
fold. Instead, replace it with a fold rooted at
ptrtoint(getelementptr), which seems to cover the original
motivation for this fold (test2 in the changed file).
Differential Revision: https://reviews.llvm.org/D124677
This enables one to write generic code that can be instantiated for both
specific operation classes and the common base class without
specialization. Examples include functions that take/return ops, such
as:
```mlir
template <typename FnTy>
void applyIf(FnTy &&lambda, ...) {
for (Operation *op : ...) {
auto specific = dyn_cast<function_traits<FnTy>::template arg_t<0>>(op);
if (specific)
lambda(specific);
}
}
```
that would otherwise need to rely on template specialization to support
lambdas that take specific operations and those that take `Operation *`.
Reviewed By: rriddle
Differential Revision: https://reviews.llvm.org/D124675
X86 codegen uses function attribute `min-legal-vector-width` to select the proper ABI. The intention of the attribute is to reflect user's requirement when they passing or returning vector arguments. So Clang front-end will iterate the vector arguments and set `min-legal-vector-width` to the width of the maximum for both caller and callee.
It is assumed any middle end optimizations won't care of the attribute expect inlining and argument promotion.
- For inlining, we will propagate the attribute of inlined functions because the inlining functions become the newer caller.
- For argument promotion, we check the `min-legal-vector-width` of the caller and callee and refuse to promote when they don't match.
The problem comes from the optimizations' combination, as shown by https://godbolt.org/z/zo3hba8xW. The caller `foo` has two callees `bar` and `baz`. When doing argument promotion, both `foo` and `bar` has the same `min-legal-vector-width`. So the argument was promoted to vector. Then the inlining inlines `baz` to `foo` and updates `min-legal-vector-width`, which results in ABI mismatch between `foo` and `bar`.
This patch fixes the problem by expanding the concept of `min-legal-vector-width` to indicator of functions arguments. That says, any passes touch functions arguments have to set `min-legal-vector-width` to the value reflects the width of vector arguments. It makes sense to me because any arguments modifications are ABI related and should response for the ABI compatibility.
Differential Revision: https://reviews.llvm.org/D123284
The print output of loop cache analysis sometimes has a non-deterministic order
and therefore we have been using `CHECK-DAG` in its lit tests. This patch changes
the sorting of LoopCosts to llvm::stable_sort() where we compare loop cost numbers
and sort the loops. In case of the same loop cost numbers, llvm::stable_sort() now
would output a deterministic loop order.
Reviewed By: Meinersbur, fhahn, #loopoptwg
Differential Revision: https://reviews.llvm.org/D124725
This had the surprising behavior of using whatever instruction
happened to be first in the block as an anchor point to stick random
implicit defs on. Use a real implicit_def instead.
Many MIR reductions benefit from or require increasing the instruction
count. For example, unlike in the IR, you may need to insert a new
instruction to represent an undef. The current instruction reduction
pass works around this by sticking implicit defs on whatever
instruction happens to be first in the entry block block.
Other strategies I've applied manually include breaking instructions
with multiple defs into separate instructions, or breaking large
register defs into multiple subregister defs.
Make up a simple scoring system based on what I generally try to get
rid of first when manually reducing. Counts implicit defs as free
since reduction passes will be introducing them, although they
probably should count for something. It also might make more sense to
have a comparison the two functions, rather than having to compute a
contextless number. This isn't particularly well tested since overall
the MIR support isn't in a place where it is useful on the kinds of
testcases I want to throw at it.
The current implementation uses a discrete "pdl_interp.inferred_types"
operation, which acts as a "fake" handle to a type range. This op is
used as a signal to pdl_interp.create_operation that types should be
inferred. This is terribly awkward and clunky though:
* This op doesn't have a byte code representation, and its conversion
to bytecode kind of assumes that it is only used in a certain way. The
current lowering is also broken and seemingly untested.
* Given that this is a different operation, it gives off the assumption
that it can be used multiple times, or that after the first use
the value contains the inferred types. This isn't the case though,
the resultant type range can never actually be used as a type range.
This commit refactors the representation by removing the discrete
InferredTypesOp, and instead adds a UnitAttr to
pdl_interp.CreateOperation that signals when the created operations
should infer their types. This leads to a much much cleaner abstraction,
a more optimal bytecode lowering, and also allows for better error
handling and diagnostics when a created operation doesn't actually
support type inferrence.
Differential Revision: https://reviews.llvm.org/D124587
In some cases, it is not enough to freeze the final AND/OR operation
when chaining a number of invariant conditions together.
After creating a chain of ANDs/ORs, we assume all unswitched operands to
be either true or false. But if any of the operands is poison, the rest
of the operands could have any value after branching on the frozen
condition.
To avoid that, freeze individual operands, if needed. In some cases this
may lead to unnecessary freezes, but it seems required at least for some
cases (see trivial-unswitch-freeze-individual-conditions.ll)
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D124554
Enable efficient implementation of context-aware joining of distinct
boolean values. It can be used to join distinct boolean values while
preserving flow condition information.
Flow conditions are represented as Token <=> Clause iff formulas. To
perform context-aware joining, one can simply add the tokens of flow
conditions to the formula when joining distinct boolean values, e.g:
`makeOr(makeAnd(FC1, Val1), makeAnd(FC2, Val2))`. This significantly
simplifies the implementation of `Environment::join`.
This patch removes the `DataflowAnalysisContext::getSolver` method.
The `DataflowAnalysisContext::flowConditionImplies` method should be
used instead.
Reviewed-by: ymandel, xazax.hun
Differential Revision: https://reviews.llvm.org/D124395
If the legalized src/dst types are the same, assume the "truncation" is free.
This fixes some edge cases such as mul lo/hi ops and bool vectors which will get legalized back to legal vector widths
Replace all the instances of `_LIBCPP_INLINE_VISIBILITY` with `_LIBCPP_HIDE_FROM_ABI` and `_VSTD` with `std`.
Reviewed By: Mordante, #libc
Spies: libcxx-commits
Differential Revision: https://reviews.llvm.org/D124662
As Fortran 2018 C1533, a nonintrinsic elemental procedure shall not be
used as an actual argument. The semantic check for implicit iterface is
missed.
Reviewed By: klausler
Differential Revision: https://reviews.llvm.org/D124379
Based off the script from D103695, we were exaggerating the cost of the OR(AND(X,M),AND(Y,~M)) expansion using instruction count instead of effective throughput
A global variable may have the same name as a label, and ptxas does not accept it.
Prefix labels with $L__ to fix this.
Reviewed By: MaskRay, tra
Differential Revision: https://reviews.llvm.org/D119669
Trivial unswitching can also introduce new branches on undef/poison.
Freeze the conditions if needed.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D124549