Add andm, orm, xorm, eqvm, nndm, negm, pcvm, lzvm, and tovm intrinsic
instructions, a few pseudo instructions to expand logical intrinsic
using VM512, a mechnism to expand such pseudo instructions, and
regression tests. Also, assign vector mask types and vector mask
register classes correctly. This is required to use VM512 registers
as function arguments.
Reviewed By: simoll
Differential Revision: https://reviews.llvm.org/D93093
Use SCEV to salvage additional @llvm.dbg.value that have turned into
referencing undef after transformation (and traditional
salvageDebugInfo). Before rewrite (but after introduction of new
induction variables) use SCEV to compute an equivalent set of values for
each @llvm.dbg.value in the loop body (among the loop header PHI-nodes).
After rewrite (and dead PHI elimination) update those @llvm.dbg.value
now referencing undef by picking a remaining value from its equivalence
set. Allow match with offset by inserting compensation code in the
DIExpression.
Fixes : PR38815
Differential Revision: https://reviews.llvm.org/D87494
This patch updates VPWidenMemoryInstructionRecipe to use VPDef
to manage the value it produces instead of inheriting from VPValue.
Reviewed By: gilr
Differential Revision: https://reviews.llvm.org/D90563
Vector element size could be different for different store chains.
This patch prevents wrong computation of maximum number of elements
for that case.
Differential Revision: https://reviews.llvm.org/D93192
Changes in this patch:
- Minor changes to the LowerVECREDUCE_SEQ_FADD function added by @cameron.mcinally
to also work for scalable types
- Added TableGen patterns for FP reductions with unpacked types (nxv2f16, nxv4f16 & nxv2f32)
- Asserts added to expandFMINNUM_FMAXNUM & expandVecReduceSeq for scalable types
Reviewed By: cameron.mcinally
Differential Revision: https://reviews.llvm.org/D93050
A vpt block that just contains either VPST;VCTP or VPT;VCTP, once the
VCTP is removed will become invalid. This fixed the first by removing
the now empty block and bails out for the second, as we have no simple
way of converting a VPT to a VCMP.
Differential Revision: https://reviews.llvm.org/D92369
These parameters set a default value of 0, so I believe they
should include a 0 suffix. This allows for versions which do not
set a default value in future.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D93187
The runtime library has two family library implementation for ppc_fp128 and fp128.
For IBM Long double(ppc_fp128), it is suffixed with 'l', i.e(sqrtl). For
IEEE Long double(fp128), it is suffixed with "ieee128" or "f128".
We miss to map several libcall for IEEE Long double.
Reviewed By: qiucf
Differential Revision: https://reviews.llvm.org/D91675
add a new goal MustReduceRegisterPressure for machine combiner pass.
PowerPC will use this new goal to do some register pressure related optimization.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D92068
JITLinkDylib represents a target dylib for a JITLink link. By representing this
explicitly we can:
- Enable JITLinkMemoryManagers to manage allocations on a per-dylib basis
(e.g by maintaining a seperate allocation pool for each JITLinkDylib).
- Enable new features and diagnostics that require information about the
target dylib (not implemented in this patch).
InstCombine canonicalizes X>C && X<C' style comparisons into
(X+C1)<C2. This type of expression is recognized by some analyses
like LVI, but currently not when used inside assumptions, because
AssumptionCache does not track affected values for it.
If a faux shuffle uses smaller shuffle inputs, try to recursively combine with those inputs directly instead of widening them immediately. Then widen all smaller inputs at the bottom of the recursion.
This will still mean we're generating nodes on the fly (PR45974) even if we don't combine to a new shuffle but it does help AVX2+ targets combine across xmm/ymm/zmm types, mainly as variable shuffles.
This recommits a87fccb3ff with a fix to mark the destination operand
of the marker instruction as def, to fix a machine verifier failure.
This reverts the revert commit c0f2cea7c0.
Default expansion leads to repeated extensions/truncations to/from vXi16 which shuffle combining and demanded elts can't completely unravel.
Better just to promote (any_extend) the input and perform a vXi16 reduction.
We'll be able to remove a lot of this if we ever get decent legalization support for reduction intrinsics in SelectionDAG.
BasicAA currently handles cases like Scale*V0 + (-Scale)*V1 where
V0 != V1, but does not handle the simpler case of Scale*V with
V != 0. Add it based on an isKnownNonZero() call.
I'm not passing a context instruction for now, because the existing
approach of always using GEP1 for context could result in symmetry
issues.
Differential Revision: https://reviews.llvm.org/D93162
In particular, if the successor block, which is about to get a new
predecessor block, currently only has a single predecessor,
then the bonus instructions will be directly used within said successor,
which is fine, since the block with bonus instructions dominates that
successor. But once there's a new predecessor, the IR is no longer valid,
and we don't fix it, because we only update PHI nodes.
Which means, the live-out bonus instructions must be exclusively used
by the PHI nodes in successor blocks. So we have to form trivial PHI nodes.
which will then be successfully updated to recieve cloned bonus instns.
This all works fine, except for the fact that we don't have access to
the dominator tree, and we don't ignore unreachable code,
so we sometimes do end up having to deal with some weird IR.
Fixes https://bugs.llvm.org/show_bug.cgi?id=48450
Some of the pattern matching in PPCInstrVSX.td and node lowering involving vectors assumes 64bit mode. This patch disables some of the unsafe pattern matching and lowering of BUILD_VECTOR in 32bit mode.
Reviewed By: Xiangling_L
Differential Revision: https://reviews.llvm.org/D92789
CVP currently handles switches by checking an equality predicate
on all edges from predecessor blocks. Of course, this can only
work if the value being switched over is defined in a different block.
Replace this implementation with a call to getPredicateAt(), which
also does the predecessor edge predicate check (if not defined in
the same block), but can also do quite a bit more: It can reason
about phi-nodes by checking edge predicates for incoming values,
it can reason about assumes, and it can reason about block values.
As such, this makes the implementation both simpler and more
powerful. The compile-time impact on CTMark is in the noise.
The getPayload/getMask/getPassThrough functions should return values
that could be composed into a masked load/store without any additional
type casts. The previous fix violated that.
Instead, convert scalar mask to a vector right before rescaling.
The last use of isLoop was removed on Apr 29, 2002 in commit
09bbb5c015 as part of an effort to
remove "old induction varaible cannonicalization pass built on top of
interval analysis".
AlignVectors treats all loaded/stored values as vectors of bytes,
and masks as corresponding vectors of booleans, so make getMask
produce a 1-element vector for scalars from the start.
The ABI demands a data16 prefix for lea in 64-bit LP64 mode, but not in
64-bit ILP32 mode. In both modes this prefix would ordinarily be
ignored, but the instructions may be changed by the linker to
instructions that are affected by the prefix.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D93157
This adds some basic MVE masked load/store costs, notably changing the
cost of legal loads/stores to the MVECostFactor and the cost of
scalarized instructions to 8*NumElts.
Differential Revision: https://reviews.llvm.org/D86538
When it comes to the scalar cost of any predicated block, the loop
vectorizer by default regards this predication as a sign that it is
looking at an if-conversion and divides the scalar cost of the block by
2, assuming it would only be executed half the time. This however makes
no sense if the predication has been introduced to tail predicate the
loop.
Original patch by Anna Welker
Differential Revision: https://reviews.llvm.org/D86452
gcov computes the line execution count as the sum of (a) counts from
predecessors on other lines and (b) the sum of loop execution counts of blocks
on the same line (think of loops on one line).
For (b), we use Donald B. Johnson's cycle enumeration algorithm and perform
cycle cancelling for each cycle. This number of candidate cycles were
exponential and D93036 made it polynomial by skipping zero count cycles. The
time complexity is high (O(V*E^2) (it could be O(E^2) but the linear `Blocks`
check made it higher) and the implementation is complex.
We could just identify loops and sum all back edges. However, this requires a
dominator tree construction which is more complex. The time complexity can be
decreased to almost linear, though.
This patch just performs cycle cancelling iteratively. Add two members
`traversable` and `incoming` to GCOVArc. There are 3 states:
* `!traversable`: blocks not on this line or explored blocks
* `traversable && incoming == nullptr`: unexplored blocks
* `traversable && incoming != nullptr`: blocks which are being explored (on the stack)
If an arc points to a block being explored, a cycle has been found.
Let E be the number of arcs. Every time a cycle is found, at least one arc is
saturated (`edgeCount` reduced to 0), so there are at most E cycles. Finding one
cycle takes O(E) time, so the overall time complexity is O(E^2). Note that we
always augment through a back edge and never need to augment its reverse edge so
reverse edges in traditional flow networks are not needed.
Reviewed By: xinhaoyuan
Differential Revision: https://reviews.llvm.org/D93073